<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>
        <title>DataDecisionMakers | VentureBeat</title>
        <link>https://venturebeat.com/category/datadecisionmakers/feed/</link>
        <description>Transformative tech coverage that matters</description>
        <lastBuildDate>Thu, 30 Apr 2026 13:05:16 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <language>en</language>
        <copyright>Copyright 2026, VentureBeat</copyright>
        <item>
            <title><![CDATA[Monitoring LLM behavior: Drift, retries, and refusal patterns]]></title>
            <link>https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns</link>
            <guid isPermaLink="false">2zWsVTW2SQxLvpViqkfRDF</guid>
            <pubDate>Sun, 26 Apr 2026 15:13:54 GMT</pubDate>
            <description><![CDATA[<p>Traditional software is predictable: Input A plus function B always equals output C. This determinism allows engineers to develop robust tests. On the other hand, generative AI is stochastic and unpredictable. The exact same prompt often yields different results on Monday versus Tuesday, breaking the traditional unit testing that engineers know and love.</p><p>To ship enterprise-ready AI, engineers cannot rely on mere “vibe checks” that pass today but fail when customers use the product. Product builders need to adopt a new infrastructure layer: The <i>AI Evaluation Stack</i>.</p><p>This framework is informed by my extensive experience shipping <a href="https://venturebeat.com/infrastructure/claude-openclaw-and-the-new-reality-ai-agents-are-here-and-so-is-the-chaos?_gl=1*p1fni9*_up*MQ..*_ga*NzQyMjc1NDIyLjE3NzcwNjQ1MjA.*_ga_SCH1J7LNKY*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw*_ga_B8TDS1LEXQ*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw">AI products</a> for Fortune 500 enterprise customers in high-stakes industries, where “hallucination” is not funny — it’s a huge compliance risk.</p><h2>Defining the AI evaluation paradigm</h2><p>Traditional software tests are binary assertions (pass/fail). While some AI evals use binary asserts, many evaluate on a gradient. An eval is not a single script; it is a structured pipeline of assertions — ranging from strict code syntax to nuanced semantic checks — that verify the AI system’s intended function.</p><h3>The taxonomy of evaluation checks</h3><p>To build a robust, cost-effective pipeline, asserts must be separated into two distinct architectural layers:</p><h4>Layer 1: Deterministic assertions</h4><p>A surprisingly large share of production AI failures aren&#x27;t semantic &quot;hallucinations&quot; — they are basic syntax and routing failures. Deterministic assertions serve as the pipeline&#x27;s first gate, using traditional code and regex to validate structural integrity.</p><p>Instead of asking if a response is &quot;helpful,&quot; these assertions ask strict, binary questions:</p><ul><li><p>Did the model generate the correct JSON key/value schema?</p></li><li><p>Did it invoke the correct tool call with the required arguments?</p></li><li><p>Did it successfully slot-fill a valid GUID or email address?</p></li></ul><p>// Example: Layer 1 Deterministic Tool Call Assertion</p><p>{</p><p>  &quot;test_scenario&quot;: &quot;User asks to look up an account&quot;,</p><p>  &quot;assertion_type&quot;: &quot;schema_validation&quot;,</p><p>  &quot;expected_action&quot;: &quot;Call API: get_customer_record&quot;,</p><p>  &quot;actual_ai_output&quot;: &quot;I found the customer.&quot;,</p><p>  &quot;eval_result&quot;: &quot;FAIL - AI hallucinated conversational text instead of generating the required API payload.&quot;</p><p>}</p><p>In the example above, the test failed instantly because the model generated conversational text instead of the required tool call payload.</p><p>Architecturally, deterministic assertions must be the first layer of the stack, operating on a computationally inexpensive &quot;fail-fast&quot; principle. If a downstream API requires a specific schema, a malformed JSON string is a fatal error. By failing the evaluation immediately at this layer, engineering teams prevent the pipeline from triggering expensive semantic checks (Layer 2) or wasting valuable human review time (Layer 3).</p><h4>Layer 2: Model-based assertions</h4><p>When deterministic assertions pass, the pipeline must evaluate semantic quality. Because natural language is fluid, traditional code cannot easily assert if a response is &quot;helpful&quot; or &quot;empathetic.&quot; This introduces model-based evaluation, commonly referred to as &quot;<i>LLM-as-a-Judge</i>” or “<i>LLM-Judge</i>.&quot;</p><p>While using one non-deterministic system to evaluate another seems counterintuitive, it is an exceptionally powerful architectural pattern for use cases requiring nuance. It is virtually impossible to write a reliable regex to verify if a response is &quot;actionable&quot; or &quot;polite.&quot; While human reviewers excel at this nuance, they cannot scale to evaluate tens of thousands of CI/CD test cases. Thus, the LLM-as-a-Judge becomes the scalable proxy for human discernment.</p><h2>3 critical inputs for model-based assertions</h2><p>However, model-based assertions only yield reliable data when the LLM-as-a-Judge is provisioned with three critical inputs:</p><ol><li><p><b>A state-of-the-art reasoning model:</b> The Judge must possess superior reasoning capabilities compared to the production model. If your app runs on a smaller, faster model for latency, the judge must be a frontier reasoning model to approximate human-level discernment.</p></li><li><p><b>A strict assessment rubric:</b> Vague evaluation prompts (&quot;Rate how good this answer is&quot;) yield noisy, stochastic evaluations. A robust rubric explicitly defines the gradients of failure and success. (For example, a &quot;Helpfulness&quot; rubric should define Score 1 as an irrelevant refusal, Score 2 as addressing the prompt but lacking actionable steps, and Score 3 as providing actionable next steps strictly within context.)</p></li><li><p><b>Ground truth (golden outputs):</b> While the rubric provides the rules, a human-vetted &quot;expected answer&quot; acts as the answer key. When the LLM-Judge can compare the production model&#x27;s output against a verified Golden Output, its scoring reliability increases dramatically.</p></li></ol><h2>Architecture: The offline vs online pipeline</h2><p>A robust evaluation architecture requires two complementary pipelines. The online pipeline monitors post-deployment telemetry, while the offline pipeline provides the foundational baseline and deterministic constraints required to evaluate stochastic models safely.</p><h3>The offline evaluation pipeline</h3><p>The offline pipeline&#x27;s primary objective is regression testing — identifying failures, drift, and latency before production. Deploying an <a href="https://venturebeat.com/technology/when-product-managers-ship-code-ai-just-broke-the-software-org-chart?_gl=1*p1fni9*_up*MQ..*_ga*NzQyMjc1NDIyLjE3NzcwNjQ1MjA.*_ga_SCH1J7LNKY*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw*_ga_B8TDS1LEXQ*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw">enterprise LLM</a> feature without a gating offline evaluation suite is an architectural anti-pattern; it is the equivalent of merging uncompiled code into a main branch.</p><h3>Process</h3><h4>1. Curating the golden dataset</h4><p>The offline lifecycle begins by curating a &quot;<b>golden dataset</b>&quot; — a static, version-controlled repository of 200 to 500 test cases representing the AI&#x27;s full operational envelope. Each case pairs an exact input payload with an expected &quot;<b>golden output</b>&quot; (ground truth).</p><p>Crucially, this dataset must reflect expected real-world traffic distributions. While most cases cover standard &quot;happy-path&quot; interactions, engineers must systematically incorporate edge cases, jailbreaks, and adversarial inputs. Evaluating &quot;refusal capabilities&quot; under stress remains a strict compliance requirement.</p><p><b>Example test case payload (standard tool use):</b></p><ul><li><p><b>Input:</b> &quot;Schedule a 30-minute follow-up meeting with the client for next Tuesday at 10 a.m.&quot;</p></li><li><p><b>Expected output (golden):</b> The system successfully invokes the schedule_meeting tool with the correct JSON payload: {&quot;duration_minutes&quot;: 30, &quot;day&quot;: &quot;Tuesday&quot;, &quot;time&quot;: &quot;10 AM&quot;, &quot;attendee&quot;: &quot;client_email&quot;}.</p></li></ul><p>While manually curating hundreds of edge cases is tedious, the process can be accelerated with synthetic data generation pipelines that use a <a href="https://venturebeat.com/orchestration/when-ai-turns-software-development-inside-out-170-throughput-at-80-headcount?_gl=1*mgusdo*_up*MQ..*_ga*NzQyMjc1NDIyLjE3NzcwNjQ1MjA.*_ga_SCH1J7LNKY*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw*_ga_B8TDS1LEXQ*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw">specialized LLM</a> to produce diverse TSV/CSV test payloads. However, relying entirely on AI-generated test cases introduces the risk of data contamination and bias. A human-in-the-loop (HITL) architecture is mandatory at this stage; domain experts must manually review, edit, and validate the synthetic dataset to ensure it accurately reflects real-world user intent and enterprise policy before it is committed to the repository.</p><h4>2. Defining the evaluation criteria</h4><p>Once the dataset is curated, engineers must design the evaluation criteria to compute a composite score for each model output. A robust architecture achieves this by assigning weighted points across a hybrid of Layer 1 (deterministic) and Layer 2 (model-based) asserts.</p><p>Consider an AI agent executing a &quot;send email&quot; tool. An evaluation framework might utilize a 10-point scoring system:</p><ul><li><p><b>Layer 1: Deterministic asserts (6 points):</b> Did the agent invoke the correct tool? (2 pts). Did it produce a valid JSON object? (2 pts). Does the JSON strictly adhere to the expected schema? (2 pts).</p></li><li><p><b>Layer 2: Model-based asserts (4 points):</b> (Note: Semantic rubrics must be highly use-case specific). Does the subject line reflect user intent? (1 pt). Does the email body match expected outputs without hallucination? (1 pt). Were CC/BCC fields leveraged accurately? (1 pt). Was the appropriate priority flag inferred? (1 pt).</p></li></ul><p>To understand why the LLM-Judge awarded these points, the engineer must prompt the judge to supply its reasoning for each score. This is crucial for debugging failures.</p><p><b>The passing threshold and short-circuit logic</b> </p><p>In this example, an 8/10 passing threshold requires 8 points for success. Crucially, the evaluation pipeline must enforce strict short-circuit evaluation (fail-fast logic). If the model fails any deterministic assertion — such as generating a malformed JSON schema — the system must instantly fail the entire test case (0/10). There is zero architectural value in invoking an expensive LLM-Judge to assess the semantic &quot;politeness&quot; of an email if the underlying API call is structurally broken.</p><h4>3. Executing the pipeline and aggregating signals</h4><p>Using an evaluation infrastructure of choice, the system executes the offline pipeline — typically integrated as a blocking CI/CD step during a pull request. The infrastructure iterates through the golden dataset, injecting each test payload into the production model, capturing the output, and executing defined assertions against it.</p><p>Each output is scored against the passing threshold. Once batch execution is complete, results are aggregated into an overall pass rate. For enterprise-grade applications, the baseline pass rate must typically exceed <b>95%,</b> scaling to <b>99%-plus </b>for strict compliance or high-risk domains.</p><h4>4. Assessment, iteration, and alignment</h4><p>Based on aggregated failure data, engineering teams conduct a root-cause analysis of failing test cases. This assessment drives iterative updates to core components: refining system prompts, modifying tool descriptions, augmenting knowledge sources, or adjusting hyperparameters (like temperature or top-p). Continuous optimization remains best practice even after achieving a 95% pass rate.</p><p>Crucially, any system modification necessitates a full regression test. Because LLMs are inherently non-deterministic, an update intended to fix one specific edge case can easily cause unforeseen degradations in other areas. The entire offline pipeline must be rerun to validate that the update improved quality without introducing regressions.</p><h3>The online evaluation pipeline</h3><p>While the offline pipeline acts as a strict pre-deployment gatekeeper, the online pipeline is the post-deployment telemetry system. Its objective is to monitor real-world behavior, capturing emergent edge cases, and quantifying model drift. Architects must instrument applications to capture five distinct categories of telemetry:</p><h4>1. Explicit user signals</h4><p>Direct, deterministic feedback indicating model performance:</p><ul><li><p><b>Thumbs up/down:</b> Disproportionate negative feedback is the most immediate leading indicator of system degradation, directing immediate engineering investigation.</p></li><li><p><b>Verbatim in-app feedback:</b> Systematically parsing written comments identifies novel failure modes to integrate back into the offline &quot;golden dataset.&quot;</p></li></ul><h4>2. Implicit behavioral signals</h4><p>Behavioral telemetry reveals silent failures where users give up without explicit feedback:</p><ul><li><p><b>Regeneration and retry rates:</b> High frequencies of retries indicate the initial output failed to resolve user intent.</p></li><li><p><b>Apology rate:</b> Programmatically scanning for heuristic triggers (&quot;I’m sorry&quot;) detects degraded capabilities or broken tool routing.</p></li><li><p><b>Refusal rate:</b> Artificially high refusal rates (&quot;I can’t do that&quot;) indicate over-calibrated safety filters rejecting benign user queries.</p></li></ul><h4>3. Production deterministic asserts (synchronous)</h4><p>Because deterministic code checks execute in milliseconds, teams can seamlessly reuse Layer 1 offline asserts (schema conformity, tool validity) to synchronously evaluate 100% of production traffic. Logging these pass/fail rates instantly detects anomalous spikes in malformed outputs — the earliest warning sign of silent model drift or provider-side API changes.</p><h4>4. Production LLM-as-a-Judge (asynchronous)</h4><p>If strict data privacy agreements (DPAs) permit logging user inputs, teams can deploy model-based asserts. Architecturally, production LLM-Judges must never execute synchronously on the critical path, which doubles latency and compute costs. Instead, a background LLM-Judge asynchronously samples a fraction (5%) of daily sessions, grading outputs against the offline rubric to generate a continuous quality dashboard.</p><h2>Engineering the feedback loop (the “flywheel”)</h2><p>Evaluation pipelines are not &quot;set-it-and-forget-it&quot; infrastructure. Without continuous updates, static datasets suffer from &quot;rot&quot; (concept drift) as user behavior evolves and customers discover novel use cases.</p><p>For example, an HR chatbot might boast a pristine 99% offline pass rate for standard payroll questions. However, if the company suddenly announces a new equity plan, users will immediately begin prompting the AI about vesting schedules — a domain entirely missing from the offline evaluations.</p><p>To make the system smarter over time, engineers must architect a closed feedback loop that mines production telemetry for continuous improvement.</p><p><b>The continuous improvement workflow:</b></p><ol><li><p><b>Capture:</b> A user triggers an explicit negative signal (a &quot;thumbs down&quot;) or an implicit behavioral flag in production.</p></li><li><p><b>Triage:</b> The specific session log is automatically flagged and routed for human review.</p></li><li><p><b>Root-cause analysis:</b> A domain expert investigates the failure, identifies the gap, and updates the AI system to successfully handle similar requests.</p></li><li><p><b>Dataset augmentation:</b> The novel user input, paired with the newly corrected expected output, is appended to the offline Golden Dataset alongside several synthetic variations.</p></li><li><p><b>Regression testing:</b> The model is continuously re-evaluated against this newly discovered edge case in all future runs.</p></li></ol><p>Building an evaluation pipeline without monitoring production logs and updating datasets is fundamentally insufficient. Users are unpredictable. Evaluating on stale data creates a dangerous illusion: High offline pass rates masking a rapidly degrading real-world experience.</p><h2>Conclusion: The new “definition of done”</h2><p>In the era of generative AI, a feature or product is no longer &quot;done&quot; simply because the code compiles and the prompt returns a coherent response. It is only done when a rigorous, automated evaluation pipeline is deployed and stable — and when the model consistently passes against both a curated golden dataset and newly discovered production edge cases.</p><p>This guide has equipped you with a comprehensive blueprint for building that reality. From architecting offline regression pipelines and online telemetry to the continuous feedback flywheel and navigating enterprise anti-patterns, you now have the structural foundation required to deploy AI systems with greater confidence.</p><p>Now, it is your turn. Share this framework with your engineering, product, and legal teams to establish a unified, cross-functional standard for AI quality in your organization. Stop guessing whether your models are degrading in production, and start measuring.</p><p><i>Derah Onuorah is a Microsoft senior product manager. </i></p>]]></description>
            <category>Infrastructure</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/4IEiKF5i3wgiKwmJw8UOtf/8a415ee33ad42c0cb72ceb0aec1155dc/u7277289442_AI_robots_with_hardhats._An_office_setting._They__5df79da3-f7e2-43fa-a9cb-8d27ca6939c9_2.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Context decay, orchestration drift, and the rise of silent failures in AI systems]]></title>
            <link>https://venturebeat.com/infrastructure/context-decay-orchestration-drift-and-the-rise-of-silent-failures-in-ai-systems</link>
            <guid isPermaLink="false">3wag8VeUQGQQBubBe8KioV</guid>
            <pubDate>Sun, 26 Apr 2026 04:00:00 GMT</pubDate>
            <description><![CDATA[<p>The most expensive <a href="https://venturebeat.com/security/five-signs-data-drift-is-already-undermining-your-security-models?_gl=1*5zylav*_up*MQ..*_ga*NzQyMjc1NDIyLjE3NzcwNjQ1MjA.*_ga_SCH1J7LNKY*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw*_ga_B8TDS1LEXQ*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw">AI failure</a> I have seen in enterprise deployments did not produce an error. No alert fired. No dashboard turned red. The system was fully operational, it was just consistently, confidently wrong. That is the reliability gap. And it is the problem most enterprise AI programs are not built to catch.</p><p>We have spent the last two years getting very good at evaluating models: benchmarks, accuracy scores, red-team exercises, retrieval quality tests. But in production, the model is rarely where the system breaks. It breaks in the infrastructure layer, the data pipelines feeding it, the orchestration logic wrapping it, the retrieval systems grounding it, the downstream workflows trusting its output. That layer is still being monitored with tools designed for a different kind of software.</p><h2><b>The gap no one is measuring</b></h2><p>Here&#x27;s what makes this problem hard to see: Operationally healthy and behaviorally reliable are not the same thing, and most monitoring stacks cannot tell the difference.</p><p>A system can show green across every infrastructure metric, latency within SLA, throughput normal, error rate flat, while simultaneously reasoning over retrieval results that are six months stale, silently falling back to cached context after a tool call degrades, or propagating a misinterpretation through five steps of an agentic workflow. None of that shows up in Prometheus. None of it trips a Datadog alert.</p><p>The reason is straightforward: <a href="https://venturebeat.com/security/your-developers-are-already-running-ai-locally-why-on-device-inference-is?_gl=1*5zylav*_up*MQ..*_ga*NzQyMjc1NDIyLjE3NzcwNjQ1MjA.*_ga_SCH1J7LNKY*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw*_ga_B8TDS1LEXQ*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw">Traditional observability</a> was built to answer the question “is the service up?” Enterprise AI requires answering a harder question: “Is the service behaving correctly?” Those are different instruments.</p><table><tbody><tr><td><p><b>What teams typically measure</b></p></td><td><p><b>What actually drives AI infrastructure failure</b></p></td></tr><tr><td><p>Uptime / latency / error rate</p></td><td><p>Retrieval freshness and grounding confidence</p></td></tr><tr><td><p>Token usage</p></td><td><p>Context integrity across multi-step workflows</p></td></tr><tr><td><p>Throughput</p></td><td><p>Semantic drift under real-world load</p></td></tr><tr><td><p>Model benchmark scores</p></td><td><p>Behavioral consistency when conditions degrade</p></td></tr><tr><td><p>Infrastructure error rate</p></td><td><p>Silent partial failure at the reasoning layer</p></td></tr></tbody></table><p> Closing this gap requires adding a behavioral telemetry layer alongside the infrastructure one — not replacing what exists, but extending it to capture what the model actually did with the context it received, not just whether the service responded.</p><h2><b>Four failure patterns that standard monitoring will not catch</b></h2><p>Across enterprise AI deployments in network operations, logistics, and observability platforms, I see four failure patterns repeat with enough consistency to name them. </p><p>The first is context degradation. The model reasons over incomplete or stale data in a way that is invisible to the end user. The answer looks polished. The grounding is gone. Detection usually happens weeks later, through downstream consequences rather than system alerts.</p><p>The second is orchestration drift. <a href="https://venturebeat.com/infrastructure/claude-openclaw-and-the-new-reality-ai-agents-are-here-and-so-is-the-chaos?_gl=1*5zylav*_up*MQ..*_ga*NzQyMjc1NDIyLjE3NzcwNjQ1MjA.*_ga_SCH1J7LNKY*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw*_ga_B8TDS1LEXQ*czE3NzcwNjQ1MTkkbzEkZzAkdDE3NzcwNjQ1MTkkajYwJGwwJGgw">Agentic pipelines</a> rarely fail because one component breaks. They fail because the sequence of interactions between retrieval, inference, tool use, and downstream action starts to diverge under real-world load. A system that looked stable in testing behaves very differently when latency compounds across steps and edge cases stack.</p><p>The third is a silent partial failure. One component underperforms without crossing an alert threshold. The system degrades behaviorally before it degrades operationally. These failures accumulate quietly and surface first as user mistrust, not incident tickets. By the time the signal reaches a postmortem, the erosion has been happening for weeks.</p><p>The fourth is the automation blast radius. In traditional software, a localized defect stays local. In AI-driven workflows, one misinterpretation early in the chain can propagate across steps, systems, and business decisions. The cost is not just technical. It becomes organizational, and it is very hard to reverse.</p><p>Metrics tell you what happened. They rarely tell you what almost happened.</p><h2><b>Why classic chaos engineering is not enough and what needs to change</b></h2><p>Traditional chaos engineering asks the right kind of question: What happens when things break? Kill a node. Drop a partition. Spike CPU. Observe. Those tests are necessary, and enterprises should run them.</p><p>But for AI systems, the most dangerous failures are not caused by hard infrastructure faults. They emerge at the interaction layer between data quality, context assembly, model reasoning, orchestration logic, and downstream action. You can stress the infrastructure all day and never surface the failure mode that costs you the most.</p><p>What AI reliability testing needs is an intent-based layer: Define what the system must do under degraded conditions, not just what it should do when everything works. Then test the specific conditions that challenge that intent. What happens if the retrieval layer returns content that is technically valid but six months outdated? What happens if a summarization agent loses 30% of its context window to unexpected token inflation upstream? What happens if a tool call succeeds syntactically but returns semantically incomplete data? What happens if an agent retries through a degraded workflow and compounds its own error with each step?</p><p>These scenarios are not edge cases. They are what production looks like. This is the framework I have applied in building reliability systems for enterprise infrastructure: Intent-based chaos level creation for distributed computing environments. The key insight: Intent defines the test, not just the fault.</p><h2><b>What the infrastructure layer actually needs</b></h2><p>None of this requires reinventing the stack. It requires extending four things.</p><p>Add behavioral telemetry alongside infrastructure telemetry. Track whether responses were grounded, whether fallback behavior was triggered, whether confidence dropped below a meaningful threshold, whether the output was appropriate for the downstream context it entered. This is the observability layer that makes everything else interpretable.</p><p>Introduce semantic fault injection into pre-production environments. Deliberately simulate stale retrieval, incomplete context assembly, tool-call degradation, and token-boundary pressure. The goal is not theatrical chaos. The goal is finding out how the system behaves when conditions are slightly worse than your staging environment — which is always what production is.</p><p>Define safe halt conditions before deployment, not after the first incident. AI systems need the equivalent of circuit breakers at the reasoning layer. If a system cannot maintain grounding, validate context integrity, or complete a workflow with enough confidence to be trusted, it should stop cleanly, label the failure, and hand control to a human or a deterministic fallback. A graceful halt is almost always safer than a fluent error. Too many systems are designed to keep going because confident output creates the illusion of correctness.</p><p>Assign shared ownership for end-to-end reliability. The most common organizational failure is a clean separation between model teams, platform teams, data teams, and application teams. When the system is operationally up but behaviorally wrong, no one owns it clearly. Semantic failure needs an owner. Without one, it accumulates.</p><h2><b>The maturity curve is shifting</b></h2><p>For the last two years, the enterprise AI differentiator has been adoption — who gets to production fastest. That phase is ending. As models commoditize and baseline capability converges, competitive advantage will come from something harder to copy: The ability to operate AI reliably at scale, in real conditions, with real consequences.</p><p>Yesterday’s differentiator was model adoption. Today’s is system integration. Tomorrow’s will be reliability under production stress.</p><p>The enterprises that get there first will not have the most advanced models. They will have the most disciplined infrastructure around them — infrastructure that was tested against the conditions it would actually face, not the conditions that made the pilot look good.</p><p>The model is not the whole risk. The untested system around it is.</p><p><i>Sayali Patil is an AI infrastructure and product leader. </i></p>]]></description>
            <category>Infrastructure</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/76YheZIpA8AMYZUQRBAljo/5080f08dfe636580d96eb7883f41136d/Silent_AI.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[AI synthetic audiences are already here and poised to upend the consulting industry]]></title>
            <link>https://venturebeat.com/technology/ai-synthetic-audiences-are-already-here-and-poised-to-upend-the-consulting-industry</link>
            <guid isPermaLink="false">1Rj1gcdsr2z1naCpdljfrE</guid>
            <pubDate>Sun, 26 Apr 2026 04:00:00 GMT</pubDate>
            <description><![CDATA[<p><i>By Eren Celebi, WPP</i></p><p>There is a war brewing between AI and consulting. </p><p>Akin to an armies slow march towards the castle, a new technology is coming to dethrone the expert guessers of Mckinsey, Nielsen, Gartner, Publicis and the rest. Any consulting that involves analyzing people (think all of marketing, research, polling, etc.) will have to reckon with the technology of “synthetic audiences”.</p><p>Synthetic audiences aim to generate digital versions of people that can then be surveyed almost instantly and affordably, but not as accurately. Think Tamagochi but with people. </p><p>By prompting AI with information about a person, we ask AI to get in their shoes, simulate the thoughts, behaviors, priorities and decisions of real world humans. We can also invent non-specific placeholder people or personas and survey them as though they are real. Various firms have already fielded products in these domains, including startups <a href="https://www.uktech.news/ai/ai-startup-building-synthetic-audiences-to-model-human-thought-raises-10m-20260212">Electric Twin</a>, <a href="https://www.eu-startups.com/2025/08/british-ai-startup-artificial-societies-raises-e4-5-million-to-simulate-human-behaviour-at-scale/">Artificial Societies, </a>and <a href="https://techcrunch.com/2025/12/05/ai-synthetic-research-startup-aaru-raised-a-series-a-at-a-1b-headline-valuation/">Aaru</a>, and even the century-old <a href="https://www.dentsu.com/news-releases/dentsu-launches-generative-audiences-ai-powered-growth-intelligence-that-thinks-like-consumers">Dentsu</a>.</p><p>What used to take 4 months to survey people, plus two months to create a nice PowerPoint presentation of findings at a total cost of thousands or even tens of thousands, now takes two minutes and costs only a few dollars. </p><p>It may seem like I’ve picked my winner. But in this war of tribes, I’m a Romeo, caught between the two warring houses. I work for a large incumbent in this space. From 2023-2025 while working at the London headquarters of <a href="https://www.wpp.com/en/wpp-iq/2026/02/shaping-performance-inside-wpp-productions-ai-model-for-relevance-at-scale">WPP</a>, I built similar tools for numerous Fortune 500’s and advised many New York University researchers on the subject. </p><p>Companies like WPP with head counts and revenues that rival the populations and GDP’s of small European nations need startups for their speed and high margins, while startups need our distribution.</p><p>My advice has always been for unity between these tribes. Considering WPP is partnering with numerous startups, is working tirelessly in building our own tools and building deep connections with hyper scalers, it’s possible I mislead you with the war analogy. This may be a love story after all. But destiny’s bottle of poison is in our hands. These next few years are pivotal and formative.</p><p>The future will ultimately be determined by the buyers of these studies. Fortune 500’s, with the largest appetite for market research, often hesitate to include synthetic audiences in their diet. The first question I’m asked in any pitch is &quot;will AI steal my data?&quot; I find this question to be an emotional response. It seems to me like most AI fears are remnants of a 2022 LinkedIn post that burrowed itself into our collective consciousness.</p><p>I generally respond to this question with another: “Do you use Microsoft Teams?” </p><p>The answer is often &quot;yes.&quot; Almost every enterprise stores sensitive data in a cloud service that Google Amazon or Microsoft provides. These are the same companies that provide enterprise AI services, which state in their terms and conditions that they won’t train models with your data. Now, believing this statement is optional, but for that matter believing is voluntary for all things.</p><p>Criticisms of accuracy on the other hand, are harder to dispute. The famed venture capital firm Andreessen-Horowitz (a16z) titled its analysis of this budding tech scene as “<a href="https://a16z.com/ai-market-research/">Faster, smarter, cheaper</a>”. </p><p>As the hopeful mediator in this war, I agree synthetic research is faster and cheaper, but is it smarter? Not sure. A <a href="https://arxiv.org/pdf/2411.10109">seminal paper from Stanford by Park et al. established a benchmark in 2024</a> proving that AI can simulate human responses to surveys with an average of 85% accuracy. </p><p>In fact for certain portions of the general social survey, they replicated answers with more than 90% accuracy. When the model is provided relevant information and is given rich context (like a mini biography of the person) it can guess their actions and thoughts very accurately.</p><p>But no prediction can be 100% accurate. A future where human propensities are modeled even better than humans can express their own desires is a possibility. Maybe we’ll live in a future where the movie <i>Minority Report </i>becomes reality. However, this future is too distant to warrant the attention of a business reader and is better suited for Tom Cruise and Steven Spielberg.</p><p>What is more interesting to me is what this technology can do at lower accuracies. In my private tests, I’ve seen that with very simple information about a person, such as their age, neighborhood and gender, certain behaviors can be modeled with 72% accuracy. </p><p>An argument can be made that these are easy-to-make predictions. Predicting whether a married person will have children is low stakes. This can’t completely replace the unique insight of a strategist. </p><p>However, considering how elusive it is to understand and model people. A solution that’s better than random and so attainable poses to make an impact.</p><p>Think about the immense scale. The human mind works with a small range of values. We understand when something is twice as fast but we can’t comprehend when something is 175,200 times faster. All of a sudden a journey that took several days becomes becomes several hours, bridges get built, gas stations, entire industries are started. </p><p>When improvement isn’t marginal but exponential, it has positive externalities that are impossible to predict even by this article.</p><p>What I suggest for all of us is to eat the popcorn and watch the show. No matter what happens, it’ll be fun.</p>]]></description>
            <category>Technology</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/4AZSXEMtdtnPgmv3m8XzM6/e9cc92137da39bed7cf07a278ffa2775/Carl_Franzen_graphic_novel_style_intricate_line_drawing_first_f1777103-36c6-4343-90ce-11883e004271_2.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Five signs data drift is already undermining your security models]]></title>
            <link>https://venturebeat.com/security/five-signs-data-drift-is-already-undermining-your-security-models</link>
            <guid isPermaLink="false">7i0FSJc3DXou1wEJSVEmkG</guid>
            <pubDate>Sun, 12 Apr 2026 19:00:00 GMT</pubDate>
            <description><![CDATA[<p>Data drift happens when the statistical properties of a machine learning (ML) model&#x27;s input data change over time, eventually rendering its predictions less accurate. <a href="https://venturebeat.com/security/ocsf-explained-the-shared-data-language-security-teams-have-been-missing?_gl=1*yt0z35*_up*MQ..*_ga*MTcxNTczODYxLjE3NzYwMDUzOTE.*_ga_B8TDS1LEXQ*czE3NzYwMDUzODkkbzEkZzAkdDE3NzYwMDUzODkkajYwJGwwJGgw*_ga_SCH1J7LNKY*czE3NzYwMDUzODkkbzEkZzAkdDE3NzYwMDUzODkkajYwJGwwJGgw">Cybersecurity professionals</a> who rely on ML for tasks like malware detection and network threat analysis find that undetected data drift can create vulnerabilities. A model trained on old attack patterns may fail to see today&#x27;s sophisticated threats. Recognizing the early signs of data drift is the first step in maintaining reliable and efficient security systems.</p><h2><b>Why data drift compromises security models</b></h2><p>ML models are trained on a snapshot of historical data. When live data no longer resembles this snapshot, the model&#x27;s performance dwindles, creating a <a href="https://venturebeat.com/technology/why-cios-must-lead-ai-experimentation-not-just-govern-it?_gl=1*x7qiq4*_up*MQ..*_ga*MTcxNTczODYxLjE3NzYwMDUzOTE.*_ga_B8TDS1LEXQ*czE3NzYwMDUzODkkbzEkZzAkdDE3NzYwMDUzODkkajYwJGwwJGgw*_ga_SCH1J7LNKY*czE3NzYwMDUzODkkbzEkZzAkdDE3NzYwMDUzODkkajYwJGwwJGgw">critical cybersecurity risk</a>. A threat detection model may generate more false negatives by missing real breaches or create more false positives, leading to alert fatigue for security teams.</p><p>Adversaries actively exploit this weakness. In 2024,<a href="https://thehackernews.com/2024/07/proofpoint-email-routing-flaw-exploited.html"> <u>attackers used echo-spoofing techniques</u></a> to bypass email protection services. By exploiting misconfigurations in the system, they sent millions of spoofed emails that evaded the vendor&#x27;s ML classifiers. This incident demonstrates how threat actors can manipulate input data to exploit blind spots. When a security model fails to adapt to shifting tactics, it becomes a liability.</p><h2><b>5 indicators of data drift</b></h2><p>Security professionals can recognize the presence of drift (or its potential) in several ways.</p><h3><b>1. A sudden drop in model performance</b></h3><p>Accuracy, precision, and recall are often the first casualties. A consistent decline in these key metrics is a red flag that the model is no longer in sync with the current threat landscape.</p><p>Consider Klarna&#x27;s success: Its AI assistant handled 2.3 million customer service conversations in its first month and performed work equivalent to 700 agents. This efficiency drove a<a href="https://www.nutshell.com/blog/best-ai-chatbots"> <u>25% decline in repeat inquiries</u></a> and reduced resolution times to under two minutes. </p><p>Now imagine if those parameters suddenly reversed because of drift. In a security context, a similar drop in performance does not just mean unhappy clients — it also means successful intrusions and potential data exfiltration.</p><h3><b>2. Shifts in statistical distributions</b></h3><p><a href="https://venturebeat.com/security/human-centric-iam-is-failing-agentic-ai-requires-a-new-identity-control?_gl=1*61shbb*_up*MQ..*_ga*MTcxNTczODYxLjE3NzYwMDUzOTE.*_ga_B8TDS1LEXQ*czE3NzYwMDUzODkkbzEkZzAkdDE3NzYwMDUzODkkajYwJGwwJGgw*_ga_SCH1J7LNKY*czE3NzYwMDUzODkkbzEkZzAkdDE3NzYwMDUzODkkajYwJGwwJGgw">Security teams</a> should monitor the core statistical properties of input features, such as the mean, median, and standard deviation. A significant change in these metrics from training data could indicate the underlying data has changed.</p><p>Monitoring for such shifts enables teams to catch drift before it causes a breach. For example, a phishing detection model might be trained on emails with an average attachment size of 2MB. If the average attachment size suddenly jumps to 10MB due to a new malware-delivery method, the model may fail to classify these emails correctly.</p><h3><b>3. Changes in prediction behavior</b></h3><p>Even if overall accuracy seems stable, distributions of predictions might change, a phenomenon often referred to as prediction drift.</p><p>For instance, if a fraud detection model historically flagged 1% of transactions as suspicious but suddenly starts flagging 5% or 0.1%, either something has shifted or the nature of the input data has changed. It might indicate a new type of attack that confuses the model or a change in legitimate user behavior that the model was not trained to identify.</p><h3><b>4. An increase in model uncertainty</b></h3><p>For models that provide a confidence score or probability with their predictions, a general decrease in confidence can be a subtle sign of drift.</p><p>Recent studies highlight the<a href="https://arxiv.org/html/2410.21952v2"> <u>value of uncertainty quantification</u></a> in detecting adversarial attacks. If the model becomes less sure about its forecasts across the board, it is likely facing data it was not trained on. In a cybersecurity setting, this uncertainty is an early sign of potential model failure, suggesting the model is operating in unfamiliar ground and that its decisions might no longer be reliable.</p><h3><b>5. Changes in feature relationships</b></h3><p>The correlation between different input features can also change over time. In a network intrusion model, traffic volume and packet size might be highly linked during normal operations. If that correlation disappears, it can signal a change in network behavior that the model may not understand. A sudden feature decoupling could indicate a new tunneling tactic or a stealthy exfiltration attempt.</p><h2><b>Approaches to detecting and mitigating data drift</b></h2><p>Common detection methods include the Kolmogorov-Smirnov (KS) and the population stability index (PSI). These compare the <a href="https://towardsdatascience.com/drift-detection-in-robust-machine-learning-systems/"><u>distributions of live and training data</u></a> to identify deviations. The KS test determines if two datasets differ significantly, while the PSI measures how much a variable&#x27;s distribution has shifted over time. </p><p>The mitigation method of choice often depends on how the drift manifests, as distribution changes may occur suddenly. For example, customers&#x27; buying behavior may change overnight with the launch of a new product or a promotion. In other cases, drift may occur gradually over a more extended period. That said, security teams must learn to adjust their monitoring cadence to capture both rapid spikes and slow burns. Mitigation will involve retraining the model on more recent data to reclaim its effectiveness.</p><h2><b>Proactively manage drift for stronger security</b></h2><p>Data drift is an inevitable reality, and cybersecurity teams can maintain a strong security posture by treating detection as a continuous and automated process. Proactive monitoring and model retraining are fundamental practices to ensure ML systems remain reliable allies against developing threats.</p><p><i>Zac Amos is the Features Editor at </i><a href="https://rehack.com/"><i><u>ReHack</u></i></a><i>.</i></p>]]></description>
            <category>Security</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/erAw6FrOeAX9eZJqeF2Dx/3a759d02f32a698bdc815c787701a17a/AI_drift.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Your developers are already running AI locally: Why on-device inference is the CISO’s new blind spot]]></title>
            <link>https://venturebeat.com/security/your-developers-are-already-running-ai-locally-why-on-device-inference-is</link>
            <guid isPermaLink="false">3EC5GemarqXB92UGk1xUjb</guid>
            <pubDate>Sun, 12 Apr 2026 15:00:20 GMT</pubDate>
            <description><![CDATA[<p>For the last 18 months, the CISO playbook for generative AI has been relatively simple: Control the browser.</p><p><a href="https://venturebeat.com/security/ocsf-explained-the-shared-data-language-security-teams-have-been-missing?_gl=1*4903t3*_up*MQ..*_ga*MTcxNTczODYxLjE3NzYwMDUzOTE.*_ga_B8TDS1LEXQ*czE3NzYwMDUzODkkbzEkZzAkdDE3NzYwMDUzODkkajYwJGwwJGgw*_ga_SCH1J7LNKY*czE3NzYwMDUzODkkbzEkZzAkdDE3NzYwMDUzODkkajYwJGwwJGgw">Security teams</a> tightened cloud access security broker (CASB) policies, blocked or monitored traffic to well-known AI endpoints, and routed usage through sanctioned gateways. The operating model was clear: If sensitive data leaves the network for an external API call, we can observe it, log it, and stop it. But that model is starting to break.</p><p>A quiet hardware shift is pushing large language model (LLM) usage off the network and onto the endpoint. Call it Shadow AI 2.0, or the “bring your own model” (BYOM) era: Employees running capable models locally on laptops, offline, with no API calls and no obvious network signature. The governance conversation is still framed as “data exfiltration to the cloud,” but the more immediate enterprise risk is increasingly “unvetted inference inside the device.&quot;</p><p>When inference happens locally, traditional data loss prevention (DLP) doesn’t see the interaction. And when security can’t see it, it can’t manage it.</p><h3><b>Why local inference is suddenly practical</b></h3><p>Two years ago, running a useful LLM on a work laptop was a niche stunt. Today, it’s routine for technical teams.</p><p>Three things converged:</p><ul><li><p><b>Consumer-grade accelerators got serious: </b>A MacBook Pro with 64GB unified memory can often run quantized 70B-class models at usable speeds (with practical limits on context length). What once required multi-GPU servers is now feasible on a high-end laptop for many real workflows.</p></li><li><p><b>Quantization went mainstream:</b> It’s now easy to compress models into smaller, faster formats that fit within laptop memory often with acceptable quality tradeoffs for many tasks.</p></li><li><p><b>Distribution is frictionless:</b> Open-weight models are a single command away, and the tooling ecosystem makes “download → run → chat” trivial.</p></li></ul><p><b>The result: </b>An engineer can pull down a multi‑GB model artifact, turn off Wi‑Fi, and run sensitive workflows locally, source code review, document summarization, drafting customer communications, even exploratory analysis over regulated datasets. No outbound packets, no proxy logs, no cloud audit trail.</p><p>From a <a href="https://venturebeat.com/security/mythos-detection-ceiling-security-teams-new-playbook?_gl=1*qe97gz*_up*MQ..*_ga*MzY1OTQzODYzLjE3NzYwMDU1Mjk.*_ga_SCH1J7LNKY*czE3NzYwMDU1MjgkbzEkZzAkdDE3NzYwMDU1MjgkajYwJGwwJGgw*_ga_B8TDS1LEXQ*czE3NzYwMDU1MjgkbzEkZzAkdDE3NzYwMDU1MjgkajYwJGwwJGgw">network-security perspective</a>, that activity can look indistinguishable from “nothing happened”.</p><h3><b>The risk isn’t only data leaving the company anymore</b></h3><p>If the data isn’t leaving the laptop, why should a CISO care?</p><p>Because the dominant risks shift from exfiltration to integrity, provenance, and compliance. In practice, local inference creates three classes of blind spots that most enterprises have not operationalized.</p><h4><b>1. Code and decision contamination (integrity risk)</b></h4><p>Local models are often adopted because they’re fast, private, and “no approval required.&quot; The downside is that they’re frequently unvetted for the enterprise environment.</p><p><b>A common scenario:</b> A senior developer downloads a community-tuned coding model because it benchmarks well. They paste in internal auth logic, payment flows, or infrastructure scripts to “clean it up.&quot; The model returns output that looks competent, compiles, and passes unit tests, but subtly degrades security posture (weak input validation, unsafe defaults, brittle concurrency changes, dependency choices that aren’t allowed internally). The engineer commits the change.</p><p>If that interaction happened offline, you may have no record that AI influenced the code path at all. And when you later do incident response, you’ll be investigating the symptom (a vulnerability) without visibility into a key cause (uncontrolled model usage).</p><h4><b>2. Licensing and IP exposure (compliance risk)</b></h4><p>Many high-performing models ship with licenses that include <a href="https://llama.meta.com/llama3/license/"><u>restrictions on commercial use</u></a>, attribution requirements, field-of-use limits, or obligations that can be incompatible with proprietary product development. When employees run models locally, that usage can bypass the organization’s normal procurement and legal review process.</p><p>If a team uses a non-commercial model to generate production code, documentation, or product behavior, the company can inherit risk that shows up later during M&amp;A diligence, customer security reviews, or litigation. The hard part is not just the license terms, it’s the lack of inventory and traceability. Without a governed model hub or usage record, you may not be able to prove what was used where.</p><h4><b>3. Model supply chain exposure (provenance risk)</b></h4><p>Local inference also changes the software supply chain problem. Endpoints begin accumulating large model artifacts and the toolchains around them: ownloaders, converters, runtimes, plugins, UI shells, and Python packages.</p><p>There is a critical technical nuance here: The file format matters. While newer formats like <a href="https://huggingface.co/docs/safetensors/index"><b><u>Safetensors</u></b></a> are designed to prevent arbitrary code execution, older <a href="https://pytorch.org/docs/stable/generated/torch.load.html"><b><u>Pickle-based</u></b><u> PyTorch files</u></a> can execute malicious payloads simply when loaded. If your developers are grabbing unvetted checkpoints from Hugging Face or other repositories, they aren&#x27;t just downloading data — they could be downloading an exploit.</p><p>Security teams have spent decades learning to treat unknown executables as hostile. BYOM requires extending that mindset to model artifacts and the surrounding runtime stack. The biggest organizational gap today is that most companies have no equivalent of a <a href="https://www.cisa.gov/sbom"><u>software bill of materials</u></a> for models: Provenance, hashes, allowed sources, scanning, and lifecycle management.</p><h3><b>Mitigating BYOM: treat model weights like software artifacts</b></h3><p>You can’t solve local inference by blocking URLs. You need endpoint-aware controls and a developer experience that makes the safe path the easy path.</p><p>Here are three practical ways:</p><p><b>1. Move governance down to the endpoint</b> </p><p>Network DLP and CASB still matter for cloud usage, but they’re not sufficient for BYOM. Start treating local model usage as an endpoint governance problem by looking for specific signals:</p><ul><li><p><b>Inventory and detection:</b> Scan for high-fidelity indicators like .gguf files larger than 2GB, processes like <a href="https://github.com/ggerganov/llama.cpp"><u>llama.cpp</u></a> or Ollama, and local listeners on common <a href="https://docs.ollama.com/faq"><u>default port 11434</u></a>.</p></li><li><p><b>Process and runtime awareness:</b> Monitor for repeated high GPU/NPU (neural processing unit) utilization from unapproved runtimes or unknown local inference servers.</p></li><li><p><b>Device policy:</b> Use <b>mobile device management (MDM) and endpoint detection and response (EDR)</b> policies to control installation of unapproved runtimes and enforce baseline hardening on engineering devices. The point isn’t to punish experimentation. It’s to regain visibility.</p></li></ul><p><b>2. Provide a paved road: An internal, curated model hub</b> </p><p><a href="https://venturebeat.com/security/ai-agent-zero-trust-architecture-audit-credential-isolation-anthropic-nvidia-nemoclaw">Shadow AI</a> is often an outcome of friction. Approved tools are too restrictive, too generic, or too slow to approve. A better approach is to offer a curated internal catalog that includes: </p><ul><li><p>Approved models for common tasks (coding, summarization, classification)</p></li><li><p>Verified licenses and usage guidance</p></li><li><p>Pinned versions with hashes (prioritizing safer formats like Safetensors)</p></li><li><p>Clear documentation for safe local usage, including where sensitive data is and isn’t allowed. If you want developers to stop scavenging, give them something better.</p></li></ul><p><b>3. Update policy language: “Cloud services” isn’t enough anymore</b> </p><p>Most acceptable use policies talk about SaaS and cloud tools. BYOM requires policy that explicitly covers:</p><ul><li><p>Downloading and running model artifacts on corporate endpoints</p></li><li><p>Acceptable sources</p></li><li><p>License compliance requirements</p></li><li><p>Rules for using models with sensitive data</p></li><li><p>Retention and logging expectations for local inference tools This doesn’t need to be heavy-handed. It needs to be unambiguous.</p></li></ul><h3><b>The perimeter is shifting back to the device</b></h3><p>For a decade we moved security controls “up” into the cloud. Local inference is pulling a meaningful slice of AI activity back “down” to the endpoint.</p><p>5 signals shadow AI has moved to endpoints:</p><ul><li><p><b>Large model artifacts:</b> Unexplained storage consumption by .gguf or .pt files.</p></li><li><p><b>Local inference servers:</b> Processes listening on ports like 11434 (Ollama).</p></li><li><p><b>GPU utilization patterns:</b> Spikes in GPU usage while offline or disconnected from VPN.</p></li><li><p><b>Lack of model inventory:</b> Inability to map code outputs to specific model versions.</p></li><li><p><b>License ambiguity:</b> Presence of &quot;non-commercial&quot; model weights in production builds.</p></li></ul><p>Shadow AI 2.0 isn’t a hypothetical future, it’s a predictable consequence of fast hardware, easy distribution, and developer demand. CISOs who focus only on network controls will miss what’s happening on the silicon sitting right on employees’ desks.</p><p>The next phase of AI governance is less about blocking websites and more about controlling artifacts, provenance, and policy at the endpoint, without killing productivity.</p><p><i>Jayachander Reddy Kandakatla is a senior MLOps engineer.</i></p>]]></description>
            <category>Security</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/pAoHef9hMVI3aHoyHfluC/f410fef5dc2a910939184a98db76eec4/AI_perimeter.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos]]></title>
            <link>https://venturebeat.com/infrastructure/claude-openclaw-and-the-new-reality-ai-agents-are-here-and-so-is-the-chaos</link>
            <guid isPermaLink="false">E4kZwK085N3OHabqiT6mh</guid>
            <pubDate>Wed, 08 Apr 2026 22:26:37 GMT</pubDate>
            <description><![CDATA[<p>The age of agentic AI is upon us — whether we like it or not. What started with an innocent question-answer banter with ChatGPT back in 2022 has become an existential debate on job security and the rise of the machines. </p><p>More recently, fears of reaching artificial general intelligence (AGI) have become more real with the advent of  powerful autonomous agents like Claude Cowork and <a href="https://venturebeat.com/security/openclaw-500000-instances-no-enterprise-kill-switch">OpenClaw</a>. Having played with these tools for some time, here is a comparison.</p><p>First, we have OpenClaw (formerly known as Moltbot and Clawdbot). Surpassing 150,000 GitHub stars in days, OpenClaw is already being deployed on local machines with deep system access. This is like a robot “maid” (Irona for <i>Richie Rich</i> fans, for instance) that you give the keys to your house. It’s supposed to clean it, and you give it the necessary autonomy to take actions and manage your belongings (files and data) as it pleases. The whole purpose is to perform the task at hand — inbox triaging, auto-replies, content curation, travel planning, and more.</p><p>Next we have Google’s <a href="https://antigravity.google/">Antigravity</a>, a coding agent with an IDE that accelerates the path from prompt to production. You can interactively create complete application projects and modify specific details over individual prompts. This is like having a junior developer that can not only code, but build, test, integrate, and fix issues. In the realworld, this is like hiring an electrician: They are really good at a specific job and you only need to give them access to a specific item (your electric junction box). </p><p>Finally, we have the mighty Claude. The release of Anthropic&#x27;s Cowork, which featured AI agents for automating legal tasks like contract review and NDA triage, caused a sharp sell-off in legal-tech and software-as-a-service (SaaS) stocks (referred to as the <a href="https://venturebeat.com/infrastructure/intuit-is-betting-its-40-years-of-small-business-data-can-outlast-the">SaaSpocalypse</a>). Claude has anyway been the go-to chatbot; now with Cowork, it has domain knowledge for specific industries like legal and finance. This is like hiring an accountant. They know the domain inside-out and can complete taxes and manage invoices. Users provide specific access to highly-sensitive financial details.</p><h2>Making these tools work for you</h2><p>The key to making these tools more impactful is giving them more power, but that increases the <a href="https://venturebeat.com/security/openclaw-can-bypass-your-edr-dlp-and-iam-without-triggering-a-single-alert">risk of misuse</a>. Users must trust providers like Anthorpic and Google to ensure that agent prompts will not cause harm, leak data, or provide unfair (illegal) advantage to certain vendors. OpenClaw is open-source, which complicates things, as there is no central governing authority. </p><p>While these technological advancements are amazing and meant for the greater good, all it takes is one or two adverse events to cause panic. Imagine the agentic electrician frying all your house circuits by connecting the wrong wire. In an agent scenario, this could be injecting incorrect code, breaking down a bigger system or adding hidden flaws that may not be immediately evident. Cowork could miss major saving opportunities when doing a user&#x27;s taxes; on the flip side, it could include illegal writeoffs. Claude can do unimaginable damage when it has more control and authority.</p><p>But in the middle of this chaos, there is an opportunity to really take advantage. With the right guardrails in place, agents can focus on specific actions and avoid making random, unaccounted-for decisions. Principles of responsible AI — accountability, transparency, reproducibility, security, privacy — are extremely important. Logging agent steps and human confirmation are absolutely critical. </p><p>Also, when agents deal with so many diverse systems, it&#x27;s important they speak the same language. Ontology becomes very important so that events can be tracked, monitored, and accounted for. A shared domain-specific ontology can define a “code of conduct.&quot; These ethics can help control the chaos. When tied together with a shared trust and distributed identity framework, we can build systems that enable agents to do truly useful work. </p><p>When done right, an agentic ecosystem can greatly offload the human “cognitive load” and enable our workforce to perform high-value tasks. Humans will benefit when agents handle the mundane.</p><p><i>Dattaraj Rao is innovation and R&amp;D architect at Persistent Systems. </i></p>]]></description>
            <author>dattarajraogravitar@gmail.com (Dattaraj Rao, Persistent Systems)</author>
            <category>Infrastructure</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/33OB5cKXtts9VZ7sMyzGew/7454f3b529fbde6e78746d28b720e4c4/Chaos.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
        <item>
            <title><![CDATA[OCSF explained: The shared data language security teams have been missing]]></title>
            <link>https://venturebeat.com/security/ocsf-explained-the-shared-data-language-security-teams-have-been-missing</link>
            <guid isPermaLink="false">65JjDz3GmMbr0gsoVnjCZu</guid>
            <pubDate>Sun, 05 Apr 2026 18:07:41 GMT</pubDate>
            <description><![CDATA[<p>The security industry has spent the last year talking about models, copilots, and agents, but a quieter shift is happening one layer below all of that: Vendors are lining up around a shared way to describe security data. The Open Cybersecurity Schema Framework<i> (</i>OCSF), is emerging as one of the strongest candidates for that job.</p><p>It gives vendors, enterprises, and practitioners a common way to represent <a href="https://venturebeat.com/security/claude-code-512000-line-source-leak-attack-paths-audit-security-leaders?_gl=1*1bnj2g2*_up*MQ..*_ga*MjA4NDYxMTU5MS4xNzc1MjYyMDM1*_ga_SCH1J7LNKY*czE3NzUyNjIwMzMkbzEkZzAkdDE3NzUyNjIwMzMkajYwJGwwJGgw*_ga_B8TDS1LEXQ*czE3NzUyNjIwMzQkbzEkZzAkdDE3NzUyNjIwMzQkajYwJGwwJGgw">security events</a>, findings, objects, and context. That means less time rewriting field names and custom parsers and more time correlating detections, running analytics, and building workflows that can work across products. In a market where every security team is stitching together endpoint, identity, cloud, SaaS, and AI telemetry, a common infrastructure long felt like a pipe dream, and OCSF now puts it within reach.</p><h2>OCSF in plain language</h2><p>OCSF is an open-source framework for cybersecurity schemas. It’s vendor neutral by design and deliberately agnostic to storage format, data collection, and ETL choices. In practical terms, it gives application teams and data engineers a shared structure for events so analysts can work with a more consistent language for threat detection and investigation.</p><p>That sounds dry until you look at the daily work inside a <a href="https://venturebeat.com/security/axios-npm-supply-chain-attack-rat-maintainer-token-2026?_gl=1*18t0sen*_up*MQ..*_ga*MjA4NDYxMTU5MS4xNzc1MjYyMDM1*_ga_SCH1J7LNKY*czE3NzUyNjIwMzMkbzEkZzAkdDE3NzUyNjIwMzMkajYwJGwwJGgw*_ga_B8TDS1LEXQ*czE3NzUyNjIwMzQkbzEkZzAkdDE3NzUyNjIwMzQkajYwJGwwJGgw">security operations center</a> (SOC). Security teams have to spend a lot of effort normalizing data from different tools so that they can correlate events. For example, detecting an employee logging in from San Francisco at 10 a.m. on their laptop, then accessing a cloud resource from New York at 10:02 a.m. could reveal a leaked credential. </p><p>Setting up a system that can correlate those events, however, is no easy task: Different tools describe the same idea with different fields, nesting structures, and assumptions. OCSF was built to lower this tax. It helps vendors map their own schemas into a common model and helps customers move data through lakes, pipelines, security incident and event management (SIEM) tools without requiring time consuming translation at every hop.</p><h2>The last two years have been unusually fast</h2><p>Most of OCSF’s visible acceleration has happened in the last two years. The project was <a href="https://venturebeat.com/security/black-hat-2022-reveals-enterprise-security-trends">announced in August 2022</a> by Amazon AWS and Splunk, building on worked contributed by Symantec, Broadcom, and other well known infrastructure giants Cloudflare, CrowdStrike, IBM, Okta, Palo Alto Networks, Rapid7, Salesforce, Securonix, Sumo Logic, Tanium, Trend Micro, and Zscaler.</p><p><i>The OCSF community has kept up a steady cadence of releases over the last two years</i></p><p>The community has grown quickly. AWS said in August 2024 that OCSF had expanded from a 17-company initiative into a community with more than 200 participating organizations and 800 contributors, which expanded to 900 wen OCSF joined the Linux Foundation in November 2024. </p><h2>OCSF is showing up across the industry</h2><p>In the observability and security space, OCSF is everywhere. AWS Security Lake converts natively supported AWS logs and events into OCSF and stores them in Parquet. AWS AppFabric can output OCSF — normalized audit data. AWS Security Hub findings use OCSF, and AWS publishes an extension for cloud-specific resource details. </p><p>Splunk can translate incoming data into OCSF with edge processor and ingest processor. Cribl supports seamless converting streaming data into OCSF and compatible formats.</p><p>Palo Alto Networks can forward Strata sogging Service data into Amazon Security Lake in OCSF. CrowdStrike positions itself on both sides of the OCSF pipe, with Falcon data translated into OCSF for Security Lake and Falcon Next-Gen SIEM positioned to ingest and parse OCSF-formatted data. OCSF is one of those rare standards that has crossed the chasm from an abstract standard into standard operational plumbing across the industry.</p><h2>AI is giving the OCSF story fresh urgency</h2><p>When enterprises deploy AI infrastructure, large language models (LLMs) sit at the core, surrounded by complex distributed systems such as model gateways, agent runtimes, vector stores, tool calls, retrieval systems, and policy engines. These components generate new forms of telemetry, much of which spans product boundaries. Security teams across the SOC are increasingly focused on capturing and analyzing this data. The central question often becomes what an agentic AI system actually did, rather than only the text it produced, and whether its actions led to any security breaches.</p><p>That puts more pressure on the underlying data model. An AI assistant that calls the wrong tool, retrieves the wrong data, or chains together a risky sequence of actions creates a security event that needs to be understood across systems. A shared security schema becomes more valuable in that world, especially when AI is also being used on the analytics side to correlate more data, faster.</p><h2>For OCSF, 2025 was all about AI</h2><p>Imagine a company uses an AI assistant to help employees look up internal documents and trigger tools like ticketing systems or code repositories. One day, the assistant starts pulling the wrong files, calling tools it should not use, and exposing sensitive information in its responses. </p><p>Updates in OCSF versions 1.5.0, 1.6.0, and 1.7.0 help security teams piece together what happened by flagging unusual behavior, showing who had access to the connected systems, and tracing the assistant’s tool calls step by step. Instead of only seeing the final answer the AI gave, the team can investigate the full chain of actions that led to the problem.</p><h2>What&#x27;s on the horizon</h2><p>Imagine a company uses an AI customer support bot, and one day the bot begins giving long, detailed answers that include internal troubleshooting guidance meant only for staff. With the kinds of changes being developed for OCSF 1.8.0, the security team could see which model handled the exchange, which provider supplied it, what role each message played, and how the token counts changed across the conversation. </p><p>A sudden spike in prompt or completion tokens could signal that the bot was fed an unusually large hidden prompt, pulled in too much background data from a vector database, or generated an overly long response that increased the chance of sensitive information leaking. That gives investigators a practical clue about where the interaction went off course, instead of leaving them with only the final answer.</p><h2>Why this matters to the broader market</h2><p>The bigger story is that OCSF has moved quickly from being a community effort to becoming a real standard that security products use every day. Over the past two years, it has gained stronger governance, frequent releases, and practical support across data lakes, ingest pipelines, SIEM workflows, and partner ecosystems. </p><p>In a world where AI expands the security landscape through scams, abuse, and new attack paths, security teams rely on OCSF to connect data from many systems without losing context along the way to keep your data safe. </p><p><i>Nikhil Mungel has been building distributed systems and AI teams at SaaS companies for more than 15 years.</i></p>]]></description>
            <category>Security</category>
            <category>DataDecisionMakers</category>
            <enclosure url="https://images.ctfassets.net/jdtwqhzvc2n1/1hnK3p6KoPnJhRtzL44Xab/1492dae9e6bd482ddd3ac2e9c5b4c2f1/OCSF.png?w=300&amp;q=30" length="0" type="image/png"/>
        </item>
    </channel>
</rss>