Three Reports, One PCAP, and What AI Discipline Actually Looks Like in OT & Critical Infrastructure

Originally posted on LinkedIn, written by Dan Gunter, CEO & Founder of Insane Cyber

Recently, we ran an internal experiment. We had a network capture from a critical-infrastructure facility, the kind of multi-vendor environment most of us work in, with controllers and meters and historians from half a dozen vendors layered over twenty-plus years of installation. We wanted to know how different AI-assisted approaches would handle it.

So we did three runs against the same evidence. Two of them used Valkyrie, our OT assessment platform, with different methodology configurations. The third was the simplest thing a customer might do today: drop the PCAP into Claude through Cowork, and ask the model to write an OT assessment.

About eighty percent of the application-layer bytes in that capture were sitting on a non-standard TCP port. Wireshark sees the frames, can’t identify the protocol, and labels everything as plain “TCP.” Of all the OT traffic in the environment, the largest single protocol was the one nobody could read without doing real work.

Each of the three runs produced a confident, vendor-specific identification for that protocol. Each one named a different vendor. Only one was right, and we only confirmed which a known-good packet capture from a lab environment running the actual product. Once we had ground truth, the bit-level match was unambiguous.

This is the gap that everyone working in OT cybersecurity needs to think hard about. The distance between confident and verified can look invisible in a PDF, but it has consequences in the plant. A finding that names the wrong vendor leads to remediation recommendations for the wrong product, advisory crosswalks against the wrong CVE list, and a written record that ages into something nobody trusts the next time around. The reader of that report, usually a controls engineer or a security lead with a real budget and a real change window, has no good way to tell which findings rest on evidence and which rest on inference, because by the time the words hit the page, they all look the same.

That’s the problem. It isn’t really about AI capability. The AI in all three runs was capable enough to produce something readable and technically shaped. The problem is process discipline: whether the workflow around the AI catches the wrong answer before it ships, or whether it doesn’t.

Why OT Raises the Stakes

Most readers of this don’t need the OT-versus-IT comparison rehashed. You already know your protective relay can’t reboot on a Tuesday afternoon. You know the PLC controlling your dewatering pumps is older than the engineer maintaining it. You know that when a control loop misbehaves at three in the morning, the consequences are measured in dollars per minute and sometimes in injury reports.

What’s worth saying briefly is why this changes the AI conversation.

A wrong answer in OT can hurt people. The same misconfigured ACL that creates a help-desk ticket in IT can create a personnel-safety event in OT. AI-produced findings have the same property as any other finding: they get implemented. When the recommendation rests on a misidentification, the implementation rests on the same misidentification.

Vendor diversity outlasts staff turnover. A typical industrial site has equipment from six to eight vendors spanning twenty-five years of installation. No analyst, human or AI, knows every protocol on a site like that. Coverage gaps are a permanent feature of OT work, not a temporary one. Any methodology that doesn’t acknowledge its own coverage gaps is lying about them.

The systems are unmovable. You cannot patch a turbine governor on a Wednesday because someone wrote a CVE advisory on Tuesday. Recommendations have to land in a world of scheduled outages, vendor support contracts, and regulatory reporting calendars. An AI that produces advice without that context produces homework, not work.

Those three pressures combine into an asymmetric cost structure. A confident, wrong AI output is cheap to produce. It is expensive to consume, to implement, to document, and to defend during the next audit. The defense against that asymmetry is the process that wraps around the AI, not the model itself.

TTX and Assessments Are the Same Problem in Two Costumes

Two engagement types come up when our customers ask us about AI.

Tabletop exercises are facilitated scenarios. Operations, engineering, and security are in the same room, walking through “an adversary writes to your pump-control PLC; what happens next?” The AI temptation is obvious: TTX scenarios are labor-intensive to write, sector-specific, and often reused across engagements. Drafting inject cards and facilitator guides is exactly the kind of work AI looks productive at.

It also fails ungracefully. A hallucinated process consequence, the pump will explode in a way it physically cannot, is argued out by the plant engineer in the room, and the whole exercise loses credibility. A sector-mismatched scenario (electric-utility content pasted into a water-treatment engagement) hits the trash bin. A plausibly-shaped inject card that names a protocol the customer doesn’t actually run gets the AI a reputation in the room it doesn’t recover from. Tabletop facilitation is mostly trust; once that’s gone, the exercise is over.

OT assessments have the same problem at higher stakes. The customer hands over network captures, asset inventories, P&IDs, prior reports, and the deliverable is findings, severities, remediation roadmaps, and regulatory crosswalks. AI is useful for every step, but AI also fails for the same reasons it fails in TTX. Confident misidentification of protocols, severity inflation when context is missing, vendor blind-spots that don’t survive cross-checking.

The ethical question in both is the same. AI is already in the workflow. The real question is what process wraps around it, so wrong answers get caught before they leave the room.

What Discipline Actually Looks Like

Human-driven OT assessments and tabletop exercises have always been built in sections. You don’t just look at a network and write findings. You scope the engagement. You inventory the assets. You map the architecture. You anchor on what consequence the customer actually cares about. You correlate threats against the asset inventory rather than against an ingestion-time guess. You synthesize findings, you map them to frameworks, you produce a deliverable. Each section produces an artifact that the next section consumes.

That section-by-section structure exists for a reason. It keeps the analyst from getting ahead of the evidence. You don’t rate severity before you’ve anchored crown jewels. You don’t write recommendations before you’ve validated the asset inventory. Skipping a section means inference outruns observation, and in OT, that’s where bad findings come from.

The same discipline has to be enforced on AI. Without it, the model will happily write findings before it has finished asset discovery, will guess at protocol identities before checking authoritative references, and will assign severities to assets it hasn’t confirmed exist. The model doesn’t know it’s getting ahead of itself, because in its training data, that’s just how text gets generated. Structure is the constraint that keeps inference from running past evidence.

A few specific disciplines are the ones that pay off most.

Everything has to be labeled with how the AI knows it.

Observed claims come from direct evidence: counts, captured strings, decoded frames. Inferred claims come from cross-correlation: this is a workstation because of how it behaves on the wire. Assumed claims come from sector-norm fill-ins: in a mining context, a host like this is likely doing X, pending confirmation. Mixing the three without labels is what makes AI-assisted findings dangerous. Separating them with explicit confidence levels is what makes them safe. It’s the line between a finding and an opinion.

The methodology has to show what it considered and rejected.

Every finding should ship with the alternative explanations that were examined and the reasons they were rejected. The benefit isn’t forensic, though it is that. It’s that the act of writing down rejected hypotheses forces the AI and the human reviewing it to argue against the most plausible alternatives instead of accepting the first thing that fit. In the three reports story up top, the run that got the protocol identification right was the one that explicitly checked whether the port appeared in the named vendor’s official port reference. The runs that got it wrong skipped that check entirely.

Human input has to be structural, not terminal.

“Human-in-the-loop” (HITL) gets abused. Most HITL workflows have the human at the end, approving or rejecting a finished output. That isn’t HITL; that’s review. Real HITL puts the human between sections of the work. Before the scope is finalized, before asset classifications are committed, before findings are severity-rated, before the deliverable is finalized. These structural touchpoints catch things a review-at-the-end model can’t, because by the time review happens, the structure is already locked. The questions at structural touchpoints are usually short and specific. Who is the audience for this deliverable? What worst-case event do you really care about? Which of these unknown devices do you recognize? Each answer steers the work downstream.

Absence of work has to be as visible as the presence of work.

Every deliverable should carry a section that names, explicitly, what wasn’t done. We didn’t have access to the safety-instrumented system network. We didn’t perform active probing. We didn’t have customer-confirmed crown jewels and used sector norms instead. The temptation is to bury that section because it sounds like an excuse. The reason it has to stay loud is that the next person reading the report six months later, on a different engagement, trying to understand whether a finding is still relevant, needs to know what was certain and what was a gap.

What Discipline Catches, In Practice

Going back to the experiment. The same capture that produced the three different vendor identifications also produced findings that the simpler runs missed entirely, and the difference came down to whether the methodology actually decoded the traffic or just looked at it.

When the methodology demanded that we cover the highest-volume un-decoded protocols and write parsers against authoritative vendor documentation, we found a cross-zone write pattern from a host that shouldn’t have been writing at all. Hundreds of write attempts, most of them rejected by the controllers, some succeeded. Invisible until the protocol was decoded. Once it was decoded, the finding was obvious and actionable.

The same protocol-level decoding surfaced something subtler. Plaintext tag names in the SCADA traffic revealed that the facility wasn’t only what we thought it was; it had additional functions we’d missed in the initial framing. That re-characterization changed the regulatory scope of the assessment, the crown-jewel anchoring, and the severity of several findings. None of that emerges from a deliverable that treats high-volume proprietary traffic as opaque.

A third protocol for a power-quality meter protocol surfaced repeated error messages indicating the revenue meters at the facility couldn’t synchronize their clocks. On a meter that’s used for billing and forensics, time-discipline failure is its own finding, with real downstream implications for any event-correlation work that ever has to be done.

The pattern across all three is the same. A methodology that says cover the highest-volume protocols, decode them against authoritative documentation, surface what the evidence shows catches things a quicker pass misses. A methodology that says write findings based on what the protocol probably is produces things the customer will have to walk back later.

What to Look for When AI Is in the Room

If you’re hiring out OT assessment work or scoping it internally, five things are worth checking before you sign off on a deliverable.

There should be a documented methodology with explicit sections. If the work is narrated in free-form prose without a structural backbone, the assessment isn’t repeatable. The same evidence handed to the same vendor six months later should produce a recognizably-shaped report, not a different one.

Every claim should be labeled with how it was known. If you can’t tell which findings rest on observed evidence and which on sector-norm assumptions, you can’t budget remediation against them, and you can’t defend them at the next audit.

There should be a record of what was considered and rejected. If everything in the deliverable arrived without alternatives being examined, the analysis hasn’t been adversarial enough to trust.

The human input should be structural, not terminal. If your only contact with the analyst was at kickoff and at delivery, you weren’t in the loop; you were at the endpoints. Real HITL has the customer steering the work between sections.

There should be a section that names, explicitly, what wasn’t done. If everything in the deliverable is framed as comprehensive, the gaps are being hidden. Comprehensiveness is impossible in OT; the question is whether the gaps are documented.

A deliverable that has all five was produced with discipline. A deliverable that has none is a probability cloud presented as fact, and the cost of consuming it falls on whoever has to implement against it.

Closing

The reason we care about this isn’t that AI is going to ruin OT cybersecurity if we let it. It won’t. AI in OT is already inevitable, and the people on the receiving end of OT assessments are going to keep getting AI-assisted deliverables whether they ask for them or not. The question is whether the process around the AI catches mistakes before they leave the room, or whether it doesn’t.

In a corporate IT environment, the cost of an AI mistake is usually a help-desk ticket. In OT, it’s a change window that didn’t need to happen, a finding that turns out to be wrong on the next audit, or, in the worst cases, something measured in injury reports. We owe the engineers who maintain these systems a better answer than “trust the AI.” The answer we owe them is a methodology that makes the AI’s reasoning legible, its uncertainty visible, and its mistakes catchable.

In critical infrastructure, the difference between a finding and an opinion is the discipline of the workflow that produced it. We build ours to fail loudly when it should fail loudly, because the alternative is to fail quietly in production. That’s a cost the people running these plants don’t deserve to absorb on someone else’s behalf.

If you’re sitting on an AI-assisted assessment that doesn’t carry confidence labels, doesn’t show its considered-and-rejected work, doesn’t name what wasn’t covered, ask. The vendor either has answers, or they have a problem they should fix before they ship the next one.

Products

Services

Company

Resources