Offensive Security

The Orchestration Layer Is Where AI Finds Bugs: A Tour of Offensive LLM Harnesses

Dark cyberpunk illustration of a staged pipeline: glowing amber and cyan nodes wired together above a grid of tool icons, with data flowing left to right through gated checkpoints.

Point a top-tier model at a real codebase, hand it one prompt - "find the vulnerabilities" - and you get back a confident report you cannot trust. Some findings are real. Some are invented. None arrive with a reproduction, and the run burned a small fortune in tokens rediscovering the same functions three times over. Swap in a smarter model next quarter and the report improves at the margins while the bill climbs. The people actually pulling verified bugs out of large targets with LLMs are not winning on model choice. They are winning on the machinery wrapped around the model. That machinery is the harness, and it is the whole difference between AI slop and a bug-finding machine.

Andy Gill (ZephrFish) made this case cleanly in Harnessing Harnesses: prompt engineering and model selection get almost all the airtime, while the orchestration layer - the harness - gets almost none, and that layer is where the real gains in capability, cost, and reliability live. For offensive work this has stopped being a theory: over the last few months a cluster of open-source security harnesses has shipped, and the pattern behind them is clear enough to copy. This is a walk through what those tools do and the structure worth stealing.

Who this is for

Read on if you point LLMs at code or binaries for security work: source-code auditors and vulnerability researchers, red teamers building their own tooling, appsec reviewers staring down repositories too large to read by hand, and one-person security teams trying to cover more ground than the headcount allows. If your LLM use stops at drafting emails and summarizing advisories, none of this is aimed at you yet, and you can close the tab. The mechanics run defensively too, so blue teams evaluating AI-assisted code review will recognize the parts.

The myth: a better model finds the bugs

The common assumption is that capability scales with the model, so better results must run through the next frontier release. In practice a strong model with no structure around it burns tokens on redundant context, repeats work it has already done, and produces output you cannot verify or reproduce. The leverage sits in the harness: the orchestration layer controlling the inputs, tools, prompts, models, state, validation gates, and outputs at each stage.

Model Context Protocol servers, which got a lot of attention this year, live inside that layer as one kind of tool. They hand the model callable functions - run a command, decompile a binary, query a database - but they do not decide when those functions fire, in what order, with what context, or what to do with the result. That is the harness's job. You can wire up a full suite of MCP servers and still ship inconsistent, unverifiable findings if nothing coordinates how they are used. The orchestration layer decides which data to collect, which tools to run and why, which model fits the task, how much context is required, what to reuse instead of rediscover, and - the part most setups skip - when to stop and hand control back to the operator.

What the field has already built

The clearest way to understand a harness is to look at the ones already in the open. Each breaks vulnerability research into stages and refuses to trust the model's first answer.

  • RAPTOR turns Claude Code into a general-purpose offensive and defensive agent through rules, sub-agents, and skills. It splits into a Python execution layer that runs the tools and a decision layer that chooses what to run and how to read the results, so the orchestration logic can be tested on its own, even run from CI to emit SARIF with no model in the loop. Its validation pipeline runs six stages: is the pattern genuine, what would an attacker need to reach it, does the code support it line by line, what is the CVSS ruling, is it feasible at the binary level (ASLR and RELRO checks, gadget availability, Z3 SMT solving for one-gadget applicability), and a final contradiction check before anything is promoted.
  • Anthropic's defending-code-reference-harness targets C/C++ and runs an autonomous find, grade, and patch loop inside AddressSanitizer-instrumented Docker containers. Every finding ships a binary PoC that reproduces the crash against the instrumented build, so reachability is not up for debate. The vulnpipeline_* stages run in sequence: recon maps the attack surface, a run stage launches fuzzing agents and collects PoCs, a report stage grades each crash, and a patch stage writes a fix, rebuilds, and re-runs the PoC to confirm it is closed.
  • baby-naptime, an open take on Google Project Zero's unreleased Naptime, is a single-agent runtime loop against a live C/C++ binary: propose an approach, execute it, read the output, update the theory, repeat across dozens of iterations with real runtime data instead of static context alone.
  • evilsocket/audit is an eight-stage, language-agnostic pipeline for repositories without a clean build. It maps the codebase, identifies trust boundaries, reviews past security fixes, and runs parallel agents against focused tasks, then deduplicates and passes each finding through a trace stage that must show attacker-controlled input reaching a vulnerable sink before it is reported.
  • Visa's VVAH leans hardest on threat modeling and taint-flow up front: it inventories the repo, maps trust boundaries, assigns specialist review lenses, validates through an adversarial second pass, and emits SARIF and Markdown, treating results as triage candidates rather than confirmed bugs. Its limits are worth knowing: the call graph is LLM-seeded and reinforced with regex rather than built from a full AST, so dynamic dispatch, reflection, and framework routing can be missed.

None of these produces flawless findings unattended. Gill notes he has had several kick out rubbish and needed to tune Audit heavily to get it flowing. What they buy you is discipline - a repeatable pipeline that challenges its own output - rather than a replacement for the reviewer.

The pattern that makes them work

Strip these tools down and the same skeleton appears: recon → hunt → validate → trace → report, the shape Gill ships in his own harness-kit template. Recon maps the target, Hunt investigates narrow hypotheses in parallel, Validate is told to find every reason a finding is wrong, and Trace proves whether attacker-controlled input reaches the sink. Only findings that clear those gates reach reporting.

The single most common mistake is using one system prompt for the whole pipeline. Each stage needs a prompt written for its job: an agent mapping a codebase needs different instructions from one forming exploit hypotheses, and both differ from one auditing a proposed PoC. Stages should exchange structured artifacts - the mapping stage returns JSON of file paths, entry points, and dependencies - rather than one sprawling conversation, which keeps the pipeline inspectable, rerunnable, and replaceable stage by stage.

The validation gate is where the discipline pays off. Gill points to Scrutineer's revalidate skill: when a deep-dive stage produces a High or Critical, revalidate checks it against git history and returns true_positive, false_positive, already_fixed, or uncertain. Only true positives advance to the expensive verification that tests against current HEAD, keeping your costliest runtime work aimed at the findings most likely to be real. Encoding that as configuration keeps the shape explicit:

# harness.yaml - a five-stage source-audit pipeline. Each stage gets its own
# prompt, model tier, and context budget; a finding only advances past its gate.
stages:
  recon:
    model: fast            # cheap tier: map the target, no deep reasoning
    budget_tokens: 8000
    output: recon.json     # {entry_points, sources, sinks, dependencies}
  hunt:
    model: balanced
    budget_tokens: 8000
    parallel: 6            # narrow, independent hypotheses, scoped context each
    input: recon.json
    output: findings.jsonl
  validate:
    model: balanced
    budget_tokens: 8000
    prompt: "List every reason this finding is WRONG. Check git history."
    gate:
      advance_if: true_positive   # false_positive|already_fixed|uncertain dropped
  trace:
    model: frontier        # expensive: reserve for survivors only
    budget_tokens: 32000
    requires: attacker_input_reaches_sink   # no taint path, no report
  report:
    model: balanced
    format: [sarif, markdown]

Budget the context window like money

Treat the context window as a budget instead of pouring in everything you have. A reliable way to wreck an early harness is to pass raw files, full scanner output, and the whole conversation history into every stage, most of it irrelevant to the current hypothesis. Retrieve only the code paths tied to the hypothesis at hand, summarize noisy tool output, keep a short rolling summary, and drop resolved tasks once their results are stored elsewhere. As a rough calibration: single-function analysis often fits in about 8K tokens, synthesis across findings closer to 32K, and fuzzer or scanner logs should be cut to a few hundred useful tokens before they hit a prompt. Model routing follows the same logic - cheap models classify and summarize, strong models handle validation, tracing, and synthesis. Set this early; bolting context management on later is painful.

Give the harness a memory

Context management decides what a stage sees during one run; retrieval-augmented generation decides what the harness reuses across runs. A knowledge base of prior notes, tool syntax, documentation, and previous findings lets the pipeline learn from earlier work instead of paying to rediscover it, and a feedback loop after each pass means new findings build on the baseline rather than starting cold. Instrument the spend too: Gill's TokenBurn maps a Claude subscription to equivalent API cost, so you can see which stage is quietly wasteful.

Why a tuned harness is the whole game

This is the part that decides whether AI helps you or wastes your time. A frontier model handed a codebase and a one-line prompt produces confident, unverifiable output - the AI slop that reads well and falls apart the moment you try to reproduce it. The same model inside a disciplined harness, with staged prompts, scoped context, and gates that try to prove each finding wrong, becomes a bug-finding machine you can stand behind.

At Red Hound we run our own private harness built on these principles, with our own methodologies and expertise encoded into the stages, prompts, and validation gates. A harness is only as good as the judgment poured into it, and ours carries how we scope, hunt, and prove findings. We have run it against both CTF targets and real red team engagements, and used correctly it turns around fast, high-quality offensive work a model cannot deliver out of the box. It does not replace the operator; it lets a skilled one cover far more ground and prove what they find.

Build the validation gate before you scale the agents

The temptation is to point eight parallel agents at a repository on day one and let the volume impress you. Resist it. Start with one honest pipeline: recon, a single Hunt task, a Validate stage whose only job is to disprove the finding, and a Trace stage that must show input reaching a sink. A finding that survives that sequence is worth an operator's time; one that does not never should have reached a report a client will read. Defensible findings, not raw volume, are what an offensive engagement is judged on. If you want a second set of eyes on an AI-assisted research pipeline, or an offensive test that reflects how attackers are actually building this tooling, that is the conversation Red Hound has every week.

Want an offensive test that reflects how attackers work now?

We run penetration tests and adversarial assessments that account for how real attackers - and increasingly their tooling - approach your environment. Book a session to scope a test, or to talk through building AI-assisted research into your own workflow.