Building an Autonomous AI Agent for Compliance Control Testing: A Practical Guide

Compliance control testing is one of the most expensive and least loved activities in enterprise security. Auditors ask for evidence that MFA is enforced across all accounts. A human analyst logs into the identity provider, runs a report, exports a CSV, opens it in Excel, counts the exceptions, and writes a finding. Then they do it again for backup configurations, access reviews, encryption settings, and 47 other controls. For a mid-size organization with SOC 2, ISO 27001, or HIPAA obligations, a single annual audit can consume 500-800 hours of skilled analyst time — and still miss things because the process is manual, inconsistent, and tedious.

AI agents can automate the deterministic, repeatable parts of this work. Not the judgment calls — those still require a human. But the evidence collection, the API queries, the data normalization, and the initial finding classification can all be delegated to an agent that runs continuously, produces consistent output, and maintains an immutable audit trail. This article explains how to build one.

The Problem with Manual Compliance Testing

Manual compliance testing fails in three predictable ways. First, it is point-in-time: the audit captures a snapshot, not a continuous state. A control that was passing on the day of evidence collection might have been failing for the three months before, and will fail again the day after. Second, it is analyst-dependent: the same control tested by two different analysts frequently produces different findings because the procedure is interpreted differently and the evidence is gathered inconsistently. Third, it scales poorly: adding another compliance framework or another business unit means adding more analyst hours at a roughly linear rate.

AI agents address all three. An agent can run control tests on a schedule — daily, weekly, or continuously — producing a time-series view of control health rather than a point-in-time snapshot. It applies the same query logic every time, eliminating interpretation variance. And it scales horizontally: testing one more cloud account costs the same as testing the first.

Architecture Overview

A compliance testing agent has three primary components:

LLM Orchestrator: The reasoning engine that receives a control objective (e.g., "Verify that MFA is enabled for all users in the Entra ID tenant"), decides which tool functions to call, interprets the results, classifies the finding, and generates the evidence narrative. GPT-4o, Claude 3.5 Sonnet, or equivalent models are suitable. The orchestrator operates in a tool-calling loop: it calls a tool, receives the result, decides what to do next, calls another tool if needed, and terminates when it has sufficient evidence to form a finding.
Tool Functions: Deterministic functions that make API calls to authoritative data sources and return structured JSON. The LLM calls these; it does not construct the API calls itself. This is critical: the LLM orchestrates and interprets, but the actual data retrieval happens through pre-validated functions that return verified data. This eliminates the most dangerous hallucination surface — the agent cannot fabricate API responses if the API responses are real.
Evidence Store and Report Generator: Every tool call and its response is logged immutably with a timestamp. The agent's reasoning trace is preserved. The final finding is generated with citations to specific evidence records. This is what makes the output audit-ready: every conclusion can be traced to a specific API response at a specific point in time.

The architecture looks like this in practice:

# Simplified agent loop pseudocode
def run_compliance_control_test(control_objective: str, tools: list) -> Finding:
    messages = [
        {"role": "system", "content": COMPLIANCE_AGENT_SYSTEM_PROMPT},
        {"role": "user", "content": control_objective}
    ]

    while True:
        response = llm.chat(messages=messages, tools=tools)

        if response.finish_reason == "tool_calls":
            # Execute each tool call deterministically
            for tool_call in response.tool_calls:
                result = execute_tool(tool_call.name, tool_call.arguments)
                audit_log.record(tool_call, result)  # Immutable record
                messages.append({"role": "tool", "content": result})

        elif response.finish_reason == "stop":
            # Agent has enough evidence to form a finding
            finding = parse_finding(response.content)
            finding.evidence_ids = audit_log.get_current_session_ids()
            return finding

Building the Tool Functions

Each tool function should do one thing, do it reliably, and return structured output the LLM can reason about. Here are four examples covering common compliance controls:

MFA Enforcement Check (Entra ID)

def check_entra_mfa_status(tenant_id: str, include_guests: bool = False) -> dict:
    """
    Returns MFA registration and enforcement status for all users in the tenant.
    Queries: Authentication Methods Policy, Per-User MFA, and CA Policy coverage.
    """
    graph = GraphClient(tenant_id=tenant_id, credential=get_managed_identity_credential())

    users = graph.get("/users?$select=id,displayName,userType,assignedLicenses")
    auth_methods = graph.get("/reports/authenticationMethods/userRegistrationDetails")
    ca_policies = graph.get("/identity/conditionalAccess/policies?$filter=state eq 'enabled'")

    # Identify users without phishing-resistant MFA registered
    mfa_gaps = []
    for user in users:
        if not include_guests and user["userType"] == "Guest":
            continue
        reg = next((r for r in auth_methods if r["id"] == user["id"]), {})
        if not reg.get("isMfaRegistered", False):
            mfa_gaps.append({
                "userId": user["id"],
                "displayName": user["displayName"],
                "mfaRegistered": False,
                "methods": reg.get("methodsRegistered", [])
            })

    return {
        "total_users_checked": len(users),
        "users_without_mfa": len(mfa_gaps),
        "gap_percentage": round(len(mfa_gaps) / len(users) * 100, 1),
        "gap_details": mfa_gaps[:10],  # First 10 for context
        "evidence_timestamp": datetime.utcnow().isoformat(),
        "data_source": "Microsoft Graph API"
    }

Backup Configuration Validation (AWS)

def check_aws_backup_compliance(account_id: str, required_retention_days: int = 30) -> dict:
    """
    Checks that all EC2 instances and RDS databases have backup plans
    with retention meeting or exceeding the required threshold.
    """
    backup_client = boto3.client("backup")
    ec2 = boto3.client("ec2")
    rds = boto3.client("rds")

    # Get all backup plans and their selections
    plans = backup_client.list_backup_plans()["BackupPlansList"]
    protected_resources = set()
    non_compliant_plans = []

    for plan in plans:
        plan_detail = backup_client.get_backup_plan(BackupPlanId=plan["BackupPlanId"])
        for rule in plan_detail["BackupPlan"]["Rules"]:
            if rule["Lifecycle"].get("DeleteAfterDays", 0) < required_retention_days:
                non_compliant_plans.append({
                    "plan_name": plan["BackupPlanName"],
                    "rule_name": rule["RuleName"],
                    "retention_days": rule["Lifecycle"].get("DeleteAfterDays"),
                    "required_days": required_retention_days
                })

    # Check for unprotected resources
    all_instances = [i["InstanceId"] for r in ec2.describe_instances()["Reservations"]
                     for i in r["Instances"] if i["State"]["Name"] == "running"]
    unprotected = [i for i in all_instances if i not in protected_resources]

    return {
        "account_id": account_id,
        "total_running_instances": len(all_instances),
        "unprotected_instances": len(unprotected),
        "non_compliant_backup_plans": non_compliant_plans,
        "finding": "FAIL" if unprotected or non_compliant_plans else "PASS",
        "evidence_timestamp": datetime.utcnow().isoformat()
    }

Access Review Completion Check

def check_access_review_completion(tenant_id: str, review_period_days: int = 90) -> dict:
    """
    Checks Entra ID Access Review completion rates for privileged roles.
    """
    graph = GraphClient(tenant_id=tenant_id, credential=get_managed_identity_credential())
    cutoff = datetime.utcnow() - timedelta(days=review_period_days)

    reviews = graph.get(
        f"/identityGovernance/accessReviews/definitions"
        f"?$filter=createdDateTime ge {cutoff.isoformat()}Z"
    )

    overdue = [r for r in reviews if r["status"] not in ["Completed", "Applied"]
               and r["settings"]["instanceDurationInDays"] > 0]

    return {
        "total_reviews_in_period": len(reviews),
        "incomplete_reviews": len(overdue),
        "overdue_review_names": [r["displayName"] for r in overdue],
        "finding": "FAIL" if overdue else "PASS",
        "evidence_timestamp": datetime.utcnow().isoformat()
    }

Generating Audit-Ready Findings

The agent's final output must be structured for two audiences simultaneously: the compliance team that needs audit evidence, and the engineering team that needs to remediate. A finding should contain:

Control reference: Which framework control this maps to (SOC 2 CC6.1, ISO 27001 A.9.1.2, NIST 800-53 IA-2)
Test methodology: Which tool functions were called and in what sequence
Evidence records: References to the immutable log entries for each API response, with timestamps
Finding classification: Pass, Fail, or Exception — with the specific condition that triggered the classification
Affected entities: Specific users, systems, or resources that are out of compliance, not just counts
Remediation guidance: Specific steps to resolve the finding, appropriate to the platform

The LLM system prompt for a compliance agent should explicitly instruct the model to cite specific evidence records rather than generalizing, to distinguish between what the data shows and what it cannot determine, and to flag any cases where tool results were ambiguous or incomplete. This is where the compliance context makes prompt engineering critical — you need the model to be conservative and precise, not creative.

The Hallucination Problem in Compliance Context

Hallucination risk in compliance agents is not symmetric. A false negative (agent reports a control as passing when it is actually failing) is worse than a false positive (agent flags a control that is actually passing). A false negative can lead your organization to attest to an auditor that a control is effective when it is not. That is a compliance failure, potentially a material misrepresentation.

The architectural safeguards that matter most:

The agent never makes compliance conclusions without a tool result. The system prompt must prohibit the LLM from generating a finding based on general knowledge — every claim must be grounded in a specific API response that is in the context window. Implement this by checking that the final finding cites at least one evidence record before accepting it.
Tool results are always deterministic. The tools return real API data. The LLM interprets them but cannot manufacture them. If an API call fails, the tool returns an error, and the agent must report the control as "Unable to Test" — not assume it passes.
Human review gates for findings that will be used in audit submissions. Agent output should flow into a review queue where a compliance analyst reviews and approves findings before they are included in formal audit evidence packages. The agent reduces the work from evidence collection to evidence review — not from review to automatic submission.
Temperature set to 0. Compliance agents should use deterministic (zero temperature) model settings. Variability in reasoning is the enemy of consistent compliance testing.

Scheduling and Continuous Compliance

One of the most powerful capabilities of an automated compliance agent is continuous testing. Instead of point-in-time annual evidence collection, controls can be tested daily or weekly. The output feeds a compliance dashboard that shows control health over time — a trend line rather than a snapshot.

Implement this as a scheduled job (Lambda, Azure Functions, or a Kubernetes CronJob) that runs each control test on a defined cadence and writes results to a time-series store. When a control's status changes from Pass to Fail, trigger an immediate alert to the compliance and security teams. This converts compliance from a reactive audit exercise to a proactive monitoring function.

For most organizations, implementing even a subset of automated controls — MFA coverage, backup validation, privileged access review completion, encryption settings — eliminates 40-60% of the manual evidence collection effort while providing continuous visibility that annual audits never could.

Audit Trail Requirements

Everything the agent does must be logged in a way that is useful for audit purposes. Every tool call must be recorded with: the function name, the input parameters (sanitized of secrets), the raw response, and a timestamp. Every LLM call must be recorded with: the full prompt, the model response, and any tool calls triggered. These records must be stored immutably — in append-only storage, with write access restricted to the agent's service identity and read access available to auditors.

When an auditor asks "How do you know MFA was enforced on March 15th?" you should be able to produce a JSON record showing the specific Microsoft Graph API call made at 03:00 UTC on March 15th, the response listing all users and their MFA registration status, and the finding classification derived from that response. That is audit-ready evidence. A spreadsheet saying "Compliance team checked MFA on March 15th" is not.

AI agents do not replace compliance judgment. They replace compliance drudgery — the repetitive, API-driven, evidence-collection work that consumes skilled analyst hours without requiring skilled analyst judgment. Build the agent to handle that layer, build the human review gates to handle the judgment layer, and you have a compliance program that is both more efficient and more thorough than the purely manual alternative.

Ready to automate your compliance control testing?

We design and build AI-powered compliance automation for organizations seeking continuous control visibility — from architecture and tool function development to audit trail design and human review workflows. Book a session with our team.

Book a Session