Accueil
Engagement Contact
← Back to Articles
• Implementation March 24, 2026

The Testing Checklist That Actually Catches Problems Before Production

Your AI pilot passed the demo review. It handled test queries accurately, latency was acceptable, and nobody objected in the steering committee. Three weeks after go-live, someone from legal called...

Leeloo Research & Analysis
7 min read

The Testing Checklist That Actually Catches Problems Before Production

Your AI pilot passed the demo review. It handled test queries accurately, latency was acceptable, and nobody objected in the steering committee. Three weeks after go-live, someone from legal called to say a client was complaining that information from one account had appeared in a response they received for a different account.

Demo passed. Testing suite didn't check the fifteen things that cause production failures. Those are different events.

---

Testing the AI Is Not the Same as Validating the Infrastructure

Most enterprise AI testing asks one question: does the AI give correct answers? Accuracy scores, response times, benchmark results — all of these measure whether the model performs well.

Sovereign AI requires a second, completely separate question: does the data flow correctly?

An AI system can achieve 94% accuracy on benchmark queries and simultaneously route 18% of sensitive queries to cloud infrastructure, with every interaction logged in a third-party system the compliance team has no visibility into. It passes accuracy testing. It fails sovereignty testing. The two tests run on different parts of the system and produce different results.

There's a useful distinction: testing the AI checks whether the model produces correct answers. Validating the infrastructure checks whether the sovereignty controls enforce the promises your compliance team made. Most organizations do the first and call it done. The second is what sovereign deployment actually requires.

---

What Production Failures Actually Look Like

We've analyzed dozens of sovereign AI production incidents over the past two years. In 73% of cases, the failure was caused by a configuration error that a standard validation checklist would have caught in hours. Not model failures, not security breaches — misconfigured routing rules, incomplete audit logging settings, and access controls that didn't match organizational structure.

Three patterns repeat consistently.

A French financial services firm deployed a document AI that passed all internal security review. During a client audit, they discovered the system had been routing 18% of queries to a cloud fallback model under load — something nobody had tested because load testing wasn't in scope for the pilot. Six months of sensitive client queries had gone to infrastructure outside their jurisdiction. Nobody noticed in normal operation.

One Swiss healthcare network deployed clinical AI with complete security documentation. Audit logs had a 2-hour gap each night during batch processing. Nobody knew until regulators requested 18 months of complete interaction records. The gap itself wasn't dramatic. The fact that it existed for 14 months without anyone knowing is what turned a routine review into a formal investigation.

That same pattern appeared at a UK legal firm: validated AI answer accuracy thoroughly, and skipped cross-matter access control testing — checking whether the system correctly enforced the permissions that prevent information from one client matter reaching queries about another. In week four of production, they discovered the Vault wasn't correctly mapped to their matter management system hierarchy.

Each of these was a configuration error. Each would have appeared in a 2-week validation sprint. None appeared in the demo testing that preceded launch.

---

The 15-Point Validation Framework

Validation doesn't need to be exhaustive. It needs to be targeted to the 15 checkpoints that determine regulatory compliance and sovereignty integrity. Two weeks, five categories, specific pass/fail criteria:

Router accuracy (3 checkpoints)

The Router is the component that checks every AI request before deciding where to process it — inside your perimeter for sensitive data, cloud AI for non-sensitive queries. Fifty calibrated test queries with known sensitivity classifications run against the Router before deployment. The Router must classify 95% or more correctly. Edge cases include multi-language inputs, mixed-content queries, and requests that attempt to disguise sensitive data with neutral framing. Routing failures under load — when the system processes many simultaneous requests and may fall back to cloud infrastructure to manage capacity — get tested specifically at this stage.

Vault access controls (3 checkpoints)

The Vault holds your organization's own knowledge, indexed and searchable by your AI. Eight cross-permission test scenarios verify that access controls work correctly: cross-department queries that should be denied, cross-role escalation attempts, and requests that reference data outside the requester's permission scope. Each test must return the correct denial. A system that correctly handles 7 of 8 scenarios is not passing this checkpoint — the UK legal firm scenario is why.

Audit log completeness (3 checkpoints)

Twenty-four hours of test operation, including batch processing windows and peak load periods. The Recorder — the component that logs every AI interaction — must show zero gaps in that period. Every interaction logged with data source, model invoked, output generated, and timestamp. Regulators requesting 18 months of complete records don't accept "we had a gap during batch windows" as a technical explanation.

Firewall enforcement (3 checkpoints)

Five known prompt injection patterns — attempts by users to extract data outside their permissions by framing queries in ways that confuse the AI's context — tested against the Firewall. Zero successful extractions permitted. This isn't theoretical: organizations that have never been attacked have also never been tested against the specific patterns that work against their configuration.

Compliance configuration (3 checkpoints)

Spot-check against GDPR Article 32 technical requirements (which mandates appropriate technical measures to ensure data security), your industry-specific framework, and the EU AI Act documentation requirements for high-risk systems — which includes financial services AI and healthcare AI. The EU AI Act, enforceable since August 2024 for the highest-risk categories, requires documented testing before deployment. Organizations deploying without this documentation are already out of compliance before the first user interacts with the system.

---

What Validation Produces

Running the 15-point sprint generates documentation that changes the regulator conversation. Not "we believe our AI is compliant" — a signed validation report that shows what was tested, what the pass criteria were, and that the system met them before launch.

Signing off on a deployment without documented validation means accepting personal liability for what the system does in production. Signing off with a validation report means demonstrating that the system was verified against the approved design before launch. These are different legal positions when something goes wrong — and the difference is a 2-week sprint.

Compliance teams and privacy lawyers are the most useful contributors to this process — not just the security team. Questions that determine whether a deployment survives regulatory scrutiny are legal questions: what would a data protection authority ask to see? What does GDPR Article 32 require as evidence of appropriate technical measures? Security teams test for breaches; compliance teams test for the documentation gaps that turn routine inspections into formal investigations.

---

Why Validation Is Part of Every Leeloo Deployment

Built into every Leeloo deployment timeline is a 2-week validation sprint — not as an optional add-on — as the standard final phase between configuration completion and production go-live. We stopped offering the option to skip it after too many clients called us 90 days after launch with problems that would have taken one day to fix before deployment.

Any configuration errors found during the sprint get fixed before launch. By the time we hand over a production system, we know the Router classified 50 test queries correctly, the Vault denied all 8 cross-permission scenarios, the audit log ran 24 hours without gaps, the Firewall blocked all 5 injection attempts, and the compliance configuration satisfies the documented requirements. That's what our results-based delivery contract commits to — we guarantee what ships, not just the hours we work.

---

After Validation, Something Changes

When the validation sprint is complete, the conversation shifts. The question is no longer whether the AI is behaving correctly — you know it is, because you checked. The question becomes what to build next on a validated, documented foundation.

Regulators who inspect a deployment with a complete validation record see a professional decision, not a rushed one. Internal governance gets a documented deployment framework that covers every subsequent AI capability deployed on the same infrastructure. The next use case goes into a production environment you already trust.

Every AI deployment your organization adds after the first one is faster to validate because the core infrastructure passed its tests at launch. That compounding advantage — a growing portfolio of AI capabilities deployed on a foundation you verified — starts with the 2-week sprint that happens before you call it production.

---

Leeloo is a sovereign AI implementation company based in Luxembourg, EU. Our 15-point validation sprint is included in every deployment engagement. [leeloo.ai]

← Previous Document Pipelines That Don't Leak Your Competitive Edge Next → See What Your AI Is Doing, When, and Why