45% of AI-Generated Code Ships With a Vulnerability: The Research
AI generated code security, by the numbers. Four studies — Veracode, Carnegie Mellon, Escape.tech, Tenzai — on how often AI code ships with a vulnerability.
You let an AI write your app's code, and somewhere in the back of your mind is the obvious question: how often does that code come out insecure? Not in theory — in measured, published numbers. You'd like to know whether you're worrying over nothing or sitting on a real risk.
Here's the short version, and it's worth sitting with: when researchers tested AI-generated code security at scale, roughly 45% of code samples failed security tests. Nearly half. This post walks through the four most important studies, exactly what each measured, and what each one means for a non-developer who recently shipped an app.
⚡ TL;DR
- Veracode tested 100+ AI models across 80 tasks and found 45% of code samples introduced a known vulnerability class. Their CTO: the models "make the wrong choices nearly half the time, and it's not improving."
- Carnegie Mellon found AI produced working code 61% of the time but secure code only 10.5% — over 80% of the code that worked was still vulnerable.
- Real-world scans agree: Escape.tech found thousands of vulnerabilities across 5,600 vibe-coded apps; Tenzai found SSRF in every AI tool it tested.
- Working and secure are nearly independent. Your app running fine tells you almost nothing about whether it's safe.
Veracode: 45% of AI code fails security tests
The headline number comes from the Veracode 2025 GenAI Code Security Report. It's the largest study of its kind, and the cleanest to summarize.
Veracode tested over 100 large language models — the AI systems behind tools like Cursor, Lovable, and the rest — across 80 distinct coding tasks. For each, they checked whether the generated code introduced a vulnerability from the OWASP Top 10, the security industry's standard list of the most common and serious web application flaws.
The result: 45% of the code samples failed. They shipped a real, known-category vulnerability.
The breakdown is where it gets pointed:
- Java was the worst language tested, failing 72% of the time.
- Cross-site scripting (XSS) — the flaw class CWE-80, where an attacker injects malicious code that runs in your users' browsers — failed in 86% of relevant tests.
- Log injection (CWE-117), where unsanitized input poisons your logs, failed in 88%.
And the conclusion that should stick, from Veracode CTO Jens Wessling: "Our research reveals GenAI models make the wrong choices nearly half the time, and it's not improving."
That last clause matters. The intuition that newer, smarter models must be getting safer doesn't hold. Capability is climbing; security is not coming along for the ride.
What it means for you: if your app does anything with user input — and every app does — the base rate of a vulnerability slipping in is roughly a coin flip per relevant feature. Not because your tool is bad, but because this is how AI-generated code behaves across the board.
Carnegie Mellon's SusVibes: working ≠ secure
The Veracode number tells you how often vulnerabilities appear. Carnegie Mellon's SusVibes benchmark tells you something subtler and arguably more important: that secure and working are two different things, and AI is optimizing for the wrong one.
SusVibes measured AI coding setups on two separate axes — does the code function correctly, and is it secure. The best combination it tested hit a 61% functional pass rate but only a 10.5% security pass rate.
Read those two numbers together and you get the real finding: over 80% of the solutions that worked correctly still contained a vulnerability. The code did the job. It also left a hole.
This is the single most useful idea in all of this research, so it's worth stating plainly: an app running perfectly tells you almost nothing about whether it's secure. The two are nearly independent. Your preview works, your users can sign up, everything looks done — and the security posture underneath is a separate question that "it works" never answers. We dig into why the AI behaves this way in why AI optimizes for "it works," not "it's secure".
What it means for you: stop treating a working app as a finished app. Functionality and security are different checkboxes, and the AI only ticks the first one.
Escape.tech: what it looks like in the wild
Lab benchmarks are one thing; real deployed apps are another. Escape.tech scanned 5,600 vibe-coded apps in the wild and found the lab numbers translate directly into the field.
Across those apps:
- 2,000+ vulnerabilities.
- 400+ exposed secrets — API keys and credentials sitting where they shouldn't be.
- 175 instances of exposed PII (personally identifiable information — real user data like names, emails, and more, left accessible).
These aren't synthetic tasks. They're live applications that real people built and shipped, and the gaps are exactly the categories you'd expect: exposed keys and exposed data, at scale.
What it means for you: the benchmark failure rates aren't trapped in a lab. They show up in production apps, including ones whose founders had no idea anything was wrong until someone scanned.
Tenzai: every AI tool left the same holes
The most recent study, Tenzai (December 2025), is valuable because it tested the tools directly rather than the underlying models, and across the names you actually use.
Tenzai built 15 apps spanning Cursor, Claude Code, Replit, Devin, and OpenAI Codex. The findings were remarkably uniform:
- Every tool introduced SSRF (Server-Side Request Forgery — tricking your server into making requests it shouldn't, often to reach internal systems). Not most. Every one.
- Zero tools implemented CSRF protection (Cross-Site Request Forgery — stopping a malicious site from acting on a logged-in user's behalf).
- Zero tools set security headers — the standard HTTP settings that harden a site against common attacks.
The uniformity is the point. This isn't one weak tool dragging down an average. It's a shared blind spot. These are baseline web protections, and the entire category of AI builders skips them by default.
What it means for you: the tool you picked is not the variable. Whether you used Cursor or Claude Code or Replit, the same baseline protections are likely missing, because none of them add these on their own.
What all four studies agree on
Four studies, four methods — large-scale model testing, a dual-axis benchmark, a field scan of live apps, and direct tool testing. They converge on one conclusion:
AI-generated code is insecure by default, and a working app is not evidence of a secure one.
That's not a reason to stop vibe coding. It's a reason to add one step: a security pass on your deployed app, before you launch and after each deploy. The studies show the gap is real and consistent; the response is to look for it rather than assume it isn't there.
How to check your own app
The honest way to find out whether the research applies to you is to test your live app the way these researchers did — against the actual categories they found:
- Database access. Confirm your access rules (Supabase RLS, Firebase Security Rules) actually restrict data, not only "match everyone."
- Exposed secrets. Scan your frontend for secret keys — not the publishable ones, which are meant to ship.
- The OWASP-style flaws. XSS, SSRF, missing CSRF protection, missing security headers — the classes Veracode and Tenzai flagged. These mostly don't appear in any dashboard; you find them by testing the deployed site.
Curious where your app lands in these numbers?
Run a free, read-only scan of your live app — no install, results in under a minute.
Scan my app free →A note on what the research does not mean
One thing these numbers do not license: panicking over every flag a scanner throws. The research is about real vulnerability classes — XSS, SSRF, exposed secret keys, missing access rules. It is not about your Supabase anon key or your Firebase web config, which are public by design and meant to live in the browser.
🐺 Not a real problem
The 45% figure covers genuine vulnerabilities. It does not mean your public-by-design keys — the Supabase anon key, the Firebase web config, a Stripe
pk_live_key — are leaks. They're identifiers, not passwords. Don't let alarming research turn into chasing the wrong things.
The right takeaway is targeted, not anxious: the gaps the studies found are real and common, so check for those — and ignore the false alarms that don't appear anywhere in the research.
FAQ
Is AI-generated code secure?
By default, often not. The Veracode 2025 study found 45% of AI-generated code samples failed security tests, and Carnegie Mellon found over 80% of working AI solutions still contained a vulnerability. AI code can be made secure, but it doesn't start that way — security is a step you add, not something the tool guarantees.
Does using a better AI model fix the security problem?
Not on its own. Veracode's CTO noted the failure rate "is not improving" even as models get more capable. The models are optimizing for code that works, not code that's secure, and that incentive doesn't change merely because the model is smarter. The reliable fix is testing your deployed app, regardless of which model wrote it.
Which AI coding tool is the safest?
The research doesn't point to a clear winner, and that's the finding. Tenzai tested Cursor, Claude Code, Replit, Devin, and Codex and found the same baseline gaps — SSRF, missing CSRF, missing headers — across all of them. The tool matters less than whether you run a security pass on what it builds.
The bottom line
The numbers are consistent across four independent studies: roughly 45% of AI-generated code fails security tests, over 80% of working AI code is still vulnerable, and the same baseline gaps appear across every major tool. A running app is not a safe app — the two are nearly independent. None of this means stop building. It means add a security pass, and ignore the false alarms about public-by-design keys. An attacker can find your real gaps in minutes; the point is to find them first, on every deploy.
Find your gaps before an attacker does.
Is My Site Hackable? scans your deployed app for the exact issues in this article — exposed keys, missing RLS, open buckets — and tells you what's real and what's a false alarm.
Run a free scan →