Fairy
Resources

The 7 Bug Categories Every Team Shipping AI-Generated Code Will Encounter

April 29, 2026 · 9-minute read · Seth

The 7 Bug Categories Every Team Shipping AI-Generated Code Will Encounter

After a year of AI-assisted development becoming the default at most engineering teams, the patterns of what AI gets wrong are becoming legible. The bugs are not random. They cluster into recognizable categories that share root causes — and once you can name the categories, you can build review processes that target them specifically.

This is a field guide to the seven categories of bug we see most often in AI-generated code. For each, we describe the pattern, why AI tools produce it, what it looks like in practice, and how to catch it before it ships.

This is not a tier list of severity. Any of these categories can cause a production incident depending on context. The order below is roughly frequency — Category 1 is the most common, Category 7 is the most insidious.

Category 1: The Plausible-But-Wrong API Call

The pattern: AI invokes a function, method, or library API that doesn't exist, doesn't exist with the signature it used, or no longer exists in the version of the library you have installed.

Why AI generates it: training data freshness is uneven. The model has seen many versions of many libraries and synthesizes plausible-looking calls that may not match any actual version. The newer the library or the more rapidly it changes (web frameworks, AI SDKs, cloud SDKs), the higher the rate of fabrication.

What it looks like:

# Looks reasonable. Doesn't exist.
response = openai.chat.completions.create_async(
    model="gpt-5", 
    messages=messages,
    enable_caching=True,  # ← not a real parameter
)

How to catch it: type checkers, import linters, and CI. If your project doesn't run a type checker on every PR, this is the highest-leverage CI change you can make this week. Hallucinated APIs fail loudly in a type-checked codebase. They fail silently otherwise.

Severity: usually low (caught by CI), occasionally high (when the call is to an admin-only library path that runs only in production).

Category 2: The Missing Edge Case

The pattern: the code handles the happy path correctly. It does not handle the empty list, the null user, the network timeout, the partial response, the user with a special character in their name, the timezone boundary, or the day that's exactly midnight UTC.

Why AI generates it: training data is biased toward example code that demonstrates the main case. Edge cases get added later, in production, by humans who learn what breaks. The model has seen the demo code; it has seen less of the bug-fix commits that hardened the demo code over time.

What it looks like:

// Works for users with names. Throws for null.
const initials = user.name.split(' ').map(p => p[0]).join('');

// Works for non-empty lists. Returns undefined for empty.
const latest = sortByDate(items)[0];

// Works for finite numbers. NaN propagates silently.
const average = sum / count;

How to catch it: property-based testing, explicit null-handling in your type system (TypeScript strict mode, Rust's Option, Kotlin's nullability), and a code review checklist that asks specifically what happens when this input is empty, null, or malformed? AI reviewers help here, but they share the same blind spots as the model that generated the code.

Severity: ranges from cosmetic to catastrophic depending on which edge case lands in production.

Category 3: The Authorization Hole

The pattern: a new endpoint, function, or database query handles its primary responsibility correctly but skips the authorization check that should have been there. The model wrote what was asked for; it did not add the defensive check that wasn't.

Why AI generates it: the prompt asks for the feature. Authorization is implicit context — every senior engineer adds it reflexively, but it isn't usually stated. AI tools optimize for what was asked, not for what a senior engineer would reflexively add.

What it looks like:

@router.get("/users/{user_id}/exports")
async def get_exports(user_id: str):
    # Returns any user's exports to any caller.
    # Auth check was assumed, not added.
    return await db.fetch_exports(user_id=user_id)

How to catch it: authentication and authorization should be enforced at the framework level (middleware, decorators, route registration) rather than as something each endpoint remembers to add. Make missing auth a compile or registration error, not a code review observation. We covered specific patterns in the AI code security checklist.

Severity: high to severe. This is the category that produces the most consequential incidents.

Category 4: The Lost Tenant Filter

The pattern: in multi-tenant systems, a query that originally filtered by tenant or organization loses that filter during refactoring. The new version compiles, runs, and returns everyone's data.

Why AI generates it: refactoring requests ("make this query reusable," "extract this into a function") don't carry the implicit constraint that tenant isolation must be preserved. The model focuses on the structural change and the security-critical filter quietly disappears.

What it looks like:

// Before — tenant filter present
const orders = await db.query(
  'SELECT * FROM orders WHERE tenant_id = $1 AND status = $2',
  [tenantId, status]
);

// After "refactor for reusability" — tenant filter gone
const orders = await getOrdersByStatus(status);  // ← reads all tenants

How to catch it: explicit row-level security in your database (Postgres RLS, similar features in other DBs), so that even buggy application code can't return other tenants' data. Belt and suspenders. If RLS isn't feasible, a static analyzer that flags queries without an explicit tenant filter is worth writing.

Severity: severe. Cross-tenant data leakage is among the worst classes of production bug because it's hard to detect, hard to scope, and triggers compliance and notification obligations.

Category 5: The Wrong Concurrency Model

The pattern: code that works correctly in single-threaded testing but fails under concurrent load. Race conditions in counter updates. Lost writes in last-write-wins scenarios. Deadlocks in lock ordering. Database transactions that should be serializable but aren't.

Why AI generates it: concurrency is hard for humans, and the human-written training data the model learned from is full of subtle concurrency bugs that worked in the original context but don't generalize. The model reproduces the patterns.

What it looks like:

# Race condition: two concurrent calls can read the same count.
async def increment_quota(user_id: str):
    current = await db.fetch_quota(user_id)
    new_value = current + 1
    await db.update_quota(user_id, new_value)

How to catch it: integration tests that simulate concurrent load against any function that touches shared state. For database operations, use the database's atomic operations (UPDATE ... SET count = count + 1) rather than read-modify-write patterns. Code review with explicit attention to what happens if this function is called twice in parallel.

Severity: variable. Usually invisible at low traffic; catastrophic at scale. The bugs that manifest only under load are the ones that take production down at the worst possible time.

Category 6: The Outdated Dependency

The pattern: AI-generated code installs a library version that's months or years old, sometimes with known CVEs. Or it pulls in a library that has been superseded, deprecated, or abandoned.

Why AI generates it: the model's snapshot of npm, PyPI, or crates.io is months behind. It will confidently recommend package@1.4.2 when the current secure version is 1.4.7, or suggest a library that's been deprecated in favor of a fork.

What it looks like:

"dependencies": {
  "lodash": "4.17.15",        // ← CVE-2020-8203
  "node-fetch": "2.6.0",      // ← old version, security advisories on the line
  "request": "^2.88.2"        // ← deprecated since 2020
}

How to catch it: Dependabot, Renovate, or your platform's equivalent, running on every PR. npm audit / pip-audit / cargo audit as CI gates. Block merges on high-severity vulnerabilities. The tooling for this category is mature; the only excuse for not having it is not having configured it.

Severity: ranges from low (the vulnerability doesn't apply to your usage) to severe (the vulnerability is your application's exact pattern of use).

Category 7: The Architectural Drift

The pattern: no single PR is wrong. Each one is locally reasonable. But the aggregate of fifty AI-generated PRs over three months produces a codebase that has three different ways to do the same thing, four different naming conventions, two parallel sets of abstractions for the same domain concept, and a steadily growing tangle of dead code.

Why AI generates it: AI tools optimize for the local context of a single PR. They have no memory of architectural decisions made across PRs unless that context is in the codebase's documentation. Each PR is a fresh attempt to solve a problem in a plausibly correct way; consistency with prior decisions is incidental.

What it looks like: you don't notice. That's the problem. The codebase slowly stops looking like a coherent system and starts looking like a collage. Onboarding new engineers takes longer. Bugs become harder to localize because the same logic exists in three places that no longer match.

How to catch it: quarterly architecture reviews led by a senior engineer (internal or external) who reads the diff of the last three months as a single change. This is the only defense, and it cannot be automated. The output of the review should be a list of consolidation tasks and a refreshed set of architectural decisions to document.

Severity: slow-moving, often ignored, eventually crippling. The category that doesn't cause a production incident but does cause a velocity collapse 18 months in.

What to do with this list

Three things:

  1. Use it as a code review checklist for AI-generated PRs. When reviewing a PR, ask explicitly: which of these seven categories could apply here? It primes attention better than generic review.

  2. Build process around the categories your tools don't cover. Categories 1 and 6 are well-handled by existing automation. Categories 2 through 5 require human attention. Category 7 requires periodic deliberate review. If your current process doesn't address each one, it has a hole.

  3. Stop trusting "AI reviewers" to catch AI authors. As we covered in Why AI Code Review Tools Can't Replace Senior Engineers, AI tools share blind spots with AI authors. Categories 2, 3, 4, 5, and 7 need human attention.

If your team doesn't have the senior bandwidth for this — or you'd rather your senior engineers be building product than reviewing PRs — Fairy provides on-demand staff-level review specifically targeted at these categories. 24-hour turnaround, fixed price, sign-off with accountability.

Submit a PR for verification →


This piece is intended as a foundation we'll revise quarterly with real review data as Fairy's review volume grows. If your team has encountered a category of AI-generated bug that doesn't fit here, we'd genuinely like to hear about it.


Have AI-generated work you’d want verified? Connect with a Fairy →

More resources