This tiny piece of math prevents perfect coding agents

Despite popular belief, coding hasn't yet been solved and might never be solved even with the advent of general and super intelligence.

May 21, 2026

We all have been using coding agents more and more. We have detailed SKILL.md files, context management systems, memory layers and so on. I’m not here to tell you AI is bad or that it won’t replace jobs. That ship has sailed, and frankly, if we’re not using these tools in 2026, we’re handicapping ourselves.

A scheduler that works perfectly, or does it?

Consider a chore scheduler. We have three roommates, and we have to rotate chores fairly, except we have to come up with what constitutes as “fair”. The kind of thing any of us might ask an AI agent to build.

The rules are simple:

No one repeats the same chore consecutively
People with fewer total assignments get higher priority
Recently assigned people get a cooldown
Some people are unavailable on certain days

Here’s the Rust code for the core assignment logic:

pub fn assign(&mut self, people: &[Person], chore: &str, day: &str) -> Option<Person> {
    let last_person_for_chore = self.history.iter().rev()
        .find(|a| a.chore == chore)
        .map(|a| a.person.clone());

    let mut candidates: Vec<Person> = people.iter()
        .filter(|p| {
            !self.unavailable.get(*p)
                .map(|days| days.contains(&day.to_string()))
                .unwrap_or(false)
        })
        .cloned()
        .collect();

    candidates.sort_by_key(|p| {
        (
            self.cooldowns.get(p).copied().unwrap_or(0),
            self.completed.get(p).copied().unwrap_or(0),
        )
    });

    candidates.into_iter()
        .find(|p| Some(p.clone()) != last_person_for_chore)
}

The code can be considered to be pretty universally clean, well structured and follows best practices. There are 12 tests covering basic assignment, consecutive repeat prevention, unavailability, cooldown mechanics, workload balancing, history bounds, and edge cases. All pass. Green across the board.

Verify it yourself:
The full codebase is at github.com/ronniebasak/chore-scheduler-rice-theorem.
git clone git@github.com:ronniebasak/chore-scheduler-rice-theorem.git
cd chore-scheduler-rice-theorem

# All 12 tests pass
cargo test

# Run the 9-week simulation and see the unfair distribution
cargo run

Now here’s the output after 9 weeks of simulation:

=== Final Workload Distribution ===
  Alice: 28 assignments (44.4%)
  Bob: 26 assignments (41.3%)
  Cara: 9 assignments (14.3%)

  Ideal (perfectly fair): 21.0 each
  Actual spread: 9 to 28 (delta: 19)

  ⚠️  UNFAIR: The spread of 19 exceeds reasonable bounds.

Alice is doing three times more work than Cara. The scheduler follows every single rule it was given. Every test passes. The code is correct by every measurable standard that most AI or beginner level humans can think of. And it produces a deeply unfair outcome.

Rice’s theorem, in plain terms

In 1953, Henry Gordon Rice proved something that sounds almost too strong to be true:

No algorithm can decide any non-trivial semantic property of programs.

Let me unpack that. A “semantic property” is something about what a program does, not what it looks like. “Does this program halt?” is semantic. “Does this program produce fair outputs?” is semantic. “Does this variable name start with a lowercase letter?” is syntactic (and trivially checkable).

“Non-trivial” just means it’s true for some programs and false for others. “Is this scheduler fair?” qualifies, because some schedulers are fair and some aren’t.

Rice’s theorem says: for any such property, there is no general algorithm that can look at arbitrary source code and reliably tell you whether that property holds. Not “it’s hard.” Not “we haven’t found one yet.” It’s provably, mathematically impossible.

That doesn’t mean we should stop code reviews and not do AI powered reviews, because one, the more reviews we get, the more confident we can be that the code is correct. We just can’t claim it is perfect.

Both humans and AI hit this wall

Here’s where we need to be careful, because this isn’t an anti-AI argument.

Most of us wouldn’t catch the unfairness in this scheduler immediately either. We write the rules, we write the tests, we feel satisfied. It takes running the simulation over 63 days and actually looking at the distribution to notice. A human reviewer could easily miss this too.

And that’s exactly the point. Rice’s theorem doesn’t discriminate between carbon and silicon. It’s a statement about computation itself. No algorithm, no matter how sophisticated the LLM (or brain) behind it, can reliably determine “is this program fair” for arbitrary programs.

This applies equally to:

“Does this function match what the user actually intended?”
“Will this system behave reasonably under realistic long-term usage?”
“Is this optimization actually an improvement in the ways that matter?”

These are all semantic properties. They’re all undecidable in the general case.

Token by token confidence, and why we keep approving

Here’s something worth thinking about. How does a coding agent actually work? It generates code token by token. Each token is the “most likely next token” given everything before it. As it writes code, or even follows TDD and generates tests, it produces output that looks correct at every step. And when it communicates that to us with confidence, we tend to just... approve.

This creates a false sense of security that’s unique to AI-assisted development.

In traditional engineering, there’s tension. Engineers push back on product requirements. QA pushes back on engineers. Management pushes back on timelines. That friction is annoying, but it keeps things relatively stable. It’s the reason “this codebase has been maintained for 30 years, it’s rock solid” is a valid heuristic. Decades of humans arguing, reviewing, and catching each other’s mistakes bakes in a kind of robustness.

QA doesn’t find all possible bugs. They find the critical ones. And over time, the important edge cases get covered because humans remember what went wrong last quarter and carry that context forward.

With LLM-generated code, the dynamic is inverted. As a codebase grows, its combinatorial space explodes. For every prompt we give the agent, there are multiple valid token sequences it could produce, and we only ever see a few of them. The agent picks one path through that space, presents it confidently, and we approve it because it passes the tests we asked for.

This means something counterintuitive: with LLMs, the larger and more maintained a project becomes, even if it’s the LLM that produced every line, the more brittle it can get. Not because the code is bad in any obvious way, but because the combinatorial space of possible behaviors grows faster than our ability (or the agent’s ability) to verify all of them. Rice’s theorem guarantees we can’t close that gap completely. Neither via static rules nor via “runtime tests”

Where human failure and AI failure diverge

So if both humans and AI hit the same mathematical ceiling, what’s the difference?

When we eventually notice the unfairness in this scheduler, we understand why it happened. Cara is unavailable Monday through Wednesday, so she gets fewer assignments. The cooldown system then deprioritizes Alice and Bob temporarily, but since they’re the only candidates most days, they just rotate between themselves. The “fewer total chores = higher priority” rule tries to help Cara catch up on Thursday, but one day a week against six days of Alice-and-Bob rotation isn’t enough.

We can explain this to another developer. We carry this understanding forward. Next time we design a scheduler with constrained availability, we’ll think differently about how to weight catch-up priority. We learn something about the shape of the problem.

Today’s AI agents don’t seem to do this at a fidelity we can (except Mythos, none of us have gotten hands on that). When Claude or GPT writes a scheduler that passes all tests but produces unfair results, it doesn’t accumulate understanding about why that failure mode exists. We can tell it about the problem, and it’ll fix this specific instance. But it’s not building a mental model of “fairness under asymmetric constraints” that it carries to the next project. Each conversation starts relatively fresh.

This isn’t saying AI is bad. It’s observing that the failure mode is qualitatively different. A junior developer who encounters this bug and thinks through it will handle similar problems better next year. It’s hard to say the same is true for today’s AI systems.

What this means practically

Rice’s theorem sets a ceiling. Neither humans nor machines can perfectly verify semantic correctness of arbitrary programs. This is not something we’ll engineer around with better models or more compute. It’s mathematical.

But below that ceiling, there’s a lot of useful space. AI agents are excellent at:

Catching syntactic issues
Verifying type correctness
Running tests and reporting results
Generating boilerplate
Spotting common patterns of bugs

They’ll keep getting better at these things. The space below the ceiling will get more and more filled in.

What they can’t do, and what Rice’s theorem guarantees they’ll never perfectly do, is look at a program and tell us “yes, this does what you actually meant.” That gap between specification and intent is where the hard problems live.

And when we humans hit that gap, we build context, intuition, and transferable understanding from the experience. When today’s AI hits it, it just... doesn’t notice. Or if we point it out, it fixes the symptom without building the kind of deep understanding that prevents the next occurrence.

Closing thought

This isn’t an argument that humans are irreplaceable or that AI won’t take jobs. Both of those claims require ignoring reality.

What this is saying is simpler: there’s a mathematical reason why “perfect coding agent” is an oxymoron. And the way AI handles hitting that mathematical wall is, today, fundamentally different from how we handle it. We fail and learn. Today’s AI fails and doesn’t quite know it failed.

Maybe that changes. Maybe future architectures will build persistent, transferable understanding from failure. But right now, in 2026, Rice’s theorem quietly draws a line that no amount of scaling seems to cross.

And understanding where that line is can make us better users of AI tools, not worse ones. Knowing what they can’t do helps us point them at what they can.

All this is to say that, even if we get “superintelligent” AI, it won’t be writing perfect code. Because such is the nature of computation in general, either we exhaustively search the entire problem space, which can be impossible for nontrivial software, or we just accept the reality and learn to deal with it. Bugs are a “feature” of software.

Arnav Gupta

17h

I think it would be great if you also actually covered why the fairness did not emerge here? (Cara was not available on all days). To highlight how all the initial conditions, even if covered, does not manage to cover this angle.

That's a really useful remark. I'll update the blog and add it.

Heap Hopping

Discussion about this post

Ready for more?