Anthropic’s Claude Opus 4.6 and the latest GPT-4 models are materially better than their predecessors. They reason longer. They maintain deeper context. They self-correct more effectively. On isolated coding tasks, they perform at a level that would have seemed unrealistic just a few years ago.
But there is a subtle mistake in how many teams evaluate them.
They evaluate the model. Production software is not a model. It is a system.
A coding model is typically judged by whether it completes a task, produces logically coherent output, and compiles successfully. Production systems are judged by entirely different criteria. They must behave deterministically across environments. They must remain portable across platforms. They must preserve architectural boundaries. They must handle edge cases. They must remain secure under unexpected inputs. They must scale without silent drift.
Those are system properties, not prompt properties.
A model can generate syntactically correct code that passes immediate tests while still introducing subtle risk. It may rely on a non-portable flag that fails on macOS. It may hard-code assumptions about branch naming. It may introduce a dependency upgrade that appears safe but alters behavior under load. It may satisfy the happy path while weakening structural constraints over time.
None of that means the model failed. It means the system was not enforced.
There is also confusion around context windows. A larger context window allows a model to see more files at once, which improves coherence in larger codebases. But visibility is not the same as understanding intent. A model may see architectural patterns without understanding why certain boundaries must hold. It may replicate conventions without recognizing which ones are critical invariants. Production reliability depends on constraints, and constraints must be enforced explicitly.
Self-correction is another area that gets overstated. Modern models can revise their output when prompted or when errors surface. That is meaningful progress. But self-correction remains probabilistic. It is a retry mechanism. Verification, by contrast, is deterministic. A verification system checks whether tests exist, whether branch rules are respected, whether dependencies resolve safely, and whether structural boundaries hold under defined conditions.
Retrying increases confidence. Verification increases reliability.
As models improve, another dynamic emerges: trust inflation. When output quality rises, skepticism naturally declines. Developers review less aggressively. Teams assume fewer edge cases. Changes move faster through pipelines. Meanwhile, human review capacity does not scale proportionally. If generation speed increases while inspection capacity remains constant, inspection depth per change decreases. That is arithmetic, not opinion.
Without enforcement that scales with generation velocity, structural drift becomes more likely over time.
This is not an argument against AI. It is an argument for systems thinking. AI increases throughput. Throughput without constraints increases entropy. In complex systems, entropy accumulates quietly before it surfaces visibly. Production incidents are rarely caused by one dramatic mistake. They are usually the result of small, compounding assumptions that were never verified.
The real question is not whether Claude Opus 4.6 or GPT-4 is “good enough.”
The real question is whether the system in which they operate is verified.
Models generate.
Systems enforce.
Teams that separate those responsibilities will build faster and more safely. Teams that merge them will eventually discover that confidence is not the same as correctness.
The model is not the system.
Production reliability depends on understanding that difference.