Thought Leadership

From Vibe Coding to Agentic Engineering

Andreas Martens May 2026

Andrej Karpathy coined the term "Vibe Coding" — and made it viral. It describes the mode most developers are in today when they work with AI: sitting in front of Cursor or Claude Code, chatting with a model, accepting patches that "feel right", iterating on the visual surface, swatting bugs as they pop up. Great for personal projects. Great for weekend hacks. Great for prototypes. Doesn't scale to production.

The point Karpathy is now making — and the part of the framing that matters for any engineering organisation putting AI into production — is the contrast. There is a second mode. He has called the discipline "Agentic Engineering". It is not Vibe Coding scaled up. It is a different shape entirely.

Vibe Coding trusts the model. Agentic Engineering measures it. Vibe Coding chats. Agentic Engineering specifies. Vibe Coding lives in a five-minute task horizon. Agentic Engineering targets hours and days. Karpathy himself has been explicit on this: he uses Vibe Coding for his own small tools, and would never have written Tesla Autopilot that way.

The discipline that Agentic Engineering describes is being born right now. The next twelve to twenty-four months will see it crystallise into a professional skillset that engineering organisations either acquire or get out-iterated by competitors who did. Below: seven concrete engineering practices that together constitute the shift.

Seven Practices, One Discipline

Each practice is a deliberate move away from "I type a prompt, the model answers, I eyeball the result" and toward "I orchestrate autonomous agents over long horizons, with tooling, with verification, with measurable output."

Context Engineering Not prompt engineering

The lever shifted: what's in the agent's working memory matters more than how it's asked.

Karpathy has said publicly that "prompt engineering" as a discipline is becoming increasingly irrelevant. The real lever sits one layer down: which files, which history, which spec, which tooling the agent has in its working memory at the moment it acts. Teams that get good at curating that context — what to include, what to exclude, what's stale, what's load-bearing — solve half the problem before the model runs. This is the largest skill-shift engineering teams need to make.

Verification as Mandatory Loop Vibe Coding trusts. Agentic Engineering measures.

Without a verification loop you don't have autonomy — you have extended autocomplete.

Tests, linters, type checkers, eval suites: these are the signals the agent uses to check its own work and re-do it when needed. Without that loop, autonomy is an illusion — the agent only "feels" like it's working until production catches the bug. The discipline is to make verification the agent's job, executed inside the loop, not a human review after the fact.

Spec-Driven Work Not free conversation

A spec lets you walk away. A chat doesn't.

Instead of "build a login function", there's a specification: acceptance criteria, expected interface behaviour, edge cases enumerated, regulatory constraints stated. The agent works against the spec, not against a developer's gut-feel expectation. That makes the work asynchronously executable — you can hand off a task and check back twenty minutes later, instead of co-piloting every keystroke. Spec-driven work is what turns AI into a teammate instead of a typing aid.

Long Task Horizons Hours, not minutes

How long can the model stay coherent and productive without a human turn?

Vibe Coding solves five-minute tasks. Agentic Engineering targets hour-long and day-long tasks — multi-step, with planning and execution decoupled, with fallback paths, with self-correction at gates. Karpathy has named the maximum coherent horizon as the central open frontier in LLM research right now. The teams that figure out how to set up tasks that survive long horizons — instead of degrading at the 20-minute mark — get the productivity multiplier first.

Code as Agent-Friendly Infrastructure What used to be "good code for humans" is now a precondition for agents

Agents are more honest than humans about what they find unreadable.

A codebase an agent can operate on at scale has clear structure, good docstrings, type hints, tests, predictable naming. Most of what was "good code for human readability" is now a hard precondition for agent productivity. The twist: agents are blunter than human reviewers about what they find unreadable. A function with no docstring and a vague name gets misused, audibly, in the audit logs. Engineering teams that have been carrying technical-debt-disguised-as-tribal-knowledge feel that compounding fast.

Multi-Agent Orchestration Specialists, not a monster model

A planner, a coder, a reviewer, a test agent — each with its own context, tools, job.

This is the architecture inside Cursor, Devin, Claude Code, Cline: orchestrated specialists rather than one monolithic model that "does everything". Karpathy is cautious — he calls Multi-Agent over-hyped for some use cases — but agrees the pattern works for production workflows with clearly separable subtasks. Each role has its own context window, its own toolset, its own success criteria. The orchestration logic is where most of the engineering work hides.

Eval-Driven Development Measure the agent the way you'd measure code

Eval suites become engineering practice — like unit tests, but for the agent.

You don't trust the agent over time — you measure it. Which tasks does it solve reliably, where does it break, how does that change with each model upgrade? Eval suites become standard engineering infrastructure: a regression set the agent must pass before being trusted with new responsibility. Without this, you can't tell whether an upgrade made things better, worse, or just different — you're guessing.

The Skill Shift This Implies

The seven practices above aren't a list of nice-to-haves. They're the components of a coherent discipline. Engineering teams that try to scale Vibe Coding — by hiring more developers who are "good at prompts" — run into a ceiling fast. The ceiling is structural, not effort-bound.

The skill that matters is no longer typing a clever prompt. It is designing the surrounding system: the context the agent operates in, the verification the agent runs against, the spec the work is anchored to, the eval suite that catches regressions. That's the actual engineering work.

The discipline is not in the prompting. It's in the setup around it.

This explains the otherwise puzzling pattern of organisations getting impressive AI-assisted demos and then stalling at the production gate. The demo runs in Vibe Coding mode and works on the happy path. Production needs Agentic Engineering — and the team has never built the surrounding system.

* * *

Why This Is Happening Now

The terminology is fresh, but the underlying shift was inevitable. Three forces are pushing it.

Models got good enough that the human-in-the-loop became the bottleneck. When the model writes faster than a human can review, the limiting factor is no longer the model. It's the review-and-correction surface. Vibe Coding doesn't fix this — it makes the human the slow part. Agentic Engineering moves verification into the loop.

Tooling matured to support long-running autonomy. A year ago, asking an agent to work for an hour without supervision would have meant rewriting your context window every twenty minutes. The tools — orchestration frameworks, eval libraries, structured tool-use APIs — caught up. Long horizons are now technically feasible. The engineering discipline to use them well is the gap.

The economics are unforgiving. Teams that figure out Agentic Engineering get a multiplier that compounds across every engineering hour. Teams that stay in Vibe Coding mode see their senior engineers spend their day proof-reading model output. The latter team loses the talent war and the product war on the same axis.

* * *

What This Means for Engineering Leadership

For anyone running a software-delivery organisation, the question stops being "should we use AI" and becomes "are we set up to use it as Agentic Engineering or only as Vibe Coding?". The answer is mostly visible in the practices above. A team that has eval suites for its agents, a spec-driven workflow, context curation as a first-class discipline, and orchestration patterns in production — that team has crossed the threshold. A team where AI usage looks like "developers chat with Cursor" hasn't.

The investment is in the practices, not in the licences. The skill shift is in context, verification, specification, eval — not in better prompts. The next twelve to twenty-four months are the window in which a small number of organisations build the muscle and the rest discover, painfully, that they need it.

Vibe Coding is the mode most developers are in today. Agentic Engineering is the mode productive AI-augmented teams will need to operate in if they want to leave the demo stage.

That's the framing worth taking seriously.

In Practice

The shift from Vibe Coding to Agentic Engineering in a fixed-price 10-day engagement:

Offering · 10-Day Bootstrap Agentic Engineering Bootstrap — set up the practices and ship the first agent-augmented workflow Read the offering

Where is your team between Vibe Coding and Agentic Engineering?

We help engineering organisations put the seven practices in place — context, verification, spec, horizons, infrastructure, orchestration, eval.

Get in touch