It's easy to look at the latest model release and think the model is the product. It rarely is.
Most of the AI work we ship in production looks more like the boring kind of engineering than the demos suggest. Retrieval pipelines, eval harnesses, guardrails, cost monitoring, prompt hygiene, error states, fallbacks. The model is the smallest part of the stack you're standing up.
What "shipping AI" actually means
When a customer says "we want to add AI" to their product, the work usually breaks down something like this:
- 20% picking and integrating the model.
- 30% building the retrieval and context pipeline so the model has something useful to work with.
- 20% building the eval harness so you can tell whether changes make things better or worse.
- 15% UX work — how does the AI feature fit into the existing product?
- 15% operations — monitoring, cost discipline, on-call, safety.
Every team underestimates one of these. The most commonly missed is evals.
Evals are the thing
A non-trivial AI feature is impossible to improve without evals. By "evals" we don't mean benchmarks; we mean task-specific evaluation data that mirrors how your users actually use the feature.
The teams that succeed almost always have:
- A small (50–200) curated set of inputs and "good" outputs.
- A way to grade outputs — sometimes deterministic, often LLM-as-judge with a tight rubric.
- A regression suite that runs every time a prompt or model changes.
Without this, you're flying blind. Every prompt edit is a shrug.
The interesting work hasn't changed
We sometimes hear people worry that AI engineering will replace the "normal" kind. The opposite is closer to true. AI features create more demand for the same skills that have always made products good: careful problem framing, clean data plumbing, observability, UX that respects the user, and discipline about cost and reliability.
The models are getting better fast. The engineering around them isn't free.