ByCrafters - Transform Your Ideas into Reality

An MLOps practitioner’s field guide for founders building real AI systems — not demos.

Most “AI agents” you see on Twitter would not survive 24 hours in production.

They work in a notebook. They demo well. They fail the moment a real user, a real timeout, or a real cost ceiling enters the picture.

After deploying agentic systems for production workloads, the pattern is clear: the simplest workflow architectures win. Anthropic published a now-canonical breakdown of these patterns at the end of 2024, and a year later, those same five patterns are still the ones holding up enterprise AI systems.

This is the practitioner’s version of that breakdown — what each pattern actually is, where it earns its keep, and the MLOps reality of running it in production.

Why this matters before we touch the patterns

Founders keep asking the wrong question.

The wrong question is: “Should we build an agent?”

The right question is: “What is the simplest workflow that solves this?”

A full autonomous agent — an LLM in a loop, choosing its own tools, running until it decides it’s done — is the most expensive, least predictable, and hardest-to-monitor version of AI you can deploy. It is rarely the right starting point.

The five patterns below are workflows, not agents. They have predictable topologies. They are observable. They are debuggable. They are cheap to monitor. And they solve 80% of real business problems before you ever need a true agent.

If you skip these and jump straight to “let the LLM figure it out,” you are paying autonomy tax for no reason.

Pattern 1 — Prompt Chaining

What it is: Decompose a complex task into a sequence of LLM calls. The output of one becomes the input of the next. Optionally, insert a programmatic gate between steps to validate, route, or exit.

Mental model: A factory line. Each station does one thing well.

When it earns its keep:

The task has clear, separable sub-steps (extract → classify → summarize → format)
You need higher accuracy than a single mega-prompt can deliver
You want to log and inspect intermediate outputs

The trade-off: Latency stacks. If each step takes 1.2 seconds, a 4-step chain is 4.8 seconds before the user sees anything. For chat UX, that is brutal. For batch pipelines, it is irrelevant.

MLOps reality:

Cache aggressively between steps. If step 1’s output is deterministic for a given input, you should not be paying for it twice.
Log every intermediate output with a trace ID. When something breaks in step 4, you need to see steps 1–3 without rerunning them.
The “gate” is where most teams cut cost — fail fast, exit early, do not call the expensive model if the cheap model already said no.

Where founders mess this up: They chain too many steps. If you have 7 LLM calls in a row, you have a debugging nightmare and a latency disaster. Three to four is the sweet spot.

Pattern 2 — Routing

What it is: A classifier (often a small LLM or even a fine-tuned encoder) decides which downstream path the input goes to. Each path is specialized.

Mental model: A receptionist at a hospital. Different complaints go to different departments.

When it earns its keep:

Your system handles fundamentally different request types (Q&A vs. action vs. complaint vs. billing)
One generic prompt is degrading quality because you are asking one model to be good at everything
You want to send cheap requests to a cheap model and expensive requests to a frontier model

The killer use case: Cost routing. A real example — route 70% of routine queries to a smaller, self-hosted model and only escalate ambiguous or high-value queries to a frontier API. The savings are not marginal. They are 5–10x on inference cost for the routed traffic.

MLOps reality:

The router itself must be evaluated. Misrouting is a silent failure — the user gets an answer, just from the wrong specialist. Build a confusion matrix from labeled traffic and watch it.
Keep the router small and fast. If your router takes 800ms, you are charging users a tax before they even hit the actual workflow.
Always have a “default” path for low-confidence routing decisions. Never let the router fail closed.

Pattern 3 — Parallelization

What it is: Split the work, run in parallel, aggregate the results.

Two flavors:

Sectioning — different sub-tasks run in parallel (extract name, extract amount, extract date — all at once)
Voting — same task, multiple times, aggregate via majority or consensus (used to boost accuracy)

When it earns its keep:

Speed matters and the sub-tasks are independent
You need accuracy bumps on a critical decision and you can afford 3–5x the inference cost for that one step
You are processing structured documents (invoices, contracts, forms) where many fields can be extracted simultaneously

The trade-off: You pay for N calls. The aggregator logic adds complexity. If the sub-tasks are not actually independent, parallelization will produce inconsistent results that the aggregator has to reconcile.

MLOps reality:

Set hard timeouts on each branch. One slow branch should not hold the whole response hostage.
The aggregator is where bugs live. Test it independently from the LLM calls.
For voting patterns, track agreement rates over time — if your branches start agreeing on 99% of cases, you are wasting money. If they disagree on 40%, your prompts need work.

Pattern 4 — Orchestrator-Workers

What it is: A central LLM looks at the task, dynamically decides what sub-tasks are needed, dispatches them to worker LLMs, then synthesizes the results.

This is where it starts feeling “agentic.” But it is still bounded — the orchestrator does not loop forever, it produces one plan, executes it, and merges.

When it earns its keep:

The task structure is not knowable in advance (“research this topic,” “fix this codebase issue,” “build a report on this company”)
Different inputs require fundamentally different decomposition

The trade-off: The orchestrator is the most expensive call in your system. It sees the most context, makes the most consequential decision, and you cannot use a small model for it. Budget accordingly.

MLOps reality:

Cap the number of workers the orchestrator can spawn. Without a cap, weird inputs produce 30 worker calls and a $4 inference bill on a single request.
Log the orchestrator’s plan separately from the worker outputs. When the system gives a bad answer, the failure is almost always in the plan, not the workers.
The synthesizer at the end is doing real work — it is reading all the worker outputs and writing the final answer. Treat it as a first-class component, not an afterthought.

Pattern 5 — Evaluator-Optimizer

What it is: A generator LLM produces a draft. An evaluator LLM critiques it. If it passes, ship it. If not, the feedback goes back to the generator. Loop until accepted (or max iterations hit).

Mental model: A writer and an editor.

When it earns its keep:

The output quality is hard to get right in one shot — long-form writing, code generation, deep research, complex translations
You have a clear notion of “good” that another LLM can articulate

The trade-off: Iteration cost. A 3-loop run is at minimum 6 LLM calls (3 generate, 3 evaluate). Latency is brutal. Use it for batch and async workloads, not real-time chat.

MLOps reality:

Always cap the loop. 3 iterations is usually the cap. Past that, the evaluator is either too strict or the generator is fundamentally unable to satisfy it — more loops will not save you.
Track the iteration distribution. If 90% of your runs are hitting max iterations, your evaluator is too picky and you are burning money. If 90% pass on iteration 1, you do not need this pattern.
The evaluator’s prompt is the highest-leverage thing in the system. A bad evaluator silently lowers quality across the entire product.

How these patterns compose in real systems

Here is what people miss.

These patterns are not exclusive. They nest.

A real production system might look like this:

Routing at the edge — decide if this is a Q&A, an action, or a research request
For Q&A, prompt chaining: rewrite the query → retrieve → synthesize answer
For research, orchestrator-workers: plan the research → dispatch parallel searches → synthesize
For high-stakes outputs, wrap the final step in evaluator-optimizer with a 2-iteration cap

Each pattern replaces an “LLM call” node inside another pattern. That is the actual mental model. Build small, compose deliberately.

The MLOps layer that makes any of this real

Patterns are the easy part. What makes agentic systems survive in production is the operations layer underneath them. Without it, the most elegant pattern in the world will fail in week two.

The non-negotiables:

Tracing on every call. Every LLM call, every retrieval, every tool use, tagged with a request ID. If you cannot replay a single user’s request end-to-end, you cannot debug it.
Latency and cost budgets per request. Hard caps. A single request should never be allowed to cost $5 because the orchestrator went weird.
Evaluation on real traffic. Not vibes. Not “looks good in the demo.” A held-out set of real production inputs, scored automatically, run on every prompt change.
Versioned prompts. Prompts are code. Treat them like code — git, code review, rollbacks.
A kill switch for every model dependency. If your frontier API provider has an outage (and they will), you need a fallback path. Even a degraded one.

These are not nice-to-haves. They are the difference between an AI feature and an AI product.

The closing thought

The teams winning with agentic AI right now are not the ones with the most clever architectures.

They are the ones using the simplest workflow that solves the problem, instrumented well enough that they can see what is actually happening, and disciplined enough not to add complexity until the data demands it.

Start with the simplest pattern. Add the next one only when the metrics force you to.

That is the entire game.

Written from the perspective of an MLOps and AI engineer building production systems.