Renato Martins
Engineering Leader — AI Systems, Experimentation, Developer Platforms
About
I build systems that answer one critical question: is this change actually better?
Over the past 20+ years, I’ve led engineering teams at Amazon, Microsoft, and startups to design and scale experimentation platforms, data systems, and AI-powered tools that enable teams to ship with confidence.
My work has focused on measuring impact at scale, detecting regressions before they reach customers, and building platforms that make the right thing the easy thing for engineers.
More recently, I’ve been focused on applying LLMs to improve evaluation systems, developer workflows, and product behavior measurement at scale.
Selected Work & Thinking
- Building systems that decide what ships: experimentation and evaluation platforms that support high-confidence product decisions at scale.
- From experimentation to LLM evals: extending measurement beyond metrics into model behavior, qualitative regression detection, and prompt/system-level change management.
- Designing pits of success: developer platforms and workflows that embed safe defaults, guardrails, and clarity by design.
Blog Post
Why LLM Evaluation Is Fundamentally Harder Than A/B Testing
I’ve spent much of my career building experimentation systems. At their core, those systems answer a deceptively simple question: is this change actually better? In traditional product development, that usually means comparing a control and a treatment, looking at a defined set of metrics, and deciding whether the results are strong enough to ship.
That problem is already hard. You have to deal with noisy data, confounding factors, guardrail metrics, latency between cause and effect, and the fact that what improves one metric can quietly harm another. But compared with evaluating large language models, A/B testing is often the easier problem.
Why? Because in A/B testing, the thing you are evaluating is usually a product change with a relatively clear behavioral surface. In LLM systems, the thing you are evaluating is behavior itself — behavior that is flexible, contextual, and often only partially observable through standard product metrics.
When an LLM changes, the question is not just whether click-through rate improved or whether task completion rose. The harder questions are things like:
- Did the model become more helpful, or just more verbose?
- Did it become more agreeable in a way that looks pleasant but reduces honesty?
- Did web search improve overall while quietly getting worse on a small but important class of queries?
- Did a prompt change reduce one class of failures while introducing a subtler one somewhere else?
These are not purely quantitative questions. They sit at the seam between measurable performance and human judgment. That makes evaluation infrastructure for language models fundamentally different from traditional experimentation platforms.
The first major difference is that LLM behavior is high dimensional. A product experiment might affect conversion, retention, or latency. An LLM change can affect tone, factuality, refusal behavior, sycophancy, formatting, instruction following, consistency, and domain-specific competence — all at once. Some of those things are measurable with automated harnesses. Some require carefully designed rubrics. Some only emerge over time.
The second difference is that the regression surface is much larger. In a standard A/B test, you usually know the feature you changed and the user path it touches. In an LLM-based system, a change to a system prompt, model checkpoint, tool-calling policy, or retrieval strategy can create regressions across a surprisingly wide set of behaviors. The system may look better in aggregate while becoming less reliable in a class of edge cases that matter disproportionately.
The third difference is that qualitative regressions often matter more than aggregate wins. A model that is slightly more capable but occasionally much more misleading may be a net negative for users. A model that sounds more confident while being wrong more often can create damage that average metrics fail to capture. In other words, the cost of being wrong in LLM systems is often asymmetrical, and the tails matter.
Evaluation systems for AI cannot just optimize for “what improved.” They also have to protect against “what became dangerously worse.”
This is why robust evaluation for LLM systems needs to look more like a layered safety and quality discipline than a single experiment readout. Good systems combine multiple forms of evidence:
- Automated eval harnesses for repeatability and scale
- Regression suites that preserve hard-won behavior over time
- Human review loops for nuanced judgments
- Prompt and configuration versioning so changes are understandable and reversible
- Launch-time operational discipline for high-stakes model and prompt changes
Another challenge is organizational, not just technical. In traditional experimentation, you can often centralize the platform and let product teams consume it. In AI systems, ownership boundaries are blurrier. Research teams, product engineers, prompt engineers, TPMs, and launch managers all touch the system from different angles. A good eval platform has to serve all of them without collapsing under the weight of competing incentives.
That makes the best evaluation platforms not just measurement systems, but coordination systems. They create shared truth. They make trade-offs visible. They reduce ambiguity during launches. And most importantly, they help teams ship with confidence rather than intuition.
What I find especially compelling is that this is still early. We do not yet have universally accepted patterns for measuring model quality the way we do for traditional application metrics. The field is still inventing its norms: how to detect behavioral drift, how to compare prompt quality, how to define harness parity, how to blend quantitative and qualitative evidence into something operationally useful.
That uncertainty is exactly what makes the work interesting. The challenge is not simply to build dashboards or bulk runners. It is to build systems that help teams answer hard questions about model behavior with enough rigor that they can move quickly without losing trust.
In the long run, I believe evaluation will become one of the defining disciplines of AI engineering. The companies that do it well will not just build more capable models. They will build systems people can actually rely on.
Visual Diagram
A Practical View of an LLM Evaluation Pipeline
The goal is not just to score models. It is to make launches safer, regressions visible, and trade-offs understandable.
Blog Post
What Breaks When You Scale LLM Eval Systems
Most evaluation systems look reasonable at small scale. A team has a handful of prompts, a set of golden examples, and a few people who understand the context well enough to interpret results. In that world, the eval system feels close to the work. The same people making the change are often the same people reviewing the outcome.
But scale changes the nature of the problem. As more teams rely on a shared model, as more prompts and product behaviors accumulate, and as launch cadence increases, evaluation stops being a local quality check and becomes shared infrastructure. That is the point where things begin to break.
The first thing that breaks is shared meaning. At small scale, everyone can keep a rough mental model of what “better” means. At larger scale, different teams optimize for different things: product engagement, latency, cost, refusal behavior, safety, tone, instruction following, or user trust. If the eval platform does not create a shared language for these trade-offs, people start reading the same results in incompatible ways.
The second thing that breaks is ownership clarity. Evaluation systems sit at the seam between research, product, platform, and operations. That seam is productive when responsibilities are explicit, but fragile when they are assumed. Who owns prompt versioning? Who decides which evals are required for launch? Who maintains shared regression suites? At scale, ambiguity in those boundaries creates duplicated work in some areas and neglect in others.
The third thing that breaks is trust in the harness. Teams will only rely on evaluation systems if they believe the system reflects reality. Harness parity becomes critical: are offline results actually representative of product behavior in production? If the answer is often “not really,” teams fall back to intuition, anecdotes, or slow manual review. The platform becomes something people route around instead of something they depend on.
Another common failure mode is metric overfitting. As systems mature, there is a temptation to optimize what is easy to track rather than what matters most. But many of the most important model regressions are subtle. Behavioral drift, tone changes, over-agreeableness, brittle tool use, and degraded edge-case performance often evade the metrics that look cleanest on dashboards. If the system rewards only what is easy to measure, it gradually becomes blind to what matters.
Scale also makes operational moments more chaotic. Model launches compress time. Multiple teams need answers quickly. People want results they can trust, but they also want them now. A good eval platform has to support calm decision-making under pressure. That means bulk runners must be reliable, dashboards legible, version history easy to inspect, and rollback paths obvious. During a launch, operational discipline matters as much as methodological rigor.
What breaks, in short, is not just code. Coordination breaks. Trust breaks. Interpretability breaks. The eval system starts to fail not because any one part is obviously wrong, but because the organization has outgrown the assumptions the system was built on.
The response is not to add more dashboards or more evals indiscriminately. It is to make the platform more durable in a few specific ways: create clear ownership boundaries, prioritize harness parity, invest in regression suites that preserve hard-won behavior, and design workflows that make the safe path the easy path. The best evaluation systems do not merely measure models. They help organizations think clearly together.
That is why scaling LLM evals is such an interesting engineering problem. It is a platform problem, a product problem, and an organizational design problem at the same time. The teams that solve it well will not just ship faster. They will ship with justified confidence.