Renato Martins

Engineering Executive — AI, A/B Experimentation, Data Platforms & Distributed Systems

About

I build systems that answer one critical question: is this change actually better?

Over the past 20+ years, I’ve led engineering teams at Amazon, Microsoft, and startups to design and scale experimentation platforms, distributed systems, healthcare data infrastructure, and AI-powered developer tooling.

My work has focused on measuring impact at scale, detecting regressions before they reach customers, and building platforms that make the right thing the easy thing for engineers.

More recently, I’ve been focused on applying LLMs to improve evaluation systems, developer workflows, semantic understanding, and product behavior measurement at scale.

I’m especially interested in systems operating at the intersection of AI, infrastructure, and human decision-making — including model evaluation platforms, AI-assisted healthcare interoperability, and explainable AI systems for regulated domains such as healthcare and financial technology.

Selected Work & Thinking

Building systems that decide what ships: experimentation and evaluation platforms that support high-confidence product decisions at scale.
From experimentation to LLM evals: extending measurement beyond metrics into model behavior, qualitative regression detection, prompt management, and launch-time operational confidence.
AI-assisted healthcare interoperability: designing systems that learn and explain hospital-specific EMR message flows, semantic meaning, and integration anomalies to accelerate onboarding and improve reliability.
Explainable AI for regulated domains: designing RAG-based systems that combine deterministic logic with AI-assisted reasoning for personalized tax and financial guidance.
Designing pits of success: developer platforms and workflows that embed safe defaults, guardrails, observability, and clarity by design.

Blog Post

Why LLM Evaluation Is Fundamentally Harder Than A/B Testing

A short essay on measurement, regressions, and trust in AI systems

I’ve spent much of my career building experimentation systems. At their core, those systems answer a deceptively simple question: is this change actually better? In traditional product development, that usually means comparing a control and a treatment, looking at a defined set of metrics, and deciding whether the results are strong enough to ship.

That problem is already hard. You have to deal with noisy data, confounding factors, guardrail metrics, latency between cause and effect, and the fact that what improves one metric can quietly harm another. But compared with evaluating large language models, A/B testing is often the easier problem.

Why? Because in A/B testing, the thing you are evaluating is usually a product change with a relatively clear behavioral surface. In LLM systems, the thing you are evaluating is behavior itself — behavior that is flexible, contextual, and often only partially observable through standard product metrics.

When an LLM changes, the question is not just whether click-through rate improved or whether task completion rose. The harder questions are things like:

Did the model become more helpful, or just more verbose?
Did it become more agreeable in a way that looks pleasant but reduces honesty?
Did web search improve overall while quietly getting worse on a small but important class of queries?
Did a prompt change reduce one class of failures while introducing a subtler one somewhere else?

These are not purely quantitative questions. They sit at the seam between measurable performance and human judgment. That makes evaluation infrastructure for language models fundamentally different from traditional experimentation platforms.

The first major difference is that LLM behavior is high dimensional. A product experiment might affect conversion, retention, or latency. An LLM change can affect tone, factuality, refusal behavior, sycophancy, formatting, instruction following, consistency, and domain-specific competence — all at once. Some of those things are measurable with automated harnesses. Some require carefully designed rubrics. Some only emerge over time.

The second difference is that the regression surface is much larger. In a standard A/B test, you usually know the feature you changed and the user path it touches. In an LLM-based system, a change to a system prompt, model checkpoint, tool-calling policy, or retrieval strategy can create regressions across a surprisingly wide set of behaviors. The system may look better in aggregate while becoming less reliable in a class of edge cases that matter disproportionately.

The third difference is that qualitative regressions often matter more than aggregate wins. A model that is slightly more capable but occasionally much more misleading may be a net negative for users. A model that sounds more confident while being wrong more often can create damage that average metrics fail to capture. In other words, the cost of being wrong in LLM systems is often asymmetrical, and the tails matter.

Evaluation systems for AI cannot just optimize for “what improved.” They also have to protect against “what became dangerously worse.”

This is why robust evaluation for LLM systems needs to look more like a layered safety and quality discipline than a single experiment readout. Good systems combine multiple forms of evidence:

Automated eval harnesses for repeatability and scale
Regression suites that preserve hard-won behavior over time
Human review loops for nuanced judgments
Prompt and configuration versioning so changes are understandable and reversible
Launch-time operational discipline for high-stakes model and prompt changes

Another challenge is organizational, not just technical. In traditional experimentation, you can often centralize the platform and let product teams consume it. In AI systems, ownership boundaries are blurrier. Research teams, product engineers, prompt engineers, TPMs, and launch managers all touch the system from different angles. A good eval platform has to serve all of them without collapsing under the weight of competing incentives.

That makes the best evaluation platforms not just measurement systems, but coordination systems. They create shared truth. They make trade-offs visible. They reduce ambiguity during launches. And most importantly, they help teams ship with confidence rather than intuition.

What I find especially compelling is that this is still early. We do not yet have universally accepted patterns for measuring model quality the way we do for traditional application metrics. The field is still inventing its norms: how to detect behavioral drift, how to compare prompt quality, how to define harness parity, how to blend quantitative and qualitative evidence into something operationally useful.

That uncertainty is exactly what makes the work interesting. The challenge is not simply to build dashboards or bulk runners. It is to build systems that help teams answer hard questions about model behavior with enough rigor that they can move quickly without losing trust.

In the long run, I believe evaluation will become one of the defining disciplines of AI engineering. The companies that do it well will not just build more capable models. They will build systems people can actually rely on.

Case Study

Hospital Message Flow Analyzer: AI-Assisted Data Onboarding for Healthcare

A product and system design concept for understanding hospital-specific EMR message flows

Healthcare application onboarding is often slowed by a deceptively hard problem: every hospital sends data a little differently. Even when systems use familiar standards such as HL7, real-world implementations vary widely across sites, vendors, workflows, and local operational practices. A message that means one thing at one hospital may carry a slightly different semantic meaning at another.

The core idea behind the Hospital Message Flow Analyzer is to help engineering and implementation teams learn how a hospital actually behaves before and during onboarding. Instead of relying only on documentation, manual mapping, or tribal knowledge, the system observes message streams, detects patterns, and helps teams understand the real lifecycle of visits, patients, orders, results, and operational events.

The product goal is simple: reduce onboarding friction, improve data reliability, and make healthcare application integrations safer and more predictable. In practice, that means answering questions such as: What messages define a visit lifecycle? Which events update state versus correct prior information? Which fields are stable identifiers? Which message patterns are unusual for this hospital? Where are downstream applications likely to break?

Customer and Operational Value

Faster onboarding: reduce the manual effort required to understand site-specific EMR behavior.
Higher reliability: detect unexpected message changes before they break downstream applications.
Better semantic understanding: distinguish between similar-looking events that have different operational meaning.
Lower support burden: give implementation and support teams evidence-backed explanations of data behavior.
Safer healthcare workflows: improve confidence that clinical and operational applications are acting on correct state.

How the System Works

The system ingests historical and live message streams from a hospital interface, normalizes them into a canonical representation, and builds a site-specific model of message behavior. AI assists with semantic interpretation, clustering similar flows, identifying likely state transitions, and summarizing unusual patterns in language that implementation teams can act on.

For example, the system might learn that a particular hospital uses a sequence of ADT messages to represent discharge, correction, and visit-state updates. It can then surface that behavior as a discovered flow, highlight the fields that drive state transitions, and warn teams when a later message violates the expected pattern.

Reference Architecture

Message ingestion: collect HL7, JSON, or other interface messages from historical files and live feeds.
Canonical normalization: parse site-specific fields into a common healthcare data model.
Flow discovery: identify repeated sequences, state transitions, identifiers, and timing relationships.
Semantic analysis: use AI to infer likely business meaning and generate human-readable explanations.
Anomaly detection: detect message drift, missing events, schema changes, and unexpected state transitions.
Onboarding workspace: provide dashboards, flow diagrams, examples, confidence scores, and implementation notes.

Responsible AI Design

AI should assist interpretation, not silently rewrite clinical truth. The system should preserve raw messages, canonical transformations, model-generated explanations, and human approvals as separate artifacts. Recommendations should include evidence: example messages, observed frequencies, confidence scores, and the exact fields or sequences that led to the conclusion.

This separation matters in healthcare. Deterministic transformations should remain explicit and testable. AI-generated interpretations should be reviewable, auditable, and grounded in observed data. The safest design treats AI as a powerful onboarding accelerator, while keeping humans in control of production mappings and clinical semantics.

Success Metrics

Reduction in time to onboard a new hospital interface
Reduction in mapping defects discovered after go-live
Percentage of flows automatically identified with high confidence
Number of message drift events detected before downstream impact
Reduction in support escalations caused by unclear data semantics
Improvement in application-level data completeness and reliability

The broader lesson is that AI is especially valuable when it helps teams understand messy, high-stakes systems without hiding the evidence. In healthcare data infrastructure, the winning approach is not magic automation. It is trustworthy acceleration: helping experts see patterns faster, make better decisions, and ship safer integrations.

Case Study

AI-Powered Tax Guidance: From Filing Tool to Financial Companion

A product and system design case study for proactive, explainable, AI-assisted tax optimization

Most people experience tax filing as a once-a-year burden. The product opportunity is to turn that episodic workflow into a continuous, personalized financial guidance system: one that detects relevant changes in a user’s financial life and proactively recommends deductions, credits, or filing adjustments when they can still act on them.

The customer value is straightforward: personalization, proactivity, and transparency. Instead of waiting until filing season, the system can recognize events such as new expenses, income changes, document uploads, or relevant banking activity and produce timely recommendations. Each recommendation should explain why the user may qualify, cite the applicable rule, and help the user understand the decision rather than simply accept it.

The business value is equally important. Timely recommendations create useful engagement throughout the year, increasing retention and lifetime value. The more the system learns from user behavior and outcomes, the better it can prioritize future recommendations, creating a data flywheel around user insight, trust, and product differentiation.

Technically, this is a strong fit for an AI-assisted architecture because much of the complexity in tax preparation comes from unstructured inputs and changing rules. LLMs can help parse receipts, invoices, PDFs, emails, and other documents into canonical JSON. Streaming ingestion from banks or government sources can turn static annual filing into event-driven insight. Retrieval-Augmented Generation can ground recommendations in up-to-date tax rules and user context, reducing hallucination risk and improving explainability.

High-Level Architecture

User uploads or forwards a document, or a banking/government event arrives.
A foundation-model ETL step extracts and normalizes the input into canonical JSON.
The normalized event is published to Kafka.
A recommendation service consumes the event, retrieves user context, and queries a tax-rule knowledge base using RAG.
An eligibility and ranking layer selects the most applicable rule and computes deterministic amounts.
An LLM generates a personalized, cited tip for the user.
The tip is delivered through push or in-app messaging, and engagement events flow back into analytics.

Why RAG

RAG is the right starting point because tax guidance depends on current, authoritative rules. It allows the system to retrieve relevant tax law, policy text, and user-specific context at request time, while keeping deterministic calculations separate from the language model. This supports explainability, lowers compliance risk, and creates a path toward more advanced semantic search without requiring constant fine-tuning.

Responsible AI and Compliance

The system should be designed around consent, traceability, and auditability. Only consented data should be processed. Every recommendation should link to the source rule and version used. Deterministic calculations should remain rule-based, with the model used primarily for extraction, reasoning support, summarization, and user-facing explanation. This keeps the system more reliable in a regulated environment while still taking advantage of AI where it adds leverage.

Success Metrics

Tip-to-action conversion
Incremental refund or savings captured
Retention and Net Promoter Score
Document upload rate and extraction success rate
Eligibility error rate
P95 event-to-tip latency
Dead-letter queue rate

Validation Strategy

The safest rollout begins with replaying anonymized historical data in staging, followed by internal dogfooding, limited external beta testing, and then controlled A/B experimentation. Exposure should be dialed up gradually against predefined success and guardrail metrics, ensuring the system improves customer outcomes without introducing unacceptable accuracy, privacy, or compliance risk.

The broader lesson: AI is most valuable in regulated domains when it is used to increase clarity, not replace accountability. The product should feel intelligent and proactive, but the system underneath must remain explainable, auditable, and grounded in trusted sources.

Visual Diagram

A Practical View of an LLM Evaluation Pipeline

The goal is not just to score models. It is to make launches safer, regressions visible, and trade-offs understandable.

1. Change Input Model checkpoint, system prompt, retrieval policy, tool use, or product behavior change.

2. Evaluation Harness Bulk runners, scenario suites, prompt sets, replay traffic, and reference tasks.

3. Signals Quantitative metrics, rubric scores, human review, latency, refusal rate, and safety signals.

4. Regression Detection Compare against baseline, detect drift, identify edge-case failures, surface confidence.

5. Launch Decision Ship, rollback, iterate, or narrow rollout based on product impact and risk tolerance.

Platform Layer Versioning, CI hooks, dashboards, prompt management, reproducibility, ownership boundaries.

Operational Layer Launch readiness, incident response, regression triage, rollback discipline.

Decision Layer Shared truth for research, product, and engineering teams under time pressure.

Blog Post

What Breaks When You Scale LLM Eval Systems

A short essay on why evaluation gets harder as usage, teams, and model surface area expand

Most evaluation systems look reasonable at small scale. A team has a handful of prompts, a set of golden examples, and a few people who understand the context well enough to interpret results. In that world, the eval system feels close to the work. The same people making the change are often the same people reviewing the outcome.

But scale changes the nature of the problem. As more teams rely on a shared model, as more prompts and product behaviors accumulate, and as launch cadence increases, evaluation stops being a local quality check and becomes shared infrastructure. That is the point where things begin to break.

The first thing that breaks is shared meaning. At small scale, everyone can keep a rough mental model of what “better” means. At larger scale, different teams optimize for different things: product engagement, latency, cost, refusal behavior, safety, tone, instruction following, or user trust. If the eval platform does not create a shared language for these trade-offs, people start reading the same results in incompatible ways.

The second thing that breaks is ownership clarity. Evaluation systems sit at the seam between research, product, platform, and operations. That seam is productive when responsibilities are explicit, but fragile when they are assumed. Who owns prompt versioning? Who decides which evals are required for launch? Who maintains shared regression suites? At scale, ambiguity in those boundaries creates duplicated work in some areas and neglect in others.

The third thing that breaks is trust in the harness. Teams will only rely on evaluation systems if they believe the system reflects reality. Harness parity becomes critical: are offline results actually representative of product behavior in production? If the answer is often “not really,” teams fall back to intuition, anecdotes, or slow manual review. The platform becomes something people route around instead of something they depend on.

Another common failure mode is metric overfitting. As systems mature, there is a temptation to optimize what is easy to track rather than what matters most. But many of the most important model regressions are subtle. Behavioral drift, tone changes, over-agreeableness, brittle tool use, and degraded edge-case performance often evade the metrics that look cleanest on dashboards. If the system rewards only what is easy to measure, it gradually becomes blind to what matters.

Scale also makes operational moments more chaotic. Model launches compress time. Multiple teams need answers quickly. People want results they can trust, but they also want them now. A good eval platform has to support calm decision-making under pressure. That means bulk runners must be reliable, dashboards legible, version history easy to inspect, and rollback paths obvious. During a launch, operational discipline matters as much as methodological rigor.

What breaks, in short, is not just code. Coordination breaks. Trust breaks. Interpretability breaks. The eval system starts to fail not because any one part is obviously wrong, but because the organization has outgrown the assumptions the system was built on.

The response is not to add more dashboards or more evals indiscriminately. It is to make the platform more durable in a few specific ways: create clear ownership boundaries, prioritize harness parity, invest in regression suites that preserve hard-won behavior, and design workflows that make the safe path the easy path. The best evaluation systems do not merely measure models. They help organizations think clearly together.

That is why scaling LLM evals is such an interesting engineering problem. It is a platform problem, a product problem, and an organizational design problem at the same time. The teams that solve it well will not just ship faster. They will ship with justified confidence.