LLM red-teaming without lobotomizing your product
It's easy to make a model refuse everything. The hard part is making it refuse the right things while still being useful.
There's a pattern we see in early LLM red-team programs. A team runs a battery of jailbreak prompts, finds dozens of failures, and reacts by piling on system-prompt restrictions and rejection patterns. Two weeks later, the product is safe — and useless. Users escalate, the guardrails get rolled back, and the team is back where they started, except now without trust in their own assessment process.
The fix is to treat red-teaming as a measurement discipline, not a compliance ritual.
Three categories worth separating
- Capability harms — outputs that themselves cause damage (CSAM, malware, etc.).
- Trust harms — outputs that mislead users about facts or the product itself.
- Product harms — outputs that break the contract of what the system is supposed to do.
Lumping these together is how teams end up with blanket refusals. Capability harms need hard guardrails. Trust harms need calibration and citations. Product harms need scoping and tool-call discipline.
Agentic systems multiply the surface
If your product gives the model tools — and most production systems now do — your attack surface is no longer just the output. It's the sequence of tool calls, what the tools have access to, and whether intermediate outputs flow back into context. We've seen incidents where a single attacker-controlled string in an upstream document caused an agent to exfiltrate secrets, send emails, and write back to a vector store — all from a prompt the user never wrote.
A working harness
# anatomy of a useful LLM red-team run 1. Generate adversarial inputs across capability/trust/product axes 2. Run them through the deployed product (not just the model) 3. Score outcomes against an explicit rubric, not "did the model refuse?" 4. Stratify by user persona (logged-in/anonymous/admin) 5. Replay top failures with mitigations applied; measure regression
The output is a graph of risk-vs-utility you can show a product team. That's the artifact that actually moves the deployment forward.
Working on something like this?
Tell us about your stack. We'll come back with a scoped plan in two business days.