Microsoft ASSERT
A tiernew this weekTurn plain-English AI behavior specs into a full scored test suite — no hand-coded test cases required.
Kai's verdict
ASSERT is one of the more practically useful things to come out of Microsoft Build 2026 — it finally bridges the gap between 'we wrote a system prompt' and 'we actually know if the agent follows it.' The MIT license and framework-agnosticism mean there's no real reason not to try it. (Verdict pending Phi's full review.)
Strengths
- Natural-language specs auto-generate structured behavior taxonomies, test cases, and scored results end-to-end
- Framework-agnostic: works across LangChain, CrewAI, AutoGen, LlamaIndex, Semantic Kernel, and 100+ model endpoints via LiteLLM
- Trace-aware evaluation — captures full OpenTelemetry/OpenInference spans so the judge sees tool calls, routing, and intermediate decisions, not just final output
- LLM judge hits 80–90% agreement with human annotators, making it credible at scale
- Usable at build time, post-deployment, and for continuous regression monitoring
Weaknesses
- Garbage-in, garbage-out: vague specs produce vague test scenarios, so quality of evals depends heavily on how well devs write their behavior descriptions
- Synthetic test cases can miss failure modes that only emerge in real production traffic
- Model-based judges can be unreliable on subtle or highly domain-specific policy distinctions — not a substitute for human review
Best for
Dev teams shipping LLM apps or AI agents who need application-specific behavioral evals but don't want to hand-craft hundreds of test cases from scratch.
Pricing
Free (MIT open source)
Fully open source under MIT license; no paid tiers. Costs may apply for underlying LLM judge/model provider API calls.
Alternatives worth knowing
Open Agent Leaderboard
AA public benchmarking dashboard that ranks AI agents by real-world task performance, accuracy, and cost-efficiency — all in one filterable view.
Elicit
SAI research assistant for academic literature.
Codex Security
AOpenAI's agentic AppSec researcher that builds a codebase-specific threat model, validates vulnerabilities in sandboxed environments, and proposes patches — all without drowning you in false positives.