Microsoft ASSERT

A tiernew this week

Turn plain-English AI behavior specs into a full scored test suite — no hand-coded test cases required.

Open Microsoft ASSERT →Compare with alternatives

Kai's verdict

ASSERT is one of the more practically useful things to come out of Microsoft Build 2026 — it finally bridges the gap between 'we wrote a system prompt' and 'we actually know if the agent follows it.' The MIT license and framework-agnosticism mean there's no real reason not to try it. (Verdict pending Phi's full review.)

Strengths

Natural-language specs auto-generate structured behavior taxonomies, test cases, and scored results end-to-end
Framework-agnostic: works across LangChain, CrewAI, AutoGen, LlamaIndex, Semantic Kernel, and 100+ model endpoints via LiteLLM
Trace-aware evaluation — captures full OpenTelemetry/OpenInference spans so the judge sees tool calls, routing, and intermediate decisions, not just final output
LLM judge hits 80–90% agreement with human annotators, making it credible at scale
Usable at build time, post-deployment, and for continuous regression monitoring

Weaknesses

Garbage-in, garbage-out: vague specs produce vague test scenarios, so quality of evals depends heavily on how well devs write their behavior descriptions
Synthetic test cases can miss failure modes that only emerge in real production traffic
Model-based judges can be unreliable on subtle or highly domain-specific policy distinctions — not a substitute for human review

Best for

Dev teams shipping LLM apps or AI agents who need application-specific behavioral evals but don't want to hand-craft hundreds of test cases from scratch.

Pricing

Free (MIT open source)

Fully open source under MIT license; no paid tiers. Costs may apply for underlying LLM judge/model provider API calls.

Alternatives worth knowing

Open Agent Leaderboard

A public benchmarking dashboard that ranks AI agents by real-world task performance, accuracy, and cost-efficiency — all in one filterable view.

Elicit

AI research assistant for academic literature.

Codex Security

OpenAI's agentic AppSec researcher that builds a codebase-specific threat model, validates vulnerabilities in sandboxed environments, and proposes patches — all without drowning you in false positives.