Thomas
Peng

I build agentic systems and evaluate them honestly. Deterministic scoring, adversarial verification, cost-gated reproducible runs. Honest nulls included.

Visitor assessment (optional)
When evaluating a candidate's AI work, what matters most?

One kernel. Three deployed systems. All evaluated honestly.

Quorum's core/ directory is the substrate: cost-aware routing, K-adversarial verification, full tracing. Aegis and FieldAgent both vendor it. That is not a portfolio of four separate projects. It is one serious infrastructure bet applied to three distinct problems, each with its own honest eval.

The honest nulls are not disclaimers. The truncation artifact in FieldAgent, the gap-erasure in Aegis. These are reported front and center because frontier labs are not hiring people who hide nulls. They are hiring people who find them first.

Flagship

Quorum

Task-aware agent orchestrator with adversarial verification

Honest finding

K=3 adversarial verification cut false positives 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%). The recall drop is real. Held-out: 3/3 genuine bugs found, 0 surviving false positives.

Cost-aware model routing (DeepSeek to Haiku to Sonnet to Opus) plus adversarial multi-agent verification with full tracing. A trace UI that looks like a product. Vendors a shared kernel reused across Aegis and FieldAgent. Fans out finders per file, K skeptics per finding (concurrency cap 8). ~$0.25 total per run. 58 tests, ruff, mypy, CI green. Cost-routing live tier gated on an Anthropic key: the harness is committed, the number is honest.

0.0%FP after K=3
3/3Held-out bugs
~$0.25Per run
58Tests
https://quorum.thomaspeng.caOpen live

Live demo available on desktop

Open demo
Red-team

Aegis

Adaptive attacker agent vs layered defenses

Honest finding (lead with this)

A reasoning model is significantly more robust pre-defense (injection ASR 49.3% vs 68.1%, p=0.0012). But the full defense stack erases the gap (1.7% vs 2.8%, p=0.40, not significant). The null is the point.

Adaptive attacker red-teams a target on two harmless proxies (canary-string extraction, prompt-injection sentinel), scored deterministically via exact match. No LLM judge in the success path. Adaptation lift: 24.0% to 29.9% (became significant only after scaling the benchmark). Scaling is the legit power lever, not p-hacking. Defense reduction 29.2% to 4.2%. The input-classifier is the workhorse. 78 tests, CI green.

p=0.40Gap erased
-25%Defense reduction
+5.9%Adaptation lift
78Tests
https://7p3ng.github.io/aegis/Open live

Live demo available on desktop

Open demo
Contract AI

FieldAgent

CUAD contract red-flag finder, graded span-IoU

Honest finding (the null is the point)

Detection F1 = 0.548 (P=0.741 / R=0.435), 95% CI [0.460, 0.637]. +0.21 F1 over keyword floor. The agentic chunking lift looked like +0.45 on DeepSeek but collapsed to +0.07 on a fair rerun (CIs overlap, truncation artifact). Reporting both.

Reads a real commercial contract, flags risk-bearing clauses (span, severity, plain-English risk), graded span-IoU against CUAD gold. No LLM judge. Vendors Quorum's core/. 20 held-out CUAD contracts. Party names and dollar figures redacted in the demo. 47 tests, CI green. The honest null about the agentic lift is the argument, not a disclaimer.

0.548F1 score
+0.21Over keyword floor
+0.07Fair agentic lift
47Tests
https://fieldagent.thomaspeng.caOpen live

Live demo available on desktop

Open demo

Skill-Tuning Council

Internal infra. No public URL. Presented as methodology.

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary to editors to merger to council to escalate-on-disagreement. 576 tests. Every voter's rationale is logged.

This is a systems-design piece, not a shipped product. The interesting part is the adversarial-then-council pattern: a single editor drifts, four voters catch drift, escalation handles disagreement. The same adversarial-verification kernel from Quorum's core/ applies to the council itself.

No public URL because the output is internal tooling. The architecture is the argument.

How I measure

01

Deterministic scoring

No LLM judge in the success path. Exact match, span-IoU, CI-bounded against labeled gold. If the metric is fuzzy, the eval is not trustworthy.

02

Adversarial verification

K skeptics challenge every finding. False positives are a specific target. A result that survives adversarial review is a different claim than one that does not.

03

Cost-gated runs

Every live tier is gated on a real API key. Estimated costs are honest. Gated numbers are labeled as gated. The harness is committed even when the number is not live.

04

Honest nulls

Results that failed are reported first, not buried. The truncation artifact in FieldAgent. The gap erasure in Aegis. These are not disclaimers. They are the finding.

05

Reproducible offline

make eval-dry reproduces Quorum's core eval without a live API key. Evals that cannot be reproduced are not evals. They are claims.

06

Model-specific claims

A number on DeepSeek-v4-pro is not a number. A number that replicates on Claude Sonnet is a number. Cross-model validation is not optional when the claim matters.

Talk to me

Applied AI, agent engineering, evaluation design. Two contact points. No X, no LinkedIn.