Tools · Eval harness

Eval Set Sizer

“It works on the examples I tried” is not a measurement. To claim an accuracy with a straight face, you need enough labelled cases to put a confidence interval around it. This sizes the golden set for you.

%

Your best guess at how often the feature is right. Use 50 if you truly don’t know — it’s the most demanding.

points

How tight the answer must be. ±5 means a measured 90% really sits in 85–95%.

How often the interval should contain the true accuracy.

Golden examples needed

The math: normal-approximation sample size, n = z² · p(1 − p) ÷ e², rounded up. p is expected accuracy, e the margin of error, z the confidence multiplier. It assumes independent, representative examples — so sample real traffic, don’t cherry-pick. For tiny or lopsided sets the normal approximation loosens; treat the number as the floor.

An eval harness with a golden set is part of every engagement. Sizing it is step one; building the harness, the regression gate, and the failure-mode catalogue is the work.

Email me See an eval in production