Tools · Eval harness
Eval Set Sizer
“It works on the examples I tried” is not a measurement. To claim an accuracy with a straight face, you need enough labelled cases to put a confidence interval around it. This sizes the golden set for you.
—
The math: normal-approximation sample size, n = z² · p(1 − p) ÷ e², rounded up. p is expected accuracy, e the margin of error, z the confidence multiplier. It assumes independent, representative examples — so sample real traffic, don’t cherry-pick. For tiny or lopsided sets the normal approximation loosens; treat the number as the floor.
An eval harness with a golden set is part of every engagement. Sizing it is step one; building the harness, the regression gate, and the failure-mode catalogue is the work.