Benchmarker
A/B testing and consistency analysis with statistical metrics for prompt optimization
Overview
The Benchmarker helps you evaluate prompt performance through systematic testing. Run single prompts multiple times to measure consistency, or compare two variants in A/B tests. Get statistical analysis including latency distributions, token usage, and confidence metrics.
A/B Testing
Compare two variants
10-100 Runs
Statistical significance
P50/P95/P99
Latency percentiles
How to Use
- 1Select Test Mode - Choose 'Single Prompt' for consistency testing or 'A/B Test' to compare two variants.
- 2Enter Prompt(s) - Paste your prompt (and second variant for A/B tests) into the editor(s).
- 3Configure Test Parameters - Set the number of runs (10-100), model, and temperature settings.
- 4Add Test Input - Enter a sample user message that will be used for all test runs.
- 5Run Benchmark - Start the test. Results stream in real-time as each run completes.
- 6Analyze Results - Review statistical analysis including averages, percentiles, and distributions.
Test Modes
Single Prompt Mode
Test one prompt multiple times to measure consistency and reliability. Useful for understanding how much variation to expect from your prompt.
What You'll Learn
- Response consistency across multiple runs
- Latency variation and percentiles
- Token usage patterns
- Cost projection for production volumes
A/B Test Mode
Compare two prompt variants head-to-head. Run equal numbers of tests on each variant and get statistical analysis of which performs better.
Statistical Output
- Side-by-side metric comparison
- Statistical significance indicators
- Winner determination with confidence level
- Detailed per-variant analysis
Understanding Metrics
Latency Metrics
Median - 50% of requests complete faster
95th percentile - typical worst case
99th percentile - rare worst case
Token Metrics
- Input Tokens: Tokens in your prompt (should be consistent)
- Output Tokens: Tokens in response (may vary)
- Total Tokens: Combined input + output
- Cost per Run: Calculated from token counts and pricing
Consistency Score
Measures how similar responses are across runs. Higher scores indicate more predictable outputs. Calculated using semantic similarity of response content.
Statistical Analysis
Report Includes
Benchmark Results Summary
========================
Variant A: Original Prompt
--------------------------
Runs: 50
Mean Latency: 1,234ms
P50: 1,180ms | P95: 1,890ms | P99: 2,340ms
Avg Output Tokens: 156
Consistency Score: 87%
Total Cost: $0.23
Variant B: Optimized Prompt
--------------------------
Runs: 50
Mean Latency: 987ms
P50: 940ms | P95: 1,450ms | P99: 1,780ms
Avg Output Tokens: 142
Consistency Score: 91%
Total Cost: $0.19
Winner: Variant B
Confidence: 94%
Improvement: 20% faster, 17% cheaper, more consistentStatistical Significance
- 95%+ confidence: Strong evidence of real difference
- 90-95% confidence: Likely real difference, consider more runs
- <90% confidence: Difference may be noise, need more data
AI Expert Use Cases
Prompt Optimization
Model Selection
Temperature Tuning
Production Readiness
Regression Testing
Tips & Best Practices
Pro Tips
- Run at least 30 tests for statistical significance
- Use realistic test inputs that match production usage
- Test at different times to account for API variability
- Compare P95 latency, not just averages (tail matters)
- Consider cost per quality, not just raw performance
- Document benchmark conditions for reproducibility
What to Test
- System prompt variations
- Few-shot example changes
- Different output format instructions
- Constraint and guardrail modifications
- Temperature and other parameter changes