Benchmarker

A/B testing and consistency analysis with statistical metrics for prompt optimization

Overview

The Benchmarker helps you evaluate prompt performance through systematic testing. Run single prompts multiple times to measure consistency, or compare two variants in A/B tests. Get statistical analysis including latency distributions, token usage, and confidence metrics.

A/B Testing

Compare two variants

10-100 Runs

Statistical significance

P50/P95/P99

Latency percentiles

How to Use

  • 1
    Select Test Mode - Choose 'Single Prompt' for consistency testing or 'A/B Test' to compare two variants.
  • 2
    Enter Prompt(s) - Paste your prompt (and second variant for A/B tests) into the editor(s).
  • 3
    Configure Test Parameters - Set the number of runs (10-100), model, and temperature settings.
  • 4
    Add Test Input - Enter a sample user message that will be used for all test runs.
  • 5
    Run Benchmark - Start the test. Results stream in real-time as each run completes.
  • 6
    Analyze Results - Review statistical analysis including averages, percentiles, and distributions.

Test Modes

Single Prompt Mode

Test one prompt multiple times to measure consistency and reliability. Useful for understanding how much variation to expect from your prompt.

What You'll Learn

  • Response consistency across multiple runs
  • Latency variation and percentiles
  • Token usage patterns
  • Cost projection for production volumes

A/B Test Mode

Compare two prompt variants head-to-head. Run equal numbers of tests on each variant and get statistical analysis of which performs better.

Statistical Output

  • Side-by-side metric comparison
  • Statistical significance indicators
  • Winner determination with confidence level
  • Detailed per-variant analysis

Understanding Metrics

Latency Metrics

P50

Median - 50% of requests complete faster

P95

95th percentile - typical worst case

P99

99th percentile - rare worst case

Token Metrics

  • Input Tokens: Tokens in your prompt (should be consistent)
  • Output Tokens: Tokens in response (may vary)
  • Total Tokens: Combined input + output
  • Cost per Run: Calculated from token counts and pricing

Consistency Score

Measures how similar responses are across runs. Higher scores indicate more predictable outputs. Calculated using semantic similarity of response content.

Statistical Analysis

Report Includes

Benchmark Results Summary
========================

Variant A: Original Prompt
--------------------------
Runs: 50
Mean Latency: 1,234ms
P50: 1,180ms | P95: 1,890ms | P99: 2,340ms
Avg Output Tokens: 156
Consistency Score: 87%
Total Cost: $0.23

Variant B: Optimized Prompt
--------------------------
Runs: 50
Mean Latency: 987ms
P50: 940ms | P95: 1,450ms | P99: 1,780ms
Avg Output Tokens: 142
Consistency Score: 91%
Total Cost: $0.19

Winner: Variant B
Confidence: 94%
Improvement: 20% faster, 17% cheaper, more consistent

Statistical Significance

The benchmarker uses statistical tests to determine if differences are meaningful:
  • 95%+ confidence: Strong evidence of real difference
  • 90-95% confidence: Likely real difference, consider more runs
  • <90% confidence: Difference may be noise, need more data

AI Expert Use Cases

Prompt Optimization

A/B test prompt changes to verify improvements. Never deploy a "better" prompt without statistical evidence that it actually performs better.

Model Selection

Run the same prompt across different models to compare performance characteristics. Find the best model for your specific use case.

Temperature Tuning

Test different temperature settings to find the optimal balance between creativity and consistency for your application.

Production Readiness

Before deployment, run 100 tests to understand latency distributions and ensure the prompt meets your SLA requirements.

Regression Testing

After making changes, benchmark to ensure you haven't regressed on important metrics like latency or consistency.

Tips & Best Practices

Pro Tips

  • Run at least 30 tests for statistical significance
  • Use realistic test inputs that match production usage
  • Test at different times to account for API variability
  • Compare P95 latency, not just averages (tail matters)
  • Consider cost per quality, not just raw performance
  • Document benchmark conditions for reproducibility

What to Test

Good candidates for benchmarking:
  • System prompt variations
  • Few-shot example changes
  • Different output format instructions
  • Constraint and guardrail modifications
  • Temperature and other parameter changes