Open Source · Now on GitHub

Quantify AI confidence.
Deploy with certainty.

MetaReason transforms LLM evaluation from guesswork into science. Statistically rigorous confidence scores backed by Bayesian inference and ensemble testing.

metareason evaluate --output json
{
  "population_mean": 4.1368,
  "population_median": 4.1446,
  "hdi_lower": 3.6782,
  "hdi_upper": 4.5428,
  "hdi_prob": 0.94,
  "oracle_noise_mean": 0.5105,
  "oracle_noise_hdi": [0.2397, 0.8211],
  "n_samples": 5
}

From prompt to confidence interval
in five steps

Every evaluation follows a rigorous statistical pipeline that transforms qualitative LLM responses into quantitative risk metrics.

01

Define your evaluation in YAML

Declare your prompt template, variable axes (categorical and continuous), oracle definitions, and compliance mappings in a single version-controlled file.

Declarative · Git-native · Auditable
02

Generate prompt variants

Latin Hypercube Sampling ensures efficient, comprehensive coverage of the parameter space — personas, temperatures, phrasings, and structures — testing your model the way real users will use it.

Latin Hypercube · Stratified · Reproducible
03

Run inference at scale

The full prompt ensemble is executed against your target LLM in batch, producing responses that capture the real distribution of model behavior.

Batch processing · Any LLM provider
04

Score with dual oracles

Each response is evaluated twice: embedding similarity against a canonical answer measures accuracy, while an LLM-as-Judge evaluates explainability against your rubric. Two dimensions, two independent scores.

Cosine similarity · LLM-as-Judge · Binary labels
05

Synthesize Bayesian confidence

Binary labels feed into a PyMC Beta-Binomial model, producing full posterior distributions for P(accuracy) and P(explainability) with 95% Highest Density Intervals. Not point estimates — real confidence intervals.

PyMC · Beta-Binomial · 95% HDI
MetaReason Evaluation Report showing confidence assessment, Bayesian posterior distribution, and score metrics

Who watches the watchers?

The same Bayesian modeling and MCMC inference that scores your LLM is turned on the oracle judges themselves — quantifying how much noise your evaluators introduce and how confident you should be in the evaluation itself.

metareason calibrate --oracle clarity_judge
{
  "spec_id": "variance_calibration_v1",
  "oracle": "gemma3:27b",
  "repeats": 10,
  "scores": [4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0],
  "analysis": {
    "population_mean": 3.9999999999383204,
    "population_median": 3.9999999999337517,
    "population_std": 7.531e-11,
    "hdi_lower": 3.9999999998274243,
    "hdi_upper": 4.000000000053331,
    "hdi_prob": 0.94,
    "oracle_noise_mean": 2.126e-10,
    "oracle_noise_hdi": [9.046e-11, 4.013e-10],
    "n_samples": 10
  },
  "expected_score": 4.0,
  "bias": -6.168e-11
}
Oracle calibration Judge noise quantification Recursive confidence

Transparent by design

Our core evaluation engine is open source because methodologies hidden in black boxes cannot govern systems that are themselves black boxes. We show our work, share our math, and invite scrutiny.

  • Full evaluation engine — YAML config to confidence scores
  • Bayesian statistical models with reproducible inference
  • Community-contributed oracles and evaluation specs
  • Open governance — public docs, decisions, and metrics
Example Evaluation Spec
spec_id: "quick_test" pipeline: - template: | You are a helpful assistant. Tone: {{ tone }} Detail Level: {{ detail_level }} Query: Explain how a compiler works. adapter: name: "ollama" model: "gemma3:27b" temperature: 0.7 max_tokens: 500 sampling: method: "latin_hypercube" optimization: "maximin" random_seed: 42 n_variants: 5 oracles: clarity_judge: type: "llm_judge" model: "gemma3:27b" adapter: name: "ollama" rubric: | Evaluate the response for clarity. Rate on a scale of 1-5. analysis: mcmc_draws: 1000 mcmc_chains: 2 prior_quality_mu: 3.0 hdi_probability: 0.94 axes: - name: "tone" type: "categorical" values: ["formal", "casual", "technical"] - name: "detail_level" type: "continuous" distribution: "uniform"

Not another leaderboard.
A confidence engine.

Most evaluation tools give you a score. We give you a statistically rigorous probability distribution with uncertainty quantified.

Bayesian, not binary

Full posterior distributions with 95% HDI credible intervals. Know not just the answer, but how confident you should be in it.

Ensemble variants

Latin Hypercube Sampling generates comprehensive prompt ensembles that test your model the way messy, real-world users will.

Audit-ready

YAML specs are version-controlled, reproducible, and map directly to ISO 42001, EU AI Act, and SOX compliance requirements.

Hope is not a strategy.
Measurement is.

Star the repo. Run your first evaluation. Join the community building the standard for AI confidence measurement.