MetaReason — Monitor Your Intelligence

Methodology

From prompt to confidence interval
in five steps

Every evaluation follows a rigorous statistical pipeline that transforms qualitative LLM responses into quantitative risk metrics.

01

Define your evaluation in YAML

Declare your prompt template, variable axes (categorical and continuous), oracle definitions, and compliance mappings in a single version-controlled file.

Declarative · Git-native · Auditable

02

Generate prompt variants

Latin Hypercube Sampling ensures efficient, comprehensive coverage of the parameter space — personas, temperatures, phrasings, and structures — testing your model the way real users will use it.

Latin Hypercube · Stratified · Reproducible

03

Run inference at scale

The full prompt ensemble is executed against your target LLM in batch, producing responses that capture the real distribution of model behavior.

Batch processing · Any LLM provider

04

Score with dual oracles

Each response is evaluated twice: embedding similarity against a canonical answer measures accuracy, while an LLM-as-Judge evaluates explainability against your rubric. Two dimensions, two independent scores.

Cosine similarity · LLM-as-Judge · Binary labels

05

Synthesize Bayesian confidence

Binary labels feed into a PyMC Beta-Binomial model, producing full posterior distributions for P(accuracy) and P(explainability) with 95% Highest Density Intervals. Not point estimates — real confidence intervals.

PyMC · Beta-Binomial · 95% HDI

MetaReason Evaluation Report showing confidence assessment, Bayesian posterior distribution, and score metrics

Recursive Confidence

Who watches the watchers?

The same Bayesian modeling and MCMC inference that scores your LLM is turned on the oracle judges themselves — quantifying how much noise your evaluators introduce and how confident you should be in the evaluation itself.

metareason calibrate --oracle clarity_judge

{
  "spec_id": "variance_calibration_v1",
  "oracle": "gemma3:27b",
  "repeats": 10,
  "scores": [4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0],
  "analysis": {
    "population_mean": 3.9999999999383204,
    "population_median": 3.9999999999337517,
    "population_std": 7.531e-11,
    "hdi_lower": 3.9999999998274243,
    "hdi_upper": 4.000000000053331,
    "hdi_prob": 0.94,
    "oracle_noise_mean": 2.126e-10,
    "oracle_noise_hdi": [9.046e-11, 4.013e-10],
    "n_samples": 10
  },
  "expected_score": 4.0,
  "bias": -6.168e-11
}

Oracle calibration Judge noise quantification Recursive confidence

Open Core

Transparent by design

Our core evaluation engine is open source because methodologies hidden in black boxes cannot govern systems that are themselves black boxes. We show our work, share our math, and invite scrutiny.

Full evaluation engine — YAML config to confidence scores
Bayesian statistical models with reproducible inference
Community-contributed oracles and evaluation specs
Open governance — public docs, decisions, and metrics

Explore on GitHub Read the Docs

Example Evaluation Spec

spec_id: "quick_test" pipeline: - template: | You are a helpful assistant. Tone: {{ tone }} Detail Level: {{ detail_level }} Query: Explain how a compiler works. adapter: name: "ollama" model: "gemma3:27b" temperature: 0.7 max_tokens: 500 sampling: method: "latin_hypercube" optimization: "maximin" random_seed: 42 n_variants: 5 oracles: clarity_judge: type: "llm_judge" model: "gemma3:27b" adapter: name: "ollama" rubric: | Evaluate the response for clarity. Rate on a scale of 1-5. analysis: mcmc_draws: 1000 mcmc_chains: 2 prior_quality_mu: 3.0 hdi_probability: 0.94 axes: - name: "tone" type: "categorical" values: ["formal", "casual", "technical"] - name: "detail_level" type: "continuous" distribution: "uniform"

Why MetaReason

Not another leaderboard.
A confidence engine.

Most evaluation tools give you a score. We give you a statistically rigorous probability distribution with uncertainty quantified.

Bayesian, not binary

Full posterior distributions with 95% HDI credible intervals. Know not just the answer, but how confident you should be in it.

Ensemble variants

Latin Hypercube Sampling generates comprehensive prompt ensembles that test your model the way messy, real-world users will.

Audit-ready

YAML specs are version-controlled, reproducible, and map directly to ISO 42001, EU AI Act, and SOX compliance requirements.

Quantify AI confidence.
Deploy with certainty.

From prompt to confidence interval
in five steps

Define your evaluation in YAML

Generate prompt variants

Run inference at scale

Score with dual oracles

Synthesize Bayesian confidence

Who watches the watchers?

Transparent by design

Not another leaderboard.
A confidence engine.

Bayesian, not binary

Ensemble variants

Audit-ready

Hope is not a strategy.
Measurement is.

Quantify AI confidence.Deploy with certainty.

From prompt to confidence intervalin five steps

Define your evaluation in YAML

Generate prompt variants

Run inference at scale

Score with dual oracles

Synthesize Bayesian confidence

Who watches the watchers?

Transparent by design

Not another leaderboard.A confidence engine.

Bayesian, not binary

Ensemble variants

Audit-ready

Hope is not a strategy.Measurement is.

Quantify AI confidence.
Deploy with certainty.

From prompt to confidence interval
in five steps

Not another leaderboard.
A confidence engine.

Hope is not a strategy.
Measurement is.