Rust · adaptive · proof-carrying · domain-neutral

Calybris Engine

The decision engine that learns where to spend

Adaptive routing with Thompson Sampling. Outcome-calibrated quality floors. SHA-256 sealed proof chain. 22 models across 6 providers.

1M real conversations: 0 errors, $572K saved (84%). 1.1M across 4 industries: 0 runtime failures.

316 tests. Integer kernel. Zero unsafe code.

GOVERIS Product Book Pilot Call

01 — Integer Kernel

Allocation-free integer kernel

No floating-point in the hot path. The kernel uses bounded integer arithmetic throughout — every value is an i64 or u64 denominated in microcents. No heap allocation per decision. No GC pause. No NaN propagation.

        // utility formula (integer arithmetic, microcent-denominated)

        utility = quality_adjusted_value - risk_penalty - cost - latency_penalty

11 hard constraint gates evaluated in strict order — first failure short-circuits
Max utility selection across a 64-model catalog per decision
Counterfactual tracking: best alternative model + utility delta recorded alongside every decision

8.6M

decisions / sec

3-run median · overflow-safe

models · 6 providers

OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral

constraint gates

Short-circuit on first fail

heap allocations

Per hot-path decision

Why integers? Floating-point introduces non-determinism across platforms and makes audit replay unreliable. Integer arithmetic is bit-identical on every architecture. The kernel can be replayed on any machine and produce the same decision for the same input.

02 — Adaptive Intelligence

The engine reads the data. Decides on its own. Learns from outcomes.

Three layers — Welford statistics, Thompson Sampling, outcome calibration — all in allocation-free Rust with lock-free atomics. No manual quality floor tuning needed.

Welford Online Statistics

Tracks rolling mean and variance per input regime. Detects anomalous requests automatically — unusual patterns get higher quality floors.

Thompson Sampling

Bayesian explore/exploit for tier selection. Beta(alpha, beta) per regime per tier. Converges to optimal policy without manual tuning.

Outcome Calibration

Failures raise the quality floor 2x faster than successes lower it. The engine is naturally conservative — it protects quality by default.

Warm-Start Floors

Pre-computed from public benchmarks (WildChat 500K, KDD 494K, SP500 14K). Day-one defaults: support 0.55, coding 0.72, compliance 0.92, security 0.92.

7.2x

fewer missed attacks

Adaptive vs random on KDD Cup 99

16.69%

miss rate

SP500: 20% fewer errors than random

+4.9pp

learning

KDD last quarter vs first quarter

regimes

Each learns its own quality floor

Client override always wins. The adaptive router recommends quality floors only when the client doesn’t provide one. Client-provided floors are never overridden. For high-stakes workflows, set causal_exploration_bps=0 and use shadow mode instead.

03 — Decision Proof Chain

Every decision is cryptographically chained

Each decision gets a SHA-256 fingerprint that includes the previous decision's hash, forming an immutable chain. Tampering with any decision invalidates every subsequent entry.

Decision_n-1 → SHA-256 → Decision_n → SHA-256 → Decision_n+1 → WAL

Write-Ahead Log

Every decision is durably written before acknowledgment. The WAL uses hash chaining — each entry includes the previous entry's SHA-256 digest.

Crash Recovery

Crash-tail truncation detects and discards incomplete entries on restart. Deterministic replay re-derives state from the surviving chain.

Group Commit

Decisions within a 200µs window are batched into a single fsync, amortizing I/O cost without sacrificing durability guarantees.

Checkpoint Signing

Optional ed25519 signatures on periodic checkpoints. KMS integration available. The checkpoint proves the chain was intact at a known point.

        // WAL entry structure

        struct WalEntry {

          sequence_id: u64,

          prev_hash: [u8; 32],    // SHA-256 of previous entry

          decision: DecisionRecord,

          fingerprint: [u8; 32],  // SHA-256(prev_hash || decision)

          timestamp: i64,

        }

04 — Budget Enforcement

Per-tenant atomic budget control

Balances are denominated in microcents and stored as i64 atomics. Thread-safe reservation uses a CAS loop — no mutex, no lock contention. The budget journal is hash-chained and durably synced.

Request → Reserve → Execute → Commit → Done

Reserve on decision: estimated cost deducted atomically before execution begins
Commit on execution: actual cost replaces the reservation; delta refunded
Release on failure: full reservation returned if execution never completes
CAS loop: compare-and-swap on AtomicI64 — no mutex, no lock convoy
Hash-chained journal: every balance mutation is logged with a SHA-256 chain and durably fsynced

Budget never goes negative. This invariant is proven by proptest with randomized operation sequences including concurrent reservations, partial commits, and forced failures. If any sequence produces a negative balance, the test suite fails.

05 — Causal Regret Measurement

Doubly-robust off-policy evaluation

The engine measures whether past decisions were optimal by comparing actual outcomes against counterfactual alternatives. This is causal inference, not correlation — and it uses importance weighting to correct for the logging policy's selection bias.

        // regret computation

        regret = best_model_reward - selected_model_reward

        // reward formula

        reward = (business_outcome × quality) - cost - latency_penalty

        // importance weight (corrects for selection bias)

        weight = 1 / logging_propensity    // capped at 20x

Outcome Ingestion

Outcomes are bound to their originating decision and execution fingerprints. No outcome can be attributed without matching the proof chain.

Importance Weighting

Weight = 1 / logging_propensity, capped at 20x to prevent variance explosion from rare selections. Standard doubly-robust correction.

Holdout Validation

20% of traffic (by default) is held out from policy intervention. The holdout provides an unbiased baseline for regret comparison.

Calibration Check

The engine cross-checks its reward model predictions against observed outcomes. Systematic bias triggers a calibration warning, not a silent drift.

Measures regret but does not auto-optimize — that is intentional. The engine tells you which decisions were suboptimal and by how much. It does not autonomously change the policy. The human decides.

06 — Staged Policy Rollout

Shadow evaluation before enforcement

New policies are never deployed directly. A candidate policy snapshot is staged and runs in shadow mode alongside production, evaluating every decision in parallel without affecting live traffic.

Stage candidate → Shadow eval → Count mismatches → Promotion gate → Promote / Rollback

Shadow evaluation runs the candidate policy on live requests without altering the production decision
Action mismatches: counts how often the candidate would choose a different action (allow vs. block vs. downgrade)
Model mismatches: counts how often the candidate would select a different model
Promotion gates: minimum sample count + action delta threshold must both pass before promotion is allowed
Rollback: explicit snapshot ID rollback — the previous policy is always available for instant restore

One candidate at a time. This is by design. Concurrent shadow policies create confounding variables that make mismatch counts meaningless. No uncontrolled experiments.

07 — Governance Hard Limits

Fail-closed safety gates

These constraints are evaluated before every decision. They cannot be overridden by client metadata, configuration, or policy rules. If any gate fails, the decision is blocked.

✕ Risk score hard block risk < 0.96
✕ Confidence floor confidence ≥ 0.55
✕ No paid models for zero expected value EV > 0 for paid
✕ Cost-to-value ratio limit ratio < 1.25
✕ Non-finite numeric rejection NaN / Inf → reject

These gates cannot be overridden by client metadata. They are compiled into the kernel. A request with risk 0.97 is blocked regardless of tenant, business value, or any other field. The gate evaluates first and short-circuits the entire pipeline.

08 — Domain-Neutral Proof

1,110,073 real records. 4 industries. 0 errors.

Same binary. Zero domain-specific code. Every dataset is real — Yahoo Finance, KDD Cup 99, UCI Covertype, California Housing.

14,400

SP500 Finance

30 tickers, 2yr daily · 0 errors

494,021

Cybersecurity

KDD Cup 99 intrusion detection · 0 errors

581,012

Forestry

UCI Covertype classification · 0 errors

20,640

Real Estate

California Housing valuation · 0 errors

Marginal savings depend on your current routing maturity: teams with no routing see ~44%, teams with basic routing see ~26%, teams with smart routing see ~14%. The shadow replay pilot measures YOUR actual savings.

09 — Test Infrastructure

316 tests. Not happy-path.

The test suite is designed to break the engine, not confirm that it works. Fault injection, property-based testing, exhaustive concurrency analysis, and adversarial inputs.

WAL Fault Injection

Truncation mid-write, bit corruption in committed entries, partial writes with power-loss simulation, and 8-thread concurrent contention on a single WAL file.

Budget Proptest

Random operation sequences — reserve, commit, release, fail — with NaN and Inf poison values injected at random positions. Balance invariant checked after every sequence.

Loom Exhaustive

3-way reservation race tested under ALL interleavings using loom. Not sampling — every possible thread schedule is explored to prove the absence of data races.

Handler Adversarial

u32::MAX tokens, empty messages, malformed JSON, and rate limiter 429 responses. The handler must never panic, never produce NaN, and never leak budget.

Constant-time key comparison. Cryptographic key comparisons use length-independent usize arithmetic — no early exit on mismatch. Timing side-channel leakage is structurally impossible.

Calybris Engine

Allocation-free integer kernel

The engine reads the data. Decides on its own. Learns from outcomes.

Welford Online Statistics

Thompson Sampling

Outcome Calibration

Warm-Start Floors

Every decision is cryptographically chained

Write-Ahead Log

Crash Recovery

Group Commit

Checkpoint Signing

Per-tenant atomic budget control

Doubly-robust off-policy evaluation

Outcome Ingestion

Importance Weighting

Holdout Validation

Calibration Check

Shadow evaluation before enforcement

Fail-closed safety gates

1,110,073 real records. 4 industries. 0 errors.

316 tests. Not happy-path.

WAL Fault Injection

Budget Proptest

Loom Exhaustive

Handler Adversarial

Four products, one engine