Rust · adaptive · proof-carrying · domain-neutral

Calybris Engine

The decision engine that learns where to spend

Adaptive routing with Thompson Sampling. Outcome-calibrated quality floors. SHA-256 sealed proof chain. 22 models across 6 providers.

1M real conversations: 0 errors, $572K saved (84%). 1.1M across 4 industries: 0 runtime failures.

316 tests. Integer kernel. Zero unsafe code.

Allocation-free integer kernel

No floating-point in the hot path. The kernel uses bounded integer arithmetic throughout — every value is an i64 or u64 denominated in microcents. No heap allocation per decision. No GC pause. No NaN propagation.

// utility formula (integer arithmetic, microcent-denominated)
utility = quality_adjusted_value - risk_penalty - cost - latency_penalty
8.6M
decisions / sec
3-run median · overflow-safe
22
models · 6 providers
OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral
11
constraint gates
Short-circuit on first fail
0
heap allocations
Per hot-path decision
Why integers? Floating-point introduces non-determinism across platforms and makes audit replay unreliable. Integer arithmetic is bit-identical on every architecture. The kernel can be replayed on any machine and produce the same decision for the same input.

The engine reads the data. Decides on its own. Learns from outcomes.

Three layers — Welford statistics, Thompson Sampling, outcome calibration — all in allocation-free Rust with lock-free atomics. No manual quality floor tuning needed.

Welford Online Statistics

Tracks rolling mean and variance per input regime. Detects anomalous requests automatically — unusual patterns get higher quality floors.

Thompson Sampling

Bayesian explore/exploit for tier selection. Beta(alpha, beta) per regime per tier. Converges to optimal policy without manual tuning.

Outcome Calibration

Failures raise the quality floor 2x faster than successes lower it. The engine is naturally conservative — it protects quality by default.

Warm-Start Floors

Pre-computed from public benchmarks (WildChat 500K, KDD 494K, SP500 14K). Day-one defaults: support 0.55, coding 0.72, compliance 0.92, security 0.92.

7.2x
fewer missed attacks
Adaptive vs random on KDD Cup 99
16.69%
miss rate
SP500: 20% fewer errors than random
+4.9pp
learning
KDD last quarter vs first quarter
8
regimes
Each learns its own quality floor
Client override always wins. The adaptive router recommends quality floors only when the client doesn’t provide one. Client-provided floors are never overridden. For high-stakes workflows, set causal_exploration_bps=0 and use shadow mode instead.

Every decision is cryptographically chained

Each decision gets a SHA-256 fingerprint that includes the previous decision's hash, forming an immutable chain. Tampering with any decision invalidates every subsequent entry.

Decisionn-1 SHA-256 Decisionn SHA-256 Decisionn+1 WAL

Write-Ahead Log

Every decision is durably written before acknowledgment. The WAL uses hash chaining — each entry includes the previous entry's SHA-256 digest.

Crash Recovery

Crash-tail truncation detects and discards incomplete entries on restart. Deterministic replay re-derives state from the surviving chain.

Group Commit

Decisions within a 200µs window are batched into a single fsync, amortizing I/O cost without sacrificing durability guarantees.

Checkpoint Signing

Optional ed25519 signatures on periodic checkpoints. KMS integration available. The checkpoint proves the chain was intact at a known point.

// WAL entry structure
struct WalEntry {
  sequence_id: u64,
  prev_hash: [u8; 32],    // SHA-256 of previous entry
  decision: DecisionRecord,
  fingerprint: [u8; 32],  // SHA-256(prev_hash || decision)
  timestamp: i64,
}

Per-tenant atomic budget control

Balances are denominated in microcents and stored as i64 atomics. Thread-safe reservation uses a CAS loop — no mutex, no lock contention. The budget journal is hash-chained and durably synced.

Request Reserve Execute Commit Done
Budget never goes negative. This invariant is proven by proptest with randomized operation sequences including concurrent reservations, partial commits, and forced failures. If any sequence produces a negative balance, the test suite fails.

Doubly-robust off-policy evaluation

The engine measures whether past decisions were optimal by comparing actual outcomes against counterfactual alternatives. This is causal inference, not correlation — and it uses importance weighting to correct for the logging policy's selection bias.

// regret computation
regret = best_model_reward - selected_model_reward

// reward formula
reward = (business_outcome × quality) - cost - latency_penalty

// importance weight (corrects for selection bias)
weight = 1 / logging_propensity    // capped at 20x

Outcome Ingestion

Outcomes are bound to their originating decision and execution fingerprints. No outcome can be attributed without matching the proof chain.

Importance Weighting

Weight = 1 / logging_propensity, capped at 20x to prevent variance explosion from rare selections. Standard doubly-robust correction.

Holdout Validation

20% of traffic (by default) is held out from policy intervention. The holdout provides an unbiased baseline for regret comparison.

Calibration Check

The engine cross-checks its reward model predictions against observed outcomes. Systematic bias triggers a calibration warning, not a silent drift.

Measures regret but does not auto-optimize — that is intentional. The engine tells you which decisions were suboptimal and by how much. It does not autonomously change the policy. The human decides.

Shadow evaluation before enforcement

New policies are never deployed directly. A candidate policy snapshot is staged and runs in shadow mode alongside production, evaluating every decision in parallel without affecting live traffic.

Stage candidate Shadow eval Count mismatches Promotion gate Promote / Rollback
One candidate at a time. This is by design. Concurrent shadow policies create confounding variables that make mismatch counts meaningless. No uncontrolled experiments.

Fail-closed safety gates

These constraints are evaluated before every decision. They cannot be overridden by client metadata, configuration, or policy rules. If any gate fails, the decision is blocked.

These gates cannot be overridden by client metadata. They are compiled into the kernel. A request with risk 0.97 is blocked regardless of tenant, business value, or any other field. The gate evaluates first and short-circuits the entire pipeline.

1,110,073 real records. 4 industries. 0 errors.

Same binary. Zero domain-specific code. Every dataset is real — Yahoo Finance, KDD Cup 99, UCI Covertype, California Housing.

14,400
SP500 Finance
30 tickers, 2yr daily · 0 errors
494,021
Cybersecurity
KDD Cup 99 intrusion detection · 0 errors
581,012
Forestry
UCI Covertype classification · 0 errors
20,640
Real Estate
California Housing valuation · 0 errors
Marginal savings depend on your current routing maturity: teams with no routing see ~44%, teams with basic routing see ~26%, teams with smart routing see ~14%. The shadow replay pilot measures YOUR actual savings.

316 tests. Not happy-path.

The test suite is designed to break the engine, not confirm that it works. Fault injection, property-based testing, exhaustive concurrency analysis, and adversarial inputs.

WAL Fault Injection

Truncation mid-write, bit corruption in committed entries, partial writes with power-loss simulation, and 8-thread concurrent contention on a single WAL file.

Budget Proptest

Random operation sequences — reserve, commit, release, fail — with NaN and Inf poison values injected at random positions. Balance invariant checked after every sequence.

Loom Exhaustive

3-way reservation race tested under ALL interleavings using loom. Not sampling — every possible thread schedule is explored to prove the absence of data races.

Handler Adversarial

u32::MAX tokens, empty messages, malformed JSON, and rate limiter 429 responses. The handler must never panic, never produce NaN, and never leak budget.

Constant-time key comparison. Cryptographic key comparisons use length-independent usize arithmetic — no early exit on mismatch. Timing side-channel leakage is structurally impossible.

Four products, one engine

Each product plugs its own domain vocabulary into the Calybris policy gate. The proof machinery, WAL, budget control, and regret measurement are shared infrastructure.