Rust · adaptive · proof-carrying · domain-neutral

Calybris Engine

The decision engine that learns where to spend

Every decision — AI model call, trading signal, medical triage, factory inspection — gets an adaptive quality floor, Thompson Sampling exploration, outcome-calibrated routing, and a SHA-256 sealed proof.

1M real conversations: 0 errors, $572K saved, 84% cost reduction.
1.1M across 4 industries: 0 runtime failures, 0 domain-specific code.

Measured on a consumer laptop. Not a cloud SLO claim.

All numbers from a single-node run. Windows, NTFS, no cluster. Linux + NVMe would improve HTTP and WAL numbers. Full methodology: splits, hyperparameters, known limitations →

8.6M
decisions / sec
Integer kernel, overflow-safe, zero allocation
6,084
req / sec
Durable HTTP with WAL fsync (c=128)
42ms
p99 latency
Including disk fsync, 15K requests, 0 failed
1M
WildChat + OpenAssistant
0 errors · $572K saved (84%) · 2,411 req/s
316
tests passing
Fault injection + proptest + loom + adaptive
22
models
6 providers: OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral
69M
WAL rows
Hash-chained, tamper-evident, crash-recovered
7.2x
fewer missed attacks
Adaptive vs random on KDD Cup 99 (11% vs 80%)

The engine reads the data. Decides on its own. Learns from outcomes.

Three layers — Welford statistics, Thompson Sampling, outcome calibration — all in allocation-free Rust with lock-free atomics.

Welford Online Statistics

Tracks rolling mean and variance per input regime. No heap allocation. Detects anomalous requests automatically — unusual patterns get higher quality floors, common patterns get cheaper routing.

integer arithmetic · 8 regimes

Thompson Sampling

Bayesian explore/exploit for tier selection. Maintains Beta(alpha, beta) per regime per tier. Explores cheaper tiers when uncertain, exploits when confident. Converges to optimal policy without manual tuning.

Beta-Bernoulli · deterministic seed

Outcome Calibration

When an outcome is reported, quality floors adjust automatically. Failures raise the floor 2x faster than successes lower it. The engine is naturally conservative — it protects quality by default.

asymmetric learning · fail-safe

Warm-Start Floors

Pre-computed quality floors from public dataset benchmarks (WildChat 500K, KDD 494K, SP500 14K). Day-one defaults for 8 use case patterns: support (0.55), coding (0.72), compliance (0.92), security (0.92).

zero cold-start guessing

Quality Tracker

Empirical success rates per use-case per tier. After 50+ observations, recommends calibrated floors with confidence levels. Catches cross-provider quality gaps before they reach production.

lock-free · concurrent-safe

GBM Compiled to Rust Trees

200 GBM trees trained on 2,090 labeled prompts, exported as pure Rust source code. No ONNX runtime, no Python, no GPU. The model predicts quality_floor, risk_score, and confidence_score from a 384-dim sentence embedding — compiled directly into the binary.

40K lines Rust · 0 dependencies · <1ms inference

Every decision is sealed, chained, and replayable.

Not a log. A cryptographic evidence chain. Tamper with any record and the chain breaks. The engine refuses to start on a broken chain.

SHA-256 Decision Fingerprint

Every decision binds the request payload, policy version, candidate frontier, and final selection into a single SHA-256 hash. Independently verifiable months later.

Hash-Chained WAL

Decision N includes the hash of decision N-1. Delete, modify, or reorder any record and the chain breaks. Validated on startup before accepting new decisions. 69M rows tested.

Budget Conservation

remaining + reserved + committed = initial. Proven with proptest across thousands of random operation sequences. Loom exhaustive testing verifies every thread interleaving.

Quality Floor Guarantee

The kernel never selects a model below the quality floor. No override, no exception. If no model qualifies, the request is blocked — not silently degraded. Proptest-proven.

Same binary. 4 industries. 1,110,073 real records. 0 errors.

Every dataset is real — Yahoo Finance, KDD Cup 99, UCI Covertype, California Housing. No synthetic data. No domain-specific code changes.

Domain Records Data Source Errors
SP500 Finance 14,400 Yahoo Finance, 30 tickers, 2yr daily 0
Cybersecurity 494,021 KDD Cup 99 intrusion detection 0
Forestry 581,012 UCI Covertype classification 0
Real Estate 20,640 California Housing valuation 0
Total 1,110,073 4 real datasets 0
"0 errors" means 0 crashes, 0 panics, and 0 data-corruption events during replay — not 0 wrong decisions. Decision quality depends on quality floor calibration. Marginal savings depend on your current routing maturity: teams with no routing see ~44%, teams with basic routing see ~26%, teams with smart routing see ~14%.

Request to proof in one pipeline. Outcome back to learning.

Request Adaptive Router Integer Kernel Proof Gen WAL Budget Response
← Outcome API → Adaptive Router (Thompson Sampling updates quality floors)

The adaptive router recommends quality floors when the client doesn't provide one. Client-provided floors always override — the engine enhances human judgment, never replaces it.

316 tests. Designed to break the engine.

286 Library Tests

Every subsystem: kernel, WAL, budget, adaptive, quality tracker, handlers, security, telemetry, config. Each test names and exercises a concrete invariant.

26 Integration Tests

Full HTTP stack with mock providers. Shadow mode, proxy mode, auth planes, rate limiting, reconciliation. Axum + Tower end-to-end.

4 Loom Exhaustive Tests

Every thread interleaving for concurrent budget operations. Three threads racing to reserve budget — total never exceeds limit under any schedule.

Fault Injection

Truncated WAL writes, corrupted records, poisoned mutexes, disk-full scenarios, NaN/Infinity inputs. The engine recovers or rejects — never panics.

Infrastructure

Rust (edition 2024, #![forbid(unsafe_code)])
Zero unsafe blocks. Async runtime via Tokio. HTTP via Axum.
Integer Kernel
i128 utility scoring, fixed-point arithmetic, zero allocation in hot path
Thompson Sampling (Rust)
Beta-Bernoulli bandit, lock-free atomics, 8-thread concurrent-safe
Hash-Chained WAL
SHA-256, fsync-on-write, crash recovery, 69M rows tested
22-Model Catalog
OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral + custom JSON
OpenAPI 3.1 + Prometheus
/openapi.json spec, /metrics Prometheus exposition, /health + /ready probes
GBM-to-Rust Codegen
200 decision trees compiled to 40K lines of pure Rust. Zero runtime dependency. <1ms inference.
Sentence Embedding (384-dim)
all-MiniLM-L6-v2 for prompt understanding. CPU-only, ~9ms. Predicts quality_floor, risk, confidence from text.
Circuit Breaker
10 consecutive failures opens circuit, 30s recovery, auto half-open
Docker Deployment
Single image, read-only filesystem, no-new-privileges, one-command pilot

The engine gets better every week. Here's the data.

Simulated 4-week shadow replay pilot (10K decisions). Adaptive routing improves as Thompson Sampling calibrates quality floors from outcome feedback.

WeekCallsSavingsMiss RatePrecision
Week 12,50079.0%19.2%80.8%
Week 22,50088.2%22.9%77.1%
Week 32,50092.5%18.5%81.5%
Week 42,50097.1%20.7%79.3%
Simulated shadow replay with Thompson Sampling + outcome feedback. Savings increase as the engine learns which regimes can be safely downgraded. Miss rate stays ~20% (consistent with random premium-needed rate). In production, shadow replay runs without affecting live traffic. Marginal savings over existing routing: 14–45% depending on current maturity.

Deploy in 3 commands. Audit in 7 days.

1. Deploy

docker compose up -d

Single container. Read-only filesystem. No external dependencies.

2. Mirror

POST /api/v1/route
{"model":"gpt-4o",
 "input_tokens":1200,
 "metadata":{
   "tenant_id":"team-a",
   "use_case":"support"}}

3. Audit

GET /api/v1/audit/report
→ JSON + Markdown
  spend analysis

Board-readable report after 7 days of observation.

No prompt capture. Metadata-only mirror. Private VPC deployment. Shadow mode — production traffic stays unchanged.

One engine, multiple products

Each product plugs its own domain vocabulary. The proof machinery, adaptive routing, and durability layer are shared.