Rust · adaptive · proof-carrying · domain-neutral

Calybris Engine

The decision engine that learns where to spend

Every decision — AI model call, trading signal, medical triage, factory inspection — gets an adaptive quality floor, Thompson Sampling exploration, outcome-calibrated routing, and a SHA-256 sealed proof.

1M real conversations: 0 errors, $572K saved, 84% cost reduction.
1.1M across 4 industries: 0 runtime failures, 0 domain-specific code.

GOVERIS Product Request Pilot View Benchmarks

01 — Performance

Measured on a consumer laptop. Not a cloud SLO claim.

All numbers from a single-node run. Windows, NTFS, no cluster. Linux + NVMe would improve HTTP and WAL numbers. Full methodology: splits, hyperparameters, known limitations →

8.6M

decisions / sec

Integer kernel, overflow-safe, zero allocation

6,084

req / sec

Durable HTTP with WAL fsync (c=128)

42ms

p99 latency

Including disk fsync, 15K requests, 0 failed

WildChat + OpenAssistant

0 errors · $572K saved (84%) · 2,411 req/s

316

tests passing

Fault injection + proptest + loom + adaptive

models

6 providers: OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral

69M

WAL rows

Hash-chained, tamper-evident, crash-recovered

7.2x

fewer missed attacks

Adaptive vs random on KDD Cup 99 (11% vs 80%)

02 — Adaptive Intelligence

The engine reads the data. Decides on its own. Learns from outcomes.

Three layers — Welford statistics, Thompson Sampling, outcome calibration — all in allocation-free Rust with lock-free atomics.

Welford Online Statistics

Tracks rolling mean and variance per input regime. No heap allocation. Detects anomalous requests automatically — unusual patterns get higher quality floors, common patterns get cheaper routing.

integer arithmetic · 8 regimes

Thompson Sampling

Bayesian explore/exploit for tier selection. Maintains Beta(alpha, beta) per regime per tier. Explores cheaper tiers when uncertain, exploits when confident. Converges to optimal policy without manual tuning.

Beta-Bernoulli · deterministic seed

Outcome Calibration

When an outcome is reported, quality floors adjust automatically. Failures raise the floor 2x faster than successes lower it. The engine is naturally conservative — it protects quality by default.

asymmetric learning · fail-safe

Warm-Start Floors

Pre-computed quality floors from public dataset benchmarks (WildChat 500K, KDD 494K, SP500 14K). Day-one defaults for 8 use case patterns: support (0.55), coding (0.72), compliance (0.92), security (0.92).

zero cold-start guessing

Quality Tracker

Empirical success rates per use-case per tier. After 50+ observations, recommends calibrated floors with confidence levels. Catches cross-provider quality gaps before they reach production.

lock-free · concurrent-safe

GBM Compiled to Rust Trees

200 GBM trees trained on 2,090 labeled prompts, exported as pure Rust source code. No ONNX runtime, no Python, no GPU. The model predicts quality_floor, risk_score, and confidence_score from a 384-dim sentence embedding — compiled directly into the binary.

40K lines Rust · 0 dependencies · <1ms inference

03 — Proof & Durability

Every decision is sealed, chained, and replayable.

Not a log. A cryptographic evidence chain. Tamper with any record and the chain breaks. The engine refuses to start on a broken chain.

SHA-256 Decision Fingerprint

Every decision binds the request payload, policy version, candidate frontier, and final selection into a single SHA-256 hash. Independently verifiable months later.

Hash-Chained WAL

Decision N includes the hash of decision N-1. Delete, modify, or reorder any record and the chain breaks. Validated on startup before accepting new decisions. 69M rows tested.

Budget Conservation

remaining + reserved + committed = initial. Proven with proptest across thousands of random operation sequences. Loom exhaustive testing verifies every thread interleaving.

Quality Floor Guarantee

The kernel never selects a model below the quality floor. No override, no exception. If no model qualifies, the request is blocked — not silently degraded. Proptest-proven.

04 — Domain-Neutral Proof

Same binary. 4 industries. 1,110,073 real records. 0 errors.

Every dataset is real — Yahoo Finance, KDD Cup 99, UCI Covertype, California Housing. No synthetic data. No domain-specific code changes.

Domain	Records	Data Source
SP500 Finance	14,400	Yahoo Finance, 30 tickers, 2yr daily
Cybersecurity	494,021	KDD Cup 99 intrusion detection
Forestry	581,012	UCI Covertype classification
Real Estate	20,640	California Housing valuation
Total	1,110,073	4 real datasets

"0 errors" means 0 crashes, 0 panics, and 0 data-corruption events during replay — not 0 wrong decisions. Decision quality depends on quality floor calibration. Marginal savings depend on your current routing maturity: teams with no routing see ~44%, teams with basic routing see ~26%, teams with smart routing see ~14%.

05 — Architecture

Request to proof in one pipeline. Outcome back to learning.

Request → Adaptive Router → Integer Kernel → Proof Gen → WAL → Budget → Response

← Outcome API → Adaptive Router (Thompson Sampling updates quality floors)

The adaptive router recommends quality floors when the client doesn't provide one. Client-provided floors always override — the engine enhances human judgment, never replaces it.

06 — Test Infrastructure

316 tests. Designed to break the engine.

286 Library Tests

Every subsystem: kernel, WAL, budget, adaptive, quality tracker, handlers, security, telemetry, config. Each test names and exercises a concrete invariant.

26 Integration Tests

Full HTTP stack with mock providers. Shadow mode, proxy mode, auth planes, rate limiting, reconciliation. Axum + Tower end-to-end.

4 Loom Exhaustive Tests

Every thread interleaving for concurrent budget operations. Three threads racing to reserve budget — total never exceeds limit under any schedule.

Fault Injection

Truncated WAL writes, corrupted records, poisoned mutexes, disk-full scenarios, NaN/Infinity inputs. The engine recovers or rejects — never panics.

07 — Tech Stack

Infrastructure

Rust (edition 2024, #![forbid(unsafe_code)])

Zero unsafe blocks. Async runtime via Tokio. HTTP via Axum.

Integer Kernel

i128 utility scoring, fixed-point arithmetic, zero allocation in hot path

Thompson Sampling (Rust)

Beta-Bernoulli bandit, lock-free atomics, 8-thread concurrent-safe

Hash-Chained WAL

SHA-256, fsync-on-write, crash recovery, 69M rows tested

22-Model Catalog

OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral + custom JSON

OpenAPI 3.1 + Prometheus

/openapi.json spec, /metrics Prometheus exposition, /health + /ready probes

GBM-to-Rust Codegen

200 decision trees compiled to 40K lines of pure Rust. Zero runtime dependency. <1ms inference.

Sentence Embedding (384-dim)

all-MiniLM-L6-v2 for prompt understanding. CPU-only, ~9ms. Predicts quality_floor, risk, confidence from text.

Circuit Breaker

10 consecutive failures opens circuit, 30s recovery, auto half-open

Docker Deployment

Single image, read-only filesystem, no-new-privileges, one-command pilot

08 — Shadow Replay

The engine gets better every week. Here's the data.

Simulated 4-week shadow replay pilot (10K decisions). Adaptive routing improves as Thompson Sampling calibrates quality floors from outcome feedback.

Week	Calls	Savings	Miss Rate	Precision
Week 1	2,500	79.0%	19.2%	80.8%
Week 2	2,500	88.2%	22.9%	77.1%
Week 3	2,500	92.5%	18.5%	81.5%
Week 4	2,500	97.1%	20.7%	79.3%

Simulated shadow replay with Thompson Sampling + outcome feedback. Savings increase as the engine learns which regimes can be safely downgraded. Miss rate stays ~20% (consistent with random premium-needed rate). In production, shadow replay runs without affecting live traffic. Marginal savings over existing routing: 14–45% depending on current maturity.

09 — Quickstart

Deploy in 3 commands. Audit in 7 days.

1. Deploy

docker compose up -d

Single container. Read-only filesystem. No external dependencies.

2. Mirror

POST /api/v1/route
{"model":"gpt-4o",
 "input_tokens":1200,
 "metadata":{
   "tenant_id":"team-a",
   "use_case":"support"}}

3. Audit

GET /api/v1/audit/report
→ JSON + Markdown
  spend analysis

Board-readable report after 7 days of observation.

No prompt capture. Metadata-only mirror. Private VPC deployment. Shadow mode — production traffic stays unchanged.

Calybris Engine

Measured on a consumer laptop. Not a cloud SLO claim.

The engine reads the data. Decides on its own. Learns from outcomes.

Welford Online Statistics

Thompson Sampling

Outcome Calibration

Warm-Start Floors

Quality Tracker

GBM Compiled to Rust Trees

Every decision is sealed, chained, and replayable.

SHA-256 Decision Fingerprint

Hash-Chained WAL

Budget Conservation

Quality Floor Guarantee

Same binary. 4 industries. 1,110,073 real records. 0 errors.

Request to proof in one pipeline. Outcome back to learning.

316 tests. Designed to break the engine.

286 Library Tests

26 Integration Tests

4 Loom Exhaustive Tests

Fault Injection

Infrastructure

The engine gets better every week. Here's the data.

Deploy in 3 commands. Audit in 7 days.

1. Deploy

2. Mirror

3. Audit

One engine, multiple products