← back to engine

Calybris Benchmark Methodology

Every performance, savings, and reliability number on the landing page, in the Medium articles, and in this document comes from a reproducible test or replay. This page documents exactly how each benchmark was run, what splits were used, what hyperparameters were chosen, and what the known limitations are.

If you find a methodological flaw, email emirhuseyininci@gmail.com. I'll fix it or retract the claim.

1. Kernel Throughput (8.6M decisions/sec)

What it measures: Pure integer kernel evaluation speed — no I/O, no HTTP, no WAL, no network.

How it was measured:

Binary: calybris_stress (release profile, LTO thin, codegen-units=1)
Workload: 1,000,000 synthetic decisions with mixed parameters
Seed: 73,120,526 (deterministic, reproducible)
Model catalog: 4 models (semantic-cache, local-small, gpt-4o-mini, gpt-4o)
3-run median on consumer Windows laptop (Intel i7, NTFS)

Known limitations:

4-model catalog. With 22 models, the pure kernel slows down roughly ~3x versus the 4-model synthetic loop.
Single-threaded tight loop, no concurrent load
The 12,222 decisions/sec real-data benchmark (section 6) is a broader benchmark: it includes real-record JSON mapping, 22-model evaluation, SHA-256 proof generation, and full decision-object construction. It is not directly comparable to the 8.6M kernel-only number.

Why 8.6M not 10.27M: Earlier version used wrapping_mul (unsafe overflow risk). Reverted to saturating_mul. Correctness > speed.

2. HTTP Gateway (6,084 req/sec, 42ms p99)

What it measures: Full HTTP stack including JSON parse, policy evaluation, WAL write with fsync, JSON response.

Concurrency: 128 connections, 1ms mock provider
15,000 requests total, WAL enabled with fsync
Limitation: mock provider, same machine, NTFS, no TLS

3. SP500 Trading Benchmark

Data

Source: Yahoo Finance via yfinance
Tickers: 20-30 major US stocks, 2 years daily OHLCV
Total: 9,580–14,400 records
No lookahead bias: features from day T, ground truth from day T+1

Features (8 dimensions)

daily_return, volume_ratio, intraday_range, gap,
momentum_5d, volatility_20d, momentum_10d, avg_range_5d

Ground truth

|return[t+1]| > 2% = premium analysis was needed (~15-20% of days). This is a proxy label: the benchmark treats next-day absolute moves above 2% as cases where premium analysis would have been worth paying for. It is not a trading profitability label.

Split

GBM: 50/50 split (shuffled across tickers, not strict temporal)
Adaptive: online, no split, first-vs-last quarter comparison
Known limitation: not a strict temporal split. Because the GBM split is shuffled across tickers and time, it may leak market-regime information across the split. GBM numbers should be treated as an upper bound. These results are routing feasibility evidence, not live-trading alpha claims.

GBM hyperparameters

n_estimators=200, max_depth=5, learning_rate=0.1, subsample=0.8, random_state=42
No hyperparameter search performed. Standard defaults.

Results

Strategy	Savings	Miss Rate	Precision
Random 60%	57.8%	20.83%	79.2%
Static QF	84.3%	19.42%	80.6%
GBM	76.0%	11.82%	88.4%
Adaptive	81.2%	16.69%	83.3%

4. KDD Cup 99 Cybersecurity

Source: sklearn.datasets.fetch_kddcup99(percent10=True)
Records: 494,021 (~80% attack, ~20% normal)
Known limitation: KDD Cup 99 is widely criticized as unrealistic. Used as standard benchmark, not as representative of real traffic.

Why adaptive wins (11% vs 75%): In this setup, the adaptive policy benefits from online per-regime feedback, while the GBM is fixed after training. The result should be interpreted as an online-adaptation advantage in this specific benchmark, not as a general claim that GBM is weaker.

Caveat: 7.2x improvement is dataset-specific. We do not claim generalization to unseen attack categories.

5. Domain-Neutral 1.1M Benchmark

Domain	Records	Source
SP500 Finance	14,400	Yahoo Finance
Cybersecurity	494,021	KDD Cup 99
Forestry	581,012	UCI Covertype
Real Estate	20,640	California Housing
Total	1,110,073

"Zero errors" = zero crashes, zero panics, zero data corruption. Not zero wrong decisions.

"96% savings" is against always-premium baseline. Realistic marginal: 14-45% depending on existing routing.

6. Real Data Decision Replay (12,222 dec/sec kernel, 2,411 req/s HTTP)

Two benchmarks on the same 1M real data (500K WildChat + 47K OpenAssistant + synthetic fill):

Kernel-only (12,222 dec/sec): 22-model catalog + real-record JSON mapping + SHA-256 proof generation + full decision-object construction. WAL disabled, single-threaded. Not comparable to the 8.6M kernel-only number.
Full HTTP stack (2,411 req/s): 32-thread Python client → Axum HTTP → adaptive router → kernel → WAL fsync → proof → response. This is the production-representative number.

1M HTTP benchmark results

Metric	Value
Total decisions	1,000,000
Errors	0
Throughput	2,411 req/s
Requested cost (baseline)	$681,913
Selected cost (Calybris)	$109,175
Savings	$572,737 (84.0%)
p50 latency	42 µs
p99 latency	111 µs

Savings caveat: 84% is measured against "always use the requested model" baseline. The requested model distribution in the WildChat dataset includes expensive models (o3, claude-opus-4) that Calybris routes to cheaper alternatives. Marginal savings over existing routing: 14–45% depending on maturity. See section 5.

7. GBM Prompt Model

2,090 labeled prompts, 7 categories, template-generated
Embedding: all-MiniLM-L6-v2 (22M params, 384-dim, CPU, ~9ms)
Export: 200 trees → 40K lines pure Rust, <1ms inference
MAE: quality_floor 0.046, risk 0.052, confidence 0.051
Limitation: template data, needs retraining on real prompts
The GBM prompt model is used as a routing prior, not as a final correctness oracle. It provides initial quality floor estimates that the adaptive router refines from real outcomes.
The goal of this model is not broad language understanding; it is to provide a cheap prior for routing decisions before outcome feedback is available.

8. Cold Start (51 decisions)

20 simulated trials. Average crossover: 51 decisions to beat random in simulated feedback trials. At 1K req/day: ~1 hour.

Limitation: assumes immediate feedback, stationary distribution during calibration.

9. Test Suite (316 tests)

Category	Count
Library unit tests	286
Integration tests	26
Loom exhaustive	4
Total	316

Not covered: real network latency, multi-node partition, 72h+ uptime, production traffic, cross-provider prompt equivalence.

10. Reproducibility

# Kernel throughput
cargo run --release --bin calybris_stress -- --rows 1000000

# HTTP gateway
cargo run --release --bin calybris_tokio_gateway_benchmark

# SP500 + KDD adaptive routing
python scripts/adaptive_proof.py

# Domain-neutral 1.1M
python scripts/domain_proof_1m.py

# GBM prompt model
python scripts/train_embedding_model.py

# GBM export to Rust
python scripts/export_gbm_to_rust.py

Random seeds are fixed. Results reproducible ±5% for timing, exact for deterministic benchmarks.

11. What we do NOT claim

Production savings guarantees. All figures are benchmark estimates.
State-of-the-art ML accuracy. Practical routing tools, not research contributions.
Security certification. Not audited by third-party firm.
Scalability beyond single-node. Postgres backend implemented but not partition-tested.

Last updated: 2026-06-23. If any claim is inaccurate, email me. I'll fix it or retract it.

← back to engine · GOVERIS product · Request pilot