Every performance, savings, and reliability number on the landing page, in the Medium articles, and in this document comes from a reproducible test or replay. This page documents exactly how each benchmark was run, what splits were used, what hyperparameters were chosen, and what the known limitations are.
If you find a methodological flaw, email emirhuseyininci@gmail.com. I'll fix it or retract the claim.
What it measures: Pure integer kernel evaluation speed — no I/O, no HTTP, no WAL, no network.
How it was measured:
calybris_stress (release profile, LTO thin, codegen-units=1)Known limitations:
Why 8.6M not 10.27M: Earlier version used wrapping_mul (unsafe overflow risk). Reverted to saturating_mul. Correctness > speed.
What it measures: Full HTTP stack including JSON parse, policy evaluation, WAL write with fsync, JSON response.
yfinancedaily_return, volume_ratio, intraday_range, gap,
momentum_5d, volatility_20d, momentum_10d, avg_range_5d
|return[t+1]| > 2% = premium analysis was needed (~15-20% of days). This is a proxy label: the benchmark treats next-day absolute moves above 2% as cases where premium analysis would have been worth paying for. It is not a trading profitability label.
n_estimators=200, max_depth=5, learning_rate=0.1, subsample=0.8, random_state=42
No hyperparameter search performed. Standard defaults.
| Strategy | Savings | Miss Rate | Precision |
|---|---|---|---|
| Random 60% | 57.8% | 20.83% | 79.2% |
| Static QF | 84.3% | 19.42% | 80.6% |
| GBM | 76.0% | 11.82% | 88.4% |
| Adaptive | 81.2% | 16.69% | 83.3% |
sklearn.datasets.fetch_kddcup99(percent10=True)Why adaptive wins (11% vs 75%): In this setup, the adaptive policy benefits from online per-regime feedback, while the GBM is fixed after training. The result should be interpreted as an online-adaptation advantage in this specific benchmark, not as a general claim that GBM is weaker.
Caveat: 7.2x improvement is dataset-specific. We do not claim generalization to unseen attack categories.
| Domain | Records | Source |
|---|---|---|
| SP500 Finance | 14,400 | Yahoo Finance |
| Cybersecurity | 494,021 | KDD Cup 99 |
| Forestry | 581,012 | UCI Covertype |
| Real Estate | 20,640 | California Housing |
| Total | 1,110,073 |
"Zero errors" = zero crashes, zero panics, zero data corruption. Not zero wrong decisions.
"96% savings" is against always-premium baseline. Realistic marginal: 14-45% depending on existing routing.
Two benchmarks on the same 1M real data (500K WildChat + 47K OpenAssistant + synthetic fill):
| Metric | Value |
|---|---|
| Total decisions | 1,000,000 |
| Errors | 0 |
| Throughput | 2,411 req/s |
| Requested cost (baseline) | $681,913 |
| Selected cost (Calybris) | $109,175 |
| Savings | $572,737 (84.0%) |
| p50 latency | 42 µs |
| p99 latency | 111 µs |
Savings caveat: 84% is measured against "always use the requested model" baseline. The requested model distribution in the WildChat dataset includes expensive models (o3, claude-opus-4) that Calybris routes to cheaper alternatives. Marginal savings over existing routing: 14–45% depending on maturity. See section 5.
20 simulated trials. Average crossover: 51 decisions to beat random in simulated feedback trials. At 1K req/day: ~1 hour.
Limitation: assumes immediate feedback, stationary distribution during calibration.
| Category | Count |
|---|---|
| Library unit tests | 286 |
| Integration tests | 26 |
| Loom exhaustive | 4 |
| Total | 316 |
Not covered: real network latency, multi-node partition, 72h+ uptime, production traffic, cross-provider prompt equivalence.
# Kernel throughput
cargo run --release --bin calybris_stress -- --rows 1000000
# HTTP gateway
cargo run --release --bin calybris_tokio_gateway_benchmark
# SP500 + KDD adaptive routing
python scripts/adaptive_proof.py
# Domain-neutral 1.1M
python scripts/domain_proof_1m.py
# GBM prompt model
python scripts/train_embedding_model.py
# GBM export to Rust
python scripts/export_gbm_to_rust.py
Random seeds are fixed. Results reproducible ±5% for timing, exact for deterministic benchmarks.
Last updated: 2026-06-23. If any claim is inaccurate, email me. I'll fix it or retract it.