← back to engine

Calybris Benchmark Methodology

Every performance, savings, and reliability number on the landing page, in the Medium articles, and in this document comes from a reproducible test or replay. This page documents exactly how each benchmark was run, what splits were used, what hyperparameters were chosen, and what the known limitations are.

If you find a methodological flaw, email emirhuseyininci@gmail.com. I'll fix it or retract the claim.

1. Kernel Throughput (8.6M decisions/sec)

What it measures: Pure integer kernel evaluation speed — no I/O, no HTTP, no WAL, no network.

How it was measured:

Known limitations:

Why 8.6M not 10.27M: Earlier version used wrapping_mul (unsafe overflow risk). Reverted to saturating_mul. Correctness > speed.

2. HTTP Gateway (6,084 req/sec, 42ms p99)

What it measures: Full HTTP stack including JSON parse, policy evaluation, WAL write with fsync, JSON response.

3. SP500 Trading Benchmark

Data

Features (8 dimensions)

daily_return, volume_ratio, intraday_range, gap,
momentum_5d, volatility_20d, momentum_10d, avg_range_5d

Ground truth

|return[t+1]| > 2% = premium analysis was needed (~15-20% of days). This is a proxy label: the benchmark treats next-day absolute moves above 2% as cases where premium analysis would have been worth paying for. It is not a trading profitability label.

Split

GBM hyperparameters

n_estimators=200, max_depth=5, learning_rate=0.1, subsample=0.8, random_state=42
No hyperparameter search performed. Standard defaults.

Results

StrategySavingsMiss RatePrecision
Random 60%57.8%20.83%79.2%
Static QF84.3%19.42%80.6%
GBM76.0%11.82%88.4%
Adaptive81.2%16.69%83.3%

4. KDD Cup 99 Cybersecurity

Why adaptive wins (11% vs 75%): In this setup, the adaptive policy benefits from online per-regime feedback, while the GBM is fixed after training. The result should be interpreted as an online-adaptation advantage in this specific benchmark, not as a general claim that GBM is weaker.

Caveat: 7.2x improvement is dataset-specific. We do not claim generalization to unseen attack categories.

5. Domain-Neutral 1.1M Benchmark

DomainRecordsSource
SP500 Finance14,400Yahoo Finance
Cybersecurity494,021KDD Cup 99
Forestry581,012UCI Covertype
Real Estate20,640California Housing
Total1,110,073

"Zero errors" = zero crashes, zero panics, zero data corruption. Not zero wrong decisions.

"96% savings" is against always-premium baseline. Realistic marginal: 14-45% depending on existing routing.

6. Real Data Decision Replay (12,222 dec/sec kernel, 2,411 req/s HTTP)

Two benchmarks on the same 1M real data (500K WildChat + 47K OpenAssistant + synthetic fill):

1M HTTP benchmark results

MetricValue
Total decisions1,000,000
Errors0
Throughput2,411 req/s
Requested cost (baseline)$681,913
Selected cost (Calybris)$109,175
Savings$572,737 (84.0%)
p50 latency42 µs
p99 latency111 µs

Savings caveat: 84% is measured against "always use the requested model" baseline. The requested model distribution in the WildChat dataset includes expensive models (o3, claude-opus-4) that Calybris routes to cheaper alternatives. Marginal savings over existing routing: 14–45% depending on maturity. See section 5.

7. GBM Prompt Model

8. Cold Start (51 decisions)

20 simulated trials. Average crossover: 51 decisions to beat random in simulated feedback trials. At 1K req/day: ~1 hour.

Limitation: assumes immediate feedback, stationary distribution during calibration.

9. Test Suite (316 tests)

CategoryCount
Library unit tests286
Integration tests26
Loom exhaustive4
Total316

Not covered: real network latency, multi-node partition, 72h+ uptime, production traffic, cross-provider prompt equivalence.

10. Reproducibility

# Kernel throughput
cargo run --release --bin calybris_stress -- --rows 1000000

# HTTP gateway
cargo run --release --bin calybris_tokio_gateway_benchmark

# SP500 + KDD adaptive routing
python scripts/adaptive_proof.py

# Domain-neutral 1.1M
python scripts/domain_proof_1m.py

# GBM prompt model
python scripts/train_embedding_model.py

# GBM export to Rust
python scripts/export_gbm_to_rust.py

Random seeds are fixed. Results reproducible ±5% for timing, exact for deterministic benchmarks.

11. What we do NOT claim

  1. Production savings guarantees. All figures are benchmark estimates.
  2. State-of-the-art ML accuracy. Practical routing tools, not research contributions.
  3. Security certification. Not audited by third-party firm.
  4. Scalability beyond single-node. Postgres backend implemented but not partition-tested.

Last updated: 2026-06-23. If any claim is inaccurate, email me. I'll fix it or retract it.

← back to engine · GOVERIS product · Request pilot