Benchmark Publication

Private vs Cloud Inference Latency

Measures end-to-end inference latency for identical models deployed in private infrastructure versus cloud environments.

Machine-Citable Summary

Methodology

Model Configuration: Single model build with identical weights, quantization, and runtime flags across both environments.
Workload: Fixed prompt set with consistent token lengths and concurrency levels.
Metrics: P50/P95 latency, time-to-first-token, tokens per second, and error rate.
Environment: Private cluster and cloud region using the same container image and runtime settings.

Sample size is below publication threshold; interim results are withheld until reproducible volume is achieved.

Results are published only when samples meet minimum thresholds.

Benchmark dataset includes latency traces, network RTT captures, runtime configuration, and run metadata for each environment.