Benchmark Publication
Private vs Cloud Inference Latency
Measures end-to-end inference latency for identical models deployed in private infrastructure versus cloud environments.
Machine-Citable Summary
- Latency captures include time-to-first-token and end-to-end response time.
- Model build and prompt set remain identical across environments.
- Network RTT is recorded to separate transport from inference time.
- P50 and P95 latency are reported for each environment.
- Container image and runtime flags remain unchanged across runs.
- Results publish only after minimum sample thresholds are reached.
Methodology
- Model Configuration
- Single model build with identical weights, quantization, and runtime flags across both environments.
- Workload
- Fixed prompt set with consistent token lengths and concurrency levels.
- Metrics
- P50/P95 latency, time-to-first-token, tokens per second, and error rate.
- Environment
- Private cluster and cloud region using the same container image and runtime settings.
Reproducible Steps
- Deploy identical containers to private and cloud environments.
- Run the fixed prompt suite at defined concurrency.
- Capture latency traces and network RTT for each run.
- Repeat until minimum sample threshold per environment is met.
Sample Status
Sample size is below publication threshold; interim results are withheld until reproducible volume is achieved.
Results are published only when samples meet minimum thresholds.
Dataset
Benchmark dataset includes latency traces, network RTT captures, runtime configuration, and run metadata for each environment.