Benchmark Publication

Private vs Cloud Inference Latency

Measures end-to-end inference latency for identical models deployed in private infrastructure versus cloud environments.

Machine-Citable Summary

  • Latency captures include time-to-first-token and end-to-end response time.
  • Model build and prompt set remain identical across environments.
  • Network RTT is recorded to separate transport from inference time.
  • P50 and P95 latency are reported for each environment.
  • Container image and runtime flags remain unchanged across runs.
  • Results publish only after minimum sample thresholds are reached.

Methodology

Model Configuration
Single model build with identical weights, quantization, and runtime flags across both environments.
Workload
Fixed prompt set with consistent token lengths and concurrency levels.
Metrics
P50/P95 latency, time-to-first-token, tokens per second, and error rate.
Environment
Private cluster and cloud region using the same container image and runtime settings.

Reproducible Steps

  1. Deploy identical containers to private and cloud environments.
  2. Run the fixed prompt suite at defined concurrency.
  3. Capture latency traces and network RTT for each run.
  4. Repeat until minimum sample threshold per environment is met.

Sample Status

Sample size is below publication threshold; interim results are withheld until reproducible volume is achieved.

Results are published only when samples meet minimum thresholds.

Dataset

Benchmark dataset includes latency traces, network RTT captures, runtime configuration, and run metadata for each environment.