Benchmarking details
Benchmarks are always opinionated. The goal of this benchmark is to find the best possible self-hosted backend for $/token:
- end-to-end, including the RestAPI server
 - multi-tenant: multiple clients will try to query your server
 - fair batch size: You want to limit request size (sentences per requests) to something low, such that you can load balance requests, scale
 - measured over throughput per token: Idle servers are bad for business. This benchmark is NOT about the latency for a single request against an IDLE server. It partially evaluates the latency under a typical load scenario
 - Bert small / large - the most typical semantic search tasks require a small model (< 1B params)
 - accuracy: each backend must have a ~1e-4 prevision over the torch fp32 embeddings.
 
Benchmarking machines:
CPU and NVIDIA:
- GCP g2-standard-16
 - Intel Cascade Lake
 - 1 x NVIDIA L4, cu122
 
AMD:
- 16 core CPU
 - AMD MI210, rocm5.7 without flash-attn
 
AWS Inferentia
- Huggingface AMI (torch-neuronx 1.13, optimum 1.17)
 - inf2.xlarge instance (2 Neuron Cores with 1 used)
 
Reproduction steps:
Install the environment
sentence-transformers, fastembed, infinity
git clone https://github.com/michaelfeil/infinity.git
cd infinity
git checkout tags/0.0.25
python ./docs/benchmarks/simple_app.py
huggingface/text-embeddings-inference
using the cpu and cuda-89 container (note that cc-89 matches to Nvidia L4)
docker run -it -p 7997:80 --pull always ghcr.io/huggingface/text-embeddings-inference:cpu-0.6 
--model-id BAAI/bge-small-en-v1.5 --max-client-batch-size 256
docker run -it -p "7997:80" --gpus all --pull always ghcr.io/huggingface/text-embeddings-inference:89-0.6 
--model-id BAAI/bge-large-en-v1.5 --max-client-batch-size 256
tensorrt, onnx-gpu:
docker buildx build --target production-tensorrt -t inf-trt . && docker run -it -p "7997:7997" --gpus all inf-trt v2 --model-id BAAI/bge-large-en-v1.5 --engine optimum --device "cuda OR tensorrt"
Results
To launch the benchmarks
Below are the following metrics: * Requests # / sec (1 request = 256 sentences / 115_000 tokens) * time to run benchmark (10 requests = 1_150_000 tokens)
Results: CPU-only (BAAI/bge-small-en-v1.5 | bert-small)
| Model | Time (seconds) | Requests # / sec (mean) | 
|---|---|---|
| infinity-optimum-int8 | 100.490 | 0.10 | 
| infinity-optimum (onnx) | 125.342 | 0.08 | 
| fastembed (onnx) | 125.770 | 0.08 | 
| sentence-transformers (torch) | 256.884 | 0.04 | 
| infinity (torch) | 353.065?? | 0.03 (needs revision) | 
| huggingface/TEI (candle) | 1104.357 | 0.009 | 
Results: NVIDIA L4 (BAAI/bge-large-en-v1.5 | bert-large)
| Model | Requests # / sec (mean) | Time (seconds) | 
|---|---|---|
| huggingface/TEI (candle, flashbert) | 0.54 | 18.491 | 
| infinity (torch + compile + fa2) | 0.51 | 19.562 | 
| tensorrt (via infinity) | 0.43 | 23.367 | 
| infinity (onnx-gpu fp16, fused layers) | 0.41 | 24.448 | 
| sentence-transformers (fp16) | 0.17 | 59.107 | 
Results: AMD MI210 (BAAI/bge-large-en-v1.5 | bert-large)
| Model | Requests # / sec (mean) | Time (seconds) | 
|---|---|---|
| infinity (torch + no compile + fa2 disabled) | 0.75 | 13.400 | 
Results: AWS INF2 xlarge (BAAI/bge-large-en-v1.5 | bert-large)
| Model | Requests # / sec (mean) | Time (seconds) | 
|---|---|---|
| infinity (neuron, fp16, constant batch_size 4 / 512 seq) | 0.11 | 90.564 |