DeepSeek TGI benchmark: LLM applications for 70% less cost

INSIDE SALAD

DeepSeek TGI Benchmark: Build highly scalable and cost-effective LLM applications on SaladCloud, at 70% less cost than hyperscalers

Published: February 14, 2025

SaladCloud

DeepSeek LLM benchmark on SaladCloud

LLM applications need a highly scalable and cost-effective GPU infrastructure to meet rapidly evolving demands. While the data center GPUs, such as A100 and H100, from major cloud providers offer high throughput and exceptional performance, they are often too costly and excessive for many use cases, such as text classification, translation, summarization, and personalization. Additionally, some cloud providers emphasize committed resource usage, which doesn’t align with the flexible business models of startups and other organizations. SaladCloud, leveraging tens of thousands of consumer GPUs, provides fine-grained, diverse resource options with unmatched scalability, flexibility and cost-effectiveness. We’ve seen an increasing number of customers successfully running and fine-tuning their LLM models on SaladCloud, achieving over 70% cost savings compared to hyperscalers. In this article, we share results from a DeepSeek TGI benchmark on SaladCloud, showing how companies can develop highly scalable and cost-effective LLM applications.

Interested in deploying on SaladCloud? Contact our support team today.

DeepSeek R1 is a new, highly powerful open-source reasoning model with performance on par with OpenAI’s o1. The distilled and fine-tuned smaller dense models by using the reasoning data generated by DeepSeek R1, also deliver exceptional results on benchmarks. Hugging Face’s TGI v3 is the latest version of its toolkit for deploying and serving LLMs, offering significant performance enhancements with zero configuration required. In this comprehensive benchmark, we explore the deployment and performance evaluation of a typical LLM inference system using Hugging Face’s TGIv3 and DeepSeek-R1-Distill-Llama-8B on SaladCloud. Additionally, we share best practices from our customers for building high-throughput, reliable and cost-effective LLM applications. All code and configurations are available in this Github repository.

The DeepSeek benchmark image

The official Hugging Face’s TGI image can be deployed on SaladCloud directly, without any modifications.

However, to fully leverage the distributed and dynamic nature of SaladCloud, we recommend building a custom wrapper image with the following features for production:

Install essential tools, such as testing and troubleshooting.
Add an I/O worker if using a job queue system.
Pre-load model parameters into the image to minimize costs, as Salad nodes incur no charges when downloading the image.
Implement initial and real-time performance checks to ensure that nodes remain in an optimal state for application execution.

Providing input prompts and streaming the generated tokens on SaladCloud

There are two primary ways for providing input prompts to your LLM models and streaming the generated tokens on SaladCloud:

Deploying SaladCloud’s Container Gateway is the quickest approach. The TGI server on instances should listen on an IPv6 port, and the gateway can map a public URL to this port, forwarding multiple client requests to instances concurrently.

Inference time can vary significantly with different prompts. So configuring the Least Number of Connections algorithm is crucial to manage these fluctuations effectively. If client applications send more requests than the TGI server can handle, excessive resource usage may lead to errors and request timeouts. To improve system robustness, the TGI server can proactively reject excessive requests as a backpressure mechanism, while client applications can implement traffic control to stop accepting new requests from users during periods of congestion.

A few customers have successfully implemented a Redis-based near-real-time queue on SaladCloud, supporting regional deployment and being platform-independent. In this setup, client applications send requests to a Redis cluster, while a Redis worker, included in the image, pulls jobs (input prompts) from the cluster, invokes the TGI server, and streams the generated tokens. This queue-based approach may enhance resilience against traffic spikes and variations in inference time, as instances only fetch new jobs after completing the current ones.

In this benchmark, the container gateway will be deployed to manage access. Please refer to the Dockerfile for the benchmark image, which is built on the official TGI image. This image includes testing tools and performs an initial check (VRAM, CUDA version) before starting the TGI server.

Deployment to SaladCloud

Currently, all container instances, regardless of location, are centrally accessed through SaladCloud’s Container Gateway in the U.S. To reduce latency, we can use the SaladCloud Python SDK to deploy container groups specifically within the U.S.

The example code creates two separate container groups, one for the RTX 3090 and the other for the RTX 4090, each group consisting of 5 replicas.

Using the container gateway for instances in other regions may introduce additional latency, typically in the range of several hundred milliseconds. This is generally acceptable since LLM inference usually takes much longer (tens of seconds). However, for latency-sensitive applications that require local access, a Redis-based queue should be considered.

For more information on deploying LLMs on SaladCloud, please refer to this guide.

Benchmarking tool and methodology

We use the open-source LLMPerf for load testing, which allows testing over public endpoints by sending concurrent requests with varying prompt lengths through OpenAI-compatible APIs. It measures key metrics such as Time-to-First-Token (s), End-to-End Latency (s), Request-Output Throughput (tokens/s), and more. Additionally, there is a performance leaderboard for several API providers across various LLM models, benchmarked using LLMPerf.

Each LLM model uses its own tokenizer, meaning the same prompt may result in different token counts across models. To maintain consistency and comparability, LLMPerf standardizes token counting by using the LlamaTokenizer for both input and output tokens, regardless of the model being tested. For more information, refer to the additional details about the benchmarking samples generated by LLMPerf.

We have updated the benchmark code to include additional metrics based on SaladCloud pricing, such as infrastructure cost, cost per 1K requests and cost per 1M output tokens.

For this benchmarking test, we send concurrent requests to the two container groups (the RTX 3090 and the RTX 4090), varying the number of concurrent requests and combinations of prompt and output lengths, and then collect the results. Metrics like TTFT and E2E Latency are also affected by the distance from the gateway to the client location, which, in this test, is California.

DeepSeek benchmark results and observations

TGI utilizes a continuous batching algorithm, dynamically adding requests to the running batch for optimal performance. As the number of concurrent requests increases from 5 to 10, 20, 40, and 80 for both the two 5-replica container groups, the total output throughput improves by over 500% while the cost per 1M output tokens drops by more than 80%. However, both per-request output throughput and time degrade as the number of concurrent requests rises.

As input prompt length increases, TTFT also rises. However, network transmission, along with queuing within TGI, remains the primary source of latency, which is approximately 300 milliseconds.

TTFT constitutes a small portion of the total request-output time, which is primarily influenced by the output length (number of generated tokens). Enabling token streaming allows tokens to be returned one by one. This significantly reduces the wait time for the complete response and enhancing the user experience.

Larger batch sizes and longer context lengths (input prompt + generated text) lead to higher VRAM usage. Both RTX 3090 and RTX 4090 feature 24GB of VRAM and can handle batched inference with an 8B LLM model in 16-bit precision as follows:

– A batch size of 16 and a context length of 1024
– A batch size of 8 and a context length of 2048
– A batch size of 4 and a context length of 4096
– A batch size of 1 and a context length of 16384

For more information on VRAM usage, please check this link.

The RTX 4090 delivers over 20% higher performance than the RTX 3090, but comes at a 40% higher cost. The RTX 3090 remains a strong option for certain LLM use cases, thanks to its excellent price-to-performance ratio and per-request output throughput exceeding 40 tokens per second.

Please see the original samples generated by LLMPerf during the benchmarking test.

Get Free Credits for SaladCloud

SaladCloud – a scalable, cost-effective way to deploy LLM applications

Consumer GPUs are considerably more cost-effective than Data Center GPUs. While they may not be the ideal choice for extensive and large model training tasks, they are powerful enough for the inference of most AI models.

DeepSeek-R1-Distill-Llama-8B with Hugging Face’s TGI v3 shows solid performance and cost-effectiveness on both the RTX 3090 and 4090 nodes of SaladCloud. The replica count of container groups can be easily adjusted to meet changing demands.

For a wide range of LLM use cases and models, our customers have reported up to 70% cost savings by switching from hyperscaler GPUs like the A100 and H100 to SaladCloud’s 3090 and 4090. Thanks to its exceptional scalability, flexibility, and cost-effectiveness, SaladCloud is perfectly positioned to drive success in the fast-evolving LLM and AI markets.

When to use SaladCloud for LLM Applications

The best use cases for deploying LLM applications on SaladCloud and realizing significant cost savings and high scalability are:

If you are looking for infrastructure to deploy your own LLM models
If you are fine-tuning your own LLM models
If you have fluctuating compute needs and need extreme scalability
If your use cases involve AI detection, AI humanizer, text classification, translation, summarization, and personalization, etc.

Interested in free credits to try SaladCloud for LLMs? Contact our support team today.

SaladCloud

SaladCloud is the world’s largest distributed cloud computing network with 11,000+ daily GPUs and 450,000 GPUs contributing compute, all at the lowest cost in the market.

Have questions about enterprise pricing for SaladCloud?

DeepSeek TGI Benchmark: Build highly scalable and cost-effective LLM applications on SaladCloud, at 70% less cost than hyperscalers

SaladCloud

DeepSeek LLM benchmark on SaladCloud

The DeepSeek benchmark image

Providing input prompts and streaming the generated tokens on SaladCloud

Deployment to SaladCloud

Benchmarking tool and methodology

DeepSeek benchmark results and observations

SaladCloud – a scalable, cost-effective way to deploy LLM applications

When to use SaladCloud for LLM Applications

Book a 15 min call with our team.

Related Blog Posts

Salad x Render Network: Milestones 1 and 2 Are Live

RNP-023 Approved: Salad Is Joining the Render Network

Use Cline with SaladCloud: Building Real Apps for Under $0.01

Subscribe To SaladCloud Newsletter & Stay Updated.