Dirt Cheap Image Captioning With Qwen Vision-Language Models: Up to 98.4% Cheaper than OpenAI

Image captioning with Qwen Vision Language Model Image captioning and labeling plays an important role in many AI and ML training workloads, and until fairly recently, has been limited in effectiveness both by available technology and cost. Enter open-source vision-language models like Alibaba’s Apache 2.0-licensed Qwen 2.5, available in 3B and 7B sizes. Vision-Language models provide substantial improvements over previous-generation solutions based on CLIP and BLIP. The ability to include a text prompt along with your image gives you a great deal of control as to the style and content of the returned captions. Additionally, self-hosting language models gets easier and more performant all the time thanks to projects like Huggingface’s Text Generation Inference (TGI). Comparing Qwen 2.5 VL against OpenAI’s GPT 40 mini We deployed both the 3B and 7B versions of Qwen 2.5 VL using TGI on SaladCloud, and benchmarked their cost-performance using the COCO2017 image dataset. We compared this to using OpenAI’s GPT 4o mini, their least expensive vision-language model. We also used the “low” detail setting with 4o mini, to try to achieve the least expensive possible use of OpenAI’s services. We found that OpenAI measured roughly twice as many input tokens per image as our TGI/Qwen setup, and cost quite a lot more per token as well. In our best cost-performance of the benchmark, Qwen 2.5 VL 3B achieved 126,650 images captioned per dollar on an RTX 4080, a savings of 98.4% over running the same workload on OpenAI. The rest of the numbers are detailed below. High quality results from all models All tested models provided very good captions, even correctly identifying text. You can see how the different prompts greatly effect the generated captions. Prompt: “What is in this image? Include details.” Qwen2.5 VL 3B The image depicts a street scene in a town with signs for a hotel, shops, and possibly a restaurant. Notable objects in the image include: – A “Peace Way Hotel” sign prominently displayed on the left side of the street. This sign is likely nearby the hotel.– Multiple closed storefronts indicating a quiet day and/or area where businesses are closed.– A clear road with cars, potentially indicating moderate traffic volume.– Buildings with various signs and storefronts, suggesting a commercial area.– Smaller signs and billboards, some of which are partially obscured.– Raised traffic lights or signals, signaling that the road is a busy one.– A large hill or mountain visible in the background, adding to the scenic backdrop.– Some electrical or utility poles and wires extending overhead. Given the setting with sporadic closed shops and hotels, it might suggest a time frame like early morning or low season. Qwen2.5 VL 7B The image depicts a quiet street scene in a town, likely in a Middle Eastern region given the signage and architecture. Key details include: – Hotel Signage: Prominently featured is a sign for the “PEACE WAY HOTEL,” indicating hospitality services in the area.– Restaurant/Cafe: There is another sign labeled “NO NAME RESTAURANT & CAFE,” suggesting local dining options.– Buildings: The landscape includes low-rise buildings with various signs in different languages (likely Arabic and English), displaying shops and businesses.– Traffic: A few vehicles are visible, including a car on the road, suggesting minimal traffic activity.– Surroundings: The background displays a mountainous area, suggesting a hilly or dry environment.– Electrical Wires: The presence of utility poles and wires, indicative of urban infrastructure, adds to the character of the streetscape. Overall, the setting appears to be a small, possibly rural town, characterized by local commerce and quiet surroundings. GPT 4o mini The image shows a street scene that appears to be in a small town or city. There are several buildings lining the road, with a variety of signs in different languages, suggesting a diverse environment. Notably, there is a large sign that reads “PEACE WAY HOTEL,” which stands out prominently. The street is largely empty, indicating a calm atmosphere, and a few vehicles are parked or driving in the area. In the background, hills or mountains rise up, suggesting a rugged landscape surrounding the settlement. The overall mood of the scene is quiet and somewhat subdued, possibly indicating a less busy time of day or a less populated area. Prompt: “Describe the image in detail using comma-separated descriptor tags, with more prominent features toward the front of the list“ Qwen2.5 VL 3B hut, building, window, wood, blue, red, streetlamp, car, truck, sign, mountain, white, beige, asphalt, metal, pole, railing, water, fence, rock, Notify, post office, neuron, electricity, phone, Hotel, Peace, Way Hotel, peaces, signboard, route, mountain, facade, lights, road, run, windows, road, power lines, accommodation, ambulance, driveway, fire escape, handrail, post, craft, sign, go, power, electricity, wine, stop sign Qwen2.5 VL 7B desert-like building facades, narrow street, two signboards, Bavarian architecture style, arid landscape, ridges, single yellow car, white truck on right, administrative architecture, urban setting, signage, blue doors, underground barriers, Hill in background GPT 4o mini street scene, urban landscape, mountainous background, deserted road, hotel signage, small shops, muted colors, peaceful atmosphere, buildings with balconies, clear sky, parked vehicles, roadside signage, Mediterranean architecture, signage in Arabic and English, early morning light, light traffic, hilly terrain, quiet town atmosphere Prompt: “What is in this image? Include details.” Qwen2.5 VL 3B The image depicts a landscape scene with a vast expanse of snow-covered ground in the foreground. The middle ground includes trees and rock formations, suggesting a mountainous terrain. The background features rolling hills and more visible mountain ranges in the distance, under a partly cloudy sky. There is a signpost with blue and white labeling in the right foreground, but the details on the sign are not clearly visible. Qwen2.5 VL 7B The image shows a snowy landscape, likely taken on a mountaintop or hillside. The ground is covered with patches of snow, with some bare soil and vegetation visible where the snow has melted or been pushed away. In the distance, a hazy horizon stretches toward what appears to be valleys and mountains. There is dense evergreen forest
How Undetectable.ai saves over $50,000 a month with SaladCloud’s consumer GPUs

A surge of users for Undetectable.ai Undetectable.ai is on a mission to solve one of the most pressing challenges in the era of large language models (LLMs): accurately detecting AI-generated content and seamlessly “humanizing” it so that it remains indistinguishable from natural human writing. In just a matter of months, Undetectable.ai skyrocketed from an idea around accurate AI detection to 14 million+ signups, serving both individuals and enterprises worldwide. Students worried about getting flagged for AI content turned to the Humanizer for rewriting their legitimate essays. Marketing agencies, churning out blogs and ad copy, used it to keep that “authentic human tone.” Even entire content-writing businesses popped up around the Undetectable.ai API. What began with a single product vision — an AI detector — quickly evolved into a two-pronged solution: Both tools are powered by a highly custom, manually-created dataset with 10s of 1000s of samples in multiple languages, delivering premium accuracy and humanization. Scaling Compute – But at what cost? As undetectable.ai grew in popularity, Ben Miller, COO of Undetectable.ai, was wrestling with a problem that could make or break the startup: These queries weren’t trivial text in/out requests; they involved inference on specialized models requiring significant VRAM. To achieve this at scale, Undetectable.ai needed fast, cost-efficient, and highly adaptable GPU infrastructure – enter SaladCloud. The problem with hyperscalers and high-end GPUs Ben and team looked into the usual suspects: big cloud providers offering A100 or H100 GPUs with impressive performance – but equally jaw-dropping price tags. If they stayed on that path, Undetectable.ai would pay tens of thousands of dollars a month, maybe hundreds of thousands as traffic kept soaring. As a lean startup, they couldn’t sink all their resources into GPU fees. “Some of the A100 providers wanted a committed contract. Our business being cyclical, this was not ideal”, adds Ben. Meanwhile, the need for a flexible infrastructure kept growing. The usage spiked dramatically before midterm exams at universities and peaked again when marketing campaigns ramped up at end-of-quarter cycles. One day they’d need 20 GPUs, the next day maybe 300. “As we scaled, we had to find the most cost effective, scalable way for us to get good GPUs. That’s how I found SaladCloud” – Ben Miller Testing deployment on SaladCloud’s consumer GPUs “We started with a test of our custom model on an A100 and a consumer GPU. The A100 on another cloud could run the queries 3x faster than a consumer card on SaladCloud, but the price was 10 to 40 times higher. And so the math in favor of SaladCloud was very attractive.” – Ben Miller, COO, Undetectable.ai Instead of paying for pricey, high-end datacenter GPUs, Undetectable tapped into thousands of consumer GPUs on SaladCloud. There were immediate benefits. Ben adds, “I was attracted to this idea of a massive cloud with thousands of GPUs while we were struggling to get a single A100 on. It takes a week to get response from support on the hyperscalers while SaladCloud’s team was incredibly responsive”. Saving $50k-$80k a month on SaladCloud As the numbers made sense, Undetectable.ai switched to SaladCloud. Almost overnight, they spun up the capacity to handle hundreds of thousands of queries per day – with compute nodes peppered across the entire planet. “We’re saving roughly $50,000–$80,000 a month by using SaladCloud instead of an enterprise GPU cloud or a high-end GPU. And that’s before we even refine our autoscaling to handle weekend vs. weekday usage more precisely.” – Ben Miller, COO, Undetectable.ai Ensuring AI content stays human At its core, Undetectable.ai’s story is about pushing the boundaries of LLM usage. This year alone, the team is introducing 20-30 innovative products, including the AI Essay Writer, which helps students refine their essays, and the AI Job Application Bot, designed to automate the job search process, helping professionals save time and increase their chances of securing their next position. Alongside these advancements, Undetectable.ai is also onboarding a growing number of enterprise customers, solidifying its role as a leader in humanizing AI-generated content. With AI ensuring humans can create massive amounts of content in minutes, tools like Undetectable are ensuring the digital world doesn’t become polluted by robotic, emotion-less content. Undetectable.ai’s partnership with SaladCloud showcases the power of leveraging consumer-grade GPU clouds for demanding AI workloads. By placing cost efficiency and scalability at the forefront, Undetectable.ai now processes hundreds of thousands of queries daily, meeting the needs of marketers, students, educators, and content creators—all without compromising model accuracy or user experience. For AI companies deploying large language models (LLMs) at scale, Undetectable.ai’s story is a testament to thinking beyond the conventional (and often prohibitively expensive) approach of enterprise GPUs. With SaladCloud, they’ve unlocked a massive, globally distributed network of GPUs—capable of powering advanced AI solutions at a fraction of the cost. SaladCloudSaladCloud is the world’s largest distributed cloud computing network with 11,000+ daily GPUs and 450,000 GPUs contributing compute, all at the lowest cost in the market.
Your own ChatGPT for just $0.04/hr – with Ollama, ChatUI and SaladCloud

Deploy your own LLM with Ollama & Huggingface Chat UI on SaladCloud How much does it cost to build and deploy a ChatGPT-like product today? The cost could be anywhere from thousands to millions – depending on the model, infrastructure, and use case. Even the same task could cost anywhere from $1000 to $100,000. But with the advancement of open-source models & open infrastructure, there’s been tremendous interest in building a cost-efficient ChatGPT-like tool for various real-life applications. In this article, we explore how tools like Ollama and Huggingface Chat UI can simplify this process, particularly when deployed on Salad’s distributed cloud infrastructure. The challenges in hosting & implementing LLMs In today’s digital ecosystem, Large Language Models (LLMs) have revolutionized various sectors, including technology, healthcare, education, and customer service. Their ability to understand and generate human-like text has made them immensely popular, driving innovations in chatbots, content creation, and more. These models, with their vast knowledge bases and sophisticated algorithms, can converse, comprehend complex topics, write code, and even compose poetry. This makes them highly versatile tools for many enterprise & everyday use cases. However, hosting and implementing these LLMs poses significant challenges. Despite these challenges, the integration of LLMs into platforms continues to grow, driven by their vast potential and the continuous advancements in the field. As solutions like Hugging Face’s Chat UIand SaladCloud continue to offer more accessible and efficient ways to deploy these models, we’re likely to see an even greater adoption and innovation across industries. What is Ollama? Ollama is a tool that enables the local execution of open-source large language models like Llama 2 and Mistral 7B on various operating systems, including Mac OS, Linux, and soon Windows. It simplifies the process of running LLMs by allowing users to execute models with a simple terminal command or an API call. Ollama optimizes setup and configuration, specifically tailoring GPU usage for efficient performance. It supports a variety of models and variants, all accessible through the Ollama model library, making it a versatile and user-friendly solution for running powerful language models locally. Here is a list of supported models: Model Parameters Size Download Llama2 7B 3.8GB ollama run llama2 Mistral 7B 4.1GB ollama run mistral Dolphin Phi 2.7B 1.6GB ollama run dolphin-phi Phi-2 2.7B 1.7GB ollama run phi Neural Chat 7B 4.1GB ollama run neural-chat Starling 7B 4.1GB ollama run starling-lm Code Llama 7B 3.8GB ollama run codellama Llama 2 Uncensored 7B 3.8GB ollama run llama2-uncensored Llama 2 13B 13B 7.3GB ollama run llama2:13b Llama 2 70B 70B 39GB ollama run llama2:70b Orca Mini 3B 1.9GB ollama run orca-mini Vicuna 7B 3.8GB ollama run vicuna LLaVA 7B 4.5GB ollama run llava What is Huggingface Chat UI? Huggingface Chat UI is a powerful tool for practitioners in the Large Language Model (LLM) space looking to deploy a ChatGPT-like conversational interface. It enables interaction with models hostedon Huggingface, leveraging its text generation inference or any custom API powered by LLM. Chat UI has such capabilities as conversational history, memory, authentication, and theming. Huggingface Chat UI is an ideal choice for those looking to create a more engaging and robust conversational agent. Integrating Ollama and Huggingface Chat UI for deploying on SaladCloud The main goal of our project is to integrate Ollama with Huggingface Chat UI and deploy them to SaladCloud.The final version of the code can be found here: GitHub – SaladTechnologies/ollama-chatui In order to achieve our goal, we did the following: 1. Clone Ollama Repository We start by cloning the Ollama repository from Ollama Git Repo. This repository serves as the base of the project.Ollama is a user-friendly tool and can be operated via terminal or as a REST API. In this project, the intention is to run Ollama in a Docker container and connect it to Chat UI. The Dockerfile from Ollama repository shows that it runs on host 0.0.0.0 and port 11434. However, since direct access to Ollama isn’t required but rather through the UI, this configuration will be modified later. 2. Setting Up Huggingface Chat UI Chat UI git repo: GitHub – huggingface/chat-ui: Open source codebase powering the HuggingChat app From the Chat UI Readme, we can see that we need to follow a few steps to make it work in our custom solution: Notice that the path to ollama is specified as http://127.0.0.1:11434. 3. Connecting Ollama and Chat UI We now need to connect Ollama and ChatUI. This involves ensuring that the Chat UI can communicate with the Ollama instance, typically by setting the appropriate port and host settings in the UI configuration to match the Ollama Docker deployment. First we clone the ChatUI repo in our Dockerfile and replace the host that Ollama uses with 127.0.0.1. Next expose port 3000 that is used by ChatUi.We will also replace the entry point with our custom shell script: With this script, we establish the necessary .env.local file and populate it with configurations. Next, we initiate the Ollama server in a separate tmux session to download the desired model. ChatUI is then activated on port 3000. For any adjustments in model settings, refer to the models_config/model.local file. We have also converted the MongoDB URL, Huggingface Token, and Model name into environment variables to facilitate seamless alterations during deployment to SaladCloud. Additionally, a DOWNLOAD_TIME variable is defined. Since Ollama runs in a tmux session, subsequent commands can be executed even if the server isn’t fully operational. To ensure that Ollama is fully active before initiating ChatUI, we incorporate a sleep duration. This duration is model-dependent forinstance, downloading llama2 might take around 8 minutes. 4. Deploying to SaladCloud After setting up and connecting Ollama and Chat UI, the complete system is ready for deployment to Salad’s cloud infrastructure. The integrated solution will be hosted on Salad’s robust cloud platform. Detailed deployment instructions and necessary files are accessible through the Salad Technologies Ollama Chat UI GitHub repository or by pulling the image from the Salad Docker Registry: saladtechnologies/ollama-chatui-salad:1.0.0. To deploy our solution, we need to follow the instructions:
LLM Comparison Through TGI Benchmark Using SaladCloud

In the field of Artificial Intelligence (AI), Text Generation Inference (TGI) has become a vital toolkit for deploying and serving Large Language Models (LLMs). TGI enables efficient and scalable text generation with popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and Mistral. This SaladCloud benchmark dives deep into this technology, with a LLM comparison focused on the performance of popular language models. TGI and Large Language Models TGI is essential for leveraging the capabilities of Large Language Models, which are key to many AI applications today. These models, known for generating text that closely resembles human writing, are crucial for applications ranging from automated customer service to creative content generation.You can easily deploy TGI on SaladCloud using the following instructions: Run TGI (Text Generation Interface) by Hugging Face Experiment design: Benchmarking on SaladCloud Our benchmark study on SaladCloud aims to evaluate and compare select LLMs deployed through TGI. This will provide insights into model performance under varying loads and the efficacy of SaladCloud in supporting advanced AI tasks. Models for comparison We’ve selected the following models for our benchmark, each with its unique capabilities: Test parameters Batch Sizes: The models will be tested with batch sizes of 1, 4, 8, 16, 32, 64, and 128.Hardware Configuration: Uniform hardware setup across tests with 8 vCPUs, 28GB of RAM, and a 24GB GPU card.Benchmarking Tool: To conduct this benchmark, we utilized the Text Generation Benchmark Tool,which is a part of TGI and designed to effectively measure the performance of these models.Model Parameters: We’ve used the default Sequence length of 10 and decode length of 8. Performance metrics The TGI benchmark provides us with the following metrics for each batch we provided: Bigcode/santacoder bigcode/santacoder is part of the SantaCoder series, featuring 1.1 billion parameters and trained on subsets of Python, Java, and JavaScript from The Stack (v1.1). This model, known for its Multi Query Attention and a 2048-token context window, utilizes advanced training techniques like near-deduplication and comment-to-code ratio filtering. The SantaCoder series also includes variations in architecture and objectives, providing diverse capabilities in code generation and analysis. This is the smallest model in our benchmark. Key observations Cost-effectiveness on SaladCloud: bigcode/santacoder A key part of our analysis focused on the cost-effectiveness of running TGI models on SaladCloud. For a batch size of 32, with a compute cost of $0.35 per hour, we calculated the cost per million tokens based on throughput : The cost per token, considering the throughput and compute price, is approximately $0.03047 or about 3.047 cents per million tokens for output and $0.07572 per million input tokens. Tiiuae/falcon-7b Falcon-7B is a decoder-only model with 7 billion parameters. It was built by TII and trained on an extensive 1,500B token dataset from RefinedWeb, enhanced with curated corpora. It is available under the Apache 2.0 license, making it a significant model for large-scale text generation tasks. Key findings Cost-effectiveness on SaladCloud: Tiiuae/Falcon-7b For the tiiuae/falcon-7b model on SaladCloud with a batch size of 32 and a compute cost of $0.35 per hour, the calculated cost per million tokens with a throughput of 744 tokens per second is approximately $0.13095, or about 13.095 cents per million output tokens and $0.28345 per million input tokens. The average decode total latency for batch size 32 is 300.82 milliseconds. While this latency might be slightly higher compared to smaller models, it still falls within a reasonable range for many applications, especially considering the model’s large size of 7 billion parameters. The cost-effectiveness, combined with the model’s capabilities, makes it a viable option for extensive text generation tasks on SaladCloud. Code Llama Code Llama is a collection of generative text models, with the base model boasting 7 billion parameters. It’s part of a series ranging up to 34 billion parameters, specifically tailored for code-related tasks. This benchmark focuses on the base 7B version in Hugging Face Transformers format, designed to handle a wide range of coding applications. The cost for processing one million tokens using the Code Llama model on SaladCloud, with a batch size of 32 and a compute cost of $0.35 per hour, is approximately $0.11826 per million output tokens and $0.28679 per million input tokens. This figure highlights the economic feasibility of utilizing SaladCloud for large-scale text generation tasks with sophisticated models like Code Llama. Mistral-7B-Instruct-v0.1 Mistral-7B-Instruct-v0.1 is an instruct fine-tuned version of the Mistral-7B-v0.1 generative text model. This model leverages a variety of publicly available conversation datasets to enhance its capability to understand and generate human-like, conversational text. Its fine-tuning makes it particularly adept at handling instruction-based queries, setting it apart in the realm of LLMs. Key insights Implications and Cost Analysis The performance of the Mistral-7B-Instruct-v0.1 model on SaladCloud shows promising potential for its use in various AI-driven conversational systems. Its ability to process a high number of tokens per second at a manageable latency makes it a strong contender for applications requiring nuanced language understanding and generation. With a price of $0.35 per hour, we achieve a cost of approximately $0.12153 per million output tokens and $0.27778 per million input tokens. Conclusion – LLM comparison benchmark results Our comprehensive LLM comparison benchmark of various Text Generation Inference (TGI) models on SaladCloud reveals an insightful trend: despite the diversity in the models’ capabilities and complexities, there is a remarkable consistency in cost-effectiveness when using the same compute configuration. Consistent performance and cost-effectiveness Customizable compute options Final thoughts In conclusion, SaladCloud emerges as a compelling choice for deploying and running TGI models. Its ability to provide uniform compute efficiency across a range of models, combined with the option to customize and optimize compute resources, offers both consistency in performance and flexibility in cost management. Whether it’s for large-scale commercial deployments or smaller, more targeted AI tasks, SaladCloud’s platform is well-equipped to meet diverse text generation requirements with an optimal balance of efficiency and cost-effectiveness SaladCloudSaladCloud is the world’s largest distributed cloud computing network with 11,000+ daily GPUs and 450,000 GPUs contributing compute, all at the lowest cost in the market.