Image captioning with Qwen Vision Language Model
Image captioning and labeling plays an important role in many AI and ML training workloads, and until fairly recently, has been limited in effectiveness both by available technology and cost. Enter open-source vision-language models like Alibaba’s Apache 2.0-licensed Qwen 2.5, available in 3B and 7B sizes. Vision-Language models provide substantial improvements over previous-generation solutions based on CLIP and BLIP. The ability to include a text prompt along with your image gives you a great deal of control as to the style and content of the returned captions. Additionally, self-hosting language models gets easier and more performant all the time thanks to projects like Huggingface’s Text Generation Inference (TGI).
Comparing Qwen 2.5 VL against OpenAI’s GPT 40 mini
We deployed both the 3B and 7B versions of Qwen 2.5 VL using TGI on SaladCloud, and benchmarked their cost-performance using the COCO2017 image dataset. We compared this to using OpenAI’s GPT 4o mini, their least expensive vision-language model. We also used the “low” detail setting with 4o mini, to try to achieve the least expensive possible use of OpenAI’s services.
We found that OpenAI measured roughly twice as many input tokens per image as our TGI/Qwen setup, and cost quite a lot more per token as well. In our best cost-performance of the benchmark, Qwen 2.5 VL 3B achieved 126,650 images captioned per dollar on an RTX 4080, a savings of 98.4% over running the same workload on OpenAI. The rest of the numbers are detailed below.
High quality results from all models
All tested models provided very good captions, even correctly identifying text. You can see how the different prompts greatly effect the generated captions.

Prompt: “What is in this image? Include details.”
Qwen2.5 VL 3B
The image depicts a street scene in a town with signs for a hotel, shops, and possibly a restaurant. Notable objects in the image include:
– A “Peace Way Hotel” sign prominently displayed on the left side of the street. This sign is likely nearby the hotel.
– Multiple closed storefronts indicating a quiet day and/or area where businesses are closed.
– A clear road with cars, potentially indicating moderate traffic volume.
– Buildings with various signs and storefronts, suggesting a commercial area.
– Smaller signs and billboards, some of which are partially obscured.
– Raised traffic lights or signals, signaling that the road is a busy one.
– A large hill or mountain visible in the background, adding to the scenic backdrop.
– Some electrical or utility poles and wires extending overhead.
Given the setting with sporadic closed shops and hotels, it might suggest a time frame like early morning or low season.
Qwen2.5 VL 7B
The image depicts a quiet street scene in a town, likely in a Middle Eastern region given the signage and architecture. Key details include:
– Hotel Signage: Prominently featured is a sign for the “PEACE WAY HOTEL,” indicating hospitality services in the area.
– Restaurant/Cafe: There is another sign labeled “NO NAME RESTAURANT & CAFE,” suggesting local dining options.
– Buildings: The landscape includes low-rise buildings with various signs in different languages (likely Arabic and English), displaying shops and businesses.
– Traffic: A few vehicles are visible, including a car on the road, suggesting minimal traffic activity.
– Surroundings: The background displays a mountainous area, suggesting a hilly or dry environment.
– Electrical Wires: The presence of utility poles and wires, indicative of urban infrastructure, adds to the character of the streetscape.
Overall, the setting appears to be a small, possibly rural town, characterized by local commerce and quiet surroundings.
GPT 4o mini
The image shows a street scene that appears to be in a small town or city. There are several buildings lining the road, with a variety of signs in different languages, suggesting a diverse environment. Notably, there is a large sign that reads “PEACE WAY HOTEL,” which stands out prominently. The street is largely empty, indicating a calm atmosphere, and a few vehicles are parked or driving in the area. In the background, hills or mountains rise up, suggesting a rugged landscape surrounding the settlement. The overall mood of the scene is quiet and somewhat subdued, possibly indicating a less busy time of day or a less populated area.
Prompt: “Describe the image in detail using comma-separated descriptor tags, with more prominent features toward the front of the list“
Qwen2.5 VL 3B
hut, building, window, wood, blue, red, streetlamp, car, truck, sign, mountain, white, beige, asphalt, metal, pole, railing, water, fence, rock, Notify, post office, neuron, electricity, phone, Hotel, Peace, Way Hotel, peaces, signboard, route, mountain, facade, lights, road, run, windows, road, power lines, accommodation, ambulance, driveway, fire escape, handrail, post, craft, sign, go, power, electricity, wine, stop sign
Qwen2.5 VL 7B
desert-like building facades, narrow street, two signboards, Bavarian architecture style, arid landscape, ridges, single yellow car, white truck on right, administrative architecture, urban setting, signage, blue doors, underground barriers, Hill in background
GPT 4o mini
street scene, urban landscape, mountainous background, deserted road, hotel signage, small shops, muted colors, peaceful atmosphere, buildings with balconies, clear sky, parked vehicles, roadside signage, Mediterranean architecture, signage in Arabic and English, early morning light, light traffic, hilly terrain, quiet town atmosphere

Prompt: “What is in this image? Include details.”
Qwen2.5 VL 3B
The image depicts a landscape scene with a vast expanse of snow-covered ground in the foreground. The middle ground includes trees and rock formations, suggesting a mountainous terrain. The background features rolling hills and more visible mountain ranges in the distance, under a partly cloudy sky. There is a signpost with blue and white labeling in the right foreground, but the details on the sign are not clearly visible.
Qwen2.5 VL 7B
The image shows a snowy landscape, likely taken on a mountaintop or hillside. The ground is covered with patches of snow, with some bare soil and vegetation visible where the snow has melted or been pushed away. In the distance, a hazy horizon stretches toward what appears to be valleys and mountains. There is dense evergreen forest predominantly visible on the slopes, adding depth and texture to the scene. The sky is cloudy, with slashes of sunlight breaking through on the right, indicating that the sun might be setting or emerging from behind a cloud formation. A blue signpost, common for hiking trails, is tilted on the right side, suggesting directionality. The overall atmosphere is serene and remote, typical of a high-altitude or wilderness mountainous area with no visible human structures.
GPT 4o mini
The image depicts a winter landscape, likely at a mountain location. The foreground shows patches of snow covering the ground and some rocky areas. There are several evergreen trees scattered throughout, adding to the natural scenery. In the background, there are rolling mountains that create a panoramic view, with a cloudy sky overhead that hints at possible overcast weather. A signpost can be seen, likely indicating a trail or direction for hikers, which adds context to the outdoor setting. The overall atmosphere suggests a serene and picturesque scene typical of a mountainous region in winter.
Prompt: “Describe the image in detail using comma-separated descriptor tags, with more prominent features toward the front of the list“
Qwen2.5 VL 3B
snow-covered ground, cloudy sky, distant mountains, snowy patch ahead, tree on left, roughly cut tree on right, planted snow mound, blue and white sign on right, sparse vegetation, wide open landscape
Qwen2.5 VL 7B
Open snowy landscape, evergreen trees, cloudy sky, sun rays peeking through clouds, distant mountain range
GPT 4o mini
snow-covered landscape, distant mountains, cloudy sky, tree line, scenic viewpoint, rocky foreground, winter scenery, hiking signpost, fading sunlight, rugged terrain, atmospheric perspective, natural beauty
Benchmark design
We deployed container groups with 10 replicas on SaladCloud, one group for each model+gpu combo we tested.
Our docker image was very simple, just using the base TGI image with the model weights preloaded. We left everything in its default configuration.
FROM ghcr.io/huggingface/text-generation-inference:3.2.0
# We previously downloaded the model files
COPY data /data
ENV MODEL_ID="Qwen/Qwen2.5-VL-3B-Instruct"We used Grafana K6, a popular load-testing tool, to produce and measure load on the container groups, ramping from 10 virtual users to 300 over the course of an hour. With each request, we included an image URL to an image in the COCO2017 dataset, and the text prompt “What is in this image? Include details.” Requests were made in the OpenAI chat completion format.

Example Captioning Request
curl -X 'POST' \
'https://some-random-prefix.salad.cloud/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"max_tokens": 256,
"messages":[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image? Include details."
},
{
"type": "image_url",
"image_url": {
"url": "https://salad-benchmark-assets.download/coco2017/train2017/000000000094.jpg"
}
}
]
}
],
"stream": false
}'Results from the Qwen 2.5 VL benchmark against OpenAI’s GPT 40 mini
All Results Are From A SaladCloud Container Group With 10 Replicas.
Qwen 2.5 VL 3B
RTX 3090 (24GB VRAM), 4vCPU, 30GB RAM (x10)
- Cost of Cluster Per Hour at Batch Priority: $1.16
- 71,949 Images Processed
- Max Throughput: 33 Images Per Second, achieved at 237 virtual users.
- Images Per Dollar at Max Throughput, Batch Priority: 102,413
RTX 4080 (16GB VRAM), 4vCPU, 30GB RAM (x10) – Best Value
- Cost of Cluster Per Hour at Batch Priority: $1.46
- 108,127 Images processed
- Max Throughput: 51.4 Images Per Second, achieved at 280 virtual users.
- Images Per Dollar at Max Throughput, Batch Priority: 126,650
RTX 4090 (24GB VRAM), 4vCPU, 30GB RAM (x10)
- Cost of Cluster Per Hour at Batch Priority: $1.96
- 76,584 Images processed
- Max Throughput: 37 Images Per Second, achieved at 299 virtual users. This result suggests the cluster could have handled more load.
- Images Per Dollar at Max Throughput, Batch Priority: 66,734
Qwen 2.5 VL 7B
RTX 3090 (24GB VRAM), 4vCPU, 30GB RAM (x10)
- Cost of Cluster Per Hour at Batch Priority: $1.16
- 48664 Images Processed
- Max Throughput: 27.7 Images Per Second, achieved at 288 virtual users.
- Images Per Dollar at Max Throughput, Batch Priority: 85,965
RTX 4090 (24GB VRAM), 4vCPU, 30GB RAM (x10)
- Cost of Cluster Per Hour at Batch Priority: $1.96
- 64,536 Images processed
- Max Throughput: 30.6 Images Per Second, achieved at 293 virtual users.
- Images Per Dollar at Max Throughput, Batch Priority: 56,103
Conclusion
Open vision-language models have largely caught up to proprietary models like those from OpenAI when it comes to image comprehension, and they can be run for a tiny fraction of the cost using open source inference servers and Salad’s distributed GPU cloud. With prompt engineering you can get the type of captions you need for your particular use, and small models like Qwen 2.5 VL can even be finetuned on modest consumer hardware for even further control.
