SaladCloud Blog

INSIDE SALAD

Tag 309K Images/$ with Recognize Anything Model++ (RAM++) On Consumer GPUs

Shawn Rushefsky

What is the Recognize Anything Model++?

The Recognize Anything Model++ (RAM++) is a state of the art image tagging foundational model released last year, with pre-trained model weights available on huggingface hub. It significantly outperforms other open models like CLIP and BLIP in both the scope of recognized categories and accuracy. But how much does it cost to run RAM++ on consumer GPUs?

In this benchmark, we tag 144,485 images from the COCO 2017 and AVA image datasets, evaluating inference speed and cost-performance. The evaluation was done across 167 nodes on SaladCloud representing 19 different consumer GPU classes. To do this, we created a container group targeting a capacity of 100 nodes, with the “Stable Diffusion Compatible” GPU class. All nodes were assigned 2 vCPU and 8GB RAM. Here’s what we found.

Up to 309k images tagged per dollar on RTX 2080

Images tagged per dollar for each GPU type for Recognize Anything Model++ (RAM++)

In keeping with a trend we often see here, the best cost-performance is coming from the lower end GPUs, RTX 20- and 30-series cards. In general, we find that the smallest/cheapest GPU that can do the job you need is likely to have the best cost-performance, in terms of inferences per dollar. RAM++ is a fairly small, lightweight model (3GB), and achieved its best performance on the RTX 2080, with just over 309k inferences per dollar.

Average Inference Time Is <300ms Across All GPUs

Inference time distribution per GPU for Recognize Anything Model++ (RAM++)

We see relatively quick inference times across all GPU types, but we also see a pretty wide distribution of performance, even within a single GPU type. Zooming in, we can see this wide distribution is also present within a single node.

Inference time distribution for RTX 3080 GPU

Further, we see no significant correlation between inference time and number of tags generated.

GPUCorrelation between inference time and number of tags
RTX 20800.04255
RTX 2080 SUPER-0.02209
RTX 2080 Ti-0.03439
RTX 30600.00074
RTX 3060 Ti0.00455
RTX 30700.00138
RTX 3070 Laptop GPU-0.00326
RTX 3070 Ti-0.01494
RTX 3080-0.00041
RTX 3080 Laptop GPU-0.09197
RTX 3080 Ti0.02748
RTX 3090-0.00146
RTX 40600.03447
RTX 4060 Laptop GPU-0.08151
RTX 4060 Ti0.04153
RTX 40700.01393
RTX 4070 Laptop GPU-0.05811
RTX 4070 Ti0.00359
RTX 40800.02090
RTX 4090-0.03002

Based on this, you should expect to see fairly wide variation in inference time in production regardless of your GPU selection or image properties.

Results from the Recognize Anything Model++ (RAM++) benchmark

Consumer GPUs offer a highly cost-effective solution for batch image tagging, coming in between 60x-300x the cost efficiency of managed services like Azure AI Computer Vision. The Recognize Anything paper and code repository offer guides to train and fine-tune this model on your own data, so even if you have unusual categories, you should consider RAM++ instead of commercially available managed services.

Resources

Have questions about SaladCloud for your workload?

Book a 15 min call with our team. Get $50 in testing credits.

Related Blog Posts

Speech to text inference benchmark - Distil Whisper Large v2

Inference Benchmark on Salad: Distil-Whisper Large V2 vs. Whisper Large V3 for Speech-to-text

Hugging Face Distil-Whisper Large V2 is a distilled version of the OpenAI Whisper model that is 6 times faster, 49% smaller and performs within 1%  WER (word error rates) on...
Read More
Openvoice text to speech gpu benchmark on SaladCloud

OpenVoice Text-to-Speech (TTS) Benchmark: 6 Million+ Words/$ Using Salad

What is OpenVoice? OpenVoice is an open-source, instant voice cloning technology that enables the creation of realistic and customizable speech from just a short audio clip of a reference speaker....
Read More
Whisper large v3 - Automatic speech - recognition - gpu benchmark

Whisper Large V3 Speech Recognition Benchmark: 1 Million hours of audio transcription for just $5110

Save over 99.8% on audio transcription using Whisper Large V3 and consumer GPUs A 99.8% cost-savings for automatic speech recognition sounds unreal. But with the right choice of GPUs and...
Read More

Don’t miss anything!

Subscribe To SaladCloud Newsletter & Stay Updated.