Once upon a time, in 2019, it was considered very impressive to generate an approximate depth map from a single image, a technique known as monocular depth estimation. Today, AI models that can run on your laptop can generate fully textured 3D assets ready to import into game engines and modeling tools.
We benchmarked one such model, Hunyuan3D 2.1 from Tencent, on SaladCloud using RTX 4090 GPUs with 4 vCPU and 38GB RAM. For inference we used ComfyUI and ComfyUI API, along with the Hunyuan3D 2.1 Custom Node. We found a median generation time of 139.2 seconds across more than 900 generations, coming out to $0.0148 / generation on High priority, and an impressive $0.009 / generation on Batch priority. This is more than 90% less cost than FAL’s Hunyuan 3D 2.0 endpoint, even on high priority.
If you read our other benchmarks, this will come as no surprised to you. ComfyUI + SaladCloud is an easy, cost-effective way to serve diffusion models at scale, including Image-to-3D models like this one.
Example Outputs
The results aren’t Pixar quality or anything, but overall very impressive for something that took 2 minutes and little-to-no skill. Our input images were AI generated as well, so the full pipeline is Text-to-Image-to-3D.
Potion Bottle

Rabbit Astronaut

Dog

Spaceship
