Benchmarking WAN2.1: Open-Source Text-to-Video Generation on SaladCloud

INSIDE SALAD

Benchmarking WAN2.1: Open-Source Text-to-Video Generation on SaladCloud

Published: August 4, 2025

Maksim Gorkii

Just a few years ago generating a video from a line of text felt almost impossible. Today, anyone can turn a text prompt into a realistic video — not just with expensive closed-source APIs, but with powerful open-source models as well.

One of the most capable of these today is WAN2.1, a text-to-video model designed to run on modern GPUs, including consumer cards like RTX 4090 and 5090.

We wanted to see how WAN2.1 performs in the real world, so we ran it on SaladCloud and measured how fast, how much it costs, and how well it scales.

In this post, we’ll show you:

What WAN2.1 is and why it’s one of the best AI video generation models today
How we ran it on SaladCloud, a GPU cloud built for AI workloads
Detailed benchmarks comparing RTX 4090 vs 5090 GPUs
Generation speed, quality, and cost-per-minute breakdowns
Bonus: Run WAN2.1 on newest SaladCloud Secure
Why SaladCloud is the best platform to run open-source video AI at scale

What Is WAN2.1?

WAN2.1 supports text-to-video, image-to-video, video editing, text-to-image, video-to-audio..

WAN2.1 is built on the Diffusion Transformer (DiT) paradigm, taking video generation a step further through a set of architectural and training innovations.

1. Wan-VAE: 3D Spatio-Temporal Variational Autoencoder

At the core of WAN2.1 is Wan-VAE, a new 3D causal variational autoencoder built for video generation.

It achieves spatio-temporal compression that lowers memory requirements while maintaining temporal causality, ensuring smooth motion across frames.
Unlike many open VAEs, Wan-VAE can encode and decode unlimited-length 1080p videos without losing historical temporal information, making it suitable for both short clips and long-form sequences.
Its efficiency and fidelity outperform most open-source alternatives.

2. Video Diffusion Transformer (DiT) with Flow Matching

WAN2.1 uses a Flow Matching framework built on top of the Diffusion Transformer architecture, enabling stable, high-quality generation:

A multilingual T5 Encoder encodes the text prompt, feeding into transformer blocks where cross-attention integrates the prompt’s semantics directly into the diffusion process.
Each block is modulated by a shared MLP (Linear → SiLU → Linear), which predicts six time-dependent modulation parameters per block.
This shared design reduces parameter overhead while delivering performance gains at the same scale.

WAN2.1 was trained on a massive, deduplicated dataset of image and video content.

Data went through a four-step cleaning pipeline, filtering for visual fidelity, motion smoothness, diversity, and frame integrity.
This pipeline produced one of the largest and most diverse training sets used by any open-source video generation model to date.

Model Configurations

WAN2.1 is available in two primary sizes:

Model	Dimension	Input Dim	Output Dim	Feedforward Dim	Frequency Dim	Heads	Layers
1.3B	1,536	16	16	8,960	256	12	30
14B	5,120	16	16	13,824	256	40	40

For our benchmarks, we ran 1.3B Text-to-Video (T2V) variant on consumer GPU’s and 14B on SaladCloud Secure.

Performance

In internal WAN team’s testing WAN2.1 scored higher than both leading open-source and commercial models, according to human-preference-weighted scoring.

WAN2.1 supports a wide range of AI media generation tasks:

Text-to-Video (T2V)
Image-to-Video
Text-to-Image
Video-to-Audio
Video Editing

For this benchmark, we focused on the Text-to-Video task.

Why Run WAN2.1 on SaladCloud?

Easily deploy WAN2.1 on SaladCloud.

Running WAN2.1 locally can be challenging due to VRAM requirements. While the model’s documentation claims 8.2 GB VRAM is enough, in the real world high-end GPUs like RTX 4090 or 5090 are required.

SaladCloud solves this by providing:

On-demand RTX 4090/5090 GPUs
Container-based deployments in minutes
Up to 80% cheaper pricing than big cloud providers

Benchmark Setup: How We Tested WAN2.1 on SaladCloud

Benchmark Inputs:

We ran WAN2.1’s official diffusers implementation on SaladCloud container groups using:

Model: WAN2.1 T2V-1.3B
Prompts: 50 unique text prompts
GPUs: RTX 4090 and RTX 5090 (50 nodes each)
Resolutions:
- 480p (480×720)
- 720p (720×1280)
Guidance Scale: 3 for 480p, 5 for 720p
Task: Text-to-Video generation

Preparing to run on SaladCloud:

Step	What we did	Why it matters
1 — Grabbed the core generation code	Copied the official WAN 2.1 diffusers inference script and tweaked a little (added easy settings switching, saving results to external storage, logged additional data).	Same logic the authors published—just adapted to our benchmark needs.
2 — Packaged it a container	Wrapped the script into a Docker image.	Every GPU node pulls the identical environment, so runs are reproducible and instantly scalable.
4 — Launched the fleets	Started four SaladCloud container groups for each gpu class and resolution type	Apples‑to‑apples comparisons: same prompts, same code, only GPU class and resolution change.
5 — Stored outputs to external storage	Saved clips and its timing logs to an external azure storage when the render finishes.	Keeps containers stateless.

*WAN2.1 Performance Results: Time, Cost & Resolution Comparison

Here’s how WAN2.1 performed on Consumer GPUs:

GPU Type	Resolution	Avg Generation Time (5s video)	Generation Time per 1-min of Video	Cost per Minute of video (High Priority Pricing)	Cost per Minute of video (Batch Pricing)
4090	480p	5.3 min	63.6 min	$0.37	$0.24
4090	720p	40 min	8 hours	$2.80	$1.80
5090	480p	2.4 min	28.8 min	$0.24	$0.15
5090	720p	33.9 min	6.8 hours	$3.30	$2.20

Key Observations

480p is a lot faster on both GPU classes. On both GPUs, a 5‑second clip at 480 p renders in minutes, while the same clip at 720 p takes well over half an hour. Because WAN 2.1 was trained primarily at 480 × 720, the lower‑resolution output is not only faster but sometimes even look better.
RTX 5090s doubles throughput. At 480p, a single 5090 produces about twenty‑five 5‑second clips per hour, or 2 minutes of video—roughly twice what a 4090 can do. Even though compute hour price of 5090 is higher than 4090, it still makes the 5090 the better value for batch workloads.
Cost goes up with resolution. Generating a video clip at 720 p is seven to eight times more expensive per minute than at 480 p, even after accounting for the faster 5090 card. Unless native 720 p is absolutely required, it is far more economical to create 480 p footage and upscale later.
Batch tier lowers the bill. Running jobs with SaladCloud’s batch tier pricing cuts per‑minute cost by roughly a third, but you need to keep in mind, that on batch pricing nodes can be interrupted more frequently.
SaladCloud lets you scale across 100+ GPUs affordably, making open-source video generation viable at scale.

Running the 14 B WAN 2.1 on SaladCloud Secure (NVIDIA L40S)

Salad’s new Secure tier gives you datacenter-grade GPUs hosted in SOC 2 Type 2 & ISO 27001-attested facilities, and higher uptime — while keeping the interruptible-compute discounts you’re used to. Currently each Secure node comes as a full, 8-GPU server of NVIDIA L40S cards (48 GB VRAM) — perfect for heavyweight models like the 14B WAN 2.1 that won’t squeeze onto consumer cards.

We used a compressed DFloat-11 versions of the model: DFloat11/examples/wan2.1 at examples · LeanModels/DFloat11

GPU Type	Resolution	Avg Generation Time (5s video)	Generation Time per 1-min of Video (using 1 gpu)	Cost per Minute of video (High Tier Pricing)	Cost per Minute of video (Batch Pricing)
L40S (Secure)	480p	16.0 min	192.5 min	$2.60	$1.03
L40S (Secure)	720p	66.3 min	796.1 min	$10.75	$4

Throughput: Because each Secure machine has 8 GPUs, you can run eight clips in parallel. At 480 p that translates to ~30 clips / hour (≈2.5 minutes of finished video per node).
Pricing: Batch tier is $2.56 / h for the whole 8-GPU box (≈$0.32 / GPU h). High-priority is $6.48 / h (≈$0.81 / GPU h). The numbers above multiply those rates by render time to show true, end-to-end cost.
Takeaway: Even the big 14B model stays affordable — barely a dollar per finished minute at 480 p on batch tier — and you also get the datacenter level features.

Examples of videos generated on SaladCloud

14B Model with 720×1280 resolution

14B model with 480×720 resolution

1.3 model with 480×720 resolution. Generated on the RTX4090

1.3 model with 720×1280 resolution. Generated on the RTX5090

Try WAN2.1 on SaladCloud Today

Spin up WAN2.1 in minutes with SaladCloud.

If you’re building with AI-generated video or experimenting with text-to-video generation models, WAN2.1 is one of the best free solutions available. And SaladCloud is the easiest way to run it at scale — fast, affordable, and developer-friendly.

We will soon one click deploy recipe for both of the models.

Have questions about enterprise pricing for SaladCloud?