SaladCloud Blog

INSIDE SALAD

Benchmarking WAN2.1: Open-Source Text-to-Video Generation on SaladCloud

Maksim Gorkii

Just a few years ago generating a video from a line of text felt almost impossible. Today, anyone can turn a text prompt into a realistic video — not just with expensive closed-source APIs, but with powerful open-source models as well.

One of the most capable of these today is WAN2.1, a text-to-video model designed to run on modern GPUs, including consumer cards like RTX 4090 and 5090.

We wanted to see how WAN2.1 performs in the real world, so we ran it on SaladCloud and measured how fast, how much it costs, and how well it scales.

In this post, we’ll show you:

  • What WAN2.1 is and why it’s one of the best AI video generation models today
  • How we ran it on SaladCloud, a GPU cloud built for AI workloads
  • Detailed benchmarks comparing RTX 4090 vs 5090 GPUs
  • Generation speed, quality, and cost-per-minute breakdowns
  • Bonus: Run WAN2.1 on newest SaladCloud Secure
  • Why SaladCloud is the best platform to run open-source video AI at scale

What Is WAN2.1?

WAN2.1 supports text-to-video, image-to-video, video editing, text-to-image, video-to-audio..

WAN2.1 is built on the Diffusion Transformer (DiT) paradigm, taking video generation a step further through a set of architectural and training innovations.

1. Wan-VAE: 3D Spatio-Temporal Variational Autoencoder

At the core of WAN2.1 is Wan-VAE, a new 3D causal variational autoencoder built for video generation.

  • It achieves spatio-temporal compression that lowers memory requirements while maintaining temporal causality, ensuring smooth motion across frames.
  • Unlike many open VAEs, Wan-VAE can encode and decode unlimited-length 1080p videos without losing historical temporal information, making it suitable for both short clips and long-form sequences.
  • Its efficiency and fidelity outperform most open-source alternatives.

2. Video Diffusion Transformer (DiT) with Flow Matching

WAN2.1 uses a Flow Matching framework built on top of the Diffusion Transformer architecture, enabling stable, high-quality generation:

  • A multilingual T5 Encoder encodes the text prompt, feeding into transformer blocks where cross-attention integrates the prompt’s semantics directly into the diffusion process.
  • Each block is modulated by a shared MLP (Linear → SiLU → Linear), which predicts six time-dependent modulation parameters per block.
  • This shared design reduces parameter overhead while delivering performance gains at the same scale.

WAN2.1 was trained on a massive, deduplicated dataset of image and video content.

  • Data went through a four-step cleaning pipeline, filtering for visual fidelity, motion smoothness, diversity, and frame integrity.
  • This pipeline produced one of the largest and most diverse training sets used by any open-source video generation model to date.

Model Configurations

WAN2.1 is available in two primary sizes:

ModelDimensionInput DimOutput DimFeedforward DimFrequency DimHeadsLayers
1.3B1,53616168,9602561230
14B5,120161613,8242564040

For our benchmarks, we ran 1.3B Text-to-Video (T2V) variant on consumer GPU’s and 14B on SaladCloud Secure.

Performance

In internal WAN team’s testing WAN2.1 scored higher than both leading open-source and commercial models, according to human-preference-weighted scoring.

WAN2.1 supports a wide range of AI media generation tasks:

  • Text-to-Video (T2V)
  • Image-to-Video
  • Text-to-Image
  • Video-to-Audio
  • Video Editing

For this benchmark, we focused on the Text-to-Video task.

Why Run WAN2.1 on SaladCloud?

Easily deploy WAN2.1 on SaladCloud.

Running WAN2.1 locally can be challenging due to VRAM requirements. While the model’s documentation claims 8.2 GB VRAM is enough, in the real world high-end GPUs like RTX 4090 or 5090 are required.

SaladCloud solves this by providing:

  • On-demand RTX 4090/5090 GPUs
  • Container-based deployments in minutes
  • Up to 80% cheaper pricing than big cloud providers

Benchmark Setup: How We Tested WAN2.1 on SaladCloud

Benchmark Inputs:

We ran WAN2.1’s official diffusers implementation on SaladCloud container groups using:

  • Model: WAN2.1 T2V-1.3B
  • Prompts: 50 unique text prompts
  • GPUs: RTX 4090 and RTX 5090 (50 nodes each)
  • Resolutions:
    • 480p (480×720)
    • 720p (720×1280)
  • Guidance Scale: 3 for 480p, 5 for 720p
  • Task: Text-to-Video generation

Preparing to run on SaladCloud:

StepWhat we didWhy it matters
1 — Grabbed the core generation codeCopied the official WAN 2.1 diffusers inference script and tweaked a little (added easy settings switching, saving results to external storage, logged additional data).Same logic the authors published—just adapted to our benchmark needs.
2 — Packaged it a containerWrapped the script into a Docker image.Every GPU node pulls the identical environment, so runs are reproducible and instantly scalable.
4 — Launched the fleetsStarted four SaladCloud container groups for each gpu class and resolution typeApples‑to‑apples comparisons: same prompts, same code, only GPU class and resolution change.
5 — Stored outputs to external storageSaved clips and its timing logs to an external azure storage when the render finishes.Keeps containers stateless.

*WAN2.1 Performance Results: Time, Cost & Resolution Comparison

Here’s how WAN2.1 performed on Consumer GPUs:

GPU TypeResolutionAvg Generation Time (5s video)Generation Time per 1-min of VideoCost per Minute of video (High Priority Pricing)Cost per Minute of video (Batch Pricing)
4090480p5.3 min63.6 min$0.37$0.24
4090720p40 min8 hours$2.80$1.80
5090480p2.4 min28.8 min$0.24$0.15
5090720p33.9 min6.8 hours$3.30$2.20

Key Observations

  1. 480p is a lot faster on both GPU classes. On both GPUs, a 5‑second clip at 480 p renders in minutes, while the same clip at 720 p takes well over half an hour. Because WAN 2.1 was trained primarily at 480 × 720, the lower‑resolution output is not only faster but sometimes even look better.
  2. RTX 5090s doubles throughput. At 480p, a single 5090 produces about twenty‑five 5‑second clips per hour, or 2 minutes of video—roughly twice what a 4090 can do. Even though compute hour price of 5090 is higher than 4090, it still makes the 5090 the better value for batch workloads.
  3. Cost goes up with resolution. Generating a video clip at 720 p is seven to eight times more expensive per minute than at 480 p, even after accounting for the faster 5090 card. Unless native 720 p is absolutely required, it is far more economical to create 480 p footage and upscale later.
  4. Batch tier lowers the bill. Running jobs with SaladCloud’s batch tier pricing cuts per‑minute cost by roughly a third, but you need to keep in mind, that on batch pricing nodes can be interrupted more frequently.
  5. SaladCloud lets you scale across 100+ GPUs affordably, making open-source video generation viable at scale.

Running the 14 B WAN 2.1 on SaladCloud Secure (NVIDIA L40S)

Salad’s new Secure tier gives you datacenter-grade GPUs hosted in SOC 2 Type 2 & ISO 27001-attested facilities, and higher uptime — while keeping the interruptible-compute discounts you’re used to. Currently each Secure node comes as a full, 8-GPU server of NVIDIA L40S cards (48 GB VRAM) — perfect for heavyweight models like the 14B WAN 2.1 that won’t squeeze onto consumer cards.

We used a compressed DFloat-11 versions of the model: DFloat11/examples/wan2.1 at examples · LeanModels/DFloat11

GPU TypeResolutionAvg Generation Time (5s video)Generation Time per 1-min of Video (using 1 gpu)Cost per Minute of video (High Tier Pricing)Cost per Minute of video (Batch Pricing)
L40S (Secure)480p16.0 min192.5 min$2.60$1.03
L40S (Secure)720p66.3 min796.1 min$10.75$4
  • Throughput: Because each Secure machine has 8 GPUs, you can run eight clips in parallel. At 480 p that translates to ~30 clips / hour (≈2.5 minutes of finished video per node).
  • Pricing: Batch tier is $2.56 / h for the whole 8-GPU box (≈$0.32 / GPU h). High-priority is $6.48 / h (≈$0.81 / GPU h). The numbers above multiply those rates by render time to show true, end-to-end cost.
  • Takeaway: Even the big 14B model stays affordable — barely a dollar per finished minute at 480 p on batch tier — and you also get the datacenter level features.

Examples of videos generated on SaladCloud

14B Model with 720×1280 resolution

14B model with 480×720 resolution

1.3 model with 480×720 resolution. Generated on the RTX4090

1.3 model with 720×1280 resolution. Generated on the RTX5090

Try WAN2.1 on SaladCloud Today

Spin up WAN2.1 in minutes with SaladCloud.

If you’re building with AI-generated video or experimenting with text-to-video generation models, WAN2.1 is one of the best free solutions available. And SaladCloud is the easiest way to run it at scale — fast, affordable, and developer-friendly.

We will soon one click deploy recipe for both of the models.

Have questions about enterprise pricing for SaladCloud?

Book a 15 min call with our team.

Related Blog Posts

Salad will become a Render Subnet, Salad and Render Partnership

RNP-023 Approved: Salad Is Joining the Render Network

It's official. RNP-023 has passed the community vote, and Salad will now become an exclusive subnet on the Render Network. A few weeks ago we shared our proposal to fully...
Read More

Use Cline with SaladCloud: Building Real Apps for Under $0.01

At SaladCloud, we've been working on easy-to-deploy recipes designed to cover most agentic use cases out of the box. When you run LLMs on Salad, you're not worried about token...
Read More

Salad Proposes Integration with the Render Network

I’m excited to share that Salad has submitted a formal proposal alongside the Render Network Foundation to become a subnet on the Render Network. This would involve fully transitioning our...
Read More

Don’t miss anything!

Subscribe To SaladCloud Newsletter & Stay Updated.