OPen source image to video ai models tested and analysed

With Analysis

Wan 2.2

WAN 2.2 is an advanced open source AI video generation model developed by Alibaba’s Tongyi Lab (WANX team) and released in 2025. It includes both Text‑to‑Video (T2V‑A14B) and Image‑to‑Video (I2V‑A14B) variants at 480p and 720p resolution, supporting 24 FPS, and optimized to run on consumer-grade GPUs like the RTX 4090 X (formerly Twitter) +11 Hugging Face +11 YouTube +11 . It’s open-source under Apache 2.0, integrates seamlessly with ComfyUI and Diffusers, and leverages a MoE architecture with prompt extension and multi-modal control features

AI image of a red Mustang cruising on a long, cinematic highway. This image is used to test the image-to-video generation performance of the Wan Vace open source AI model, focusing on motion continuity and perspective tracking.

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

"A red Ford Mustang cruising smoothly down a long, empty highway during sunset. The camera follows the car from behind in a steady, cinematic motion, keeping the car centered while the distant mountains and road stretch ahead. The camera movement is slow, smooth, and dramatic, capturing the beauty of the scene and the motion of the car."

Show Observations

Below are the discovered pros and cons for

Wan 2.2

image generator ai after using and testing it for the given prompts.

Pros

Regularly updated and maintained by the WANX team, ensuring stable performance and feature upgrades.

Has shown strong motion consistency and subject coherence in various community results, especially compared to other open-source models.

Integrated into ComfyUI with a wide range of community workflows and nodes.

Very uncensored

Cons

Resource-heavy compared to smaller models — may require higher-end GPUs (e.g., 24GB VRAM+) for optimal performance.

📝

Benchmark Score

⏲

Generation Speed

⚙️

ComfyUI Workflow

Wan Vace

Wan VACE (Wan 2.1-VACE) is a state-of-the-art open source AI video generator developed by Alibaba’s Tongyi Lab, released in early 2025. It is part of the Wan 2.1 model suite and supports text-to-video, image-to-video, video editing, and reference-based generation using a modular Video Condition Unit (VCU). Wan VACE is available in both 1.3B and 14B parameter sizes, under an Apache 2.0 license, and is hosted on GitHub, Hugging Face, and ModelScope. It enables multi-modal control including masks, flow, pose, and outpainting, and is optimized to run on consumer GPUs (8GB+), making it highly accessible for researchers and creators in the open source video AI community.

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Below are the discovered pros and cons for

Wan Vace

image generator ai after using and testing it for the given prompts.

Pros

Available in two scales (1.3B for low-end GPUs and 14B for high-res output)

Optimized for both Chinese and English text prompts

Supports a wide range of tasks: text-to-video, image-to-video, video editing, pose and flow control

Cons

Quality can vary significantly between generations without tight prompt tuning

Motion in outputs is often minimal, especially in 1.3B version

📝

Benchmark Score

⏲

Generation Speed

Hunyuan AI

Hunyuan Video is an advanced open source AI video generator developed by Tencent and released in December 2024. It is built with a 13-billion-parameter Transformer-based architecture featuring a dual-stream design, 3D VAE for spatio-temporal learning, and a multimodal LLM text encoder for rich prompt comprehension. Hunyuan Video supports both text-to-video and image-to-video tasks with resolutions up to 1280×720p, and integrates with ComfyUI, Gradio, and Hugging Face. It also supports prompt rewriting, pose conditioning, video outpainting, and FP8 multi-GPU inference, making it one of the most capable tools in the open source video AI space.

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Below are the discovered pros and cons for

Hunyuan AI

image generator ai after using and testing it for the given prompts.

Pros

High model capacity (13B) delivers strong motion quality and visual realism

Supports high-resolution video (up to 720p) and long clip durations (up to 16 seconds)

Good temporal consistency with strong prompt-to-video alignment

Cons

Requires high-end GPUs (12–16 GB VRAM+) for smooth local inference

Generation time is relatively long (~8–10 mins per 3-second video in full config)

📝

Benchmark Score

⏲

Generation Speed

Cosmos 2B

Cosmos‑Predict2‑2B is an open source world-model video generative AI developed by NVIDIA and released in June 2025. It’s part of the Cosmos‑Predict2 family, a set of transformer-based diffusion models that generate high-quality videos up to 720p at 16 FPS from image+text or video+text inputs. The model excels in video-to-world future prediction tasks and is available under the NVIDIA Open Model License, with full commercial usage permitted. Supported via Diffusers and ComfyUI, Cosmos‑Predict2‑2B offers fast inference, multi-modal conditioning, and hardware-accelerated optimization for physical AI and creative applications

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Below are the discovered pros and cons for

Cosmos 2B

image generator ai after using and testing it for the given prompts.

Pros

Efficient and fast, leveraging transformer diffusion and sparse attention for accelerated generation

Integrates smoothly with Diffusers pipelines, ComfyUI, and GPU-optimized environments

Cons

Limited to 5-second video clips, may not support longer sequences out of the box

Requires a modern NVIDIA GPU and ~32 GB VRAM for full performance, especially on 14B variant

📝

Benchmark Score

⏲

Generation Speed

Ltxv 13b

LTXV‑13B is a 13-billion-parameter open source AI video generation model developed by Lightricks, released in May 2025. It specializes in extremely fast AI video generation, capable of rendering short clips at up to 24 FPS. This performance is achieved by regenerating only the motion-related pixels while keeping the background static, enabling rapid output speed. It supports both text-to-video and image-to-video generation and integrates well with ComfyUI, Diffusers, and LoRA. The model is licensed under Apache 2.0.

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Below are the discovered pros and cons for

Ltxv 13b

image generator ai after using and testing it for the given prompts.

Pros

Fastest open source AI video generation model, delivering near real-time performance for both image-to-video and text-to-video tasks.

Lightweight for its scale (13B parameters), allowing smooth execution on high-end consumer GPUs without enterprise hardware requirements.

Cons

Background motion is minimal, as the model reuses static pixels for faster AI video generation — reducing realism in dynamic scenes.

Less suitable for high-action or full-scene cinematic AI video outputs.

📝

Benchmark Score

⏲

Generation Speed

Stable Video Diffusion XT

Stable Video Diffusion (SVD) is an open source AI video generator designed for adding natural motion effects to still images, such as flowing water, moving clouds, and gentle camera shifts. It’s configured to generate up to 25 frames per video with adjustable playback speeds (6–10 FPS), resulting in short clips (2.5–4 seconds). SVD is primarily aimed at creating ambient, loopable motion rather than narrative-driven or prompt-aligned video generation, and does not support text prompts for conditional content.

An AI-generated red Mustang driving on a long, scenic road. This image is used to assess the SVD XT AI model’s ability to animate directional movement and keep consistent depth in video generation.

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Below are the discovered pros and cons for

Stable Video Diffusion XT

image generator ai after using and testing it for the given prompts.

Pros

Excellent for creating natural object motion—such as water, clouds, foliage—in short clips.

Fast generation, close to image generation speed.

Cons

Limited to a maximum of 25 total frames, constraining video duration and continuity.

Lacks support for prompt-based or text-conditioned generation, making it unsuitable for narrative content.

📝

Benchmark Score

⏲

Generation Speed

Wan 2.2

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Regularly updated and maintained by the WANX team, ensuring stable performance and feature upgrades.

Has shown strong motion consistency and subject coherence in various community results, especially compared to other open-source models.

Integrated into ComfyUI with a wide range of community workflows and nodes.

Very uncensored

Cons

Resource-heavy compared to smaller models — may require higher-end GPUs (e.g., 24GB VRAM+) for optimal performance.

📝

Benchmark Score

⏲

Generation Speed

⚙️

ComfyUI Workflow

Wan Vace

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Available in two scales (1.3B for low-end GPUs and 14B for high-res output)

Optimized for both Chinese and English text prompts

Supports a wide range of tasks: text-to-video, image-to-video, video editing, pose and flow control

Cons

Quality can vary significantly between generations without tight prompt tuning

Motion in outputs is often minimal, especially in 1.3B version

📝

Benchmark Score

⏲

Generation Speed

Hunyuan AI

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

High model capacity (13B) delivers strong motion quality and visual realism

Supports high-resolution video (up to 720p) and long clip durations (up to 16 seconds)

Good temporal consistency with strong prompt-to-video alignment

Cons

Requires high-end GPUs (12–16 GB VRAM+) for smooth local inference

Generation time is relatively long (~8–10 mins per 3-second video in full config)

📝

Benchmark Score

⏲

Generation Speed

Cosmos 2B

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Efficient and fast, leveraging transformer diffusion and sparse attention for accelerated generation

Integrates smoothly with Diffusers pipelines, ComfyUI, and GPU-optimized environments

Cons

Limited to 5-second video clips, may not support longer sequences out of the box

Requires a modern NVIDIA GPU and ~32 GB VRAM for full performance, especially on 14B variant

📝

Benchmark Score

⏲

Generation Speed

Ltxv 13b

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Fastest open source AI video generation model, delivering near real-time performance for both image-to-video and text-to-video tasks.

Lightweight for its scale (13B parameters), allowing smooth execution on high-end consumer GPUs without enterprise hardware requirements.

Cons

Background motion is minimal, as the model reuses static pixels for faster AI video generation — reducing realism in dynamic scenes.

Less suitable for high-action or full-scene cinematic AI video outputs.

📝

Benchmark Score

⏲

Generation Speed

Stable Video Diffusion XT

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Excellent for creating natural object motion—such as water, clouds, foliage—in short clips.

Fast generation, close to image generation speed.

Cons

Limited to a maximum of 25 total frames, constraining video duration and continuity.

Lacks support for prompt-based or text-conditioned generation, making it unsuitable for narrative content.

📝

Benchmark Score

⏲

Generation Speed

Wan 2.2

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Regularly updated and maintained by the WANX team, ensuring stable performance and feature upgrades.

Has shown strong motion consistency and subject coherence in various community results, especially compared to other open-source models.

Integrated into ComfyUI with a wide range of community workflows and nodes.

Very uncensored

Cons

Resource-heavy compared to smaller models — may require higher-end GPUs (e.g., 24GB VRAM+) for optimal performance.

📝

Benchmark Score

⏲

Generation Speed

⚙️

ComfyUI Workflow

Wan Vace

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Available in two scales (1.3B for low-end GPUs and 14B for high-res output)

Optimized for both Chinese and English text prompts

Supports a wide range of tasks: text-to-video, image-to-video, video editing, pose and flow control

Cons

Quality can vary significantly between generations without tight prompt tuning

Motion in outputs is often minimal, especially in 1.3B version

📝

Benchmark Score

⏲

Generation Speed

Hunyuan AI

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

High model capacity (13B) delivers strong motion quality and visual realism

Supports high-resolution video (up to 720p) and long clip durations (up to 16 seconds)

Good temporal consistency with strong prompt-to-video alignment

Cons

Requires high-end GPUs (12–16 GB VRAM+) for smooth local inference

Generation time is relatively long (~8–10 mins per 3-second video in full config)

📝

Benchmark Score

⏲

Generation Speed

Cosmos 2B

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Efficient and fast, leveraging transformer diffusion and sparse attention for accelerated generation

Integrates smoothly with Diffusers pipelines, ComfyUI, and GPU-optimized environments

Cons

Limited to 5-second video clips, may not support longer sequences out of the box

Requires a modern NVIDIA GPU and ~32 GB VRAM for full performance, especially on 14B variant

📝

Benchmark Score

⏲

Generation Speed

Ltxv 13b

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Fastest open source AI video generation model, delivering near real-time performance for both image-to-video and text-to-video tasks.

Lightweight for its scale (13B parameters), allowing smooth execution on high-end consumer GPUs without enterprise hardware requirements.

Cons

Background motion is minimal, as the model reuses static pixels for faster AI video generation — reducing realism in dynamic scenes.

Less suitable for high-action or full-scene cinematic AI video outputs.

📝

Benchmark Score

⏲

Generation Speed

Stable Video Diffusion XT

—>

This scenario is to test the AI models animation ability with natural physics like walking and the ability to keep the essence of image in the video

PROMPT USED IN AI

Show Observations

Pros

Excellent for creating natural object motion—such as water, clouds, foliage—in short clips.

Fast generation, close to image generation speed.

Cons

Limited to a maximum of 25 total frames, constraining video duration and continuity.

Lacks support for prompt-based or text-conditioned generation, making it unsuitable for narrative content.

📝

Benchmark Score

⏲

Generation Speed

benchmark scores of open source Video generation ai models

To check the methodology behind testing of AI models, click here

AI Model

Prompt Adherence

Realism

Frame Stability

Motion Quality

Identity Consistency

Scene Continuity

Physics & Interactions

Camera & Lighting

WAN VACE

★★★★☆

★★★½☆

★★★★☆

★★★☆☆

★★★★☆

Hunyuan AI

★★★★☆

★★★½☆

★★★★☆

LTXV 13B

★★★☆☆

★★½☆☆

★★★☆☆

★★½☆☆

★★☆☆☆

★★★☆☆

Cosmos 2B

★★★☆☆

★★½☆☆

★★★☆☆

SVD XT

★★★★☆

★★★☆☆

★★★★☆

★★★½☆

★★★★☆

★★★☆☆

★★★★☆

WAN 2.2

★★★★☆

★★★½☆

★★★★☆

AI Model

Prompt Adherence

Realism

Frame Stability

Motion Quality

Identity Consistency

Scene Continuity

Physics & Interactions

Camera & Lighting

WAN VACE

★★★★☆

★★★½☆

★★★★☆

★★★☆☆

★★★★☆

Hunyuan AI

★★★★☆

★★★½☆

★★★★☆

LTXV 13B

★★★☆☆

★★½☆☆

★★★☆☆

★★½☆☆

★★☆☆☆

★★★☆☆

Cosmos 2B

★★★☆☆

★★½☆☆

★★★☆☆

SVD XT

★★★★☆

★★★☆☆

★★★★☆

★★★½☆

★★★★☆

★★★☆☆

★★★★☆

WAN 2.2

★★★★☆

★★★½☆

★★★★☆

AI Model

Prompt Adherence

Realism

Frame Stability

Motion Quality

Identity Consistency

Scene Continuity

Physics & Interactions

Camera & Lighting

WAN VACE

★★★★☆

★★★½☆

★★★★☆

★★★☆☆

★★★★☆

Hunyuan AI

★★★★☆

★★★½☆

★★★★☆

LTXV 13B

★★★☆☆

★★½☆☆

★★★☆☆

★★½☆☆

★★☆☆☆

★★★☆☆

Cosmos 2B

★★★☆☆

★★½☆☆

★★★☆☆

SVD XT

★★★★☆

★★★☆☆

★★★★☆

★★★½☆

★★★★☆

★★★☆☆

★★★★☆

WAN 2.2

★★★★☆

★★★½☆

★★★★☆

Time taken by open source ai model to generate a 720p 24 fps 5 seconds video From an Image on a 16 gb vram RTX 5070 TI graphic card

AI Model

Time Taken

Wan Vace

25-30 mins

Hunyuan AI

30-35 mins

Ltxv 13b

45 secs

Cosmos 2B

20-25 mins

SVD xt

70-80 secs

Wan 2.2

20-25 mins

AI Model

Time Taken

Wan Vace

25-30 mins

Hunyuan AI

30-35 mins

Ltxv 13b

45 secs

Cosmos 2B

20-25 mins

SVD xt

70-80 secs

Wan 2.2

20-25 mins

AI Model

Time Taken

Wan Vace

25-30 mins

Hunyuan AI

30-35 mins

Ltxv 13b

45 secs

Cosmos 2B

20-25 mins

SVD xt

70-80 secs

Wan 2.2

20-25 mins