Find Qwen3-Omni Model Documentation
Read our Qwen3-Omni Model Documentation 2025 Guide by CEO Brian Plain at NextAICompany.com.
Brian Plain, your CEO & local MA AI Company consultant, highlights Qwen3-Omni which is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video
Breaking ‘Qwen3-Omini Model’ News: Read this new article on Hackster.io this single model Alibaba Cloud Release (released in three tailored versions at 30 billion parameters each) can handle text, audio, video, and imagery. [Alibaba Cloud Releases the “Open” Qwen3-Omni, Its First “Natively End-to-End Omni-Modal AI”] by Hackster.io.

Qwen3-Omni-30B-A3B-Captioner

An interactive guide for usage with Transformers and vLLM.

Transformers Usage

Installation

The Hugging Face Transformers code for Qwen3-Omni is merged but not yet released on PyPI. Install from source in a new Python environment to avoid conflicts.

# If you already have transformers installed, please uninstall it first, or create a new Python environment
# pip uninstall transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate

We offer a toolkit to handle various audio and visual inputs conveniently. Ensure your system has ffmpeg installed.

pip install qwen-omni-utils -U

For reduced GPU memory usage, we recommend installing FlashAttention 2. This is optional if you’re using vLLM.

pip install -U flash-attn --no-build-isolation

Note: Your hardware must be compatible with FlashAttention 2, and the model must be loaded in torch.float16 or torch.bfloat16.

Code Snippet

Here’s how to use Qwen3-Omni with transformers and qwen_omni_utils.

import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
            {"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
            {"type": "text", "text": "What can you see and hear? Answer in one short sentence."}
        ],
    },
]

# Set whether to use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)

# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, speaker="Ethan", thinker_return_dict_in_generate=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True, clean_up_tokenization_spaces=False)

print(text)

if audio is not None:
    sf.write(
        "output.wav",
        audio.reshape(-1).detach().cpu().numpy(),
        samplerate=24000,
    )

Advanced Usage Examples

Batch inference and toggling audio output are available. More detailed examples will be added here soon.

vLLM Usage

Installation

We strongly recommend using vLLM for inference. Since the code is in a pull request, install vLLM from source in a new Python environment.

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error, use "pip install -e . -v" to build from source.

# Install other dependencies
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

Inference with vLLM

Below is a simple example of how to run Qwen3-Omni with vLLM.

import os
import torch
from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info

if __name__ == '__main__':
    # vLLM engine v1 not supported yet
    os.environ['VLLM_USE_V1'] = '0'
    
    MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
    # MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"

    llm = LLM(
        model=MODEL_PATH,
        trust_remote_code=True,
        gpu_memory_utilization=0.95,
        tensor_parallel_size=torch.cuda.device_count(),
        limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
        max_num_seqs=8,
        max_model_len=32768,
        seed=1234,
    )

    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.95,
        top_k=20,
        max_tokens=16384,
    )

    processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
    
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}
            ],
        }
    ]

    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
    
    inputs = {
        'prompt': text,
        'multi_modal_data': {},
        "mm_processor_kwargs": {
            "use_audio_in_video": True,
        },
    }

    if images is not None:
        inputs['multi_modal_data']['image'] = images
    if videos is not None:
        inputs['multi_modal_data']['video'] = videos
    if audios is not None:
        inputs['multi_modal_data']['audio'] = audios
    
    outputs = llm.generate([inputs], sampling_params=sampling_params)
    print(outputs[0].outputs[0].text)

Advanced vLLM Usage Examples

Batch inference and serving with vLLM provide enhanced throughput. More detailed examples will be added here soon.

Usage Tips (Recommended Reading)

Minimum GPU Memory Requirements

Model Precision	15s Video	30s Video	60s Video	120s Video
Qwen3-Omni-30B-A3B-Instruct (BF16)	78.85 GB	88.52 GB	107.74 GB	144.81 GB
Qwen3-Omni-30B-A3B-Thinking (BF16)	68.74 GB	77.79 GB	95.76 GB	131.65 GB

Note: Theoretical minimum memory for transformers with BF16 precision and FlashAttention 2.

Prompt for Audio-Visual Interaction

For audio-visual interaction, using the following system prompt is recommended to improve reasoning and generate more natural, conversational responses.

user_system_prompt = "You are Qwen-Omni, a smart voice assistant created by Alibaba Qwen."

message = {
    "role": "system",
    "content": [
        {"type": "text", "text": f"{user_system_prompt} You are a virtual voice assistant with no gender or age... Keep replies concise and conversational, as if talking face-to-face."}
    ]
}

The full prompt content is extensive. The key is to instruct the model to be a conversational voice assistant, avoiding formal or structured language.

Evaluation

Qwen3-Omni maintains SOTA performance on text and visual modalities and achieves SOTA on 32 of 36 audio/audio-visual benchmarks, outperforming systems like Gemini 2.5 Pro and GPT-4o.

Default Prompts for Evaluation

Task Type	Prompt
ASR (Chinese)	请将这段中文语音转换为纯文本。
ASR (Other languages)	Transcribe the audio into text.
Speech-to-Text Translation	Listen to the provided speech and produce a translation in text.
Song Lyrics Recognition	Transcribe the song lyrics into text without any punctuation, separate lines with line breaks, and output only the lyrics without additional explanations.

FEATURES

5 ChatGPT prompts that helped me feel more positive in minutes

By Amanda Caswell | Published September 20, 2025

Image credit: Shutterstock

The other day while out on a run, I had a chat with ChatGPT Voice about motivation and manifesting my goals. Some goals were personal, others were career oriented, but deep down, I knew they were achieveable; I just needed a little nudge. I have found that the chatbot is useful for brainstorming and even pushing me when I’m feeling down.

Of course I lean on my husband all the time for advice, but when I’m just trying to get out of my own head ChatGPT can be a helpful companion. Here are a few of my favorite prompts that I use to help to refocus my goals and stay on track.

1. Balancing my routine

Woman waking up early.

Prompt: “Based on what you know about me, can you help me design my perfect day — one that energizes me and aligns with my goals?”

I use this prompt a lot when I want to calm down and motivate myself, especially when things are not going as planned. It helps to restructure my mindset and celebrate the small wins that are already happening in my life.

2. Staying motivated

Woman preparing for a workout.

This is where more prompt examples would go, illustrating how to use ChatGPT for motivation and positivity.

Next AI Qwen/Qwen3-Omni-30B-A3B-Instructions & Technical Report | Brian Plain | NextAICompany.com

Qwen3-Omni-30B-A3B-Captioner

Transformers Usage

Installation

Code Snippet

vLLM Usage

Installation

Inference with vLLM

Usage Tips (Recommended Reading)

Minimum GPU Memory Requirements

Prompt for Audio-Visual Interaction

Evaluation

Default Prompts for Evaluation

5 ChatGPT prompts that helped me feel more positive in minutes

1. Balancing my routine

2. Staying motivated

Leave a Comment Cancel Reply