- Find Qwen3-Omni Model Documentation
- Read our Qwen3-Omni Model Documentation 2025 Guide by CEO Brian Plain at NextAICompany.com.
- Brian Plain, your CEO & local MA AI Company consultant, highlights Qwen3-Omni which is the natively end-to-end multilingual omni-modal foundation models. It processes text, images, audio, and video
- Breaking ‘Qwen3-Omini Model’ News: Read this new article on Hackster.io this single model Alibaba Cloud Release (released in three tailored versions at 30 billion parameters each) can handle text, audio, video, and imagery. [Alibaba Cloud Releases the “Open” Qwen3-Omni, Its First “Natively End-to-End Omni-Modal AI”] by Hackster.io.
Qwen3-Omni-30B-A3B-Captioner
An interactive guide for usage with Transformers and vLLM.
Transformers Usage
Installation
The Hugging Face Transformers code for Qwen3-Omni is merged but not yet released on PyPI. Install from source in a new Python environment to avoid conflicts.
# If you already have transformers installed, please uninstall it first, or create a new Python environment
# pip uninstall transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
We offer a toolkit to handle various audio and visual inputs conveniently. Ensure your system has ffmpeg installed.
pip install qwen-omni-utils -U
For reduced GPU memory usage, we recommend installing FlashAttention 2. This is optional if you’re using vLLM.
pip install -U flash-attn --no-build-isolation
Note: Your hardware must be compatible with FlashAttention 2, and the model must be loaded in torch.float16 or torch.bfloat16.
Code Snippet
Here’s how to use Qwen3-Omni with transformers and qwen_omni_utils.
import soundfile as sf
from transformers import Qwen3OmniMoeForConditionalGeneration, Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info
MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
model = Qwen3OmniMoeForConditionalGeneration.from_pretrained(
MODEL_PATH,
dtype="auto",
device_map="auto",
attn_implementation="flash_attention_2",
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cars.jpg"},
{"type": "audio", "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/cough.wav"},
{"type": "text", "text": "What can you see and hear? Answer in one short sentence."}
],
},
]
# Set whether to use audio in video
USE_AUDIO_IN_VIDEO = True
# Preparation for inference
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)
# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, speaker="Ethan", thinker_return_dict_in_generate=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
text = processor.batch_decode(text_ids.sequences[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)
if audio is not None:
sf.write(
"output.wav",
audio.reshape(-1).detach().cpu().numpy(),
samplerate=24000,
)
Advanced Usage Examples
Batch inference and toggling audio output are available. More detailed examples will be added here soon.
vLLM Usage
Installation
We strongly recommend using vLLM for inference. Since the code is in a pull request, install vLLM from source in a new Python environment.
git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error, use "pip install -e . -v" to build from source.
# Install other dependencies
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation
Inference with vLLM
Below is a simple example of how to run Qwen3-Omni with vLLM.
import os
import torch
from vllm import LLM, SamplingParams
from transformers import Qwen3OmniMoeProcessor
from qwen_omni_utils import process_mm_info
if __name__ == '__main__':
# vLLM engine v1 not supported yet
os.environ['VLLM_USE_V1'] = '0'
MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Thinking"
llm = LLM(
model=MODEL_PATH,
trust_remote_code=True,
gpu_memory_utilization=0.95,
tensor_parallel_size=torch.cuda.device_count(),
limit_mm_per_prompt={'image': 3, 'video': 3, 'audio': 3},
max_num_seqs=8,
max_model_len=32768,
seed=1234,
)
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.95,
top_k=20,
max_tokens=16384,
)
processor = Qwen3OmniMoeProcessor.from_pretrained(MODEL_PATH)
messages = [
{
"role": "user",
"content": [
{"type": "video", "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-Omni/demo/draw.mp4"}
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
inputs = {
'prompt': text,
'multi_modal_data': {},
"mm_processor_kwargs": {
"use_audio_in_video": True,
},
}
if images is not None:
inputs['multi_modal_data']['image'] = images
if videos is not None:
inputs['multi_modal_data']['video'] = videos
if audios is not None:
inputs['multi_modal_data']['audio'] = audios
outputs = llm.generate([inputs], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Advanced vLLM Usage Examples
Batch inference and serving with vLLM provide enhanced throughput. More detailed examples will be added here soon.
Usage Tips (Recommended Reading)
Minimum GPU Memory Requirements
| Model Precision | 15s Video | 30s Video | 60s Video | 120s Video |
|---|---|---|---|---|
| Qwen3-Omni-30B-A3B-Instruct (BF16) | 78.85 GB | 88.52 GB | 107.74 GB | 144.81 GB |
| Qwen3-Omni-30B-A3B-Thinking (BF16) | 68.74 GB | 77.79 GB | 95.76 GB | 131.65 GB |
Note: Theoretical minimum memory for transformers with BF16 precision and FlashAttention 2.
Prompt for Audio-Visual Interaction
For audio-visual interaction, using the following system prompt is recommended to improve reasoning and generate more natural, conversational responses.
user_system_prompt = "You are Qwen-Omni, a smart voice assistant created by Alibaba Qwen."
message = {
"role": "system",
"content": [
{"type": "text", "text": f"{user_system_prompt} You are a virtual voice assistant with no gender or age... Keep replies concise and conversational, as if talking face-to-face."}
]
}
The full prompt content is extensive. The key is to instruct the model to be a conversational voice assistant, avoiding formal or structured language.
Evaluation
Qwen3-Omni maintains SOTA performance on text and visual modalities and achieves SOTA on 32 of 36 audio/audio-visual benchmarks, outperforming systems like Gemini 2.5 Pro and GPT-4o.
Default Prompts for Evaluation
| Task Type | Prompt |
|---|---|
| ASR (Chinese) | 请将这段中文语音转换为纯文本。 |
| ASR (Other languages) | Transcribe the audio into text. |
| Speech-to-Text Translation | Listen to the provided speech and produce a translation in text. |
| Song Lyrics Recognition | Transcribe the song lyrics into text without any punctuation, separate lines with line breaks, and output only the lyrics without additional explanations. |
FEATURES
5 ChatGPT prompts that helped me feel more positive in minutes
By Amanda Caswell | Published September 20, 2025
The other day while out on a run, I had a chat with ChatGPT Voice about motivation and manifesting my goals. Some goals were personal, others were career oriented, but deep down, I knew they were achieveable; I just needed a little nudge. I have found that the chatbot is useful for brainstorming and even pushing me when I’m feeling down.
Of course I lean on my husband all the time for advice, but when I’m just trying to get out of my own head ChatGPT can be a helpful companion. Here are a few of my favorite prompts that I use to help to refocus my goals and stay on track.
1. Balancing my routine
Prompt: “Based on what you know about me, can you help me design my perfect day — one that energizes me and aligns with my goals?”
I use this prompt a lot when I want to calm down and motivate myself, especially when things are not going as planned. It helps to restructure my mindset and celebrate the small wins that are already happening in my life.
2. Staying motivated
This is where more prompt examples would go, illustrating how to use ChatGPT for motivation and positivity.
