Alibaba Releases New Open-Source Model QwQ-32B, Matching DeepSeek-R1 with Lower Computational Requirements

2025-03-06

The Qwen team under Alibaba has unveiled QwQ-32B, a new reasoning model featuring 32 billion parameters. This model leverages reinforcement learning (RL) to enhance its performance in solving complex problems.

It has been open-sourced under the Apache 2.0 license on platforms like Hugging Face and ModelScope, allowing both commercial and research usage. Companies can integrate it directly into their products and services, including paid offerings.

QwQ (short for Qwen-with-Questions) was initially launched by Alibaba in November 2024 as an open-source reasoning model designed to compete with OpenAI's o1-preview. Its primary goal is to improve logical reasoning and planning abilities by reviewing and refining its responses during the reasoning process. It demonstrates exceptional performance in mathematical and coding tasks.

The initial version of QwQ boasts 32 billion parameters and a context length of 32,000 tokens. It outperforms o1-preview in math benchmarks such as AIME and MATH, as well as scientific reasoning tasks like GPQA. However, when it comes to programming benchmarks like LiveCodeBench, QwQ still lags behind OpenAI's models and faces challenges like language mixing and circular reasoning.

As AI continues to evolve rapidly, the limitations of traditional large language models (LLMs) have become more apparent, with diminishing returns from scaling up. This has sparked growing interest in large reasoning models (LRMs), which improve accuracy through extended reasoning time and self-reflection. Examples include OpenAI's o3 series and DeepSeek-R1 from China-based DeepSeek.

Since the release of DeepSeek-R1 in January 2024, the DeepSeek website has seen a surge in traffic, making it the second most popular AI model provider site after OpenAI.

Alibaba's latest offering, QwQ-32B, further enhances performance by integrating RL with structured self-questioning. It employs a multi-stage RL training approach to boost mathematical reasoning, coding capabilities, and general problem-solving skills.

QwQ-32B has been benchmarked against leading models, including DeepSeek-R1. Despite having fewer parameters, it shows competitive performance. For instance, while DeepSeek-R1 has 671 billion parameters (activating 37 billion), QwQ-32B achieves similar results using just 24GB of GPU virtual memory (on Nvidia H100s with 80GB). In contrast, DeepSeek-R1 requires over 1,500GB of virtual memory (running on 16 Nvidia A100 GPUs).

QwQ-32B is built on a causal language model architecture with several optimizations: a 64-layer transformer structure incorporating RoPE, SwiGLU, RMSNorm, and attention QKV bias; generalized query attention mechanisms with 40 heads for queries and 8 for key-value pairs; an extended context length of 131,072 tokens for better handling of long input sequences; and a multi-stage training pipeline covering pre-training, supervised fine-tuning, and reinforcement learning.

The RL process for QwQ-32B is divided into two phases: the first focuses on training for math and coding tasks, using accuracy validators and code execution servers to ensure correct answers. The second phase targets general capability enhancement, employing universal reward models and rule-based validators to improve instruction-following, human alignment, and proxy reasoning, all while maintaining its math and coding proficiency.