- Qwen3 beats OpenAI and DeepSeek in maths, coding, and reasoning.
- Now runs in one mode for more consistent results.
Alibaba has rolled out an updated version of its Qwen3 large language model (LLM), showing improvements in math, coding, reasoning, and multilingual tasks. As reported by the South China Morning Post, one variant of the model, the Qwen3-235B-A22B-2507-Instruct, has been released on Hugging Face and Alibaba’s ModelScope platform, along with a more efficient FP8 version that’s easier to run on limited hardware.
The company claims that the new version performs better in a range of tasks like logic, tool use, and long-context understanding. Based on benchmark tests, it also edges ahead of some competitors.
One key update is the FP8 model – short for 8-bit floating point – which cuts down on the memory and computing power needed to run the model. This makes it more useful for smaller teams or companies without access to large-scale infrastructure. Users can run it on basic hardware or deploy it more efficiently in the cloud, which helps lower energy costs and speeds up response times. Although Alibaba didn’t share exact numbers, similar FP8 setups typically reduce costs and resource use by a wide margin.
In short, the FP8 format allows organisations to do more with less. Instead of relying on massive clusters, users can run the model on a single GPU or even a personal machine, making it more practical for private use or local development. It also gives teams a chance to fine-tune models without the usual infrastructure headaches.
The new version also posted strong scores on several public tests. In the 2025 American Invitational Mathematics Examination, Qwen3 scored 70.3. That’s well above DeepSeek’s 46.6 and OpenAI’s GPT-4o, which scored 26.7. In coding tests, it pulled in 87.9 points on the MultiPL-E benchmark – just ahead of DeepSeek and OpenAI but slightly behind Claude Opus 4 Non-thinking from Anthropic, which scored 88.5.
As the name suggests, the latest Qwen model only supports a non-thinking mode – meaning it gives direct answers without showing steps or reasoning chains. But it’s able to process longer input, with a token limit now stretched to 256,000, making it better at handling large documents or long conversations in a single run.
The shift is tied to another change: Alibaba is dropping its earlier “hybrid” reasoning model approach, VentureBeat wrote. The idea behind hybrid mode was to let users switch between thinking and non-thinking behaviour, depending on the task. Users could toggle it manually – for example, adding a “/think” command before a prompt – to have the model work through a chain of logic before giving an answer.
That flexibility gave users more control, but also introduced design issues. Sometimes the model behaved unpredictably depending on the prompt, and users had to decide when to turn the reasoning on or off. After reviewing feedback, the Qwen team said it will now train separate models for instruction and reasoning tasks instead of blending both modes in one model.
A company post on social media read, “After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we’ll train Instruct and Thinking models separately so we can get the best quality possible.”
That means the new 2507 release focuses solely on following instructions and generating direct responses. For now, the reasoning model will be a separate track.
Alibaba is also starting to put its Qwen models into real-world products. A 3-billion-parameter version will power Xiaowei Hui, a smart assistant from HP that runs on the company’s PCs in China. The assistant is expected to help with tasks like writing and meeting summaries.
The Qwen3 series, launched in April, spans from smaller 600 million parameter models up to the larger 235 billion ones. One variant – the Qwen3-235B-A22B-no-thinking – ranks among the top open-source models globally, coming in just behind Chinese models from Moonshot AI and DeepSeek, based on a recent LMArena report.
Hugging Face’s own rankings from last month also placed several Qwen models in the top ten among Chinese LLMs, further boosting the model family’s reputation in the open-source AI space.
Nvidia CEO Jensen Huang also weighed in during his trip to China last week, where he acknowledged the progress China has made with its open-source AI work. He described Alibaba’s Qwen, along with DeepSeek and Moonshot’s Kimi, as “very advanced” and “the best open reasoning models in the world today.”
The Qwen team’s decision to split instruction and reasoning into separate models marks a shift in strategy – one that favours predictability and performance over flexibility. While this means users won’t be able to toggle reasoning on and off in a single model anymore, it could lead to better results in each task area.
For teams looking to run powerful models on lower-cost infrastructure, the new FP8 version adds another reason to consider Qwen3. And for those tracking benchmarks, Alibaba’s latest model is now firmly in the race with some of the best-known names in AI.