GPT-OSS vs. Qwen vs. Deepseek: Comparing Open Source LLM Architectures...
Y Combinator Startup PodcastFull Title
GPT-OSS vs. Qwen vs. Deepseek: Comparing Open Source LLM Architectures
Summary
This episode compares the architectural differences and training methodologies of three prominent open-source Large Language Models: OpenAI's GPT-OSS, Alipop Cloud's Qwen3, and DeepSeek's V3. It highlights their unique approaches to efficiency, context length, and training data, offering insights into the rapidly evolving landscape of open-source AI.
Key Points
- GPT-OSS is an open-weights model from OpenAI, featuring a Mixture of Experts (MoE) architecture available in 120B and 20B parameter sizes, activating only a subset of parameters per token for efficient inference.
- GPT-OSS incorporates modern LLM features like grouped query attention for reduced memory, sweet activations for nuanced transformations, rotary positional embeddings (RoPE) for longer context, and RMS norm for stable training.
- GPT-OSS boasts a 131,000 token context window achieved through YARN scaling during pre-training, and uses OpenAI's O200k harmony tokenizer; its training data is a text-only corpus in the trillions of tokens with a focus on STEM and coding, filtered for safety, with limited public detail.
- Qwen3 is a family of models from Alipop Cloud, offering both dense and MoE variants, with dense models up to 32B parameters and MoE models up to 235B parameters, known for achieving comparable performance to dense models with fewer active parameters.
- Qwen3 models share architectural similarities with GPT-OSS, utilizing grouped query attention, sweet activations, RoPE, and RMS norm, but introduce QK norm to stabilize attention scores, replacing QKV bias from previous models.
- Qwen3 was trained on 36 trillion tokens across three stages: general, reasoning, and long context, with the latter extending context to over 32,000 tokens using optimizations like ABF, YARN, and dual chunk attention.
- Qwen3's post-training pipeline includes stages for reasoning, RL with GRPO, thinking mode fusion allowing a single model to switch between reasoning and non-reasoning modes, and general RL for instruction following and tool use, using strong-to-weak distillation for smaller model training.
- DeepSeek V3 is a 671 billion parameter MoE model trained natively in 8-bit for cost efficiency, and its V3.1 update further extends context through a two-phase approach and adds a hybrid thinking mode for flexible inference.
- DeepSeek V3 utilizes MLA, a memory-efficient attention mechanism that compresses keys and values, offering better performance than GQA in long-context models.
- A key difference across models is their approach to long context: GPT-OSS has native long-context stability, DeepSeek V3 is trained step-by-step, and Qwen3 pushes the limits of its 32,000 token training with inference-time scaling.
- The comparison highlights that while top-line benchmark statistics and core components like attention mechanisms are similar across these LLMs, the specific methods and underlying dataset engineering vary significantly, making direct replication challenging.
- Reinforcement learning is a common element in the post-training and reasoning phases for these models, with notable efficiency, like Qwen's impressive results from only 4,000 data pairs.
Conclusion
Understanding the nuances of architectural choices, training data, and post-training methodologies is crucial for evaluating open-source LLMs beyond simple benchmark scores.
The field is rapidly evolving, with each major model release introducing novel techniques for efficiency, context handling, and reasoning capabilities.
Future development in open-source LLMs will likely continue to focus on empirical findings and the unique combinations of tools that yield superior performance, rather than purely first-principles explanations.
Discussion Topics
- How can developers and researchers effectively benchmark and compare open-source LLMs given the diverse architectural and training approaches?
- What are the most significant trade-offs between different context-handling techniques like YARN scaling and MLA in terms of performance and implementation complexity?
- Beyond architecture and training data, what are the key factors that contribute to the "moat" or competitive advantage of leading open-source LLM developers?
Key Terms
- Mixture of Experts (MoE)
- A neural network architecture where specialized sub-networks (experts) handle different parts of the input data, leading to more efficient computation.
- Grouped Query Attention (GQA)
- A modification to attention mechanisms that reduces memory usage and speeds up inference by allowing multiple query heads to share key-value pairs.
- Sweet Activations
- A type of activation function in feed-forward network layers that allows for more complex and nuanced transformations of data.
- Rotary Positional Embeddings (RoPE)
- A method of encoding token position directly into the attention mechanism, designed to better handle longer sequences and improve contextual understanding.
- RMS Norm
- A normalization technique that scales inputs by their root mean square, contributing to more stable training of neural networks.
- YARN Scaling
- A technique used to extend the effective context window of models by adjusting the base frequency of rotary positional embeddings.
- MLA (Memory-efficient Linear Attention)
- An attention mechanism that compresses keys and values into a smaller latent space to reduce memory usage and improve performance, particularly in long-context models.
- QK Norm
- A normalization step applied to query and key vectors to maintain constant magnitudes, aiming to stabilize attention scores at scale.
- GRPO (Grouped Query Reinforcement Learning)
- An RL algorithm adapted from DeepSeek researchers for strengthening complex problem-solving in LLMs.
- Thinking Mode Fusion
- A technique that integrates reasoning and non-reasoning capabilities into a single model, allowing users to switch between modes.
- Strong to Weak Distillation
- A method used to train smaller models by transferring knowledge from larger, more capable models.
Timeline
GPT-OSS is a mixture of experts model, available in two sizes, 120 billion parameters and 20 billion parameters.
trained as a decoder only transformer gpt oss incorporates plenty of features typical modern llms
One standard capability of the model is its 131 000 token context window which achieves by applying yarn scaling during pre-training rather than as an inference time adjustment
for gpt oss openai makes use of their open source o200k harmony tokenizer
As for the dataset GPT-OSS was trained on, OpenAI has only disclosed the broad strokes.
Once training was complete, the model was released in a quantized format by default, making it lightweight enough for deployment on modest hardware.
GPT-OSS also underwent substantial post-training for safety and alignment, shaping its default behavior for more controlled outputs.
QEN3, the newest family of models developed by Alipop Cloud, dropped this past April to considerable hype with benchmark scores that rivaled those of leading open source models like DeepSeq V3 or Lama 4
Architecturally, QN3 dense models are very similar to the QN2.5 models all about its previous releases.
all QN3 models also use the same tokenizer used in previous QN models which implements byte-level byte-pair encodings that allow it to handle any text or symbol without special preprocessing unlike word or character-based tokenizers
one of the main things that sets QN3 apart from previous QN models is the way it controls the scale of the key query and value projections to keep attention scores stable at scale
data set wise QN3 was trained on 36 trillion pre-training tokens twice as many as the QN2.5 models
Quen's pre-training occurred in three stages.
Finally, Quinn uses a four-step post-training pipeline with two goals, giving users more control over how much reasoning to use for a given query and letting them efficiently distill larger model capabilities into smaller models.
DeepSeek's V3 model was one of the most ambitious open source LLMs to come out of a major lab in recent years.
But high level, the thing to know about V3 is that it's a mixture of experts model with several hardware and algorithmic optimizations, including training V3 natively in 8-bit rather than 16 or 32-bit, a huge unlock for cutting training costs.
And just recently, DeepSeq pushed V3 even further with an updated version.
One thing that sets V3 apart is that it uses a different attention mechanism than GPT-OSS and QN3.
From V3 to Quen to GPT-OSS, how should we think about, at a high level, the differences between these models?
One of the most interesting type of differences lies in how each model extends its context length.
Quen also fine-tunes to 32,000, but skips that additional retraining step.
In other words, GBTOSS is born with long-context stability, DeepSeq is trained into it step-by-step, and Quen pushes the limits of what a 32,000-train model can do without more long-context training.
Personally, I think one of the most interesting things about these papers, and the state-of-the-art in deep learning more generally, is that a lot of these read as empirical findings.
This is quite surprising.
Also, all the major models heavily use reinforcement learning as part of the post-training and reasoning portions of their model training efforts.
Another point here is that it's very opaque what the differences in datasets are between the labs.
So the big takeaway when reading these papers is you shouldn't focus too much on just the benchmark performance or top-line stats like context size.
I hope this gives you a framework for how to understand the latest open-source releases, and gives you a toolkit to start tinkering with them yourself.
Episode Details
- Podcast
- Y Combinator Startup Podcast
- Episode
- GPT-OSS vs. Qwen vs. Deepseek: Comparing Open Source LLM Architectures
- Official Link
- https://www.ycombinator.com/
- Published
- August 29, 2025