GPT-OSS vs. Qwen vs. Deepseek: Comparing Open Source LLM Architectures - Y Combinator Startup Podcast | Ryan Randels | Mobile & SaaS Strategy | Agentic AI

Full Title

GPT-OSS vs. Qwen vs. Deepseek: Comparing Open Source LLM Architectures

Summary

This episode compares the architectural differences and training methodologies of three prominent open-source Large Language Models: OpenAI's GPT-OSS, Alipop Cloud's Qwen3, and DeepSeek's V3. It highlights their unique approaches to efficiency, context length, and training data, offering insights into the rapidly evolving landscape of open-source AI.

Key Points

GPT-OSS is an open-weights model from OpenAI, featuring a Mixture of Experts (MoE) architecture available in 120B and 20B parameter sizes, activating only a subset of parameters per token for efficient inference.
GPT-OSS incorporates modern LLM features like grouped query attention for reduced memory, sweet activations for nuanced transformations, rotary positional embeddings (RoPE) for longer context, and RMS norm for stable training.
GPT-OSS boasts a 131,000 token context window achieved through YARN scaling during pre-training, and uses OpenAI's O200k harmony tokenizer; its training data is a text-only corpus in the trillions of tokens with a focus on STEM and coding, filtered for safety, with limited public detail.
Qwen3 is a family of models from Alipop Cloud, offering both dense and MoE variants, with dense models up to 32B parameters and MoE models up to 235B parameters, known for achieving comparable performance to dense models with fewer active parameters.
Qwen3 models share architectural similarities with GPT-OSS, utilizing grouped query attention, sweet activations, RoPE, and RMS norm, but introduce QK norm to stabilize attention scores, replacing QKV bias from previous models.
Qwen3 was trained on 36 trillion tokens across three stages: general, reasoning, and long context, with the latter extending context to over 32,000 tokens using optimizations like ABF, YARN, and dual chunk attention.
Qwen3's post-training pipeline includes stages for reasoning, RL with GRPO, thinking mode fusion allowing a single model to switch between reasoning and non-reasoning modes, and general RL for instruction following and tool use, using strong-to-weak distillation for smaller model training.
DeepSeek V3 is a 671 billion parameter MoE model trained natively in 8-bit for cost efficiency, and its V3.1 update further extends context through a two-phase approach and adds a hybrid thinking mode for flexible inference.
DeepSeek V3 utilizes MLA, a memory-efficient attention mechanism that compresses keys and values, offering better performance than GQA in long-context models.
A key difference across models is their approach to long context: GPT-OSS has native long-context stability, DeepSeek V3 is trained step-by-step, and Qwen3 pushes the limits of its 32,000 token training with inference-time scaling.
The comparison highlights that while top-line benchmark statistics and core components like attention mechanisms are similar across these LLMs, the specific methods and underlying dataset engineering vary significantly, making direct replication challenging.
Reinforcement learning is a common element in the post-training and reasoning phases for these models, with notable efficiency, like Qwen's impressive results from only 4,000 data pairs.

Conclusion

Understanding the nuances of architectural choices, training data, and post-training methodologies is crucial for evaluating open-source LLMs beyond simple benchmark scores.

The field is rapidly evolving, with each major model release introducing novel techniques for efficiency, context handling, and reasoning capabilities.

Future development in open-source LLMs will likely continue to focus on empirical findings and the unique combinations of tools that yield superior performance, rather than purely first-principles explanations.