The world got its first look at OpenAI Strawberry, a long-anticipated new release from the leading AI lab that exhibits a giant step forward in AI’s reasoning capabilities—and potentially a new era for AI altogether. The company’s own CEO, Sam Altman, has predicted that we might reach AI superintelligence in “a few thousand days.”
How did we get here, and what does that mean for use cases for AI?
The capabilities of large language models (LLMs) have come a long way since OpenAI launched ChatGPT-2 nearly two years ago. Back then, much of GenAI was entertaining: Generating images of cats doing yoga, writing break-up texts, and more.
Fast-forward to today, and LLMs and the products they support are getting much better at sounding… human. Previews of the newest OpenAI o1 models, previously known by its internal codename “Strawberry,” demonstrated significant advancements in AI reasoning capabilities. Meanwhile, AI-generated podcast hosts from NotebookLM sound remarkably real with banter, pauses, and filler words that capture the human touch of language.
AI superintelligence may still feel far away. However, advancements in OpenAI Strawberry and NotebookLM show tangible differences in the user experience and reasoning in AI today. And, they may give us a glimpse into where AI is going next.
OpenAI Strawberry sets a new bar for LLM reasoning.
Before OpenAI o1, AI models struggled to solve complex, multi-step problems. Today, Strawberry represents a significant step toward achieving Artificial General Intelligence (AGI).
How did we get here so quickly? After all, ChatGPT launched a little less than two years ago. What do we mean when we say some LLMs can “reason” and others cannot? And, most importantly, how is this changing the way we think about and interact with GenAI today?
Can large language models reason?
The human world is messy. Our problems, questions, and the answers we’re looking for are rife with context. We live in the nuance, which requires reasoning, contextualization, and critical thinking.
Previous AI models struggled with this complexity largely due to their training. LLMs like GPT-3 and GPT-4 were trained to predict the next word in a sequence based on large datasets. It’s why these models excel at natural language processing: generating coherent text, creative writing, and providing clear information.
Reasoning, however, requires more than predicting words or pulling information. It requires understanding, contextualization, solving multi-step problems, and (sometimes) integrating external knowledge.
While previous models excelled at single-hop reasoning, they lacked the ability to break down problems into steps, extrapolate general principles to specific or theoretical instances, and self-correct. The result? Errors and shallow answers.
Where previous LLMs excelled:
- Natural language processing: Interpret, manipulate, and comprehend human language
- Information retrieval: Answer factual questions based on their training data and provide summaries or general knowledge on a wide range of topics
- Pattern recognition: Identify patterns in text data and perform text classification
- Creative writing: Assist with content creation and ideation for a wide range of content types
- Single-hop reasoning: Answer basic questions or make inferences using information from a single source or step (no need to combine multiple pieces of information)
Where previous LLMs fell short:
- Multi-step logic: Breaking down complex problems into manageable steps and following formal logical rules
- Mathematical reasoning: Solving simple arithmetic operations or complex mathematical tasks
- Abstract thinking: Solving problems that required abstract reasoning beyond pattern matching
- Relational understanding: Interpreting relationships between entities, such as temporal, causal, or conceptual connections
- Contextual understanding: Adapting to implicit contextual cues and handling context-sensitive information
While LLMs like GPT-3 and GPT-4 excel at pattern recognition and language generation, their ability to reason—especially in a deliberate, context-based, or multi-step manner—has been limited. OpenAI Strawberry changed that in a big way.
A first look at LLM reasoning
In September, OpenAI released a preview of o1, the first in a planned series of reasoning models from the AI lab goliath. The initial version featured text-only (no image generation yet) has already impressed researchers, demonstrating:
- 83.3% success rate in solving complex, competitive programming problems, surpassing many human experts
- 100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions
- Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models
OpenAI o1’s success lies in its training. Experts trained the model using reinforcement learning to sharpen its thinking and fine-tune strategies for problem-solving. This training is fundamentally different than its predecessors, which primarily focused on generating text based on the next-word prediction. Unlike previous models, the AI behind OpenAI Strawberry is designed to enhance reasoning capabilities, allowing it to solve complex problems, plan, and even tackle tasks that typically require human-like thought processes.
Other models have also demonstrated improvements in reasoning that bridge the gap in formal mathematical reasoning. AlphaProof and AlphaGeometry, two new AI systems that use reinforcement learning and a fine-tuned version of Google DeepMind’s Gemini AI, solved four out of six problems from the 2024 International Mathematical Olympiad (IMO).
Performance of LLMs on the IMO:
- Google DeepMind: 66% (accuracy)
- OpenAI o1: 83%
- ChatGPT-4o: 13%
Statistically, four out of six doesn’t seem impressive—the AI is still wrong a third of the time. But in context, it’s a big step forward in AI’s ability to apply logic and math in a more grounded fashion. (Look at that contextual reasoning!)
What this means for reasoning in AI
Models that have reasoning capabilities like o1 are more than just impressive. They represent a major step toward a more effective and human-like AI.
These models spend more time "thinking" through problems before responding and use a “chain of thought” to process queries. As a result, o1 demonstrates more human reasoning, echoing how we process problems step-by-step and think before speaking. (At least, that’s the goal.)
More AI labs are working to integrate reasoning capabilities into their models, including Microsoft. Investors in AI say more complex reasoning models can improve reliability and enhance AI agents. This—plus the integration of AI agents into businesses—can dramatically change the way in which we view AI today from novel to necessity.
Still, LLMs with these reasoning capabilities are nascent. While o1 performs multi-hop reasoning faster than humans, it’s much slower than its counterparts. Not to mention, it’s more expensive. Future releases of OpenAI Strawberry and continuous fine-tuning will provide a clearer view of just how smart this technology can be.
NotebookLM and the AI podcast heard around the world
While experts celebrated the new capabilities of OpenAI Strawberry, another story was blowing up. NotebookLM, a not-so-new AI assistant, demonstrated a new, wow-ing feature; its story serves as a mirror for the advancements and evolutions in the GenAI space.
NotebookLM then vs. NotebookLM now
While not a new product by AI-industry standards, NotebookLM evolved from a mere AI assistant to a breakthrough AI product seemingly overnight. Launched in June 2023 from Google Labs, NotebookLM aimed to help users manage large volumes of information by organizing, summarizing, and generating insights from uploaded documents.
A fancy notebook with standard AI-driven capabilities. Until two months ago.
The engineers behind NotebookLM created an advanced feature called Deep Dives: a GenAI podcast that summarizes and discusses an uploaded file. The difference? The audio actually sounded like a real conversation between two humans.
“NotebookLM creates a conversation between two AI hosts discussing the material. They discuss the material, they banter, they laugh, and they make sense. This feature offers a fresh, passive way to consume information, which is a welcomed alternative to reading dense material.”
— Ksenia Se, Turing Post
This feature even caught the attention of Andrej Karpathy.
“It is a bit of a re-imagination of the UIUX of working with LLMs organized around a collection of sources you upload and then refer to with queries, seeing results alongside and with citations… In my opinion, LLM capability (IQ, but also memory (context length), multimodal, etc.) is getting way ahead of the UIUX of packaging it into products. Think Code Interpreter, Claude Artifacts, Cursor/Replit, NotebookLM, etc. I expect (and look forward to) a lot more and different paradigms of interaction than just chat.”
—Andrej Karpathy, OpenAI co-founder
He’s not the only one excited about it. Just google “NotebookLM,” and you’ll see a wave of enthusiasm, with some saying it’s the “most exciting thing” since ChatGPT.
The tech behind it: Does it include model reasoning?
NotebookLM is powered by Google’s long-context Gemini 1.5 Pro, a Transformer model utilizing a sparse Mixture-of-Experts architecture. This allows NotebookLM to process up to 1,500 pages of information simultaneously, making it suitable for those tackling large datasets or complex topics. Beyond this, Notebook leveraged a variety of different tools and techniques to create its impressive, human-like podcasts, including RAG, SoundStorm, and prompt engineering.
Does it include reasoning capabilities like those in OpenAI Strawberry? No, not quite. NotebookLM’s capabilities are more context-specific and focused on synthesizing information from multiple documents, which enables it to also analyze visual data like charts and images.
In contrast, Strawberry models like OpenAI o1 excel in handling complex, multi-step reasoning tasks.
However, NotebookLM and Strawberry do one thing similarly: They hit the right use cases for AI.
Strawberry solves problems AI previously couldn’t, like mathematics and complex coding, opening up a whole new space for people to use AI in their work. Meanwhile, NotebookLM helps us sort through the overwhelming amount of information and sometimes chaotic notes and regurgitates it back to us in a format that’s easy and enjoyable.
When discussing NotebookLM, Karpathy said,
“That's what I think is ultimately so compelling about the 2-person podcast format as a UIUX exploration. It lifts two major "barriers to enjoyment" of LLMs. 1 Chat is hard. You don't know what to say or ask. In the 2-person podcast format, the question asking is also delegated to an AI, so you get a lot more chill experience instead of being a synchronous constraint in the generating process. 2 Reading is hard, and it's much easier to just lean back and listen.”
AI enters a new era.
More reasoning. More features and integrations. More AI services that resemble human intelligence?
Only time will tell. What’s certain is that AI is moving beyond its novelty phase and into a new phase of sophistication and adoption. GenAI is no longer just entertaining—it’s becoming a part of our everyday lives.
The advancements seen in models like OpenAI Strawberry and applications like NotebookLM signal a transformative shift. The integration of sophisticated reasoning capabilities for LLMs and user-friendly interfaces proves that AI is no longer merely a tool for entertainment.
LLMs are becoming much, much better at giving humans the assistance they need and the experience they want. If you’ve only dabbled in GenAI, now is the time to dig in deeper. It won’t be long before it’s an integrated part of our world.
Explore the CoreWeave blog for more stories about LLM innovations and the AI infrastructure that powers it.