“`html

The Quiet Breakthrough: How LLMs Are Learning to Count and Reason

My grandmother, bless her heart, used to have a meticulous way of counting mangoes from her garden.

Not just by number, but by ripeness, by size, by which tree they had fallen from after a monsoon shower.

She did not use a spreadsheet; she used her lived experience, her hands feeling each one, her eyes discerning the subtle changes.

It was an incremental, sequential process, building a mental ledger as she went.

Simple, yet deeply perceptive.

It strikes me sometimes how we demand such human-like intuition from our most advanced AI, the large language models (LLMs) that now write code and translate languages with astonishing fluidity.

These digital prodigies can weave prose, synthesize vast swathes of information from the furthest corners of the web, and even help us architect entire business strategies.

They are, in many ways, brilliant.

Yet, ask them something as seemingly straightforward as tallying the number of rs in strawberry, and they might falter.

This reveals a fascinating paradox: the truly intelligent struggle with the utterly elementary.

This disconnect is not a minor bug; it is a profound architectural challenge, and solving it is key to unlocking the next generation of AI.

Why This Matters Now

The current prowess of LLMs has profoundly reshaped our work, but their core architecture, the transformer, harbors inherent limitations that impact their ability to perform certain fundamental tasks.

For businesses leaning heavily on AI for complex operations, from sophisticated code generation to nuanced customer service, these limitations translate directly into computational costs and accuracy gaps.

IBM researchers detailed in a spotlight poster at NeurIPS 2025 that a hybrid model, like IBMs PD-SSM, significantly outperformed other SSM variants at universal state tracking tasks, achieving an average score of 98.5 percent in experiments.

This shows a clear path forward for more reliable and efficient AI systems.

Modern LLMs, despite their linguistic brilliance, struggle with basic counting and sequential reasoning due to their core transformer architecture.

IBMs new PD-SSM hybrid model offers a breakthrough, dramatically improving their ability to track states and handle complex logical problems, critical for business applications like code generation.

The Transformer’s Core Flaw: Why Parallel Processing Hinders Sequential Logic

Imagine trying to understand a long conversation, not by listening word-for-word, but by getting all the words in a jumbled heap at once, and then trying to piece together the narrative.

That is a bit like how a transformer-based LLM processes information.

The self-attention mechanism, which revolutionized natural language processing by allowing models to process long blocks of text simultaneously, gives transformers a rich understanding of word relationships.

But this parallel, out-of-order crunching of data comes at a cost: it makes state tracking surprisingly difficult.

State tracking is how we, as humans, incrementally update our worldview as circumstances change.

It is how you keep track of details in a long meeting, or evaluate an opponents last move in a chess match, or even just remember the context of a sprawling email thread.

For transformers, this incremental update is a struggle.

Instead, they memorize each new observation for later retrieval, a strategy akin to a vast, but ultimately finite, database.

They can recall facts precisely, but when the history gets too long, their memory, much like ours after a sleepless night, can simply run out.

The Strawberry Problem: A Microcosm of a Macro Issue

The now-famous strawberry problem – asking an LLM to count the rs in the word – seems trivial, almost childish.

Yet, it highlights a profound challenge.

This is not about mathematical aptitude in the human sense, but rather a fundamental difficulty in processing sequential data and maintaining an accurate count over time.

While workarounds like chain-of-thought (CoT) prompting can break the problem into steps, these methods add significant time and computational cost to an already expensive process, as noted by IBM in their 2024 internal article, The quest to teach LLMs how to count.

This simple counting task becomes a canary in the coal mine, signaling deeper limitations in logical reasoning and memory.

What the Research Really Says: A New Path to Smarter AI

The quest to imbue LLMs with more robust sequential logic is not new.

For decades, researchers have explored architectures like recurrent neural networks (RNNs) that process text word by word, summarizing past inputs into a compressed state for future reference.

While their memories might be hazier, they can stretch much farther back in time.

This historical context is now informing cutting-edge research.

Chomsky’s Hierarchy and the Parity Problem

In 2020, Stanford researcher Michael Hahn mathematically proved that transformers have limited ability to perform state tracking tasks at the lowest level of Chomskys hierarchy, a framework for classifying language complexity.

He showed that transformers would theoretically fail the parity problem – determining if the number of ones in a binary string is odd or even – beyond a certain length.

This is not just an academic exercise; it reveals a fundamental weakness.

If LLMs cannot master basic sequential logic, their ability to evaluate complex logical formulas or generate reliable code is inherently compromised, impacting their utility in critical business applications.

The Rise of Hybrid Architectures

Companies like IBM and Nvidia are now integrating RNN-like processing into their LLMs.

IBMs Granite family of models, for instance, interleave transformer layers with those of a state space model (SSM), an architecture known for skillfully processing long sequences.

Nvidias 2024 Mamba2: A Selective State Space Model offers a similar selective state space model.

Blending old and new architectures is boosting inferencing speeds and efficiency.

Hybrid models like IBMs Bamba architecture promise more cost-effective and faster AI operations, a critical factor for enterprise-scale deployments.

PD-SSM: The Breakthrough in Universal State Tracking

IBM researchers, in a spotlight poster at NeurIPS 2025, propose a new way of structuring the SSMs transition matrices to enable universal state tracking.

Their improved model, called PD-SSM, significantly outperformed other SSM variants in experiments on such tasks.

This architectural innovation makes sequential logical reasoning more intuitive and robust for LLMs.

PD-SSM could dramatically improve an LLMs capacity to handle complex tasks like predicting what comes next in time-ordered events, a critical function for advanced analytics and automation.

Real-World Performance

PD-SSMs capabilities extend beyond abstract problems.

It performed well on real-life tasks involving long time-series datasets, such as classifying heartbeats and forecasting ethanol demand, as reported by IBM researchers in a spotlight poster at NeurIPS 2025.

A hybrid model integrating PD-SSM could correctly compute word parity out to 40 characters, while other SSMs struggled past 15 characters.

The theoretical advancements translate directly into practical, observable improvements.

This level of enhanced state tracking is especially important for applications like code generation, where precise sequential evaluation is non-negotiable, offering a competitive edge for businesses.

Playbook You Can Use Today

Embracing these advancements requires a strategic shift in how we approach AI integration and development.

Here is a playbook for leaders and practitioners:

Assess Your AIs State Tracking Needs: Identify which of your AI applications (e.g., code generation, long-form content synthesis, conversational AI agents) rely heavily on sequential reasoning and long-term memory.
These are the prime candidates for enhanced models.
Pilot Hybrid Architectures: Begin experimenting with hybrid LLM architectures that incorporate State Space Models (SSMs).
Look for models that specifically address improved state tracking capabilities, such as those inspired by IBMs PD-SSM.
Prioritize Canary in the Coal Mine Problems: Use tasks like the parity problem or complex counting as benchmarks during AI model evaluation.
If a model cannot reliably solve these, it suggests deeper limitations in its sequential processing, as IBM researcher Shawn Tan notes in The quest to teach LLMs how to count from 2024: If a model can’t track two states sequentially, how can we expect it to solve more challenging problems?
Invest in Custom Model Fine-Tuning: Collaborate with AI experts to fine-tune models for your specific use cases, leveraging the strengths of new architectures.
This ensures the theoretical benefits translate into practical gains for your unique data and operations.
Focus on Data Annotation for Sequence: Ensure your training data for fine-tuning emphasizes sequential relationships and state changes.
High-quality, context-rich sequential data will maximize the benefits of improved state tracking in your models.
Stay Abreast of Research: The field of AI architecture is evolving rapidly.
Keep a pulse on breakthroughs, especially in areas like State Space Models and their integration with transformers, by following leading academic journals and industry research.

For example, explore the latest findings presented at conferences like NeurIPS.
Ethical Implementation: As AI becomes more capable, the ethical implications grow.
Ensure your teams are trained on responsible AI development and deployment, especially when models handle sensitive sequential data or make logical decisions impacting human lives.

Risks, Trade-offs, and Ethics

While the advancements in state tracking are exciting, they come with their own set of considerations.

Increased model complexity, while offering greater capability, can also lead to higher computational costs in some scenarios, despite efficiency gains in others.

There is always a trade-off between model universality and task-specific optimization.

Moreover, as AI becomes more adept at sequential reasoning and remembering longer contexts, the potential for biased or harmful patterns embedded in vast training data to be perpetuated over extended interactions also grows.

Mitigation requires rigorous data auditing, transparency in model design, and continuous monitoring of AI system outputs.

Ethical guidelines, perhaps influenced by a universal declaration on AI ethics by a leading government body, must be embedded from conception to deployment.

The goal is not just smarter AI, but wiser AI.

Tools, Metrics, and Cadence

Implementing these advanced LLMs requires a robust operational framework.

Recommended Tool Stack:

Model Hosting & Deployment: Cloud platforms offering managed LLM services.
Data Annotation & Labeling: Specialized tools for sequential data labeling.
Experiment Tracking: Platforms for managing and comparing different model architectures and fine-tuning runs.
Performance Monitoring: Tools to track model accuracy, latency, and resource consumption in production.

Key Performance Indicators (KPIs):

State Tracking Accuracy: Percentage of correctly processed sequential tasks (e.g., parity, complex counting) – Target: >95 percent.
Inference Latency: Time taken for the model to generate an output for a given input – Target: <500ms (application dependent).
Computational Cost: Resources consumed (e.g., GPU hours) per inference or per training epoch – Target: Reduced by 15-20 percent compared to baseline.
Task Completion Rate: Percentage of user requests or automated tasks successfully handled by the AI – Target: >90 percent for critical sequential tasks.
Contextual Coherence: Human-rated quality of long-form outputs, reflecting sustained memory & logic – Target: Score >4 out of 5.

Review Cadence:

Weekly: Performance monitoring, anomaly detection, small-scale fine-tuning adjustments.
Monthly: Comprehensive model re-evaluation, data drift analysis, review of new research.
Quarterly: Strategic review of AI roadmap, resource allocation, and exploration of new architectures for integration.

FAQ

Why do LLMs struggle with simple counting problems?

Large Language Models based on the transformer architecture process data in parallel, out of order, which makes it difficult for them to track sequential changes or incrementally update their worldview.

This affects tasks like counting specific characters or maintaining state in a long conversation, as explained by IBM in their 2024 internal article, The quest to teach LLMs how to count.

What is state tracking and why is it important for LLMs?

State tracking refers to an AIs ability to incrementally update its understanding or memory as new information streams in.

It is crucial for complex tasks like evaluating code, engaging in long conversations, or logical reasoning, as it allows the model to build context over time, as explained by IBM in their 2024 internal article, The quest to teach LLMs how to count.

How is IBM addressing the limitations of transformers in state tracking?

IBM researchers are developing hybrid architectures that interleave transformer layers with state space models (SSMs), which have an RNN-like structure.

Their latest innovation, PD-SSM, proposes a new way to structure SSMs transition matrices to enable universal state tracking, significantly improving performance on sequential tasks, achieving an average score of 98.5 percent, as detailed in a spotlight poster at NeurIPS 2025 by IBM Researchers.

What does the parity problem have to do with LLM capabilities?

The parity problem, determining if a string of ones and zeros has an odd or even number of ones, is a fundamental test of sequential state tracking.

Stanford researcher Michael Hahn mathematically proved in 2020 that transformers have limited ability to solve this at scale, foreshadowing struggles with logical formulas, as documented in his 2020 Mathematical Proof of Transformer Limitations.

Conclusion

The journey to truly intelligent AI, one that can reason with the grounded intuition of my grandmother counting mangoes, is an ongoing quest.

We have witnessed LLMs master language with breathtaking elegance, yet stumble on the simplest acts of numerical perception.

The strawberry problem, the parity test – these are not trivial academic puzzles, but crucial indicators of a deeper truth: real intelligence, whether human or artificial, depends on the quiet, incremental mastery of sequential detail and logical reasoning.

IBMs breakthrough with PD-SSM, restructuring the very algebra of state space models, represents a significant leap in this journey.

It is a testament to the power of blending architectural insights, old and new, to overcome fundamental limitations.

As Aleksandar Terzic, an IBM researcher, remarked in The quest to teach LLMs how to count from 2024, We are excited to explore these implications further.

The path forward for AI is not just about more data or bigger models; it is about smarter foundations.

If LLMs can finally master the simple art of counting and robust state tracking, who knows what more profound, logical problems they might be able to solve, making them not just eloquent, but truly discerning partners in our digital age.

It is time for AI that does not just speak our language, but truly understands it, one logical step at a time.

References

IBM. 2024. The quest to teach LLMs how to count (Internal Article).
IBM Researchers. 2025. Spotlight poster at NeurIPS 2025 (PD-SSM).
Nvidia. 2024. Mamba2: A Selective State Space Model.
Stanford. 2020. Mathematical Proof of Transformer Limitations (2020).

“`

The quest to teach LLMs how to count