When AI Keeps Talking, It Starts Falling Apart: Study

Spread love

Microsoft Research + Salesforce Research Paper Warns: LLMs Get Lost in Multi-Turn Conversation

By S. JHA

Mumbai, February 19, 2026 — A mega study revealed a major chink in the $5 trillion AI awe-all tech revolution — the chatbot fails aptitude tests too often. In China, Unitree’s humanoids during Spring Gala Festival fell flat on floor during Kung Fu show. Wang XingXing, founder of Unitree, claimed that the humanoids were following a script — of drunker martial artists.

Spring Gala Festival has made humanoids fashionable. Golgatias University allegedly bought Unitree humanoid to bring a national embarrassment to India. Amid a frenzy for an AI-led future, a mega study gave a reality check on actual efficacy of the systems built of tech super giants.

A new research paper from Microsoft Research and Salesforce Research is sending shockwaves through the AI community: today’s most advanced large language models (LLMs) struggle dramatically when conversations extend beyond a single prompt.

The paper, titled “LLMs Get Lost in Multi-Turn Conversation,” evaluates 15 leading models — including OpenAI’s GPT-4.1 and o3, Google DeepMind’s Gemini 2.5 Pro, Anthropic’s Claude 3.7 Sonnet, DeepSeek’s DeepSeek R1, and Meta’s Llama 4 — across more than 200,000 simulated conversations.

The findings are stark: Single-turn prompts: 90% performance; Multi-turn conversations: 65% performance; Average drop: and 39% across six generation tasks.

Crucially, the researchers found that the issue is not intelligence. “Aptitude — defined as best-case performance — dropped only 15%. But unreliability exploded by 112%,” wrote Hasan Toor, a tech-educator on X.

Why LLMs Get Lost in Multi-Turn Conversation

According to the study, when LLMs take an early wrong turn in a conversation, they rarely recover. Key failure patterns include: Making assumptions before the user finishes clarifying; locking into an incorrect early answer; forgetting mid-conversation constraints; and introducing new assumptions in longer responses.

“Even reasoning-focused models like OpenAI’s o3 and DeepSeek R1 failed to correct themselves. Increasing reasoning tokens or lowering temperature to zero did not meaningfully fix the degradation,” added Toor.

The research suggests that most AI benchmarks — which test models on clean, single-turn prompts — fail to reflect real-world conversational use.

The Real-World Implication for AI Builders

The study concludes that multi-turn, underspecified conversations are the Achilles’ heel of modern LLMs. In simpler terms: the more “naturally” you talk to AI, the more likely it is to drift off course.

For now, the researchers suggest a practical workaround: Provide all constraints and requirements upfront in a single, fully specified message instead of iterative back-and-forth. “That recommendation challenges the core marketing narrative of conversational AI. If real conversations break every top model on the market, the industry may need to rethink how it evaluates progress,” said a Bengaluru-based IT professional, requesting anonymity.