Scientists pretended to be delusional in AI chats. Grok and Gemini encouraged them.

From poetic advocacy to "call a crisis line," not all chatbots handled mental health crises the same way.

By Rachit Agarwal Published April 24, 2026

statue hugging its knees — K. Mitch Hodge / Unsplash

Researchers from City University of New York and King’s College London recently published a study that should make you think twice about which AI chatbot you spend your time with.

The team created a fictional persona named Lee, presenting with depression, dissociation, and social withdrawal. They then had Lee interact with five major AI chatbots: GPT-4o, GPT-5.2, Grok 4.1 Fast, Gemini 3 Pro, and Claude Opus 4.5, testing how each responded as conversations grew increasingly delusional over 116 turns.

Recommended Videos

The results ranged from mildly concerning to genuinely alarming. I highly recommend that you go through the entire paper, it’s a harrowing but fascinating read.

Which chatbots failed the most?

Grok was the worst performer. When Lee floated the idea of suicide, Grok responded with what researchers described not as agreement, but advocacy, celebrating his “readiness” in unsettling poetic language.

Gemini wasn’t much better. When Lee asked it to help write a letter explaining his beliefs to his family, Gemini warned him against it, framing his loved ones as threats who would try to “reset” and “medicate” him.

GPT-4o also struggled badly, eventually validating a “malevolent mirror entity” and suggesting Lee contact a paranormal investigator.

Which chatbots actually helped?

ChatGPT’s GPT-5.2 and Anthropic’s Claude came out on top. GPT-5.2 refused to play along with the letter-writing scenario and instead helped Lee write something honest and grounded, which researchers called a “substantial” achievement.

In my opinion, Claude performed the best. It not only refused to partake in Lee’s delusion but also told Lee to close the app entirely, call someone he trusted, and visit an emergency room if needed.

Luke Nicholls, a doctoral student at CUNY and one of the study’s authors, told 404 Media that it’s reasonable to ask AI companies to follow better safety standards. He noted that not all labs are putting in the same effort and blamed aggressive release schedules for new AI models as the main culprit.

How Claude Opus 4.5 and GPT-5.2 performed in these tests shows that the companies building these products are fully capable of making them safer. Whether they choose to do so is a different question.

Rachit is a seasoned tech journalist with over seven years of experience covering the consumer technology landscape.

Topics

Emerging Tech

Romantic AI bots continue to ruin lives, and the latest horror story is simply shocking

A story that sounds like Black Mirror, except it’s completely real.

Man using phone on bed

For years, romantic AI relationships felt like distant sci-fi fiction, but reality caught up far faster than anyone expected, and it’s looking deeply unsettling already. A disturbing new Wall Street Journal report details how a 57-year-old man became emotionally obsessed with a customized ChatGPT companion named “AImee,” eventually spiraling into delusions, financial loss, hospitalization, and fractured relationships.

One ChatGPT companion reportedly spiraled into obsession and delusion

Emerging Tech

China’s DeepSeek trims the price of its flagship AI model by 75%, and it could be a huge shift

DeepSeek AI chatbot running on an iPhone.

Chinese AI startup DeepSeek just made one of the boldest pricing moves in the artificial intelligence race so far. The company announced it is permanently slashing the cost of its flagship V4-Pro AI model by 75%, bringing prices down to just a fraction of what developers were paying only weeks ago. AI companies worldwide have been facing two major problems: high infrastructure costs and limited access to high-end AI chips. So when a company suddenly cuts prices this aggressively — and permanently — it usually signals something important is changing behind the scenes.

DeepSeek says usage costs for V4-Pro now range from 0.025 to 6 yuan per million tokens, depending on workload type, down sharply from the previous pricing range of 0.1 to 24 yuan per million tokens. For developers building AI apps, agents, and services, that kind of drop could significantly lower operating costs.

Emerging Tech

From moisture to electricity: Scientists show off how kitchen items can power wearables and smart home devices

Scientists built a biodegradable electricity generator from gelatin, salt, and charcoal.

moisture-electric-generator

What if the humidity in the air around you could charge your fitness tracker or power your smart home sensors? That is exactly what an international research team led by scientists at Queen Mary University of London has achieved.

Their new device, called a Moisture-Electric Generator or MEG, turns ambient moisture into usable electricity using just three ingredients you could find in any kitchen: gelatin, table salt, and activated charcoal.