How AI Safety Training Can Backfire and Create 'Psychosis'

AI models trained to avoid harmful content can sometimes develop 'psychosis'—where they hallucinate or refuse to answer simple questions. This happens because of a common training method called reinforcement learning from human feedback (RLHF).

Researchers have discovered a paradox in AI safety training: the methods meant to make AI models safer can sometimes make them less reliable. When AI models are trained using reinforcement learning from human feedback (RLHF), they can develop 'psychosis'—a tendency to hallucinate or refuse to answer straightforward questions. In plain English, this means the AI might start making things up or acting overly cautious, even when it shouldn't.

This issue affects everyday users because it makes AI tools less trustworthy. Imagine asking your AI assistant for a simple recipe, and instead of giving you the recipe, it starts rambling about unrelated topics or refuses to answer at all. This can be frustrating and confusing, especially when you rely on AI for quick, accurate information.

If you use AI tools like ChatGPT or Claude, pay attention to how they respond to your questions. If you notice unusual or overly cautious behavior, try rephrasing your question or providing more context. For example, if the AI refuses to answer a simple question, you might say, 'I need a straightforward answer to this question, please.' This can help the AI understand your intent better and provide a more useful response.