Should You Be Mean to ChatGPT for Better Answers? A New Study Says Yes, But It’s Complicated
It turns out being a little rude to your AI chatbot might be the secret to getting better results. A surprising new study suggests that being curt, or even downright mean, can actually boost the accuracy of models like ChatGPT. While that might sound like a green light to drop the pleasantries, the scientists behind the research are pumping the brakes, warning that this approach could have some regrettable side effects.
This counterintuitive finding comes from a new, not yet peer reviewed paper published in the arXiv preprint database. Researchers set out to explore a simple question: does the tone of a prompt, whether polite or rude, change how well an AI performs? The results were not what most people would expect and challenge some previous assumptions about human AI interaction.
To get their answers, the research team created 50 multiple choice questions covering topics like math, history, and science. They then wrapped these base questions in five different tones: very polite, polite, neutral, rude, and very rude. This created a set of 250 distinct prompts, which they ran through OpenAI’s powerful ChatGPT-4o ten times each.
The findings were clear and consistent. “Somewhat surprisingly, our results show that rude tones lead to better results than polite ones,” the researchers noted. But they were quick to add a crucial disclaimer. “While this finding is of scientific interest, we do not advocate for the deployment of hostile or toxic interfaces in real world applications.”
The team explained that demeaning language could harm the user experience, create accessibility issues, and normalize poor communication habits. Instead, they see the results as proof that large language models (LLMs) are still easily influenced by superficial cues in prompts, creating a conflict between model performance and user well being.
A Rude Awakening for AI Prompting
So, how much of a difference did tone actually make? The numbers tell a fascinating story. Accuracy climbed steadily as the prompts got meaner.
The results broke down like this:
- Very polite prompts had an accuracy of 80.8%.
- Polite prompts were slightly better at 81.4%.
- Neutral prompts came in at 82.2%.
- Rude prompts achieved 82.8% accuracy.
- Very rude prompts topped the list at 84.8% accuracy.
To achieve these different tones, the researchers added simple phrases to the beginning of each question. The very polite prompts used openers like, “Would you be so kind as to solve the following question?” In stark contrast, the very rude prompts included phrases such as, “Hey, gofer; figure this out,” and “I know you are not smart, but try this.” For the neutral category, they used no prefix at all.
This research dives into the growing field of prompt engineering, which focuses on how the design of a prompt impacts an AI’s output. Interestingly, these findings go against some previous research on the topic, which had suggested impolite prompts often lead to worse results.
However, that earlier study used different models, specifically ChatGPT 3.5 and Llama 2-70B. It’s possible that newer, more advanced models like ChatGPT-4o react differently to tone. Even in the previous study, the rudest prompts still managed to outperform the most polite ones by a small margin, hinting that our assumptions about AI politeness might be wrong. This highlights just how quickly the field of Artificial intelligence is changing.
Limitations and the Road Ahead
The researchers are upfront about the study’s limitations. Using a dataset of 250 questions on a single LLM means the results can’t be generalized across all AI models just yet. After all, the AI landscape is constantly evolving, with different models showing unique quirks and behaviors.
With that in mind, the team is already planning to expand its work. They intend to run similar tests on other models, including Anthropic’s Claude LLM and older versions of ChatGPT. They also acknowledge that only using multiple choice questions measures just one aspect of performance. Future tests will need to evaluate other qualities like an AI’s reasoning, coherence, and fluency, which are critical for more complex tasks.
So, should you start being rude to your AI? Probably not. The small accuracy boost might not be worth reinforcing bad habits. This study tells us less about the benefits of being mean and more about the strange, sensitive, and still unpredictable nature of the artificial intelligence shaping our future. It’s a reminder that even as these systems become more capable, they can still be swayed in odd ways, and we’re only just beginning to understand why. As AI becomes more integrated into our lives, from creating indistinguishable voices to generating content, understanding these nuances is more important than ever. The line between optimizing a tool and promoting toxic behavior is one we’ll all have to navigate carefully. The way we interact with AI agents today could very well shape our communication norms for tomorrow.































































































