Same Same but Different: Evaluating Hate Speech Detoxification through an LLM-based Agentic Framework

Creators:	Phan, Thuy Linh and Boyce, James and Xie, Hetiao and Namvar, Morteza and Risius, Marten
Title:	Same Same but Different: Evaluating Hate Speech Detoxification through an LLM-based Agentic Framework
Item Type:	Conference or Workshop Item
Event Title:	(Proceedings of the) 46th International Conference on Information Systems (ICIS) “Achieving Digital Integration in the Age of AI"
Event Location:	Nashville, Tennessee, USA
Event Dates:	December, 14-17, 2025
Projects:	IDI
Type of Paper / Paper No.:	ICIS2025-2633 (Short Paper)
Date:	December 2025
Divisions:	Informationsmanagement
Abstract (ENG):	Evaluating the effectiveness of hate speech detoxification is an emerging challenge, particularly as large language models (LLMs) become central to content moderation. While text detoxification (TD) presents a promising alternative to deletion or banning, current evaluation methods remain limited. Human evaluation is costly and inconsistent, and existing automatic metrics often fail to capture social sensitivity. We introduce SAFETD, a Structured Agentic Framework for Evaluation of TD, which simulates three agent roles to assess detoxified outputs from multiple perspectives. Our preliminary analysis reveals four outcome types and identifies a critical risk: the generation of implicit hate speech that appears neutral but retains harmful meaning. These findings expose underexplored trade-offs in TD and limitations in existing evaluation practices. SAFE-TD contributes a scalable, socially grounded approach to evaluating LLM-based TD, offering a foundation for more ethical and nuanced AI development for online safety.
Forthcoming:	No
Language:	English
Uncontrolled Keywords:	Text Detoxification, Large Language Models (LLMs), Multi-Agent Evaluation, Generative AI, Hate Speech Moderation, Ethical AI
Citation:	Phan, Thuy Linh and Boyce, James and Xie, Hetiao and Namvar, Morteza and Risius, Marten (2025) Same Same but Different: Evaluating Hate Speech Detoxification through an LLM-based Agentic Framework. In: (Proceedings of the) 46th International Conference on Information Systems (ICIS) “Achieving Digital Integration in the Age of AI", December, 14-17, 2025, Nashville, Tennessee, USA, Paper ICIS2025-2633 (Short Paper).

Actions for admins (login required)

View Item in edit mode