Same Same but Different: Evaluating Hate Speech Detoxification through an LLM-based Agentic Framework

Creators: Phan, Thuy Linh and Boyce, James and Xie, Hetiao and Namvar, Morteza and Risius, Marten
Title: Same Same but Different: Evaluating Hate Speech Detoxification through an LLM-based Agentic Framework
Item Type: Conference or Workshop Item
Event Title: (Proceedings of the) 46th International Conference on Information Systems (ICIS) “Achieving Digital Integration in the Age of AI"
Event Location: Nashville, Tennessee, USA
Event Dates: December, 14-17, 2025
Projects: IDI
Type of Paper / Paper No.: ICIS2025-2633 (Short Paper)
Date: December 2025
Divisions: Informationsmanagement
Abstract (ENG): Evaluating the effectiveness of hate speech detoxification is an emerging challenge, particularly as large language models (LLMs) become central to content moderation. While text detoxification (TD) presents a promising alternative to deletion or banning, current evaluation methods remain limited. Human evaluation is costly and inconsistent, and existing automatic metrics often fail to capture social sensitivity. We introduce SAFETD, a Structured Agentic Framework for Evaluation of TD, which simulates three agent roles to assess detoxified outputs from multiple perspectives. Our preliminary analysis reveals four outcome types and identifies a critical risk: the generation of implicit hate speech that appears neutral but retains harmful meaning. These findings expose underexplored trade-offs in TD and limitations in existing evaluation practices. SAFE-TD contributes a scalable, socially grounded approach to evaluating LLM-based TD, offering a foundation for more ethical and nuanced AI development for online safety.
Forthcoming: No
Language: English
Uncontrolled Keywords: Text Detoxification, Large Language Models (LLMs), Multi-Agent Evaluation, Generative AI, Hate Speech Moderation, Ethical AI
Citation:

Phan, Thuy Linh and Boyce, James and Xie, Hetiao and Namvar, Morteza and Risius, Marten (2025) Same Same but Different: Evaluating Hate Speech Detoxification through an LLM-based Agentic Framework. In: (Proceedings of the) 46th International Conference on Information Systems (ICIS) “Achieving Digital Integration in the Age of AI", December, 14-17, 2025, Nashville, Tennessee, USA, Paper ICIS2025-2633 (Short Paper).

Actions for admins (login required)

View Item in edit mode View Item in edit mode