From data to detection: developing a corpus and training language models for the identification of anti-refugee narratives in Spanish

Mata Vázquez, Jacinto; Gualda, Estrella; Pachón Álvarez, Victoria; Rebollo Díaz, Carolina; Domínguez Olmedo, Juan Luis

doi:10.1016/j.array.2025.100526

From data to detection: developing a corpus and training language models for the identification of anti-refugee narratives in Spanish

dc.contributor.author	Mata Vázquez, Jacinto
dc.contributor.author	Gualda, Estrella
dc.contributor.author	Pachón Álvarez, Victoria
dc.contributor.author	Rebollo Díaz, Carolina
dc.contributor.author	Domínguez Olmedo, Juan Luis
dc.date.accessioned	2025-11-17T07:53:15Z
dc.date.available	2025-11-17T07:53:15Z
dc.date.issued	2025
dc.description.abstract	This study addresses the automatic detection of negative anti-refugee messages in Spanish texts, using language models based on pre-trained Transformers models. Despite numerous studies on hate speech detection, few have concentrated on Spanish, particularly regarding hostility towards refugees. To fill this void, we developed HateRADAR-es, a new corpus of Spanish-language tweets manually annotated by sociologist and social workers experts to identify the presence or absence of hateful content directed at refugees. This dataset has been made available to the research community to encourage further investigation. A comprehensive experimental framework to tackle this challenge, composed of several stages to achieve language models with a high efficacy in detecting such messages, is presented. To address the class imbalance issue in the data, data augmentation techniques are applied, and extensive experimentation is carried out to find the best values for the hyperparameters of the language models to achieve better performance. In the evaluation process, an ensemble of the fine-tuned models BETO, XLM-RoBERTa, and RoBERTa-large achieved the best results, with an accuracy of 0.891, an F1-measure of 0.860, and an AUC-ROC of 0.892. These findings underscore the effectiveness of combining multiple models into an ensemble to handle the complexity and nuances of hate speech on social media, offering a promising direction for future adaptations and applications of language models in specific hate contexts.
dc.description.department	Sociología, Trabajo Social y Salud Pública
dc.description.department	Tecnologías de la Información
dc.description.researchgroup	G.I. ESEIS, Estudios Sociales e Intervención Social (SEJ-216)
dc.description.sponsorship	This paper is part of the I+D+i Project titled ‘‘Conspiracy Theories and Hate Speech Online: Comparison of patterns in narratives and social networks about COVID 19, immigrants, refugees and LGBTI people [NON CONSPIRA HATE!]", PID2021 123983OB I00, funded by MCIN/AEI/10.13039/501100011033/ and by FEDER/EU. The publication is part of grant JDC2022-048239-I, funded by MCIN/AEI/10.13039/501100011033 and by the European Union‘‘NextGenerationEU’’/PRTR. We also thank for the support of the research centers at the Uni-versity of Huelva ‘‘Estudios Sociales E Intervención Social, ESEIS’’, ‘‘Pensamiento Contemporáneo e Innovación para el Desarrollo Social, COIDESO" and ‘‘Centro de Investigación en Tecnología, Energía 𝑦Sostenibilidad, CITES’’.
dc.identifier.citation	Mata, J., Gualda, E., Pachón, V., Rebollo, C., & Domínguez, J. L. (2025). From data to detection: Developing a corpus and training language models for the identification of anti-refugee narratives in Spanish. Array, 28, 100526. https://doi.org/10.1016/j.array.2025.100526
dc.identifier.doi	10.1016/j.array.2025.100526
dc.identifier.issn	2590-0056 (electrónico)
dc.identifier.uri	https://hdl.handle.net/10272/27390
dc.language.iso	eng
dc.publisher	Elsevier
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International
dc.rights.accessRights	open access
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.subject.other	Deep learning
dc.subject.other	Language models
dc.subject.other	Transformers
dc.subject.other	Social media
dc.subject.other	Twitter
dc.subject.other	Hate speech
dc.subject.other	Refugees
dc.subject.unesco	1203.04 Inteligencia Artificial
dc.subject.unesco	6308 Comunicaciones Sociales
dc.subject.unesco	6112.01 Discriminación
dc.title	From data to detection: developing a corpus and training language models for the identification of anti-refugee narratives in Spanish
dc.type	journal article
dc.type.hasVersion	VoR
dspace.entity.type	Publication
relation.isAuthorOfPublication	ac76819b-d91a-4158-b947-4a9e827e5e9d
relation.isAuthorOfPublication	e65f8d9d-ba99-49a1-9abf-34925101fabc
relation.isAuthorOfPublication	47cb4892-3513-4d33-953c-8521bc9cb187
relation.isAuthorOfPublication	122cbb60-00f2-4997-84ed-979737b38d0b
relation.isAuthorOfPublication	11d4312c-8591-4e26-b971-740ce012d168
relation.isAuthorOfPublication.latestForDiscovery	ac76819b-d91a-4158-b947-4a9e827e5e9d

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 1-s2.0-S2590005625001535-main.pdf
Size:: 2.35 MB
Format:: Adobe Portable Document Format

Download

Collections

Artículos