Automated Harmful Content Detection Using Grammar-Focused Representations of Text Data by Daria Stetsenko (NASK PIB)
Harmful content detection is one of the most important topics in natural language processing. Every couple of years, surveys and thesis are published that summarize the state-of-the-art methods to approach the problem. In 2019 even one of the shared tasks in SemEval Competition focused solely on offensive language detection. Both classical and neural models have been proposed and compared in the literature, with different levels of analysis: the lexical level (including sentiment analysis), the meta-information (context), and, if applicable, the multimodal perspective. However, almost all the mentioned models relied on semantic word or sentence embeddings, putting aside the potential of grammar. And there is a lot of meaningful information to be extracted from the grammar layer.
This research project aims to find models to extract and process the grammar indicators of harmful content in written texts, and investigate possible ways of representing them, most probably combined with the lexical ones. The research includes work on sentence and document embeddings that preserve linguistic information interpretable for both machine-learning algorithms and humans. We strive to set the clear-cut linguistic boundaries of syntactic constructions and semantic features for the categorization of harmful content, defined as a general category including hate speech, violence description, and pornography. The goal would be to determine a linguistic representation of each of them.
The talk was delivered during ML in PL Conference 2022 as a part of Contributed Talks. The conference was organized by a non-profit NGO called ML in PL Association.
ML in PL Association website: https://mlinpl.org/
ML in PL Conference 2022 website: https://conference2022.mlinpl.org/
ML In PL Conference 2023 website: https://conference2023.mlinpl.org/
---
ML in PL Association was founded based on the experiences in organizing of the ML in PL Conference (formerly PL in ML), the ML in PL Association is a non-profit organization devoted to fostering the machine learning community in Poland and Europe and promoting a deep understanding of ML methods. Even though ML in PL is based in Poland, it seeks to provide opportunities for international cooperation.