0

Behind Closed Words: Creating and Investigating the forePLay Annotated Dataset for Polish Erotic Discourse

A Polish language dataset for erotic content detection shows that specialized models, particularly transformer-based architectures, outperform multilingual alternatives in handling imbalanced categories.

Year
2024
Venue
arXiv 2024
Authors
4
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2412.17533v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

The surge in online content has created an urgent demand for robust detection systems, especially in non-English contexts where current tools demonstrate significant limitations. We present forePLay, a novel Polish language dataset for erotic content detection, featuring over 24k annotated sentences with a multidimensional taxonomy encompassing ambiguity, violence, and social unacceptability dimensions. Our comprehensive evaluation demonstrates that specialized Polish language models achieve superior performance compared to multilingual alternatives, with transformer-based architectures showing particular strength in handling imbalanced categories. The dataset and accompanying analysis establish essential frameworks for developing linguistically-aware content moderation systems, while highlighting critical considerations for extending such capabilities to morphologically complex languages.

Authors

4