0

Decay No More: A Persistent Twitter Dataset for Learning Social Meaning

A new persistent Twitter dataset uses paraphrases to address data decay, enabling fair comparisons in social media research with minimal performance loss.

Year
2022
Venue
arXiv 2022
Authors
3
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2204.04611v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

With the proliferation of social media, many studies resort to social media to construct datasets for developing social meaning understanding systems. For the popular case of Twitter, most researchers distribute tweet IDs without the actual text contents due to the data distribution policy of the platform. One issue is that the posts become increasingly inaccessible over time, which leads to unfair comparisons and a temporal bias in social media research. To alleviate this challenge of data decay, we leverage a paraphrase model to propose a new persistent English Twitter dataset for social meaning (PTSM). PTSM consists of $17$ social meaning datasets in $10$ categories of tasks. We experiment with two SOTA pre-trained language models and show that our PTSM can substitute the actual tweets with paraphrases with marginal performance loss.

Authors

3