0

tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation

A structured annotation framework for harmonizing dataset schemas facilitates multi-task training and evaluation by providing standardized metadata and preprocessing instructions.

Year
2023
Venue
arXiv 2023
Authors
1
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2301.05948v3ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

The HuggingFace Datasets Hub hosts thousands of datasets, offering exciting opportunities for language model training and evaluation. However, datasets for a specific task type often have different schemas, making harmonization challenging. Multi-task training or evaluation necessitates manual work to fit data into task templates. Several initiatives independently tackle this issue by releasing harmonized datasets or providing harmonization codes to preprocess datasets into a consistent format. We identify patterns across previous preprocessing efforts, such as column name mapping and extracting specific sub-fields from structured data in a column. We then propose a structured annotation framework that ensures our annotations are fully exposed and not hidden within unstructured code. We release a dataset annotation framework and dataset annotations for more than 500 English tasks\footnote{\url{https://github.com/sileod/tasksource}}. These annotations include metadata, such as the names of columns to be used as input or labels for all datasets, which can save time for future dataset preprocessing, regardless of whether our framework is utilized. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size in an external evaluation.

Authors

1