Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of Large Language Models (LLMs) across numerous applications. However, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and…

Open

Year: 2024
ArXiv: arxiv.org/abs/2401.00676
URL: arxiv.org/abs/2401.00676v1
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2401.00676v1
TL;DR: Semantic Scholar

Attribution policy →