Deduplicating Training Data Makes Language Models Better (2024)

Katherine Lee,Daphne Ippolito,Andrew Nystrom,Chiyuan Zhang,Douglas Eck,Chris Callison-Burch,Nicholas Carlini

Abstract

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets—for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. Code for deduplication is released at https://github.com/google-research/deduplicate-text-datasets.

Anthology ID:
2022.acl-long.577
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan,Preslav Nakov,Aline Villavicencio
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8424–8445
Language:
URL:
https://aclanthology.org/2022.acl-long.577
DOI:
10.18653/v1/2022.acl-long.577
Bibkey:
Cite (ACL):
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Deduplicating Training Data Makes Language Models Better (Lee et al., ACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.acl-long.577.pdf
Video:
https://aclanthology.org/2022.acl-long.577.mp4
Code
google-research/deduplicate-text-datasets+additional community code
Data
Billion Word Benchmark,RealNews,Wiki-40B

PDFCiteSearch

Code

Video

Export citation
  • BibTeX
  • MODS XML
  • Endnote
  • Preformatted
@inproceedings{lee-etal-2022-deduplicating, title = "Deduplicating Training Data Makes Language Models Better", author = "Lee, Katherine and Ippolito, Daphne and Nystrom, Andrew and Zhang, Chiyuan and Eck, Douglas and Callison-Burch, Chris and Carlini, Nicholas", editor = "Muresan, Smaranda and Nakov, Preslav and Villavicencio, Aline", booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = may, year = "2022", address = "Dublin, Ireland", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.acl-long.577", doi = "10.18653/v1/2022.acl-long.577", pages = "8424--8445", abstract = "We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1{\%} of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets{---}for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4{\%} of the validation set of standard datasets, thus allowing for more accurate evaluation. Code for deduplication is released at \url{https://github.com/google-research/deduplicate-text-datasets}.",}

Download as File

<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="lee-etal-2022-deduplicating"> <titleInfo> <title>Deduplicating Training Data Makes Language Models Better</title> </titleInfo> <name type="personal"> <namePart type="given">Katherine</namePart> <namePart type="family">Lee</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Daphne</namePart> <namePart type="family">Ippolito</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Andrew</namePart> <namePart type="family">Nystrom</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chiyuan</namePart> <namePart type="family">Zhang</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Douglas</namePart> <namePart type="family">Eck</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Chris</namePart> <namePart type="family">Callison-Burch</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Nicholas</namePart> <namePart type="family">Carlini</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2022-05</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <relatedItem type="host"> <titleInfo> <title>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</title> </titleInfo> <name type="personal"> <namePart type="given">Smaranda</namePart> <namePart type="family">Muresan</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Preslav</namePart> <namePart type="family">Nakov</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Aline</namePart> <namePart type="family">Villavicencio</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>Association for Computational Linguistics</publisher> <place> <placeTerm type="text">Dublin, Ireland</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> </relatedItem> <abstract>We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets—for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. Code for deduplication is released at https://github.com/google-research/deduplicate-text-datasets.</abstract> <identifier type="citekey">lee-etal-2022-deduplicating</identifier> <identifier type="doi">10.18653/v1/2022.acl-long.577</identifier> <location> <url>https://aclanthology.org/2022.acl-long.577</url> </location> <part> <date>2022-05</date> <extent unit="page"> <start>8424</start> <end>8445</end> </extent> </part></mods></modsCollection>

Download as File

%0 Conference Proceedings%T Deduplicating Training Data Makes Language Models Better%A Lee, Katherine%A Ippolito, Daphne%A Nystrom, Andrew%A Zhang, Chiyuan%A Eck, Douglas%A Callison-Burch, Chris%A Carlini, Nicholas%Y Muresan, Smaranda%Y Nakov, Preslav%Y Villavicencio, Aline%S Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)%D 2022%8 May%I Association for Computational Linguistics%C Dublin, Ireland%F lee-etal-2022-deduplicating%X We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data. We develop two tools that allow us to deduplicate training datasets—for example removing from C4 a single 61 word English sentence that is repeated over 60,000 times. Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation. Code for deduplication is released at https://github.com/google-research/deduplicate-text-datasets.%R 10.18653/v1/2022.acl-long.577%U https://aclanthology.org/2022.acl-long.577%U https://doi.org/10.18653/v1/2022.acl-long.577%P 8424-8445

Download as File

Markdown (Informal)

[Deduplicating Training Data Makes Language Models Better](https://aclanthology.org/2022.acl-long.577) (Lee et al., ACL 2022)

  • Deduplicating Training Data Makes Language Models Better (Lee et al., ACL 2022)
ACL
  • Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
Deduplicating Training Data Makes Language Models Better (2024)

FAQs

Deduplicating Training Data Makes Language Models Better? ›

Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer train steps to achieve the same or better accuracy. We can also reduce train-test overlap, which affects over 4% of the validation set of standard datasets, thus allowing for more accurate evaluation.

Why is it important to remove duplicate data in machine learning? ›

Duplicate data is a common issue in datasets that can lead to inaccuracies and bias in analysis. Removing duplicates is an essential step in data cleaning and preprocessing, ensuring that the data is accurate and reliable for further analysis or modeling.

What are the text deduplication techniques? ›

Text Deduplication techniques use various algorithms such as clustering, fingerprinting, and deep learning models to distinguish between true duplicates and similar but distinct data points.

Why do we need to remove duplicate data? ›

Duplicates can significantly impact the quality, accuracy, and reliability of your data, and lead to inaccurate results in your analysis or modeling. Two different data sources may use the same identifier type, but may have a different way of dealing with changes to the record the identifier type is representing.

Why is duplicate data bad for machine learning? ›

Duplicate data can have detrimental effects on your machine learning models and outcomes, such as reducing data diversity and representativeness, which can lead to overfitting or biased models.

What is the purpose of data deduplication? ›

Data deduplication is a process that eliminates excessive copies of data and significantly decreases storage capacity requirements. Deduplication can be run as an inline process as the data is being written into the storage system and/or as a background process to eliminate duplicates after the data is written to disk.

Should you use data deduplication? ›

The primary goal of data deduplication is to optimise storage space, improve data management efficiency, and enhance data integrity. By identifying and removing redundant copies of the same data, organisations can reduce storage costs, streamline data access, and enhance overall data quality.

What is deduplication in NLP? ›

Deduplication refers to a method of eliminating a dataset's redundant data. In a secure data deduplication process, a deduplication assessment tool identifies extra copies of data and deletes them, so a single instance can then be stored. Data deduplication software analyzes data to identify duplicate byte patterns.

Why is duplicated data a problem? ›

Good reporting requires accurate data that is free of duplicates. Duplicate data inhibits this. Reports generated from duplicate records are less reliable and cannot be used to make informed decisions. The business will also find it difficult to forecast what it should do for future growth.

Why is it critical to reduce or eliminate duplicate records? ›

When you have duplicate records, it is more difficult for software to correctly match people to behaviors. You could have one individual represented through multiple profiles, which give increased weight to that particular person's activities.

Why is it important to avoid code duplication? ›

Duplicate code hurts readability, maintainability, and scalability; it introduces a duplication of effort and increases the chances of introducing errors and inconsistencies.” One consequence of duplicated code is the burden it places on maintenance.

Why is it important not to duplicate information? ›

Inaccurate Reporting

If you're making important decisions based on the data and reporting your company creates, you need those reports to be accurate. Duplicate data can skew findings, so your decisions may be based on inaccurate information.

References

Top Articles
Latest Posts
Article information

Author: Lidia Grady

Last Updated:

Views: 6337

Rating: 4.4 / 5 (65 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Lidia Grady

Birthday: 1992-01-22

Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

Phone: +29914464387516

Job: Customer Engineer

Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.