In the realm of eDiscovery, where vast volumes of data are scrutinized for legal purposes, efficiency and accuracy are paramount. One key tool that has revolutionized this process is deduplication. 


In this article, we'll delve into the benefits of deduplication in the eDiscovery workflow and guide you through reviewing the results of GoldFynch's deduplication feature. For a functional guide to deduplication in GoldFynch, see: Finding duplicate files in your GoldFynch case using deduplication. 


Understanding Deduplication

Deduplication is the process of identifying and eliminating duplicate documents within a dataset. In eDiscovery, where data sets can be massive and include redundant information, deduplication plays a crucial role in streamlining the review process. By removing duplicates, legal teams can focus on unique content, saving time and resources.


Benefits of Deduplication

  1. Cost Reduction: Deduplication reduces the volume of data that needs to be processed and reviewed, leading to significant cost savings in storage and review expenses.
  2. Time Savings: Removing duplicates from your review accelerates the review process, allowing legal teams to focus on relevant information promptly.
  3. Increased Accuracy: By eliminating duplicate documents from your review, deduplication ensures that only unique content is considered during review, reducing the risk of inconsistencies or contradictions.
  4. Enhanced Review Efficiency: With a unique dataset, reviewers can efficiently identify key documents and make informed decisions.


Deduplication in GoldFynch

Deduplication in GoldFynch uses either the MD5 hash value of your files or the Message-ID (for emails) to detect duplicates, and if the files are not exact matches based on the deduplication strategy selected (MD5 hash based, Message-ID based, etc), then they will not be detected as duplicates. MD5 file hashes serve as a digital "signature" for files, and even the slightest change to a file's data (visible or binary) will change the file hash. 


It's also worth noting that deduplication is done on a root family level since it's not typically desired/allowed to exclude or remove duplicate attachments that belong to non-duplicate parent files. So GoldFynch will not mark attachment files as duplicates (even if the file hashes are the same), unless the parent files are also duplicates. To summarize, if you run a deduplication session using the hash-based strategy and there are no detected duplicates, then the duplicate-looking files are either attachments to non-duplicate parent files or the files aren't, in fact, exact duplicates.


More information on why attachments are not detected in a deduplication session can be found here


Reviewing GoldFynch's Deduplication Results


GoldFynch offers a comprehensive deduplication feature that simplifies the review process. Here's how to review the results of GoldFynch's deduplication report:

  1. Access the Deduplication Report: In GoldFynch, navigate to the deduplication section to access the deduplication report for your case.  More information on generating this report either applied or waiting to apply can be found here.
  2. Review Duplicate Report: The report displays duplicate items identified, each containing the exact identical documents.  Review the report to gain an understanding of the duplicate documents.  


Components of the GoldFynch Deduplication Results

The GoldFynch duplicate report contains the following information:

  • APP Link - This is a direct link to the document in your GoldFynch case (only accessible if you are logged into an account that has access to your case)
  • APP ID - GoldFynch's internal ID which is used to track each individual file that is uploaded
  • APP Parent ID - This is the ID of the Parent document. If there is no parent then it is the same as the APP ID 
  • Keep? - When the value is TRUE it indicates that the file is primary, and FALSE indicates that the file is a duplicate
  • File Name - File name of the document
  • Pathname - Path of the document in GoldFynch
  • Tags - All tags attached to the document will be listed


In case the files are emails, the following fields will be populated with the available metadata:


  • Subject 
  • From 
  • To
  • Cc
  • Bcc
  • Sent
  • Message ID

Note: If the source does not have metadata, these fields will be blank, even if they are emails.


What next? 

Once the deduplication session is applied, the system will not automatically delete the duplicate items, but instead, mark them with a system tag = DUPE.  


It is worth mentioning, as part of the typical workflow, once these files have been marked with the system tag, we suggest you create a review set of your case, which will automatically exclude any system-marked duplicates.  Other benefits of conducting your review using review sets are detailed here.