A breakdown of GoldFynch deduplication strategies for better understanding of each one.
For more details on the deduplication feature in general, click here.
1. Hash-Based Deduplication
What it is: Generates a hash value (like a fingerprint) for a file or email. If two items have the same hash, they are considered identical.
Strengths:
Very accurate for standalone files or perfectly identical emails.
Detects even if filenames differ but the content is exactly the same.
Limitations:
For emails, even a tiny difference (e.g., a space in a header or a different timestamp) changes the hash, so duplicates might be missed.
Example:
Two PDFs with identical text but different filenames → Hash match → Duplicate.
Same email sent to two people at slightly different times → Different hash → Not detected as duplicate.
2. Message-ID Based Deduplication
What it is: Uses the
Message-ID
header from the email, which is intended to be globally unique.Strengths:
Good for catching duplicates across different mailboxes.
Limitations:
Not always unique - some systems reuse IDs, strip them, or generate invalid ones.
Forwarded or resent messages may keep the same
Message-ID
even if the body changes.
Example:
Two copies of the same email retrieved from two different accounts → Same Message-ID → Duplicate.
A newsletter resent with minor edits but same Message-ID → False duplicate risk.
3. Message-ID + Subject
What it is: Matches on both the
Message-ID
and the email subject.Strengths:
Reduces false duplicates when a Message-ID is reused by different messages.
Limitations:
Still possible for false positives if different emails share the same Message-ID and subject but have different content.
Example:
Automated alerts that reuse the same Message-ID for each daily alert - but subjects differ → Not flagged as duplicate.
4. Message-ID + Subject + Time
What it is: Combines three fields -
Message-ID
, subject, and timestamp - for matching.Strengths:
Much tighter dedupe - reduces chances of accidental matches when IDs or subjects are reused.
Good for matching exact same emails from different sources.
Limitations:
Timestamp differences between recipients (e.g., CC vs. TO) can prevent matching.
Example:
An email sent to Person A at 8:00 AM and Person B at 8:01 AM → Same ID & subject but different times → Not flagged as duplicate.
Exact copies of an email imported from two archives → Match → Duplicate.