A breakdown of GoldFynch deduplication strategies for better understanding of each one.  


For more details on the deduplication feature in general, click here.


1. Hash-Based Deduplication

  • What it is: Generates a hash value (like a fingerprint) for a file or email. If two items have the same hash, they are considered identical.

  • Strengths:

    • Very accurate for standalone files or perfectly identical emails.

    • Detects even if filenames differ but the content is exactly the same.

  • Limitations:

    • For emails, even a tiny difference (e.g., a space in a header or a different timestamp) changes the hash, so duplicates might be missed.

  • Example:

    • Two PDFs with identical text but different filenames → Hash match → Duplicate.

    • Same email sent to two people at slightly different times → Different hash → Not detected as duplicate.


2. Message-ID Based Deduplication

  • What it is: Uses the Message-ID header from the email, which is intended to be globally unique.

  • Strengths:

    • Good for catching duplicates across different mailboxes.

  • Limitations:

    • Not always unique - some systems reuse IDs, strip them, or generate invalid ones.

    • Forwarded or resent messages may keep the same Message-ID even if the body changes.

  • Example:

    • Two copies of the same email retrieved from two different accounts → Same Message-ID → Duplicate.

    • A newsletter resent with minor edits but same Message-ID → False duplicate risk.


3. Message-ID + Subject

  • What it is: Matches on both the Message-ID and the email subject.

  • Strengths:

    • Reduces false duplicates when a Message-ID is reused by different messages.

  • Limitations:

    • Still possible for false positives if different emails share the same Message-ID and subject but have different content.

  • Example:

    • Automated alerts that reuse the same Message-ID for each daily alert - but subjects differ → Not flagged as duplicate.


4. Message-ID + Subject + Time

  • What it is: Combines three fields - Message-ID, subject, and timestamp - for matching.

  • Strengths:

    • Much tighter dedupe - reduces chances of accidental matches when IDs or subjects are reused.

    • Good for matching exact same emails from different sources.

  • Limitations:

    • Timestamp differences between recipients (e.g., CC vs. TO) can prevent matching.

  • Example:

    • An email sent to Person A at 8:00 AM and Person B at 8:01 AM → Same ID & subject but different times → Not flagged as duplicate.

    • Exact copies of an email imported from two archives → Match → Duplicate.