When receiving data from other parties, sometimes there is an opportunity to specify or negotiate the desired production format. In these instances, we recommend requesting data in as near-to-native a format as possible, in order to preserve as much of the original file metadata and forensic information as possible. This document further outlines GoldFynch's ideal incoming production format (including load file fields where necessary, in detail,) and may be sent to other parties as a production format specification.
Here is a list in decreasing order of preference for formats of productions that you wish to import:
1. Native load file production
2. Loose collection of native files
3. PDF load file production
4. Loose collection of document-based PDFs
5. TIFF load file production
6. Bulk PDF production
7. Loose collection of TIFF images
8. Paper
If opposing counsel will not produce documents in Native format, then a PDF production is the next best option, and finally a TIFF production. PDFs have much better text render quality. TIFF productions are typically in highly compressed GROUP 4 format, which is intended to be minimal file size for fax machines. These TIFF files can be physically hard to see/read and can give low-quality OCR results.
NOTE: GoldFynch provides additional services including sourcing data directly from your clients. You can learn more about them here.
File naming
Each produced file should be assigned a single Bates number, and named with that single number (including any prefix.)
Attachment and child files
- Attachment and child files will be referred to using the parent file’s Bates name
- The filename or location of the contained child file will be tracked relative to the parent file
- Attachments or children files of a native file should not be produced individually, nor assigned their own Bates numbers
NOTE: This isn't valid in cases of redaction - see below for more on this
Redactions and non-native files
If redactions have been carried out on a file in GoldFynch, this means they will be in a non-native format. Both in this case, as well as in other cases where it is not possible or practical to produce a file in native format, the file should be rasterized or rendered and should be produced as a single PDF file with a searchable text layer.
- Any of these generated, non-native PDF files should still be placed in the “NATIVES” folder and have the “NATIVE_PATH” column populated in the load file
- The “TRUE_NATIVE” column in the load file should be set to “F” to indicate that the PDF is a derived representation of the true native file
- In the case that a file is produced in a non-native PDF format, other files in the family should then be assigned their own Bates numbers and produced in native format. These files should then have the “PARENT_ID” column populated to track the file family hierarchy
Example: Consider an MSG email that has a single ZIP attachment. Ideally, the single MSG file would get assigned a single Bates number, e.g. FILE_0001, renamed to FILE_0001.msg, and the ZIP attachment would not be produced separately. In the case that the MSG file needs redactions, it would be rendered to a PDF file, named FILE_0001.pdf, and the ZIP attachment would then be assigned Bates FILE_0002 and have its “PARENT_ID” column set to “FILE_0001.”
The first preference for email productions is bulk export/archive files like PST or OST for Outlook/Exchange systems, and MBOX for Gmail or other mail services.
One email archive file should be used per user or mailbox
Each archive file should be assigned and named with a single Bates identifier
In the case of such bulk email production, individual emails will not have an assigned identifier or Bates number. In these situations, emails will be tracked and identified using a combination of the container file’s identifier, and the email’s subject & message-id
Attachments will be tracked and identified by MD5 hash value or by name and parent email
The load file “CUSTODIAN” column should be populated with the mailbox owner’s name and email address
If bulk email production in PST or MBOX format is not possible, the next preference is for near-native individual MSG or EML (MIME) files.
Each email should be assigned and named with a single Bates identifier
The load file “CUSTODIAN” column should be populated with the mailbox owner’s name and email address
MSG files or EML files that do not have an “X-Gmail-Label” header set should populate the “MAILBOX_FOLDER” column in the load file with the folder location of the email within the user’s mailbox. (Example: “Inbox/Invoices/2018”)
Other electronic documents and files
Other digital documents and files should be produced in native format, as found on the originating filesystem where possible, especially any files that represent:
- Audio
- Video
- CAD drawings
- Spreadsheets
- Documents with tracked changes
Where possible, electronic files should populate:
- an “OS” column in the load file, indicating the operating system of originating electronic device. (Example: “Windows 10”)
- a “FILESYSTEM” column in the load file, indicating the hard drive filesystem of originating electronic disk. (Example: “NTFS)
- the “FS_CREATED”, “FS_MODIFIED”, and “FS_ACCESSED” columns in the load file with ISO 8601 datetime strings, indicating the various timestamps as stored on the originating filesystem
- the “CUSTODIAN” field in the load file with a description of the owner of the originating electronic device
Additionally, when available, files originating from Apple operating systems should populate:
- the “APPLE_WHEREFROM” load file column with the file’s “com.apple.metadata:kMDItemWhereFroms” attribute, in a semicolon-delimited list
the “APPLE_QUARANTINE” load file column with the file’s “com.apple.quarantine” attribute, in a semicolon-delimited list
Paper documents converted into electronic documents
Paper documents should be:
- scanned with a resolution of at least 300 PPI
produced as document-level PDF files, with searchable text layers
PDFs generated from scanned documents should:
- be placed in the “NATIVES” folder and have the “NATIVE_PATH” column in the load file set
have the “TRUE_NATIVE” column in the load file set to “F” to indicate that the PDF is a derived representation of the original paper document
Load file and additional production formatting
The production should:
- consist of native files and generated PDF files
- be named according to their assigned Bates numbers
- be placed in a folder named “NATIVES,” which may consist of numbered subdirectories
The load file itself should:
- be in DAT, CSV, or JSON format and be UTF-8 or UTF-16 encoded
- contain a leading Byte Order Mark (BOM) indicating the proper UTF text encoding
- reference the native location of files using a path relative to the top folder of the production
NOTE: In the case of JSON, the load file should be structured as an array, with one JSON object / key-value-map per produced file
Load file fields
Refer to the following table for load file fields, descriptions and examples:
Column Name | Description | Example |
DOC_ID | The Bates number (with prefix) of the file. | FILE_0001 |
PARENT_ID | The Bates number of the parent file, in the case that individual files of a family are produced individually due to redactions. | FILE_0001 |
NATIVE_PATH | The path to the native file, or to the derived/rendered PDF file. It should be relative to the top folder of the production. | NATIVES/0001/FILE_0001.msg |
TRUE_NATIVE | Indicates whether the file is truly a native file or is a derived PDF. | T |
CUSTODIAN | Description of who/where the file originated. | John Doe |
MAILBOX_FOLDER | Mailbox folder for individual email files | Inbox/Invoices/2018 |
OS | Name of the operating system where the file originated. | Windows 10 |
FILESYSTEM | Name of the disk filesystem where the file originated. | NTFS |
FS_CREATED | Created date & time from the original filesystem. | 2017-02-22T16:24:36Z |
FS_MODIFIED | Modified date & time from the original filesystem. | 2017-02-22T16:24:36Z |
FS_ACCESSED | Accessed date & time from the original filesystem. | 2017-02-22T16:24:36Z |
APPLE_WHEREFROM | “com.apple.metadata:kMDItemWhereFroms” field populated from Apple Finder metadata. | https://dl-web.dropbox.com/get/file.pdf, https://www.dropbox.com/ |
APPLE_QUARANTINE | “com.apple.quarantine” field populated from Apple Finder metadata. | 0001;55555555;Google Chrome; |
ORIG_EXT | For redacted files, the extension of the original, native file. | .msg |
ORIG_TYPE | For redacted files, the MIME filetype of the original, native file. | application/vnd.ms-outlook |
CREATED | For redacted files, the internally-created metadata date from the original, native file. | 2017-02-22T16:24:36Z |
MODIFIED | For redacted files, the internally-created metadata date from the original, native file. | 2017-02-22T16:24:36Z |
AUTHOR | For redacted files, the internally-created metadata date from the original, native file. | Jane Doe |
SUBJECT | For redacted emails, the subject from the original, native file. | Fwd: Some Subject |
FROM | For redacted emails, the “from” field from the original, native file. | Jane Doe <[email protected]> |
TO | For redacted emails, the “to” field from the original, native file. | John Doe <[email protected]>; Jane Doe <[email protected]> |
CC | For redacted emails, the “cc” field from the original, native file. | John Doe <[email protected]>; Jane Doe <[email protected]> |
BCC | For redacted emails, the “bcc” field from the original, native file. | John Doe <[email protected]>; Jane Doe <[email protected]> |
SENT | For redacted emails, the “date” header or PidTagClientSubmitTime from the original, native file. | 2017-02-22T16:24:36Z |
RECEIVED | For redacted emails, the latest “received-by date or PidTagMessageDeliveryTime from the original, native file. | 2017-02-22T16:24:36Z |
MESSAGE-ID | For redacted emails, the “message-id” header or PidTagInternetMessageId from the original, native file. | |
REFERENCES | For redacted emails, the “references” header or PidTagInternetReferences from the original, native file. | |
HEADERS | For redacted emails, the entire header section or PidTagTransportMessageHeaders from the original, native file. | Received-By: … |
MSG_CLASS | For redacted MSG files, the PidTagMessageClass from the original, native file. | IPM.Note |