What is metadata?
Metadata is ‘data about data,’ and all computer files have it. As you create and modify files on your computer, the applications you are using (e.g. Microsoft Word) record all sorts of information about these files. Things like who created them, when they were created, when they were last opened, etc. This ‘data about data’ (i.e. metadata) serves as a digital footprint to track the history of the document. It's useful in many ways, including searching for specific files or file types. All files have metadata embedded in them, but you won’t see it unless you know where to look.
How much metadata is there?
There are hundreds of different types of metadata. Some of them are easy to find—e.g. the author of a document, how much time was spent editing the document, and where it’s stored. But some of them are hard to find unless you have technical skills—e.g. the history of all edits to a document.
Metadata for word processing documents
Let's take a look at a Microsoft Word document’s metadata:
- Filename and size
- When (date and time) the file was created. And who created it
- When (date and time) it was last modified. And who modified it
- How many times and when it has been accessed, changed, or altered
- Where it’s stored on the hard drive or computer network. And (occasionally) the GPS location of where it was created
You can check out some of the metadata for your own Microsoft Word files:
- On Windows:
- Right-click a Word file and then click ‘Properties'
- Click the ‘Details’ tab. Here you’ll see a lot of the file’s metadata
- On Mac:
- Right-click a Word file and select 'Get Info'
- Click the arrow next to 'General.' The section will enlarge and you'll see some of the file's metadata
Extracting and referencing metadata is extremely useful, which is why GoldFynch has been built to be good at doing it. If you'd like a more in-depth look at metadata, check out this article and this solution (on email metadata).
File System Data vs. Internal File Metadata?
In general, the distinction between file system dates and internal file metadata dates is normally ignored / glossed over in load files and eDiscovery in general. It's very rare to have a load file that explicitly indicates if the created / modified dates are file system dates or not.
File system dates are, in general, complicated and misunderstood. Which dates are stored, what they actually mean, and when they are updated, can be based on the specific file system being used, the operating system, and the version & settings of that operating system.
File system dates can be counterintuitive... for example, if you have some file on a Windows NTFS drive that has a modified and created date of e.g., Jan 2021, and you copy and paste the file to make a copy, the modified date will remain 2021, while the created date would be today, as that is when that specific file was created in the file system. Meanwhile, something like a file-internal created date found in PDF and Office docs would not change / update during the copy operation and would still indicate the date the document was first created.
Because of all of the complications with understanding and collecting / preserving file system dates, and because the file's internal metadata dates that are recorded and updated by the native editing applications are often more relevant, it is common for eDiscovery processing systems to use those internal dates.
This is what GoldFynch does when processing native files, we pull dates from the native files internal metadata (assuming that the file format supports metadata).
The dates seen in GoldFynch were taken from the internal metadata of the native xls file, which contains a "created" and "last save / modified" timestamp. Excel xls files internally store these dates as the amount of time that has passed since Jan 1, 1601 (UTC) (see here: https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-filetime), which seems to match the dates seen in GoldFynch, so it's likely that this xls file has the timestamp values set to 0. Normally, an xls file will just omit these date fields from the metadata to indicate that they don't exist / haven't been recorded, but it's likely that the 0 value here indicates the same thing. We should probably add a filter to ensure the reported timestamps are reasonable before storing them.