MDRS:CrewReportsLog
MDRS crew reports processing pipeline
The MDRS email archive consists of all official emails sent between Mission Control, habitat crews and other Mars Society staff over the past twenty years.
There are several email archive files to process; each one comprises multiple years of reports. Each file is multi-gigabyte and contains all of the emails as a single flat file.
The first archive comprises the years 2011-2019; a total number of emails is about 18000.
The file is a flat text file formatted as an email export. Each email is comprised of a lengthy header followed by text and HTML versions of the email along with any attachments formatted as base64 text.
https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html
The link details the structure of emails in a raw format. After the header the different sections of the email are split by a 'boundary' field. Each section have its own subset of fields which define the contents of that section.
A overall boundary field is defined in the primary header. This boundary is a kind of 'wrapper' boundary; it defines the start and end of an email at a minimum. Within the body of the email the different sections reference the boundary to demark different sections of an email.
The emails typically have multiple sections:
- A 'text/plain' content type, the flat text of an email
- A 'text/html' content type, the html version of the email
- A binary content type, 'application/zip' or 'image/jpg' for example, a base64 text representation of a file attachment
Additionally there are additional internal boundaries which are defined in their own internal sub-header and demarks subsections. A wrapper inside of a wrapper.
Email is an open standard of mutually incompatible implementations.
The process has been broken down into several stages:
- Scan the master email file and split into individual emails
- Process each individual email and split into separate sections (text, html, attachments)
- Process each email header and collect data about each email:
- To
- From
- Date
- Subject
- Boundary fields
- Thread ID
The next stage will require parsing the subject and email body sections looking for keys words. As part of our work with MDRS researchers we will need to locate and mark up the emails with this data:
- Type of Report (commander, science, operations, etc)
- Person who submitted
- Position of person who submitted
- Crew #
- Field Season #
- Date submitted
- Sol # of crew rotation (if available)
- Text of report
- Any photo attachments of report
- Any other file attachments