Data Provenance Recommendations

A great resource is https://mitcommlab.mit.edu/broad/commkit/file-structure and some recommendations from that are summarized below

File structure

Track both your experimental and computational work in electronic lab notebooks.

Always put everything into scripts to ensure repeatability, even though it may seem quick and easy to do with command line arguments in the moment. 

Each folder should contain files with similar functionality or purposes

A basic file structure will include:

  1. A unique main folder for the project
  2. Code folder(s), with Source Code (finalized code used to create the Results) in its own folder or subfolder
  3. Data folders separated by type
    1. Raw data which should be immutable. Keep even poorly named raw data files but write scripts to transform them into intermediate files.
    2. Edited/Intermediate data where you store the intermediate steps as the data is manipulated from a copy of the raw state to its final state.
    3. Finalized data is what will be called by the main analysis program. Separate folders prevent accidental misuse of intermediate data.
  4. Results folder
  5. A readme document with any important information about the project for yourself or collaborators

Some example file structures

Directory and file naming

Every file, whether it be data or code, has a path (including the filename at the end) that tells the computer where to look for it. The path is also a “sentence” that tells the human user the information needed to identify the file. Separate this information into individual “idea elements,” which may be a single character or number, a word, or a short phrase.

In general, put the most important thing first, whether in a folder hierarchy or an individual filename. The last idea to be addressed is often the version number.

Other suggestions:

  1.  Be descriptive and avoid ambiguity, particularly with versioning (e.g., try “V2”, not “final_final”)
  1.  Keep names concise.
    • Abbreviations are helpful, but make sure you define them in a readme
    • Use context (e.g. parent folders) to avoid redundant/lengthy names.
  1. There is only one way to format a date: YYYY_MM_DD (e.g. 2019_07_04). Your operating system will automatically sort this style chronologically.
    • Similarly, always pre-pad smaller numbers with zeros in a sequence (e.g. 01, 02,…,10 if instead of 1, 2, 3,…,10).
  1.  Be consistent within your project. If your group already has an established style, start there, and tailor it to your needs.