Data Considerations
DITI team members span multiple disciplines and their familiarity with data in disciplines other than their own may be limited. We recommend that faculty partners select useful data for workshops, projects, and presentations during advance planning for the module. Faculty partners should plan to provide or point to datasets that are relevant to the course and to consult with DITI team members on what kinds of data are needed for different assignments or workshops. In some cases, the DITI team may be able to assist with data gathering. If faculty are unable to provide data resources, then DITI team members may be able to supply datasets; however, such data may not always be the most relevant for the course.
The DITI is guided by three principles: open data, analyzable data, and archivable data. If datasets are not arranged in an open, standard, analyzable format, then users of that data might have difficulty accessing it in the short- or long-term. We strive to remove all unnecessary restrictions over the data that we create, use, and share in the classroom and on our GitHub Digital Showcase. If at all possible, we would like faculty to follow these principles when it comes to selecting datasets, though we will do our best to work with files in any form. We prefer faculty send us data that are analyzable and non-proprietary, so we can avoid problems like: major software updates causing datasets to become inaccessible, difficulty with reading files on different operating systems and with different software, or proprietary data types becoming obsolete.
The DITI follows data and file formatting guidelines that allow for long-term storage and wide-range accessibility. If you have any questions about these data considerations, please contact a DITI member.
Below are lists outlining the data file formatting standards we recommend at the DITI:
Data File Formatting
- TAR
- GZIP
- ZIP
- CSV
- XML
- CSV
- TSV
- SHP
- GeoJSON
- KML
- DBF
- NetCDF
- GeoTIFF/TIFF
- NetCDF
- HDF-EOS
- MOV
- MPEG
- AVI
- MXF
- MP4
- WAVE
- AIFF
- MP3
- MXF
- TIFF
- JPEG 2000
- PNG
- GIF
- BMP
- XML
- HTML
- TXT
- UTF-8
- WARC
Below are the data values we observe at the DITI and more resources explaining the reasoning behind these guidelines:
Data Values & Resources
- Standardize all coded and null values within a dataset.
- Use an explicit value for missing or no data, rather than an empty field.
- For numeric fields, represent missing data with a specified extreme value (e.g., -9999), the IEEE floating point NaN value (Not a Number), or the database NULL. Be advised that NULL and NaN can cause problems, particularly with some older programs. For character fields, use NULL, “not applicable”, “n/a” or “none”.
- If there are multiple reasons that cells might not contain values, include a separate code for each.
- The null value(s) should be consistently applied within and among data files.
- If data values are encoded, be sure to provide a definition in the metadata. We recommend using UTF-8 when possible.
- Don’t include rows with summary statistics. It is best to put summary statistics, figures, analyses, and other summary content into a separate companion data file.
- “Data and File Formatting,” Axiom Data Science. 2017. https://www.axiomdatascience.com/best-practices/DataandFileFormatting.html
- Tauberer, Joshua. “Analyzable Data in Open Formats (Principles 5 and 7).” Open Government Data: The Book. Second Edition, 2014. https://opengovdata.io/2014/analyzable-data-in-open-formats/
-
Type of Program
-
Links and Resources