Preferred File Formats

The choice of file formats is crucial in ensuring the dataset remains interoperable and reusable. Long established file formats may be preferable as they offer wide support and are more likely to be well supported in the future.

The most suitable file formats are usually:

Standard open formats are also the safest options to guarantee long-term support and access as proprietary formats can become obsolete over time (for example, .xls was replaced by .xlsx) making data difficult to read and interpret in the future.

The table below gives an overview of preferred vs. non-preferred file formats for a selection of document types. The list of file formats in the column “Non-preferred file formats” is non-exhaustive and may include the formats perceived to be the most commonly used. To enhance the interoperability of your data, please try to use preferred file formats wherever possible.

File Type Preferred File Format (Examples) Non-preferred File Formats (examples)
Audio
  • Uncompressed and lossless Wav or AIFF (.wav/.aiff)
  • Compressed and lossless FLAC (.flac)
  • Compressed and lossy Mp3 (.mp3)
  • AAC (.m4a)
  • Monkey's Audio (.ape)
  • Ogg Vorbis (.ogg)
  • Windows Media Audio (.wma)
Archive / Container File
  • Zip Archive (.zip)
  • Tar Archive (.tar)
  • 7zip archive (.7z)
  • Compressed Tar (.tar.gz, .tar.bz2)
  • WinRar Archive (.rar)
Image
  • Uncompressed TIFF (.tif or .tiff)
  • Compressed and lossless PNG (.png)
  • Compressed and lossy JPEG (.jpg, .jp2)
  • Standard applicable RAW image formats (.raw, .nef, etc)
  • Digital Negative (.dng)
  • Adobe Photoshop (.psd)
  • Apple Picture File (.pct)
  • Graphics Interchange Format (.gif)
  • Windows Bitmap (.bmp)
Slides, Illustrations
  • Open Document Format Presentation (.odf)
  • Open Document Format Drawing (.odg)
  • Encapsulated Postscript (.eps)
  • Scalable Vector Graphics (.svg)
  • Portable Document Format (.pdf)
  • Microsoft PowerPoint Files (.ppt, .pptx)
  • Apple Keynote Files (.key)
Spreadsheet, Tabular File
  • Plain text with Unicode UTF-8 character encoding, tab-separated (.tsv, .txt)
  • comma-/semicolon-separated (.csv)
  • Open Document Format Spreadsheet file (.ods)
  • Microsoft Excel Files (.xls, .xlsx)
  • Apple Numbers File (.numbers)
Unstructured Data
  • Ascii Plain Text
  • JavaScript Object Notation (.json)
  • Extensible Markup Language (.xml)
  • Excel Files (.xls, .xlsx)
  • Apple Numbers File (.numbers)
Text
  • Ascii Plain Text
  • Rich Text Format (.rtf)
  • Open Document Format Text File (.odt)
  • Portable Document Format (.pdf)
  • Markdown (.md)
  • R Markdown (.rmd)
  • HTML (.html)
  • Extensible Markup Language (.xml)
  • Microsoft Word Files (.doc, .docx)
  • Apple Pages Files (.pages)
Markup Language
  • Markdown (.md)
  • R Markdown (.rmd)
  • HTML (.html)
  • Extensible Markup Language (.xml)
  • SGML (.sgml) - it has been superseded by XML and HTML, markup languages that were derived from SGML.
Video
  • MPEG-4 H.264 (.mp4)
  • AVI (.avi)
  • Material Exchange Format (.mxf)
  • Flash Video (.flv)
  • Quicktime (.mov)
  • Windows Media Video (.wmv)
  • WebM (.webm)
Numerical Data
  • Hierarchical Data Format (.hdf5, .h5, .hdf4)
  • NetCDF (.nc)
  • NumPy Array Data (.npz, .npy)
  • Zarr file storage (.zarr)
  • MATLAB binary (.mat)
Statistical Analysis
  • R Files(.r, .RData)
  • SPSS (.dat/.sps)
  • STATA (.dat/.DO)
  • SPSS Portable (.por)
  • SPSS (.sav)
  • STATA (.dta)
  • SAS (.7dat, .sd2, .tpt)
CAD Files
  • AutoCAD Drawing Exchange Format (.dxf)
  • Open Document Format Drawing (.odg)
  • Standard for the Exchange of Product Model Data Files (.step)
  • Stereo Lithography Files (.stl)
  • Polygon File Format (.ply)
All proprietary file types

Choosing Appropriate Filenames

Enhancing Findability

Adopting good practices for file naming and organizing makes finding the required data much easier, not just for you, but also for your colleagues and collaborators, and for other researchers who may re-use your data in the future.

In order to help with this, the following fundamental file naming recommendations are given:

  • Files should be named consistently.
  • The name of the file should be consistent across different files types when storing in multiple file formats i.e. preferred and non-preferred file formats.
  • Files names should be descriptive but short (< 30 characters) which helps makes names human-readable. This may not always be possible, but clarity should not be sacrificed for brevity.
  • Avoid the use of spaces in filenames. Various software and operating systems may not be able to correctly read file paths if the folder or file names include spaces. This helps makes all files machine readable.
  • Avoid special characters (e.g. \ / ? : * ” > < | : # % ” { } | ^ [ ] ` ~ æÆ øØ åÅ äÄ öÖ) as these can cause issues for software and operating systems. It is important that filenames are machine readable.
  • If using a date as part of the filename, use the international date convention YYYY-MM-DD in your formatting (e.g. 2017-10-25 or 20171025).
  • Good file names are easily searchable by both humans or regular expressions. It is good practice to split the filename into parts with a specific character such as - ("Kebab-Case") or _ ("Snake-Case") to help searching.
  • It is worth considering if your naming convention works well with default ordering applied by computers. For example, a computer will often sort 1,3,4,20,100 as 1,100,20,3,4. In this case leading zeros should be used to ensure the sequential order is preserved by the computer.
Organising Data within Files
Enhancing Usability

The way the data in your files should be organized depends on the file type and the discipline. You should follow best-practice recommendations within your field but there are some best practices that always make things more clear regardless of discipline.

When storing columnar or tabular data the following recommendations apply:

  • If storing data in spreadsheet format, only one table should be on each sheet or tab to make the spreadsheet machine readable. It also improves clarity for humans.
  • If storing data in ascii format, only one table should be on each file to make the data machine readable. It also improves clarity for humans.
  • The first row is the header including variable names. It may be appropriate to have a multiline header with related constants stored before the variable names in some cases.
  • Variable names should not include special characters (where possible).
  • Each variable should have it's own column.
  • Each row should be a single observation.
  • There should only be one value per cell in a spreadsheet.