Preferred File Formats
The choice of file formats is crucial in ensuring the dataset remains interoperable and reusable. Long established file formats may be preferable as they offer wide support and are more likely to be well supported in the future.
The most suitable file formats are usually:
- non-proprietary
- open, with documented international standards
- using standard character encoding, preferably Unicode (e.g. UTF-8)
- uncompressed (space permitting)
Standard open formats are also the safest options to guarantee long-term support and access as proprietary formats can become obsolete over time (for example, .xls
was replaced by .xlsx
) making data difficult to read and interpret in the future.
The table below gives an overview of preferred vs. non-preferred file formats for a selection of document types. The list of file formats in the column “Non-preferred file formats” is non-exhaustive and may include the formats perceived to be the most commonly used. To enhance the interoperability of your data, please try to use preferred file formats wherever possible.
File Type | Preferred File Format (Examples) | Non-preferred File Formats (examples) |
---|---|---|
Audio |
|
|
Archive / Container File |
|
|
Image |
|
|
Slides, Illustrations |
|
|
Spreadsheet, Tabular File |
|
|
Unstructured Data |
|
|
Text |
|
|
Markup Language |
|
|
Video |
|
|
Numerical Data |
|
|
Statistical Analysis |
|
|
CAD Files |
|
All proprietary file types |
Choosing Appropriate Filenames
Enhancing Findability
Adopting good practices for file naming and organizing makes finding the required data much easier, not just for you, but also for your colleagues and collaborators, and for other researchers who may re-use your data in the future.
In order to help with this, the following fundamental file naming recommendations are given:
- Files should be named consistently.
- The name of the file should be consistent across different files types when storing in multiple file formats i.e. preferred and non-preferred file formats.
- Files names should be descriptive but short (< 30 characters) which helps makes names human-readable. This may not always be possible, but clarity should not be sacrificed for brevity.
- Avoid the use of spaces in filenames. Various software and operating systems may not be able to correctly read file paths if the folder or file names include spaces. This helps makes all files machine readable.
- Avoid special characters (e.g.
\ / ? : * ” > < | : # % ” { } | ^ [ ] ` ~ æÆ øØ åÅ äÄ öÖ
) as these can cause issues for software and operating systems. It is important that filenames are machine readable. - If using a date as part of the filename, use the international date convention
YYYY-MM-DD
in your formatting (e.g. 2017-10-25 or 20171025). - Good file names are easily searchable by both humans or regular expressions.
It is good practice to split the filename into parts with a specific character such as
-
("Kebab-Case") or_
("Snake-Case") to help searching. - It is worth considering if your naming convention works well with default ordering applied by computers. For example, a computer will often
sort
1,3,4,20,100
as1,100,20,3,4
. In this case leading zeros should be used to ensure the sequential order is preserved by the computer.
Organising Data within Files
Enhancing Usability
The way the data in your files should be organized depends on the file type and the discipline. You should follow best-practice recommendations within your field but there are some best practices that always make things more clear regardless of discipline.
When storing columnar or tabular data the following recommendations apply:
- If storing data in spreadsheet format, only one table should be on each sheet or tab to make the spreadsheet machine readable. It also improves clarity for humans.
- If storing data in ascii format, only one table should be on each file to make the data machine readable. It also improves clarity for humans.
- The first row is the header including variable names. It may be appropriate to have a multiline header with related constants stored before the variable names in some cases.
- Variable names should not include special characters (where possible).
- Each variable should have it's own column.
- Each row should be a single observation.
- There should only be one value per cell in a spreadsheet.