A data file format refers to the rules used to construct the file. There is an infinite number of possible formats of data files. However, most data analysis software can only read data in a small number of formats, so the first step in tidying data is usually to find an appropriate format.
File formats designed for data capture
Most advanced analysis software can extract data and analyze such data. For example, web scraping can be used to extract data from web pages, and SQL queries of relational databases. A lot of work is required to restructure the data into a format that makes analysis easy.
Data file formats for data analysis
Where the data that needs to be analyzed is regularly being analyzed, a mechanism will typically be in place for obtaining files that are better for data analysis. For example, while most survey data is stored in relational databases, the leading survey data collection companies allow users to export data files in better formats and to also obtain better file formats via an API.
Overview of the key file formats for data analysis
The most widely used file formats for data analysis are probably:
These file formats share three defining features:
- They are easy to create.
- They are easy to read.
- They contain minimal machine-readable metadata. The "machine-readable" distinction is important. An Excel file can have lots of metadata included in it. For example, it's common that data files created in Excel has the raw data stored in a single worksheet, and a second worksheet containing a data dictionary which is the metadata. However, as a general rule with very few exceptions, data analysis software cannot read the metadata contained in Excel files.
For many forms of data analysis, the absence of metadata in the file is not a big issue (e.g., sales data), making these file formats very useful. In academic contexts, where a single data file may be analyzed over months or even years, the absence of metadata is also not a huge problem.
However, when there is substantial metadata, or, there is a need to perform analysis quickly, these simple formats tend to be highly problematic, as:
- The metadata has to be entered, and this takes time.
- It is difficult to perform high-quality data cleaning and checking with limited metadata (as one of the ways of checking data is to assess consistency with metadata).
- Entry of metadata becomes a source of error.
- Often the metadata is not entered, leading to errors in interpretation.
Consequently, more sophisticated file formats are the norm in areas where there are substantial metadata. Common file formats that contain extensive machine-readable metadata are: