A flat data file is a file that is structured as a raw data table, with rows for every case. Flat files generally store values as either text or numbers. Flat files contain limited metadata, which makes them problematic for some types of data analysis (e.g., survey data). If using a flat file, it is good practice to create a data dictionary. Better file formats are required if you have a large number of variables.
The standard structure of a flat file
A flat file has the following structure:
- (Often) The first row contains the names of the variables.
- Each row in the file contains a case.
- Each column represents some property of the individual cases - a variable. Variables are delimited (separated) in some way. In the example below, tabs are used to delimit the variables, and we can see that the first variable is called, Person's age and the first case has a value of 35 to 44.
A flat file containing labels
Looking at the age variable shown below, we can see that there are three unique values: 34 to 44 (which appears twice), Under 18, and 60 or more. In this example, the values in the data file are labels (i.e., text).
Person's age Gender Attitude
35 to 44 Male Somewhat disagree
35 to 44 Female Neither agree nor disagree
Under 18 Female Neither agree nor disagree
60 or more Female Strongly agree
While the text representation is easier for the human eye to interpret, the numeric representation is typically a lot better. This is because the text representation can miss important information (e.g., we can tell from the numbers below that under 18 has been ordered before 35 to 44 above, whereas we do not know this information from the text). An additional problem with text representations is that they often break down when labels are changed (e.g., if Female is changed to Females, then the data file becomes a mess).
A flat file containing variable names and values
An alternative to showing labels is to instead show variable names and values. In the example below, Person's age is referred to by the variable name of q1, and the age of 35 to 44 is represented by the value of 3.
q1 q2 q3
3 1 2
3 2 3
1 2 3
4 2 5
The disadvantage of the numeric representation of categories is that there is a need for the person analyzing the data to re-enter the information about the meaning of the numbers after the data file is imported into a data science app, and this can be a very time consuming process. Sometimes files will contain a mix of names, labels, and values.
Flat files are always missing metadata
Although flat files are easy to create, they are usually not so easy to analyze. The basic problem is that they are usually missing key metadata, which either makes analysis difficult or impossible.
Consider the example using codes and numbers. In order to analyze this data, we need to keep track of what these numbers and values mean. The only good way of doing this is to enter the metadata into the analysis software, which takes time.
If you use text, the analysis software will generally convert the underlying data back into numbers. However, it generally won't do it in the way you would want, so you need to manually correct it. Consider the third variable, Attitude. If the text is mapped back to numbers automatically (which is what most software will automatically do), it will likely assign values based on alphabetic order, leading to a value of 1 to Neither agree nor disagree, a value of 2 to Somewhat disagree, and a value of 3 to Strongly agree. Such coding is not sensible and needs to be rectified prior to any analysis (as otherwise, people will tend to misinterpret results).
Data dictionaries
One common practice with flat files containing codes and values is to provide data dictionaries so that users can look up the meaning of the codes and values. For example:
Better file formats are required with large numbers of variables
Flat files with data dictionaries are adequate if you have a small number of variables. However, as the number of variables grows, they become problematic in two ways:
- It slows down analysis a lot, as you either have to spend a lot of time entering all the metadata or, you have to regularly look things up.
- It leads to errors. The more metadata that has to be entered or looked up, the greater the chance of an error.
For this reason, if performing analyses with a larger number of variables, it is routine to use better file formats (e.g., SPSS Data Files, Triple S Data Files. QPacks). Such data files formats contain all the underlying values and codes, as well as the metadata, and make it easy to use the values, codes, or text depending as required.
Comments
0 comments
Please sign in to leave a comment.