Most data analysis requires raw data. Raw data is data that can be viewed as a table, where each column represents a variable each row represents a case. Metadata explains how the data in a data set should be interpreted. Analyses are created by summarizing different sections - rows and columns - of the data. All analyses can be created from a raw data table. Although raw data is often required when analyzing data, this does not mean that the data should be stored as raw data.
A raw data table
The table below is a small example of raw data. It's small in that more commonly we would have hundreds, thousands, or millions of rows, and dozens, hundreds, or thousands of columns.
Each row is a case
In the table above, the 11th row contains the data for a person in a survey. The person is referred to by unique identifier 713, has a mobile phone, is a part-time worker, is aged 20 to 24 years, and mentioned the brand name of Optus when asked the first phone company they could think of.
To introduce some jargon, each row in the raw data above is referred to as a case.
Each column is a variable
When looking at a table of raw data, each column is known as a variable. For example, Age and Work Status are variables in the table above.
Metadata explains how to interpret the data
Consider the raw data shown below. Each row represents a customer at a phone company. Each column is a variable. While you can guess the meaning of some of the columns, for others it is impossible. What does a 9 for USAGE mean?
The raw data on its own is rarely sufficient for data analysis. Typically, metadata is also required. This metadata is often referred to as a data dictionary. The table below shows metadata for the raw data above.
All analyses can be created from raw data
All data analysis can be done by analyzing some or all of a raw data table. For example:
- Looking at the variable Does respondent have a mobile phone? we can see that 100% of people in this data set have a mobile phone.
- 1 / 11 = 9% are aged 16 to19.
- Everybody aged 16 to 29 has Optus for Top of Mind Awareness, and this is only one person.
Data does not need to be stored as a raw data table
It is important that you can view any data you have as a raw data table. For this purpose, most data analysis software have data editors, which are ways of viewing and editing tables or raw data.
However, it doesn't follow from this that data should be stored as a raw data table. Although there are some crude file formats, such as CSV files, that do store data as raw data tables, it's rarely a great format, as such files:
- Are larger than they need to be.
- Do not contain key metadata.
Data files versus data sets
When data is shared, it is usually shared by storing the data in a file (e.g., an Excel file). Such a file is called a data file.
Data analysis proceeds by importing a data file into the analysis software, and then modifying this data in various ways (e.g., cleaning and tidying), after which analyses are conducted (e.g., crosstabs).
The data that is analyzed is referred to as the data set. It is often very different from the original data file (e.g., it may have fewer cases, modified metadata, and additional derived variables).
The data set can then be saved out as a file, in which case the data set and the data file become one and the same again.
The next article examines the idea of a case of data in more detail: Case.