Merging Data Files and Data Tables – The Data Story Guide

When you need to analyze data that is stored in multiple data files or tables it can be appropriate to create a new, merged data set. Merging is also known as joining. Common variants of this are:

Merging by case
Merging by variables
Many-to-one and one-to-many merges
Many-to-many merges

Any alternative to merging files is to use live joins.

When to merge data files

There are a number of common examples in which a data analyst will need to merge new cases into a main, or principal, data file:

They have collected data in a longitudinal study (tracker) – a project in which an analyst collects data over a period of time and analyzes it at intervals.
They have collected data in a before-and-after project – where the analyst collects data before an event, and then again after.
The data you wish to analyze is stored in multiple tables of a database (e.g., one table may show purchasing data, another may show personal characteristics).

Merging by case

Merging by case, sometimes known as appending data or adding data by rows (i.e. you’re adding new rows of data to each column), assumes that the variables in the two files you’re merging are the same in nature. For instance, var1 in the example below should be numeric in both questions, and not a string (text) variable in one file and numeric in the other. Most software matches up the data based on the variable name, and so the same names should be used across the two files. “var1” in one file should be “var1” in the other.

This also assumes that the IDs for each case are different. If it should happen that you have a variable in one file that doesn’t have a match in the other, then missing data (blank values) may be inserted for those rows that do not have data.

Merging by variables

Contrary to when you merge new cases, merging in new variables requires the IDs for each case in the two files to be the same, but the variable names should be different. In this scenario, which is sometimes referred to as augmenting your data or merging data by columns (i.e. you’re adding new columns of data to each row), you’re adding in new variables with information for each existing case in your data file. As with merging new cases where not all variables are present, the same thing applies if you merge in new variables where some cases are missing – these should simply be given blank values.

It could also happen that you have a new file with both new cases and new variables. The approach here will depend on the software you’re using for your merge. If the software cannot handle merging both variables and cases at the same time, then consider first merging in only the new variables for the existing sample (i.e. augment first), and then append the new cases across all variables as a second step to your merge.

Many-to-one and one-to-many merges

In the previous example of merging by variables, each case in one data file was matched with a corresponding case in the other data file. It is also common to have many-to-one merges, where many cases in one file are merged with only one in another.

In the example below, the first table shows data on people, and there are many people who share the same zip code. The second file contains zip codes. And, the third file contains the merged data, where the data from the zip code file has been repeated multiple times, to match the data from the first file.

Many-to-many merges

A many-to-many merge occurs when neither file contains cases that have only one value on the common variable in either file. For example, if the only variable in common was age, and you had multiple instances of different ages in the files.

Such many-to-many merges are not widespread and are known as statistical matching and data fusion.

Live joins

When files are merged, a new file is created. This can be problematic:

If the files are regularly revised, it requires that the merging occurs regularly.
Analysis of some of the variables in merged files can be challenging. For example, if you wanted to know the average zip code's wealth index, you will get the wrong answer if calculating the average using the variable in the third table above (due to some zip codes appearing multiple times in the data).

A remedy for this is to use live joins, which join the data as needed. For example, this can be done:

When using SQL to create tables.
Using Edit data file relationships in Q and Displayr.