Checking Consistency with Known Data – The Data Story Guide

Sometimes studies inadvertently over-or under-represent key groups in the population. For example, a study may contain 45% men, but it's known that men represent 49% of the overall population.

It is good practice to review any results in a data set that can be checked and compared with known facts about the population. Where there is evidence that the data set is inconsistent with other data, there are three explanations:

The other data is wrong.
The data is not representative. This happens when a specific group of respondents is under-or over-represented in the data set compared to the population. For example, most surveys underrepresent sub-groups of the population that are hard to recruit (e.g., very affluent people, young males, etc.).
The data suffers from measurement error. That is when the data for some or all cases is known to be wrong to some extent. This is extremely common. For example:
- In surveys, people are regularly asked how many products they bought, when they bought them, and what they paid for them. People often can't recall all this information correctly.
- Customer databases are often missing critical data. For example, a customer may not always use their loyalty card when purchasing. As a result, the customer database under-records the amount spent by the individual customer and over-states the number of different customers.

Only one of these problems has an easy solution. That is representativeness, which can be addressed by weighting the data.