The most common variable set structure is binary - multi, which consists of a set of related variables, each of which is binary. This data type is often referred to as multiple response data in survey analysis.
This structure arises in four common scenarios:
- Multiple response questions
- Ad hoc coding of properties of cases (i.e., "flags")
- One-hot coding/dummy variables
- Quantizing other variable sets
Multiple response questions
Surveys commonly present lists of options and ask people to choose as many as are applicable. Such questions are known as multiple response questions. These questions are usually best stored as a variable set of binary variables.
Ad hoc coding of properties of cases (i.e., "flags")
In databases that are used for analysis (e.g., for business intelligence), it is routine to create sets of binary variables measuring interesting properties of the cases. For example, in a customer database, you may have binary variables indicating:
- Current customer.
- Churn risk.
- Male
- Etc.
One-hot coding / Dummy variables
Many statistical and machine learning algorithms have been designed to work only with numeric predictor variables <tk>. A common way to work around this limitation is to replace any categorical variables with binary variables, where there is one binary variable for all but one of the categories of the categorical variable. Such variables are known as dummy variables in statistics and one-hot coding in machine learning. See Dummy Variables.
Quantizing other variable set structures
Summaries of Binary - Multi variable sets are easier to interpret than summaries of some variable sets (e.g., Numeric - Multi, Ordinal - Multi, Nominal - Multi, grids). Consequently, it is common to modify these other variable set structures to binary - multi. See Creating New Variables by Duplicating and Modifying Variable Sets.
Comments
0 comments
Please sign in to leave a comment.