Occasionally it is necessary to create variables by writing code. Code is written using a computer language. Code can consist of short or long expressions. Code can also be used to automate the process of creating new variables.
Occasionally it is necessary to create variables by writing code
Most modern software contains multiple efficient ways of creating new variables. In particular:
- Creating Variables Using In-Built Options
- Creating New Variables by Duplicating and Modifying Variable Sets
- Creating New Variables by Creating Filters
If efficiency is the goal, it's advisable to learn the automated approaches in your software, such as:
- It will take less of your time.
- You will make fewer errors.
- Calculations will be performed faster (as the other methods will use more optimized code than you can write).
To use an analogy, always writing code to create variables is a bit like cleaning a carpet using chopsticks (i.e., not sensible).
Code is written using computer languages. The most common ones used in data analysis are:
- Excel. Yes, the formulas really are written in a proper computer language.
- SPSS Syntax.
Code can consist of short or longer expressions
If you are using analysis software with a graphical user interface, such as SPSS Statistics, Q, or Displayr, when you create a new variable, you will typically choose an option for creating a new variable in your analysis software, and then provide:
- A name.
- A label (if using software that supports this).
- An expression.
The hard bit of writing code is creating the expressions.
Most variables are created with simple expressions, such as sums. E.g.,
Q4a + Q4b
Or if statements:
if (Q4a < 5) 1 else 0
However, more complex expressions are also possible. The example below creates a family size variable using household structure and number of children questions as inputs. It assumes the following questions:
Q4. Are you … Living with your parents/guardian 1 SCREENOUT Living alone 2 GO TO Q6 Living with partner only 3 GO TO Q6 Living with children only 4 Living with partner and children 5 Sharing accommodation 6 GO TO Q6 Other (Please type into the box.) 7 GO TO Q6 Q5. How many children live with you in the following age groups? Please make sure you type a number into every box a) __ Under 5 b) __ 5-7 c) __ 8-11 d) __ 12-14 e) __ 15-17 f) __ 18 to 24 g) __ 25 or over
The Expression is then:
if (Q4 == 2) 1; else if (Q4 == 3) 2; else if (Q4 == 4) 1 + Q5_A + Q5_B + Q5_C + Q5_D + Q5_E + Q5_F + Q5_G; else if (Q4 == 5) 2 + Q5_A + Q5_B + Q5_C + Q5_D + Q5_E + Q5_F + Q5_G; else if (Q4 >= 6) 3;
Note that in the case of the last category, sharing accommodation and other, a guess of family size being 3 has been made.
Using code to automate the creation of the new variables
If using software that does not have a graphical user interface, your code needs to both:
- Contain the expression (as described in the previous section).
- Create a new variable.
In R, which has no graphical user interface, this approach is very efficient. For example, we can create a new variable that adds together two variables by writing:
q4sum = Q4a + Q4b
In software that does have a graphical user interface, the code is typically a bit more verbose, as the code needs to tell the software where to store the data and you need to choose when to run the code. For example, to create a new variable by recoding an existing variable in SPSS Statistics, you code like this:
RECODE q4c (6=SYSMIS) (1 thru 3=0) (4 thru 5=1) INTO q4c.top2box.
VARIABLE LABELS q4c.top2box 'Top 2 Boxes: Coke Zero'.
In R, you need to run the code whenever you change it, so the efficiency gain is more of a short-term win than a long-term advantage.