Reshaping a data file

 

Converting variables to observations, observations to variables.

Most population surveys gather information about household members in a household roster. The data are organized with one observation being a household. Information about each member of a household is in different variables on the same observation. For example, the age of the first household member might be in the variable age01, the second in variable age02, and so forth.

We may need to summarize the data about household members. Or we may need to merge the household roster information with other information collected from selected members on different forms. In either case, it would be more convenient if the household roster variables were organized differently, with each observation being a member (instead of a household).

Stata offers the reshape command to make this transformation easier. Stata calls the original data format "wide", where household members are represented as variables on a household observation. They call the new format "long", where each household member is a separate observation. So, the command to transform the data from wide to long is reshape long, and from long to wide is reshape wide.

The example below uses a household roster with up to 10 members. Browse the data before and after each reshape to help visualize the transformation.


clear
use "q:\utilities\statatut\hhwide.dta"
summarize

/* Reshape from wide form to long form (each member becomes an observation). */

reshape long age edu rel sex, i(hhid) j(lineno)
summarize

/* Reshape from long form to wide form (each member becomes a set of 4 variables). */

reshape wide age edu rel sex, i(hhid) j(lineno)
summarize

Questions:

1. What do the terms "age", "edu", "rel", and "sex" stand for in the reshape command? Answer.


2. What is the term "i(hhid)"? Answer.


3. What is the term "j(lineno)"? Answer.


4. Why are there 5000 values for lineno in the long file, but only 2564 values for age and the other variables? Answer.


5. What happens to the variable "lineno" when we convert from long to wide? Answer.


6. The variable "electric" is a household-level variable. We want it to be added unchanged to each observation in the long form. Is that what happens? Answer.

 


Answers:

 

1. These are the prefixes for the variable names in the original (wide) form of the data. Reshape works most easily if the variable names have a prefix followed by a number. In this case we have the variable name prefix "age" followed by a number 1-10 corresponding to the household member's order in the roster. If your variable names don't have this prefix-number form, you may want to rename them before doing reshape. See the explanation of  foreach elsewhere in this tutorial for an easy way to rename multiple variables. Alternatively, you may want to explore the advanced features of reshape that allow you work with other naming conventions without renaming the variables. See the Stata manual for details.

 

Back to question

 


 

2. The reshape command needs a common identifier for each member of a household. Reshape long assigns this identifier to each household member, while reshape wide uses this identifier to assign each member to a household. In this case, that identifier is named hhid in the original data. We would need this identifier in order to collapse the data to the household level or to merge data from other files onto the new long file.

 

Back to question 

 


 

3. The reshape command needs an identifier for each individual in the household. Typically, this identifier is the line number in the household roster. The "j(lineno)" term tells reshape long to create a new variable named lineno. This variable captures the position of the household member in each household observation in the original (wide) file. The reshape wide command uses this variable to assign a suffix to the new household-level age, education, relationship, and sex variables it is creating. In reshape long, we can choose any name we want for the "j" variable, so we chose lineno, while in reshape wide we must use the variable that actually identifies each individual in the household. We would need this line number, together with the hhid variable, to merge individual-level data with the new long file.

 

Back to question 

 


 

4. Reshape creates one new observation for each set of 4 variables in the wide file. We have 10 sets of 4 variables, so we get 10 observations. The wide file has 500 households. Multiplied by 10 members, that's 5,000 new observations in the long file. However, not every household in the wide file has 10 members - some have fewer. Nevertheless, reshape creates a new observation for each set of variables, missing or not. The summarize command tells us that these 500 households have 2,564 members distributed among them, with anywhere from 1-10 members in each household. We probably would drop any observation with all missing data before doing anything further with the long file. The command would be:

drop  if missing(age)

Note: you should not use edu to drop observations with missing values, because 8 people are missing education but have values for all the other variables. They shouldn't be dropped.

 

Back to question.

 

 


 

5. The variable lineno disappears going from long to wide. Its values become the suffix in the variable names age, edu, rel, and sex.

Back to question 

 


 

6. Yes, the variable electric is automatically added to each new observation generated by the reshape long command. We do not name it in the reshape command, and Stata carries it along unchanged.

Back to question

 


 

Review again?

 

Another topic?


Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme