Data Reshaping Techniques Using Stata

By Ishuan Li, Associate Professor of Economics, Minnesota State University

This tip provides an overview of data reshaping techniques in multiple-response sets using Stata, a general purpose statistical software package, as applied to multiple-response variables. See the Basic Introduction to Stata for more information.

Multiple-response variables are generated by questionnaires that allow for more than one response. This in turn can create data structure and analysis issues, which come mostly from the open-ended nature of some survey questions. The responses to these type of questions can be coded as a set of variables, or each answer can be treated as a single variable and responses can take numerical or text (string) values.

Let’s suppose the following survey question appears in a faculty survey of teaching methodologies.  

In Stata, indicator variables (dummy variables that take the value of zero or one) can be created to represent each of the six options. Let’s assume that five respondents answered the question and the results are captured on the following table (data editor on Stata):  


The variables for question 1 are all coded with a similar prefix (q1_). Among the five respondents, respondent (id) 1 chose chalk and board, handouts, and others among the teaching aids. The multiple-response set data formatted as shown allows us to provide simple summary statistics as follows:

Count: Stata command is “count” or “tabulate.” The count is returned on the output panel below the line of commands. In this case, the PowerPoint slide (pp) teaching aid is used two times among respondents who answered question 1.



Tabulate: computes the frequency (count) of times q1_pp equals one (and equals zero).


Among the six items in the answers to question 1, a frequency distribution of the number of aids can be computed by adding up the row totals. But first, a variable “naids” must be created:


For a count of how many times each teaching aid was chosen, the command “tabstat” returns the sum by adding up each column and returning the number under the column sum:


Shaping Data Structure

Long and wide data structures

Some statistical methods require that data be organized in “long” or “wide” structure. A “long” structure is one in which an element (id) is observed more than once; time series and longitudinal data are examples of “long” format. For example:


In “wide” format, an element (id) is observed once (a row on the spreadsheet). Cross-sectional data tend to be structured in wide format; paired t-tests require data to be organized in this format. For example, data from the prior table is presented in wide format as follows: 


In survey responses from multiple-response variables, the data can be presented in the “long” format, where individual unique responses to each teaching aid are recorded as follows:


The variable “q1” is a string (text) variable. Stata reads string variable values in red font (numerical operations are not possible with text values). To summarize the data, we can:


In this example, the data structure is more informative if shaped in long format. Take the following table; it shows the data in a “wide” format:

Suppose our dataset includes a variable on sex of respondent (male or female). We can “reshape” this dataset from “wide” to “long” structure, as follows:
Type: reshape long q1_, i(id) string
In the command, we ask Stata to: reshape the data to long format with one variable named “q1_”; the rows represent the ids; and generate a “j” variable (see second column “_j” on data editor caption, which takes one of the six values: pp, cb, h, iw, e, or o). The column for “sex” repeats if it is the same id.
Ranked multiple responses
Other times, multiple responses are generated to reflect some ranking (ordinal value). In this example, respondents are asked to rank each teaching aid from 1 (most used) to 6 (least used). In this case, respondents could omit (but cannot overlap) rankings. The dataset may look like this:

Row 1 represents respondent number one’s responses: he ranks chalk-and-board highest (or most used), followed by handouts, and ranks PowerPoint slides least. We can also create an additional “response” variable listing the ranking of the different teaching aids as follows:
On the command panel, type: egen response=concat (q1_*)
The response variable is shown below:

We can also clean up the zeroes in the response variable by dropping missing values, zeroes, and empty cells. On the command panel, type: replace response=subinstr (response, “0”, “”, .)

The string of commands is shown on the output panel as follows:


The data editor shows the response variable (in this case) without the zeroes:


On the other hand, suppose the dataset provides the ranking (1=best, 6=least) in the long format as follows:


We can reshape it to “wide” format as follows:

On the command panel, type: reshape wide q1_, i(id) j(rank)

The output panel shows the commands and its execution:


On the data editor window, we have the data in wide format with only five observations, six teaching aid options and the sex variable as columns.


In sum, Stata allows data reshaping from wide to long or vice versa with few key commands. The data editor interface is Excel friendly. Most importantly, the IR professional should realize that data reshaping is a powerful tool that is useful beyond summary presentation of dataset with textual observations. It is a great tool for reshaping the dataset in the appropriate format required for cross-section, time series, or panel/longitudinal inferential analysis.





To add a comment, Sign In
Total Comments: 1
Gary posted on 7/22/2016 12:24 PM
You would like the R packages "reshape2" and "dplyr."