Skip to content. | Skip to navigation

Personal tools

Describing the data

 

Describing the variables: means, univariates, frequencies, and data types.

Now we'll use some real data. The data are from a health facility survey conducted in Tanzania in 1999. Type each of these commands and observe the results. Note that the letters in bold on each command are acceptable abbreviations.


       clear
       use "t:\statatut\exampfac.dta"
       describe
       summarize
       su pill-natural
       su fphour* 
       codebook urbrur facname
       tabulate urbrur
       tab urbrur, nolabel missing plot
       tab factype urbrur 
       tab1 factype urbrur
       tab2 factype urbrur
       tab2 factype urbrur, row col cell

 

Questions:

1. The describe command lists each variable in Stata's memory. What do the terms "double," "str42," "byte," etc. in the second column refer to? Answer.


2. How do I specify which data type I want to use? Answer.


3. The summarize command lists the number of observations, mean, standard deviation, min, and max for a variable. Why is the number of observations different for some variables, and is even 0 for facname? Answer.


4. How can I summarize a specific set of variables? Answer.


5. When is the codebook command useful? Answer.


6. The tabulate command gives frequencies (counts), and is most useful with categorical variables. What are the two ways to specify a one-way frequency? Answer.


7. What do the nolabel missing plot options do on the tabulate command? Answer.


8. How can I get two-way frequencies (cross-tabulations)? Answer.

 


Answers:

 

1. That is the data type for each variable. Each data type handles a different kind of data. The following table describes the data types used by Stata:
Type            Min          Max   Precision   Bytes   Type
----------------------------------------------------------------
byte      -2 digits     2 digits    2 digits       1   integer
int       -4 digits     4 digits    4 digits       2   integer
long      -9 digits     9 digits    9 digits       4   integer
float       -10**38       10**36      10**-8       4   real   
double     -10**307      10**307     10**-16       8   real   
str1              1            1                   1   character
str80             1           80                  80   character
str244            1          244                 244   character
----------------------------------------------------------------
Strings are limited to 244 characters. You can see this and the limits on just about everything else in Stata by typing the command help limits

. At CPC, we're using Stata/SE, so the right-hand column applies.

Back to question.

 

 


 

2. You can specify a data type on the generate command:
       gen byte a=0
If you don't specify a data type, by default Stata uses type float (4 bytes). Using an efficient data type reduces the file size. This is important for very large files or for computers with little memory (RAM). The compress command selects the most efficient data type after variables have been generated. See compress

in Miscellaneous Tips and Tricks for details on compress.

Back to question. 

 


 

3. The "Obs" column displays the number of non-missing observations for numeric variables. For string variables, like facname, it is always 0.

Back to question.

 


 

4. There are two ways to specify a variable list, both shown in the example:
  • pill-natural (first variable - last variable)
  • fphour* (root variable name plus *)

These two methods work with all Stata commands. To use the first method, you need to know the position of each variable in the Stata data file. Use the describe command to see those positions, or look for them in the Variables window.

Back to question.

 


 

5. The codebook command gives univariate statistics about numeric variables, and it is a handy way to get information about string variables.

Back to question.

 


 

6. The two ways to get one-way frequencies are:
  • tab factype (for a single variable)
  • tab1 pill-natural (necessary for lists of variables)
Back to question.

 

 


 

7. These three options give extra information about the variable urbrur:
  • nolabel displays the numeric values instead of the value labels
  • missing shows how many observations have missing values
  • plot gives a graphical comparison of the frequencies
For more information on how Stata handles missing values, see missing values

in the Miscellaneous Tips and Tricks section of this tutorial.

Back to question. 

 


 

8. The two ways to get two-way frequencies are:
  • tab factype urbrur
  • tab2 factype urbrur

These two commands are equivalent.

Back to question. 

 


 

Review again?

 

Another topic?

Questions or comments? If you are affiliated with the Carolina Population Center, send them to Phil Bardsley or Dan Blanchette.