Describing the data

 

Describing the variables: means, univariates, frequencies, and data types.

Now we'll use some real data. The data are from a health facility survey conducted in Tanzania in 1999. Copy each of these commands into the Command window, press Enter, and observe the results. Note that the letters in bold on each command are acceptable abbreviations. If you see "-more-" at the bottom of your Results screen, press the Space Bar to see a new page of data, or press "q" to quit the command.


       clear
       use "q:\utilities\statatut\exampfac.dta"
       describe
       summarize
       su pill-natural
       su fphour* 
       codebook urbrur facname
       tabulate urbrur
       tab urbrur, nolabel missing plot
       tab factype urbrur 
       tab1 factype urbrur
       tab2 factype urbrur
       tab2 factype urbrur, row col cell

 

Questions:

1. The describe command lists each variable in Stata's memory. What do the terms "double," "str42," "byte," etc. in the second column refer to? Answer.


2. How do I specify which data type I want to use? Answer.


3. The summarize command lists the number of observations, mean, standard deviation, min, and max for a variable. Why is the number of observations different for some variables, and is even 0 for facname? Answer.


4. How can I summarize a specific set of variables? Answer.


5. When is the codebook command useful? Answer.


6. The tabulate command gives frequencies (counts), and is most useful with categorical variables. What are the two ways to specify a one-way frequency? Answer.


7. What do the nolabel missing plot options do on the tabulate command? Answer.


8. How can I get two-way frequencies (cross-tabulations)? Answer.

 


Answers:

 

1. That is the data type for each variable. Each data type handles a different kind of data. The following table describes the data types used by Stata:

Type            Min          Max   Precision   Bytes   Type
----------------------------------------------------------------
byte      -2 digits     2 digits    2 digits       1   integer
int       -4 digits     4 digits    4 digits       2   integer
long      -9 digits     9 digits    9 digits       4   integer
float       -10**38       10**36      10**-8       4   real   
double     -10**307      10**307     10**-16       8   real   
str1              1            1                   1   string
str2              2            2                   2   string
... . . . ...
str2045 1 2045 2045 string
strL 2000000000 2000000000 2000000000 long string
----------------------------------------------------------------

Prior to Stata version 13, strings were limited to 2045 characters. Starting with Stata 13 a new data type strL can hold strings up to 2 billion characters. You can see this and the limits on just about everything else in Stata by typing the command help limits, and there's a detailed explanation of data types in the PDF documentation and under help data types.

Back to question.

 

 


 

2. You can specify a data type on the generate command:

       gen byte a=0

If you don't specify a data type, by default Stata uses type float (4 bytes). Using an efficient data type reduces the file size. This is important for very large files or for computers with little memory (RAM). The compress command selects the most efficient data type after variables have been generated. See compress in Miscellaneous Tips and Tricks for details on compress.

Back to question. 

 


 

3. The "Obs" column displays the number of non-missing observations for numeric variables. For string variables, like facname, it is always 0.

Back to question.

 


 

4. There are two ways to specify a variable list, both shown in the example:

  • pill-natural (first variable - last variable)
  • fphour* (root variable name plus *)

These two methods work with all Stata commands. To use the first method, you need to know the position of each variable in the Stata data file. Use the describe command to see those positions, or look for them in the Variables window.

Back to question.

 


 

5. The codebook command gives univariate statistics about numeric variables, and it is a handy way to get information about string variables.

Back to question.

 


 

6. The two ways to get one-way frequencies are:

  • tab factype (for a single variable)
  • tab1 pill-natural (necessary for lists of variables)

Another handy command is fre written by Ben Jan at the University of Bern. It is not built into Stata, but we have installed it on all terminal servers at CPC. LIke all user-contributed Stata commands, it is available for free from the SSC archives at Boston College. You can install it on your desktop or laptop by typing:

ssc install fre

Back to question.

 

 


 

7. These three options give extra information about the variable urbrur:

  • nolabel displays the numeric values instead of the value labels
  • missing shows how many observations have missing values
  • plot gives a graphical comparison of the frequencies

For more information on how Stata handles missing values, see missing values in the Miscellaneous Tips and Tricks section of this tutorial.

Back to question. 

 


 

8. The two ways to get two-way frequencies are:

  • tab factype urbrur
  • tab2 factype urbrur

These two commands are equivalent.

Back to question. 

 


 

Review again?

 

Another topic?


Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme