Data cleaning
Working with do-files, documenting, outliers, duplicate ids.
Unless you're lucky enough to be working with perfectly clean data, you'll need to at least check for outliers and, the bane of all survey work, duplicate identifiers. In the previous section on changing data, the edit command was introduced. We don't recommend using it for data cleaning, because it does not keep a record of what you've done, and you can't easily repeat data cleaning steps you've taken.
Stata do-files, which are collections of commands that you write, are an easy way to clean and document your data. Copy and paste the following commands into a text editor (e.g., Wordpad, Notepad, or the Stata do-file editor), and save the file as a text (ascii) file named test.do. The suffix should be ".do" (not required, but strongly recommended). The file should be in the same directory that you're using for your Stata session. If you're using a different directory, remember where you saved it.
clear use "t:\statatut\exampfac.dta" /* Notice the blank line above this line. Blanks are fine in do-files. */ /* Comment lines, like these, can be anywhere in a do-file, including */ /* at the end of a command line. */ // You can also use double-slashes anywhere in a command line // to write a comment. Everything from there to the end of the // line will be treated as a comment. gen factype2= factype /* comment at the end of a command */ gen factype3= factype // this works, too * Which facilities have an odd value for age? ta age list facid facname factype authorit if age == 1998 replace age= 1 if age == 1998 * Look for duplicate values of facid using _n. sort facid list facid factype authorit if facid == facid[_n-1] #delimit ; list facid factype authorit if facid == facid[_n-1] | facid == facid[_n+1]; drop if facid == 1001 & missing(factype); #delimit cr * Look for duplicate values of facid using duplicates command. duplicates list facid
If you're using the Stata 11 do-file editor, click on the button "Execute (do)" (the icon looks like a page with writing on it and an arrow pointing to the right). If you used another editor, open Stata and type:
do test.do
or
do "driveletter/directory/etc/test.do"
depending on where you saved the file test.do. Press Enter and watch the results. When you see --more-- press the Space Bar.
Questions:
1. Comments can begin with an asterisk (*) and end with a carriage return, or they can begin with two slashes (//) and end with a carriage return, or they can be bracketed by /* */ and span an many lines as needed. All are shown in the example above. What happens if commands are enclosed in comments, like this:
/*
ta age
list facid facname factype authorit if age == 1998
replace age= 1 if age == 1998
*/
Answer.
2. When ta age was executed, the screen stopped scrolling and --more-- appeared at the bottom. What does that mean?
Answer.
3. The character _n is the sequence number of the current observation in memory. What does facid[ _n-1] mean?
Answer.
4. Why do we need to sort the data before testing for duplicate values of facid?
Answer.
5. Why list both observations _n-1 and _n+1?
Answer.
6. What does the duplicates command do?
Answer.
7. What if there is more than one nested id variable?
Answer.
8. What do the commands #delimit ; and #delimit cr do?
Answer.
Answers:
1. Everything that's enclosed in comments is treated as a comment and is not executed. This is a handy way to block out portions of a do-file that you don't want to run.
Back to question
2. When the results of a Stata command fill up more than one screen, Stata pauses to let you review what's currently on the screen. When you want to continue, press the Space Bar to see another page, or press the Enter key to see another line.
You have other options for how to handle this situation.
- put the command set more off in the beginning of your do-file
- click on the green "Go" button to see another page
- press Ctrl-Break to cancel the command
- press the "Q" letter key to cancel the command
- click on the "Break" button (red circle with white "X")
The other options are useful if you're debugging a do-file by executing one command at a time, or if you're working interactively (not using a do-file). If you Break or "Q" in the middle of a do-file, the do-file execution is canceled along with the command that produced the --more-- condition.
3. Stata implicitly creates an array for every variable. You can think of each column of the data in memory as an array. The subscript for each member of the array is the sequence number of that observation. Since _n is the sequence number for the current observation during execution of a command, it's element number is _n.
To refer to a particular element in the array, enclose the sequence number or a reference to it (such as _n) in square brackets and place it immediately after the variable name. For example, authorit[3] is the value of authorit for the third observation in memory.
Back to question
4. If duplicate values of facid exist, sorting the data will place them next to each other in the data in memory. So, the array element reference [ _n-1]
will refer to the element immediately prior to the current observation, and the duplicate value will match the current value.
Back to question
5. Actually, we've asked to list the current observation (_n). If the current observation has the same value of facid as either the previous or the next observation, it's a duplicate. This method catches 2 or more observations with the same value for any variable.
Back to question
6. The duplicates command does the same thing as the method using _n-1 and _n+1. It's much easier to use than the method described above, and you don't need to sort the data first. The duplicates command also allows you to drop duplicate observations easily. See "One-to-one merging" for an example. See help duplicates
for details.
Back to question
7. If there is more than one nested id variable, you need to sort in the proper nest order and check all ids for duplicates using the logical operator & (ampersand) between them. Suppose there are 3 nested ids:
sort id1 id2 id3
list id1 id2 id3 if id1 == id1[_n-1] & id2 == id2[_n-1] & id3 == id3[_n-1]
Again, we'd like to point out that the duplicates command is much easier to use than this method:
duplicates list id1 id2 id3
We've shown you the method using _n because there are many instances in programming where you'll want to refer to observations before or after a particular case. This implicit array is powerful and well worth understanding and remembering for other applications.
Back to question.
8. In a do-file, Stata assumes that each command is no more than 1 line long, and that each line ends in a carriage return (when you press the Enter key, a text editor inserts a carriage return symbol). If you want to type a command that is more than one line long, you need to tell Stata to look for a semi-colon with #delimit ; instead of a carriage return. From that point on, you must end each command, whether one or more lines long, with a semi-colon. To switch back to carriage return, use #delimit cr.
There are other ways to continue a single command across more than one line. One way is to comment out the carriage return - type /* at the end of one line, and */ at the beginning of the next line (to end the comment):
list facid factype authorit if facid == facid[_n-1] /* */ | facid == facid[_n+1]Another way, available in version 8 and later, is to end a line with ///, which tells Stata to continue reading the next line as a continuation of this line:
list facid factype authorit if facid == facid[_n-1] /// | facid == facid[_n+1]
There is no one "correct" method, so use the one you prefer.
Back to question
Review again?
Another topic?


