Working with do-files, documenting, outliers, duplicate ids.
Unless you're lucky enough to be working with perfectly clean data, you'll need to at least check for outliers and, the bane of all survey work, duplicate identifiers. In the previous section on changing data, the edit command was introduced. We don't recommend using it for data cleaning, because it does not keep a record of what you've done, and you can't easily repeat data cleaning steps you've taken.
Stata do-files, which are collections of commands that you write, are an easy way to clean and document your data. Copy and paste the following commands into a text editor (e.g., Wordpad, Notepad, or the Stata do-file editor), and save the file as a text (ascii) file named test.do. The suffix should be ".do" (not required, but strongly recommended). The file should be in the same directory that you're using for your Stata session. If you're using a different directory, remember where you saved it.
clear use "q:\utilities\statatut\exampfac.dta" /* Notice the blank line above this line. Blanks are fine in do-files. */
/* Comment lines, like these, can be anywhere in a do-file, including */
/* at the end of a command line. */ // You can also use double-slashes anywhere in a command line
// to write a comment. Everything from there to the end of the
// line will be treated as a comment. gen factype2= factype /* comment at the end of a command */ gen factype3= factype // this works, too * Which facilities have an odd value for age? ta age list facid facname factype authorit if age == 1998 replace age= 1 if age == 1998
* Look for duplicate values of facid using duplicates command. duplicates list facid list facid factype authorit if facid == 1001
* Look for duplicate values of facid using _n. sort facid list facid factype authorit if facid == facid[_n-1] #delimit ; list facid factype authorit if facid == facid[_n-1] | facid == facid[_n+1];
* Drop the duplicate observation where factype and authorit are missing.
drop if facid == 1001 & missing(factype)
If you're using the Stata do-file editor, click on the button "Execute (do)" (the icon looks like a page with writing on it and an arrow pointing to the right). If you used another editor, open Stata and type:
depending on where you saved the file test.do. Press Enter and watch the results. When you see --more-- press the Space Bar.
1. Comments can begin with an asterisk (*) and end with a carriage return (Enter key), or they can begin with two slashes (//) and end with a carriage return, or they can be bracketed by /* */ and span an many lines as needed. All are shown in the example above. What happens if commands are enclosed in comments, like this:
/* ta age list facid facname factype authorit if age == 1998 replace age= 1 if age == 1998 */
3. What does the duplicates command do? Answer.
4. Why are you showing a second way to check for duplicates? Answer.
5. The character _n is the sequence number of the current observation in memory. What does facid[ _n-1] mean? Answer.
6. Why do we need to sort the data before testing for duplicate values of facid? Answer.
7. Why list both observations _n-1 and _n+1? Answer.
8. What if there is more than one nested id variable? Answer.
9. What do the commands #delimit ; and #delimit cr do? Answer.
2. When the results of a Stata command fill up more than one screen, Stata pauses to let you review what's currently on the screen. When you want to continue, press the Space Bar to see another page, or press the Enter key to see another line.
You have other options for how to handle this situation.
- put the command set more off in the beginning of your do-file
- click on the green "Clear --more-- Condition" button to see another page
- press Ctrl-Break to cancel the command
- press the "Q" letter key to cancel the command
- click on the "Break" button (red circle with white "X")
The set more off command allows the do-file to continue uninterrupted to the finish. This is a good option if you have turned on logging, so you'll have a record of what scrolled by. Use set more on(the default) when you need to stop and check each screen.
The other options are useful if you're debugging a do-file by executing one command at a time, or if you're working interactively (not using a do-file). If you Break or "Q" in the middle of a do-file, the do-file execution is canceled along with the command that produced the --more-- condition.
3. The duplicates command counts, lists, tags, or drops duplicates. In this case we're just listing the duplicates so you can explore them further. See "One-to-one merging" for an example of dropping duplicates using the duplicates command. See help duplicates for details.
4. Before the duplicates command, you needed to use [_n] to check for duplicates. Even though you no longer need it to check for duplicates, it's useful to know about [_n] for more advanced programming.
5. You can think of each variable as a column in Excel, and _n is the row number. For example, facid is the first variable, which would be column A in Excel. To refer to the cell in the third row, you would type A3 in Excel. Similarly, in Stata facid is the value of facid for the third observation in memory.
Since Stata executes a command on every observation from top to bottom, facid[_n] is the current observation it's looking at, facid[_n-1] is the observation before the current observation, and facid[_n+1] is the next observation.
6. If duplicate values of facid exist, sorting the data will place them next to each other in the data in memory. So, the array element reference [ _n-1] will refer to the element immediately prior to the current observation, and the duplicate value will match the current value.
7. Actually, we've asked to list the current observation (_n). If the current observation has the same value of facid as either the previous or the next observation, it's a duplicate. This method catches 2 or more observations with the same value for any variable.
8. If there is more than one nested id variable, you need to sort in the proper nest order and check all ids for duplicates using the logical operator & (ampersand) between them. Suppose there are 3 nested ids:
sort id1 id2 id3 list id1 id2 id3 if id1 == id1[_n-1] & id2 == id2[_n-1] & id3 == id3[_n-1]
Again, we'd like to point out that the duplicates command is much easier to use than this method:
duplicates list id1 id2 id3
We've shown you the method using _n because there are many instances in programming where you'll want to refer to observations before or after a particular case. This implicit array is powerful and well worth understanding and remembering for other applications.
9. In a do-file, Stata assumes that each command is no more than 1 line long, and that each line ends in a carriage return (when you press the Enter key, a text editor inserts a carriage return symbol). If you want to type a command that is more than one line long, you can use #delimit ; to tell Stata to look for a semi-colon instead of a carriage return. From that point on, you must end each command, whether one or more lines long, with a semi-colon. To switch back to carriage return, use #delimit cr.
There are other ways to continue a single command across more than one line. One way is to comment out the carriage return - type /* at the end of one line, and */ at the beginning of the next line (to end the comment):
list facid factype authorit if facid == facid[_n-1] /* */ | facid == facid[_n+1]
Another way is to end a line with ///, which tells Stata to continue reading the next line as a continuation of this line:
list facid factype authorit if facid == facid[_n-1] /// | facid == facid[_n+1]
There is no one "correct" method, so use the one you prefer.