Data cleaning

 

Working with do-files, documenting, outliers, duplicate ids.

Unless you're lucky enough to be working with perfectly clean data, you'll need to at least check for outliers and, the bane of all survey work, duplicate identifiers. In the previous section on changing data, the edit command was introduced. We don't recommend using it for data cleaning, because it does not keep a record of what you've done, and you can't easily repeat data cleaning steps you've taken.

Stata do-files, which are collections of commands that you write, are an easy way to clean and document your data. Copy and paste the following commands into a text editor (e.g., Wordpad, Notepad, or the Stata do-file editor), and save the file as a text (ascii) file named test.do. The suffix should be ".do" (not required, but strongly recommended). The file should be in the same directory that you're using for your Stata session. If you're using a different directory, remember where you saved it.


clear
use "q:\utilities\statatut\exampfac.dta"

/* Notice the blank line above this line. Blanks are fine in do-files. */ 
/* Comment lines, like these, can be anywhere in a do-file, including */
/* at the end of a command line. */
// You can also use double-slashes anywhere in a command line
// to write a comment. Everything from there to the end of the
// line will be treated as a comment.
gen factype2= factype /* comment at the end of a command */ gen factype3= factype // this works, too * Which facilities have an odd value for age? ta age list facid facname factype authorit if age == 1998 replace age= 1 if age == 1998

* Look for duplicate values of facid using duplicates command. duplicates list facid list facid factype authorit if facid == 1001  
* Look for duplicate values of facid using _n.
sort facid list facid factype authorit if facid == facid[_n-1] #delimit ; list facid factype authorit if facid == facid[_n-1] | facid == facid[_n+1];
#delimit cr
 
* Drop the duplicate observation where factype and authorit are missing.

drop if facid == 1001 & missing(factype)

If you're using the Stata do-file editor, click on the button "Execute (do)" (the icon looks like a page with writing on it and an arrow pointing to the right). If you used another editor, open Stata and type:

do test.do

or

do "driveletter/directory/etc/test.do"

depending on where you saved the file test.do. Press Enter and watch the results. When you see --more-- press the Space Bar.


Questions:

1. Comments can begin with an asterisk (*) and end with a carriage return (Enter key), or they can begin with two slashes (//) and end with a carriage return, or they can be bracketed by /* */ and span an many lines as needed. All are shown in the example above. What happens if commands are enclosed in comments, like this:

       /*
       ta age
       list facid facname factype authorit if age == 1998
       replace age= 1 if age == 1998
       */

Answer.


2. When ta age was executed, the screen stopped scrolling and --more-- appeared at the bottom. What does that mean? Answer.

3. What does the duplicates command do? Answer.

4. Why are you showing a second way to check for duplicates? Answer.

5. The character _n is the sequence number of the current observation in memory. What does facid[ _n-1] mean? Answer.

6. Why do we need to sort the data before testing for duplicate values of facid? Answer.

7. Why list both observations _n-1 and _n+1? Answer.

8. What if there is more than one nested id variable? Answer.

9. What do the commands #delimit ; and #delimit cr do? Answer.

 

 


Answers:

1. Everything that's enclosed in comments is treated as a comment and is not executed. This is a handy way to block out portions of a do-file that you don't want to run.

Back to question


 

2. When the results of a Stata command fill up more than one screen, Stata pauses to let you review what's currently on the screen. When you want to continue, press the Space Bar to see another page, or press the Enter key to see another line.

You have other options for how to handle this situation.

  • put the command set more off in the beginning of your do-file
  • click on the green "Clear --more-- Condition" button to see another page
  • press Ctrl-Break to cancel the command
  • press the "Q" letter key to cancel the command
  • click on the "Break" button (red circle with white "X")

The set more off command allows the do-file to continue uninterrupted to the finish. This is a good option if you have turned on logging, so you'll have a record of what scrolled by. Use set more on(the default) when you need to stop and check each screen.

The other options are useful if you're debugging a do-file by executing one command at a time, or if you're working interactively (not using a do-file). If you Break or "Q" in the middle of a do-file, the do-file execution is canceled along with the command that produced the --more-- condition.

Back to question


 

3. The duplicates command counts, lists, tags, or drops duplicates. In this case we're just listing the duplicates so you can explore them further. See "One-to-one merging" for an example of dropping duplicates using the duplicates command. See help duplicates for details.

Back to question 


4. Before the duplicates command, you needed to use [_n] to check for duplicates. Even though you no longer need it to check for duplicates, it's useful to know about [_n] for more advanced programming.

Back to question


5. You can think of each variable as a column in Excel, and _n is the row number.  For example, facid is the first variable, which would be column A in Excel. To refer to the cell in the third row, you would type A3 in Excel. Similarly, in Stata facid[3] is the value of facid for the third observation in memory.

Since Stata executes a command on every observation from top to bottom, facid[_n] is the current observation it's looking at, facid[_n-1] is the observation before the current observation, and facid[_n+1] is the next observation.

Back to question 

 


6. If duplicate values of facid exist, sorting the data will place them next to each other in the data in memory. So, the array element reference [ _n-1] will refer to the element immediately prior to the current observation, and the duplicate value will match the current value.

Back to question 

 


7. Actually, we've asked to list the current observation (_n). If the current observation has the same value of facid as either the previous or the next observation, it's a duplicate. This method catches 2 or more observations with the same value for any variable.

Back to question 


8. If there is more than one nested id variable, you need to sort in the proper nest order and check all ids for duplicates using the logical operator & (ampersand) between them. Suppose there are 3 nested ids:

       sort id1 id2 id3
       list id1 id2 id3  if id1 == id1[_n-1] & id2 == id2[_n-1] & id3 == id3[_n-1]

Again, we'd like to point out that the duplicates command is much easier to use than this method:

       duplicates list id1 id2 id3

We've shown you the method using _n because there are many instances in programming where you'll want to refer to observations before or after a particular case. This implicit array is powerful and well worth understanding and remembering for other applications.

Back to question. 

 


9. In a do-file, Stata assumes that each command is no more than 1 line long, and that each line ends in a carriage return (when you press the Enter key, a text editor inserts a carriage return symbol). If you want to type a command that is more than one line long, you can use #delimit ; to tell Stata to look for a semi-colon instead of a carriage return. From that point on, you must end each command, whether one or more lines long, with a semi-colon. To switch back to carriage return, use #delimit cr.

There are other ways to continue a single command across more than one line. One way is to comment out the carriage return - type /* at the end of one line, and */ at the beginning of the next line (to end the comment):

list facid factype authorit if facid == facid[_n-1] /*
  */  | facid == facid[_n+1]

Another way is to end a line with ///, which tells Stata to continue reading the next line as a continuation of this line:

list facid factype authorit if facid == facid[_n-1] ///
   | facid == facid[_n+1]

There is no one "correct" method, so use the one you prefer.

Back to question 

 


 

Review again?

 

Another topic?


Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme