Working with large data files
Stata requires that the data file you want to analyze fits into memory. This means that working with files approaching the size of memory on your computer can be a challenge. Fortunately, Stata has supplied a number of nice tools for dealing with large data files. We review them here.
describe usingSometimes you may just want to see what variables are in the large file. You don't need to use the entire file just to see a list of variables and their labels. Instead, you can type
describe using "bigfile.dta"where "bigfile.dta" is that name of the file you want to describe. Stata will give you all the information about the variables that you would expect from the describe command. Ideally, you'll be able to select a subset of variables, or a subset of observations, just by looking at describe.
lookfor_allIf the big file has a lot of variables, the describe using command will give you a lot of text to search. You can capture this in a log and search it in a text editor such as Notepad or Word. But, if you have several files to search, try lookfor_all.
This command is available from the SSC archives. It searches through all Stata data files in the current directory (and its subdirectories if you ask for it) for any string you want to find. The string may be in the variable name or label. For example, you may want to find the variable containing the sampling weight, so you try searching for the string "weight". First, change directories (cd) to the directory containing the file or files you want to search, then "lookfor" the string:
cd "c:\big_file_directory" lookfor_all weight, subdirThe command lists the name of each file containing that string along with the names of all the variables containing that string in their name or label. It then gives you a clickable link to each file with a match. This command has lots of nice features. See help lookfor_all at CPC, or you can download it to your standalone computer with ssc install lookfor_all.
use list_of_variables usingYou can bring a subset of variables from bigfile.dta into memory using this form of the use command:
use list_of_variables using "bigfile.dta"After looking at the results of describe using, decide which variables you need for your analysis, and list them in the use command.
use inYou can bring in a small sample of observations from a large file with this version of the command:
use in 1/20 using "bigfile.dta"This allows you to look at a sample of the variables more carefully, perhaps learning more than you could glean from the describe command.
use ifSuppose you're only interested in studying people in a certain age range.
use if age >= 1 & age < 5 using "bigfile.dta"Of course, you can combine any or all of these features in the same command.
random sampleYou might want to test your model on a small number of observations. Selecting those observations randomly can help you get a somewhat more representative set than selecting those from the beginning of the file, for example. You can use the runiform function to select any percent of observations you choose. The function returns a value between 0 and 1, so to get a 10% sample, you might use observations when runiform returns values between 0 and 0.1, or any other range of length 0.1, like this:
use if inrange(runiform(),0,.1) using "bigfile.dta"