Shrinking large data files

 

In an earlier example, we mentioned the compress command as a way to reduce the size of Stata data files. It works by choosing the most efficient data type that is necessary to store each variable and still maintain the precision of your data.

When working with very large files, or on computers with limited RAM, you need to compress the data. Below is an example you can try to see how compress works.

 


clear
use "q:\utilities\statatut\excomp.dta"
set more off
compress

Questions:

1. All these variables started as data type "float" or as short strings. How much is the memory need reduced when compress changes floats to bytes? Answer.


2. What does set more off do? Answer.


3. How can I tell whether compress will make much difference in my RAM requirements? Answer.


4. This is a "Catch-22"! I can't fit my Stata data into memory on my computer, but I can't compress it unless I can get it into Stata. Answer.


5. I have a Windows compression program on my PC. Can I use that instead? Answer.

 


Answers:

 

1. A float variable requires 4 bytes, while a byte variable requires 1 byte. See help data types

in Stata for details on the amount of storage required by each data type.

Back to question 

 


 

2. You'll recall from an earlier example that --more-- shows on the bottom of the Stata results window when a command generates more lines than can fit in that window. When you see --more-- you need to press the space bar to continue viewing results.

You can tell Stata to continue scrolling the results and not stop when the screen fills up. The command is set more off. To turn more back on, use set more on. That way you can go get a cup of coffee while Stata compresses your giant survey file.

Back to question 

 


 

3. You can look at the current data type of the variables using the describe command. If you see a lot of doubles (8 bytes) or floats (4 bytes) and you know your data are mostly 1 or 2 digit values, you'll know that compress will make a great deal of difference. If your data are already stored in bytes, compress won't help much.

Back to question 

 


 

4. There are two ways to get around this problem. You can find another computer with more RAM than is available on yours, and compress the file there. Or, you can use the varlist option on the use command to bring a subset of variables into memory. This will allow you to split the file into two or more pieces, each of which is small enough to fit in RAM and compress. Then you can recombine them, if necessary, into a single file. An example of this was shown in the discussion of One-to-one merging

. Remember to include the necessary identification variables in the smaller files you create so that you can merge if you have to.

Back to question 

 


 

5. Compression programs like WinZip and PKZIP use a different compression method. Stata cannot read data files that have been compressed by those programs, or by Unix commands such as compress, gzip, or tar.

Back to question 

 


Review again?

 

Another topic?


Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme