Saving Space
Compressing files to save space
The file extension .gz or .Z indicates a compressed file. In order to use it you have to uncompress it.
Before leaping into unzipping or uncompressing, check to make sure that there is enough disk space to handle the uncompressed version of the file. We try to operate at no higher than 85% on Emerald.
To see how much space is available in AFS, type:
fs lq
Look at the %Used number. If it's 95% do not unzip the file if it's over 30,000,000 bytes.
If you check and see that we are using 90% or more of our disk space send an e-mail to Phil Bardsley, who will see what files can be compressed or removed.
This page is specifically for people working on the Dietary Patterns and Trends in US project.
To uncompress a .gz file, type:
gunzip filename.ext.gz
For example:
gunzip csf8990a.dta.gz
You have to be in the csf89/data/ directory to do this.
If a file has .Z at the end of the file name you have to type:
uncompress filename.ext.Z
For example:
uncompress csf8990a.dta.Z
You have to be in csf89/data/ directory to do this.
If you're on Emerald you have to do this in batch by adding "bsub" like so:
bsub gunzip csf8990a.dta.gz
The reason we gzip/compress files is so that we don't waste disk space on data sets that we aren't using. The gzip command does a better job compressing than the compress command. To gzip a file, type:
gzip csf8990a.dta
To compress:
compress csf8990a.dta
A secret of gzipping/compressing is that when you zip or unzip, compress or uncompress a file you become the owner. If it's a dataset you don't want to accidentally modify or remove, then make sure its file permissions are -r--r--r, which makes it read-only for all users.
Click here to learn how to change the permissions of a file on Emerald.
Programming tips to save space
A program that consists of code such as:
******************************************************;
libname in "/usda/csf98/data/";
libname out "/usda/paper_name/data/";
data out.paper_name01;
merge in.csf9810
in.csf9805b;
by hhid personid;
where age>12;
run;
******************************************************;
is not efficiently using space.
Creating analysis datasets should be done sparingly and done only when the program to create them is complicated and the subsequent use of the analysis dataset is expected to be frequent and for an extended time. The other exception is if the analysis dataset is small (less than 5,000,000 bytes) and/or you are just testing out SAS/Stata code.
If you are a new user and desire to create analysis datasets in this resource demanding way, make sure you keep the program that created the dataset and remember to delete it (or at least gzip it) when the dataset's use is no longer immediate.
If you are a more experienced user, you know you can simply include the above code in every program in which you would use the analysis dataset. Your program may take longer to run; but in general it's better to use up a little extra processing time occasionally than to take up large amounts of disk space all the time.
If you are going to create an analysis dataset, make sure to keep only the variables you need. You can greatly reduce the size of a file by limiting the number of variables stored. Also an analysis dataset should have observations for only those people in the desired subsample. When it comes time to use survey commands to analyze the data, then merge your analysis dataset with the person level file and create a subpop variable.
************************************************************** use "/usda/paper_name/data/my_analysis_file.dta" merge hhid personid using "/usda/csf98/data/csf9805b.dta" gen subpop=0 replace subpop=1 if _merge==3 /* persons in both datasets */ **************************************************************
Reducing the number of observations in an analysis dataset also saves a lot of disk space.
We cannot afford to house large analysis datasets that basically duplicate datasets already in existence in the usda file system.
Rule of thumb:
If you are not planning to use a dataset for more than a couple of
weeks, then gzip it (or delete it if you have the program that creates it).


