Precision and data storage

Two precision issues come up repeatedly when using Stata (and other similar analysis packages). One is how decimal values are represented in the computer's memory. The other is how large an integer you can store in a given Stata data type.

When 0.7 doesn't equal 0.7

Computers use a binary (0's and 1's) system to store decimal numbers. This leads to some inaccuracy, since some decimal values can't be stored exactly in binary. Try this:

clear all
set obs 1
gen x= 0.7
list
list if x == 0.7   // 0.7 doesn't equal 0.7
browse
list if x == float(0.7)   // now they are equal

You'll notice that the command list  if x == 0.7results in nothing being listed! When you browse the data, you'll see that 0.7 is being stored as the value 0.69999999. Since that value isn't 0.7, your command to list x results in no matches.

The float function takes care of this problem - it rounds the value 0.69999999 to 0.7. Many decimal values are stored accurately in binary, for example 0.5, but many are not. Rather than trying to memorize which are and which are not, we suggest always using the float function.

Big integers

The other precision issue has to do with Stata's data types. Stata offers 3 data types for integers (byte, int, and long) and 2 for floating point (float and double). For character data, Stata offers data type string. If you type help data_typesyou'll see a table that lists the 5 numeric data types that Stata uses, and the string data type, along with the minimum and maximum values that can be stored in each data type, and the maximum value that can be stored precisely.

Why data types?When Stata was first being developed, computers had very little random-access memory, and RAM was expensive. So, there was a benefit to storing values in as little memory as possible. While it would be convenient to create all variables with the highest precision available, which for numeric data is type double, this would waste a lot of memory. For example, a typical yes-no (1,2) variable can be stored accurately in a single byte, so storing it in type double would waste 7 bytes per observation. In a data file of survey results with thousands of variables and thousands of observations, this adds up many megabytes of wasted storage.

Now that computer memory is less expensive, we tend to pay less attention to it. But we need to pay attention when creating values in Stata that are relatively large. Typically, this occurs when the researcher decides to create a single numeric identifier out of multiple, nested identifying variables. In the Demographic and Health Survey data, one can often use three variables (the sampling cluster, the household identifier within the cluster, and the person identifier within the household) jointly to identify an individual respondent. For example:

     duplicates report v001 v002 v003

usually demonstrates that these 3 nested variables create a unique identifier (but not always - be sure to check). But, if you try to combine them into a single numeric variable, you may run into trouble:

     gen id= (v001*1000000) + (v002*1000) + v003

This example creates an 8- or 9-digit number, depending on the values of the cluster identifier v001. But Stata will store id by default in type float. Data type float begins to lose precision above 7 digits. We might be lucky and create a unique identifier, but probably not.

There are two ways to get around this. First, specify double when generating large integers. Data type double can accurately represent integers up to 15 digits:

     gen double id= (v001*1000000) + (v002*1000) + v003

While double is discussed in the data type help page in terms of floating point precision, it works well for integers too, and it's the only way to store big integers precisely.

The other way to do this is using data type string for the identifier. In fact, the DHS data includes a string identifier, called caseid for the individual respondent. Strings are a bit of a pain to work with, but they precisely hold integers up to 244 characters in length.

We recommend that you always check for duplicates when creating a composite identifier. And always use data type double for these big integers. Some people even recommend that you always use data type double, which you can do with:

     set type double, permanently

This asks Stata to create all new variables with data type double, greatly reducing your need to worry about precision. When you're finished creating an analysis file, you can compressthe file. This command asks Stata to decide how each variable can be stored most efficiently.

The Stata Corp. web site offers many FAQ's on topics like this. Here's one on the precision of floating point storage that you might find useful: The accuracy of the float data type

 

Note: Dan Blanchette contributed this web page, however please direct questions to Phil Bardsley as noted below.

 


Review again?

 

Another topic?


Wink Plone Theme by Quintagroup © 2013.

Personal tools
This is themeComment for Wink theme