Skip to main content

Data Variables and Codes

What do the missing values codes .A, .B, .C, and .D mean?

SAS provides 27 values: .A, .B, .C, .Z and ., that are all treated as missing by statistical procedures. The RLMS data take advantage of this feature to code missing values as follows:

.A = not applicable
.B = does not know
.C = refuses to answer
.D = does not answer
. = legitimate missing (due to skip instruction)

Those who are converting the data to a software product that does not provide multiple missing value codes, and who want to preserve these distinctions, may use the following code in SAS to convert the missing values to numeric values. First confirm that the variables you are converting do not contain any legitimate negative values in the range -6 to -9 before making these changes and edit the code as needed.

libname in xport 'c:\rlms\rminanth.906.xpt';
libname out 'c:\rlms\';
data out.rminanth;
set in.rminanth;
array miss _numeric_;
do over miss;
if miss=.d then miss=-9;
if miss=.c then miss=-8;
if miss=.b then miss=-7;
if miss=.a then miss=-6;
end;
run;

Since “not applicable” and “legitimate missing” are equivalent, some files use the “.” missing value for both meanings (that is, you will not find “.A” in those files).

Why are some questions in the questionnaire but not in the data file?

Some questions were added at the request of a funding agency in Russia and are not available for public during a certain period of time. When the designated period of time ends, the corresponding variables will be published in the next releases of the data files.

Moreover, there is another type of restricted-use data. Users can get an access to it after filling necessary documents. Contact us via rlms@unc.edu in case you need questions missing in the data file.

What are the “Text” data?

Many responses to questions were recorded verbatim and were not coded into categories. These questions can be identified in the questionnaire by the note “(char)” under the variable name to the left of the question. Since these responses are unique to the respondent, the likelihood of disclosing the respondent’s identity is high. Therefore, the text data are not distributed with the remaining variables. Moreover, these responses are in Russian. Access to the text data requires stricter IRB review than does access to most of the data.

Sample Design and Methods

Is the sample representative?

The RLMS sample is representative of the Russian Federation at the national-level. At the sub-national level, the Moscow and St. Petersburg samples are representative. No other sub-national portions of the sample are representative of their geographic or administrative areas. A more detailed discussion of the issues is available here.

Does the sample use stratification?

Yes. A multistage probability sample was employed to draw the sample of dwelling units. First, a list of 2,029 consolidated raions (similar to counties) was created from which to draw primary sample units (PSUs). These were allocated into 38 strata, based largely on geographical factors and level of urbanization, but were also based on ethnicity where there was salient variability. As in many national surveys involving face-to-face interviews, some remote areas were eliminated to contain costs; also, Chechnya was eliminated because of armed conflict. From among the remaining raions (containing more than 95.5 percent of the population), three very large population units were selected with certainty: Moscow city, Moscow Oblast, and St. Petersburg city each constituted a self-representing (SR) stratum. The remaining non-self-representing raions (NSRs) were allocated to 35 strata of roughly equal size. One raion was then selected from each NSR stratum using the method “probability proportional to size” (PPS). That is, the probability that a raion in a given NSR stratum was selected was directly proportional to its measure of population size.

The target sample size was set at 4,000 dwelling units. They were distributed as follows: a total of 584 units was allocated to the three SR strata, which contained 14.6 percent of the Russian population. In accordance with the principles of PPS, the remaining 3,416 dwelling units were allocated fairly equally across the 35 NSR primary sampling units, since they were drawn from fairly equal-sized strata using PPS. However, to allow for a non-response rate of approximately 15 percent, in actuality we drew a sample of 4,718 dwelling units, with 940 allocated to the three SR strata. Oversampling was concentrated in large urban areas, where the highest non-response rate was expected.

Since there was no consolidated list of households or dwellings in any of the 38 selected PSUs, an intermediate stage of selection was then introduced, as usual. The selection of second-stage units (SSUs) differed depending on whether the population was urban (located in cities and “villages of the city type,” known as “PGTs”) or rural (located in villages). That is, within each selected PSU the population was stratified into urban and rural substrata, and the target sample size was allocated proportionately to the two substrata. For example, if 40 percent of the population in a given region was rural, 40 of the 100 dwelling units allotted to the stratum were drawn from villages.

In rural substrata, villages served as the SSUs. In urban substrata, SSUs were defined by the boundaries of 1989 census enumeration districts, if possible. If the necessary information was not available, the boundaries of 1994 microcensus enumeration districts, voting districts, or residential postal zones were employed–in decreasing order of preference. Approximately one SSU was selected for each 10 dwellings in the sample, using PPS where the SSUs differed appreciably in size. After SSUs were selected, an enumeration of dwelling units was made by visual inspection and recourse to official documents. Finally, the required number of dwellings was selected systematically starting with a random address in the list.

What census did the post-stratification weights adjustment use?

The post-stratification adjustment first used the 1989 census (1994 microcensus). Starting with Round 13, we used the 2002 census results for calculating the post-stratification weights.

How was the sample adjusted when dwellings were demolished?

In the districts where old houses were demolished (where our old respondents used to live), we replaced the demolished buildings with the new ones, built on the same site, and the occupants of the new units were included in the sample. We did not make a special variable for these new households. The new households were sampled according to the same procedure: from a list of all dwellings on the survey site, we did a systematical sampling by an even interval. And the occupants of the old households (movers from the demolished buildings) were followed to their new addresses where possible.

Is there a variable that identifies whether a household was part of the original follow up?

Yes, there are special variables in the RLMS data sets to define original sample respondents (individuals and households). These variables have words “inmover” or “hhmover” in their variable names (for individual data use inmover* and for household data use hhmover*). Movers are the respondents not from the original sample. So cases with *mover*=0 are good for cross-sectional analysis, but the others (where *mover*=1) are for follow-up purposes only. You also can use post-stratification weight variables (like hhwgt_15) to identify movers, since they have post-stratification weight equal to zero (=0).

How do I identify an individual who changed households?

If, for example, person number 1 in a household roster in round 6 left the household before round 7, then in the household roster for round 7 person number 1 will be coded as absent (h7inhh01=2) in that household. The reason for their absence in round 7 will be coded in h7whyn01 (1= they moved, 2= the household split, 3= they died).

If, for example, a household splits between rounds, then in the round 7 roster all members of the round 6 household are duplicated and will keep their round 6 roster number. They will be coded in the round 7 roster as absent in their old household and present with all personal characteristics in their new household. This approach simplifies keeping track of individuals between rounds.

What are the effects of loss to follow up?

The main effects are in the Moscow/SPB sample. Because of high attrition the Moscow/SPB sample in round 10 was replaced with a new sample. And starting with 2001 the Moscow/SPB observations from 1994 sample are no longer a part of the cross-sectional RLMS sample. Most of these people actually did not move from their original addresses. But in terms of the RLMS we still should mark them as movers (=”movers from the cross-sectional sample”). However, we cannot calculate post-stratification weights for them anymore, because the weights adjust the cross-sectional sample (which, as a whole, is to represent all-Russia population) to the census data. And for the non-cross-sectional part of the sample, we just have no data to adjust to.

What is the main reason for losing respondents?

Refusals are much more common that the inability to find movers. Along with the refusals, another important reason is “no contact, nobody home during at least 3 visits.” We can only count refusals or no-contacts at the address of a previously interviewed household, but we almost never can say whether the same people were not home/refused, or whether the dwelling now has a new household. So it is difficult to differentiate between refusals and no-contacts.

What are the non-response rates by round?

The response rate in the survey of households in the sample of dwelling units was 87.6 percent in Round 5, 82.1 percent in Round 6, 79.4 percent in Round 7, 77.7 percent in Round 8, 75.3 percent in Round 9, 57.9 percent in Round 10, 57.3 percent in Round 11, 54.8 percent in Round 12, 54.3 percent in Round 13, and 50.8 percent in Round 14. The response rate for individuals within interviewed households exceeded 97 percent in each round; thus the response rate for all individuals within sampled dwellings units was most likely just slightly lower than the corresponding figure for dwelling units.

The response rate for Rounds 10 through 14 cannot be directly compared with the response rate of the previous rounds. Because of the high attrition in the cross-sectional sample in Moscow and St. Petersburg during rounds 5 through 9, in Round 10 the cross-sectional sample in Moscow and St. Petersburg was replaced by a 100 percent new sample (using the same sample design). The comparison of the response rate can be made only for all other areas except Moscow and St. Petersburg cities. The response rate in the survey of households in the sample of dwelling units in all other areas except Moscow and St. Petersburg cities was 91.8 percent in Round 5, 87.3 percent in Round 6, 84.9 percent in Round 7, 83.4 percent in Round 8, 82.0 percent in Round 9, 80.3 percent in Round 10, 78.8 percent in Round 11, 76.8 percent in Round 12, 76.1 percent in Round 13, and 72.2 percent in Round 14.

Because of the decline in response rate in big cities, the proportion of the big cities in the sample became less than needed and continued to decrease each round, so in Round 15 another part of sample repair was done. We added new households to reconstruct the share of each region in the sample (to make it equal to that of 1994 sample). We used the same procedure for drawing the new addresses as in 1994. And, no wonder, the response rate on the new addresses was lower than on our old addresses. So, the response rate for the whole sample in Round 15 decreased: 44.9 percent for the whole RLMS sample and 55.9 percent for all other areas except Moscow and St. Petersburg cities. Also, regarding the comparable part of Round 15 (that which can be compared with previous rounds of response data, without new addresses added in Round 15) the response rate for Round 15 was 50.6 percent for the whole RLMS sample and 69.9 percent for all other areas except Moscow and St. Petersburg cities.

How do I link household-level data with data from individual household members?

For rounds 5 through 17, the combination of three variables: site, censusid, and family uniquely identifies each household within a round.  For rounds 18 and 19, the combination of two variables: region and family uniquely identifies each household within a round.  These variables are in files at both the household and individual levels.

How was the household head assigned?

The head of household is assigned according to the following demographic hierarchy: (1) the oldest working-aged male in the household, (2) if no working-aged males, then the oldest working-age female, (3) if no working-age females, then the youngest retirement-age male, (4) if no retirement-age males, then the youngest retirement-age female, and finally (5) if no retirement-age females, then the oldest child.

Longitudinal vs. Cross-Sectional Samples

Why does the RLMS have two types of samples?

The RLMS Phase 2 sample has two parts: (1) original sample addresses (drawn in 1994) and the follow-up addresses. The original sample addresses make a representative all-Russia sample, while the follow-up addresses are needed for panel analyses of individual changes. The panel is composed of people who were interviewed as the original sample in their original locations for at least one round, and then they moved to a new address. When they moved, they left the representative sample. When they were interviewed at their new address, they were retained as part of the panel sample. Follow-ups allow us to observe individual changes during a number of years for more people. They help us to see what happens to people with given characteristics in 1994 by 2000 or 2005. But if we do a cross-sectional analysis for a particular round, for which we need a representative all-Russia sample for the given year, we do not need the follow-ups in our analysis.

Are different types of households in the panel?

Yes, there are two types of panel households: (1) “movers,” where all previously interviewed household members move to a new location, and (2) “split households,” where a previously interviewed household makes two different households, with “old” hh members in both parts, and both parts are interviewed. With split households only one of the parts can still remain in the original sample, and the other parts are interviewed as follow-ups.

For example, let’s suppose that in Round 5 we interviewed a household that consisted of a couple (parents) and their adult son. Before Round 6 the son marries and starts to live in a separate household. Perhaps he and his wife move away from the parents, or they stay at the parents’ address but as two distinct households who pay for food separately. Then, in Round 6 we interview the parents as one household and the son and his wife as another household. Both of these households have previously interviewed people, so both are “old” households. These two households have different BIDs (B identifier) but the same AIDs (A identifier) (as they come from one household of Round 5).

Note that in Round 6 there are no duplicates of the current round identifier, BID, but that there are duplicates of the previous round identifier, AID, because of this household split. If you would like to merge a dataset with a previous round data, use the previous round’s *ID to avoid the problems associated with duplicates. For example, if you want to merge Round 6 data with Round 5, use AID (not BID).

What happens if a household drops out?

Sometimes a household doen’t respond. They might be unavailable or simply refuse to answer. They might move to another town and then return to the old address.

We never drop households from a sample, and we never drop the address from a sample.  But sometimes households may participate in several waves, miss about 1, 2 or 3 years and then decide to participate again. For example, household #10107 participated in r6, r7 and r9 but didn’t want to be interviewed in r8. In a situation like this, researchers must be careful in creating a longitudinal file. If a household moved for a long time (or forever) and a new household came to their address, this new household will have the same number #10107, because, in fact, #10107 is the address number not household number.

To understand this situation, look at the identifiers. This new household #10107, which started to participate in RLMS project in r9, will have zero ID numbers for all previous rounds (r5, r6 , r7, r8). So you’ll see that aid, bid, cid, and did will have missing values. This indicates that a new family lives at that address and has joined the sample. In this you should not match this new household to the old one because these two families are different.

If we can follow up the old household and find their new address, it will keep its old number #10107, and the new family which started to live at their old sample address will have a new ID, for example, #10114. .

How do round identifiers change over time?

In Round 5 we had AIDs composed with household numbers from 1 to n within a population point. Starting with Round 6, BID, CID, DID, and next-rounds *IDs have household numbers from 1 to n within a census district. So, for many Round 6 households BID and AID will not match, although the household is the same. And we made another global change in numeration in Round 15:  two-digit household numbers in *IDs were replaced with three-digit household numbers.

Each data file has IDs for all previous rounds. A very good tip is NEVER calculate a previous round ID based on next-round ID. It is strictly prohibited! (Although it can give good results for many cases within Rounds 6 through 14, it also gives wrong results too!) We supply all previous *IDs each round to make sure that the linking is made correctly.

How are mover and split household identifiers constructed?

If a household moves as a whole (no split parts), it keeps its previous round *ID (like, CID=BID). If a household splits, one of the parts keeps the “old” *ID (like CID=BID), and other parts are given a new *ID. The new ID number starts with 51, as a rule. This practice eliminates duplicates in current round IDs. For example, if a person in household 40008 in round 6 moved out of the household, and he was successfully followed to his new household, he was assigned a new ID, 40051, in round 7.

Are identifiers unique over time?

For households, only current round’s IDs are unique. For the split hhs, previous rounds’ *IDs will be duplicated. But if you use only original sample observations, both the current round’s and previous rounds’ *IDs will be unique (there are no split parts in original sample). If you would like to merge a dataset with a previous round data, use the previous round’s *IDs. In SPSS choose the option “table” while matching (=”link one observation from previous round to several observations of current round”), using the previous round’s *ID as a key for matching, e.g., “match files /file=rlms7data /table=rlms6data /by BID.”

For individuals, the files include the variable “idind” that identifies each individual across rounds. In addition, within each round a separate individual *ID number is calculated as the hh *ID number multiplied by 100 plus person number (=number of hh member in hh roster).

What is the easiest way to construct a panel file?

For individuals, the best way is to use the variable “idind”, which uniquely identifies each person over all survey years. This variable is attached to each individual record in the Dataverse data. If you have older files that do not include idind, we supply a file that allows you to link idind with each individual so that you can link individuals over time. That variable, and the variables needed to link with each cross-sectional file, are in Longitudinal_identifiers.zip in the Supplemental Files section of the RLMS-HSE Dataverse.

To link households across rounds of data, always use the household identifier (AID, BID, etc.) from the earlier round. Here’s a simple example linking R14 with R15:

use rnhhhous.dta
merge 1:m jid using rohhhous.dta

Note that this is a “one-to-many” merge. The term “1:m” in the merge statement allows for households in R14 to split in R15 by recognizing that there may be duplicates of the R14 identifier (JID) in R15.

It was noticed that for 5-7% of individuals the year of birth changes in the panel data. Also, more often there are cases when years of education are decreasing. Is it a technical error during formation of individual identifiers, or is it a mistake to fill out questionnaires?

Such discrepancies in the data are explained by the method of collecting information: every year the paper questionnaires are filled according to the respondents’ words during a personal interview. The interviewer has no right to demand from the respondent any supporting documents. The year of birth cannot change, but the respondent’s answer to the question about the year of birth, however, is changing. The same concerns questions about education: in one wave a person responded in a different way. This phenomenon is typical not only for the RLMS database, but for any other mass databases.
In case of discrepancies, we repeatedly ask these questions several times after the interview and indicate the correct answers in the data. Most often, the information obtained during the last survey is confirmed. We do not change the data of the previous waves, except for cases of obvious mistakes – when it is clear that the error was due to the interviewer mistake. Indeed, the same respondent can answer “A” – in one wave, “B” – in another, and in the next wave can  answer again “A” (and confirm “exactly, exactly” A ” ! “) or even” C ” (and confirm” C “).
Errors in the formation of IDIND identification variables are corrected in the files of all waves as soon as we find out them. However, compared to the number of divergences in the year of birth or education, the number of errors in the formation of the IDIND variable is insignificant.

Economic Constructed Variables

How have the income and expenditure variables been adjusted?

In the household file the income and expenditure variables have a deflated value. Non-deflated variables are marked with ‘n’ (nominal), deflated marked with ‘r’ (real). For example, tincm_nm and tincm_rm. The inflation index converts values to June 1992 (the start of the survey). They have not been adjusted for regional differences. Note that the RLMS sample has NOT been designed to be regionally representative, so the researcher is cautioned not to interpret the data at the regional level. The regional deflator (CPI) for individual earnings can be found at: http://www.gks.ru/bgd/regl/b08_17/IssWWW.exe/Stg/02-06.htm.

The grid below translates the CPI for 7 Districts. Unfortunately there is information only for 2004 through 2007.

Consumer price index
Subjects of Russian Federation
(December to December of previous year, in %%)
2004 2005 2006 2007
Russian Federation 111,7 110,9 109,0 111,9
Central Federal District 112,1 110,5 109,0 112,2
Northwestern Federal District 112,3 111,2 109,5 112,6
Southern Federal District 112,0 112,1 109,0 112,1
Volga Federal District 112,4 110,2 108,7 113,1
Urals Federal District 110,4 111,7 110,2 110,9
Siberian Federal District 111,2 110,5 108,6 110,8
Far Eastern Federal District 111,3 113,3 108,8 109,6

Has the basket of goods for the inflation index been changed since 1992?

Yes. The basket of goods is calculated by the federal statistics service (http://www.gks.ru/). They use the same goods most of the time. To better reflect the real structure of consumption they can change the basket (to delete, add, or replace goods), but it should not influence the results greatly.

How were each of these variables constructed?

The variables were constructed from a variety of sources. They are expressed as mean per month. While we don’t have documentation on these variables, we supplied the code, which is well annotated. Please see the Constructed Variables Code section of the Dataverse.

Miscellaneous

Is other documentation available in English?

Most documentation in English except the questionnaires and a few supplemental files is currently on this Web site. The questionnaires and supplemental files are with the data on the CPC Dataverse. If you do not see an answer to your question, please ask rlms@unc.edu, and we will attempt to get an answer for you.