Word - Natasha Sarkisian

SOCY7709: Quantitative Data Management
Instructor: Natasha Sarkisian
Processing Observations across Subgroups
1. Subgroup operations across multiple variables
Oftentimes, datasets include nested subcomponents – e.g., data on each child of each respondent,
or on each person who helped with activities of daily living, or each recipient of financial
assistance, or on organizations for which respondent volunteered, etc. Even though the entire
dataset might not have a nested structure, these subcomponents are essentially nested – children
within the respondent, etc. Such data are usually provided in separate variables for each lower
level unit, e.g.:
R’s ID
Child 1 age
Child 1
gender
Child 2 age
Child 2
gender
Child 3 age
Child 3
gender
1
2
3
Different individuals might have different number of children, so, for example, if someone only
has one child, then only child 1 columns will contain data; the following ones will be missing.
When dealing with such data, we could either (1) conduct analyses on the level of individual 
need to aggregate across variables or (2) conduct analyses on the level of lower level unit (here,
child)  need to reshape the data into:
R’s ID
1
1
1
2
2
2
3
3
3
Child
number
1
2
3
1
2
3
1
2
3
Child
gender
Child age
In this long format, there will be rows with all child data missing since some people have less
than 3 children; such blank rows need to be identified and deleted (we learned how to do that
when discussing reshaping).
We will focus, however, on aggregating across variables when our goal is to have individuals as
units of analysis. As we discussed earlier, when we are interested in creating new variables based
on information that is stored in multiple columns in the dataset, we can use the many egen
options that allow us to combine such information across columns in various ways.
1
This is a recap:
 anycount(varlist), values(integer numlist) -- looks for a match across multiple variables
and generates a count of variables among those in the varlist for which values are equal to
any integer value in a supplied numlist. Values for any observations excluded by either if
or in are set to 0 (not missing).
 anymatch(varlist), values(integer numlist) – same but generates a yes/no indicator; it is 1
if any variable in varlist is equal to any integer value in a supplied numlist and 0
otherwise. Values for any observations excluded by either if or in are set to 0 (not
missing).
 rowfirst(varlist) -- gives the first nonmissing value in varlist for each observation (row).
 rowlast(varlist) -- gives the last nonmissing value in varlist for each observation (row).
 rowmin(varlist) -- gives the minimum value in varlist for each observation (row).
 rowmax(varlist) -- gives the maximum value (ignoring missing values) in varlist for each
observation (row).
 rowmean(varlist) – gives the mean for each observation across variables; great for
creating individual scores for multi-item scales. It ignores missing values: for example, if
three variables are specified and, in some observations, one of the variables is missing, in
those observations the new variable will contain the mean of the two variables that do
exist. Other observations will contain the mean of all three variables.
 rowsd(varlist) – gives the (row) standard deviations of the variables in varlist, ignoring
missing values.
 rowmedian(varlist) – similar to mean, but gives a median.
 rowpctile(varlist) [, p(#)] – similar to median but gives the #th percentile of values across
the variables specified. If p() is not specified, p(50) is assumed, meaning medians, which
is the same as rowmedian option.
 rowtotal(varlist) [, missing] -- creates the (row) sum of the variables in varlist, treating
missing as 0. If missing is specified and all values in varlist are missing for an
observation, new variable is set to missing.
 rowmiss(varlist) -- gives the number of missing values in varlist for each observation
(row).
 rownonmiss(varlist) [, strok] -- gives the number of nonmissing values in varlist for each
observation (row). String variables may not be specified unless the strok option is also
specified. If strok is specified, string variables will be counted as containing missing
values when they contain "".
For all of these, if all variables in the varlist are missing, the result will be missing as well.
Example:
. use "C:\Users\sarkisin\Documents\Data Management\gss2002.dta", clear
. sum relate*
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------relate1 |
2765
1
0
1
1
relate2 |
1876
3.168977
2.126585
2
8
relate3 |
944
3.643008
1.496795
3
8
2
relate4 |
494
3.504049
1.306476
3
8
relate5 |
207
3.541063
1.313487
3
8
-------------+-------------------------------------------------------relate6 |
64
3.890625
1.533925
3
8
relate7 |
24
4.166667
1.632993
3
8
relate8 |
8
4.375
2.065879
3
8
relate9 |
4
5.25
2.629956
3
8
relate10 |
2
3
0
3
3
-------------+-------------------------------------------------------relate11 |
6
5.5
1.643168
3
7
relate12 |
1
7
.
7
7
. codebook relate*
-------------------------------------------------------------------------------relate1
relationship of 1st person to household head
-------------------------------------------------------------------------------type:
label:
numeric (byte)
relate1
range:
unique values:
[1,1]
1
tabulation:
Freq.
2765
units:
missing .:
Numeric
1
1
0/2765
Label
head of household
-------------------------------------------------------------------------------relate2
relationship of 2nd person to household head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate2
range:
unique values:
[2,8]
7
tabulation:
Freq.
1214
328
13
7
11
44
259
889
units:
missing .:
Numeric
2
3
4
5
6
7
8
.
1
889/2765
Label
spouse
child
son or daughter-in-law
grand or great-grandchild
parent or parent-in-law
other relative
non-relative
-------------------------------------------------------------------------------relate3
relationship of 3rd person to household head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate3
range:
unique values:
[3,8]
6
tabulation:
Freq.
774
15
39
13
40
63
1821
units:
missing .:
Numeric
3
4
5
6
7
8
.
1
1821/2765
Label
child
son or daughter-in-law
grand or great-grandchild
parent or parent-in-law
other relative
non-relative
3
-------------------------------------------------------------------------------relate4
relationship of 4th person to household head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate4
range:
unique values:
[3,8]
6
tabulation:
Freq.
417
7
30
1
16
23
2271
units:
missing .:
Numeric
3
4
5
6
7
8
.
1
2271/2765
Label
child
son or daughter-in-law
grand or great-grandchild
parent or parent-in-law
other relative
non-relative
-------------------------------------------------------------------------------relate5
relationship of 5th person to household head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate5
range:
unique values:
[3,8]
4
tabulation:
Freq.
172
18
9
8
2558
units:
missing .:
Numeric
3
5
7
8
.
1
2558/2765
Label
child
grand or great-grandchild
other relative
non-relative
-------------------------------------------------------------------------------relate6
relationship of 6th person to household head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate6
range:
unique values:
[3,8]
4
tabulation:
Freq.
45
11
5
3
2701
units:
missing .:
Numeric
3
5
7
8
.
1
2701/2765
Label
child
grand or great-grandchild
other relative
non-relative
-------------------------------------------------------------------------------relate7
relationship of 7th person to household head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate7
range:
unique values:
[3,8]
4
tabulation:
Freq.
14
7
1
units:
missing .:
Numeric
3
5
7
1
2741/2765
Label
child
grand or great-grandchild
other relative
4
2
2741
8
.
non-relative
-------------------------------------------------------------------------------relate8
relationship of 8th person to household head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate8
range:
unique values:
[3,8]
4
tabulation:
Freq.
5
1
1
1
2757
units:
missing .:
Numeric
3
5
7
8
.
1
2757/2765
Label
child
grand or great-grandchild
other relative
non-relative
-------------------------------------------------------------------------------relate9
relationship of 9th person to household head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate9
range:
unique values:
[3,8]
3
tabulation:
Freq.
2
1
1
2761
units:
missing .:
Numeric
3
7
8
.
1
2761/2765
Label
child
other relative
non-relative
-------------------------------------------------------------------------------relate10
relationship of 10th person to household head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate10
range:
unique values:
[3,3]
1
tabulation:
Freq.
2
2763
units:
missing .:
Numeric
3
.
1
2763/2765
Label
child
-------------------------------------------------------------------------------relate11
relation of 11th person (visitor) to head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate11
range:
unique values:
[3,7]
4
tabulation:
Freq.
1
1
2
2
2759
units:
missing .:
Numeric
3
4
6
7
.
1
2759/2765
Label
child
son or daughter-in-law
parent or parent-in-law
other relative
5
-------------------------------------------------------------------------------relate12
relation of 12th person (visitor) to head
-------------------------------------------------------------------------------type: numeric (byte)
label: relate12
range:
unique values:
[7,7]
1
tabulation:
Freq.
1
2764
units:
missing .:
Numeric
7
.
1
2764/2765
Label
other relative
. for num 1/8: egen relshipX=anycount(relate*), values(X)
->
egen relship1=anycount(relate*), values(1)
->
egen relship2=anycount(relate*), values(2)
->
egen relship3=anycount(relate*), values(3)
->
egen relship4=anycount(relate*), values(4)
->
egen relship5=anycount(relate*), values(5)
->
egen relship6=anycount(relate*), values(6)
->
egen relship7=anycount(relate*), values(7)
->
egen relship8=anycount(relate*), values(8)
. sum relship*
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------relship1 |
2765
1
0
1
1
relship2 |
2765
.4390597
.4963621
0
1
relship3 |
2765
.636528
1.025315
0
8
relship4 |
2765
.0130199
.11338
0
1
relship5 |
2765
.040868
.2810861
0
5
-------------+-------------------------------------------------------relship6 |
2765
.0097649
.1088294
0
2
relship7 |
2765
.0433996
.2839101
0
5
relship8 |
2765
.1301989
.4350459
0
5
2. Subgroup operations across cases
When talking about subgroups in previous examples, we were working across variables, focusing
on aggregating based on groups of variables. But oftentimes, subgroups of interest are
represented within a given variable – that is the case when the dataset is nested/multilevel and
structured as such datasets are typically structured – e.g., students nested within schools means
each student is one observation, and therefore each school corresponds to multiple observations.
In such situations, we could conduct operations by group using by and bysort prefixes – for
example, to examine means by groups:
. bysort degree: sum income
--------------------------------------------------------------------------------> degree = lt high
6
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------income |
367
9.637602
3.140082
1
13
--------------------------------------------------------------------------------> degree = high sch
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------income |
1427
10.77645
2.510244
1
13
--------------------------------------------------------------------------------> degree = junior c
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------income |
193
11.65285
1.509934
1
13
--------------------------------------------------------------------------------> degree = bachelor
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------income |
430
11.68372
1.55324
1
13
--------------------------------------------------------------------------------> degree = graduate
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------income |
227
11.8326
1.071841
1
13
--------------------------------------------------------------------------------> degree = .
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------income |
1
9
.
9
9
We could, however, also use mean command with “over” option:
. mean income, over(degree)
Mean estimation
Number of obs
=
2644
_subpop_1: degree = lt high school
_subpop_2: degree = high school
_subpop_3: degree = junior college
bachelor: degree = bachelor
graduate: degree = graduate
-------------------------------------------------------------Over |
Mean
Std. Err.
[95% Conf. Interval]
-------------+-----------------------------------------------income
|
_subpop_1 |
9.637602
.1639109
9.316195
9.959009
_subpop_2 |
10.77645
.0664514
10.64615
10.90676
_subpop_3 |
11.65285
.1086874
11.43973
11.86597
bachelor |
11.68372
.0749039
11.53684
11.8306
graduate |
11.8326
.0711406
11.6931
11.9721
--------------------------------------------------------------
If we want to examine values by group, we can use list with sepby:
. sort indus80
. list age sex hrs1 indus80, sepby(indus80)
+-------------------------------+
| age
sex
hrs1
indus80 |
|-------------------------------|
1. | 72
female
.
10 |
7
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
|
.
female
.
10 |
| 62
male
70
10 |
| 83
male
.
10 |
| 74
male
20
10 |
| 69
female
.
10 |
| 31
male
.
10 |
| 47
male
.
10 |
| 68
male
.
10 |
| 75
female
.
10 |
| 76
female
.
10 |
|-------------------------------|
12. | 39
female
.
11 |
13. | 36
male
6
11 |
14. | 45
male
60
11 |
15. | 19
male
.
11 |
16. | 61
male
.
11 |
17. | 37
male
30
11 |
18. | 50
female
6
11 |
19. | 63
male
50
11 |
20. | 46
female
36
11 |
21. | 43
male
40
11 |
22. | 39
female
40
11 |
23. | 23
male
45
11 |
24. | 36
male
50
11 |
25. | 23
male
70
11 |
26. | 78
male
.
11 |
27. | 65
male
.
11 |
28. | 20
male
.
11 |
29. | 76
male
.
11 |
|-------------------------------|
30. | 51
female
35
20 |
31. | 50
male
20
20 |
|-------------------------------|
32. | 59
male
45
21 |
33. | 63
female
40
21 |
34. | 20
male
15
21 |
35. | 53
male
.
21 |
36. | 24
male
48
21 |
37. | 30
male
16
21 |
38. | 28
male
.
21 |
39. | 25
male
65
21 |
|-------------------------------|
40. | 82
female
.
30 |
|-------------------------------|
41. | 58
female
36
31 |
42. | 31
female
45
31 |
43. | 33
male
.
31 |
|-------------------------------|
44. | 75
male
.
40 |
|-------------------------------|
45. | 33
male
40
41 |
|-------------------------------|
46. | 56
male
50
42 |
47. | 41
male
.
42 |
48. | 39
male
.
42 |
49. | 45
male
84
42 |
50. | 52
male
50
42 |
|-------------------------------|
51. | 48
male
52
50 |
--Break-r(1);
8
If we want to actually aggregate from the lower level to the higher level (that is, create a variable
that will have the same value for all members of the same subgroup – for example, mean of that
subgroup, or its standard deviation, etc.), we would now do that across observations (for a given
variable or a combination of variables), also using bysort prefix. There are some useful egen
options for that as well – we didn’t discuss these yet:
 mean(exp) -- the mean of exp
 sd(exp) -- the standard deviation of exp
 total(exp) [, missing] -- creates a constant (within varlist) containing the sum of exp
treating missing as 0. If missing is specified and all values in exp are missing, new
variable is set to missing.
 median(exp) -- the median of exp
 pctile(exp) [, p(#)] -- the #th percentile of exp. If p(#) is not specified, 50 is assumed,
meaning medians.
 mode(varname) [, minmode maxmode nummode(integer) missing] -- the mode for
varname, which may be numeric or string. The mode is the value occurring most
frequently. If two or more modes exist or if varname contains all missing values, the
mode produced will be a missing value. To avoid this, the minmode, maxmode, or
nummode() option may be used to specify choices for selecting among the multiple
modes, and the missing option will treat missing values as categories. minmode returns
the lowest value, and maxmode returns the highest value. nummode(#) will return the
#th mode, counting from the lowest up. Missing values are excluded from determination
of the mode unless missing is specified.
 min(exp) -- the minimum value of exp
 max(exp) -- the maximum value of exp
 iqr(exp) -- interquartile range of exp
 kurt(varname)-- kurtosis of varname
 skew(varname)--skewness of varname
 rank(exp) [, field|track|unique] -- creates ranks based on values of exp; by default, equal
observations are assigned the average rank.
 count(exp) -- gives the number of nonmissing observations of exp
With the egen functions we used earlier, we were working across variables, focusing on rows.
But here we are focusing on one column at a time. If we do not use by: or bysort: prefix, that will
just generate a constant. For example:
. egen modehrs1=mode(hrs1)
. sum modehrs1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------modehrs1 |
2765
40
0
40
40
. egen meanagehrs=mean(agem*hrs1m)
. sum meanagehrs
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------meanagehrs |
2765
-11.6341
0
-11.6341
-11.6341
9
Typical use of these options of egen, however, is not to generate a constant like these ones, but to
create variables with statistics calculated by some kind of group with by: or bysort: prefix.
For example:
. bysort marital: egen agegroupmean=mean(age)
. table agegroupmean marital
------------------------------------------------------------------------------------agegroupm |
marital status
ean
|
married
widowed
divorced
separated never married
----------+-------------------------------------------------------------------------33.32764 |
708
44.26042 |
96
47.75475 |
1,269
48.86652 |
445
71.7328 |
247
-------------------------------------------------------------------------------------
Another example, with a loop:
. for var hrs1 age sex rincome: bysort indus80: egen Xmean=mean(X) \ bysort ind
> us80: egen Xsd=sd(X)
-> bysort indus80: egen hrs1mean=mean(hrs1)
(123 missing values generated)
-> bysort indus80: egen hrs1sd=sd(hrs1)
(193 missing values generated)
->
bysort indus80: egen agemean=mean(age)
-> bysort indus80: egen agesd=sd(age)
(28 missing values generated)
->
bysort indus80: egen sexmean=mean(sex)
-> bysort indus80: egen sexsd=sd(sex)
(28 missing values generated)
-> bysort indus80: egen rincomemean=mean(rincome)
(19 missing values generated)
-> bysort indus80: egen rincomesd=sd(rincome)
(77 missing values generated)
If we want to examine these with one observation per subgroup (essentially explore the data on
industry level), we could use tag function of egen:
 tag(varlist) [, missing]
We only use this when we don’t care which observation to take because we will be using this
variable to do analyses on, say, industry level, so all observations with the same industry code
would have the same industry characteristics. Missing option uses missing values as a separate
group.
. egen tagged=tag(indus80)
. tab tagged
tag(indus80 |
) |
Freq.
Percent
Cum.
------------+-----------------------------------
10
0 |
2,565
92.77
92.77
1 |
200
7.23
100.00
------------+----------------------------------Total |
2,765
100.00
.02
0
.01
Density
.03
.04
. histogram agemean if tagged==1
(bin=14, start=23, width=4.5)
20
40
60
80
agemean
0
2
Density
4
6
. histogram sexsd if tagged==1
(bin=13, start=0, width=.05439283)
0
.2
.4
sexsd
.6
.8
If you want to create a separate aggregate dataset rather than a mixed level one, you can use the
collapse command:
. collapse hrs1 age sex rincome, by(indus80)
. des
Contains data
obs:
201
vars:
5
size:
6,834
-------------------------------------------------------------------------------storage
display
value
variable name
type
format
label
variable label
-------------------------------------------------------------------------------indus80
int
%8.0g
indus80
rs industry code (1980)
hrs1
double %8.0g
(mean) hrs1
age
double %8.0g
(mean) age
sex
double %8.0g
(mean) sex
rincome
double %8.0g
(mean) rincome
-------------------------------------------------------------------------------Sorted by: indus80
Note: dataset has changed since last saved
In this example, we collapsed by calculating means of each variable by subgroup (mean is the
default here) – but we could also use different statistics, or even specify statistics for each
variable. Available statistics include:
11
mean
median
p1
p2
...
p50
...
p98
p99
sd
semean
sebinomial
sepoisson
sum
rawsum
count
percent
max
min
iqr
first
last
firstnm
lastnm
means (default)
medians
1st percentile
2nd percentile
3rd-49th percentiles
50th percentile (same as median)
51st-97th percentiles
98th percentile
99th percentile
standard deviations
standard error of the mean (sd/sqrt(n))
standard error of the mean, binomial (sqrt(p(1-p)/n))
standard error of the mean, Poisson (sqrt(mean))
sums
sums, ignoring optionally specified weight except observations
with a weight of zero are excluded
number of nonmissing observations
percentage of nonmissing observations
maximums
minimums
interquartile range
first value
last value
first nonmissing value
last nonmissing value
. use "C:\Users\sarkisin\Documents\Teaching Grad Statistics\Data Management\gss2
> 002.dta", clear
. collapse (mean) hrs1mean=hrs1 agemean=age sexmean=sex rincomemean=rincome (max
> ) hrs1max=hrs1 agemax=age incomemax=rincome (min) hrs1min=hrs1 agemin=age rinc
> omemin=rincome (sd) hrs1sd=hrs1 agesd=age sexsd=sex rincomesd=rincome, by(ind
> us80)
. des
Contains data
obs:
201
vars:
15
size:
14,472
-------------------------------------------------------------------------------storage
display
value
variable name
type
format
label
variable label
-------------------------------------------------------------------------------indus80
int
%8.0g
indus80
rs industry code (1980)
hrs1mean
double %8.0g
(mean) hrs1
agemean
double %8.0g
(mean) age
sexmean
double %8.0g
(mean) sex
rincomemean
double %8.0g
(mean) rincome
hrs1max
byte
%8.0g
(max) hrs1
agemax
byte
%8.0g
(max) age
incomemax
byte
%8.0g
(max) rincome
hrs1min
byte
%8.0g
(min) hrs1
agemin
byte
%8.0g
(min) age
rincomemin
byte
%8.0g
(min) rincome
hrs1sd
double %8.0g
(sd) hrs1
agesd
double %8.0g
(sd) age
sexsd
double %8.0g
(sd) sex
rincomesd
double %8.0g
(sd) rincome
-------------------------------------------------------------------------------Sorted by: indus80
Note: dataset has changed since last saved
12
Subscripting observations within groups
Notation
_n
_N
_n-1
_n+1
1, 2, 3…
Meaning
Current observation
Last observation
Previous observation
Next observation
Observation number 1, 2, 3…
We will use a subset of NLSY data focusing on marriage and employment for this example.
. use "C:\Users\sarkisin\Documents\Teaching Grad Statistics\marriage_raw.dta" , clear
. drop interv79 interv81
. reshape long mar fexp pexp educ interv newage emp enrol, i(id) j(year)
(note: j = 82 83 84 85 86 87 88 89 90 91 92 93 94)
Data
wide
->
long
----------------------------------------------------------------------------Number of obs.
6081
->
79053
Number of variables
109
->
14
j variable (13 values)
->
year
xij variables:
mar82 mar83 ... mar94
->
mar
fexp82 fexp83 ... fexp94
->
fexp
pexp82 pexp83 ... pexp94
->
pexp
educ82 educ83 ... educ94
->
educ
interv82 interv83 ... interv94
->
interv
newage82 newage83 ... newage94
->
newage
emp82 emp83 ... emp94
->
emp
enrol82 enrol83 ... enrol94
->
enrol
----------------------------------------------------------------------------. sum
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------id |
79053
3041
1755.445
1
6081
year |
79053
88
3.741681
82
94
parpres |
79053
36.76739
12.06151
12
82
pared |
76765
10.779
3.25792
0
20
black |
79053
.2491367
.4325158
0
1
-------------+-------------------------------------------------------hispanic |
79053
.1624733
.3688867
0
1
mar |
78861
.5372618
.4986128
0
1
pexp |
79053
1.85826
2.077519
0
17.69231
fexp |
79053
3.462117
3.412239
0
18.57692
educ |
79053
12.67904
2.247307
0
20
-------------+-------------------------------------------------------interv |
79053
10380.57
1424.878
8050
12775
newage |
66965
9870.376
1613.834
6252
13847
emp |
79053
.6687918
.4706507
0
1
enrol |
79053
.0788838
.269559
0
1
Let’s construct a date of birth variable and fill in the missing values:
. gen dob=interv-newage
13
(12088 missing values generated)
. gen dob2=dob
(12088 missing values generated)
. bysort id: replace dob2=dob2[_n-1] if dob2==.
(11607 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(191 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(79 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(49 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(44 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(37 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(30 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(19 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(12 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(10 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(8 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(2 real changes made)
. bysort id: replace dob2=dob2[_n+1] if dob2==.
(0 real changes made)
Alternatively, same can be done easier (but this wouldn’t help when you want to use prior values
to fill in gaps in a time series):
. bysort id: egen dob3=mean(dob)
. sum
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------id |
79053
3041
1755.445
1
6081
year |
79053
88
3.741681
82
94
parpres |
79053
36.76739
12.06151
12
82
pared |
76765
10.779
3.25792
0
20
black |
79053
.2491367
.4325158
0
1
-------------+-------------------------------------------------------hispanic |
79053
.1624733
.3688867
0
1
mar |
78861
.5372618
.4986128
0
1
pexp |
79053
1.85826
2.077519
0
17.69231
14
fexp |
79053
3.462117
3.412239
0
18.57692
educ |
79053
12.67904
2.247307
0
20
-------------+-------------------------------------------------------interv |
79053
10380.57
1424.878
8050
12775
newage |
66965
9870.376
1613.834
6252
13847
emp |
79053
.6687918
.4706507
0
1
enrol |
79053
.0788838
.269559
0
1
dob |
66965
364.2885
815.6515
-1095
1824
-------------+-------------------------------------------------------dob2 |
79053
308.3008
819.3959
-1095
1824
dob3 |
79053
308.3008
819.3959
-1095
1824
Using subscripting to indicate sequential position of observation within its group:
. bysort id: gen seq=_n
. bysort id: gen last=(_n==_N)
. bysort id: gen first=(_n==1)
. list id year seq last first
+----------------------------------+
|
id
year
seq
last
first |
|----------------------------------|
1. |
1
83
1
0
1 |
2. |
1
84
2
0
0 |
3. |
1
85
3
0
0 |
4. |
1
86
4
0
0 |
5. |
1
87
5
0
0 |
|----------------------------------|
6. |
1
88
6
0
0 |
7. |
1
89
7
0
0 |
8. |
1
90
8
0
0 |
9. |
1
91
9
0
0 |
10. |
1
92
10
0
0 |
|----------------------------------|
11. |
1
93
11
0
0 |
12. |
1
94
12
1
0 |
13. |
2
83
1
0
1 |
14. |
2
84
2
0
0 |
15. |
2
85
3
0
0 |
|----------------------------------|
16. |
2
86
4
0
0 |
17. |
2
87
5
0
0 |
18. |
2
88
6
0
0 |
19. |
2
89
7
0
0 |
20. |
2
90
8
0
0 |
|----------------------------------|
21. |
2
91
9
0
0 |
22. |
2
92
10
0
0 |
23. |
2
93
11
0
0 |
--Break-r(1);
Constructing an indicator of transition from single to married:
. gen mart=(mar==1)
. bysort id: replace mart=0 if mar==1 & mar[_n-1]==1
(35677 real changes made)
15
. bysort id: replace mart=0 if mar==1 & _n==1
(1761 real changes made)
We only have missing values at first wave; and we coded this indicator as 0 for those cases; if we
had more missing values throughout, it might be easier to impute first – or make a decision what
to do with “broken” sequences
. list id year mar mart
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
+--------------------------+
|
id
year
mar
mart |
|--------------------------|
|
1
82
0
0 |
|
1
83
1
1 |
|
1
84
1
0 |
|
1
85
1
0 |
|
1
86
1
0 |
|--------------------------|
|
1
87
1
0 |
|
1
88
1
0 |
|
1
89
1
0 |
|
1
90
1
0 |
|
1
91
1
0 |
|--------------------------|
|
1
92
1
0 |
|
1
93
1
0 |
|
1
94
1
0 |
|
2
82
1
0 |
|
2
83
1
0 |
|--------------------------|
|
2
84
1
0 |
|
2
85
1
0 |
|
2
86
1
0 |
|
2
87
1
0 |
|
2
88
1
0 |
|--------------------------|
|
2
89
1
0 |
|
2
90
1
0 |
|
2
91
1
0 |
|
2
92
1
0 |
|
2
93
1
0 |
|--------------------------|
|
2
94
1
0 |
|
3
82
0
0 |
|
3
83
0
0 |
|
3
84
0
0 |
|
3
85
0
0 |
|--------------------------|
|
3
86
0
0 |
|
3
87
0
0 |
|
3
88
0
0 |
|
3
89
0
0 |
|
3
90
0
0 |
|--------------------------|
|
3
91
0
0 |
|
3
92
0
0 |
|
3
93
0
0 |
|
3
94
0
0 |
|
4
82
0
0 |
|--------------------------|
|
4
83
0
0 |
16
42.
43.
44.
45.
|
4
84
0
0 |
|
4
85
0
0 |
|
4
86
0
0 |
|
4
87
0
0 |
|--------------------------|
46. |
4
88
0
0 |
47. |
4
89
0
0 |
48. |
4
90
0
0 |
49. |
4
91
0
0 |
50. |
4
92
0
0 |
|--------------------------|
51. |
4
93
1
1 |
52. |
4
94
1
0 |
--Break-r(1);
Creating lagged or differenced variables – using subscripting:
. bysort id: gen empl=emp[_n-1]
(13167 missing values generated)
Versus using time series operators:
. tsset id year
panel variable:
time variable:
delta:
id (strongly balanced)
year, 83 to 94
1 unit
. gen empl2=l.emp
(13167 missing values generated)
. list id year emp empl empl2
+----------------------------------+
|
id
year
emp
empl
empl2 |
|----------------------------------|
1. |
1
83
0
.
. |
2. |
1
84
0
0
0 |
3. |
1
85
.
0
0 |
4. |
1
86
1
.
. |
5. |
1
87
.
1
1 |
|----------------------------------|
6. |
1
88
0
.
. |
7. |
1
89
1
0
0 |
8. |
1
90
.
1
1 |
9. |
1
91
1
.
. |
10. |
1
92
1
1
1 |
|----------------------------------|
11. |
1
93
1
1
1 |
12. |
1
94
1
1
1 |
13. |
2
83
0
.
. |
14. |
2
84
0
0
0 |
--Break-r(1);
. bysort id: gen educd=educ-educ[_n-1]
(19657 missing values generated)
. gen educd2=d.educ
(19657 missing values generated)
You can also generate further lags – l. is the same as l1, but you can do l2. etc; you can also
make variables with future values (e.g., f1.mar or f2mar).
17