SOCY7709: Quantitative Data Management Instructor: Natasha Sarkisian Processing Observations across Subgroups 1. Subgroup operations across multiple variables Oftentimes, datasets include nested subcomponents – e.g., data on each child of each respondent, or on each person who helped with activities of daily living, or each recipient of financial assistance, or on organizations for which respondent volunteered, etc. Even though the entire dataset might not have a nested structure, these subcomponents are essentially nested – children within the respondent, etc. Such data are usually provided in separate variables for each lower level unit, e.g.: R’s ID Child 1 age Child 1 gender Child 2 age Child 2 gender Child 3 age Child 3 gender 1 2 3 Different individuals might have different number of children, so, for example, if someone only has one child, then only child 1 columns will contain data; the following ones will be missing. When dealing with such data, we could either (1) conduct analyses on the level of individual need to aggregate across variables or (2) conduct analyses on the level of lower level unit (here, child) need to reshape the data into: R’s ID 1 1 1 2 2 2 3 3 3 Child number 1 2 3 1 2 3 1 2 3 Child gender Child age In this long format, there will be rows with all child data missing since some people have less than 3 children; such blank rows need to be identified and deleted (we learned how to do that when discussing reshaping). We will focus, however, on aggregating across variables when our goal is to have individuals as units of analysis. As we discussed earlier, when we are interested in creating new variables based on information that is stored in multiple columns in the dataset, we can use the many egen options that allow us to combine such information across columns in various ways. 1 This is a recap: anycount(varlist), values(integer numlist) -- looks for a match across multiple variables and generates a count of variables among those in the varlist for which values are equal to any integer value in a supplied numlist. Values for any observations excluded by either if or in are set to 0 (not missing). anymatch(varlist), values(integer numlist) – same but generates a yes/no indicator; it is 1 if any variable in varlist is equal to any integer value in a supplied numlist and 0 otherwise. Values for any observations excluded by either if or in are set to 0 (not missing). rowfirst(varlist) -- gives the first nonmissing value in varlist for each observation (row). rowlast(varlist) -- gives the last nonmissing value in varlist for each observation (row). rowmin(varlist) -- gives the minimum value in varlist for each observation (row). rowmax(varlist) -- gives the maximum value (ignoring missing values) in varlist for each observation (row). rowmean(varlist) – gives the mean for each observation across variables; great for creating individual scores for multi-item scales. It ignores missing values: for example, if three variables are specified and, in some observations, one of the variables is missing, in those observations the new variable will contain the mean of the two variables that do exist. Other observations will contain the mean of all three variables. rowsd(varlist) – gives the (row) standard deviations of the variables in varlist, ignoring missing values. rowmedian(varlist) – similar to mean, but gives a median. rowpctile(varlist) [, p(#)] – similar to median but gives the #th percentile of values across the variables specified. If p() is not specified, p(50) is assumed, meaning medians, which is the same as rowmedian option. rowtotal(varlist) [, missing] -- creates the (row) sum of the variables in varlist, treating missing as 0. If missing is specified and all values in varlist are missing for an observation, new variable is set to missing. rowmiss(varlist) -- gives the number of missing values in varlist for each observation (row). rownonmiss(varlist) [, strok] -- gives the number of nonmissing values in varlist for each observation (row). String variables may not be specified unless the strok option is also specified. If strok is specified, string variables will be counted as containing missing values when they contain "". For all of these, if all variables in the varlist are missing, the result will be missing as well. Example: . use "C:\Users\sarkisin\Documents\Data Management\gss2002.dta", clear . sum relate* Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------relate1 | 2765 1 0 1 1 relate2 | 1876 3.168977 2.126585 2 8 relate3 | 944 3.643008 1.496795 3 8 2 relate4 | 494 3.504049 1.306476 3 8 relate5 | 207 3.541063 1.313487 3 8 -------------+-------------------------------------------------------relate6 | 64 3.890625 1.533925 3 8 relate7 | 24 4.166667 1.632993 3 8 relate8 | 8 4.375 2.065879 3 8 relate9 | 4 5.25 2.629956 3 8 relate10 | 2 3 0 3 3 -------------+-------------------------------------------------------relate11 | 6 5.5 1.643168 3 7 relate12 | 1 7 . 7 7 . codebook relate* -------------------------------------------------------------------------------relate1 relationship of 1st person to household head -------------------------------------------------------------------------------type: label: numeric (byte) relate1 range: unique values: [1,1] 1 tabulation: Freq. 2765 units: missing .: Numeric 1 1 0/2765 Label head of household -------------------------------------------------------------------------------relate2 relationship of 2nd person to household head -------------------------------------------------------------------------------type: numeric (byte) label: relate2 range: unique values: [2,8] 7 tabulation: Freq. 1214 328 13 7 11 44 259 889 units: missing .: Numeric 2 3 4 5 6 7 8 . 1 889/2765 Label spouse child son or daughter-in-law grand or great-grandchild parent or parent-in-law other relative non-relative -------------------------------------------------------------------------------relate3 relationship of 3rd person to household head -------------------------------------------------------------------------------type: numeric (byte) label: relate3 range: unique values: [3,8] 6 tabulation: Freq. 774 15 39 13 40 63 1821 units: missing .: Numeric 3 4 5 6 7 8 . 1 1821/2765 Label child son or daughter-in-law grand or great-grandchild parent or parent-in-law other relative non-relative 3 -------------------------------------------------------------------------------relate4 relationship of 4th person to household head -------------------------------------------------------------------------------type: numeric (byte) label: relate4 range: unique values: [3,8] 6 tabulation: Freq. 417 7 30 1 16 23 2271 units: missing .: Numeric 3 4 5 6 7 8 . 1 2271/2765 Label child son or daughter-in-law grand or great-grandchild parent or parent-in-law other relative non-relative -------------------------------------------------------------------------------relate5 relationship of 5th person to household head -------------------------------------------------------------------------------type: numeric (byte) label: relate5 range: unique values: [3,8] 4 tabulation: Freq. 172 18 9 8 2558 units: missing .: Numeric 3 5 7 8 . 1 2558/2765 Label child grand or great-grandchild other relative non-relative -------------------------------------------------------------------------------relate6 relationship of 6th person to household head -------------------------------------------------------------------------------type: numeric (byte) label: relate6 range: unique values: [3,8] 4 tabulation: Freq. 45 11 5 3 2701 units: missing .: Numeric 3 5 7 8 . 1 2701/2765 Label child grand or great-grandchild other relative non-relative -------------------------------------------------------------------------------relate7 relationship of 7th person to household head -------------------------------------------------------------------------------type: numeric (byte) label: relate7 range: unique values: [3,8] 4 tabulation: Freq. 14 7 1 units: missing .: Numeric 3 5 7 1 2741/2765 Label child grand or great-grandchild other relative 4 2 2741 8 . non-relative -------------------------------------------------------------------------------relate8 relationship of 8th person to household head -------------------------------------------------------------------------------type: numeric (byte) label: relate8 range: unique values: [3,8] 4 tabulation: Freq. 5 1 1 1 2757 units: missing .: Numeric 3 5 7 8 . 1 2757/2765 Label child grand or great-grandchild other relative non-relative -------------------------------------------------------------------------------relate9 relationship of 9th person to household head -------------------------------------------------------------------------------type: numeric (byte) label: relate9 range: unique values: [3,8] 3 tabulation: Freq. 2 1 1 2761 units: missing .: Numeric 3 7 8 . 1 2761/2765 Label child other relative non-relative -------------------------------------------------------------------------------relate10 relationship of 10th person to household head -------------------------------------------------------------------------------type: numeric (byte) label: relate10 range: unique values: [3,3] 1 tabulation: Freq. 2 2763 units: missing .: Numeric 3 . 1 2763/2765 Label child -------------------------------------------------------------------------------relate11 relation of 11th person (visitor) to head -------------------------------------------------------------------------------type: numeric (byte) label: relate11 range: unique values: [3,7] 4 tabulation: Freq. 1 1 2 2 2759 units: missing .: Numeric 3 4 6 7 . 1 2759/2765 Label child son or daughter-in-law parent or parent-in-law other relative 5 -------------------------------------------------------------------------------relate12 relation of 12th person (visitor) to head -------------------------------------------------------------------------------type: numeric (byte) label: relate12 range: unique values: [7,7] 1 tabulation: Freq. 1 2764 units: missing .: Numeric 7 . 1 2764/2765 Label other relative . for num 1/8: egen relshipX=anycount(relate*), values(X) -> egen relship1=anycount(relate*), values(1) -> egen relship2=anycount(relate*), values(2) -> egen relship3=anycount(relate*), values(3) -> egen relship4=anycount(relate*), values(4) -> egen relship5=anycount(relate*), values(5) -> egen relship6=anycount(relate*), values(6) -> egen relship7=anycount(relate*), values(7) -> egen relship8=anycount(relate*), values(8) . sum relship* Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------relship1 | 2765 1 0 1 1 relship2 | 2765 .4390597 .4963621 0 1 relship3 | 2765 .636528 1.025315 0 8 relship4 | 2765 .0130199 .11338 0 1 relship5 | 2765 .040868 .2810861 0 5 -------------+-------------------------------------------------------relship6 | 2765 .0097649 .1088294 0 2 relship7 | 2765 .0433996 .2839101 0 5 relship8 | 2765 .1301989 .4350459 0 5 2. Subgroup operations across cases When talking about subgroups in previous examples, we were working across variables, focusing on aggregating based on groups of variables. But oftentimes, subgroups of interest are represented within a given variable – that is the case when the dataset is nested/multilevel and structured as such datasets are typically structured – e.g., students nested within schools means each student is one observation, and therefore each school corresponds to multiple observations. In such situations, we could conduct operations by group using by and bysort prefixes – for example, to examine means by groups: . bysort degree: sum income --------------------------------------------------------------------------------> degree = lt high 6 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------income | 367 9.637602 3.140082 1 13 --------------------------------------------------------------------------------> degree = high sch Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------income | 1427 10.77645 2.510244 1 13 --------------------------------------------------------------------------------> degree = junior c Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------income | 193 11.65285 1.509934 1 13 --------------------------------------------------------------------------------> degree = bachelor Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------income | 430 11.68372 1.55324 1 13 --------------------------------------------------------------------------------> degree = graduate Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------income | 227 11.8326 1.071841 1 13 --------------------------------------------------------------------------------> degree = . Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------income | 1 9 . 9 9 We could, however, also use mean command with “over” option: . mean income, over(degree) Mean estimation Number of obs = 2644 _subpop_1: degree = lt high school _subpop_2: degree = high school _subpop_3: degree = junior college bachelor: degree = bachelor graduate: degree = graduate -------------------------------------------------------------Over | Mean Std. Err. [95% Conf. Interval] -------------+-----------------------------------------------income | _subpop_1 | 9.637602 .1639109 9.316195 9.959009 _subpop_2 | 10.77645 .0664514 10.64615 10.90676 _subpop_3 | 11.65285 .1086874 11.43973 11.86597 bachelor | 11.68372 .0749039 11.53684 11.8306 graduate | 11.8326 .0711406 11.6931 11.9721 -------------------------------------------------------------- If we want to examine values by group, we can use list with sepby: . sort indus80 . list age sex hrs1 indus80, sepby(indus80) +-------------------------------+ | age sex hrs1 indus80 | |-------------------------------| 1. | 72 female . 10 | 7 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. | . female . 10 | | 62 male 70 10 | | 83 male . 10 | | 74 male 20 10 | | 69 female . 10 | | 31 male . 10 | | 47 male . 10 | | 68 male . 10 | | 75 female . 10 | | 76 female . 10 | |-------------------------------| 12. | 39 female . 11 | 13. | 36 male 6 11 | 14. | 45 male 60 11 | 15. | 19 male . 11 | 16. | 61 male . 11 | 17. | 37 male 30 11 | 18. | 50 female 6 11 | 19. | 63 male 50 11 | 20. | 46 female 36 11 | 21. | 43 male 40 11 | 22. | 39 female 40 11 | 23. | 23 male 45 11 | 24. | 36 male 50 11 | 25. | 23 male 70 11 | 26. | 78 male . 11 | 27. | 65 male . 11 | 28. | 20 male . 11 | 29. | 76 male . 11 | |-------------------------------| 30. | 51 female 35 20 | 31. | 50 male 20 20 | |-------------------------------| 32. | 59 male 45 21 | 33. | 63 female 40 21 | 34. | 20 male 15 21 | 35. | 53 male . 21 | 36. | 24 male 48 21 | 37. | 30 male 16 21 | 38. | 28 male . 21 | 39. | 25 male 65 21 | |-------------------------------| 40. | 82 female . 30 | |-------------------------------| 41. | 58 female 36 31 | 42. | 31 female 45 31 | 43. | 33 male . 31 | |-------------------------------| 44. | 75 male . 40 | |-------------------------------| 45. | 33 male 40 41 | |-------------------------------| 46. | 56 male 50 42 | 47. | 41 male . 42 | 48. | 39 male . 42 | 49. | 45 male 84 42 | 50. | 52 male 50 42 | |-------------------------------| 51. | 48 male 52 50 | --Break-r(1); 8 If we want to actually aggregate from the lower level to the higher level (that is, create a variable that will have the same value for all members of the same subgroup – for example, mean of that subgroup, or its standard deviation, etc.), we would now do that across observations (for a given variable or a combination of variables), also using bysort prefix. There are some useful egen options for that as well – we didn’t discuss these yet: mean(exp) -- the mean of exp sd(exp) -- the standard deviation of exp total(exp) [, missing] -- creates a constant (within varlist) containing the sum of exp treating missing as 0. If missing is specified and all values in exp are missing, new variable is set to missing. median(exp) -- the median of exp pctile(exp) [, p(#)] -- the #th percentile of exp. If p(#) is not specified, 50 is assumed, meaning medians. mode(varname) [, minmode maxmode nummode(integer) missing] -- the mode for varname, which may be numeric or string. The mode is the value occurring most frequently. If two or more modes exist or if varname contains all missing values, the mode produced will be a missing value. To avoid this, the minmode, maxmode, or nummode() option may be used to specify choices for selecting among the multiple modes, and the missing option will treat missing values as categories. minmode returns the lowest value, and maxmode returns the highest value. nummode(#) will return the #th mode, counting from the lowest up. Missing values are excluded from determination of the mode unless missing is specified. min(exp) -- the minimum value of exp max(exp) -- the maximum value of exp iqr(exp) -- interquartile range of exp kurt(varname)-- kurtosis of varname skew(varname)--skewness of varname rank(exp) [, field|track|unique] -- creates ranks based on values of exp; by default, equal observations are assigned the average rank. count(exp) -- gives the number of nonmissing observations of exp With the egen functions we used earlier, we were working across variables, focusing on rows. But here we are focusing on one column at a time. If we do not use by: or bysort: prefix, that will just generate a constant. For example: . egen modehrs1=mode(hrs1) . sum modehrs1 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------modehrs1 | 2765 40 0 40 40 . egen meanagehrs=mean(agem*hrs1m) . sum meanagehrs Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------meanagehrs | 2765 -11.6341 0 -11.6341 -11.6341 9 Typical use of these options of egen, however, is not to generate a constant like these ones, but to create variables with statistics calculated by some kind of group with by: or bysort: prefix. For example: . bysort marital: egen agegroupmean=mean(age) . table agegroupmean marital ------------------------------------------------------------------------------------agegroupm | marital status ean | married widowed divorced separated never married ----------+-------------------------------------------------------------------------33.32764 | 708 44.26042 | 96 47.75475 | 1,269 48.86652 | 445 71.7328 | 247 ------------------------------------------------------------------------------------- Another example, with a loop: . for var hrs1 age sex rincome: bysort indus80: egen Xmean=mean(X) \ bysort ind > us80: egen Xsd=sd(X) -> bysort indus80: egen hrs1mean=mean(hrs1) (123 missing values generated) -> bysort indus80: egen hrs1sd=sd(hrs1) (193 missing values generated) -> bysort indus80: egen agemean=mean(age) -> bysort indus80: egen agesd=sd(age) (28 missing values generated) -> bysort indus80: egen sexmean=mean(sex) -> bysort indus80: egen sexsd=sd(sex) (28 missing values generated) -> bysort indus80: egen rincomemean=mean(rincome) (19 missing values generated) -> bysort indus80: egen rincomesd=sd(rincome) (77 missing values generated) If we want to examine these with one observation per subgroup (essentially explore the data on industry level), we could use tag function of egen: tag(varlist) [, missing] We only use this when we don’t care which observation to take because we will be using this variable to do analyses on, say, industry level, so all observations with the same industry code would have the same industry characteristics. Missing option uses missing values as a separate group. . egen tagged=tag(indus80) . tab tagged tag(indus80 | ) | Freq. Percent Cum. ------------+----------------------------------- 10 0 | 2,565 92.77 92.77 1 | 200 7.23 100.00 ------------+----------------------------------Total | 2,765 100.00 .02 0 .01 Density .03 .04 . histogram agemean if tagged==1 (bin=14, start=23, width=4.5) 20 40 60 80 agemean 0 2 Density 4 6 . histogram sexsd if tagged==1 (bin=13, start=0, width=.05439283) 0 .2 .4 sexsd .6 .8 If you want to create a separate aggregate dataset rather than a mixed level one, you can use the collapse command: . collapse hrs1 age sex rincome, by(indus80) . des Contains data obs: 201 vars: 5 size: 6,834 -------------------------------------------------------------------------------storage display value variable name type format label variable label -------------------------------------------------------------------------------indus80 int %8.0g indus80 rs industry code (1980) hrs1 double %8.0g (mean) hrs1 age double %8.0g (mean) age sex double %8.0g (mean) sex rincome double %8.0g (mean) rincome -------------------------------------------------------------------------------Sorted by: indus80 Note: dataset has changed since last saved In this example, we collapsed by calculating means of each variable by subgroup (mean is the default here) – but we could also use different statistics, or even specify statistics for each variable. Available statistics include: 11 mean median p1 p2 ... p50 ... p98 p99 sd semean sebinomial sepoisson sum rawsum count percent max min iqr first last firstnm lastnm means (default) medians 1st percentile 2nd percentile 3rd-49th percentiles 50th percentile (same as median) 51st-97th percentiles 98th percentile 99th percentile standard deviations standard error of the mean (sd/sqrt(n)) standard error of the mean, binomial (sqrt(p(1-p)/n)) standard error of the mean, Poisson (sqrt(mean)) sums sums, ignoring optionally specified weight except observations with a weight of zero are excluded number of nonmissing observations percentage of nonmissing observations maximums minimums interquartile range first value last value first nonmissing value last nonmissing value . use "C:\Users\sarkisin\Documents\Teaching Grad Statistics\Data Management\gss2 > 002.dta", clear . collapse (mean) hrs1mean=hrs1 agemean=age sexmean=sex rincomemean=rincome (max > ) hrs1max=hrs1 agemax=age incomemax=rincome (min) hrs1min=hrs1 agemin=age rinc > omemin=rincome (sd) hrs1sd=hrs1 agesd=age sexsd=sex rincomesd=rincome, by(ind > us80) . des Contains data obs: 201 vars: 15 size: 14,472 -------------------------------------------------------------------------------storage display value variable name type format label variable label -------------------------------------------------------------------------------indus80 int %8.0g indus80 rs industry code (1980) hrs1mean double %8.0g (mean) hrs1 agemean double %8.0g (mean) age sexmean double %8.0g (mean) sex rincomemean double %8.0g (mean) rincome hrs1max byte %8.0g (max) hrs1 agemax byte %8.0g (max) age incomemax byte %8.0g (max) rincome hrs1min byte %8.0g (min) hrs1 agemin byte %8.0g (min) age rincomemin byte %8.0g (min) rincome hrs1sd double %8.0g (sd) hrs1 agesd double %8.0g (sd) age sexsd double %8.0g (sd) sex rincomesd double %8.0g (sd) rincome -------------------------------------------------------------------------------Sorted by: indus80 Note: dataset has changed since last saved 12 Subscripting observations within groups Notation _n _N _n-1 _n+1 1, 2, 3… Meaning Current observation Last observation Previous observation Next observation Observation number 1, 2, 3… We will use a subset of NLSY data focusing on marriage and employment for this example. . use "C:\Users\sarkisin\Documents\Teaching Grad Statistics\marriage_raw.dta" , clear . drop interv79 interv81 . reshape long mar fexp pexp educ interv newage emp enrol, i(id) j(year) (note: j = 82 83 84 85 86 87 88 89 90 91 92 93 94) Data wide -> long ----------------------------------------------------------------------------Number of obs. 6081 -> 79053 Number of variables 109 -> 14 j variable (13 values) -> year xij variables: mar82 mar83 ... mar94 -> mar fexp82 fexp83 ... fexp94 -> fexp pexp82 pexp83 ... pexp94 -> pexp educ82 educ83 ... educ94 -> educ interv82 interv83 ... interv94 -> interv newage82 newage83 ... newage94 -> newage emp82 emp83 ... emp94 -> emp enrol82 enrol83 ... enrol94 -> enrol ----------------------------------------------------------------------------. sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------id | 79053 3041 1755.445 1 6081 year | 79053 88 3.741681 82 94 parpres | 79053 36.76739 12.06151 12 82 pared | 76765 10.779 3.25792 0 20 black | 79053 .2491367 .4325158 0 1 -------------+-------------------------------------------------------hispanic | 79053 .1624733 .3688867 0 1 mar | 78861 .5372618 .4986128 0 1 pexp | 79053 1.85826 2.077519 0 17.69231 fexp | 79053 3.462117 3.412239 0 18.57692 educ | 79053 12.67904 2.247307 0 20 -------------+-------------------------------------------------------interv | 79053 10380.57 1424.878 8050 12775 newage | 66965 9870.376 1613.834 6252 13847 emp | 79053 .6687918 .4706507 0 1 enrol | 79053 .0788838 .269559 0 1 Let’s construct a date of birth variable and fill in the missing values: . gen dob=interv-newage 13 (12088 missing values generated) . gen dob2=dob (12088 missing values generated) . bysort id: replace dob2=dob2[_n-1] if dob2==. (11607 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (191 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (79 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (49 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (44 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (37 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (30 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (19 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (12 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (10 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (8 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (2 real changes made) . bysort id: replace dob2=dob2[_n+1] if dob2==. (0 real changes made) Alternatively, same can be done easier (but this wouldn’t help when you want to use prior values to fill in gaps in a time series): . bysort id: egen dob3=mean(dob) . sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------id | 79053 3041 1755.445 1 6081 year | 79053 88 3.741681 82 94 parpres | 79053 36.76739 12.06151 12 82 pared | 76765 10.779 3.25792 0 20 black | 79053 .2491367 .4325158 0 1 -------------+-------------------------------------------------------hispanic | 79053 .1624733 .3688867 0 1 mar | 78861 .5372618 .4986128 0 1 pexp | 79053 1.85826 2.077519 0 17.69231 14 fexp | 79053 3.462117 3.412239 0 18.57692 educ | 79053 12.67904 2.247307 0 20 -------------+-------------------------------------------------------interv | 79053 10380.57 1424.878 8050 12775 newage | 66965 9870.376 1613.834 6252 13847 emp | 79053 .6687918 .4706507 0 1 enrol | 79053 .0788838 .269559 0 1 dob | 66965 364.2885 815.6515 -1095 1824 -------------+-------------------------------------------------------dob2 | 79053 308.3008 819.3959 -1095 1824 dob3 | 79053 308.3008 819.3959 -1095 1824 Using subscripting to indicate sequential position of observation within its group: . bysort id: gen seq=_n . bysort id: gen last=(_n==_N) . bysort id: gen first=(_n==1) . list id year seq last first +----------------------------------+ | id year seq last first | |----------------------------------| 1. | 1 83 1 0 1 | 2. | 1 84 2 0 0 | 3. | 1 85 3 0 0 | 4. | 1 86 4 0 0 | 5. | 1 87 5 0 0 | |----------------------------------| 6. | 1 88 6 0 0 | 7. | 1 89 7 0 0 | 8. | 1 90 8 0 0 | 9. | 1 91 9 0 0 | 10. | 1 92 10 0 0 | |----------------------------------| 11. | 1 93 11 0 0 | 12. | 1 94 12 1 0 | 13. | 2 83 1 0 1 | 14. | 2 84 2 0 0 | 15. | 2 85 3 0 0 | |----------------------------------| 16. | 2 86 4 0 0 | 17. | 2 87 5 0 0 | 18. | 2 88 6 0 0 | 19. | 2 89 7 0 0 | 20. | 2 90 8 0 0 | |----------------------------------| 21. | 2 91 9 0 0 | 22. | 2 92 10 0 0 | 23. | 2 93 11 0 0 | --Break-r(1); Constructing an indicator of transition from single to married: . gen mart=(mar==1) . bysort id: replace mart=0 if mar==1 & mar[_n-1]==1 (35677 real changes made) 15 . bysort id: replace mart=0 if mar==1 & _n==1 (1761 real changes made) We only have missing values at first wave; and we coded this indicator as 0 for those cases; if we had more missing values throughout, it might be easier to impute first – or make a decision what to do with “broken” sequences . list id year mar mart 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. +--------------------------+ | id year mar mart | |--------------------------| | 1 82 0 0 | | 1 83 1 1 | | 1 84 1 0 | | 1 85 1 0 | | 1 86 1 0 | |--------------------------| | 1 87 1 0 | | 1 88 1 0 | | 1 89 1 0 | | 1 90 1 0 | | 1 91 1 0 | |--------------------------| | 1 92 1 0 | | 1 93 1 0 | | 1 94 1 0 | | 2 82 1 0 | | 2 83 1 0 | |--------------------------| | 2 84 1 0 | | 2 85 1 0 | | 2 86 1 0 | | 2 87 1 0 | | 2 88 1 0 | |--------------------------| | 2 89 1 0 | | 2 90 1 0 | | 2 91 1 0 | | 2 92 1 0 | | 2 93 1 0 | |--------------------------| | 2 94 1 0 | | 3 82 0 0 | | 3 83 0 0 | | 3 84 0 0 | | 3 85 0 0 | |--------------------------| | 3 86 0 0 | | 3 87 0 0 | | 3 88 0 0 | | 3 89 0 0 | | 3 90 0 0 | |--------------------------| | 3 91 0 0 | | 3 92 0 0 | | 3 93 0 0 | | 3 94 0 0 | | 4 82 0 0 | |--------------------------| | 4 83 0 0 | 16 42. 43. 44. 45. | 4 84 0 0 | | 4 85 0 0 | | 4 86 0 0 | | 4 87 0 0 | |--------------------------| 46. | 4 88 0 0 | 47. | 4 89 0 0 | 48. | 4 90 0 0 | 49. | 4 91 0 0 | 50. | 4 92 0 0 | |--------------------------| 51. | 4 93 1 1 | 52. | 4 94 1 0 | --Break-r(1); Creating lagged or differenced variables – using subscripting: . bysort id: gen empl=emp[_n-1] (13167 missing values generated) Versus using time series operators: . tsset id year panel variable: time variable: delta: id (strongly balanced) year, 83 to 94 1 unit . gen empl2=l.emp (13167 missing values generated) . list id year emp empl empl2 +----------------------------------+ | id year emp empl empl2 | |----------------------------------| 1. | 1 83 0 . . | 2. | 1 84 0 0 0 | 3. | 1 85 . 0 0 | 4. | 1 86 1 . . | 5. | 1 87 . 1 1 | |----------------------------------| 6. | 1 88 0 . . | 7. | 1 89 1 0 0 | 8. | 1 90 . 1 1 | 9. | 1 91 1 . . | 10. | 1 92 1 1 1 | |----------------------------------| 11. | 1 93 1 1 1 | 12. | 1 94 1 1 1 | 13. | 2 83 0 . . | 14. | 2 84 0 0 0 | --Break-r(1); . bysort id: gen educd=educ-educ[_n-1] (19657 missing values generated) . gen educd2=d.educ (19657 missing values generated) You can also generate further lags – l. is the same as l1, but you can do l2. etc; you can also make variables with future values (e.g., f1.mar or f2mar). 17
© Copyright 2026 Paperzz