Creating Something from Nothing: Working with Synthetic Files

Creating Something from Nothing:
Working with Synthetic Files
Chuck Humphrey
University of Alberta
ACCOLEDS /DLI Training: December 2003
Outline



Types of microdata files
Which microdata file to use
Providing services for synthetic files
This presentation is a modification of a workshop that Bo
Wandschneider and I presented at the May 2003 National
DLI Training program.
Types of Microdata Files
•
•
Confidential Microdata Products
• Master Files
• Share Files
Public Access Microdata Products
• Public use anonymized microdata (PUMFS)
• Synthetic Files
Microdata Products
Microdata
•
•
raw data organized in a file where the
records or lines in the file are
observations of a specific unit of analysis
and the information on the lines are the
values of variables
requires some form of processing or
analysis to be used
Microdata Products
CCHS
0000015959922220611230721241433296101121222222112222222223060.75021.6221102296010400009600960400
000000002266662666666666666666666601166666666631114222222122226622612222266966226662122222221213
666666666666666666666999666615212222222222266666666666666666666666666666666666666666666666666666
6666666666666966666666666666666966666666666666666666666666666666666666000.4001.0000.0000.0000.10
00.1001.7112222222222222222222222699799966996699669966996699669966996699669966996699669966996699
6699669966996699669966996699660101300.0100032396969696966666662966666662696969666666666662696011
111101101.00096969696669619222210339699699696669699669605996666666662666611112222222105011000001
00000000000000166666666666666610020002000.006666969669966996669669666666607101122666296666666969
609669696000.009696662621112441100412102119630401161245060522333200224.17
0000023535951221521330523226642101103266666666266666666619045.90999.6622100296040501020300960000
000000001221222666666666666666666606126666666611413211122112226622622222266966226662222222222111
666666666666666666666999666611666666666666615221222222222222266666666666666666666662151222222222
2221222212226026666666666666666966666666666666116666666666666666666666000.1001.0001.0001.0001.00
01.0005.1222222122222222222222122699669966996699669966060299669966996699669966996699669966996699
6699669966996699660032996699660101100.8102112301960705066666662966666662696969666666666662696021
141201100.45996969696669696132229639699699696669699669606996666666662666622266662222296966996996
99699699699699612666666666666639969962000.006666969669966996669669666666696969696666296666666969
609669696000.009696662612631340000312696669669966663234040122333200317.04
Confidential Microdata
Master Files
•
These files contain the fullness of detail
captured about the unit of observation.
The information in these files could
identify the individual who provided the
original information and, therefore, are
considered confidential.
Confidential Microdata
Master File – Example
Confidential Microdata
Master File – geography
Confidential Microdata
Master File - fullness of data
Confidential Microdata
Master File - fullness of data
Confidential Microdata
Master File - fullness of data
Confidential Microdata
Share Files
•
•
these are confidential files in which
the respondents have signed a
consent form permitting Statistics
Canada to allow access to their
information for approved research.
Used with NPHS and NLSCY
Public Access Microdata
Anonymized Microdata
•
•
these microdata are specially prepared to
minimize the possibility of disclosing or
identifying any of the cases or
observations
the original data from the master file
are edited to create a public use
microdata file
Public Access Microdata
Steps in Anonymizing Microdata
•
•
•
•
removal of all personal identifiers
include only gross levels of geography
collapse detailed information into fewer
general categories or cap values
suppress the values of a variable
Public Access Microdata
Statistics Canada PUMFs
•
•
only available for select social
surveys that undergo a review of the
Data Release Committee, an internal
Statistics Canada committee;
no ‘enterprise’ public use microdata;
Public Access Microdata
Statistics Canada PUMFs
•
•
almost all are cross-sectional, that is,
represent data collected at one point
in time;
longitudinal data are difficult to
anonymize while maintaining any
useful information.
Public Access Microdata
PUMFs – personal identifiers
Public Access Microdata
PUMFs – collapsed data
Public Access Microdata
PUMFs – suppressed data
Public Access Microdata
Synthetic Files
•
These microdata do not contain
actual ‘real’ cases but are pseudocases that for some surveys, provide
aggregate results close to the ‘real’
cases
Public Access Microdata
Synthetic Files
•
They have been prepared to create
analysis runs with the master file
without possibly disclosing or
identifying any of the cases
Public Access Microdata
Synthetic Files
•
•
The results are not to be reported,
but are strictly to be used to
prepare analyses of master files;
Usually associated with longitudinal
files.
Public Access Microdata
Steps in creating Synthetic Files
•
•
•
•
Observations are transformed
No records actually exist
Keep fullness of variable description
How the files are made is kept
confidential
Public Access Microdata
Synthetic Files – CCHS Cycle 1.1
Obs
Lrecl
Var
PUMF
Synthetic
130880
841
65101
1778
614
1164
Implications for Analysis
What are the implications in doing
analysis with these different types of
microdata files?
Implications for Analysis
Master File
•
•
All observations
Has the most variables with the
most detail
•
•
Lots of geography and personal
characteristics
Little grouping or capping of categories
Implications for Analysis
Master File
•
•
Restricted access: only available to
authorized Statistics Canada
employees, which includes ‘deemed
employees’;
Use of the analysis is controlled
through a contract;
Implications for Analysis
Master File
•
Includes linkage variables across
files within a study, e.g., NLSCY
linkage among the files for different
units of analysis (kids, parents,
teachers).
Implications for Analysis
Public Use Microdata (PUMF)
•
•
Valuable content for a tremendous amount
of research;
Where issues arise is when smaller area
geography is desired; rare subpopulations
are being studied; or the variables that
are needed have been used to anonymize
respondents;
Implications for Analysis
Public Use Microdata (PUMF)
•
•
Licensed product: agree to certain
terms of use;
No linkage to multiple units of
analysis, except for a few exceptions
(e.g., GSS Time Use and Family);
Implications for Analysis
Synthetic Files
“Looks like a duck and quacks like a
duck”, but it isn’t a duck or any other
type of fowl.
Implications for Analysis
Synthetic Files
•
•
•
•
•
Looks like master files
Lots of observations
Lots of variables
Little grouping or capping of categories
Lots of geographic detail
Synthetic Files
Precautions
•
•
•
Results not authentic – but may be close in
the aggregate for some synthetic files;
Use for testing analysis setups only;
Still need the master files for publishable
results.
Where do we get Access?
Master File
•
•
•
Restricted access governed under the
Statistics Act;
Remote Job Submission (a.k.a, RDA)
Research Data Centres
•
Apply to SSHRC to obtain a peer-reviewed
proposal and STC for security clearance.
Where do we get Access?
Public Use Microdata Files (PUMF)
•
•
•
Get from DLI
Analyze where it is convenient
Can use a variety of analysis
software, including SAS, SPSS,
Stata, HLM, LISREL, etc.
Where do we get Access?
Synthetic Files
•
•
•
Author Divisions ‘may’ create it
Most relevant when dealing with new
Panel Data, but not necessarily, e.g.,
the Census has potential
NPHS & CCHS synthetic files on DLI
FTP site
Where do we get Access?
Synthetic files
•
•
Work locally with the file
Build SAS and SPSS setups
Which File is Appropriate?
•
•
•
•
•
1st stop is still the PUMF;
This file has the easiest access for us;
Probably meets the needs of most patrons;
Not as administratively burdensome as
synthetic or master file;
Perfect for clients just looking for ‘data’ –
courses in quantitative analysis;
Which File is Appropriate?
•
•
•
If more detail is needed, refer to the
Master File Documentation;
Inform patrons that the cost of use is
higher, both in terms of accessibility
and analytical requirements;
Interest most likely to come from grad
students and ‘experienced’ researchers
Which File is Appropriate?
•
•
•
Download the Synthetic files from DLI
Make them aware of problems with
synthetic files – RESULTS ARE NOT
PUBLISHABLE
Encourage them to submit an
application for RDC access – there is a
time lag
Which File is Appropriate?
•
Some of you may work with patron
using synthetic files before passing
her/him off to RDC.
Services for Synthetic Files
DLI Contacts can provide four basic
services with synthetic files.
•
•
Build SPSS and SAS system files from
the raw synthetic data files that are
distributed through DLI;
Provide information about the use of
Remote Job Submission and RDC’s;
Services for Synthetic Files
•
•
Assist with finding variables in the
synthetic files;
Provide instruction about ways of
capturing SPSS or SAS code from
“dummy” analysis runs with the
synthetic files. It is this code that is
submitted to STC through remote job
submission.
Services for Synthetic Files
1. Building SPSS and SAS system files for
synthetic data
•
•
The CCHS synthetic data are distributed
as a raw ASCII file with accompanying
command files for SPSS and SAS
Separate synthetic data files exist for the
master file setup and for bootstrapping
analysis
Services for Synthetic Files
1. Building SPSS and SAS system files for
synthetic data
•
The synthetic data for the CCHS Cycle 1.1
has 1,164 variables and 65,101 fabricated
cases. Creating the SPSS and SAS system
files from this file is not difficult, but it
does take time. DLI Contacts may wish to
create these products for their patrons.
Services for Synthetic Files
2. Information about Remote Job Submission
(RJS)
•
•
The author divisions supporting RJS have
established their own guidelines and have
different operating procedures. Not all divisions
supporting longitudinal surveys currently support
RJS (e.g., SLID).
Therefore, there is a need to track down this
information for our patrons.
Services for Synthetic Files
2. Information about Remote Job
Submission (RJS)
•
For example, the sources for information
about RJS include the Centre for
Education Statistics:
http://www.statcan.ca/english/edu/rda/index.htm
Services for Synthetic Files
2. Information about Remote Job
Submission (RJS)
Where do you find this information?
• Ask the DLI Team via the DLI List
• The EAC has asked for a description of
RJS on the DLI website, which should be
on the DLI Team’s to-do list
Services for Synthetic Files
2. Information about Research Data
Centres
•
•
The collection of master files available
through RDC’s is listed on the STC website
for RDC’s
Each RDC has its own website describing
its services
http://www.statcan.ca/english/rdc/index.htm
Services for Synthetic Files
3. Data Reference for the content of the
synthetic files
•
•
•
•
Helping researchers identify variables over
longitudinal files is an important service
Need to keep the unit of analysis straight
Need to understand the mnemonic naming
convention for variables over cycles
Develop indexing aids for you and your patrons
Services for Synthetic Files
4. Provide helpful tips for preserving the code
from “dummy” analysis runs in SPSS and SAS
•
•
Researchers will run analyses on the synthetic file
to generate the code that they will subsequently
email for Remote Job Submission
Providing information about how to do this easily
will be helpful to your patrons