Transition from Raw to Primary Data

CGIAR Research Program on
Climate Change, Agriculture and Food Security (CCAFS)
Transition from Raw to Primary Data
October 2013
Data Management Guidelines by Statistical Services Centre, University of Reading is licensed under
a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Permissions beyond the scope of this license may be available at www.reading.ac.uk/ssc.
These materials were produced for and with funding from the Climate Change Agriculture and Food
Security Research Program of the Consultative Group on International Agricultural Research (CGIAR).
STATISTICAL SERVICES CENTRE
Introduction
Raw data encompasses everything that was obtained in the data collection process, whether it is
data directly gathered by the study or data that has been collected prior to the study that becomes
part of the study datasets.
Data can come in a wide variety of formats; some immediately useful and ready for analysis whilst
others will require steps to input, format, validate and derive before any analysis can be done. The
result of this process, outlined in Figure 1, is the primary data.
TRANSITION FROM RAW TO P RIMARY DATA
Page 1 of 7
STATISTICAL SERVICES CENTRE
Overall Process
The components of the transition from raw to primary data process are:




Planning: Specifying the format of how each individual piece of data will be stored (whether
data are: numeric, coded, categorical, free text, identifier data, etc.), as well as which
software will be used for entry and storing of the electronic versions of the data. Some of
this may already be in place from the data collection due to the tool design, for example the
responses to a question in a survey may be coded at the data collection stage.
Data Entry: Ensuring all data is available in electronic form.
Manipulation: Deriving additional information from the data and bringing all data sources
together.
Data Quality Checks: These checks should be taking place throughout the whole data
transition process.
This document is primarily concerned with the elements of the Data Entry and Manipulation stages.
The planning and data quality check stages are outlined in more detail in the “Storing Numerical and
Non-Numerical data”, and “Data Quality Checking” videos and corresponding documents.
Data Entering the Transition Process
Different types of raw data can join the process at different stages; however once it has joined it
should follow through the subsequent stages of the process as well as being checked for quality.
Table 1: Types of data entering the process of transition from raw to primary data at different stages
Stage
Data Entry
Manipulation
Types of data entering the process at this stage
Data which were collected/recorded by hand.
Data from electronic data collection devices which has automatically been
output into a digital file.
External or historic data which may need format modification for
consistency with the other activity data.
Data Entry
Where raw data is not already in a digital format (such as paper questionnaires), it will require a data
entry process to transfer the collected data onto a computerised format. The data entry process
should ensure that all data stored digitally is a completely accurate reflection of the raw data being
entered.
Depending on the amount of data, and budget of project, this can be done through a simple
spreadsheet program such as Microsoft Excel or through a more complex package such as Microsoft
Access, or CS-Pro, where a pre-designed database can be created behind specific data entry screens.
The database should be carefully designed so that it meets as many of the objectives for data
storage and data quality checks as possible, whilst remaining intuitive for the data entry staff to use.
To ensure the accuracy of the data it is advised to follow a double data entry process to minimise the
likelihood of any input errors. The expected human error rate for a simple data input process, using
only coded and numerical fields, has been shown in studies to be between 0.3% and 2%, dependent
on the prior data entry experience of the individual. Although this may seem small, in the context of
TRANSITION FROM RAW TO P RIMARY DATA
Page 2 of 7
STATISTICAL SERVICES CENTRE
a moderately sized study, 50 pieces of information for each of 100 individuals, it corresponds to a
minimum of 15 errors. Researchers have found that this can lead to a large effect upon the eventual
analytical findings, potentially showing statistical significant results where the true values would not
have and vice versa. (Barchard & Pace, 2011); (Kawado, 2003).
The probability of the same error being made independently by two people is very small (around 1 in
40,000 assuming an individual error rate of 0.5%). This means that the majority of errors made can
be found by comparing the two set of entered data and investigating instances where they differ by
cross-checking with the original copy to verify which the correct value is. However, many data entry
errors made are unlikely to be independent, they often occur as a result of data not being recorded
clearly or legibly or perhaps a confusing data entry system. Checks should be in place during the data
collection process and in the design of the data entry system to prevent errors entering the data for
the aforementioned reasons.
Inputting data in coded form is much more preferable to typing out full responses. It will not only
decrease the length of time taken per record, but it will also reduce the overall error rate. Obviously
many fields cannot be coded so must be entered in full, such as names, address and comments.
In some circumstances data entry using PDAs or other handheld devices could be a viable alternative
to data collection by hand. Where it is used effectively this can then combine the data collection and
data entry stages of a project. Although there are potential problems due to technical issues,
however nearly all potential problems can be dealt with simply by good training and careful planning
(i.e. regularly backing up data; carrying spare parts/additional devices; carrying physical copies in
case of failure). One major drawback of using this process is that it removes the double data entry
validation process and therefore has to rely solely on the original data being entered correctly.
Optical character recognition software has been developed to read in large amounts of data
automatically. However, as of 2012, the transcription error rates (i.e. the software failing to
recognise or incorrectly recognising characters) associated with these programs are too high for
them to be recommended when dealing with handwritten data. These programs can be used
effectively when dealing with typed data or data which consists solely of tick boxes.
Manipulation
Data will often arise from various different sources, storing these different data sources in different
data files rather than trying to create one large data set is encouraged. For survey data, which is
likely to result in the creation of a large number of fields, the raw data should be split into multiple
datasets corresponding to one per section/question/group of questions. This is for a number of
reasons:
 Different data sources may be looking at results on different levels (e.g. individual vs.
households).
 Finding a particular variable is often much easier when dealing with smaller datasets, each
relating to a different aspect of the data (for example, one dataset on household
characteristics, another containing the responses to individual savings questions, another
dealing community group membership).
 To prevent excessively wide data sets, which cannot be handled by some software.
TRANSITION FROM RAW TO P RIMARY DATA
Page 3 of 7
STATISTICAL SERVICES CENTRE
It is imperative that all datasets contain all relevant ID variables, and that these ID variables are
consistent across all datasets. Therefore an important step in the transition process for any digital
raw data is to merge the data with all appropriate ID variables. This applies to external or historic
data as well, which will require manipulation so that they include these study specific ID variables
where relevant.
Deriving Variables
Additional numerical fields can be derived as part of the DM process, and stored alongside raw data,
for example converting areas recorded in different units into hectares, calculating age from date of
birth or calculating percentage change in a variable recorded twice. Where these variables can be
derived, it is better to rely on computed calculations for these values, instead of values which may
have been recorded by hand and included as part of the raw data.
Categorical variables can also be derived from the data, for example splitting age into groups or
combining the responses from several fields to create a pass/fail type response.
Full documentation for how the additional variables were derived should be included as part of the
metadata. The inclusion of syntax files (these should contain the commands for the calculation of all
derived variables) in the data archive, along with the derived datasets, is highly recommended.
Deriving Datasets
In addition to deriving individual variables, it will often be necessary to derive complete datasets,
changing the level at which the data is stored, to fulfil the analysis requirements. For example when
obtaining climate data it is generally presented on a day to day level, but it is often of more use to
studies to derive monthly summary statistics which will be used in analysis. Or if the analysis is
focused on the change from baseline a derived dataset could be created to contain the original
baseline values alongside the new values and the derived differences. In both of these
circumstances a whole dataset will be derived from the raw data and/or external/historic data. The
data should be stored following the data storage recommendations and any calculations underlying
the derivation should be documented in the metadata. In particular it should be clear how missing
values were treated as part of the derivation process.
Updating Categorical Data
As discussed in “Storing Numerical and Non-Numerical Data” in the initial stages of entering and
compiling data using numeric categories alongside a data dictionary is often more efficient and less
prone to errors. However when the data needs to be analysed or archived, it is easier to interpret
the meaning of the data if categorical variables have been stored as text. Updating the data to
include the labels of the coded categorical data is recommended and can be achieved relatively
easily in most appropriate software packages by adding label columns onto the dataset. An
exception can be made for True/False variables, where the values of 0 and 1 conventionally are
understood as False and True responses respectively.
The final part of the data manipulation process is usually exporting the data to an appropriate
statistical analysis package.
TRANSITION FROM RAW TO P RIMARY DATA
Page 4 of 7
STATISTICAL SERVICES CENTRE
Data Quality Checks
Data quality checks should take place throughout the data collection and the transition of raw data
to primary data. Checking the data prior at the collection and entry stages ensures that all data is
‘correct’ before any manipulations are conducted, checking after manipulation confirms that the
manipulation was performed correctly.
Works Cited
Barchard, K. A., & Pace, L. A. (2011, September). Preventing human error: The impact of data entry
methods on data accuracy and statistical results. Computers in Human Behavior, 27(5), 18341839.
Kawado, M. (2003, May). A comparison of error detection rates between the reading. Controlled
Clinical Trials, 24, 560-569.
SSC Resources
Accessible from www.reading.ac.uk/ssc
SADC Course In Statistics. Module I2: Organising Data (2007):




Session 7. The use of Optical Character Recognition Technology in National Statistical Offices
Session 11. Managing datasets (1): appending, merging
Session 12. Managing datasets (2): moving data between levels
Session 14. Calculating Indices using Spreadsheets
Uganda Bureau of Statistics. Module 2: From the data to the report (2008)
DFID Guidelines for Good Statistical Practice (2001):


The Role of a Database Package in Managing Research Data
Disciplined Use of Spreadsheets for Data Entry
TRANSITION FROM RAW TO P RIMARY DATA
Page 5 of 7
STATISTICAL SERVICES CENTRE
Appendix I – CCAFS Data Management Support Pack
This document is part of the CCAFS Data Management Support Pack produced by the Statistical
Services Centre, University of Reading, UK. The following materials are available in the pack:
0. Data Management Strategy
a. CCAFS Data Management Strategy
1. Research Protocols
a. Writing Research Protocols – a statistical perspective
b. Preparation of Research Protocols – Good Practice Case Study
c. What is a Research Protocol, and how to use one (Video & Transcript)
d. Details of what a Research Protocol should contain (Video & Transcript)
2. Data Management Policies & Plans
a. Creating a Data Management Plan
b. Data Management Plan (Video & Transcript)
c. Example Data Management Activity Plan
d. Example Consent Form
3. Budgeting & Planning
a. Budgeting & Planning for Data Management
b. ToR Data Support Staff
c. Budgeting & Planning (Video & Transcript)
4. Data Ownership
a. Data Ownership and Authorship
b. Template – Data Ownership Agreement
c. CCAFS Data Ownership & Sharing Agreement
d. Data Ownership & Authorship (Video & Transcript)
5. Data & Document Storage
a. Creating and Using a DDS
b. DDS Introduction – (Video & Transcript)
c. DDS Organisation – (Video & Transcript)
d. DDS Ownership – (Video & Transcript)
e. Introduction to Dropbox – (Video & Transcript)
6. Archiving & Sharing
a. Archiving & Sharing Data
b. Data and Documents to Submit for Archiving – a checklist
c. MetaData
d. Archiving & Sharing (Video & Transcript)
e. Metadata (Video & Transcript)
f. CCAFS HBS Questionnaire
g. CCAFS HHS Code Book
h. CCAFS Training Manual for Field Supervisors
TRANSITION FROM RAW TO P RIMARY DATA
Page 6 of 7
STATISTICAL SERVICES CENTRE
7. CCAFS Data Portals
a. Portals for CCAFS Outputs
b. AgTrials Summary
c. CCAFS-Climate Summary
d. DSpace Introduction
e. Introduction to Dataverse (Video & Transcript)
f. Creating a Dataverse (Video & Transcript)
g. Dataverse Study Catalogue
h. CCAFS Dataverse (Video & Transcript)
8. Data Quality & Organisation
a. Data Quality Assurance
b. Guidance for handling different types of Data
c. Transition from Raw to Primary Data
d. Data Quality Assurance (Video & Transcript)
e. Guidance for handling different types of data (Video & Transcript)
f. Transition from Raw to Primary Data (Video & Transcript)
TRANSITION FROM RAW TO P RIMARY DATA
Page 7 of 7