CGIAR Research Program on Climate Change, Agriculture and Food Security (CCAFS) Transition from Raw to Primary Data October 2013 Data Management Guidelines by Statistical Services Centre, University of Reading is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Permissions beyond the scope of this license may be available at www.reading.ac.uk/ssc. These materials were produced for and with funding from the Climate Change Agriculture and Food Security Research Program of the Consultative Group on International Agricultural Research (CGIAR). STATISTICAL SERVICES CENTRE Introduction Raw data encompasses everything that was obtained in the data collection process, whether it is data directly gathered by the study or data that has been collected prior to the study that becomes part of the study datasets. Data can come in a wide variety of formats; some immediately useful and ready for analysis whilst others will require steps to input, format, validate and derive before any analysis can be done. The result of this process, outlined in Figure 1, is the primary data. TRANSITION FROM RAW TO P RIMARY DATA Page 1 of 7 STATISTICAL SERVICES CENTRE Overall Process The components of the transition from raw to primary data process are: Planning: Specifying the format of how each individual piece of data will be stored (whether data are: numeric, coded, categorical, free text, identifier data, etc.), as well as which software will be used for entry and storing of the electronic versions of the data. Some of this may already be in place from the data collection due to the tool design, for example the responses to a question in a survey may be coded at the data collection stage. Data Entry: Ensuring all data is available in electronic form. Manipulation: Deriving additional information from the data and bringing all data sources together. Data Quality Checks: These checks should be taking place throughout the whole data transition process. This document is primarily concerned with the elements of the Data Entry and Manipulation stages. The planning and data quality check stages are outlined in more detail in the “Storing Numerical and Non-Numerical data”, and “Data Quality Checking” videos and corresponding documents. Data Entering the Transition Process Different types of raw data can join the process at different stages; however once it has joined it should follow through the subsequent stages of the process as well as being checked for quality. Table 1: Types of data entering the process of transition from raw to primary data at different stages Stage Data Entry Manipulation Types of data entering the process at this stage Data which were collected/recorded by hand. Data from electronic data collection devices which has automatically been output into a digital file. External or historic data which may need format modification for consistency with the other activity data. Data Entry Where raw data is not already in a digital format (such as paper questionnaires), it will require a data entry process to transfer the collected data onto a computerised format. The data entry process should ensure that all data stored digitally is a completely accurate reflection of the raw data being entered. Depending on the amount of data, and budget of project, this can be done through a simple spreadsheet program such as Microsoft Excel or through a more complex package such as Microsoft Access, or CS-Pro, where a pre-designed database can be created behind specific data entry screens. The database should be carefully designed so that it meets as many of the objectives for data storage and data quality checks as possible, whilst remaining intuitive for the data entry staff to use. To ensure the accuracy of the data it is advised to follow a double data entry process to minimise the likelihood of any input errors. The expected human error rate for a simple data input process, using only coded and numerical fields, has been shown in studies to be between 0.3% and 2%, dependent on the prior data entry experience of the individual. Although this may seem small, in the context of TRANSITION FROM RAW TO P RIMARY DATA Page 2 of 7 STATISTICAL SERVICES CENTRE a moderately sized study, 50 pieces of information for each of 100 individuals, it corresponds to a minimum of 15 errors. Researchers have found that this can lead to a large effect upon the eventual analytical findings, potentially showing statistical significant results where the true values would not have and vice versa. (Barchard & Pace, 2011); (Kawado, 2003). The probability of the same error being made independently by two people is very small (around 1 in 40,000 assuming an individual error rate of 0.5%). This means that the majority of errors made can be found by comparing the two set of entered data and investigating instances where they differ by cross-checking with the original copy to verify which the correct value is. However, many data entry errors made are unlikely to be independent, they often occur as a result of data not being recorded clearly or legibly or perhaps a confusing data entry system. Checks should be in place during the data collection process and in the design of the data entry system to prevent errors entering the data for the aforementioned reasons. Inputting data in coded form is much more preferable to typing out full responses. It will not only decrease the length of time taken per record, but it will also reduce the overall error rate. Obviously many fields cannot be coded so must be entered in full, such as names, address and comments. In some circumstances data entry using PDAs or other handheld devices could be a viable alternative to data collection by hand. Where it is used effectively this can then combine the data collection and data entry stages of a project. Although there are potential problems due to technical issues, however nearly all potential problems can be dealt with simply by good training and careful planning (i.e. regularly backing up data; carrying spare parts/additional devices; carrying physical copies in case of failure). One major drawback of using this process is that it removes the double data entry validation process and therefore has to rely solely on the original data being entered correctly. Optical character recognition software has been developed to read in large amounts of data automatically. However, as of 2012, the transcription error rates (i.e. the software failing to recognise or incorrectly recognising characters) associated with these programs are too high for them to be recommended when dealing with handwritten data. These programs can be used effectively when dealing with typed data or data which consists solely of tick boxes. Manipulation Data will often arise from various different sources, storing these different data sources in different data files rather than trying to create one large data set is encouraged. For survey data, which is likely to result in the creation of a large number of fields, the raw data should be split into multiple datasets corresponding to one per section/question/group of questions. This is for a number of reasons: Different data sources may be looking at results on different levels (e.g. individual vs. households). Finding a particular variable is often much easier when dealing with smaller datasets, each relating to a different aspect of the data (for example, one dataset on household characteristics, another containing the responses to individual savings questions, another dealing community group membership). To prevent excessively wide data sets, which cannot be handled by some software. TRANSITION FROM RAW TO P RIMARY DATA Page 3 of 7 STATISTICAL SERVICES CENTRE It is imperative that all datasets contain all relevant ID variables, and that these ID variables are consistent across all datasets. Therefore an important step in the transition process for any digital raw data is to merge the data with all appropriate ID variables. This applies to external or historic data as well, which will require manipulation so that they include these study specific ID variables where relevant. Deriving Variables Additional numerical fields can be derived as part of the DM process, and stored alongside raw data, for example converting areas recorded in different units into hectares, calculating age from date of birth or calculating percentage change in a variable recorded twice. Where these variables can be derived, it is better to rely on computed calculations for these values, instead of values which may have been recorded by hand and included as part of the raw data. Categorical variables can also be derived from the data, for example splitting age into groups or combining the responses from several fields to create a pass/fail type response. Full documentation for how the additional variables were derived should be included as part of the metadata. The inclusion of syntax files (these should contain the commands for the calculation of all derived variables) in the data archive, along with the derived datasets, is highly recommended. Deriving Datasets In addition to deriving individual variables, it will often be necessary to derive complete datasets, changing the level at which the data is stored, to fulfil the analysis requirements. For example when obtaining climate data it is generally presented on a day to day level, but it is often of more use to studies to derive monthly summary statistics which will be used in analysis. Or if the analysis is focused on the change from baseline a derived dataset could be created to contain the original baseline values alongside the new values and the derived differences. In both of these circumstances a whole dataset will be derived from the raw data and/or external/historic data. The data should be stored following the data storage recommendations and any calculations underlying the derivation should be documented in the metadata. In particular it should be clear how missing values were treated as part of the derivation process. Updating Categorical Data As discussed in “Storing Numerical and Non-Numerical Data” in the initial stages of entering and compiling data using numeric categories alongside a data dictionary is often more efficient and less prone to errors. However when the data needs to be analysed or archived, it is easier to interpret the meaning of the data if categorical variables have been stored as text. Updating the data to include the labels of the coded categorical data is recommended and can be achieved relatively easily in most appropriate software packages by adding label columns onto the dataset. An exception can be made for True/False variables, where the values of 0 and 1 conventionally are understood as False and True responses respectively. The final part of the data manipulation process is usually exporting the data to an appropriate statistical analysis package. TRANSITION FROM RAW TO P RIMARY DATA Page 4 of 7 STATISTICAL SERVICES CENTRE Data Quality Checks Data quality checks should take place throughout the data collection and the transition of raw data to primary data. Checking the data prior at the collection and entry stages ensures that all data is ‘correct’ before any manipulations are conducted, checking after manipulation confirms that the manipulation was performed correctly. Works Cited Barchard, K. A., & Pace, L. A. (2011, September). Preventing human error: The impact of data entry methods on data accuracy and statistical results. Computers in Human Behavior, 27(5), 18341839. Kawado, M. (2003, May). A comparison of error detection rates between the reading. Controlled Clinical Trials, 24, 560-569. SSC Resources Accessible from www.reading.ac.uk/ssc SADC Course In Statistics. Module I2: Organising Data (2007): Session 7. The use of Optical Character Recognition Technology in National Statistical Offices Session 11. Managing datasets (1): appending, merging Session 12. Managing datasets (2): moving data between levels Session 14. Calculating Indices using Spreadsheets Uganda Bureau of Statistics. Module 2: From the data to the report (2008) DFID Guidelines for Good Statistical Practice (2001): The Role of a Database Package in Managing Research Data Disciplined Use of Spreadsheets for Data Entry TRANSITION FROM RAW TO P RIMARY DATA Page 5 of 7 STATISTICAL SERVICES CENTRE Appendix I – CCAFS Data Management Support Pack This document is part of the CCAFS Data Management Support Pack produced by the Statistical Services Centre, University of Reading, UK. The following materials are available in the pack: 0. Data Management Strategy a. CCAFS Data Management Strategy 1. Research Protocols a. Writing Research Protocols – a statistical perspective b. Preparation of Research Protocols – Good Practice Case Study c. What is a Research Protocol, and how to use one (Video & Transcript) d. Details of what a Research Protocol should contain (Video & Transcript) 2. Data Management Policies & Plans a. Creating a Data Management Plan b. Data Management Plan (Video & Transcript) c. Example Data Management Activity Plan d. Example Consent Form 3. Budgeting & Planning a. Budgeting & Planning for Data Management b. ToR Data Support Staff c. Budgeting & Planning (Video & Transcript) 4. Data Ownership a. Data Ownership and Authorship b. Template – Data Ownership Agreement c. CCAFS Data Ownership & Sharing Agreement d. Data Ownership & Authorship (Video & Transcript) 5. Data & Document Storage a. Creating and Using a DDS b. DDS Introduction – (Video & Transcript) c. DDS Organisation – (Video & Transcript) d. DDS Ownership – (Video & Transcript) e. Introduction to Dropbox – (Video & Transcript) 6. Archiving & Sharing a. Archiving & Sharing Data b. Data and Documents to Submit for Archiving – a checklist c. MetaData d. Archiving & Sharing (Video & Transcript) e. Metadata (Video & Transcript) f. CCAFS HBS Questionnaire g. CCAFS HHS Code Book h. CCAFS Training Manual for Field Supervisors TRANSITION FROM RAW TO P RIMARY DATA Page 6 of 7 STATISTICAL SERVICES CENTRE 7. CCAFS Data Portals a. Portals for CCAFS Outputs b. AgTrials Summary c. CCAFS-Climate Summary d. DSpace Introduction e. Introduction to Dataverse (Video & Transcript) f. Creating a Dataverse (Video & Transcript) g. Dataverse Study Catalogue h. CCAFS Dataverse (Video & Transcript) 8. Data Quality & Organisation a. Data Quality Assurance b. Guidance for handling different types of Data c. Transition from Raw to Primary Data d. Data Quality Assurance (Video & Transcript) e. Guidance for handling different types of data (Video & Transcript) f. Transition from Raw to Primary Data (Video & Transcript) TRANSITION FROM RAW TO P RIMARY DATA Page 7 of 7
© Copyright 2026 Paperzz