Cookbook for a Codebook - Open Science Framework

Cookbook for a Codebook
How a good codebook can lead to better science
Kai T. Horstmann
What is a good Codebook?
• Whenever you design a new study and collect data, variables in the data
frame may not contain all information to understand them.
• Probably the most known version of any codebook are the value lables
from SPSS. They are also a bad codebook.
• A codebook contains all information necessary to analyze the
corresponding data frame.
• First, lets take a small survey (in German)
• www.formr.org/formr_codebook
2
Why do you need a codebook?
• In case you die, and someone else wants to continue working with your
data.
• In case a reviewer wants to see the data
• In case you will start working with the data in about 10 years again.
• In case you want to share your data with other scientists.
• To save time and energy and reduce errors when analyzing the data.
• Having a good codebook is one of the first step towards an open and and
reproducible science.
• Here is a publicly available codebook:
• https://osf.io/yw3pq/
3
Reducing errors with a codebook
• Ideally, the codebook allows making all data analyses automatically.
• Using R, a codebook can be used together with the data frame to
summarize variables, get item analyses, etc.
function(codebook, data) = results
• Maybe you are not using R now, but maybe your student does or you will
be using in 5 years from now.
4
Technical requirements to a codebook
• Technically, a codebook is a table consisting of rows and columns.
• Each row contains an item.
• Columns contain information about the items.
• A codebook should be
• in English
• machine-readable (i. e. no colors, no merged fields)
• in a format everyone can open (google sheets, excel, csv)
5
The typical design of a study involves
(more or less)
- Selection of the measure, for example the BFI-K questionnaire.
- Collection of the items in the format you need, for example an excel table.
- (Getting good and valid translations of your measures).
- Making a decision of the Likert-type scale used (e. g. 1-6).
- Listing which items have to be inversed after data collection.
- Getting the items into the (online) survey tool (sometimes, this involves a lot of
copying and pasting).
- Finding better item-names than v1, v2, v3, v4, v5, v6, v7, v8, v9, v10.
- Knowing which item belongs to which scale.
- Collecting the data.
- Downloading the individual items for the collection platform.
- Reversing the items that need to be reversed.
- Scoring scales.
- Looking at the usual stuff: reliabilities, measurement models, etc.
- Start looking at the stuff that matters, i. e. your initial question.
6
What a codebook contains:
• There are mandatory and optional columns in each codebook.
• Mandatory columns qualify a codebook as a codebook, the optional
columns make you happy in the future
7
What a codebook contains:
Rows
• Each row in a codebook represents usually an item.
• It can also be just the introductory text for a questionnaire, or the
introduction to the study. The study thereby becomes even more
replicable.
8
What a codebook contains:
Mandatory Columns
• name: The name of the item in the final data frame. On naming of items,
see next slides.
• label: The label of the item, what the participant will actually see.
9
What a codebook contains:
Optional Columns (I)
- coding: Is the item coded in the direction of the construct (give it the value
0), or does it have to be recoded (= 1)? This information can also be added to
the item names, but it may just be useful to have it in another, separate
column.
- questionnaire: This column contains information about the questionnaire
the item belongs to. Here, one could for example use the short names of the
questionnaires, e. g. BFI (never use a minus, though. R does not like them and
converts them to dots, e. g. BFI-K would become BFI.K)
- scale: This column contains information about the scale the item belongs to.
This can for example be extra for Extraversion, but also demo if it is an item
of a demographics questionnaire or none if the item does not belong to any
scale.
- range: Sometimes useful to explicitly list which range the items should have.
10
What a codebook contains:
Optional Columns (II)
- (further columns): If there are for example subscales, or different phases or
versions in which the same item is used again, one could add additional
information on the items. Imagine for example that you will give the same
item twice, with a different wording, e. g. I feel I am conscientious or Others
think, I am conscientious. Then you could add another column named version
with three entries: neither (for the items that don't have a different
wording), feel (for the items with the feel-beginning), or others (for the
other-items).
- item_number: Finally, all items need an item number, which is the essential
part that makes them unique in their questionnaire. This can either be an
ongoing number throughout the whole questionnaire or a number within the
scale. Either way, it has to be ensured that each item name only exists once
in the whole study (also, across multiple surveys. If you have more surveys,
just add another column, called survey.)
11
What a codebook contains:
Optional Columns (III)
- english_version: The English translation of the original item.
- does not need to be a psychometric translation, just so that everyone
can understand the item.
- comment: Sometimes, there just needs to be a comment. Just make a new
column for it.
Any additional information that is necessary to understand a unique item
has to go into the codebook. If, for example, the naming of the Likert-scale is
very important, add 5 new columns, one for each answer.
12
Summary: Columns
Column
Information
name
The variable name as it is in the data frame
label
The full version of the item
coding
The coding (reverse coded?)
questionnaire
The questionnaire the item belongs to
scale
The scale the item belongs to
range
The range of the item
further columns
Additional columns that uniquely identify the item. For
example version or day in a daily diary study
item_number
The number of the item (numeric)
english_version
The English translation
comment
A comment, e. g. a link
13
Example codebook
Here is an example of a codebook for the Big Five Inventory, the first 5 items:
name
label
## Bitte beantworten Sie die folgenden Fragen.
coding
Ich bin jemand, der...
excb_BFI_info_1
excb_BFI_extra_2
excb_BFI_agree_3
excb_BFI_consc_4
excb_BFI_neuro_5
... aus sich heraus geht, gesellig ist
… rücksichtsvoll und einfühlsam zu anderen ist
… leicht ablenkbar ist, nicht bei der Sache bleibt
… sich viele Sorgen macht
0
0
1
0
survey
questionnaire
scale
range
item_number
excb
excb
excb
excb
excb
BFI
BFI
BFI
BFI
BFI
info
extra
agree
consc
neuro
1 to 6
1 to 6
1 to 6
1 to 6
1
2
3
4
5
14
english_version
comment
## Please answer the
following questions. I am
someone who ...
... is outgoing, sociable
... is taking care of others
... is easily distractable
... worries a lot
Codebook: Naming Variables
Bad Names
Good Names
Contain no information about the item
Contain a lot of information about the
item
Do not follow a clear structure/system
Have a clear structure/the same system
across all items
Contain more than one number
Contain only one number (there may be
exceptions)
Contain different special characters
Contain only underscore
Examples
•
•
•
v 1, v.2, v:3
item_2, alter, älter3
consc_2_v1, consc_1_var6
•
•
•
consc_1, consc_2, consc_3
bfi_w_one_c_1, bfi_w_two_c_1,
bfi_w_three_c_1
demogr_age_1, demogr_gender_2,
demogr_consent
• Variable names don’t have to be very short – you never want to type them
anyways!
15
Making variable names
• If you fill each column appropriately, you can just copy and paste these
parts together to generate meaningful item-names:
name
label
## Bitte beantworten Sie die folgenden Fragen.
coding
Ich bin jemand, der...
excb_BFI_info_1
excb_BFI_extra_2
excb_BFI_agree_3
excb_BFI_consc_4
excb_BFI_neuro_5
... aus sich heraus geht, gesellig ist
… rücksichtsvoll und einfühlsam zu anderen ist
… leicht ablenkbar ist, nicht bei der Sache bleibt
… sich viele Sorgen macht
survey
questionnaire
scale
excb
excb
excb
excb
excb
BFI
BFI
BFI
BFI
BFI
info
extra
agree
consc
neuro
0
0
1
0
range
item_number
1 to 6
1 to 6
1 to 6
1 to 6
1
2
3
4
5
english_version
comment
## Please answer the
following questions. I am
someone who ...
... is outgoing, sociable
... is taking care of others
... is easily distractable
... worries a lot
# in R, use paste() or str_c()
=""&H2&"_"&I2&"_"&J2&"_"&K2&""
16
Working with the codebook
• Using such a system of names allows selecting items very quickly:
# select in R all items that measure conscientiousness
grep(x = names(data), pattern = "consc", value = TRUE)
• For SPSS: You can use the excel file with drop-down menus to select a
specific sub-group of items.
17
R + Codebook + Data = All you need
• Further things you could do using R and this Codebook:
• recode all items directly from the codebook.
• generate all scales, reliabilities, measurement models, etc. directly from
the codebook.
18
Introduction: formr
• formr is a survey platform that turns you (extended) codebook into a
questionnaire (or your questionnaire into your codebook).
• www.formr.org
• Advantage of using formr:
• it is free
• it uses R + Rmarkdown to format the questionnaire/items
• it has a great support-feature (called Ruben & Cyril)
• it is very flexible
19
Online Material
• All slides and additional materials can be found here:
• https://osf.io/q3s4q/
• Thanks a lot!
20