Empirical Research in Economics: Growing up with R

Empirical Research in Economics
Growing up with
Changyou Sun
E
R
E
R
Changyou Sun
Empirical Research in Economics:
Growing up with R
Copyright © 2015 by Changyou Sun. All rights reserved. No part of this document may be
reproduced, translated, or distributed in whole or in part without the permission of the author.
Written by
Dr. Changyou Sun
Natural Resource Economics
Department of Forestry
Mississippi State University
Mississippi State, MS 39762, USA
Email: [email protected]
Published by
Pine Square LLC
105 Elm Place
Starkville, Mississippi 39759
USA
Cover designed by Chelbie Caitlyn Williamson
Printed in the United States of America
Printed in August 2015
First edition
Publisher’s Cataloging-In-Publication Data
(Prepared by The Donohue Group, Inc.)
Sun, Changyou, 1969 –
Empirical research in economics : growing up with R / Changyou Sun.
pages : illustrations ;
cm
Includes bibliographical references and index.
ISBN: 978-0-9965854-0-8
1. Economics–Research–Methodology. 2. R (Computer program language)
3. Economic statistics–Computer programs. I. Title. II. Title: ERER
HB74.5 .S86 2015
330.0721
2015911715
Brief Contents
Preface
xvi
Part I
Introduction
Chapter 1
Chapter 2
Part II
Chapter
Chapter
Chapter
Chapter
3
4
5
6
Part III
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
Chapter
3
12
Economic Research and Design
29
Areas and Types of Economic Research
Anatomy on Empirical Studies
Proposal and Design for Empirical Studies
Reference and File Management
31
42
57
78
Programming as a Beginner
99
Programming as a Wrapper
12
13
14
15
16
Part V
Chapter
Chapter
Chapter
Chapter
Motivation, Objective, and Design
Getting Started with R
7 Sample Study A and Predefined R Functions
8 Data Input and Output
9 Manipulating Basic Objects
10 Manipulating Data Frames
11 Base R Graphics
Part IV
Sample Study B and New R Functions
Flow Control Structure
Matrix and Linear Algebra
How to Write a Function
Advanced Graphics
Programming as a Contributor
17
18
19
20
Part VI
Chapter 21
Chapter 22
Chapter 23
1
Sample Study C and New R Packages
Contents of a New Package
Procedures for a New Package
Graphical User Interfaces
Publishing a Manuscript
Manuscript Preparation
Peer Review on Research Manuscripts
A Clinic for Frequently Appearing Symptoms
Appendix A Programs for R Graphics Show Boxes
Appendix B Ordered Choice Model and R Loops
Appendix C Data Analysis Project
References
List of Figures, Tables, and Programs
Index of Authors
Index of Subjects
Index of Commands
101
128
152
180
214
251
253
269
293
312
359
393
395
412
428
446
469
471
486
503
513
521
532
541
545
549
551
557
Contents
Preface
Part I
xvi
1
Introduction
Chapter 1 Motivation, Objective, and Design
1.1 Inefficiency, reasons, and motivation . . . . . . . . . . . . . . .
1.2 Objective and design . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Principle A: Incrementalism and growing up by stage .
1.2.2 Principle B: Project-oriented learning with full samples
1.2.3 Principle C: Reproducible research with programming .
1.3 Book structure . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 How to use this book . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 A guide for intensive use as a textbook . . . . . . . . . .
1.4.2 A guide for self-study . . . . . . . . . . . . . . . . . . .
1.4.3 Materials and notations . . . . . . . . . . . . . . . . . .
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
. 3
. 5
. 5
. 6
. 7
. 7
. 8
. 9
. 9
. 10
. 11
Chapter 2 Getting Started with R
2.1 Installation of base R, packages, and editors
2.1.1 Base R . . . . . . . . . . . . . . . .
2.1.2 Contributed R packages . . . . . . .
2.1.3 Alternative R editors . . . . . . . . .
2.2 Installation notes for this book . . . . . . .
2.2.1 Recommended installation steps . .
2.2.2 Working directory and source() . .
2.2.3 Possible installation problems . . . .
2.3 The help system and features of R . . . . .
2.4 Playing with R like a clickor . . . . . . . . .
2.5 Exercises . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Part II
Economic Research and Design
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
12
14
16
17
17
18
19
20
22
27
29
Chapter 3 Areas and Types of Economic Research
31
3.1 Areas of economic research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Theoretical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
34
34
35
35
36
37
38
38
38
39
40
41
. . . . . . .
. . . . . . .
to a paper
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
42
42
43
45
45
47
49
49
50
50
51
51
53
54
56
Chapter 5 Proposal and Design for Empirical Studies
5.1 Fundamentals of proposal preparation . . . . . . . . . . . .
5.1.1 Inputs needed for a great proposal . . . . . . . . . .
5.1.2 Keywords in a proposal . . . . . . . . . . . . . . . .
5.1.3 Two triangles, one focus, the mood, and the story .
5.2 Empirical study design for funding . . . . . . . . . . . . . .
5.2.1 Understanding a request for proposal . . . . . . . . .
5.2.2 Setting up a proposal outline . . . . . . . . . . . . .
5.2.3 An unfunded sample proposal . . . . . . . . . . . . .
5.2.4 A funded sample proposal . . . . . . . . . . . . . . .
5.3 Empirical study design for publishing . . . . . . . . . . . .
5.3.1 Design with survey data (Sun et al. 2007) . . . . . .
5.3.2 Design with public data (Wan et al. 2010a) . . . . .
5.3.3 Design with challenging models (Sun 2011) . . . . .
5.4 Empirical study design for graduate degrees . . . . . . . . .
5.4.1 Phases: technician (master’s) versus designer (PhD)
5.4.2 Facing the constraints: time, money, and experience
5.5 Summary of research design . . . . . . . . . . . . . . . . . .
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
57
57
58
60
62
62
63
64
67
69
70
72
73
74
74
75
76
77
3.3
3.4
3.5
3.6
3.2.1 Economic theories and thinking . . . . . . .
3.2.2 With or without quantitative models . . . .
3.2.3 Structure of theoretical studies with models
Empirical studies . . . . . . . . . . . . . . . . . . .
3.3.1 Common thread: y = f (x, β) + . . . . . .
3.3.2 Data analyses by programming . . . . . . .
3.3.3 Challenges for empirical studies . . . . . . .
Review studies . . . . . . . . . . . . . . . . . . . .
3.4.1 Features of review studies . . . . . . . . . .
3.4.2 Reviewing a research issue . . . . . . . . . .
3.4.3 Reviewing a statistical model . . . . . . . .
Becoming an expert . . . . . . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . . . . . . . . . . .
Chapter 4 Anatomy on Empirical Studies
4.1 A production approach . . . . . . . . . . . . .
4.1.1 Building a house as an analogy . . . .
4.1.2 An author’s and a reader’s perspective
4.2 Three versions of an empirical study . . . . .
4.2.1 Three roles, jobs, and outcomes . . . .
4.2.2 Relations among the three versions . .
4.3 Proposal version and design . . . . . . . . . .
4.3.1 Idea quality versus presentation skills
4.3.2 Idea sparkles . . . . . . . . . . . . . .
4.3.3 Tips for literature search . . . . . . .
4.3.4 A one-page summary of a proposal . .
4.4 Program version and structure . . . . . . . .
4.5 Manuscript version and detail . . . . . . . . .
4.6 Practical steps for an empirical study . . . .
4.7 Exercises . . . . . . . . . . . . . . . . . . . .
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 6 Reference and File Management
6.1 Demand of literature management . . . . . . . . . . .
6.1.1 Timing for literature management . . . . . . .
6.1.2 How many papers? . . . . . . . . . . . . . . . .
6.1.3 Tasks and goals . . . . . . . . . . . . . . . . . .
6.2 Systematic solutions . . . . . . . . . . . . . . . . . . .
6.2.1 Principles for effective management . . . . . . .
6.2.2 Practical techniques for reference management
6.2.3 Practical techniques for file management . . . .
6.2.4 A comparison of EndNote and Mendeley . . . .
6.3 Implementation by EndNote . . . . . . . . . . . . . . .
6.3.1 Managing references in EndNote . . . . . . . .
6.3.2 Managing files in EndNote . . . . . . . . . . .
6.4 Implementation by Mendeley . . . . . . . . . . . . . .
6.4.1 Managing references in Mendeley . . . . . . . .
6.4.2 Managing files in Mendeley . . . . . . . . . . .
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . .
Part III
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
78
78
78
79
80
81
81
82
83
86
87
87
91
93
94
95
96
99
Programming as a Beginner
Chapter 7 Sample Study A and Predefined R Functions
7.1 Manuscript version for Sun et al. (2007) . . . . . . . . . .
7.2 Statistics for a binary choice model . . . . . . . . . . . . .
7.2.1 Definitions and estimation methods . . . . . . . .
7.2.2 Linear probability model . . . . . . . . . . . . . . .
7.2.3 Binary probit and logit models . . . . . . . . . . .
7.3 Estimating a binary choice model like a clickor . . . . . .
7.3.1 A logit regression by the R Commander . . . . . .
7.3.2 Characteristics of clickors . . . . . . . . . . . . . .
7.4 Program version for Sun et al. (2007) . . . . . . . . . . . .
7.5 Basic syntax of R language . . . . . . . . . . . . . . . . .
7.5.1 Sections and comment lines . . . . . . . . . . . . .
7.5.2 Paragraphs and command blocks . . . . . . . . . .
7.5.3 Sentences and commands . . . . . . . . . . . . . .
7.5.4 Words and object names . . . . . . . . . . . . . . .
7.6 Formatting an R program . . . . . . . . . . . . . . . . . .
7.6.1 Comments and an executable program . . . . . . .
7.6.2 Line width and breaks . . . . . . . . . . . . . . . .
7.6.3 Spaces and indention . . . . . . . . . . . . . . . . .
7.6.4 A checklist for R program formatting . . . . . . .
7.7 Road map: using predefined functions (Part III) . . . . .
7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
101
103
103
104
106
110
110
111
112
116
116
116
117
119
120
121
123
124
124
125
126
Chapter 8 Data Input and Output
8.1 Objects in R . . . . . . . . . . . . . . . . . . .
8.1.1 Object attributes . . . . . . . . . . . . .
8.1.2 Object types commonly used in R . . .
8.1.3 Object creation, predicate, and coercion
8.2 Calling an R function . . . . . . . . . . . . . .
8.3 Subscripts, flow control, and new functions . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
128
128
128
131
132
135
137
x
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
139
139
141
142
144
147
150
Chapter 9 Manipulating Basic Objects
9.1 R operators . . . . . . . . . . . . . . . . . . . . . .
9.2 Character strings . . . . . . . . . . . . . . . . . . .
9.2.1 Frequently used functions . . . . . . . . . .
9.2.2 Special meaning of a character . . . . . . .
9.2.3 Application: character string manipulation
9.3 Factors . . . . . . . . . . . . . . . . . . . . . . . . .
9.4 Date and time . . . . . . . . . . . . . . . . . . . .
9.5 Time series . . . . . . . . . . . . . . . . . . . . . .
9.6 Formulas . . . . . . . . . . . . . . . . . . . . . . .
9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
152
152
155
155
160
164
166
169
172
177
179
Chapter 10 Manipulating Data Frames
10.1 Subscripting and indexing in R . . . . . . . . .
10.2 Common tasks for data frame objects . . . . .
10.3 Summary statistics of data frames . . . . . . .
10.3.1 Quick summarization by row or column
10.3.2 Contingency and pivot tables . . . . . .
10.3.3 The apply() family . . . . . . . . . . .
10.4 Application: estimating a binary choice model .
10.5 Exercises . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
180
180
184
193
193
194
198
201
204
Chapter 11 Base R Graphics
11.1 A bird view of the graphics system . . . . . . . . . . . . . . . .
11.2 Your paint: preparation of plotting data . . . . . . . . . . . . .
11.3 Your canvas: graphics devices . . . . . . . . . . . . . . . . . . .
11.3.1 Screen versus file devices . . . . . . . . . . . . . . . . . .
11.3.2 Setting graphics parameters by par() . . . . . . . . . .
11.3.3 Many graphs on multiple pages or files . . . . . . . . . .
11.3.4 Many graphs on a single page . . . . . . . . . . . . . . .
11.4 Your big and small brushes: plotting functions . . . . . . . . .
11.5 Region, coordinate, clipping, and overlaying . . . . . . . . . . .
11.5.1 Regions and margins . . . . . . . . . . . . . . . . . . . .
11.5.2 Coordinate system and clipping . . . . . . . . . . . . . .
11.5.3 Overlaying, axis, and legend . . . . . . . . . . . . . . . .
11.6 Application on customizing a default graph . . . . . . . . . . .
11.7 Application on creating a diagram . . . . . . . . . . . . . . . .
11.7.1 Drawing demand and supply curves . . . . . . . . . . .
11.7.2 A diagram for an author’s work and a reader’s memory
11.8 Summary: using predefined functions (Part III) . . . . . . . .
11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
214
214
216
217
217
221
223
226
229
232
233
234
235
238
240
241
243
246
247
8.5
8.6
Data inputs and creation . . . . .
8.4.1 Manual data inputs in R . .
8.4.2 Sample data in R packages
8.4.3 Simulation data in R . . . .
8.4.4 Reading external data . . .
Data outputs . . . . . . . . . . . .
Exercises . . . . . . . . . . . . . .
.
.
.
.
.
.
.
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Part IV
251
Programming as a Wrapper
Chapter 12 Sample Study B and New R Functions
12.1 Manuscript version for Wan et al. (2010a) . . . . . . . . . . . .
12.2 Statistics: AIDS model . . . . . . . . . . . . . . . . . . . . . . .
12.2.1 Static and dynamic models . . . . . . . . . . . . . . . .
12.2.2 Implementation: construction of restriction matrices . .
12.2.3 Implementation: estimation by generalized least square .
12.2.4 Implementation: calculation of demand elasticities . . .
12.3 Program version for Wan et al. (2010a) . . . . . . . . . . . . .
12.4 Needs for user-defined functions . . . . . . . . . . . . . . . . . .
12.5 Road map: how to write new functions (Part IV) . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
253
253
255
255
256
257
259
260
264
268
Chapter 13 Flow Control Structure
13.1 Conditional statements . . . . . . . . . . . .
13.1.1 Branching with if . . . . . . . . . .
13.1.2 The ifelse() function . . . . . . .
13.1.3 The switch() function . . . . . . .
13.2 Looping statements . . . . . . . . . . . . . .
13.2.1 for, while, and repeat loops . . . .
13.2.2 Constructing a looping structure . .
13.3 Additional functions for flow control . . . .
13.4 Application: restrictions on the AIDS model
13.5 Application: elasticities for the AIDS model
13.6 Exercises . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
269
269
269
271
272
275
275
278
281
283
287
290
Chapter 14 Matrix and Linear Algebra
14.1 Matrix creation and subscripts . . . . . . . . . .
14.1.1 Creating a matrix . . . . . . . . . . . . .
14.1.2 Subscripting a matrix . . . . . . . . . . .
14.2 Matrix operation and linear algebra . . . . . . .
14.3 Application: ordinary least square . . . . . . . .
14.4 Application: generalized least square . . . . . . .
14.5 Application: marginal effects for a binary model .
14.6 Exercises . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
293
293
293
295
298
301
304
307
309
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
or S4?
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
312
312
312
315
319
322
322
322
323
329
329
332
336
339
340
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 15 How to Write a Function
15.1 Function structure . . . . . . . . . . . . . . . . . .
15.1.1 Main components . . . . . . . . . . . . . . .
15.1.2 Function arguments . . . . . . . . . . . . .
15.1.3 Exporting outputs from a function . . . . .
15.1.4 Loading and attaching a function . . . . . .
15.2 Function environment . . . . . . . . . . . . . . . .
15.2.1 Basic concepts . . . . . . . . . . . . . . . .
15.2.2 Applications . . . . . . . . . . . . . . . . .
15.3 Object-oriented programming . . . . . . . . . . . .
15.3.1 A quick start: why and when do we need S3
15.3.2 S3 . . . . . . . . . . . . . . . . . . . . . . .
15.3.3 S4 . . . . . . . . . . . . . . . . . . . . . . .
15.3.4 S3 versus S4 . . . . . . . . . . . . . . . . .
15.4 Practical steps for writing a function . . . . . . . .
xii
15.5
15.6
15.7
15.8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
342
344
347
349
Chapter 16 Advanced Graphics
16.1 R graphics engine and systems . . . . . . . . . . . . . . . . . . .
16.2 The grid system . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.1 A comparison of traditional graphics and grid . . . . . .
16.2.2 Viewports . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.2.3 Low-level plotting functions . . . . . . . . . . . . . . . . .
16.2.4 Graphics parameters and coordinate systems . . . . . . .
16.2.5 Application: customizing a default graph by grid . . . . .
16.2.6 Application: creating a diagram by grid . . . . . . . . . .
16.3 The ggplot2 package . . . . . . . . . . . . . . . . . . . . . . . . .
16.3.1 A comparison of traditional graphics and ggplot2 . . . .
16.3.2 Geom, aesthetic, and mapping . . . . . . . . . . . . . . .
16.3.3 Stat, position, scale, and facet . . . . . . . . . . . . . . .
16.3.4 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.3.5 Application: the graph in Sun et al. (2007) by ggplot2 .
16.3.6 Application: the graph in Wan et al. (2010a) by ggplot2
16.4 Spatial data and maps . . . . . . . . . . . . . . . . . . . . . . . .
16.4.1 Application: the fire map in Sun and Tolver (2012) . . . .
16.5 Summary: how to write new functions (Part IV) . . . . . . . . .
16.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
359
359
360
361
362
364
366
369
371
374
374
376
377
378
379
381
385
386
390
391
Part V
Application:
Application:
Application:
Exercises .
one-dimensional optimization . . .
multi-dimensional optimization . .
static AIDS model by aiStaFit()
. . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
393
Programming as a Contributor
Chapter 17 Sample Study C and New R Packages
17.1 Manuscript version for Sun (2011) . . . . . . . . . .
17.2 Statistics: threshold cointegration and APT . . . . .
17.2.1 Linear cointegration analysis . . . . . . . . .
17.2.2 Threshold cointegration analysis . . . . . . .
17.2.3 Asymmetric error correction model . . . . . .
17.3 Needs for a new package . . . . . . . . . . . . . . . .
17.4 Program version for Sun (2011) . . . . . . . . . . . .
17.4.1 Program for tables . . . . . . . . . . . . . . .
17.4.2 Program for figures . . . . . . . . . . . . . . .
17.5 Road map: developing a package and GUI (Part V)
17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
395
395
397
397
398
399
400
401
402
407
410
411
Chapter 18 Contents of a New Package
18.1 The decision of a new package . . . . . . . .
18.1.1 Costs and benefits . . . . . . . . . . .
18.1.2 Validating the need of a new package
18.2 What are inside a package? . . . . . . . . . .
18.2.1 General considerations . . . . . . . . .
18.2.2 Example: contents of the apt package
18.3 Debugging . . . . . . . . . . . . . . . . . . . .
18.3.1 Bugs and browsing status in R . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
412
412
412
414
415
415
416
418
418
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18.3.2 Debugging without special tools . . . . .
18.3.3 Special tools with source code changes . .
18.3.4 Special tools without source code changes
18.4 Time and memory . . . . . . . . . . . . . . . . .
18.5 Exercises . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
419
421
421
424
427
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
428
428
429
429
432
435
440
440
442
444
445
Chapter 20 Graphical User Interfaces
20.1 Transition from base R graphics to GUIs . . . . . . .
20.1.1 Packages and installation . . . . . . . . . . .
20.2 GUIs in R and the gWidgets package . . . . . . . .
20.2.1 Two simple examples . . . . . . . . . . . . . .
20.2.2 GUI structure in R . . . . . . . . . . . . . . .
20.2.3 Main concepts in the gWidgets package . . .
20.2.4 Container widgets and layouts . . . . . . . .
20.2.5 Control widgets . . . . . . . . . . . . . . . . .
20.3 Application: correlation between two variables . . . .
20.4 Application: a GUI for the apt package . . . . . . .
20.5 Summary: developing a package and GUI (Part V)
20.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
446
446
448
450
450
452
453
454
455
456
460
465
465
Chapter 19 Procedures for a New Package
19.1 An overview of procedures . . . . . . . . . . . .
19.2 Skeleton stage . . . . . . . . . . . . . . . . . . .
19.2.1 The folders . . . . . . . . . . . . . . . .
19.2.2 Help manuals . . . . . . . . . . . . . . .
19.2.3 Other files . . . . . . . . . . . . . . . . .
19.3 Compilation stage . . . . . . . . . . . . . . . .
19.3.1 Required tools . . . . . . . . . . . . . .
19.3.2 Four steps: build, check, install, and test
19.4 Distribution stage . . . . . . . . . . . . . . . .
19.5 Exercises . . . . . . . . . . . . . . . . . . . . .
Part VI
.
.
.
.
.
.
.
.
.
.
469
Publishing a Manuscript
Chapter 21 Manuscript Preparation
21.1 Writing for scientific research . . . . . . . . . . . . . . . . . .
21.2 An outline sample for Wan et al. (2010a) . . . . . . . . . . .
21.3 Outline construction by section . . . . . . . . . . . . . . . . .
21.3.1 Manuscript space allocation and sequence . . . . . . .
21.3.2 Key sections: methodology and results . . . . . . . . .
21.3.3 Background sections: a review of market, literature, or
21.3.4 Wrap-up sections: summary, conclusion, or discussion
21.3.5 Decoration sections: introduction and abstract . . . .
21.4 Detail for a manuscript . . . . . . . . . . . . . . . . . . . . .
21.4.1 Knitting a paragraph as a net . . . . . . . . . . . . . .
21.4.2 Sentence preparation . . . . . . . . . . . . . . . . . . .
21.4.3 Getting formats right . . . . . . . . . . . . . . . . . .
21.5 Writing styles for a manuscript . . . . . . . . . . . . . . . . .
21.5.1 Explanation: common sense or not . . . . . . . . . . .
21.5.2 Explanation: the error term and beyond . . . . . . . .
xiv
. . .
. . .
. . .
. . .
. . .
issue
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
471
471
472
475
475
476
477
478
479
479
480
481
482
482
483
483
21.5.3 Rhythm: long versus short
21.5.4 Affirmative style . . . . .
21.5.5 Do not invite questions .
21.6 Exercises . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
484
484
485
485
Chapter 22 Peer Review on Research Manuscripts
22.1 An inherently negative process . . . . . . . . . . . . .
22.2 Marketing skills and typical review comments . . . . .
22.2.1 Marketing skills . . . . . . . . . . . . . . . . . .
22.2.2 Mismatch between a manuscript and a journal
22.2.3 Limited contributions with the current design .
22.2.4 Insufficient or inappropriate analyses . . . . . .
22.2.5 Poor writing and details . . . . . . . . . . . . .
22.2.6 Random errors and bad lucks . . . . . . . . . .
22.3 Peer review comments for Wan et al. (2010a) . . . . .
22.3.1 Comments from referee A in November 2009 .
22.3.2 Comments from referee B in November 2009 .
22.3.3 Comments from referee A in February 2010 . .
22.3.4 Assessment on referee comments . . . . . . . .
22.4 Responses to the comments for Wan et al. (2010a) . .
22.4.1 Responses to referee A in November 2009 . . .
22.4.2 Steps and strategies for preparing responses . .
22.5 Summary of a peer review . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
486
486
487
487
489
489
490
491
491
492
492
493
494
496
497
497
501
502
Chapter 23 A Clinic for Frequently Appearing Symptoms
23.1 Symptoms related to study design and outline . . . . . . . .
23.2 Symptoms related to R programming . . . . . . . . . . . . .
23.3 Symptoms related to manuscript details . . . . . . . . . . .
23.4 Final words . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
503
503
507
509
511
Appendix A
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Programs for R Graphics Show Boxes
513
Appendix B Ordered Choice Model and R Loops
521
B.1 Some math fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
B.2 Predicted probability for ordered choice model . . . . . . . . . . . . . . . . . 523
B.3 Marginal effect for ordered choice model . . . . . . . . . . . . . . . . . . . . . 527
Appendix C Data Analysis Project
C.1 Roll call analysis of voting records (Sun, 2006a) . . . . . . . . . . . . . . . . .
C.2 Ordered probit model on law reform (Sun, 2006b) . . . . . . . . . . . . . . .
C.3 Event analysis of ESA (Sun and Liao, 2011) . . . . . . . . . . . . . . . . . . .
532
532
535
538
References
541
List of Figures, Tables, and Programs
Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
545
545
546
546
Index of Authors
549
Index of Subjects
551
Index of Commands
557
xv
Preface
The major motivation behind the book is that there has been a lack of systematic approach
in teaching students how to conduct empirical studies. Graduate students in economics often
spend considerable time in taking a number of courses in economics, statistics, or econometrics. After intensive course work, however, they may still feel frustrated or inefficient in
working on their own projects. Thus, there has been a critical need for students to integrate
various techniques into applied scientific research.
The goal of the book is that after going through the process prescribed in the book, a
graduate student can finish a typical empirical study in economics over a reasonable period,
with the ultimate target of four months or less. To achieve the goal, both research methods
and statistical programming are presented. Instead of using fragmented small data sets and
examples, several complete sample studies are employed in the book to demonstrate how to
design and conduct typical empirical analyses in economics. The software used is R, which
is a powerful and flexible computer language for computing and graphics.
This book is highly structured by following the typical process of conducting an empirical
study. It is not intended to be an econometric book, so statistics is covered as needed only.
It is different from typical books on research methodology in the market because this book
provides detailed guides on how to conduct data analyses. It is also different from many
existing R books that focus on how to use R as a tool. In using this book, students need to
have a deep understanding of the sample papers. In learning R and programming, students
need to run the sample programs included in the book. This is critical as many statistical
techniques will become self-evident once students play with real data sets and codes.
I am sincerely grateful for all the helps and supports I have received while working on
this book in recent years. In particular, the earlier versions of the book were used in several
workshops or courses in Beijing Forestry University, Auburn University, and Mississippi
State University. Many improvements were based on the feedback from the attendees. The
accompanying R packages (i.e., erer and apt) have been used worldwide, and I have received
a large number of comments from users I have never met. My graduate students, Fan Zhang,
Zhuo Ning, and Prativa Shrestha, read the book several times (probably more than they
liked), and provided valuable comments. Finally, the whole book was prepared with LATEX 2ε ,
a high-quality typesetting system with strong supports from many online forums.
Empirical studies in economics have become more revealing and rewarding with modern
software. I wish you enjoy this type of study through the help of this book.
C. Sun
August 2015
xvi
Part I
Introduction
1
2
Part I Introduction: Two chapters are used to introduce the whole book and the software
R. This will help readers understand the structure and design of the book, and install the
software R for coming data analyses.
Chapter 1 Motivation, Objective, and Design (pages 3 – 11): The motivation, objective, and
design of the book are presented. Several principles (i.e., incrementalism, project orientation,
and reproducibility) are adopted in designing the book. Some user guides are provided at
the end.
Chapter 2 Getting Started with R (pages 12 – 27): How to install the base R, contributed
packages, and an R editor is described first. Then the R help system and the differences
between R and commercial software products are discussed.
R Graphics • Show Box 1 • A world map with two countries highlighted
R has rich functions for drawing maps. See Program A.1 on page 513 for detail.
Chapter
1
Motivation, Objective, and Design
A
t present, the way of teaching and learning in economics is inadequate in integrating course work (e.g., statistics and economics), research methodology, and software usage for specific applied analyses. This has provided me strong motivation in
preparing this book. The overall objective is to help students conduct an empirical study
over a reasonable period, e.g., four months. In achieving that, several complete sample papers in economics are used, the learning process is divided into several stages (i.e., clickor,
beginner, wrapper, and contributor), and software R is adopted for all data analyses. Some
study guides are presented at the end.
1.1
Inefficiency, reasons, and motivation
As a professor, I have had the privilege of mentoring graduate students in natural resource
economics for years. In a typical graduate program, students take a number of courses
related to economics, statistics, econometrics, and research methods. At the same time,
students need to finish a research project or thesis to receive a graduate degree at the end.
In general, the research component of a graduate program is much more challenging than
the course work. In working on a research project, it is common for students to feel excited
at the beginning, but become frustrated gradually, and even fail after a few years. For these
students who do complete a research project successfully for a degree, the output is often of
a low or unpublishable quality. This observation has been repeatedly confirmed in mentoring
my own students, attending the defense seminars by graduate students in our department,
and listening to student presentations at professional meetings.
Why are graduate students inefficient in conducting scientific research? Apparently, the
answer can differ by person or environment. At one time, I thought it might be that the
studies conducted by graduate students were too difficult. This can be true for some individuals who have produced theses with high quality. However, after being in this profession for
so many years, I feel that on average most studies conducted by graduate students have low
to moderate demand on study design, statistical analysis, and writing. Using much less time
(e.g., four months), established professionals can conduct many of these types of studies
with a better quality, often publishable in a refereed journal.
For quite a while in mentoring my students, I also thought that the weak background
and qualification of an individual student might be the main reason behind the low research
productivity. I know a number of graduate students in our department who did not take any
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
3
4
Chapter 1 Motivation, Objective, and Design
calculus courses or linear algebra before they started a graduate program in economics here.
Those students had a very difficult time in taking graduate courses (e.g., microeconomics).
Based on these experiences, I have spent a considerable amount of time in recruiting and
evaluating graduate applicants in recent years. Those without basic qualifications have been
filtered away and rejected at the beginning. This is beneficial for both the graduate school
and applicants because valuable resources and time can be saved.
Still, the widespread inefficiency in scientific research by graduate students I have observed cannot be completely explained by the low qualifications of some individual students.
I believe, or have the faith as a professional, that graduate students in economics as a group
have appropriate qualifications to finish their graduate studies and grow up as a new generation of economists. Thus, I have been looking for some common reasons behind the low
research productivity of graduate students. After many years of observing and reflecting, my
conclusion is that the prevailing way of teaching and learning lack integration among course
work, research methods, and software usage. This is especially true as many economic research projects today involve heavy data analyses.
Specifically, courses in economics and statistics have limited coverage on how to apply
theoretical and empirical models in applied studies. Some courses may have a reading list
of over 50 journal articles. But reading published articles is different from doing a similar
study. This is analogous to the relation between reading many novels and writing a novel;
they are relevant but fundamentally different activities. A graduate program is supposed
to produce writers and creators, not readers only. Furthermore, both introductory and advanced econometric courses cover many quantitative models (e.g., ordinary least square in
scalar or matrix algebra). Then, students are told that they just need to use a software application like EVIEWS or LIMDEP and push a button to get a regression done. For many
graduate students, there has been a big gap and black box between formulas on textbooks
and regression outputs on a computer screen.
As another contributing factor, published books about software usage offer limited advice
for using software applications for specific scientific studies. In most cases, they are just like
a manual to a lawn mower. A husband with limited knowledge of landscaping purchases a
powerful mower with a detailed manual, and then believes that he can show off a beautiful
lawn in a few days before his wife and kids. I was exactly like that when we purchased our
first house, my wife delegated all the authority and responsibility of yard work to me, and
then I started working on our yard for the first time. In addition, many software books do not
follow basic principles of learning. For example, at the very beginning (e.g., an introduction
chapter), one book for software R presents detailed steps of how to create a package in R.
However, most R users do not write a package at all in the first few years. If there is such a
need, it is often after a student has learned the basics of R through several applied analyses.
Finally, most software books use small and fragmented examples in demonstrating how to
use existing functionality. There is a lack of systematic presentation of how to use a software
application to conduct a study from the beginning to the end.
Scientific research methodology has been the subject of a number of published books.
A search through online bookstores can generate several dozens of books in this category.
They often cover principles related to scientific research, such as literature review, study
design, or grantsmanship. However, they generally do not cover specific economic models or
software usage. As a result, they offer some limited advice to students in conducting specific
empirical studies. A big gap still exists between the principles prescribed in methodology
books and a real research project in economics.
To summarize, I believe that the lack of integration among course work, software usage,
and research methods is the main reason behind the low research productivity for students in
5
1.2 Objective and design
Table 1.1 Three versions of an empirical study
Version
Focus
Description
Proposal
Program
Manuscript
Idea
Structure
Detail
Soul, design, and guide of an empirical study
Results in table and figure from data analyses by software
Final publication with all details
applied economics. After reading many books for economics, statistics, software applications,
and research methods, students may still experience significant difficulties in getting started
with their research projects or finishing them on time. Therefore, there has been a critical
need to combine these components in mentoring students for their own empirical studies.
1.2
Objective and design
The objective of this book is to teach young professionals how to conduct an empirical
study in economics over a reasonable period, with the expectation of four months or less in
general. Based on my mentoring experience, this is highly achievable if students follow the
methods presented in this book, work through these exercises diligently, and go through the
training completely.
To achieve the objective, I have designed and prepared the book with three principles
or considerations. They are all contained in, or implied by, the title of this book: (A) Incrementalism and growing up by stage; (B) Project-oriented learning with complete samples;
(C) Reproducible empirical research in economics with R programming.
1.2.1
Principle A: Incrementalism and growing up by stage
The first, and perhaps the most important, principle in my research philosophy is incrementalism. Scientific research in economics has become more challenging as our economy system
has become more sophisticated. For a specific area in economics, conducting an empirical
study also involves many steps in a long process. Thus, we need to be realistic and take an
incremental approach in learning how to conduct empirical studies. Incrementalism allows
us to build up our skills and grow up gradually in the long term. For an individual project
and in the short term, incrementalism can be implemented through itemization. This is
composed of dividing a big task into small pieces, getting organized and focused on one
piece each time, and assembling all outputs together at the end.
Specifically, the principle of incrementalism is inherent in the whole book and reflected
in two aspects. First of all, the long production process of an empirical study is divided into
three stages: proposal, program, and manuscript, as presented in Table 1.1 Three versions
of an empirical study. The proposal version of an empirical study provides the guide for the
whole study. The program version, often composed of a computer language like R, generates
final tables and graphs for the study. The manuscript version presents the study to readers.
Furthermore, the programming work behind a program version is divided into four stages:
clickor, beginner, wrapper, and contributor. A clicker is a device that makes a clicking sound,
usually when activated by a person on purpose. It seems that the word clickor does not exist
at present, so I use it here to refer to a person when clicks a device such as a computer
mouse. As a clickor, students just click pull-down menus from a software application to
conduct data analyses. The other stages are all related to programming with increasing
demand on skills. A beginner uses predefined functions in a software application like R in
6
Chapter 1 Motivation, Objective, and Design
a very basic way. The gain for a beginner over a clickor is small but a beginner is on the
track of growing up. A wrapper uses predefined functions extensively and begins to write
some new functions. A contributor writes a large number of new functions, formats them
into a package, and thus extends the software to allow himself and others to do more work.
A researcher can move along the four related stages over time and grow up incrementally.
1.2.2
Principle B: Project-oriented learning with full samples
The second principle is to adopt a project-oriented learning approach. Many graduate students do not conduct any research until the third or even fourth year of their graduate
programs. They prefer to have a solid training through courses first before they start a
project seriously. However, I believe no time is better than now. The best way of getting
started in economic research is to have a real research project and learn from the process
as soon as possible. A project is the best way to organize our mind and inspire us for more
exploration in a related area.
Following this principle, three publications are adopted as sample papers in this book
(i.e., Sun et al., 2007; Wan et al., 2010a; Sun, 2011). The selection of these sample papers
is purely personal as I am the author or coauthor of them and have all the raw data. A
similar set of papers can provide needed support to achieve the book objective too. These
sample papers are used to demonstrate the whole research process: generating a research
idea, collecting data, estimating an empirical model, and finally writing a manuscript. This
will give students a complete, instead of fragmented, picture of empirical studies. These
three sample papers and the related data will be used as much as possible in the book.
Specifically, the first sample study is Sun et al. (2007). A binary logit regression is
employed to analyze the decision of liability insurance purchase among sportspersons in
Mississippi. This study is used for learning fundamental steps for empirical studies and R
programming skills. The second sample study by Wan et al. (2010a) is a demand system
model for import competition of wooden beds in the United States. This is used to demonstrate how to write new functions in working with a more complicated economic study. The
last sample study by Sun (2011) is about asymmetric price transmission between two major
wooden beds suppliers in the United States. Through working on Sun (2011), an R package
called apt was developed and published.
In comparison, Sun et al. (2007) is easier than Wan et al. (2010a), and in turn, Wan et al.
(2010a) is easier than Sun (2011). By model, the binary logit regression employed by Sun
et al. (2007) has become a standard model in any introductory book of econometrics. The
almost ideal demand system (AIDS) model and linear cointegration analyses in Wan et al.
(2010a) are a little more complicated as it involves static and dynamic models for a group
of supplying countries. The threshold cointegration analysis employed in Sun (2011) has
been mainly developed in the most recent 20 years. By data, Sun et al. (2007) uses a survey
output with typical features for cross-sectional data. The other two studies use time series
data, and in particular, data manipulation for a group of countries in Wan et al. (2010a)
presents a great opportunity to learn relevant skills. Overall, this set of papers, as carefully
selected, will allow us to learn research methods and programming skills simultaneously and
build up various skills incrementally.
In addition to the three sample papers used explicitly in the main text, another three
sample papers are also selected for exercises in the book (Sun, 2006a,b; Sun and Liao, 2011).
Raw data for these papers are available to readers. Thus, the sample studies in the main
text can be followed and the studies in the exercises can be reproduced too. Briefly, Sun
(2006a) is a roll call analysis of the voting records for a bill related to forest management in
2003. The model used is a binary logit regression for roll calls, which is popular in political
1.3 Book structure
7
science. Sun (2006b) is an analysis of statutory reforms on landowners’ liability in using
prescribed fires on forestland. The model is an ordered probit model. Sun and Liao (2011) is
an assessment of the effects of litigation under the Endangered Species Act on the values of
public forest firms. The model employed in this study is a typical event analysis in finance.
Along the same line of project-oriented learning, most R sample codes in this book are
organized in a list format and they focus on a specific issue. In contrast, many published
R books embed short sample codes in the main text or use very small examples. By using
complete sample papers or focusing on a selected issue (e.g., character string manipulation),
the R sample codes in this book present more complete pictures for a focal issue. The
drawback, if any, is that readers will probably need to run the sample codes on a computer
during the reading to have a deep understanding of the description in the main text.
1.2.3
Principle C: Reproducible research with programming
The software R is adopted in conducting all statistical analyses in this book. I have used
a number of commercial software products in the past for empirical studies. Each of them
is advantageous in some aspects, such as handling large data sets or truncated dependent
variables. Thus, it is hard for one software product to dominate others completely. However,
it is generally not practical for a researcher to purchase many statistical software products,
or even if affordable, to become an expert in each of them. The realistic approach is to
choose one software product as the main tool and use others as supplements.
R is a free but powerful software product for computation and graphics. R is better
than many commercial software applications, based on my own experience. The main advantage of R is its open sources and online community for help and learning. This allows
a much deeper understanding of statistical analyses, a feeling that I have never had from
using commercial software products with secretive procedures and codes. In addition, R is
a programming language. Programming doubles or triples productivity over clicking menus
in many cases. Finally, for many even moderately sophisticated econometric models used
today, programming has become a must to use them. This will be demonstrated in the
sample study of Sun (2011).
For every sample study used in this book, a program version is included. All tables and
figures in the published version can be reproduced in a few minutes. For the sample papers
adopted for exercises, their program versions are available to class instructors. Reproducibility through programming will greatly facilitate learning of empirical studies by students. By
explicitly documenting research steps through R programming, a researcher can build up
skills gradually step by step. After a few years and through several studies, one can improve
research efficiency greatly.
1.3
Book structure
The book has two major components: research methods and statistical programming. Some
statistics and econometrics are covered to help understand the sample papers, but they are
not the focus. As a result, approximately 40% of the book contents are for research methods,
50% are for R and programming, and 10% are for statistics.
There are six parts in total, as shown in Figure 1.1 The structure of the book. Part I
Introduction contains an overview of the whole book and a brief introduction of R. Part II
Economic Research and Design is devoted to research design completely, including study
design and reference management. Reference management is demonstrated through the reference software applications of EndNote and Mendeley. Part VI Publishing a Manuscript
8
Chapter 1 Motivation, Objective, and Design
Proposal Version
(II)
Software
An
Empirical
Study
(I)
Program Version
Beginner - Clickor (III)
Wrapper (IV)
Contributor (V)
Manuscript Version
(VI)
Figure 1.1 The structure of the book
is about writing a manuscript and submitting it for publication. Furthermore, in presenting
programming skills, research methods are emphasized in various places. The three selected
sample papers are utilized extensively in the presentation. In addition, several publications
related to the selected sample studies are also analyzed to elaborate research design techniques.
The three parts in the middle correspond to three main stages in the learning process:
programming as a beginner, wrapper, and contributor. The chapters are organized to present
typical learning needs and research techniques. Part III Programming as a Beginner is
mainly about how to use predefined functions in R for data manipulation, base graphics,
and regression analyses. Part IV Programming as a Wrapper covers how to extend the
existing features by writing new functions. Part V Programming as a Contributor moves
along the spectrum further by focusing on how to create a new package for one’s own use
or public sharing.
Since R as a language is powerful but sophisticated, it is difficult, even if not completely
impossible, to cover all details. Thus, the main text in the three parts related to programming
will highlight key techniques, especially these widely adopted in practical research. In a few
chapters, many sample R codes are presented so students can learn from these samples
quickly. The arrangement also allows readers to explore further by following these sample
programs, or to copy them for their own projects.
1.4
How to use this book
The book can be used either as a textbook for intensive work over a short period, or alternatively, as a casual self-study book. Each way has its pros and cons. By analogy, this is
similar to the two ends of a swimming pool. Before diving into the water, a swimmer should
always make sure of the location, so he will not die from drowning with too much water, or
breaking his neck with too little water.
1.4 How to use this book
1.4.1
9
A guide for intensive use as a textbook
In using this book as a textbook, I offer a three-credit hour course in one semester with 16
weeks. In total, the course has 32 meetings, 1.5 hours each, and 48 contact hours. Outside
the classroom, students are expected to spend another 150 hours to finish exercises and data
analysis projects. In sum, a total of 200 hours over a few months are needed for an intensive
training.
More specifically, there may be a need to reallocate the time of lecturing for the two main
components of the book: research method and programming. In the book, these materials are
arranged by following the typical research flow, which is necessary for presentation clarity. In
lecturing, however, they may be adjusted because of the fundamental characteristics of these
materials. In general, research methodology is relatively easy to understand, but difficult to
generate any effect on our research habit and behavior. Presenting research methods alone
for an extended period (e.g., a month) can also be boring and unfruitful. Teaching research
methodology also requires intensive mentoring from instructors, often in the form of critical
or even negative comments. In contrast, materials for data analyses and programming are
more appropriate for self-learning, because they can be confirmed by standard answers,
expectation, or common sense. Thus, time for the part of research methods can be less than
four weeks at the first coverage, and then they can be reemphasized in the programming
part with sample papers. Mixing these two components to a certain degree in lecturing can
better engage students in the classroom. If needed, the sequence can also be rearranged;
R programming can be covered first, and research design and manuscript preparation are
covered in the end.
In offering the class, I ask students to finish many small assignments and one big data
analysis project during the semester. The assignments are based on exercises listed in the
book. The data analysis project is based on one of the three sample papers prepared for
exercises exclusively (i.e., Sun, 2006a,b; Sun and Liao, 2011). As a result, students are
required to duplicate the published tables and figures by R programming in one selected study.
Data for the three studies are available to students in two formats: raw data as Microsoft
Excel files and the final data used for regression in R data format. All of them are included
in the erer library. Tables and graphs in the published versions of these papers should be
reproduced in the end.
There are some considerations of using the three sample studies for exercises. In general,
it is not recommended that students start a completely new research project in meeting
the requirement of data analysis project. This is because a student’s study design probably
does not work, the student cannot collect the needed data in a short period, or the model
employed is too complicated to be analyzed in a semester. By using a published paper as the
base for the exercises, the general principle of confirmation in learning is followed. Students
are expected to reproduce one of the selected papers by stage, with the answers known in
advance.
1.4.2
A guide for self-study
A buffet is a type of meal that allows diners to choose food items, quantities, and sequences
freely. With a fixed amount of payment, diners can eat as much as they like. The downside
is that the appetite of a diner can be easily ruined by having one pound of ice cream at the
beginning, and then he begins to complain the food quality during the rest of his stay.
Reading a textbook without any guide is like having a buffet dinner. This book is highly
structured. This means that many materials are covered sequentially, and sometimes less
elegantly at the first glance. Thus, a casual or impatient reading of the book can be frus-
10
Chapter 1 Motivation, Objective, and Design
trating if one does not have enough time or discipline in following the main structure, or
just wants to use this book as a dictionary. To have an efficient self-study on a large book,
discipline, patience, and time are all necessary inputs.
A few students have told me the book seems great, but the problem is that it needs
careful reading. When I first heard that, I just ignored it completely, because there is no
good book in the world that does not require committed reading. However, after hearing
the above comment several times, I have gradually realized that casual reading of materials
on Internet or mobile electronic devices may have changed the learning habits of many
people. Regardless of the trend, it should be emphasized that this book is prepared like a
traditional textbook. Materials are connected and presented in a logic way. For many key
points highlighted in the text, students need to read them several times, reflect for a while,
and reinforce the learning through exercises.
For self-reading of this book, it is highly possible that a reader will skim the parts for
research methods (i.e., Parts II and VI) casually. This is reasonable because these materials
are easy to understand. However, without any mentoring, it is difficult to improve one’s
skills for either research design or manuscript preparation. Still, I urge students to digest
these research methods and incorporate them into their own research activities.
The three parts of R programming can be read relatively more independently from other
materials. Readers are strongly encouraged to have a good understanding of the individual
sample papers employed. This is an extra cost in comparison to reading these published R
books in the market with fragmented samples. The benefit in combining complete sample
papers and R programming shall become evident to readers in the process.
For readers who need a quick answer to a programming problem within three minutes,
note there are many R code programs and an intensive index for R functions included in the
book. The availability of these materials is the result of many conscious efforts in making
the book more convenient to users. Thus, check the list of indexes and R code programs
when it is needed.
1.4.3
Materials and notations
To students, all the sample programs listed in this book are available individually. They
are included in the erer library and should be copied to a local drive for exercises. How
to install the package and copy the materials is explained in Section 2.2.1 Recommended
installation steps on page 17. In addition, the three sample studies (i.e., Sun et al., 2007;
Wan et al., 2010a; Sun, 2011) for the main text and the three sample studies for exercises
(i.e., Sun, 2006a,b; Sun and Liao, 2011) are all published journal articles. If a user has no
access to any of them, I can share my personal copy for educational purpose by email.
The following materials are available to class instructors only. A sample syllabus is
available with detailed course design for a typical offering in a semester. A total of about
800 slides for 28 lectures are prepared in LATEX. They are all in PDF and available in several
layouts: one slide per page, six slides per page, and three slides with note areas per page.
For each of the three sample study for exercises (i.e., Sun, 2006a,b; Sun and Liao, 2011), a
program version is prepared and available. In addition, the answers to all the exercises in
this book are also available to instructors.
Several formats have been adopted consistently, with the sole objective of increasing
the usability of the book. Without these special formats, the book would look too plain.
However, I have been conservative and cautious in using any fancy feature because excessive
uses may become distractive to reading. Specifically, for math notations, the rules promoted
in Abadir and Magnus (2005) are adopted. For example, ϕ denotes a scalar function, f a
1.5 Exercises
11
vector function, and F a matrix function. Similarly, x denotes a scalar argument, x a vector
argument, and X a matrix argument.
The titles of all tables and figures are shaded, so they can better stand out from the
main text. When a figure looks fragmented, a frame is also added to the whole region. All
the tables and figures have floated to the top of a page, which is feasible as none of them
is more than a page long. In addition, some descriptions in the text may be important and
readers may need to use them as a reference later on. If that is the case, a shaded box is
used to highlight the relevant part.
All the R programs have been prepared with the Tinn-R editor, and then inserted into
the main text. The outputs from these R programs are too large to be fully reported. Thus,
most R programs are followed by some selected outputs. In formatting these R programs
in the book, two lines are used at the top of each program to have a stronger separation
effect, and one line at the bottom for a weaker effect. Line numbers are added outside of
the program. They are not a part of the program per se, but added to facilitate references
in the main text. Within the R programs, commands are in a typewriter font, and comments are in a slanted typewriter font. Comments in the selected outputs, however, are
reported without the slanted format (because the verbatim environment in LATEX is used).
R commands in the main text are also formatted with a typewriter font. R functions in the
main text are indicated with a pair of parentheses, e.g., plot().
LATEX has a well designed cross-reference mechanism, which has been used extensively
in the book to provide a label number, title, or page number, e.g., Section 1.4.3 Materials
and notations on page 10. When used in a cross reference, the following words are in bold
type: Table, Figure, Program, Part, Chapter, Section, Equation, and Exercise. This
may be a little distracting in some places when the number of words in bold type is large,
but overall the format allows readers to quickly identify the item of interest. The benefit is
bigger than the cost to me, so this feature has been adopted. The page number associated
with a referred item is included when the item is a few pages away from where it is called.
Words or phrases that need to be emphasized in the main text are in italic type.
All the figures generated from the R programs in this book have been adjusted further
before they are inserted in the book. One change is to use the Computer Modern font, the
default choice by LATEX. The other change is to embed the font in the graphic output, which
is a technical requirement for book publishing. Several R packages have been used for this
purpose, including extrafont. As this is quite technical and time-consuming, the relevant
codes have been excluded from all the R programs.
1.5
Exercises
1.5.1 Understand sample papers. Read the sample papers that will be used in the main
text (i.e., Sun et al., 2007; Wan et al., 2010a; Sun, 2011). Become familiar with these
papers. This may take at least two hours per paper.
1.5.2 Select an empirical study. Skim the sample papers that will be used as the base for
exercises (i.e., Sun, 2006a,b; Sun and Liao, 2011). Select one of the three papers as the
base for many exercises that will be required in the remaining chapters of this book.
Alternatively, one can select or use a paper from the literature or a finished project if
raw data are available; this is not recommended if no mentoring is available.
Chapter
2
Getting Started with R
I
f you have never used R before, this chapter will be able to help you get started.
Brief notes about installation of the base R, R packages, and editors are presented
first. Then, R is compared with commercial software. Some misconceptions about
free software like R are discussed also. At the end, the help system for R is described.
2.1
Installation of base R, packages, and editors
Installing R and packages may differ slightly on different operating systems, and various
issues and solutions are discussed at online forums. I am a Microsoft Windows user so the
following installation notes focus on this environment. In general, there are three types of
installation needs: the base R, R packages, and a selected R editor. The latter two are
optional. Additional R packages may be needed if one utilizes extended packages for data
analyses or graphics. If one prefers an interface different than the one offered by the base
R, then a number of editors are available.
2.1.1
Base R
A copy of the base R is available from the Comprehensive R Archive Network (CRAN)
on the Web site of R. As of July 2015, it takes five clicks at http://www.r-project.org
to download it: CRAN (at the left column) ⇒ A mirror site choice ⇒ Download R
for Windows ⇒ base ⇒ Download R 3.2.1 for Windows. The base R can be installed
at the default folder (e.g., C:/Program Files/R-3.2.1), or in another selected folder (e.g.,
C:/myprogram/R-3.2.1). The latter is recommended if the software of LATEX is used along
with R for document preparation. The reason is that a space is allowed for a folder name
under the environment of Microsoft Windows (e.g., “Program Files”), but it is not recognized
in LATEX and often causes trouble.
In addition, in the middle of the installation process, one needs to make a choice of
installing 32-bit, 64-bit, or both versions of the R software. In general, if your computer
has a 64-bit version of Windows, then it is recommended to install both versions of R. To
understand the difference between 32-bit and 64-bit system and learn how to check the
version on your computer, search the Internet with relevant keywords.
Once R is installed, open R graphical user interface (RGui) from the start menu of
Microsoft Windows. In Figure 2.1 R graphical user interface on Microsoft Windows, the
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
12
2.1 Installation of base R, packages, and editors
13
Figure 2.1 R graphical user interface on Microsoft Windows
normal working status of R is shown. In this specific case, the 32-bit version is called, as
revealed at the initial message within the window and also at the corner, i.e., RGui (32bit). In general, when a statistical software application is in operation, it can contain or
show many windows. These include an editor window for commands, a data window for
data display, an output window for results, a log window for working status or trace, an
error message window for error or warning messages, and a navigator for an overview of all
relevant components (similar to Windows Explorer, or any resource management window).
The base R has two major types of windows: R console and editor windows. In opening
a new session of R, the default window shown on a screen is the R console. This window can
show command lines submitted from the editor window, data imported or created in the
current session, analysis results, and any error or warning message. The console allows users
to receive commands from an R editor window and also to type commands directly. This is
convenient if some commands are for testing only and do not need to be saved. Actually, I
often use the R console as a calculator when I am around my desktop computer; it is more
powerful than any hand-held palm calculator I have. Thus, the R console is truly interactive.
An R editor window can be initiated by clicking the menu of File ⇒ New script, or by
opening a saved script through File ⇒ Open script. Commands in the R editor window
can be modified and saved on a local drive and rerun later as a program. A saved file has an
R extension, e.g., LogitStudyProgram.r. This is a text document so any word processor can
read it. Multiple editor windows can be opened at the same time, and commands from them
can be submitted to the R console. For empirical studies in economics, saving commands
through one or several editor windows is an efficient way to organize analysis steps. This
will be further elaborated and emphasized in later chapters.
The R console and editor windows can be easily arranged on a computer screen by
clicking the menu of Windows ⇒ Tile Horizontally or Tile Vertically. If you prefer
an editor window to stay on the left side, then place the mouse cursor in this window and
click Tile Vertically again. In submitting commands from the console directly, each time
only one line can be submitted by hitting the Enter key on the keyboard. Note one line in
the console window can have multiple commands. In submitting commands from an editor
window, there are two major ways: one is by line and the other is by block. When a cursor is
in anywhere on a command line, use Ctrl + r to submit the current line. To submit several
command lines, highlight or select them and then use Ctrl + r.
14
Chapter 2 Getting Started with R
The base R interface is relatively simple. When R is installed, one can try some simple
codes in an R session to make sure that R is appropriately installed, as shown in Figure 2.1.
Furthermore, explore the functionality offered through the menus. The most frequently used
keyboard shortcuts are listed as follows:
Ctrl + a
Select all command lines;
Ctrl + r
Run a single or a group of command lines;
Ctrl + l
Clear console;
Ctrl + f
Find a text;
Ctrl + c
Copy a text;
Ctrl + v
Paste a text;
Ctrl + h
Replace a text; and
Esc
Stop current computation.
2.1.2
Contributed R packages
The distribution of base R is very lean with 62 megabytes only (as of R 3.2.1 in July 2015).
This allows a fast installation and efficient running. Many functions available in R are not
installed routinely, and even after installed, they are not loaded into a working environment
automatically. These functions are available in R as packages or libraries. A package is a
set of functions, help files, and data files that have been combined together for a topic
(e.g., the package of AER for applied econometrics with R). As of July 2015, the CRAN
package repository features about 6,800 packages. To find out which packages are installed
and loaded, run sessionInfo() in an R session. This provides the version of R and the
operation system for which it is compiled. The HTML help facility can be initiated by
help.start() and it gives details of the packages installed on a computer.
There are several ways to install R packages. First, installing a package is straightforward
from the RGui under the base R. As shown in Figure 2.1 R graphical user interface
on Microsoft Windows, first click the menu of Packages ⇒ Install package(s). Then
navigate the list, locate the package names of interest, and follow the instructions popped
up on the screen.
Alternatively, one can use the function of install.packages() to install packages. This
is more efficient than clicking the menu buttons so I suggest this approach. Three commonly
situations are: installing packages available at the CRAN site; installing a package from a
local drive with a tar.gz source file; and installing a package from the R forge site. In
comparison, installing packages from the CRAN site directly should be the default method.
Installing a package from a local drive is generally for experienced users only.
The following sample commands can be modified and used for specific packages.
install.packages(pkgs = “AER”, dependencies = TRUE)
install.packages(pkgs = c(“apt”, “erer”), dependencies = TRUE)
install.packages(pkgs = “C:/erer_2.4.tar.gz”, repos = NULL,
type = “source”)
install.packages(pkgs = “test.package”,
repos = c(“http://R-Forge.R-project.org”, getOption(“repos”)))
Installing a package available from the CRAN site is the easiest approach. When a
package available from CRAN is installed like above, dependent packages can be installed
automatically by setting the dependencies argument as TRUE. For example, the AER library
now depends on a number of packages, and all of them can be installed through the above
15
2.1 Installation of base R, packages, and editors
command. This method can also install several packages together, e.g., apt and erer by the
string of c(“apt”, “erer”). In addition, the above commands can be saved in a file for
packages often used as an R program. When the base R is reinstalled or updated later on,
these contributed packages can be reinstalled by running the saved program once, instead
of typing the installation commands repeatedly. The number of packages I have been using
is less than 30, so I have maintained and updated a one-page program on my local hard
drive.
Installing a package from a local drive under Microsoft Windows is feasible too. Depending on the format of the package, this can be less straightforward than installing a package
from the CRAN site. A source version of tar.gz for a package is most common, and it can
be installed through install.packages(). This method is needed if you have a personal
package that is not shared with anybody, or you receive a package from colleagues next door
to you. There are two packages for this book: erer for “Empirical Research in Economics
with R” and apt for “Asymmetric Price Transmission.” I update them frequently but I do
not upload them to the CRAN site every time, so there is a need to install them from
my local drive. It should be emphasized that installing a local package cannot automatically
install these packages that the local package is dependent on. They need to be installed first
before the local package can be installed. For instance, erer depends on several packages,
including systemfit, lmtest, tseries, ggplot2, urca, and MASS; they should be installed
first from the CRAN site before a local installation of erer. Therefore, installing a package
manually from a local drive needs the information of package dependency, which is available
at the description file from the unzipped source file of a package.
Installing a package from an R forge site is not uncommon. Some packages are in the
R-forge site under intensive development and tests. They have not been published on CRAN
yet (http://r-forge.r-project.org). For example, if there is a package called test.package
on the R-forge site, then it can be installed by following the sample code shown above.
As the number of packages has become so large, the R site has a list of CRAN task
views by subject, e.g., Bayesian, econometrics, and finance. Each task view summarizes the
relevant packages and provides a list of packages at the end. For example, the task view
of econometrics includes about 100 packages now. These packages by task view can be
installed as a group. To automatically install them by view, a package named ctv needs to
be installed first. Then the packages in a task view can be installed via the commands of
install.views() or update.views(), as shown below.
install.packages("ctv")
install.views("Econometrics")
update.views("Econometrics")
Finally, after a contributed package is installed, a package needs to be loaded before its
functions or data sets can be used. A package can be loaded with the functions of library()
or require(): library(erer) or require(erer). Both the functions load a package and
put it on the search list. require() is mainly designed for use inside other functions; it
returns FALSE and gives a warning (rather than an error as library() does by default) if
the package does not exist. In the middle of an R session and without closing the R interface,
one can unload a package by using the function of detach(), as shown below.
library(erer)
sessionInfo()
detach(name = “package:erer”, unload = TRUE)
sessionInfo()
#
#
#
#
Load erer
Confirm loaded
Unload erer
Check unloaded
16
Chapter 2 Getting Started with R
Figure 2.2 Interface of the alternative editor Tinn-R
2.1.3
Alternative R editors
The editor with base R distribution as shown in Figure 2.1 R graphical user interface on
Microsoft Windows on page 13 is a simple text editor. Given its limited editing ability, many
editors have been developed for R, as summarized on several Web sites like http://www.
sciviews.org/_rgui. These software applications or editors do not change the functionality
of the base R at all, but make R friendlier to use. By analogy, this is similar to the relation
between a cell phone and various phone shells you purchase separately as phone accessories.
In addition, most of these applications for R are completely free.
Among these editors, one of the most appealing editors to me is Tinn-R for Windows.
It is a small free program with many improvements over the simple R editor; the difference
is like a color television versus a black-white one. The cost of enjoying these benefits is that
users need time to install and configure it, and furthermore, navigate the menus to become
familiar with its multiple utilities.
Tinn-R is available from the link at www.sciviews.org/Tinn-R. The interface is shown in
Figure 2.2 Interface of the alternative editor Tinn-R. In general, three windows are most
frequently used: a script window for commands, an output window, and a log window. It
also has a navigation window for resource management. Keyboard shortcuts are available
for various actions, and a user can customize existing shortcuts or create new ones based on
personal preferences. For example, within a Tinn-R editor window, pressing “Ctrl + (“ on
a keyboard generates a pair of parentheses, i.e., (). A comment sign of # can be added to
the beginning of each line for a highlighted command block with the keystroke of “Alt +
c“, and similarly it can be removed by “Alt + z“. Different colors and fonts can be used
for commands and comments. Line numbers can be added at the side to facilitate reading.
Multiple program files are organized by tab at the top of the window.
Another editor that has become more popular is RStudio (http://rstudio.org). This one
is relatively easier than Tinn-R to configure at the beginning, and mainly focuses on R. It
seems that RStudio has been under very intensive development in recent years. An interface
of RStudio on Microsoft Windows system is shown in Figure 2.3 Interface of the alternative
editor RStudio.
2.2 Installation notes for this book
17
Figure 2.3 Interface of the alternative editor RStudio
2.2
Installation notes for this book
In using this book for several courses and workshops, a number of installation problems have
occurred. While most of them is attributable to a lack of experience, it is worthy of some
space here to list and emphasize the main steps. Readers should follow these steps closely
in getting ready for coming data analysis in the book. Some R concepts may not be very
clear at this point, but they are all included here for completeness.
2.2.1
Recommended installation steps
Step 1 Install the base R. This is the engine of the software and should be installed first.
Step 2 Install an alternative R editor (optional). The base R can work independently. If one
is satisfied with the interface of base R, then skip this step. An alternative editor
like Tinn-R or Rstudio provides more convenient utilities. Furthermore, depending
on the editor chosen, connecting the base R and the editor (e.g., Tinn-R) may take
some efforts. If any problem arises, follow the manual provided by these editors
closely or search a solution on the Internet.
Step 3 Test the base R or alternative R editor. If the base R and an alternative editor are
well installed and connected, then users should be able to reproduce the appearance
shown on Figure 2.1, 2.2, or 2.3. Note there are three command lines in testing
the software. The first line is x <- 1:5; x, the second line is mean(x), and the third
line is y. Create a new editor window to hold the three command lines. There are two
ways of submitting the commands, either by line or by block, and you must make
sure that both the ways work. Thus, submit the test commands by line first, and
then highlight all of them and submit them as a block. In the base R, submission
can be done through a button on the interface or by a keystroke of Ctrl + r. In an
alternative editor, there should be similar buttons or keyboard shortcuts. Sometimes
an error message within the alternative editor interface can be a problem of the editor
per se, not a problem of the base R. This can be verified by testing the same codes
with the base R interface.
18
Chapter 2 Getting Started with R
Step 4 Install the erer package. Have a computer connected with the Internet, and install
it through the default method, as described earlier in this section. Relevant packages
will be installed automatically. For example, to install the erer package, run the
following line: install.packages(pkgs = “erer”, dependencies = TRUE).
Step 5 Find a copy of all data and sample R program files used in this book. There are
two ways to find or locate them: one on your local drive and the other through
the Internet. Once R and contributed packages are installed on a computer, a
folder is created automatically to hold documents from packages installed. For
example, I can find documents related to the erer package on my computer at
C:/CSprogram/myRsoftware/R-3.2.0/library/erer/doc. Close to 100 raw data
and sample R program files used in this book are included there. Alternatively,
one can download a zipped copy of the erer package from the Internet directly.
Search the Internet by “package erer” and save the latest source package, e.g.,
erer_2.3.tar.gz. Unzip this document and all the data and program files should
be available now.
Step 6 Make a copy of all data and program files at a new local folder. After these files are
located, create a new folder on your computer with the following name: C:/aErer,
and then make a copy of all the files. This will allow you to use all the files without
making any additional modification. For users with an operating system other than
Microsoft Windows, the new folder name can still be similarly named, but further
changes in some R sample programs may be needed to modify file directory.
Step 7 Test one R sample program. Open base R or an alternative R editor (e.g., TinnR), and then a sample program in your new folder, e.g., C:/aErer/r072sunSJAF.r.
Select all the commands and submit them as a group. If it runs well without any
error message, then the installation is successful and you are ready for all the coming
data analyses.
2.2.2
Working directory and source()
To run R sample programs in this book smoothly, a basic knowledge of working directory
is needed. Each R code demonstration in this book has a corresponding file saved in the
data folder of the erer library. For example, Program 7.2 The final program version for
Sun et al. (2007) on page 114 is saved as r072sunSJAF.r. All these R programs have been
created and saved under the directory of C:/aErer on my computer. In addition, all the
sample data sets used in this book, e.g., RawDataIns1.csv, are also saved at the C:/aErer
folder. This is the simplest and most convenient folder name I can think of.
When a sample program needs to communicate with a local drive for data input and output, a directory (e.g., C:/aErer) has been specified inside a program through either a local
or global specification. For example, read.table(file = “C:/aErer/RawDataIns1.csv”)
uses local directory specification, so the directory information here is only effective for the
read.table() function. The commands of wdNew <- “C:/aErer”; setwd(wdNew) allow
for a global specification, so all the following commands will be affected. In general, the
getwd() function can reveal the name of current working direction in an R session, and
dir() can list the files in the current directory.
R beginners often make a mistake in using the forward slash (/) and backslash (\)
in defining a directory. Microsoft Windows generally uses the backslash to define a directory (e.g., “C:\aErer”), which is sometimes referred to as the Windows mode. However, R
2.2 Installation notes for this book
19
requires the forward slash (e.g., “C:/aErer”, which is called the Unix mode. In R, a backslash \ is known as an escape character. To put a backslash in a string, you must double
it. Thus, “C:\aErer” is wrong for directory specification in R, but both “C:/aErer” and
“C:\\aErer” are acceptable.
To run an R program like r072sunSJAF.r for Program 7.2 successfully, the data sets
(e.g., RawDataIns1.csv) should be saved in the directory specified within the program (e.g.,
C:/aErer). In addition, some sample programs also use the source() function to call and
run another program. For example, Program 8.1 Accessing and defining object attributes
on page 129 is saved as r081attribute.r. In Program 8.1, the source() function is used
to run the whole program of r072sunSJAF.r to create some objects for further demonstration. If you have followed Step 6 in Section 2.2.1 Recommended installation steps, then
there is no need for you to make any change to run Program 7.2 and Program 8.1. Otherwise, you may need to revise these directories in the files, depending on how and where
you have saved the sample R programs and data sets.
2.2.3
Possible installation problems
Various problems can arise in the above steps. Most of them originate from a lack of experience, and in some cases, a lack of common sense (it is harsh to hear but often true). For
example, it is not a good idea to install an alternative editor (e.g., RStudio) before the base
R is installed. Make sure your computer is really connected with the Internet if you know
your Internet service is often unreliable. As each computer faces a different environment, it
is impossible to discuss all possible problems here. Based on my experience, I list several
major problems that seem occur more frequently.
The first one is about mirror sites. At present, R is available at a large number of mirror
sites so users are generally encouraged to choose one that is physically close to them. For
example, if you are in Beijing, then it is recommended that you choose one in Beijing,
not another in Brazil. However, the server at a mirror site may be unreliable or broken
without any warning sign. You can still hook up with the site and finish all the installation
process. Some packages may be installed partially during the process, so later on strange
error messages can occur. If that is the case, feel free to choose a site that may be physically
far away from you to try your luck again. My own experience is that most sites in the United
States are pretty stable.
In connecting the base R with an alternative R editor (e.g., Tinn-R), the most common
problem is that submitting a block of commands may not work as expected and can generate
an error message. In general, this is because the alternative R editor is not connected with
the base R well. As these alternative R editors have been under constant development and
each computer environment is unique, the best way of dealing with these errors is to search
the Internet for suggestions.
When you have limited access to the Internet or the signal is weak, it is tempting to
download a copy of the package of interest and then install it locally. However, as emphasized
earlier in the section, it does not work if the package depends on other packages but they
have not been installed on your computer yet. Thus, the best approach for beginners is
to have a computer connected with the Internet and use the default approach as follows:
install.packages(pkgs = “erer”, dependencies = TRUE). If your Internet service is
slow, then it may take a good amount of time. You need to be patient in a situation like
that. Installing a package locally is recommended for experienced users only.
Close to 100 R sample programs and data files are presented in this book. A number
of them need a specification of the directory and folder information for data inputs. I have
organized all of them on my computer in the folder of C:/aErer. Thus, the most efficient
20
Chapter 2 Getting Started with R
way for a beginner to get started is to create a folder with the same name and then copy
these files there. In contrast, one can run these programs in the folder where R installs
them initially. This is strongly discouraged as you need to make changes on some sample R
programs and also the directory where R is installed is often too long to manipulate. Finally,
do not ignore any error message, and more importantly, address errors sequentially. In most
cases, an earlier error will complicate later operations.
2.3
The help system and features of R
Spirit of R and the help system
There are many different descriptions about R in books and on the Internet. On its own Web
site, R is defined as “a free software environment for statistical computing and graphics.”
My definition is: R is freedom.
R allows scientists to conduct statistical computing and draw graphs in a completely
new way. It reflects the inner desire of all human beings for freedom: free to speak, free
to dance, free to move, and now, free to do programming. Like many of us, I am tired of
being controlled by commercial software and waiting for their updates over time. While
there are markets for some unique commercial software products, an increasing number of
data analyses and programming jobs have been accomplished by users with free software
directly.
Inside an R session, the help file for a specific function is just around your fingertips. After
loading a package like library(erer), typing help(‘erer’), ?erer, or ?bsTab can invoke
the help() page instantly for the package of erer or a specific function. Note there is an
Index at the bottom of the help page. Following the link allows users to navigate all the help
files of packages installed on a computer. For each function, there are detailed descriptions
for arguments, and even better, examples that users can copy to an editor window and run
immediately. These examples can also be run for a function with the example() function
in an R session, e.g., example(bsTab). These examples have passed the procedure when
packages are built and uploaded onto the CRAN site, so they should work on another
computer too. This feature allows users to learn a function much faster than checking a
hard-copy software manual. With this help system, users do not need to worry about the
detail for a function anymore.
To see the actual codes for a function, one just needs to type the name of a function at
the prompt in R console directly like: bsTab. The code for a function will be shown instantly.
One can examine the code, or extend and modify it for other purposes.
The online community of R is an amazing place to get help. Every day several hundreds
of questions are posted on sites related to R (e.g., http://www.nabble.com and http://
stackoverflow.com). If anybody has doubt about the power of individual users as a group,
look at the development of Facebook, Twitter, and R in recent years. The spirit behind
these Internet-related phenomena is the same.
At present, R has seven reference manuals: An Introduction to R, R data Import and Export, R Installation and Administration, Writing R Extensions, The R Language Definition,
R Internals, and The R Reference Index. They are updated with each new release of R. On
its official Web site, many free contributed documents are also available. In addition, over
100 books related to R have been published and they are available at online bookstores. In
particular, I recommend the following books: Dalgaard (2008), Spector (2008), and Everitt
and Hothorn (2009) for basic statistics and data manipulation; Bivand et al. (2008) for
spatial statistics; Wickham (2009) and Murrell (2011) for R graphics; Kleiber and Zeileis
2.3 The help system and features of R
21
(2008), Pfaff (2008), and Vinod (2008) for econometrics; Braun and Murdoch (2008), Jones
et al. (2009), Adler (2010), and Matloff (2011) for practical programming techniques.
R versus commercial software
To a new user, the most appealing feature of R may be its free license. It can be used
anywhere with very limited constraints. However, being free may not be the main reason
for its popularity in recent years. After all, there are a number of free software products on
the Internet and many of them are not appealing at all. During the last several years, the
more I use R, the more I like its design and advantages over many commercial products.
In contrast with commercial software, R’s greatest advantage is its nature of open sources.
It allows users to view the coding for each function and learn from that. This is impossible for
commercial software products because they depend on the functionality for making profits.
This key difference results in many divergences between R and commercial software. For
example, R can be updated more frequently; commercial software cannot force users to pay
for updates every month but usually every three years. The core program of R is small so
it runs efficiently; commercial software is big in a distribution because it is a one-time deal
with users. R encourages users to extend its program; commercial software asks users to be
patient and to rely on their updates that may only come years later.
R is also a very concise language. If a data set is analyzed by R and commercial software
products at the same time, in general the program file from R is much shorter than that
from others. This conciseness of R may become a barrier for new users in digesting the
codes at the beginning, but after a while, users often appreciate its short and clean style.
Furthermore, most commercial software products cannot be used as a computer language
like R. It is true that there is a core computer language behind each software product
for statistical analysis. R is among the few products in the market (e.g., MATLAB) that
encourage analysts to use it as a language and provide real aids in the learning process.
Commercial software products are commodities in a market economy. When running a
regression, it throws out all results on a computer screen and forces readers to digest them.
That is a typical selling strategy used by a local car dealer: look how much we can do for
you! In contrast, R is much more like a shy but decent introvert. R saves everything as
an object, and in general, if an object is not called, it does not show up and bother you.
Commercial software products try to control users while R allows users to have much better
control over their data analyses.
Using the classification of four stages adopted in this book (e.g., clickor, beginner, wrapper, and contributor), R users can reach the status of contributor in one or two years.
However, users of commercial software will seldom reach the stage of contributor; some diligent users may become good wrappers; and most will become good clickors only. That is
exactly what vendors of commercial software like to have: many compliant ‘children’ waiting
for their updates with money in hand.
Commercial software still enjoys some advantages over R in certain aspects. One common
complaint about R is that it is not well organized like commercial software. With a slim
base R and many extended packages, users often face the dilemma of which function should
be used for a specific task. This creates a feeling of fragmentation as R is built up by many
individual contributors worldwide. There is a lack of authority to tell users which one among
many available similar packages is the best for a specific data set. This may force users to
spend more efforts to have a deeper understanding of the model employed. Hopefully, the
fragmentation of R functionality will be better addressed as the software is improved and
users gain more experience in using it.
22
Chapter 2 Getting Started with R
In summary, I believe R will gradually gain a larger share of the current software market
for statistical analyses. The main advantage of R is that it is free and can be used as a
language to accomplish sophisticated programming jobs. The main advantage of commercial
software products is that they provide stable and consistent solutions to routine statistical
analyses. To some degree, the relation and competition between R and commercial software
is similar to that between electronic books and traditional paper books, or that between
emails and traditional letters. I have no doubt that R will gradually become more popular
over time, and many commercial software applications will face a shrinking customer base.
What is not free?
R being free can cause some misconceptions. While users do not need to pay for a copy of
R, they still need to buy books to learn R systematically. There are some good and free
contributed books on the Web site of R. They can be helpful for beginners to get started.
In addition, over 100 books about R have been published formally in recent years. Many of
them have better quality than these free contributed books. Some of them are very affordable
too, comparing their prices to the copy or printing costs for these free books (e.g., ten cents
per page in a local store for copying and binding).
Furthermore, learning R is not effort free. Just installing a free copy of R really does not
have any impact on one’s statistical skills and programming ability. The bigger challenge is
to use it efficiently for actual research. Several years ago, for example, I bought a software
application for home projects. I thought I would be able to add a sun room to the back of
our house by myself easily. It turned out I still needed to learn a lot about architecture,
so at the end I never finished a single project with this software. Similarly, owning a copy
of Microsoft Word does not mean one can produce a beautiful document automatically.
Owning a dictionary does not mean one can write a great novel tomorrow. The same logic
applies for R and statistical analysis.
Learning R and the language takes great effort for several years. This is achievable
through carefully designed projects and rigorous training. The benefit is so great that it
is well worthy of your investment. Being able to program with a computer language for
research is just like being able to drive a car for commuting.
2.4
Playing with R like a clickor
Pull-down menus in a computer application, also called drop-down menus, are options that
appear when an item is selected with a mouse. Using the classification for statistical analyses
in this book, this is labeled as a clickor status. While R is mainly designed for programming,
it does have a number of contributed packages that provide pull-down menus for both
computing and plotting. One well-known package is R Commander, titled as Rcmdr. In this
section, R Commander and data from a contributed package are used to demonstrate the
features of this approach.
Using pull-down menus for statistical analyses is intuitive. One can select a data set and
try all the features available in an interface. This may help some users get started with R
and then gradually move on to programming.
Installing and loading the R Commander
To install this package in R, use the command of install.packages(‘Rcmdr’). To invoke
the R Commander, open an R session first. The main R console window should be similar
2.4 Playing with R like a clickor
Figure 2.4 Main interface of the R Commander with a linear regression
23
24
Chapter 2 Getting Started with R
to Figure 2.1, 2.2, or 2.3, depending on the editor or graphical user interface (GUI) you
choose. Then, submit library(Rcmdr) at the prompt of the R console. A new window will
pop up, as shown in Figure 2.4 Main interface of the R Commander with a linear regression.
If the new window is closed incidentally, it can be recalled anytime by Commander() at the
prompt of the R Console. From now on, you can click and play the R Commander interface
independently. The R console window is still open and run in the background, which can be
minimized but cannot be closed.
Major menu items
The R Commander has three main windows: script, output, and message. The script window
displays codes associated with each mouse click. One can also type and submit R commands
through this window. The output window holds results from any operation. The message
window displays notes, warnings, or errors. In addition, separate windows can be initiated
from some operations, such as viewing a data set or displaying a graph.
The pull-down menus are very intuitive, including file management, data operations,
regressions, plotting, and help. Some features are not active until a data set is active. At
present, there are already a large number of menu items available in the R Commander.
However, it should be emphasized that this is still just a very small portion of the total
capacities of R. With several thousands of contributed packages, R can handle much more
diverse and complicated tasks than the R Commander has shown. Below is a selected and
brief list of the items available in the R Commander.
File
Edit
Data
- Change working directory
- Open or save script file
- Save output; Save R workspace; Exit
- Edit R Markdown document
- Cut; Copy; Paste; Find; Delete
- Select all
- New data set; Load data set; Merge data set
- Import data from text file, SPSS, SAS, STATA, Excel, Access
- Data in package (List or read data sets in packages)
- Manage variables in active data set
Statistics
- Summaries, contingency tables, means, variances
- Nonparametric tests
- Dimensional analysis
- Fit models (Linear, generalized linear, multinomial logit)
Graphs
- Color palette
- Histogram, boxplot, line graph, strip chart, Bar graph
- Save graph to file
Models
- Select active model
- Summarize model
- Hypothesis tests
- Graphs (diagnostic, residual)
2.4 Playing with R like a clickor
25
Distributions
- Continuous distributions
- Distribute distributions
Tools
- Load packages
- Load Rcmdr plug-in(s)
Help
- Commander help
- Introduction to the R Commander
The guide and help system for the R Commander are well documented under the menu
of Help. In particular, the document of Introduction to the R Commander is a PDF file
with 26 pages, and it explains well the operation of this application.
Playing a data set
To understand and get started with the R Commander, the best way is to have a data set
and play it for a while. The Data menu at the main interface lists a number of options. Data
can be imported from one’s local drive, prepare manually from the menu of Data ⇒ New
data set, or load some sample data from the R Commander directly. To be brief, some
built-in data is used here for demonstration.
When the R Commander is loaded through the command of library(Rcmdr), several
other packages are also loaded together in the current working environment, including the
package of car. So all the contents in car are available to users now. One of its data sets is
Davis. It has 200 rows and 5 columns, and it documents the height and weight information
for men and women who engaged in some exercise. A few observations are selected and
listed as follows:
sex weight height repwt repht
1
M
77
182
77
180
2
F
58
161
51
159
3
F
53
161
54
158
...
197
M
83
180
80
180
198
M
81
175
NA
NA
199
M
90
181
91
178
200
M
79
177
81
178
where sex is either female (F) or male (M); weight is measured in kg; height is measured
in cm; repwt is reported weight; and repht is reported weight. Some values are missing.
Below we show three groups of activities with this data. The key screenshots are displayed
in Figure 2.5 Loading a data set, drawing a scatter plot, and fitting a linear model.
First, to get this data set into the current working environment, follow the menus
at Data ⇒ Data in packages ⇒ Read data set from an attached package and then
make the appropriate clicks. Once the data is loaded successfully, the R Script window
should display the following command line automatically: data(Davis, package = ‘car’),
as shown in Figure 2.4. At this time, many menu items will change the color from gray
to black, indicating that a data set is active and these menus are available for use. To view
the data set, one can click the button of View data set on the main interface. A separate
window for the data will come up.
Second, summary statistics can be generated for the data set. For example, following
Statistics ⇒ Summaries ⇒ Active data set, the minimum, median, mean, and max-
26
Chapter 2 Getting Started with R
Figure 2.5 Loading a data set, drawing a scatter plot, and fitting a linear model
2.5 Exercises
27
imum values for each variable will be reported in the Output window. In addition, various
types of graphs can be used to explore the data and understand its properties. For example,
using the menus of Graphs ⇒ Scatter plot, we can see the positive relation between the
variables of height and weight.
Finally, a linear regression can be fitted on the data. For example, if the relation between
the height (x) and weight (y) is of interest, we can fit a linear model by using the menus
of Statistics ⇒ Fit models ⇒ Linear regression. The regression coefficient is 0.238
and it is significant at the 1% level, which confirms our impression that taller persons are
heavier on average. Once a model is fitted, some diagnostic or other plots can be generated
through the Models menu. For example, Figure 2.5 contains a graph generated from clicking
the menus at Models ⇒ Graphs ⇒ Influence plot.
In summary, through these activities, one can see that the R Commander can meet many
ordinary tasks for data analyses and graphs. This can reduce transaction costs when a user
has never used R before, or one is completely new to computer programming. Note that the
script and output windows in the R Commander display the underlying R commands for
each action. This is a nice feature that allows users to learn some basic R commands. In the
long term, however, programming still is much more efficient than clicking the menus.
2.5
Exercises
2.5.1 Install R and packages and prepare for coming data analysis. With the guide in this
chapter, readers should have the following items installed: base R, the erer library, a
selected R editor (e.g., R Studio). Furthermore, make a copy of all the data and sample
R programs used in this book to a local folder and test one of them, as detailed in
Section 2.2.1 Recommended installation steps on page 17.
2.5.2 Learn how to have an efficient package installation. In this exercise, create and save
a file for all the contributed packages you install on your computer, following the
discussion at Section 2.1.2 Contributed R packages on page 14. To see the names
of packages installed, use the function of installed.packages(). Some of them are
always installed and loaded with base R, so there is no need to include these in your
file.
28
Part II
Economic Research and Design
29
30
Part II Economic Research and Design: Four chapters in this part are organized with
a top-down approach: economic research in general; three versions of an empirical study;
study design and proposal for an empirical study; and reference and file management.
Chapter 3 Areas and Types of Economic Research (pages 31 – 41): Areas and issues of economic research are introduced. Economic research is classified into three types: theoretical,
empirical, and review studies. Features and challenges by type are discussed.
Chapter 4 Anatomy on Empirical Studies (pages 42 – 56): The production process of
an empirical study is analyzed in detail and relevant research methods are presented. An
empirical study has three versions: proposal, program, and manuscript.
Chapter 5 Proposal and Design for Empirical Studies (pages 57 – 77): General principles
of research design of empirical studies in economics are examined. Three different situations
are analyzed: proposals for money, publication, or degree.
Chapter 6 Reference and File Management (pages 78 – 97): The demand of literature
management is analyzed. Some systematic solutions are suggested. The features in EndNote
and Mendeley are presented for both reference and file management.
R Graphics • Show Box 2 • A heart shape with three-dimensional effects
R is flexible in integrating data into graphics. See Program A.2 on page 514 for detail.
Chapter
3
Areas and Types of Economic Research
S
cientific research in economics involves many areas or aspects of our economy.
Economic analyses can be classified into different types, depending on the nature
of the problems, research needs, and methods employed. To facilitate a comparison,
economic analyses are classified into three types in this book: theoretical studies, empirical
studies, and review studies. The main features of each type are examined first. At the end,
those types of economic analyses are compared and some thoughts about long-term career
planning are presented.
3.1
Areas of economic research
Economics as a discipline has evolved to be quite diversified. The classification system
adopted by the Journal of Economic Literature (JEL) has a good representation of major
research areas in economics. As an example, some of the frequently used areas related to
natural resource economics are presented in Table 3.1 Journal of Economic Literature
classification system. Each JEL code is composed of three characters. The first is a letter
and the other two are numbers. For the first character, there are 20 main areas, ranging
from “A — General Economics and Teaching” to “Z — Other Special Topics”. The two tiers
by number represent more detailed areas.
Another interesting way to exploring various areas of economic research is to look at
the list of Nobel Prize and Laureates in Economics Sciences (http://www.nobelprize.org).
The Prize in Economic Sciences was first established in 1968 in memory of Alfred Nobel,
founder of the Nobel Prize. Between 1969 and 2014, 46 Prizes were awarded to 75 Laureates.
The average age of a Laureate is 67 years old and only one woman has received the Prize
in this category (i.e., E. Ostrom in 2009). The research areas covered by these awards in
the past do not cover every aspect of economic research. However, the list indeed reveals a
number of active research areas at the aggregate level. Relevant to empirical studies and the
focus of this book, several Prizes in recent years have been awarded to quantitative analyses
of economic issues. These include the Prize to J.J. Heckman and D.L. McFadden in 2000
for theory and methods of analyzing selective samples and discrete choice, R.F. Engle and
C.W.J. Granger in 2003 for methods of analyzing time-varying volatility and cointegration,
T.J. Sargent and C.A. Sims in 2011 for their empirical research on cause and effect in the
macroeconomy, and more recently, E.F. Fama, L.P. Hansen, and R.J. Shiller in 2013 for
their empirical analysis of asset prizes.
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
31
32
Chapter 3 Areas and Types of Economic Research
Table 3.1 Journal of Economic Literature classification system
Code
Description
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
Y
Z
C58
D12
Q23
Q26
Q27
Q51
General Economics and Teaching
History of Economic Thought, Methodology, and Heterodox Approaches
Mathematical and Quantitative Methods
Microeconomics
Macroeconomics and Monetary Economics
International Economics
Financial Economics
Public Economics
Health, Education, and Welfare
Labor and Demographic Economics
Law and Economics
Industrial Organization
Business Administration and Business Economics; Marketing; Accounting
Economic History
Economic Development, Technological Change, and Growth
Economic Systems
Agricultural & Natural Resource Econ.; Environmental & Ecological Econ.
Urban, Rural, Regional, Real Estate, and Transportation Economics
Miscellaneous Categories
Other Special Topics
Financial Econometrics
Consumer Economics: Empirical Analysis
Forestry
Recreational Aspects of Natural Resources
Issues in International Trade
Valuation of Environmental Effects
Source: https://www.aeaweb.org/journal; accessed in July 2015.
For specific areas in economics, there are also detailed directions that researchers can
choose from. Take as an example natural resource economics (e.g., JEL Q2 and Q5). This
is a small branch of economics but this field covers numerous resource issues. One way
of classifying these research issues is to trace the movement of resources (e.g., tree) like
in a stream. (Historically, logs were also transported through rivers.) At the beginning
of the stream, landowners manage land for various outputs. Timber is one main output
from forests. Other outputs include water, air, wildlife, recreation service, and aesthetic
value. This multiple-output feature of forests generates numerous research areas, e.g., timber
market, bioenergy, climate change, carbon sequestration, conservation, and land use change.
In the middle of the stream, the wood products industry has been an important sector
in the economy. In the United States of America, the output of the wood products industry
has been about 10% of the total output from the manufacturing sector. Industrial firms,
including sawmills, paper mills, and furniture firms, are active market participants. Various
issues related to the industry are worthy of attention, e.g., productivity, product demand
and supply, business location and clustering, industry innovation and evolution, mergers
and acquisitions, industrial organization, and market power. At the end of the stream, there
are various products available to consumers. A number of issues related to the market exist,
3.2 Theoretical studies
33
including eco-labeling for wood products, demand for paper and lumber products, housing
market, and domestic and international trade of wood and paper products.
In summary, there are various areas of economic research. Most economists work on a
small field within the large arena during a whole career. By analogy, this is very similar to a
medical doctor who generally specializes in one field, e.g., nose or heart. It is challenging for
anyone to become an expert in several fields. Thus, next time if an economist on a television
program talks in a way that he knows everything about the economy, watch out if he is
really so knowledgeable, or just bragging about some common sense.
3.2
Theoretical studies
Within a specific area or across various areas, economic research can be classified by other
criteria. In this book, three types of economic research are differentiated: theoretical, empirical, and review studies. The whole book focuses on empirical studies, but an understanding
of the other types will be able to help us improve efficiency in conducting empirical studies.
Theoretical studies in economics generally focus on the theoretical aspect of an economic
issue. This type of study has been qualitative in nature. However, in recent years, it has
become more quantitative with sophisticated mathematical models. Thus, theoretical studies
have become more structured with models and have gained more popularity over time.
Understanding the structure and design of this type of study allows a more comprehensive
view of a research area. It also can help generate empirical research ideas in the process.
3.2.1
Economic theories and thinking
The key feature differentiating theoretical studies from other types of economic studies is
that they are theory-oriented. There are a number of inherent advantages of a theoretical
approach to analyzing economic issues, with the main one being its flexibility. Theoretical
studies often have light demand for data, or in many cases, they do not need any at all.
Economic phenomena can be so complicated in some situations that a theoretical analysis
is the most practical or even the only way available to researchers. Overall, a theoretical
study allows a focus on a specific aspect of the subject, and provides solutions to social and
economic problems.
Conducting a theoretical study can be challenging in several aspects. While some of
them address large issues, many theoretical studies examine small, conceptual, and dry theoretical questions in economics. These questions may have a limited relation with reality or
our breakfast for tomorrow morning. This is similar to many abstract mathematical issues.
In recent years, quantitative models have been increasingly employed in theoretical studies.
As a result, calculus and mathematics have become deeply involved in modern theoretical
analyses. A student needs to be very strong in mathematics and deductive reasoning. Furthermore, for a specific issue, it is not uncommon that results from studies with different
theoretical assumptions are in conflict. This can be frustrating as theoretical studies may
fail to provide clear answers to questions under investigation.
The best way to understand this type of economic study is to read representative studies in some specific areas. For example, the seminal article by Coase (1960) examined the
economic problem of externalities. A number of legal cases and statutes were used in benefitcost analyses related to transaction cost and property rights. Coase’s study is very readable
with limited quantitative analyses. Similarly, Akerlof (1970) analyzed information asymmetry and quality uncertainty, using the market for used cars as an example. Within the
literature of economic analysis of law, many similar articles have been published since then.
Chapter
4
Anatomy on Empirical Studies
I
n this chapter, research methodology related to empirical studies is examined in
detail. We adopt a production approach to analyzing the characteristics of an empirical study. A paper can be viewed differently by a reader and an author (i.e.,
a consumer versus producer). From the perspective of production, an empirical study has
three faces or versions: a proposal version, a program version, and a manuscript version. For
a published journal article, readers see the manuscript version only. How the study has been
designed and how the data have been analyzed are usually not available to readers, or they
are only partially revealed in the manuscript version. The differentiation among the three
versions will give us a better understanding of the production process of a scientific paper.
It also allows us to identify several critical steps in practical application, and eventually,
helps us improve research productivity incrementally.
4.1
A production approach
Conducting scientific research is a production activity. It is comparable to many activities
human beings participate in. Similar situations include making a movie, mining coal, and
performing a dance. In the next subsection, we take home building as an analogy to explain
the phases in making a paper. Specifically, a scientific paper is evaluated on the basis of
three phases: idea, structure, and detail. This division is distinctive and easy to understand.
Each phase can include small steps, so a large research project can be divided into workable
units. Finally, we compare an author’s and a reader’s perspective on a paper to further
reveal the key features of the production process in conducting an empirical study.
4.1.1
Building a house as an analogy
Making a paper is similar to other activities, as compared in Table 4.1 A production
comparison between building a house and writing a paper. In 2004, my wife and I spent a
half year in shopping for a house in the town where we have been living since then. It was
our first house so at that time, we had very limited experience about the housing market and
process as buyers. It was the biggest financial decision in our lifetime too. At the beginning,
we read a lot of books and tried to understand what steps we should follow in finding a
house. After numerous conversations, readings, and trials, we figured out that there were
three major stages in shopping for a house:
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
42
43
4.1 A production approach
Table 4.1 A production comparison between building a house and writing a paper
Building a house
Writing a paper
Roles
1. Location
A builder or land dealer makes strategic
decisions about the location, timing,
and size for a proposed house building.
1. Idea
A professor or a project
investigator creates a
research idea.
Designer
2. Floor Plan
The builder or a hired expert blueprints
the floor plan for the proposed house
and prepares for actual building.
2. Outline
The professor or a graduate
student develops the outline
for a paper, possibly with
some help from others.
Technician
3. Details
The builder hires employees or
independent contractors to work on the
foundation, frame, window, cable, and
painting.
3. Detail
The professor or a graduate
student works on the study
along the outline and fills
details in the paper.
Painter
• Phase (1) Location: Identifying the right community we like;
• Phase (2) Structure and floor plan: Finding a structure right to us; and
• Phase (3) Detail: Examining the quality (e.g., brick, paint, and landscape).
More importantly, we also found out that the sequence from location to floor plan then
to detail should be largely maintained. We wasted a lot of time in learning these basics. For
example, we found a house with a great floor plan and style but it was too isolated from
other communities. In other words, the location was inappropriate for us, so at the end we
just wasted our time. Overall, we found that the price of a house has been determined by
factors related to the three phases with the ordered sequence. The most important single
factor in determining a house price is location, not the flowers in the front yard.
After understanding these basics and the market, our final decision was signing a contract
with a builder. When we signed the contract, he just finished the concrete foundation for
the house. Over the next several months, we walked around the site every Saturday with
excitement and expectation to observe the whole building process. Because we put our
deposit on the house already and would become the owner very soon, we watched all the
construction activities carefully.
We also had many conversations with the builder over the following months. He told me
that there were numerous things to worry about in doing the business. First, he made a
decision of the location: where to buy a piece of land and build the house. Next, he selected
a design map and floor plan for the house. Finally, he went to the site every day to supervise
these workers he hired and to take care of the detail and quality. Without much surprise,
that was also what we as buyers have paid so much attention to in the shopping process.
4.1.2
An author’s and a reader’s perspective to a paper
A paper’s features can be revealed from several perspectives, as shown in Figure 4.1 A
comparison of an author’s work and a reader’s memory. Using the terminology of economics,
44
Chapter 4 Anatomy on Empirical Studies
1. Research Idea
− One sentence
− A few days or months
2. Outline
− 2 pages
− Several months
3. Details
− 30 pages
− Several weeks
An author's work
Paper
A reader's memory
1. Details
− Most facts
− In several days
2. Outline
− Structure only
− In a year
3. Research Idea
− Topic only
− After several years
Figure 4.1 A comparison of an author’s work and a reader’s memory
an author is a paper producer and a reader is a paper consumer. Understanding these
differences can help us conduct scientific research efficiently. Specifically, in conducting an
empirical study, the process usually starts with a research idea of interest to a researcher.
Once the study design is finished, data can be collected and analyses can be conducted
to generate the key findings and set up the structure of the study. Finally, a manuscript
can be prepared, writing can be polished, and the results can be disseminated through
various outlets. Each stage (i.e., idea, structure, and detail) needs a different amount of time.
Sometimes it may take only a few seconds to have a brilliant idea, but overall, generating
ideas for scientific research requires long-term accumulation and diligent work. The structure
and key findings may take several months to set up and finalize. In the end, all the details
for a paper may take several weeks to be connected together.
Then, the paper is presented to readers, either your colleagues or a person you do not
know. Without any direct communication in general, how do readers understand what you
Chapter
5
Proposal and Design for Empirical Studies
W
riting a proposal for money, publication, or degree is similar in many aspects. In
this chapter, the common requirements are presented and a number of keywords
are elaborated first. For the primary purpose of funding, two sample proposals, one
unfunded and the other funded, are analyzed with regard to their presentation skills. For
the publishing purpose, three sample empirical studies used in this book are illustrated (i.e.,
Sun et al., 2007; Wan et al., 2010a; Sun, 2011). At the end, constraints associated with
study design for graduate degrees are discussed and some suggestions are proposed to tackle
restrictions in time, financial resource, and experience.
5.1
Fundamentals of proposal preparation
A proposal can be prepared for a number of purposes, including money, publication, or
degree. Funding is an important and necessary resource for many scientific research projects
today, so this has been a primary purpose of proposal writing. Some economic research
may need limited investment in facilities and resources, so a proposal can be prepared with
the main purpose of publishing. In addition, a dissertation design by a graduate student
in economics can focus on degree and graduation time. Different purposes can lead to distinct emphases at the design stage. In this section, some common fundamentals of proposal
preparation are presented first.
5.1.1
Inputs needed for a great proposal
I summarize and aggregate the inputs needed for a great proposal into several items: great
presentation skills, quality time and solid commitment, and a compelling idea. First of all,
presentation skills help researchers demonstrate the merit of a proposed project. Grantsmanship is the art of obtaining grants and is broader than presentation skills. Presentation
skills for proposals are not only about pretty sentences, but also about some implicit or
explicit norms adopted in a community. I believe that all presentation skills can be learned
through several projects over a few years.
Here is an example of presentation skills I learned in the past. A typical problem for
young professionals is that too many objectives are listed and too much is promised in
a proposal. A teenage boy has the tendency to show maturity with some mustache or
muscle. A new investigator also has the tendency to show reviewers the magnificence and
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
57
58
Chapter 5 Proposal and Design for Empirical Studies
broadness of the project. As a result, proposals written in this way begin to read more like
a lifetime career plan. Unfortunately, the fate of your proposal is controlled by some senior
and seasoned researchers in your field. Most of them will strongly doubt that an ambitious
and large list of objectives can be finished within a few years. This is such a simple and
easy-to-make mistake that many young professionals (including myself) pay heavy prices in
the learning process. The solution is very simple: just focus on two to three objectives in a
typical proposal. A single objective is usually too thin but any number bigger than three has
the risk of promising too much. Four or more objectives are recommended only when there
is really a solid justification behind the decision. For example, it is reasonable to have more
than three objectives when a multiple-discipline project requests for $50 million.
Quality time and commitment is the second necessary input to successful proposal preparation. This sounds so obvious but many researchers can easily forget it with a busy daily
schedule. Time is a basic input for any of our daily activities. Everyone has the same amount
of time per day; remember nobody has 25 hours a day or 400 days a year. A proposal with a
large request for financial support often needs a total of several months of time for preparation. Furthermore, the quality of time is even more important. If one only has some residual
and fragmented time for proposal preparation after other commitments, then the probability of writing an excellent proposal is low. Whether one prepares a proposal continuously
for several months or periodically over a long period, one needs quality time and solid
commitment in the process.
The final and also the most important input is a compelling research idea for a proposal.
Scientific merits of a research idea are the innovation and impact of a proposed research.
Training and experience can help researchers improve the quality of ideas over time. To
generate a high-quality research idea in economics, a solid training in basic courses such
as microeconomics, macroeconomics, and econometrics is necessary, which may take a few
years. To learn these economic theories or techniques better, students also need to participate
in some real research projects while taking these courses. For a specific area of interest
(e.g., international trade of a specific commodity), current literature should be searched and
digested to gain a broad and deep understanding of the research status. In addition, an idea
for economic research can be generated from various sources, such as newspaper, market
reports, or discussions. Scientific research and proposal writing are also an iterative process.
Researchers can improve and accumulate their knowledge in an area in the long term, and
ultimately, improve the quality of research ideas.
While training in economics can be helpful in improving idea quality, it should be pointed
out that training is less effective to improving idea quality than to other inputs for proposal
preparation (e.g., presentation skills). Generating an idea with sound scientific merits is like
standing on the shoulders of giants. It requires tremendous creativity in a brain-storming
process. In other words, some talents are needed in generating a first-rate idea. This is
the artistic, not scientific, side of proposal preparation. By analogy, a person can become
very skilled in playing a piano by training over several years, but he may not become
a great pianist; another person can have beautiful steps on stage, but she may not be
perceived as a great dancer. Overall, the demand on creativity reflects the spirit of human
beings in exploring the unknown world. It is these challenges that make scientific research
so fascinating to many of us.
5.1.2
Keywords in a proposal
Regardless of specific purposes, there are a number of keywords or components that every
proposal needs to address clearly (Morrison and Russell, 2005). In evaluating a proposal,
reviewers also look for these keywords and try to understand if they are well defined and
5.1 Fundamentals of proposal preparation
59
connected. Thus, it is imperative that researchers spend their efforts in delineating these
keywords explicitly in the proposal. This can improve productivity in the preparation, and
if funding is sought, it also can increase the possibility of being funded. We explain these
keywords below one by one in greater detail. Some of them have already been presented in
Table 4.2 Main components in the proposal version of an empirical study on page 51.
Research issue or question: An issue should be identified and defined at the beginning.
The boundary or size of the issue should be appropriate, not too big or too small. It should
not be too big, vague, or ambitious, so the issue can be addressed in the proposed time
frame. It should not be too trivial or small either. By working on the issue, you can make
significant contributions to the area, not just repeating or marginally extending the work
already published by others. Thus, there is a delicate balance in defining the boundary of
an issue or problem. In general, the issue description will have to be revised many times
when other keywords of a proposal are developed and refined.
What is known or current knowledge status: What is known or what has been established
earlier by other researchers needs to be determined. Scientific research has become more
extensive as our society continues to evolve. A new research idea is likely related to existing
articles in some way. Therefore, a detailed review and analyses of the relevant literature are
required.
What is unknown or knowledge gap: This keyword of knowledge gap is closely related
to the previous keyword of current knowledge status. Thus, it seems like a trivial task in
generating a compelling idea. However, it is not. Understanding what has been achieved is
much easier than figuring out what is unknown. It requires independent and critical thinking
ability, which may take years to gain. My own habit is that after reading a published paper, I
write down notes on its first page instantly about its strength and shortcoming. This forces
me to identify knowledge gap and potential directions for future research. For a specific
proposal, the stated knowledge gap should be framed with an appropriate size, similar to
the principle for issue definition.
Study needs: The knowledge gap identified earlier may require several projects to address.
Study needs can be as big as, or smaller than, the knowledge gap identified. They are
supposed to be more relevant to a specific proposal or project. If current knowledge status
and knowledge gap for a selected issue are well stated in a proposal, the keyword of study
need is relatively easy to define.
Goals or objectives: For a large proposal (e.g., asking for one million dollars), the credentials and research records of principal investigators are an important factor for a funding
agency to consider in reducing the risk of failure. Thus, demonstrating that the proposed
project is within the long-term research activities and efforts of the researchers has become
a presentation skill widely employed. In that case, a long-term goal over many years is usually stated before specific objectives for the proposed project. For small projects, describing
a long-term goal is not necessary. Individual objectives should be defined in a way that
achieving them can meet the study need and fill some or all the knowledge gaps presented
earlier.
Methods: This is the key component of the proposal as it describes how to achieve the
objectives defined earlier. It should demonstrate enough innovation and capacity in solving
the problem. The degree of detail released can be affected by a number of factors. Often the
space allowed in a proposal is limited, so the section needs to be concise. Sometimes the detail
is not very clear at the time of proposal preparation, or investigators prefer not to present too
detailed methods to avoid information leakage. Ultimately, the methods presented should
convince reviewers that they are sufficient for achieving the stated objectives.
Data sources: This is usually a straightforward part in a proposal, especially for empirical
Chapter
6
Reference and File Management
H
ave you ever felt too tedious in formatting the reference section in a manuscript?
Have you ever been lost in finding an article from several hundreds of documents
saved on your computer? If your answer is yes, then this chapter may help you
address these problems effectively. First of all, the need of literature management from the
perspective of researchers is analyzed. Then, some systematic strategies to reference and file
management are presented. In particular, EndNote and Mendeley have been two leading
products for Windows users in the market, and they are chosen to demonstrate how to accomplish various tasks for reference and file management. Overall, if the methods presented
in this chapter are followed, references and files can become well organized and research
productivity can be gained at the stages of study design and manuscript preparation.
6.1
Demand of literature management
There is a need to understand the demand of literature management before seeking a solution. Reference and file management has become a routine task for scientific research,
covering the whole process from research design to manuscript preparation. The number of
papers that needs to be managed is estimated first. Then, the demand is quantified as a list
of major tasks and goals.
6.1.1
Timing for literature management
Literature management for research includes both reference and file management. Reference
management and file/document management are closely related but two different tasks.
Reference management has been in demand for a long time because past studies need to
be cited in the text of a current manuscript, and a bibliography section should be provided
at the end for detail. Sometimes I read papers published about 50 years ago, and they still
contain reference information with a format similar to what is required today.
In contrast, electronic versions of published articles, especially in portable document
format (PDF), have only become widely available since the 1990s. Before that, printed
copies were obtained from a library and the only management task was to organize one’s
bookshelf. The number of scientific articles published has grown fast in recent years, and in
general they are published as a PDF document. At present, each published paper has some
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
78
6.1 Demand of literature management
79
reference information and a PDF copy. As a result, there has been an increasing need of
managing either hard or electronic copies of these files.
The need of reference and file management has become a routine job for researchers,
covering every step in the whole process. In particular, it is equally important for project
design and manuscript preparation, or strictly speaking, it is even more important for project
design to improve research productivity. As demand for reference management has existed for
a long time and it is mainly relevant to manuscript preparation, it creates a false impression
that what we cover in this chapter is useful for manuscript preparation only. However,
designing a research project needs deep and comprehensive understanding of what has been
achieved, i.e., the keyword of current knowledge status, as presented in Chapter 5 Proposal
and Design for Empirical Studies. Scientific research today should start with an efficient
management of reference and file information. That is why this topic is covered here in
Part II Economic Research and Design.
6.1.2
How many papers?
Let us first estimate how many papers we need to read and manage. An exceptional researcher can list as many as 700 publications over 30 years on his curriculum vitae (i.e.,
about one paper every two weeks), but I perceive that as an outlier. In economics, a moderately productive researcher can publish about three articles per year, resulting in about
100 articles for a whole career. Many economists can publish about 20 papers only during
their lifetime. At the low end, young professionals like PhD students may work on several
projects and publish less than two papers.
In conducting a specific project, the number of papers skimmed may be very large, e.g.,
searching and skimming papers at online databases. In this chapter, we focus on these papers
that one spends a good amount of time on reading. Assume a researcher needs to read 50
relevant papers in working on a specific project. The number of references cited is usually
smaller, e.g., 30 papers in a typical reference section. Thus, the maximum number that one
needs to manage during a career is about 5,000 papers (i.e., 50 × 100). This number can
be bigger for different disciplines, or if a researcher has cooperative research projects with
many colleagues. Thus, to be very conservative by doubling the amount of work, I believe
that the total number of papers that most researchers need to manage should be no more
than 10,000 papers in a whole career. At the low end, if a graduate student only includes
two or three projects in a dissertation, then the number of relevant papers is usually less
than 150.
Personally, I have accumulated about 3,000 papers on my computer after 15 years as a
professor. The size of individual documents is up to 15 megabytes (MB) per copy. In total,
their sizes are 2.3 gigabytes (GB) on my computer. Thus, with a simple extrapolation, the
size of 10,000 PDF papers should be about 8 GB. This is all manageable on modern personal
computers.
Obviously, if we have many research projects, then some special software is needed for
efficient literature management. If the number is very small, then the cost may be bigger
than the benefit from a special tool. The answer is vague if the number of papers is moderate.
For example, a student only has one research project (e.g., a master’s thesis) and the student
is also very certain that he will not design or conduct any scientific research anymore in the
future. Then the benefit of learning a new software application and applying the relevant
techniques on a small set of papers (e.g., 30 cited references) is just too small. Based on my
experience, I recommend that reference and file management be handled by special software
whenever one needs to manage over 100 references and related PDF files.
80
Chapter 6 Reference and File Management
Table 6.1 Tasks and goals for reference and file management
Task
Goal
Reference management
Insert all needed citations in the text of a manuscript.
Create a reference section in a manuscript.
Format the citations and reference section in a manuscript.
< 30 minutes
< 1 minute
< 1 hour
File management
Reorganize old existing PDF files on a local drive.
Organize all new PDF files related to a manuscript.
Find a PDF file on a computer.
Add comments to a PDF file for record.
Match a PDF copy with a hard copy on a bookshelf.
200 – 300 files per day
< 1 day
< 10 seconds
< 10 minutes
< 1 minute
Note: Assume that a typical manuscript has 20 to 40 pages and contains 30 citations. A
researcher needs to prepare three such manuscripts and manage less than 300 new PDF
files every year, and will work on no more than 10,000 PDF documents in total during a
career. The symbol < means less than.
6.1.3
Tasks and goals
For each published article, there are many itemized pieces of information, e.g., author, article
title, year, journal title, volume and issue numbers, and page numbers. There are also many
small format requirements, e.g., capital letters, sequence of authors, and indention. In addition, each article is usually available as a single PDF document, with possible appendixes.
As a result, the management of reference information is relatively tedious because of the
large number of items and format requirements, while the management of PDF files is more
straightforward. However, as the number of PDF files becomes larger (e.g., several hundreds
on a local drive), the task can become very difficult without any software. Actually, many
researchers have been motivated to learn special reference software mainly because of the
anxiety and pressure related to file management.
Tasks of reference management focus on how to add references to a manuscript prepared
for publication. Tasks of file management are about how to organize electronic and hard
copies of articles in a way that they can be quickly identified and annotated. Common major
tasks are listed in Table 6.1 Tasks and goals for reference and file management.
Without any special software, one can manage references and files within applications
used for document preparation (e.g., Microsoft Word) and the operating system (e.g., Microsoft Windows). Personally, I can meet all the reference format requirements for one article
with less than one day manually. As I publish about three journal articles per year, all of
these tasks are feasible and tolerable. In contrast, managing several thousand PDF documents using the Windows Explorer application is very inefficient. As an indicator, it is often
hard to find a specific PDF document, and even if feasible, it takes many minutes for a
simple task.
With reference software, what are the cost and benefit? The cost is that one needs
to spend several days or weeks in learning a selected software product, depending on the
learning pace. The benefits are making fewer or even no mistakes in the reference section of
a manuscript, and more importantly, saving time for both reference and file management.
Assume that a typical manuscript has 30 citations, a researcher needs to prepare three such
Part III
Programming as a Beginner
99
100
Part III Programming as a Beginner: Five chapters are used to show how to conduct
an empirical study with predefined functions in R. The sample study used is Sun et al.
(2007), which employs a binary logit model to examine insurance purchase decision.
Chapter 7 Sample Study A and Predefined R Functions (pages 101 – 127): The manuscript
and program versions of Sun et al. (2007) are analyzed first. How to conduct the study
through pull-down menus in R is briefly shown. R grammar and program formatting are
elaborated in detail.
Chapter 8 Data Input and Output (pages 128 – 151): An object and a function are explained first as two core concepts of R. The focus of this chapter is on the exchange of
information between R and a local drive, i.e., data inputs and outputs.
Chapter 9 Manipulating Basic Objects (pages 152 – 179): How to manipulate major R
objects by type is presented. These types covered are R operators, character string, factor,
date and time, time series, and formula.
Chapter 10 Manipulating Data Frames (pages 180 – 213): How to index a data frame is
presented first. Then common tasks related to data frame are addressed one by one. Methods
for data summary and aggregation are presented at the end.
Chapter 11 Base R Graphics (pages 214 – 249): The traditional graphics system available
in the base R is presented. The four main inputs for generating an R graph are plotting
data, graphics devices, high-level plotting functions, and low-level plotting functions.
R Graphics • Show Box 3 • Survival of 2,201 passengers on the Titanic sank in 1912
2nd
3rd
Crew
Yes
Survived
No
1st
Class
R can visualize categorical variables. See Program A.3 on page 516 for detail.
Chapter
7
Sample Study A and Predefined R Functions
O
ne of the principles used in this book is to learn the methods of conducting
an empirical study through a complete project. In this chapter, Sun et al. (2007)
is presented as the sample study for Part III Programming as a Beginner. This
simple empirical study is well suited for students to learn basic programming skills. The
manuscript version and statistics related to a binary choice model are presented first. Then
how to estimate the binary choice model by using pull-down menus is demonstrated with
R Commander. Finally, the program version of this study is displayed. The basic R syntax
and format requirements are explained in detail at the end.
7.1
Manuscript version for Sun et al. (2007)
In an empirical study, statistical outcomes are always the core. In the manuscript version
of an empirical study, these outcomes are often reported in the format of table and figure
and then analyzed one by one. All other sections in the manuscript support the result
section, including methodology, literature review, introduction, discussion, and conclusion.
The manuscript version will grow up from a very basic skeleton to its final version, with the
guide from the proposal version and outputs from the program version. Conversely, creating
and expanding the program version needs the initial guide from the proposal version, and
furthermore, more specific decisions from the manuscript version. For example, table formats
for the final results are generally unspecified in a proposal version. Determining the number
and possible contents in these tables through a manuscript draft can provide practical guides
to data analyses and programming.
In general, the final manuscript version for a peer review journal should have a length
of around 25 to 35 pages in double line spacing, including a separate page for each table or
figure. Therefore, there will be a limit for the number of empirical results that one can report
in a typical journal article. For example, it rarely happens that a journal article contains 20
tables or more. Thus, it is a good habit to write down how many pages are planned for each
section at the beginning when you have a clear and cool mind. This will remind you what
page limits you prefer to have, so you will not waste your time to overwork on a specific
section, e.g., writing a very long literature review.
Below is the very first draft of the manuscript version for Sun et al. (2007). It is the
foundation of the final manuscript version and it also provides guide to the programming
for this study. The corresponding proposal version is presented at Section 5.3.1 Design
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
101
102
Chapter 7 Sample Study A and Predefined R Functions
with survey data (Sun et al. 2007) on page 70. The following manuscript draft seems like
a small variation of the proposal version, but actually, it is not. The first manuscript draft
should be much more practical than the proposal version, with a specific number of tables
and figures. This will provide a clear direction for programming and data analyses later on
within R.
The First Manuscript Version for Sun et al. (2007)
1. Abstract (200 words). Have one or two sentences for research issue, objective, methodology, data, and results.
2. Introduction (2 pages in double line spacing). One or two paragraphs for each of the
following items: research issue, what is known, what is unknown, knowledge gap and
study need, objective, and an overview of the manuscript structure.
3. Literature review (3 pages). Two subsections: (a) liability concerns of forest landowners
and incidents; and (b) liability insurance as a way to reduce liability.
4. Methodology (3 pages). Two subsections: (a) data and telephone survey; and (b) a binary logit model for liability insurance coverage. The methods employed are descriptive
statistics and a binary logit regression.
5. Empirical findings (3 pages of text, 4 pages of tables, and 1 page of a figure). Three
subsections: (a) pattern of injuries and damages; (b) pattern of liability insurance
coverage; and (c) results from the binary logit regression of liability insurance coverage.
Table 1. Recreational bodily injury and property damages
Table 2. Pattern of liability insurance coverage
Table 3. Definitions and means of variables in the logit model
Table 4. Results of the binary logit regression analysis of liability insurance coverage
Figure 1. Probability response curves for key determinants
6. Discussions (3 pages). Choose three to five key results from the empirical findings and
have a discussion.
7. References (3 pages). Cite about 30 papers.
end
Furthermore, within the manuscript draft, the contents of tables and figures should be
specified or designed before any programming can take place. Their contents may change
later on by data analyses or mining. Nevertheless, predicting or designing these tables and
figures in advance will greatly improve programming efficiency. As an example, results of
the logit regression analysis in Sun et al. (2007) are reported as Table 4 in its final published
version. The very first version of this table is presented in Table 7.1 A draft table for the
logit regression analysis in Sun et al. (2007). This table seems like a blank table, but actually,
working on table drafts first is a key technique for improving data analysis efficiency. Note
some hypothetical numbers are put in each column to determine the width and number of
columns that are appropriate for a potential target publication outlet (i.e., Southern Journal
of Applied Forestry in this case). This is necessary because the table will be generated from
a program version directly later on.
103
7.2 Statistics for a binary choice model
Table 7.1 A draft table for the logit regression analysis in Sun et al. (2007)
Variable
Coefficient
t-ratio
Marginal effect
t-ratio
Constant
Injury
HuntYrs
Age
Race
...
Observations
Log-likelihood
Chi-squared
Prediction
4.444
3.33***
0.666
7.777***
7.2
1,700
−222.22
55.55
90%
Statistics for a binary choice model
A binary choice model describes a choice between two discrete alternatives, such as buying
a car or not. The dependent variable is represented by 1 and 0, corresponding to the alternatives available to decision makers. This type of model can be estimated through ordinary
least square, generalized least square, or maximum likelihood. Overall, binary choice models have been well covered in standard textbooks. In this section, formulas related to the
linear probability model and binary probit/logit model are presented. They will be used for
demonstration or programming exercises later on in the book. To be brief, derivations of
these formulas are not described here. Details are available at Judge et al. (1985), Baltagi
(2011), and Greene (2011).
7.2.1
Definitions and estimation methods
In Sun et al. (2007), a binary choice model is employed for an insurance purchase decision
by recreationalists. A binary choice model can be initiated or motivated in several ways. In
economics, there is an economic agent behind each observation, e.g., a hunter. Thus, several
theoretical models have been used to develop binary choice models, such as maximization
of expected utility or unobservable random index. Regardless of the underlying theoretical
model, the empirical model is always in the format of y = f (X), as conceptualized in
this book. For a binary choice model, the dependent variable only has two values, 1 or 0.
Therefore, the formula can be adapted as follows:
P r(yi = 1) = pi = F (xi β)
(7.1)
P r(yi = 0) = 1 − pi = 1 − F (xi β)
where yi is the choice made by recreationalist i; P r(yi = 1), or pi in short, is the probability
of yi being 1; xi is a vector of K independent variables associated with recreationalist i and
its dimension is 1 × K; and β is the coefficient vector (K × 1).
The independent variables in Sun et al. (2007) are composed of three groups:
xi β = β0 + Ci β1 + Si β2 + Ti β3
(7.2)
where βm (m = 0, 1, 2, 3) are parameter vectors to be estimated. In total, 12 variables are
included in three groups: recreational experience of a person (i.e., Ci = Injury, HuntYrs),
104
Chapter 7 Sample Study A and Predefined R Functions
license type (i.e., Si = Nonres, Lspman, Lnong), and socio-demographic characteristics (i.e.,
Ti = Gender, Age, Race, Marital, Edu, Inc, TownPop).
Equation (7.1) can be expressed more concisely with matrix notations (Judge et al.,
1985):
P r(y = 1) = p = F (Xβ)
(7.3)
P r(y = 0) = 1 − p = 1 − F (Xβ)
where the bold symbol of y is the vector value of independent variable (N × 1); p is the
vector of probability; and X is a matrix of independent variables (N × K); and N is the
total number of observations.
A binary choice model can be estimated in several ways, depending on the choice of
F (Judge et al., 1985). When estimated by ordinary least square, it is called the linear
probability model. As several properties of the linear probability model are unsatisfactory
(e.g., predicted values being out of the range of 0 to 1), it can be improved by generalized least
square. At present, the more popular approach for binary choice models is a binary probit
or logit model. In this book, we cover both the linear probability and binary probit/logit
models for the purpose of programming demonstration. Together, the general formulas for
these models are:
F (Xβ) = Xβ
F (Xβ) = Φ(Xβ) =
Z
Xβ
φ(t) dt =
−∞
F (Xβ) = Λ(Xβ) =
1
1 + e−Xβ
Z
Xβ
−∞
2
e−t /2
√
dt
2π
(7.4)
where the first equation is for the linear probability model, the second one is for the binary
probit model, and third one is for the binary logit model. The cumulative probability function
is denoted by the symbol Φ for the normal distribution and Λ for the logit distribution. These
alternative models will be analyzed in detail below.
There are various outputs from estimating a binary choice model. Like any regression
analysis, the main output is coefficient estimates. In addition, marginal effects can be calculated to measure the magnitude of the impact on the choice from each independent variable.
In the linear probability model, marginal effects are just the coefficient estimates. However,
the linear probability model is not the best model for handling binary choice data, and
marginal effects from the binary probit or logit model are perceived to be more accurate. As
the relation between dependent and independent variables in a binary probit or logit model
is nonlinear, calculating marginal effects and its standard errors is much more complicated.
7.2.2
Linear probability model
A linear model can be used to examine binary choice data, and this has been referred to as the
linear probability model. The major problems with this approach is that the actual observed
values of the choices are either 1 or 0, but the predicted values from a linear probability
model is continuous and can be out of the interval between 0 and 1. Furthermore, the error
term is inherently heteroskedastic. As a result of these barriers, linear probability model has
gradually faded away in handling binary choice data.
In this subsection, basic formulas related to the linear probability model are presented
here for completeness. A linear model can be estimated by various methods. Ordinary least
105
7.2 Statistics for a binary choice model
square and maximum likelihood estimators are representative and thus they are briefly presented here. These formulas will be used later in the book for programming demonstrations
and exercises related to binary choice models.
To begin, express a linear system in a matrix form as follows (Greene, 2011):
y = Xβ + e
(7.5)
where y and e are N × 1 vectors of the dependent variable and error term, respectively;
X is an N × K matrix of independent variables; β is a K × 1 coefficient matrix; N is
the number of observations; and K is the number of independent variables. The dependent
variable is assumed to be independently and identically distributed with equal variance at
each observation, i.e., E[ee0 ] = σ 2 I N .
Ordinary least square estimation for a linear model
The ordinary least square estimator of a linear system is well documented in statistics
textbooks (e.g., Greene, 2011). The key formulas are as follows:
βb = (X 0 X)−1 X 0 y
yb = X βb
eb = y − yb
(7.6)
σ
b = (b
e eb)/(N − K)
b =σ
cov(β)
b 2 (X 0 X)−1
2
0
b yb, and eb are the estimated coefficients, fitted dependent variable, and estimated
where β,
residuals, respectively. The scalar value of σ
b 2 is an unbiased estimator of σ 2 . The covariance
b can be computed with the estimated variance of σ
matrix for the coefficients, i.e., cov(β),
b 2.
0
The prime symbol ( ) denotes transposition of vectors or matrices, and the widehat symbol
(b) indicates an estimated value.
Once the coefficients and standard errors are estimated, other statistics related to the
ordinary least square can be computed next. Take the coefficient of determination (R2 ) as
an example. This is a measure of the goodness of model fit. Total variation in the dependent
variable of y can be measured as the deviation from its mean. This can be decomposed into
two components: one explained by the model and the other in the residuals. Mathematically,
the R2 value can be computed as:
eb 0 eb
(y − y)0 (y − y)
eb 0 eb
R2 = 1 − 0
y y − N y2
R2 = 1 −
or alternatively,
(7.7)
(7.8)
where the scalar value of y denotes the average of the dependent variable.
Maximum likelihood estimation for a linear model
A normal (or Gaussian) distribution is a continuous probability distribution with two parameters. If a random variable z is normally distributed, its probability distribution function
106
Chapter 7 Sample Study A and Predefined R Functions
can be expressed as (Greene, 2011):
(z − µ)2
f (z | µ, σ) = (2πσ 2 )−1/2 exp −
2σ 2
2
z
f (z | 0, 1) = (2π)−1/2 exp −
2
(7.9)
(7.10)
where parameter µ is the mean and parameter σ is the standard deviation of the distribution.
When µ = 0 and σ = 1, the distribution is called standard normal distribution.
Assume that the dependent variable of yi (i = 1, . . . , N ) in Equation (7.5) is normally
and independently distributed with the mean of xi β and variance of σ 2 . Then, the maximum
likelihood principle can be employed to estimate the unknown parameters and variance of
the residuals, i.e., β and σ 2 . The joint probability density function of yi (i = 1, . . . , N ),
given the preceding mean and variance, can be expressed as: L = f (y1 , y2 , . . . , yN | Xβ, σ 2 ).
As the individual values of yi are independent, the joint probability density function can be
written as the product of N individual density functions as follows (Judge et al., 1985):
L = f (y1 | x1 β, σ 2 ) . . . f (yN | xN β, σ 2 ) =
N
Y
f (yi | xi β, σ 2 )
(7.11)
i=1
(yi − xi β)2
f (yi | xi β, σ 2 ) = (2πσ 2 )−1/2 exp −
2σ 2
(7.12)
where in the second equation the probability density function of a normally distributed
variable yi is computed with the given mean and variance at each observation.
Under the maximum likelihood criterion, the parameter estimates are chosen to maximize
the probability of generating the observed sample. Using matrix notations, the likelihood
function for the linear model can be expressed as:
(y − Xβ)0 (y − Xβ)
L(β, σ 2 ) = (2πσ 2 )−N/2 exp −
2σ 2
(7.13)
where the likelihood value becomes a function of β and σ 2 now. In general, the method of
maximum likelihood is applied on the log-likelihood function:
logL(β, σ 2 ) = −
(y − Xβ)0 (y − Xβ)
N
ln(2πσ 2 ) −
2
2σ 2
(7.14)
where log and ln are the notation of natural logarithm. With this expression, routine optimization methods in a computer language like R can be used to estimate the unknown
parameters and variance.
7.2.3
Binary probit and logit models
A number of continuous probability distribution functions have been employed to address
the shortcoming of the linear probability model (Greene, 2011). A binary probit model is
based on the standard normal distribution, and a binary logit model is based on the logistic
distribution. Others include Weibull and complementary log-log model. The binary probit
and logit models are the most frequently used, so they are presented with more details here.
7.2 Statistics for a binary choice model
107
Estimating parameter values
The binary probit or logit model is nonlinear so maximum likelihood can be used to estimate
the parameters (Baltagi, 2011). The likelihood and log-likelihood functions can be expressed
as:
L=
N
Y
[F (xi β)]yi [1 − F (xi β)]1−yi
(7.15)
i=1
logL =
N
X
yi lnF (xi β) + (1 − yi ) ln[1 − F (xi β)]
(7.16)
i=1
where F (.) is the cumulative distribution function, as expressed in Equation (7.4). In a
standard computer language, these distribution functions are well defined. In R, it is the
pnorm() function for the normal distribution, and the plogis() function for the logistic
distribution. Thus, for estimating parameter values only, the specific form of the cumulative
distribution function is not needed, but they are indeed needed for standard error computation for marginal effects later on. Similar to the linear probability model, numerical
optimization techniques can be used to maximize the log-likelihood function and estimate
the unknown parameters and their variance.
Calculating predicted probabilities
Once the coefficients in a binary probit or logit model are estimated, the predicted probability at specific values of independent variables can be computed. This is similar to the
predicted values from a linear probability model. The difference is that binary probit or
logit models are nonlinear and usually contain dummy variables as independent variables,
which makes the computation more challenging and rewarding.
The simplest case is to compute the probability at the mean values of all variables:
b
b = F (X̄ β)
P r(y = 1 | X̄) = p
(7.17)
where X̄ is the mean values of the set of independent variables, and other symbols are the
b Thus, the resulting
same as defined above. The matrix dimension is 1×K for X̄, K ×1 for β.
probability is a scalar value. This scenario is of little use in practical applications.
The more useful probabilities that have been computed from binary probit or logit models
are to change the values of one or two selected independent variables, and meanwhile, to fix
all other independent variables at their mean values. Specifically, there are three scenarios:
(i) one selected independent variable is continuous; (ii) one selected independent variable is
a dummy with the value of 1 or 0; and (iii) two independent variables are considered, with
one being a continuous variable and the other being a dummy variable.
The first scenario focuses on a continuous independent variable, e.g., hunting years in Sun
et al. (2007). The whole range of this variable can be divided equally into many intervals,
e.g., M = 300 for 70 hunting years. With this treatment, a new matrix of X̄ s=seq for
independent variables can be generated; the selected variable has the equally spaced values
and all the other variables have the fixed mean values. The probabilities computed have the
dimension of M × 1. The relation between the probabilities and the selected variable can be
revealed through a plot. Mathematically, this can be expressed as:
b
b = F (X̄ s=seq β)
P r(y = 1 | X̄ s=seq ) = p
where s is the selected continuous variable, with a sequence of values being assigned.
(7.18)
108
Chapter 7 Sample Study A and Predefined R Functions
In the above treatment, new values for the selected variable are generated over its whole
range with an equal interval. Can we use the original values of this variable, e.g., hunting
years in Sun et al. (2007)? The answer is yes. The resulting curve will have a similar shape,
but it may not be smooth because the original values of the selected variable are not equally
distributed over its whole range. For practical applications, the interest is usually on the
visual relation between the probability and a selected continuous variable through a graph.
Thus, equally-spaced values of the selected variable over its whole range are used.
The second scenario focuses on a dummy independent variable. As a dummy independent
variable can only take the value of either 1 or 0, the above strategy for a continuous variable
does not work anymore. Instead, two probability values need to be generated:
b
P r(y = 1 | X̄ d=1 ) = F (X̄ d=1 β)
b
P r(y = 1 | X̄ d=0 ) = F (X̄ d=0 β)
(7.19)
b − F (X̄ d=0 β)
b
Md = ∆Fb = F (X̄ d=1 β)
where d is the selected dummy variable. In X̄ d=1 , the selected dummy variable takes the
value of 1 only and other independent variables take the mean values. X̄ d=0 is similarly
defined with the dummy variable taking the value of 0 only. The difference between the
two probability values, Md , is also known as the marginal effect of the dummy independent
variable of d.
The third scenario is to combine the previous two scenarios, with both a continuous
variable and a dummy variable included. For example, in Sun et al. (2007), the effects of
hunting years (s = HuntY rs) and residential status of a recreationalist (d = N onres)
on the probability of liability insurance purchase are assessed through three curves in one
plot. The probability series for the selected continuous variable only can be computed with
Equation (7.18). In combining the two variables together, two additional probability series
can be constructed as follows (Greene, 2011):
b
P r(y = 1 | X̄ s=seq, d=1 ) = F (X̄ s=seq, d=1 β)
b
P r(y = 1 | X̄ s=seq, d=0 ) = F (X̄ s=seq, d=0 β)
(7.20)
where X̄ s=seq, d=1 is a further modification on X̄ s=seq , with the column value for the dummy
variable being set as 1. X̄ s=seq, d=0 is similarly defined.
Note that in Equation (7.19), the difference between two probability values for a
selected dummy variable is interpreted as the marginal effect at the mean values for all
other independent variables. In Equation (7.20), the continuous variable does not take
the mean value, but varies over its whole range. Thus, the differences between the two series
in Equation (7.20) is the marginal effect of the dummy variable over the whole range of the
continuous variable. In a graph, the vertical difference between the two series at the mean
value of the continuous variable should equal to the value calculated from Equation (7.19).
Calculating marginal effects
The marginal effect of continuous independent variables from a binary probit or logit model
can be calculated as (Greene, 2011):
M=
b ∂(X β)
b
∂F (X β)
∂E[y | X]
b β
b = f (b
b
=
= f (X β)
y) β
b
∂X
∂X
∂(X β)
(7.21)
7.2 Statistics for a binary choice model
109
where M is the marginal effects of X, F is the cumulative distribution function, and f is
b can change with the value of
the probability density function. The scale factor of f (X β)
b
X. For the linear probability model, its scale factor is one so M = β.
For the probit or logit model, the scale factor, and thus the marginal effect of M , can
b β
b or M = f¯(X β)
b β.
b Specifically, one way is to use
be calculated in two ways: M = f (X̄ β)
the mean values of X. The other way is to calculate the scale factor for each observation,
then take an average of all the scale factor values, and compute the marginal effects at the
end. In practice, the two approaches may generate similar results (Greene, 2011).
Marginal effects for dummy independent variables can be calculated in the same way as
for continuous variables. Theoretically, however, the concept of derivative is more accurate
for a small change. A dummy variable takes the discrete value of 1 and 0 only. Thus, a more
appropriate marginal effect formula for a binary independent variable is the difference of predicted probabilities associated with the status of 1 and 0, as presented in Equation (7.19)
(Greene, 2011). In practice, it seems that whether treating a dummy variable as a continuous
variable or not generates very small differences for binary choice models.
Conceptually, predicted probabilities and marginal effects are closely related to each
other. For a continuous variable, the marginal effect is the change of the predicted probabilities over a small change of the focal variable. In a graph created from Equation (7.18),
the marginal effect of a continuous independent variable is the slope of the corresponding
predicted probability curve. For a dummy independent variable, its marginal effect is the
change of the probabilities between two statuses (1 versus 0), which can be revealed by a
graph from Equation (7.20). These relations are revealed well through Figure 1 in Sun
et al. (2007) for the combined effects of one continuous variable and another dummy variable.
Finally, the data set used for calculating the marginal effect can be the whole data
set (X), or a subset only. Subsetting on the data is generally applied through a dummy
variable. For example, one independent variable in Sun (2006a) is the party affiliation of a
house representative or a senator in the United States, with the value of 0 for Democrats
and 1 for Republicans. This dummy independent variable can be used to split the whole
data set into two. Then marginal effects and standard errors can be calculated for each data
set, with the same coefficient and covariance matrices estimated from the whole data set.
The formulas for all the computation are still the same.
Standard errors for predicted probabilities and marginal effects
The delta method can be used to compute standard errors for predicted probabilities and
b where
b = F (X̄ β),
marginal effects (Baltagi, 2011). First, denote predicted probabilities as p
the values of X̄ can change for different needs, e.g., Equations (7.18) to (7.20). Its
asymptotic covariance matrix can be derived as:
b ∂b
b
p
∂F (X̄ β)
y
b X̄ = fbX̄
=
= f (X̄ β)
b
b
∂b
y
∂β
∂β
0
b
b
p
p
b
V
cov(b
p) =
b
b
∂β
∂β
(7.22)
(7.23)
b and y
b Note f (X̄ β)
b is
b = X̄ β.
where Vb is the estimated asymptotic covariance matrix of β;
b over another vector y
b and
b . Assume the length of F (X̄ β)
the derivative of a vector F (X̄ β)
b
b is M × 1. Then, by the rule of matrix calculus, f (X̄ β) is a matrix of M × M , and in this
y
particular case, it is also a diagonal matrix.
110
Chapter 7 Sample Study A and Predefined R Functions
The marginal effect of a dummy variable is defined in Equation (7.19) as Md = ∆Fb.
Its asymptotic covariance matrix can be expressed as:
Md
= fb1 X̄ d=1 − fb0 X̄ d=0
b
∂β
0
Md b Md
V
cov(Md ) =
b
b
∂β
∂β
(7.24)
(7.25)
b is related to the matrix of X̄ d=1 , as defined in Equations (7.19)
where fb1 = f (X̄ d=1 β)
b
to (7.22). f0 is similarly defined.
b β
b = f (b
b as
The marginal effect of continuous independent variables is M = f (X β)
y ) β,
presented in Equation (7.21). Its asymptotic covariance matrix can be expressed as:
b
b
b
∂M
b df ∂(X β) = fbI + df βX
b 0 = fb(I + wβX
b 0)
= fbI + β
b
b
db
y ∂β
db
y
∂β
0
∂M b ∂M
cov(M ) =
V
b
b
∂β
∂β
(7.26)
(7.27)
where the product rule for matrix calculus needs to be applied in the first equation. In
b
b
y fb for the standard normal distribution, and df =
addition, it can be verified that df = −b
db
y
db
y
(1−2Fb)fb for the logistic distribution. After combining terms, the difference between a probit
model and a logit model for the purpose of programming is in the scale factor: w = −b
y for
the probit model and w = 1 − 2Fb for the logit model.
7.3
Estimating a binary choice model like a clickor
As shown in Section 2.4 Playing with R like a clickor on page 22, using pull-down menus for
statistical analyses is intuitive when an application such as the R Commander is adopted.
However, long-term costs are much bigger than the benefit. In this section, a simplified
analysis for Sun et al. (2007) is presented to demonstrate the characteristics of this approach.
The application of R Commander is used to show all the process again, with the emphasis
on real data and a meaningful model specification.
7.3.1
A logit regression by the R Commander
The data set used for this demonstration was collected through a telephone survey as reported in Sun et al. (2007). In the end, 57% of the participants completed the phone interview
successfully and the final data set contained 1,653 observations. The data set was saved in
two different formats: RawDataIns2.csv and RawDataIns2.xls. They contained the same
information with 1,653 rows and 14 columns. The column header corresponded to the variable names: Y, Injury, HuntYrs, Nonres, Lspman, Lnong, Gender, Age, Race, Marital, Edu,
Inc, TownPop, FishYrs, where Y was a binary dependent variable and the other 13 were
independent variables (Sun et al., 2007). Among the 13 independent variables, HuntYrs and
FishYrs were highly correlated, so only one of them was used in the regression.
The goal here is to demonstrate the characteristics of using pull-down menus for one’s
own data. Thus, only some selected steps are executed. More complete statistical analyses
will be shown later on through the program version for Sun et al. (2007). Specifically, the
selected steps are:
7.3 Estimating a binary choice model like a clickor
111
Figure 7.1 Importing external Excel data in the R Commander
1. Import the data from a local drive by a software application;
2. Generate basic descriptive statistics of the data (e.g., mean and standard deviation);
3. Conduct a binary logit regression with 12 independent variables: Injury, HuntYrs,
Nonres, Lspman, Lnong, Gender, Age, Race, Marital, Edu, Inc, TownPop; and
4. Estimate the logit regression again with the variable of HuntYrs replaced by FishYrs.
The first step of the exercise is to import data from the local drive. After the R commander is loaded in an R session with the command of library(Rcmdr), data can be
imported through the menu of Data ⇒ Import data ⇒ from Excel. Navigate to the file
of RawDataIns2.xls on a local drive. If you have followed Step 6 in Section 2.2.1 Recommended installation steps on page 17, then this document should be saved on your computer
like this: C:/aErer/RawDataIns2.xls. Finally, clicking the button of View data set will
show up the data set. The main interface and relevant menu items are shown in Figure 7.1
Importing external Excel data in the R Commander. Note commands associated with each
click will show up in the R Commander interface.
The second step is to generate summary statistics. This is produced by following the
menu of Statistics ⇒ Summaries ⇒ Active data set. The third step is to fit a binary
logit model. This is similar to the above step by using the menu of Statistics ⇒ Fit
models ⇒ Generalized linear model. Variables can be selected within a new window.
In the final step, the logit model is fitted again by changing one explanatory variable. Some
partial results from the binary logit model fit are shown as an example in Figure 7.2 Fitting
a binary logit model in the R Commander.
7.3.2
Characteristics of clickors
Going through a simple example like the above by pull-down menus is an experience every
student should have. If you have not done that, play it for a while. Then you can understand
the benefit and cost of this approach.
On the positive side, pull-down menus are self-explanatory and easy to use. This may
be especially attractive if a group of software users are diverse and are unlikely to do any
complicated actions. For statistical analyses in economics, this approach may have some
values to undergraduate education. It inspires students and gets them engaged in the learning
process. In addition, this approach can also be useful if one is just interested in using a
specific procedure or function in a software product for a regression, but has no interest in
112
Chapter 7 Sample Study A and Predefined R Functions
Figure 7.2 Fitting a binary logit model in the R Commander
learning the whole software. For example, the software of LIMDEP has great procedures for
multinomial logit model. One can just use the pull-down menu to do the regression without
reading the thick user guide.
On the negative side, using pull-down menus for statistical analyses has many drawbacks.
The major problem is that using pull-down menus does not keep a complete track of what
has been tried and done. While all the outputs in the window can be saved, it is not easy
to edit or annotate them. Without a record, communication between group members of a
project is difficult, even if not completely impossible. After a project is finished for a while,
it is difficult to have clear answers to the questions of how, what, and when. Personally, this
occurred very often between me and graduate students in the past. Related to the lack of
record, using the clicking approach is prone to errors. It is true that sometimes a small error
in a computer program can be hard to detect and result in severe consequences. However,
the clicking approach makes error identification even more challenging, because very few
clues are available to users after a few months. In addition, note that steps three and four in
the above example are different by one independent variable only. In a computer program,
commands can be copied, pasted, and modified in just a few seconds. Finally, without a full
program version for an empirical study, the analysis on a data set may be fragmented and
disorganized.
Overall, clicking with pull-down menus is an approach that should be discouraged for
empirical analysis in economics in the long term. As computer programming has become
so easy to learn, I believe that everyone should learn a computer language and build up
programming skills gradually.
7.4
Program version for Sun et al. (2007)
Program 7.1 The first program version for Sun et al. (2007) contains the major steps to
generate outputs in table or figure formats. The structure is set up with the information
7.4 Program version for Sun et al. (2007)
113
and guide from the proposal version and the first manuscript version of this project. While
this seems straightforward and looks like a direct copy of the first manuscript version, it
does set up the framework where a program version can be built up gradually. On the basis
of this draft, more contents will be added and the final program version can be developed.
In general, a program version should be able to generate and reproduce all the tables
and figures reported in a manuscript. To save space here, the program version for Sun et al.
(2007) as drafted in Program 7.1 is reduced, with Tables 1 and 2 in the original publication
being excluded. The reason for this simplification is that we need a lean program to explain
basic R grammar later on in this chapter.
Program 7.2 The final program version for Sun et al. (2007) contains detailed R commands. Note the line numbers at the left side are not part of the program, but an aid
for reference. Running this program at R can reproduce the key results reported in the
manuscript version within one minute. The tables and figures generated from this program
are still a little bit different from the ones published in Sun et al. (2007). Table 4 is copied
below to show its appearance in the R console. Once the results are saved on a local drive,
tables can be copied to a word processor for final formatting (e.g., Microsoft Word), and
figures should be copied without any further formatting. The amount of formatting work
on tables is usually small (e.g., less than 1%). If one prefers to have tables finalized in R
completely, then a combination of R for the program version and LATEX for the manuscript
version is needed.
The challenge for conducting an empirical study is to start with the simple skeleton
as listed in the first program version, add commands to generate the expected tables and
figures gradually, and at the end format the program version in a professional way. To do
programming efficiently, one needs to learn a computer language step by step. In that regard,
the main goal of all the three parts about programming in the middle of this book is to
teach students how to move efficiently from Program 7.1 to Program 7.2.
Program 7.1 The first program version for Sun et al. (2007)
1
2
# Title: R program for Sun et al. (2007 SJAF)
# Date: January 2006
3
4
5
# 0. Libraries and global setting
# Load some libraries; Set up working directory
6
7
8
9
# 1. Import raw data in csv format
# 2. Descriptive statistics
# Generate Table 3
10
11
12
13
14
#
#
#
#
3. Logit regression and figures
3.1 Logit regression
3.2 Marginal effect
Generate Table 4
15
16
17
18
# 3.3 Figures: probability response curve
# Show and customize one graph on screen device
# Save three graphs on file device (Figure 1a, 1b, 1c)
19
20
# 4. Export results in tables
114
Chapter 7 Sample Study A and Predefined R Functions
Program 7.2 The final program version for Sun et al. (2007)
1
2
# Title: R program for Sun et al. (2007 SJAF)
# Date: January - May 2006
3
4
5
6
7
8
9
10
#
#
#
#
#
#
#
------------------------------------------------------------------------Brief contents
0. Libraries and global setting
1. Import raw data in csv format
2. Descriptive statistics
3. Logit regression and figures
4. Export results
11
12
13
14
15
16
# ------------------------------------------------------------------------# 0. Libraries and global setting
library(erer) # functions: bsTab(), maBina(), maTrend()
wdNew <- 'C:/aErer' # Set up working directory
setwd(wdNew); getwd(); dir()
17
18
19
20
21
22
23
# ------------------------------------------------------------------------# 1. Import raw data in csv format
daInsNam <- read.table(file = 'RawDataIns1.csv', header = TRUE, sep = ',')
daIns <- read.table(file = 'RawDataIns2.csv', header = TRUE, sep = ',')
class(daInsNam); dim(daInsNam); print(daInsNam); class(daIns); dim(daIns)
head(daIns); tail(daIns); daIns[1:3, 1:5]
24
25
26
27
28
29
30
31
# ------------------------------------------------------------------------# 2. Descriptive statistics
(insMean <- round(x = apply(X = daIns, MARGIN = 2, FUN = mean), digits =2))
(insCorr <- round(x = cor(daIns), digits = 3))
table.3 <- cbind(daInsNam, Mean = I(sprintf(fmt="%.2f", insMean)))[-14, ]
rownames(table.3) <- 1:nrow(table.3)
print(table.3, right = FALSE)
32
33
34
35
36
37
38
39
40
41
42
43
44
# ------------------------------------------------------------------------# 3. Logit regression and figures
# 3.1 Logit regression
ra <- glm(formula = Y ~ Injury + HuntYrs + Nonres + Lspman + Lnong +
Gender + Age + Race + Marital + Edu + Inc + TownPop,
family = binomial(link = 'logit'), data = daIns, x = TRUE)
fm.fish <- Y ~ Injury + FishYrs + Nonres + Lspman + Lnong +
Gender + Age + Race + Marital + Edu + Inc + TownPop
rb <- update(object = ra, formula = fm.fish)
names(ra); summary(ra)
(ca <- data.frame(summary(ra)$coefficients))
(cb <- data.frame(summary(rb)$coefficients))
45
46
47
48
# 3.2 Marginal effect
(me <- maBina(w = ra))
(u1 <- bsTab(w = ra, need = '2T'))
7.4 Program version for Sun et al. (2007)
49
50
51
52
53
115
(u2 <- bsTab(w = me$out, need = '2T'))
table.4 <- cbind(u1, u2)[, -4]
colnames(table.4) <- c('Variable', 'Coefficient', 't-ratio',
'Marginal effect', 't-ratio')
table.4
54
55
56
57
58
# 3.3 Figures: probability response curve
(p1 <- maTrend(q = me, nam.d = 'Nonres', nam.c = 'HuntYrs'))
(p2 <- maTrend(q = me, nam.d = 'Nonres', nam.c = 'Age'))
(p3 <- maTrend(q = me, nam.d = 'Nonres', nam.c = 'Inc'))
59
60
61
62
63
64
# Show one graph on screen device
windows(width = 4, height = 3, pointsize = 9)
bringToTop(stay = TRUE)
par(mai = c(0.7, 0.7, 0.1, 0.1), family = 'serif')
plot(p1)
65
66
67
68
69
70
71
72
73
74
75
# Save three graphs on file device
fname <- c('OutInsFig1a.png', 'OutInsFig1b.png', 'OutInsFig1c.png')
pname <- list(p1, p2, p3)
for (i in 1:3) {
png(file = fname[i], width = 4, height = 3,
units = 'in', pointsize = 9, res = 300)
par(mai = c(0.7, 0.7, 0.1, 0.1), family = 'serif')
plot(pname[[i]])
dev.off()
}
76
77
78
79
80
# ------------------------------------------------------------------------# 4. Export results
write.table(x = table.3, file = 'OutInsTable3.csv', sep = ',')
write.table(x = table.4, file = 'OutInsTable4.csv', sep = ',')
Note: Major functions used in Program 7.2 are setwd(), getwd(), dir(), read.table(),
class(), dim(), print(), head(), tail(), round(), mean(), apply(), data.frame(),
nrow(), rownames(), glm(), update(), names(), summary(), cbind(), colnames(), par(),
windows(), plot(), png(), dev.off(), write.table(), maTrend(), bsTab(), maBina(),
and bringToTop().
# Selected results from Program 7.2
> table.4
Variable Coefficient t-ratio Marginal effect t-ratio
1 (Intercept)
-3.986*** -5.514
-0.519*** -5.867
2
Injury
0.245
0.466
0.032
0.466
3
HuntYrs
0.014**
2.402
0.002**
2.412
4
Nonres
0.761*
1.910
0.121
1.613
...
11
Edu
-0.010 -0.328
-0.001 -0.328
12
Inc
0.004*
1.867
0.001*
1.873
13
TownPop
0.002
1.029
0.000
1.030
116
7.5
Chapter 7 Sample Study A and Predefined R Functions
Basic syntax of R language
There are 80 lines in Program 7.2 The final program version for Sun et al. (2007). It does
not take much time to understand the basic structure and a large portion of the program
version even if you have never used R before. To facilitate understanding, let us compare
English as a language for a manuscript version and R as a computer language for a program
version from several aspects: section, paragraph, sentence, and word, as summarized in
Table 7.2 A comparison of grammar rules between English and R.
7.5.1
Sections and comment lines
In a manuscript version, we use titles such as “Introduction” and “Conclusion” to name
sections. Usually a published ten-page manuscript version of an empirical study can have five
to seven sections. Similarly, we can have section titles in a program version in R. Specifically,
the # sign, or formatted as # in the R program, allows anything after it and on the same
line to be treated as comments, not commands. This allows us to make notes or comments
on the whole program in a very flexible and informative way. For instance, among the 80
lines in Program 7.2, there are 24 complete comment lines that start with # (i.e., 30%).
Most of the comments are created with the first draft of the program, and some are added
when the program is improved.
To what degree one prefers to comment on a program is largely a personal choice.
Nevertheless, there must be some basic comments in a program version to facilitate reading.
You may be surprised how fast you forget what you have done a month ago. You may also be
surprised how many researchers do not have any comment in a program at all. In the past, I
read a few R programs from colleagues without any section separator. These programs can
be over 1,000 lines or 20 pages long (i.e., about 50 lines per page). To me, that is like a long
manuscript in English without section titles. Based on these experiences, I strongly suggest
that young professionals even create a small comment block like “Brief contents” when an
R program is first created. That will remind and force one to organize the program in a
logical way.
Comments in R must start with the # sign, either at the beginning of a line, or in the
middle of a line as needed (e.g., line 14). R does not have any symbol for block comments.
In contrast, SAS, for instance, has a pair of symbols for a block of comments, so anything
between /* and */ is treated as comments. With a single click, editors like Tinn-R allow
users to select a block of lines and add a # sign to the beginning of each line simultaneously.
Thus, the lack of a block comment symbol in R is not a big drawback.
The comment sign can also be used in testing command lines. When one needs to exclude
several command lines in the middle of command blocks, one can just add the comment
sign to the beginning of these lines. When these commands need to be included later on, the
comment sign # can be removed. In computer programming, these actions are often called
as “comment it in” (i.e., removing the comment sign and turning a line into a command),
or “comment it out” (i.e., adding the comment sign to a line and removing it out of the
command status).
7.5.2
Paragraphs and command blocks
A paragraph should always have a center meaning. Paragraphs in a manuscript are indicated
by blank lines, indention at the beginning, or some white spaces at the end. Similarly, blank
lines can be inserted in a program version to achieve the same effect. This divides a long
program into many blocks or paragraphs. In most cases, without further detailed comments
117
7.5 Basic syntax of R language
Table 7.2 A comparison of grammar rules between English and R
Item
English
R Language
Section
Section title
Comment lines starting with #; section
numbers; dashed lines or similar symbols
as separators
Paragraph
Blank lines or indention
Blank lines for code blocks
Sentence
Period, question mark, or similar punctuations
End of a line without any special symbol;
end of a pair of parentheses; operators
Word
Any word in a dictionary
Some names reserved for internal functions; flexible for user-defined names; case
sensitive; a period allowed in a name
on these paragraphs, readers can understand the center purpose of a block well. Thus, do
not underestimate the benefit of blank lines. There are 11 out of 80 lines (e.g., line 3) in
Program 7.2, or 14% of the total. Combining the comment and blank lines together, that
is 44% in this program. In other words, without learning any R function, you can understand
44% of the program version for Sun et al. (2007).
Comment and blank lines are simple but critical to compose an R program version efficiently. It allows researchers to organize the analysis into blocks, and solve the problem step
by step. Some software applications, e.g., R Commander, generate commands in the output
window automatically, so users can see the commands for each click or action. Unfortunately
and not surprisingly, no software can add comment or blank lines to a statistical analysis.
Combining all command lines from an output window is fundamentally different from the
approach I advocate here. From the beginning, we create the structure of the program version
on purpose and manage the whole process systematically, e.g., starting with Program 7.1
The first program version for Sun et al. (2007).
Forgetting the name of a specific R function is normal and it can be solved by checking
R help documents. Having a messy structure or no structure at all is more troublesome.
Therefore, to emphasize a critical point about computer programming here, it is the wise
and diligent use of comment and blank lines that allows the program version for an empirical
study to be created with a desired format from the beginning. Then, it can be revised and
built up incrementally by block, table, or figure.
7.5.3
Sentences and commands
A command in R is similar to a sentence in English. First, let us have a look at the typical
structure of a command in R. In English, a typical sentence has the structure of subjectverb-object, e.g., I receive apples. In R, a typical command has the following structure:
object.name <- value
where a new object is created and the value is assigned, e.g., my.weight <- 180, or table.4
<- cbind(u1, u2)[, -4] (i.e., line 50 in Program 7.2). Note the major assignment symbols in R are <- and =. The difference between <- and = is small. In most cases, <- can
be replaced by =, but <- is clearer than =. In a function call like line 36 in Program 7.2,
the equal sign has to be used to assign values to the arguments within a function call
(e.g., formula = ...). To continue the comparison between English and R as a computer
language, the assignment operator in R is similar to a verb in English.
118
Chapter 7 Sample Study A and Predefined R Functions
Furthermore, there are many variations of the basic command structure in R, which will
be covered gradually later on. In particular, when a command is composed of an object name
only, it requests that the value or content of the object be printed on a computer screen.
For instance, the command on line 53 in Program 7.2 shows the content of table.4. Of
course, before the value of an object can be shown on a screen, it should be created first
with the assignment form, unless the object is a built-in constant in R (e.g., 35 or pi for
3.1415).
Sentences in English usually end with punctuations like a period, an exclamation mark,
or a question mark. In contrast, in most cases, each line in R language is called a command.
There is no such thing like a period at the end of a command line in R. Many software
applications use special features to indicate the end of a command (e.g., run in SAS or $ in
LIMDEP). The choice by R may look less organized than other software applications at a
first glance, but after a while, it will become apparent that it indeed makes programming
in R very concise.
While in most situations one R command is one line, R allows multiple command lines
to be put on the same line and separated by a punctuation sign of “;”. For example, line
22 in Program 7.2 has five commands, and line 23 has three commands. This allows some
short but related commands to be put on a single line, making an R program concise.
A command can be longer than a single line in R. There are two ways to organize or
indicate a multiple-line command. The first way is through operators. In Program 7.2, the
binary operator of + at the end of line 39 indicates that the command will continue into the
next line. In splitting a long command line, it is always a good habit to put operators at
the end of lines. In Program 7.2, line 36 also follows the same rule.
The second way is through parentheses and the like. R uses parenthesis (, square bracket
[, and curly brace { to indicate a multiple-line command. In R, parentheses and the like
must be in pair, similar to that in English. For example, lines 36 to 38 in Program 7.2 are a
single command over three lines. There are two pairs of parentheses: one pair on line 38 and
the other on lines 36 and 38. Sometimes, a command can contain many pairs of parentheses,
so one needs to be careful in balancing these pairs. In the editor of Tinn-R, when a cursor is
moved to one parenthesis, its color is changed to red to indicate the matching part. These
utilities can reduce a lot of stress on eyes.
A classic example of matching parentheses or braces is related to the if statement in
R. In Program 7.3 Curly brace match in the if statement, three uses of the function are
demonstrated; the first two are right and the third is wrong. The first use is a simple if
statement without the optional part of else, so the pair of curly braces on line 3 and 5
are matched correctly. In the second use of if with the optional else part, “} else {” is
correctly formatted as a single string and put on the same line. In the third case, lines 17
and 18 will be treated as a single command because R has the curly braces matched already
before reading line 19. As a result, lines 19 and 21 will generate two error messages, and
line 20 will become an independent, valid, but unwanted command.
Program 7.3 Curly brace match in the if statement
1
2
3
4
5
6
7
# a. Correct use of if() without the else part
x <- 10
if (x > 5) {
m <- x + 200
}
x; m
7.5 Basic syntax of R language
8
9
10
11
12
13
14
119
# b. Correct use of if() with the else part
if (x > 5) {
y <- x + 1
} else {
y <- x + 2
}
x; y
15
16
17
18
19
20
21
22
# c. Incorrect use of if/else; a mismatch of curly braces
if (x > 5) {
z <- x + 1}
else {
z <- x + 2
}
x; z
# Selected results from Program 7.3
> x; m
[1] 10
[1] 210
> x; y
[1] 10
[1] 11
> if (x > 5) {
+
z <- x + 1 }
> else {
Error: unexpected ‘else’ in "else"
>
z <- x + 2
> }
Error: unexpected ‘}’ in "}"
> x; z
[1] 10
[1] 12
7.5.4
Words and object names
Words in English are made from letters and symbols and used to create sentences. Most
people use less than 1,500 English words in conversation and writing. The standard for these
words is an English dictionary. When English words are used to compose comments in R,
they still follow the same grammar rules.
Inside all R commands, everything is an object so the most basic building input in a
program is object names. The status of object names in R is similar to that of word in
English. However, object names on R command lines usually do not follow the rules in an
English dictionary anymore, even if they have the same appearance. For example, on line 27
in Program 7.2, there is a word of mean. This is the name of an R function that calculates
120
Chapter 7 Sample Study A and Predefined R Functions
the average of a given data. A function with the same role can be recreated and named
differently, such as average, mymean, or mEAn. In general, these keywords for R functions
are named to give users a clue of their roles, but they can be coined in any way a programmer
likes. Similarly, names can be created for objects that are only meaningful for the current
research and environment. For example, daInsNam on line 20 in Program 7.2 is the object
name of 14 variables imported for Sun et al. (2007).
In creating object names in R, it should be noted that R is case sensitive, so the names
of dog and Dog are different. This asks users to be very careful about details, but does allow
more flexible ways of creating object names. Usually, using the dot symbol and mixing low
and upper cases can create very informative object names in R. As examples, table.4 is
used to represent a table in Program 7.2. In package erer, data sets are named with the
prefix of da like daIns and daPe, so whenever users see these names, it is apparent that
they are data objects. Procedures related to a topic can be created with a new prefix. For
instance, ma is used for marginal effect analysis as in maBina() and maTrend().
The underscore symbol can also be used to create object names in R (e.g., my_weight).
However, it is less frequently used than the dot or period symbol (e.g., my.weight), and
even discouraged by many users in R. This is probably because the period symbol is easier
to type and arguably better looking in the middle of an object name than the underscore
symbol. The hyphen symbol (-) cannot be used inside an object name because it is the minus
operator in R. Finally, the name of an object cannot contain any space; otherwise, it would
become two names. Thus, “table 4” is a valid name for a file or phrase in English under
the Microsoft Windows system. In R, however, it should be named as table.4, Table.4,
table_4, or table4.
7.6
Formatting an R program
In this book, we divide the production process of an empirical study into three stages:
proposal, program, and manuscript. Each of them has a balance between contents and
presentation techniques. For a proposal, good grantsmanship can improve the funding probability. For a manuscript, appropriate formats can improve its readability. The same logic
is applicable for a computer program. The purpose of R formatting guide is to make an R
program easier to read and share.
The difference is in the readership. Both a proposal and a manuscript need to be read and
evaluated by reviewers other than the authors. Thus, there exists a variety of formatting or
presentation guides for proposal and manuscript preparation, e.g., a guide to authors from a
specific journal. In contrast, an R program is mainly for personal use or sharing among close
colleagues, and it is not for publishing. As a result, the format of an R program is largely
determined by personal choices or the tradition adopted in a local working environment.
To some degree, most R programming guides are just some suggestions. For example, the
Google’s R Style Guide is one of such guides. You can search it on its Web site and read the
full description.
The common thread of these guides is straightforward: just use your common sense, and
additionally, be consistent (which is also a common sense). In general, it takes some time
and patience to learn how to format an R program, just like learning the skills of formatting
a manuscript in English. You can also learn from reading others’ programs if your mind
is always open and your eyes are sharp. Programs with good formats can serve as a good
module to follow. Programs with bad formats can also be inspiring to users from a different
perspective, as similar mistakes can be avoided.
We have presented a good example in Program 7.2 The final program version for
7.6 Formatting an R program
121
Sun et al. (2007). To have a bad formatting example, the commands in Program 7.2
are copied and edited with a few bad formatting treatments. The results are presented
in Program 7.4 A poorly formatted program version for Sun et al. (2007). The main
treatments are eliminating the comment and blank lines, removing spaces and indention,
and allowing automatic wrap-ups of long lines. All the commands are kept so it still can
generate the same results. The total number of lines is reduced from 80 to 29. Of course,
this revised program with bad formats is difficult to read and digest, if possible at all. In
the following subsections, several key aspects of a good looking R programs are elaborated.
They are based on my personal experience so please follow them at your discretion.
Program 7.4 A poorly formatted program version for Sun et al. (2007)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
library(erer);wdNew<-"C:/aErer";setwd(wdNew);getwd();dir()
daInsNam=read.table("RawDataIns1.csv",header=TRUE,sep=",");daIns=read.table
("RawDataIns2.csv",header=TRUE,sep=",");class(daInsNam);dim(daInsNam);print
(daInsNam);class(daIns);dim(daIns);head(daIns);tail(daIns);daIns[1:3,1:5]
(insMean<-round(x=apply(daIns,MARGIN=2,FUN=mean),digits=2))
(insCorr<-round(x=cor(daIns),digits=3))
table.3 <- cbind(daInsNam, Mean = I(sprintf(fmt="%.2f", insMean)))[-14, ]
rownames(table.3)<-1:nrow(table.3);print(table.3,right=FALSE)
ra<-glm(formula=Y~Injury+HuntYrs+Nonres+Lspman+Lnong+
Gender+Age+Race+Marital+Edu+Inc+TownPop,family=binomial(link="logit"),data
=daIns,x=TRUE)
fm.fish<-Y~Injury+FishYrs+Nonres+Lspman+Lnong+Gender+Age+Race+Marital+
Edu+Inc+TownPop
rb<-update(object=ra,formula=fm.fish);names(ra);summary(ra)
(ca<-data.frame(summary(ra)$coefficients))
(cb<-data.frame(summary(rb)$coefficients))
(me<-maBina(w=ra))
(u1<-bsTab(w=ra,need="2T"));(u2<-bsTab(w=me$out,need="2T"))
table.4<-cbind(u1,u2)[,-4]
colnames(table.4)<-c("Variable","Coefficient","t-ratio","Marginaleffect",
"t-ratio");table.4
(p1<-maTrend(q=me,nam.c="HuntYrs",nam.d="Nonres"));(p2<-maTrend(q=me,nam.c=
"Age",nam.d="Nonres"));(p3<-maTrend(q=me,nam.c="Inc",nam.d="Nonres"))
windows(width=4,height=3,pointsize=9);bringToTop(stay=TRUE)
par(mai=c(0.7,0.7,0.1,0.1),family="serif");plot(p1)
fname<-c("insFigure1a.png","insFigure1b.png","insFigure1c.png")
pname<-list(p1,p2,p3)
for(i in 1:3){png(file=fname[i],width=4,height=3,
units="in",pointsize=9,res=300)
par(mai=c(0.7,0.7,0.1,0.1),family="serif")
plot(pname[[i]]);dev.off()}
write.table(x=table.3,file="insTable3.csv",sep=",");write.table(x=table.4,
file="insTable4.csv",sep=",")
7.6.1
Comments and an executable program
Some comments on an R program must be presented to make it more readable. The comment
sign # can be used in three ways: long comments, short comments, and R commands. First, a
122
Chapter 7 Sample Study A and Predefined R Functions
comment sign can be placed at the beginning of a line, turning the whole line as a comment.
The content of the line can be a sentence in English, or just some symbols (e.g., a star,
dash, or hyphen). If an entire line is a comment, then it usually begins with # and one
space. Second, the comment sign can also be placed in the middle of a line and after some
commands. Short comments can be accommodated in this way; the general recommendation
is to have two spaces before # and then one space after it. This can be very helpful if one
needs to locate some specific points in a long program, e.g., # Correlation coefficient
computed here.
The third way of using a comment sign is to comment out one or several R commands.
There are several situations why one needs to keep a block of R commands as comments. It
may be that these commands are some test codes that have values for future programming.
In general, I suggest that most test codes be removed and only these really valuable codes
be kept. Another situation is that some commands need a long time to run (e.g., 10 minutes
for a loop). If that is the case, the codes can be commented out and the outputs can be
given explicitly in the program. If the number or size of outputs is large, then one can save
the outputs on a local drive first, and then reload the outputs in the program.
For example, assume that it takes about 30 minutes from the following commented codes
to generate the values for the objects of x, y, and z. In actual programming, one can run the
codes without the comment signs, and then generate and validate the results. If the final
results are large in size, they can be saved on a local drive as tempResult.Rdata through
the function of save(). Then the relevant codes can be commented out. The saved results
can be loaded in the future directly from the local drive. One can use the function of ls()
before and after the function of load() to reveal if these objects are available on the search
path. Alternatively, if the results are small, then they can be embedded and presented in the
R program directly. Details about the functions used in the following example, i.e., save()
and load(), are available at their help files. At this point, you just need to understand the
essence of this approach.
## need 30 minutes to run this block
# x <- 0; y <- 0; z <- 0
# for (i in 1:1000) {
#
y <- ...
#
...
#
z <- ...
# }
# A. If the results are large, save and then reload later
# save(x, y, z, file = ‘C:/aErer/tempResult.Rdata’)
load(‘C:/aErer/tempResult.Rdata’); ls()
# B. If the results are small, present them directly in the program
x <- 10; y <- 20; z <- 30
Comment signs used in the first and second ways (i.e., long and short comments) make
an R program easy to understand by section. Judicious uses of a comment sign in the third
way can make a whole R program for a project executable in a few minutes. My rule is
that no more than three minutes are needed in running an R program and reproducing all
the tables and figures. For most of my finished projects, I just need about one minute to
reproduce all the results.
There are several common problems in using the comment sign. These include forgetting # before a comment line, having very fancy and long section headers (e.g., a big box
7.6 Formatting an R program
123
composed of stars), and including a time-consuming block of codes directly in the final
program. Most of these problems can be revealed by a quick reading of a program with
good common sense, and then by running the whole program with one click. To emphasize,
a well-structured R program should be executable within a few minutes by a single submission. If an R program cannot be run repeatedly, then there must be some problems inside
the program.
7.6.2
Line width and breaks
In preparing a proposal or manuscript in English, most word processing applications allow
automatic wrap-up for long sentences. A long R command line that spreads over several lines
can be wrapped up by an R editor in the same way. However, that is strongly discouraged
for programming in general. Thus, we need some rules in formatting lines in an R program.
The maximum line length is usually about 80 characters. Some R editors (e.g., TinnR) allow users to set up the position of a vertical gray line and show it up in the editor.
Alternatively, a comment line composed of many dash symbols (or any symbol you like) can
be added to separate two sections in an R program, and it can also serve as an indicator of
line width, as adopted in Program 7.2 The final program version for Sun et al. (2007). If
one needs to print an R program for reading or sharing, then the preview function in an R
editor can reveal if the default line width is too wide. In addition, always use a font family
(e.g., Courier New) in programming that has the same width for every letter. This allows
for exactly vertical alignments of commands.
If needed, break a long line between argument assignments or at a location with good
sense. In using Microsoft Word, users do not compile the contents directly as in other
applications such as LATEX. As a result, some bad formats can occur. For example, a phrase
like “Table 1” can be split right in the middle, with “Table” being put at the end of a
line and “1” being put at the beginning of the following line, which is ugly for professional
publishing. Many applications like LATEX allow users to have better controls over these subtle
but important details.
In defining or using R functions, line breaks should be made between assignments. Thus,
lines 8 and 9 in Program 7.4 has a bad break by separating data and =daIns on two lines.
This is similar to breaking “Table 1” into two parts in a manuscript. In other long R commands, one may need to break a command several times, e.g., a long index for subscripting
operation. In general, always break a line at a location with good sense.
There are several ways to reduce the number of breaks in a long command line. One
way is to define some objects first and then use them in the long command. For example,
the following long command associated with update() has two lines. On lines 39 to 40 of
Program 7.2, a new formula object is created first and then used in calling the function.
# A long command line associated with update()
rb <- update(object = ra, formula = Y ~ Injury + FishYrs + Nonres +
Lspman + Lnong + Gender + Age + Race + Marital + Edu +
Inc + TownPop)
# A shorter command line with update()
fm.fish <- Y ~ Injury + FishYrs + Nonres + Lspman + Lnong +
Gender + Age + Race + Marital + Edu + Inc + TownPop
rb <- update(object = ra, formula = fm.fish)
For curly braces, an opening curly brace should not go on its own line; a closing curly
brace should always go on its own line. Curly braces can be omitted when a block consists
124
Chapter 7 Sample Study A and Predefined R Functions
of a single statement, and as always, it should be consistently adopted in an R program.
Several good examples are shown in Program 7.3 Curly brace match in the if statement.
In summary, in preparing an R program, it is better to set up the maximum line width
first, either with the help of utilities in an R editor, or some comment lines. An R program is
always composed of lines that are manually broken down or controlled at the end. Break up
long lines in R between arguments or anywhere with a good sense. Never allow automatic
wrap-ups of long lines in an R program, including short comments at the end of a command
line.
7.6.3
Spaces and indention
The use of space in English and R is quite different. In English, extra spaces between words
are not allowed, at least for professional writing. In R, extra spaces are allowed on comment
lines because computers and the R software do not execute these lines at all. They are also
allowed, and even promoted, in many places on command lines. In general, for commands
longer than one line, two spaces should be inserted consistently to indent the following lines
to improve readability. To allow better alignments in some cases, more than two spaces can
be inserted, but it should be implemented consistently. For example, on lines 36 to 38 in
Program 7.2, many extra spaces are inserted to make these relevant arguments aligned
vertically. In indenting codes, do not use tabs or mix tabs and spaces, but just use the space
bar on a keyboard.
A conditional or looping statement in R, e.g., if and for, generally contains a block of
codes. The block of codes for each statement should have the same indention. If there are
nested conditional and looping statements, then more spaces should be added for the inner
blocks. For example, lines 69 to 75 in Program 7.2 is a large block of codes, as indicated
by the vertical alignment of for and } and the indention of all lines inside the curly braces.
Furthermore, the indention on line 71 indicates that this line and the previous one are one
command line. If there is another for loop inside this block, then all the command lines
should be indented farther and deeper.
Place spaces around all binary operators, e.g., <-, =, +, -, and *. Do not place a space
before a comma, but place one after a comma. Place a space before left parenthesis, except
in a function call. For all the above situations, spaces can be eliminated if there is a need
to reduce the length of one line within the preferred width, e.g., 75 to 80 characters. In
general, my rule is that a maximum of three spaces can be compressed on a single line if
needed; otherwise, multiple lines will be used for a long command. Some bad examples are
demonstrated below.
me<-maBina(w=ra)
# need spaces around <- and =
test <- daIns[,1:5]
# need a space after the comma
test <- daIns[ ,1:5] # need no space before the comma
for(1 in 1:3) {...
# need a space before ( for loops
if(x >= 5){...
# need a space before ( and { for conditionals
mean (1:10)
# need no space before ( for function calls
7.6.4
A checklist for R program formatting
The following brief list describes the main requirements in formatting an R program. It is
not comprehensive and detailed as the previous presentation contains. You will need to read
this again later on when you format your R program. As always, use good common sense
and consistency in formatting an R program.
7.7 Road map: using predefined functions (Part III)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
All comments must start with the comment symbol of # in R.
Add complete comment lines to create and separate sections for a long program.
Add short comments at the end of a line for brief notes if needed.
Comment out time-consuming code blocks to facilitate quick reproduction.
An R program for a project should be executable with a single submission.
Use blank lines to indicate paragraphs or code blocks.
Set up the width of command lines (e.g., 80 characters) and never exceed it.
Do not allow automatic wrap-ups of long lines; break them manually.
Always place “} else {” together on a separate line.
Break a long line with good sense (e.g., not on an argument assignment).
Indent a multiple-line command or a code block with two spaces or consistency.
Align some codes vertically to indicate a block.
For curly braces, an opening curly brace should not go on its own line.
For curly braces, a closing curly brace should go on its own line.
Use indention to indicate a code block for conditionals and looping statements.
Place one space before and after binary operators, e.g., <-, =, and +.
Do not use the operator of = to replace <- when <- is needed.
No space before a comma, but one space after it, e.g., test[, ].
Place a space before a left ( and {, e.g., if (a > 5) {b <- 3}.
Do not place a space before a left ( for a function call, e.g., mean(1:3).
Spell out argument names in calling a function with multiple arguments.
Exceptions are always allowed, but use good common sense.
7.7
Road map: using predefined functions (Part III)
125
Up to this point, we have learned Sun et al. (2007) as the sample study for Part III,
including its manuscript version, underlying statistics, and the final program version. More
importantly, the structure, basic syntax, and format of the program version have been
analyzed. In Program 7.2 The final program version for Sun et al. (2007), we are glad
to know that 44% of the 80 command lines are just comments and blank lines. However,
the remaining 56% command lines are still largely unexplained. Where do we go from here,
and how can we understand the R language used in this sample program completely? The
answers are within the next four chapters in Part III.
The rest of this Part is organized to help beginners learn how to use predefined, existing
R functions efficiently. At the end of this Part, you should be able to understand the whole
sample program completely, including these functions and their usage. For your own selected empirical study, a similar program can be prepared after learning various techniques
presented in this Part.
Specifically in Chapter 8 Data Input and Output, the focus is on the exchange of information between R and a local drive, i.e., data inputs and outputs. A number of R concepts
are explained first, including objects, attributes, subscripts, user-defined functions, and flow
control. To connect with the sample study, the sections of raw data imports and result exports in Program 7.2 should be understandable after this chapter. Then, two chapters are
used to present the techniques of creating and manipulating major R objects. These materials are basic but critical for calling and using R functions efficiently. In general, they are more
difficult to learn than materials in other chapters of the book. In Chapter 9 Manipulating
Basic Objects, how to manipulate major R objects by type is presented. These types covered
are R operators, character strings, factors, dates, time series, and formulas. In Chapter 10
126
Chapter 7 Sample Study A and Predefined R Functions
Manipulating Data Frames, common tasks related to data frames are addressed, and methods for data summarization are presented. As an example, using R functions to estimate a
binary choice model is displayed at the end. After reading these two chapters, one should
be able to comprehend the regression analyses in Program 7.2.
Drawing a graph in R is a very different task than conducting a statistical regression. A
regression often involves one or two R functions only. However, plotting requires learning a
large number of functions at the same time, which is true for R and any computer language.
In Chapter 11 Base R Graphics, the traditional graphics system available in base R is
presented in detail. The four main inputs for generating a graph in R are plotting data,
graphics devices, high-level plotting functions, and low-level plotting functions. After learning the techniques presented in this chapter, one should be able to understand the graph
output from Program 7.2.
A simple test on whether you know the materials in Part III well is to read Program 7.2
The final program version for Sun et al. (2007) and assess how much you can understand it.
In addition, many exercises are designed and included in several chapters. They can also test
your understanding of the materials. Overall, the knowledge in this Part is the foundation
for building an R program efficiently, so everyone should learn it well before moving on to
other chapters in the book. A typical barrier that prevents students from learning advanced
techniques (e.g., writing a new function or package) is the lack of a solid comprehension of
basic R concepts, as presented in the remaining chapters of Part III.
7.8
Exercises
7.8.1 Assess some packages with pull-down menus. Go to the Web site of R and navigate
to the contributed package site (e.g., http://mirrors.ibiblio.org/CRAN/). Search the
Web page by the keyword of ‘interface’ to reveal these packages that contain pull-down
menus. Select one package that is built on Rcmdr, e.g., RcmdrPlugin.SurvivalT, and
install it on your computer. A new menu of SurvivalT should appear within the R
Commander interface. Try to estimate some models and then evaluate the relation
between this new package and Rcmdr.
7.8.2 Prepare draft tables. Recall that in Exercise 4.7.2 on page 56, one empirical study
is selected, and the structure of its published manuscript version is extracted. In
this chapter, tables and figures in the first draft of manuscript version for Sun et al.
(2007) are further specified through, for instance, Table 7.1 A draft table for the
logit regression analysis in Sun et al. (2007) on page 103. In this exercise, assuming
no analyses had been conducted for your selected empirical study, compose a draft of
at least two tables in the published version.
7.8.3 Create an R program draft. Recall that in Exercise 3.6.2 on page 41, one empirical
study has been read and selected. In this exercise, prepare a draft of an R program
version for this study. Follow the structure of a program version listed in Table 4.3
Structure of the program version for an empirical study on page 52 and the sample
program version as listed in Program 7.1 The first program version for Sun et al.
(2007) on page 113. Present and organize all the main components of a program version
in this first draft, including the tables and figures.
7.8.4 Format a sample R program. Program 7.5 contains many inappropriate formats.
Note the commands in this program indeed work well. You do not need to understand
these relevant R functions to reformat the whole program. Similarly, a production
127
7.8 Exercises
editor does not necessarily understand a manuscript in formatting it for a specific
publication outlet. With the formatting guide discussed in this chapter, identify and
correct existing formatting problems in this R program. Do not make any change
on the content of this sample program. After it is reformatted, the structure of the
program should be clear to readers before the program is run. At the end, the whole
program can be executed with one submission.
Program 7.5 Identifying and correcting bad formats in an R program
1
2
3
4
5
6
7
8
9
10
11
12
13
***********************************************************
** Title: This is an exercise for formatting R codes. *****
***********************************************************
__________ 1. Load packages and data ____________
library( erer );data( daIns );
2. Create a dataset from the existing one
mydata<daIns[ ,1:5 ]; head( mydata); names(mydata );str( mydata
); summary(mydata);
3. Run a linear model ---------------------------------result=lm(formula=Y~Injury+HuntYrs+Nonres
+Lspman
,data
= mydata)
summary( result );####This is the main result.
Chapter
8
Data Input and Output
C
onstructing the program version for an empirical study through R needs to
build up skills gradually. In this chapter, we focus on the exchange of information
between R and a local drive, i.e., data input and output (Spector, 2008). Before
we can do that, several R concepts need to be defined. Objects and functions are explained
first as two core concepts of R. Inside R, everything is an object, including a function, and
furthermore, functions can be used to manipulate objects. Then, subscripting an object,
writing a user-defined function, and controlling program flow are briefly introduced. In the
end, several functions and techniques for data inputs and outputs are covered in detail.
8.1
Objects in R
R objects are the building blocks for every command line in the program version for an
empirical study. They can be differentiated by attribute or property. Major object types
(e.g., list) are briefly described in this section. A set of basic built-in functions can be
employed in R to create and manipulate these objects.
8.1.1
Object attributes
Each object has unique properties that allow classification and differentiation within a working environment. These properties or features are generally referred to as attributes in
R. Two objects in R may look like the same on a computer screen, but they are different
objects if any of their attributes is not the same. The attributes of an existing object can
be modified. New objects with built-in or user-defined attributes can be created.
R has a number of built-in object types with well-defined attributes. Objects commonly
used in R include vectors, data frames, matrices, lists, factors, formulas, dates, and time
series. Usual attributes include class, comment, dim, dimnames, names, row.names, tsp,
and levels. See more details by help(“attributes”). In Program 8.1 Accessing and
defining object attributes, several objects generated from Program 7.2 The final program
version for Sun et al. (2007) on page 114 are used as examples. In particular, daInsNam is
a data frame object with two columns and 14 rows: one column for the abbreviations of 14
variables and the other for detailed variable descriptions.
Attributes associated with an object can be revealed by an attribute function or a convenience function. Specifically, there are two attribute functions: attributes() and attr().
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
128
129
8.1 Objects in R
The function of attributes() reveals all available attributes of an object, and attr()
shows one attribute only. Furthermore, all existing attributes or one specific attribute can
be replaced through those two functions too. In using the replacement form of the two functions, a new attribute can be created if the attribute referred to does not exist for an object,
as shown on lines 12 to 15 in Program 8.1. A new attribute can be any object, including
a numerical vector (e.g., line 13) or a data frame object (e.g., line 15). Actually, it is not
uncommon to store some descriptive information about an object as one of its attributes.
It always stays with the object but does not affect the usage of the object.
Convenience functions are available for a number of object attributes in R and can also
be used to abstract attribute information or define new attributes. Not all attributes are
implemented or included in attributes(). For example, the mode attribute is implicit with
a matrix object and it is usually not visible or available with attributes(). In addition,
convenience functions focus on one particular attribute and they are more convenient to
use. In particular, two of the most important attributes of R objects are class and mode,
with the corresponding convenience functions available at the same name.
Specifically, class() can be used to reveal the current class information of an object.
The command of class(x) <- “value” can be used to revise the existing class attribute,
or assign a new class to x if x has no such a class. An object can have many classes at
the same time, as shown on line 23 in Program 8.1. The class of an object is important
because R is an object-oriented programming language. Generic functions are available in
R to invoke different methods. Depending on the class of their arguments, generic functions
can act very differently. For a particular generic function or a given class, you can find out
which methods are available through the function of methods(). For example, the generic
function of mean() can work differently for numeric and date classes, e.g., mean.Date(). As
another example, through lines 56 to 58 in Program 7.2 on page 114, several objects with
the class of “maTrend” are created. The erer package contains two methods for this class:
plot.maTrend() and print.maTrend(), as revealed by line 26 in Program 8.1.
Similarly, the convenience function of mode() can be used to reveal an object’s mode.
Commonly encountered modes of individual objects are numeric, character, and logical.
Some objects, e.g., matrices or arrays, require that all the data contained in them be of the
same mode, while others (e.g., lists and data frames) allow for multiple modes within a single
object. Additional convenience functions in R include typeof(), names(), dim(), length(),
tsp(), and comment(). Every object in R has several attributes to describe the nature of
the information that it contains. A specific convenience function can be defined and thus
applicable for some types of functions only. For example, tsp() returns the properties of a
time series object with the class of ts. Applying tsp() on a data frame object will return
the result of NULL, as shown by line 20 in Program 8.1.
For simple objects such as a vector, it is usually straightforward to determine what an
object in R contains. Examining the class and mode, along with its length or dimension
attribute, should be sufficient to allow one to work effectively with the object. For objects
with richer attributes, the function of str() can be employed to reveal its internal structure
and properties. Finally, the ls() function can reveal the names of all objects in an R session.
The erer package also includes an improved function of lss() to reveal the name, class,
and size of each object in an R session. This is similar to the functionality provided by
Windows Explorer on a Microsoft Windows operating system.
Program 8.1 Accessing and defining object attributes
1
2
# Run the program version for Sun et al. (2007) by source().
# Create a new object for demonstration.
Chapter
9
Manipulating Basic Objects
A
fter learning how to import and export data, we move onto R object manipulation. The information for this topic is large so it is covered in two chapters. In this
chapter, R operators, character strings, factors, dates and time, time series, and
formulas are covered. In Chapter 10 Manipulating Data Frames, techniques for subscription and data frame manipulation are presented. These methods are essential to efficient R
programming. In each section, major concepts are first explained in a way that they can
be read independently without running any R program. Then sample codes relevant to the
concepts are presented along with selected results. To have a deep understanding, it is best
to first read the text and have an overall understanding, then run the sample codes and
digest the results, and finally read the relevant description again. All the exercises for this
and next chapters are combined and presented at Exercise 10.5 on page 204.
9.1
R operators
In calling an R function, its function name is combined with a set of argument values like
this: mean(x = 1:30). An R operator is a function too, and it takes some argument values
and can be written without parentheses. For example, the command of 3 + 5 is the same
as “+”(3, 5). A large number of operators are defined in R. Their meaning will become
clearer when they are used for specific purposes. Detailed definitions for these operators are
available at the built-in documents, e.g., help(“+”).
Specifically, arithmetic operators in R include +, -, *, /, ˆ, %%, and %/%; the corresponding
operation is addition, subtraction, multiplication, division, exponentiation (e.g., 3ˆ2 returns
9), modulus (e.g., 50 %% 3 returns 2), and integer division (e.g., 50 %/% 3 returns 16).
Assignment operators in R include <-, «-, ->, -», and =. The operators of «- and -»
are normally only used within a function, so a search for an object name is allowed in the
parent environment of the function. The = operator is only allowed at the top level (e.g., in
a complete expression typed at the command prompt) or as one of the sub-expressions in a
braced list of expressions. In general, the = operator is mainly used in assigning values to
function arguments. The assignment operator of the leftward form, i.e., <-, is usually used
in assigning a value to a name, e.g., my.score <- 95. In practice, it is a bad habit to use
= to replace <- where the latter should be used.
Relational or comparison operators are >, >=, <, <=, ==, and !=; the corresponding operation is greater than, greater than or equal to, less than, less than or equal to, logical equal,
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
152
9.1 R operators
153
and not equal. These operators allow the comparison of values in atomic vectors and return
a logical vector, indicating the result of element-by-element comparison. The elements of
shorter vectors are recycled as necessary. For example, the expression of 1:6 == c(1, 4)
return the following logical vector: TRUE FALSE FALSE TRUE FALSE FALSE.
Logical operators in R include: &, |, !, &&, ||, and xor; the corresponding operation and
meaning is logical AND for vector, logical OR for vector, logical negation (NOT), logical
AND for scalar, logical OR for scalar, and element-wise exclusive OR. In particular, the
short forms of & and | are different from the long forms of && and || in two aspects. First,
the short forms perform element-wise comparisons in much the same way as arithmetic
operators, so they are vectorized. The long forms are not vectorized and work on scalars, so
only the first element is evaluated and the output is a single logical value of TRUE or FALSE;
this is true even if the vectors being evaluated have more than one element. Of course, if
the vectors being evaluated have one element only, the short and long forms will generate
the same result. Second, the long forms examine the expression sequentially from the left to
right, with the right-hand operand only evaluated if necessary. This design can save time or
avoid errors in some cases. Both the features of && and || are preferred in flow control, so
they are often used to construct a logical expression for some functions, e.g., if.
Logical expressions contain R relational or logical operators and evaluate to either true
or false, denoted as TRUE or FALSE in R. For example, the logical expression of (34 * 3 >
12 * 9) & (88 %/% 10 == 8) returns a value of FALSE. In addition, two frequently used
logical functions are worthy of some descriptions here. The all() function examines if all
of the given set of logical vectors are true or not. The any() function answers if any of the
given set of logical vectors is true or not. Both the functions check the entire vector supplied
but return a single logical value, TRUE or FALSE.
There are different purposes for the following operators in R: (, [, and {. All of them have
to be used in pairs. The parenthesis notation returns the result of evaluating expressions
inside the parentheses, e.g., (4 + 6) * 100. Besides that, it is most commonly used to
supply argument values in calling a function, e.g., mean(1:10). The square brackets (i.e., [
and [[), the dollar sign of $, and the at-sign of @ are all designated in R for indexing, and
can be used to extract from or replace some elements in an existing object. For example,
the command of c(45, 60, 10)[3] returns the third element in a numeric vector. The
most important distinction between [, [[, $, and @ is that the [ operator can select more
than one element whereas the other three can select a single element only. Finally, the curly
braces evaluate a series of expressions, either separated by new lines or semicolons, e.g., dog
<- 1:5; pig <- 6:10. Typical uses of curly braces are grouping the body of a function,
e.g., test <- function(x) {x + 1}; test(20), or grouping expressions for flow control
functions, e.g., x <- 1; y <- 2; for (i in 1:5) {x <- x + i; y <- y * i}.
Other operators in R include : as a sequence generator, :: and ::: for accessing variables in a name space, ˜ for formula, and ? for help. In general, the order of operations
and precedence rules in R are: function calls and grouping expressions (e.g., {), indexing,
arithmetic, comparison and relation, logical, formula, assignment, and help. Consult the
help document for the Syntax() function for detail.
If needed, one can define new operators in R. A user-defined binary operator is created
through a new function and its name is composed of a character string and two % symbols.
For example, assume that there are two exams in a course. The final score for each student
is composed of the scores from the two exams and a base score of 40. An operator named as
%score% can be defined to add two exam scores and 40 together, as shown in Program 9.1
Built-in and user-defined operators in R.
Operators in R are vectorized. Thus, for given vector inputs, an operator acts on each
Chapter
10
Manipulating Data Frames
A
data frame is a type of object that has been widely used in R. The main feature
of a data frame object is that it is a two-dimensional object like a table and it can
hold heterogeneous information. Given its popularity, this chapter focuses on these
common tasks related to data frames. Manipulating data frames involves a comprehensive
understanding of indexing techniques employed in R, so relevant operators and functions
are covered first. In addition, data aggregation and summary on data frame objects are
presented. This involves how to generate summary statistics by row or column, or for subsets
of a data frame object. Finally, an application related to a binary logit regression is presented
to synthesize many techniques that have covered in Part III Programming as a Beginner.
A number of exercises also have been designed and included at the end of this chapter.
Readers are strongly encouraged to work on some of them and improve their understanding
of basic R object manipulation.
10.1
Subscripting and indexing in R
For objects with multiple elements, R offers efficient indexing and subscripting operations
to locate specific elements. Operators defined for subscripting in R include single brackets in
pair, double brackets in pair, the dollar sign, and the at sign, i.e., [, [[, $, and @. The index
or subscript can be numerical, character, or logical values. Use help(“[”) in an R console
to launch a help page and read their detailed definitions.
The purpose of subscripting is either extracting or replacing. For extracting operations,
elements identified in an object is extracted and saved as a new object, e.g., new.x <x[i, j, drop = TRUE, ...]. The drop argument is relevant to matrices, arrays, and data
frames, and it specifies whether subscripting reduces the dimension or not. For replacing
operations, the identified elements in the object are replaced by new values supplied by
users and the object is thus modified. If the original object needs to be kept and untouched,
then it should be copied before the replacement operation, e.g., copy.x <- x; copy.x[i,
j] <- new.value.
A numerical index is either a single integer or a vector of integers. If there is any zero
in the index, it is ignored; thus c(0, 1, 0, 4) is the same as c(1, 4) when it is used as
an index. The c() function, the colon operator (i.e., :), and the seq() function are often
used in generating a small index manually, e.g., 1:4 for the first four elements, or seq(from
= 1, to = 10, by = 2) for all the even numbers between 1 and 10. Negative subscripts
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
180
10.1 Subscripting and indexing in R
181
are accepted in R, e.g., -c(1, 4) or c(-1, -4), and they specify the positions of elements
that need to be excluded. Negative and positive numeric values, however, cannot be mixed
in one index.
A character string index is either a single character string or a vector of character strings,
e.g., “v1” or c(“v1”, “v2”). Apparently, it can be used only if an object has names for its
elements. Negative character subscripts are not permitted, e.g., -“v1”. If elements based on
names need to be excluded, then their numerical location should be determined first using
string manipulation functions and then used as the input to a numeric index.
A logical index can also be used to access elements of an object, e.g., c(TRUE, FALSE,
TRUE). A logical value has to be either TRUE or FALSE, or the shorthand of T or F. Elements
corresponding to TRUE values will be included, and these corresponding to FALSE values will
be excluded.
The length of an index vector is allowed to be smaller than, equal to, or larger than the
corresponding dimension in an object (e.g., the number of rows in a data frame). Take an
extracting operation as an example. When smaller, the object extracted will have a smaller
dimension than the original object. When equal, the existing and new objects have the
same dimension, but the order of elements may be different, depending on the index. When
larger, the new object will be composed of elements from the existing object, and some of
the elements from the existing object must be repeated.
When the length of an index vector is larger than the corresponding dimension in an
object, the index must be a numerical or character index, but cannot be a logical index.
With a numerical or character index, an element in an object can be extracted for more than
one time, and this may be exactly what one needs in some situations. A good example is
demonstrated in Program 9.5 Manipulating factor objects on page 167 with the function
of levels(), i.e., the expression of levels(num)[num]. Another example is bootstrapping
a sample with repetition, using the sample() function. In that situation, one may need to
include some observations more than once, either manually or statistically with a probability
specification.
For a logical index, however, it can be at most as long as the corresponding dimension
of an object. This is because a logical index does not contain the location information. In
practice, a logical index is often generated from logical operation on some components of the
object involved. Thus, the length of a logical index is usually the same as the corresponding
dimension of the object. If the length of a logical vector is longer than the corresponding
dimension, then NA values will be generated because these extra logical values refer to
some locations that do not exist in the object. Furthermore, R does accept a logical index
that is short, and if needed, the logical values in the index will be repeated to match the
dimension of the object being indexed. For example, in extracting four out of eight elements
in a vector, a logical index like c(FALSE, TRUE) should have the same effect as c(FALSE,
TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE) or c(2, 4, 6, 8). This use of logical
indexes can be very efficient in programming if the pattern of extracting operation is clear.
Sometimes there is a need to convert one of the three types of indexes into another. For
example, in creating a new function, users are allowed to supply a character string vector to
an argument. The same argument value can be used for including some elements in one place
but for excluding some elements in another place. In general, to convert a character string
index into a numerical index, use functions for string manipulation to find the location,
e.g., match(), pmatch(), or grep(). To convert a logical index into a numerical index, use
the which() function. It can generate a vector containing the subscripts of the elements for
which the logical vector is true. This is demonstrated in Program 10.1 Subscripting and
indexing R objects.
Chapter
11
Base R Graphics
R
graphics can be used to display either raw data or analysis outputs in a graphic
format. A graph can be shown on a computer screen, or alternatively, saved as a
file on a local drive and then used in a document. The traditional graphics system available in base R can generate quality graphics, and furthermore, has served as the
foundation of many contributed packages in extending the functionality of base R. Thus, a
deep understanding of the graphics system in base R is important. Furthermore, because
drawing computer graphs always involves many details, it is undesirable and inefficient to
copy, paste, and compile many R help pages here. Instead, the emphasis in this chapter will
be on gaining a solid understanding of the overall structure of the base R graphics system.
11.1
A bird view of the graphics system
Generating a graph by the graphics system in base R is very similar to drawing a picture
on a canvas or board. To facilitate learning, we also borrow the concept of production
function from economics. Assume there is a relation between one output and four inputs:
y = f (x1 , x2 , x3 , x4 ). In painting, y is a picture output from painting, and the four inputs
are paint, canvas materials, big brushes, and small brushes. In base R, the output is a graph,
and the four inputs are plotting data, graphics devices, high-level plotting functions, and
low-level plotting functions.
Transformation activities from inputs to outputs by human beings involve a production
technology, i.e., the f in the production function. Excellent and mediocre artists differ
mainly in the production technology, but less likely in the inputs used. Similarly, with
exactly the same data available, two persons can produce very different graphs from R. This
is often referred to as art of programming, or philosophy of graphics. Whether the art of
programming can be taught or not is controversial. Personally, I feel that is a very difficult
task, even if not totally impossible. In contrast, most people agree that the physical and
technical side of painting and graphics can be learned step by step. That is what we are
going to do in this chapter. Each of the four major inputs related to R graphics system will
be analyzed in detail by section in a moment.
In general, plotting data should be prepared before drawing a graph. A graphics device
holds a graph generated from R. It is either a computer screen or a file saved on a local
drive. R console is largely a window for ordinary outputs in text format only, and outputs
with complicated formats such as graphs and mathematical symbols cannot be shown there.
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
214
215
20
0
Return (%)
10
5
−5
−40
0
Return (%)
15
40
11.1 A bird view of the graphics system
2001
2002
2003
Year
2004
2005
2001
2002
2003
2004
2005
Year
Figure 11.1 Graphs with different graphical parameters
Thus, either a new screen/window or file device should be created to hold a graph. A highlevel plotting function can generate a graph independently e.g., plot(). A low-level plotting
function can annotate, revise, expand, or customize an existing graph, e.g., points(). Thus,
a high-level function can be used alone, but a low-level function has to follow a high-level
function.
The appearance of a graph is controlled by various graphical parameters, and they are all
used as function arguments. These graphical parameters can be further classified into three
types: device parameters, plotting function parameters, and universal parameters. The first
two types of graphical parameters are specific to graphics devices or plotting functions only,
while universal parameters can be used for both graphics devices and plotting functions. For
example, the parameter of col is universal and it can be used everywhere to specify colors.
Where a universal parameter is used does matter. When a universal graphical parameter
is used in a function related to graphics devices, the setting will be effective for the whole
device and following statements. When a universal graphical parameter is used in a plotting
function, it affects the output from that function only.
As an example, Program 11.1 Base R graphics with inputs of data, devices, and plotting
functions can generate Figure 11.1 Graphs with different graphical parameters. In this
demonstration, the data to be plotted are the annual returns of a public firm over five years,
varying between the value of 10% in 2001 and 13% in 2005. The data are prepared and
saved first. The windows() function initiates a screen graphics device so any graphic output
from the following statements will be displayed on it. The device function can accept a
number of device parameters as its argument, e.g., width for the size of the window. The
par() function can set the properties of graphics device further. In this example, the mai
argument is a graphical parameter that defines the figure margins. The plot() function is
the workhorse and leading high-level plotting function in R, and it uses a line to connect
five observations here. The plot() function also accepts various graphical parameters, e.g.,
ylab for the axis label. Finally, the points() function is a low-level function that generates
five points; its graphical parameter of pch specifies the shape of the points, with a different
shape by year in this example.
Note the range of the axis values is controlled by the graphical parameter of ylim, and
changing it from the default value to a larger value results in a very flat line graph in
Figure 11.1. Thus, graphical parameters can change the appearance of a graph dramatically. The effect may or may not be desirable from the perspective of a programmer or
viewer. In general, R as a tool, like all other software applications in the world, does not
216
Chapter 11 Base R Graphics
have an answer to questions like which graph is more appropriate. The answer is related to
the overall goal of a research project and the specific purpose of a graph, and additionally,
one’s graphics philosophy. (No philosophy is one type of philosophy too, just like religion.)
More specifically for Figure 11.1, without other information, we cannot simply assert
that the version on the left is better than the one on the right. In most cases, the left one is
better. However, if this public firm needs to be compared with other firms in a study, and
the annual returns of those firms have been volatile between −50% and 50% over the period,
then the version on the right side is appropriate in revealing the small return volatility of
the selected firm.
Finally, the relation among the four inputs is worthy of a brief note. A data set for
plotting needs to be prepared at the beginning, and the demand can be large or small.
When the data needed is small, it can be directly supplied as argument values to function
calls. Apparently, there must be some data to get started; otherwise, there is nothing to
draw or show. If no statements related to any graphics device are present, then the default
device is a new pop-up square window on the screen (seven by seven inches). A high-level
plotting function must be supplied in order to generate a graph in R. Low-level functions are
auxiliary to customize an existing graph so they are always optional. As a result, the simplest
R program that can generate a graph is a statement in an R console like this: plot(1:3).
Following our analytical steps as listed in Program 11.1, this simple command is equivalent
to a group of commands as follows: data <- 1:3; windows(); plot(x = data). It seems
so lengthy that we need two pages to explain a short command, but that is exactly how R
generates a graph for us.
Program 11.1 Base R graphics with inputs of data, devices, and plotting functions
1
2
3
4
# Four components or inputs in the base R graphics system
# A. Plotting data
Year <- 2001:2005
Return <- c(0.10, 0.12, -0.05, 0.18, 0.13) * 100
5
6
7
8
# B. Graphics device
windows(width = 4, height = 3, pointsize = 11)
par(mai = c(0.8, 0.8, 0.1, 0.1), family = "serif")
9
10
11
# C1. High-level plotting function
plot(x = Year, y = Return, type = 'l', ylab = 'Return (%)')
12
13
14
15
# C2. High-level plotting function + a change in the range of y axis
# plot(x = Year, y = Return, type = 'l', ylab = 'Return (%)',
#
ylim = c(-50, 50))
16
17
18
# D. Low-level plotting function
points(x = Year, y = Return, pch = 1:5)
11.2
Your paint: preparation of plotting data
Data are needed for data arguments in plotting functions, e.g., the x and y arguments
in the plot() function. Data are also needed for graphical parameters that are supplied as
arguments in the functions for graphics devices and plotting, e.g., “red” for a color argument.
Part IV
Programming as a Wrapper
251
252
Part IV Programming as a Wrapper: The goal of this part is to extend the existing R
functionality for data analyses and graphics. The sample study used is Wan et al. (2010a),
and the demand model of AIDS is assessed in detail. Skills for writing new R functions will
be greatly emphasized and elaborated through several chapters.
Chapter 12 Sample Study B and New R Functions (pages 253 – 268): The proposal,
manuscript, and program versions for Wan et al. (2010a) are presented. The AIDS model
and the need for user-defined functions are introduced for later applications.
Chapter 13 Flow Control Structure (pages 269 – 292): Functions and techniques for controlling R program flow are presented. Conditional statements in R include if, ifelse(),
and switch(). Looping statements include for, while, and repeat.
Chapter 14 Matrix and Linear Algebra (pages 293 – 311): Operation rules and main
functions relevant to matrices are described. Several large examples related to the AIDS
model and generalized least square are constructed and presented.
Chapter 15 How to Write a Function (pages 312 – 358): The structure of a function is
analyzed first. S3 and S4 as two approaches to organizing an R function are compared.
Several examples are designed to demonstrate how to write a function.
Chapter 16 Advanced Graphics (pages 359 – 392): The graphics systems in R are reviewed
first. Then the grid system, ggplot2, and several packages for maps are described. For each
contributed package, new concepts are defined and examples are designed.
R Graphics • Show Box 4 • A diagram for the structure of this book
Empirical Study
1
Proposal
4
Manuscript
2
R Program
3
New packages make diagrams easier to draw. See Program A.4 on page 517 for detail.
Chapter
12
Sample Study B and New R Functions
T
he leading sample study in Part IV Programming as a Wrapper is Wan et al.
(2010a). This is a moderate challenging empirical study, and it is appropriate for
students to learn how to write user-defined functions for a specific project. The
skeleton of the manuscript version is presented first. Then, the relevant statistics and the
program version are detailed. At the end, the need for user-defined functions is analyzed
and a road map for Part IV Programming as a Wrapper is presented.
12.1
Manuscript version for Wan et al. (2010a)
The corresponding proposal version for Wan et al. (2010a) is presented at Section 5.3.2
Design with public data (Wan et al. 2010a) on page 72. The program version for this study
is presented later in this chapter and can produce detailed tables and figures. Below is the
very first manuscript version for this study. It is constructed with the information in the
proposal version, and will be used as a practical guide to composing the program version.
The final manuscript version is published as Wan et al. (2010a).
In constructing the skeleton of the first manuscript version, key tables and figures should
be drafted or hypothesized. As an example, Table 12.1 A draft table for the descriptive
statistics in Wan et al. (2010a) is included below. This seems like a blank table, but it is an
important technique for improving data analysis efficiency. This also has been emphasized
in Section 7.1 Manuscript version for Sun et al. (2007) on page 101.
The First Manuscript Version for Wan et al. (2010a)
1. Abstract (200 words). Have one or two sentences for the research issue, study needs,
objectives, methodology, data, results, and contributions.
2. Introduction (2 pages in double line spacing). Brief trade pattern review; the antidumping investigation process; overall objective; three specific objectives and contributions.
3. Market overview and antidumping investigation against China (3 pages). A market
review of trade patterns, antidumping investigation, duties, and research needs.
4. Methodology (6 pages). Four subsections:
— Static Almost Ideal Demand System (AIDS) model
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
253
254
Chapter 12 Sample Study B and New R Functions
Table 12.1 A draft table for the descriptive statistics in Wan et al. (2010a)
Variable
Mean
St. Dev.
Minimum
Maximum
Share for country 1
Share for country 2
Share for country 3
Share for country 4
Share for country 5
Share for country 6
Share for country 7
Share for country 8
Price for country 1
Price for country 2
Price for country 3
Price for country 4
Price for country 5
Price for country 6
Price for country 7
Price for country 8
Total expenditure
30.123
8.888
22.222
60.123
160.123
10.123
80.123
300.123
— Dynamic AIDS model
— Estimation and diagnostic tests
— Demand elasticities
5. Data sources and variables (1 page). Define wooden beds by the Harmonized Tariff
Schedule. Describe country selection, time periods covered, and data sources.
6. Empirical results (4 pages of text, 6 tables, and 1 figure). Three subsections:
— Model fit and diagnostic tests
— Results from the estimated coefficients
— Results from the calculated elasticities
Table 1. Descriptive statistics of the variables defined for AIDS model
Table 2. Diagnostic tests on the static and dynamic AIDS models
Table 3. Estimated parameters from the static AIDS model
Table 4. Estimated parameters from the dynamic AIDS model
Table 5. Expenditure elasticity and Marshallian own-price elasticity
Table 6. Long-run and short-run Hicksian cross-price elasticity
Figure 1. Monthly expenditure and import shares by country
7. Discussion and summary (3 pages). A brief summary of the study is presented first.
Then about three key results from the empirical findings will be discussed.
8. References (3 pages). No more than 40 studies will be cited.
end
255
12.2 Statistics: AIDS model
12.2
Statistics: AIDS model
In this section, statistics for the Almost Ideal Demand System (AIDS) model is presented.
Emphases are put on the relevant information that is necessary to understand some R implementation later on. For a comprehensive coverage of the development of this model, read
Wan et al. (2010a) and references cited there. Specifically, the key formulas for the AIDS
model are introduced first. Then three aspects of the model implementations are elaborated:
construction of restriction matrices, model estimation by generalized least square, and calculation of demand elasticities. They are all implemented in the erer library with several
new R functions. Note that the erer library also contains several additional functions for
the AIDS model, including the Durbin-Wu-Hausman test for expenditure endogeneity and
diagnostic tests for model fit. For the purpose of brevity, their implementations will not be
analyzed within this book, so relevant technical aspects are omitted here.
12.2.1
Static and dynamic models
The AIDS model, originally developed by Deaton and Muellbauer (1980), has been widely
adopted in estimating demand elasticities. The popularity of this model is due to its several
advantages. It is consistent with consumer theory so theoretical properties of homogeneity
and symmetry can be tested and imposed through linear restrictions. As a demand system, it also overcomes the limitations of a single equation approach and can examine how
consumers make decisions among a bundle of goods to maximize their utility under budget
constraints. Furthermore, with the development of time series econometrics, dynamic AIDS
model has been constructed to consider the properties of individual time series through the
error correction technique pioneered by Engle and Granger (1987).
For the research issue of wooden bed imports (Wan et al., 2010a), a conventional static
AIDS model with a set of policy dummy variables can be specified as follows:
wit = αi + βis ln
mt
Pt∗
+
N
X
j=1
s
γij
ln pjt +
K
X
ϕsik Dkt + uit
(12.1)
k=1
where w is the import share of beds; m is the total expenditure on all imports in the system;
P ∗ is the aggregate price index; m/P ∗ is referred to as the real total expenditure; p is the
price of beds; D denotes the antidumping dummy variables; α, β, γ, and ϕ are parameters to
be estimated; and u is the disturbance term. The superscript of s in the parameters denotes
static (long-run) AIDS model. For the subscripts, i indexes country names in the import
share and also the equation in the demand system (i = 1, 2, . . . , N ), j indexes country names
in the price variable (j = 1, 2, . . . , N ), t indexes time (t = 1, 2, . . . , T ), and k indexes the
dummy variables (k = 1, 2, . . . , K). In this study, the maximum values of these indexes are
N = 8 (seven countries plus the rest of world as a residual supplier), T = 96 (monthly data
from January 2001 to December 2008), and K = 3.
Several variables in the above equation are defined and calculated using the import
PN prices
and quantities for individual countries. The total expenditure is defined as mt = i=1 pit qit ,
in which q is the quantity of beds. The import share can be computed as wit = pit qit /mt .
The aggregate price index is generally approximated by the Stone’s Price Index as lnPt∗ =
PN
j=1 wjt ln pjt . In addition, major events related to the antidumping investigation were the
announcement of petition in October 2003, the affirmative less-than-fair-value determination
in July 2004, and the final implementation since January 2005. Three corresponding pulse
dummy variables are added to the AIDS model to represent these events. For instance, the
256
Chapter 12 Sample Study B and New R Functions
dummy variable for the petition announcement is equal to one for October 2003 and zero
for other months. The other two dummy variables are similarly defined.
The dynamic AIDS model is a combination of the static model, cointegration analysis,
and error correction model. The static AIDS model ignores dynamic adjustment in the short
run and focuses on the long-run behavior. In reality, producers’ behavior can be influenced
by various factors such as price fluctuations and policy interventions. Thus, the static model
can be restrictive for some situations. In addition, time series data used in the static AIDS
model may be nonstationary, which may invalidate the asymptotic distribution of an estimator. Finally, the static AIDS model is incapable of evaluating short-run dynamics. To
address these potential limitations associated with the static AIDS model, the concept of
cointegration has been introduced and the dynamic AIDS model has been developed.
The Engle-Granger two-stage cointegration analysis is employed in Wan et al. (2010a).
At first, the stationarity of the variables used in the static AIDS model needs to be examined through unit root tests, e.g., the Augmented Dickey-Fuller test. If these variables are
integrated with the same order, a cointegration test on the residuals calculated from the
static AIDS model is conducted to determine if the residuals are stationary. If a residual is
stationary, it suggests that a long-run equilibrium and cointegration relation exist for the
variables in that equation. Consequentially, the estimates from the static AIDS model can
be interpreted as the long-run equilibrium relation among these variables.
If the cointegration relation is confirmed, the residuals (b
uit ), also referred to as the error
correction terms, are saved to construct the dynamic AIDS model as follows:
∆wit = ψi ∆wi, t−1 + λi u
bi, t−1 +
βid
∆ln
mt
Pt∗
+
N
X
d
γij
∆ln pjt +
j=1
K
X
ϕdik Dkt + ξit (12.2)
k=1
where ∆ is the first-difference operator; u
b is the residual from the static model and other
variables are the same as defined above; ψ, λ, β, γ, and ϕ are parameters to be estimated; and
ξ is the disturbance term. The superscript d in the parameters indicates dynamic (short-run)
AIDS model. The parameter ψ measures the effect of consumption habit. The parameter λ
measures the speed of short-run adjustment and is expected to be negative.
One major concern with the AIDS model is whether the expenditure variable is exogenous. If the expenditure variable is correlated with the error term, then the seemingly unrelated regression estimator may become biased and inconsistent. The Durbin-Wu-Hausman
test is often used to address this concern. In addition, the adequacy of the model specification in the static and dynamic models can be examined through several diagnostic tests,
including the Breusch-Godfrey test, Breusch-Pagan test, Ramsey’s specification error test,
and Jarque-Bera LM test (Wan et al., 2010a).
12.2.2
Implementation: construction of restriction matrices
To comply with economic theory, the static AIDS model is required to satisfy the following
properties which are organized as three groups of restrictions:
Adding-up:
N
X
αi = 1;
i=1
Homogeneity:
N
X
N
X
i=1
s
γij
=0
j=1
s
s
Symmetry: γij
= γji
βis = 0;
N
X
i=1
s
γij
= 0;
N
X
ϕsik = 0
i=1
(12.3)
257
12.2 Statistics: AIDS model
where the adding-up restriction can be satisfied through dropping one equation from the
estimation. The homogeneity and symmetry restrictions can be imposed on the parameters
to be estimated and assessed by likelihood ratio tests. For the dynamic AIDS model, these
restrictions can be similarly defined, imposed, and evaluated.
In actual implementation, how to impose the restrictions on the AIDS system will depend on how the model is estimated, e.g., generalized least square or maximum likelihood
estimation. In the erer library, the AIDS model is estimated with the generalized least
square, so the above restrictions are imposed through matrix manipulation. A good knowledge of matrix object in R should be learned in constructing and adding them to the AIDS
model.
12.2.3
Implementation: estimation by generalized least square
Both the static and dynamic AIDS models can be expressed in a more general format for the
purpose of estimation (Henningsen and Hamann, 2007; Greene, 2011). To avoid confusion,
notations in this subsection should be read independently from these in previous formulas.
To begin with, consider a system of G equations with the following notations:
y i = X i β i + ui , i = 1, 2, . . . , G
(12.4)
where y i is a vector of the dependent variable in equation i; X i is a matrix of independent
variables; β i is the coefficient vector; and ui is the vector of the disturbance term. Assume
that each equation has the same number of observations (t = 1, 2, . . . , T ). The independent
variable matrix is the same for each equation, and the number of independent variables in
one equation is H. Thus, the matrix dimension is T × 1 for y i and ui , T × H for X i , and
H × 1 for β i .
To allow for matrix manipulations, the system can be expressed in a stacked format:

 

  X

0 ...
0
1
u1
β1
y1


0   β   u2 
 y2   0 X 2 . . .
2 




(12.5)
..
..
..  
.
 ...  +  ... 
 ...  = 
.
 .

.
.
.
uG
βG
yG
0
0 . . . XG
or more compactly as:
y = Xβ + u
(12.6)
where the matrix dimension is T G × 1 for y and u, T G × HG for X, and HG × 1 for β.
When the matrix of independent variables is the same across equations, i.e., X 1 = X G , the
Kronecker product notation can be used to simplify the relation further so X = I G ⊗ X 1 ,
where ⊗ is the Kronecker product and I G is an identity matrix of dimension G.
The whole system can be treated as a single equation and estimated by ordinary least
square. If that is the case, then a strong assumption is made about the covariance structure of the disturbance term. In practice, generalized least square has been developed to
accommodate more scenarios. Specifically, the coefficients and their covariance matrix by
generalized least square can be expressed as follows:
−1
b −1 X
b −1 y
βb = X 0 Ω
X0 Ω
−1
b = X0 Ω
b −1 X
cov(β)
(12.7)
(12.8)
258
Chapter 12 Sample Study B and New R Functions
b is the estimated covariance of the disturbance term. Note the similarity between
where Ω
the above equations and these for ordinary least square, i.e., Equation (7.6) on page 105.
The key to understanding the linkage between ordinary least square and generalized
b Assume that the disturbance terms across observations
least square is in the definition of Ω.
are not correlated, but there is contemporaneous correlation. Then, the covariance matrix
for the disturbance terms can be expressed as:
E[uu0 ] = Ω = Σ ⊗ I T
(12.9)
where Ω is a T G × T G matrix; Σ = [σij ] is the G × G disturbance covariance matrix; ⊗ is
the Kronecker product; and I T is an identity matrix of dimension T .
Whether there is contemporaneous correlation in the disturbance terms is equal to
whether Σ = [σij ] is a diagonal matrix. If Σ = [σij ] is a diagonal matrix, then there
is no contemporaneous correlation, and estimating the whole system by generalized least
square is the same as estimating each equation separately by ordinary least square. This is
true because the covariance constant will be canceled out in Equation (12.7), resulting in
the same expression in Equation (7.6) on page 105. If Σ = [σij ] is not diagonal, then incorporating the contemporaneous correlation into the estimation will improve the efficiency
of the estimator. This has been referred to as seemingly unrelated regression in statistics.
Just like in ordinary least square, the true covariance matrix of the disturbance terms in
b the residual
generalized least square, i.e., Ω, is unknown. To have the estimated value of Ω,
b and
values from a regression need to be utilized. There are several ways of generating Ω,
the following treatment is very similar to that used in ordinary least square:
σ
bij =
u0i uj
T −H
(12.10)
b = [b
b ij is an element in Σ
where σ
σij ]. With the covariance matrix being estimated from the
actual data, this estimator has been called as the feasible generalized least square.
Finally, in many empirical applications, there is a need to estimate model coefficients
under liner restrictions. For example, in the AIDS model, homogeneity and symmetry restrictions from the underlying economic theory need to be imposed on the model. One way
for estimating the coefficients under linear restrictions is to constrain the coefficients with
the following equation:
Rβ R = q
(12.11)
where β R is the restricted coefficient vector, R is the restriction matrix, and q is the
restriction vector. Each linear independent restriction is represented by one row of R and
one corresponding element in q.
When the linear restrictions are imposed on the system, the restricted estimator and the
covariance matrix can be derived and expressed as follows:
"
# "
#−1 "
#
0 b −1
0 b −1
0
βbR
X Ω y
= X Ω X R
(12.12)
b
λ
R
0
q
"
# "
#−1
0 b −1
0
βbR
cov
= X Ω X R
(12.13)
b
λ
R
0
b is a vector of the Lagrangian multipliers for the linear restrictions. Apparently, if the
where λ
linear restrictions are null, then the above equations will be reduced to Equations (12.7)
and (12.8).
259
12.2 Statistics: AIDS model
12.2.4
Implementation: calculation of demand elasticities
The key output from the AIDS model is the demand elasticities. Several types of elasticities
can be computed to evaluate the response of consumer preferences and import quantities to
changes in expenditure and prices. In this study, expenditure elasticity, Marshallian price
elasticities, and Hicksian price elasticities are calculated using the estimated parameters
from the AIDS model and the average import shares over the study period. From the static
AIDS model, the long-run elasticities can be calculated as:
ηis = 1 +
βis
wi
s
γij
β s wj
− i
wi
wi
s
γ
ij
ρsij = −δij +
+ wj
wi
sij = −δij +
(12.14)
where η, , and ρ are the expenditure elasticity, Marshallian price elasticity, Hicksian price
s
elasticity, respectively; βis and γij
are parameter estimates from the static AIDS model; the
Kronecker delta δij is equal to 1 if i = j (i.e., own-price elasticity) and 0 if i 6= j (i.e., crossprice elasticity); and w is the average import share over the study period (2001 – 2008). For
the dynamic AIDS model, short-run elasticities can be similarly calculated via the above
d
formula, with the corresponding parameters (i.e., βid and γij
) being substituted.
The standard errors for elasticities can be computed by following several basic properties
of variances (Greene, 2011). In general,
var(ax + b) = a2 var(x)
var(x + y) = var(x) + var(y) + 2 cov(x, y)
!
n
n X
n
X
X
var
xi =
cov(xi , xj )
i=1
(12.15)
i=1 j=1
=
n
X
var(xi ) + 2
i=1
n X
n
X
cov(xi , xj )
i<j j=1
where x, y, xi , and xj denote random variables, and a and b are constants. Note that the
variance of a constant is zero, and the correlation between a constant and a random variable
is zero too.
To compute the variance of expenditure elasticity, Marshallian price elasticity, and Hicksian price elasticity, we just need to use the basic variance formulas in Equation (12.14).
The results can be expressed as follows:
var(ηis ) =
var(sij ) =
var(ρsij ) =
var(βis )
w2i
s
var(γij
)
w2i
s
var(γij
)
+
s
, βis )
w2j var(βis ) 2 wj cov(γij
−
w2i
w2i
(12.16)
w2i
where the Kronecker delta δij takes the constant value of 1 or 0 only as an additive term
in Equation (12.14), so it does not show up in the variance formulas anymore. Once the
variance is available, the t-ratio and p-value can be computed for an elasticity estimate.
260
12.3
Chapter 12 Sample Study B and New R Functions
Program version for Wan et al. (2010a)
Program 12.1 Program version for Wan et al. (2010a) can be used to generate all the five
tables and one figure. To save space, a simplified version of the figure is produced with base
R graphics here, which is shown as Figure 11.2 Plotting multiple time series of wooden
bed trade on a single page on page 227. A more elegant version of this figure is presented
as Figure 16.5 Import shares of beds in Wan et al. (2010a) by ggplot2 and grid on
page 383. A number of new functions are created and saved in the erer library, and the
main R program has been reduced to less than three pages. The statistical analyses for this
project are well organized, and all the results can be reproduced within a few minutes.
The initial step in the program is to import two raw data sets: the import data and
expenditure data. At the end of the transformation, three R time series objects are generated
and saved in the erer library as daExp, daBedRaw, and daBed. Note that daExp is needed for
the Hausman test, daBedRaw is needed for generating some summary statistics in Table 1,
and daBed is the main data for the AIDS model.
Furthermore, Program 12.1 is structured to make all the results in Wan et al. (2010a)
reproducible with the erer library only. For the block titled as “# 1. Data import and
transformation”, the purpose is to demonstrate how to import raw data in Microsoft Excel
and transform them for the AIDS model. In particular, the dummy variable creation is also
incorporated into the aiData() function. If you are interested in running the model only,
then that is feasible by loading the erer library (line 2) and then skipping to line 33 directly.
Program 12.1 Program version for Wan et al. (2010a)
1
2
3
# Title: R Program for Wan et al. (2010 JAAE ); last revised Feb. 2010
library(RODBC); library(erer); library(xlsx)
options(width = 120); setwd("C:/aErer")
4
5
6
7
8
9
10
11
12
13
# ------------------------------------------------------------------------# 1. Data import and transformation
# 1.1 Import raw data in Microsoft Excel format
dat <- odbcConnectExcel2007('RawDataAids.xlsx')
sheet <- sqlTables(dat); sheet$TABLE_NAME
impo <- sqlFetch(dat, "dataImport")
expe <- sqlFetch(dat, "dataExp")
odbcClose(dat)
names(impo); names(expe); head(impo); tail(expe)
14
15
16
17
18
19
# 1.2 Expenditure data for Hausman test
ex <- ts(data = expe[, -c(1, 2)], start = c(1959, 1), end = c(2009, 12),
frequency = 12)
Exp <- window(ex, start = c(2001, 1), end = c(2008, 12), frequency = 12)
head(Exp); bsStat(Exp)
20
21
22
23
24
25
26
# 1.3 Raw import data, date selection, and transformation for AIDS
BedRaw <- ts(data = impo[, -c(1, 2)], start = c(1996, 1),
end = c(2008, 12), frequency = 12)
lab8 <- c("CN", "VN", "ID", "MY", "CA", "BR", "IT")
dumm <- list(dum1 = c(2003, 10, 2003, 10), dum2 = c(2004, 7, 2004, 7),
dum3 = c(2005, 1, 2005, 1))
12.3 Program version for Wan et al. (2010a)
27
28
29
30
31
261
imp8 <- aiData(x = BedRaw, label = lab8, label.tot = "WD",
prefix.value = "v", prefix.quant = "q",
start = c(2001, 1), end = c(2008, 12), dummy = dumm)
imp5 <- update(imp8, label = c("CN", "VN", "ID", "MY")); names(imp5)
Bed <- imp8$out; colnames(Bed)[18:20] <- c("dum1", "dum2", "dum3")
32
33
34
35
# 1.4 Three datasets saved in 'erer' library already
# Results in Wan (2010 JAAE ) can be reproduced with saved data directly.
data(daExp, daBedRaw, daBed); str(daExp); str(Exp)
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# ------------------------------------------------------------------------# 2. Descriptive statistics (Table 1)
lab8 <- c("CN", "VN", "ID", "MY", "CA", "BR", "IT")
pig <- aiData(x = daBedRaw, label = lab8, label.tot = "WD",
prefix.value = "v", prefix.quant = "q",
start = c(2001, 1), end = c(2008, 12))
hog <- cbind(pig$share * 100, pig$price, pig$m / 10 ^ 6)
colnames(hog) <- c(paste("s", lab8, sep = ""), "sRW",
paste("p", lab8, sep = ""), "pRW", "Expend")
dog <- bsStat(hog, two = TRUE, digits = 3)$fstat[, -6]
colnames(dog) <- c("Variable", "Mean", "St. Dev.", "Minimum", "Maximum")
dog[, -1] <- apply(X = dog[, -1], MARGIN = 2,
FUN = function(x) {sprintf(fmt="%.3f", x)})
(table.1 <- dog)
51
52
53
54
55
56
57
58
59
# ------------------------------------------------------------------------# 3. Monthly expenditure and import shares by country (Figure 1)
tos <- window(daBedRaw[, "vWD"], start = c(2001, 1), end = c(2008, 12))
tot <- tos / 10 ^ 6
sha <- daBed[, c('sCN', 'sVN', 'sID', 'sMY', 'sCA', 'sBR', 'sIT')] * 100
y <- ts.union(tot, sha); colnames(y) <- c('TotExp', colnames(sha))
windows(width = 5.5, height = 5, family = 'serif', pointsize = 11)
plot(x = y, xlab = "", main = "", oma.multi = c(2.5, 0, 0.2, 0))
60
61
62
63
64
65
66
67
68
69
# ------------------------------------------------------------------------# 4. Hausman test and revised data
# 4.1 Getting started with a static AIDS model
sh <- paste("s", c(lab8, "RW"), sep = "")
pr <- paste("lnp", c(lab8, "RW"), sep = "")
du3 <- c("dum1", "dum2", "dum3"); du2 <- du3[2:3]
rSta <- aiStaFit(y = daBed, share = sh, price = pr, shift = du3,
expen = "rte", omit = "sRW", hom = TRUE, sym = TRUE)
summary(rSta)
70
71
72
73
74
75
# 4.2 Hausman test and new data
(dg <- daExp[, "dg"])
rHau <- aiStaHau(x = rSta, instr = dg, choice = FALSE)
names(rHau); colnames(rHau$daHau); colnames(rHau$daFit); rHau
two.exp <- rHau$daFit[, c("rte", "rte.fit")]; bsStat(two.exp, digits = 4)
262
76
77
Chapter 12 Sample Study B and New R Functions
plot(data.frame(two.exp)); abline(a = 0, b = 1)
daBedFit <- rHau$daFit
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# ------------------------------------------------------------------------# 5. Static and dynamic AIDS models
# 5.1 Diagnostics and coefficients (Table 2, 3, 4)
hSta <- update(rSta, y = daBedFit, expen = "rte.fit")
hSta2 <- update(hSta, hom = FALSE, sym = F); lrtest(hSta2$est, hSta$est)
hSta3 <- update(hSta, hom = FALSE, sym = T); lrtest(hSta2$est, hSta3$est)
hSta4 <- update(hSta, hom = TRUE, sym = F); lrtest(hSta2$est, hSta4$est)
hDyn <- aiDynFit(hSta)
hDyn2 <- aiDynFit(hSta2); lrtest(hDyn2$est, hDyn$est)
hDyn3 <- aiDynFit(hSta3); lrtest(hDyn2$est, hDyn3$est)
hDyn4 <- aiDynFit(hSta4); lrtest(hDyn2$est, hDyn4$est)
(table.2 <- rbind(aiDiag(hSta), aiDiag(hDyn)))
(table.3 <- summary(hSta)); (table.4 <- summary(hDyn))
92
93
94
95
96
97
98
99
100
101
# 5.2 Own-price elasticities (Table 5)
es <- aiElas(hSta); ed <- aiElas(hDyn); esm <- edm <- NULL
for (i in 1:7) {
esm <- c(esm, es$marsh[c(i * 2 - 1, i * 2), i + 1])
edm <- c(edm, ed$marsh[c(i * 2 - 1, i * 2), i + 1])
}
MM <- cbind(es$expen[-c(15:16), ], esm, ed$expen[-c(15:16), 2], edm)
colnames(MM) <- c("Country", "LR.exp", "LR.Marsh", "SR.exp", "SR.Marsh")
(table.5 <- MM)
102
103
104
105
106
107
108
109
110
111
112
113
# 5.3 Cross-price elasticities (Table 6)
(table.6a <- es$hicks[-c(15:16), -9])
(table.6b <- ed$hicks[-c(15:16), -9])
for (j in 1:7) {
table.6a[c(j * 2 - 1, j * 2), j + 1] <- "___"
table.6b[c(j * 2 - 1, j * 2), j + 1] <- "___"
}
rown <- rbind(c("Long-run", rep("", times = 7)),
c("Short-run", rep("", times = 7)))
colnames(rown) <- colnames(table.6a)
(table.6 <- rbind(rown[1, ], table.6a, rown[2, ], table.6b))
114
115
116
117
118
# 5.4 Alternative specifications
summary(uSta1 <- update(hSta, shift = du2)); aiElas(uSta1)
summary(uDyn1a <- aiDynFit(uSta1)); aiElas(uDyn1a)
summary(uDyn1b <- aiDynFit(uSta1, dum.dif = TRUE))
119
120
121
122
123
124
# ------------------------------------------------------------------------# 6. Export five tables
# Table in csv format
(output <- listn(table.1, table.2, table.3, table.4, table.5, table.6))
write.list(z = output, file = "OutAidsTable.csv")
263
12.3 Program version for Wan et al. (2010a)
125
126
127
128
129
130
131
# Table in excel format
name <- paste("table", 1:6, sep = ".")
for (i in 1:length(name)) {
write.xlsx(x = get(name[i]), file = "OutAidsTable.xlsx",
sheetName = name[i], row.names = FALSE, append = as.logical(i - 1))
}
Note: Major functions used in Program 12.1 are: library(), getwd(), head(), tail(),
odbcConnectExcel2007(), sqlFetch(), ts(), window(), ts.union(), replace(), plot(),
windows(), update(), for loop, data(), aiData(), aiStaFit(), aiStaHau(), aiDynFit(),
aiElas(), listn(), write.list(), get(), and write.xlsx().
# Selected results from Program 12.1
> table.1
Variable
Mean St. Dev. Minimum
1
sCN 44.226
6.984 27.995
2
sVN 11.731
10.728
0.087
3
sID
7.817
1.772
4.661
4
sMY
6.309
2.358
2.340
5
sCA
6.306
3.720
1.394
6
sBR
4.435
1.319
1.813
7
sIT
4.136
2.700
0.716
8
sRW 15.040
4.106
9.249
9
pCN 150.179
10.351 116.067
10
pVN 117.344
11.580 90.712
11
pID 135.295
21.591 91.127
12
pMY 104.600
11.536 78.988
13
pCA 123.673
12.682 94.238
14
pBR 87.569
11.683 38.021
15
pIT 244.321 110.453 137.408
16
pRW 112.263
13.618 84.258
17
Expend 83.915
23.362 33.728
> table.5
Country
LR.exp LR.Marsh
1
sCN 1.095*** -0.467***
2
(19.723) (-2.992)
3
sVN 3.013*** -2.491***
4
(16.783) (-6.967)
5
sID 0.494*** -0.955***
6
(4.573) (-5.665)
7
sMY 2.281*** -0.909***
8
(21.988) (-6.107)
9
sCA -0.845*** -0.968***
10
(-6.055) (-3.912)
11
sBR 0.892*** -1.137***
12
(6.068) (-6.091)
13
sIT -0.583*** -1.162***
14
(-3.179) (-9.407)
Maximum
58.527
34.251
12.525
10.012
16.185
8.807
11.764
26.271
177.675
150.721
189.369
142.184
187.215
120.905
652.052
145.088
121.153
SR.exp
1.285***
(12.079)
1.091***
(6.013)
0.546***
(2.525)
0.946***
(5.286)
0.172
(0.656)
0.577**
(2.179)
0.511
(1.363)
SR.Marsh
-0.998***
(-10.311)
-1.109***
(-8.663)
-0.978***
(-7.301)
-0.848***
(-7.786)
-1.035***
(-8.062)
-0.987***
(-9.031)
-0.849***
(-6.119)
264
12.4
Chapter 12 Sample Study B and New R Functions
Needs for user-defined functions
As presented above, the full program version for Wan et al. (2010a) is less than three pages
but can reproduce all the results. A dominant feature of this program version is that a good
number of user-defined functions are defined and called repeatedly. To demonstrate the need
for user-defined functions, a simple example is extracted from Program 12.1 and analyzed
in this section.
In Program 12.2 Estimating the AIDS model with fewer user-defined functions, the
AIDS model estimated is static and has four suppliers only: China, Vietnam, Indonesia, and
the residual supplier. The model is first estimated with the two user-defined functions in the
erer library: aiData() and aiStaFit(). Then, the model is estimated again without them.
To replicate the results, several steps are needed: preparing the data for the regression,
constructing the formula list, creating the restriction matrices, and fitting the regression at
the end. Both the approaches use the systemfit() function to estimate the static AIDS
demand model, and they generate the same results.
A comparison of the above two approaches can reveal several differences. First of all,
defining new functions can save space. The user-defined functions wrap up many commands
in the format of new functions and make the program flow more concise. Without the userdefined functions, the program is long and fragmented. Furthermore, calling user-defined
functions repeatedly can greatly save space. In many situations, the saving is one line in
calling a function versus a half page without a user-defined function. In Program 12.1
Program version for Wan et al. (2010a), these user-defined functions have been called repeatedly, and without that, the length of the program version would have been at least 20
pages.
Second, defining new functions makes a program version better organized. Sometimes a
new function may be called only once in a program. Thus, wrapping a group of commands
in a function format seems have limited benefits. However, even if that is true, the single
benefit of having a program organized can still be large enough in many situations to justify
efforts for writing a new function. With user-defined functions, Program 12.1 is reduced
to less than three pages and becomes more like a large table of contents, and many details
are wrapped up within the definitions of new functions. This approach is consistent with the
idea of proposal, program, and manuscript versions for a research project, as emphasized
in this book. An R program version thus has an identifiable structure that allows easy
connections with the proposal and manuscript version for a project. In sum, user-defined
functions make organizing the program version of a study possible or easier. In general, if a
program version for a study is very long (e.g., 40 pages), then it is a good indication that
user-defined functions should be considered and adopted.
Finally, user-defined functions make statistical analyses much more efficient for a research project. In Program 12.2, the model estimated is static and has four suppliers only.
The raw data of daBedRaw contains import values and quantities for 16 countries. At the beginning, how many countries should be included as individual suppliers in the AIDS model?
This is a typical empirical question that can only be determined by the trade pattern for
a specific commodity. In Wan et al. (2010a), seven countries and one residual supplier are
chosen at the end. However, how about a similar model with other choices, like five, nine, or
even more suppliers? How about changing the start point from January 2001 to July 2001
because of some large trade volatility for Vietnam? In reality, the number of alternative
models for data mining can be very large. Without user-defined functions, researchers often
feel hopeless to answer these questions, so the most possible scenario is to just examine
one or two of selected hypotheses. In particular, note that in Program 12.2, a change
in the country list, i.e., choice <- c("CN", "VN", "ID"), will require many changes in
12.4 Needs for user-defined functions
265
the sequential commands, e.g., the dimension of res.left. In contrast, when user-defined
functions are available, additional hypotheses can be examined easily and comprehensively
with a simple function call.
The benefit of user-defined functions does not come to us without a cost. The cost is to
learn how to wrap up a group of commands in the format of R function. This is the focus
of Part IV Programming as a Wrapper. Specifically, for Wan et al. (2010a), the following
new functions are defined for estimating static and dynamic AIDS model. Methods for some
generic functions are also defined for several new classes, e.g., summary() and print().
bsFormu()
bsLag()
aiData()
aiStaFit()
aiStaHau()
aiDynFit()
aiDiag()
aiElas()
Creating formula objects for models; used inside aiStaFit()
Generating lagged time series; used inside aiDynFit()
Transforming raw data for static AIDS model with dummy variables
Fitting a static AIDS model
Conducting a Hausman test on a static AIDS model
Fitting a dynamic AIDS model
Diagnostic statistics for static or dynamic AIDS model
Computing elasticity for static or dynamic AIDS models
Program 12.2 Estimating the AIDS model with fewer user-defined functions
1
2
3
4
# 0. load library; inputs and choices
library(systemfit); library(erer)
data(daBedRaw); colnames(daBedRaw)
wa <- c(2001, 1); wb <- c(2008, 12); choice <- c("CN", "VN", "ID")
5
6
7
8
# 1. With two user-defined functions: aiData(); aiStaFit()
pit <- aiData(x = daBedRaw, label = choice, start = wa, end = wb)
cow <- pit$out; round(head(cow), 3)
9
10
11
12
13
14
15
16
sh <- paste("s",
c(choice, "RW"), sep = "")
pr <- paste("lnp", c(choice, "RW"), sep = "")
rr <- aiStaFit(y = cow, share = sh, price = pr, expen = "rte",
hom = TRUE, sym = TRUE)
summary(rr)
names(rr); rr$formula
rr$res.matrix
17
18
19
20
21
22
23
24
25
26
27
# 2. Without two user-defined functions: aiData(); aiStaFit()
# 2.1 Prepare data for AIDS
vn2 <- paste("v", choice, sep = "")
qn2 <- paste("q", choice, sep = "")
x <- window(daBedRaw, start = wa, end = wb)
y <- x[, c(vn2, "vWD", qn2, "qWD")]
vRW <- y[, "vWD"] - rowSums(y[, vn2])
qRW <- y[, "qWD"] - rowSums(y[, qn2])
value <- ts.union(y[, vn2], vRW); colnames(value) <- c(vn2, "vRW")
quant <- ts.union(y[, qn2], qRW); colnames(quant) <- c(qn2, "qRW")
28
29
30
31
price <- value / quant; colnames(price) <- c("pCN", "pVN", "pID", "pRW")
lnp <- log(price); colnames(lnp) <- c("lnpCN", "lnpVN", "lnpID", "lnpRW")
m <- ts(rowSums(value), start = wa, end = wb, frequency = 12)
266
32
Chapter 12 Sample Study B and New R Functions
share <- value / m; colnames(share) <- c("sCN", "sVN", "sID", "sRW")
33
34
35
36
37
38
rte <- log(m) - rowSums(share * lnp)
dee <- ts.union(share, rte, lnp)
colnames(dee) <- c(colnames(share), "rte", colnames(lnp))
round(head(dee), 3)
identical(cow, dee) # TRUE
39
40
41
42
43
44
45
46
47
48
49
# 2.2 Formula and restriction matrix
mod <- list(China = sCN ~ 1 + rte + lnpCN + lnpVN + lnpID +
Vietnam = sVN ~ 1 + rte + lnpCN + lnpVN + lnpID +
Indonesia = sID ~ 1 + rte + lnpCN + lnpVN + lnpID +
res.left <- matrix(data = 0, nrow = 6, ncol = 18)
res.left[1, 3:6] <- res.left[2, 9:12] <- res.left[3, 15:18]
res.left[4,
4] <- res.left[5,
5] <- res.left[6,
11]
res.left[4,
9] <- res.left[5,
15] <- res.left[6,
16]
res.right <- rep(0, times = 6)
identical(res.left, rr$res.matrix) # TRUE
lnpRW,
lnpRW,
lnpRW)
<- 1
<- 1
<- -1
50
51
52
53
54
# 2.3 Fit AIDS model
dd <- systemfit(formula = mod, method = "SUR", data = dee,
restrict.matrix = res.left, restrict.rhs = res.right)
round(summary(dd, equations = FALSE)$coefficients, digits = 3)
# Selected results from Program 12.2
> data(daBedRaw); colnames(daBedRaw)
[1] "vBR" "vCA" "vCN" "vDK" "vFR" "vHK" "vIA" "vID" "vIT" "vMY" "vMX"
[12] "vPH" "vTW" "vTH" "vUK" "vVN" "vWD" "qBR" "qCA" "qCN" "qDK" "qFR"
[23] "qHK" "qIA" "qID" "qIT" "qMY" "qMX" "qPH" "qTW" "qTH" "qUK" "qVN"
[34] "qWD"
> cow <- pit$out; round(head(cow), 3)
sCN
sVN
sID
sRW
rte
Jan 2001 0.402 0.001 0.107 0.490 12.832
Feb 2001 0.305 0.001 0.089 0.604 12.527
Mar 2001 0.280 0.001 0.125 0.594 12.758
Apr 2001 0.333 0.001 0.118 0.548 12.728
May 2001 0.376 0.001 0.108 0.515 12.784
> summary(rr)
Parameter
1 (Intercept)
2
3
rte
4
5
lnpCN
6
7
lnpVN
8
9
lnpID
lnpCN
4.932
4.896
4.754
4.890
4.936
sCN
sVN
sID
0.006 -3.438*** 0.657***
(0.020) (-10.502)
(7.864)
0.030 0.267*** -0.043***
(1.212) (10.742) (-6.785)
0.167**
-0.020
-0.030*
(2.299) (-0.341) (-1.852)
-0.020
-0.092
-0.004
(-0.341) (-1.405) (-0.337)
-0.030*
-0.004
0.010
lnpVN
4.831
4.831
4.800
4.862
4.684
lnpID
4.590
4.558
4.584
4.614
4.512
lnpRW
4.679
4.798
4.773
4.795
4.804
267
12.4 Needs for user-defined functions
10
11
12
13
(-1.852)
lnpRW -0.117***
(-2.719)
R-squared
0.143
(-0.337)
0.115***
(3.050)
0.622
> names(rr); rr$formula
[1] "y"
"share"
[6] "omit"
"nOmit"
[11] "nExoge"
"nParam"
[16] "res.rhs"
"est"
[[1]]
sCN ~ 1 + rte
<environment:
[[2]]
sVN ~ 1 + rte
<environment:
[[3]]
sID ~ 1 + rte
<environment:
(0.875)
0.025*
(1.837)
0.541
"price"
"hom"
"nTotal"
"AR1"
"expen"
"sym"
"formula"
"call"
"shift"
"nShare"
"res.matrix"
+ lnpCN + lnpVN + lnpID + lnpRW
0x0ee3de1c>
+ lnpCN + lnpVN + lnpID + lnpRW
0x0ee3de1c>
+ lnpCN + lnpVN + lnpID + lnpRW
0x0ee3de1c>
> rr$res.matrix
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
[1,]
0
0
1
1
1
1
0
0
0
0
0
0
0
[2,]
0
0
0
0
0
0
0
0
1
1
1
1
0
[3,]
0
0
0
0
0
0
0
0
0
0
0
0
0
[4,]
0
0
0
1
0
0
0
0
-1
0
0
0
0
[5,]
0
0
0
0
1
0
0
0
0
0
0
0
0
[6,]
0
0
0
0
0
0
0
0
0
0
1
0
0
[,14] [,15] [,16] [,17] [,18]
[1,]
0
0
0
0
0
[2,]
0
0
0
0
0
[3,]
0
1
1
1
1
[4,]
0
0
0
0
0
[5,]
0
-1
0
0
0
[6,]
0
0
-1
0
0
> round(summary(dd, equations = FALSE)$coefficients, digits = 3)
Estimate Std. Error t value Pr(>|t|)
China_(Intercept)
0.006
0.320
0.020
0.984
China_rte
0.030
0.024
1.212
0.226
China_lnpCN
0.167
0.073
2.299
0.022
China_lnpVN
-0.020
0.057 -0.341
0.734
China_lnpID
-0.030
0.016 -1.852
0.065
China_lnpRW
-0.117
0.043 -2.719
0.007
Vietnam_(Intercept)
-3.438
0.327 -10.502
0.000
Vietnam_rte
0.267
0.025 10.742
0.000
Vietnam_lnpCN
-0.020
0.057 -0.341
0.734
Vietnam_lnpVN
-0.092
0.065 -1.405
0.161
268
Chapter 12 Sample Study B and New R Functions
Vietnam_lnpID
Vietnam_lnpRW
Indonesia_(Intercept)
Indonesia_rte
Indonesia_lnpCN
Indonesia_lnpVN
Indonesia_lnpID
Indonesia_lnpRW
12.5
-0.004
0.115
0.657
-0.043
-0.030
-0.004
0.010
0.025
0.012
0.038
0.084
0.006
0.016
0.012
0.011
0.013
-0.337
3.050
7.864
-6.785
-1.852
-0.337
0.875
1.837
0.736
0.003
0.000
0.000
0.065
0.736
0.382
0.067
Road map: how to write new functions (Part IV)
In this chapter, we have learned the sample study of Wan et al. (2010a) for Part IV Programming as a Wrapper, including its manuscript version, underlying statistics, and the
final program version. The demand for user-defined functions is demonstrated through a
small AIDS model. Writing new functions for a specific research project can save programming space, make the program well organized, and save time by improving programming
efficiency. Thus, the focus from this point on is to learn programming techniques for writing
user-defined functions.
The rest four chapters in this Part are designed to help students learn how to define new
functions to meet the unique need of a specific project. At the end of this Part, you should be
able to understand the whole sample program for Program 12.1 completely, including these
relevant functions in the erer library. For your own selected empirical study, new functions
can be defined if there is such a need. More likely, the programming techniques learned
in this Part, e.g., flow control structures, will improve your programming efficiency greatly
even without any new function being defined. A good example is available in Program 12.1
when two for loops are used to manipulate the final table outputs.
Briefly, Chapter 13 Flow Control Structure presents techniques for controlling programming flow in R. In particular, if and for are the workhorses and thus a deep understanding
of them is a must for efficient programming in R. Chapter 14 Matrix and Linear Algebra
explains operation rules and main functions for matrix manipulations. Chapter 15 How
to Write a Function elaborates R function structure. S3 and S4 as the two approaches of
organizing R functions are compared. At the end, in Chapter 16 Advanced Graphics, several advanced topics for R graphics are covered, including the grid system and the ggplot2
package. In all these chapters, a large number of applications are designed and included,
and several of them are closely related to the sample study of Wan et al. (2010a).
A simple test on whether you understand the materials in this Part well is to read
Program 12.1 Program version for Wan et al. (2010a) and the relevant new user-defined
functions, and then assess how much you can comprehend them. A number of exercises are
also designed and included in the following chapters, especially for writing new functions.
You are strongly encouraged to practice with these small exercises before you move on to
some real and large data sets.
Overall, the techniques presented in Part IV Programming as a Wrapper will appear to
be more helpful or related to real data analyses than these covered in Part III Programming
as a Beginner. The relation is like driving a car at the beginning of a highway and then in
the middle of the highway. If you have spent large efforts and gained a solid understanding
of the materials in Part III Programming as a Beginner, then you will be able to conduct
statistical analyses even more efficiently with moderate efforts in this Part. This is also like
our harvest season while we grow up with R in conducting empirical studies. Enjoy the
freedom that R as a computer language brings to you now.
Chapter
13
Flow Control Structure
O
perations in R are usually organized with the format of function. However, sometimes a function structure may not be suitable or sufficient to manage a program
flow. In addition, writing a large user-defined function also needs some special syntax to manage its function body efficiently. R has two types of flow control statements
(Braun and Murdoch, 2008). One is called branching statements, conditional statement, or
simply conditionals, including if, ifelse(), and switch(). The other is looping statements
or loops, including for, while, and repeat. Among these statements, the more frequently
used functions are if and for, and furthermore, if is generally easier to use than for.
In this chapter, techniques for controlling R program flow are presented first with simple
examples. For actual scientific research, conditional and looping statements are often mixed
or nested within each other. This will be demonstrated by several challenging applications.
A notation is needed here about the formatting difference between a function and a flow
control statement. Strictly speaking, ifelse() and switch() are different from others in
that they are normal functions. Thus, ifelse() and switch() are formatted in the text
with a pair of parentheses. In contrast, the flow control statements (e.g., if) are different
from a typical function in many aspects, even though they can be used like a function in
some cases. Thus, these keywords are formatted in the text without parentheses, e.g., if.
To launch a help page within an R session, use help(“if”) or ?”if”. A command like ?if
does not work, but it works for a function like ?ifelse or ?switch.
13.1
Conditional statements
Three ways of constructing conditional execution are presented in this section. The if statement is most flexible in controlling a large number of commands. ifelse() and switch()
are two functions that can become very handy for some tasks. In the following presentation,
the concepts and basic definitions are elaborated. In Program 13.1 Conditional execution
with if, ifelse(), and switch(), several examples are employed to demonstrate the use
of these conditional statements.
13.1.1
Branching with if
The if statement provides users the flexibility in choosing which group of commands is
executed when several commands are interlinked. It can have an optional part of else.
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
269
270
Chapter 13 Flow Control Structure
Specifically, a conditional statement of if can take one of the following two forms:
# Form A: without else
if (cond) {
commands_true
}
# Form B: with else
if (cond) {
commands_true
} else {
commands_false
}
where cond is a logical expression that returns a single logical value of TRUE or FALSE.
commands_true and commands_false are two command groups.
The group of commands is usually enclosed within a set of curly braces. Individual
commands are separated by semicolons or set up as new lines. When multiple commands and
the else part are included, a number of formatting rules should be followed, as demonstrated
in Program 7.3 Curly brace match in the if statement on page 118, and additionally,
explained in Section 7.6 Formatting an R program on page 120. Briefly, there should be
a space before the left parenthesis and curly brace, i.e., if (); the “} else {“ component
should be on a separate new line; and there should be consistent indention and vertical
alignment for all commands in a group. If there is one command only within a group, then
the curly braces can be omitted. However, including them is recommended for clarity.
A program flow is conditional on and controlled by the condition object of cond. If the
condition evaluates to TRUE, then a group of commands will be invoked. If the condition
evaluates to FALSE and the optional part of else is provided, then an alternative group of
commands will be invoked. If the condition is FALSE and the else part is not provided, then
no commands will be invoked and executed, and nothing happens at the end.
Instead of cond, the condition object can also be expressed as !cond, where the ! operator
is logical negation (NOT). In this kind of situation, note that the supplied information by a
user is the cond object, but the condition object for the if statement is changed into !cond.
If the supplied value for cond is FALSE, then the condition object is TRUE and the commands
afterward will be executed. Thus, the interest is on the status of cond being FALSE, or the
condition object !cond for the if statement being TRUE.
When do we need to use !cond as the condition object, not cond directly? In writing a
new function, the default value for an argument may need to be set in such a way that the
program flow is easy to understand or it can reduce processing time. This will become more
apparent when we learn techniques of writing new functions.
If needed, multiple if statements can be nested in one of the following two ways:
# Form C: nesting without "else if"
if (cond_1) {
commands_true_1
} else {
if (cond_2) {
commands_true_2
} else {
commands_false_2
}
}
Chapter
14
Matrix and Linear Algebra
A
matrix object is usually primitive in a computer language, and the relevant matrix language is fundamental to statistical computation. In Part III Programming
as a Beginner, we have learned some basic techniques in manipulating R objects, including matrices. However, in general, R beginners do not use matrices intensively. Instead,
matrix manipulations are more closely related to writing user-defined functions. As the focus
of Part IV Programming as a Wrapper is writing new functions, operation rules and main
functions for matrix manipulations are detailed in this chapter. A number of applications
are included to demonstrate how to conduct statistical computation with matrices. Overall,
the materials in this chapter will allow us to be better prepared for writing new functions.
14.1
Matrix creation and subscripts
A matrix can be created in several different ways (Matloff, 2011). Subscripting and indexing
a matrix are needed to extract a subset from an existing matrix or to replace some of its
values. R functions for matrix creation and subscripts are detailed in this section.
14.1.1
Creating a matrix
The concept of matrix in R is closely related to the concepts of vector and array. An
array is a multidimensional extension of vectors, and it contains objects of the same mode.
The most commonly used array in R has two dimensions, i.e., a matrix. Internally in R, a
matrix is stored columnwise in a vector with an additional dimension attribute, which gives
the number of rows and columns. If the elements of a matrix are not of the same mode,
then all the elements will be coerced from a specific type to a more general type, e.g., from
a numeric mode to a character mode. Thus, the mode of a matrix is simply the mode of its
constituent elements; the class of a matrix is matrix. To find out if an object is a matrix,
use the is.matrix() function to test it.
A matrix is very different from a data frame in R. A data frame is a list with the
restriction that all its individual elements have the same length. It can be similarly indexed
like a matrix, but it accommodates different modes, making it especially convenient in
handling heterogeneous raw data and analysis outputs. In contrast, a matrix object in R
can hold values of the same mode only; it needs smaller storage space in general, which
can be revealed by the object.size() or lss() function. R matrices allow all operations
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
293
294
Chapter 14 Matrix and Linear Algebra
related to linear algebra, and thus they are the building block of statistical computation.
As a result, matrices are especially relevant for advanced analyses in R.
The main function that can be used to create a new matrix is: matrix(data = NA, nrow
= 1, ncol = 1, byrow = FALSE, dimnames = NULL). Basically, this function converts a
vector into a matrix, with the options of specifying the numbers and names for both the
rows and columns. Specifically, the data argument is an optional data vector. The nrow and
ncol arguments specify the desired number of rows and columns, respectively. If one of the
nrow and ncol arguments is not given, then an attempt is made to infer it from the length
of data and the other arguments. If neither is given, a one-column matrix is returned. If
there are too few elements in data to fill the matrix, then the elements in data are recycled.
That provides a compact way of making a new matrix full of zeros, ones, or NAs at the
beginning of a specific task, e.g., matrix(data = 0, nrow = 3, ncol = 5). The byrow
argument has a logical value; if FALSE (the default), the matrix is filled by column; and if
TRUE, the matrix is filled by row. The dimnames argument can be NULL or a list of length
two, giving the corresponding row and column names; an empty list is treated as NULL; and
a list of length one is treated as row names.
The diag() function can extract or replace the diagonal of a matrix, or construct a new
diagonal matrix. Recall that an identity matrix or unit matrix of size n is the n × n square
matrix with ones on the main diagonal and zeros elsewhere. Specifically, in diag(x = 1,
nrow, ncol), the x argument can be a matrix, a vector, a one-dimensional array, or missing;
nrow and ncol are optional dimensions for the result when x is not a matrix. As detailed at
the built-in help page, diag() has four distinct usages. First, it extracts the diagonal of x
when x is supplied as a matrix. Second, it returns an identify matrix when x is missing and
nrow is specified. Third, if x is a scalar (i.e., a length-one vector) and the only argument,
then it returns an identity matrix of size given by x. (Note strictly speaking, R as a language
cannot define a scalar object explicitly.) Fourth, if x is a numeric vector with a length of at
least two, or if x is a numeric vector with any length but other arguments are also supplied
in diag(), then it returns a matrix with the given diagonal and zero off-diagonal entries.
R has a set of coercion functions that can be used to convert objects with different
classes or modes. To coerce an existing object (e.g., data frame or vector) into a matrix, use
the function of as.matrix(). In addition, use the rbind() function to stack two matrices
vertically, and use cbind() to stack matrices horizontally. The functions of stack() and
unstack() are valid for a data frame or list, but cannot be applied on a matrix.
In linear algebra, matrix vectorization is a linear transformation which converts a matrix
into a vector. To do that in R, use the functions of as.vector() or c(); both have the
same effect for matrix vectorization. c() is more concise and as.vector() is clearer. This
operation removes the dimension attribute from a matrix and leaves the elements in a
vector form, given that the elements of a matrix are stored columnwise internally. Note that
in general the functions of as.vector() and c() are different in many ways. as.vector()
removes all the attributes of an object if the output is atomic, but not for a list object. This
difference is shown with an example in Program 14.1 Creating and subscripting matrices
in R.
Several R functions are available to extract and replace the attributes of a matrix.
Every matrix has a dimension attribute. The dim() function returns a vector of length
two containing the number of rows and columns. Individual elements can be accessed using
nrow() or ncol(). Another matrix attribute is row and column names. They can be assigned
through the dimnames argument of the matrix() function, or after a matrix is created,
through the dimnames(), rownames(), or colnames() function. Since the numbers of rows
and columns in a matrix need not be the same, the value of the dimnames argument in
295
14.1 Matrix creation and subscripts
matrix(), or the value for the dimnames() function, must be a list; the first element is a
vector of names for rows, and the second is a vector of names for columns. To provide names
for just one dimension, use a value of NULL for the dimension without a name.
14.1.2
Subscripting a matrix
Subscripting a matrix can be handled by the [ operator and two indices. The other major
R indexing operators of [[ and $ are available for list and data frame objects, but not
for matrices. As usual, the index values used for subscripting a matrix can be numerical,
character, logical, or empty.
The general format for indexing a matrix is as follows:
x[i, j, ..., drop = TRUE]
x[i]
where x is a matrix; i and j are subscripts; and drop indicates whether the result is coerced
to the lowest possible dimension. By default, subscripting operations reduce the dimensions
of a matrix whenever possible. Consequentially, subscripting a matrix can potentially return
a vector. This may cause problems when the output from subscripting a matrix needs to be
a matrix and will be used in further matrix operations. To prevent this from happening, the
matrix nature of the extracted object can be retained with the drop = FALSE argument.
Note the drop = FALSE argument is applicable with two-index subscripting only.
A single index can also be used to subscript elements in a matrix, and the output is
always a vector. This is because a matrix is internally stored as a vector. In particular, in
subscripting a matrix, another new matrix with the same dimension and all logical values can
be used as subscripts. The new matrix can be generated by logical operations on the existing
matrix, or with several other functions related to matrices. For example, the lower.tri()
and upper.tri() functions return a logical matrix useful in extracting the lower or upper
triangular elements of a matrix. The row() function returns a new matrix with the same
dimension as the existing matrix, filling each cell with the row number of each element;
the col() function returns a new matrix with the column numbers. In the end, the new
matrix serving as the single index is the same as a vector index that is compatible with
the dimension of the existing matrix. Thus, behind this type of operation, a single index is
actually used for subscripting.
Program 14.1 Creating and subscripting matrices in R
1
2
3
4
5
# A. Creating a matrix
aa <- 1:1000
bb <- matrix(data = aa, nrow = 1); class(bb); mode(bb)
cc <- data.frame(bb)
library(erer); lss() # size
6
7
8
9
10
11
12
13
14
15
# diag(): four usuages
ma <- matrix(data = 1:20, nrow =
diag(x = ma)
diag(diag(x = ma))
diag(nrow = 4)
diag(x = 4)
diag(x = c(3, 9, 10))
diag(x = c(3, 9, 10), nrow = 4)
mb <- ma
4, ncol = 5, byrow = FALSE); ma
# 1. extract the diagonal values
#
extract the diagnonal matrix
# 2. create an identity matrix
# 3. create an identify matrix
# 4. a matrix with the given diagonal
#
a matrix with more rows
Chapter
15
How to Write a Function
A
function in R is an object that can transform a set of argument values and then
return output values, e.g., function.name <- function(arguments) {body}. In
this chapter, the structure of a function is analyzed first. R has two approaches in
organizing a function: S3 and S4. Their features are elaborated and compared in detail with
several examples. At the end, a number of applications are designed to demonstrate how
to write a user-defined function. These include writing a function for conducting numerical
optimization with one dimension, estimating a binary choice model with maximum likelihood
for Sun et al. (2007), and wrapping up several functions for a static AIDS model for Wan
et al. (2010a).
15.1
Function structure
In this section, the structure of a new function is analyzed. The major components of a new
function include a name for the new function, the function keyword, arguments, and the
body. The core of a function is its body, which is basically manipulating argument objects
as inputs and generating output objects at the end. Thus, all the techniques we have learned
about R object manipulations will be the foundation of writing new functions. Materials
covered in this chapter are mainly procedural, and we emphasize how to organize some R
commands for a task into a function structure that can be called and used repeatedly.
15.1.1
Main components
A function in R is one type of object that receives arguments from users, makes some
transformation on the arguments, and finally returns one or more values. This is very similar
to the concept of production function in economics. A production function, also known as a
transformation function, changes production factors into one or multiple outputs (e.g., from
labor, capital, and raw materials into computers). User-defined R functions are treated as
the same as predefined functions. Thus, they can be used independently or called by other
functions. The major benefit of creating new functions is that a large programming task can
be organized in small units, and then they can be addressed by individual new functions.
If the number of new functions is large for a project, then they can be wrapped up as
a package, which will be covered in Part V Programming as a Contributor. At present,
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
312
15.1 Function structure
313
writing a function has become a basic technique in using R efficiently for even moderately
challenging research projects, e.g., Wan et al. (2010a).
The basic syntax of a function in R can be expressed like this:
function.name <- function(argument 1, argument 2, ...) {
statement 1; statement 2
statement 3
...
}
where typical R formatting rules are followed. These rules include one space before and after
the <- operator; no space after the function keyword and before the left parenthesis; one
space before the left curly brace; putting the left curly brace at the end of the argument line;
putting the right curly brace separately on a new line; aligning the right curly brace with
function.name vertically; separating multiple arguments with a comma; and separating
multiple commands on the same line with a semicolon. In Program 15.1 Function structure
and properties, a new function is formatted in this way unless it is very short and can be
put on a single line.
The above function structure has four major components: (a) the function name and
assignment operator, (b) the keyword of function, (c) the arguments, and (d) the body and
curly braces. The body component can be further divided into three parts: (d1) input, (d2)
transformation, and (d3) output. To have a meaningful function, the minimum information
needed is (b), (c), and (d3), i.e., the function keyword, an (empty) argument, and an
(empty) output part in the body. For example, test <- function() 3; test() will return
3, or combining the two commands together as (function() 3)() will generate 3 too.
If the number of 3 is replaced by an empty body enclosed in braces, then the function
always returns NULL, i.e., (function() {})(). In practice, a new function has most of these
components in order to achieve a specific goal; otherwise, it would become too trivial. Let’s
examine them briefly one by one here first. These components that need more elaboration
will be examined in the following sections with greater details.
First, the symbol or name of the function is usually supplied, so a named function can be
created and available for use later on. If the expression of function.name <- is not supplied,
then an anonymous function is created. This is perfectly fine because functions are just one
type of object in R, and any object in R can be unnamed. For example, the expression
of mean(c(1, 3, 4)) returns the average value of a vector as an unnamed object. An
anonymous function may be preferred when it is used as an argument in another function.
For example, in the lapply(x, FUN) function, the FUN argument also should be another
function object. It can be a built-in function, e.g., seq(), or a user-defined new function.
If the user-defined function is short, then it is common to just define and supply the new
function at the same location, making the new function anonymous and available for the
calling function only.
The second component is the function declaration by the keyword of function. It tells
R that the object named as function.name before the assignment operator is a function.
The class of a function object, i.e., class(function.name), is “function”. Furthermore, for
function objects, R has a set of functions for extracting or replacing components. The args()
function reveals the set of arguments accepted by a function. Similarly, the formals()
function returns a list object of the formal arguments in a function. The body() function
returns the body of the function specified. The functions of formals() and body() can either
extract or replace components of an existing function. In addition, the alist() function
can be used with formals() to revise the arguments of a function.
The third component is a set of arguments that are separated by commas and enclosed
314
Chapter 15 How to Write a Function
in a pair of parentheses. The set of arguments can be completely empty so some functions
have no arguments at all. For example, the search() function returns a character vector
of packages attached on the search path; to use this function, just type search(). In most
cases, the arguments are composed of symbols (i.e., a variable name x), statements with an
assignment operator of = (e.g., x = TRUE), and the special . . . argument.
The fourth component is the body, i.e., the major component of a function. It is composed
of one or multiple R statements, which are usually enclosed in a pair of curly braces. Multiple
statements can be put on one line and separated by semicolons, or each statement can be
put on a separate line. If there is only one statement in the body of a function, then the
curly braces can be omitted. However, in general, including the curly braces is recommended
for clarity.
Within the function body, the first part of inputs can be completely ignored if the
arguments are straightforward. In most situations, however, argument values are examined
in the input part to evaluate if they possesses the appropriate class, mode, or format, e.g.,
a data frame class required for an argument. In addition, some arguments may need simple
extraction or operation before they can be used in the transformation. In the transformation
part, the argument values are transformed to achieve the goal of the function. This is often
the key portion of a function, and therefore, most time in creating a new function is spent
here. In the final output part, one or multiple outputs from previous transformation are
organized and exported, so they will be accessible after a function is called. In general, some
outputs should be returned to make the operation meaningful. Some functions, e.g., plot(),
focus on the side effect, so they do not return anything really meaningful.
Program 15.1 Function structure and properties
1
2
3
# A. Minimum information for a function
test <- function() {3}; test()
(function() {})()
4
5
6
7
8
9
10
11
12
13
14
15
16
# B. Function properties
dog <- function(x) {
y <- x + 10
z <- x * 3
w <- paste("x times 3 is equal to", z, sep = " ")
result <- list(x = x, y = y, z = z, w = w)
return(result)
}
class(dog); args(dog); formals(dog); body(dog)
dog(8)
# default printing for all
res <- dog(8); res # assignment and selected printing
res$x; res$w
17
18
19
20
21
22
23
24
# C. Anonymous function
ga <- lapply(X = 1:3, FUN = seq); ga
my.seq <- function (y) {seq(from = 1, to = y, by = 1)}
gb <- lapply(X = 1:3, FUN = my.seq)
gc <- lapply(X = 1:3, FUN = function(y) {seq(from = 1, to = y, by = 1)})
gc
identical(ga, gb); identical(gb, gc)
Chapter
16
Advanced Graphics
B
esides traditional graphics system covered in Chapter 11, R has many additional facilities in creating and developing graphics. In this chapter, the graphics
systems in R are reviewed first. Then the grid package as another important graphics system in R is introduced. Furthermore, the ggplot2 package is a contributed package
based on the grid system and has several advantages over the traditional graphics in R.
Thus, ggplot2 is also presented with several applications. Finally, as one important extension, R has a number of packages that can handle map data and geographical information
well. These functions are briefly introduced at the end.
16.1
R graphics engine and systems
As the key graphics facility, the grDevices package is included in base R and has been
referred to as the graphics engine (Murrell 2011). This package contains fundamental infrastructure for supporting almost all graphics applications in R. Furthermore, two R packages,
i.e., graphics and grid, have been built on top of the graphics engine, and consequentially,
two graphics systems have been developed in R. The graphics package, also known as
the traditional graphics system, has a complete set of functions for creating and annotating
graphs. Chapter 11 Base R Graphics on page 214 has a detailed coverage of these functions
and techniques. In contrast, the grid package has a separate set of graphics tools. It does
not provide functions for drawing complete plots, so it is not used to produce plots as often
as the traditional graphics system. Instead, it is more common to use functions from grid
to develop new packages. Besides the two systems of graphics and grid, several graphics
packages do exist independently even though the number is small. For example, the rggobi
package provides a command-line interface for interactive and dynamic plotting.
Many new graphics packages have been built on top of either the graphics or grid
system, or even independently. For example, the maps package provides functions for drawing
maps in the traditional graphics system. The packages of lattice and ggplot2 have been
built on top of grid. The CRAN task view has one specific section for graph displays, titled
as “Graphics.” It classifies these new packages into several categories: plotting, graphic
applications, graphics systems, devices, colors, interactive graphics, and development. Over
40 packages related to R graphics have been reviewed. While the list of contributed packages
in the task view may not be comprehensive or complete, it is a good place to view new
development in this area.
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
359
360
Chapter 16 Advanced Graphics
The existence of two graphics systems (i.e., graphics v. grid) and related packages in
R raises questions for practical applications: which one is better or when should a package
be employed? Unfortunately, there is no simple answer that everyone would agree with. In
general, the same graphics system should be used to create a complete graph. Mixing them
for one specific graph is difficult or confusing. Nevertheless, it may be desirable to combine
the grid system with other packages in some situations, because the grid system offers more
flexibility in formatting a graph than the traditional system.
More specifically, take lattice and ggplot2 as an example. From my experience, the
main advantages of these packages are that graphs can be saved and manipulated exactly
like other objects in R. The default style of these new packages is often better because they
are motivated and developed to improve the appearance of graphics from the traditional
system. Plotting multivariate data sets as panel graphs is easier and the appearance is
usually more professional. The cost of using these new packages is that there is often a steep
learning curve. A good number of new concepts and plotting functions have to be learned
before any serious plotting is feasible. Furthermore, creating a graph always involves the
application of a set of graphics functions. Thus, one needs to have a solid understanding of
a contributed package before a quality graph can be produced. For example, one may need
to spend a few weeks on the book by Wickham (2009) before ggplot2 can be used to draw
a graph with a publication quality.
The main advantages of the traditional graphics system are that it is easy to learn,
well documented and discussed through the built-in help files and online forums, and in
most situations, very stable. In general, graphs are used either for exploring data patterns
or for publishing. Traditional graphics is still faster and more flexible in data exploration.
For example, plot() can show up to 10 time series in one window quickly with a very
short command because plot.ts() is defined. Traditional graphics is also more flexible in
drawing diagrams. The main disadvantage of the traditional graphics system is that graphs
cannot be saved like a normal object, making a program flow unclear in some situations.
For complicated data, the appearance of traditional graphs may not look professional.
In summary, traditional graphics system has been deeply rooted in base R and used in
many contributed packages to define plot methods for objects with new classes. It is also
flexible for data exploration and demonstration graphs. Thus, it is likely that traditional
graphics system will continue to be used widely. New packages like lattice() can generate
more efficient and professional graphics in the long run if one is willing to spend weeks in
learning more new concepts.
16.2
The grid system
The grid package has been developed by Paul Murrell since the 1990s (Murrell, 2011). At
present, it is a core package in base R, so all the functions in this package are just indexed like
other graphics functions in base R. Several contributed packages have also been developed
over time, e.g., gridBase for integrating base and grid graphics, gridDebug for debugging
grid graphics, and gridExtra for some additional functions in grid graphics. Furthermore,
sophisticated graphics packages have been built on the basis of the grid package, including
the well-known lattice and ggplot2. As the grid system has been constantly expanding
and growing, the grid package is explained briefly in this section to help readers get started.
Some examples are presented in Program 16.1 Learning viewports and low-level plotting
functions in grid.
The built-in help documents are very well documented, which is available by running
help(package = “grid”). There are over ten vignettes at vignette(package = “grid”),
Part V
Programming as a Contributor
393
394
Part V Programming as a Contributor: The main sample study for this part is Sun
(2011). The focus is on developing a new package or a graphical user interface, and making
a contribution to the R community. How to create the content of a new package, what
procedures should be followed, and how to develop a graphical user interface are illustrated
in three individual chapters.
Chapter 17 Sample Study C and New R Packages (pages 395 – 411): The statistics for the
underlying model in Sun (2011) is presented first. Then the complete program version for
generating tables and graphs is assessed, and the need of a new package is emphasized.
Chapter 18 Contents of a New Package (pages 412 – 427): The principles of package design
are analyzed first. The contents of a new package are analyzed with the apt package as an
example. Debugging techniques and management of time and memory are also detailed.
Chapter 19 Procedures for a New Package (pages 428 – 445): Procedural requirements for
building a new package are presented, with Microsoft Windows as the operating system.
The whole process is divided into three stages: skeleton, compilation, and distribution.
Chapter 20 Graphical User Interfaces (pages 446 – 467): Concepts and tools for developing
R graphical user interfaces are covered. The gWidgets package and the GTK toolkit are
employed for the development. At the end, a GUI for the apt package is demonstrated.
R Graphics • Show Box 5 • Screenshots from a dynamic graph for correlation
R graphics can show dynamic correlations. See Program A.5 on page 518 for detail.
Chapter
17
Sample Study C and New R Packages
T
he core sample study for Part V Programming as a Contributor is Sun (2011). In
this chapter, the underlying statistical model and manuscript and program versions
for this study are presented first. The research issue is asymmetric price transmission
(APT) between China and Vietnam in the import wooden bed market of the United States.
This is closely related to the issue examined in Wan et al. (2010a), i.e., the main sample
study for Part IV Programming as a Wrapper. At the stage of proposal and project design,
some aspects of designing several projects in one area are discussed. The relevant discussion
can be found at Section 5.3.3 Design with challenging models (Sun 2011) on page 73.
The model employed is at the frontier of time series statistics, i.e., nonlinear threshold
cointegration analysis. It involves hundreds of linear regressions even for a very small data
set. Thus, writing new functions and even preparing a new package are needed to have an
efficient data analysis. For this specific model, a new package called apt is created. The
program version for Sun (2011) is organized with the help of this package, so the whole
program has become more concise and readable.
17.1
Manuscript version for Sun (2011)
Recall that an empirical study has three versions: proposal, program, and manuscript. A
proposal provides a guide like a road map for setting up the first draft of a manuscript. Like
an engine, an R program can generate detailed tables and figures for the final manuscript.
For this study, the brief proposal is presented at Section 5.3.3 Design with challenging
models (Sun 2011). The R program for this study is presented later in this chapter. The
final manuscript version is published as Sun (2011). Below is the very first manuscript version
that is developed from the proposal.
In constructing the first manuscript version for an empirical study, the key components
are the tables and figures. The contents should be predicted as much as possible before a
researcher works on an R program. The prediction is based on the understanding of the
issue, data, model, and literature. The more a researcher can predict at this stage, the more
efficient the programming will become. At the end, both the content and format of tables and
figures need to be written down in the manuscript draft. For example, the results of EngleGranger and threshold cointegration tests are reported in combination as Table 3 in Sun
(2011). The first draft of these results is presented here as Table 17.1. Some hypothetical
values are put in the columns to provide formatting guides for R programming later.
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
395
396
Chapter 17 Sample Study C and New R Packages
The First Manuscript Version for Sun (2011)
1. Abstract (200 words). Have one or two sentences for research issue, study need, objective, methodology, data, results, and contributions.
2. Introduction (3 pages in double line spacing). Have a paragraph for each of the following items: an overview of wooden bed imports in the United States, market price
analyses and asymmetric price transmission (APT), sources of APT, models of APT,
objective, and manuscript organization.
3. Import wooden bed market in the United States (4 pages). A review of the US wooden
bed market is presented, with the emphasis on expansion of China and Vietnam in
the import wooden bed market.
— Factors behind China’s export growth
— Antidumping investigation against China
— Vietnam’s growth
4. Methodology (6 pages): A brief introduction of the methods and then three subsections.
— Linear cointegration analysis
— Threshold cointegration analysis
— Asymmetric error correction model with threshold cointegration
5. Data and software (0.5 page). Monthly cost-insurance-freight values in dollar and
quantities in piece are reported by country. The period covered in this study is from
January 2002 to January 2010. Threshold cointegration and asymmetric error correction model are combined and used in this study. A new R package named as apt is
created in the process.
6. Empirical results (4 pages of text, 4 pages of tables, and 3 pages of figure).
— Descriptive statistics and unit root test
— Results of the linear cointegration analysis
— Results of the threshold cointegration analysis
— Results of the asymmetric error correction model
Table 1. Results of descriptive statistics and unit root tests
Table 2. Results of Johansen cointegration tests on the import prices
Table 3. Results of Engle-Granger and threshold cointegration tests
Table 4. Results of asymmetric error correction model
Figure 1. Monthly import values of wooden beds from China and Vietnam
Figure 2. Monthly import prices of wooden beds from China and Vietnam
Figure 3. Sum of squared errors by threshold value for threshold selection
7. Conclusion and discussions (3 pages). A brief summary of the study is presented
first. Then about three key results from the empirical findings will be highlighted and
discussed.
8. References (3 pages). No more than 30 studies will be cited.
end
397
17.2 Statistics: threshold cointegration and APT
Table 17.1 A draft table for the cointegration analyses in Sun (2011)
Item
Estimate
Threshold
ρ1
ρ2
Diagnostics
AIC
BIC
QLB(4)
QLB(8)
QLB(12)
Hypotheses
Φ(H0 : ρ1 = ρ2 = 0)
F (H0 : ρ1 = ρ2 )
17.2
Engle
TAR
CTAR
—
−0.666***
(−3.333)
—
—
0
−0.666***
(−3.333)
−0.666***
(−3.333)
888.888
888.888
—
—
10.123***
4.444***
MTAR
CMTAR
0
Statistics: threshold cointegration and APT
In this section, the relevant statistics for threshold cointegration and asymmetric price transmission is presented. Emphases are put on the key information relevant for R implementation in the apt package. For a comprehensive coverage of this methodology, read these
references cited in Sun (2011). Nonstationarity and unit root tests, Johansen-Juselius cointegration analysis, and most model diagnostics are not covered here for brevity. In contrast,
Engle-Granger linear cointegration, threshold cointegration, and asymmetric error correction
model are described here with some detail. Linear cointegration analysis is the foundation
of threshold cointegration.
17.2.1
Linear cointegration analysis
For linear cointegration analysis, there exist two major methods: Johansen-Juselius and
Engle-Granger two-step approaches (Enders, 2010). Both of them assume symmetric relations between variables. The Johansen approach is a multivariate generalization of the
Dickey-Fuller test. The Engle-Granger approach is the foundation of threshold cointegration so it is explained in detail first.
The focal variables here are monthly import prices of wooden beds from two supplying
countries, i.e., Vietnam (Vt ) and China (Ht ). Their properties of nonstationarity and order of
integration can be assessed using the Augmented Dickey-Fuller test. If both series have a unit
root, then it is appropriate to conduct cointegration analysis to evaluate their interaction.
With the Engle-Granger two-stage approach, the property of residuals from the long-term
equilibrium relation is analyzed (Engle and Granger, 1987). For the two focal price variables,
the two-stage approach can be expressed as:
Vt = α0 + α1 Ht + ξt
∆ξbt = ρ ξbt−1 +
P
X
i=1
φi ∆ξbt−i + µt
(17.1)
(17.2)
398
Chapter 17 Sample Study C and New R Packages
where α0 , α1 , ρ, and φi are coefficients, ξt is the error term, ξbt is the estimated residuals, ∆
indicates the first difference, µt is a white noise disturbance term, and P is the lag number.
In the first stage of estimating the long-term relation among the price variables, the
price of China is chosen to be placed on the right side and assumed to be the driving force.
This considers the fact that China has been the leading supplier in the import wooden bed
market of the United States over the study period from 2002 to 2010. In the second stage,
the estimated residuals ξbt are used to conduct a unit root test. Special critical values are
needed for this test because the series is not raw data but a residual series. The number
of lags is chosen so there is no serial correlation in the regression residuals µt . It can be
selected with several statistics, e.g., the Akaike Information Criterion (AIC) or Ljung-Box
Q test. If the null hypothesis of ρ = 0 is rejected, then the residual series from the long-term
equilibrium is stationary and the focal variables of Vt and Ht are cointegrated.
17.2.2
Threshold cointegration analysis
In recent years, nonlinear cointegration has been increasingly used in price transmission studies. Among various developments of nonlinear cointegration, one branch is called threshold
cointegration. The nonlinearity comes from two linear regressions combined, and the linear
regressions are based on the above Engle-Granger linear cointegration approach. Thus, the
threshold cointegration regression considered here is piecewise only and not smooth. Specifically, Enders and Siklos (2001) propose a two-regime threshold cointegration approach to
entail asymmetric adjustment in cointegration analysis. This modifies Equation (17.2)
such that:
∆ξbt = ρ1 It ξbt−1 + ρ2 (1 − It ) ξbt−1 +
P
X
ϕi ∆ξbt−i + µt
(17.3)
i=1
It = 1 if ξbt−1 ≥ τ, 0 otherwise; or
It = 1 if ∆ξbt−1 ≥ τ, 0 otherwise
(17.4)
(17.5)
where It is the Heaviside indicator, P the number of lags, ρ1 , ρ2 , and ϕi the coefficients, and
τ the threshold value. The lag (P ) is specified to account for serially correlated residuals
and it can be similarly selected as in linear cointegration analysis.
The Heaviside indicator It can be specified with two alternative definitions of the threshold variable, either the lagged residual (ξbt−1 ) or the change of the lagged residual (∆ξbt−1 ).
Equations (17.3) and (17.4) together have been referred to as the Threshold Autoregression (TAR) model, while Equations (17.3) and (17.5) are named as the Momentum
Threshold Autoregression (MTAR) model. The threshold value τ can be specified as zero,
or it can be estimated. Thus, a total of four models can be estimated. They are TAR with
τ = 0, consistent TAR with τ estimated, MTAR with τ = 0, and consistent MTAR with τ
estimated. In general, a model with the lowest AIC is deemed to be the most appropriate.
Insights into the asymmetric adjustment in the context of a long-term cointegration
relation can be obtained with two tests. First, an F -test is employed to examine the null
hypothesis of no cointegration (H0 : ρ1 = ρ2 = 0) against the alternative of cointegration
with either TAR or MTAR threshold adjustment. The test statistic is represented by Φ.
This test does not follow a standard distribution and the critical values in Enders and
Siklos (2001) should be used. The second one is a standard F -test to evaluate the null
hypothesis of symmetric adjustment in the long-term equilibrium (H0 : ρ1 = ρ2 ). Rejection
of the null hypothesis indicates the existence of an asymmetric adjustment process. Results
from the two tests are the key outputs from threshold cointegration analysis.
399
17.2 Statistics: threshold cointegration and APT
The challenge of threshold cointegration analysis comes from estimating the threshold
value of τ . With a given value for τ , Equation (17.3) is just a linear regression and it can
be easily estimated by any software application, e.g., the lm() function in R. At present,
the method by Chan (1993) has been widely followed to obtain a consistent estimate of the
threshold value. A super consistent estimate of the threshold value can be attained with
several steps. First, the process involves sorting in ascending order the threshold variable,
i.e., ξbt−1 for the TAR model or the ∆ξbt−1 for the MTAR model. Second, the possible
threshold values are determined. If the threshold value is to be meaningful, the threshold
variable must actually cross the threshold value (Enders, 2010). Thus, the threshold value τ
should lie between the maximum and minimum value of the threshold variable. In practice,
the highest and lowest 15% of the values are excluded from the search to ensure an adequate
number of observations on each side. The middle 70% values of the sorted threshold variable
are generally used as potential threshold values. The percentage can be higher if the total
number of observations in a study is larger, e.g., 90% for 1,000 observations. Third, the
TAR or MTAR model is estimated with each potential threshold value. The sum of squared
errors for each trial can be calculated and the relation between the sum of squared errors
and the threshold value can be examined. Finally, the threshold value that minimizes the
sum of squared errors is deemed to be the consistent estimate of the threshold.
17.2.3
Asymmetric error correction model
The Granger representation theorem (Engle and Granger, 1987) states that an error correction model can be estimated where all the variables in consideration are cointegrated. The
specification assumes that the adjustment process due to disequilibrium among the variables
is symmetric. Two extensions on the standard specification in the error correction model
have been made for analyzing asymmetric price transmission. Granger and Lee (1989) first
extend the specification to the case of asymmetric adjustments. Error correction terms and
first differences on the variables are decomposed into positive and negative components.
This allows detailed examinations on whether positive and negative price differences have
asymmetric effects on the dynamic behavior of prices. The second extension follows the
development of threshold cointegration (Enders and Granger, 1998). When the presence of
threshold cointegration is validated, the error correction terms are modified further.
The asymmetric error correction model with threshold cointegration in Sun (2011) is
developed as follows:
− −
+ +
∆Ht = θH + δH
Et−1 + δH
Et−1 +
J
X
+
+
αHj
∆Ht−j
+
j=1
J
X
+
+
βHj
∆Vt−j
+
j=1
J
X
j=1
(17.6)
j=1
J
X
+
αV+j ∆Ht−j
+
j=1
j=1
−
−
αHj
∆Ht−j
+
−
−
βHj
∆Vt−j
+ ϑHt
−
+
∆Vt = θV + δV+ Et−1
+ δV− Et−1
+
J
X
J
X
+
βV+j ∆Vt−j
+
J
X
−
βV−j ∆Vt−j
+ ϑV t
J
X
−
αV−j ∆Ht−j
+
j=1
(17.7)
j=1
where ∆H and ∆V are the import prices of China and Vietnam in first difference; θ, δ, α,
and β are coefficients; and ϑ is error terms. The subscripts of H and V differentiate the
400
Chapter 17 Sample Study C and New R Packages
coefficients by country, t denotes time, and j represents lags. All the lagged price variables
in first difference (i.e., ∆Ht−j and ∆Vt−j ) are split into positive and negative components,
+
is equal to (Vt−1 − Vt−2 ) if
as indicated by the superscripts + and −. For instance, ∆Vt−1
−
Vt−1 > Vt−2 and equal to 0 otherwise; ∆Vt−1 is equal to (Vt−1 − Vt−2 ) if Vt−1 < Vt−2 and
equal to 0 otherwise. The maximum lag of J is chosen with the AIC statistic and Ljung-Box
Q test so the residuals have no serial correlation.
The error correction terms are the key component of the asymmetric error correction
−
+
= It ξbt−1 and Et−1
= (1 − It ) ξbt−1 , and are a direct result
model. They are defined as Et−1
from the above threshold cointegration regression. This definition of the error correction
terms not only considers the possible asymmetric price responses to positive and negative
shocks on the long-term equilibrium, but also incorporates the impact of threshold cointegration through the construction of Heaviside indicator in Equations (17.4) and (17.5).
The signs of estimated coefficients can offer a first insight on the presence of asymmetric
price behavior and can reveal the response of individual variables to the disequilibrium
in the previous periods. Note the price of China is assumed to be the driving force and
the long-term disequilibrium is measured as the price spread between Vietnam and China.
Thus, the expected signs for the error correction terms should be positive for China (i.e.,
−
+
δH
> 0, δH
> 0) and negative for Vietnam (i.e., δV+ < 0, δV− < 0).
Single or joint hypotheses can be formally assessed. In this study, four types of hypotheses
and F -tests are examined, as detailed in Frey and Manera (2007). The first one is Granger
causality test. Whether the Chinese price Granger causes its own price or the Vietnamese
price can be tested by restricting all the Chinese prices to be zero (H01 : αi+ = αi− = 0 for all
lags i simultaneously). Similarly, the test can be applied to the Vietnamese price (H02 : βi+ =
βi− = 0 for all lags). The second type of hypothesis is concerned with the distributed lag
asymmetric effect. At the first lag, for instance, the null hypothesis is that the Chinese price
has symmetric effect on its own price or the Vietnamese price (H03 : α1+ = α1− ). This can be
repeated for each lag and both countries (i.e., H04 : β4+ = β4− ). The third type of hypothesis
is cumulative asymmetric
PJ effect. The
PJ null hypothesis of cumulative
PJ symmetric
PJeffect can
be expressed as H05 : i=1 αi+ = i=1 αi− for China and H06 : i=1 βi+ = i=1 βi− for
Vietnam. Finally, the equilibrium adjustment path asymmetry can be examined with the
null hypothesis of H07 : δ + = δ − for each equation estimated.
17.3
Needs for a new package
Estimating the statistical models as described in the previous section is almost impossible
by clicking pull-down menus in a statistical software application. Computer programming
must be employed, and within R’s language structure, new functions must be created. As
the number of functions is relatively large and some of them need to be repeatedly called,
it is also more efficient to wrap up these new functions together in an R package. To reveal
the need for new functions and packages, three particular aspects of the models employed
in Sun (2011) are analyzed here.
The first challenge is to estimate the threshold cointegration model, as expressed in
Equations (17.3) to (17.5) as a group. When the threshold value τ and a lag value P is
given, variables in Equation (17.3) can be easily defined. Thus, the regression per se is a
linear model and can be estimated by the lm() function. The problem is that the number
of regressions is too large. Image that the total number of observations is 120 (e.g., monthly
data for 10 years). If 70% of the residual values are used as the potential values for τ , then
the number is about 84. Furthermore, assume the potential value of P can vary from 1 to
12. In combination, the number of regressions is 84 × 12 × 2 = 2, 016 for TAR and MTAR
17.4 Program version for Sun (2011)
401
specification. At the end of each regression, the sum of squared errors and the threshold
values should be documented. Note a data set with 120 observations is pretty small. If the
data set is a little bit larger (e.g., 500 observations or more), then the task quickly becomes
unmanageable or extremely inefficient.
The solution is using flow control statements, such as if and for. Multiple looping
statements can be nested with each other, and outputs from each loop can be selected and
collected. This has been presented in Chapter 13 Flow Control Structure on page 269.
Furthermore, as functions in R can divide a large programming job with interlinked components into small units, several new functions will be created in estimating the threshold
cointegration model.
The second challenge is to estimate the asymmetric error correction model. Variables
used in the regression needs to be created with a given value of lag J. The number of
variables on the right side rises fast with a larger value for lag J. Furthermore, the value
of J is unknown in advance so there is a need to estimate the model repeatedly with
different values. Thus, the whole process includes selecting a lag value, composing variables,
estimating the linear model, collecting regression outputs, and repeating it by J times.
Therefore, while the asymmetric error correction model is linear, the process can be very
tedious and inefficient without programming.
The third challenge is hypothesis tests on the coefficients from the asymmetric error correction model. Many hypotheses can be formed and F -tests can be employed. Individually,
they are easier to implement; collectively, the work is inefficient without programming. This
is because whenever the value of lag J changes, the number and positions of coefficients
from the regression change too. Again, using new functions and flow control structure in R
can easily solve these problems.
The linkage between new functions and a package is worthy of a note here. When the
number of functions created in a project is large, the need and marginal benefit of building
a new package can become significant. In fact, the threshold cointegration analysis serves
as a good example. Walter Enders has made great contributions in this area through his
book and journal articles (e.g., Enders, 2010; Enders and Siklos, 2001). He also programmed
the main components of threshold cointegration analysis through the commercial software
RATS and distributed it on the Internet. I have benefited from these sources in learning the
method. However, RATS does not have the concept of package or library as R has clearly
defined. As a result, the functions created in RATS have no good documentation and are
pretty fragmented. In the following chapters, we will learn how to wrap up a group of new
functions into the apt package. The step from many new functions to a new package will
make programming efficient and pleasant to everyone, including the package author.
In summary, conducting an empirical study with a sophisticated statistical model like
threshold cointegration has become almost impossible without programming. New functions can be created and called to address recurring regressions. When the number of new
functions is large, a new package can be used to document the linkage about them clearly,
organize the R program for a project logically, and eventually, improve research productivity.
17.4
Program version for Sun (2011)
When the program for an empirical study is very long (e.g., 30 pages), it may be better
to organize it through several documents. The R program for Sun (2011) is five pages long
only and it may not be necessary in this particular case. Nevertheless, to demonstrate the
benefits of splitting a long program, two R programs are presented below. One is for the
main statistical analyses and tables. The other is used to generate three figures.
402
17.4.1
Chapter 17 Sample Study C and New R Packages
Program for tables
The main program is listed in Program 17.1 Main program version for generating tables
in Sun (2011). This contains all the statistical analyses and can generate the four tables.
Specifically, the data used in this study is pretty simple. It has four time series: import values
and prices for China and Vietnam each from January 2002 to January 2010. They are saved
as the data object of daVich() in the apt library. The main steps in the program correspond
to the study design in the proposal and desired outputs in the manuscript. These include
summary statistics (Table 1), Johansen cointegration tests (Table 2), threshold cointegration
tests (Table 3), and asymmetric error correction model (Table 4).
As you read along the program, you will notice that a number of new functions have been
created and wrapped together in the apt package. This is the focus of Part V Programming
as a Contributor and will be elaborated gradually later on. At this point, it should be evident
that the program version is well organized with the help of a new package. Except some
minor format differences, the tables generated from this R program are highly similar to the
final versions reported in Sun (2011).
Some results in Table 3 as published in Sun (2011) were inaccurate because of a mistake
made when the data was processed in 2009. The mistake was identified after the paper was
published. For example, for the consistent MTAR, the coefficient for the positive term was
reported as −0.251 (−2.130) in Sun (2011), but it should be −0.106 (−0.764), as calculated
from below codes. This is also explained on the help page of daVich(). The main conclusions
from all the analyses are still qualitatively the same.
A large portion of Program 17.1 has been distributed with the apt library as sample
codes. A number of users worldwide have raised a similar question to me in recent several
years. The question is simple from my perspective. However, as it occurs repeatedly from
time to time, it is worthy of a note here. Briefly, the data used in Sun (2011) are just two
single time series. It is tempting for another user to have two new data series imported
into R, and then copy and run the sample program. Unfortunately, this will generate errors
at various stages in the middle. This is because several key choices have to be made in
Program 17.1, e.g., the lag and threshold values. The choices are dependent on individual
data. Thus, one cannot just simply copy the whole R program for another data.
Program 17.1 Main program version for generating tables in Sun (2011)
1
2
3
# Title: R Program for Sun (2011 FPE)
library(apt); library(vars); setwd('C:/aErer')
options(width = 100, stringsAsFactors = FALSE)
4
5
6
7
8
9
10
11
12
13
# ------------------------------------------------------------------------# 1. Data and summary statistics
# Price data for China and Vietnam are saved as 'daVich'
data(daVich); head(daVich); tail(daVich); str(daVich)
prVi <- daVich[, 1]; prCh <- daVich[, 2]
(dog <- t(bsStat(y = daVich, digits = c(3, 3))))
dog2 <- data.frame(item = rownames(dog), CH.level = dog[, 2],
CH.diff = '__', VI.level = dog[, 1], VI.diff = '__')[2:6, ]
rownames(dog2) <- 1:nrow(dog2); str(dog2); dog2
14
15
16
17
# ------------------------------------------------------------------------# 2. Unit root test (Table 1)
ch.t1 <- ur.df(type = 'trend', lags = 3, y = prCh); slotNames(ch.t1)
403
17.4 Program version for Sun (2011)
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
ch.d1 <- ur.df(type = 'drift', lags = 3,
ch.t2 <- ur.df(type = 'trend', lags = 3,
ch.d2 <- ur.df(type = 'drift', lags = 3,
vi.t1 <- ur.df(type = 'trend', lags = 12,
vi.d1 <- ur.df(type = 'drift', lags = 11,
vi.t2 <- ur.df(type = 'trend', lags = 10,
vi.d2 <- ur.df(type = 'drift', lags = 10,
dog2[6, ] <- c('ADF with trend',
paste(round(ch.t1@teststat[1], digits =
paste(round(ch.t2@teststat[1], digits =
paste(round(vi.t1@teststat[1], digits =
paste(round(vi.t2@teststat[1], digits =
dog2[7, ] <- c('ADF with drift',
paste(round(ch.d1@teststat[1], digits =
paste(round(ch.d2@teststat[1], digits =
paste(round(vi.d1@teststat[1], digits =
paste(round(vi.d2@teststat[1], digits =
(table.1 <- dog2)
y
y
y
y
y
y
y
=
=
=
=
=
=
=
prCh)
diff(prCh))
diff(prCh))
prVi)
prVi)
diff(prVi))
diff(prVi))
3),
3),
3),
3),
'[',
'[',
'[',
'[',
3,
3,
12,
10,
']',
']',
']',
']',
sep
sep
sep
sep
=
=
=
=
''),
''),
''),
''))
3),
3),
3),
3),
'[',
'[',
'[',
'[',
3,
3,
11,
10,
']',
']',
']',
']',
sep
sep
sep
sep
=
=
=
=
''),
''),
''),
''))
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# ------------------------------------------------------------------------# 3. Johansen-Juselius and Engle-Granger cointegration analyses
# JJ cointegration
VARselect(daVich, lag.max = 12, type = 'const')
summary(VAR(daVich, type = 'const', p = 1))
K <- 5; two <- cbind(prVi, prCh)
summary(j1 <- ca.jo(x = two, type = 'eigen', ecdet = 'trend', K = K))
summary(j2 <- ca.jo(x = two, type = 'eigen', ecdet = 'const', K = K))
summary(j3 <- ca.jo(x = two, type = 'eigen', ecdet = 'none' , K = K))
summary(j4 <- ca.jo(x = two, type = 'trace', ecdet = 'trend', K = K))
summary(j5 <- ca.jo(x = two, type = 'trace', ecdet = 'const', K = K))
summary(j6 <- ca.jo(x = two, type = 'trace', ecdet = 'none' , K = K))
slotNames(j1)
out1 <- cbind('eigen', 'trend', K, round(j1@teststat, digits = 3), j1@cval)
out2 <- cbind('eigen', 'const', K, round(j2@teststat, digits = 3), j2@cval)
out3 <- cbind('eigen', 'none', K, round(j3@teststat, digits = 3), j3@cval)
out4 <- cbind('trace', 'trend', K, round(j4@teststat, digits = 3), j4@cval)
out5 <- cbind('trace', 'const', K, round(j5@teststat, digits = 3), j5@cval)
out6 <- cbind('trace', 'none', K, round(j6@teststat, digits = 3), j6@cval)
jjci <- rbind(out1, out2, out3, out4, out5, out6)
colnames(jjci) <- c('test 1', 'test 2', 'lag', 'statistic',
'c.v 10%', 'c.v 5%', 'c.v 1%')
rownames(jjci) <- 1:nrow(jjci)
(table.2 <- data.frame(jjci))
61
62
63
64
65
66
# EG cointegration
LR <- lm(formula = prVi ~ prCh); summary(LR)
(LR.coef <- round(summary(LR)$coefficients, digits = 3))
(ry <- ts(data = residuals(LR), start = start(prCh), end = end(prCh),
frequency = 12))
404
67
68
69
70
71
72
73
74
75
Chapter 17 Sample Study C and New R Packages
eg <- ur.df(y = ry, type = c('none'), lags = 1)
eg2 <- ur.df2(y = ry, type = c('none'), lags = 1)
(eg4 <- Box.test(eg@res, lag = 4, type = 'Ljung') )
(eg8 <- Box.test(eg@res, lag = 8, type = 'Ljung') )
(eg12 <- Box.test(eg@res, lag = 12, type = 'Ljung'))
EG.coef <- coefficients(eg@testreg)[1, 1]
EG.tval <- coefficients(eg@testreg)[1, 3]
(res.EG <- round(t(data.frame(EG.coef, EG.tval, eg2$aic, eg2$bic,
eg4$p.value, eg8$p.value, eg12$p.value)), digits = 3))
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# ------------------------------------------------------------------------# 4. Threshold cointegration
# best threshold
test <- ciTarFit(y = prVi, x = prCh); test; names(test)
t3 <- ciTarThd(y = prVi, x = prCh, model = 'tar', lag = 0); plot(t3)
time.org <- proc.time()
(th.tar <- t3$basic)
for (i in 1:12) { # about 20 seconds
t3a <- ciTarThd(y = prVi, x = prCh, model = 'tar', lag = i)
th.tar[i+2] <- t3a$basic[, 2]
}
th.tar
time.org - proc.time()
90
91
92
93
94
95
96
97
t4 <- ciTarThd(y = prVi, x = prCh, model = 'mtar', lag = 0)
(th.mtar <- t4$basic); plot(t4)
for (i in 1:12) { # about 36 seconds
t4a <- ciTarThd(y = prVi, x = prCh, model = 'mtar', lag = i)
th.mtar[i+2] <- t4a$basic[,2]
}
th.mtar
98
99
100
t.tar <- -8.041; t.mtar <- -0.451
# t.tar <- -8.701 ; t.mtar <- -0.451
# lag = 0 to 4; final choices
# lag = 5 to 12
101
102
103
104
105
106
107
mx <- 12 # lag selection
(g1 <-ciTarLag(y=prVi, x=prCh,
(g2 <-ciTarLag(y=prVi, x=prCh,
(g3 <-ciTarLag(y=prVi, x=prCh,
(g4 <-ciTarLag(y=prVi, x=prCh,
plot(g1)
model='tar',
model='mtar',
model='tar',
model='mtar',
maxlag
maxlag
maxlag
maxlag
=
=
=
=
mx,
mx,
mx,
mx,
thresh
thresh
thresh
thresh
=
=
=
=
0))
0))
t.tar))
t.mtar))
108
109
110
111
# Figure of threshold selection: mtar at lag = 3 (Figure 3 data)
(t5 <- ciTarThd(y=prVi, x=prCh, model = 'mtar', lag = 3, th.range = 0.15))
plot(t5)
112
113
114
115
# Table 3 Results of EG and threshold cointegration combined
vv <- 3
(f1 <- ciTarFit(y=prVi, x=prCh, model = 'tar', lag = vv, thresh = 0))
17.4 Program version for Sun (2011)
116
117
118
405
(f2 <- ciTarFit(y=prVi, x=prCh, model = 'tar', lag = vv, thresh = t.tar ))
(f3 <- ciTarFit(y=prVi, x=prCh, model = 'mtar', lag = vv, thresh = 0))
(f4 <- ciTarFit(y=prVi, x=prCh, model = 'mtar', lag = vv, thresh = t.mtar))
119
120
121
122
123
r0 <- cbind(summary(f1)$dia, summary(f2)$dia,
summary(f3)$dia, summary(f4)$dia)
diag <- r0[c(1:4, 6:7, 12:14, 8, 9, 11), c(1, 2, 4, 6, 8)]
rownames(diag) <- 1:nrow(diag); diag
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
e1 <- summary(f1)$out; e2 <- summary(f2)$out
e3 <- summary(f3)$out; e4 <- summary(f4)$out; rbind(e1, e2, e3, e4)
ee <- list(e1, e2, e3, e4); vect <- NULL
for (i in 1:4) {
ef <- data.frame(ee[i])
vect2 <- c(paste(ef[3, 'estimate'], ef[3, 'sign'], sep = ''),
paste('(', ef[3, 't.value'], ')', sep = ''),
paste(ef[4, 'estimate'], ef[4, 'sign'], sep = ''),
paste('(', ef[4, 't.value'], ')', sep = ''))
vect <- cbind(vect, vect2)
}
item <- c('pos.coeff','pos.t.value', 'neg.coeff','neg.t.value')
ve <- data.frame(cbind(item, vect)); colnames(ve) <- colnames(diag)
(res.CI <- rbind(diag, ve)[c(1:2, 13:16, 3:12), ])
rownames(res.CI) <- 1:nrow(res.CI)
res.CI$Engle <- '__'
res.CI[c(3, 4, 9:13), 'Engle'] <- res.EG[, 1]
res.CI[4, 6] <- paste('(', res.CI[4, 6], ')', sep = '')
(table.3 <- res.CI[, c(1, 6, 2:5)])
144
145
146
147
148
149
150
151
152
153
154
155
156
# ------------------------------------------------------------------------# 5. Asymmstric error correction model
(sem <- ecmSymFit(y = prVi, x = prCh, lag = 4)); names(sem)
(aem <- ecmAsyFit(y = prVi, x = prCh, lag = 4, model = 'mtar',
split = TRUE, thresh = t.mtar))
(ccc <- summary(aem))
coe <- cbind(as.character(ccc[1:19, 2]),
paste(ccc[1:19, 'estimate'], ccc$signif[1:19], sep = ''),
ccc[1:19, 't.value'],
paste(ccc[20:38, 'estimate'], ccc$signif[20:38],sep = ''),
ccc[20:38, 't.value'])
colnames(coe) <- c('item', 'CH.est', 'CH.t', 'VI.est','VI.t')
157
158
159
160
161
162
163
164
(edia <- ecmDiag(aem, 3)); (ed <- edia[c(1, 6:9), ])
ed2 <- cbind(ed[, 1:2], '_', ed[, 3], '_'); colnames(ed2) <- colnames(coe)
(tes <- ecmAsyTest(aem)$out); (tes2 <- tes[c(2, 3, 5, 11:13, 1), -1])
tes3 <- cbind(as.character(tes2[, 1]),
paste(tes2[, 2], tes2[, 6], sep = ''),
paste('[', round(tes2[, 4], digits = 2), ']', sep = ''),
paste(tes2[, 3], tes2[, 7], sep = ''),
406
165
166
167
Chapter 17 Sample Study C and New R Packages
paste('[', round(tes2[, 5], digits = 2), ']', sep = ''))
colnames(tes3) <- colnames(coe)
(table.4 <- data.frame(rbind(coe, ed2, tes3)))
168
169
170
171
172
# ------------------------------------------------------------------------# 6. Output
(output <- listn(table.1, table.2, table.3, table.4))
write.list(z = output, file = 'OutBedTable.csv')
Note: Major functions used in Program 17.1 are: ur.df(), ca.jo(), VAR(), ciTarThd(),
ciTarLag(), ciTarFit(), ecmSymFit(), ecmAsyFit(), ecmDiag(), bsStat(), Box.test(),
and lm().
# Selected results from Program 17.1
> table.1
item CH.level
CH.diff
VI.level
VI.diff
1
mean
148.791
__
115.526
__
2
stde
11.461
__
9.882
__
3
mini
119.618
__
99.335
__
4
maxi
177.675
__
150.721
__
5
obno
97
__
97
__
6 ADF with trend -2.956[3] -7.394[3] -2.936[12] -5.777[10]
7 ADF with drift -2.422[3] -7.195[3] -1.161[11] -5.74[10]
> table.2
test.1 test.2 lag statistic c.v.10. c.v.5. c.v.1.
1
eigen trend
5
10.001
10.49 12.25 16.26
2
eigen trend
5
20.253
16.85 18.96 23.65
3
eigen const
5
4.461
7.52
9.24 12.97
4
eigen const
5
14.304
13.75 15.67
20.2
5
eigen
none
5
4.438
6.5
8.18 11.65
6
eigen
none
5
14.3
12.91
14.9 19.19
7
trace trend
5
10.001
10.49 12.25 16.26
8
trace trend
5
30.254
22.76 25.32 30.45
9
trace const
5
4.461
7.52
9.24 12.97
10 trace const
5
18.765
17.85 19.96
24.6
11 trace
none
5
4.438
6.5
8.18 11.65
12 trace
none
5
18.738
15.66 17.95 23.52
> table.3
1
2
3
4
5
6
7
8
9
item
Engle
tar
c.tar
mtar
c.mtar
lag
__
3
3
3
3
thresh
__
0
-8.041
0
-0.451
pos.coeff
-0.407 -0.328**
-0.28**
-0.116
-0.106
pos.t.value (-4.173) (-2.523) (-2.306) (-0.824) (-0.764)
neg.coeff
__ -0.515*** -0.721*** -0.658*** -0.677***
neg.t.value
__ (-3.119) (-3.942) (-4.754) (-4.888)
total obs
__
97
97
97
97
coint obs
__
93
93
93
93
aic 669.627
658.998
654.863
650.612
649.495
407
17.4 Program version for Sun (2011)
10
bic
11 LB test(4)
12 LB test(8)
13 LB test(12)
14
H1: no CI
15 H2: no APT
16 H2: p.value
677.351
0.773
0.919
0.239
__
__
__
674.193
0.961
0.992
0.122
6.539
1.033
0.312
> table.4[1:7, ]
1
2
3
4
5
6
7
670.059
0.879
0.964
0.084
8.836
5.081
0.027
item
CH.est
CH.t VI.est
(Intercept)
-0.146 -0.052 -3.853*
X.diff.prCh.t_1.pos -0.622*** -2.755 -0.155
X.diff.prCh.t_2.pos
0.082
0.344 -0.144
X.diff.prCh.t_3.pos
-0.282 -1.264 0.146
X.diff.prCh.t_4.pos
-0.324 -1.403 -0.193
X.diff.prCh.t_1.neg
-0.314. -1.464 -0.105
X.diff.prCh.t_2.neg -0.584*** -2.651 0.085
665.808
0.988
0.999
0.289
11.307
9.435
0.003
17.4.2
664.69
0.987
0.998
0.333
11.976
10.612
0.002
VI.t
-1.777
-0.897
-0.795
0.854
-1.091
-0.641
0.508
Program for figures
The three figures reported in Sun (2011) can be created by base R graphics or the ggplot2
package. These codes for graphs are organized separately as a document to increase readability, and they are all presented the following R program. When the codes for figure generation
are long, this can make the main program more concise.
There are several ways to connect individual programs for a specific empirical study.
First, the main program can be called by the source() function and all data will become
available for another program. Alternatively, if it takes a long time to run the main program
each time or the data used in another program is small, then the relevant data can be
copied or generated directly. This is exactly true for the relation between the two programs
here. In general, figures use fewer data than statistical analyses. A threshold cointegration
analysis often takes quite some time to finish. Thus, at the beginning of Program 17.2,
the value data for Figure 1, price data for Figure 2, and sum of squared errors for Figure 3
are generated directly, without calling the main program.
Figure 17.1 is generated from traditional graphics system, and Figure 17.2 is from
ggplot2. The main difference is that the ggplot version has a gray background and grid
lines. Which version is more attractive is largely a personal choice. The codes used for the
ggplot version is generally longer than these for the base R version. One can also customize
the appearance of the ggplot version and make it very similar to the version from base R.
This is left as Exercise 17.6.1 on page 411.
In Sun (2011), Figure 1 is monthly import values for China and Vietnam, and Figure 2
is their monthly import prices. Both the figures can be created with the ggplot2 package.
Recall that %+% is defined in ggplot2 to replace one data frame with another one. It is
tempting to use this operator to generate Figure 2 with a substitution of the underlying
data frame. However, the value and price data are quite different in scale. As a result, it is
faster in this case to copy all the codes for Figure 1 and then revise them for Figure 2.
Program 17.2 Graph program version for generating figures in Sun (2011)
1
2
# Title: Graph codes for Sun (2011 FPE)
library(apt); library(ggplot2); setwd('C:/aErer'); data(daVich)
408
Chapter 17 Sample Study C and New R Packages
Montly import value($ million)
60
China
Vietnam
40
20
0
2002
2003
2004
2005
2006
2007
2008
2009
2010
Figure 17.1 Monthly import value of beds from China and Vietnam (base R)
3
4
5
6
7
8
9
10
11
# ------------------------------------------------------------------------# A. Data for graphs: value, price, and t5$path
prVi <- daVich[, 1]; prCh <- daVich[, 2]
vaVi <- daVich[, 3]; vaCh <- daVich[, 4]
(date <- as.Date(time(daVich), format = '%Y/%m/%d'))
(value <- data.frame(date, vaCh, vaVi))
(price <- data.frame(date, prVi, prCh))
(t5 <- ciTarThd(y=prVi, x=prCh, model = 'mtar', lag = 3, th.range = 0.15))
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# ------------------------------------------------------------------------# B. Traditonal graphics
# Figure 1 Import values from China and Vietnam
win.graph(width = 5, height = 2.8, pointsize = 9); bringToTop(stay = TRUE)
par(mai = c(0.4, 0.5, 0.1, 0.1), mgp = c(2, 1, 0), family = "serif")
plot(x = vaCh, lty = 1, lwd = 1, ylim = c(0, 60), xlab = '',
ylab = 'Montly import value($ million)', axes = FALSE)
box(); axis(side = 1, at = 2002:2010)
axis(side = 2, at = c(0, 20, 40, 60), las = 1)
lines(x = vaVi, lty = 4, lwd = 1)
legend(x = 2008.1, y = 59, legend = c('China', 'Vietnam'),
lty = c(1, 4), box.lty = 0)
fig1.base <- recordPlot()
26
27
28
29
30
31
# Figure 2 Import prices from China and Vietnam
win.graph(width = 5, height = 2.8, pointsize = 9)
par(mai = c(0.4, 0.5, 0.1, 0.1), mgp = c(2, 1, 0), family = "serif")
plot(x = prCh, lty = 1, type = 'l', lwd = 1, ylim = range(prCh, prVi),
xlab = '', ylab = 'Monthly import price ($/piece)' )
409
17.4 Program version for Sun (2011)
Monthly import value ($ million)
60
China
Vietnam
40
20
0
2002
2003
2004
2005
2006
2007
2008
2009
2010
Figure 17.2 Monthly import value of beds from China and Vietnam (ggplot2)
32
33
34
lines(x = prVi, lty = 3, type = 'l', lwd = 1)
legend(x = 2008.5, y = 175, legend = c('China', 'Vietnam'),
lty = c(1, 3), box.lty = 0)
35
36
37
38
39
40
# Figure 3 Sum of dquared errors by threshold value from MTAR
win.graph(width = 5.1, height = 3.3, pointsize = 9)
par(mai = c(0.5, 0.5, 0.1, 0.1), mgp = c(2.2, 1, 0), family = "serif")
plot(formula = path.sse ~ path.thr, data = t5$path, type = 'l',
ylab = 'Sum of Squared Errors', xlab = 'Threshold value')
41
42
43
44
45
46
47
48
49
# ------------------------------------------------------------------------# C. ggplot for three figures
pp <- theme(axis.text
= element_text(size = 8, family = "serif")) +
theme(axis.title = element_text(size = 9, family = "serif")) +
theme(legend.text = element_text(size = 9, family = "serif")) +
theme(legend.position = c(0.85, 0.9) ) +
theme(legend.key = element_rect(fill = 'white', color = NA)) +
theme(legend.background = element_rect(fill = NA, color = NA))
50
51
52
53
54
55
56
57
58
fig1 <- ggplot(data = value, aes(x = date)) +
geom_line(aes(y = vaCh, linetype = 'China')) +
geom_line(aes(y = vaVi, linetype = 'Vietnam')) +
scale_linetype_manual(name = '', values = c(1, 3)) +
scale_x_date(name = '', labels = as.character(2002:2010), breaks =
as.Date(paste(2002:2010, '-1-1', sep = ''), format = '%Y-%m-%d')) +
scale_y_continuous(limits = c(0, 60),
name = 'Monthly import value ($ million)') + pp
59
60
fig2 <- ggplot(data = price, aes(x = date)) +
410
61
62
63
64
65
66
67
Chapter 17 Sample Study C and New R Packages
geom_line(aes(y = prCh, linetype = 'China')) +
geom_line(aes(y = prVi, linetype = 'Vietnam')) +
scale_linetype_manual(name = '', values = c(1, 3))+
scale_x_date(name = '', labels = as.character(2002:2010), breaks =
as.Date(paste(2002:2010, '-1-1', sep = ''), format = '%Y-%m-%d')) +
scale_y_continuous(limits = c(98, 180),
name = 'Monthly import price ($/piece)') + pp
68
69
70
71
72
73
74
75
fig3 <- ggplot(data = t5$path) +
geom_line(aes(x = path.thr, y = path.sse)) +
labs(x = 'Threshold value', y = 'Sum of squared errors') +
scale_y_continuous(limits = c(5000, 5700)) +
scale_x_continuous(breaks = c(-10:7)) +
theme(axis.text = element_text(size = 8, family = "serif")) +
theme(axis.title = element_text(size = 9, family = "serif"))
76
77
78
79
80
# ------------------------------------------------------------------------# D. Show on screen devices or save on file devices
pdf(file = 'OutBedFig1base.pdf', width = 5, height = 2.8, pointsize = 9)
replayPlot(fig1.base); dev.off()
81
82
83
84
windows(width = 5, height = 2.8); fig1
windows(width = 5, height = 2.8); fig2
windows(width = 5, height = 2.8); fig3
85
86
87
88
ggsave(fig1, filename = 'OutBedFig1ggplot.pdf', width = 5, height = 2.8)
ggsave(fig2, filename = 'OutBedFig2ggplot.pdf', width = 5, height = 2.8)
ggsave(fig3, filename = 'OutBedFig3ggplot.pdf', width = 5, height = 2.8)
17.5
Road map: developing a package and GUI (Part V)
Two large parts for R programming have been presented so far in this book. In Part III
Programming as a Beginner, basic R concepts and data manipulations are elaborated. Using
predefined functions for specific analyses is emphasized. In Part IV Programming as a
Wrapper, the structure of an R function is examined and how to write new functions is
demonstrated through various applications. Assuming you have learned these techniques
well, we now reach the final stage of the growing-up process: creating a new package for a
statistical model or research issue.
In general, the materials in the part for beginner are more difficult than these in the part
for wrapper. The current materials in Part V Programming as a Contributor are probably
the easiest. The main challenge for creating a new package is to design the structure and
put appropriate contents inside the folders. This is covered in Chapter 18 Contents of a
New Package. Once the contents for a new package are finalized, the procedure of building
up the package is straightforward, and it takes no more than a few days to learn it. This is
covered in Chapter 19 Procedures for a New Package.
It is possible to transform an R package into a graphical user interface (GUI). The
decision of building a graphical user interface is related to the associated benefit and cost.
The benefit of GUIs includes a more intuitive appearance and low requirement on user’s
17.6 Exercises
411
programming skills. The cost of this extra step is that package authors will need to learn new
commands to develop an application with a clear interface. If R is selected as the language
in developing a GUI, then a programmer should have a solid understanding of R.
The basics of developing graphical user interfaces in R are presented in Chapter 20
Graphical User Interfaces. With a good knowledge base, one just needs to learn a few
new concepts related to GUIs and a few more packages in R. In the apt package, its core
functions are programmed into a GUI. This demonstrates well the growing process with
R from preparing individual functions, to a new package, and finally to a graphical user
interface.
17.6
Exercises
17.6.1 Customize Figure 1 in Sun (2011) by ggplot2. In Program 17.2 Graph program
version for generating figures in Sun (2011) on page 407, Figure 1 is generated by
base R graphics and ggplot2 separately. Customize the version by ggplot2 so its appearance looks like the version from base R graphics. This may like a trivial exercise,
but it will let you learn more about ggplot2.
17.6.2 Analyze two empirical studies for a similar issue. The purpose of this exercise is
to learn and compare design techniques for several studies in the same area, similar to the relation between Wan et al. (2010a) and Sun (2011). Recall that in
Exercise 3.6.2 on page 41, one empirical study has been selected. This selected
study can be one of the sample studies (i.e., Sun, 2006a,b; Sun and Liao, 2011), or
one from the literature. For this exercise, find another empirical study in the literature that is closely related to the research issue covered in the selected study. Read
and compare the objectives, methods, and other aspects of the two related studies,
with an emphasis on the linkage.
Chapter
18
Contents of a New Package
W
hat is an R package? A package in R is a collection of documents that follow
certain format requirements (Adler, 2010). Thus, the definition has two keywords:
content and format. The purpose of a package is to provide extra features, usually extending base R in a particular direction. In conducting a specific research project,
researchers often generate a number of new functions and data sets and then save them in
one or several folders on a local drive. These documents together are similar to a package
or at the early stage of package development. However, these documents per se are not a
package yet because without some extra efforts, they do not conform to the format required
by R. The format requirement will be detailed in next chapter. The documents prepared
for a package can be either functions or data sets, and they are the focus of this chapter.
The apt package related to asymmetry price transmission and Sun (2011) will be used as
an example to elaborate how to organize the content of a package.
Errors are likely to occur in the process of creating a new function. The probability
of running into errors is likely higher with many functions being designed and created for
a new package. Thus, the topic of debugging is relevant to Chapter 15 How to Write a
Function on page 312, but it is more important to package creators. R has a number of tools
that can be used for debugging. These tools and additional R features for time and memory
management are covered in this chapter.
18.1
The decision of a new package
After one gains experience in using R for a few projects, it is likely that creating a package
will come to one’s mind. This is a natural step or stage because R encourages a user to
become a developer gradually. A good understanding of the benefit and a research area is
needed in making the decision of a new package.
18.1.1
Costs and benefits
There is always a benefit-cost question for any human activity. The first main cost associated
with new package creation in R is to learn how to build an R package for the first time. The
formats and procedures for a new package can be intimidating and the learning curve can be
steep to some users. Furthermore, extra time investment is also needed in building a specific
package every time. The final cost is that if a package is shared with others publicly or
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
412
18.1 The decision of a new package
413
privately, various questions can arise and the developer may need to address these questions
periodically.
Once a package is built successfully, it can be used by the developer only, by a few
collaborators (e.g., in a corporate environment or laboratory), or by many users through a
public distribution on the R Web site. Each scenario has some common or unique benefits.
These benefits are briefly described below, ranging from more private to more public.
The first major benefit of compiling a group of documents into a new package is an
efficiency gain in organization. Without a package format, new functions, data sets, and
explanations related to a research topic are often fragmented and scattered on a local drive.
The package format requirement imposes minimum standards for documentation and consistency between code and documentation. When a package is first built, a number of R
functions are available to examine the formats. In submitting a package to the CRAN site,
additional checks beyond those required to get the package up and running are conducted
further. At the end of a large research project, it is always a pleasure to compile many unorganized documents on a local drive into a package with clear documentation and connection.
Thus, by my own experience, even if a researcher has no plan to share newly developed functions with others, just the efficiency gain in organization should be sufficient to pack them
into a package.
Building a new package can improve ones programming quality. The quality can be
improved in designing and adding contents into a package. It can also be improved in
following the format requirements during the process of building an R package. In addition,
sharing a package with others, especially through the CRAN site publicly, can provide
opportunities for testing the functions widely, identifying possible inaccuracies and errors,
getting feedback to revise the functions, or adding more functions for areas inspired by
questions.
Compiling a set of documents into a package can provide great convenience of use. In
fact, many functions included in base R are actually in packages too, e.g., lm() in the
stats package. We know how convenient to load and use them in R. Developing a new
package for these functions from a specific project allows the same convenience. When an
ongoing project is finished and a new package is built, the program version for the project
can become very concise after some reorganization. If the package is shared with others
(privately or publicly), it provides a simple interface for others to access functions and data
sets conveniently, just like any other packages in base R.
Developing packages can result in a deep understanding of R language. Commercial software products intentionally differentiate developers from users because they need to develop
new features to make money; they discourage users to extend their software products. R is
an open-source software application that encourages users to gradually become developers.
Therefore, it is a natural growing path from using predefined functions in R (Part III Programming as a Beginner), to writing new functions (Part IV Programming as a Wrapper),
and to writing new packages (Part V Programming as a Contributor). We have learned
how to estimate a linear model by ordinary least square through a user-defined function
and know how inspiring the application has been. See Program 15.7 A new function for
ordinary least square with the S4 mechanism on page 336 for detail. By the same logic,
developing a new package will allow users to gain a much deeper understanding of the
structure of R as a computer language, and furthermore, to become more efficient in using
R packages and the language.
The final benefit of building new packages is to help others in a world that has become
increasingly interdependent. In using predefined functions (Part III Programming as a Beginner), we are dependent on others’ work. In writing new functions (Part IV Programming
Chapter
19
Procedures for a New Package
T
he last few chapters in a big book are often more difficult to understand. Fortunately, this is not the case here. In Chapter 18 Contents of a New Package,
how to prepare for the content of an R package in the form of functions and data
sets are elaborated. In this chapter, the procedural requirements for building a new package
are presented, with Microsoft Windows as the operating system. The format and procedure
requirements for a new package together can be intimidating to beginners, but they are
much easier than writing new R functions. By my experience, it takes less than three days
for the first time through a self-study. With the materials in this chapter, I hope the time
needed is even shorter. Once the initial investment is made, building another new package
with a small to moderate size (e.g., the apt library) should consume a few hours only.
19.1
An overview of procedures
A package in R is a collection of documents that follow certain format requirements. Assume
that one has decided to create a new package for a specific research issue or model. A group
of R functions and data sets have been prepared and organized on a local drive. Now, the
final task in building up a package is to reorganize and format all the documents into a
package. To facilitate presentation and learning, this process is divided into three main
stages: skeleton, compilation, and distribution.
In the skeleton stage, function and data documents are organized in several folders on a
local drive with special folder names and formats. In particular, a set of help documents are
created to explain the functions and data sets. The size is usually about half a page to one
page for one individual R function. Thus, depending on the number of functions included
in a package, these help documents may take some efforts to prepare.
In the compilation stage, several new software applications need to be installed on a
computer under a Microsoft Windows operating system. A command prompt window is
opened and a package is built there. In the distribution stage, the final package can be
submitted to CRAN site for public sharing or sent to colleagues for private sharing.
By time, the skeleton stage may take a few hours for a moderate package or even days
for a large package. The compilation stage can be finished in a half hour if there is no error
in the content of a package, but it can take much more time if the package content and help
files need to be revised repeatedly. The distribution stage is the easiest one and can usually
be finished in a few minutes. Sharing a package through the CRAN site will go through
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
428
19.2 Skeleton stage
429
additional checks and evaluation. That can generate new errors or warning messages, and
the package may need to be revised and then resubmitted.
19.2
Skeleton stage
After R functions and data sets have been prepared, they still need to be organized under
specific folders. These folders have particular names and relations. In addition, help files by
function and data set need to be created. Finally, a number of special documents also need
to be constructed, with a requirement of solid understanding of R namespaces.
19.2.1
The folders
Documents to be included in a package should be organized in several folders. The names
and formats of these folders should be followed strictly. The folder structure can be created
and learned in several approaches: copying one from an existing package; generating one
within the R console; and creating one manually on a local drive. This is demonstrated
briefly in Program 19.1 Three approaches to creating the skeleton of a new package.
If you have never created an R package before, then the most appealing approach is to
copy an existing package listed on the CRAN site. All publicly available packages have
been examined by developers and the CRAN site maintainer. Thus, they are all great
examples to learn from. For each package (e.g., the apt library), several documents are
available on the CRAN site: a reference manual (apt.pdf), vignettes (e.g., apt manual or
any name a developer likes; this is optional), package source (apt_2.3.tar.gz), Windows
binary (apt_2.3.zip), OS X binary (apt_2.3.tgz), and old sources. In particular, the reference manual in PDF is currently available on the CRAN site only. Installing a package
on a computer does not download this PDF help file to a local computer. Instead, a HTML
version of the help file is installed on a computer. The source file in the format of tar.gz has
the same content as the developer has on his or her own computer. When a package is compiled, all comments in the source version are removed. Thus, to view any comments or other
information in the package, the source version (apt_2.3.tar.gz) needs to be downloaded.
All these documents on the CRAN site can be downloaded manually one by one.
Sometimes, one may prefer to download these documents on the CRAN site from an R
console session directly. For such a need, the download.packages() function in R can download a Windows binary version directly for a computer under a Windows operating system,
e.g., apt_2.3.zip. The download.file() function is more powerful and flexible in downloading any file from the Internet. The erer library contains a wrapper function named as
download.lib() that can download the zip or tar.gz version of a package and also its PDF
manual. In Program 19.1, this function is used to download the source version and reference manual in PDF for two packages, i.e., erer and apt. Note the version of a package can
change over time. Thus, inside the download.lib() function, the available.packages()
function is used to reveal and extract information of current packages on the CRAN site,
and then the download.file() function is applied to download specific documents.
Once the source version is downloaded, it needs to be unzipped from tar.gz to tar.
Many tools can be used to unzip a file, including the free software of 7-zip (www.7-zip.org).
After all the folders in the example package can be opened and viewed normally, they can
be copied to a new directory, the contents can be replaced by new documents intended for
a new package, and the skeleton (i.e., name and format) can be kept at the end.
As the second approach to creating a folder structure, one can also create the skeleton
of a new package from an R session directly with the package.skeleton() function. This
Chapter
20
Graphical User Interfaces
T
wo chapters about R graphics have been presented so far in the book. They are
Chapter 11 Base R Graphics at the end of Part III Programming as a Beginner,
and then Chapter 16 Advanced Graphics at the end of Part IV Programming
as a Wrapper. Now at the end of Part V Programming as a Contributor, a new chapter
about graphical user interfaces (GUIs) is included. Essentially, this topic is still within the
scope of R graphics, and it is closely related to these functions covered in several previous
chapters about R graphics. However, the task is different and a number of new concepts are
introduced.
Many approaches exist for developing GUIs. A large number of contributed packages have
become available in recent years. They have been growing constantly and changing rapidly,
and the trend will likely continue for a while. We choose the gWidgets package to illustrate
the rationale and demonstrate how a GUI can be created with some moderate effort. At the
end, two applications are presented. One is about the correlation between random variables,
and the other is for threshold cointegration analyses with the apt package.
20.1
Transition from base R graphics to GUIs
Graphics user interfaces are in our daily lives. GUIs allow users to interact with electronic
devices (e.g., a computer, a cell phone, or the control panel of a gas pump) through graphical
icons and visual indicators (e.g., a button or message). In contrast, command-line interfaces
(CLIs) require commands to be composed and submitted through a keyboard.
The benefits of GUIs are obvious. In general, they provide intuitive looks for straightforward tasks. For a personal use, both adults and young kids can play all kinds of games
on tablet computers. Furthermore, GUIs also have values for education and professionals.
Teaching demos can be created to engage undergraduate students in learning new subjects.
GUIs can also be beneficial to professionals that have limited interests or skills in programming. For simple work, GUIs can be more productive than CLIs.
The cost of GUIs can be looked from the perspective of either production or consumption.
Every GUI is created by a person with strong programming skills. Computer science has
gained rapid growth since the 1990s, and many jobs titled as software engineers have become
available. A number of computer languages have been popular (e.g., C, Java, S, and R). A
GUI may be simple to use, but the creation is more demanding. For example, the software
of Microsoft Office has an intuitive interface, which is the output of many programmers.
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
446
20.1 Transition from base R graphics to GUIs
447
Figure 20.1 A static view of chess board created by base R graphics
The cost of GUIs for consumers is less obvious. In general, GUIs are less efficient to use
than command-line interfaces for complicated work. A large task may need users to click
numerous buttons and make a large number of choices. As most consumers cannot write
a computer program, they do not understand the loss of the productivity associated with
GUIs. For empirical research in economics, this is especially true, as discussed in Section 7.3
Estimating a binary choice model like a clickor on page 110.
With all the benefits and costs in mind, the main goal of this chapter is to develop a GUI
for the apt package as a teaching demo. We have been building our knowledge base in R
and growing up gradually throughout the book. Thus, before we work on GUIs intensively,
it is important to understand what we have learned and what new tools we need if the R
language is used for creating such a demo.
To help understand the transition from base R graphics to GUIs, a simple example
is utilized through Program 20.1 Creating a chess board from base R graphics. Base R
graphics is employed here, while other packages such as grid() also can achieve the same
effect. Specifically, on the whole graphics device, outer and figure margins, axes, and labels
are compressed. Graphical components are delivered to the plotting region. A total of 64
rectangular cells are drawn with two nested loops. The color of each cell is determined by the
inherent pattern. If the sum of coordinate values for the start point is an event number (e.g.,
x + y = 1 + 3 = 4), then the color is dark gray; otherwise, it is light gray (or almost white
on the screen). Four circles with black and white colors are used to represent chess pieces
here. The final output is presented in Figure 20.1 A static view of chess board created
by base R graphics. While the R commands are simple, the picture looks very much like a
chess board. Thus, to a certain degree, base R graphics is pretty efficient in delivering what
we want. In fact, the functionality in base R graphics has been the foundation of many new
packages for GUIs.
As it stands now, the problem of Figure 20.1 is that it is a static view. It does not allow
a user to click on the board and interact with a computer. Thus, the dynamic exchange of
information between users and computers in a typical GUI is not available. Unfortunately,
the predefined functions in base R cannot generate such an interactive effect. Thus, new
functions or tools are needed for these new features associated with GUIs.
Finally, a question is why R for a GUI, not other languages? What is the advantage in
448
Chapter 20 Graphical User Interfaces
employing R to develop a GUI? The choice of language for GUI development can be affected
by many factors, e.g., the nature of a GUI. R is strong in statistical analyses and graphics.
Thus, if a GUI needs a large amount of computation, then R may have some advantages
over other languages. In addition, R is flexible in handling many tasks and integrating them
together.
In summary, GUIs have some benefits in certain circumstances, which is true not only
for leisure use but also for business and scientific research. We have learned various aspects
of R as a language in previous chapters. Thus, developing a GUI through R is just one extra
mile on what we have gone so far. A simple test of your knowledge base is whether you can
understand Program 20.1 in a few minutes without running the program. If not, then you
probably need to read Chapter 11 Base R Graphics and Chapter 16 Advanced Graphics
first. If yes, then you are ready to move forward.
Program 20.1 Creating a chess board from base R graphics
1
2
3
4
# A. Window device for a chess board
win.graph(width = 3, height = 3)
bringToTop(stay = TRUE)
par(mai = c(0.1, 0.1, 0.1, 0.1))
5
6
7
8
9
10
11
12
13
14
15
16
# B. Draw cells with different colors
plot(x = 1:9, y = 1:9, type = "n", xaxs = "i", yaxs = "i",
axes = FALSE, ann = FALSE)
for (a in 1:8) {
for (b in 1:8) {
colo <- ifelse(test = (a + b) %% 2 == 0, yes = "gray50", no = "gray98")
rect(xleft = a, xright = a + 1, ybottom = b, ytop = b + 1,
col = colo, border = "white")
}
}
box()
17
18
19
20
21
# C. Add chess pieces
points(x = c(2.5, 3.5), y = c(3.5, 6.5), pch = 16, cex = 3)
points(x = c(5.5, 7.5), y = c(4.5, 3.5), pch = 21, cex = 3, bg = "white")
out <- recordPlot()
22
23
24
25
26
# D. Save a pdf copy
pdf(file = "C:/aErer/fig_chess.pdf", width = 3, height = 3,
useDingbats = FALSE)
replayPlot(out); dev.off()
20.1.1
Packages and installation
A few concepts need to be defined before getting started. A widget is a control element in a
GUI, such as a button or a scroll bar. A user interacts with a computer through GUI widgets.
A toolkit is a set of widgets used in developing GUIs. A toolkit itself is a software application
that is built on the top of an operating system, provides a programming interface, and allows
widgets to be used. There are a large number of GUI toolkits, e.g., GTK+, Qt, Tk, FLTK,
Part VI
Publishing a Manuscript
469
470
Part VI Publishing a Manuscript: Manuscript preparation and peer review process are
analyzed in this part. Major steps in writing a manuscript are examined to improve efficiency.
The whole peer-review process is demonstrated with examples. Typical symptoms associated
with a poorly prepared manuscript are selected and discussed.
Chapter 21 Manuscript Preparation (pages 471 – 485): Manuscript preparation is analyzed
from three perspectives: outline, detail, and style. In particular, how to construct manuscript
outlines by section is explained and compared with examples.
Chapter 22 Peer Review on Research Manuscripts (pages 486 – 502): Peer review on a
manuscript is inherently negative so the nature is analyzed first. Typical marketing skills
are discussed. Comments and responses from Wan et al. (2010a) are used as an example.
Chapter 23 A Clinic for Frequently Appearing Symptoms (pages 503 – 511): Typical
symptoms related to a poorly written manuscript are listed and analyzed. This includes
potential problems in study design, computer programming, and manuscript preparation.
R Graphics • Show Box 6 • Faces at Christmas time conditional on economic status
1947
1957
1952
1962
R can draw nontraditional graphics. See Program A.6 on page 520 for detail.
Chapter
21
Manuscript Preparation
T
he final main task in conducting a scientific study is to prepare a manuscript,
based on the study design and data analysis results. In this chapter, manuscript
preparation is analyzed from several perspectives: outline, detail, and style. First of
all, an outline is the “bone” or skeleton of a manuscript. By abstracting the outline of a
published paper, how the paper was created can be recovered. For illustration, Wan et al.
(2010a) is selected to show how the outline of a manuscript can be constructed. Then, the
process of constructing an outline will be analyzed by section (e.g., introduction, method,
and conclusion). The outline can be expanded and details for a manuscript can be added.
Finally, some issues related to writing styles, particularly these for empirical studies, will be
discussed.
21.1
Writing for scientific research
Writing is one way of expressing our thoughts, similar to other ways of communication, such
as speaking, singing, body action, or even no action. In general, writing is a more rigorous
way of communication, as it allows a better organization of mind. Written opinions can be
conveyed to readers without direct communication. Published papers can still exist after the
authors disappear from the earth. One of the most fundamental differences between animals
and human beings is that we can write.
Why is writing important? It is difficult for a person to compete in today’s job market
without strong writing ability. This is true not only in research-oriented professions such as
faculty, but also in a corporate environment. In addition, good writing is the foundation of
high-quality speech. It would be surprising if an excellent public speaker cannot write well.
For scientific research, the final output is in the form of reports, theses, or journal articles.
Clear writing is essential for disseminating research outputs.
Writing per se, i.e., literally using a pen on a piece of paper or punching a computer
keyboard, generally consumes a small amount of the total time for a research project. For
most empirical studies in economics, once data analyses are completed, a few weeks shall be
sufficient for writing a 30-page manuscript. In spite of that, scientific writing is still often
perceived to be more challenging than other types of writing. To publish an article on some
prestigious journals, it may need to be revised many times over several months or years. It
is certainly more difficult to write a scientific report than to write a short announcement
notice for a weekend party. The challenges for scientific writing mainly come from how to
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
471
472
Chapter 21 Manuscript Preparation
present innovative research outputs in a rigorous, concise, and logical way.
To address the challenges for scientific writing, the concepts promoted in this book will
continue to be exploited here. An empirical study in economics has three versions: proposal,
program, and manuscript. The key to improving manuscript preparation is to understand
the relation between the three versions of an empirical study. A proposal contains the study
design and provides the overall guide. A computer program generates all key results in the
format of tables and figures. Manuscript preparation is the final step in the production
process of economic research. Efficient writing needs strong supports from the other two
versions; otherwise, there is really not much to write for scientific research.
Over time, scientific writing has become highly structured, even though there is a certain
degree of flexibility for some sections in a manuscript. Thus, structural writing has become
the main approach in improving writing efficiency. Specifically, the first step is to set up the
outline of a manuscript, with inputs generated from the program version of an empirical
study. Then, the outline can be expanded with details being added. In the end, expressions
and sentences can be polished after the main contents of a manuscript are finalized.
In sum, the quality of a manuscript from a scientific project is mainly dependent on the
quality of study design and data analyses. The goal of preparing a manuscript is to present
scientific findings in a concise and rigorous way. To improve the efficiency for manuscript
preparation, structural writing is highly recommended. Grammar mistakes can also affect
the quality of a manuscript. One can improve the ability of handling details and grammar
through more practices over time.
21.2
An outline sample for Wan et al. (2010a)
A complete outline for a manuscript should be constructed before any writing starts. To
demonstrate what an outline looks like, Wan et al. (2010a) is selected as an example here.
Recall that the proposal for this study is presented at Section 5.3.2 Design with public
data (Wan et al. 2010a) on page 72. In Section 12.1 Manuscript version for Wan et al.
(2010a) on page 253, the first manuscript version is constructed with information from the
proposal version, which is very brief and sketchy. Finally, in Section 12.3 Program version
for Wan et al. (2010a) on page 260, six tables and one figures have been generated and
saved.
With all the analyses and work being accomplished, we are in a great position to finalize
the outline for this manuscript. After the program version for a project has been completed,
the first program version, e.g., the sample as presented in Section 12.1, should be expanded
with the table and figure results first. R may not be able to format the outputs completely.
Thus, within a word processing application, tables and figures should be formatted with the
ultimate publication outlet in mind.
Then, the outline is expanded gradually to the paragraph level with one line for each
paragraph. This is a critical step in setting up the final outline for the manuscript. It involves
a good amount of planning, and for the discussion section, even some brainstorming. At the
end, a complete outline for a typical manuscript should have about two to three pages of
bullet points with texts and about eight pages of tables and figures. For Wan et al. (2010a),
its final outline is listed below. The only difference is that the actual tables and figures are
not included here for brevity, as they are the same as these published on the journal.
In general, one should not begin to write a manuscript before an outline is complete
and largely satisfactory. If some key questions or objectives raised in the proposal version
have not been addressed in the outline, then one should collect new materials, ask help from
others, or conduct more necessary analyses. That being said, it should be noted that setting
21.2 An outline sample for Wan et al. (2010a)
473
up an outline can be dynamic and it is hard to have a perfect outline before the paper is
published. Sometimes comments from a peer reviewer can change the outline dramatically.
The sample outline for Wan et al. (2010a) below reveals well the components of a typical
empirical study. The key sections are the methodology and results. Without those two
sections, it could not have been published. Other sections like introduction and conclusion
are there to support or sell the research story. In addition, I usually list the number of pages
I plan to write for each section in the outline. This notation can remind me during writing
if the actual length is sufficient or not. It can help avoid a bigger or smaller section than
planned. When actual writing is different from what has been planned in the outline, you
need to think over it twice: why should I write more for this section, or why should I reduce
the description for another section? If you cannot convince yourself for a change, then you
will have to contain the desire to talk too much for one point, or push yourself harder to
write more for another point. Failing to stick to the original space allocation may result in
an unbalanced structure for the paper at the end.
The final outline for Wan et al. (2010a)
Title: Analysis of import demand for wooden beds in the U.S.
(30 pages in double line spacing)
(1). A cover page. An abstract has about 200 words. Have one or two sentences for research
issue, study need, objective, methodology, data, results, and contributions. Five key
words. JEL classification codes if needed.
(2). Introduction (2 pages in double line spacing). A brief introduction of the trade patterns,
antidumping investigation, past studies, overall objective, and three specific objectives
and contributions. Each paragraph is defined by one line with a black bullet.
• A brief introduction of trade of furniture and wooden beds related to the U.S.
• Past studies about furniture trade, knowledge gap, and study need.
• Overall study objective and a brief introduction of method, product, and data.
• Objective #1: consumer behavior evaluation through price elasticities.
• Objective #2: depression effect of the antidumping action on China’s imports.
• Objective #3: diversion effect of the antidumping action on others’ imports.
• A brief paragraph about the manuscript organization.
(3). Market Overview and the Antidumping Investigation against China (3 pages in double
line spacing). A detailed review of the research issue: trade pattern, antidumping
investigation, duties, and research needs. Figure 1 is relevant to this section.
• Rapid growth in imports of wood bedroom furniture by the U.S.
• Traditional suppliers of wooden beds in the U.S. import market.
• Newly industrialized Asian countries (e.g., China) as new suppliers.
• Antidumping investigation against the large imports from China and key dates.
• Main conclusions from the investigation.
• Antidumping duties and a display of import shares by country (Figure 1).
Chapter
22
Peer Review on Research Manuscripts
T
o publish a manuscript on a journal, the manuscript needs to go through an
anonymous peer review process. The nature of this process is inherently negative,
and this will be elaborated first. Then, the major reasons for manuscript rejection
are discussed and possible remedies are presented. Finally, the review comments for Wan
et al. (2010a) are included as an example. Some responses are presented and analyzed for
the purpose of illustration.
22.1
An inherently negative process
When a manuscript is submitted to a journal, it is handled by a chief editor, an associate
editor, and two to four anonymous reviewers (also known as referees). At the beginning of a
peer review process, the chief editor reads the manuscript and assigns it to an associate editor with appropriate expertise. The associate editor then selects several reviewers to make
comments on the manuscript. Selected reviewers read the manuscript and make recommendations to an associate editor, which often takes a few weeks or even months. The associate
editor reads the comments and manuscript, and then in turn, makes a recommendation
to the chief editor. At the end, the chief editor makes a decision with all the information.
The names of reviewers, and sometimes even the associate editor, are usually unknown to
authors. This structure can vary by journal greatly, but opinions from one or two editors
and several reviewers are the key to the whole peer review.
A peer review process is negative in nature. It is interesting to compare what we human
beings try to behave in different environments. When we see a new baby, we always say
it is a beautiful baby. But we know some babies are not. In attending a funeral, the dead
person is definitely the best one on this planet. But we know nobody is perfect. A researcher
may feel heartbroken if he has spent large efforts on a manuscript, followed the long review
process, and at the end still received a rejection letter with all negative comments.
An anonymous peer review on a manuscript allows for ugly, hash, and unfriendly truths
to be discovered and conveyed. Referees search for flaws in a manuscript. They assess study
design, methods, data quality, and results. Any fatal or even suspicious problem in a major
area is likely to result in a rejection decision. In some cases, problems identified through
a peer review process can be fixed, but in other situations, they may not. Ultimately, the
contents and scientific merits will determine the fate of a manuscript.
Given the stressful nature of this game, a natural question is whether we can have a
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
486
22.2 Marketing skills and typical review comments
487
friendlier process to achieve the same effect. Most probably, the answer is no. The reason is
simple: people around you do not want to, or cannot, tell you their true feeling about the
paper. An anonymous peer review allows referees to make serious comments on a study. The
benefits are apparent if we compare the quality of journal articles to that of non-refereed
conference papers. Once the review process is over and the paper is finally published, most
authors would agree that the peer review comments are helpful in improving manuscript
quality. This is the major benefit of the peer review system, and it will continue to provide
strong justification to this system in the future.
A peer review does have several costs. One major cost is that authors often suffer from
the critical comments. This is true even for seasoned researchers. It is just human nature
that we like positive feedback on our activities. A solution to this is to keep in mind firmly
the nature of this game. Do not take the negative comments personally. A manuscript being
rejected does not mean the author is a bad person or unqualified professional, and in many
situations, it even does not mean that the research has no value either. In most cases, putting
the comments away for several weeks before revising a manuscript can ease most of the pain.
Another major cost of a peer review is that editors and reviewers can make mistakes.
They do reject some good manuscripts that should have been published. This happens because human beings are not perfect and we are all limited in one way or another. Fortunately,
the probability is usually low. If a manuscript is indeed of high quality but rejected by one
journal, authors can always try other journals later.
Like any activity, going through a peer review process needs some advice and experience.
Most graduate students do not stay in an academic environment after their graduation.
They may only have one or two times to practice and learn relevant skills. Thus, one-to-one
advising by an adviser usually plays a large role to help students understand this game. The
skills learned from a peer review should be beneficial to future career development, even
in a corporate environment. This is because a peer review is just one typical activity in a
market economy.
22.2
Marketing skills and typical review comments
Several editorial decisions are possible at the end of a peer review. A manuscript can be
rejected directly without reconsideration, rejected with resubmission allowed, acceptable
after a major, moderate, or minor revision, and accepted as it is. In practice, a manuscript
often goes through the review process several times, with the status being changed gradually
from a major revision to a final acceptance. For each version of a manuscript, the editorial
decision can be influenced by manuscript quality, journal goals, research areas, rules in an
academic community, and reviewers’ personality.
The comments from an editor and several reviewers are a reflection of many factors
listed above. A peer review on a manuscript can generate numerous comments in many
pages. While comments can be different with specific contents and formats, the nature of
these comments can be classified into several major types. In this section, a few typical
situations are analyzed and possible remedies are offered.
22.2.1
Marketing skills
In academia, “publish or perish” is a phrase to describe the pressure to publish academic
work in an increasingly competitive academic environment. In economics, the profit of a
commodity to a producer is equal to the difference between the revenue and cost. Finishing
a manuscript efficiently within one’s office is similar to producing a quality hamburger with
488
Chapter 22 Peer Review on Research Manuscripts
minimum cost in a factory. However, before a manuscript is published, few people will have
a chance to read it so the impact or “revenue” is almost zero. Unless one just conducts a
study for personal pleasure completely, the pressure of “publish or perish” is always present.
Getting a manuscript published on a peer-review journal needs some marketing skills.
This is the same like selling any commodity in a market (e.g., a hamburger). It is true even
if an author is very confident in the quality of a manuscript. There are intense competitions
from the perspective of both researchers and publishers. In a specific research area, it is
common that many researchers compete with each other in trying to publish results with a
better quality or faster pace. Journal editors and publishers also need to compete for quality
manuscripts. Therefore, it is critical that authors understand the competitive nature of the
peer review process, and furthermore, utilize appropriate marketing skills in promoting and
publishing their results.
Where can we learn marketing skills? Unfortunately, it is not something we can learn
in a classroom through a typical course. By analogy, it is hard to image that one takes
a marketing course and then becomes the best agent in selling cars. It is safe to say that
better marketing is always associated with a solid understanding of the commodity, better
interaction with persons in the market, and most importantly, more practices. These general
principles are also applicable in selling a manuscript to a journal, or in a broad sense and
in the long term, to academic audience.
To publish research outcomes successfully and continually, one needs to know the focus of
a manuscript and the relevant research area well. This is so obvious for selling any commodity
but it can be neglected after one has spent months in working on a project and preparing
a manuscript. My suggestion is that put the manuscript away for a few weeks once it is
finished. Then read it again, evaluate possible publication outlets (or your previous choice),
and then select a journal for final formatting. In the long term, one will be able to gradually
gain a better understanding of the research area, know where similar studies have been
published, and make sound decisions for the peer review process.
Researchers also need to build up a professional network that covers editors and potential
reviewers. It is true that there are too many scientific journals nowadays, and it is impossible
to know all journal editors. However, it should be quite feasible to interact with the editors
of a few journals where one often publishes. Researchers should utilize various opportunities
to impress editors and potential reviewers with specific outcomes from a study, or in the long
term, with one’s own reputation in a research area. These opportunities include presentations
and social activities at professional meetings, anonymous peer reviews on manuscripts as
requested by journal editors, and other professional communications. In general, when one
is deeply integrated in a profession, it often becomes easier to get manuscripts published,
or at least, many unfair treatments or bad lucks as described below can be avoided.
In dealing an editorial decision and comments from a journal, write and respond in a
professional way. Some comments can be hard to read and digest. However, angry responses
will not help at all for getting a manuscript accepted, and in the long term, it may hurt
one’s reputation too. Instead, compose sentences and have a cover letter with a professional
writing style. If you do not agree with the reviewers on a specific comment, for example,
express it like this: “We respectfully do not agree with this because ....” Keep in mind
that editors and reviewers are ordinary persons around you and they can be as fragile as
yourself. In one instance, I emphasized a few words by using capital letters. The reviewer
responded to my answer by asking: “Why do you yell to me by capitalizing the letters?”
At the end, that specific manuscript was rejected after two rounds of peer reviews. In sum,
in communicating with editors and reviewers by email or letter, writing professionally and
politely may improve the chance of getting a manuscript published.
Chapter
23
A Clinic for Frequently Appearing Symptoms
S
o far, the right way to conduct scientific research has been promoted in the book.
In reality, this may not work as planned. Similar to swimming, one watches another
person swimming in a pool so well. He tells himself, “Easy job! I can do that too.”
However, when he jumps into the pool, he quickly realizes that what he learned on the land
does not work in the water, and he is sinking to the bottom. In this chapter, some typical
symptoms that have occurred to myself and our graduate students are listed and elaborated.
This can serve as a self-guide for improving research productivity.
23.1
Symptoms related to study design and outline
23.1.1 Symptom: Readers of your paper say: “It does not look like a paper.”
Prescription: No simple medicine is available to cure this severe symptom. More
examinations are needed to determine which part of your research is problematic. A
prescription can be offered for each individual problem.
Explanation: If your paper and writing style give readers this kind of impression,
it means that you have not learned the basics of scientific research yet. There is
a long way to go before you can reach professional writing standard and meet the
requirement. This is similar to a patient who is in a severe condition. When doctors
examine the patient, they could not determine which part or organ of the patient
does not function well. The only certain thing is that the patient is dying. If your
paper does not look like a paper, it will disappear from this planet in a short period
too.
23.1.2 Symptom: Faculty advisers say: “There is no way to revise and improve your paper.”
Prescription: No simple medicine is available to cure this severe symptom. Most
probably, your outline is not clear. Raw materials and information, even if relevant,
are just piled together in the paper without logical connection. Alternatively, your
outline is clear but there is no real contribution from the analysis. You need to learn
the basics of writing and receive detailed guidance from your faculty adviser.
Explanation: This often occurs to young professionals. Students thought if a 30page document is compiled together, it would become a paper automatically. Unfortunately, that perception is totally wrong. To produce a quality paper, you have
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
503
504
Chapter 23 A Clinic for Frequently Appearing Symptoms
to go through several stages: create or understand the research idea and objectives,
conduct the necessary analyses, and connect all the details together in a manuscript.
23.1.3 Symptom: You feel you do not really understand the research project.
Prescription: Read the proposal from your faculty adviser or collaborators carefully.
Or if the research idea is your own, try to clarify it and write it down on a piece of
paper.
Explanation: A complete understanding of what you are doing is the first and
fundamental step in conducting scientific research. If you do not know where you
are going, you will not reach your destination automatically. My personal habit is to
write down relevant information in a single document when I start a new project. In
that regard, taking notes can help greatly in clarifying my mind. Items included are
the objectives, related key literature, main contributions, and major uncertainties
that may kill the idea. These concepts are covered in Section 5.1.2 Keywords in a
proposal on page 58 and Figure 5.1 The relation among keywords in a proposal for
social science on page 61.
23.1.4 Symptom: After writing five pages, you do not know what to write next.
Prescription: You have not fully constructed the outline for the manuscript yet.
You have to go through the outline stage before you write the paper with detail.
Explanation: This may be one of the most common mistakes that young professionals make in conducting scientific research. Many students write a paper without an
overall outline being set up first. They thought that by writing down these already
known details, other components would follow up automatically. That is completely
wrong and it seldom happens. In fact, when you dive into the detail stage and the
writing process, you focus on individual “trees” and quickly lose your view of the
“forests.” Sooner or later, you will find yourself desperately in the middle of a jungle
with this question: where am I and where can I go now? Even if at the end you can
finish the paper with 30 pages, it will most probably look like fragmented and cannot
achieve your original objective. Therefore, you have to go through the outline stage
before you go to the detail stage and write your paper. In particular, the outline of a
manuscript for empirical study in economics comes from the design in a proposal and
results from a computer program. See the relevant coverage in Chapter 4 Anatomy
on Empirical Studies on page 42.
23.1.5 Symptom: You were trapped in writing the introduction of a manuscript.
Prescription: One should never prepare a manuscript by working on the introduction section first. Instead, write your introduction section after you have finished all
other sections of your paper.
Explanation: The typical symptom is that you start to write a manuscript with the
first section of introduction. You can spend a lot of time in doing that but still feel
unsatisfactory or uncertain. To write a manuscript more efficiently, the introduction
section should be written after all other sections have been done. This is a section
that you use to sell your story to readers. If you have not finished key sections such as
the method and results, you really do not know what to sell or what to introduce in
the first place. Similarly, in building a house, no builder begins the work by putting
up the front door first. See more discussions at Section 21.3 Outline construction
by section on page 475.
Appendix
A
Programs for R Graphics Show Boxes
S
ix show boxes for R graphics have been inserted at the beginning of the parts.
In this appendix, the R programs and notes for these graphs are presented. In
general, computer graphics can be either scientific or entertaining, or both. These
show boxes have been selected with some practical values for scientific research, but they
are more relaxing than these in the book. I wish these show boxes capture your attention
and inspire you to explore more in the process of learning R graphics functionality.
Show Box 1 on page 2: map creation
R can process spatial data and visualize them on maps, as presented in Section 16.4 Spatial
data and maps on page 385. In this show box, the maps package is utilized to generate a basic
sketch of the world map. The key function of map() can use internal databases included in
the package or external data imported from other sources. To make the map more revealing,
two countries, i.e., Brazil and China, are selected and filled with colors. Note the add = TRUE
argument in the map() function allows several layers to be combined in a single map. The
map is displayed on a screen device first and then saved as a PDF file at the end.
Program A.1 A world map with two countries highlighted
1
2
3
# A. Load the package and understand region names
setwd("C:/aErer"); library(maps)
dat <- map(database = "world", plot = FALSE); str(dat)
4
5
6
7
8
9
10
# B. Display the map on the screen device
windows(width = 5.3, height = 2.5); bringToTop(stay = TRUE)
map(database = "world", fill = FALSE, col = "green", mar = c(0, 0, 0, 0))
map(database = "world", regions = c("Brazil", "China"), fill = TRUE,
col = c("yellow", "red"), add = TRUE)
showWorld <- recordPlot()
11
12
13
14
# C. Save the map on a file device
pdf(file = "fig_showWorld.pdf", width = 5.3, height = 2.5)
replayPlot(showWorld); dev.off()
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
513
514
Appendix A Programs for R Graphics Show Boxes
# Selected results from Program A.1
> dat <- map(database = "world", plot = FALSE); str(dat)
List of 4
$ x
: num [1:27221] -130 -130 -132 -132 -132 ...
$ y
: num [1:27221] 55.9 56.1 56.7 57 57.2 ...
$ range: num [1:4] -180 190.3 -85.4 83.6
$ names: chr [1:2284] "Canada" "South Africa" "Denmark" ...
- attr(*, "class")= chr "map"
Show Box 2 on page 30: 3-D heart shapes
R is strong in both statistical analyses and graphics. Combining its data processing and
graphics functions can accomplish many interesting tasks. In this show box, a heart shape
with three-dimensional (3-D) effects is created. The heart shape seems too romantic or
casual for an academic book like this. Nonetheless, my main consideration is that Part II
Economic Research and Design is about thinking and it does related to our heart.
Specifically, the objective is achieved by drawing many heart shapes with shrinking sizes
and lighter colors. First, the colorRampPalette() function can return a new function that
interpolates a set of given colors to create new color palettes. With the chosen colors of red
and white, a set of colors can be generated from this function. If you are interested in seeing
what a black heart would look like, then the command for the color set can be changed to
colors = c(“black”, “white”).
The oneHeart() function is created to draw one heart shape with the two argument
values, i.e., r for the size and col for the color. This new function contains a seemingly
complicated expression for a heart shape, but that is actually easy as there are many similar
mathematical formulas on the Internet.
In drawing the graph finally, the key functions used are polygon() and mapply(). In base
R graphics, polygon() is a low-level plotting function. Thus, the high-level plotting functions
of plot.new() and plot.window() are used first to initiate a graph. The mapply() function
vectorizes the drawing action over its two arguments (i.e., heart size values and colors), which
is more efficient than employing two looping statements. See details at Section 10.3.3 The
apply() family on page 198. The number used here is 500, and a smaller number (e.g., n =
3) can show a much coarser version. oneHeart() function is supplied as an argument value
to the mapply() function.
Program A.2 A heart shape with three-dimensional effects
1
2
3
4
5
6
# A. Set color and heart parameter values
setwd("C:/aErer")
n <- 500
fun.col <- colorRampPalette(colors = c("red", "white")); fun.col
set.col <- fun.col(n); head(set.col)
set.val <- seq(from = 16, to = 0, length.out = n)
7
8
9
10
11
12
# B. Create a new function to draw one heart as a polygon
oneHeart <- function(r, col) {
t <- seq(from = 0, to = 2 * pi, length.out = 100)
x <- r * sin(t) ^ 3
y <- (13 * r / 16) * cos(t) - (5 * r / 16) * cos(2 * t) -
515
13
14
15
}
(2 * r / 16) * cos(3 * t) - (r / 16) * cos(4 * t)
polygon(x, y, col = col, border = NA)
16
17
18
19
20
21
22
23
# C. Draw many hearts with mapply()
windows(width = 5.3, height = 3.2); bringToTop(stay = TRUE)
par(mgp = c(0, 0, 0), mai = c(0, 0, 0, 0))
plot.new()
plot.window(xlim = c(-16, 16), ylim = c(-16, 13))
mapply(FUN = oneHeart, set.val, set.col)
showHeart <- recordPlot()
24
25
26
27
# D. Save the graph on a file device
pdf(file = "fig_showHeart.pdf", width = 5.3, height = 3.2)
replayPlot(showHeart); dev.off()
# Selected results from Program A.2
> fun.col <- colorRampPalette(colors = c("red", "white")); fun.col
function (n)
{
x <- ramp(seq.int(0, 1, length.out = n))
if (ncol(x) == 4L)
rgb(x[, 1L], x[, 2L], x[, 3L], x[, 4L], maxColorValue = 255)
else rgb(x[, 1L], x[, 2L], x[, 3L], maxColorValue = 255)
}
<bytecode: 0x069e1d50>
<environment: 0x0a9a9180>
> set.col <- fun.col(n); head(set.col)
[1] "#FF0000" "#FF0000" "#FF0101" "#FF0101" "#FF0202" "#FF0202"
Show Box 3 on page 100: categorical values
R is widely used to visualize continuous variables. Nevertheless, R also has rich functions in
revealing the relations between categorical variables, and to my understanding, these tools
are largely underutilized in economic research. In this show box, the well-known Titanic
data set is selected to demonstrate how a contingency table can be shown as a graph. Recall
Titanic was a British passenger liner that sank in the North Atlantic Ocean in 1912 and
more than 1,500 people died.
The R data set of Titanic() is a four-dimensional array resulting from cross-tabulating
2,201 observations on four variables: age, class, gender, and survival. This data set also has
the table attribute, so the functions of as.data.frame() and ftable() can be used to
have a better view of the data, as detailed at Section 10.3.2 Contingency and pivot tables
on page 194. In drawing a concise graph for demonstration here, the class information is used
as the x axis, and the survival rate as the y axis. The mosaicplot() function in traditional
graphics system shows the pattern well. The main impression is that first-class passengers
have a higher survival rate. If one needs to have more controls over the plot, e.g., adding
values by category to the plotting region, then several contributed packages in R can be
consulted, e.g., vcd.
516
Appendix A Programs for R Graphics Show Boxes
Program A.3 Survival of 2,201 passengers on Titanic sank in 1912
1
2
3
4
5
# A. Understand data; library(vcd) has more options for mosaic plots
setwd("C:/aErer")
Titanic; as.data.frame(Titanic)
str(Titanic)
ftable(Titanic)
6
7
8
9
10
11
12
13
# B. Draw a mosaic plot for two categorical variables
windows(width = 5.3, height = 2.5, pointsize = 9)
bringToTop(stay = TRUE)
par(mai = c(0.4, 0.4, 0.1, 0))
mosaicplot(formula = ~ Class + Survived, data = Titanic,
color = c("red", "green"), main = "", cex.axis = 1)
showMosaic <- recordPlot()
14
15
16
17
18
# C. Save the graph on a file device
pdf(file = "fig_showMosaic.pdf", width = 5.3, height = 2.5)
replayPlot(showMosaic)
dev.off()
# Selected results from Program A.3
> str(Titanic)
table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
- attr(*, "dimnames")=List of 4
..$ Class
: chr [1:4] "1st" "2nd" "3rd" "Crew"
..$ Sex
: chr [1:2] "Male" "Female"
..$ Age
: chr [1:2] "Child" "Adult"
..$ Survived: chr [1:2] "No" "Yes"
> ftable(Titanic)
Class Sex
1st
Male
Female
2nd
Male
Female
3rd
Male
Female
Crew
Male
Female
Age
Child
Adult
Child
Adult
Child
Adult
Child
Adult
Child
Adult
Child
Adult
Child
Adult
Child
Adult
Survived
No Yes
0
5
118 57
0
1
4 140
0 11
154 14
0 13
13 80
35 13
387 75
17 14
89 76
0
0
670 192
0
0
3 20
517
Show Box 4 on page 252: diagrams
Traditional graphics system in R is very flexible in drawing a diagram. Furthermore, several
contributed packages have made some common diagrams even easier to draw. The diagram
package is among one of them. In this show box, the plotmat() function is used to draw
the structure of this book again. Four boxes are created to represent the overall structure,
i.e., the proposal, program, and manuscript versions for an empirical study. There are many
arguments in this function that can be used to customize the detail. Of course, users need to
pay an extra cost in learning the new package before enjoying the flexibility in customizing
the graph. Calling plotmat() can draw a diagram and also return a list of graphical data.
Program A.4 A diagram for the structure of this book
1
2
3
4
# A. Data preparation
setwd("C:/aErer"); library(diagram)
M <- matrix(nrow = 4, ncol = 4, data = 0)
M[2, 1] <- 1; M[4, 2] <- 2; M[3, 4] <- 3; M[1, 3] <- 4; M
5
6
7
8
9
10
11
12
13
14
15
16
17
# B. Display the diagram on a screen device
windows(width = 5.3, height = 2.5, family = "serif")
par(mai = c(0, 0, 0, 0))
book <- plotmat(A = M, pos = c(1, 2, 1), curve = 0.3,
name = c("Empirical Study", "Proposal", "Manuscript", "R Program"),
lwd = 1, box.lwd = 1.5, cex.txt = 0.8, arr.type = "triangle",
box.size = c(0.15, 0.1, 0.1, 0.1), box.cex = 0.75,
box.type = c("hexa", "ellipse", "ellipse", "ellipse"),
box.prop = 0.4, box.col = c("pink", "yellow", "green", "orange"),
lcol = "purple", arr.col = "purple")
names(book); book[["rect"]]
showDiagram <- recordPlot()
18
19
20
21
# C. Save the graph on a file device
pdf(file = "fig_showDiagram.pdf", width = 5.3, height=2.5, family="serif")
replayPlot(showDiagram); dev.off()
# Selected results from Program A.4
> M[2, 1] <- 1; M[4, 2] <- 2; M[3, 4] <- 3; M[1, 3] <- 4; M
[,1] [,2] [,3] [,4]
[1,]
0
0
4
0
[2,]
1
0
0
0
[3,]
0
0
0
3
[4,]
0
2
0
0
> names(book); book[["rect"]]
[1] "arr"
"comp" "radii" "rect"
xleft
ybot xright
ytop
[1,] 0.35 0.70590856
0.65 0.9607581
[2,] 0.15 0.41505015
0.35 0.5849498
[3,] 0.65 0.41505015
0.85 0.5849498
[4,] 0.40 0.08171682
0.60 0.2516165
518
Appendix A Programs for R Graphics Show Boxes
Show Box 5 on page 394: dynamic graphs
R can create dynamic and interactive graphs. In this show box, 500 plots are drawn on
the screen device sequentially with varying parameters to create an effect like a video.
Specifically, two random variables are generated with the rCopula() function in the copula
package. The relation between the variables is specified through a normal copula, as defined
by normalCopula(). The correlation between these two variables can vary from −1 to 1.
When these variables are plotted on a graph through a for looping statement and the
plot() function, the varying correlations and colors create a continuous visual effect. The
color values are generated with the rainbow() function.
To capture the spirit of this video on a two-dimensional static graph, some extra efforts
are needed. Specifically, three screenshots are chosen with different correlations and colors.
Then they are arranged on the screen device through the powerful functionality offered in
the grid package. Three viewports are created at the top-level window and each contains
one screenshot. The axes and labels are excluded in the screenshots for brevity.
Program A.5 Screenshots from a demonstration video for correlation
1
2
3
4
5
6
7
# A. Data: number of plots, points, and colors
setwd("C:/aErer"); library(copula); library(grid)
num.plot <- 500; num.points <- 10000
set.rho <- seq(from = 0, to = 1, length.out = num.plot)
pie(x = rep(x = 1, times = 15), col = rainbow(15)) # understand rainbow()
set.col <- rainbow(num.plot)
str(set.col); head(set.col, n = 4)
8
9
10
11
12
13
14
15
16
17
18
# B. Display the video on the screen device; need about 35 seconds
# Two random variables with bivariate normal copula
windows(width = 4, height = 4); bringToTop(stay = TRUE)
par(mar = c(2.5, 2.5, 1, 1))
for (i in 1:num.plot) {
sam <- rCopula(n = num.points, copula = normalCopula(param = set.rho[i]))
plot(x = sam, col = set.col[i], xlim = c(0, 1), ylim = c(0, 1),
type = "p", pch = ".")
}
str(sam); head(sam, n = 3)
19
20
21
22
23
24
25
26
27
# C. Three screenshots for ERER book
windows(width = 5.3, height = 2.5); bringToTop(stay = TRUE)
v1 <- viewport(x = 0.02, y = 0.98, width = 0.55, height = 0.5,
just = c("left", "top"))
pushViewport(v1); grid.rect(gp = gpar(lty = "dashed"))
sam <- rCopula(n = 3000, copula = normalCopula(param = 0))
grid.points(x = sam[, 1], y = sam[, 2], pch = 1,
size = unit(0.001, "char"), gp = gpar(col = "red"))
28
29
30
31
32
upViewport(0); current.viewport()
v2 <- viewport(width = 0.55, height = 0.5)
pushViewport(v2); grid.rect(gp = gpar(fill = "white", lty = "dashed"))
sam <- rCopula(n = 3000, copula = normalCopula(param = 0.8))
519
33
34
grid.points(x = sam[, 1], y = sam[, 2], pch = 1,
size = unit(0.001, "char"), gp = gpar(col = "darkgreen"))
35
36
37
38
39
40
41
42
43
upViewport(0)
v3 <- viewport(x = 0.98, y = 0.02, width = 0.55, height = 0.5,
just = c("right", "bottom"))
pushViewport(v3); grid.rect(gp = gpar(fill = "white", lty = "dashed"))
sam <- rCopula(n = 3000, copula = normalCopula(param = 0.99))
grid.points(x = sam[, 1], y = sam[, 2], pch = 1,
size = unit(0.001, "char"), gp = gpar(col = "blue"))
showCorrelation <- recordPlot()
44
45
46
47
48
# D. Save the three screen shots on a file device
pdf(file = "fig_showCorrelation.pdf", width = 5.3, height = 2.5,
useDingbats = FALSE)
replayPlot(showCorrelation); dev.off()
# Selected results from Program A.5
> str(set.col); head(set.col, n = 4)
chr [1:500] "#FF0000FF" "#FF0300FF" "#FF0600FF" "#FF0900FF" ...
[1] "#FF0000FF" "#FF0300FF" "#FF0600FF" "#FF0900FF"
> str(sam); head(sam, n = 3)
num [1:10000, 1:2] 0.925 0.0674 0.5819 0.24 0.3992 ...
[,1]
[,2]
[1,] 0.9249584 0.9249584
[2,] 0.0674410 0.0674410
[3,] 0.5818629 0.5818629
Show Box 6 on page 470: human faces as a graph
Herman Chernoff invented Chernoff faces to display multivariate data through the shape of
a human face. The individual parts on a face represent variable values by their shape, size,
placement, and color. The idea is that humans can easily recognize small changes on a face.
By choosing appropriate variables in representing the features of individual parts on a face,
Chernoff faces can generate interesting pictures and facilitate communications in a way that
is different from words, tables, or traditional lines and points.
In this show box, the longley data set is used to create four faces. This is a macroeconomic data set with annual observations for seven economic variables between 1947 and
1962. The variables include gross national product (GNP), GNP deflator, and numbers of
people unemployed and employed. Four years are selected in the show box: 1947, 1952,
1957 and 1962. The faces() function from the aplpack package is used to draw the faces.
The color and Christmas decoration on the faces are determined by the argument value of
face.type = 2. This function also has an argument of plot = TRUE. If plot = FALSE, the
output from calling faces() can be saved and then plotted. The output has the class of
“faces” and a new method named plot.faces() is defined in the package. By using the
economic data to shape the faces, the economy status over time is presented graphically,
even though it is a little bit different from a typical line graph. Hopefully, you can see that
Americans were happier in the 1960s than in the 1940s through the graph.
520
Appendix A Programs for R Graphics Show Boxes
Program A.6 Faces at Christmas time conditional on economic status
1
2
3
4
# Load library and data
setwd("C:/aErer"); library(aplpack)
data(longley); str(longley)
longley[as.character(c(1947, 1952, 1957, 1962)), 1:4]
5
6
7
8
9
10
11
12
# Some practices
windows(); bringToTop(stay
faces()
faces(face.type = 0)
faces(xy = rbind(1:4, 5:3,
faces(xy = longley[c(1, 6,
faces(xy = longley[c(1, 6,
= TRUE)
3:5, 5:7), face.type = 2)
11, 16), ], face.type = 1)
11, 16), ], face.type = 0)
13
14
15
16
17
18
19
20
# Display on the screen device
windows(width = 5.3, height = 2.5, pointsize = 9); bringToTop(stay = TRUE)
par(mar = c(0, 0, 0, 0), family = "serif")
aa <- faces(xy = longley[c(1, 6, 11, 16), ], plot = FALSE)
class(aa) # "faces"
plot.faces(x = aa, face.type = 2, width = 1.1, height = 0.9)
showFace <- recordPlot()
21
22
23
24
# Save the graph on a file device
pdf(file = "fig_showFace.pdf", width = 5.3, height = 2.5, pointsize = 9)
replayPlot(showFace); dev.off()
# Selected results from Program A.6
> data(longley); str(longley)
’data.frame’: 16 obs. of 7 variables:
$ GNP.deflator: num 83 88.5 88.2 89.5 96.2 ...
$ GNP
: num 234 259 258 285 329 ...
$ Unemployed : num 236 232 368 335 210 ...
$ Armed.Forces: num 159 146 162 165 310 ...
$ Population : num 108 109 110 111 112 ...
$ Year
: int 1947 1948 1949 1950 1951 ...
$ Employed
: num 60.3 61.1 60.2 61.2 63.2 ...
> longley[as.character(c(1947, 1952, 1957, 1962)), 1:4]
GNP.deflator
GNP Unemployed Armed.Forces
1947
83.0 234.289
235.6
159.0
1952
98.1 346.999
193.2
359.4
1957
108.4 442.769
293.6
279.8
1962
116.9 554.894
400.7
282.7
Appendix
B
Ordered Choice Model and R Loops
T
his appendix has two objectives. The first objective is to document how to compute
predicted probabilities, marginal effects, and their standard errors for an ordered
probit or logit model. The technical details are similar to these for binary choice
model, as presented in Section 7.2 Statistics for a binary choice model on page 103. The
computation for ordered choice model is more tedious because there are multiple choices,
instead of two. In addition, for both predicted probabilities and marginal effects, standard
errors are more difficult to calculate than their values per se. Standard errors can be computed using linear approximation approach (i.e., delta method).
The second objective is to demonstrate how to avoid overuse of looping statements. This
could be demonstrated in Chapter 13 Flow Control Structure on page 269, but it is a little
too long. For the ordered choice model, the ocME() and ocProb() functions are created
and included in the erer library. The implementation is dramatically simplified by using
vectorization technique available in R and by avoiding overuse of R loops. In general, looping
statements such as for are intuitive and straightforward to use, but overusing them can
result in inefficiency. Thus, calculating marginal effects from ordered choice model serves as
a good example to illustrate how to have a balanced use of looping statements. In developing
these functions, results from ocME() and ocProb() were compared with these from the
software of LIMDEP, and the assessment was satisfactory.
B.1
Some math fundamentals
A good understanding of the following mathematics is necessary to compute predicted probabilities, marginal effects, and their standard errors for ordered choice model. Thus, they
are included to facilitate the remaining presentation of this appendix.
Delta method
Delta method is an approximate method for computing the variance of a variable, given the
variance estimates of its parameters (Greene, 2011). For example, assume
g = f (b1 , b2 ) = ab1 b2 + 3b2
(B.1)
where g is a function of two variables (b1 and b2 ) and one constant (a). This allows the value
of g to be calculated when the values of b1 , b2 , and a are known.
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
521
522
Appendix B Ordered Choice Model and R Loops
Furthermore, the variance of g can be computed as:
σ
T
σ12 ∂g
11
∂g
∂g
∂g
var(g) = ∂b1 ∂b2
∂b1
∂b2
σ21 σ22
1 5
ab2
= ab2 ab1 + 3
5 2
ab1 + 3
(B.2)
= 2a2 b21 + a2 b22 + 10a2 b1 b2 + 12ab1 + 30ab2 + 18
where σ11 is the variance of b1 , σ12 is the covariance between the two variables, and T
denotes the transpose of a matrix. For illustration, some simple numerical values are assumed for the covariance-variance matrix. In sum, given a functional relation as defined in
Equation (B.1), the value and covariance estimates of these variables on the right side can
be used to compute the variance of the left-side variable.
If the constant of a is treated as a parameter in the computation, it does not affect the
result. This can be shown as follows:


σ11 σ1a σ12 T
∂g
∂g
∂g
σa1 σaa σa2  ∂g ∂g ∂g
var(g) = ∂b
∂a
∂b2
∂b1
∂a
∂b2
1
σ21 σ2a σ22



ab2
1 0 5
= ab2 b1 b2 ab1 + 3 0 0 0  b1 b2 
(B.3)
5 0 2
ab1 + 3
= 2a2 b21 + a2 b22 + 10a2 b1 b2 + 12ab1 + 30ab2 + 18
where the correlation of a with itself and the two variables is zero. Thus, the variance of g
will remain the same even if a constant is included as a parameter.
Delta method can also be applied to a vector of variables, e.g., g1 and g2 . This needs
good understanding of matrix operation, and additionally, matrix calculus.
Matrix calculus
Matrix calculus is a large subject, as detailed in Abadir and Magnus (2005). Two useful rules
are stated here, as they are particularly related to computing standard errors of marginal
effects in ordered choice model.
Taking derivative on a vector with regard to another vector results in a very large matrix,
not a vector only. Specifically, if y is a m × 1 vector, x is a n × 1 vector, and y = f (x), then
 ∂y (x)

1
1 (x)
. . . ∂y∂x
∂x1
n
 .
∂y
.. 
..

..
=
(B.4)
.
. 

∂x
∂ym (x)
∂ym (x)
...
∂x1
∂xn
where y1 (x) is the first element in the vector, and others are similarly defined. The result is
a m × n matrix. This is also called the Jacobian matrix of y = f (x).
If y = A(r) x(r), y is m × 1, A(r) is m × n, x(r) is n × 1, and r is k × 1, then
∂y
∂A(r)
∂x
= (xT ⊗ Im )
+ A(r)
∂r
∂r
∂r
(B.5)
where ⊗ is Kronecker product, and Im is an identity matrix with the dimension of m × m.
523
B.2 Predicted probability for ordered choice model
Derivative for a probability density function
For a probit model, the underlying distribution is a normal distribution (Greene, 2011, page
1024). Note that a normal distribution has no closed form for its cumulative distribution
function F . Its probability density function f can be expressed as:
f (z | µ, σ 2 ) =
(z−µ)2
1
√ e− 2σ2
σ 2π
(B.6)
where z is a random variable, µ is the mean, and σ is the standard deviation of the distribution. When µ = 0 and σ = 1, the distribution is called standard normal distribution. The
probability density function and its derivative can be expressed as
z2
1
f (z | 0, 1) = √ e− 2
2π
∂f
0
f =
= f (−2z/2) = −zf
∂z
(B.7)
(B.8)
where the prime symbol denotes derivative.
For the logistic distribution associated with a logistic model, the cumulative distribution
function, probability density function, and its derivative can be expressed as:
F (z) =
1
1 + e−z
f (z) = F 0 = −(1 + e−z )−2 e−z (−1) =
f 0 (z) =
(B.9)
1
e−z
= F (1 − F )
1 + e−z 1 + e−z
∂f
= F 0 (1 − F ) + F (−F 0 ) = F 0 (1 − 2F ) = f (1 − 2F )
∂z
(B.10)
(B.11)
where f 0 is the derivative of probability density function.
B.2
Predicted probability for ordered choice model
Predicted probabilities are usually calculated for a continuous independent variable to reveal
its impact. First, let us compute the value of predicted probability. Assume there are four
ordered choices, i.e., J = 4. The predicted probability by category can be expressed as
follows:
p1 = F1 = P r(y = 1 | X) = F (u2 − Xβ)
p2 = F2 = P r(y = 2 | X) = F (u3 − Xβ) − F (u2 − Xβ)
(B.12)
p3 = F3 = P r(y = 3 | X) = F (u4 − Xβ) − F (u3 − Xβ)
p4 = F4 = P r(y = 4 | X) = 1 − F (u4 − Xβ)
where u2 , u3 , and u4 are estimated thresholds. With the above arrangement, only J − 1
threshold values are estimated for an ordered
choice model with J choices. The sum of all
P4
probabilities at a given value is 1, i.e., i=1 pi = 1.
To display the effect of a continuous independent variable on predicted probability, create
a new matrix X̄ s=seq for all the independent variables. This matrix is first defined as a N ×K
matrix, where N is an arbitrary number and K is the number of independent variables.
Dimension N is the number of predicted probabilities of interest, and it can be different
from the actual number of observations used for regression. In this new matrix, each row
524
Appendix B Ordered Choice Model and R Loops
contains the same mean values for the K independent variables. Then the column value of a
selected continuous variable is substituted by a sequence. In R, this sequence can be created
with the command by seq(from = min(.), to = max(.), by = N). In general, the range
of this sequence is the same as the actual range of the selected variable.
In writing a function to calculate predicted probabilities, the number of categories J and
the number of estimated thresholds are unknown in advance. Thus, a looping statement
needs to be employed to compute predicted probabilities as expressed in Equation (B.12).
This is feasible as only one loop is needed. However, this will become more cumbersome in
calculating standard errors later as another loop is needed. The dimension of X̄ s=seq βb is
(N × K) × (K × 1) = N × 1, so the predicted probabilities for each choice are a vector. In
sum, with the estimated threshold and parameter values, Equation (B.12) can be used to
calculate predicated probabilities by category.
The above approach can be greatly simplified by vectorizing the operation. This starts
with a small rearrangement of the above equation as follows:
p1 = F (u2 − Xβ) − F (u1 − Xβ) = F (u2 − Xβ) − 0
p2 = F (u3 − Xβ) − F (u2 − Xβ)
(B.13)
p3 = F (u4 − Xβ) − F (u3 − Xβ)
p4 = F (u5 − Xβ) − F (u4 − Xβ) = 1 − F (u4 − Xβ)
where u1 is −∞; and u5 is ∞. For the purpose of actual computation and programming,
a large constant is sufficient for generating a probability value of 0 or 1, e.g., u1 = −106
and u2 = 106 . With two constants added to the estimated thresholds, the above equation
set can be reduced to a single equation with matrix notation. R is powerful in vectorization
operation, and thus a looping statement is avoided and computation becomes more efficient.
In combination, the predicted probabilities can be calculated as:
P = F (Ub − Z) − F (Ua − Z)
(B.14)
where P has the dimension of N × J and is a combination of the probability vectors for all
choices.
In programming, the estimated values of thresholds and parameters are used to construct
several matrices. To vectorize the computation, three matrices need to be carefully arranged
so they are conformable. Specifically, Ub is a N × J matrix constructed without the first
threshold, and each row has the same value, such as c(b
u2 , u
b3 , u
b4 , u
b5 ) for four choices. Ua is
a matrix constructed without the last threshold, and each row has the same value, such as
c(b
u1 , u
b2 , u
b3 , u
b4 ) for four choices. Z is also a N ×J matrix, with each column being repeatedly
b In sum, Equation (B.14) vectorizes the operation in
filled by the same value of X̄ s=seq β.
Equation (B.12), and no matter how many categories an ordered choice model contains,
the computation can be accomplished with a single matrix operation.
To compute standard errors of predicted probability, take the third category as an example. With the estimated values substituted into the equation, this can be expressed as:
b − F (b
b
F3 = P r(y = 3 | X̄ s=seq ) = F (b
u4 − X̄ s=seq β)
u3 − X̄ s=seq β)
(B.15)
where X̄ s=seq is the new matrix defined for a continuous variable.
The variance of the predicted probability can be computed with the delta method. F3 is
b u
a function of three estimated parameters: β,
b3 , and u
b4 . Thus, the variance for the predicted
525
B.2 Predicted probability for ordered choice model
probabilities and the three derivatives can be expressed as
∂F
∂F3
∂F3
∂F3
∂F3
∂F3 T
b u
var(F3 ) = ∂ βb3 ∂b
cov(
β,
b
,
u
b
)
3
4
b ∂b
u3
∂b
u4
u3
∂b
u4
∂β
h
i
∂F3
b − f (b
b X̄ s=seq = [f3 − f4 ] X̄ s=seq
= f (b
u3 − X̄ s=seq β)
u4 − X̄ s=seq β)
∂ βb
∂F3
b = −f3
= −f (b
u3 − X̄ s=seq β)
∂b
u3
∂F3
b = f4
= f (b
u4 − X̄ s=seq β)
∂b
u4
(B.16)
(B.17)
(B.18)
(B.19)
b u
where f3 and f4 are diagonal matrices, and the covariance matrix cov(β,
b3 , u
b4 ) is generated
from the model fit. Note the covariance matrix of coefficients and relevant thresholds vary
by choice. The variance vectors for all the choices could be combined together with some
matrix arrangement, but the cost may be bigger than the benefit. Thus, they are computed
with a looping statement in the ocProb() function, as showed in Program B.1.
Program B.1 Calculating predicted probabilities for an ordered choice model
1
2
3
4
5
6
7
8
9
10
11
# A new function: probabilities for ordered choice
ocProb <- function(w, nam.c, n = 100, digits = 3)
{
# 1. Check inputs
if (!inherits(w, "polr")) {
stop("Need an ordered choice model from 'polr()'.\n")
}
if (w$method != "probit" & w$method != "logistic") {
stop("Need a probit or logit model.\n")
}
if (missing(nam.c)) stop("Need a continous variable name'.\n")
12
13
14
15
16
17
18
19
20
# 2. Abstract data out
lev <- w$lev; J <- length(lev)
x.name <- attr(x = w$terms, which = "term.labels")
x2 <- w$model[, x.name]
if (identical(sort(unique(x2[, nam.c])), c(0, 1)) ||
inherits(x2[, nam.c], what = "factor")) {
stop("nam.c must be a continuous variable.")
}
21
22
23
24
25
26
ww <- paste("~ 1", paste("+", x.name, collapse = " "), collapse = " ")
x <- model.matrix(as.formula(ww), data = x2)[, -1]
b.est <- as.matrix(coef(w)); K <- nrow(b.est)
z <- c(-10^6, w$zeta, 10^6) # expand it with two extreme thresholds
z2 <- matrix(data = z, nrow = n, ncol = length(z), byrow = TRUE)
27
28
29
30
31
32
pfun <- switch(w$method, probit = pnorm, logistic = plogis)
dfun <- switch(w$method, probit = dnorm, logistic = dlogis)
V2 <- vcov(w) # increase covarance matrix by 2 fixed thresholds
V3 <- rbind(cbind(V2, 0, 0), 0, 0)
ind <- c(1:K, nrow(V3)-1, (K+1):(K+J-1), nrow(V3))
526
Appendix B Ordered Choice Model and R Loops
V4 <- V3[ind, ]; V5 <- V4[, ind]
33
34
# 3. Construct x matrix and compute xb
mm <- matrix(data = colMeans(x), ncol = ncol(x), nrow = n, byrow = TRUE)
colnames(mm) <- colnames(x)
ran <- range(x[, nam.c])
mm[, nam.c] <- seq(from = ran[1], to = ran[2], length.out = n)
xb <- mm %*% b.est
xb2 <- matrix(data = xb, nrow = n, ncol = J, byrow = FALSE) # J copy
35
36
37
38
39
40
41
42
# 4. Compute probability by category; vectorized on z2 and xb2
pp <- pfun(z2[, 2:(J+1)] - xb2) - pfun(z2[, 1:J] - xb2)
trend <- cbind(mm[, nam.c], pp)
colnames(trend) <- c(nam.c, paste("p", lev, sep="."))
43
44
45
46
47
# 5. Compute the standard errors
se <- matrix(data = 0, nrow = n, ncol = J)
for (i in 1:J) {
z1 <- z[i] - xb; z2 <- z[i+1] - xb
d1 <- diag(c(dfun(z1) - dfun(z2)), n, n) %*% mm
q1 <- - dfun(z1); q2 <dfun(z2)
dr <- cbind(d1, q1, q2)
V <- V5[c(1:K, K+i, K+i+1), c(1:K, K+i, K+i+1)]
va <- dr %*% V %*% t(dr)
se[, i] <- sqrt(diag(va))
}
colnames(se) <- paste("Pred_SE", lev, sep = ".")
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
}
# 6. Report results
t.value <- pp / se
p.value <- 2 * (1 - pt(abs(t.value), n - K))
out <- list()
for (i in 1:J) {
out[[i]] <- round(cbind(predicted_prob = pp[, i], error = se[, i],
t.value = t.value[, i], p.value = p.value[, i]), digits)
}
out[[J+1]] <- round(x = trend, digits = digits)
names(out) <- paste("predicted_prob", c(lev, "all"), sep = ".")
result <- listn(w, nam.c, method=w$method, mean.x=colMeans(x), out, lev)
class(result) <- "ocProb"; return(result)
74
75
76
77
78
79
80
# Example: include "Freq" to have a continuous variable for demo
library(erer); library(MASS); data(housing); str(housing); tail(housing)
reg2 <- polr(formula = Sat ~ Infl + Type + Cont + Freq, data = housing,
Hess = TRUE, method = "probit")
p2 <- ocProb(w = reg2, nam.c = 'Freq', n = 300); p2
plot(p2)
B.3 Marginal effect for ordered choice model
527
# Selected results from Program B.1
> library(erer); library(MASS); data(housing); str(housing)
’data.frame’: 72 obs. of 5 variables:
$ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 ...
$ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 ...
$ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 ...
$ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ...
$ Freq: int 21 21 28 34 22 36 10 11 36 61 ... $
> tail(housing)
Sat
Infl
67
Low Medium
68 Medium Medium
69
High Medium
70
Low
High
71 Medium
High
72
High
High
Type
Terrace
Terrace
Terrace
Terrace
Terrace
Terrace
Cont Freq
High
31
High
21
High
13
High
5
High
6
High
13
> p2 <- ocProb(w = reg2, nam.c = ’Freq’, n = 300); p2
Freq p.Low p.Medium p.High
[295,] 84.612 0.087
0.226 0.687
[296,] 84.890 0.086
0.225 0.688
[297,] 85.167 0.086
0.225 0.690
[298,] 85.445 0.085
0.224 0.691
[299,] 85.722 0.084
0.223 0.693
[300,] 86.000 0.084
0.222 0.694
B.3
Marginal effect for ordered choice model
Marginal effects for ordered choice models can be calculated in a way similar to that in binary
choice model. The formulas for dummy variables and continuous variables are different, with
the latter case being much more challenging.
Marginal effect for dummy variables
Marginal effects for dummy independent variables are closely related to the concept of
predicted probability. It is more efficient to compute them for each dummy separately. For
illustration, assume there are four ordered choices and take the third category as an example:
b − F (b
b
F 3,1 = P r(y = 3 | X̄ d=1 ) = F (b
u4 − X̄ d=1 β)
u3 − X̄ d=1 β)
b − F (b
b
F 3,0 = P r(y = 3 | X̄ d=0 ) = F (b
u4 − X̄ d=0 β)
u3 − X̄ d=0 β)
(B.20)
where d is the selected dummy variable. In X̄ d=1 , the selected dummy variable takes the
value of 1 only, and other independent variables take the mean values. X̄ d=0 is similarly
defined with the dummy variable taking the value of 0.
The difference between the two probability values can be be computed as follows:
m3d = ∆Fb = P r(y = 3 | X̄ d=1 ) − P r(y = 3 | X̄ d=0 )
which is also referred to as the marginal effect of a dummy variable.
(B.21)
528
Appendix B Ordered Choice Model and R Loops
The delta method can be used to compute standard errors. Note m3d is a function of
two groups of estimated parameters, i.e., βb and u
b. Thus, the variance of the marginal effect
for a dummy variable can be computed as:
∂m3d
∂m3d
∂m3d
∂m3d
∂m3d
∂m3d T
b u
var(m3d ) =
cov(
β,
b
,
u
b
)
(B.22)
3
4
b
∂b
u3
∂b
u4
b
∂b
u3
∂b
u4
∂β
∂β
h
i
∂m3d
b − f (b
b X̄ d=1 −
= f (b
u3 − X̄ d=1 β)
u4 − X̄ d=1 β)
∂ βb
h
i
b − f (b
b X̄ d=0
f (b
u3 − X̄ d=0 β)
u4 − X̄ d=0 β)
(B.23)
= f 3,1 − f 4,1 X̄ d=1 − f 3,0 − f 4,0 X̄ d=0
∂m3d
b + f (b
b =f −f
= −f (b
u3 − X̄ d=1 β)
u3 − X̄ d=0 β)
(B.24)
3,0
3,1
∂b
u3
∂m3d
b − f (b
b =f −f
= f (b
u4 − X̄ d=1 β)
u4 − X̄ d=0 β)
(B.25)
4,1
4,0
∂b
u4
where the two relevant threshold values are specific to choices. To calculate the marginal
effects and standard errors for multiple dummy variables, a looping statement can be used.
Marginal effect for continuous variables
First, let us compute the value of marginal effects. For instance, taking a derivative on the
probability for the first choice can generate the following marginal effect:
∂p1
∂X
b ∂(b
b
b ∂(b
b
∂F (b
u2 − X β)
u2 − X β)
∂F (b
u1 − X β)
u1 − X β)
=
−
b
b
∂X
∂X
∂(b
u2 − X β)
∂(b
u1 − X β)
h
i
b − f (b
b .
= βb f (b
u1 − X β)
u2 − X β)
m1 =
(B.26)
Putting all together, the formulas for marginal effects for an ordered choice model with
four choices are:
h
i
b − f (b
b
m1 = βb f (b
u1 − X β)
u2 − X β)
h
i
b − f (b
b
m2 = βb f (b
u2 − X β)
u3 − X β)
(B.27)
h
i
b − f (b
b
m3 = βb f (b
u3 − X β)
u4 − X β)
h
i
b − f (b
b
m4 = βb f (b
u4 − X β)
u5 − X β)
P4
where by definition the sum of all marginal effects at a given point is zero, i.e., i=1 mi = 0.
In programming, the computation of marginal effect can be vectorized in R as follows
M = βb [f (Ua − Z) − f (Ub − Z)]
(B.28)
where M is the combination of marginal effects for all the choices. The matrices of Ua , Ub ,
and Z are the same as defined in Equation (B.14). If you are good at matrix calculus,
then the relation between Equations (B.14) and (B.28) should be apparent.
Standard errors for marginal effects can be computed by choice separately using the
delta method. The value of each marginal effect is affected by all the parameter estimates
529
B.3 Marginal effect for ordered choice model
b and two threshold values (e.g., u
(β)
b3 and u
b4 for m3 ). For the first and last choices, taking
derivatives with regard to the two fixed threshold values (i.e., u
b1 and u
b5 with four choices)
will not affect the standard errors because the relevant variance is zero. Specifically, taking
m3 as an example, the standard error is
var(m3 ) =
∂m3
b
∂β
∂m3
∂b
u3
∂m3
∂b
u4
b u
cov(β,
b3 , u
b4 )
∂m3
b
∂β
∂m3
∂b
u3
∂m3
∂b
u4
T
(B.29)
b It is related to the product rule of matrix calculus
The first derivative is on the vector β.
and can be expressed as
"
# "
#
b
∂m3
βb b 0
β
b 0 (−X 0 )
= (f 3 ⊗ I K ) + βf 3 (−X 0 ) − (f 4 ⊗ I K ) + βf
4
∂ βb
βb
βb
i
i h
h
b 0)
b 0 ) − f I K + f 0 (−βX
(B.30)
= f 3 I K + f 03 (−βX
4
4
 b 0 ) − f I K + (−z4 f )(−βX
b 0)
 f 3 I K + (−z3 f 3 )(−βX
(probit)
4
4
=
b 0 ) − f I K + f (1 − 2F 4 )(−βX
b 0)
 f 3 I K + f 3 (1 − 2F 3 )(−βX
(logit)
4
4
b 0 ) − f I K + S 4 (−βX
b 0)
= f 3 I K + S 3 (−βX
4
where S 3 = −z3 and S 4 = −z4 for a probit model, and S 3 = 1 − 2F 3 and S 4 = 1 − 2F 4
b Note Equations (B.8)
for a logit model. In addition, z3 = u
b3 − X βb and z4 = u
b4 − X β.
and (B.11) are used in the transformation.
The second and third derivatives are about a single estimate:
(
b
b
∂m3
∂f
∂(b
u
−
X
β)
β(−z
(probit)
3
3f 3)
b 0 =
= βb 3
= βf
3
b
∂b
u3
∂z3
∂b
u3
βf 3 (1 − 2F 3 ) (logit)
b S3
= βf
3
∂m3
b S4
= −βf
4
∂b
u4
(B.31)
where S3 and S4 are the same as defined above.
In sum, the value of marginal effect can be vectorized and computed together for all
variables and choices with a single matrix operation. The variance is best handled by choice
individually with a looping statement. All of these are implemented through the ocME()
function, as shown in Program B.2.
Program B.2 Calculating marginal effects for an ordered choice model
1
2
3
4
5
6
7
8
9
10
# A new function: marginal effect for ordered choice
ocME <- function(w, rev.dum = TRUE, digits = 3)
{
# 1. Check inputs; similar to ocProb()
# 2. Get data out: x may contains factors so use model.matrix
lev <- w$lev; J <- length(lev)
x.name <- attr(x = w$terms, which = "term.labels")
x2 <- w$model[, x.name]
ww <- paste("~ 1", paste("+", x.name, collapse = " "), collapse = " ")
x <- model.matrix(as.formula(ww), data = x2)[, -1] # factor x changed
530
11
12
13
Appendix B Ordered Choice Model and R Loops
x.bar <- as.matrix(colMeans(x))
b.est <- as.matrix(coef(w)); K <- nrow(b.est)
xb <- t(x.bar) %*% b.est; z <- c(-10^6, w$zeta, 10^6)
14
15
16
17
18
19
20
pfun <- switch(w$method, probit = pnorm, logistic = plogis)
dfun <- switch(w$method, probit = dnorm, logistic = dlogis)
V2 <- vcov(w) # increase covarance matrix by 2 fixed thresholds
V3 <- rbind(cbind(V2, 0, 0), 0, 0)
ind <- c(1:K, nrow(V3)-1, (K+1):(K+J-1), nrow(V3))
V4 <- V3[ind,]; V5 <- V4[, ind]
21
22
23
24
25
26
# 3. Calcualate marginal effects (ME)
# 3.1 ME value
f.xb <- dfun(z[1:J] - xb) - dfun(z[2:(J+1)] - xb)
me <- b.est %*% matrix(data = f.xb, nrow = 1)
colnames(me) <- paste("effect", lev, sep = ".")
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# 3.2 ME error
se <- matrix(0, nrow = K, ncol = J)
for (j in 1:J) {
u1 <- c(z[j] - xb); u2 <- c(z[j+1]
if (w$method == "probit") {
s1 <- -u1
s2 <- -u2
} else {
s1 <- 1 - 2 * pfun(u1)
s2 <- 1 - 2 * pfun(u2)
}
d1 <dfun(u1) * (diag(1,K,K)
d2 <- -1 * dfun(u2) * (diag(1,K,K)
q1 <dfun(u1) * s1 * b.est
q2 <- -1 * dfun(u2) * s2 * b.est
dr <- cbind(d1 + d2, q1, q2)
V <- V5[c(1:K, K+j, K+j+1), c(1:K,
cova <- dr %*% V %*% t(dr)
se[, j] <- sqrt(diag(cova))
}
colnames(se) <- paste("SE", lev, sep
rownames(se) <- colnames(x)
- xb)
- s1 * (b.est %*% t(x.bar)))
- s2 * (b.est %*% t(x.bar)))
K+j, K+j+1)]
= ".")
50
51
52
53
54
55
56
57
58
59
# 4. Revise ME and error for dummy variable
if (rev.dum) {
for (k in 1:K) {
if (identical(sort(unique(x[, k])), c(0, 1))) {
for (j in 1:J) {
x.d1 <- x.bar; x.d1[k, 1] <- 1
x.d0 <- x.bar; x.d0[k, 1] <- 0
ua1 <-z[j] - t(x.d1) %*% b.est; ub1 <- z[j+1] - t(x.d1) %*% b.est
ua0 <-z[j] - t(x.d0) %*% b.est; ub0 <- z[j+1] - t(x.d0) %*% b.est
B.3 Marginal effect for ordered choice model
60
61
62
63
64
65
66
67
68
69
}
70
}
}
}
me[k, j] <- pfun(ub1) - pfun(ua1) - (pfun(ub0) - pfun(ua0))
d1 <- (dfun(ua1) - dfun(ub1)) %*% t(x.d1) (dfun(ua0) - dfun(ub0)) %*% t(x.d0)
q1 <- -dfun(ua1) + dfun(ua0); q2 <- dfun(ub1) - dfun(ub0)
dr <- cbind(d1, q1, q2)
V <- V5[c(1:K, K+j, K+j+1), c(1:K, K+j, K+j+1)]
se[k, j] <- sqrt(c(dr %*% V %*% t(dr)))
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
}
# 5. Output
t.value <- me / se
p.value <- 2 * (1 - pt(abs(t.value), w$df.residual))
out <- list()
for (j in 1:J) {
out[[j]] <- round(cbind(effect = me[, j], error = se[, j],
t.value = t.value[, j], p.value = p.value[, j]), digits)
}
out[[J+1]] <- round(me, digits)
names(out) <- paste("ME", c(lev, "all"), sep = ".")
result <- listn(w, out)
class(result) <- "ocME"
return(result)
86
87
88
89
90
91
# Example: The specification is from MASS.
library(erer); library(MASS); data(housing); tail(housing)
reg3 <- polr(formula = Sat ~ Infl + Type + Cont, data = housing,
weights = Freq, Hess = TRUE, method = "probit")
m3 <- ocME(w = reg3); m3
# Selected results from Program B.2
> m3 <- ocME(w = reg3); m3
effect.Low effect.Medium effect.High
InflMedium
-0.119
-0.016
0.135
InflHigh
-0.255
-0.047
0.303
TypeApartment
0.128
0.003
-0.131
TypeAtrium
0.079
0.004
-0.083
TypeTerrace
0.248
-0.008
-0.240
ContHigh
-0.079
-0.007
0.086
531
Appendix
C
Data Analysis Project
G
reat efforts have been put in designing a data analysis project. While many exercises are included at the end of individual chapters in the book, none of them
can challenge and motivate students like a real research project. With this belief
in mind, three new sample studies (i.e., Sun, 2006a,b; Sun and Liao, 2011) are carefully
selected and prepared for this data project. Reproducing the results in one of the studies
should help you learn research methods and data analyses as promoted throughout the book
in a more systematic and complete manner.
Manipulating raw data can be more demanding and difficult than running a simple
line regression. In this regard, Sun (2006a) is more difficult than the other two studies
as a number of raw data need to be consolidated. Skipping the data preparation step is
strongly discouraged as it is an integrated part of the whole research process. However, if
you indeed have a good reason to do so, note that the data sets for all the three studies are
included in the erer package, i.e., daRoll, daLaw, and daEsa. They can be loaded directly,
e.g., data(daRoll). The results in the three studies can be reproduced with these data sets
and R functions. The erer library contains several functions that are developed for these
models. These include marginal effects for binary choice and ordered probit models, and
additionally, return and risk analysis for financial events.
C.1
Roll call analysis of voting records (Sun, 2006a)
Sun (2006a) is a roll call analysis of the Healthy Forests Restoration Act for fire management.
Roll call analyses are an examination of congressional voting records through binary logit
or probit models. It is a type of statistical analysis widely used in political science. The
statistical model and outputs from this study are very similar to those in Sun et al. (2007)
because binary choice models are the key methodology. Thus, among the three sample
studies for exercises, the statistical model in Sun (2006a) is relatively easier to estimate.
The raw data are saved and available as “RawDataProjectARoll.xlsx” in the erer
library. A number of data sources are utilized to collect the raw data, and there are a total
of eight data sheets. The very first challenge is that there are four model specifications, and
correspondingly, four data sets should be prepared from the raw data. Various techniques for
data manipulations should be employed to merge, index, and combine them in generating
the final data sets for regressions. To mitigate the possible frustration from raw data to the
four individual data sets, daRoll is included in the erer package. This is the combined data
Empirical Research in Economics: Growing up with R
Copyright © 2015 by Changyou Sun
532
533
C.1 Roll call analysis of voting records (Sun, 2006a)
set for all specifications. The sample codes on the help page of daRoll can generate four
individual data sets. For a complete exercise, however, one should start with the raw data,
generate a data frame object similar to daRoll, and finally, abstract four individual data
sets for regression analyses.
All the three tables and two figures in Sun (2006a) can be completely reproduced by
R. Some notes are provided here to help you reproduce the results. In your R console, the
results from an R program should have appearance similar to the following outputs.
• Raw data manipulation: data for this study are in “RawDataProjectARoll.xlsx”.
Using merge(), ifelse(), and other functions in R, the raw data sets can be converted
into a data frame object similar to daRoll in the erer library. From this combined
data frame, four individual data sets can be created.
• Table 1 can be generated by bsStat() with additional manipulations.
• Table 2 can be created by glm(), bsTab(), and logLik().
• Table 3 can be generated by maBina() and bsTab().
• Figure 1 can be generated with the fire data in “RawDataProjectARoll.xlsx” and
traditional graphics system. Note the left y-axis is for acres and the right y-axis is for
fire numbers. To overlay two graphs, techniques in Section 11.5 Region, coordinate,
clipping, and overlaying on page 232 should be consulted.
• Figure 2 can be generated with maTrend(). The mfrow argument in par() can arrange
four graphs on one device. The recordPlot() function can be used to record a graph
on a screen device, and then replayPlot() can save it on a file device.
> print(table.1, right = FALSE)
Variable Description
1 Vote
Dummy dependent variable equals one if vote is yes
2 RepParty Dummy equals one if Republican
3 East
Regional dummy for 11 northeastern states
4 West
Regional dummy for 11 western states
5 South
Regional dummy for 13 southern states
6 PopDen
Population density per km2
7 PopRural % of rural population
8 Edu
% of population over 25 with a Bachelor’s degree
9 Income
Median family income ($1,000)
10 FYland
% of federal lands in total forestlands 2002
11 Size
Value of shipments of industry 1997 ($ million)
12 ContrFY Contribution from forest firms ($1,000)
13 ContrEN Contribution from environmental groups ($1,000)
14 Sex
Dummy equals one if male
15 Lawyer
Dummy equals one if lawyer
16 Member
Dummy equals one if a committee member for HFRA
17 Year
Number of years in the position
18 Chamber Dummy equals one if House and zero if Senate
> table.2
Variable HR-May
1 Constant -3.96
2 RepParty
5.83
3
East -0.39
t1 HR-Nov
-2.14** -1.87
7.76***
4.16
-0.54
0.04
Mean (min, max)
0.70
0.53
0.21
0.22
0.31
0.78(0,24.56)
0.22(0,0.79)
0.15(0.04,0.35)
50.78(20.92,91.57)
0.12
9.53(0.12,21.01)
4.95(0,151.56)
1.61(0,34.63)
0.86
0.42
0.13
10.43(0,48.00)
0.82
t2 HR-Senate
t3 Democrat
-1.39
0.12
0.10
-0.22
7.60***
4.01 8.15***
__
0.07
-0.57
-1.18
-0.34
t4
-0.16
__
-0.57
534
Appendix C Data Analysis Project
4
West
0.34
0.37
2.18 3.02***
5
South
1.13
1.57
1.23
1.96**
6
PopDen -0.28
-0.79 -0.16
-0.87
7 PopRural
5.62 3.32***
4.16 2.98***
8
Edu
0.87
0.10 -1.69
-0.26
9
Income -0.04
-0.93 -0.02
-0.64
10
FYland
1.49
0.95 -0.18
-0.16
11
Size
0.12
2.35**
0.01
0.34
12 ContrFY
0.21
2.37**
0.53 3.70***
13 ContrEN -0.36 -3.43*** -0.16 -2.44***
14
Sex
0.94
1.15
0.14
0.24
15
Lawyer
0.10
0.21
0.10
0.24
16
Member
2.36 2.65***
1.65
1.95**
17
Year
0.00
0.02
0.01
0.50
18
Obs.
426
426
19
Log-L -68.24
-96.52
1.45
1.30
-0.17
3.02
-5.66
0.00
1.00
0.01
0.24
-0.07
0.26
-0.28
1.14
0.01
520
-128.84
2.45***
2.43**
-0.97
2.56***
-0.96
0.06
0.98
0.45
2.86***
-2.19**
0.53
-0.82
1.81*
0.35
1.77 2.66***
1.69 2.70***
-0.16
-0.88
3.14 2.48***
-9.05
-1.34
0.01
0.40
0.73
0.66
-0.01
-0.30
0.19 2.33**
-0.05
-1.36
0.41
0.74
-0.64
-1.63
1.46 2.12**
0.02
0.90
245
-97.48
250
> table.3
Variable HS (Dem)
t1 HS (Rep)
t2 HS(all)
t3 Demo (all)
t4
1 RepParty
0.956 8.55***
0.013
1.64
0.211 3.02***
__
__
2
East
-0.132
-1.23
-0.002 -0.84 -0.035
-0.97
-0.079
-0.59
3
West
0.346 2.65***
0.003
1.43
0.056
2.16**
0.416 3.02***
4
South
0.309
2.40**
0.004
1.33
0.068
1.92*
0.401 2.65***
5
PopDen
-0.040
-0.99
-0.001 -0.79 -0.009
-0.86
-0.038
-0.90
6 PopRural
0.721 2.52***
0.010
1.29
0.159
1.90*
0.744
2.44**
7
Edu
-1.349
-0.96
-0.018 -0.82 -0.297
-0.92
-2.144
-1.34
8
Income
0.000
0.06
0.000
0.06
0.000
0.06
0.003
0.40
9
FYland
0.238
0.98
0.003
0.87
0.052
0.96
0.173
0.65
10
Size
0.003
0.46
0.000
0.43
0.001
0.45
-0.003
-0.30
11 ContrFY
0.058 2.66***
0.001 2.14**
0.013 4.26***
0.046
2.20**
12 ContrEN
-0.017 -2.17**
-0.000 -1.37 -0.004
-1.89*
-0.011
-1.35
13
Sex
0.062
0.53
0.001
0.50
0.014
0.52
0.097
0.74
14
Lawyer
-0.066
-0.82
-0.001 -0.74 -0.015
-0.80
-0.151
-1.62
15
Member
0.279
1.91*
0.003
1.36
0.043
1.99**
0.349
2.38**
16
Year
0.002
0.35
0.000
0.35
0.000
0.35
0.005
0.90
17 Chamber
-0.417 -3.37***
-0.004 -1.54 -0.061 -2.64***
-0.378 -3.05***
Number (1,000)
2
100
3
150
Acres (million)
4
5
6
200
7
8
Acres burned
Number of fires
1960
1970
1980
1990
2000
Figure 1 in Sun (2006a): Acres burned and fire numbers on federal lands from 1960 to 2003
535
1.0
C.2 Ordered probit model on law reform (Sun, 2006b)
Prob (Vote = Yes)
0.75
0.85
0.95
Rep
Prob (Vote = Yes)
0.6
0.7
0.8
0.9
All
0.5
Dem
0.8
0.0
0.2
0.4
0.6
0.8
Federal forestland %
1.0
0.2
0.4
0.6
Rural population %
All
Dem
Rep
All
Prob (Vote = Yes)
0.4
0.6
0.8
Rep
1.0
Dem
0.4
0.2
1.0
0.0
Prob (Vote = Yes)
0.5 0.6 0.7 0.8 0.9
All
0.65
Dem
Rep
0
10
20
30
Contribution − forest industry ($1,000)
0
5
10 15 20 25 30 35
Contribution − environ. groups ($1,000)
Figure 2 in Sun (2006a): Probability response curves (HR-Senate specification) (Note: Dash lines
are positioned at the means of the explanatory variables.)
C.2
Ordered probit model on law reform (Sun, 2006b)
In Sun (2006b), an ordered probit model is employed to examine factors that have influenced
the retention of certain liability laws for prescribed fire in the United States. All the four
tables and one figure in this study can be completely reproduced by R. Overall, the data
transformation in this study is easy, but the ordered probit model is difficult to estimate.
• Raw data manipulation: Raw data are saved as “RawDataProjectBBurn.xlsx”. The
amount of data manipulation is pretty small. The data set of daLaw in the erer library
has the same content with the raw data.
• Table 1 can be generated with some indexing operation.
• Table 2 can be created by bsStat() and some additional manipulations.
• Table 3 can be generated by polr() in the MASS library. This function needs a factor as
the dependent variable. The relevant data in daLaw is a character string so it should be
converted first by factor(). Exercise 10.5.9 Data frame manipulation on page 211
is relevant to this. To improve model convergence, some initial values (with a guess
or from the published version) may need to be supplied by the start argument in
polr().
536
Appendix C Data Analysis Project
• Table 4 can be generated by ocME() from the erer library.
• Figure 1 can be generated with ocProb() and base R graphics functions.
> print(table.1, right = FALSE)
Strict liability Uncertain liability
1 Delaware
Arizona
2 Hawaii
Colorado
3 Minnesota
Connecticut
4 Pennsylvania
Idaho
5 Rhode Island
Illinois
6 Wisconsin
Indiana
7
Iowa
8
Kansas
9
Maine
10
Massachusetts
11
Missouri
12
Montana
13
Nebraska
14
New Mexico
15
North Dakota
16
Ohio
17
South Dakota
18
Tennessee
19
Utah
20
Vermont
21
West Virginia
22
Wyoming
23 N = 6
N = 22
Simple negligence
Alabama
Alaska
Arkansas
California
Kentucky
Louisiana
Maryland
Mississippi
New Hampshire
New Jersey
New York
North Carolina
Oklahoma
Oregon
South Carolina
Texas
Virginia
Washington
Gross negligence
Florida
Georgia
Michigan
Nevada
N = 18
N = 4
> print(table.2, right = FALSE)
Variable Mean Minimum Maximum
1 Y
1.4
0.0
3.0
2 FYNFS
3.0
0.0
18.5
3 FYIND
1.3
0.0
7.4
4 FYNIP
7.3
0.3
35.9
5 AGEN
312.8 23.0
3735.0
6 POPRUR
1.2
0.1
3.6
7 EDU
0.3
0.0
2.0
8 INC
20.8 15.8
28.8
9 DAY
166.3 42.0
350.0
10 BIANN
0.3
0.0
1.0
11 SEAT
147.6 49.0
424.0
12 BICAM
2.9
0.0
16.7
13 COMIT
34.6 10.0
69.0
14 RATIO
4.9
1.2
18.6
Definition
1 Categorical dependent variable (Y = 0, 1, 2, or 3 )
2 National Forests area in a state (million acres)
3 Industrial forestland area in a state (million acres)
4 Nonindustrial private forestland area in a state (million acres)
5 Permanent forestry program personnel in a state
6 Rural population in a state (million)
7 Population 25 years old with advanced degrees in a state (million)
8 Per capita income in a state (thousand dollars)
537
C.2 Ordered probit model on law reform (Sun, 2006b)
9
10
11
12
13
14
The maximum length of legislative sessions in calendar days in a state
A dummy variable equal to one for states with annual legislative sessions, ...
Total number of legislative seats (Senate plus House) in the legislative ...
Level of bicameralism in a state, defined as the size of the Senate ...
Total number of standing committees in a state
Total number of standing committees in a state divided by ...
> table.3
Variable Estimate.x t_ratio.x Estimate.y t_ratio.y Estimate t_ratio
1
FYNFS
-0.00
-0.11
-0.04
-0.81
-0.04
-1.00
2
FYIND
0.25
2.52b
0.44
3.01c
0.45
3.31c
3
FYNIP
0.05
1.78a
0.04
1.22
0.05
1.54
4
AGEN
0.00
0.21
5
POPRUR
0.11
0.30
6
EDU
1.32
1.41
1.48
2.19b
7
INC
0.04
0.49
0.03
0.44
8
DAY
-0.00
-0.84
-0.00
-0.83
9
BIANN
-0.54
-1.19
-0.51
-1.18
10
SEAT
0.01
1.13
0.01
2.15b
11
BICAM
0.05
0.34
12
COMIT
-0.07
-2.40b
-0.07 -2.49b
13
RATIO
-0.29
-2.19b
-0.30 -2.40b
14 Log-likelihood
-52.42
-46.19
-46.30
> table.4
Variable
1
FYNFS
2
FYIND
3
FYNIP
4
EDU
5
SEAT
6
COMIT
7
RATIO
Y = 0
0.005
-0.052
-0.005
-0.169
-0.001
0.008
0.034
Y = 1 Y = 2 Y = 3
0.012 -0.014 -0.003
-0.126 0.149 0.029
-0.013 0.015 0.003
-0.413 0.488 0.094
-0.003 0.004 0.001
0.020 -0.024 -0.005
0.084 -0.099 -0.019
0.6
Probability (Y = J)
Y=2
0.5
0.4
Y=1
0.3
0.2
Y=3
0.1
Y=0
0.0
0
5
10
15
20
25
30
NIPF land area in a state (million acres)
35
Y = 0 (strict liability)
Y = 1 (uncertain liability)
Y = 2 (simple negligence)
Y = 3 (gross negligence)
Figure 1 in Sun (2006b): Effects of NIPF land area on the probability of retaining different
liability rules
538
C.3
Appendix C Data Analysis Project
Event analysis of ESA (Sun and Liao, 2011)
In Sun and Liao (2011), event analysis is employed to examine the impact of six court
decisions related to the Endangered Species Act (ESA) on the financial performance of U.S.
forest products firms. The methodology is widely used in financial economics. Ordinary least
square is used to estimate all the models. Thus, while the data are time series, the time
dimension is irrelevant for model estimation. The work for data preparation is moderate.
The main challenge is to use looping statements for repeated analyses by case and window.
Several functions are included in the erer library, and in particular, the generic function of
update() is available and should be used repeatedly. The three tables (Tables 2 to 4) and
the figure in this study can be completely reproduced by R.
• Raw data manipulation: Raw data for this study are in “RawDataProjectCEsa.xlsx”.
The data set of daEsa in the erer library has similar contents with the raw data. The
event date information is in Table 1 of the published paper.
• Table 1 is descriptive event information, and there is no need to reproduce it by R.
• Table 2 can be created by the evReturn() function in erer package and update().
• Tables 3 and 4 can be generated by evRisk().
• Figure 1 can be generated with evReturn() and some graphics functions. The limits
of the y-axes in the two plots are set to be the same for comparison. The published
version was generated by the ggplot2 package. In this particular case, generating the
graph by ggplot2 can be challenging to beginners, partly because of adding labels to
a panel graph (i.e., facets). The trick is to add some columns in the data frame for
coordinates or labels, and then use geom_text() to add the labels.
In base R graphics, layout() can be used to create two figure regions. The y-axis titles
for the two plots are combined. This is implemented by first setting the oma argument
in par() to be bigger than zero, and then adding the label through mtext(..., outer
= TRUE). Two shaded boxes with texts are added to the right end.
> (output <- listn(table.2, table.3, table.4))
$table.2
Case (-2, 2)
(-3, 3) (-4, 4) (-5, 5) (-6, 6) (-7, 7)
1
I -1.872*
-0.672
1.055
0.799
0.187
1.632
2
(-1.954) (-0.593) (0.820) (0.562) (0.121) (0.983)
3
II
-1.023
-0.971
-1.459
-1.435
-1.992
-1.560
4
(-0.882) (-0.708) (-0.937) (-0.833) (-1.065) (-0.779)
5
III
-1.292
-0.024
-0.712
0.692
1.516
1.698
6
(-1.057) (-0.017) (-0.434) (0.381) (0.770) (0.802)
7
IV 3.873***
2.580
2.015 5.244***
3.066 5.479**
8
(2.869)
(1.618) (1.116) (2.656) (1.429) (2.379)
9
V 4.040** 11.550*** 6.538*** 4.631** 6.002**
2.476
10
(2.587)
(6.274) (3.142) (2.016) (2.409) (0.925)
11
VI
-0.328
-2.112 -3.124*
-2.267
-1.943
-2.647
12
(-0.243) (-1.325) (-1.726) (-1.132) (-0.892) (-1.130)
$table.3
Firm
I_beta
1
BBC 0.774***
2
BOW 0.948***
I_gamma II_beta II_gamma III_beta III_gamma
0.392 0.862*** -0.773*** 0.474**
-0.155
0.079 1.035*** -0.654*** 0.622***
-0.004
539
C.3 Event analysis of ESA (Sun and Liao, 2011)
3
4
5
6
7
8
9
10
11
12
13
14
CSK
GP
IP
KMB
LPX
MWV
PCH
PCL
POP
TIN
WPP
WY
0.884***
-0.124 0.766*** -0.396** 0.656***
0.820*** 0.428** 0.876***
-0.256 0.581***
0.895*** 0.421** 1.023*** -0.748*** 0.522***
0.992***
-0.009 0.994*** -0.364** 0.677***
0.926*** 0.935*** 0.942***
-0.309 0.785***
0.854***
0.032 0.802***
-0.386* 0.600***
0.868***
0.026 0.848*** -0.327** 0.572***
0.873***
0.185 0.755*** -0.541*** 0.501***
0.968***
-0.092 0.548***
-0.364
0.318
0.857***
0.126 0.824***
-0.023 0.690***
0.242
0.052 0.873***
-0.398*
0.320*
0.964*** 0.457** 0.775***
-0.109 0.670***
-0.256
0.225
-0.163
-0.192
-0.251
-0.118
-0.013
-0.353**
0.008
-0.115
0.159
-0.068
$table.4
Firm IV_beta IV_gamma
V_beta V_gamma VI_beta VI_gamma
1
BBC
0.089 0.502** 0.791*** -0.099 1.180*** -0.235**
2
BOW
0.376*
0.271 0.708*** -0.075 1.106***
-0.188*
3
CSK 0.368**
0.283 0.766*** -0.234 0.993***
-0.054
4
GP 0.622***
0.180 0.884*** -0.131 1.598*** -0.686***
5
IP
0.278
0.188 0.939*** -0.261 1.036***
-0.015
6
KMB 0.632***
-0.088 1.041*** -0.390* 0.723*** 0.222***
7
LPX 0.639***
-0.138 1.185*** -0.499* 1.376***
-0.270*
8
MWV 0.421**
0.164 0.850*** -0.105 1.189*** -0.217***
9
PCH 0.519***
0.091 0.795*** -0.063 1.043***
-0.064
10 PCL
0.215*
0.122 0.528*** -0.133 0.951***
0.022
11 POP
0.175
0.470 0.522*** -0.020 0.801***
0.112
12 TIN 0.802***
-0.292 0.845*** -0.085 1.152*** -0.156**
13 WPP 0.480***
0.112 1.018*** -0.245 1.214***
-0.180*
14
WY 0.670***
-0.094 0.882*** -0.225 1.093*** -0.156**
3
VI
0
a. Negative − cases I & VI
I (−2, 2):
VI (−4, 4):
V (−6, 6):
IV (−7, 7):
6
− 1.872 %
− 3.124 %
+ 6.002 %
+ 5.479 %
I
−3
9
b. Positive − cases IV & V
Average cumulative abnormal returns (%)
9
6
3
V
IV
0
−3
−7
−6
−5
−4
−3
−2
−1 0
1
Event day
2
3
4
5
6
7
Figure 1 (Sun and Liao, 2011): Average cumulative abnormal returns (created by ggplot2)
540
Appendix C Data Analysis Project
Negative − cases I & VI
I (−2, 2): −1.872%
VI (−4, 4): −3.124%
V (−6, 6): +6.002%
IV (−7, 7): +5.479%
6
3
VI
0
I
−3
9
Positive − cases IV & V
Average cumulative abnormal returns (%)
9
6
V
3
IV
0
−3
−7
−6
−5
−4
−3
−2
−1
0
1
Event day
2
3
4
5
6
7
Figure 1 (Sun and Liao, 2011): Average cumulative abnormal returns (created by base R
graphics)
References
Abadir, K. and Magnus, J. 2005. Matrix Algebra. Cambridge University Press, Cambridge.
Adler, J. 2010. R in a Nutshell: A Desktop Quick Reference. O’REILLY, Sebastopol, CA.
Akerlof, G. 1970. The market for ‘lemons’: quality uncertainty and the market mechanism. Quarterly Journal of Economics 84:488–500.
Alston, J. and Chalfant, J. 1993. The silence of the lambdas: a test of the almost ideal
and Rotterdam models. American Journal of Agricultural Economics 75:304–313.
Amacher, G., Ollikainen, M., and Koskela, E. 2009. Economics of Forest Resources.
The MIT Press, Cambridge, MA.
Baltagi, B. 2011. Econometrics. Springer-Verglag Berlin Heidelberg, New York, NY, 5th
edition.
Bivand, R., Pebesma, E., and Gomez-Rubio, V. 2008. Applied Spatial Data Analysis
with R. Springer, New York, NY.
Blackburn, T. 2003. Getting Science Grants: Effective Strategies for Funding Success.
Jossey-Bass, San Francisco, CA.
Braun, W. and Murdoch, D. 2008. A First Course in Statistical Programming with R.
Cambridge University Press, New York, NY.
Buehlmann, U., Bumgardner, M., Lihra, T., and Frye, M. 2006. Attitudes of
U.S. retailers toward China, Canada, and the United States as manufacturing sources for
furniture: an assessment of competitive priorities. Journal of Global Marketing 20:61–73.
Chan, K. 1993. Consistency and limiting distribution of the least squares estimator of a
threshold autoregressive model. The Annals of Statistics 21:520–533.
Chiang, A. and Wainwright, K. 2004. Fundamental Methods of Mathematical Economics. McGraw-Hill, New York, NY, 4th edition.
Coase, R. 1960. The problem of social cost. Journal of Law and Economics 3:1–44.
Covey, S. 2013. The 7 Habits of Highly Effective People: Powerful Lessons in Personal
Change. Simon and Schuster, New York, NY, 2nd edition.
541
542
References
Dalgaard, P. 2008. Introductory Statistics with R. Springer, London, UK, 2nd edition.
Deaton, A. and Muellbauer, J. 1980. An almost ideal demand system. American
Economic Review 70:312–326.
Edgerton, D. 1993. On the estimation of separation demand models. Journal of Agricultural and Resource Economics 18:141–146.
Enders, W. 2010. Applied Econometric Time Series. John Wiley and Sons, Inc., New
York, NY, 3rd edition.
Enders, W. and Granger, C. 1998. Unit-root tests and asymmetric adjustment with
an example using the term structure of interest rates. Journal of Business and Economic
Statistics 16:304–311.
Enders, W. and Siklos, P. 2001. Cointegration and threshold adjustment. Journal of
Business and Economic Statistics 19:166–176.
Engle, R. and Granger, C. 1987. Co-integration and error correction: representation,
estimation, and testing. Econometrica 55:251–276.
Everitt, B. and Hothorn, T. 2009. A Handbook of Statistical Analyses Using R.
Chapman and Hall/CRC, Boca Raton, FL, 2nd edition.
Faustmann, M. 1995. Calculation of the value which forest land and immature stands
possess for forestry. Journal of Forest Economics 1:7–44.
Frey, G. and Manera, M. 2007. Econometric models of asymmetric price transmission.
Journal of Economic Surveys 21:349–415.
Gallet, C. 2010. The income elasticity of meat: a meta-analysis. Australian Journal of
Agricultural and Resource Economics 54:477–490.
Gardner, B. 1975. Farm retail price spread in a competitive food industry. American
Journal of Agricultural Economics 57:399–409.
Granger, C. and Lee, T. 1989. Investigation of production, sales, and inventory relationships using multicointegration and non-symmetric error correction models. Journal
of Applied Econometrics 4:S145–S159.
Greene, W. 2011. Econometric Analysis. Prentice Hall, New York, NY, 7th edition.
Hartman, R. 1976. The harvesting decision when a standing forest has value. Economic
Inquiry 14:52–58.
Henneberry, S., Piwethongngam, K., and Qiang, H. 1999. Consumer food safety
concerns and fresh produce consumption. Journal of Agricultural and Resource Economics
24:98–113.
Henningsen, A. and Hamann, J. 2007. systemfit: a package for estimating systems of
simultaneous equations in R. Journal of Statistical Software 23:1–40.
Johnson, E. 1991. The Handbook of Good English. Washington Square Press, New York,
NY.
References
543
Jones, O., Maillardet, R., and Robinson, A. 2009. Introduction to Scientific Programming and Simulation Using R. Chapman and Hall/CRC, Boca Raton, FL.
Judge, G., Griffiths, W., Hill, R., Lutkepohl, H., and Lee, T.-C. 1985. The
Theory and Practice of Econometrics. John Wiley and Sons, New York, NY, 2nd edition.
Kleiber, C. and Zeileis, A. 2008. Applied Econometrics with R. Springer, New York,
NY.
Lawrence, M. and Verzani, J. 2012. Programming Graphical User Interfaces in R. CRC
Press, Boca Raton, FL.
Luo, X., Sun, C., Jiang, H., Zhang, Y., and Meng, Q. 2015. International trade after
intervention: the case of bedroom furniture. Forest Policy and Economics 50:180–191.
MacKinlay, A. 1997. Event studies in economics and finance. Journal of Economic
Literature 35:13–39.
Matloff, N. 2011. The Art of R Programming: A Tour of Statistical Software Design. No
Starch Press, San Francisco, CA.
Mei, B. and Sun, C. 2008. Assessing time-varying oligopoly and oligopsony power in the
U.S. paper industry. Journal of Agricultural and Applied Economics 40:927–939.
Meyer, J. and von Cramon-Taubadel, S. 2004. Asymmetric price transmission: a
survey. Journal of Agricultural Economics 55:581–611.
Morrison, D. and Russell, S. 2005. The Grant Application Writer’s Workbook: Successful Proposals to any Agency. Grant Writers’ Seminars and Workshops, LLC, Buellton,
CA.
Murrell, P. 2011. R Graphics. CRC Press, London, UK, 2nd edition.
Nelsen, R. 2006. An Introduction to Copulas. Springer, New York, NY, 2nd edition.
Newman, D. 2002. Forestry’s golden rule and the development of the optimal forest rotation
literature. Journal of Forest Economics 8:5–27.
Pfaff, B. 2008. Analysis of Integrated and Cointegrated Time Series with R. Springer
Science and Business Media, LLC, New York, NY, 2nd edition.
Piggott, N. and Wohlgenant, M. 2002. Price elasticities, joint products, and international trade. Australian Journal of Agricultural and Resource Economics 46:487–500.
Spector, P. 2008. Data Manipulation with R. Springer, New York, NY.
Sun, C. 2006a. A roll call analysis of the Healthy Forests Restoration Act and constituent
interests in fire policy. Forest Policy and Economics 9:126–138.
Sun, C. 2006b. State statutory reforms and retention of prescribed fire liability laws on US
forest land. Forest Policy and Economics 9:392–402.
Sun, C. 2011. Price dynamics in the import wooden bed market of the United States.
Forest Policy and Economics 13:479–487.
Sun, C. 2014. Recent growth in China’s roundwood import and its global implications.
Forest Policy and Economics 39:43–53.
544
References
Sun, C. and Liao, X. 2011. Effects of litigation under the Endangered Species Act on
forest firm values. Journal of Forest Economics 17:388–398.
Sun, C., Pokharel, S., Jones, W., Grado, S., and Grebner, D. 2007. Extent of
recreational incidents and determinants of liability insurance coverage for hunters and
anglers in Mississippi. Southern Journal of Applied Forestry 31:151–158.
Sun, C. and Tolver, B. 2012. Assessing administrative laws for forestry prescribed burning in the southern United States: a management-based regulation approach. International
Forestry Review 14:337–348.
Sun, C. and Zhang, D. 2003. The effects of exchange rate volatility on U.S. forest
commodities exports. Forest Science 49:807–814.
University of Chicago Press. 2003. The Chicago Manual of Style: The Essential Guide
for Writers, Editors, and Publishers. The University of Chicago Press, Chicago, IL, 15th
edition.
Vinod, H. 2008. Hands-on Intermediate Econometrics Using R: Templates for Extending
Dozens of Practical Examples. World Scientific, New Jersey, USA.
Wan, Y., Sun, C., and Grebner, D. 2010a. Analysis of import demand for wooden beds
in the United States. Journal of Agricultural and Applied Economics 42:643–658.
Wan, Y., Sun, C., and Grebner, D. 2010b. Intervention analysis of the antidumping
investigation on wooden bedroom furniture imports from China. Canadian Journal of
Forest Research 40:1434–1447.
Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer, London, UK.
Wright, B., Kaiser, R., and Nicholls, S. 2002. Rural landowner liability for recreational injuries: myths, perceptions, and realities. Journal of Soil and Water Conservation
57:183–191.
List of Figures, Tables, and Programs
Figures
1.1
2.1
2.2
2.3
2.4
2.5
4.1
4.2
5.1
6.1
6.2
6.3
6.4
6.5
6.6
7.1
7.2
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
11.10
11.11
11.12
11.13
15.1
15.2
15.3
The structure of the book . . . . . . . . . . . . . . . . . . . . . . . . . .
R graphical user interface on Microsoft Windows . . . . . . . . . . . . .
Interface of the alternative editor Tinn-R . . . . . . . . . . . . . . . . .
Interface of the alternative editor RStudio . . . . . . . . . . . . . . . . .
Main interface of the R Commander with a linear regression . . . . . . .
Loading a data set, drawing a scatter plot, and fitting a linear model . .
A comparison of an author’s work and a reader’s memory . . . . . . . .
Three roles, jobs, and outputs related to an empirical study . . . . . . .
The relation among keywords in a proposal for social science . . . . . .
Saving PDF documents in a single folder on a local drive . . . . . . . . .
Interface of a library in EndNote . . . . . . . . . . . . . . . . . . . . . .
Inserting references in Microsoft Word through EndNote . . . . . . . . .
Customizing reference fields in EndNote . . . . . . . . . . . . . . . . . .
Interface of a library in Mendeley . . . . . . . . . . . . . . . . . . . . . .
Inserting references in Microsoft Word through Mendeley . . . . . . . .
Importing external Excel data in the R Commander . . . . . . . . . . .
Fitting a binary logit model in the R Commander . . . . . . . . . . . . .
Graphs with different graphical parameters . . . . . . . . . . . . . . . .
Plotting multiple time series of wooden bed trade on a single page . . .
Displaying math symbols on a graph . . . . . . . . . . . . . . . . . . . .
Defining sizes of regions and margins and their relations . . . . . . . . .
Understanding region, user coordinates, clipping, and overlaying . . . .
Default probability response curves for hunting experience . . . . . . . .
Customizing probability curves in Sun et al. (2007) with base R graphics
Changes in consumer and producer surplus under a demand shift . . . .
Diagram of an author’s work and a reader’s memory by base R . . . . .
A diagram illustrating management-based regulations . . . . . . . . . .
Probability response curves for hunting ages . . . . . . . . . . . . . . . .
Changing a clipping area and user coordinates . . . . . . . . . . . . . . .
A more challenging market graph with a supply shift . . . . . . . . . . .
The curve for the function of y = (x + a)2 + 20 . . . . . . . . . . . . . .
Variation of weekdays associated with a birthday by year . . . . . . . .
The curve for the function of y = 50 − 2 × sin(x − 5) . . . . . . . . . . .
545
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
13
16
17
23
26
44
47
61
84
87
88
89
94
95
111
112
215
227
231
233
236
239
241
243
244
248
248
249
249
342
350
357
546
References
16.1
16.2
16.3
16.4
16.5
16.6
17.1
17.2
19.1
19.2
19.3
19.4
20.1
20.2
20.3
20.4
20.5
20.6
20.7
20.8
Viewports, primitive functions, graphical parameters, and units in grid
Probability response curves in Sun et al. (2007) by grid . . . . . . . . .
The diagram of an author’s work and a reader’s memory by grid . . . .
Probability response curves in Sun et al. (2007) by ggplot2 . . . . . . .
Import shares of beds in Wan et al. (2010a) by ggplot2 and grid . . .
Intensity of fire regulations in Sun and Tolver (2012) by map() . . . . .
Monthly import value of beds from China and Vietnam (base R) . . . .
Monthly import value of beds from China and Vietnam (ggplot2) . . .
The skeleton for a new package on a local drive . . . . . . . . . . . . . .
A screenshot of the help document for the ciTarFit() function . . . . .
Finding the environment variable and path on Windows 7 . . . . . . . .
The command prompt window for building R packages . . . . . . . . . .
A static view of chess board created by base R graphics . . . . . . . . .
Two simple examples for understanding GUIs . . . . . . . . . . . . . . .
A graphical user interface for the correlation between two variables . . .
Overall appearance for the guiApt() GUI with the tabs on the left . . .
Two screenshots for the guiApt() GUI with the tabs at the bottom . .
A GUI for calculating net monthly income . . . . . . . . . . . . . . . . .
A GUI for R calculator with six simple operations . . . . . . . . . . . .
A revised GUI for the correlation between two variables . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
367
370
372
380
383
388
408
409
430
433
442
443
447
450
457
461
462
466
466
467
. . . .
. . . .
. . . .
paper
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
32
41
43
51
52
53
54
64
80
103
117
133
156
254
284
397
417
.
.
.
.
.
.
.
.
.
.
.
.
113
114
118
121
Tables
1.1
3.1
3.2
4.1
4.2
4.3
4.4
4.5
5.1
6.1
7.1
7.2
8.1
9.1
12.1
13.1
17.1
18.1
Three versions of an empirical study . . . . . . . . . . . . . . . .
Journal of Economic Literature classification system . . . . . . .
A comparison of three types of economic studies . . . . . . . . .
A production comparison between building a house and writing a
Main components in the proposal version of an empirical study .
Structure of the program version for an empirical study . . . . .
Four stages of data analyses and software usage . . . . . . . . . .
Structure of the manuscript version for an empirical study . . . .
The outline from a funded proposal narrative document . . . . .
Tasks and goals for reference and file management . . . . . . . .
A draft table for the logit regression analysis in Sun et al. (2007)
A comparison of grammar rules between English and R . . . . .
Constructor, predicate, and coercion functions for R objects . . .
Major functions for character string manipulation in R . . . . . .
A draft table for the descriptive statistics in Wan et al. (2010a) .
Coefficients for the static AIDS model in Wan et al. (2010a) . . .
A draft table for the cointegration analyses in Sun (2011) . . . .
Documents and functions included in the apt package . . . . . .
Programs
7.1
7.2
7.3
7.4
The first program version for Sun et al. (2007) . . . . . . .
The final program version for Sun et al. (2007) . . . . . .
Curly brace match in the if statement . . . . . . . . . . .
A poorly formatted program version for Sun et al. (2007)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
547
References
7.5
8.1
8.2
8.3
8.4
8.5
8.6
8.7
9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
10.1
10.2
10.3
10.4
10.5
11.1
11.2
11.3
11.4
11.5
11.6
11.7
11.8
11.9
11.10
12.1
12.2
13.1
13.2
13.3
13.4
13.5
13.6
14.1
14.2
14.3
14.4
14.5
15.1
15.2
15.3
15.4
15.5
Identifying and correcting bad formats in an R program . . . . . . .
Accessing and defining object attributes . . . . . . . . . . . . . . . .
Creating R objects and locating missing values . . . . . . . . . . . .
A brief introduction to subscripts, flow control, and new functions .
Manual data inputs through R console and Excel . . . . . . . . . . .
Generating a random sample . . . . . . . . . . . . . . . . . . . . . .
Importing data in text, Excel, or graphics format . . . . . . . . . . .
Exporting tables and graphs from R to a local drive . . . . . . . . .
Built-in and user-defined operators in R . . . . . . . . . . . . . . . .
Manipulating character strings . . . . . . . . . . . . . . . . . . . . .
Special meaning of a character . . . . . . . . . . . . . . . . . . . . .
Creating and manipulating a long character string vector . . . . . . .
Manipulating factor objects . . . . . . . . . . . . . . . . . . . . . . .
Manipulating date and time objects . . . . . . . . . . . . . . . . . . .
Manipulating time series objects . . . . . . . . . . . . . . . . . . . .
Defining the relation among variables by formula . . . . . . . . . . .
Subscripting and indexing R objects . . . . . . . . . . . . . . . . . .
Common tasks for manipulating data frame objects . . . . . . . . . .
Summary statistics and pivot tables from data frames . . . . . . . .
Generating summary statistics with apply() . . . . . . . . . . . . .
Calling glm() to estimate a binary choice model . . . . . . . . . . .
Base R graphics with inputs of data, devices, and plotting functions
Managing screen and file graphics devices . . . . . . . . . . . . . . .
Using the par() function to manipulate graphics parameters . . . .
Viewing and saving multiple pages of graphs . . . . . . . . . . . . . .
Creating multiple graphs on a single page . . . . . . . . . . . . . . .
Using high-level and low-level plotting functions . . . . . . . . . . . .
Region, user coordinates, clipping, and overlaying . . . . . . . . . . .
Customizing a graph generated from existing R functions . . . . . .
Market equilibrium after a demand shift . . . . . . . . . . . . . . . .
An author’s work and a reader’s memory by base R . . . . . . . . . .
Program version for Wan et al. (2010a) . . . . . . . . . . . . . . . . .
Estimating the AIDS model with fewer user-defined functions . . . .
Conditional execution with if, ifelse(), and switch() . . . . . . .
Looping through for, while, and repeat . . . . . . . . . . . . . . .
Looping on objects other than a vector and preallocating spaces . . .
Additional functions for flow control: stop() and return() . . . . .
Homogeneity and symmetry restrictions for the AIDS model . . . . .
Computing elasticities and standard errors for the AIDS model . . .
Creating and subscripting matrices in R . . . . . . . . . . . . . . . .
Matrix multiplication, inversion, and other operations . . . . . . . .
Fitting a linear model by matrix multiplication . . . . . . . . . . . .
Estimating a demand system by generalized least square . . . . . . .
Marginal effects and standard errors for the binary choice model . .
Function structure and properties . . . . . . . . . . . . . . . . . . . .
Supplying arguments for a function . . . . . . . . . . . . . . . . . . .
Returning outputs from a function . . . . . . . . . . . . . . . . . . .
Understanding the environment of a function . . . . . . . . . . . . .
Understanding the need of S3 or S4 . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
127
129
134
138
140
143
146
149
154
158
162
164
167
170
174
178
182
189
195
199
202
216
220
222
224
228
232
237
239
242
245
260
265
272
276
279
282
285
288
295
299
302
304
307
314
317
320
325
330
548
References
15.6
15.7
15.8
15.9
15.10
15.11
15.12
16.1
16.2
16.3
16.4
16.5
16.6
17.1
17.2
18.1
18.2
19.1
19.2
19.3
20.1
20.2
20.3
20.4
A.1
A.2
A.3
A.4
A.5
A.6
B.1
B.2
A function for course scores with the S3 mechanism . . . . . . . . . . .
A new function for ordinary least square with the S4 mechanism . . .
Testing the new function for ordinary least square . . . . . . . . . . .
An outline for estimating a linear model with ordinary least square . .
Defining a new function for one-dimensional optimization . . . . . . .
Estimating a binary logit model with maximum likelihood . . . . . . .
Wrapping up commands in a new function for the static AIDS model .
Learning viewports and low-level plotting functions in grid . . . . . .
Customizing probability response curves in Sun et al. (2007) by grid .
Comparing an author’s work and a reader’s memory by grid . . . . .
Customizing a graph from existing R functions by ggplot2 . . . . . .
Creating the panel graph in Wan et al. (2010a) by ggplot2 and grid .
Creating the map for fire regulations in Sun and Tolver (2012) . . . .
Main program version for generating tables in Sun (2011) . . . . . . .
Graph program version for generating figures in Sun (2011) . . . . . .
Debugging tools for identifying program errors in R . . . . . . . . . . .
Examining function efficiency by time and memory . . . . . . . . . . .
Three approaches to creating the skeleton of a new package . . . . . .
Help file of ciTarFit.Rd for ciTarFit() and print.ciTarFit() . . .
Installing, loading, and attaching a package in R . . . . . . . . . . . .
Creating a chess board from base R graphics . . . . . . . . . . . . . .
Understanding key concepts in gWidgets through two examples . . . .
A demonstration of the correlation between two random variables . . .
Creating a GUI for the apt package . . . . . . . . . . . . . . . . . . .
A world map with two countries highlighted . . . . . . . . . . . . . . .
A heart shape with three-dimensional effects . . . . . . . . . . . . . . .
Survival of 2,201 passengers on Titanic sank in 1912 . . . . . . . . . .
A diagram for the structure of this book . . . . . . . . . . . . . . . . .
Screenshots from a demonstration video for correlation . . . . . . . . .
Faces at Christmas time conditional on economic status . . . . . . . .
Calculating predicted probabilities for an ordered choice model . . . .
Calculating marginal effects for an ordered choice model . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
334
336
338
341
343
344
348
367
370
373
381
382
387
402
407
422
425
431
433
437
448
451
458
462
513
514
516
517
518
520
525
529
Index of Authors
Author names of the cited references are listed. In the main text, some authors are abbreviated as et al. For example, Frye is the fourth author in Buehlmann et al. (2006).
Gomez-Rubio, V., 21, 386
Grado, S.C., 6, 11, 57, 70, 75, 101–103,
110, 113, 114, 120, 126, 184,
201, 246, 247, 301, 307, 312,
369, 370, 380, 391, 532
Granger, C.W.J., 255, 397, 399
Grebner, D.L., 6, 11, 37, 41, 57, 70, 72,
73, 75, 89, 96, 101–103, 110,
113, 114, 120, 126, 145, 184,
201, 209, 235, 246, 247, 252,
253, 255, 256, 260, 265, 283,
301, 307, 312, 313, 347, 369,
370, 380, 382, 385, 391, 395,
411, 470–473, 477, 480, 483,
492, 506, 532
Greene, W.H., 103, 105, 106, 108, 109,
257, 259, 521, 523
Griffiths, W.E., 103, 104, 106
Abadir, K.M., 10, 522
Adler, J., 21, 412
Akerlof, G.A., 33
Alston, J.M., 477
Amacher, G.S., 34
Baltagi, B.H., 103, 107, 109
Bivand, R.S., 21, 386
Blackburn, T.R., 62
Braun, W.J., 21, 269
Buehlmann, U., 72
Bumgardner, M., 72
Chalfant, J.A., 477
Chan, K.S., 399
Chiang, A.C., 34
Coase, R.H., 33, 34
Covey, S.R., 511
Dalgaard, P., 21
Deaton, A., 72, 255
Hamann, J.D., 257
Hartman, R., 34
Henneberry, S.R., 73
Henningsen, A., 257
Hill, R.C., 103, 104, 106
Hothorn, T., 21
Edgerton, D.L., 73
Enders, W., 72, 74, 397–399
Engle, R.F., 255, 397, 399
Everitt, B.S., 21
Faustmann, M., 34, 38
Frey, G., 39, 400, 414
Frye, M., 72
Jiang, H., 41
Johnson, E.D., 480
Jones, O., 21, 322, 419
Jones, W.D., 6, 11, 57, 70, 75, 101–103,
110, 113, 114, 120, 126, 184,
Gallet, C.A., 38
Gardner, B.L., 34
549
550
201, 246, 247, 301, 307, 312,
369, 370, 380, 391, 532
Judge, G.G., 103, 104, 106
Kaiser, R.A., 71
Kleiber, C., 21
Koskela, E.A., 34
Lawrence, M.F., 449, 450
Lee, T.-C., 103, 104, 106
Lee, T.H., 399
Liao, X., 6, 9, 11, 41, 56, 77, 151, 290,
538
Lihra, T., 72
Luo, X., 41
Lutkepohl, H., 103, 104, 106
MacKinlay, A.C., 39, 81, 91
Magnus, J.R., 10, 522
Maillardet, R., 21, 322, 419
Manera, M., 39, 400, 414
Matloff, N., 21, 293
Mei, B., 483
Meng, Q., 41
Meyer, J., 74
Morrison, D.C., 58, 62, 63
Muellbauer, J., 72, 255
Murdoch, D.J., 21, 269
Murrell, P., 21, 359, 360
Nelsen, R.B., 456
Newman, D.H., 38
Nicholls, S., 71
Ollikainen, M., 34
Pebesma, E.J., 21, 386
Pfaff, B., 21
Piggott, N.E., 35
Piwethongngam, K., 73
Pokharel, S., 6, 11, 57, 70, 75, 101–103,
110, 113, 114, 120, 126, 184,
Index of Authors
201, 246, 247, 301, 307, 312,
369, 370, 380, 391, 532
Qiang, H., 73
Robinson, A., 21, 322, 419
Russell, S.W., 58, 62, 63
Siklos, P.L., 398
Spector, P., 21, 128
Sun, C., 6, 7, 9, 11, 37, 41, 56, 57, 70, 72,
73, 75, 77, 89, 93, 96, 101–103,
109, 110, 113, 114, 120, 126,
145, 151, 184, 201, 209, 235,
246, 247, 252, 253, 255, 256,
260, 265, 283, 290, 301, 307,
312, 313, 347, 369, 370, 380,
382, 385–387, 391, 394, 395,
397, 402, 407, 411, 416, 460,
465, 470–473, 477, 480, 483,
489, 492, 506, 532, 535, 538
Tolver, B., 247, 386, 387
University of Chicago Press, 480
Verzani, J., 449, 450
Vinod, H.D., 21
von Cramon-Taubadel, S., 74
Wainwright, K., 34
Wan, Y., 6, 11, 37, 41, 57, 70, 72, 73, 89,
96, 145, 209, 235, 252, 253, 255,
256, 260, 265, 283, 312, 313,
347, 370, 382, 385, 395, 411,
470–473, 477, 480, 483, 492, 506
Wickham, H., 21, 374, 376
Wohlgenant, M.K., 35
Wright, B.A., 71
Zeileis, A., 21
Zhang, D., 37
Zhang, Y., 41
Index of Subjects
Calling a function, 135
location match, 136
name match, 136
Character string
creating, 156
displaying, 157
extracting, 158
formatting numbers, 162
lower and upper cases, 158
matching, 161
regular expression
creation, 161
definition, 160
Chernoff faces, 519
Clickor, 6, 22, 110
characteristics, 27, 111
playing a data set, 25
Clinic, 503
frequently appearing symptoms, 503
manuscript details, 509
R programming, 507
study design and outline, 503
Cointegration, 256
Comparison of EndNote and Mendeley,
86
Comparison of R and commercial
software, 21
Conditional statement, 269
if, 269
ifelse, 271
nesting if, 270
switch, 272
Contingency and pivot tables, 194
Contributor, 6, 394, 511
A
AIDS model, 255
demand elasticity, 259, 287
estimation, 347
estimation by GLS, 257
restriction matrices, 256, 283
static and dynamic, 255
apply family, 198
list, 198
subsets, 198
Asymmetric error correction model, 399
B
Backslash in R, 19, 157, 161
Beginner, 6, 100, 125, 511
Binary choice model, 103, 533
estimation, 103, 307
linear probability model, 104
maximum likelihood, 105
ordinary least square, 105
marginal effect, 108, 307, 521
maximum likelihood estimation, 344
parameter values, 107
predicted probability, 107
standard error, 109, 307
Book design, 511
motivation, 3
objective, 5
principle A: incrementalism, 5
principle B: project-oriented, 6
principle C: reproducible research, 7
structure, 7
Book materials, 10
D
C
551
552
Data analysis project, 532
Data frame, 180
add new elements, 187
changing mode by column, 185
combine, 188
display, 184
extract, 186
merge, 188
pivot tables, 194
random sample, 188
remove rows or columns, 187
renaming, 185
reorder and sort, 186
replace, 186
reshape between narrow and wide
formats, 188
summary, 193
Data input
external data
Excel format, 145
text format, 144
importing graphics, 145
manual, 139
sample in packages, 141
simulation, 142
Data output, 147
Excel format, 148
graph exports, 149
text format, 147
Date and time
creation, 169
extracting, 170
type, 169
Debug
browsing status, 418
change source codes, 421
no change on source codes, 421
without special tools, 419
Delta method, 521
Demonstration graph, 240
Deparse, 325
Diagram, 240, 517
E
Economic research
areas, 31
selection, 40
type
empirical study, 35
Index of Subjects
review study, 38
selection, 40
theoretical study, 33
Empirical study
challenge, 37
common thread, 35
practical steps, 54
production process, 42
programming advantage, 36
three roles, jobs, and outputs, 47, 74
three versions, 45, 101, 253, 395, 511
manuscript, 53, 472
program, 51
proposal, 49
relation, 47
EndNote, 86
file management
absolute path, 91
imports, 92
PDF annotation, 93
relative path, 91
save, locate, and export, 93
implementation, 87
reference management
custom group, 90
group, 90
group set, 90
new reference creation, 87
output styles, 89
search and export, 91
smart group, 90
types and fields, 88
Environment, 322
Error correction model, 256, 399
Escape character in R, 19, 157, 161
Event analysis, 538
F
Factor
conversion, 166
reordering, 166
structure, 166
File management
demand, 78
practical techniques, 83
solution, 81
Flow control, 137, 269
conditional statement, 269
looping statement, 269, 524
553
Index of Subjects
stop, warning, and message, 281
Formulas, 177
creation, 177
extracting, 178
Forward slash in R, 19, 157
Frame, 323
Function
arguments, 315
ellipsis, 316
with default values, 316
without default values, 316
debugging, 418
environment, 32