Empirical Research in Economics Growing up with Changyou Sun E R E R Changyou Sun Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun. All rights reserved. No part of this document may be reproduced, translated, or distributed in whole or in part without the permission of the author. Written by Dr. Changyou Sun Natural Resource Economics Department of Forestry Mississippi State University Mississippi State, MS 39762, USA Email: [email protected] Published by Pine Square LLC 105 Elm Place Starkville, Mississippi 39759 USA Cover designed by Chelbie Caitlyn Williamson Printed in the United States of America Printed in August 2015 First edition Publisher’s Cataloging-In-Publication Data (Prepared by The Donohue Group, Inc.) Sun, Changyou, 1969 – Empirical research in economics : growing up with R / Changyou Sun. pages : illustrations ; cm Includes bibliographical references and index. ISBN: 978-0-9965854-0-8 1. Economics–Research–Methodology. 2. R (Computer program language) 3. Economic statistics–Computer programs. I. Title. II. Title: ERER HB74.5 .S86 2015 330.0721 2015911715 Brief Contents Preface xvi Part I Introduction Chapter 1 Chapter 2 Part II Chapter Chapter Chapter Chapter 3 4 5 6 Part III Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter Chapter 3 12 Economic Research and Design 29 Areas and Types of Economic Research Anatomy on Empirical Studies Proposal and Design for Empirical Studies Reference and File Management 31 42 57 78 Programming as a Beginner 99 Programming as a Wrapper 12 13 14 15 16 Part V Chapter Chapter Chapter Chapter Motivation, Objective, and Design Getting Started with R 7 Sample Study A and Predefined R Functions 8 Data Input and Output 9 Manipulating Basic Objects 10 Manipulating Data Frames 11 Base R Graphics Part IV Sample Study B and New R Functions Flow Control Structure Matrix and Linear Algebra How to Write a Function Advanced Graphics Programming as a Contributor 17 18 19 20 Part VI Chapter 21 Chapter 22 Chapter 23 1 Sample Study C and New R Packages Contents of a New Package Procedures for a New Package Graphical User Interfaces Publishing a Manuscript Manuscript Preparation Peer Review on Research Manuscripts A Clinic for Frequently Appearing Symptoms Appendix A Programs for R Graphics Show Boxes Appendix B Ordered Choice Model and R Loops Appendix C Data Analysis Project References List of Figures, Tables, and Programs Index of Authors Index of Subjects Index of Commands 101 128 152 180 214 251 253 269 293 312 359 393 395 412 428 446 469 471 486 503 513 521 532 541 545 549 551 557 Contents Preface Part I xvi 1 Introduction Chapter 1 Motivation, Objective, and Design 1.1 Inefficiency, reasons, and motivation . . . . . . . . . . . . . . . 1.2 Objective and design . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Principle A: Incrementalism and growing up by stage . 1.2.2 Principle B: Project-oriented learning with full samples 1.2.3 Principle C: Reproducible research with programming . 1.3 Book structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 How to use this book . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 A guide for intensive use as a textbook . . . . . . . . . . 1.4.2 A guide for self-study . . . . . . . . . . . . . . . . . . . 1.4.3 Materials and notations . . . . . . . . . . . . . . . . . . 1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 . 3 . 5 . 5 . 6 . 7 . 7 . 8 . 9 . 9 . 10 . 11 Chapter 2 Getting Started with R 2.1 Installation of base R, packages, and editors 2.1.1 Base R . . . . . . . . . . . . . . . . 2.1.2 Contributed R packages . . . . . . . 2.1.3 Alternative R editors . . . . . . . . . 2.2 Installation notes for this book . . . . . . . 2.2.1 Recommended installation steps . . 2.2.2 Working directory and source() . . 2.2.3 Possible installation problems . . . . 2.3 The help system and features of R . . . . . 2.4 Playing with R like a clickor . . . . . . . . . 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part II Economic Research and Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 12 14 16 17 17 18 19 20 22 27 29 Chapter 3 Areas and Types of Economic Research 31 3.1 Areas of economic research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Theoretical studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 34 34 35 35 36 37 38 38 38 39 40 41 . . . . . . . . . . . . . . to a paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 42 43 45 45 47 49 49 50 50 51 51 53 54 56 Chapter 5 Proposal and Design for Empirical Studies 5.1 Fundamentals of proposal preparation . . . . . . . . . . . . 5.1.1 Inputs needed for a great proposal . . . . . . . . . . 5.1.2 Keywords in a proposal . . . . . . . . . . . . . . . . 5.1.3 Two triangles, one focus, the mood, and the story . 5.2 Empirical study design for funding . . . . . . . . . . . . . . 5.2.1 Understanding a request for proposal . . . . . . . . . 5.2.2 Setting up a proposal outline . . . . . . . . . . . . . 5.2.3 An unfunded sample proposal . . . . . . . . . . . . . 5.2.4 A funded sample proposal . . . . . . . . . . . . . . . 5.3 Empirical study design for publishing . . . . . . . . . . . . 5.3.1 Design with survey data (Sun et al. 2007) . . . . . . 5.3.2 Design with public data (Wan et al. 2010a) . . . . . 5.3.3 Design with challenging models (Sun 2011) . . . . . 5.4 Empirical study design for graduate degrees . . . . . . . . . 5.4.1 Phases: technician (master’s) versus designer (PhD) 5.4.2 Facing the constraints: time, money, and experience 5.5 Summary of research design . . . . . . . . . . . . . . . . . . 5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 57 57 58 60 62 62 63 64 67 69 70 72 73 74 74 75 76 77 3.3 3.4 3.5 3.6 3.2.1 Economic theories and thinking . . . . . . . 3.2.2 With or without quantitative models . . . . 3.2.3 Structure of theoretical studies with models Empirical studies . . . . . . . . . . . . . . . . . . . 3.3.1 Common thread: y = f (x, β) + . . . . . . 3.3.2 Data analyses by programming . . . . . . . 3.3.3 Challenges for empirical studies . . . . . . . Review studies . . . . . . . . . . . . . . . . . . . . 3.4.1 Features of review studies . . . . . . . . . . 3.4.2 Reviewing a research issue . . . . . . . . . . 3.4.3 Reviewing a statistical model . . . . . . . . Becoming an expert . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 Anatomy on Empirical Studies 4.1 A production approach . . . . . . . . . . . . . 4.1.1 Building a house as an analogy . . . . 4.1.2 An author’s and a reader’s perspective 4.2 Three versions of an empirical study . . . . . 4.2.1 Three roles, jobs, and outcomes . . . . 4.2.2 Relations among the three versions . . 4.3 Proposal version and design . . . . . . . . . . 4.3.1 Idea quality versus presentation skills 4.3.2 Idea sparkles . . . . . . . . . . . . . . 4.3.3 Tips for literature search . . . . . . . 4.3.4 A one-page summary of a proposal . . 4.4 Program version and structure . . . . . . . . 4.5 Manuscript version and detail . . . . . . . . . 4.6 Practical steps for an empirical study . . . . 4.7 Exercises . . . . . . . . . . . . . . . . . . . . ix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 6 Reference and File Management 6.1 Demand of literature management . . . . . . . . . . . 6.1.1 Timing for literature management . . . . . . . 6.1.2 How many papers? . . . . . . . . . . . . . . . . 6.1.3 Tasks and goals . . . . . . . . . . . . . . . . . . 6.2 Systematic solutions . . . . . . . . . . . . . . . . . . . 6.2.1 Principles for effective management . . . . . . . 6.2.2 Practical techniques for reference management 6.2.3 Practical techniques for file management . . . . 6.2.4 A comparison of EndNote and Mendeley . . . . 6.3 Implementation by EndNote . . . . . . . . . . . . . . . 6.3.1 Managing references in EndNote . . . . . . . . 6.3.2 Managing files in EndNote . . . . . . . . . . . 6.4 Implementation by Mendeley . . . . . . . . . . . . . . 6.4.1 Managing references in Mendeley . . . . . . . . 6.4.2 Managing files in Mendeley . . . . . . . . . . . 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . Part III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 78 78 79 80 81 81 82 83 86 87 87 91 93 94 95 96 99 Programming as a Beginner Chapter 7 Sample Study A and Predefined R Functions 7.1 Manuscript version for Sun et al. (2007) . . . . . . . . . . 7.2 Statistics for a binary choice model . . . . . . . . . . . . . 7.2.1 Definitions and estimation methods . . . . . . . . 7.2.2 Linear probability model . . . . . . . . . . . . . . . 7.2.3 Binary probit and logit models . . . . . . . . . . . 7.3 Estimating a binary choice model like a clickor . . . . . . 7.3.1 A logit regression by the R Commander . . . . . . 7.3.2 Characteristics of clickors . . . . . . . . . . . . . . 7.4 Program version for Sun et al. (2007) . . . . . . . . . . . . 7.5 Basic syntax of R language . . . . . . . . . . . . . . . . . 7.5.1 Sections and comment lines . . . . . . . . . . . . . 7.5.2 Paragraphs and command blocks . . . . . . . . . . 7.5.3 Sentences and commands . . . . . . . . . . . . . . 7.5.4 Words and object names . . . . . . . . . . . . . . . 7.6 Formatting an R program . . . . . . . . . . . . . . . . . . 7.6.1 Comments and an executable program . . . . . . . 7.6.2 Line width and breaks . . . . . . . . . . . . . . . . 7.6.3 Spaces and indention . . . . . . . . . . . . . . . . . 7.6.4 A checklist for R program formatting . . . . . . . 7.7 Road map: using predefined functions (Part III) . . . . . 7.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 101 103 103 104 106 110 110 111 112 116 116 116 117 119 120 121 123 124 124 125 126 Chapter 8 Data Input and Output 8.1 Objects in R . . . . . . . . . . . . . . . . . . . 8.1.1 Object attributes . . . . . . . . . . . . . 8.1.2 Object types commonly used in R . . . 8.1.3 Object creation, predicate, and coercion 8.2 Calling an R function . . . . . . . . . . . . . . 8.3 Subscripts, flow control, and new functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 128 128 131 132 135 137 x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 139 141 142 144 147 150 Chapter 9 Manipulating Basic Objects 9.1 R operators . . . . . . . . . . . . . . . . . . . . . . 9.2 Character strings . . . . . . . . . . . . . . . . . . . 9.2.1 Frequently used functions . . . . . . . . . . 9.2.2 Special meaning of a character . . . . . . . 9.2.3 Application: character string manipulation 9.3 Factors . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Date and time . . . . . . . . . . . . . . . . . . . . 9.5 Time series . . . . . . . . . . . . . . . . . . . . . . 9.6 Formulas . . . . . . . . . . . . . . . . . . . . . . . 9.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 152 155 155 160 164 166 169 172 177 179 Chapter 10 Manipulating Data Frames 10.1 Subscripting and indexing in R . . . . . . . . . 10.2 Common tasks for data frame objects . . . . . 10.3 Summary statistics of data frames . . . . . . . 10.3.1 Quick summarization by row or column 10.3.2 Contingency and pivot tables . . . . . . 10.3.3 The apply() family . . . . . . . . . . . 10.4 Application: estimating a binary choice model . 10.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 180 184 193 193 194 198 201 204 Chapter 11 Base R Graphics 11.1 A bird view of the graphics system . . . . . . . . . . . . . . . . 11.2 Your paint: preparation of plotting data . . . . . . . . . . . . . 11.3 Your canvas: graphics devices . . . . . . . . . . . . . . . . . . . 11.3.1 Screen versus file devices . . . . . . . . . . . . . . . . . . 11.3.2 Setting graphics parameters by par() . . . . . . . . . . 11.3.3 Many graphs on multiple pages or files . . . . . . . . . . 11.3.4 Many graphs on a single page . . . . . . . . . . . . . . . 11.4 Your big and small brushes: plotting functions . . . . . . . . . 11.5 Region, coordinate, clipping, and overlaying . . . . . . . . . . . 11.5.1 Regions and margins . . . . . . . . . . . . . . . . . . . . 11.5.2 Coordinate system and clipping . . . . . . . . . . . . . . 11.5.3 Overlaying, axis, and legend . . . . . . . . . . . . . . . . 11.6 Application on customizing a default graph . . . . . . . . . . . 11.7 Application on creating a diagram . . . . . . . . . . . . . . . . 11.7.1 Drawing demand and supply curves . . . . . . . . . . . 11.7.2 A diagram for an author’s work and a reader’s memory 11.8 Summary: using predefined functions (Part III) . . . . . . . . 11.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 214 216 217 217 221 223 226 229 232 233 234 235 238 240 241 243 246 247 8.5 8.6 Data inputs and creation . . . . . 8.4.1 Manual data inputs in R . . 8.4.2 Sample data in R packages 8.4.3 Simulation data in R . . . . 8.4.4 Reading external data . . . Data outputs . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . xi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Part IV 251 Programming as a Wrapper Chapter 12 Sample Study B and New R Functions 12.1 Manuscript version for Wan et al. (2010a) . . . . . . . . . . . . 12.2 Statistics: AIDS model . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Static and dynamic models . . . . . . . . . . . . . . . . 12.2.2 Implementation: construction of restriction matrices . . 12.2.3 Implementation: estimation by generalized least square . 12.2.4 Implementation: calculation of demand elasticities . . . 12.3 Program version for Wan et al. (2010a) . . . . . . . . . . . . . 12.4 Needs for user-defined functions . . . . . . . . . . . . . . . . . . 12.5 Road map: how to write new functions (Part IV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 253 255 255 256 257 259 260 264 268 Chapter 13 Flow Control Structure 13.1 Conditional statements . . . . . . . . . . . . 13.1.1 Branching with if . . . . . . . . . . 13.1.2 The ifelse() function . . . . . . . 13.1.3 The switch() function . . . . . . . 13.2 Looping statements . . . . . . . . . . . . . . 13.2.1 for, while, and repeat loops . . . . 13.2.2 Constructing a looping structure . . 13.3 Additional functions for flow control . . . . 13.4 Application: restrictions on the AIDS model 13.5 Application: elasticities for the AIDS model 13.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 269 269 271 272 275 275 278 281 283 287 290 Chapter 14 Matrix and Linear Algebra 14.1 Matrix creation and subscripts . . . . . . . . . . 14.1.1 Creating a matrix . . . . . . . . . . . . . 14.1.2 Subscripting a matrix . . . . . . . . . . . 14.2 Matrix operation and linear algebra . . . . . . . 14.3 Application: ordinary least square . . . . . . . . 14.4 Application: generalized least square . . . . . . . 14.5 Application: marginal effects for a binary model . 14.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 293 293 295 298 301 304 307 309 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . or S4? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 312 312 315 319 322 322 322 323 329 329 332 336 339 340 . . . . . . . . . . . . . . . . . . . . . . Chapter 15 How to Write a Function 15.1 Function structure . . . . . . . . . . . . . . . . . . 15.1.1 Main components . . . . . . . . . . . . . . . 15.1.2 Function arguments . . . . . . . . . . . . . 15.1.3 Exporting outputs from a function . . . . . 15.1.4 Loading and attaching a function . . . . . . 15.2 Function environment . . . . . . . . . . . . . . . . 15.2.1 Basic concepts . . . . . . . . . . . . . . . . 15.2.2 Applications . . . . . . . . . . . . . . . . . 15.3 Object-oriented programming . . . . . . . . . . . . 15.3.1 A quick start: why and when do we need S3 15.3.2 S3 . . . . . . . . . . . . . . . . . . . . . . . 15.3.3 S4 . . . . . . . . . . . . . . . . . . . . . . . 15.3.4 S3 versus S4 . . . . . . . . . . . . . . . . . 15.4 Practical steps for writing a function . . . . . . . . xii 15.5 15.6 15.7 15.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342 344 347 349 Chapter 16 Advanced Graphics 16.1 R graphics engine and systems . . . . . . . . . . . . . . . . . . . 16.2 The grid system . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.1 A comparison of traditional graphics and grid . . . . . . 16.2.2 Viewports . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2.3 Low-level plotting functions . . . . . . . . . . . . . . . . . 16.2.4 Graphics parameters and coordinate systems . . . . . . . 16.2.5 Application: customizing a default graph by grid . . . . . 16.2.6 Application: creating a diagram by grid . . . . . . . . . . 16.3 The ggplot2 package . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1 A comparison of traditional graphics and ggplot2 . . . . 16.3.2 Geom, aesthetic, and mapping . . . . . . . . . . . . . . . 16.3.3 Stat, position, scale, and facet . . . . . . . . . . . . . . . 16.3.4 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.5 Application: the graph in Sun et al. (2007) by ggplot2 . 16.3.6 Application: the graph in Wan et al. (2010a) by ggplot2 16.4 Spatial data and maps . . . . . . . . . . . . . . . . . . . . . . . . 16.4.1 Application: the fire map in Sun and Tolver (2012) . . . . 16.5 Summary: how to write new functions (Part IV) . . . . . . . . . 16.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 359 360 361 362 364 366 369 371 374 374 376 377 378 379 381 385 386 390 391 Part V Application: Application: Application: Exercises . one-dimensional optimization . . . multi-dimensional optimization . . static AIDS model by aiStaFit() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Programming as a Contributor Chapter 17 Sample Study C and New R Packages 17.1 Manuscript version for Sun (2011) . . . . . . . . . . 17.2 Statistics: threshold cointegration and APT . . . . . 17.2.1 Linear cointegration analysis . . . . . . . . . 17.2.2 Threshold cointegration analysis . . . . . . . 17.2.3 Asymmetric error correction model . . . . . . 17.3 Needs for a new package . . . . . . . . . . . . . . . . 17.4 Program version for Sun (2011) . . . . . . . . . . . . 17.4.1 Program for tables . . . . . . . . . . . . . . . 17.4.2 Program for figures . . . . . . . . . . . . . . . 17.5 Road map: developing a package and GUI (Part V) 17.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 395 397 397 398 399 400 401 402 407 410 411 Chapter 18 Contents of a New Package 18.1 The decision of a new package . . . . . . . . 18.1.1 Costs and benefits . . . . . . . . . . . 18.1.2 Validating the need of a new package 18.2 What are inside a package? . . . . . . . . . . 18.2.1 General considerations . . . . . . . . . 18.2.2 Example: contents of the apt package 18.3 Debugging . . . . . . . . . . . . . . . . . . . . 18.3.1 Bugs and browsing status in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 412 412 414 415 415 416 418 418 xiii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.3.2 Debugging without special tools . . . . . 18.3.3 Special tools with source code changes . . 18.3.4 Special tools without source code changes 18.4 Time and memory . . . . . . . . . . . . . . . . . 18.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 421 421 424 427 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 428 429 429 432 435 440 440 442 444 445 Chapter 20 Graphical User Interfaces 20.1 Transition from base R graphics to GUIs . . . . . . . 20.1.1 Packages and installation . . . . . . . . . . . 20.2 GUIs in R and the gWidgets package . . . . . . . . 20.2.1 Two simple examples . . . . . . . . . . . . . . 20.2.2 GUI structure in R . . . . . . . . . . . . . . . 20.2.3 Main concepts in the gWidgets package . . . 20.2.4 Container widgets and layouts . . . . . . . . 20.2.5 Control widgets . . . . . . . . . . . . . . . . . 20.3 Application: correlation between two variables . . . . 20.4 Application: a GUI for the apt package . . . . . . . 20.5 Summary: developing a package and GUI (Part V) 20.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 446 448 450 450 452 453 454 455 456 460 465 465 Chapter 19 Procedures for a New Package 19.1 An overview of procedures . . . . . . . . . . . . 19.2 Skeleton stage . . . . . . . . . . . . . . . . . . . 19.2.1 The folders . . . . . . . . . . . . . . . . 19.2.2 Help manuals . . . . . . . . . . . . . . . 19.2.3 Other files . . . . . . . . . . . . . . . . . 19.3 Compilation stage . . . . . . . . . . . . . . . . 19.3.1 Required tools . . . . . . . . . . . . . . 19.3.2 Four steps: build, check, install, and test 19.4 Distribution stage . . . . . . . . . . . . . . . . 19.5 Exercises . . . . . . . . . . . . . . . . . . . . . Part VI . . . . . . . . . . 469 Publishing a Manuscript Chapter 21 Manuscript Preparation 21.1 Writing for scientific research . . . . . . . . . . . . . . . . . . 21.2 An outline sample for Wan et al. (2010a) . . . . . . . . . . . 21.3 Outline construction by section . . . . . . . . . . . . . . . . . 21.3.1 Manuscript space allocation and sequence . . . . . . . 21.3.2 Key sections: methodology and results . . . . . . . . . 21.3.3 Background sections: a review of market, literature, or 21.3.4 Wrap-up sections: summary, conclusion, or discussion 21.3.5 Decoration sections: introduction and abstract . . . . 21.4 Detail for a manuscript . . . . . . . . . . . . . . . . . . . . . 21.4.1 Knitting a paragraph as a net . . . . . . . . . . . . . . 21.4.2 Sentence preparation . . . . . . . . . . . . . . . . . . . 21.4.3 Getting formats right . . . . . . . . . . . . . . . . . . 21.5 Writing styles for a manuscript . . . . . . . . . . . . . . . . . 21.5.1 Explanation: common sense or not . . . . . . . . . . . 21.5.2 Explanation: the error term and beyond . . . . . . . . xiv . . . . . . . . . . . . . . . issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 471 472 475 475 476 477 478 479 479 480 481 482 482 483 483 21.5.3 Rhythm: long versus short 21.5.4 Affirmative style . . . . . 21.5.5 Do not invite questions . 21.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 484 485 485 Chapter 22 Peer Review on Research Manuscripts 22.1 An inherently negative process . . . . . . . . . . . . . 22.2 Marketing skills and typical review comments . . . . . 22.2.1 Marketing skills . . . . . . . . . . . . . . . . . . 22.2.2 Mismatch between a manuscript and a journal 22.2.3 Limited contributions with the current design . 22.2.4 Insufficient or inappropriate analyses . . . . . . 22.2.5 Poor writing and details . . . . . . . . . . . . . 22.2.6 Random errors and bad lucks . . . . . . . . . . 22.3 Peer review comments for Wan et al. (2010a) . . . . . 22.3.1 Comments from referee A in November 2009 . 22.3.2 Comments from referee B in November 2009 . 22.3.3 Comments from referee A in February 2010 . . 22.3.4 Assessment on referee comments . . . . . . . . 22.4 Responses to the comments for Wan et al. (2010a) . . 22.4.1 Responses to referee A in November 2009 . . . 22.4.2 Steps and strategies for preparing responses . . 22.5 Summary of a peer review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 486 487 487 489 489 490 491 491 492 492 493 494 496 497 497 501 502 Chapter 23 A Clinic for Frequently Appearing Symptoms 23.1 Symptoms related to study design and outline . . . . . . . . 23.2 Symptoms related to R programming . . . . . . . . . . . . . 23.3 Symptoms related to manuscript details . . . . . . . . . . . 23.4 Final words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 503 507 509 511 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Programs for R Graphics Show Boxes 513 Appendix B Ordered Choice Model and R Loops 521 B.1 Some math fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 B.2 Predicted probability for ordered choice model . . . . . . . . . . . . . . . . . 523 B.3 Marginal effect for ordered choice model . . . . . . . . . . . . . . . . . . . . . 527 Appendix C Data Analysis Project C.1 Roll call analysis of voting records (Sun, 2006a) . . . . . . . . . . . . . . . . . C.2 Ordered probit model on law reform (Sun, 2006b) . . . . . . . . . . . . . . . C.3 Event analysis of ESA (Sun and Liao, 2011) . . . . . . . . . . . . . . . . . . . 532 532 535 538 References 541 List of Figures, Tables, and Programs Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 545 546 546 Index of Authors 549 Index of Subjects 551 Index of Commands 557 xv Preface The major motivation behind the book is that there has been a lack of systematic approach in teaching students how to conduct empirical studies. Graduate students in economics often spend considerable time in taking a number of courses in economics, statistics, or econometrics. After intensive course work, however, they may still feel frustrated or inefficient in working on their own projects. Thus, there has been a critical need for students to integrate various techniques into applied scientific research. The goal of the book is that after going through the process prescribed in the book, a graduate student can finish a typical empirical study in economics over a reasonable period, with the ultimate target of four months or less. To achieve the goal, both research methods and statistical programming are presented. Instead of using fragmented small data sets and examples, several complete sample studies are employed in the book to demonstrate how to design and conduct typical empirical analyses in economics. The software used is R, which is a powerful and flexible computer language for computing and graphics. This book is highly structured by following the typical process of conducting an empirical study. It is not intended to be an econometric book, so statistics is covered as needed only. It is different from typical books on research methodology in the market because this book provides detailed guides on how to conduct data analyses. It is also different from many existing R books that focus on how to use R as a tool. In using this book, students need to have a deep understanding of the sample papers. In learning R and programming, students need to run the sample programs included in the book. This is critical as many statistical techniques will become self-evident once students play with real data sets and codes. I am sincerely grateful for all the helps and supports I have received while working on this book in recent years. In particular, the earlier versions of the book were used in several workshops or courses in Beijing Forestry University, Auburn University, and Mississippi State University. Many improvements were based on the feedback from the attendees. The accompanying R packages (i.e., erer and apt) have been used worldwide, and I have received a large number of comments from users I have never met. My graduate students, Fan Zhang, Zhuo Ning, and Prativa Shrestha, read the book several times (probably more than they liked), and provided valuable comments. Finally, the whole book was prepared with LATEX 2ε , a high-quality typesetting system with strong supports from many online forums. Empirical studies in economics have become more revealing and rewarding with modern software. I wish you enjoy this type of study through the help of this book. C. Sun August 2015 xvi Part I Introduction 1 2 Part I Introduction: Two chapters are used to introduce the whole book and the software R. This will help readers understand the structure and design of the book, and install the software R for coming data analyses. Chapter 1 Motivation, Objective, and Design (pages 3 – 11): The motivation, objective, and design of the book are presented. Several principles (i.e., incrementalism, project orientation, and reproducibility) are adopted in designing the book. Some user guides are provided at the end. Chapter 2 Getting Started with R (pages 12 – 27): How to install the base R, contributed packages, and an R editor is described first. Then the R help system and the differences between R and commercial software products are discussed. R Graphics • Show Box 1 • A world map with two countries highlighted R has rich functions for drawing maps. See Program A.1 on page 513 for detail. Chapter 1 Motivation, Objective, and Design A t present, the way of teaching and learning in economics is inadequate in integrating course work (e.g., statistics and economics), research methodology, and software usage for specific applied analyses. This has provided me strong motivation in preparing this book. The overall objective is to help students conduct an empirical study over a reasonable period, e.g., four months. In achieving that, several complete sample papers in economics are used, the learning process is divided into several stages (i.e., clickor, beginner, wrapper, and contributor), and software R is adopted for all data analyses. Some study guides are presented at the end. 1.1 Inefficiency, reasons, and motivation As a professor, I have had the privilege of mentoring graduate students in natural resource economics for years. In a typical graduate program, students take a number of courses related to economics, statistics, econometrics, and research methods. At the same time, students need to finish a research project or thesis to receive a graduate degree at the end. In general, the research component of a graduate program is much more challenging than the course work. In working on a research project, it is common for students to feel excited at the beginning, but become frustrated gradually, and even fail after a few years. For these students who do complete a research project successfully for a degree, the output is often of a low or unpublishable quality. This observation has been repeatedly confirmed in mentoring my own students, attending the defense seminars by graduate students in our department, and listening to student presentations at professional meetings. Why are graduate students inefficient in conducting scientific research? Apparently, the answer can differ by person or environment. At one time, I thought it might be that the studies conducted by graduate students were too difficult. This can be true for some individuals who have produced theses with high quality. However, after being in this profession for so many years, I feel that on average most studies conducted by graduate students have low to moderate demand on study design, statistical analysis, and writing. Using much less time (e.g., four months), established professionals can conduct many of these types of studies with a better quality, often publishable in a refereed journal. For quite a while in mentoring my students, I also thought that the weak background and qualification of an individual student might be the main reason behind the low research productivity. I know a number of graduate students in our department who did not take any Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 3 4 Chapter 1 Motivation, Objective, and Design calculus courses or linear algebra before they started a graduate program in economics here. Those students had a very difficult time in taking graduate courses (e.g., microeconomics). Based on these experiences, I have spent a considerable amount of time in recruiting and evaluating graduate applicants in recent years. Those without basic qualifications have been filtered away and rejected at the beginning. This is beneficial for both the graduate school and applicants because valuable resources and time can be saved. Still, the widespread inefficiency in scientific research by graduate students I have observed cannot be completely explained by the low qualifications of some individual students. I believe, or have the faith as a professional, that graduate students in economics as a group have appropriate qualifications to finish their graduate studies and grow up as a new generation of economists. Thus, I have been looking for some common reasons behind the low research productivity of graduate students. After many years of observing and reflecting, my conclusion is that the prevailing way of teaching and learning lack integration among course work, research methods, and software usage. This is especially true as many economic research projects today involve heavy data analyses. Specifically, courses in economics and statistics have limited coverage on how to apply theoretical and empirical models in applied studies. Some courses may have a reading list of over 50 journal articles. But reading published articles is different from doing a similar study. This is analogous to the relation between reading many novels and writing a novel; they are relevant but fundamentally different activities. A graduate program is supposed to produce writers and creators, not readers only. Furthermore, both introductory and advanced econometric courses cover many quantitative models (e.g., ordinary least square in scalar or matrix algebra). Then, students are told that they just need to use a software application like EVIEWS or LIMDEP and push a button to get a regression done. For many graduate students, there has been a big gap and black box between formulas on textbooks and regression outputs on a computer screen. As another contributing factor, published books about software usage offer limited advice for using software applications for specific scientific studies. In most cases, they are just like a manual to a lawn mower. A husband with limited knowledge of landscaping purchases a powerful mower with a detailed manual, and then believes that he can show off a beautiful lawn in a few days before his wife and kids. I was exactly like that when we purchased our first house, my wife delegated all the authority and responsibility of yard work to me, and then I started working on our yard for the first time. In addition, many software books do not follow basic principles of learning. For example, at the very beginning (e.g., an introduction chapter), one book for software R presents detailed steps of how to create a package in R. However, most R users do not write a package at all in the first few years. If there is such a need, it is often after a student has learned the basics of R through several applied analyses. Finally, most software books use small and fragmented examples in demonstrating how to use existing functionality. There is a lack of systematic presentation of how to use a software application to conduct a study from the beginning to the end. Scientific research methodology has been the subject of a number of published books. A search through online bookstores can generate several dozens of books in this category. They often cover principles related to scientific research, such as literature review, study design, or grantsmanship. However, they generally do not cover specific economic models or software usage. As a result, they offer some limited advice to students in conducting specific empirical studies. A big gap still exists between the principles prescribed in methodology books and a real research project in economics. To summarize, I believe that the lack of integration among course work, software usage, and research methods is the main reason behind the low research productivity for students in 5 1.2 Objective and design Table 1.1 Three versions of an empirical study Version Focus Description Proposal Program Manuscript Idea Structure Detail Soul, design, and guide of an empirical study Results in table and figure from data analyses by software Final publication with all details applied economics. After reading many books for economics, statistics, software applications, and research methods, students may still experience significant difficulties in getting started with their research projects or finishing them on time. Therefore, there has been a critical need to combine these components in mentoring students for their own empirical studies. 1.2 Objective and design The objective of this book is to teach young professionals how to conduct an empirical study in economics over a reasonable period, with the expectation of four months or less in general. Based on my mentoring experience, this is highly achievable if students follow the methods presented in this book, work through these exercises diligently, and go through the training completely. To achieve the objective, I have designed and prepared the book with three principles or considerations. They are all contained in, or implied by, the title of this book: (A) Incrementalism and growing up by stage; (B) Project-oriented learning with complete samples; (C) Reproducible empirical research in economics with R programming. 1.2.1 Principle A: Incrementalism and growing up by stage The first, and perhaps the most important, principle in my research philosophy is incrementalism. Scientific research in economics has become more challenging as our economy system has become more sophisticated. For a specific area in economics, conducting an empirical study also involves many steps in a long process. Thus, we need to be realistic and take an incremental approach in learning how to conduct empirical studies. Incrementalism allows us to build up our skills and grow up gradually in the long term. For an individual project and in the short term, incrementalism can be implemented through itemization. This is composed of dividing a big task into small pieces, getting organized and focused on one piece each time, and assembling all outputs together at the end. Specifically, the principle of incrementalism is inherent in the whole book and reflected in two aspects. First of all, the long production process of an empirical study is divided into three stages: proposal, program, and manuscript, as presented in Table 1.1 Three versions of an empirical study. The proposal version of an empirical study provides the guide for the whole study. The program version, often composed of a computer language like R, generates final tables and graphs for the study. The manuscript version presents the study to readers. Furthermore, the programming work behind a program version is divided into four stages: clickor, beginner, wrapper, and contributor. A clicker is a device that makes a clicking sound, usually when activated by a person on purpose. It seems that the word clickor does not exist at present, so I use it here to refer to a person when clicks a device such as a computer mouse. As a clickor, students just click pull-down menus from a software application to conduct data analyses. The other stages are all related to programming with increasing demand on skills. A beginner uses predefined functions in a software application like R in 6 Chapter 1 Motivation, Objective, and Design a very basic way. The gain for a beginner over a clickor is small but a beginner is on the track of growing up. A wrapper uses predefined functions extensively and begins to write some new functions. A contributor writes a large number of new functions, formats them into a package, and thus extends the software to allow himself and others to do more work. A researcher can move along the four related stages over time and grow up incrementally. 1.2.2 Principle B: Project-oriented learning with full samples The second principle is to adopt a project-oriented learning approach. Many graduate students do not conduct any research until the third or even fourth year of their graduate programs. They prefer to have a solid training through courses first before they start a project seriously. However, I believe no time is better than now. The best way of getting started in economic research is to have a real research project and learn from the process as soon as possible. A project is the best way to organize our mind and inspire us for more exploration in a related area. Following this principle, three publications are adopted as sample papers in this book (i.e., Sun et al., 2007; Wan et al., 2010a; Sun, 2011). The selection of these sample papers is purely personal as I am the author or coauthor of them and have all the raw data. A similar set of papers can provide needed support to achieve the book objective too. These sample papers are used to demonstrate the whole research process: generating a research idea, collecting data, estimating an empirical model, and finally writing a manuscript. This will give students a complete, instead of fragmented, picture of empirical studies. These three sample papers and the related data will be used as much as possible in the book. Specifically, the first sample study is Sun et al. (2007). A binary logit regression is employed to analyze the decision of liability insurance purchase among sportspersons in Mississippi. This study is used for learning fundamental steps for empirical studies and R programming skills. The second sample study by Wan et al. (2010a) is a demand system model for import competition of wooden beds in the United States. This is used to demonstrate how to write new functions in working with a more complicated economic study. The last sample study by Sun (2011) is about asymmetric price transmission between two major wooden beds suppliers in the United States. Through working on Sun (2011), an R package called apt was developed and published. In comparison, Sun et al. (2007) is easier than Wan et al. (2010a), and in turn, Wan et al. (2010a) is easier than Sun (2011). By model, the binary logit regression employed by Sun et al. (2007) has become a standard model in any introductory book of econometrics. The almost ideal demand system (AIDS) model and linear cointegration analyses in Wan et al. (2010a) are a little more complicated as it involves static and dynamic models for a group of supplying countries. The threshold cointegration analysis employed in Sun (2011) has been mainly developed in the most recent 20 years. By data, Sun et al. (2007) uses a survey output with typical features for cross-sectional data. The other two studies use time series data, and in particular, data manipulation for a group of countries in Wan et al. (2010a) presents a great opportunity to learn relevant skills. Overall, this set of papers, as carefully selected, will allow us to learn research methods and programming skills simultaneously and build up various skills incrementally. In addition to the three sample papers used explicitly in the main text, another three sample papers are also selected for exercises in the book (Sun, 2006a,b; Sun and Liao, 2011). Raw data for these papers are available to readers. Thus, the sample studies in the main text can be followed and the studies in the exercises can be reproduced too. Briefly, Sun (2006a) is a roll call analysis of the voting records for a bill related to forest management in 2003. The model used is a binary logit regression for roll calls, which is popular in political 1.3 Book structure 7 science. Sun (2006b) is an analysis of statutory reforms on landowners’ liability in using prescribed fires on forestland. The model is an ordered probit model. Sun and Liao (2011) is an assessment of the effects of litigation under the Endangered Species Act on the values of public forest firms. The model employed in this study is a typical event analysis in finance. Along the same line of project-oriented learning, most R sample codes in this book are organized in a list format and they focus on a specific issue. In contrast, many published R books embed short sample codes in the main text or use very small examples. By using complete sample papers or focusing on a selected issue (e.g., character string manipulation), the R sample codes in this book present more complete pictures for a focal issue. The drawback, if any, is that readers will probably need to run the sample codes on a computer during the reading to have a deep understanding of the description in the main text. 1.2.3 Principle C: Reproducible research with programming The software R is adopted in conducting all statistical analyses in this book. I have used a number of commercial software products in the past for empirical studies. Each of them is advantageous in some aspects, such as handling large data sets or truncated dependent variables. Thus, it is hard for one software product to dominate others completely. However, it is generally not practical for a researcher to purchase many statistical software products, or even if affordable, to become an expert in each of them. The realistic approach is to choose one software product as the main tool and use others as supplements. R is a free but powerful software product for computation and graphics. R is better than many commercial software applications, based on my own experience. The main advantage of R is its open sources and online community for help and learning. This allows a much deeper understanding of statistical analyses, a feeling that I have never had from using commercial software products with secretive procedures and codes. In addition, R is a programming language. Programming doubles or triples productivity over clicking menus in many cases. Finally, for many even moderately sophisticated econometric models used today, programming has become a must to use them. This will be demonstrated in the sample study of Sun (2011). For every sample study used in this book, a program version is included. All tables and figures in the published version can be reproduced in a few minutes. For the sample papers adopted for exercises, their program versions are available to class instructors. Reproducibility through programming will greatly facilitate learning of empirical studies by students. By explicitly documenting research steps through R programming, a researcher can build up skills gradually step by step. After a few years and through several studies, one can improve research efficiency greatly. 1.3 Book structure The book has two major components: research methods and statistical programming. Some statistics and econometrics are covered to help understand the sample papers, but they are not the focus. As a result, approximately 40% of the book contents are for research methods, 50% are for R and programming, and 10% are for statistics. There are six parts in total, as shown in Figure 1.1 The structure of the book. Part I Introduction contains an overview of the whole book and a brief introduction of R. Part II Economic Research and Design is devoted to research design completely, including study design and reference management. Reference management is demonstrated through the reference software applications of EndNote and Mendeley. Part VI Publishing a Manuscript 8 Chapter 1 Motivation, Objective, and Design Proposal Version (II) Software An Empirical Study (I) Program Version Beginner - Clickor (III) Wrapper (IV) Contributor (V) Manuscript Version (VI) Figure 1.1 The structure of the book is about writing a manuscript and submitting it for publication. Furthermore, in presenting programming skills, research methods are emphasized in various places. The three selected sample papers are utilized extensively in the presentation. In addition, several publications related to the selected sample studies are also analyzed to elaborate research design techniques. The three parts in the middle correspond to three main stages in the learning process: programming as a beginner, wrapper, and contributor. The chapters are organized to present typical learning needs and research techniques. Part III Programming as a Beginner is mainly about how to use predefined functions in R for data manipulation, base graphics, and regression analyses. Part IV Programming as a Wrapper covers how to extend the existing features by writing new functions. Part V Programming as a Contributor moves along the spectrum further by focusing on how to create a new package for one’s own use or public sharing. Since R as a language is powerful but sophisticated, it is difficult, even if not completely impossible, to cover all details. Thus, the main text in the three parts related to programming will highlight key techniques, especially these widely adopted in practical research. In a few chapters, many sample R codes are presented so students can learn from these samples quickly. The arrangement also allows readers to explore further by following these sample programs, or to copy them for their own projects. 1.4 How to use this book The book can be used either as a textbook for intensive work over a short period, or alternatively, as a casual self-study book. Each way has its pros and cons. By analogy, this is similar to the two ends of a swimming pool. Before diving into the water, a swimmer should always make sure of the location, so he will not die from drowning with too much water, or breaking his neck with too little water. 1.4 How to use this book 1.4.1 9 A guide for intensive use as a textbook In using this book as a textbook, I offer a three-credit hour course in one semester with 16 weeks. In total, the course has 32 meetings, 1.5 hours each, and 48 contact hours. Outside the classroom, students are expected to spend another 150 hours to finish exercises and data analysis projects. In sum, a total of 200 hours over a few months are needed for an intensive training. More specifically, there may be a need to reallocate the time of lecturing for the two main components of the book: research method and programming. In the book, these materials are arranged by following the typical research flow, which is necessary for presentation clarity. In lecturing, however, they may be adjusted because of the fundamental characteristics of these materials. In general, research methodology is relatively easy to understand, but difficult to generate any effect on our research habit and behavior. Presenting research methods alone for an extended period (e.g., a month) can also be boring and unfruitful. Teaching research methodology also requires intensive mentoring from instructors, often in the form of critical or even negative comments. In contrast, materials for data analyses and programming are more appropriate for self-learning, because they can be confirmed by standard answers, expectation, or common sense. Thus, time for the part of research methods can be less than four weeks at the first coverage, and then they can be reemphasized in the programming part with sample papers. Mixing these two components to a certain degree in lecturing can better engage students in the classroom. If needed, the sequence can also be rearranged; R programming can be covered first, and research design and manuscript preparation are covered in the end. In offering the class, I ask students to finish many small assignments and one big data analysis project during the semester. The assignments are based on exercises listed in the book. The data analysis project is based on one of the three sample papers prepared for exercises exclusively (i.e., Sun, 2006a,b; Sun and Liao, 2011). As a result, students are required to duplicate the published tables and figures by R programming in one selected study. Data for the three studies are available to students in two formats: raw data as Microsoft Excel files and the final data used for regression in R data format. All of them are included in the erer library. Tables and graphs in the published versions of these papers should be reproduced in the end. There are some considerations of using the three sample studies for exercises. In general, it is not recommended that students start a completely new research project in meeting the requirement of data analysis project. This is because a student’s study design probably does not work, the student cannot collect the needed data in a short period, or the model employed is too complicated to be analyzed in a semester. By using a published paper as the base for the exercises, the general principle of confirmation in learning is followed. Students are expected to reproduce one of the selected papers by stage, with the answers known in advance. 1.4.2 A guide for self-study A buffet is a type of meal that allows diners to choose food items, quantities, and sequences freely. With a fixed amount of payment, diners can eat as much as they like. The downside is that the appetite of a diner can be easily ruined by having one pound of ice cream at the beginning, and then he begins to complain the food quality during the rest of his stay. Reading a textbook without any guide is like having a buffet dinner. This book is highly structured. This means that many materials are covered sequentially, and sometimes less elegantly at the first glance. Thus, a casual or impatient reading of the book can be frus- 10 Chapter 1 Motivation, Objective, and Design trating if one does not have enough time or discipline in following the main structure, or just wants to use this book as a dictionary. To have an efficient self-study on a large book, discipline, patience, and time are all necessary inputs. A few students have told me the book seems great, but the problem is that it needs careful reading. When I first heard that, I just ignored it completely, because there is no good book in the world that does not require committed reading. However, after hearing the above comment several times, I have gradually realized that casual reading of materials on Internet or mobile electronic devices may have changed the learning habits of many people. Regardless of the trend, it should be emphasized that this book is prepared like a traditional textbook. Materials are connected and presented in a logic way. For many key points highlighted in the text, students need to read them several times, reflect for a while, and reinforce the learning through exercises. For self-reading of this book, it is highly possible that a reader will skim the parts for research methods (i.e., Parts II and VI) casually. This is reasonable because these materials are easy to understand. However, without any mentoring, it is difficult to improve one’s skills for either research design or manuscript preparation. Still, I urge students to digest these research methods and incorporate them into their own research activities. The three parts of R programming can be read relatively more independently from other materials. Readers are strongly encouraged to have a good understanding of the individual sample papers employed. This is an extra cost in comparison to reading these published R books in the market with fragmented samples. The benefit in combining complete sample papers and R programming shall become evident to readers in the process. For readers who need a quick answer to a programming problem within three minutes, note there are many R code programs and an intensive index for R functions included in the book. The availability of these materials is the result of many conscious efforts in making the book more convenient to users. Thus, check the list of indexes and R code programs when it is needed. 1.4.3 Materials and notations To students, all the sample programs listed in this book are available individually. They are included in the erer library and should be copied to a local drive for exercises. How to install the package and copy the materials is explained in Section 2.2.1 Recommended installation steps on page 17. In addition, the three sample studies (i.e., Sun et al., 2007; Wan et al., 2010a; Sun, 2011) for the main text and the three sample studies for exercises (i.e., Sun, 2006a,b; Sun and Liao, 2011) are all published journal articles. If a user has no access to any of them, I can share my personal copy for educational purpose by email. The following materials are available to class instructors only. A sample syllabus is available with detailed course design for a typical offering in a semester. A total of about 800 slides for 28 lectures are prepared in LATEX. They are all in PDF and available in several layouts: one slide per page, six slides per page, and three slides with note areas per page. For each of the three sample study for exercises (i.e., Sun, 2006a,b; Sun and Liao, 2011), a program version is prepared and available. In addition, the answers to all the exercises in this book are also available to instructors. Several formats have been adopted consistently, with the sole objective of increasing the usability of the book. Without these special formats, the book would look too plain. However, I have been conservative and cautious in using any fancy feature because excessive uses may become distractive to reading. Specifically, for math notations, the rules promoted in Abadir and Magnus (2005) are adopted. For example, ϕ denotes a scalar function, f a 1.5 Exercises 11 vector function, and F a matrix function. Similarly, x denotes a scalar argument, x a vector argument, and X a matrix argument. The titles of all tables and figures are shaded, so they can better stand out from the main text. When a figure looks fragmented, a frame is also added to the whole region. All the tables and figures have floated to the top of a page, which is feasible as none of them is more than a page long. In addition, some descriptions in the text may be important and readers may need to use them as a reference later on. If that is the case, a shaded box is used to highlight the relevant part. All the R programs have been prepared with the Tinn-R editor, and then inserted into the main text. The outputs from these R programs are too large to be fully reported. Thus, most R programs are followed by some selected outputs. In formatting these R programs in the book, two lines are used at the top of each program to have a stronger separation effect, and one line at the bottom for a weaker effect. Line numbers are added outside of the program. They are not a part of the program per se, but added to facilitate references in the main text. Within the R programs, commands are in a typewriter font, and comments are in a slanted typewriter font. Comments in the selected outputs, however, are reported without the slanted format (because the verbatim environment in LATEX is used). R commands in the main text are also formatted with a typewriter font. R functions in the main text are indicated with a pair of parentheses, e.g., plot(). LATEX has a well designed cross-reference mechanism, which has been used extensively in the book to provide a label number, title, or page number, e.g., Section 1.4.3 Materials and notations on page 10. When used in a cross reference, the following words are in bold type: Table, Figure, Program, Part, Chapter, Section, Equation, and Exercise. This may be a little distracting in some places when the number of words in bold type is large, but overall the format allows readers to quickly identify the item of interest. The benefit is bigger than the cost to me, so this feature has been adopted. The page number associated with a referred item is included when the item is a few pages away from where it is called. Words or phrases that need to be emphasized in the main text are in italic type. All the figures generated from the R programs in this book have been adjusted further before they are inserted in the book. One change is to use the Computer Modern font, the default choice by LATEX. The other change is to embed the font in the graphic output, which is a technical requirement for book publishing. Several R packages have been used for this purpose, including extrafont. As this is quite technical and time-consuming, the relevant codes have been excluded from all the R programs. 1.5 Exercises 1.5.1 Understand sample papers. Read the sample papers that will be used in the main text (i.e., Sun et al., 2007; Wan et al., 2010a; Sun, 2011). Become familiar with these papers. This may take at least two hours per paper. 1.5.2 Select an empirical study. Skim the sample papers that will be used as the base for exercises (i.e., Sun, 2006a,b; Sun and Liao, 2011). Select one of the three papers as the base for many exercises that will be required in the remaining chapters of this book. Alternatively, one can select or use a paper from the literature or a finished project if raw data are available; this is not recommended if no mentoring is available. Chapter 2 Getting Started with R I f you have never used R before, this chapter will be able to help you get started. Brief notes about installation of the base R, R packages, and editors are presented first. Then, R is compared with commercial software. Some misconceptions about free software like R are discussed also. At the end, the help system for R is described. 2.1 Installation of base R, packages, and editors Installing R and packages may differ slightly on different operating systems, and various issues and solutions are discussed at online forums. I am a Microsoft Windows user so the following installation notes focus on this environment. In general, there are three types of installation needs: the base R, R packages, and a selected R editor. The latter two are optional. Additional R packages may be needed if one utilizes extended packages for data analyses or graphics. If one prefers an interface different than the one offered by the base R, then a number of editors are available. 2.1.1 Base R A copy of the base R is available from the Comprehensive R Archive Network (CRAN) on the Web site of R. As of July 2015, it takes five clicks at http://www.r-project.org to download it: CRAN (at the left column) ⇒ A mirror site choice ⇒ Download R for Windows ⇒ base ⇒ Download R 3.2.1 for Windows. The base R can be installed at the default folder (e.g., C:/Program Files/R-3.2.1), or in another selected folder (e.g., C:/myprogram/R-3.2.1). The latter is recommended if the software of LATEX is used along with R for document preparation. The reason is that a space is allowed for a folder name under the environment of Microsoft Windows (e.g., “Program Files”), but it is not recognized in LATEX and often causes trouble. In addition, in the middle of the installation process, one needs to make a choice of installing 32-bit, 64-bit, or both versions of the R software. In general, if your computer has a 64-bit version of Windows, then it is recommended to install both versions of R. To understand the difference between 32-bit and 64-bit system and learn how to check the version on your computer, search the Internet with relevant keywords. Once R is installed, open R graphical user interface (RGui) from the start menu of Microsoft Windows. In Figure 2.1 R graphical user interface on Microsoft Windows, the Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 12 2.1 Installation of base R, packages, and editors 13 Figure 2.1 R graphical user interface on Microsoft Windows normal working status of R is shown. In this specific case, the 32-bit version is called, as revealed at the initial message within the window and also at the corner, i.e., RGui (32bit). In general, when a statistical software application is in operation, it can contain or show many windows. These include an editor window for commands, a data window for data display, an output window for results, a log window for working status or trace, an error message window for error or warning messages, and a navigator for an overview of all relevant components (similar to Windows Explorer, or any resource management window). The base R has two major types of windows: R console and editor windows. In opening a new session of R, the default window shown on a screen is the R console. This window can show command lines submitted from the editor window, data imported or created in the current session, analysis results, and any error or warning message. The console allows users to receive commands from an R editor window and also to type commands directly. This is convenient if some commands are for testing only and do not need to be saved. Actually, I often use the R console as a calculator when I am around my desktop computer; it is more powerful than any hand-held palm calculator I have. Thus, the R console is truly interactive. An R editor window can be initiated by clicking the menu of File ⇒ New script, or by opening a saved script through File ⇒ Open script. Commands in the R editor window can be modified and saved on a local drive and rerun later as a program. A saved file has an R extension, e.g., LogitStudyProgram.r. This is a text document so any word processor can read it. Multiple editor windows can be opened at the same time, and commands from them can be submitted to the R console. For empirical studies in economics, saving commands through one or several editor windows is an efficient way to organize analysis steps. This will be further elaborated and emphasized in later chapters. The R console and editor windows can be easily arranged on a computer screen by clicking the menu of Windows ⇒ Tile Horizontally or Tile Vertically. If you prefer an editor window to stay on the left side, then place the mouse cursor in this window and click Tile Vertically again. In submitting commands from the console directly, each time only one line can be submitted by hitting the Enter key on the keyboard. Note one line in the console window can have multiple commands. In submitting commands from an editor window, there are two major ways: one is by line and the other is by block. When a cursor is in anywhere on a command line, use Ctrl + r to submit the current line. To submit several command lines, highlight or select them and then use Ctrl + r. 14 Chapter 2 Getting Started with R The base R interface is relatively simple. When R is installed, one can try some simple codes in an R session to make sure that R is appropriately installed, as shown in Figure 2.1. Furthermore, explore the functionality offered through the menus. The most frequently used keyboard shortcuts are listed as follows: Ctrl + a Select all command lines; Ctrl + r Run a single or a group of command lines; Ctrl + l Clear console; Ctrl + f Find a text; Ctrl + c Copy a text; Ctrl + v Paste a text; Ctrl + h Replace a text; and Esc Stop current computation. 2.1.2 Contributed R packages The distribution of base R is very lean with 62 megabytes only (as of R 3.2.1 in July 2015). This allows a fast installation and efficient running. Many functions available in R are not installed routinely, and even after installed, they are not loaded into a working environment automatically. These functions are available in R as packages or libraries. A package is a set of functions, help files, and data files that have been combined together for a topic (e.g., the package of AER for applied econometrics with R). As of July 2015, the CRAN package repository features about 6,800 packages. To find out which packages are installed and loaded, run sessionInfo() in an R session. This provides the version of R and the operation system for which it is compiled. The HTML help facility can be initiated by help.start() and it gives details of the packages installed on a computer. There are several ways to install R packages. First, installing a package is straightforward from the RGui under the base R. As shown in Figure 2.1 R graphical user interface on Microsoft Windows, first click the menu of Packages ⇒ Install package(s). Then navigate the list, locate the package names of interest, and follow the instructions popped up on the screen. Alternatively, one can use the function of install.packages() to install packages. This is more efficient than clicking the menu buttons so I suggest this approach. Three commonly situations are: installing packages available at the CRAN site; installing a package from a local drive with a tar.gz source file; and installing a package from the R forge site. In comparison, installing packages from the CRAN site directly should be the default method. Installing a package from a local drive is generally for experienced users only. The following sample commands can be modified and used for specific packages. install.packages(pkgs = “AER”, dependencies = TRUE) install.packages(pkgs = c(“apt”, “erer”), dependencies = TRUE) install.packages(pkgs = “C:/erer_2.4.tar.gz”, repos = NULL, type = “source”) install.packages(pkgs = “test.package”, repos = c(“http://R-Forge.R-project.org”, getOption(“repos”))) Installing a package available from the CRAN site is the easiest approach. When a package available from CRAN is installed like above, dependent packages can be installed automatically by setting the dependencies argument as TRUE. For example, the AER library now depends on a number of packages, and all of them can be installed through the above 15 2.1 Installation of base R, packages, and editors command. This method can also install several packages together, e.g., apt and erer by the string of c(“apt”, “erer”). In addition, the above commands can be saved in a file for packages often used as an R program. When the base R is reinstalled or updated later on, these contributed packages can be reinstalled by running the saved program once, instead of typing the installation commands repeatedly. The number of packages I have been using is less than 30, so I have maintained and updated a one-page program on my local hard drive. Installing a package from a local drive under Microsoft Windows is feasible too. Depending on the format of the package, this can be less straightforward than installing a package from the CRAN site. A source version of tar.gz for a package is most common, and it can be installed through install.packages(). This method is needed if you have a personal package that is not shared with anybody, or you receive a package from colleagues next door to you. There are two packages for this book: erer for “Empirical Research in Economics with R” and apt for “Asymmetric Price Transmission.” I update them frequently but I do not upload them to the CRAN site every time, so there is a need to install them from my local drive. It should be emphasized that installing a local package cannot automatically install these packages that the local package is dependent on. They need to be installed first before the local package can be installed. For instance, erer depends on several packages, including systemfit, lmtest, tseries, ggplot2, urca, and MASS; they should be installed first from the CRAN site before a local installation of erer. Therefore, installing a package manually from a local drive needs the information of package dependency, which is available at the description file from the unzipped source file of a package. Installing a package from an R forge site is not uncommon. Some packages are in the R-forge site under intensive development and tests. They have not been published on CRAN yet (http://r-forge.r-project.org). For example, if there is a package called test.package on the R-forge site, then it can be installed by following the sample code shown above. As the number of packages has become so large, the R site has a list of CRAN task views by subject, e.g., Bayesian, econometrics, and finance. Each task view summarizes the relevant packages and provides a list of packages at the end. For example, the task view of econometrics includes about 100 packages now. These packages by task view can be installed as a group. To automatically install them by view, a package named ctv needs to be installed first. Then the packages in a task view can be installed via the commands of install.views() or update.views(), as shown below. install.packages("ctv") install.views("Econometrics") update.views("Econometrics") Finally, after a contributed package is installed, a package needs to be loaded before its functions or data sets can be used. A package can be loaded with the functions of library() or require(): library(erer) or require(erer). Both the functions load a package and put it on the search list. require() is mainly designed for use inside other functions; it returns FALSE and gives a warning (rather than an error as library() does by default) if the package does not exist. In the middle of an R session and without closing the R interface, one can unload a package by using the function of detach(), as shown below. library(erer) sessionInfo() detach(name = “package:erer”, unload = TRUE) sessionInfo() # # # # Load erer Confirm loaded Unload erer Check unloaded 16 Chapter 2 Getting Started with R Figure 2.2 Interface of the alternative editor Tinn-R 2.1.3 Alternative R editors The editor with base R distribution as shown in Figure 2.1 R graphical user interface on Microsoft Windows on page 13 is a simple text editor. Given its limited editing ability, many editors have been developed for R, as summarized on several Web sites like http://www. sciviews.org/_rgui. These software applications or editors do not change the functionality of the base R at all, but make R friendlier to use. By analogy, this is similar to the relation between a cell phone and various phone shells you purchase separately as phone accessories. In addition, most of these applications for R are completely free. Among these editors, one of the most appealing editors to me is Tinn-R for Windows. It is a small free program with many improvements over the simple R editor; the difference is like a color television versus a black-white one. The cost of enjoying these benefits is that users need time to install and configure it, and furthermore, navigate the menus to become familiar with its multiple utilities. Tinn-R is available from the link at www.sciviews.org/Tinn-R. The interface is shown in Figure 2.2 Interface of the alternative editor Tinn-R. In general, three windows are most frequently used: a script window for commands, an output window, and a log window. It also has a navigation window for resource management. Keyboard shortcuts are available for various actions, and a user can customize existing shortcuts or create new ones based on personal preferences. For example, within a Tinn-R editor window, pressing “Ctrl + (“ on a keyboard generates a pair of parentheses, i.e., (). A comment sign of # can be added to the beginning of each line for a highlighted command block with the keystroke of “Alt + c“, and similarly it can be removed by “Alt + z“. Different colors and fonts can be used for commands and comments. Line numbers can be added at the side to facilitate reading. Multiple program files are organized by tab at the top of the window. Another editor that has become more popular is RStudio (http://rstudio.org). This one is relatively easier than Tinn-R to configure at the beginning, and mainly focuses on R. It seems that RStudio has been under very intensive development in recent years. An interface of RStudio on Microsoft Windows system is shown in Figure 2.3 Interface of the alternative editor RStudio. 2.2 Installation notes for this book 17 Figure 2.3 Interface of the alternative editor RStudio 2.2 Installation notes for this book In using this book for several courses and workshops, a number of installation problems have occurred. While most of them is attributable to a lack of experience, it is worthy of some space here to list and emphasize the main steps. Readers should follow these steps closely in getting ready for coming data analysis in the book. Some R concepts may not be very clear at this point, but they are all included here for completeness. 2.2.1 Recommended installation steps Step 1 Install the base R. This is the engine of the software and should be installed first. Step 2 Install an alternative R editor (optional). The base R can work independently. If one is satisfied with the interface of base R, then skip this step. An alternative editor like Tinn-R or Rstudio provides more convenient utilities. Furthermore, depending on the editor chosen, connecting the base R and the editor (e.g., Tinn-R) may take some efforts. If any problem arises, follow the manual provided by these editors closely or search a solution on the Internet. Step 3 Test the base R or alternative R editor. If the base R and an alternative editor are well installed and connected, then users should be able to reproduce the appearance shown on Figure 2.1, 2.2, or 2.3. Note there are three command lines in testing the software. The first line is x <- 1:5; x, the second line is mean(x), and the third line is y. Create a new editor window to hold the three command lines. There are two ways of submitting the commands, either by line or by block, and you must make sure that both the ways work. Thus, submit the test commands by line first, and then highlight all of them and submit them as a block. In the base R, submission can be done through a button on the interface or by a keystroke of Ctrl + r. In an alternative editor, there should be similar buttons or keyboard shortcuts. Sometimes an error message within the alternative editor interface can be a problem of the editor per se, not a problem of the base R. This can be verified by testing the same codes with the base R interface. 18 Chapter 2 Getting Started with R Step 4 Install the erer package. Have a computer connected with the Internet, and install it through the default method, as described earlier in this section. Relevant packages will be installed automatically. For example, to install the erer package, run the following line: install.packages(pkgs = “erer”, dependencies = TRUE). Step 5 Find a copy of all data and sample R program files used in this book. There are two ways to find or locate them: one on your local drive and the other through the Internet. Once R and contributed packages are installed on a computer, a folder is created automatically to hold documents from packages installed. For example, I can find documents related to the erer package on my computer at C:/CSprogram/myRsoftware/R-3.2.0/library/erer/doc. Close to 100 raw data and sample R program files used in this book are included there. Alternatively, one can download a zipped copy of the erer package from the Internet directly. Search the Internet by “package erer” and save the latest source package, e.g., erer_2.3.tar.gz. Unzip this document and all the data and program files should be available now. Step 6 Make a copy of all data and program files at a new local folder. After these files are located, create a new folder on your computer with the following name: C:/aErer, and then make a copy of all the files. This will allow you to use all the files without making any additional modification. For users with an operating system other than Microsoft Windows, the new folder name can still be similarly named, but further changes in some R sample programs may be needed to modify file directory. Step 7 Test one R sample program. Open base R or an alternative R editor (e.g., TinnR), and then a sample program in your new folder, e.g., C:/aErer/r072sunSJAF.r. Select all the commands and submit them as a group. If it runs well without any error message, then the installation is successful and you are ready for all the coming data analyses. 2.2.2 Working directory and source() To run R sample programs in this book smoothly, a basic knowledge of working directory is needed. Each R code demonstration in this book has a corresponding file saved in the data folder of the erer library. For example, Program 7.2 The final program version for Sun et al. (2007) on page 114 is saved as r072sunSJAF.r. All these R programs have been created and saved under the directory of C:/aErer on my computer. In addition, all the sample data sets used in this book, e.g., RawDataIns1.csv, are also saved at the C:/aErer folder. This is the simplest and most convenient folder name I can think of. When a sample program needs to communicate with a local drive for data input and output, a directory (e.g., C:/aErer) has been specified inside a program through either a local or global specification. For example, read.table(file = “C:/aErer/RawDataIns1.csv”) uses local directory specification, so the directory information here is only effective for the read.table() function. The commands of wdNew <- “C:/aErer”; setwd(wdNew) allow for a global specification, so all the following commands will be affected. In general, the getwd() function can reveal the name of current working direction in an R session, and dir() can list the files in the current directory. R beginners often make a mistake in using the forward slash (/) and backslash (\) in defining a directory. Microsoft Windows generally uses the backslash to define a directory (e.g., “C:\aErer”), which is sometimes referred to as the Windows mode. However, R 2.2 Installation notes for this book 19 requires the forward slash (e.g., “C:/aErer”, which is called the Unix mode. In R, a backslash \ is known as an escape character. To put a backslash in a string, you must double it. Thus, “C:\aErer” is wrong for directory specification in R, but both “C:/aErer” and “C:\\aErer” are acceptable. To run an R program like r072sunSJAF.r for Program 7.2 successfully, the data sets (e.g., RawDataIns1.csv) should be saved in the directory specified within the program (e.g., C:/aErer). In addition, some sample programs also use the source() function to call and run another program. For example, Program 8.1 Accessing and defining object attributes on page 129 is saved as r081attribute.r. In Program 8.1, the source() function is used to run the whole program of r072sunSJAF.r to create some objects for further demonstration. If you have followed Step 6 in Section 2.2.1 Recommended installation steps, then there is no need for you to make any change to run Program 7.2 and Program 8.1. Otherwise, you may need to revise these directories in the files, depending on how and where you have saved the sample R programs and data sets. 2.2.3 Possible installation problems Various problems can arise in the above steps. Most of them originate from a lack of experience, and in some cases, a lack of common sense (it is harsh to hear but often true). For example, it is not a good idea to install an alternative editor (e.g., RStudio) before the base R is installed. Make sure your computer is really connected with the Internet if you know your Internet service is often unreliable. As each computer faces a different environment, it is impossible to discuss all possible problems here. Based on my experience, I list several major problems that seem occur more frequently. The first one is about mirror sites. At present, R is available at a large number of mirror sites so users are generally encouraged to choose one that is physically close to them. For example, if you are in Beijing, then it is recommended that you choose one in Beijing, not another in Brazil. However, the server at a mirror site may be unreliable or broken without any warning sign. You can still hook up with the site and finish all the installation process. Some packages may be installed partially during the process, so later on strange error messages can occur. If that is the case, feel free to choose a site that may be physically far away from you to try your luck again. My own experience is that most sites in the United States are pretty stable. In connecting the base R with an alternative R editor (e.g., Tinn-R), the most common problem is that submitting a block of commands may not work as expected and can generate an error message. In general, this is because the alternative R editor is not connected with the base R well. As these alternative R editors have been under constant development and each computer environment is unique, the best way of dealing with these errors is to search the Internet for suggestions. When you have limited access to the Internet or the signal is weak, it is tempting to download a copy of the package of interest and then install it locally. However, as emphasized earlier in the section, it does not work if the package depends on other packages but they have not been installed on your computer yet. Thus, the best approach for beginners is to have a computer connected with the Internet and use the default approach as follows: install.packages(pkgs = “erer”, dependencies = TRUE). If your Internet service is slow, then it may take a good amount of time. You need to be patient in a situation like that. Installing a package locally is recommended for experienced users only. Close to 100 R sample programs and data files are presented in this book. A number of them need a specification of the directory and folder information for data inputs. I have organized all of them on my computer in the folder of C:/aErer. Thus, the most efficient 20 Chapter 2 Getting Started with R way for a beginner to get started is to create a folder with the same name and then copy these files there. In contrast, one can run these programs in the folder where R installs them initially. This is strongly discouraged as you need to make changes on some sample R programs and also the directory where R is installed is often too long to manipulate. Finally, do not ignore any error message, and more importantly, address errors sequentially. In most cases, an earlier error will complicate later operations. 2.3 The help system and features of R Spirit of R and the help system There are many different descriptions about R in books and on the Internet. On its own Web site, R is defined as “a free software environment for statistical computing and graphics.” My definition is: R is freedom. R allows scientists to conduct statistical computing and draw graphs in a completely new way. It reflects the inner desire of all human beings for freedom: free to speak, free to dance, free to move, and now, free to do programming. Like many of us, I am tired of being controlled by commercial software and waiting for their updates over time. While there are markets for some unique commercial software products, an increasing number of data analyses and programming jobs have been accomplished by users with free software directly. Inside an R session, the help file for a specific function is just around your fingertips. After loading a package like library(erer), typing help(‘erer’), ?erer, or ?bsTab can invoke the help() page instantly for the package of erer or a specific function. Note there is an Index at the bottom of the help page. Following the link allows users to navigate all the help files of packages installed on a computer. For each function, there are detailed descriptions for arguments, and even better, examples that users can copy to an editor window and run immediately. These examples can also be run for a function with the example() function in an R session, e.g., example(bsTab). These examples have passed the procedure when packages are built and uploaded onto the CRAN site, so they should work on another computer too. This feature allows users to learn a function much faster than checking a hard-copy software manual. With this help system, users do not need to worry about the detail for a function anymore. To see the actual codes for a function, one just needs to type the name of a function at the prompt in R console directly like: bsTab. The code for a function will be shown instantly. One can examine the code, or extend and modify it for other purposes. The online community of R is an amazing place to get help. Every day several hundreds of questions are posted on sites related to R (e.g., http://www.nabble.com and http:// stackoverflow.com). If anybody has doubt about the power of individual users as a group, look at the development of Facebook, Twitter, and R in recent years. The spirit behind these Internet-related phenomena is the same. At present, R has seven reference manuals: An Introduction to R, R data Import and Export, R Installation and Administration, Writing R Extensions, The R Language Definition, R Internals, and The R Reference Index. They are updated with each new release of R. On its official Web site, many free contributed documents are also available. In addition, over 100 books related to R have been published and they are available at online bookstores. In particular, I recommend the following books: Dalgaard (2008), Spector (2008), and Everitt and Hothorn (2009) for basic statistics and data manipulation; Bivand et al. (2008) for spatial statistics; Wickham (2009) and Murrell (2011) for R graphics; Kleiber and Zeileis 2.3 The help system and features of R 21 (2008), Pfaff (2008), and Vinod (2008) for econometrics; Braun and Murdoch (2008), Jones et al. (2009), Adler (2010), and Matloff (2011) for practical programming techniques. R versus commercial software To a new user, the most appealing feature of R may be its free license. It can be used anywhere with very limited constraints. However, being free may not be the main reason for its popularity in recent years. After all, there are a number of free software products on the Internet and many of them are not appealing at all. During the last several years, the more I use R, the more I like its design and advantages over many commercial products. In contrast with commercial software, R’s greatest advantage is its nature of open sources. It allows users to view the coding for each function and learn from that. This is impossible for commercial software products because they depend on the functionality for making profits. This key difference results in many divergences between R and commercial software. For example, R can be updated more frequently; commercial software cannot force users to pay for updates every month but usually every three years. The core program of R is small so it runs efficiently; commercial software is big in a distribution because it is a one-time deal with users. R encourages users to extend its program; commercial software asks users to be patient and to rely on their updates that may only come years later. R is also a very concise language. If a data set is analyzed by R and commercial software products at the same time, in general the program file from R is much shorter than that from others. This conciseness of R may become a barrier for new users in digesting the codes at the beginning, but after a while, users often appreciate its short and clean style. Furthermore, most commercial software products cannot be used as a computer language like R. It is true that there is a core computer language behind each software product for statistical analysis. R is among the few products in the market (e.g., MATLAB) that encourage analysts to use it as a language and provide real aids in the learning process. Commercial software products are commodities in a market economy. When running a regression, it throws out all results on a computer screen and forces readers to digest them. That is a typical selling strategy used by a local car dealer: look how much we can do for you! In contrast, R is much more like a shy but decent introvert. R saves everything as an object, and in general, if an object is not called, it does not show up and bother you. Commercial software products try to control users while R allows users to have much better control over their data analyses. Using the classification of four stages adopted in this book (e.g., clickor, beginner, wrapper, and contributor), R users can reach the status of contributor in one or two years. However, users of commercial software will seldom reach the stage of contributor; some diligent users may become good wrappers; and most will become good clickors only. That is exactly what vendors of commercial software like to have: many compliant ‘children’ waiting for their updates with money in hand. Commercial software still enjoys some advantages over R in certain aspects. One common complaint about R is that it is not well organized like commercial software. With a slim base R and many extended packages, users often face the dilemma of which function should be used for a specific task. This creates a feeling of fragmentation as R is built up by many individual contributors worldwide. There is a lack of authority to tell users which one among many available similar packages is the best for a specific data set. This may force users to spend more efforts to have a deeper understanding of the model employed. Hopefully, the fragmentation of R functionality will be better addressed as the software is improved and users gain more experience in using it. 22 Chapter 2 Getting Started with R In summary, I believe R will gradually gain a larger share of the current software market for statistical analyses. The main advantage of R is that it is free and can be used as a language to accomplish sophisticated programming jobs. The main advantage of commercial software products is that they provide stable and consistent solutions to routine statistical analyses. To some degree, the relation and competition between R and commercial software is similar to that between electronic books and traditional paper books, or that between emails and traditional letters. I have no doubt that R will gradually become more popular over time, and many commercial software applications will face a shrinking customer base. What is not free? R being free can cause some misconceptions. While users do not need to pay for a copy of R, they still need to buy books to learn R systematically. There are some good and free contributed books on the Web site of R. They can be helpful for beginners to get started. In addition, over 100 books about R have been published formally in recent years. Many of them have better quality than these free contributed books. Some of them are very affordable too, comparing their prices to the copy or printing costs for these free books (e.g., ten cents per page in a local store for copying and binding). Furthermore, learning R is not effort free. Just installing a free copy of R really does not have any impact on one’s statistical skills and programming ability. The bigger challenge is to use it efficiently for actual research. Several years ago, for example, I bought a software application for home projects. I thought I would be able to add a sun room to the back of our house by myself easily. It turned out I still needed to learn a lot about architecture, so at the end I never finished a single project with this software. Similarly, owning a copy of Microsoft Word does not mean one can produce a beautiful document automatically. Owning a dictionary does not mean one can write a great novel tomorrow. The same logic applies for R and statistical analysis. Learning R and the language takes great effort for several years. This is achievable through carefully designed projects and rigorous training. The benefit is so great that it is well worthy of your investment. Being able to program with a computer language for research is just like being able to drive a car for commuting. 2.4 Playing with R like a clickor Pull-down menus in a computer application, also called drop-down menus, are options that appear when an item is selected with a mouse. Using the classification for statistical analyses in this book, this is labeled as a clickor status. While R is mainly designed for programming, it does have a number of contributed packages that provide pull-down menus for both computing and plotting. One well-known package is R Commander, titled as Rcmdr. In this section, R Commander and data from a contributed package are used to demonstrate the features of this approach. Using pull-down menus for statistical analyses is intuitive. One can select a data set and try all the features available in an interface. This may help some users get started with R and then gradually move on to programming. Installing and loading the R Commander To install this package in R, use the command of install.packages(‘Rcmdr’). To invoke the R Commander, open an R session first. The main R console window should be similar 2.4 Playing with R like a clickor Figure 2.4 Main interface of the R Commander with a linear regression 23 24 Chapter 2 Getting Started with R to Figure 2.1, 2.2, or 2.3, depending on the editor or graphical user interface (GUI) you choose. Then, submit library(Rcmdr) at the prompt of the R console. A new window will pop up, as shown in Figure 2.4 Main interface of the R Commander with a linear regression. If the new window is closed incidentally, it can be recalled anytime by Commander() at the prompt of the R Console. From now on, you can click and play the R Commander interface independently. The R console window is still open and run in the background, which can be minimized but cannot be closed. Major menu items The R Commander has three main windows: script, output, and message. The script window displays codes associated with each mouse click. One can also type and submit R commands through this window. The output window holds results from any operation. The message window displays notes, warnings, or errors. In addition, separate windows can be initiated from some operations, such as viewing a data set or displaying a graph. The pull-down menus are very intuitive, including file management, data operations, regressions, plotting, and help. Some features are not active until a data set is active. At present, there are already a large number of menu items available in the R Commander. However, it should be emphasized that this is still just a very small portion of the total capacities of R. With several thousands of contributed packages, R can handle much more diverse and complicated tasks than the R Commander has shown. Below is a selected and brief list of the items available in the R Commander. File Edit Data - Change working directory - Open or save script file - Save output; Save R workspace; Exit - Edit R Markdown document - Cut; Copy; Paste; Find; Delete - Select all - New data set; Load data set; Merge data set - Import data from text file, SPSS, SAS, STATA, Excel, Access - Data in package (List or read data sets in packages) - Manage variables in active data set Statistics - Summaries, contingency tables, means, variances - Nonparametric tests - Dimensional analysis - Fit models (Linear, generalized linear, multinomial logit) Graphs - Color palette - Histogram, boxplot, line graph, strip chart, Bar graph - Save graph to file Models - Select active model - Summarize model - Hypothesis tests - Graphs (diagnostic, residual) 2.4 Playing with R like a clickor 25 Distributions - Continuous distributions - Distribute distributions Tools - Load packages - Load Rcmdr plug-in(s) Help - Commander help - Introduction to the R Commander The guide and help system for the R Commander are well documented under the menu of Help. In particular, the document of Introduction to the R Commander is a PDF file with 26 pages, and it explains well the operation of this application. Playing a data set To understand and get started with the R Commander, the best way is to have a data set and play it for a while. The Data menu at the main interface lists a number of options. Data can be imported from one’s local drive, prepare manually from the menu of Data ⇒ New data set, or load some sample data from the R Commander directly. To be brief, some built-in data is used here for demonstration. When the R Commander is loaded through the command of library(Rcmdr), several other packages are also loaded together in the current working environment, including the package of car. So all the contents in car are available to users now. One of its data sets is Davis. It has 200 rows and 5 columns, and it documents the height and weight information for men and women who engaged in some exercise. A few observations are selected and listed as follows: sex weight height repwt repht 1 M 77 182 77 180 2 F 58 161 51 159 3 F 53 161 54 158 ... 197 M 83 180 80 180 198 M 81 175 NA NA 199 M 90 181 91 178 200 M 79 177 81 178 where sex is either female (F) or male (M); weight is measured in kg; height is measured in cm; repwt is reported weight; and repht is reported weight. Some values are missing. Below we show three groups of activities with this data. The key screenshots are displayed in Figure 2.5 Loading a data set, drawing a scatter plot, and fitting a linear model. First, to get this data set into the current working environment, follow the menus at Data ⇒ Data in packages ⇒ Read data set from an attached package and then make the appropriate clicks. Once the data is loaded successfully, the R Script window should display the following command line automatically: data(Davis, package = ‘car’), as shown in Figure 2.4. At this time, many menu items will change the color from gray to black, indicating that a data set is active and these menus are available for use. To view the data set, one can click the button of View data set on the main interface. A separate window for the data will come up. Second, summary statistics can be generated for the data set. For example, following Statistics ⇒ Summaries ⇒ Active data set, the minimum, median, mean, and max- 26 Chapter 2 Getting Started with R Figure 2.5 Loading a data set, drawing a scatter plot, and fitting a linear model 2.5 Exercises 27 imum values for each variable will be reported in the Output window. In addition, various types of graphs can be used to explore the data and understand its properties. For example, using the menus of Graphs ⇒ Scatter plot, we can see the positive relation between the variables of height and weight. Finally, a linear regression can be fitted on the data. For example, if the relation between the height (x) and weight (y) is of interest, we can fit a linear model by using the menus of Statistics ⇒ Fit models ⇒ Linear regression. The regression coefficient is 0.238 and it is significant at the 1% level, which confirms our impression that taller persons are heavier on average. Once a model is fitted, some diagnostic or other plots can be generated through the Models menu. For example, Figure 2.5 contains a graph generated from clicking the menus at Models ⇒ Graphs ⇒ Influence plot. In summary, through these activities, one can see that the R Commander can meet many ordinary tasks for data analyses and graphs. This can reduce transaction costs when a user has never used R before, or one is completely new to computer programming. Note that the script and output windows in the R Commander display the underlying R commands for each action. This is a nice feature that allows users to learn some basic R commands. In the long term, however, programming still is much more efficient than clicking the menus. 2.5 Exercises 2.5.1 Install R and packages and prepare for coming data analysis. With the guide in this chapter, readers should have the following items installed: base R, the erer library, a selected R editor (e.g., R Studio). Furthermore, make a copy of all the data and sample R programs used in this book to a local folder and test one of them, as detailed in Section 2.2.1 Recommended installation steps on page 17. 2.5.2 Learn how to have an efficient package installation. In this exercise, create and save a file for all the contributed packages you install on your computer, following the discussion at Section 2.1.2 Contributed R packages on page 14. To see the names of packages installed, use the function of installed.packages(). Some of them are always installed and loaded with base R, so there is no need to include these in your file. 28 Part II Economic Research and Design 29 30 Part II Economic Research and Design: Four chapters in this part are organized with a top-down approach: economic research in general; three versions of an empirical study; study design and proposal for an empirical study; and reference and file management. Chapter 3 Areas and Types of Economic Research (pages 31 – 41): Areas and issues of economic research are introduced. Economic research is classified into three types: theoretical, empirical, and review studies. Features and challenges by type are discussed. Chapter 4 Anatomy on Empirical Studies (pages 42 – 56): The production process of an empirical study is analyzed in detail and relevant research methods are presented. An empirical study has three versions: proposal, program, and manuscript. Chapter 5 Proposal and Design for Empirical Studies (pages 57 – 77): General principles of research design of empirical studies in economics are examined. Three different situations are analyzed: proposals for money, publication, or degree. Chapter 6 Reference and File Management (pages 78 – 97): The demand of literature management is analyzed. Some systematic solutions are suggested. The features in EndNote and Mendeley are presented for both reference and file management. R Graphics • Show Box 2 • A heart shape with three-dimensional effects R is flexible in integrating data into graphics. See Program A.2 on page 514 for detail. Chapter 3 Areas and Types of Economic Research S cientific research in economics involves many areas or aspects of our economy. Economic analyses can be classified into different types, depending on the nature of the problems, research needs, and methods employed. To facilitate a comparison, economic analyses are classified into three types in this book: theoretical studies, empirical studies, and review studies. The main features of each type are examined first. At the end, those types of economic analyses are compared and some thoughts about long-term career planning are presented. 3.1 Areas of economic research Economics as a discipline has evolved to be quite diversified. The classification system adopted by the Journal of Economic Literature (JEL) has a good representation of major research areas in economics. As an example, some of the frequently used areas related to natural resource economics are presented in Table 3.1 Journal of Economic Literature classification system. Each JEL code is composed of three characters. The first is a letter and the other two are numbers. For the first character, there are 20 main areas, ranging from “A — General Economics and Teaching” to “Z — Other Special Topics”. The two tiers by number represent more detailed areas. Another interesting way to exploring various areas of economic research is to look at the list of Nobel Prize and Laureates in Economics Sciences (http://www.nobelprize.org). The Prize in Economic Sciences was first established in 1968 in memory of Alfred Nobel, founder of the Nobel Prize. Between 1969 and 2014, 46 Prizes were awarded to 75 Laureates. The average age of a Laureate is 67 years old and only one woman has received the Prize in this category (i.e., E. Ostrom in 2009). The research areas covered by these awards in the past do not cover every aspect of economic research. However, the list indeed reveals a number of active research areas at the aggregate level. Relevant to empirical studies and the focus of this book, several Prizes in recent years have been awarded to quantitative analyses of economic issues. These include the Prize to J.J. Heckman and D.L. McFadden in 2000 for theory and methods of analyzing selective samples and discrete choice, R.F. Engle and C.W.J. Granger in 2003 for methods of analyzing time-varying volatility and cointegration, T.J. Sargent and C.A. Sims in 2011 for their empirical research on cause and effect in the macroeconomy, and more recently, E.F. Fama, L.P. Hansen, and R.J. Shiller in 2013 for their empirical analysis of asset prizes. Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 31 32 Chapter 3 Areas and Types of Economic Research Table 3.1 Journal of Economic Literature classification system Code Description A B C D E F G H I J K L M N O P Q R Y Z C58 D12 Q23 Q26 Q27 Q51 General Economics and Teaching History of Economic Thought, Methodology, and Heterodox Approaches Mathematical and Quantitative Methods Microeconomics Macroeconomics and Monetary Economics International Economics Financial Economics Public Economics Health, Education, and Welfare Labor and Demographic Economics Law and Economics Industrial Organization Business Administration and Business Economics; Marketing; Accounting Economic History Economic Development, Technological Change, and Growth Economic Systems Agricultural & Natural Resource Econ.; Environmental & Ecological Econ. Urban, Rural, Regional, Real Estate, and Transportation Economics Miscellaneous Categories Other Special Topics Financial Econometrics Consumer Economics: Empirical Analysis Forestry Recreational Aspects of Natural Resources Issues in International Trade Valuation of Environmental Effects Source: https://www.aeaweb.org/journal; accessed in July 2015. For specific areas in economics, there are also detailed directions that researchers can choose from. Take as an example natural resource economics (e.g., JEL Q2 and Q5). This is a small branch of economics but this field covers numerous resource issues. One way of classifying these research issues is to trace the movement of resources (e.g., tree) like in a stream. (Historically, logs were also transported through rivers.) At the beginning of the stream, landowners manage land for various outputs. Timber is one main output from forests. Other outputs include water, air, wildlife, recreation service, and aesthetic value. This multiple-output feature of forests generates numerous research areas, e.g., timber market, bioenergy, climate change, carbon sequestration, conservation, and land use change. In the middle of the stream, the wood products industry has been an important sector in the economy. In the United States of America, the output of the wood products industry has been about 10% of the total output from the manufacturing sector. Industrial firms, including sawmills, paper mills, and furniture firms, are active market participants. Various issues related to the industry are worthy of attention, e.g., productivity, product demand and supply, business location and clustering, industry innovation and evolution, mergers and acquisitions, industrial organization, and market power. At the end of the stream, there are various products available to consumers. A number of issues related to the market exist, 3.2 Theoretical studies 33 including eco-labeling for wood products, demand for paper and lumber products, housing market, and domestic and international trade of wood and paper products. In summary, there are various areas of economic research. Most economists work on a small field within the large arena during a whole career. By analogy, this is very similar to a medical doctor who generally specializes in one field, e.g., nose or heart. It is challenging for anyone to become an expert in several fields. Thus, next time if an economist on a television program talks in a way that he knows everything about the economy, watch out if he is really so knowledgeable, or just bragging about some common sense. 3.2 Theoretical studies Within a specific area or across various areas, economic research can be classified by other criteria. In this book, three types of economic research are differentiated: theoretical, empirical, and review studies. The whole book focuses on empirical studies, but an understanding of the other types will be able to help us improve efficiency in conducting empirical studies. Theoretical studies in economics generally focus on the theoretical aspect of an economic issue. This type of study has been qualitative in nature. However, in recent years, it has become more quantitative with sophisticated mathematical models. Thus, theoretical studies have become more structured with models and have gained more popularity over time. Understanding the structure and design of this type of study allows a more comprehensive view of a research area. It also can help generate empirical research ideas in the process. 3.2.1 Economic theories and thinking The key feature differentiating theoretical studies from other types of economic studies is that they are theory-oriented. There are a number of inherent advantages of a theoretical approach to analyzing economic issues, with the main one being its flexibility. Theoretical studies often have light demand for data, or in many cases, they do not need any at all. Economic phenomena can be so complicated in some situations that a theoretical analysis is the most practical or even the only way available to researchers. Overall, a theoretical study allows a focus on a specific aspect of the subject, and provides solutions to social and economic problems. Conducting a theoretical study can be challenging in several aspects. While some of them address large issues, many theoretical studies examine small, conceptual, and dry theoretical questions in economics. These questions may have a limited relation with reality or our breakfast for tomorrow morning. This is similar to many abstract mathematical issues. In recent years, quantitative models have been increasingly employed in theoretical studies. As a result, calculus and mathematics have become deeply involved in modern theoretical analyses. A student needs to be very strong in mathematics and deductive reasoning. Furthermore, for a specific issue, it is not uncommon that results from studies with different theoretical assumptions are in conflict. This can be frustrating as theoretical studies may fail to provide clear answers to questions under investigation. The best way to understand this type of economic study is to read representative studies in some specific areas. For example, the seminal article by Coase (1960) examined the economic problem of externalities. A number of legal cases and statutes were used in benefitcost analyses related to transaction cost and property rights. Coase’s study is very readable with limited quantitative analyses. Similarly, Akerlof (1970) analyzed information asymmetry and quality uncertainty, using the market for used cars as an example. Within the literature of economic analysis of law, many similar articles have been published since then. Chapter 4 Anatomy on Empirical Studies I n this chapter, research methodology related to empirical studies is examined in detail. We adopt a production approach to analyzing the characteristics of an empirical study. A paper can be viewed differently by a reader and an author (i.e., a consumer versus producer). From the perspective of production, an empirical study has three faces or versions: a proposal version, a program version, and a manuscript version. For a published journal article, readers see the manuscript version only. How the study has been designed and how the data have been analyzed are usually not available to readers, or they are only partially revealed in the manuscript version. The differentiation among the three versions will give us a better understanding of the production process of a scientific paper. It also allows us to identify several critical steps in practical application, and eventually, helps us improve research productivity incrementally. 4.1 A production approach Conducting scientific research is a production activity. It is comparable to many activities human beings participate in. Similar situations include making a movie, mining coal, and performing a dance. In the next subsection, we take home building as an analogy to explain the phases in making a paper. Specifically, a scientific paper is evaluated on the basis of three phases: idea, structure, and detail. This division is distinctive and easy to understand. Each phase can include small steps, so a large research project can be divided into workable units. Finally, we compare an author’s and a reader’s perspective on a paper to further reveal the key features of the production process in conducting an empirical study. 4.1.1 Building a house as an analogy Making a paper is similar to other activities, as compared in Table 4.1 A production comparison between building a house and writing a paper. In 2004, my wife and I spent a half year in shopping for a house in the town where we have been living since then. It was our first house so at that time, we had very limited experience about the housing market and process as buyers. It was the biggest financial decision in our lifetime too. At the beginning, we read a lot of books and tried to understand what steps we should follow in finding a house. After numerous conversations, readings, and trials, we figured out that there were three major stages in shopping for a house: Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 42 43 4.1 A production approach Table 4.1 A production comparison between building a house and writing a paper Building a house Writing a paper Roles 1. Location A builder or land dealer makes strategic decisions about the location, timing, and size for a proposed house building. 1. Idea A professor or a project investigator creates a research idea. Designer 2. Floor Plan The builder or a hired expert blueprints the floor plan for the proposed house and prepares for actual building. 2. Outline The professor or a graduate student develops the outline for a paper, possibly with some help from others. Technician 3. Details The builder hires employees or independent contractors to work on the foundation, frame, window, cable, and painting. 3. Detail The professor or a graduate student works on the study along the outline and fills details in the paper. Painter • Phase (1) Location: Identifying the right community we like; • Phase (2) Structure and floor plan: Finding a structure right to us; and • Phase (3) Detail: Examining the quality (e.g., brick, paint, and landscape). More importantly, we also found out that the sequence from location to floor plan then to detail should be largely maintained. We wasted a lot of time in learning these basics. For example, we found a house with a great floor plan and style but it was too isolated from other communities. In other words, the location was inappropriate for us, so at the end we just wasted our time. Overall, we found that the price of a house has been determined by factors related to the three phases with the ordered sequence. The most important single factor in determining a house price is location, not the flowers in the front yard. After understanding these basics and the market, our final decision was signing a contract with a builder. When we signed the contract, he just finished the concrete foundation for the house. Over the next several months, we walked around the site every Saturday with excitement and expectation to observe the whole building process. Because we put our deposit on the house already and would become the owner very soon, we watched all the construction activities carefully. We also had many conversations with the builder over the following months. He told me that there were numerous things to worry about in doing the business. First, he made a decision of the location: where to buy a piece of land and build the house. Next, he selected a design map and floor plan for the house. Finally, he went to the site every day to supervise these workers he hired and to take care of the detail and quality. Without much surprise, that was also what we as buyers have paid so much attention to in the shopping process. 4.1.2 An author’s and a reader’s perspective to a paper A paper’s features can be revealed from several perspectives, as shown in Figure 4.1 A comparison of an author’s work and a reader’s memory. Using the terminology of economics, 44 Chapter 4 Anatomy on Empirical Studies 1. Research Idea − One sentence − A few days or months 2. Outline − 2 pages − Several months 3. Details − 30 pages − Several weeks An author's work Paper A reader's memory 1. Details − Most facts − In several days 2. Outline − Structure only − In a year 3. Research Idea − Topic only − After several years Figure 4.1 A comparison of an author’s work and a reader’s memory an author is a paper producer and a reader is a paper consumer. Understanding these differences can help us conduct scientific research efficiently. Specifically, in conducting an empirical study, the process usually starts with a research idea of interest to a researcher. Once the study design is finished, data can be collected and analyses can be conducted to generate the key findings and set up the structure of the study. Finally, a manuscript can be prepared, writing can be polished, and the results can be disseminated through various outlets. Each stage (i.e., idea, structure, and detail) needs a different amount of time. Sometimes it may take only a few seconds to have a brilliant idea, but overall, generating ideas for scientific research requires long-term accumulation and diligent work. The structure and key findings may take several months to set up and finalize. In the end, all the details for a paper may take several weeks to be connected together. Then, the paper is presented to readers, either your colleagues or a person you do not know. Without any direct communication in general, how do readers understand what you Chapter 5 Proposal and Design for Empirical Studies W riting a proposal for money, publication, or degree is similar in many aspects. In this chapter, the common requirements are presented and a number of keywords are elaborated first. For the primary purpose of funding, two sample proposals, one unfunded and the other funded, are analyzed with regard to their presentation skills. For the publishing purpose, three sample empirical studies used in this book are illustrated (i.e., Sun et al., 2007; Wan et al., 2010a; Sun, 2011). At the end, constraints associated with study design for graduate degrees are discussed and some suggestions are proposed to tackle restrictions in time, financial resource, and experience. 5.1 Fundamentals of proposal preparation A proposal can be prepared for a number of purposes, including money, publication, or degree. Funding is an important and necessary resource for many scientific research projects today, so this has been a primary purpose of proposal writing. Some economic research may need limited investment in facilities and resources, so a proposal can be prepared with the main purpose of publishing. In addition, a dissertation design by a graduate student in economics can focus on degree and graduation time. Different purposes can lead to distinct emphases at the design stage. In this section, some common fundamentals of proposal preparation are presented first. 5.1.1 Inputs needed for a great proposal I summarize and aggregate the inputs needed for a great proposal into several items: great presentation skills, quality time and solid commitment, and a compelling idea. First of all, presentation skills help researchers demonstrate the merit of a proposed project. Grantsmanship is the art of obtaining grants and is broader than presentation skills. Presentation skills for proposals are not only about pretty sentences, but also about some implicit or explicit norms adopted in a community. I believe that all presentation skills can be learned through several projects over a few years. Here is an example of presentation skills I learned in the past. A typical problem for young professionals is that too many objectives are listed and too much is promised in a proposal. A teenage boy has the tendency to show maturity with some mustache or muscle. A new investigator also has the tendency to show reviewers the magnificence and Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 57 58 Chapter 5 Proposal and Design for Empirical Studies broadness of the project. As a result, proposals written in this way begin to read more like a lifetime career plan. Unfortunately, the fate of your proposal is controlled by some senior and seasoned researchers in your field. Most of them will strongly doubt that an ambitious and large list of objectives can be finished within a few years. This is such a simple and easy-to-make mistake that many young professionals (including myself) pay heavy prices in the learning process. The solution is very simple: just focus on two to three objectives in a typical proposal. A single objective is usually too thin but any number bigger than three has the risk of promising too much. Four or more objectives are recommended only when there is really a solid justification behind the decision. For example, it is reasonable to have more than three objectives when a multiple-discipline project requests for $50 million. Quality time and commitment is the second necessary input to successful proposal preparation. This sounds so obvious but many researchers can easily forget it with a busy daily schedule. Time is a basic input for any of our daily activities. Everyone has the same amount of time per day; remember nobody has 25 hours a day or 400 days a year. A proposal with a large request for financial support often needs a total of several months of time for preparation. Furthermore, the quality of time is even more important. If one only has some residual and fragmented time for proposal preparation after other commitments, then the probability of writing an excellent proposal is low. Whether one prepares a proposal continuously for several months or periodically over a long period, one needs quality time and solid commitment in the process. The final and also the most important input is a compelling research idea for a proposal. Scientific merits of a research idea are the innovation and impact of a proposed research. Training and experience can help researchers improve the quality of ideas over time. To generate a high-quality research idea in economics, a solid training in basic courses such as microeconomics, macroeconomics, and econometrics is necessary, which may take a few years. To learn these economic theories or techniques better, students also need to participate in some real research projects while taking these courses. For a specific area of interest (e.g., international trade of a specific commodity), current literature should be searched and digested to gain a broad and deep understanding of the research status. In addition, an idea for economic research can be generated from various sources, such as newspaper, market reports, or discussions. Scientific research and proposal writing are also an iterative process. Researchers can improve and accumulate their knowledge in an area in the long term, and ultimately, improve the quality of research ideas. While training in economics can be helpful in improving idea quality, it should be pointed out that training is less effective to improving idea quality than to other inputs for proposal preparation (e.g., presentation skills). Generating an idea with sound scientific merits is like standing on the shoulders of giants. It requires tremendous creativity in a brain-storming process. In other words, some talents are needed in generating a first-rate idea. This is the artistic, not scientific, side of proposal preparation. By analogy, a person can become very skilled in playing a piano by training over several years, but he may not become a great pianist; another person can have beautiful steps on stage, but she may not be perceived as a great dancer. Overall, the demand on creativity reflects the spirit of human beings in exploring the unknown world. It is these challenges that make scientific research so fascinating to many of us. 5.1.2 Keywords in a proposal Regardless of specific purposes, there are a number of keywords or components that every proposal needs to address clearly (Morrison and Russell, 2005). In evaluating a proposal, reviewers also look for these keywords and try to understand if they are well defined and 5.1 Fundamentals of proposal preparation 59 connected. Thus, it is imperative that researchers spend their efforts in delineating these keywords explicitly in the proposal. This can improve productivity in the preparation, and if funding is sought, it also can increase the possibility of being funded. We explain these keywords below one by one in greater detail. Some of them have already been presented in Table 4.2 Main components in the proposal version of an empirical study on page 51. Research issue or question: An issue should be identified and defined at the beginning. The boundary or size of the issue should be appropriate, not too big or too small. It should not be too big, vague, or ambitious, so the issue can be addressed in the proposed time frame. It should not be too trivial or small either. By working on the issue, you can make significant contributions to the area, not just repeating or marginally extending the work already published by others. Thus, there is a delicate balance in defining the boundary of an issue or problem. In general, the issue description will have to be revised many times when other keywords of a proposal are developed and refined. What is known or current knowledge status: What is known or what has been established earlier by other researchers needs to be determined. Scientific research has become more extensive as our society continues to evolve. A new research idea is likely related to existing articles in some way. Therefore, a detailed review and analyses of the relevant literature are required. What is unknown or knowledge gap: This keyword of knowledge gap is closely related to the previous keyword of current knowledge status. Thus, it seems like a trivial task in generating a compelling idea. However, it is not. Understanding what has been achieved is much easier than figuring out what is unknown. It requires independent and critical thinking ability, which may take years to gain. My own habit is that after reading a published paper, I write down notes on its first page instantly about its strength and shortcoming. This forces me to identify knowledge gap and potential directions for future research. For a specific proposal, the stated knowledge gap should be framed with an appropriate size, similar to the principle for issue definition. Study needs: The knowledge gap identified earlier may require several projects to address. Study needs can be as big as, or smaller than, the knowledge gap identified. They are supposed to be more relevant to a specific proposal or project. If current knowledge status and knowledge gap for a selected issue are well stated in a proposal, the keyword of study need is relatively easy to define. Goals or objectives: For a large proposal (e.g., asking for one million dollars), the credentials and research records of principal investigators are an important factor for a funding agency to consider in reducing the risk of failure. Thus, demonstrating that the proposed project is within the long-term research activities and efforts of the researchers has become a presentation skill widely employed. In that case, a long-term goal over many years is usually stated before specific objectives for the proposed project. For small projects, describing a long-term goal is not necessary. Individual objectives should be defined in a way that achieving them can meet the study need and fill some or all the knowledge gaps presented earlier. Methods: This is the key component of the proposal as it describes how to achieve the objectives defined earlier. It should demonstrate enough innovation and capacity in solving the problem. The degree of detail released can be affected by a number of factors. Often the space allowed in a proposal is limited, so the section needs to be concise. Sometimes the detail is not very clear at the time of proposal preparation, or investigators prefer not to present too detailed methods to avoid information leakage. Ultimately, the methods presented should convince reviewers that they are sufficient for achieving the stated objectives. Data sources: This is usually a straightforward part in a proposal, especially for empirical Chapter 6 Reference and File Management H ave you ever felt too tedious in formatting the reference section in a manuscript? Have you ever been lost in finding an article from several hundreds of documents saved on your computer? If your answer is yes, then this chapter may help you address these problems effectively. First of all, the need of literature management from the perspective of researchers is analyzed. Then, some systematic strategies to reference and file management are presented. In particular, EndNote and Mendeley have been two leading products for Windows users in the market, and they are chosen to demonstrate how to accomplish various tasks for reference and file management. Overall, if the methods presented in this chapter are followed, references and files can become well organized and research productivity can be gained at the stages of study design and manuscript preparation. 6.1 Demand of literature management There is a need to understand the demand of literature management before seeking a solution. Reference and file management has become a routine task for scientific research, covering the whole process from research design to manuscript preparation. The number of papers that needs to be managed is estimated first. Then, the demand is quantified as a list of major tasks and goals. 6.1.1 Timing for literature management Literature management for research includes both reference and file management. Reference management and file/document management are closely related but two different tasks. Reference management has been in demand for a long time because past studies need to be cited in the text of a current manuscript, and a bibliography section should be provided at the end for detail. Sometimes I read papers published about 50 years ago, and they still contain reference information with a format similar to what is required today. In contrast, electronic versions of published articles, especially in portable document format (PDF), have only become widely available since the 1990s. Before that, printed copies were obtained from a library and the only management task was to organize one’s bookshelf. The number of scientific articles published has grown fast in recent years, and in general they are published as a PDF document. At present, each published paper has some Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 78 6.1 Demand of literature management 79 reference information and a PDF copy. As a result, there has been an increasing need of managing either hard or electronic copies of these files. The need of reference and file management has become a routine job for researchers, covering every step in the whole process. In particular, it is equally important for project design and manuscript preparation, or strictly speaking, it is even more important for project design to improve research productivity. As demand for reference management has existed for a long time and it is mainly relevant to manuscript preparation, it creates a false impression that what we cover in this chapter is useful for manuscript preparation only. However, designing a research project needs deep and comprehensive understanding of what has been achieved, i.e., the keyword of current knowledge status, as presented in Chapter 5 Proposal and Design for Empirical Studies. Scientific research today should start with an efficient management of reference and file information. That is why this topic is covered here in Part II Economic Research and Design. 6.1.2 How many papers? Let us first estimate how many papers we need to read and manage. An exceptional researcher can list as many as 700 publications over 30 years on his curriculum vitae (i.e., about one paper every two weeks), but I perceive that as an outlier. In economics, a moderately productive researcher can publish about three articles per year, resulting in about 100 articles for a whole career. Many economists can publish about 20 papers only during their lifetime. At the low end, young professionals like PhD students may work on several projects and publish less than two papers. In conducting a specific project, the number of papers skimmed may be very large, e.g., searching and skimming papers at online databases. In this chapter, we focus on these papers that one spends a good amount of time on reading. Assume a researcher needs to read 50 relevant papers in working on a specific project. The number of references cited is usually smaller, e.g., 30 papers in a typical reference section. Thus, the maximum number that one needs to manage during a career is about 5,000 papers (i.e., 50 × 100). This number can be bigger for different disciplines, or if a researcher has cooperative research projects with many colleagues. Thus, to be very conservative by doubling the amount of work, I believe that the total number of papers that most researchers need to manage should be no more than 10,000 papers in a whole career. At the low end, if a graduate student only includes two or three projects in a dissertation, then the number of relevant papers is usually less than 150. Personally, I have accumulated about 3,000 papers on my computer after 15 years as a professor. The size of individual documents is up to 15 megabytes (MB) per copy. In total, their sizes are 2.3 gigabytes (GB) on my computer. Thus, with a simple extrapolation, the size of 10,000 PDF papers should be about 8 GB. This is all manageable on modern personal computers. Obviously, if we have many research projects, then some special software is needed for efficient literature management. If the number is very small, then the cost may be bigger than the benefit from a special tool. The answer is vague if the number of papers is moderate. For example, a student only has one research project (e.g., a master’s thesis) and the student is also very certain that he will not design or conduct any scientific research anymore in the future. Then the benefit of learning a new software application and applying the relevant techniques on a small set of papers (e.g., 30 cited references) is just too small. Based on my experience, I recommend that reference and file management be handled by special software whenever one needs to manage over 100 references and related PDF files. 80 Chapter 6 Reference and File Management Table 6.1 Tasks and goals for reference and file management Task Goal Reference management Insert all needed citations in the text of a manuscript. Create a reference section in a manuscript. Format the citations and reference section in a manuscript. < 30 minutes < 1 minute < 1 hour File management Reorganize old existing PDF files on a local drive. Organize all new PDF files related to a manuscript. Find a PDF file on a computer. Add comments to a PDF file for record. Match a PDF copy with a hard copy on a bookshelf. 200 – 300 files per day < 1 day < 10 seconds < 10 minutes < 1 minute Note: Assume that a typical manuscript has 20 to 40 pages and contains 30 citations. A researcher needs to prepare three such manuscripts and manage less than 300 new PDF files every year, and will work on no more than 10,000 PDF documents in total during a career. The symbol < means less than. 6.1.3 Tasks and goals For each published article, there are many itemized pieces of information, e.g., author, article title, year, journal title, volume and issue numbers, and page numbers. There are also many small format requirements, e.g., capital letters, sequence of authors, and indention. In addition, each article is usually available as a single PDF document, with possible appendixes. As a result, the management of reference information is relatively tedious because of the large number of items and format requirements, while the management of PDF files is more straightforward. However, as the number of PDF files becomes larger (e.g., several hundreds on a local drive), the task can become very difficult without any software. Actually, many researchers have been motivated to learn special reference software mainly because of the anxiety and pressure related to file management. Tasks of reference management focus on how to add references to a manuscript prepared for publication. Tasks of file management are about how to organize electronic and hard copies of articles in a way that they can be quickly identified and annotated. Common major tasks are listed in Table 6.1 Tasks and goals for reference and file management. Without any special software, one can manage references and files within applications used for document preparation (e.g., Microsoft Word) and the operating system (e.g., Microsoft Windows). Personally, I can meet all the reference format requirements for one article with less than one day manually. As I publish about three journal articles per year, all of these tasks are feasible and tolerable. In contrast, managing several thousand PDF documents using the Windows Explorer application is very inefficient. As an indicator, it is often hard to find a specific PDF document, and even if feasible, it takes many minutes for a simple task. With reference software, what are the cost and benefit? The cost is that one needs to spend several days or weeks in learning a selected software product, depending on the learning pace. The benefits are making fewer or even no mistakes in the reference section of a manuscript, and more importantly, saving time for both reference and file management. Assume that a typical manuscript has 30 citations, a researcher needs to prepare three such Part III Programming as a Beginner 99 100 Part III Programming as a Beginner: Five chapters are used to show how to conduct an empirical study with predefined functions in R. The sample study used is Sun et al. (2007), which employs a binary logit model to examine insurance purchase decision. Chapter 7 Sample Study A and Predefined R Functions (pages 101 – 127): The manuscript and program versions of Sun et al. (2007) are analyzed first. How to conduct the study through pull-down menus in R is briefly shown. R grammar and program formatting are elaborated in detail. Chapter 8 Data Input and Output (pages 128 – 151): An object and a function are explained first as two core concepts of R. The focus of this chapter is on the exchange of information between R and a local drive, i.e., data inputs and outputs. Chapter 9 Manipulating Basic Objects (pages 152 – 179): How to manipulate major R objects by type is presented. These types covered are R operators, character string, factor, date and time, time series, and formula. Chapter 10 Manipulating Data Frames (pages 180 – 213): How to index a data frame is presented first. Then common tasks related to data frame are addressed one by one. Methods for data summary and aggregation are presented at the end. Chapter 11 Base R Graphics (pages 214 – 249): The traditional graphics system available in the base R is presented. The four main inputs for generating an R graph are plotting data, graphics devices, high-level plotting functions, and low-level plotting functions. R Graphics • Show Box 3 • Survival of 2,201 passengers on the Titanic sank in 1912 2nd 3rd Crew Yes Survived No 1st Class R can visualize categorical variables. See Program A.3 on page 516 for detail. Chapter 7 Sample Study A and Predefined R Functions O ne of the principles used in this book is to learn the methods of conducting an empirical study through a complete project. In this chapter, Sun et al. (2007) is presented as the sample study for Part III Programming as a Beginner. This simple empirical study is well suited for students to learn basic programming skills. The manuscript version and statistics related to a binary choice model are presented first. Then how to estimate the binary choice model by using pull-down menus is demonstrated with R Commander. Finally, the program version of this study is displayed. The basic R syntax and format requirements are explained in detail at the end. 7.1 Manuscript version for Sun et al. (2007) In an empirical study, statistical outcomes are always the core. In the manuscript version of an empirical study, these outcomes are often reported in the format of table and figure and then analyzed one by one. All other sections in the manuscript support the result section, including methodology, literature review, introduction, discussion, and conclusion. The manuscript version will grow up from a very basic skeleton to its final version, with the guide from the proposal version and outputs from the program version. Conversely, creating and expanding the program version needs the initial guide from the proposal version, and furthermore, more specific decisions from the manuscript version. For example, table formats for the final results are generally unspecified in a proposal version. Determining the number and possible contents in these tables through a manuscript draft can provide practical guides to data analyses and programming. In general, the final manuscript version for a peer review journal should have a length of around 25 to 35 pages in double line spacing, including a separate page for each table or figure. Therefore, there will be a limit for the number of empirical results that one can report in a typical journal article. For example, it rarely happens that a journal article contains 20 tables or more. Thus, it is a good habit to write down how many pages are planned for each section at the beginning when you have a clear and cool mind. This will remind you what page limits you prefer to have, so you will not waste your time to overwork on a specific section, e.g., writing a very long literature review. Below is the very first draft of the manuscript version for Sun et al. (2007). It is the foundation of the final manuscript version and it also provides guide to the programming for this study. The corresponding proposal version is presented at Section 5.3.1 Design Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 101 102 Chapter 7 Sample Study A and Predefined R Functions with survey data (Sun et al. 2007) on page 70. The following manuscript draft seems like a small variation of the proposal version, but actually, it is not. The first manuscript draft should be much more practical than the proposal version, with a specific number of tables and figures. This will provide a clear direction for programming and data analyses later on within R. The First Manuscript Version for Sun et al. (2007) 1. Abstract (200 words). Have one or two sentences for research issue, objective, methodology, data, and results. 2. Introduction (2 pages in double line spacing). One or two paragraphs for each of the following items: research issue, what is known, what is unknown, knowledge gap and study need, objective, and an overview of the manuscript structure. 3. Literature review (3 pages). Two subsections: (a) liability concerns of forest landowners and incidents; and (b) liability insurance as a way to reduce liability. 4. Methodology (3 pages). Two subsections: (a) data and telephone survey; and (b) a binary logit model for liability insurance coverage. The methods employed are descriptive statistics and a binary logit regression. 5. Empirical findings (3 pages of text, 4 pages of tables, and 1 page of a figure). Three subsections: (a) pattern of injuries and damages; (b) pattern of liability insurance coverage; and (c) results from the binary logit regression of liability insurance coverage. Table 1. Recreational bodily injury and property damages Table 2. Pattern of liability insurance coverage Table 3. Definitions and means of variables in the logit model Table 4. Results of the binary logit regression analysis of liability insurance coverage Figure 1. Probability response curves for key determinants 6. Discussions (3 pages). Choose three to five key results from the empirical findings and have a discussion. 7. References (3 pages). Cite about 30 papers. end Furthermore, within the manuscript draft, the contents of tables and figures should be specified or designed before any programming can take place. Their contents may change later on by data analyses or mining. Nevertheless, predicting or designing these tables and figures in advance will greatly improve programming efficiency. As an example, results of the logit regression analysis in Sun et al. (2007) are reported as Table 4 in its final published version. The very first version of this table is presented in Table 7.1 A draft table for the logit regression analysis in Sun et al. (2007). This table seems like a blank table, but actually, working on table drafts first is a key technique for improving data analysis efficiency. Note some hypothetical numbers are put in each column to determine the width and number of columns that are appropriate for a potential target publication outlet (i.e., Southern Journal of Applied Forestry in this case). This is necessary because the table will be generated from a program version directly later on. 103 7.2 Statistics for a binary choice model Table 7.1 A draft table for the logit regression analysis in Sun et al. (2007) Variable Coefficient t-ratio Marginal effect t-ratio Constant Injury HuntYrs Age Race ... Observations Log-likelihood Chi-squared Prediction 4.444 3.33*** 0.666 7.777*** 7.2 1,700 −222.22 55.55 90% Statistics for a binary choice model A binary choice model describes a choice between two discrete alternatives, such as buying a car or not. The dependent variable is represented by 1 and 0, corresponding to the alternatives available to decision makers. This type of model can be estimated through ordinary least square, generalized least square, or maximum likelihood. Overall, binary choice models have been well covered in standard textbooks. In this section, formulas related to the linear probability model and binary probit/logit model are presented. They will be used for demonstration or programming exercises later on in the book. To be brief, derivations of these formulas are not described here. Details are available at Judge et al. (1985), Baltagi (2011), and Greene (2011). 7.2.1 Definitions and estimation methods In Sun et al. (2007), a binary choice model is employed for an insurance purchase decision by recreationalists. A binary choice model can be initiated or motivated in several ways. In economics, there is an economic agent behind each observation, e.g., a hunter. Thus, several theoretical models have been used to develop binary choice models, such as maximization of expected utility or unobservable random index. Regardless of the underlying theoretical model, the empirical model is always in the format of y = f (X), as conceptualized in this book. For a binary choice model, the dependent variable only has two values, 1 or 0. Therefore, the formula can be adapted as follows: P r(yi = 1) = pi = F (xi β) (7.1) P r(yi = 0) = 1 − pi = 1 − F (xi β) where yi is the choice made by recreationalist i; P r(yi = 1), or pi in short, is the probability of yi being 1; xi is a vector of K independent variables associated with recreationalist i and its dimension is 1 × K; and β is the coefficient vector (K × 1). The independent variables in Sun et al. (2007) are composed of three groups: xi β = β0 + Ci β1 + Si β2 + Ti β3 (7.2) where βm (m = 0, 1, 2, 3) are parameter vectors to be estimated. In total, 12 variables are included in three groups: recreational experience of a person (i.e., Ci = Injury, HuntYrs), 104 Chapter 7 Sample Study A and Predefined R Functions license type (i.e., Si = Nonres, Lspman, Lnong), and socio-demographic characteristics (i.e., Ti = Gender, Age, Race, Marital, Edu, Inc, TownPop). Equation (7.1) can be expressed more concisely with matrix notations (Judge et al., 1985): P r(y = 1) = p = F (Xβ) (7.3) P r(y = 0) = 1 − p = 1 − F (Xβ) where the bold symbol of y is the vector value of independent variable (N × 1); p is the vector of probability; and X is a matrix of independent variables (N × K); and N is the total number of observations. A binary choice model can be estimated in several ways, depending on the choice of F (Judge et al., 1985). When estimated by ordinary least square, it is called the linear probability model. As several properties of the linear probability model are unsatisfactory (e.g., predicted values being out of the range of 0 to 1), it can be improved by generalized least square. At present, the more popular approach for binary choice models is a binary probit or logit model. In this book, we cover both the linear probability and binary probit/logit models for the purpose of programming demonstration. Together, the general formulas for these models are: F (Xβ) = Xβ F (Xβ) = Φ(Xβ) = Z Xβ φ(t) dt = −∞ F (Xβ) = Λ(Xβ) = 1 1 + e−Xβ Z Xβ −∞ 2 e−t /2 √ dt 2π (7.4) where the first equation is for the linear probability model, the second one is for the binary probit model, and third one is for the binary logit model. The cumulative probability function is denoted by the symbol Φ for the normal distribution and Λ for the logit distribution. These alternative models will be analyzed in detail below. There are various outputs from estimating a binary choice model. Like any regression analysis, the main output is coefficient estimates. In addition, marginal effects can be calculated to measure the magnitude of the impact on the choice from each independent variable. In the linear probability model, marginal effects are just the coefficient estimates. However, the linear probability model is not the best model for handling binary choice data, and marginal effects from the binary probit or logit model are perceived to be more accurate. As the relation between dependent and independent variables in a binary probit or logit model is nonlinear, calculating marginal effects and its standard errors is much more complicated. 7.2.2 Linear probability model A linear model can be used to examine binary choice data, and this has been referred to as the linear probability model. The major problems with this approach is that the actual observed values of the choices are either 1 or 0, but the predicted values from a linear probability model is continuous and can be out of the interval between 0 and 1. Furthermore, the error term is inherently heteroskedastic. As a result of these barriers, linear probability model has gradually faded away in handling binary choice data. In this subsection, basic formulas related to the linear probability model are presented here for completeness. A linear model can be estimated by various methods. Ordinary least 105 7.2 Statistics for a binary choice model square and maximum likelihood estimators are representative and thus they are briefly presented here. These formulas will be used later in the book for programming demonstrations and exercises related to binary choice models. To begin, express a linear system in a matrix form as follows (Greene, 2011): y = Xβ + e (7.5) where y and e are N × 1 vectors of the dependent variable and error term, respectively; X is an N × K matrix of independent variables; β is a K × 1 coefficient matrix; N is the number of observations; and K is the number of independent variables. The dependent variable is assumed to be independently and identically distributed with equal variance at each observation, i.e., E[ee0 ] = σ 2 I N . Ordinary least square estimation for a linear model The ordinary least square estimator of a linear system is well documented in statistics textbooks (e.g., Greene, 2011). The key formulas are as follows: βb = (X 0 X)−1 X 0 y yb = X βb eb = y − yb (7.6) σ b = (b e eb)/(N − K) b =σ cov(β) b 2 (X 0 X)−1 2 0 b yb, and eb are the estimated coefficients, fitted dependent variable, and estimated where β, residuals, respectively. The scalar value of σ b 2 is an unbiased estimator of σ 2 . The covariance b can be computed with the estimated variance of σ matrix for the coefficients, i.e., cov(β), b 2. 0 The prime symbol ( ) denotes transposition of vectors or matrices, and the widehat symbol (b) indicates an estimated value. Once the coefficients and standard errors are estimated, other statistics related to the ordinary least square can be computed next. Take the coefficient of determination (R2 ) as an example. This is a measure of the goodness of model fit. Total variation in the dependent variable of y can be measured as the deviation from its mean. This can be decomposed into two components: one explained by the model and the other in the residuals. Mathematically, the R2 value can be computed as: eb 0 eb (y − y)0 (y − y) eb 0 eb R2 = 1 − 0 y y − N y2 R2 = 1 − or alternatively, (7.7) (7.8) where the scalar value of y denotes the average of the dependent variable. Maximum likelihood estimation for a linear model A normal (or Gaussian) distribution is a continuous probability distribution with two parameters. If a random variable z is normally distributed, its probability distribution function 106 Chapter 7 Sample Study A and Predefined R Functions can be expressed as (Greene, 2011): (z − µ)2 f (z | µ, σ) = (2πσ 2 )−1/2 exp − 2σ 2 2 z f (z | 0, 1) = (2π)−1/2 exp − 2 (7.9) (7.10) where parameter µ is the mean and parameter σ is the standard deviation of the distribution. When µ = 0 and σ = 1, the distribution is called standard normal distribution. Assume that the dependent variable of yi (i = 1, . . . , N ) in Equation (7.5) is normally and independently distributed with the mean of xi β and variance of σ 2 . Then, the maximum likelihood principle can be employed to estimate the unknown parameters and variance of the residuals, i.e., β and σ 2 . The joint probability density function of yi (i = 1, . . . , N ), given the preceding mean and variance, can be expressed as: L = f (y1 , y2 , . . . , yN | Xβ, σ 2 ). As the individual values of yi are independent, the joint probability density function can be written as the product of N individual density functions as follows (Judge et al., 1985): L = f (y1 | x1 β, σ 2 ) . . . f (yN | xN β, σ 2 ) = N Y f (yi | xi β, σ 2 ) (7.11) i=1 (yi − xi β)2 f (yi | xi β, σ 2 ) = (2πσ 2 )−1/2 exp − 2σ 2 (7.12) where in the second equation the probability density function of a normally distributed variable yi is computed with the given mean and variance at each observation. Under the maximum likelihood criterion, the parameter estimates are chosen to maximize the probability of generating the observed sample. Using matrix notations, the likelihood function for the linear model can be expressed as: (y − Xβ)0 (y − Xβ) L(β, σ 2 ) = (2πσ 2 )−N/2 exp − 2σ 2 (7.13) where the likelihood value becomes a function of β and σ 2 now. In general, the method of maximum likelihood is applied on the log-likelihood function: logL(β, σ 2 ) = − (y − Xβ)0 (y − Xβ) N ln(2πσ 2 ) − 2 2σ 2 (7.14) where log and ln are the notation of natural logarithm. With this expression, routine optimization methods in a computer language like R can be used to estimate the unknown parameters and variance. 7.2.3 Binary probit and logit models A number of continuous probability distribution functions have been employed to address the shortcoming of the linear probability model (Greene, 2011). A binary probit model is based on the standard normal distribution, and a binary logit model is based on the logistic distribution. Others include Weibull and complementary log-log model. The binary probit and logit models are the most frequently used, so they are presented with more details here. 7.2 Statistics for a binary choice model 107 Estimating parameter values The binary probit or logit model is nonlinear so maximum likelihood can be used to estimate the parameters (Baltagi, 2011). The likelihood and log-likelihood functions can be expressed as: L= N Y [F (xi β)]yi [1 − F (xi β)]1−yi (7.15) i=1 logL = N X yi lnF (xi β) + (1 − yi ) ln[1 − F (xi β)] (7.16) i=1 where F (.) is the cumulative distribution function, as expressed in Equation (7.4). In a standard computer language, these distribution functions are well defined. In R, it is the pnorm() function for the normal distribution, and the plogis() function for the logistic distribution. Thus, for estimating parameter values only, the specific form of the cumulative distribution function is not needed, but they are indeed needed for standard error computation for marginal effects later on. Similar to the linear probability model, numerical optimization techniques can be used to maximize the log-likelihood function and estimate the unknown parameters and their variance. Calculating predicted probabilities Once the coefficients in a binary probit or logit model are estimated, the predicted probability at specific values of independent variables can be computed. This is similar to the predicted values from a linear probability model. The difference is that binary probit or logit models are nonlinear and usually contain dummy variables as independent variables, which makes the computation more challenging and rewarding. The simplest case is to compute the probability at the mean values of all variables: b b = F (X̄ β) P r(y = 1 | X̄) = p (7.17) where X̄ is the mean values of the set of independent variables, and other symbols are the b Thus, the resulting same as defined above. The matrix dimension is 1×K for X̄, K ×1 for β. probability is a scalar value. This scenario is of little use in practical applications. The more useful probabilities that have been computed from binary probit or logit models are to change the values of one or two selected independent variables, and meanwhile, to fix all other independent variables at their mean values. Specifically, there are three scenarios: (i) one selected independent variable is continuous; (ii) one selected independent variable is a dummy with the value of 1 or 0; and (iii) two independent variables are considered, with one being a continuous variable and the other being a dummy variable. The first scenario focuses on a continuous independent variable, e.g., hunting years in Sun et al. (2007). The whole range of this variable can be divided equally into many intervals, e.g., M = 300 for 70 hunting years. With this treatment, a new matrix of X̄ s=seq for independent variables can be generated; the selected variable has the equally spaced values and all the other variables have the fixed mean values. The probabilities computed have the dimension of M × 1. The relation between the probabilities and the selected variable can be revealed through a plot. Mathematically, this can be expressed as: b b = F (X̄ s=seq β) P r(y = 1 | X̄ s=seq ) = p where s is the selected continuous variable, with a sequence of values being assigned. (7.18) 108 Chapter 7 Sample Study A and Predefined R Functions In the above treatment, new values for the selected variable are generated over its whole range with an equal interval. Can we use the original values of this variable, e.g., hunting years in Sun et al. (2007)? The answer is yes. The resulting curve will have a similar shape, but it may not be smooth because the original values of the selected variable are not equally distributed over its whole range. For practical applications, the interest is usually on the visual relation between the probability and a selected continuous variable through a graph. Thus, equally-spaced values of the selected variable over its whole range are used. The second scenario focuses on a dummy independent variable. As a dummy independent variable can only take the value of either 1 or 0, the above strategy for a continuous variable does not work anymore. Instead, two probability values need to be generated: b P r(y = 1 | X̄ d=1 ) = F (X̄ d=1 β) b P r(y = 1 | X̄ d=0 ) = F (X̄ d=0 β) (7.19) b − F (X̄ d=0 β) b Md = ∆Fb = F (X̄ d=1 β) where d is the selected dummy variable. In X̄ d=1 , the selected dummy variable takes the value of 1 only and other independent variables take the mean values. X̄ d=0 is similarly defined with the dummy variable taking the value of 0 only. The difference between the two probability values, Md , is also known as the marginal effect of the dummy independent variable of d. The third scenario is to combine the previous two scenarios, with both a continuous variable and a dummy variable included. For example, in Sun et al. (2007), the effects of hunting years (s = HuntY rs) and residential status of a recreationalist (d = N onres) on the probability of liability insurance purchase are assessed through three curves in one plot. The probability series for the selected continuous variable only can be computed with Equation (7.18). In combining the two variables together, two additional probability series can be constructed as follows (Greene, 2011): b P r(y = 1 | X̄ s=seq, d=1 ) = F (X̄ s=seq, d=1 β) b P r(y = 1 | X̄ s=seq, d=0 ) = F (X̄ s=seq, d=0 β) (7.20) where X̄ s=seq, d=1 is a further modification on X̄ s=seq , with the column value for the dummy variable being set as 1. X̄ s=seq, d=0 is similarly defined. Note that in Equation (7.19), the difference between two probability values for a selected dummy variable is interpreted as the marginal effect at the mean values for all other independent variables. In Equation (7.20), the continuous variable does not take the mean value, but varies over its whole range. Thus, the differences between the two series in Equation (7.20) is the marginal effect of the dummy variable over the whole range of the continuous variable. In a graph, the vertical difference between the two series at the mean value of the continuous variable should equal to the value calculated from Equation (7.19). Calculating marginal effects The marginal effect of continuous independent variables from a binary probit or logit model can be calculated as (Greene, 2011): M= b ∂(X β) b ∂F (X β) ∂E[y | X] b β b = f (b b = = f (X β) y) β b ∂X ∂X ∂(X β) (7.21) 7.2 Statistics for a binary choice model 109 where M is the marginal effects of X, F is the cumulative distribution function, and f is b can change with the value of the probability density function. The scale factor of f (X β) b X. For the linear probability model, its scale factor is one so M = β. For the probit or logit model, the scale factor, and thus the marginal effect of M , can b β b or M = f¯(X β) b β. b Specifically, one way is to use be calculated in two ways: M = f (X̄ β) the mean values of X. The other way is to calculate the scale factor for each observation, then take an average of all the scale factor values, and compute the marginal effects at the end. In practice, the two approaches may generate similar results (Greene, 2011). Marginal effects for dummy independent variables can be calculated in the same way as for continuous variables. Theoretically, however, the concept of derivative is more accurate for a small change. A dummy variable takes the discrete value of 1 and 0 only. Thus, a more appropriate marginal effect formula for a binary independent variable is the difference of predicted probabilities associated with the status of 1 and 0, as presented in Equation (7.19) (Greene, 2011). In practice, it seems that whether treating a dummy variable as a continuous variable or not generates very small differences for binary choice models. Conceptually, predicted probabilities and marginal effects are closely related to each other. For a continuous variable, the marginal effect is the change of the predicted probabilities over a small change of the focal variable. In a graph created from Equation (7.18), the marginal effect of a continuous independent variable is the slope of the corresponding predicted probability curve. For a dummy independent variable, its marginal effect is the change of the probabilities between two statuses (1 versus 0), which can be revealed by a graph from Equation (7.20). These relations are revealed well through Figure 1 in Sun et al. (2007) for the combined effects of one continuous variable and another dummy variable. Finally, the data set used for calculating the marginal effect can be the whole data set (X), or a subset only. Subsetting on the data is generally applied through a dummy variable. For example, one independent variable in Sun (2006a) is the party affiliation of a house representative or a senator in the United States, with the value of 0 for Democrats and 1 for Republicans. This dummy independent variable can be used to split the whole data set into two. Then marginal effects and standard errors can be calculated for each data set, with the same coefficient and covariance matrices estimated from the whole data set. The formulas for all the computation are still the same. Standard errors for predicted probabilities and marginal effects The delta method can be used to compute standard errors for predicted probabilities and b where b = F (X̄ β), marginal effects (Baltagi, 2011). First, denote predicted probabilities as p the values of X̄ can change for different needs, e.g., Equations (7.18) to (7.20). Its asymptotic covariance matrix can be derived as: b ∂b b p ∂F (X̄ β) y b X̄ = fbX̄ = = f (X̄ β) b b ∂b y ∂β ∂β 0 b b p p b V cov(b p) = b b ∂β ∂β (7.22) (7.23) b and y b Note f (X̄ β) b is b = X̄ β. where Vb is the estimated asymptotic covariance matrix of β; b over another vector y b and b . Assume the length of F (X̄ β) the derivative of a vector F (X̄ β) b b is M × 1. Then, by the rule of matrix calculus, f (X̄ β) is a matrix of M × M , and in this y particular case, it is also a diagonal matrix. 110 Chapter 7 Sample Study A and Predefined R Functions The marginal effect of a dummy variable is defined in Equation (7.19) as Md = ∆Fb. Its asymptotic covariance matrix can be expressed as: Md = fb1 X̄ d=1 − fb0 X̄ d=0 b ∂β 0 Md b Md V cov(Md ) = b b ∂β ∂β (7.24) (7.25) b is related to the matrix of X̄ d=1 , as defined in Equations (7.19) where fb1 = f (X̄ d=1 β) b to (7.22). f0 is similarly defined. b β b = f (b b as The marginal effect of continuous independent variables is M = f (X β) y ) β, presented in Equation (7.21). Its asymptotic covariance matrix can be expressed as: b b b ∂M b df ∂(X β) = fbI + df βX b 0 = fb(I + wβX b 0) = fbI + β b b db y ∂β db y ∂β 0 ∂M b ∂M cov(M ) = V b b ∂β ∂β (7.26) (7.27) where the product rule for matrix calculus needs to be applied in the first equation. In b b y fb for the standard normal distribution, and df = addition, it can be verified that df = −b db y db y (1−2Fb)fb for the logistic distribution. After combining terms, the difference between a probit model and a logit model for the purpose of programming is in the scale factor: w = −b y for the probit model and w = 1 − 2Fb for the logit model. 7.3 Estimating a binary choice model like a clickor As shown in Section 2.4 Playing with R like a clickor on page 22, using pull-down menus for statistical analyses is intuitive when an application such as the R Commander is adopted. However, long-term costs are much bigger than the benefit. In this section, a simplified analysis for Sun et al. (2007) is presented to demonstrate the characteristics of this approach. The application of R Commander is used to show all the process again, with the emphasis on real data and a meaningful model specification. 7.3.1 A logit regression by the R Commander The data set used for this demonstration was collected through a telephone survey as reported in Sun et al. (2007). In the end, 57% of the participants completed the phone interview successfully and the final data set contained 1,653 observations. The data set was saved in two different formats: RawDataIns2.csv and RawDataIns2.xls. They contained the same information with 1,653 rows and 14 columns. The column header corresponded to the variable names: Y, Injury, HuntYrs, Nonres, Lspman, Lnong, Gender, Age, Race, Marital, Edu, Inc, TownPop, FishYrs, where Y was a binary dependent variable and the other 13 were independent variables (Sun et al., 2007). Among the 13 independent variables, HuntYrs and FishYrs were highly correlated, so only one of them was used in the regression. The goal here is to demonstrate the characteristics of using pull-down menus for one’s own data. Thus, only some selected steps are executed. More complete statistical analyses will be shown later on through the program version for Sun et al. (2007). Specifically, the selected steps are: 7.3 Estimating a binary choice model like a clickor 111 Figure 7.1 Importing external Excel data in the R Commander 1. Import the data from a local drive by a software application; 2. Generate basic descriptive statistics of the data (e.g., mean and standard deviation); 3. Conduct a binary logit regression with 12 independent variables: Injury, HuntYrs, Nonres, Lspman, Lnong, Gender, Age, Race, Marital, Edu, Inc, TownPop; and 4. Estimate the logit regression again with the variable of HuntYrs replaced by FishYrs. The first step of the exercise is to import data from the local drive. After the R commander is loaded in an R session with the command of library(Rcmdr), data can be imported through the menu of Data ⇒ Import data ⇒ from Excel. Navigate to the file of RawDataIns2.xls on a local drive. If you have followed Step 6 in Section 2.2.1 Recommended installation steps on page 17, then this document should be saved on your computer like this: C:/aErer/RawDataIns2.xls. Finally, clicking the button of View data set will show up the data set. The main interface and relevant menu items are shown in Figure 7.1 Importing external Excel data in the R Commander. Note commands associated with each click will show up in the R Commander interface. The second step is to generate summary statistics. This is produced by following the menu of Statistics ⇒ Summaries ⇒ Active data set. The third step is to fit a binary logit model. This is similar to the above step by using the menu of Statistics ⇒ Fit models ⇒ Generalized linear model. Variables can be selected within a new window. In the final step, the logit model is fitted again by changing one explanatory variable. Some partial results from the binary logit model fit are shown as an example in Figure 7.2 Fitting a binary logit model in the R Commander. 7.3.2 Characteristics of clickors Going through a simple example like the above by pull-down menus is an experience every student should have. If you have not done that, play it for a while. Then you can understand the benefit and cost of this approach. On the positive side, pull-down menus are self-explanatory and easy to use. This may be especially attractive if a group of software users are diverse and are unlikely to do any complicated actions. For statistical analyses in economics, this approach may have some values to undergraduate education. It inspires students and gets them engaged in the learning process. In addition, this approach can also be useful if one is just interested in using a specific procedure or function in a software product for a regression, but has no interest in 112 Chapter 7 Sample Study A and Predefined R Functions Figure 7.2 Fitting a binary logit model in the R Commander learning the whole software. For example, the software of LIMDEP has great procedures for multinomial logit model. One can just use the pull-down menu to do the regression without reading the thick user guide. On the negative side, using pull-down menus for statistical analyses has many drawbacks. The major problem is that using pull-down menus does not keep a complete track of what has been tried and done. While all the outputs in the window can be saved, it is not easy to edit or annotate them. Without a record, communication between group members of a project is difficult, even if not completely impossible. After a project is finished for a while, it is difficult to have clear answers to the questions of how, what, and when. Personally, this occurred very often between me and graduate students in the past. Related to the lack of record, using the clicking approach is prone to errors. It is true that sometimes a small error in a computer program can be hard to detect and result in severe consequences. However, the clicking approach makes error identification even more challenging, because very few clues are available to users after a few months. In addition, note that steps three and four in the above example are different by one independent variable only. In a computer program, commands can be copied, pasted, and modified in just a few seconds. Finally, without a full program version for an empirical study, the analysis on a data set may be fragmented and disorganized. Overall, clicking with pull-down menus is an approach that should be discouraged for empirical analysis in economics in the long term. As computer programming has become so easy to learn, I believe that everyone should learn a computer language and build up programming skills gradually. 7.4 Program version for Sun et al. (2007) Program 7.1 The first program version for Sun et al. (2007) contains the major steps to generate outputs in table or figure formats. The structure is set up with the information 7.4 Program version for Sun et al. (2007) 113 and guide from the proposal version and the first manuscript version of this project. While this seems straightforward and looks like a direct copy of the first manuscript version, it does set up the framework where a program version can be built up gradually. On the basis of this draft, more contents will be added and the final program version can be developed. In general, a program version should be able to generate and reproduce all the tables and figures reported in a manuscript. To save space here, the program version for Sun et al. (2007) as drafted in Program 7.1 is reduced, with Tables 1 and 2 in the original publication being excluded. The reason for this simplification is that we need a lean program to explain basic R grammar later on in this chapter. Program 7.2 The final program version for Sun et al. (2007) contains detailed R commands. Note the line numbers at the left side are not part of the program, but an aid for reference. Running this program at R can reproduce the key results reported in the manuscript version within one minute. The tables and figures generated from this program are still a little bit different from the ones published in Sun et al. (2007). Table 4 is copied below to show its appearance in the R console. Once the results are saved on a local drive, tables can be copied to a word processor for final formatting (e.g., Microsoft Word), and figures should be copied without any further formatting. The amount of formatting work on tables is usually small (e.g., less than 1%). If one prefers to have tables finalized in R completely, then a combination of R for the program version and LATEX for the manuscript version is needed. The challenge for conducting an empirical study is to start with the simple skeleton as listed in the first program version, add commands to generate the expected tables and figures gradually, and at the end format the program version in a professional way. To do programming efficiently, one needs to learn a computer language step by step. In that regard, the main goal of all the three parts about programming in the middle of this book is to teach students how to move efficiently from Program 7.1 to Program 7.2. Program 7.1 The first program version for Sun et al. (2007) 1 2 # Title: R program for Sun et al. (2007 SJAF) # Date: January 2006 3 4 5 # 0. Libraries and global setting # Load some libraries; Set up working directory 6 7 8 9 # 1. Import raw data in csv format # 2. Descriptive statistics # Generate Table 3 10 11 12 13 14 # # # # 3. Logit regression and figures 3.1 Logit regression 3.2 Marginal effect Generate Table 4 15 16 17 18 # 3.3 Figures: probability response curve # Show and customize one graph on screen device # Save three graphs on file device (Figure 1a, 1b, 1c) 19 20 # 4. Export results in tables 114 Chapter 7 Sample Study A and Predefined R Functions Program 7.2 The final program version for Sun et al. (2007) 1 2 # Title: R program for Sun et al. (2007 SJAF) # Date: January - May 2006 3 4 5 6 7 8 9 10 # # # # # # # ------------------------------------------------------------------------Brief contents 0. Libraries and global setting 1. Import raw data in csv format 2. Descriptive statistics 3. Logit regression and figures 4. Export results 11 12 13 14 15 16 # ------------------------------------------------------------------------# 0. Libraries and global setting library(erer) # functions: bsTab(), maBina(), maTrend() wdNew <- 'C:/aErer' # Set up working directory setwd(wdNew); getwd(); dir() 17 18 19 20 21 22 23 # ------------------------------------------------------------------------# 1. Import raw data in csv format daInsNam <- read.table(file = 'RawDataIns1.csv', header = TRUE, sep = ',') daIns <- read.table(file = 'RawDataIns2.csv', header = TRUE, sep = ',') class(daInsNam); dim(daInsNam); print(daInsNam); class(daIns); dim(daIns) head(daIns); tail(daIns); daIns[1:3, 1:5] 24 25 26 27 28 29 30 31 # ------------------------------------------------------------------------# 2. Descriptive statistics (insMean <- round(x = apply(X = daIns, MARGIN = 2, FUN = mean), digits =2)) (insCorr <- round(x = cor(daIns), digits = 3)) table.3 <- cbind(daInsNam, Mean = I(sprintf(fmt="%.2f", insMean)))[-14, ] rownames(table.3) <- 1:nrow(table.3) print(table.3, right = FALSE) 32 33 34 35 36 37 38 39 40 41 42 43 44 # ------------------------------------------------------------------------# 3. Logit regression and figures # 3.1 Logit regression ra <- glm(formula = Y ~ Injury + HuntYrs + Nonres + Lspman + Lnong + Gender + Age + Race + Marital + Edu + Inc + TownPop, family = binomial(link = 'logit'), data = daIns, x = TRUE) fm.fish <- Y ~ Injury + FishYrs + Nonres + Lspman + Lnong + Gender + Age + Race + Marital + Edu + Inc + TownPop rb <- update(object = ra, formula = fm.fish) names(ra); summary(ra) (ca <- data.frame(summary(ra)$coefficients)) (cb <- data.frame(summary(rb)$coefficients)) 45 46 47 48 # 3.2 Marginal effect (me <- maBina(w = ra)) (u1 <- bsTab(w = ra, need = '2T')) 7.4 Program version for Sun et al. (2007) 49 50 51 52 53 115 (u2 <- bsTab(w = me$out, need = '2T')) table.4 <- cbind(u1, u2)[, -4] colnames(table.4) <- c('Variable', 'Coefficient', 't-ratio', 'Marginal effect', 't-ratio') table.4 54 55 56 57 58 # 3.3 Figures: probability response curve (p1 <- maTrend(q = me, nam.d = 'Nonres', nam.c = 'HuntYrs')) (p2 <- maTrend(q = me, nam.d = 'Nonres', nam.c = 'Age')) (p3 <- maTrend(q = me, nam.d = 'Nonres', nam.c = 'Inc')) 59 60 61 62 63 64 # Show one graph on screen device windows(width = 4, height = 3, pointsize = 9) bringToTop(stay = TRUE) par(mai = c(0.7, 0.7, 0.1, 0.1), family = 'serif') plot(p1) 65 66 67 68 69 70 71 72 73 74 75 # Save three graphs on file device fname <- c('OutInsFig1a.png', 'OutInsFig1b.png', 'OutInsFig1c.png') pname <- list(p1, p2, p3) for (i in 1:3) { png(file = fname[i], width = 4, height = 3, units = 'in', pointsize = 9, res = 300) par(mai = c(0.7, 0.7, 0.1, 0.1), family = 'serif') plot(pname[[i]]) dev.off() } 76 77 78 79 80 # ------------------------------------------------------------------------# 4. Export results write.table(x = table.3, file = 'OutInsTable3.csv', sep = ',') write.table(x = table.4, file = 'OutInsTable4.csv', sep = ',') Note: Major functions used in Program 7.2 are setwd(), getwd(), dir(), read.table(), class(), dim(), print(), head(), tail(), round(), mean(), apply(), data.frame(), nrow(), rownames(), glm(), update(), names(), summary(), cbind(), colnames(), par(), windows(), plot(), png(), dev.off(), write.table(), maTrend(), bsTab(), maBina(), and bringToTop(). # Selected results from Program 7.2 > table.4 Variable Coefficient t-ratio Marginal effect t-ratio 1 (Intercept) -3.986*** -5.514 -0.519*** -5.867 2 Injury 0.245 0.466 0.032 0.466 3 HuntYrs 0.014** 2.402 0.002** 2.412 4 Nonres 0.761* 1.910 0.121 1.613 ... 11 Edu -0.010 -0.328 -0.001 -0.328 12 Inc 0.004* 1.867 0.001* 1.873 13 TownPop 0.002 1.029 0.000 1.030 116 7.5 Chapter 7 Sample Study A and Predefined R Functions Basic syntax of R language There are 80 lines in Program 7.2 The final program version for Sun et al. (2007). It does not take much time to understand the basic structure and a large portion of the program version even if you have never used R before. To facilitate understanding, let us compare English as a language for a manuscript version and R as a computer language for a program version from several aspects: section, paragraph, sentence, and word, as summarized in Table 7.2 A comparison of grammar rules between English and R. 7.5.1 Sections and comment lines In a manuscript version, we use titles such as “Introduction” and “Conclusion” to name sections. Usually a published ten-page manuscript version of an empirical study can have five to seven sections. Similarly, we can have section titles in a program version in R. Specifically, the # sign, or formatted as # in the R program, allows anything after it and on the same line to be treated as comments, not commands. This allows us to make notes or comments on the whole program in a very flexible and informative way. For instance, among the 80 lines in Program 7.2, there are 24 complete comment lines that start with # (i.e., 30%). Most of the comments are created with the first draft of the program, and some are added when the program is improved. To what degree one prefers to comment on a program is largely a personal choice. Nevertheless, there must be some basic comments in a program version to facilitate reading. You may be surprised how fast you forget what you have done a month ago. You may also be surprised how many researchers do not have any comment in a program at all. In the past, I read a few R programs from colleagues without any section separator. These programs can be over 1,000 lines or 20 pages long (i.e., about 50 lines per page). To me, that is like a long manuscript in English without section titles. Based on these experiences, I strongly suggest that young professionals even create a small comment block like “Brief contents” when an R program is first created. That will remind and force one to organize the program in a logical way. Comments in R must start with the # sign, either at the beginning of a line, or in the middle of a line as needed (e.g., line 14). R does not have any symbol for block comments. In contrast, SAS, for instance, has a pair of symbols for a block of comments, so anything between /* and */ is treated as comments. With a single click, editors like Tinn-R allow users to select a block of lines and add a # sign to the beginning of each line simultaneously. Thus, the lack of a block comment symbol in R is not a big drawback. The comment sign can also be used in testing command lines. When one needs to exclude several command lines in the middle of command blocks, one can just add the comment sign to the beginning of these lines. When these commands need to be included later on, the comment sign # can be removed. In computer programming, these actions are often called as “comment it in” (i.e., removing the comment sign and turning a line into a command), or “comment it out” (i.e., adding the comment sign to a line and removing it out of the command status). 7.5.2 Paragraphs and command blocks A paragraph should always have a center meaning. Paragraphs in a manuscript are indicated by blank lines, indention at the beginning, or some white spaces at the end. Similarly, blank lines can be inserted in a program version to achieve the same effect. This divides a long program into many blocks or paragraphs. In most cases, without further detailed comments 117 7.5 Basic syntax of R language Table 7.2 A comparison of grammar rules between English and R Item English R Language Section Section title Comment lines starting with #; section numbers; dashed lines or similar symbols as separators Paragraph Blank lines or indention Blank lines for code blocks Sentence Period, question mark, or similar punctuations End of a line without any special symbol; end of a pair of parentheses; operators Word Any word in a dictionary Some names reserved for internal functions; flexible for user-defined names; case sensitive; a period allowed in a name on these paragraphs, readers can understand the center purpose of a block well. Thus, do not underestimate the benefit of blank lines. There are 11 out of 80 lines (e.g., line 3) in Program 7.2, or 14% of the total. Combining the comment and blank lines together, that is 44% in this program. In other words, without learning any R function, you can understand 44% of the program version for Sun et al. (2007). Comment and blank lines are simple but critical to compose an R program version efficiently. It allows researchers to organize the analysis into blocks, and solve the problem step by step. Some software applications, e.g., R Commander, generate commands in the output window automatically, so users can see the commands for each click or action. Unfortunately and not surprisingly, no software can add comment or blank lines to a statistical analysis. Combining all command lines from an output window is fundamentally different from the approach I advocate here. From the beginning, we create the structure of the program version on purpose and manage the whole process systematically, e.g., starting with Program 7.1 The first program version for Sun et al. (2007). Forgetting the name of a specific R function is normal and it can be solved by checking R help documents. Having a messy structure or no structure at all is more troublesome. Therefore, to emphasize a critical point about computer programming here, it is the wise and diligent use of comment and blank lines that allows the program version for an empirical study to be created with a desired format from the beginning. Then, it can be revised and built up incrementally by block, table, or figure. 7.5.3 Sentences and commands A command in R is similar to a sentence in English. First, let us have a look at the typical structure of a command in R. In English, a typical sentence has the structure of subjectverb-object, e.g., I receive apples. In R, a typical command has the following structure: object.name <- value where a new object is created and the value is assigned, e.g., my.weight <- 180, or table.4 <- cbind(u1, u2)[, -4] (i.e., line 50 in Program 7.2). Note the major assignment symbols in R are <- and =. The difference between <- and = is small. In most cases, <- can be replaced by =, but <- is clearer than =. In a function call like line 36 in Program 7.2, the equal sign has to be used to assign values to the arguments within a function call (e.g., formula = ...). To continue the comparison between English and R as a computer language, the assignment operator in R is similar to a verb in English. 118 Chapter 7 Sample Study A and Predefined R Functions Furthermore, there are many variations of the basic command structure in R, which will be covered gradually later on. In particular, when a command is composed of an object name only, it requests that the value or content of the object be printed on a computer screen. For instance, the command on line 53 in Program 7.2 shows the content of table.4. Of course, before the value of an object can be shown on a screen, it should be created first with the assignment form, unless the object is a built-in constant in R (e.g., 35 or pi for 3.1415). Sentences in English usually end with punctuations like a period, an exclamation mark, or a question mark. In contrast, in most cases, each line in R language is called a command. There is no such thing like a period at the end of a command line in R. Many software applications use special features to indicate the end of a command (e.g., run in SAS or $ in LIMDEP). The choice by R may look less organized than other software applications at a first glance, but after a while, it will become apparent that it indeed makes programming in R very concise. While in most situations one R command is one line, R allows multiple command lines to be put on the same line and separated by a punctuation sign of “;”. For example, line 22 in Program 7.2 has five commands, and line 23 has three commands. This allows some short but related commands to be put on a single line, making an R program concise. A command can be longer than a single line in R. There are two ways to organize or indicate a multiple-line command. The first way is through operators. In Program 7.2, the binary operator of + at the end of line 39 indicates that the command will continue into the next line. In splitting a long command line, it is always a good habit to put operators at the end of lines. In Program 7.2, line 36 also follows the same rule. The second way is through parentheses and the like. R uses parenthesis (, square bracket [, and curly brace { to indicate a multiple-line command. In R, parentheses and the like must be in pair, similar to that in English. For example, lines 36 to 38 in Program 7.2 are a single command over three lines. There are two pairs of parentheses: one pair on line 38 and the other on lines 36 and 38. Sometimes, a command can contain many pairs of parentheses, so one needs to be careful in balancing these pairs. In the editor of Tinn-R, when a cursor is moved to one parenthesis, its color is changed to red to indicate the matching part. These utilities can reduce a lot of stress on eyes. A classic example of matching parentheses or braces is related to the if statement in R. In Program 7.3 Curly brace match in the if statement, three uses of the function are demonstrated; the first two are right and the third is wrong. The first use is a simple if statement without the optional part of else, so the pair of curly braces on line 3 and 5 are matched correctly. In the second use of if with the optional else part, “} else {” is correctly formatted as a single string and put on the same line. In the third case, lines 17 and 18 will be treated as a single command because R has the curly braces matched already before reading line 19. As a result, lines 19 and 21 will generate two error messages, and line 20 will become an independent, valid, but unwanted command. Program 7.3 Curly brace match in the if statement 1 2 3 4 5 6 7 # a. Correct use of if() without the else part x <- 10 if (x > 5) { m <- x + 200 } x; m 7.5 Basic syntax of R language 8 9 10 11 12 13 14 119 # b. Correct use of if() with the else part if (x > 5) { y <- x + 1 } else { y <- x + 2 } x; y 15 16 17 18 19 20 21 22 # c. Incorrect use of if/else; a mismatch of curly braces if (x > 5) { z <- x + 1} else { z <- x + 2 } x; z # Selected results from Program 7.3 > x; m [1] 10 [1] 210 > x; y [1] 10 [1] 11 > if (x > 5) { + z <- x + 1 } > else { Error: unexpected ‘else’ in "else" > z <- x + 2 > } Error: unexpected ‘}’ in "}" > x; z [1] 10 [1] 12 7.5.4 Words and object names Words in English are made from letters and symbols and used to create sentences. Most people use less than 1,500 English words in conversation and writing. The standard for these words is an English dictionary. When English words are used to compose comments in R, they still follow the same grammar rules. Inside all R commands, everything is an object so the most basic building input in a program is object names. The status of object names in R is similar to that of word in English. However, object names on R command lines usually do not follow the rules in an English dictionary anymore, even if they have the same appearance. For example, on line 27 in Program 7.2, there is a word of mean. This is the name of an R function that calculates 120 Chapter 7 Sample Study A and Predefined R Functions the average of a given data. A function with the same role can be recreated and named differently, such as average, mymean, or mEAn. In general, these keywords for R functions are named to give users a clue of their roles, but they can be coined in any way a programmer likes. Similarly, names can be created for objects that are only meaningful for the current research and environment. For example, daInsNam on line 20 in Program 7.2 is the object name of 14 variables imported for Sun et al. (2007). In creating object names in R, it should be noted that R is case sensitive, so the names of dog and Dog are different. This asks users to be very careful about details, but does allow more flexible ways of creating object names. Usually, using the dot symbol and mixing low and upper cases can create very informative object names in R. As examples, table.4 is used to represent a table in Program 7.2. In package erer, data sets are named with the prefix of da like daIns and daPe, so whenever users see these names, it is apparent that they are data objects. Procedures related to a topic can be created with a new prefix. For instance, ma is used for marginal effect analysis as in maBina() and maTrend(). The underscore symbol can also be used to create object names in R (e.g., my_weight). However, it is less frequently used than the dot or period symbol (e.g., my.weight), and even discouraged by many users in R. This is probably because the period symbol is easier to type and arguably better looking in the middle of an object name than the underscore symbol. The hyphen symbol (-) cannot be used inside an object name because it is the minus operator in R. Finally, the name of an object cannot contain any space; otherwise, it would become two names. Thus, “table 4” is a valid name for a file or phrase in English under the Microsoft Windows system. In R, however, it should be named as table.4, Table.4, table_4, or table4. 7.6 Formatting an R program In this book, we divide the production process of an empirical study into three stages: proposal, program, and manuscript. Each of them has a balance between contents and presentation techniques. For a proposal, good grantsmanship can improve the funding probability. For a manuscript, appropriate formats can improve its readability. The same logic is applicable for a computer program. The purpose of R formatting guide is to make an R program easier to read and share. The difference is in the readership. Both a proposal and a manuscript need to be read and evaluated by reviewers other than the authors. Thus, there exists a variety of formatting or presentation guides for proposal and manuscript preparation, e.g., a guide to authors from a specific journal. In contrast, an R program is mainly for personal use or sharing among close colleagues, and it is not for publishing. As a result, the format of an R program is largely determined by personal choices or the tradition adopted in a local working environment. To some degree, most R programming guides are just some suggestions. For example, the Google’s R Style Guide is one of such guides. You can search it on its Web site and read the full description. The common thread of these guides is straightforward: just use your common sense, and additionally, be consistent (which is also a common sense). In general, it takes some time and patience to learn how to format an R program, just like learning the skills of formatting a manuscript in English. You can also learn from reading others’ programs if your mind is always open and your eyes are sharp. Programs with good formats can serve as a good module to follow. Programs with bad formats can also be inspiring to users from a different perspective, as similar mistakes can be avoided. We have presented a good example in Program 7.2 The final program version for 7.6 Formatting an R program 121 Sun et al. (2007). To have a bad formatting example, the commands in Program 7.2 are copied and edited with a few bad formatting treatments. The results are presented in Program 7.4 A poorly formatted program version for Sun et al. (2007). The main treatments are eliminating the comment and blank lines, removing spaces and indention, and allowing automatic wrap-ups of long lines. All the commands are kept so it still can generate the same results. The total number of lines is reduced from 80 to 29. Of course, this revised program with bad formats is difficult to read and digest, if possible at all. In the following subsections, several key aspects of a good looking R programs are elaborated. They are based on my personal experience so please follow them at your discretion. Program 7.4 A poorly formatted program version for Sun et al. (2007) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 library(erer);wdNew<-"C:/aErer";setwd(wdNew);getwd();dir() daInsNam=read.table("RawDataIns1.csv",header=TRUE,sep=",");daIns=read.table ("RawDataIns2.csv",header=TRUE,sep=",");class(daInsNam);dim(daInsNam);print (daInsNam);class(daIns);dim(daIns);head(daIns);tail(daIns);daIns[1:3,1:5] (insMean<-round(x=apply(daIns,MARGIN=2,FUN=mean),digits=2)) (insCorr<-round(x=cor(daIns),digits=3)) table.3 <- cbind(daInsNam, Mean = I(sprintf(fmt="%.2f", insMean)))[-14, ] rownames(table.3)<-1:nrow(table.3);print(table.3,right=FALSE) ra<-glm(formula=Y~Injury+HuntYrs+Nonres+Lspman+Lnong+ Gender+Age+Race+Marital+Edu+Inc+TownPop,family=binomial(link="logit"),data =daIns,x=TRUE) fm.fish<-Y~Injury+FishYrs+Nonres+Lspman+Lnong+Gender+Age+Race+Marital+ Edu+Inc+TownPop rb<-update(object=ra,formula=fm.fish);names(ra);summary(ra) (ca<-data.frame(summary(ra)$coefficients)) (cb<-data.frame(summary(rb)$coefficients)) (me<-maBina(w=ra)) (u1<-bsTab(w=ra,need="2T"));(u2<-bsTab(w=me$out,need="2T")) table.4<-cbind(u1,u2)[,-4] colnames(table.4)<-c("Variable","Coefficient","t-ratio","Marginaleffect", "t-ratio");table.4 (p1<-maTrend(q=me,nam.c="HuntYrs",nam.d="Nonres"));(p2<-maTrend(q=me,nam.c= "Age",nam.d="Nonres"));(p3<-maTrend(q=me,nam.c="Inc",nam.d="Nonres")) windows(width=4,height=3,pointsize=9);bringToTop(stay=TRUE) par(mai=c(0.7,0.7,0.1,0.1),family="serif");plot(p1) fname<-c("insFigure1a.png","insFigure1b.png","insFigure1c.png") pname<-list(p1,p2,p3) for(i in 1:3){png(file=fname[i],width=4,height=3, units="in",pointsize=9,res=300) par(mai=c(0.7,0.7,0.1,0.1),family="serif") plot(pname[[i]]);dev.off()} write.table(x=table.3,file="insTable3.csv",sep=",");write.table(x=table.4, file="insTable4.csv",sep=",") 7.6.1 Comments and an executable program Some comments on an R program must be presented to make it more readable. The comment sign # can be used in three ways: long comments, short comments, and R commands. First, a 122 Chapter 7 Sample Study A and Predefined R Functions comment sign can be placed at the beginning of a line, turning the whole line as a comment. The content of the line can be a sentence in English, or just some symbols (e.g., a star, dash, or hyphen). If an entire line is a comment, then it usually begins with # and one space. Second, the comment sign can also be placed in the middle of a line and after some commands. Short comments can be accommodated in this way; the general recommendation is to have two spaces before # and then one space after it. This can be very helpful if one needs to locate some specific points in a long program, e.g., # Correlation coefficient computed here. The third way of using a comment sign is to comment out one or several R commands. There are several situations why one needs to keep a block of R commands as comments. It may be that these commands are some test codes that have values for future programming. In general, I suggest that most test codes be removed and only these really valuable codes be kept. Another situation is that some commands need a long time to run (e.g., 10 minutes for a loop). If that is the case, the codes can be commented out and the outputs can be given explicitly in the program. If the number or size of outputs is large, then one can save the outputs on a local drive first, and then reload the outputs in the program. For example, assume that it takes about 30 minutes from the following commented codes to generate the values for the objects of x, y, and z. In actual programming, one can run the codes without the comment signs, and then generate and validate the results. If the final results are large in size, they can be saved on a local drive as tempResult.Rdata through the function of save(). Then the relevant codes can be commented out. The saved results can be loaded in the future directly from the local drive. One can use the function of ls() before and after the function of load() to reveal if these objects are available on the search path. Alternatively, if the results are small, then they can be embedded and presented in the R program directly. Details about the functions used in the following example, i.e., save() and load(), are available at their help files. At this point, you just need to understand the essence of this approach. ## need 30 minutes to run this block # x <- 0; y <- 0; z <- 0 # for (i in 1:1000) { # y <- ... # ... # z <- ... # } # A. If the results are large, save and then reload later # save(x, y, z, file = ‘C:/aErer/tempResult.Rdata’) load(‘C:/aErer/tempResult.Rdata’); ls() # B. If the results are small, present them directly in the program x <- 10; y <- 20; z <- 30 Comment signs used in the first and second ways (i.e., long and short comments) make an R program easy to understand by section. Judicious uses of a comment sign in the third way can make a whole R program for a project executable in a few minutes. My rule is that no more than three minutes are needed in running an R program and reproducing all the tables and figures. For most of my finished projects, I just need about one minute to reproduce all the results. There are several common problems in using the comment sign. These include forgetting # before a comment line, having very fancy and long section headers (e.g., a big box 7.6 Formatting an R program 123 composed of stars), and including a time-consuming block of codes directly in the final program. Most of these problems can be revealed by a quick reading of a program with good common sense, and then by running the whole program with one click. To emphasize, a well-structured R program should be executable within a few minutes by a single submission. If an R program cannot be run repeatedly, then there must be some problems inside the program. 7.6.2 Line width and breaks In preparing a proposal or manuscript in English, most word processing applications allow automatic wrap-up for long sentences. A long R command line that spreads over several lines can be wrapped up by an R editor in the same way. However, that is strongly discouraged for programming in general. Thus, we need some rules in formatting lines in an R program. The maximum line length is usually about 80 characters. Some R editors (e.g., TinnR) allow users to set up the position of a vertical gray line and show it up in the editor. Alternatively, a comment line composed of many dash symbols (or any symbol you like) can be added to separate two sections in an R program, and it can also serve as an indicator of line width, as adopted in Program 7.2 The final program version for Sun et al. (2007). If one needs to print an R program for reading or sharing, then the preview function in an R editor can reveal if the default line width is too wide. In addition, always use a font family (e.g., Courier New) in programming that has the same width for every letter. This allows for exactly vertical alignments of commands. If needed, break a long line between argument assignments or at a location with good sense. In using Microsoft Word, users do not compile the contents directly as in other applications such as LATEX. As a result, some bad formats can occur. For example, a phrase like “Table 1” can be split right in the middle, with “Table” being put at the end of a line and “1” being put at the beginning of the following line, which is ugly for professional publishing. Many applications like LATEX allow users to have better controls over these subtle but important details. In defining or using R functions, line breaks should be made between assignments. Thus, lines 8 and 9 in Program 7.4 has a bad break by separating data and =daIns on two lines. This is similar to breaking “Table 1” into two parts in a manuscript. In other long R commands, one may need to break a command several times, e.g., a long index for subscripting operation. In general, always break a line at a location with good sense. There are several ways to reduce the number of breaks in a long command line. One way is to define some objects first and then use them in the long command. For example, the following long command associated with update() has two lines. On lines 39 to 40 of Program 7.2, a new formula object is created first and then used in calling the function. # A long command line associated with update() rb <- update(object = ra, formula = Y ~ Injury + FishYrs + Nonres + Lspman + Lnong + Gender + Age + Race + Marital + Edu + Inc + TownPop) # A shorter command line with update() fm.fish <- Y ~ Injury + FishYrs + Nonres + Lspman + Lnong + Gender + Age + Race + Marital + Edu + Inc + TownPop rb <- update(object = ra, formula = fm.fish) For curly braces, an opening curly brace should not go on its own line; a closing curly brace should always go on its own line. Curly braces can be omitted when a block consists 124 Chapter 7 Sample Study A and Predefined R Functions of a single statement, and as always, it should be consistently adopted in an R program. Several good examples are shown in Program 7.3 Curly brace match in the if statement. In summary, in preparing an R program, it is better to set up the maximum line width first, either with the help of utilities in an R editor, or some comment lines. An R program is always composed of lines that are manually broken down or controlled at the end. Break up long lines in R between arguments or anywhere with a good sense. Never allow automatic wrap-ups of long lines in an R program, including short comments at the end of a command line. 7.6.3 Spaces and indention The use of space in English and R is quite different. In English, extra spaces between words are not allowed, at least for professional writing. In R, extra spaces are allowed on comment lines because computers and the R software do not execute these lines at all. They are also allowed, and even promoted, in many places on command lines. In general, for commands longer than one line, two spaces should be inserted consistently to indent the following lines to improve readability. To allow better alignments in some cases, more than two spaces can be inserted, but it should be implemented consistently. For example, on lines 36 to 38 in Program 7.2, many extra spaces are inserted to make these relevant arguments aligned vertically. In indenting codes, do not use tabs or mix tabs and spaces, but just use the space bar on a keyboard. A conditional or looping statement in R, e.g., if and for, generally contains a block of codes. The block of codes for each statement should have the same indention. If there are nested conditional and looping statements, then more spaces should be added for the inner blocks. For example, lines 69 to 75 in Program 7.2 is a large block of codes, as indicated by the vertical alignment of for and } and the indention of all lines inside the curly braces. Furthermore, the indention on line 71 indicates that this line and the previous one are one command line. If there is another for loop inside this block, then all the command lines should be indented farther and deeper. Place spaces around all binary operators, e.g., <-, =, +, -, and *. Do not place a space before a comma, but place one after a comma. Place a space before left parenthesis, except in a function call. For all the above situations, spaces can be eliminated if there is a need to reduce the length of one line within the preferred width, e.g., 75 to 80 characters. In general, my rule is that a maximum of three spaces can be compressed on a single line if needed; otherwise, multiple lines will be used for a long command. Some bad examples are demonstrated below. me<-maBina(w=ra) # need spaces around <- and = test <- daIns[,1:5] # need a space after the comma test <- daIns[ ,1:5] # need no space before the comma for(1 in 1:3) {... # need a space before ( for loops if(x >= 5){... # need a space before ( and { for conditionals mean (1:10) # need no space before ( for function calls 7.6.4 A checklist for R program formatting The following brief list describes the main requirements in formatting an R program. It is not comprehensive and detailed as the previous presentation contains. You will need to read this again later on when you format your R program. As always, use good common sense and consistency in formatting an R program. 7.7 Road map: using predefined functions (Part III) • • • • • • • • • • • • • • • • • • • • • • All comments must start with the comment symbol of # in R. Add complete comment lines to create and separate sections for a long program. Add short comments at the end of a line for brief notes if needed. Comment out time-consuming code blocks to facilitate quick reproduction. An R program for a project should be executable with a single submission. Use blank lines to indicate paragraphs or code blocks. Set up the width of command lines (e.g., 80 characters) and never exceed it. Do not allow automatic wrap-ups of long lines; break them manually. Always place “} else {” together on a separate line. Break a long line with good sense (e.g., not on an argument assignment). Indent a multiple-line command or a code block with two spaces or consistency. Align some codes vertically to indicate a block. For curly braces, an opening curly brace should not go on its own line. For curly braces, a closing curly brace should go on its own line. Use indention to indicate a code block for conditionals and looping statements. Place one space before and after binary operators, e.g., <-, =, and +. Do not use the operator of = to replace <- when <- is needed. No space before a comma, but one space after it, e.g., test[, ]. Place a space before a left ( and {, e.g., if (a > 5) {b <- 3}. Do not place a space before a left ( for a function call, e.g., mean(1:3). Spell out argument names in calling a function with multiple arguments. Exceptions are always allowed, but use good common sense. 7.7 Road map: using predefined functions (Part III) 125 Up to this point, we have learned Sun et al. (2007) as the sample study for Part III, including its manuscript version, underlying statistics, and the final program version. More importantly, the structure, basic syntax, and format of the program version have been analyzed. In Program 7.2 The final program version for Sun et al. (2007), we are glad to know that 44% of the 80 command lines are just comments and blank lines. However, the remaining 56% command lines are still largely unexplained. Where do we go from here, and how can we understand the R language used in this sample program completely? The answers are within the next four chapters in Part III. The rest of this Part is organized to help beginners learn how to use predefined, existing R functions efficiently. At the end of this Part, you should be able to understand the whole sample program completely, including these functions and their usage. For your own selected empirical study, a similar program can be prepared after learning various techniques presented in this Part. Specifically in Chapter 8 Data Input and Output, the focus is on the exchange of information between R and a local drive, i.e., data inputs and outputs. A number of R concepts are explained first, including objects, attributes, subscripts, user-defined functions, and flow control. To connect with the sample study, the sections of raw data imports and result exports in Program 7.2 should be understandable after this chapter. Then, two chapters are used to present the techniques of creating and manipulating major R objects. These materials are basic but critical for calling and using R functions efficiently. In general, they are more difficult to learn than materials in other chapters of the book. In Chapter 9 Manipulating Basic Objects, how to manipulate major R objects by type is presented. These types covered are R operators, character strings, factors, dates, time series, and formulas. In Chapter 10 126 Chapter 7 Sample Study A and Predefined R Functions Manipulating Data Frames, common tasks related to data frames are addressed, and methods for data summarization are presented. As an example, using R functions to estimate a binary choice model is displayed at the end. After reading these two chapters, one should be able to comprehend the regression analyses in Program 7.2. Drawing a graph in R is a very different task than conducting a statistical regression. A regression often involves one or two R functions only. However, plotting requires learning a large number of functions at the same time, which is true for R and any computer language. In Chapter 11 Base R Graphics, the traditional graphics system available in base R is presented in detail. The four main inputs for generating a graph in R are plotting data, graphics devices, high-level plotting functions, and low-level plotting functions. After learning the techniques presented in this chapter, one should be able to understand the graph output from Program 7.2. A simple test on whether you know the materials in Part III well is to read Program 7.2 The final program version for Sun et al. (2007) and assess how much you can understand it. In addition, many exercises are designed and included in several chapters. They can also test your understanding of the materials. Overall, the knowledge in this Part is the foundation for building an R program efficiently, so everyone should learn it well before moving on to other chapters in the book. A typical barrier that prevents students from learning advanced techniques (e.g., writing a new function or package) is the lack of a solid comprehension of basic R concepts, as presented in the remaining chapters of Part III. 7.8 Exercises 7.8.1 Assess some packages with pull-down menus. Go to the Web site of R and navigate to the contributed package site (e.g., http://mirrors.ibiblio.org/CRAN/). Search the Web page by the keyword of ‘interface’ to reveal these packages that contain pull-down menus. Select one package that is built on Rcmdr, e.g., RcmdrPlugin.SurvivalT, and install it on your computer. A new menu of SurvivalT should appear within the R Commander interface. Try to estimate some models and then evaluate the relation between this new package and Rcmdr. 7.8.2 Prepare draft tables. Recall that in Exercise 4.7.2 on page 56, one empirical study is selected, and the structure of its published manuscript version is extracted. In this chapter, tables and figures in the first draft of manuscript version for Sun et al. (2007) are further specified through, for instance, Table 7.1 A draft table for the logit regression analysis in Sun et al. (2007) on page 103. In this exercise, assuming no analyses had been conducted for your selected empirical study, compose a draft of at least two tables in the published version. 7.8.3 Create an R program draft. Recall that in Exercise 3.6.2 on page 41, one empirical study has been read and selected. In this exercise, prepare a draft of an R program version for this study. Follow the structure of a program version listed in Table 4.3 Structure of the program version for an empirical study on page 52 and the sample program version as listed in Program 7.1 The first program version for Sun et al. (2007) on page 113. Present and organize all the main components of a program version in this first draft, including the tables and figures. 7.8.4 Format a sample R program. Program 7.5 contains many inappropriate formats. Note the commands in this program indeed work well. You do not need to understand these relevant R functions to reformat the whole program. Similarly, a production 127 7.8 Exercises editor does not necessarily understand a manuscript in formatting it for a specific publication outlet. With the formatting guide discussed in this chapter, identify and correct existing formatting problems in this R program. Do not make any change on the content of this sample program. After it is reformatted, the structure of the program should be clear to readers before the program is run. At the end, the whole program can be executed with one submission. Program 7.5 Identifying and correcting bad formats in an R program 1 2 3 4 5 6 7 8 9 10 11 12 13 *********************************************************** ** Title: This is an exercise for formatting R codes. ***** *********************************************************** __________ 1. Load packages and data ____________ library( erer );data( daIns ); 2. Create a dataset from the existing one mydata<daIns[ ,1:5 ]; head( mydata); names(mydata );str( mydata ); summary(mydata); 3. Run a linear model ---------------------------------result=lm(formula=Y~Injury+HuntYrs+Nonres +Lspman ,data = mydata) summary( result );####This is the main result. Chapter 8 Data Input and Output C onstructing the program version for an empirical study through R needs to build up skills gradually. In this chapter, we focus on the exchange of information between R and a local drive, i.e., data input and output (Spector, 2008). Before we can do that, several R concepts need to be defined. Objects and functions are explained first as two core concepts of R. Inside R, everything is an object, including a function, and furthermore, functions can be used to manipulate objects. Then, subscripting an object, writing a user-defined function, and controlling program flow are briefly introduced. In the end, several functions and techniques for data inputs and outputs are covered in detail. 8.1 Objects in R R objects are the building blocks for every command line in the program version for an empirical study. They can be differentiated by attribute or property. Major object types (e.g., list) are briefly described in this section. A set of basic built-in functions can be employed in R to create and manipulate these objects. 8.1.1 Object attributes Each object has unique properties that allow classification and differentiation within a working environment. These properties or features are generally referred to as attributes in R. Two objects in R may look like the same on a computer screen, but they are different objects if any of their attributes is not the same. The attributes of an existing object can be modified. New objects with built-in or user-defined attributes can be created. R has a number of built-in object types with well-defined attributes. Objects commonly used in R include vectors, data frames, matrices, lists, factors, formulas, dates, and time series. Usual attributes include class, comment, dim, dimnames, names, row.names, tsp, and levels. See more details by help(“attributes”). In Program 8.1 Accessing and defining object attributes, several objects generated from Program 7.2 The final program version for Sun et al. (2007) on page 114 are used as examples. In particular, daInsNam is a data frame object with two columns and 14 rows: one column for the abbreviations of 14 variables and the other for detailed variable descriptions. Attributes associated with an object can be revealed by an attribute function or a convenience function. Specifically, there are two attribute functions: attributes() and attr(). Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 128 129 8.1 Objects in R The function of attributes() reveals all available attributes of an object, and attr() shows one attribute only. Furthermore, all existing attributes or one specific attribute can be replaced through those two functions too. In using the replacement form of the two functions, a new attribute can be created if the attribute referred to does not exist for an object, as shown on lines 12 to 15 in Program 8.1. A new attribute can be any object, including a numerical vector (e.g., line 13) or a data frame object (e.g., line 15). Actually, it is not uncommon to store some descriptive information about an object as one of its attributes. It always stays with the object but does not affect the usage of the object. Convenience functions are available for a number of object attributes in R and can also be used to abstract attribute information or define new attributes. Not all attributes are implemented or included in attributes(). For example, the mode attribute is implicit with a matrix object and it is usually not visible or available with attributes(). In addition, convenience functions focus on one particular attribute and they are more convenient to use. In particular, two of the most important attributes of R objects are class and mode, with the corresponding convenience functions available at the same name. Specifically, class() can be used to reveal the current class information of an object. The command of class(x) <- “value” can be used to revise the existing class attribute, or assign a new class to x if x has no such a class. An object can have many classes at the same time, as shown on line 23 in Program 8.1. The class of an object is important because R is an object-oriented programming language. Generic functions are available in R to invoke different methods. Depending on the class of their arguments, generic functions can act very differently. For a particular generic function or a given class, you can find out which methods are available through the function of methods(). For example, the generic function of mean() can work differently for numeric and date classes, e.g., mean.Date(). As another example, through lines 56 to 58 in Program 7.2 on page 114, several objects with the class of “maTrend” are created. The erer package contains two methods for this class: plot.maTrend() and print.maTrend(), as revealed by line 26 in Program 8.1. Similarly, the convenience function of mode() can be used to reveal an object’s mode. Commonly encountered modes of individual objects are numeric, character, and logical. Some objects, e.g., matrices or arrays, require that all the data contained in them be of the same mode, while others (e.g., lists and data frames) allow for multiple modes within a single object. Additional convenience functions in R include typeof(), names(), dim(), length(), tsp(), and comment(). Every object in R has several attributes to describe the nature of the information that it contains. A specific convenience function can be defined and thus applicable for some types of functions only. For example, tsp() returns the properties of a time series object with the class of ts. Applying tsp() on a data frame object will return the result of NULL, as shown by line 20 in Program 8.1. For simple objects such as a vector, it is usually straightforward to determine what an object in R contains. Examining the class and mode, along with its length or dimension attribute, should be sufficient to allow one to work effectively with the object. For objects with richer attributes, the function of str() can be employed to reveal its internal structure and properties. Finally, the ls() function can reveal the names of all objects in an R session. The erer package also includes an improved function of lss() to reveal the name, class, and size of each object in an R session. This is similar to the functionality provided by Windows Explorer on a Microsoft Windows operating system. Program 8.1 Accessing and defining object attributes 1 2 # Run the program version for Sun et al. (2007) by source(). # Create a new object for demonstration. Chapter 9 Manipulating Basic Objects A fter learning how to import and export data, we move onto R object manipulation. The information for this topic is large so it is covered in two chapters. In this chapter, R operators, character strings, factors, dates and time, time series, and formulas are covered. In Chapter 10 Manipulating Data Frames, techniques for subscription and data frame manipulation are presented. These methods are essential to efficient R programming. In each section, major concepts are first explained in a way that they can be read independently without running any R program. Then sample codes relevant to the concepts are presented along with selected results. To have a deep understanding, it is best to first read the text and have an overall understanding, then run the sample codes and digest the results, and finally read the relevant description again. All the exercises for this and next chapters are combined and presented at Exercise 10.5 on page 204. 9.1 R operators In calling an R function, its function name is combined with a set of argument values like this: mean(x = 1:30). An R operator is a function too, and it takes some argument values and can be written without parentheses. For example, the command of 3 + 5 is the same as “+”(3, 5). A large number of operators are defined in R. Their meaning will become clearer when they are used for specific purposes. Detailed definitions for these operators are available at the built-in documents, e.g., help(“+”). Specifically, arithmetic operators in R include +, -, *, /, ˆ, %%, and %/%; the corresponding operation is addition, subtraction, multiplication, division, exponentiation (e.g., 3ˆ2 returns 9), modulus (e.g., 50 %% 3 returns 2), and integer division (e.g., 50 %/% 3 returns 16). Assignment operators in R include <-, «-, ->, -», and =. The operators of «- and -» are normally only used within a function, so a search for an object name is allowed in the parent environment of the function. The = operator is only allowed at the top level (e.g., in a complete expression typed at the command prompt) or as one of the sub-expressions in a braced list of expressions. In general, the = operator is mainly used in assigning values to function arguments. The assignment operator of the leftward form, i.e., <-, is usually used in assigning a value to a name, e.g., my.score <- 95. In practice, it is a bad habit to use = to replace <- where the latter should be used. Relational or comparison operators are >, >=, <, <=, ==, and !=; the corresponding operation is greater than, greater than or equal to, less than, less than or equal to, logical equal, Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 152 9.1 R operators 153 and not equal. These operators allow the comparison of values in atomic vectors and return a logical vector, indicating the result of element-by-element comparison. The elements of shorter vectors are recycled as necessary. For example, the expression of 1:6 == c(1, 4) return the following logical vector: TRUE FALSE FALSE TRUE FALSE FALSE. Logical operators in R include: &, |, !, &&, ||, and xor; the corresponding operation and meaning is logical AND for vector, logical OR for vector, logical negation (NOT), logical AND for scalar, logical OR for scalar, and element-wise exclusive OR. In particular, the short forms of & and | are different from the long forms of && and || in two aspects. First, the short forms perform element-wise comparisons in much the same way as arithmetic operators, so they are vectorized. The long forms are not vectorized and work on scalars, so only the first element is evaluated and the output is a single logical value of TRUE or FALSE; this is true even if the vectors being evaluated have more than one element. Of course, if the vectors being evaluated have one element only, the short and long forms will generate the same result. Second, the long forms examine the expression sequentially from the left to right, with the right-hand operand only evaluated if necessary. This design can save time or avoid errors in some cases. Both the features of && and || are preferred in flow control, so they are often used to construct a logical expression for some functions, e.g., if. Logical expressions contain R relational or logical operators and evaluate to either true or false, denoted as TRUE or FALSE in R. For example, the logical expression of (34 * 3 > 12 * 9) & (88 %/% 10 == 8) returns a value of FALSE. In addition, two frequently used logical functions are worthy of some descriptions here. The all() function examines if all of the given set of logical vectors are true or not. The any() function answers if any of the given set of logical vectors is true or not. Both the functions check the entire vector supplied but return a single logical value, TRUE or FALSE. There are different purposes for the following operators in R: (, [, and {. All of them have to be used in pairs. The parenthesis notation returns the result of evaluating expressions inside the parentheses, e.g., (4 + 6) * 100. Besides that, it is most commonly used to supply argument values in calling a function, e.g., mean(1:10). The square brackets (i.e., [ and [[), the dollar sign of $, and the at-sign of @ are all designated in R for indexing, and can be used to extract from or replace some elements in an existing object. For example, the command of c(45, 60, 10)[3] returns the third element in a numeric vector. The most important distinction between [, [[, $, and @ is that the [ operator can select more than one element whereas the other three can select a single element only. Finally, the curly braces evaluate a series of expressions, either separated by new lines or semicolons, e.g., dog <- 1:5; pig <- 6:10. Typical uses of curly braces are grouping the body of a function, e.g., test <- function(x) {x + 1}; test(20), or grouping expressions for flow control functions, e.g., x <- 1; y <- 2; for (i in 1:5) {x <- x + i; y <- y * i}. Other operators in R include : as a sequence generator, :: and ::: for accessing variables in a name space, ˜ for formula, and ? for help. In general, the order of operations and precedence rules in R are: function calls and grouping expressions (e.g., {), indexing, arithmetic, comparison and relation, logical, formula, assignment, and help. Consult the help document for the Syntax() function for detail. If needed, one can define new operators in R. A user-defined binary operator is created through a new function and its name is composed of a character string and two % symbols. For example, assume that there are two exams in a course. The final score for each student is composed of the scores from the two exams and a base score of 40. An operator named as %score% can be defined to add two exam scores and 40 together, as shown in Program 9.1 Built-in and user-defined operators in R. Operators in R are vectorized. Thus, for given vector inputs, an operator acts on each Chapter 10 Manipulating Data Frames A data frame is a type of object that has been widely used in R. The main feature of a data frame object is that it is a two-dimensional object like a table and it can hold heterogeneous information. Given its popularity, this chapter focuses on these common tasks related to data frames. Manipulating data frames involves a comprehensive understanding of indexing techniques employed in R, so relevant operators and functions are covered first. In addition, data aggregation and summary on data frame objects are presented. This involves how to generate summary statistics by row or column, or for subsets of a data frame object. Finally, an application related to a binary logit regression is presented to synthesize many techniques that have covered in Part III Programming as a Beginner. A number of exercises also have been designed and included at the end of this chapter. Readers are strongly encouraged to work on some of them and improve their understanding of basic R object manipulation. 10.1 Subscripting and indexing in R For objects with multiple elements, R offers efficient indexing and subscripting operations to locate specific elements. Operators defined for subscripting in R include single brackets in pair, double brackets in pair, the dollar sign, and the at sign, i.e., [, [[, $, and @. The index or subscript can be numerical, character, or logical values. Use help(“[”) in an R console to launch a help page and read their detailed definitions. The purpose of subscripting is either extracting or replacing. For extracting operations, elements identified in an object is extracted and saved as a new object, e.g., new.x <x[i, j, drop = TRUE, ...]. The drop argument is relevant to matrices, arrays, and data frames, and it specifies whether subscripting reduces the dimension or not. For replacing operations, the identified elements in the object are replaced by new values supplied by users and the object is thus modified. If the original object needs to be kept and untouched, then it should be copied before the replacement operation, e.g., copy.x <- x; copy.x[i, j] <- new.value. A numerical index is either a single integer or a vector of integers. If there is any zero in the index, it is ignored; thus c(0, 1, 0, 4) is the same as c(1, 4) when it is used as an index. The c() function, the colon operator (i.e., :), and the seq() function are often used in generating a small index manually, e.g., 1:4 for the first four elements, or seq(from = 1, to = 10, by = 2) for all the even numbers between 1 and 10. Negative subscripts Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 180 10.1 Subscripting and indexing in R 181 are accepted in R, e.g., -c(1, 4) or c(-1, -4), and they specify the positions of elements that need to be excluded. Negative and positive numeric values, however, cannot be mixed in one index. A character string index is either a single character string or a vector of character strings, e.g., “v1” or c(“v1”, “v2”). Apparently, it can be used only if an object has names for its elements. Negative character subscripts are not permitted, e.g., -“v1”. If elements based on names need to be excluded, then their numerical location should be determined first using string manipulation functions and then used as the input to a numeric index. A logical index can also be used to access elements of an object, e.g., c(TRUE, FALSE, TRUE). A logical value has to be either TRUE or FALSE, or the shorthand of T or F. Elements corresponding to TRUE values will be included, and these corresponding to FALSE values will be excluded. The length of an index vector is allowed to be smaller than, equal to, or larger than the corresponding dimension in an object (e.g., the number of rows in a data frame). Take an extracting operation as an example. When smaller, the object extracted will have a smaller dimension than the original object. When equal, the existing and new objects have the same dimension, but the order of elements may be different, depending on the index. When larger, the new object will be composed of elements from the existing object, and some of the elements from the existing object must be repeated. When the length of an index vector is larger than the corresponding dimension in an object, the index must be a numerical or character index, but cannot be a logical index. With a numerical or character index, an element in an object can be extracted for more than one time, and this may be exactly what one needs in some situations. A good example is demonstrated in Program 9.5 Manipulating factor objects on page 167 with the function of levels(), i.e., the expression of levels(num)[num]. Another example is bootstrapping a sample with repetition, using the sample() function. In that situation, one may need to include some observations more than once, either manually or statistically with a probability specification. For a logical index, however, it can be at most as long as the corresponding dimension of an object. This is because a logical index does not contain the location information. In practice, a logical index is often generated from logical operation on some components of the object involved. Thus, the length of a logical index is usually the same as the corresponding dimension of the object. If the length of a logical vector is longer than the corresponding dimension, then NA values will be generated because these extra logical values refer to some locations that do not exist in the object. Furthermore, R does accept a logical index that is short, and if needed, the logical values in the index will be repeated to match the dimension of the object being indexed. For example, in extracting four out of eight elements in a vector, a logical index like c(FALSE, TRUE) should have the same effect as c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE) or c(2, 4, 6, 8). This use of logical indexes can be very efficient in programming if the pattern of extracting operation is clear. Sometimes there is a need to convert one of the three types of indexes into another. For example, in creating a new function, users are allowed to supply a character string vector to an argument. The same argument value can be used for including some elements in one place but for excluding some elements in another place. In general, to convert a character string index into a numerical index, use functions for string manipulation to find the location, e.g., match(), pmatch(), or grep(). To convert a logical index into a numerical index, use the which() function. It can generate a vector containing the subscripts of the elements for which the logical vector is true. This is demonstrated in Program 10.1 Subscripting and indexing R objects. Chapter 11 Base R Graphics R graphics can be used to display either raw data or analysis outputs in a graphic format. A graph can be shown on a computer screen, or alternatively, saved as a file on a local drive and then used in a document. The traditional graphics system available in base R can generate quality graphics, and furthermore, has served as the foundation of many contributed packages in extending the functionality of base R. Thus, a deep understanding of the graphics system in base R is important. Furthermore, because drawing computer graphs always involves many details, it is undesirable and inefficient to copy, paste, and compile many R help pages here. Instead, the emphasis in this chapter will be on gaining a solid understanding of the overall structure of the base R graphics system. 11.1 A bird view of the graphics system Generating a graph by the graphics system in base R is very similar to drawing a picture on a canvas or board. To facilitate learning, we also borrow the concept of production function from economics. Assume there is a relation between one output and four inputs: y = f (x1 , x2 , x3 , x4 ). In painting, y is a picture output from painting, and the four inputs are paint, canvas materials, big brushes, and small brushes. In base R, the output is a graph, and the four inputs are plotting data, graphics devices, high-level plotting functions, and low-level plotting functions. Transformation activities from inputs to outputs by human beings involve a production technology, i.e., the f in the production function. Excellent and mediocre artists differ mainly in the production technology, but less likely in the inputs used. Similarly, with exactly the same data available, two persons can produce very different graphs from R. This is often referred to as art of programming, or philosophy of graphics. Whether the art of programming can be taught or not is controversial. Personally, I feel that is a very difficult task, even if not totally impossible. In contrast, most people agree that the physical and technical side of painting and graphics can be learned step by step. That is what we are going to do in this chapter. Each of the four major inputs related to R graphics system will be analyzed in detail by section in a moment. In general, plotting data should be prepared before drawing a graph. A graphics device holds a graph generated from R. It is either a computer screen or a file saved on a local drive. R console is largely a window for ordinary outputs in text format only, and outputs with complicated formats such as graphs and mathematical symbols cannot be shown there. Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 214 215 20 0 Return (%) 10 5 −5 −40 0 Return (%) 15 40 11.1 A bird view of the graphics system 2001 2002 2003 Year 2004 2005 2001 2002 2003 2004 2005 Year Figure 11.1 Graphs with different graphical parameters Thus, either a new screen/window or file device should be created to hold a graph. A highlevel plotting function can generate a graph independently e.g., plot(). A low-level plotting function can annotate, revise, expand, or customize an existing graph, e.g., points(). Thus, a high-level function can be used alone, but a low-level function has to follow a high-level function. The appearance of a graph is controlled by various graphical parameters, and they are all used as function arguments. These graphical parameters can be further classified into three types: device parameters, plotting function parameters, and universal parameters. The first two types of graphical parameters are specific to graphics devices or plotting functions only, while universal parameters can be used for both graphics devices and plotting functions. For example, the parameter of col is universal and it can be used everywhere to specify colors. Where a universal parameter is used does matter. When a universal graphical parameter is used in a function related to graphics devices, the setting will be effective for the whole device and following statements. When a universal graphical parameter is used in a plotting function, it affects the output from that function only. As an example, Program 11.1 Base R graphics with inputs of data, devices, and plotting functions can generate Figure 11.1 Graphs with different graphical parameters. In this demonstration, the data to be plotted are the annual returns of a public firm over five years, varying between the value of 10% in 2001 and 13% in 2005. The data are prepared and saved first. The windows() function initiates a screen graphics device so any graphic output from the following statements will be displayed on it. The device function can accept a number of device parameters as its argument, e.g., width for the size of the window. The par() function can set the properties of graphics device further. In this example, the mai argument is a graphical parameter that defines the figure margins. The plot() function is the workhorse and leading high-level plotting function in R, and it uses a line to connect five observations here. The plot() function also accepts various graphical parameters, e.g., ylab for the axis label. Finally, the points() function is a low-level function that generates five points; its graphical parameter of pch specifies the shape of the points, with a different shape by year in this example. Note the range of the axis values is controlled by the graphical parameter of ylim, and changing it from the default value to a larger value results in a very flat line graph in Figure 11.1. Thus, graphical parameters can change the appearance of a graph dramatically. The effect may or may not be desirable from the perspective of a programmer or viewer. In general, R as a tool, like all other software applications in the world, does not 216 Chapter 11 Base R Graphics have an answer to questions like which graph is more appropriate. The answer is related to the overall goal of a research project and the specific purpose of a graph, and additionally, one’s graphics philosophy. (No philosophy is one type of philosophy too, just like religion.) More specifically for Figure 11.1, without other information, we cannot simply assert that the version on the left is better than the one on the right. In most cases, the left one is better. However, if this public firm needs to be compared with other firms in a study, and the annual returns of those firms have been volatile between −50% and 50% over the period, then the version on the right side is appropriate in revealing the small return volatility of the selected firm. Finally, the relation among the four inputs is worthy of a brief note. A data set for plotting needs to be prepared at the beginning, and the demand can be large or small. When the data needed is small, it can be directly supplied as argument values to function calls. Apparently, there must be some data to get started; otherwise, there is nothing to draw or show. If no statements related to any graphics device are present, then the default device is a new pop-up square window on the screen (seven by seven inches). A high-level plotting function must be supplied in order to generate a graph in R. Low-level functions are auxiliary to customize an existing graph so they are always optional. As a result, the simplest R program that can generate a graph is a statement in an R console like this: plot(1:3). Following our analytical steps as listed in Program 11.1, this simple command is equivalent to a group of commands as follows: data <- 1:3; windows(); plot(x = data). It seems so lengthy that we need two pages to explain a short command, but that is exactly how R generates a graph for us. Program 11.1 Base R graphics with inputs of data, devices, and plotting functions 1 2 3 4 # Four components or inputs in the base R graphics system # A. Plotting data Year <- 2001:2005 Return <- c(0.10, 0.12, -0.05, 0.18, 0.13) * 100 5 6 7 8 # B. Graphics device windows(width = 4, height = 3, pointsize = 11) par(mai = c(0.8, 0.8, 0.1, 0.1), family = "serif") 9 10 11 # C1. High-level plotting function plot(x = Year, y = Return, type = 'l', ylab = 'Return (%)') 12 13 14 15 # C2. High-level plotting function + a change in the range of y axis # plot(x = Year, y = Return, type = 'l', ylab = 'Return (%)', # ylim = c(-50, 50)) 16 17 18 # D. Low-level plotting function points(x = Year, y = Return, pch = 1:5) 11.2 Your paint: preparation of plotting data Data are needed for data arguments in plotting functions, e.g., the x and y arguments in the plot() function. Data are also needed for graphical parameters that are supplied as arguments in the functions for graphics devices and plotting, e.g., “red” for a color argument. Part IV Programming as a Wrapper 251 252 Part IV Programming as a Wrapper: The goal of this part is to extend the existing R functionality for data analyses and graphics. The sample study used is Wan et al. (2010a), and the demand model of AIDS is assessed in detail. Skills for writing new R functions will be greatly emphasized and elaborated through several chapters. Chapter 12 Sample Study B and New R Functions (pages 253 – 268): The proposal, manuscript, and program versions for Wan et al. (2010a) are presented. The AIDS model and the need for user-defined functions are introduced for later applications. Chapter 13 Flow Control Structure (pages 269 – 292): Functions and techniques for controlling R program flow are presented. Conditional statements in R include if, ifelse(), and switch(). Looping statements include for, while, and repeat. Chapter 14 Matrix and Linear Algebra (pages 293 – 311): Operation rules and main functions relevant to matrices are described. Several large examples related to the AIDS model and generalized least square are constructed and presented. Chapter 15 How to Write a Function (pages 312 – 358): The structure of a function is analyzed first. S3 and S4 as two approaches to organizing an R function are compared. Several examples are designed to demonstrate how to write a function. Chapter 16 Advanced Graphics (pages 359 – 392): The graphics systems in R are reviewed first. Then the grid system, ggplot2, and several packages for maps are described. For each contributed package, new concepts are defined and examples are designed. R Graphics • Show Box 4 • A diagram for the structure of this book Empirical Study 1 Proposal 4 Manuscript 2 R Program 3 New packages make diagrams easier to draw. See Program A.4 on page 517 for detail. Chapter 12 Sample Study B and New R Functions T he leading sample study in Part IV Programming as a Wrapper is Wan et al. (2010a). This is a moderate challenging empirical study, and it is appropriate for students to learn how to write user-defined functions for a specific project. The skeleton of the manuscript version is presented first. Then, the relevant statistics and the program version are detailed. At the end, the need for user-defined functions is analyzed and a road map for Part IV Programming as a Wrapper is presented. 12.1 Manuscript version for Wan et al. (2010a) The corresponding proposal version for Wan et al. (2010a) is presented at Section 5.3.2 Design with public data (Wan et al. 2010a) on page 72. The program version for this study is presented later in this chapter and can produce detailed tables and figures. Below is the very first manuscript version for this study. It is constructed with the information in the proposal version, and will be used as a practical guide to composing the program version. The final manuscript version is published as Wan et al. (2010a). In constructing the skeleton of the first manuscript version, key tables and figures should be drafted or hypothesized. As an example, Table 12.1 A draft table for the descriptive statistics in Wan et al. (2010a) is included below. This seems like a blank table, but it is an important technique for improving data analysis efficiency. This also has been emphasized in Section 7.1 Manuscript version for Sun et al. (2007) on page 101. The First Manuscript Version for Wan et al. (2010a) 1. Abstract (200 words). Have one or two sentences for the research issue, study needs, objectives, methodology, data, results, and contributions. 2. Introduction (2 pages in double line spacing). Brief trade pattern review; the antidumping investigation process; overall objective; three specific objectives and contributions. 3. Market overview and antidumping investigation against China (3 pages). A market review of trade patterns, antidumping investigation, duties, and research needs. 4. Methodology (6 pages). Four subsections: — Static Almost Ideal Demand System (AIDS) model Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 253 254 Chapter 12 Sample Study B and New R Functions Table 12.1 A draft table for the descriptive statistics in Wan et al. (2010a) Variable Mean St. Dev. Minimum Maximum Share for country 1 Share for country 2 Share for country 3 Share for country 4 Share for country 5 Share for country 6 Share for country 7 Share for country 8 Price for country 1 Price for country 2 Price for country 3 Price for country 4 Price for country 5 Price for country 6 Price for country 7 Price for country 8 Total expenditure 30.123 8.888 22.222 60.123 160.123 10.123 80.123 300.123 — Dynamic AIDS model — Estimation and diagnostic tests — Demand elasticities 5. Data sources and variables (1 page). Define wooden beds by the Harmonized Tariff Schedule. Describe country selection, time periods covered, and data sources. 6. Empirical results (4 pages of text, 6 tables, and 1 figure). Three subsections: — Model fit and diagnostic tests — Results from the estimated coefficients — Results from the calculated elasticities Table 1. Descriptive statistics of the variables defined for AIDS model Table 2. Diagnostic tests on the static and dynamic AIDS models Table 3. Estimated parameters from the static AIDS model Table 4. Estimated parameters from the dynamic AIDS model Table 5. Expenditure elasticity and Marshallian own-price elasticity Table 6. Long-run and short-run Hicksian cross-price elasticity Figure 1. Monthly expenditure and import shares by country 7. Discussion and summary (3 pages). A brief summary of the study is presented first. Then about three key results from the empirical findings will be discussed. 8. References (3 pages). No more than 40 studies will be cited. end 255 12.2 Statistics: AIDS model 12.2 Statistics: AIDS model In this section, statistics for the Almost Ideal Demand System (AIDS) model is presented. Emphases are put on the relevant information that is necessary to understand some R implementation later on. For a comprehensive coverage of the development of this model, read Wan et al. (2010a) and references cited there. Specifically, the key formulas for the AIDS model are introduced first. Then three aspects of the model implementations are elaborated: construction of restriction matrices, model estimation by generalized least square, and calculation of demand elasticities. They are all implemented in the erer library with several new R functions. Note that the erer library also contains several additional functions for the AIDS model, including the Durbin-Wu-Hausman test for expenditure endogeneity and diagnostic tests for model fit. For the purpose of brevity, their implementations will not be analyzed within this book, so relevant technical aspects are omitted here. 12.2.1 Static and dynamic models The AIDS model, originally developed by Deaton and Muellbauer (1980), has been widely adopted in estimating demand elasticities. The popularity of this model is due to its several advantages. It is consistent with consumer theory so theoretical properties of homogeneity and symmetry can be tested and imposed through linear restrictions. As a demand system, it also overcomes the limitations of a single equation approach and can examine how consumers make decisions among a bundle of goods to maximize their utility under budget constraints. Furthermore, with the development of time series econometrics, dynamic AIDS model has been constructed to consider the properties of individual time series through the error correction technique pioneered by Engle and Granger (1987). For the research issue of wooden bed imports (Wan et al., 2010a), a conventional static AIDS model with a set of policy dummy variables can be specified as follows: wit = αi + βis ln mt Pt∗ + N X j=1 s γij ln pjt + K X ϕsik Dkt + uit (12.1) k=1 where w is the import share of beds; m is the total expenditure on all imports in the system; P ∗ is the aggregate price index; m/P ∗ is referred to as the real total expenditure; p is the price of beds; D denotes the antidumping dummy variables; α, β, γ, and ϕ are parameters to be estimated; and u is the disturbance term. The superscript of s in the parameters denotes static (long-run) AIDS model. For the subscripts, i indexes country names in the import share and also the equation in the demand system (i = 1, 2, . . . , N ), j indexes country names in the price variable (j = 1, 2, . . . , N ), t indexes time (t = 1, 2, . . . , T ), and k indexes the dummy variables (k = 1, 2, . . . , K). In this study, the maximum values of these indexes are N = 8 (seven countries plus the rest of world as a residual supplier), T = 96 (monthly data from January 2001 to December 2008), and K = 3. Several variables in the above equation are defined and calculated using the import PN prices and quantities for individual countries. The total expenditure is defined as mt = i=1 pit qit , in which q is the quantity of beds. The import share can be computed as wit = pit qit /mt . The aggregate price index is generally approximated by the Stone’s Price Index as lnPt∗ = PN j=1 wjt ln pjt . In addition, major events related to the antidumping investigation were the announcement of petition in October 2003, the affirmative less-than-fair-value determination in July 2004, and the final implementation since January 2005. Three corresponding pulse dummy variables are added to the AIDS model to represent these events. For instance, the 256 Chapter 12 Sample Study B and New R Functions dummy variable for the petition announcement is equal to one for October 2003 and zero for other months. The other two dummy variables are similarly defined. The dynamic AIDS model is a combination of the static model, cointegration analysis, and error correction model. The static AIDS model ignores dynamic adjustment in the short run and focuses on the long-run behavior. In reality, producers’ behavior can be influenced by various factors such as price fluctuations and policy interventions. Thus, the static model can be restrictive for some situations. In addition, time series data used in the static AIDS model may be nonstationary, which may invalidate the asymptotic distribution of an estimator. Finally, the static AIDS model is incapable of evaluating short-run dynamics. To address these potential limitations associated with the static AIDS model, the concept of cointegration has been introduced and the dynamic AIDS model has been developed. The Engle-Granger two-stage cointegration analysis is employed in Wan et al. (2010a). At first, the stationarity of the variables used in the static AIDS model needs to be examined through unit root tests, e.g., the Augmented Dickey-Fuller test. If these variables are integrated with the same order, a cointegration test on the residuals calculated from the static AIDS model is conducted to determine if the residuals are stationary. If a residual is stationary, it suggests that a long-run equilibrium and cointegration relation exist for the variables in that equation. Consequentially, the estimates from the static AIDS model can be interpreted as the long-run equilibrium relation among these variables. If the cointegration relation is confirmed, the residuals (b uit ), also referred to as the error correction terms, are saved to construct the dynamic AIDS model as follows: ∆wit = ψi ∆wi, t−1 + λi u bi, t−1 + βid ∆ln mt Pt∗ + N X d γij ∆ln pjt + j=1 K X ϕdik Dkt + ξit (12.2) k=1 where ∆ is the first-difference operator; u b is the residual from the static model and other variables are the same as defined above; ψ, λ, β, γ, and ϕ are parameters to be estimated; and ξ is the disturbance term. The superscript d in the parameters indicates dynamic (short-run) AIDS model. The parameter ψ measures the effect of consumption habit. The parameter λ measures the speed of short-run adjustment and is expected to be negative. One major concern with the AIDS model is whether the expenditure variable is exogenous. If the expenditure variable is correlated with the error term, then the seemingly unrelated regression estimator may become biased and inconsistent. The Durbin-Wu-Hausman test is often used to address this concern. In addition, the adequacy of the model specification in the static and dynamic models can be examined through several diagnostic tests, including the Breusch-Godfrey test, Breusch-Pagan test, Ramsey’s specification error test, and Jarque-Bera LM test (Wan et al., 2010a). 12.2.2 Implementation: construction of restriction matrices To comply with economic theory, the static AIDS model is required to satisfy the following properties which are organized as three groups of restrictions: Adding-up: N X αi = 1; i=1 Homogeneity: N X N X i=1 s γij =0 j=1 s s Symmetry: γij = γji βis = 0; N X i=1 s γij = 0; N X ϕsik = 0 i=1 (12.3) 257 12.2 Statistics: AIDS model where the adding-up restriction can be satisfied through dropping one equation from the estimation. The homogeneity and symmetry restrictions can be imposed on the parameters to be estimated and assessed by likelihood ratio tests. For the dynamic AIDS model, these restrictions can be similarly defined, imposed, and evaluated. In actual implementation, how to impose the restrictions on the AIDS system will depend on how the model is estimated, e.g., generalized least square or maximum likelihood estimation. In the erer library, the AIDS model is estimated with the generalized least square, so the above restrictions are imposed through matrix manipulation. A good knowledge of matrix object in R should be learned in constructing and adding them to the AIDS model. 12.2.3 Implementation: estimation by generalized least square Both the static and dynamic AIDS models can be expressed in a more general format for the purpose of estimation (Henningsen and Hamann, 2007; Greene, 2011). To avoid confusion, notations in this subsection should be read independently from these in previous formulas. To begin with, consider a system of G equations with the following notations: y i = X i β i + ui , i = 1, 2, . . . , G (12.4) where y i is a vector of the dependent variable in equation i; X i is a matrix of independent variables; β i is the coefficient vector; and ui is the vector of the disturbance term. Assume that each equation has the same number of observations (t = 1, 2, . . . , T ). The independent variable matrix is the same for each equation, and the number of independent variables in one equation is H. Thus, the matrix dimension is T × 1 for y i and ui , T × H for X i , and H × 1 for β i . To allow for matrix manipulations, the system can be expressed in a stacked format: X 0 ... 0 1 u1 β1 y1 0 β u2 y2 0 X 2 . . . 2 (12.5) .. .. .. . ... + ... ... = . . . . . uG βG yG 0 0 . . . XG or more compactly as: y = Xβ + u (12.6) where the matrix dimension is T G × 1 for y and u, T G × HG for X, and HG × 1 for β. When the matrix of independent variables is the same across equations, i.e., X 1 = X G , the Kronecker product notation can be used to simplify the relation further so X = I G ⊗ X 1 , where ⊗ is the Kronecker product and I G is an identity matrix of dimension G. The whole system can be treated as a single equation and estimated by ordinary least square. If that is the case, then a strong assumption is made about the covariance structure of the disturbance term. In practice, generalized least square has been developed to accommodate more scenarios. Specifically, the coefficients and their covariance matrix by generalized least square can be expressed as follows: −1 b −1 X b −1 y βb = X 0 Ω X0 Ω −1 b = X0 Ω b −1 X cov(β) (12.7) (12.8) 258 Chapter 12 Sample Study B and New R Functions b is the estimated covariance of the disturbance term. Note the similarity between where Ω the above equations and these for ordinary least square, i.e., Equation (7.6) on page 105. The key to understanding the linkage between ordinary least square and generalized b Assume that the disturbance terms across observations least square is in the definition of Ω. are not correlated, but there is contemporaneous correlation. Then, the covariance matrix for the disturbance terms can be expressed as: E[uu0 ] = Ω = Σ ⊗ I T (12.9) where Ω is a T G × T G matrix; Σ = [σij ] is the G × G disturbance covariance matrix; ⊗ is the Kronecker product; and I T is an identity matrix of dimension T . Whether there is contemporaneous correlation in the disturbance terms is equal to whether Σ = [σij ] is a diagonal matrix. If Σ = [σij ] is a diagonal matrix, then there is no contemporaneous correlation, and estimating the whole system by generalized least square is the same as estimating each equation separately by ordinary least square. This is true because the covariance constant will be canceled out in Equation (12.7), resulting in the same expression in Equation (7.6) on page 105. If Σ = [σij ] is not diagonal, then incorporating the contemporaneous correlation into the estimation will improve the efficiency of the estimator. This has been referred to as seemingly unrelated regression in statistics. Just like in ordinary least square, the true covariance matrix of the disturbance terms in b the residual generalized least square, i.e., Ω, is unknown. To have the estimated value of Ω, b and values from a regression need to be utilized. There are several ways of generating Ω, the following treatment is very similar to that used in ordinary least square: σ bij = u0i uj T −H (12.10) b = [b b ij is an element in Σ where σ σij ]. With the covariance matrix being estimated from the actual data, this estimator has been called as the feasible generalized least square. Finally, in many empirical applications, there is a need to estimate model coefficients under liner restrictions. For example, in the AIDS model, homogeneity and symmetry restrictions from the underlying economic theory need to be imposed on the model. One way for estimating the coefficients under linear restrictions is to constrain the coefficients with the following equation: Rβ R = q (12.11) where β R is the restricted coefficient vector, R is the restriction matrix, and q is the restriction vector. Each linear independent restriction is represented by one row of R and one corresponding element in q. When the linear restrictions are imposed on the system, the restricted estimator and the covariance matrix can be derived and expressed as follows: " # " #−1 " # 0 b −1 0 b −1 0 βbR X Ω y = X Ω X R (12.12) b λ R 0 q " # " #−1 0 b −1 0 βbR cov = X Ω X R (12.13) b λ R 0 b is a vector of the Lagrangian multipliers for the linear restrictions. Apparently, if the where λ linear restrictions are null, then the above equations will be reduced to Equations (12.7) and (12.8). 259 12.2 Statistics: AIDS model 12.2.4 Implementation: calculation of demand elasticities The key output from the AIDS model is the demand elasticities. Several types of elasticities can be computed to evaluate the response of consumer preferences and import quantities to changes in expenditure and prices. In this study, expenditure elasticity, Marshallian price elasticities, and Hicksian price elasticities are calculated using the estimated parameters from the AIDS model and the average import shares over the study period. From the static AIDS model, the long-run elasticities can be calculated as: ηis = 1 + βis wi s γij β s wj − i wi wi s γ ij ρsij = −δij + + wj wi sij = −δij + (12.14) where η, , and ρ are the expenditure elasticity, Marshallian price elasticity, Hicksian price s elasticity, respectively; βis and γij are parameter estimates from the static AIDS model; the Kronecker delta δij is equal to 1 if i = j (i.e., own-price elasticity) and 0 if i 6= j (i.e., crossprice elasticity); and w is the average import share over the study period (2001 – 2008). For the dynamic AIDS model, short-run elasticities can be similarly calculated via the above d formula, with the corresponding parameters (i.e., βid and γij ) being substituted. The standard errors for elasticities can be computed by following several basic properties of variances (Greene, 2011). In general, var(ax + b) = a2 var(x) var(x + y) = var(x) + var(y) + 2 cov(x, y) ! n n X n X X var xi = cov(xi , xj ) i=1 (12.15) i=1 j=1 = n X var(xi ) + 2 i=1 n X n X cov(xi , xj ) i<j j=1 where x, y, xi , and xj denote random variables, and a and b are constants. Note that the variance of a constant is zero, and the correlation between a constant and a random variable is zero too. To compute the variance of expenditure elasticity, Marshallian price elasticity, and Hicksian price elasticity, we just need to use the basic variance formulas in Equation (12.14). The results can be expressed as follows: var(ηis ) = var(sij ) = var(ρsij ) = var(βis ) w2i s var(γij ) w2i s var(γij ) + s , βis ) w2j var(βis ) 2 wj cov(γij − w2i w2i (12.16) w2i where the Kronecker delta δij takes the constant value of 1 or 0 only as an additive term in Equation (12.14), so it does not show up in the variance formulas anymore. Once the variance is available, the t-ratio and p-value can be computed for an elasticity estimate. 260 12.3 Chapter 12 Sample Study B and New R Functions Program version for Wan et al. (2010a) Program 12.1 Program version for Wan et al. (2010a) can be used to generate all the five tables and one figure. To save space, a simplified version of the figure is produced with base R graphics here, which is shown as Figure 11.2 Plotting multiple time series of wooden bed trade on a single page on page 227. A more elegant version of this figure is presented as Figure 16.5 Import shares of beds in Wan et al. (2010a) by ggplot2 and grid on page 383. A number of new functions are created and saved in the erer library, and the main R program has been reduced to less than three pages. The statistical analyses for this project are well organized, and all the results can be reproduced within a few minutes. The initial step in the program is to import two raw data sets: the import data and expenditure data. At the end of the transformation, three R time series objects are generated and saved in the erer library as daExp, daBedRaw, and daBed. Note that daExp is needed for the Hausman test, daBedRaw is needed for generating some summary statistics in Table 1, and daBed is the main data for the AIDS model. Furthermore, Program 12.1 is structured to make all the results in Wan et al. (2010a) reproducible with the erer library only. For the block titled as “# 1. Data import and transformation”, the purpose is to demonstrate how to import raw data in Microsoft Excel and transform them for the AIDS model. In particular, the dummy variable creation is also incorporated into the aiData() function. If you are interested in running the model only, then that is feasible by loading the erer library (line 2) and then skipping to line 33 directly. Program 12.1 Program version for Wan et al. (2010a) 1 2 3 # Title: R Program for Wan et al. (2010 JAAE ); last revised Feb. 2010 library(RODBC); library(erer); library(xlsx) options(width = 120); setwd("C:/aErer") 4 5 6 7 8 9 10 11 12 13 # ------------------------------------------------------------------------# 1. Data import and transformation # 1.1 Import raw data in Microsoft Excel format dat <- odbcConnectExcel2007('RawDataAids.xlsx') sheet <- sqlTables(dat); sheet$TABLE_NAME impo <- sqlFetch(dat, "dataImport") expe <- sqlFetch(dat, "dataExp") odbcClose(dat) names(impo); names(expe); head(impo); tail(expe) 14 15 16 17 18 19 # 1.2 Expenditure data for Hausman test ex <- ts(data = expe[, -c(1, 2)], start = c(1959, 1), end = c(2009, 12), frequency = 12) Exp <- window(ex, start = c(2001, 1), end = c(2008, 12), frequency = 12) head(Exp); bsStat(Exp) 20 21 22 23 24 25 26 # 1.3 Raw import data, date selection, and transformation for AIDS BedRaw <- ts(data = impo[, -c(1, 2)], start = c(1996, 1), end = c(2008, 12), frequency = 12) lab8 <- c("CN", "VN", "ID", "MY", "CA", "BR", "IT") dumm <- list(dum1 = c(2003, 10, 2003, 10), dum2 = c(2004, 7, 2004, 7), dum3 = c(2005, 1, 2005, 1)) 12.3 Program version for Wan et al. (2010a) 27 28 29 30 31 261 imp8 <- aiData(x = BedRaw, label = lab8, label.tot = "WD", prefix.value = "v", prefix.quant = "q", start = c(2001, 1), end = c(2008, 12), dummy = dumm) imp5 <- update(imp8, label = c("CN", "VN", "ID", "MY")); names(imp5) Bed <- imp8$out; colnames(Bed)[18:20] <- c("dum1", "dum2", "dum3") 32 33 34 35 # 1.4 Three datasets saved in 'erer' library already # Results in Wan (2010 JAAE ) can be reproduced with saved data directly. data(daExp, daBedRaw, daBed); str(daExp); str(Exp) 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 # ------------------------------------------------------------------------# 2. Descriptive statistics (Table 1) lab8 <- c("CN", "VN", "ID", "MY", "CA", "BR", "IT") pig <- aiData(x = daBedRaw, label = lab8, label.tot = "WD", prefix.value = "v", prefix.quant = "q", start = c(2001, 1), end = c(2008, 12)) hog <- cbind(pig$share * 100, pig$price, pig$m / 10 ^ 6) colnames(hog) <- c(paste("s", lab8, sep = ""), "sRW", paste("p", lab8, sep = ""), "pRW", "Expend") dog <- bsStat(hog, two = TRUE, digits = 3)$fstat[, -6] colnames(dog) <- c("Variable", "Mean", "St. Dev.", "Minimum", "Maximum") dog[, -1] <- apply(X = dog[, -1], MARGIN = 2, FUN = function(x) {sprintf(fmt="%.3f", x)}) (table.1 <- dog) 51 52 53 54 55 56 57 58 59 # ------------------------------------------------------------------------# 3. Monthly expenditure and import shares by country (Figure 1) tos <- window(daBedRaw[, "vWD"], start = c(2001, 1), end = c(2008, 12)) tot <- tos / 10 ^ 6 sha <- daBed[, c('sCN', 'sVN', 'sID', 'sMY', 'sCA', 'sBR', 'sIT')] * 100 y <- ts.union(tot, sha); colnames(y) <- c('TotExp', colnames(sha)) windows(width = 5.5, height = 5, family = 'serif', pointsize = 11) plot(x = y, xlab = "", main = "", oma.multi = c(2.5, 0, 0.2, 0)) 60 61 62 63 64 65 66 67 68 69 # ------------------------------------------------------------------------# 4. Hausman test and revised data # 4.1 Getting started with a static AIDS model sh <- paste("s", c(lab8, "RW"), sep = "") pr <- paste("lnp", c(lab8, "RW"), sep = "") du3 <- c("dum1", "dum2", "dum3"); du2 <- du3[2:3] rSta <- aiStaFit(y = daBed, share = sh, price = pr, shift = du3, expen = "rte", omit = "sRW", hom = TRUE, sym = TRUE) summary(rSta) 70 71 72 73 74 75 # 4.2 Hausman test and new data (dg <- daExp[, "dg"]) rHau <- aiStaHau(x = rSta, instr = dg, choice = FALSE) names(rHau); colnames(rHau$daHau); colnames(rHau$daFit); rHau two.exp <- rHau$daFit[, c("rte", "rte.fit")]; bsStat(two.exp, digits = 4) 262 76 77 Chapter 12 Sample Study B and New R Functions plot(data.frame(two.exp)); abline(a = 0, b = 1) daBedFit <- rHau$daFit 78 79 80 81 82 83 84 85 86 87 88 89 90 91 # ------------------------------------------------------------------------# 5. Static and dynamic AIDS models # 5.1 Diagnostics and coefficients (Table 2, 3, 4) hSta <- update(rSta, y = daBedFit, expen = "rte.fit") hSta2 <- update(hSta, hom = FALSE, sym = F); lrtest(hSta2$est, hSta$est) hSta3 <- update(hSta, hom = FALSE, sym = T); lrtest(hSta2$est, hSta3$est) hSta4 <- update(hSta, hom = TRUE, sym = F); lrtest(hSta2$est, hSta4$est) hDyn <- aiDynFit(hSta) hDyn2 <- aiDynFit(hSta2); lrtest(hDyn2$est, hDyn$est) hDyn3 <- aiDynFit(hSta3); lrtest(hDyn2$est, hDyn3$est) hDyn4 <- aiDynFit(hSta4); lrtest(hDyn2$est, hDyn4$est) (table.2 <- rbind(aiDiag(hSta), aiDiag(hDyn))) (table.3 <- summary(hSta)); (table.4 <- summary(hDyn)) 92 93 94 95 96 97 98 99 100 101 # 5.2 Own-price elasticities (Table 5) es <- aiElas(hSta); ed <- aiElas(hDyn); esm <- edm <- NULL for (i in 1:7) { esm <- c(esm, es$marsh[c(i * 2 - 1, i * 2), i + 1]) edm <- c(edm, ed$marsh[c(i * 2 - 1, i * 2), i + 1]) } MM <- cbind(es$expen[-c(15:16), ], esm, ed$expen[-c(15:16), 2], edm) colnames(MM) <- c("Country", "LR.exp", "LR.Marsh", "SR.exp", "SR.Marsh") (table.5 <- MM) 102 103 104 105 106 107 108 109 110 111 112 113 # 5.3 Cross-price elasticities (Table 6) (table.6a <- es$hicks[-c(15:16), -9]) (table.6b <- ed$hicks[-c(15:16), -9]) for (j in 1:7) { table.6a[c(j * 2 - 1, j * 2), j + 1] <- "___" table.6b[c(j * 2 - 1, j * 2), j + 1] <- "___" } rown <- rbind(c("Long-run", rep("", times = 7)), c("Short-run", rep("", times = 7))) colnames(rown) <- colnames(table.6a) (table.6 <- rbind(rown[1, ], table.6a, rown[2, ], table.6b)) 114 115 116 117 118 # 5.4 Alternative specifications summary(uSta1 <- update(hSta, shift = du2)); aiElas(uSta1) summary(uDyn1a <- aiDynFit(uSta1)); aiElas(uDyn1a) summary(uDyn1b <- aiDynFit(uSta1, dum.dif = TRUE)) 119 120 121 122 123 124 # ------------------------------------------------------------------------# 6. Export five tables # Table in csv format (output <- listn(table.1, table.2, table.3, table.4, table.5, table.6)) write.list(z = output, file = "OutAidsTable.csv") 263 12.3 Program version for Wan et al. (2010a) 125 126 127 128 129 130 131 # Table in excel format name <- paste("table", 1:6, sep = ".") for (i in 1:length(name)) { write.xlsx(x = get(name[i]), file = "OutAidsTable.xlsx", sheetName = name[i], row.names = FALSE, append = as.logical(i - 1)) } Note: Major functions used in Program 12.1 are: library(), getwd(), head(), tail(), odbcConnectExcel2007(), sqlFetch(), ts(), window(), ts.union(), replace(), plot(), windows(), update(), for loop, data(), aiData(), aiStaFit(), aiStaHau(), aiDynFit(), aiElas(), listn(), write.list(), get(), and write.xlsx(). # Selected results from Program 12.1 > table.1 Variable Mean St. Dev. Minimum 1 sCN 44.226 6.984 27.995 2 sVN 11.731 10.728 0.087 3 sID 7.817 1.772 4.661 4 sMY 6.309 2.358 2.340 5 sCA 6.306 3.720 1.394 6 sBR 4.435 1.319 1.813 7 sIT 4.136 2.700 0.716 8 sRW 15.040 4.106 9.249 9 pCN 150.179 10.351 116.067 10 pVN 117.344 11.580 90.712 11 pID 135.295 21.591 91.127 12 pMY 104.600 11.536 78.988 13 pCA 123.673 12.682 94.238 14 pBR 87.569 11.683 38.021 15 pIT 244.321 110.453 137.408 16 pRW 112.263 13.618 84.258 17 Expend 83.915 23.362 33.728 > table.5 Country LR.exp LR.Marsh 1 sCN 1.095*** -0.467*** 2 (19.723) (-2.992) 3 sVN 3.013*** -2.491*** 4 (16.783) (-6.967) 5 sID 0.494*** -0.955*** 6 (4.573) (-5.665) 7 sMY 2.281*** -0.909*** 8 (21.988) (-6.107) 9 sCA -0.845*** -0.968*** 10 (-6.055) (-3.912) 11 sBR 0.892*** -1.137*** 12 (6.068) (-6.091) 13 sIT -0.583*** -1.162*** 14 (-3.179) (-9.407) Maximum 58.527 34.251 12.525 10.012 16.185 8.807 11.764 26.271 177.675 150.721 189.369 142.184 187.215 120.905 652.052 145.088 121.153 SR.exp 1.285*** (12.079) 1.091*** (6.013) 0.546*** (2.525) 0.946*** (5.286) 0.172 (0.656) 0.577** (2.179) 0.511 (1.363) SR.Marsh -0.998*** (-10.311) -1.109*** (-8.663) -0.978*** (-7.301) -0.848*** (-7.786) -1.035*** (-8.062) -0.987*** (-9.031) -0.849*** (-6.119) 264 12.4 Chapter 12 Sample Study B and New R Functions Needs for user-defined functions As presented above, the full program version for Wan et al. (2010a) is less than three pages but can reproduce all the results. A dominant feature of this program version is that a good number of user-defined functions are defined and called repeatedly. To demonstrate the need for user-defined functions, a simple example is extracted from Program 12.1 and analyzed in this section. In Program 12.2 Estimating the AIDS model with fewer user-defined functions, the AIDS model estimated is static and has four suppliers only: China, Vietnam, Indonesia, and the residual supplier. The model is first estimated with the two user-defined functions in the erer library: aiData() and aiStaFit(). Then, the model is estimated again without them. To replicate the results, several steps are needed: preparing the data for the regression, constructing the formula list, creating the restriction matrices, and fitting the regression at the end. Both the approaches use the systemfit() function to estimate the static AIDS demand model, and they generate the same results. A comparison of the above two approaches can reveal several differences. First of all, defining new functions can save space. The user-defined functions wrap up many commands in the format of new functions and make the program flow more concise. Without the userdefined functions, the program is long and fragmented. Furthermore, calling user-defined functions repeatedly can greatly save space. In many situations, the saving is one line in calling a function versus a half page without a user-defined function. In Program 12.1 Program version for Wan et al. (2010a), these user-defined functions have been called repeatedly, and without that, the length of the program version would have been at least 20 pages. Second, defining new functions makes a program version better organized. Sometimes a new function may be called only once in a program. Thus, wrapping a group of commands in a function format seems have limited benefits. However, even if that is true, the single benefit of having a program organized can still be large enough in many situations to justify efforts for writing a new function. With user-defined functions, Program 12.1 is reduced to less than three pages and becomes more like a large table of contents, and many details are wrapped up within the definitions of new functions. This approach is consistent with the idea of proposal, program, and manuscript versions for a research project, as emphasized in this book. An R program version thus has an identifiable structure that allows easy connections with the proposal and manuscript version for a project. In sum, user-defined functions make organizing the program version of a study possible or easier. In general, if a program version for a study is very long (e.g., 40 pages), then it is a good indication that user-defined functions should be considered and adopted. Finally, user-defined functions make statistical analyses much more efficient for a research project. In Program 12.2, the model estimated is static and has four suppliers only. The raw data of daBedRaw contains import values and quantities for 16 countries. At the beginning, how many countries should be included as individual suppliers in the AIDS model? This is a typical empirical question that can only be determined by the trade pattern for a specific commodity. In Wan et al. (2010a), seven countries and one residual supplier are chosen at the end. However, how about a similar model with other choices, like five, nine, or even more suppliers? How about changing the start point from January 2001 to July 2001 because of some large trade volatility for Vietnam? In reality, the number of alternative models for data mining can be very large. Without user-defined functions, researchers often feel hopeless to answer these questions, so the most possible scenario is to just examine one or two of selected hypotheses. In particular, note that in Program 12.2, a change in the country list, i.e., choice <- c("CN", "VN", "ID"), will require many changes in 12.4 Needs for user-defined functions 265 the sequential commands, e.g., the dimension of res.left. In contrast, when user-defined functions are available, additional hypotheses can be examined easily and comprehensively with a simple function call. The benefit of user-defined functions does not come to us without a cost. The cost is to learn how to wrap up a group of commands in the format of R function. This is the focus of Part IV Programming as a Wrapper. Specifically, for Wan et al. (2010a), the following new functions are defined for estimating static and dynamic AIDS model. Methods for some generic functions are also defined for several new classes, e.g., summary() and print(). bsFormu() bsLag() aiData() aiStaFit() aiStaHau() aiDynFit() aiDiag() aiElas() Creating formula objects for models; used inside aiStaFit() Generating lagged time series; used inside aiDynFit() Transforming raw data for static AIDS model with dummy variables Fitting a static AIDS model Conducting a Hausman test on a static AIDS model Fitting a dynamic AIDS model Diagnostic statistics for static or dynamic AIDS model Computing elasticity for static or dynamic AIDS models Program 12.2 Estimating the AIDS model with fewer user-defined functions 1 2 3 4 # 0. load library; inputs and choices library(systemfit); library(erer) data(daBedRaw); colnames(daBedRaw) wa <- c(2001, 1); wb <- c(2008, 12); choice <- c("CN", "VN", "ID") 5 6 7 8 # 1. With two user-defined functions: aiData(); aiStaFit() pit <- aiData(x = daBedRaw, label = choice, start = wa, end = wb) cow <- pit$out; round(head(cow), 3) 9 10 11 12 13 14 15 16 sh <- paste("s", c(choice, "RW"), sep = "") pr <- paste("lnp", c(choice, "RW"), sep = "") rr <- aiStaFit(y = cow, share = sh, price = pr, expen = "rte", hom = TRUE, sym = TRUE) summary(rr) names(rr); rr$formula rr$res.matrix 17 18 19 20 21 22 23 24 25 26 27 # 2. Without two user-defined functions: aiData(); aiStaFit() # 2.1 Prepare data for AIDS vn2 <- paste("v", choice, sep = "") qn2 <- paste("q", choice, sep = "") x <- window(daBedRaw, start = wa, end = wb) y <- x[, c(vn2, "vWD", qn2, "qWD")] vRW <- y[, "vWD"] - rowSums(y[, vn2]) qRW <- y[, "qWD"] - rowSums(y[, qn2]) value <- ts.union(y[, vn2], vRW); colnames(value) <- c(vn2, "vRW") quant <- ts.union(y[, qn2], qRW); colnames(quant) <- c(qn2, "qRW") 28 29 30 31 price <- value / quant; colnames(price) <- c("pCN", "pVN", "pID", "pRW") lnp <- log(price); colnames(lnp) <- c("lnpCN", "lnpVN", "lnpID", "lnpRW") m <- ts(rowSums(value), start = wa, end = wb, frequency = 12) 266 32 Chapter 12 Sample Study B and New R Functions share <- value / m; colnames(share) <- c("sCN", "sVN", "sID", "sRW") 33 34 35 36 37 38 rte <- log(m) - rowSums(share * lnp) dee <- ts.union(share, rte, lnp) colnames(dee) <- c(colnames(share), "rte", colnames(lnp)) round(head(dee), 3) identical(cow, dee) # TRUE 39 40 41 42 43 44 45 46 47 48 49 # 2.2 Formula and restriction matrix mod <- list(China = sCN ~ 1 + rte + lnpCN + lnpVN + lnpID + Vietnam = sVN ~ 1 + rte + lnpCN + lnpVN + lnpID + Indonesia = sID ~ 1 + rte + lnpCN + lnpVN + lnpID + res.left <- matrix(data = 0, nrow = 6, ncol = 18) res.left[1, 3:6] <- res.left[2, 9:12] <- res.left[3, 15:18] res.left[4, 4] <- res.left[5, 5] <- res.left[6, 11] res.left[4, 9] <- res.left[5, 15] <- res.left[6, 16] res.right <- rep(0, times = 6) identical(res.left, rr$res.matrix) # TRUE lnpRW, lnpRW, lnpRW) <- 1 <- 1 <- -1 50 51 52 53 54 # 2.3 Fit AIDS model dd <- systemfit(formula = mod, method = "SUR", data = dee, restrict.matrix = res.left, restrict.rhs = res.right) round(summary(dd, equations = FALSE)$coefficients, digits = 3) # Selected results from Program 12.2 > data(daBedRaw); colnames(daBedRaw) [1] "vBR" "vCA" "vCN" "vDK" "vFR" "vHK" "vIA" "vID" "vIT" "vMY" "vMX" [12] "vPH" "vTW" "vTH" "vUK" "vVN" "vWD" "qBR" "qCA" "qCN" "qDK" "qFR" [23] "qHK" "qIA" "qID" "qIT" "qMY" "qMX" "qPH" "qTW" "qTH" "qUK" "qVN" [34] "qWD" > cow <- pit$out; round(head(cow), 3) sCN sVN sID sRW rte Jan 2001 0.402 0.001 0.107 0.490 12.832 Feb 2001 0.305 0.001 0.089 0.604 12.527 Mar 2001 0.280 0.001 0.125 0.594 12.758 Apr 2001 0.333 0.001 0.118 0.548 12.728 May 2001 0.376 0.001 0.108 0.515 12.784 > summary(rr) Parameter 1 (Intercept) 2 3 rte 4 5 lnpCN 6 7 lnpVN 8 9 lnpID lnpCN 4.932 4.896 4.754 4.890 4.936 sCN sVN sID 0.006 -3.438*** 0.657*** (0.020) (-10.502) (7.864) 0.030 0.267*** -0.043*** (1.212) (10.742) (-6.785) 0.167** -0.020 -0.030* (2.299) (-0.341) (-1.852) -0.020 -0.092 -0.004 (-0.341) (-1.405) (-0.337) -0.030* -0.004 0.010 lnpVN 4.831 4.831 4.800 4.862 4.684 lnpID 4.590 4.558 4.584 4.614 4.512 lnpRW 4.679 4.798 4.773 4.795 4.804 267 12.4 Needs for user-defined functions 10 11 12 13 (-1.852) lnpRW -0.117*** (-2.719) R-squared 0.143 (-0.337) 0.115*** (3.050) 0.622 > names(rr); rr$formula [1] "y" "share" [6] "omit" "nOmit" [11] "nExoge" "nParam" [16] "res.rhs" "est" [[1]] sCN ~ 1 + rte <environment: [[2]] sVN ~ 1 + rte <environment: [[3]] sID ~ 1 + rte <environment: (0.875) 0.025* (1.837) 0.541 "price" "hom" "nTotal" "AR1" "expen" "sym" "formula" "call" "shift" "nShare" "res.matrix" + lnpCN + lnpVN + lnpID + lnpRW 0x0ee3de1c> + lnpCN + lnpVN + lnpID + lnpRW 0x0ee3de1c> + lnpCN + lnpVN + lnpID + lnpRW 0x0ee3de1c> > rr$res.matrix [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [1,] 0 0 1 1 1 1 0 0 0 0 0 0 0 [2,] 0 0 0 0 0 0 0 0 1 1 1 1 0 [3,] 0 0 0 0 0 0 0 0 0 0 0 0 0 [4,] 0 0 0 1 0 0 0 0 -1 0 0 0 0 [5,] 0 0 0 0 1 0 0 0 0 0 0 0 0 [6,] 0 0 0 0 0 0 0 0 0 0 1 0 0 [,14] [,15] [,16] [,17] [,18] [1,] 0 0 0 0 0 [2,] 0 0 0 0 0 [3,] 0 1 1 1 1 [4,] 0 0 0 0 0 [5,] 0 -1 0 0 0 [6,] 0 0 -1 0 0 > round(summary(dd, equations = FALSE)$coefficients, digits = 3) Estimate Std. Error t value Pr(>|t|) China_(Intercept) 0.006 0.320 0.020 0.984 China_rte 0.030 0.024 1.212 0.226 China_lnpCN 0.167 0.073 2.299 0.022 China_lnpVN -0.020 0.057 -0.341 0.734 China_lnpID -0.030 0.016 -1.852 0.065 China_lnpRW -0.117 0.043 -2.719 0.007 Vietnam_(Intercept) -3.438 0.327 -10.502 0.000 Vietnam_rte 0.267 0.025 10.742 0.000 Vietnam_lnpCN -0.020 0.057 -0.341 0.734 Vietnam_lnpVN -0.092 0.065 -1.405 0.161 268 Chapter 12 Sample Study B and New R Functions Vietnam_lnpID Vietnam_lnpRW Indonesia_(Intercept) Indonesia_rte Indonesia_lnpCN Indonesia_lnpVN Indonesia_lnpID Indonesia_lnpRW 12.5 -0.004 0.115 0.657 -0.043 -0.030 -0.004 0.010 0.025 0.012 0.038 0.084 0.006 0.016 0.012 0.011 0.013 -0.337 3.050 7.864 -6.785 -1.852 -0.337 0.875 1.837 0.736 0.003 0.000 0.000 0.065 0.736 0.382 0.067 Road map: how to write new functions (Part IV) In this chapter, we have learned the sample study of Wan et al. (2010a) for Part IV Programming as a Wrapper, including its manuscript version, underlying statistics, and the final program version. The demand for user-defined functions is demonstrated through a small AIDS model. Writing new functions for a specific research project can save programming space, make the program well organized, and save time by improving programming efficiency. Thus, the focus from this point on is to learn programming techniques for writing user-defined functions. The rest four chapters in this Part are designed to help students learn how to define new functions to meet the unique need of a specific project. At the end of this Part, you should be able to understand the whole sample program for Program 12.1 completely, including these relevant functions in the erer library. For your own selected empirical study, new functions can be defined if there is such a need. More likely, the programming techniques learned in this Part, e.g., flow control structures, will improve your programming efficiency greatly even without any new function being defined. A good example is available in Program 12.1 when two for loops are used to manipulate the final table outputs. Briefly, Chapter 13 Flow Control Structure presents techniques for controlling programming flow in R. In particular, if and for are the workhorses and thus a deep understanding of them is a must for efficient programming in R. Chapter 14 Matrix and Linear Algebra explains operation rules and main functions for matrix manipulations. Chapter 15 How to Write a Function elaborates R function structure. S3 and S4 as the two approaches of organizing R functions are compared. At the end, in Chapter 16 Advanced Graphics, several advanced topics for R graphics are covered, including the grid system and the ggplot2 package. In all these chapters, a large number of applications are designed and included, and several of them are closely related to the sample study of Wan et al. (2010a). A simple test on whether you understand the materials in this Part well is to read Program 12.1 Program version for Wan et al. (2010a) and the relevant new user-defined functions, and then assess how much you can comprehend them. A number of exercises are also designed and included in the following chapters, especially for writing new functions. You are strongly encouraged to practice with these small exercises before you move on to some real and large data sets. Overall, the techniques presented in Part IV Programming as a Wrapper will appear to be more helpful or related to real data analyses than these covered in Part III Programming as a Beginner. The relation is like driving a car at the beginning of a highway and then in the middle of the highway. If you have spent large efforts and gained a solid understanding of the materials in Part III Programming as a Beginner, then you will be able to conduct statistical analyses even more efficiently with moderate efforts in this Part. This is also like our harvest season while we grow up with R in conducting empirical studies. Enjoy the freedom that R as a computer language brings to you now. Chapter 13 Flow Control Structure O perations in R are usually organized with the format of function. However, sometimes a function structure may not be suitable or sufficient to manage a program flow. In addition, writing a large user-defined function also needs some special syntax to manage its function body efficiently. R has two types of flow control statements (Braun and Murdoch, 2008). One is called branching statements, conditional statement, or simply conditionals, including if, ifelse(), and switch(). The other is looping statements or loops, including for, while, and repeat. Among these statements, the more frequently used functions are if and for, and furthermore, if is generally easier to use than for. In this chapter, techniques for controlling R program flow are presented first with simple examples. For actual scientific research, conditional and looping statements are often mixed or nested within each other. This will be demonstrated by several challenging applications. A notation is needed here about the formatting difference between a function and a flow control statement. Strictly speaking, ifelse() and switch() are different from others in that they are normal functions. Thus, ifelse() and switch() are formatted in the text with a pair of parentheses. In contrast, the flow control statements (e.g., if) are different from a typical function in many aspects, even though they can be used like a function in some cases. Thus, these keywords are formatted in the text without parentheses, e.g., if. To launch a help page within an R session, use help(“if”) or ?”if”. A command like ?if does not work, but it works for a function like ?ifelse or ?switch. 13.1 Conditional statements Three ways of constructing conditional execution are presented in this section. The if statement is most flexible in controlling a large number of commands. ifelse() and switch() are two functions that can become very handy for some tasks. In the following presentation, the concepts and basic definitions are elaborated. In Program 13.1 Conditional execution with if, ifelse(), and switch(), several examples are employed to demonstrate the use of these conditional statements. 13.1.1 Branching with if The if statement provides users the flexibility in choosing which group of commands is executed when several commands are interlinked. It can have an optional part of else. Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 269 270 Chapter 13 Flow Control Structure Specifically, a conditional statement of if can take one of the following two forms: # Form A: without else if (cond) { commands_true } # Form B: with else if (cond) { commands_true } else { commands_false } where cond is a logical expression that returns a single logical value of TRUE or FALSE. commands_true and commands_false are two command groups. The group of commands is usually enclosed within a set of curly braces. Individual commands are separated by semicolons or set up as new lines. When multiple commands and the else part are included, a number of formatting rules should be followed, as demonstrated in Program 7.3 Curly brace match in the if statement on page 118, and additionally, explained in Section 7.6 Formatting an R program on page 120. Briefly, there should be a space before the left parenthesis and curly brace, i.e., if (); the “} else {“ component should be on a separate new line; and there should be consistent indention and vertical alignment for all commands in a group. If there is one command only within a group, then the curly braces can be omitted. However, including them is recommended for clarity. A program flow is conditional on and controlled by the condition object of cond. If the condition evaluates to TRUE, then a group of commands will be invoked. If the condition evaluates to FALSE and the optional part of else is provided, then an alternative group of commands will be invoked. If the condition is FALSE and the else part is not provided, then no commands will be invoked and executed, and nothing happens at the end. Instead of cond, the condition object can also be expressed as !cond, where the ! operator is logical negation (NOT). In this kind of situation, note that the supplied information by a user is the cond object, but the condition object for the if statement is changed into !cond. If the supplied value for cond is FALSE, then the condition object is TRUE and the commands afterward will be executed. Thus, the interest is on the status of cond being FALSE, or the condition object !cond for the if statement being TRUE. When do we need to use !cond as the condition object, not cond directly? In writing a new function, the default value for an argument may need to be set in such a way that the program flow is easy to understand or it can reduce processing time. This will become more apparent when we learn techniques of writing new functions. If needed, multiple if statements can be nested in one of the following two ways: # Form C: nesting without "else if" if (cond_1) { commands_true_1 } else { if (cond_2) { commands_true_2 } else { commands_false_2 } } Chapter 14 Matrix and Linear Algebra A matrix object is usually primitive in a computer language, and the relevant matrix language is fundamental to statistical computation. In Part III Programming as a Beginner, we have learned some basic techniques in manipulating R objects, including matrices. However, in general, R beginners do not use matrices intensively. Instead, matrix manipulations are more closely related to writing user-defined functions. As the focus of Part IV Programming as a Wrapper is writing new functions, operation rules and main functions for matrix manipulations are detailed in this chapter. A number of applications are included to demonstrate how to conduct statistical computation with matrices. Overall, the materials in this chapter will allow us to be better prepared for writing new functions. 14.1 Matrix creation and subscripts A matrix can be created in several different ways (Matloff, 2011). Subscripting and indexing a matrix are needed to extract a subset from an existing matrix or to replace some of its values. R functions for matrix creation and subscripts are detailed in this section. 14.1.1 Creating a matrix The concept of matrix in R is closely related to the concepts of vector and array. An array is a multidimensional extension of vectors, and it contains objects of the same mode. The most commonly used array in R has two dimensions, i.e., a matrix. Internally in R, a matrix is stored columnwise in a vector with an additional dimension attribute, which gives the number of rows and columns. If the elements of a matrix are not of the same mode, then all the elements will be coerced from a specific type to a more general type, e.g., from a numeric mode to a character mode. Thus, the mode of a matrix is simply the mode of its constituent elements; the class of a matrix is matrix. To find out if an object is a matrix, use the is.matrix() function to test it. A matrix is very different from a data frame in R. A data frame is a list with the restriction that all its individual elements have the same length. It can be similarly indexed like a matrix, but it accommodates different modes, making it especially convenient in handling heterogeneous raw data and analysis outputs. In contrast, a matrix object in R can hold values of the same mode only; it needs smaller storage space in general, which can be revealed by the object.size() or lss() function. R matrices allow all operations Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 293 294 Chapter 14 Matrix and Linear Algebra related to linear algebra, and thus they are the building block of statistical computation. As a result, matrices are especially relevant for advanced analyses in R. The main function that can be used to create a new matrix is: matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL). Basically, this function converts a vector into a matrix, with the options of specifying the numbers and names for both the rows and columns. Specifically, the data argument is an optional data vector. The nrow and ncol arguments specify the desired number of rows and columns, respectively. If one of the nrow and ncol arguments is not given, then an attempt is made to infer it from the length of data and the other arguments. If neither is given, a one-column matrix is returned. If there are too few elements in data to fill the matrix, then the elements in data are recycled. That provides a compact way of making a new matrix full of zeros, ones, or NAs at the beginning of a specific task, e.g., matrix(data = 0, nrow = 3, ncol = 5). The byrow argument has a logical value; if FALSE (the default), the matrix is filled by column; and if TRUE, the matrix is filled by row. The dimnames argument can be NULL or a list of length two, giving the corresponding row and column names; an empty list is treated as NULL; and a list of length one is treated as row names. The diag() function can extract or replace the diagonal of a matrix, or construct a new diagonal matrix. Recall that an identity matrix or unit matrix of size n is the n × n square matrix with ones on the main diagonal and zeros elsewhere. Specifically, in diag(x = 1, nrow, ncol), the x argument can be a matrix, a vector, a one-dimensional array, or missing; nrow and ncol are optional dimensions for the result when x is not a matrix. As detailed at the built-in help page, diag() has four distinct usages. First, it extracts the diagonal of x when x is supplied as a matrix. Second, it returns an identify matrix when x is missing and nrow is specified. Third, if x is a scalar (i.e., a length-one vector) and the only argument, then it returns an identity matrix of size given by x. (Note strictly speaking, R as a language cannot define a scalar object explicitly.) Fourth, if x is a numeric vector with a length of at least two, or if x is a numeric vector with any length but other arguments are also supplied in diag(), then it returns a matrix with the given diagonal and zero off-diagonal entries. R has a set of coercion functions that can be used to convert objects with different classes or modes. To coerce an existing object (e.g., data frame or vector) into a matrix, use the function of as.matrix(). In addition, use the rbind() function to stack two matrices vertically, and use cbind() to stack matrices horizontally. The functions of stack() and unstack() are valid for a data frame or list, but cannot be applied on a matrix. In linear algebra, matrix vectorization is a linear transformation which converts a matrix into a vector. To do that in R, use the functions of as.vector() or c(); both have the same effect for matrix vectorization. c() is more concise and as.vector() is clearer. This operation removes the dimension attribute from a matrix and leaves the elements in a vector form, given that the elements of a matrix are stored columnwise internally. Note that in general the functions of as.vector() and c() are different in many ways. as.vector() removes all the attributes of an object if the output is atomic, but not for a list object. This difference is shown with an example in Program 14.1 Creating and subscripting matrices in R. Several R functions are available to extract and replace the attributes of a matrix. Every matrix has a dimension attribute. The dim() function returns a vector of length two containing the number of rows and columns. Individual elements can be accessed using nrow() or ncol(). Another matrix attribute is row and column names. They can be assigned through the dimnames argument of the matrix() function, or after a matrix is created, through the dimnames(), rownames(), or colnames() function. Since the numbers of rows and columns in a matrix need not be the same, the value of the dimnames argument in 295 14.1 Matrix creation and subscripts matrix(), or the value for the dimnames() function, must be a list; the first element is a vector of names for rows, and the second is a vector of names for columns. To provide names for just one dimension, use a value of NULL for the dimension without a name. 14.1.2 Subscripting a matrix Subscripting a matrix can be handled by the [ operator and two indices. The other major R indexing operators of [[ and $ are available for list and data frame objects, but not for matrices. As usual, the index values used for subscripting a matrix can be numerical, character, logical, or empty. The general format for indexing a matrix is as follows: x[i, j, ..., drop = TRUE] x[i] where x is a matrix; i and j are subscripts; and drop indicates whether the result is coerced to the lowest possible dimension. By default, subscripting operations reduce the dimensions of a matrix whenever possible. Consequentially, subscripting a matrix can potentially return a vector. This may cause problems when the output from subscripting a matrix needs to be a matrix and will be used in further matrix operations. To prevent this from happening, the matrix nature of the extracted object can be retained with the drop = FALSE argument. Note the drop = FALSE argument is applicable with two-index subscripting only. A single index can also be used to subscript elements in a matrix, and the output is always a vector. This is because a matrix is internally stored as a vector. In particular, in subscripting a matrix, another new matrix with the same dimension and all logical values can be used as subscripts. The new matrix can be generated by logical operations on the existing matrix, or with several other functions related to matrices. For example, the lower.tri() and upper.tri() functions return a logical matrix useful in extracting the lower or upper triangular elements of a matrix. The row() function returns a new matrix with the same dimension as the existing matrix, filling each cell with the row number of each element; the col() function returns a new matrix with the column numbers. In the end, the new matrix serving as the single index is the same as a vector index that is compatible with the dimension of the existing matrix. Thus, behind this type of operation, a single index is actually used for subscripting. Program 14.1 Creating and subscripting matrices in R 1 2 3 4 5 # A. Creating a matrix aa <- 1:1000 bb <- matrix(data = aa, nrow = 1); class(bb); mode(bb) cc <- data.frame(bb) library(erer); lss() # size 6 7 8 9 10 11 12 13 14 15 # diag(): four usuages ma <- matrix(data = 1:20, nrow = diag(x = ma) diag(diag(x = ma)) diag(nrow = 4) diag(x = 4) diag(x = c(3, 9, 10)) diag(x = c(3, 9, 10), nrow = 4) mb <- ma 4, ncol = 5, byrow = FALSE); ma # 1. extract the diagonal values # extract the diagnonal matrix # 2. create an identity matrix # 3. create an identify matrix # 4. a matrix with the given diagonal # a matrix with more rows Chapter 15 How to Write a Function A function in R is an object that can transform a set of argument values and then return output values, e.g., function.name <- function(arguments) {body}. In this chapter, the structure of a function is analyzed first. R has two approaches in organizing a function: S3 and S4. Their features are elaborated and compared in detail with several examples. At the end, a number of applications are designed to demonstrate how to write a user-defined function. These include writing a function for conducting numerical optimization with one dimension, estimating a binary choice model with maximum likelihood for Sun et al. (2007), and wrapping up several functions for a static AIDS model for Wan et al. (2010a). 15.1 Function structure In this section, the structure of a new function is analyzed. The major components of a new function include a name for the new function, the function keyword, arguments, and the body. The core of a function is its body, which is basically manipulating argument objects as inputs and generating output objects at the end. Thus, all the techniques we have learned about R object manipulations will be the foundation of writing new functions. Materials covered in this chapter are mainly procedural, and we emphasize how to organize some R commands for a task into a function structure that can be called and used repeatedly. 15.1.1 Main components A function in R is one type of object that receives arguments from users, makes some transformation on the arguments, and finally returns one or more values. This is very similar to the concept of production function in economics. A production function, also known as a transformation function, changes production factors into one or multiple outputs (e.g., from labor, capital, and raw materials into computers). User-defined R functions are treated as the same as predefined functions. Thus, they can be used independently or called by other functions. The major benefit of creating new functions is that a large programming task can be organized in small units, and then they can be addressed by individual new functions. If the number of new functions is large for a project, then they can be wrapped up as a package, which will be covered in Part V Programming as a Contributor. At present, Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 312 15.1 Function structure 313 writing a function has become a basic technique in using R efficiently for even moderately challenging research projects, e.g., Wan et al. (2010a). The basic syntax of a function in R can be expressed like this: function.name <- function(argument 1, argument 2, ...) { statement 1; statement 2 statement 3 ... } where typical R formatting rules are followed. These rules include one space before and after the <- operator; no space after the function keyword and before the left parenthesis; one space before the left curly brace; putting the left curly brace at the end of the argument line; putting the right curly brace separately on a new line; aligning the right curly brace with function.name vertically; separating multiple arguments with a comma; and separating multiple commands on the same line with a semicolon. In Program 15.1 Function structure and properties, a new function is formatted in this way unless it is very short and can be put on a single line. The above function structure has four major components: (a) the function name and assignment operator, (b) the keyword of function, (c) the arguments, and (d) the body and curly braces. The body component can be further divided into three parts: (d1) input, (d2) transformation, and (d3) output. To have a meaningful function, the minimum information needed is (b), (c), and (d3), i.e., the function keyword, an (empty) argument, and an (empty) output part in the body. For example, test <- function() 3; test() will return 3, or combining the two commands together as (function() 3)() will generate 3 too. If the number of 3 is replaced by an empty body enclosed in braces, then the function always returns NULL, i.e., (function() {})(). In practice, a new function has most of these components in order to achieve a specific goal; otherwise, it would become too trivial. Let’s examine them briefly one by one here first. These components that need more elaboration will be examined in the following sections with greater details. First, the symbol or name of the function is usually supplied, so a named function can be created and available for use later on. If the expression of function.name <- is not supplied, then an anonymous function is created. This is perfectly fine because functions are just one type of object in R, and any object in R can be unnamed. For example, the expression of mean(c(1, 3, 4)) returns the average value of a vector as an unnamed object. An anonymous function may be preferred when it is used as an argument in another function. For example, in the lapply(x, FUN) function, the FUN argument also should be another function object. It can be a built-in function, e.g., seq(), or a user-defined new function. If the user-defined function is short, then it is common to just define and supply the new function at the same location, making the new function anonymous and available for the calling function only. The second component is the function declaration by the keyword of function. It tells R that the object named as function.name before the assignment operator is a function. The class of a function object, i.e., class(function.name), is “function”. Furthermore, for function objects, R has a set of functions for extracting or replacing components. The args() function reveals the set of arguments accepted by a function. Similarly, the formals() function returns a list object of the formal arguments in a function. The body() function returns the body of the function specified. The functions of formals() and body() can either extract or replace components of an existing function. In addition, the alist() function can be used with formals() to revise the arguments of a function. The third component is a set of arguments that are separated by commas and enclosed 314 Chapter 15 How to Write a Function in a pair of parentheses. The set of arguments can be completely empty so some functions have no arguments at all. For example, the search() function returns a character vector of packages attached on the search path; to use this function, just type search(). In most cases, the arguments are composed of symbols (i.e., a variable name x), statements with an assignment operator of = (e.g., x = TRUE), and the special . . . argument. The fourth component is the body, i.e., the major component of a function. It is composed of one or multiple R statements, which are usually enclosed in a pair of curly braces. Multiple statements can be put on one line and separated by semicolons, or each statement can be put on a separate line. If there is only one statement in the body of a function, then the curly braces can be omitted. However, in general, including the curly braces is recommended for clarity. Within the function body, the first part of inputs can be completely ignored if the arguments are straightforward. In most situations, however, argument values are examined in the input part to evaluate if they possesses the appropriate class, mode, or format, e.g., a data frame class required for an argument. In addition, some arguments may need simple extraction or operation before they can be used in the transformation. In the transformation part, the argument values are transformed to achieve the goal of the function. This is often the key portion of a function, and therefore, most time in creating a new function is spent here. In the final output part, one or multiple outputs from previous transformation are organized and exported, so they will be accessible after a function is called. In general, some outputs should be returned to make the operation meaningful. Some functions, e.g., plot(), focus on the side effect, so they do not return anything really meaningful. Program 15.1 Function structure and properties 1 2 3 # A. Minimum information for a function test <- function() {3}; test() (function() {})() 4 5 6 7 8 9 10 11 12 13 14 15 16 # B. Function properties dog <- function(x) { y <- x + 10 z <- x * 3 w <- paste("x times 3 is equal to", z, sep = " ") result <- list(x = x, y = y, z = z, w = w) return(result) } class(dog); args(dog); formals(dog); body(dog) dog(8) # default printing for all res <- dog(8); res # assignment and selected printing res$x; res$w 17 18 19 20 21 22 23 24 # C. Anonymous function ga <- lapply(X = 1:3, FUN = seq); ga my.seq <- function (y) {seq(from = 1, to = y, by = 1)} gb <- lapply(X = 1:3, FUN = my.seq) gc <- lapply(X = 1:3, FUN = function(y) {seq(from = 1, to = y, by = 1)}) gc identical(ga, gb); identical(gb, gc) Chapter 16 Advanced Graphics B esides traditional graphics system covered in Chapter 11, R has many additional facilities in creating and developing graphics. In this chapter, the graphics systems in R are reviewed first. Then the grid package as another important graphics system in R is introduced. Furthermore, the ggplot2 package is a contributed package based on the grid system and has several advantages over the traditional graphics in R. Thus, ggplot2 is also presented with several applications. Finally, as one important extension, R has a number of packages that can handle map data and geographical information well. These functions are briefly introduced at the end. 16.1 R graphics engine and systems As the key graphics facility, the grDevices package is included in base R and has been referred to as the graphics engine (Murrell 2011). This package contains fundamental infrastructure for supporting almost all graphics applications in R. Furthermore, two R packages, i.e., graphics and grid, have been built on top of the graphics engine, and consequentially, two graphics systems have been developed in R. The graphics package, also known as the traditional graphics system, has a complete set of functions for creating and annotating graphs. Chapter 11 Base R Graphics on page 214 has a detailed coverage of these functions and techniques. In contrast, the grid package has a separate set of graphics tools. It does not provide functions for drawing complete plots, so it is not used to produce plots as often as the traditional graphics system. Instead, it is more common to use functions from grid to develop new packages. Besides the two systems of graphics and grid, several graphics packages do exist independently even though the number is small. For example, the rggobi package provides a command-line interface for interactive and dynamic plotting. Many new graphics packages have been built on top of either the graphics or grid system, or even independently. For example, the maps package provides functions for drawing maps in the traditional graphics system. The packages of lattice and ggplot2 have been built on top of grid. The CRAN task view has one specific section for graph displays, titled as “Graphics.” It classifies these new packages into several categories: plotting, graphic applications, graphics systems, devices, colors, interactive graphics, and development. Over 40 packages related to R graphics have been reviewed. While the list of contributed packages in the task view may not be comprehensive or complete, it is a good place to view new development in this area. Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 359 360 Chapter 16 Advanced Graphics The existence of two graphics systems (i.e., graphics v. grid) and related packages in R raises questions for practical applications: which one is better or when should a package be employed? Unfortunately, there is no simple answer that everyone would agree with. In general, the same graphics system should be used to create a complete graph. Mixing them for one specific graph is difficult or confusing. Nevertheless, it may be desirable to combine the grid system with other packages in some situations, because the grid system offers more flexibility in formatting a graph than the traditional system. More specifically, take lattice and ggplot2 as an example. From my experience, the main advantages of these packages are that graphs can be saved and manipulated exactly like other objects in R. The default style of these new packages is often better because they are motivated and developed to improve the appearance of graphics from the traditional system. Plotting multivariate data sets as panel graphs is easier and the appearance is usually more professional. The cost of using these new packages is that there is often a steep learning curve. A good number of new concepts and plotting functions have to be learned before any serious plotting is feasible. Furthermore, creating a graph always involves the application of a set of graphics functions. Thus, one needs to have a solid understanding of a contributed package before a quality graph can be produced. For example, one may need to spend a few weeks on the book by Wickham (2009) before ggplot2 can be used to draw a graph with a publication quality. The main advantages of the traditional graphics system are that it is easy to learn, well documented and discussed through the built-in help files and online forums, and in most situations, very stable. In general, graphs are used either for exploring data patterns or for publishing. Traditional graphics is still faster and more flexible in data exploration. For example, plot() can show up to 10 time series in one window quickly with a very short command because plot.ts() is defined. Traditional graphics is also more flexible in drawing diagrams. The main disadvantage of the traditional graphics system is that graphs cannot be saved like a normal object, making a program flow unclear in some situations. For complicated data, the appearance of traditional graphs may not look professional. In summary, traditional graphics system has been deeply rooted in base R and used in many contributed packages to define plot methods for objects with new classes. It is also flexible for data exploration and demonstration graphs. Thus, it is likely that traditional graphics system will continue to be used widely. New packages like lattice() can generate more efficient and professional graphics in the long run if one is willing to spend weeks in learning more new concepts. 16.2 The grid system The grid package has been developed by Paul Murrell since the 1990s (Murrell, 2011). At present, it is a core package in base R, so all the functions in this package are just indexed like other graphics functions in base R. Several contributed packages have also been developed over time, e.g., gridBase for integrating base and grid graphics, gridDebug for debugging grid graphics, and gridExtra for some additional functions in grid graphics. Furthermore, sophisticated graphics packages have been built on the basis of the grid package, including the well-known lattice and ggplot2. As the grid system has been constantly expanding and growing, the grid package is explained briefly in this section to help readers get started. Some examples are presented in Program 16.1 Learning viewports and low-level plotting functions in grid. The built-in help documents are very well documented, which is available by running help(package = “grid”). There are over ten vignettes at vignette(package = “grid”), Part V Programming as a Contributor 393 394 Part V Programming as a Contributor: The main sample study for this part is Sun (2011). The focus is on developing a new package or a graphical user interface, and making a contribution to the R community. How to create the content of a new package, what procedures should be followed, and how to develop a graphical user interface are illustrated in three individual chapters. Chapter 17 Sample Study C and New R Packages (pages 395 – 411): The statistics for the underlying model in Sun (2011) is presented first. Then the complete program version for generating tables and graphs is assessed, and the need of a new package is emphasized. Chapter 18 Contents of a New Package (pages 412 – 427): The principles of package design are analyzed first. The contents of a new package are analyzed with the apt package as an example. Debugging techniques and management of time and memory are also detailed. Chapter 19 Procedures for a New Package (pages 428 – 445): Procedural requirements for building a new package are presented, with Microsoft Windows as the operating system. The whole process is divided into three stages: skeleton, compilation, and distribution. Chapter 20 Graphical User Interfaces (pages 446 – 467): Concepts and tools for developing R graphical user interfaces are covered. The gWidgets package and the GTK toolkit are employed for the development. At the end, a GUI for the apt package is demonstrated. R Graphics • Show Box 5 • Screenshots from a dynamic graph for correlation R graphics can show dynamic correlations. See Program A.5 on page 518 for detail. Chapter 17 Sample Study C and New R Packages T he core sample study for Part V Programming as a Contributor is Sun (2011). In this chapter, the underlying statistical model and manuscript and program versions for this study are presented first. The research issue is asymmetric price transmission (APT) between China and Vietnam in the import wooden bed market of the United States. This is closely related to the issue examined in Wan et al. (2010a), i.e., the main sample study for Part IV Programming as a Wrapper. At the stage of proposal and project design, some aspects of designing several projects in one area are discussed. The relevant discussion can be found at Section 5.3.3 Design with challenging models (Sun 2011) on page 73. The model employed is at the frontier of time series statistics, i.e., nonlinear threshold cointegration analysis. It involves hundreds of linear regressions even for a very small data set. Thus, writing new functions and even preparing a new package are needed to have an efficient data analysis. For this specific model, a new package called apt is created. The program version for Sun (2011) is organized with the help of this package, so the whole program has become more concise and readable. 17.1 Manuscript version for Sun (2011) Recall that an empirical study has three versions: proposal, program, and manuscript. A proposal provides a guide like a road map for setting up the first draft of a manuscript. Like an engine, an R program can generate detailed tables and figures for the final manuscript. For this study, the brief proposal is presented at Section 5.3.3 Design with challenging models (Sun 2011). The R program for this study is presented later in this chapter. The final manuscript version is published as Sun (2011). Below is the very first manuscript version that is developed from the proposal. In constructing the first manuscript version for an empirical study, the key components are the tables and figures. The contents should be predicted as much as possible before a researcher works on an R program. The prediction is based on the understanding of the issue, data, model, and literature. The more a researcher can predict at this stage, the more efficient the programming will become. At the end, both the content and format of tables and figures need to be written down in the manuscript draft. For example, the results of EngleGranger and threshold cointegration tests are reported in combination as Table 3 in Sun (2011). The first draft of these results is presented here as Table 17.1. Some hypothetical values are put in the columns to provide formatting guides for R programming later. Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 395 396 Chapter 17 Sample Study C and New R Packages The First Manuscript Version for Sun (2011) 1. Abstract (200 words). Have one or two sentences for research issue, study need, objective, methodology, data, results, and contributions. 2. Introduction (3 pages in double line spacing). Have a paragraph for each of the following items: an overview of wooden bed imports in the United States, market price analyses and asymmetric price transmission (APT), sources of APT, models of APT, objective, and manuscript organization. 3. Import wooden bed market in the United States (4 pages). A review of the US wooden bed market is presented, with the emphasis on expansion of China and Vietnam in the import wooden bed market. — Factors behind China’s export growth — Antidumping investigation against China — Vietnam’s growth 4. Methodology (6 pages): A brief introduction of the methods and then three subsections. — Linear cointegration analysis — Threshold cointegration analysis — Asymmetric error correction model with threshold cointegration 5. Data and software (0.5 page). Monthly cost-insurance-freight values in dollar and quantities in piece are reported by country. The period covered in this study is from January 2002 to January 2010. Threshold cointegration and asymmetric error correction model are combined and used in this study. A new R package named as apt is created in the process. 6. Empirical results (4 pages of text, 4 pages of tables, and 3 pages of figure). — Descriptive statistics and unit root test — Results of the linear cointegration analysis — Results of the threshold cointegration analysis — Results of the asymmetric error correction model Table 1. Results of descriptive statistics and unit root tests Table 2. Results of Johansen cointegration tests on the import prices Table 3. Results of Engle-Granger and threshold cointegration tests Table 4. Results of asymmetric error correction model Figure 1. Monthly import values of wooden beds from China and Vietnam Figure 2. Monthly import prices of wooden beds from China and Vietnam Figure 3. Sum of squared errors by threshold value for threshold selection 7. Conclusion and discussions (3 pages). A brief summary of the study is presented first. Then about three key results from the empirical findings will be highlighted and discussed. 8. References (3 pages). No more than 30 studies will be cited. end 397 17.2 Statistics: threshold cointegration and APT Table 17.1 A draft table for the cointegration analyses in Sun (2011) Item Estimate Threshold ρ1 ρ2 Diagnostics AIC BIC QLB(4) QLB(8) QLB(12) Hypotheses Φ(H0 : ρ1 = ρ2 = 0) F (H0 : ρ1 = ρ2 ) 17.2 Engle TAR CTAR — −0.666*** (−3.333) — — 0 −0.666*** (−3.333) −0.666*** (−3.333) 888.888 888.888 — — 10.123*** 4.444*** MTAR CMTAR 0 Statistics: threshold cointegration and APT In this section, the relevant statistics for threshold cointegration and asymmetric price transmission is presented. Emphases are put on the key information relevant for R implementation in the apt package. For a comprehensive coverage of this methodology, read these references cited in Sun (2011). Nonstationarity and unit root tests, Johansen-Juselius cointegration analysis, and most model diagnostics are not covered here for brevity. In contrast, Engle-Granger linear cointegration, threshold cointegration, and asymmetric error correction model are described here with some detail. Linear cointegration analysis is the foundation of threshold cointegration. 17.2.1 Linear cointegration analysis For linear cointegration analysis, there exist two major methods: Johansen-Juselius and Engle-Granger two-step approaches (Enders, 2010). Both of them assume symmetric relations between variables. The Johansen approach is a multivariate generalization of the Dickey-Fuller test. The Engle-Granger approach is the foundation of threshold cointegration so it is explained in detail first. The focal variables here are monthly import prices of wooden beds from two supplying countries, i.e., Vietnam (Vt ) and China (Ht ). Their properties of nonstationarity and order of integration can be assessed using the Augmented Dickey-Fuller test. If both series have a unit root, then it is appropriate to conduct cointegration analysis to evaluate their interaction. With the Engle-Granger two-stage approach, the property of residuals from the long-term equilibrium relation is analyzed (Engle and Granger, 1987). For the two focal price variables, the two-stage approach can be expressed as: Vt = α0 + α1 Ht + ξt ∆ξbt = ρ ξbt−1 + P X i=1 φi ∆ξbt−i + µt (17.1) (17.2) 398 Chapter 17 Sample Study C and New R Packages where α0 , α1 , ρ, and φi are coefficients, ξt is the error term, ξbt is the estimated residuals, ∆ indicates the first difference, µt is a white noise disturbance term, and P is the lag number. In the first stage of estimating the long-term relation among the price variables, the price of China is chosen to be placed on the right side and assumed to be the driving force. This considers the fact that China has been the leading supplier in the import wooden bed market of the United States over the study period from 2002 to 2010. In the second stage, the estimated residuals ξbt are used to conduct a unit root test. Special critical values are needed for this test because the series is not raw data but a residual series. The number of lags is chosen so there is no serial correlation in the regression residuals µt . It can be selected with several statistics, e.g., the Akaike Information Criterion (AIC) or Ljung-Box Q test. If the null hypothesis of ρ = 0 is rejected, then the residual series from the long-term equilibrium is stationary and the focal variables of Vt and Ht are cointegrated. 17.2.2 Threshold cointegration analysis In recent years, nonlinear cointegration has been increasingly used in price transmission studies. Among various developments of nonlinear cointegration, one branch is called threshold cointegration. The nonlinearity comes from two linear regressions combined, and the linear regressions are based on the above Engle-Granger linear cointegration approach. Thus, the threshold cointegration regression considered here is piecewise only and not smooth. Specifically, Enders and Siklos (2001) propose a two-regime threshold cointegration approach to entail asymmetric adjustment in cointegration analysis. This modifies Equation (17.2) such that: ∆ξbt = ρ1 It ξbt−1 + ρ2 (1 − It ) ξbt−1 + P X ϕi ∆ξbt−i + µt (17.3) i=1 It = 1 if ξbt−1 ≥ τ, 0 otherwise; or It = 1 if ∆ξbt−1 ≥ τ, 0 otherwise (17.4) (17.5) where It is the Heaviside indicator, P the number of lags, ρ1 , ρ2 , and ϕi the coefficients, and τ the threshold value. The lag (P ) is specified to account for serially correlated residuals and it can be similarly selected as in linear cointegration analysis. The Heaviside indicator It can be specified with two alternative definitions of the threshold variable, either the lagged residual (ξbt−1 ) or the change of the lagged residual (∆ξbt−1 ). Equations (17.3) and (17.4) together have been referred to as the Threshold Autoregression (TAR) model, while Equations (17.3) and (17.5) are named as the Momentum Threshold Autoregression (MTAR) model. The threshold value τ can be specified as zero, or it can be estimated. Thus, a total of four models can be estimated. They are TAR with τ = 0, consistent TAR with τ estimated, MTAR with τ = 0, and consistent MTAR with τ estimated. In general, a model with the lowest AIC is deemed to be the most appropriate. Insights into the asymmetric adjustment in the context of a long-term cointegration relation can be obtained with two tests. First, an F -test is employed to examine the null hypothesis of no cointegration (H0 : ρ1 = ρ2 = 0) against the alternative of cointegration with either TAR or MTAR threshold adjustment. The test statistic is represented by Φ. This test does not follow a standard distribution and the critical values in Enders and Siklos (2001) should be used. The second one is a standard F -test to evaluate the null hypothesis of symmetric adjustment in the long-term equilibrium (H0 : ρ1 = ρ2 ). Rejection of the null hypothesis indicates the existence of an asymmetric adjustment process. Results from the two tests are the key outputs from threshold cointegration analysis. 399 17.2 Statistics: threshold cointegration and APT The challenge of threshold cointegration analysis comes from estimating the threshold value of τ . With a given value for τ , Equation (17.3) is just a linear regression and it can be easily estimated by any software application, e.g., the lm() function in R. At present, the method by Chan (1993) has been widely followed to obtain a consistent estimate of the threshold value. A super consistent estimate of the threshold value can be attained with several steps. First, the process involves sorting in ascending order the threshold variable, i.e., ξbt−1 for the TAR model or the ∆ξbt−1 for the MTAR model. Second, the possible threshold values are determined. If the threshold value is to be meaningful, the threshold variable must actually cross the threshold value (Enders, 2010). Thus, the threshold value τ should lie between the maximum and minimum value of the threshold variable. In practice, the highest and lowest 15% of the values are excluded from the search to ensure an adequate number of observations on each side. The middle 70% values of the sorted threshold variable are generally used as potential threshold values. The percentage can be higher if the total number of observations in a study is larger, e.g., 90% for 1,000 observations. Third, the TAR or MTAR model is estimated with each potential threshold value. The sum of squared errors for each trial can be calculated and the relation between the sum of squared errors and the threshold value can be examined. Finally, the threshold value that minimizes the sum of squared errors is deemed to be the consistent estimate of the threshold. 17.2.3 Asymmetric error correction model The Granger representation theorem (Engle and Granger, 1987) states that an error correction model can be estimated where all the variables in consideration are cointegrated. The specification assumes that the adjustment process due to disequilibrium among the variables is symmetric. Two extensions on the standard specification in the error correction model have been made for analyzing asymmetric price transmission. Granger and Lee (1989) first extend the specification to the case of asymmetric adjustments. Error correction terms and first differences on the variables are decomposed into positive and negative components. This allows detailed examinations on whether positive and negative price differences have asymmetric effects on the dynamic behavior of prices. The second extension follows the development of threshold cointegration (Enders and Granger, 1998). When the presence of threshold cointegration is validated, the error correction terms are modified further. The asymmetric error correction model with threshold cointegration in Sun (2011) is developed as follows: − − + + ∆Ht = θH + δH Et−1 + δH Et−1 + J X + + αHj ∆Ht−j + j=1 J X + + βHj ∆Vt−j + j=1 J X j=1 (17.6) j=1 J X + αV+j ∆Ht−j + j=1 j=1 − − αHj ∆Ht−j + − − βHj ∆Vt−j + ϑHt − + ∆Vt = θV + δV+ Et−1 + δV− Et−1 + J X J X + βV+j ∆Vt−j + J X − βV−j ∆Vt−j + ϑV t J X − αV−j ∆Ht−j + j=1 (17.7) j=1 where ∆H and ∆V are the import prices of China and Vietnam in first difference; θ, δ, α, and β are coefficients; and ϑ is error terms. The subscripts of H and V differentiate the 400 Chapter 17 Sample Study C and New R Packages coefficients by country, t denotes time, and j represents lags. All the lagged price variables in first difference (i.e., ∆Ht−j and ∆Vt−j ) are split into positive and negative components, + is equal to (Vt−1 − Vt−2 ) if as indicated by the superscripts + and −. For instance, ∆Vt−1 − Vt−1 > Vt−2 and equal to 0 otherwise; ∆Vt−1 is equal to (Vt−1 − Vt−2 ) if Vt−1 < Vt−2 and equal to 0 otherwise. The maximum lag of J is chosen with the AIC statistic and Ljung-Box Q test so the residuals have no serial correlation. The error correction terms are the key component of the asymmetric error correction − + = It ξbt−1 and Et−1 = (1 − It ) ξbt−1 , and are a direct result model. They are defined as Et−1 from the above threshold cointegration regression. This definition of the error correction terms not only considers the possible asymmetric price responses to positive and negative shocks on the long-term equilibrium, but also incorporates the impact of threshold cointegration through the construction of Heaviside indicator in Equations (17.4) and (17.5). The signs of estimated coefficients can offer a first insight on the presence of asymmetric price behavior and can reveal the response of individual variables to the disequilibrium in the previous periods. Note the price of China is assumed to be the driving force and the long-term disequilibrium is measured as the price spread between Vietnam and China. Thus, the expected signs for the error correction terms should be positive for China (i.e., − + δH > 0, δH > 0) and negative for Vietnam (i.e., δV+ < 0, δV− < 0). Single or joint hypotheses can be formally assessed. In this study, four types of hypotheses and F -tests are examined, as detailed in Frey and Manera (2007). The first one is Granger causality test. Whether the Chinese price Granger causes its own price or the Vietnamese price can be tested by restricting all the Chinese prices to be zero (H01 : αi+ = αi− = 0 for all lags i simultaneously). Similarly, the test can be applied to the Vietnamese price (H02 : βi+ = βi− = 0 for all lags). The second type of hypothesis is concerned with the distributed lag asymmetric effect. At the first lag, for instance, the null hypothesis is that the Chinese price has symmetric effect on its own price or the Vietnamese price (H03 : α1+ = α1− ). This can be repeated for each lag and both countries (i.e., H04 : β4+ = β4− ). The third type of hypothesis is cumulative asymmetric PJ effect. The PJ null hypothesis of cumulative PJ symmetric PJeffect can be expressed as H05 : i=1 αi+ = i=1 αi− for China and H06 : i=1 βi+ = i=1 βi− for Vietnam. Finally, the equilibrium adjustment path asymmetry can be examined with the null hypothesis of H07 : δ + = δ − for each equation estimated. 17.3 Needs for a new package Estimating the statistical models as described in the previous section is almost impossible by clicking pull-down menus in a statistical software application. Computer programming must be employed, and within R’s language structure, new functions must be created. As the number of functions is relatively large and some of them need to be repeatedly called, it is also more efficient to wrap up these new functions together in an R package. To reveal the need for new functions and packages, three particular aspects of the models employed in Sun (2011) are analyzed here. The first challenge is to estimate the threshold cointegration model, as expressed in Equations (17.3) to (17.5) as a group. When the threshold value τ and a lag value P is given, variables in Equation (17.3) can be easily defined. Thus, the regression per se is a linear model and can be estimated by the lm() function. The problem is that the number of regressions is too large. Image that the total number of observations is 120 (e.g., monthly data for 10 years). If 70% of the residual values are used as the potential values for τ , then the number is about 84. Furthermore, assume the potential value of P can vary from 1 to 12. In combination, the number of regressions is 84 × 12 × 2 = 2, 016 for TAR and MTAR 17.4 Program version for Sun (2011) 401 specification. At the end of each regression, the sum of squared errors and the threshold values should be documented. Note a data set with 120 observations is pretty small. If the data set is a little bit larger (e.g., 500 observations or more), then the task quickly becomes unmanageable or extremely inefficient. The solution is using flow control statements, such as if and for. Multiple looping statements can be nested with each other, and outputs from each loop can be selected and collected. This has been presented in Chapter 13 Flow Control Structure on page 269. Furthermore, as functions in R can divide a large programming job with interlinked components into small units, several new functions will be created in estimating the threshold cointegration model. The second challenge is to estimate the asymmetric error correction model. Variables used in the regression needs to be created with a given value of lag J. The number of variables on the right side rises fast with a larger value for lag J. Furthermore, the value of J is unknown in advance so there is a need to estimate the model repeatedly with different values. Thus, the whole process includes selecting a lag value, composing variables, estimating the linear model, collecting regression outputs, and repeating it by J times. Therefore, while the asymmetric error correction model is linear, the process can be very tedious and inefficient without programming. The third challenge is hypothesis tests on the coefficients from the asymmetric error correction model. Many hypotheses can be formed and F -tests can be employed. Individually, they are easier to implement; collectively, the work is inefficient without programming. This is because whenever the value of lag J changes, the number and positions of coefficients from the regression change too. Again, using new functions and flow control structure in R can easily solve these problems. The linkage between new functions and a package is worthy of a note here. When the number of functions created in a project is large, the need and marginal benefit of building a new package can become significant. In fact, the threshold cointegration analysis serves as a good example. Walter Enders has made great contributions in this area through his book and journal articles (e.g., Enders, 2010; Enders and Siklos, 2001). He also programmed the main components of threshold cointegration analysis through the commercial software RATS and distributed it on the Internet. I have benefited from these sources in learning the method. However, RATS does not have the concept of package or library as R has clearly defined. As a result, the functions created in RATS have no good documentation and are pretty fragmented. In the following chapters, we will learn how to wrap up a group of new functions into the apt package. The step from many new functions to a new package will make programming efficient and pleasant to everyone, including the package author. In summary, conducting an empirical study with a sophisticated statistical model like threshold cointegration has become almost impossible without programming. New functions can be created and called to address recurring regressions. When the number of new functions is large, a new package can be used to document the linkage about them clearly, organize the R program for a project logically, and eventually, improve research productivity. 17.4 Program version for Sun (2011) When the program for an empirical study is very long (e.g., 30 pages), it may be better to organize it through several documents. The R program for Sun (2011) is five pages long only and it may not be necessary in this particular case. Nevertheless, to demonstrate the benefits of splitting a long program, two R programs are presented below. One is for the main statistical analyses and tables. The other is used to generate three figures. 402 17.4.1 Chapter 17 Sample Study C and New R Packages Program for tables The main program is listed in Program 17.1 Main program version for generating tables in Sun (2011). This contains all the statistical analyses and can generate the four tables. Specifically, the data used in this study is pretty simple. It has four time series: import values and prices for China and Vietnam each from January 2002 to January 2010. They are saved as the data object of daVich() in the apt library. The main steps in the program correspond to the study design in the proposal and desired outputs in the manuscript. These include summary statistics (Table 1), Johansen cointegration tests (Table 2), threshold cointegration tests (Table 3), and asymmetric error correction model (Table 4). As you read along the program, you will notice that a number of new functions have been created and wrapped together in the apt package. This is the focus of Part V Programming as a Contributor and will be elaborated gradually later on. At this point, it should be evident that the program version is well organized with the help of a new package. Except some minor format differences, the tables generated from this R program are highly similar to the final versions reported in Sun (2011). Some results in Table 3 as published in Sun (2011) were inaccurate because of a mistake made when the data was processed in 2009. The mistake was identified after the paper was published. For example, for the consistent MTAR, the coefficient for the positive term was reported as −0.251 (−2.130) in Sun (2011), but it should be −0.106 (−0.764), as calculated from below codes. This is also explained on the help page of daVich(). The main conclusions from all the analyses are still qualitatively the same. A large portion of Program 17.1 has been distributed with the apt library as sample codes. A number of users worldwide have raised a similar question to me in recent several years. The question is simple from my perspective. However, as it occurs repeatedly from time to time, it is worthy of a note here. Briefly, the data used in Sun (2011) are just two single time series. It is tempting for another user to have two new data series imported into R, and then copy and run the sample program. Unfortunately, this will generate errors at various stages in the middle. This is because several key choices have to be made in Program 17.1, e.g., the lag and threshold values. The choices are dependent on individual data. Thus, one cannot just simply copy the whole R program for another data. Program 17.1 Main program version for generating tables in Sun (2011) 1 2 3 # Title: R Program for Sun (2011 FPE) library(apt); library(vars); setwd('C:/aErer') options(width = 100, stringsAsFactors = FALSE) 4 5 6 7 8 9 10 11 12 13 # ------------------------------------------------------------------------# 1. Data and summary statistics # Price data for China and Vietnam are saved as 'daVich' data(daVich); head(daVich); tail(daVich); str(daVich) prVi <- daVich[, 1]; prCh <- daVich[, 2] (dog <- t(bsStat(y = daVich, digits = c(3, 3)))) dog2 <- data.frame(item = rownames(dog), CH.level = dog[, 2], CH.diff = '__', VI.level = dog[, 1], VI.diff = '__')[2:6, ] rownames(dog2) <- 1:nrow(dog2); str(dog2); dog2 14 15 16 17 # ------------------------------------------------------------------------# 2. Unit root test (Table 1) ch.t1 <- ur.df(type = 'trend', lags = 3, y = prCh); slotNames(ch.t1) 403 17.4 Program version for Sun (2011) 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ch.d1 <- ur.df(type = 'drift', lags = 3, ch.t2 <- ur.df(type = 'trend', lags = 3, ch.d2 <- ur.df(type = 'drift', lags = 3, vi.t1 <- ur.df(type = 'trend', lags = 12, vi.d1 <- ur.df(type = 'drift', lags = 11, vi.t2 <- ur.df(type = 'trend', lags = 10, vi.d2 <- ur.df(type = 'drift', lags = 10, dog2[6, ] <- c('ADF with trend', paste(round(ch.t1@teststat[1], digits = paste(round(ch.t2@teststat[1], digits = paste(round(vi.t1@teststat[1], digits = paste(round(vi.t2@teststat[1], digits = dog2[7, ] <- c('ADF with drift', paste(round(ch.d1@teststat[1], digits = paste(round(ch.d2@teststat[1], digits = paste(round(vi.d1@teststat[1], digits = paste(round(vi.d2@teststat[1], digits = (table.1 <- dog2) y y y y y y y = = = = = = = prCh) diff(prCh)) diff(prCh)) prVi) prVi) diff(prVi)) diff(prVi)) 3), 3), 3), 3), '[', '[', '[', '[', 3, 3, 12, 10, ']', ']', ']', ']', sep sep sep sep = = = = ''), ''), ''), '')) 3), 3), 3), 3), '[', '[', '[', '[', 3, 3, 11, 10, ']', ']', ']', ']', sep sep sep sep = = = = ''), ''), ''), '')) 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 # ------------------------------------------------------------------------# 3. Johansen-Juselius and Engle-Granger cointegration analyses # JJ cointegration VARselect(daVich, lag.max = 12, type = 'const') summary(VAR(daVich, type = 'const', p = 1)) K <- 5; two <- cbind(prVi, prCh) summary(j1 <- ca.jo(x = two, type = 'eigen', ecdet = 'trend', K = K)) summary(j2 <- ca.jo(x = two, type = 'eigen', ecdet = 'const', K = K)) summary(j3 <- ca.jo(x = two, type = 'eigen', ecdet = 'none' , K = K)) summary(j4 <- ca.jo(x = two, type = 'trace', ecdet = 'trend', K = K)) summary(j5 <- ca.jo(x = two, type = 'trace', ecdet = 'const', K = K)) summary(j6 <- ca.jo(x = two, type = 'trace', ecdet = 'none' , K = K)) slotNames(j1) out1 <- cbind('eigen', 'trend', K, round(j1@teststat, digits = 3), j1@cval) out2 <- cbind('eigen', 'const', K, round(j2@teststat, digits = 3), j2@cval) out3 <- cbind('eigen', 'none', K, round(j3@teststat, digits = 3), j3@cval) out4 <- cbind('trace', 'trend', K, round(j4@teststat, digits = 3), j4@cval) out5 <- cbind('trace', 'const', K, round(j5@teststat, digits = 3), j5@cval) out6 <- cbind('trace', 'none', K, round(j6@teststat, digits = 3), j6@cval) jjci <- rbind(out1, out2, out3, out4, out5, out6) colnames(jjci) <- c('test 1', 'test 2', 'lag', 'statistic', 'c.v 10%', 'c.v 5%', 'c.v 1%') rownames(jjci) <- 1:nrow(jjci) (table.2 <- data.frame(jjci)) 61 62 63 64 65 66 # EG cointegration LR <- lm(formula = prVi ~ prCh); summary(LR) (LR.coef <- round(summary(LR)$coefficients, digits = 3)) (ry <- ts(data = residuals(LR), start = start(prCh), end = end(prCh), frequency = 12)) 404 67 68 69 70 71 72 73 74 75 Chapter 17 Sample Study C and New R Packages eg <- ur.df(y = ry, type = c('none'), lags = 1) eg2 <- ur.df2(y = ry, type = c('none'), lags = 1) (eg4 <- Box.test(eg@res, lag = 4, type = 'Ljung') ) (eg8 <- Box.test(eg@res, lag = 8, type = 'Ljung') ) (eg12 <- Box.test(eg@res, lag = 12, type = 'Ljung')) EG.coef <- coefficients(eg@testreg)[1, 1] EG.tval <- coefficients(eg@testreg)[1, 3] (res.EG <- round(t(data.frame(EG.coef, EG.tval, eg2$aic, eg2$bic, eg4$p.value, eg8$p.value, eg12$p.value)), digits = 3)) 76 77 78 79 80 81 82 83 84 85 86 87 88 89 # ------------------------------------------------------------------------# 4. Threshold cointegration # best threshold test <- ciTarFit(y = prVi, x = prCh); test; names(test) t3 <- ciTarThd(y = prVi, x = prCh, model = 'tar', lag = 0); plot(t3) time.org <- proc.time() (th.tar <- t3$basic) for (i in 1:12) { # about 20 seconds t3a <- ciTarThd(y = prVi, x = prCh, model = 'tar', lag = i) th.tar[i+2] <- t3a$basic[, 2] } th.tar time.org - proc.time() 90 91 92 93 94 95 96 97 t4 <- ciTarThd(y = prVi, x = prCh, model = 'mtar', lag = 0) (th.mtar <- t4$basic); plot(t4) for (i in 1:12) { # about 36 seconds t4a <- ciTarThd(y = prVi, x = prCh, model = 'mtar', lag = i) th.mtar[i+2] <- t4a$basic[,2] } th.mtar 98 99 100 t.tar <- -8.041; t.mtar <- -0.451 # t.tar <- -8.701 ; t.mtar <- -0.451 # lag = 0 to 4; final choices # lag = 5 to 12 101 102 103 104 105 106 107 mx <- 12 # lag selection (g1 <-ciTarLag(y=prVi, x=prCh, (g2 <-ciTarLag(y=prVi, x=prCh, (g3 <-ciTarLag(y=prVi, x=prCh, (g4 <-ciTarLag(y=prVi, x=prCh, plot(g1) model='tar', model='mtar', model='tar', model='mtar', maxlag maxlag maxlag maxlag = = = = mx, mx, mx, mx, thresh thresh thresh thresh = = = = 0)) 0)) t.tar)) t.mtar)) 108 109 110 111 # Figure of threshold selection: mtar at lag = 3 (Figure 3 data) (t5 <- ciTarThd(y=prVi, x=prCh, model = 'mtar', lag = 3, th.range = 0.15)) plot(t5) 112 113 114 115 # Table 3 Results of EG and threshold cointegration combined vv <- 3 (f1 <- ciTarFit(y=prVi, x=prCh, model = 'tar', lag = vv, thresh = 0)) 17.4 Program version for Sun (2011) 116 117 118 405 (f2 <- ciTarFit(y=prVi, x=prCh, model = 'tar', lag = vv, thresh = t.tar )) (f3 <- ciTarFit(y=prVi, x=prCh, model = 'mtar', lag = vv, thresh = 0)) (f4 <- ciTarFit(y=prVi, x=prCh, model = 'mtar', lag = vv, thresh = t.mtar)) 119 120 121 122 123 r0 <- cbind(summary(f1)$dia, summary(f2)$dia, summary(f3)$dia, summary(f4)$dia) diag <- r0[c(1:4, 6:7, 12:14, 8, 9, 11), c(1, 2, 4, 6, 8)] rownames(diag) <- 1:nrow(diag); diag 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 e1 <- summary(f1)$out; e2 <- summary(f2)$out e3 <- summary(f3)$out; e4 <- summary(f4)$out; rbind(e1, e2, e3, e4) ee <- list(e1, e2, e3, e4); vect <- NULL for (i in 1:4) { ef <- data.frame(ee[i]) vect2 <- c(paste(ef[3, 'estimate'], ef[3, 'sign'], sep = ''), paste('(', ef[3, 't.value'], ')', sep = ''), paste(ef[4, 'estimate'], ef[4, 'sign'], sep = ''), paste('(', ef[4, 't.value'], ')', sep = '')) vect <- cbind(vect, vect2) } item <- c('pos.coeff','pos.t.value', 'neg.coeff','neg.t.value') ve <- data.frame(cbind(item, vect)); colnames(ve) <- colnames(diag) (res.CI <- rbind(diag, ve)[c(1:2, 13:16, 3:12), ]) rownames(res.CI) <- 1:nrow(res.CI) res.CI$Engle <- '__' res.CI[c(3, 4, 9:13), 'Engle'] <- res.EG[, 1] res.CI[4, 6] <- paste('(', res.CI[4, 6], ')', sep = '') (table.3 <- res.CI[, c(1, 6, 2:5)]) 144 145 146 147 148 149 150 151 152 153 154 155 156 # ------------------------------------------------------------------------# 5. Asymmstric error correction model (sem <- ecmSymFit(y = prVi, x = prCh, lag = 4)); names(sem) (aem <- ecmAsyFit(y = prVi, x = prCh, lag = 4, model = 'mtar', split = TRUE, thresh = t.mtar)) (ccc <- summary(aem)) coe <- cbind(as.character(ccc[1:19, 2]), paste(ccc[1:19, 'estimate'], ccc$signif[1:19], sep = ''), ccc[1:19, 't.value'], paste(ccc[20:38, 'estimate'], ccc$signif[20:38],sep = ''), ccc[20:38, 't.value']) colnames(coe) <- c('item', 'CH.est', 'CH.t', 'VI.est','VI.t') 157 158 159 160 161 162 163 164 (edia <- ecmDiag(aem, 3)); (ed <- edia[c(1, 6:9), ]) ed2 <- cbind(ed[, 1:2], '_', ed[, 3], '_'); colnames(ed2) <- colnames(coe) (tes <- ecmAsyTest(aem)$out); (tes2 <- tes[c(2, 3, 5, 11:13, 1), -1]) tes3 <- cbind(as.character(tes2[, 1]), paste(tes2[, 2], tes2[, 6], sep = ''), paste('[', round(tes2[, 4], digits = 2), ']', sep = ''), paste(tes2[, 3], tes2[, 7], sep = ''), 406 165 166 167 Chapter 17 Sample Study C and New R Packages paste('[', round(tes2[, 5], digits = 2), ']', sep = '')) colnames(tes3) <- colnames(coe) (table.4 <- data.frame(rbind(coe, ed2, tes3))) 168 169 170 171 172 # ------------------------------------------------------------------------# 6. Output (output <- listn(table.1, table.2, table.3, table.4)) write.list(z = output, file = 'OutBedTable.csv') Note: Major functions used in Program 17.1 are: ur.df(), ca.jo(), VAR(), ciTarThd(), ciTarLag(), ciTarFit(), ecmSymFit(), ecmAsyFit(), ecmDiag(), bsStat(), Box.test(), and lm(). # Selected results from Program 17.1 > table.1 item CH.level CH.diff VI.level VI.diff 1 mean 148.791 __ 115.526 __ 2 stde 11.461 __ 9.882 __ 3 mini 119.618 __ 99.335 __ 4 maxi 177.675 __ 150.721 __ 5 obno 97 __ 97 __ 6 ADF with trend -2.956[3] -7.394[3] -2.936[12] -5.777[10] 7 ADF with drift -2.422[3] -7.195[3] -1.161[11] -5.74[10] > table.2 test.1 test.2 lag statistic c.v.10. c.v.5. c.v.1. 1 eigen trend 5 10.001 10.49 12.25 16.26 2 eigen trend 5 20.253 16.85 18.96 23.65 3 eigen const 5 4.461 7.52 9.24 12.97 4 eigen const 5 14.304 13.75 15.67 20.2 5 eigen none 5 4.438 6.5 8.18 11.65 6 eigen none 5 14.3 12.91 14.9 19.19 7 trace trend 5 10.001 10.49 12.25 16.26 8 trace trend 5 30.254 22.76 25.32 30.45 9 trace const 5 4.461 7.52 9.24 12.97 10 trace const 5 18.765 17.85 19.96 24.6 11 trace none 5 4.438 6.5 8.18 11.65 12 trace none 5 18.738 15.66 17.95 23.52 > table.3 1 2 3 4 5 6 7 8 9 item Engle tar c.tar mtar c.mtar lag __ 3 3 3 3 thresh __ 0 -8.041 0 -0.451 pos.coeff -0.407 -0.328** -0.28** -0.116 -0.106 pos.t.value (-4.173) (-2.523) (-2.306) (-0.824) (-0.764) neg.coeff __ -0.515*** -0.721*** -0.658*** -0.677*** neg.t.value __ (-3.119) (-3.942) (-4.754) (-4.888) total obs __ 97 97 97 97 coint obs __ 93 93 93 93 aic 669.627 658.998 654.863 650.612 649.495 407 17.4 Program version for Sun (2011) 10 bic 11 LB test(4) 12 LB test(8) 13 LB test(12) 14 H1: no CI 15 H2: no APT 16 H2: p.value 677.351 0.773 0.919 0.239 __ __ __ 674.193 0.961 0.992 0.122 6.539 1.033 0.312 > table.4[1:7, ] 1 2 3 4 5 6 7 670.059 0.879 0.964 0.084 8.836 5.081 0.027 item CH.est CH.t VI.est (Intercept) -0.146 -0.052 -3.853* X.diff.prCh.t_1.pos -0.622*** -2.755 -0.155 X.diff.prCh.t_2.pos 0.082 0.344 -0.144 X.diff.prCh.t_3.pos -0.282 -1.264 0.146 X.diff.prCh.t_4.pos -0.324 -1.403 -0.193 X.diff.prCh.t_1.neg -0.314. -1.464 -0.105 X.diff.prCh.t_2.neg -0.584*** -2.651 0.085 665.808 0.988 0.999 0.289 11.307 9.435 0.003 17.4.2 664.69 0.987 0.998 0.333 11.976 10.612 0.002 VI.t -1.777 -0.897 -0.795 0.854 -1.091 -0.641 0.508 Program for figures The three figures reported in Sun (2011) can be created by base R graphics or the ggplot2 package. These codes for graphs are organized separately as a document to increase readability, and they are all presented the following R program. When the codes for figure generation are long, this can make the main program more concise. There are several ways to connect individual programs for a specific empirical study. First, the main program can be called by the source() function and all data will become available for another program. Alternatively, if it takes a long time to run the main program each time or the data used in another program is small, then the relevant data can be copied or generated directly. This is exactly true for the relation between the two programs here. In general, figures use fewer data than statistical analyses. A threshold cointegration analysis often takes quite some time to finish. Thus, at the beginning of Program 17.2, the value data for Figure 1, price data for Figure 2, and sum of squared errors for Figure 3 are generated directly, without calling the main program. Figure 17.1 is generated from traditional graphics system, and Figure 17.2 is from ggplot2. The main difference is that the ggplot version has a gray background and grid lines. Which version is more attractive is largely a personal choice. The codes used for the ggplot version is generally longer than these for the base R version. One can also customize the appearance of the ggplot version and make it very similar to the version from base R. This is left as Exercise 17.6.1 on page 411. In Sun (2011), Figure 1 is monthly import values for China and Vietnam, and Figure 2 is their monthly import prices. Both the figures can be created with the ggplot2 package. Recall that %+% is defined in ggplot2 to replace one data frame with another one. It is tempting to use this operator to generate Figure 2 with a substitution of the underlying data frame. However, the value and price data are quite different in scale. As a result, it is faster in this case to copy all the codes for Figure 1 and then revise them for Figure 2. Program 17.2 Graph program version for generating figures in Sun (2011) 1 2 # Title: Graph codes for Sun (2011 FPE) library(apt); library(ggplot2); setwd('C:/aErer'); data(daVich) 408 Chapter 17 Sample Study C and New R Packages Montly import value($ million) 60 China Vietnam 40 20 0 2002 2003 2004 2005 2006 2007 2008 2009 2010 Figure 17.1 Monthly import value of beds from China and Vietnam (base R) 3 4 5 6 7 8 9 10 11 # ------------------------------------------------------------------------# A. Data for graphs: value, price, and t5$path prVi <- daVich[, 1]; prCh <- daVich[, 2] vaVi <- daVich[, 3]; vaCh <- daVich[, 4] (date <- as.Date(time(daVich), format = '%Y/%m/%d')) (value <- data.frame(date, vaCh, vaVi)) (price <- data.frame(date, prVi, prCh)) (t5 <- ciTarThd(y=prVi, x=prCh, model = 'mtar', lag = 3, th.range = 0.15)) 12 13 14 15 16 17 18 19 20 21 22 23 24 25 # ------------------------------------------------------------------------# B. Traditonal graphics # Figure 1 Import values from China and Vietnam win.graph(width = 5, height = 2.8, pointsize = 9); bringToTop(stay = TRUE) par(mai = c(0.4, 0.5, 0.1, 0.1), mgp = c(2, 1, 0), family = "serif") plot(x = vaCh, lty = 1, lwd = 1, ylim = c(0, 60), xlab = '', ylab = 'Montly import value($ million)', axes = FALSE) box(); axis(side = 1, at = 2002:2010) axis(side = 2, at = c(0, 20, 40, 60), las = 1) lines(x = vaVi, lty = 4, lwd = 1) legend(x = 2008.1, y = 59, legend = c('China', 'Vietnam'), lty = c(1, 4), box.lty = 0) fig1.base <- recordPlot() 26 27 28 29 30 31 # Figure 2 Import prices from China and Vietnam win.graph(width = 5, height = 2.8, pointsize = 9) par(mai = c(0.4, 0.5, 0.1, 0.1), mgp = c(2, 1, 0), family = "serif") plot(x = prCh, lty = 1, type = 'l', lwd = 1, ylim = range(prCh, prVi), xlab = '', ylab = 'Monthly import price ($/piece)' ) 409 17.4 Program version for Sun (2011) Monthly import value ($ million) 60 China Vietnam 40 20 0 2002 2003 2004 2005 2006 2007 2008 2009 2010 Figure 17.2 Monthly import value of beds from China and Vietnam (ggplot2) 32 33 34 lines(x = prVi, lty = 3, type = 'l', lwd = 1) legend(x = 2008.5, y = 175, legend = c('China', 'Vietnam'), lty = c(1, 3), box.lty = 0) 35 36 37 38 39 40 # Figure 3 Sum of dquared errors by threshold value from MTAR win.graph(width = 5.1, height = 3.3, pointsize = 9) par(mai = c(0.5, 0.5, 0.1, 0.1), mgp = c(2.2, 1, 0), family = "serif") plot(formula = path.sse ~ path.thr, data = t5$path, type = 'l', ylab = 'Sum of Squared Errors', xlab = 'Threshold value') 41 42 43 44 45 46 47 48 49 # ------------------------------------------------------------------------# C. ggplot for three figures pp <- theme(axis.text = element_text(size = 8, family = "serif")) + theme(axis.title = element_text(size = 9, family = "serif")) + theme(legend.text = element_text(size = 9, family = "serif")) + theme(legend.position = c(0.85, 0.9) ) + theme(legend.key = element_rect(fill = 'white', color = NA)) + theme(legend.background = element_rect(fill = NA, color = NA)) 50 51 52 53 54 55 56 57 58 fig1 <- ggplot(data = value, aes(x = date)) + geom_line(aes(y = vaCh, linetype = 'China')) + geom_line(aes(y = vaVi, linetype = 'Vietnam')) + scale_linetype_manual(name = '', values = c(1, 3)) + scale_x_date(name = '', labels = as.character(2002:2010), breaks = as.Date(paste(2002:2010, '-1-1', sep = ''), format = '%Y-%m-%d')) + scale_y_continuous(limits = c(0, 60), name = 'Monthly import value ($ million)') + pp 59 60 fig2 <- ggplot(data = price, aes(x = date)) + 410 61 62 63 64 65 66 67 Chapter 17 Sample Study C and New R Packages geom_line(aes(y = prCh, linetype = 'China')) + geom_line(aes(y = prVi, linetype = 'Vietnam')) + scale_linetype_manual(name = '', values = c(1, 3))+ scale_x_date(name = '', labels = as.character(2002:2010), breaks = as.Date(paste(2002:2010, '-1-1', sep = ''), format = '%Y-%m-%d')) + scale_y_continuous(limits = c(98, 180), name = 'Monthly import price ($/piece)') + pp 68 69 70 71 72 73 74 75 fig3 <- ggplot(data = t5$path) + geom_line(aes(x = path.thr, y = path.sse)) + labs(x = 'Threshold value', y = 'Sum of squared errors') + scale_y_continuous(limits = c(5000, 5700)) + scale_x_continuous(breaks = c(-10:7)) + theme(axis.text = element_text(size = 8, family = "serif")) + theme(axis.title = element_text(size = 9, family = "serif")) 76 77 78 79 80 # ------------------------------------------------------------------------# D. Show on screen devices or save on file devices pdf(file = 'OutBedFig1base.pdf', width = 5, height = 2.8, pointsize = 9) replayPlot(fig1.base); dev.off() 81 82 83 84 windows(width = 5, height = 2.8); fig1 windows(width = 5, height = 2.8); fig2 windows(width = 5, height = 2.8); fig3 85 86 87 88 ggsave(fig1, filename = 'OutBedFig1ggplot.pdf', width = 5, height = 2.8) ggsave(fig2, filename = 'OutBedFig2ggplot.pdf', width = 5, height = 2.8) ggsave(fig3, filename = 'OutBedFig3ggplot.pdf', width = 5, height = 2.8) 17.5 Road map: developing a package and GUI (Part V) Two large parts for R programming have been presented so far in this book. In Part III Programming as a Beginner, basic R concepts and data manipulations are elaborated. Using predefined functions for specific analyses is emphasized. In Part IV Programming as a Wrapper, the structure of an R function is examined and how to write new functions is demonstrated through various applications. Assuming you have learned these techniques well, we now reach the final stage of the growing-up process: creating a new package for a statistical model or research issue. In general, the materials in the part for beginner are more difficult than these in the part for wrapper. The current materials in Part V Programming as a Contributor are probably the easiest. The main challenge for creating a new package is to design the structure and put appropriate contents inside the folders. This is covered in Chapter 18 Contents of a New Package. Once the contents for a new package are finalized, the procedure of building up the package is straightforward, and it takes no more than a few days to learn it. This is covered in Chapter 19 Procedures for a New Package. It is possible to transform an R package into a graphical user interface (GUI). The decision of building a graphical user interface is related to the associated benefit and cost. The benefit of GUIs includes a more intuitive appearance and low requirement on user’s 17.6 Exercises 411 programming skills. The cost of this extra step is that package authors will need to learn new commands to develop an application with a clear interface. If R is selected as the language in developing a GUI, then a programmer should have a solid understanding of R. The basics of developing graphical user interfaces in R are presented in Chapter 20 Graphical User Interfaces. With a good knowledge base, one just needs to learn a few new concepts related to GUIs and a few more packages in R. In the apt package, its core functions are programmed into a GUI. This demonstrates well the growing process with R from preparing individual functions, to a new package, and finally to a graphical user interface. 17.6 Exercises 17.6.1 Customize Figure 1 in Sun (2011) by ggplot2. In Program 17.2 Graph program version for generating figures in Sun (2011) on page 407, Figure 1 is generated by base R graphics and ggplot2 separately. Customize the version by ggplot2 so its appearance looks like the version from base R graphics. This may like a trivial exercise, but it will let you learn more about ggplot2. 17.6.2 Analyze two empirical studies for a similar issue. The purpose of this exercise is to learn and compare design techniques for several studies in the same area, similar to the relation between Wan et al. (2010a) and Sun (2011). Recall that in Exercise 3.6.2 on page 41, one empirical study has been selected. This selected study can be one of the sample studies (i.e., Sun, 2006a,b; Sun and Liao, 2011), or one from the literature. For this exercise, find another empirical study in the literature that is closely related to the research issue covered in the selected study. Read and compare the objectives, methods, and other aspects of the two related studies, with an emphasis on the linkage. Chapter 18 Contents of a New Package W hat is an R package? A package in R is a collection of documents that follow certain format requirements (Adler, 2010). Thus, the definition has two keywords: content and format. The purpose of a package is to provide extra features, usually extending base R in a particular direction. In conducting a specific research project, researchers often generate a number of new functions and data sets and then save them in one or several folders on a local drive. These documents together are similar to a package or at the early stage of package development. However, these documents per se are not a package yet because without some extra efforts, they do not conform to the format required by R. The format requirement will be detailed in next chapter. The documents prepared for a package can be either functions or data sets, and they are the focus of this chapter. The apt package related to asymmetry price transmission and Sun (2011) will be used as an example to elaborate how to organize the content of a package. Errors are likely to occur in the process of creating a new function. The probability of running into errors is likely higher with many functions being designed and created for a new package. Thus, the topic of debugging is relevant to Chapter 15 How to Write a Function on page 312, but it is more important to package creators. R has a number of tools that can be used for debugging. These tools and additional R features for time and memory management are covered in this chapter. 18.1 The decision of a new package After one gains experience in using R for a few projects, it is likely that creating a package will come to one’s mind. This is a natural step or stage because R encourages a user to become a developer gradually. A good understanding of the benefit and a research area is needed in making the decision of a new package. 18.1.1 Costs and benefits There is always a benefit-cost question for any human activity. The first main cost associated with new package creation in R is to learn how to build an R package for the first time. The formats and procedures for a new package can be intimidating and the learning curve can be steep to some users. Furthermore, extra time investment is also needed in building a specific package every time. The final cost is that if a package is shared with others publicly or Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 412 18.1 The decision of a new package 413 privately, various questions can arise and the developer may need to address these questions periodically. Once a package is built successfully, it can be used by the developer only, by a few collaborators (e.g., in a corporate environment or laboratory), or by many users through a public distribution on the R Web site. Each scenario has some common or unique benefits. These benefits are briefly described below, ranging from more private to more public. The first major benefit of compiling a group of documents into a new package is an efficiency gain in organization. Without a package format, new functions, data sets, and explanations related to a research topic are often fragmented and scattered on a local drive. The package format requirement imposes minimum standards for documentation and consistency between code and documentation. When a package is first built, a number of R functions are available to examine the formats. In submitting a package to the CRAN site, additional checks beyond those required to get the package up and running are conducted further. At the end of a large research project, it is always a pleasure to compile many unorganized documents on a local drive into a package with clear documentation and connection. Thus, by my own experience, even if a researcher has no plan to share newly developed functions with others, just the efficiency gain in organization should be sufficient to pack them into a package. Building a new package can improve ones programming quality. The quality can be improved in designing and adding contents into a package. It can also be improved in following the format requirements during the process of building an R package. In addition, sharing a package with others, especially through the CRAN site publicly, can provide opportunities for testing the functions widely, identifying possible inaccuracies and errors, getting feedback to revise the functions, or adding more functions for areas inspired by questions. Compiling a set of documents into a package can provide great convenience of use. In fact, many functions included in base R are actually in packages too, e.g., lm() in the stats package. We know how convenient to load and use them in R. Developing a new package for these functions from a specific project allows the same convenience. When an ongoing project is finished and a new package is built, the program version for the project can become very concise after some reorganization. If the package is shared with others (privately or publicly), it provides a simple interface for others to access functions and data sets conveniently, just like any other packages in base R. Developing packages can result in a deep understanding of R language. Commercial software products intentionally differentiate developers from users because they need to develop new features to make money; they discourage users to extend their software products. R is an open-source software application that encourages users to gradually become developers. Therefore, it is a natural growing path from using predefined functions in R (Part III Programming as a Beginner), to writing new functions (Part IV Programming as a Wrapper), and to writing new packages (Part V Programming as a Contributor). We have learned how to estimate a linear model by ordinary least square through a user-defined function and know how inspiring the application has been. See Program 15.7 A new function for ordinary least square with the S4 mechanism on page 336 for detail. By the same logic, developing a new package will allow users to gain a much deeper understanding of the structure of R as a computer language, and furthermore, to become more efficient in using R packages and the language. The final benefit of building new packages is to help others in a world that has become increasingly interdependent. In using predefined functions (Part III Programming as a Beginner), we are dependent on others’ work. In writing new functions (Part IV Programming Chapter 19 Procedures for a New Package T he last few chapters in a big book are often more difficult to understand. Fortunately, this is not the case here. In Chapter 18 Contents of a New Package, how to prepare for the content of an R package in the form of functions and data sets are elaborated. In this chapter, the procedural requirements for building a new package are presented, with Microsoft Windows as the operating system. The format and procedure requirements for a new package together can be intimidating to beginners, but they are much easier than writing new R functions. By my experience, it takes less than three days for the first time through a self-study. With the materials in this chapter, I hope the time needed is even shorter. Once the initial investment is made, building another new package with a small to moderate size (e.g., the apt library) should consume a few hours only. 19.1 An overview of procedures A package in R is a collection of documents that follow certain format requirements. Assume that one has decided to create a new package for a specific research issue or model. A group of R functions and data sets have been prepared and organized on a local drive. Now, the final task in building up a package is to reorganize and format all the documents into a package. To facilitate presentation and learning, this process is divided into three main stages: skeleton, compilation, and distribution. In the skeleton stage, function and data documents are organized in several folders on a local drive with special folder names and formats. In particular, a set of help documents are created to explain the functions and data sets. The size is usually about half a page to one page for one individual R function. Thus, depending on the number of functions included in a package, these help documents may take some efforts to prepare. In the compilation stage, several new software applications need to be installed on a computer under a Microsoft Windows operating system. A command prompt window is opened and a package is built there. In the distribution stage, the final package can be submitted to CRAN site for public sharing or sent to colleagues for private sharing. By time, the skeleton stage may take a few hours for a moderate package or even days for a large package. The compilation stage can be finished in a half hour if there is no error in the content of a package, but it can take much more time if the package content and help files need to be revised repeatedly. The distribution stage is the easiest one and can usually be finished in a few minutes. Sharing a package through the CRAN site will go through Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 428 19.2 Skeleton stage 429 additional checks and evaluation. That can generate new errors or warning messages, and the package may need to be revised and then resubmitted. 19.2 Skeleton stage After R functions and data sets have been prepared, they still need to be organized under specific folders. These folders have particular names and relations. In addition, help files by function and data set need to be created. Finally, a number of special documents also need to be constructed, with a requirement of solid understanding of R namespaces. 19.2.1 The folders Documents to be included in a package should be organized in several folders. The names and formats of these folders should be followed strictly. The folder structure can be created and learned in several approaches: copying one from an existing package; generating one within the R console; and creating one manually on a local drive. This is demonstrated briefly in Program 19.1 Three approaches to creating the skeleton of a new package. If you have never created an R package before, then the most appealing approach is to copy an existing package listed on the CRAN site. All publicly available packages have been examined by developers and the CRAN site maintainer. Thus, they are all great examples to learn from. For each package (e.g., the apt library), several documents are available on the CRAN site: a reference manual (apt.pdf), vignettes (e.g., apt manual or any name a developer likes; this is optional), package source (apt_2.3.tar.gz), Windows binary (apt_2.3.zip), OS X binary (apt_2.3.tgz), and old sources. In particular, the reference manual in PDF is currently available on the CRAN site only. Installing a package on a computer does not download this PDF help file to a local computer. Instead, a HTML version of the help file is installed on a computer. The source file in the format of tar.gz has the same content as the developer has on his or her own computer. When a package is compiled, all comments in the source version are removed. Thus, to view any comments or other information in the package, the source version (apt_2.3.tar.gz) needs to be downloaded. All these documents on the CRAN site can be downloaded manually one by one. Sometimes, one may prefer to download these documents on the CRAN site from an R console session directly. For such a need, the download.packages() function in R can download a Windows binary version directly for a computer under a Windows operating system, e.g., apt_2.3.zip. The download.file() function is more powerful and flexible in downloading any file from the Internet. The erer library contains a wrapper function named as download.lib() that can download the zip or tar.gz version of a package and also its PDF manual. In Program 19.1, this function is used to download the source version and reference manual in PDF for two packages, i.e., erer and apt. Note the version of a package can change over time. Thus, inside the download.lib() function, the available.packages() function is used to reveal and extract information of current packages on the CRAN site, and then the download.file() function is applied to download specific documents. Once the source version is downloaded, it needs to be unzipped from tar.gz to tar. Many tools can be used to unzip a file, including the free software of 7-zip (www.7-zip.org). After all the folders in the example package can be opened and viewed normally, they can be copied to a new directory, the contents can be replaced by new documents intended for a new package, and the skeleton (i.e., name and format) can be kept at the end. As the second approach to creating a folder structure, one can also create the skeleton of a new package from an R session directly with the package.skeleton() function. This Chapter 20 Graphical User Interfaces T wo chapters about R graphics have been presented so far in the book. They are Chapter 11 Base R Graphics at the end of Part III Programming as a Beginner, and then Chapter 16 Advanced Graphics at the end of Part IV Programming as a Wrapper. Now at the end of Part V Programming as a Contributor, a new chapter about graphical user interfaces (GUIs) is included. Essentially, this topic is still within the scope of R graphics, and it is closely related to these functions covered in several previous chapters about R graphics. However, the task is different and a number of new concepts are introduced. Many approaches exist for developing GUIs. A large number of contributed packages have become available in recent years. They have been growing constantly and changing rapidly, and the trend will likely continue for a while. We choose the gWidgets package to illustrate the rationale and demonstrate how a GUI can be created with some moderate effort. At the end, two applications are presented. One is about the correlation between random variables, and the other is for threshold cointegration analyses with the apt package. 20.1 Transition from base R graphics to GUIs Graphics user interfaces are in our daily lives. GUIs allow users to interact with electronic devices (e.g., a computer, a cell phone, or the control panel of a gas pump) through graphical icons and visual indicators (e.g., a button or message). In contrast, command-line interfaces (CLIs) require commands to be composed and submitted through a keyboard. The benefits of GUIs are obvious. In general, they provide intuitive looks for straightforward tasks. For a personal use, both adults and young kids can play all kinds of games on tablet computers. Furthermore, GUIs also have values for education and professionals. Teaching demos can be created to engage undergraduate students in learning new subjects. GUIs can also be beneficial to professionals that have limited interests or skills in programming. For simple work, GUIs can be more productive than CLIs. The cost of GUIs can be looked from the perspective of either production or consumption. Every GUI is created by a person with strong programming skills. Computer science has gained rapid growth since the 1990s, and many jobs titled as software engineers have become available. A number of computer languages have been popular (e.g., C, Java, S, and R). A GUI may be simple to use, but the creation is more demanding. For example, the software of Microsoft Office has an intuitive interface, which is the output of many programmers. Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 446 20.1 Transition from base R graphics to GUIs 447 Figure 20.1 A static view of chess board created by base R graphics The cost of GUIs for consumers is less obvious. In general, GUIs are less efficient to use than command-line interfaces for complicated work. A large task may need users to click numerous buttons and make a large number of choices. As most consumers cannot write a computer program, they do not understand the loss of the productivity associated with GUIs. For empirical research in economics, this is especially true, as discussed in Section 7.3 Estimating a binary choice model like a clickor on page 110. With all the benefits and costs in mind, the main goal of this chapter is to develop a GUI for the apt package as a teaching demo. We have been building our knowledge base in R and growing up gradually throughout the book. Thus, before we work on GUIs intensively, it is important to understand what we have learned and what new tools we need if the R language is used for creating such a demo. To help understand the transition from base R graphics to GUIs, a simple example is utilized through Program 20.1 Creating a chess board from base R graphics. Base R graphics is employed here, while other packages such as grid() also can achieve the same effect. Specifically, on the whole graphics device, outer and figure margins, axes, and labels are compressed. Graphical components are delivered to the plotting region. A total of 64 rectangular cells are drawn with two nested loops. The color of each cell is determined by the inherent pattern. If the sum of coordinate values for the start point is an event number (e.g., x + y = 1 + 3 = 4), then the color is dark gray; otherwise, it is light gray (or almost white on the screen). Four circles with black and white colors are used to represent chess pieces here. The final output is presented in Figure 20.1 A static view of chess board created by base R graphics. While the R commands are simple, the picture looks very much like a chess board. Thus, to a certain degree, base R graphics is pretty efficient in delivering what we want. In fact, the functionality in base R graphics has been the foundation of many new packages for GUIs. As it stands now, the problem of Figure 20.1 is that it is a static view. It does not allow a user to click on the board and interact with a computer. Thus, the dynamic exchange of information between users and computers in a typical GUI is not available. Unfortunately, the predefined functions in base R cannot generate such an interactive effect. Thus, new functions or tools are needed for these new features associated with GUIs. Finally, a question is why R for a GUI, not other languages? What is the advantage in 448 Chapter 20 Graphical User Interfaces employing R to develop a GUI? The choice of language for GUI development can be affected by many factors, e.g., the nature of a GUI. R is strong in statistical analyses and graphics. Thus, if a GUI needs a large amount of computation, then R may have some advantages over other languages. In addition, R is flexible in handling many tasks and integrating them together. In summary, GUIs have some benefits in certain circumstances, which is true not only for leisure use but also for business and scientific research. We have learned various aspects of R as a language in previous chapters. Thus, developing a GUI through R is just one extra mile on what we have gone so far. A simple test of your knowledge base is whether you can understand Program 20.1 in a few minutes without running the program. If not, then you probably need to read Chapter 11 Base R Graphics and Chapter 16 Advanced Graphics first. If yes, then you are ready to move forward. Program 20.1 Creating a chess board from base R graphics 1 2 3 4 # A. Window device for a chess board win.graph(width = 3, height = 3) bringToTop(stay = TRUE) par(mai = c(0.1, 0.1, 0.1, 0.1)) 5 6 7 8 9 10 11 12 13 14 15 16 # B. Draw cells with different colors plot(x = 1:9, y = 1:9, type = "n", xaxs = "i", yaxs = "i", axes = FALSE, ann = FALSE) for (a in 1:8) { for (b in 1:8) { colo <- ifelse(test = (a + b) %% 2 == 0, yes = "gray50", no = "gray98") rect(xleft = a, xright = a + 1, ybottom = b, ytop = b + 1, col = colo, border = "white") } } box() 17 18 19 20 21 # C. Add chess pieces points(x = c(2.5, 3.5), y = c(3.5, 6.5), pch = 16, cex = 3) points(x = c(5.5, 7.5), y = c(4.5, 3.5), pch = 21, cex = 3, bg = "white") out <- recordPlot() 22 23 24 25 26 # D. Save a pdf copy pdf(file = "C:/aErer/fig_chess.pdf", width = 3, height = 3, useDingbats = FALSE) replayPlot(out); dev.off() 20.1.1 Packages and installation A few concepts need to be defined before getting started. A widget is a control element in a GUI, such as a button or a scroll bar. A user interacts with a computer through GUI widgets. A toolkit is a set of widgets used in developing GUIs. A toolkit itself is a software application that is built on the top of an operating system, provides a programming interface, and allows widgets to be used. There are a large number of GUI toolkits, e.g., GTK+, Qt, Tk, FLTK, Part VI Publishing a Manuscript 469 470 Part VI Publishing a Manuscript: Manuscript preparation and peer review process are analyzed in this part. Major steps in writing a manuscript are examined to improve efficiency. The whole peer-review process is demonstrated with examples. Typical symptoms associated with a poorly prepared manuscript are selected and discussed. Chapter 21 Manuscript Preparation (pages 471 – 485): Manuscript preparation is analyzed from three perspectives: outline, detail, and style. In particular, how to construct manuscript outlines by section is explained and compared with examples. Chapter 22 Peer Review on Research Manuscripts (pages 486 – 502): Peer review on a manuscript is inherently negative so the nature is analyzed first. Typical marketing skills are discussed. Comments and responses from Wan et al. (2010a) are used as an example. Chapter 23 A Clinic for Frequently Appearing Symptoms (pages 503 – 511): Typical symptoms related to a poorly written manuscript are listed and analyzed. This includes potential problems in study design, computer programming, and manuscript preparation. R Graphics • Show Box 6 • Faces at Christmas time conditional on economic status 1947 1957 1952 1962 R can draw nontraditional graphics. See Program A.6 on page 520 for detail. Chapter 21 Manuscript Preparation T he final main task in conducting a scientific study is to prepare a manuscript, based on the study design and data analysis results. In this chapter, manuscript preparation is analyzed from several perspectives: outline, detail, and style. First of all, an outline is the “bone” or skeleton of a manuscript. By abstracting the outline of a published paper, how the paper was created can be recovered. For illustration, Wan et al. (2010a) is selected to show how the outline of a manuscript can be constructed. Then, the process of constructing an outline will be analyzed by section (e.g., introduction, method, and conclusion). The outline can be expanded and details for a manuscript can be added. Finally, some issues related to writing styles, particularly these for empirical studies, will be discussed. 21.1 Writing for scientific research Writing is one way of expressing our thoughts, similar to other ways of communication, such as speaking, singing, body action, or even no action. In general, writing is a more rigorous way of communication, as it allows a better organization of mind. Written opinions can be conveyed to readers without direct communication. Published papers can still exist after the authors disappear from the earth. One of the most fundamental differences between animals and human beings is that we can write. Why is writing important? It is difficult for a person to compete in today’s job market without strong writing ability. This is true not only in research-oriented professions such as faculty, but also in a corporate environment. In addition, good writing is the foundation of high-quality speech. It would be surprising if an excellent public speaker cannot write well. For scientific research, the final output is in the form of reports, theses, or journal articles. Clear writing is essential for disseminating research outputs. Writing per se, i.e., literally using a pen on a piece of paper or punching a computer keyboard, generally consumes a small amount of the total time for a research project. For most empirical studies in economics, once data analyses are completed, a few weeks shall be sufficient for writing a 30-page manuscript. In spite of that, scientific writing is still often perceived to be more challenging than other types of writing. To publish an article on some prestigious journals, it may need to be revised many times over several months or years. It is certainly more difficult to write a scientific report than to write a short announcement notice for a weekend party. The challenges for scientific writing mainly come from how to Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 471 472 Chapter 21 Manuscript Preparation present innovative research outputs in a rigorous, concise, and logical way. To address the challenges for scientific writing, the concepts promoted in this book will continue to be exploited here. An empirical study in economics has three versions: proposal, program, and manuscript. The key to improving manuscript preparation is to understand the relation between the three versions of an empirical study. A proposal contains the study design and provides the overall guide. A computer program generates all key results in the format of tables and figures. Manuscript preparation is the final step in the production process of economic research. Efficient writing needs strong supports from the other two versions; otherwise, there is really not much to write for scientific research. Over time, scientific writing has become highly structured, even though there is a certain degree of flexibility for some sections in a manuscript. Thus, structural writing has become the main approach in improving writing efficiency. Specifically, the first step is to set up the outline of a manuscript, with inputs generated from the program version of an empirical study. Then, the outline can be expanded with details being added. In the end, expressions and sentences can be polished after the main contents of a manuscript are finalized. In sum, the quality of a manuscript from a scientific project is mainly dependent on the quality of study design and data analyses. The goal of preparing a manuscript is to present scientific findings in a concise and rigorous way. To improve the efficiency for manuscript preparation, structural writing is highly recommended. Grammar mistakes can also affect the quality of a manuscript. One can improve the ability of handling details and grammar through more practices over time. 21.2 An outline sample for Wan et al. (2010a) A complete outline for a manuscript should be constructed before any writing starts. To demonstrate what an outline looks like, Wan et al. (2010a) is selected as an example here. Recall that the proposal for this study is presented at Section 5.3.2 Design with public data (Wan et al. 2010a) on page 72. In Section 12.1 Manuscript version for Wan et al. (2010a) on page 253, the first manuscript version is constructed with information from the proposal version, which is very brief and sketchy. Finally, in Section 12.3 Program version for Wan et al. (2010a) on page 260, six tables and one figures have been generated and saved. With all the analyses and work being accomplished, we are in a great position to finalize the outline for this manuscript. After the program version for a project has been completed, the first program version, e.g., the sample as presented in Section 12.1, should be expanded with the table and figure results first. R may not be able to format the outputs completely. Thus, within a word processing application, tables and figures should be formatted with the ultimate publication outlet in mind. Then, the outline is expanded gradually to the paragraph level with one line for each paragraph. This is a critical step in setting up the final outline for the manuscript. It involves a good amount of planning, and for the discussion section, even some brainstorming. At the end, a complete outline for a typical manuscript should have about two to three pages of bullet points with texts and about eight pages of tables and figures. For Wan et al. (2010a), its final outline is listed below. The only difference is that the actual tables and figures are not included here for brevity, as they are the same as these published on the journal. In general, one should not begin to write a manuscript before an outline is complete and largely satisfactory. If some key questions or objectives raised in the proposal version have not been addressed in the outline, then one should collect new materials, ask help from others, or conduct more necessary analyses. That being said, it should be noted that setting 21.2 An outline sample for Wan et al. (2010a) 473 up an outline can be dynamic and it is hard to have a perfect outline before the paper is published. Sometimes comments from a peer reviewer can change the outline dramatically. The sample outline for Wan et al. (2010a) below reveals well the components of a typical empirical study. The key sections are the methodology and results. Without those two sections, it could not have been published. Other sections like introduction and conclusion are there to support or sell the research story. In addition, I usually list the number of pages I plan to write for each section in the outline. This notation can remind me during writing if the actual length is sufficient or not. It can help avoid a bigger or smaller section than planned. When actual writing is different from what has been planned in the outline, you need to think over it twice: why should I write more for this section, or why should I reduce the description for another section? If you cannot convince yourself for a change, then you will have to contain the desire to talk too much for one point, or push yourself harder to write more for another point. Failing to stick to the original space allocation may result in an unbalanced structure for the paper at the end. The final outline for Wan et al. (2010a) Title: Analysis of import demand for wooden beds in the U.S. (30 pages in double line spacing) (1). A cover page. An abstract has about 200 words. Have one or two sentences for research issue, study need, objective, methodology, data, results, and contributions. Five key words. JEL classification codes if needed. (2). Introduction (2 pages in double line spacing). A brief introduction of the trade patterns, antidumping investigation, past studies, overall objective, and three specific objectives and contributions. Each paragraph is defined by one line with a black bullet. • A brief introduction of trade of furniture and wooden beds related to the U.S. • Past studies about furniture trade, knowledge gap, and study need. • Overall study objective and a brief introduction of method, product, and data. • Objective #1: consumer behavior evaluation through price elasticities. • Objective #2: depression effect of the antidumping action on China’s imports. • Objective #3: diversion effect of the antidumping action on others’ imports. • A brief paragraph about the manuscript organization. (3). Market Overview and the Antidumping Investigation against China (3 pages in double line spacing). A detailed review of the research issue: trade pattern, antidumping investigation, duties, and research needs. Figure 1 is relevant to this section. • Rapid growth in imports of wood bedroom furniture by the U.S. • Traditional suppliers of wooden beds in the U.S. import market. • Newly industrialized Asian countries (e.g., China) as new suppliers. • Antidumping investigation against the large imports from China and key dates. • Main conclusions from the investigation. • Antidumping duties and a display of import shares by country (Figure 1). Chapter 22 Peer Review on Research Manuscripts T o publish a manuscript on a journal, the manuscript needs to go through an anonymous peer review process. The nature of this process is inherently negative, and this will be elaborated first. Then, the major reasons for manuscript rejection are discussed and possible remedies are presented. Finally, the review comments for Wan et al. (2010a) are included as an example. Some responses are presented and analyzed for the purpose of illustration. 22.1 An inherently negative process When a manuscript is submitted to a journal, it is handled by a chief editor, an associate editor, and two to four anonymous reviewers (also known as referees). At the beginning of a peer review process, the chief editor reads the manuscript and assigns it to an associate editor with appropriate expertise. The associate editor then selects several reviewers to make comments on the manuscript. Selected reviewers read the manuscript and make recommendations to an associate editor, which often takes a few weeks or even months. The associate editor reads the comments and manuscript, and then in turn, makes a recommendation to the chief editor. At the end, the chief editor makes a decision with all the information. The names of reviewers, and sometimes even the associate editor, are usually unknown to authors. This structure can vary by journal greatly, but opinions from one or two editors and several reviewers are the key to the whole peer review. A peer review process is negative in nature. It is interesting to compare what we human beings try to behave in different environments. When we see a new baby, we always say it is a beautiful baby. But we know some babies are not. In attending a funeral, the dead person is definitely the best one on this planet. But we know nobody is perfect. A researcher may feel heartbroken if he has spent large efforts on a manuscript, followed the long review process, and at the end still received a rejection letter with all negative comments. An anonymous peer review on a manuscript allows for ugly, hash, and unfriendly truths to be discovered and conveyed. Referees search for flaws in a manuscript. They assess study design, methods, data quality, and results. Any fatal or even suspicious problem in a major area is likely to result in a rejection decision. In some cases, problems identified through a peer review process can be fixed, but in other situations, they may not. Ultimately, the contents and scientific merits will determine the fate of a manuscript. Given the stressful nature of this game, a natural question is whether we can have a Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 486 22.2 Marketing skills and typical review comments 487 friendlier process to achieve the same effect. Most probably, the answer is no. The reason is simple: people around you do not want to, or cannot, tell you their true feeling about the paper. An anonymous peer review allows referees to make serious comments on a study. The benefits are apparent if we compare the quality of journal articles to that of non-refereed conference papers. Once the review process is over and the paper is finally published, most authors would agree that the peer review comments are helpful in improving manuscript quality. This is the major benefit of the peer review system, and it will continue to provide strong justification to this system in the future. A peer review does have several costs. One major cost is that authors often suffer from the critical comments. This is true even for seasoned researchers. It is just human nature that we like positive feedback on our activities. A solution to this is to keep in mind firmly the nature of this game. Do not take the negative comments personally. A manuscript being rejected does not mean the author is a bad person or unqualified professional, and in many situations, it even does not mean that the research has no value either. In most cases, putting the comments away for several weeks before revising a manuscript can ease most of the pain. Another major cost of a peer review is that editors and reviewers can make mistakes. They do reject some good manuscripts that should have been published. This happens because human beings are not perfect and we are all limited in one way or another. Fortunately, the probability is usually low. If a manuscript is indeed of high quality but rejected by one journal, authors can always try other journals later. Like any activity, going through a peer review process needs some advice and experience. Most graduate students do not stay in an academic environment after their graduation. They may only have one or two times to practice and learn relevant skills. Thus, one-to-one advising by an adviser usually plays a large role to help students understand this game. The skills learned from a peer review should be beneficial to future career development, even in a corporate environment. This is because a peer review is just one typical activity in a market economy. 22.2 Marketing skills and typical review comments Several editorial decisions are possible at the end of a peer review. A manuscript can be rejected directly without reconsideration, rejected with resubmission allowed, acceptable after a major, moderate, or minor revision, and accepted as it is. In practice, a manuscript often goes through the review process several times, with the status being changed gradually from a major revision to a final acceptance. For each version of a manuscript, the editorial decision can be influenced by manuscript quality, journal goals, research areas, rules in an academic community, and reviewers’ personality. The comments from an editor and several reviewers are a reflection of many factors listed above. A peer review on a manuscript can generate numerous comments in many pages. While comments can be different with specific contents and formats, the nature of these comments can be classified into several major types. In this section, a few typical situations are analyzed and possible remedies are offered. 22.2.1 Marketing skills In academia, “publish or perish” is a phrase to describe the pressure to publish academic work in an increasingly competitive academic environment. In economics, the profit of a commodity to a producer is equal to the difference between the revenue and cost. Finishing a manuscript efficiently within one’s office is similar to producing a quality hamburger with 488 Chapter 22 Peer Review on Research Manuscripts minimum cost in a factory. However, before a manuscript is published, few people will have a chance to read it so the impact or “revenue” is almost zero. Unless one just conducts a study for personal pleasure completely, the pressure of “publish or perish” is always present. Getting a manuscript published on a peer-review journal needs some marketing skills. This is the same like selling any commodity in a market (e.g., a hamburger). It is true even if an author is very confident in the quality of a manuscript. There are intense competitions from the perspective of both researchers and publishers. In a specific research area, it is common that many researchers compete with each other in trying to publish results with a better quality or faster pace. Journal editors and publishers also need to compete for quality manuscripts. Therefore, it is critical that authors understand the competitive nature of the peer review process, and furthermore, utilize appropriate marketing skills in promoting and publishing their results. Where can we learn marketing skills? Unfortunately, it is not something we can learn in a classroom through a typical course. By analogy, it is hard to image that one takes a marketing course and then becomes the best agent in selling cars. It is safe to say that better marketing is always associated with a solid understanding of the commodity, better interaction with persons in the market, and most importantly, more practices. These general principles are also applicable in selling a manuscript to a journal, or in a broad sense and in the long term, to academic audience. To publish research outcomes successfully and continually, one needs to know the focus of a manuscript and the relevant research area well. This is so obvious for selling any commodity but it can be neglected after one has spent months in working on a project and preparing a manuscript. My suggestion is that put the manuscript away for a few weeks once it is finished. Then read it again, evaluate possible publication outlets (or your previous choice), and then select a journal for final formatting. In the long term, one will be able to gradually gain a better understanding of the research area, know where similar studies have been published, and make sound decisions for the peer review process. Researchers also need to build up a professional network that covers editors and potential reviewers. It is true that there are too many scientific journals nowadays, and it is impossible to know all journal editors. However, it should be quite feasible to interact with the editors of a few journals where one often publishes. Researchers should utilize various opportunities to impress editors and potential reviewers with specific outcomes from a study, or in the long term, with one’s own reputation in a research area. These opportunities include presentations and social activities at professional meetings, anonymous peer reviews on manuscripts as requested by journal editors, and other professional communications. In general, when one is deeply integrated in a profession, it often becomes easier to get manuscripts published, or at least, many unfair treatments or bad lucks as described below can be avoided. In dealing an editorial decision and comments from a journal, write and respond in a professional way. Some comments can be hard to read and digest. However, angry responses will not help at all for getting a manuscript accepted, and in the long term, it may hurt one’s reputation too. Instead, compose sentences and have a cover letter with a professional writing style. If you do not agree with the reviewers on a specific comment, for example, express it like this: “We respectfully do not agree with this because ....” Keep in mind that editors and reviewers are ordinary persons around you and they can be as fragile as yourself. In one instance, I emphasized a few words by using capital letters. The reviewer responded to my answer by asking: “Why do you yell to me by capitalizing the letters?” At the end, that specific manuscript was rejected after two rounds of peer reviews. In sum, in communicating with editors and reviewers by email or letter, writing professionally and politely may improve the chance of getting a manuscript published. Chapter 23 A Clinic for Frequently Appearing Symptoms S o far, the right way to conduct scientific research has been promoted in the book. In reality, this may not work as planned. Similar to swimming, one watches another person swimming in a pool so well. He tells himself, “Easy job! I can do that too.” However, when he jumps into the pool, he quickly realizes that what he learned on the land does not work in the water, and he is sinking to the bottom. In this chapter, some typical symptoms that have occurred to myself and our graduate students are listed and elaborated. This can serve as a self-guide for improving research productivity. 23.1 Symptoms related to study design and outline 23.1.1 Symptom: Readers of your paper say: “It does not look like a paper.” Prescription: No simple medicine is available to cure this severe symptom. More examinations are needed to determine which part of your research is problematic. A prescription can be offered for each individual problem. Explanation: If your paper and writing style give readers this kind of impression, it means that you have not learned the basics of scientific research yet. There is a long way to go before you can reach professional writing standard and meet the requirement. This is similar to a patient who is in a severe condition. When doctors examine the patient, they could not determine which part or organ of the patient does not function well. The only certain thing is that the patient is dying. If your paper does not look like a paper, it will disappear from this planet in a short period too. 23.1.2 Symptom: Faculty advisers say: “There is no way to revise and improve your paper.” Prescription: No simple medicine is available to cure this severe symptom. Most probably, your outline is not clear. Raw materials and information, even if relevant, are just piled together in the paper without logical connection. Alternatively, your outline is clear but there is no real contribution from the analysis. You need to learn the basics of writing and receive detailed guidance from your faculty adviser. Explanation: This often occurs to young professionals. Students thought if a 30page document is compiled together, it would become a paper automatically. Unfortunately, that perception is totally wrong. To produce a quality paper, you have Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 503 504 Chapter 23 A Clinic for Frequently Appearing Symptoms to go through several stages: create or understand the research idea and objectives, conduct the necessary analyses, and connect all the details together in a manuscript. 23.1.3 Symptom: You feel you do not really understand the research project. Prescription: Read the proposal from your faculty adviser or collaborators carefully. Or if the research idea is your own, try to clarify it and write it down on a piece of paper. Explanation: A complete understanding of what you are doing is the first and fundamental step in conducting scientific research. If you do not know where you are going, you will not reach your destination automatically. My personal habit is to write down relevant information in a single document when I start a new project. In that regard, taking notes can help greatly in clarifying my mind. Items included are the objectives, related key literature, main contributions, and major uncertainties that may kill the idea. These concepts are covered in Section 5.1.2 Keywords in a proposal on page 58 and Figure 5.1 The relation among keywords in a proposal for social science on page 61. 23.1.4 Symptom: After writing five pages, you do not know what to write next. Prescription: You have not fully constructed the outline for the manuscript yet. You have to go through the outline stage before you write the paper with detail. Explanation: This may be one of the most common mistakes that young professionals make in conducting scientific research. Many students write a paper without an overall outline being set up first. They thought that by writing down these already known details, other components would follow up automatically. That is completely wrong and it seldom happens. In fact, when you dive into the detail stage and the writing process, you focus on individual “trees” and quickly lose your view of the “forests.” Sooner or later, you will find yourself desperately in the middle of a jungle with this question: where am I and where can I go now? Even if at the end you can finish the paper with 30 pages, it will most probably look like fragmented and cannot achieve your original objective. Therefore, you have to go through the outline stage before you go to the detail stage and write your paper. In particular, the outline of a manuscript for empirical study in economics comes from the design in a proposal and results from a computer program. See the relevant coverage in Chapter 4 Anatomy on Empirical Studies on page 42. 23.1.5 Symptom: You were trapped in writing the introduction of a manuscript. Prescription: One should never prepare a manuscript by working on the introduction section first. Instead, write your introduction section after you have finished all other sections of your paper. Explanation: The typical symptom is that you start to write a manuscript with the first section of introduction. You can spend a lot of time in doing that but still feel unsatisfactory or uncertain. To write a manuscript more efficiently, the introduction section should be written after all other sections have been done. This is a section that you use to sell your story to readers. If you have not finished key sections such as the method and results, you really do not know what to sell or what to introduce in the first place. Similarly, in building a house, no builder begins the work by putting up the front door first. See more discussions at Section 21.3 Outline construction by section on page 475. Appendix A Programs for R Graphics Show Boxes S ix show boxes for R graphics have been inserted at the beginning of the parts. In this appendix, the R programs and notes for these graphs are presented. In general, computer graphics can be either scientific or entertaining, or both. These show boxes have been selected with some practical values for scientific research, but they are more relaxing than these in the book. I wish these show boxes capture your attention and inspire you to explore more in the process of learning R graphics functionality. Show Box 1 on page 2: map creation R can process spatial data and visualize them on maps, as presented in Section 16.4 Spatial data and maps on page 385. In this show box, the maps package is utilized to generate a basic sketch of the world map. The key function of map() can use internal databases included in the package or external data imported from other sources. To make the map more revealing, two countries, i.e., Brazil and China, are selected and filled with colors. Note the add = TRUE argument in the map() function allows several layers to be combined in a single map. The map is displayed on a screen device first and then saved as a PDF file at the end. Program A.1 A world map with two countries highlighted 1 2 3 # A. Load the package and understand region names setwd("C:/aErer"); library(maps) dat <- map(database = "world", plot = FALSE); str(dat) 4 5 6 7 8 9 10 # B. Display the map on the screen device windows(width = 5.3, height = 2.5); bringToTop(stay = TRUE) map(database = "world", fill = FALSE, col = "green", mar = c(0, 0, 0, 0)) map(database = "world", regions = c("Brazil", "China"), fill = TRUE, col = c("yellow", "red"), add = TRUE) showWorld <- recordPlot() 11 12 13 14 # C. Save the map on a file device pdf(file = "fig_showWorld.pdf", width = 5.3, height = 2.5) replayPlot(showWorld); dev.off() Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 513 514 Appendix A Programs for R Graphics Show Boxes # Selected results from Program A.1 > dat <- map(database = "world", plot = FALSE); str(dat) List of 4 $ x : num [1:27221] -130 -130 -132 -132 -132 ... $ y : num [1:27221] 55.9 56.1 56.7 57 57.2 ... $ range: num [1:4] -180 190.3 -85.4 83.6 $ names: chr [1:2284] "Canada" "South Africa" "Denmark" ... - attr(*, "class")= chr "map" Show Box 2 on page 30: 3-D heart shapes R is strong in both statistical analyses and graphics. Combining its data processing and graphics functions can accomplish many interesting tasks. In this show box, a heart shape with three-dimensional (3-D) effects is created. The heart shape seems too romantic or casual for an academic book like this. Nonetheless, my main consideration is that Part II Economic Research and Design is about thinking and it does related to our heart. Specifically, the objective is achieved by drawing many heart shapes with shrinking sizes and lighter colors. First, the colorRampPalette() function can return a new function that interpolates a set of given colors to create new color palettes. With the chosen colors of red and white, a set of colors can be generated from this function. If you are interested in seeing what a black heart would look like, then the command for the color set can be changed to colors = c(“black”, “white”). The oneHeart() function is created to draw one heart shape with the two argument values, i.e., r for the size and col for the color. This new function contains a seemingly complicated expression for a heart shape, but that is actually easy as there are many similar mathematical formulas on the Internet. In drawing the graph finally, the key functions used are polygon() and mapply(). In base R graphics, polygon() is a low-level plotting function. Thus, the high-level plotting functions of plot.new() and plot.window() are used first to initiate a graph. The mapply() function vectorizes the drawing action over its two arguments (i.e., heart size values and colors), which is more efficient than employing two looping statements. See details at Section 10.3.3 The apply() family on page 198. The number used here is 500, and a smaller number (e.g., n = 3) can show a much coarser version. oneHeart() function is supplied as an argument value to the mapply() function. Program A.2 A heart shape with three-dimensional effects 1 2 3 4 5 6 # A. Set color and heart parameter values setwd("C:/aErer") n <- 500 fun.col <- colorRampPalette(colors = c("red", "white")); fun.col set.col <- fun.col(n); head(set.col) set.val <- seq(from = 16, to = 0, length.out = n) 7 8 9 10 11 12 # B. Create a new function to draw one heart as a polygon oneHeart <- function(r, col) { t <- seq(from = 0, to = 2 * pi, length.out = 100) x <- r * sin(t) ^ 3 y <- (13 * r / 16) * cos(t) - (5 * r / 16) * cos(2 * t) - 515 13 14 15 } (2 * r / 16) * cos(3 * t) - (r / 16) * cos(4 * t) polygon(x, y, col = col, border = NA) 16 17 18 19 20 21 22 23 # C. Draw many hearts with mapply() windows(width = 5.3, height = 3.2); bringToTop(stay = TRUE) par(mgp = c(0, 0, 0), mai = c(0, 0, 0, 0)) plot.new() plot.window(xlim = c(-16, 16), ylim = c(-16, 13)) mapply(FUN = oneHeart, set.val, set.col) showHeart <- recordPlot() 24 25 26 27 # D. Save the graph on a file device pdf(file = "fig_showHeart.pdf", width = 5.3, height = 3.2) replayPlot(showHeart); dev.off() # Selected results from Program A.2 > fun.col <- colorRampPalette(colors = c("red", "white")); fun.col function (n) { x <- ramp(seq.int(0, 1, length.out = n)) if (ncol(x) == 4L) rgb(x[, 1L], x[, 2L], x[, 3L], x[, 4L], maxColorValue = 255) else rgb(x[, 1L], x[, 2L], x[, 3L], maxColorValue = 255) } <bytecode: 0x069e1d50> <environment: 0x0a9a9180> > set.col <- fun.col(n); head(set.col) [1] "#FF0000" "#FF0000" "#FF0101" "#FF0101" "#FF0202" "#FF0202" Show Box 3 on page 100: categorical values R is widely used to visualize continuous variables. Nevertheless, R also has rich functions in revealing the relations between categorical variables, and to my understanding, these tools are largely underutilized in economic research. In this show box, the well-known Titanic data set is selected to demonstrate how a contingency table can be shown as a graph. Recall Titanic was a British passenger liner that sank in the North Atlantic Ocean in 1912 and more than 1,500 people died. The R data set of Titanic() is a four-dimensional array resulting from cross-tabulating 2,201 observations on four variables: age, class, gender, and survival. This data set also has the table attribute, so the functions of as.data.frame() and ftable() can be used to have a better view of the data, as detailed at Section 10.3.2 Contingency and pivot tables on page 194. In drawing a concise graph for demonstration here, the class information is used as the x axis, and the survival rate as the y axis. The mosaicplot() function in traditional graphics system shows the pattern well. The main impression is that first-class passengers have a higher survival rate. If one needs to have more controls over the plot, e.g., adding values by category to the plotting region, then several contributed packages in R can be consulted, e.g., vcd. 516 Appendix A Programs for R Graphics Show Boxes Program A.3 Survival of 2,201 passengers on Titanic sank in 1912 1 2 3 4 5 # A. Understand data; library(vcd) has more options for mosaic plots setwd("C:/aErer") Titanic; as.data.frame(Titanic) str(Titanic) ftable(Titanic) 6 7 8 9 10 11 12 13 # B. Draw a mosaic plot for two categorical variables windows(width = 5.3, height = 2.5, pointsize = 9) bringToTop(stay = TRUE) par(mai = c(0.4, 0.4, 0.1, 0)) mosaicplot(formula = ~ Class + Survived, data = Titanic, color = c("red", "green"), main = "", cex.axis = 1) showMosaic <- recordPlot() 14 15 16 17 18 # C. Save the graph on a file device pdf(file = "fig_showMosaic.pdf", width = 5.3, height = 2.5) replayPlot(showMosaic) dev.off() # Selected results from Program A.3 > str(Titanic) table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ... - attr(*, "dimnames")=List of 4 ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew" ..$ Sex : chr [1:2] "Male" "Female" ..$ Age : chr [1:2] "Child" "Adult" ..$ Survived: chr [1:2] "No" "Yes" > ftable(Titanic) Class Sex 1st Male Female 2nd Male Female 3rd Male Female Crew Male Female Age Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult Survived No Yes 0 5 118 57 0 1 4 140 0 11 154 14 0 13 13 80 35 13 387 75 17 14 89 76 0 0 670 192 0 0 3 20 517 Show Box 4 on page 252: diagrams Traditional graphics system in R is very flexible in drawing a diagram. Furthermore, several contributed packages have made some common diagrams even easier to draw. The diagram package is among one of them. In this show box, the plotmat() function is used to draw the structure of this book again. Four boxes are created to represent the overall structure, i.e., the proposal, program, and manuscript versions for an empirical study. There are many arguments in this function that can be used to customize the detail. Of course, users need to pay an extra cost in learning the new package before enjoying the flexibility in customizing the graph. Calling plotmat() can draw a diagram and also return a list of graphical data. Program A.4 A diagram for the structure of this book 1 2 3 4 # A. Data preparation setwd("C:/aErer"); library(diagram) M <- matrix(nrow = 4, ncol = 4, data = 0) M[2, 1] <- 1; M[4, 2] <- 2; M[3, 4] <- 3; M[1, 3] <- 4; M 5 6 7 8 9 10 11 12 13 14 15 16 17 # B. Display the diagram on a screen device windows(width = 5.3, height = 2.5, family = "serif") par(mai = c(0, 0, 0, 0)) book <- plotmat(A = M, pos = c(1, 2, 1), curve = 0.3, name = c("Empirical Study", "Proposal", "Manuscript", "R Program"), lwd = 1, box.lwd = 1.5, cex.txt = 0.8, arr.type = "triangle", box.size = c(0.15, 0.1, 0.1, 0.1), box.cex = 0.75, box.type = c("hexa", "ellipse", "ellipse", "ellipse"), box.prop = 0.4, box.col = c("pink", "yellow", "green", "orange"), lcol = "purple", arr.col = "purple") names(book); book[["rect"]] showDiagram <- recordPlot() 18 19 20 21 # C. Save the graph on a file device pdf(file = "fig_showDiagram.pdf", width = 5.3, height=2.5, family="serif") replayPlot(showDiagram); dev.off() # Selected results from Program A.4 > M[2, 1] <- 1; M[4, 2] <- 2; M[3, 4] <- 3; M[1, 3] <- 4; M [,1] [,2] [,3] [,4] [1,] 0 0 4 0 [2,] 1 0 0 0 [3,] 0 0 0 3 [4,] 0 2 0 0 > names(book); book[["rect"]] [1] "arr" "comp" "radii" "rect" xleft ybot xright ytop [1,] 0.35 0.70590856 0.65 0.9607581 [2,] 0.15 0.41505015 0.35 0.5849498 [3,] 0.65 0.41505015 0.85 0.5849498 [4,] 0.40 0.08171682 0.60 0.2516165 518 Appendix A Programs for R Graphics Show Boxes Show Box 5 on page 394: dynamic graphs R can create dynamic and interactive graphs. In this show box, 500 plots are drawn on the screen device sequentially with varying parameters to create an effect like a video. Specifically, two random variables are generated with the rCopula() function in the copula package. The relation between the variables is specified through a normal copula, as defined by normalCopula(). The correlation between these two variables can vary from −1 to 1. When these variables are plotted on a graph through a for looping statement and the plot() function, the varying correlations and colors create a continuous visual effect. The color values are generated with the rainbow() function. To capture the spirit of this video on a two-dimensional static graph, some extra efforts are needed. Specifically, three screenshots are chosen with different correlations and colors. Then they are arranged on the screen device through the powerful functionality offered in the grid package. Three viewports are created at the top-level window and each contains one screenshot. The axes and labels are excluded in the screenshots for brevity. Program A.5 Screenshots from a demonstration video for correlation 1 2 3 4 5 6 7 # A. Data: number of plots, points, and colors setwd("C:/aErer"); library(copula); library(grid) num.plot <- 500; num.points <- 10000 set.rho <- seq(from = 0, to = 1, length.out = num.plot) pie(x = rep(x = 1, times = 15), col = rainbow(15)) # understand rainbow() set.col <- rainbow(num.plot) str(set.col); head(set.col, n = 4) 8 9 10 11 12 13 14 15 16 17 18 # B. Display the video on the screen device; need about 35 seconds # Two random variables with bivariate normal copula windows(width = 4, height = 4); bringToTop(stay = TRUE) par(mar = c(2.5, 2.5, 1, 1)) for (i in 1:num.plot) { sam <- rCopula(n = num.points, copula = normalCopula(param = set.rho[i])) plot(x = sam, col = set.col[i], xlim = c(0, 1), ylim = c(0, 1), type = "p", pch = ".") } str(sam); head(sam, n = 3) 19 20 21 22 23 24 25 26 27 # C. Three screenshots for ERER book windows(width = 5.3, height = 2.5); bringToTop(stay = TRUE) v1 <- viewport(x = 0.02, y = 0.98, width = 0.55, height = 0.5, just = c("left", "top")) pushViewport(v1); grid.rect(gp = gpar(lty = "dashed")) sam <- rCopula(n = 3000, copula = normalCopula(param = 0)) grid.points(x = sam[, 1], y = sam[, 2], pch = 1, size = unit(0.001, "char"), gp = gpar(col = "red")) 28 29 30 31 32 upViewport(0); current.viewport() v2 <- viewport(width = 0.55, height = 0.5) pushViewport(v2); grid.rect(gp = gpar(fill = "white", lty = "dashed")) sam <- rCopula(n = 3000, copula = normalCopula(param = 0.8)) 519 33 34 grid.points(x = sam[, 1], y = sam[, 2], pch = 1, size = unit(0.001, "char"), gp = gpar(col = "darkgreen")) 35 36 37 38 39 40 41 42 43 upViewport(0) v3 <- viewport(x = 0.98, y = 0.02, width = 0.55, height = 0.5, just = c("right", "bottom")) pushViewport(v3); grid.rect(gp = gpar(fill = "white", lty = "dashed")) sam <- rCopula(n = 3000, copula = normalCopula(param = 0.99)) grid.points(x = sam[, 1], y = sam[, 2], pch = 1, size = unit(0.001, "char"), gp = gpar(col = "blue")) showCorrelation <- recordPlot() 44 45 46 47 48 # D. Save the three screen shots on a file device pdf(file = "fig_showCorrelation.pdf", width = 5.3, height = 2.5, useDingbats = FALSE) replayPlot(showCorrelation); dev.off() # Selected results from Program A.5 > str(set.col); head(set.col, n = 4) chr [1:500] "#FF0000FF" "#FF0300FF" "#FF0600FF" "#FF0900FF" ... [1] "#FF0000FF" "#FF0300FF" "#FF0600FF" "#FF0900FF" > str(sam); head(sam, n = 3) num [1:10000, 1:2] 0.925 0.0674 0.5819 0.24 0.3992 ... [,1] [,2] [1,] 0.9249584 0.9249584 [2,] 0.0674410 0.0674410 [3,] 0.5818629 0.5818629 Show Box 6 on page 470: human faces as a graph Herman Chernoff invented Chernoff faces to display multivariate data through the shape of a human face. The individual parts on a face represent variable values by their shape, size, placement, and color. The idea is that humans can easily recognize small changes on a face. By choosing appropriate variables in representing the features of individual parts on a face, Chernoff faces can generate interesting pictures and facilitate communications in a way that is different from words, tables, or traditional lines and points. In this show box, the longley data set is used to create four faces. This is a macroeconomic data set with annual observations for seven economic variables between 1947 and 1962. The variables include gross national product (GNP), GNP deflator, and numbers of people unemployed and employed. Four years are selected in the show box: 1947, 1952, 1957 and 1962. The faces() function from the aplpack package is used to draw the faces. The color and Christmas decoration on the faces are determined by the argument value of face.type = 2. This function also has an argument of plot = TRUE. If plot = FALSE, the output from calling faces() can be saved and then plotted. The output has the class of “faces” and a new method named plot.faces() is defined in the package. By using the economic data to shape the faces, the economy status over time is presented graphically, even though it is a little bit different from a typical line graph. Hopefully, you can see that Americans were happier in the 1960s than in the 1940s through the graph. 520 Appendix A Programs for R Graphics Show Boxes Program A.6 Faces at Christmas time conditional on economic status 1 2 3 4 # Load library and data setwd("C:/aErer"); library(aplpack) data(longley); str(longley) longley[as.character(c(1947, 1952, 1957, 1962)), 1:4] 5 6 7 8 9 10 11 12 # Some practices windows(); bringToTop(stay faces() faces(face.type = 0) faces(xy = rbind(1:4, 5:3, faces(xy = longley[c(1, 6, faces(xy = longley[c(1, 6, = TRUE) 3:5, 5:7), face.type = 2) 11, 16), ], face.type = 1) 11, 16), ], face.type = 0) 13 14 15 16 17 18 19 20 # Display on the screen device windows(width = 5.3, height = 2.5, pointsize = 9); bringToTop(stay = TRUE) par(mar = c(0, 0, 0, 0), family = "serif") aa <- faces(xy = longley[c(1, 6, 11, 16), ], plot = FALSE) class(aa) # "faces" plot.faces(x = aa, face.type = 2, width = 1.1, height = 0.9) showFace <- recordPlot() 21 22 23 24 # Save the graph on a file device pdf(file = "fig_showFace.pdf", width = 5.3, height = 2.5, pointsize = 9) replayPlot(showFace); dev.off() # Selected results from Program A.6 > data(longley); str(longley) ’data.frame’: 16 obs. of 7 variables: $ GNP.deflator: num 83 88.5 88.2 89.5 96.2 ... $ GNP : num 234 259 258 285 329 ... $ Unemployed : num 236 232 368 335 210 ... $ Armed.Forces: num 159 146 162 165 310 ... $ Population : num 108 109 110 111 112 ... $ Year : int 1947 1948 1949 1950 1951 ... $ Employed : num 60.3 61.1 60.2 61.2 63.2 ... > longley[as.character(c(1947, 1952, 1957, 1962)), 1:4] GNP.deflator GNP Unemployed Armed.Forces 1947 83.0 234.289 235.6 159.0 1952 98.1 346.999 193.2 359.4 1957 108.4 442.769 293.6 279.8 1962 116.9 554.894 400.7 282.7 Appendix B Ordered Choice Model and R Loops T his appendix has two objectives. The first objective is to document how to compute predicted probabilities, marginal effects, and their standard errors for an ordered probit or logit model. The technical details are similar to these for binary choice model, as presented in Section 7.2 Statistics for a binary choice model on page 103. The computation for ordered choice model is more tedious because there are multiple choices, instead of two. In addition, for both predicted probabilities and marginal effects, standard errors are more difficult to calculate than their values per se. Standard errors can be computed using linear approximation approach (i.e., delta method). The second objective is to demonstrate how to avoid overuse of looping statements. This could be demonstrated in Chapter 13 Flow Control Structure on page 269, but it is a little too long. For the ordered choice model, the ocME() and ocProb() functions are created and included in the erer library. The implementation is dramatically simplified by using vectorization technique available in R and by avoiding overuse of R loops. In general, looping statements such as for are intuitive and straightforward to use, but overusing them can result in inefficiency. Thus, calculating marginal effects from ordered choice model serves as a good example to illustrate how to have a balanced use of looping statements. In developing these functions, results from ocME() and ocProb() were compared with these from the software of LIMDEP, and the assessment was satisfactory. B.1 Some math fundamentals A good understanding of the following mathematics is necessary to compute predicted probabilities, marginal effects, and their standard errors for ordered choice model. Thus, they are included to facilitate the remaining presentation of this appendix. Delta method Delta method is an approximate method for computing the variance of a variable, given the variance estimates of its parameters (Greene, 2011). For example, assume g = f (b1 , b2 ) = ab1 b2 + 3b2 (B.1) where g is a function of two variables (b1 and b2 ) and one constant (a). This allows the value of g to be calculated when the values of b1 , b2 , and a are known. Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 521 522 Appendix B Ordered Choice Model and R Loops Furthermore, the variance of g can be computed as: σ T σ12 ∂g 11 ∂g ∂g ∂g var(g) = ∂b1 ∂b2 ∂b1 ∂b2 σ21 σ22 1 5 ab2 = ab2 ab1 + 3 5 2 ab1 + 3 (B.2) = 2a2 b21 + a2 b22 + 10a2 b1 b2 + 12ab1 + 30ab2 + 18 where σ11 is the variance of b1 , σ12 is the covariance between the two variables, and T denotes the transpose of a matrix. For illustration, some simple numerical values are assumed for the covariance-variance matrix. In sum, given a functional relation as defined in Equation (B.1), the value and covariance estimates of these variables on the right side can be used to compute the variance of the left-side variable. If the constant of a is treated as a parameter in the computation, it does not affect the result. This can be shown as follows: σ11 σ1a σ12 T ∂g ∂g ∂g σa1 σaa σa2 ∂g ∂g ∂g var(g) = ∂b ∂a ∂b2 ∂b1 ∂a ∂b2 1 σ21 σ2a σ22 ab2 1 0 5 = ab2 b1 b2 ab1 + 3 0 0 0 b1 b2 (B.3) 5 0 2 ab1 + 3 = 2a2 b21 + a2 b22 + 10a2 b1 b2 + 12ab1 + 30ab2 + 18 where the correlation of a with itself and the two variables is zero. Thus, the variance of g will remain the same even if a constant is included as a parameter. Delta method can also be applied to a vector of variables, e.g., g1 and g2 . This needs good understanding of matrix operation, and additionally, matrix calculus. Matrix calculus Matrix calculus is a large subject, as detailed in Abadir and Magnus (2005). Two useful rules are stated here, as they are particularly related to computing standard errors of marginal effects in ordered choice model. Taking derivative on a vector with regard to another vector results in a very large matrix, not a vector only. Specifically, if y is a m × 1 vector, x is a n × 1 vector, and y = f (x), then ∂y (x) 1 1 (x) . . . ∂y∂x ∂x1 n . ∂y .. .. .. = (B.4) . . ∂x ∂ym (x) ∂ym (x) ... ∂x1 ∂xn where y1 (x) is the first element in the vector, and others are similarly defined. The result is a m × n matrix. This is also called the Jacobian matrix of y = f (x). If y = A(r) x(r), y is m × 1, A(r) is m × n, x(r) is n × 1, and r is k × 1, then ∂y ∂A(r) ∂x = (xT ⊗ Im ) + A(r) ∂r ∂r ∂r (B.5) where ⊗ is Kronecker product, and Im is an identity matrix with the dimension of m × m. 523 B.2 Predicted probability for ordered choice model Derivative for a probability density function For a probit model, the underlying distribution is a normal distribution (Greene, 2011, page 1024). Note that a normal distribution has no closed form for its cumulative distribution function F . Its probability density function f can be expressed as: f (z | µ, σ 2 ) = (z−µ)2 1 √ e− 2σ2 σ 2π (B.6) where z is a random variable, µ is the mean, and σ is the standard deviation of the distribution. When µ = 0 and σ = 1, the distribution is called standard normal distribution. The probability density function and its derivative can be expressed as z2 1 f (z | 0, 1) = √ e− 2 2π ∂f 0 f = = f (−2z/2) = −zf ∂z (B.7) (B.8) where the prime symbol denotes derivative. For the logistic distribution associated with a logistic model, the cumulative distribution function, probability density function, and its derivative can be expressed as: F (z) = 1 1 + e−z f (z) = F 0 = −(1 + e−z )−2 e−z (−1) = f 0 (z) = (B.9) 1 e−z = F (1 − F ) 1 + e−z 1 + e−z ∂f = F 0 (1 − F ) + F (−F 0 ) = F 0 (1 − 2F ) = f (1 − 2F ) ∂z (B.10) (B.11) where f 0 is the derivative of probability density function. B.2 Predicted probability for ordered choice model Predicted probabilities are usually calculated for a continuous independent variable to reveal its impact. First, let us compute the value of predicted probability. Assume there are four ordered choices, i.e., J = 4. The predicted probability by category can be expressed as follows: p1 = F1 = P r(y = 1 | X) = F (u2 − Xβ) p2 = F2 = P r(y = 2 | X) = F (u3 − Xβ) − F (u2 − Xβ) (B.12) p3 = F3 = P r(y = 3 | X) = F (u4 − Xβ) − F (u3 − Xβ) p4 = F4 = P r(y = 4 | X) = 1 − F (u4 − Xβ) where u2 , u3 , and u4 are estimated thresholds. With the above arrangement, only J − 1 threshold values are estimated for an ordered choice model with J choices. The sum of all P4 probabilities at a given value is 1, i.e., i=1 pi = 1. To display the effect of a continuous independent variable on predicted probability, create a new matrix X̄ s=seq for all the independent variables. This matrix is first defined as a N ×K matrix, where N is an arbitrary number and K is the number of independent variables. Dimension N is the number of predicted probabilities of interest, and it can be different from the actual number of observations used for regression. In this new matrix, each row 524 Appendix B Ordered Choice Model and R Loops contains the same mean values for the K independent variables. Then the column value of a selected continuous variable is substituted by a sequence. In R, this sequence can be created with the command by seq(from = min(.), to = max(.), by = N). In general, the range of this sequence is the same as the actual range of the selected variable. In writing a function to calculate predicted probabilities, the number of categories J and the number of estimated thresholds are unknown in advance. Thus, a looping statement needs to be employed to compute predicted probabilities as expressed in Equation (B.12). This is feasible as only one loop is needed. However, this will become more cumbersome in calculating standard errors later as another loop is needed. The dimension of X̄ s=seq βb is (N × K) × (K × 1) = N × 1, so the predicted probabilities for each choice are a vector. In sum, with the estimated threshold and parameter values, Equation (B.12) can be used to calculate predicated probabilities by category. The above approach can be greatly simplified by vectorizing the operation. This starts with a small rearrangement of the above equation as follows: p1 = F (u2 − Xβ) − F (u1 − Xβ) = F (u2 − Xβ) − 0 p2 = F (u3 − Xβ) − F (u2 − Xβ) (B.13) p3 = F (u4 − Xβ) − F (u3 − Xβ) p4 = F (u5 − Xβ) − F (u4 − Xβ) = 1 − F (u4 − Xβ) where u1 is −∞; and u5 is ∞. For the purpose of actual computation and programming, a large constant is sufficient for generating a probability value of 0 or 1, e.g., u1 = −106 and u2 = 106 . With two constants added to the estimated thresholds, the above equation set can be reduced to a single equation with matrix notation. R is powerful in vectorization operation, and thus a looping statement is avoided and computation becomes more efficient. In combination, the predicted probabilities can be calculated as: P = F (Ub − Z) − F (Ua − Z) (B.14) where P has the dimension of N × J and is a combination of the probability vectors for all choices. In programming, the estimated values of thresholds and parameters are used to construct several matrices. To vectorize the computation, three matrices need to be carefully arranged so they are conformable. Specifically, Ub is a N × J matrix constructed without the first threshold, and each row has the same value, such as c(b u2 , u b3 , u b4 , u b5 ) for four choices. Ua is a matrix constructed without the last threshold, and each row has the same value, such as c(b u1 , u b2 , u b3 , u b4 ) for four choices. Z is also a N ×J matrix, with each column being repeatedly b In sum, Equation (B.14) vectorizes the operation in filled by the same value of X̄ s=seq β. Equation (B.12), and no matter how many categories an ordered choice model contains, the computation can be accomplished with a single matrix operation. To compute standard errors of predicted probability, take the third category as an example. With the estimated values substituted into the equation, this can be expressed as: b − F (b b F3 = P r(y = 3 | X̄ s=seq ) = F (b u4 − X̄ s=seq β) u3 − X̄ s=seq β) (B.15) where X̄ s=seq is the new matrix defined for a continuous variable. The variance of the predicted probability can be computed with the delta method. F3 is b u a function of three estimated parameters: β, b3 , and u b4 . Thus, the variance for the predicted 525 B.2 Predicted probability for ordered choice model probabilities and the three derivatives can be expressed as ∂F ∂F3 ∂F3 ∂F3 ∂F3 ∂F3 T b u var(F3 ) = ∂ βb3 ∂b cov( β, b , u b ) 3 4 b ∂b u3 ∂b u4 u3 ∂b u4 ∂β h i ∂F3 b − f (b b X̄ s=seq = [f3 − f4 ] X̄ s=seq = f (b u3 − X̄ s=seq β) u4 − X̄ s=seq β) ∂ βb ∂F3 b = −f3 = −f (b u3 − X̄ s=seq β) ∂b u3 ∂F3 b = f4 = f (b u4 − X̄ s=seq β) ∂b u4 (B.16) (B.17) (B.18) (B.19) b u where f3 and f4 are diagonal matrices, and the covariance matrix cov(β, b3 , u b4 ) is generated from the model fit. Note the covariance matrix of coefficients and relevant thresholds vary by choice. The variance vectors for all the choices could be combined together with some matrix arrangement, but the cost may be bigger than the benefit. Thus, they are computed with a looping statement in the ocProb() function, as showed in Program B.1. Program B.1 Calculating predicted probabilities for an ordered choice model 1 2 3 4 5 6 7 8 9 10 11 # A new function: probabilities for ordered choice ocProb <- function(w, nam.c, n = 100, digits = 3) { # 1. Check inputs if (!inherits(w, "polr")) { stop("Need an ordered choice model from 'polr()'.\n") } if (w$method != "probit" & w$method != "logistic") { stop("Need a probit or logit model.\n") } if (missing(nam.c)) stop("Need a continous variable name'.\n") 12 13 14 15 16 17 18 19 20 # 2. Abstract data out lev <- w$lev; J <- length(lev) x.name <- attr(x = w$terms, which = "term.labels") x2 <- w$model[, x.name] if (identical(sort(unique(x2[, nam.c])), c(0, 1)) || inherits(x2[, nam.c], what = "factor")) { stop("nam.c must be a continuous variable.") } 21 22 23 24 25 26 ww <- paste("~ 1", paste("+", x.name, collapse = " "), collapse = " ") x <- model.matrix(as.formula(ww), data = x2)[, -1] b.est <- as.matrix(coef(w)); K <- nrow(b.est) z <- c(-10^6, w$zeta, 10^6) # expand it with two extreme thresholds z2 <- matrix(data = z, nrow = n, ncol = length(z), byrow = TRUE) 27 28 29 30 31 32 pfun <- switch(w$method, probit = pnorm, logistic = plogis) dfun <- switch(w$method, probit = dnorm, logistic = dlogis) V2 <- vcov(w) # increase covarance matrix by 2 fixed thresholds V3 <- rbind(cbind(V2, 0, 0), 0, 0) ind <- c(1:K, nrow(V3)-1, (K+1):(K+J-1), nrow(V3)) 526 Appendix B Ordered Choice Model and R Loops V4 <- V3[ind, ]; V5 <- V4[, ind] 33 34 # 3. Construct x matrix and compute xb mm <- matrix(data = colMeans(x), ncol = ncol(x), nrow = n, byrow = TRUE) colnames(mm) <- colnames(x) ran <- range(x[, nam.c]) mm[, nam.c] <- seq(from = ran[1], to = ran[2], length.out = n) xb <- mm %*% b.est xb2 <- matrix(data = xb, nrow = n, ncol = J, byrow = FALSE) # J copy 35 36 37 38 39 40 41 42 # 4. Compute probability by category; vectorized on z2 and xb2 pp <- pfun(z2[, 2:(J+1)] - xb2) - pfun(z2[, 1:J] - xb2) trend <- cbind(mm[, nam.c], pp) colnames(trend) <- c(nam.c, paste("p", lev, sep=".")) 43 44 45 46 47 # 5. Compute the standard errors se <- matrix(data = 0, nrow = n, ncol = J) for (i in 1:J) { z1 <- z[i] - xb; z2 <- z[i+1] - xb d1 <- diag(c(dfun(z1) - dfun(z2)), n, n) %*% mm q1 <- - dfun(z1); q2 <dfun(z2) dr <- cbind(d1, q1, q2) V <- V5[c(1:K, K+i, K+i+1), c(1:K, K+i, K+i+1)] va <- dr %*% V %*% t(dr) se[, i] <- sqrt(diag(va)) } colnames(se) <- paste("Pred_SE", lev, sep = ".") 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 } # 6. Report results t.value <- pp / se p.value <- 2 * (1 - pt(abs(t.value), n - K)) out <- list() for (i in 1:J) { out[[i]] <- round(cbind(predicted_prob = pp[, i], error = se[, i], t.value = t.value[, i], p.value = p.value[, i]), digits) } out[[J+1]] <- round(x = trend, digits = digits) names(out) <- paste("predicted_prob", c(lev, "all"), sep = ".") result <- listn(w, nam.c, method=w$method, mean.x=colMeans(x), out, lev) class(result) <- "ocProb"; return(result) 74 75 76 77 78 79 80 # Example: include "Freq" to have a continuous variable for demo library(erer); library(MASS); data(housing); str(housing); tail(housing) reg2 <- polr(formula = Sat ~ Infl + Type + Cont + Freq, data = housing, Hess = TRUE, method = "probit") p2 <- ocProb(w = reg2, nam.c = 'Freq', n = 300); p2 plot(p2) B.3 Marginal effect for ordered choice model 527 # Selected results from Program B.1 > library(erer); library(MASS); data(housing); str(housing) ’data.frame’: 72 obs. of 5 variables: $ Sat : Ord.factor w/ 3 levels "Low"<"Medium"<..: 1 2 3 1 ... $ Infl: Factor w/ 3 levels "Low","Medium",..: 1 1 1 2 2 2 ... $ Type: Factor w/ 4 levels "Tower","Apartment",..: 1 1 1 ... $ Cont: Factor w/ 2 levels "Low","High": 1 1 1 1 1 1 1 1 1 1 ... $ Freq: int 21 21 28 34 22 36 10 11 36 61 ... $ > tail(housing) Sat Infl 67 Low Medium 68 Medium Medium 69 High Medium 70 Low High 71 Medium High 72 High High Type Terrace Terrace Terrace Terrace Terrace Terrace Cont Freq High 31 High 21 High 13 High 5 High 6 High 13 > p2 <- ocProb(w = reg2, nam.c = ’Freq’, n = 300); p2 Freq p.Low p.Medium p.High [295,] 84.612 0.087 0.226 0.687 [296,] 84.890 0.086 0.225 0.688 [297,] 85.167 0.086 0.225 0.690 [298,] 85.445 0.085 0.224 0.691 [299,] 85.722 0.084 0.223 0.693 [300,] 86.000 0.084 0.222 0.694 B.3 Marginal effect for ordered choice model Marginal effects for ordered choice models can be calculated in a way similar to that in binary choice model. The formulas for dummy variables and continuous variables are different, with the latter case being much more challenging. Marginal effect for dummy variables Marginal effects for dummy independent variables are closely related to the concept of predicted probability. It is more efficient to compute them for each dummy separately. For illustration, assume there are four ordered choices and take the third category as an example: b − F (b b F 3,1 = P r(y = 3 | X̄ d=1 ) = F (b u4 − X̄ d=1 β) u3 − X̄ d=1 β) b − F (b b F 3,0 = P r(y = 3 | X̄ d=0 ) = F (b u4 − X̄ d=0 β) u3 − X̄ d=0 β) (B.20) where d is the selected dummy variable. In X̄ d=1 , the selected dummy variable takes the value of 1 only, and other independent variables take the mean values. X̄ d=0 is similarly defined with the dummy variable taking the value of 0. The difference between the two probability values can be be computed as follows: m3d = ∆Fb = P r(y = 3 | X̄ d=1 ) − P r(y = 3 | X̄ d=0 ) which is also referred to as the marginal effect of a dummy variable. (B.21) 528 Appendix B Ordered Choice Model and R Loops The delta method can be used to compute standard errors. Note m3d is a function of two groups of estimated parameters, i.e., βb and u b. Thus, the variance of the marginal effect for a dummy variable can be computed as: ∂m3d ∂m3d ∂m3d ∂m3d ∂m3d ∂m3d T b u var(m3d ) = cov( β, b , u b ) (B.22) 3 4 b ∂b u3 ∂b u4 b ∂b u3 ∂b u4 ∂β ∂β h i ∂m3d b − f (b b X̄ d=1 − = f (b u3 − X̄ d=1 β) u4 − X̄ d=1 β) ∂ βb h i b − f (b b X̄ d=0 f (b u3 − X̄ d=0 β) u4 − X̄ d=0 β) (B.23) = f 3,1 − f 4,1 X̄ d=1 − f 3,0 − f 4,0 X̄ d=0 ∂m3d b + f (b b =f −f = −f (b u3 − X̄ d=1 β) u3 − X̄ d=0 β) (B.24) 3,0 3,1 ∂b u3 ∂m3d b − f (b b =f −f = f (b u4 − X̄ d=1 β) u4 − X̄ d=0 β) (B.25) 4,1 4,0 ∂b u4 where the two relevant threshold values are specific to choices. To calculate the marginal effects and standard errors for multiple dummy variables, a looping statement can be used. Marginal effect for continuous variables First, let us compute the value of marginal effects. For instance, taking a derivative on the probability for the first choice can generate the following marginal effect: ∂p1 ∂X b ∂(b b b ∂(b b ∂F (b u2 − X β) u2 − X β) ∂F (b u1 − X β) u1 − X β) = − b b ∂X ∂X ∂(b u2 − X β) ∂(b u1 − X β) h i b − f (b b . = βb f (b u1 − X β) u2 − X β) m1 = (B.26) Putting all together, the formulas for marginal effects for an ordered choice model with four choices are: h i b − f (b b m1 = βb f (b u1 − X β) u2 − X β) h i b − f (b b m2 = βb f (b u2 − X β) u3 − X β) (B.27) h i b − f (b b m3 = βb f (b u3 − X β) u4 − X β) h i b − f (b b m4 = βb f (b u4 − X β) u5 − X β) P4 where by definition the sum of all marginal effects at a given point is zero, i.e., i=1 mi = 0. In programming, the computation of marginal effect can be vectorized in R as follows M = βb [f (Ua − Z) − f (Ub − Z)] (B.28) where M is the combination of marginal effects for all the choices. The matrices of Ua , Ub , and Z are the same as defined in Equation (B.14). If you are good at matrix calculus, then the relation between Equations (B.14) and (B.28) should be apparent. Standard errors for marginal effects can be computed by choice separately using the delta method. The value of each marginal effect is affected by all the parameter estimates 529 B.3 Marginal effect for ordered choice model b and two threshold values (e.g., u (β) b3 and u b4 for m3 ). For the first and last choices, taking derivatives with regard to the two fixed threshold values (i.e., u b1 and u b5 with four choices) will not affect the standard errors because the relevant variance is zero. Specifically, taking m3 as an example, the standard error is var(m3 ) = ∂m3 b ∂β ∂m3 ∂b u3 ∂m3 ∂b u4 b u cov(β, b3 , u b4 ) ∂m3 b ∂β ∂m3 ∂b u3 ∂m3 ∂b u4 T (B.29) b It is related to the product rule of matrix calculus The first derivative is on the vector β. and can be expressed as " # " # b ∂m3 βb b 0 β b 0 (−X 0 ) = (f 3 ⊗ I K ) + βf 3 (−X 0 ) − (f 4 ⊗ I K ) + βf 4 ∂ βb βb βb i i h h b 0) b 0 ) − f I K + f 0 (−βX (B.30) = f 3 I K + f 03 (−βX 4 4 b 0 ) − f I K + (−z4 f )(−βX b 0) f 3 I K + (−z3 f 3 )(−βX (probit) 4 4 = b 0 ) − f I K + f (1 − 2F 4 )(−βX b 0) f 3 I K + f 3 (1 − 2F 3 )(−βX (logit) 4 4 b 0 ) − f I K + S 4 (−βX b 0) = f 3 I K + S 3 (−βX 4 where S 3 = −z3 and S 4 = −z4 for a probit model, and S 3 = 1 − 2F 3 and S 4 = 1 − 2F 4 b Note Equations (B.8) for a logit model. In addition, z3 = u b3 − X βb and z4 = u b4 − X β. and (B.11) are used in the transformation. The second and third derivatives are about a single estimate: ( b b ∂m3 ∂f ∂(b u − X β) β(−z (probit) 3 3f 3) b 0 = = βb 3 = βf 3 b ∂b u3 ∂z3 ∂b u3 βf 3 (1 − 2F 3 ) (logit) b S3 = βf 3 ∂m3 b S4 = −βf 4 ∂b u4 (B.31) where S3 and S4 are the same as defined above. In sum, the value of marginal effect can be vectorized and computed together for all variables and choices with a single matrix operation. The variance is best handled by choice individually with a looping statement. All of these are implemented through the ocME() function, as shown in Program B.2. Program B.2 Calculating marginal effects for an ordered choice model 1 2 3 4 5 6 7 8 9 10 # A new function: marginal effect for ordered choice ocME <- function(w, rev.dum = TRUE, digits = 3) { # 1. Check inputs; similar to ocProb() # 2. Get data out: x may contains factors so use model.matrix lev <- w$lev; J <- length(lev) x.name <- attr(x = w$terms, which = "term.labels") x2 <- w$model[, x.name] ww <- paste("~ 1", paste("+", x.name, collapse = " "), collapse = " ") x <- model.matrix(as.formula(ww), data = x2)[, -1] # factor x changed 530 11 12 13 Appendix B Ordered Choice Model and R Loops x.bar <- as.matrix(colMeans(x)) b.est <- as.matrix(coef(w)); K <- nrow(b.est) xb <- t(x.bar) %*% b.est; z <- c(-10^6, w$zeta, 10^6) 14 15 16 17 18 19 20 pfun <- switch(w$method, probit = pnorm, logistic = plogis) dfun <- switch(w$method, probit = dnorm, logistic = dlogis) V2 <- vcov(w) # increase covarance matrix by 2 fixed thresholds V3 <- rbind(cbind(V2, 0, 0), 0, 0) ind <- c(1:K, nrow(V3)-1, (K+1):(K+J-1), nrow(V3)) V4 <- V3[ind,]; V5 <- V4[, ind] 21 22 23 24 25 26 # 3. Calcualate marginal effects (ME) # 3.1 ME value f.xb <- dfun(z[1:J] - xb) - dfun(z[2:(J+1)] - xb) me <- b.est %*% matrix(data = f.xb, nrow = 1) colnames(me) <- paste("effect", lev, sep = ".") 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 # 3.2 ME error se <- matrix(0, nrow = K, ncol = J) for (j in 1:J) { u1 <- c(z[j] - xb); u2 <- c(z[j+1] if (w$method == "probit") { s1 <- -u1 s2 <- -u2 } else { s1 <- 1 - 2 * pfun(u1) s2 <- 1 - 2 * pfun(u2) } d1 <dfun(u1) * (diag(1,K,K) d2 <- -1 * dfun(u2) * (diag(1,K,K) q1 <dfun(u1) * s1 * b.est q2 <- -1 * dfun(u2) * s2 * b.est dr <- cbind(d1 + d2, q1, q2) V <- V5[c(1:K, K+j, K+j+1), c(1:K, cova <- dr %*% V %*% t(dr) se[, j] <- sqrt(diag(cova)) } colnames(se) <- paste("SE", lev, sep rownames(se) <- colnames(x) - xb) - s1 * (b.est %*% t(x.bar))) - s2 * (b.est %*% t(x.bar))) K+j, K+j+1)] = ".") 50 51 52 53 54 55 56 57 58 59 # 4. Revise ME and error for dummy variable if (rev.dum) { for (k in 1:K) { if (identical(sort(unique(x[, k])), c(0, 1))) { for (j in 1:J) { x.d1 <- x.bar; x.d1[k, 1] <- 1 x.d0 <- x.bar; x.d0[k, 1] <- 0 ua1 <-z[j] - t(x.d1) %*% b.est; ub1 <- z[j+1] - t(x.d1) %*% b.est ua0 <-z[j] - t(x.d0) %*% b.est; ub0 <- z[j+1] - t(x.d0) %*% b.est B.3 Marginal effect for ordered choice model 60 61 62 63 64 65 66 67 68 69 } 70 } } } me[k, j] <- pfun(ub1) - pfun(ua1) - (pfun(ub0) - pfun(ua0)) d1 <- (dfun(ua1) - dfun(ub1)) %*% t(x.d1) (dfun(ua0) - dfun(ub0)) %*% t(x.d0) q1 <- -dfun(ua1) + dfun(ua0); q2 <- dfun(ub1) - dfun(ub0) dr <- cbind(d1, q1, q2) V <- V5[c(1:K, K+j, K+j+1), c(1:K, K+j, K+j+1)] se[k, j] <- sqrt(c(dr %*% V %*% t(dr))) 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 } # 5. Output t.value <- me / se p.value <- 2 * (1 - pt(abs(t.value), w$df.residual)) out <- list() for (j in 1:J) { out[[j]] <- round(cbind(effect = me[, j], error = se[, j], t.value = t.value[, j], p.value = p.value[, j]), digits) } out[[J+1]] <- round(me, digits) names(out) <- paste("ME", c(lev, "all"), sep = ".") result <- listn(w, out) class(result) <- "ocME" return(result) 86 87 88 89 90 91 # Example: The specification is from MASS. library(erer); library(MASS); data(housing); tail(housing) reg3 <- polr(formula = Sat ~ Infl + Type + Cont, data = housing, weights = Freq, Hess = TRUE, method = "probit") m3 <- ocME(w = reg3); m3 # Selected results from Program B.2 > m3 <- ocME(w = reg3); m3 effect.Low effect.Medium effect.High InflMedium -0.119 -0.016 0.135 InflHigh -0.255 -0.047 0.303 TypeApartment 0.128 0.003 -0.131 TypeAtrium 0.079 0.004 -0.083 TypeTerrace 0.248 -0.008 -0.240 ContHigh -0.079 -0.007 0.086 531 Appendix C Data Analysis Project G reat efforts have been put in designing a data analysis project. While many exercises are included at the end of individual chapters in the book, none of them can challenge and motivate students like a real research project. With this belief in mind, three new sample studies (i.e., Sun, 2006a,b; Sun and Liao, 2011) are carefully selected and prepared for this data project. Reproducing the results in one of the studies should help you learn research methods and data analyses as promoted throughout the book in a more systematic and complete manner. Manipulating raw data can be more demanding and difficult than running a simple line regression. In this regard, Sun (2006a) is more difficult than the other two studies as a number of raw data need to be consolidated. Skipping the data preparation step is strongly discouraged as it is an integrated part of the whole research process. However, if you indeed have a good reason to do so, note that the data sets for all the three studies are included in the erer package, i.e., daRoll, daLaw, and daEsa. They can be loaded directly, e.g., data(daRoll). The results in the three studies can be reproduced with these data sets and R functions. The erer library contains several functions that are developed for these models. These include marginal effects for binary choice and ordered probit models, and additionally, return and risk analysis for financial events. C.1 Roll call analysis of voting records (Sun, 2006a) Sun (2006a) is a roll call analysis of the Healthy Forests Restoration Act for fire management. Roll call analyses are an examination of congressional voting records through binary logit or probit models. It is a type of statistical analysis widely used in political science. The statistical model and outputs from this study are very similar to those in Sun et al. (2007) because binary choice models are the key methodology. Thus, among the three sample studies for exercises, the statistical model in Sun (2006a) is relatively easier to estimate. The raw data are saved and available as “RawDataProjectARoll.xlsx” in the erer library. A number of data sources are utilized to collect the raw data, and there are a total of eight data sheets. The very first challenge is that there are four model specifications, and correspondingly, four data sets should be prepared from the raw data. Various techniques for data manipulations should be employed to merge, index, and combine them in generating the final data sets for regressions. To mitigate the possible frustration from raw data to the four individual data sets, daRoll is included in the erer package. This is the combined data Empirical Research in Economics: Growing up with R Copyright © 2015 by Changyou Sun 532 533 C.1 Roll call analysis of voting records (Sun, 2006a) set for all specifications. The sample codes on the help page of daRoll can generate four individual data sets. For a complete exercise, however, one should start with the raw data, generate a data frame object similar to daRoll, and finally, abstract four individual data sets for regression analyses. All the three tables and two figures in Sun (2006a) can be completely reproduced by R. Some notes are provided here to help you reproduce the results. In your R console, the results from an R program should have appearance similar to the following outputs. • Raw data manipulation: data for this study are in “RawDataProjectARoll.xlsx”. Using merge(), ifelse(), and other functions in R, the raw data sets can be converted into a data frame object similar to daRoll in the erer library. From this combined data frame, four individual data sets can be created. • Table 1 can be generated by bsStat() with additional manipulations. • Table 2 can be created by glm(), bsTab(), and logLik(). • Table 3 can be generated by maBina() and bsTab(). • Figure 1 can be generated with the fire data in “RawDataProjectARoll.xlsx” and traditional graphics system. Note the left y-axis is for acres and the right y-axis is for fire numbers. To overlay two graphs, techniques in Section 11.5 Region, coordinate, clipping, and overlaying on page 232 should be consulted. • Figure 2 can be generated with maTrend(). The mfrow argument in par() can arrange four graphs on one device. The recordPlot() function can be used to record a graph on a screen device, and then replayPlot() can save it on a file device. > print(table.1, right = FALSE) Variable Description 1 Vote Dummy dependent variable equals one if vote is yes 2 RepParty Dummy equals one if Republican 3 East Regional dummy for 11 northeastern states 4 West Regional dummy for 11 western states 5 South Regional dummy for 13 southern states 6 PopDen Population density per km2 7 PopRural % of rural population 8 Edu % of population over 25 with a Bachelor’s degree 9 Income Median family income ($1,000) 10 FYland % of federal lands in total forestlands 2002 11 Size Value of shipments of industry 1997 ($ million) 12 ContrFY Contribution from forest firms ($1,000) 13 ContrEN Contribution from environmental groups ($1,000) 14 Sex Dummy equals one if male 15 Lawyer Dummy equals one if lawyer 16 Member Dummy equals one if a committee member for HFRA 17 Year Number of years in the position 18 Chamber Dummy equals one if House and zero if Senate > table.2 Variable HR-May 1 Constant -3.96 2 RepParty 5.83 3 East -0.39 t1 HR-Nov -2.14** -1.87 7.76*** 4.16 -0.54 0.04 Mean (min, max) 0.70 0.53 0.21 0.22 0.31 0.78(0,24.56) 0.22(0,0.79) 0.15(0.04,0.35) 50.78(20.92,91.57) 0.12 9.53(0.12,21.01) 4.95(0,151.56) 1.61(0,34.63) 0.86 0.42 0.13 10.43(0,48.00) 0.82 t2 HR-Senate t3 Democrat -1.39 0.12 0.10 -0.22 7.60*** 4.01 8.15*** __ 0.07 -0.57 -1.18 -0.34 t4 -0.16 __ -0.57 534 Appendix C Data Analysis Project 4 West 0.34 0.37 2.18 3.02*** 5 South 1.13 1.57 1.23 1.96** 6 PopDen -0.28 -0.79 -0.16 -0.87 7 PopRural 5.62 3.32*** 4.16 2.98*** 8 Edu 0.87 0.10 -1.69 -0.26 9 Income -0.04 -0.93 -0.02 -0.64 10 FYland 1.49 0.95 -0.18 -0.16 11 Size 0.12 2.35** 0.01 0.34 12 ContrFY 0.21 2.37** 0.53 3.70*** 13 ContrEN -0.36 -3.43*** -0.16 -2.44*** 14 Sex 0.94 1.15 0.14 0.24 15 Lawyer 0.10 0.21 0.10 0.24 16 Member 2.36 2.65*** 1.65 1.95** 17 Year 0.00 0.02 0.01 0.50 18 Obs. 426 426 19 Log-L -68.24 -96.52 1.45 1.30 -0.17 3.02 -5.66 0.00 1.00 0.01 0.24 -0.07 0.26 -0.28 1.14 0.01 520 -128.84 2.45*** 2.43** -0.97 2.56*** -0.96 0.06 0.98 0.45 2.86*** -2.19** 0.53 -0.82 1.81* 0.35 1.77 2.66*** 1.69 2.70*** -0.16 -0.88 3.14 2.48*** -9.05 -1.34 0.01 0.40 0.73 0.66 -0.01 -0.30 0.19 2.33** -0.05 -1.36 0.41 0.74 -0.64 -1.63 1.46 2.12** 0.02 0.90 245 -97.48 250 > table.3 Variable HS (Dem) t1 HS (Rep) t2 HS(all) t3 Demo (all) t4 1 RepParty 0.956 8.55*** 0.013 1.64 0.211 3.02*** __ __ 2 East -0.132 -1.23 -0.002 -0.84 -0.035 -0.97 -0.079 -0.59 3 West 0.346 2.65*** 0.003 1.43 0.056 2.16** 0.416 3.02*** 4 South 0.309 2.40** 0.004 1.33 0.068 1.92* 0.401 2.65*** 5 PopDen -0.040 -0.99 -0.001 -0.79 -0.009 -0.86 -0.038 -0.90 6 PopRural 0.721 2.52*** 0.010 1.29 0.159 1.90* 0.744 2.44** 7 Edu -1.349 -0.96 -0.018 -0.82 -0.297 -0.92 -2.144 -1.34 8 Income 0.000 0.06 0.000 0.06 0.000 0.06 0.003 0.40 9 FYland 0.238 0.98 0.003 0.87 0.052 0.96 0.173 0.65 10 Size 0.003 0.46 0.000 0.43 0.001 0.45 -0.003 -0.30 11 ContrFY 0.058 2.66*** 0.001 2.14** 0.013 4.26*** 0.046 2.20** 12 ContrEN -0.017 -2.17** -0.000 -1.37 -0.004 -1.89* -0.011 -1.35 13 Sex 0.062 0.53 0.001 0.50 0.014 0.52 0.097 0.74 14 Lawyer -0.066 -0.82 -0.001 -0.74 -0.015 -0.80 -0.151 -1.62 15 Member 0.279 1.91* 0.003 1.36 0.043 1.99** 0.349 2.38** 16 Year 0.002 0.35 0.000 0.35 0.000 0.35 0.005 0.90 17 Chamber -0.417 -3.37*** -0.004 -1.54 -0.061 -2.64*** -0.378 -3.05*** Number (1,000) 2 100 3 150 Acres (million) 4 5 6 200 7 8 Acres burned Number of fires 1960 1970 1980 1990 2000 Figure 1 in Sun (2006a): Acres burned and fire numbers on federal lands from 1960 to 2003 535 1.0 C.2 Ordered probit model on law reform (Sun, 2006b) Prob (Vote = Yes) 0.75 0.85 0.95 Rep Prob (Vote = Yes) 0.6 0.7 0.8 0.9 All 0.5 Dem 0.8 0.0 0.2 0.4 0.6 0.8 Federal forestland % 1.0 0.2 0.4 0.6 Rural population % All Dem Rep All Prob (Vote = Yes) 0.4 0.6 0.8 Rep 1.0 Dem 0.4 0.2 1.0 0.0 Prob (Vote = Yes) 0.5 0.6 0.7 0.8 0.9 All 0.65 Dem Rep 0 10 20 30 Contribution − forest industry ($1,000) 0 5 10 15 20 25 30 35 Contribution − environ. groups ($1,000) Figure 2 in Sun (2006a): Probability response curves (HR-Senate specification) (Note: Dash lines are positioned at the means of the explanatory variables.) C.2 Ordered probit model on law reform (Sun, 2006b) In Sun (2006b), an ordered probit model is employed to examine factors that have influenced the retention of certain liability laws for prescribed fire in the United States. All the four tables and one figure in this study can be completely reproduced by R. Overall, the data transformation in this study is easy, but the ordered probit model is difficult to estimate. • Raw data manipulation: Raw data are saved as “RawDataProjectBBurn.xlsx”. The amount of data manipulation is pretty small. The data set of daLaw in the erer library has the same content with the raw data. • Table 1 can be generated with some indexing operation. • Table 2 can be created by bsStat() and some additional manipulations. • Table 3 can be generated by polr() in the MASS library. This function needs a factor as the dependent variable. The relevant data in daLaw is a character string so it should be converted first by factor(). Exercise 10.5.9 Data frame manipulation on page 211 is relevant to this. To improve model convergence, some initial values (with a guess or from the published version) may need to be supplied by the start argument in polr(). 536 Appendix C Data Analysis Project • Table 4 can be generated by ocME() from the erer library. • Figure 1 can be generated with ocProb() and base R graphics functions. > print(table.1, right = FALSE) Strict liability Uncertain liability 1 Delaware Arizona 2 Hawaii Colorado 3 Minnesota Connecticut 4 Pennsylvania Idaho 5 Rhode Island Illinois 6 Wisconsin Indiana 7 Iowa 8 Kansas 9 Maine 10 Massachusetts 11 Missouri 12 Montana 13 Nebraska 14 New Mexico 15 North Dakota 16 Ohio 17 South Dakota 18 Tennessee 19 Utah 20 Vermont 21 West Virginia 22 Wyoming 23 N = 6 N = 22 Simple negligence Alabama Alaska Arkansas California Kentucky Louisiana Maryland Mississippi New Hampshire New Jersey New York North Carolina Oklahoma Oregon South Carolina Texas Virginia Washington Gross negligence Florida Georgia Michigan Nevada N = 18 N = 4 > print(table.2, right = FALSE) Variable Mean Minimum Maximum 1 Y 1.4 0.0 3.0 2 FYNFS 3.0 0.0 18.5 3 FYIND 1.3 0.0 7.4 4 FYNIP 7.3 0.3 35.9 5 AGEN 312.8 23.0 3735.0 6 POPRUR 1.2 0.1 3.6 7 EDU 0.3 0.0 2.0 8 INC 20.8 15.8 28.8 9 DAY 166.3 42.0 350.0 10 BIANN 0.3 0.0 1.0 11 SEAT 147.6 49.0 424.0 12 BICAM 2.9 0.0 16.7 13 COMIT 34.6 10.0 69.0 14 RATIO 4.9 1.2 18.6 Definition 1 Categorical dependent variable (Y = 0, 1, 2, or 3 ) 2 National Forests area in a state (million acres) 3 Industrial forestland area in a state (million acres) 4 Nonindustrial private forestland area in a state (million acres) 5 Permanent forestry program personnel in a state 6 Rural population in a state (million) 7 Population 25 years old with advanced degrees in a state (million) 8 Per capita income in a state (thousand dollars) 537 C.2 Ordered probit model on law reform (Sun, 2006b) 9 10 11 12 13 14 The maximum length of legislative sessions in calendar days in a state A dummy variable equal to one for states with annual legislative sessions, ... Total number of legislative seats (Senate plus House) in the legislative ... Level of bicameralism in a state, defined as the size of the Senate ... Total number of standing committees in a state Total number of standing committees in a state divided by ... > table.3 Variable Estimate.x t_ratio.x Estimate.y t_ratio.y Estimate t_ratio 1 FYNFS -0.00 -0.11 -0.04 -0.81 -0.04 -1.00 2 FYIND 0.25 2.52b 0.44 3.01c 0.45 3.31c 3 FYNIP 0.05 1.78a 0.04 1.22 0.05 1.54 4 AGEN 0.00 0.21 5 POPRUR 0.11 0.30 6 EDU 1.32 1.41 1.48 2.19b 7 INC 0.04 0.49 0.03 0.44 8 DAY -0.00 -0.84 -0.00 -0.83 9 BIANN -0.54 -1.19 -0.51 -1.18 10 SEAT 0.01 1.13 0.01 2.15b 11 BICAM 0.05 0.34 12 COMIT -0.07 -2.40b -0.07 -2.49b 13 RATIO -0.29 -2.19b -0.30 -2.40b 14 Log-likelihood -52.42 -46.19 -46.30 > table.4 Variable 1 FYNFS 2 FYIND 3 FYNIP 4 EDU 5 SEAT 6 COMIT 7 RATIO Y = 0 0.005 -0.052 -0.005 -0.169 -0.001 0.008 0.034 Y = 1 Y = 2 Y = 3 0.012 -0.014 -0.003 -0.126 0.149 0.029 -0.013 0.015 0.003 -0.413 0.488 0.094 -0.003 0.004 0.001 0.020 -0.024 -0.005 0.084 -0.099 -0.019 0.6 Probability (Y = J) Y=2 0.5 0.4 Y=1 0.3 0.2 Y=3 0.1 Y=0 0.0 0 5 10 15 20 25 30 NIPF land area in a state (million acres) 35 Y = 0 (strict liability) Y = 1 (uncertain liability) Y = 2 (simple negligence) Y = 3 (gross negligence) Figure 1 in Sun (2006b): Effects of NIPF land area on the probability of retaining different liability rules 538 C.3 Appendix C Data Analysis Project Event analysis of ESA (Sun and Liao, 2011) In Sun and Liao (2011), event analysis is employed to examine the impact of six court decisions related to the Endangered Species Act (ESA) on the financial performance of U.S. forest products firms. The methodology is widely used in financial economics. Ordinary least square is used to estimate all the models. Thus, while the data are time series, the time dimension is irrelevant for model estimation. The work for data preparation is moderate. The main challenge is to use looping statements for repeated analyses by case and window. Several functions are included in the erer library, and in particular, the generic function of update() is available and should be used repeatedly. The three tables (Tables 2 to 4) and the figure in this study can be completely reproduced by R. • Raw data manipulation: Raw data for this study are in “RawDataProjectCEsa.xlsx”. The data set of daEsa in the erer library has similar contents with the raw data. The event date information is in Table 1 of the published paper. • Table 1 is descriptive event information, and there is no need to reproduce it by R. • Table 2 can be created by the evReturn() function in erer package and update(). • Tables 3 and 4 can be generated by evRisk(). • Figure 1 can be generated with evReturn() and some graphics functions. The limits of the y-axes in the two plots are set to be the same for comparison. The published version was generated by the ggplot2 package. In this particular case, generating the graph by ggplot2 can be challenging to beginners, partly because of adding labels to a panel graph (i.e., facets). The trick is to add some columns in the data frame for coordinates or labels, and then use geom_text() to add the labels. In base R graphics, layout() can be used to create two figure regions. The y-axis titles for the two plots are combined. This is implemented by first setting the oma argument in par() to be bigger than zero, and then adding the label through mtext(..., outer = TRUE). Two shaded boxes with texts are added to the right end. > (output <- listn(table.2, table.3, table.4)) $table.2 Case (-2, 2) (-3, 3) (-4, 4) (-5, 5) (-6, 6) (-7, 7) 1 I -1.872* -0.672 1.055 0.799 0.187 1.632 2 (-1.954) (-0.593) (0.820) (0.562) (0.121) (0.983) 3 II -1.023 -0.971 -1.459 -1.435 -1.992 -1.560 4 (-0.882) (-0.708) (-0.937) (-0.833) (-1.065) (-0.779) 5 III -1.292 -0.024 -0.712 0.692 1.516 1.698 6 (-1.057) (-0.017) (-0.434) (0.381) (0.770) (0.802) 7 IV 3.873*** 2.580 2.015 5.244*** 3.066 5.479** 8 (2.869) (1.618) (1.116) (2.656) (1.429) (2.379) 9 V 4.040** 11.550*** 6.538*** 4.631** 6.002** 2.476 10 (2.587) (6.274) (3.142) (2.016) (2.409) (0.925) 11 VI -0.328 -2.112 -3.124* -2.267 -1.943 -2.647 12 (-0.243) (-1.325) (-1.726) (-1.132) (-0.892) (-1.130) $table.3 Firm I_beta 1 BBC 0.774*** 2 BOW 0.948*** I_gamma II_beta II_gamma III_beta III_gamma 0.392 0.862*** -0.773*** 0.474** -0.155 0.079 1.035*** -0.654*** 0.622*** -0.004 539 C.3 Event analysis of ESA (Sun and Liao, 2011) 3 4 5 6 7 8 9 10 11 12 13 14 CSK GP IP KMB LPX MWV PCH PCL POP TIN WPP WY 0.884*** -0.124 0.766*** -0.396** 0.656*** 0.820*** 0.428** 0.876*** -0.256 0.581*** 0.895*** 0.421** 1.023*** -0.748*** 0.522*** 0.992*** -0.009 0.994*** -0.364** 0.677*** 0.926*** 0.935*** 0.942*** -0.309 0.785*** 0.854*** 0.032 0.802*** -0.386* 0.600*** 0.868*** 0.026 0.848*** -0.327** 0.572*** 0.873*** 0.185 0.755*** -0.541*** 0.501*** 0.968*** -0.092 0.548*** -0.364 0.318 0.857*** 0.126 0.824*** -0.023 0.690*** 0.242 0.052 0.873*** -0.398* 0.320* 0.964*** 0.457** 0.775*** -0.109 0.670*** -0.256 0.225 -0.163 -0.192 -0.251 -0.118 -0.013 -0.353** 0.008 -0.115 0.159 -0.068 $table.4 Firm IV_beta IV_gamma V_beta V_gamma VI_beta VI_gamma 1 BBC 0.089 0.502** 0.791*** -0.099 1.180*** -0.235** 2 BOW 0.376* 0.271 0.708*** -0.075 1.106*** -0.188* 3 CSK 0.368** 0.283 0.766*** -0.234 0.993*** -0.054 4 GP 0.622*** 0.180 0.884*** -0.131 1.598*** -0.686*** 5 IP 0.278 0.188 0.939*** -0.261 1.036*** -0.015 6 KMB 0.632*** -0.088 1.041*** -0.390* 0.723*** 0.222*** 7 LPX 0.639*** -0.138 1.185*** -0.499* 1.376*** -0.270* 8 MWV 0.421** 0.164 0.850*** -0.105 1.189*** -0.217*** 9 PCH 0.519*** 0.091 0.795*** -0.063 1.043*** -0.064 10 PCL 0.215* 0.122 0.528*** -0.133 0.951*** 0.022 11 POP 0.175 0.470 0.522*** -0.020 0.801*** 0.112 12 TIN 0.802*** -0.292 0.845*** -0.085 1.152*** -0.156** 13 WPP 0.480*** 0.112 1.018*** -0.245 1.214*** -0.180* 14 WY 0.670*** -0.094 0.882*** -0.225 1.093*** -0.156** 3 VI 0 a. Negative − cases I & VI I (−2, 2): VI (−4, 4): V (−6, 6): IV (−7, 7): 6 − 1.872 % − 3.124 % + 6.002 % + 5.479 % I −3 9 b. Positive − cases IV & V Average cumulative abnormal returns (%) 9 6 3 V IV 0 −3 −7 −6 −5 −4 −3 −2 −1 0 1 Event day 2 3 4 5 6 7 Figure 1 (Sun and Liao, 2011): Average cumulative abnormal returns (created by ggplot2) 540 Appendix C Data Analysis Project Negative − cases I & VI I (−2, 2): −1.872% VI (−4, 4): −3.124% V (−6, 6): +6.002% IV (−7, 7): +5.479% 6 3 VI 0 I −3 9 Positive − cases IV & V Average cumulative abnormal returns (%) 9 6 V 3 IV 0 −3 −7 −6 −5 −4 −3 −2 −1 0 1 Event day 2 3 4 5 6 7 Figure 1 (Sun and Liao, 2011): Average cumulative abnormal returns (created by base R graphics) References Abadir, K. and Magnus, J. 2005. Matrix Algebra. Cambridge University Press, Cambridge. Adler, J. 2010. R in a Nutshell: A Desktop Quick Reference. O’REILLY, Sebastopol, CA. Akerlof, G. 1970. The market for ‘lemons’: quality uncertainty and the market mechanism. Quarterly Journal of Economics 84:488–500. Alston, J. and Chalfant, J. 1993. The silence of the lambdas: a test of the almost ideal and Rotterdam models. American Journal of Agricultural Economics 75:304–313. Amacher, G., Ollikainen, M., and Koskela, E. 2009. Economics of Forest Resources. The MIT Press, Cambridge, MA. Baltagi, B. 2011. Econometrics. Springer-Verglag Berlin Heidelberg, New York, NY, 5th edition. Bivand, R., Pebesma, E., and Gomez-Rubio, V. 2008. Applied Spatial Data Analysis with R. Springer, New York, NY. Blackburn, T. 2003. Getting Science Grants: Effective Strategies for Funding Success. Jossey-Bass, San Francisco, CA. Braun, W. and Murdoch, D. 2008. A First Course in Statistical Programming with R. Cambridge University Press, New York, NY. Buehlmann, U., Bumgardner, M., Lihra, T., and Frye, M. 2006. Attitudes of U.S. retailers toward China, Canada, and the United States as manufacturing sources for furniture: an assessment of competitive priorities. Journal of Global Marketing 20:61–73. Chan, K. 1993. Consistency and limiting distribution of the least squares estimator of a threshold autoregressive model. The Annals of Statistics 21:520–533. Chiang, A. and Wainwright, K. 2004. Fundamental Methods of Mathematical Economics. McGraw-Hill, New York, NY, 4th edition. Coase, R. 1960. The problem of social cost. Journal of Law and Economics 3:1–44. Covey, S. 2013. The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change. Simon and Schuster, New York, NY, 2nd edition. 541 542 References Dalgaard, P. 2008. Introductory Statistics with R. Springer, London, UK, 2nd edition. Deaton, A. and Muellbauer, J. 1980. An almost ideal demand system. American Economic Review 70:312–326. Edgerton, D. 1993. On the estimation of separation demand models. Journal of Agricultural and Resource Economics 18:141–146. Enders, W. 2010. Applied Econometric Time Series. John Wiley and Sons, Inc., New York, NY, 3rd edition. Enders, W. and Granger, C. 1998. Unit-root tests and asymmetric adjustment with an example using the term structure of interest rates. Journal of Business and Economic Statistics 16:304–311. Enders, W. and Siklos, P. 2001. Cointegration and threshold adjustment. Journal of Business and Economic Statistics 19:166–176. Engle, R. and Granger, C. 1987. Co-integration and error correction: representation, estimation, and testing. Econometrica 55:251–276. Everitt, B. and Hothorn, T. 2009. A Handbook of Statistical Analyses Using R. Chapman and Hall/CRC, Boca Raton, FL, 2nd edition. Faustmann, M. 1995. Calculation of the value which forest land and immature stands possess for forestry. Journal of Forest Economics 1:7–44. Frey, G. and Manera, M. 2007. Econometric models of asymmetric price transmission. Journal of Economic Surveys 21:349–415. Gallet, C. 2010. The income elasticity of meat: a meta-analysis. Australian Journal of Agricultural and Resource Economics 54:477–490. Gardner, B. 1975. Farm retail price spread in a competitive food industry. American Journal of Agricultural Economics 57:399–409. Granger, C. and Lee, T. 1989. Investigation of production, sales, and inventory relationships using multicointegration and non-symmetric error correction models. Journal of Applied Econometrics 4:S145–S159. Greene, W. 2011. Econometric Analysis. Prentice Hall, New York, NY, 7th edition. Hartman, R. 1976. The harvesting decision when a standing forest has value. Economic Inquiry 14:52–58. Henneberry, S., Piwethongngam, K., and Qiang, H. 1999. Consumer food safety concerns and fresh produce consumption. Journal of Agricultural and Resource Economics 24:98–113. Henningsen, A. and Hamann, J. 2007. systemfit: a package for estimating systems of simultaneous equations in R. Journal of Statistical Software 23:1–40. Johnson, E. 1991. The Handbook of Good English. Washington Square Press, New York, NY. References 543 Jones, O., Maillardet, R., and Robinson, A. 2009. Introduction to Scientific Programming and Simulation Using R. Chapman and Hall/CRC, Boca Raton, FL. Judge, G., Griffiths, W., Hill, R., Lutkepohl, H., and Lee, T.-C. 1985. The Theory and Practice of Econometrics. John Wiley and Sons, New York, NY, 2nd edition. Kleiber, C. and Zeileis, A. 2008. Applied Econometrics with R. Springer, New York, NY. Lawrence, M. and Verzani, J. 2012. Programming Graphical User Interfaces in R. CRC Press, Boca Raton, FL. Luo, X., Sun, C., Jiang, H., Zhang, Y., and Meng, Q. 2015. International trade after intervention: the case of bedroom furniture. Forest Policy and Economics 50:180–191. MacKinlay, A. 1997. Event studies in economics and finance. Journal of Economic Literature 35:13–39. Matloff, N. 2011. The Art of R Programming: A Tour of Statistical Software Design. No Starch Press, San Francisco, CA. Mei, B. and Sun, C. 2008. Assessing time-varying oligopoly and oligopsony power in the U.S. paper industry. Journal of Agricultural and Applied Economics 40:927–939. Meyer, J. and von Cramon-Taubadel, S. 2004. Asymmetric price transmission: a survey. Journal of Agricultural Economics 55:581–611. Morrison, D. and Russell, S. 2005. The Grant Application Writer’s Workbook: Successful Proposals to any Agency. Grant Writers’ Seminars and Workshops, LLC, Buellton, CA. Murrell, P. 2011. R Graphics. CRC Press, London, UK, 2nd edition. Nelsen, R. 2006. An Introduction to Copulas. Springer, New York, NY, 2nd edition. Newman, D. 2002. Forestry’s golden rule and the development of the optimal forest rotation literature. Journal of Forest Economics 8:5–27. Pfaff, B. 2008. Analysis of Integrated and Cointegrated Time Series with R. Springer Science and Business Media, LLC, New York, NY, 2nd edition. Piggott, N. and Wohlgenant, M. 2002. Price elasticities, joint products, and international trade. Australian Journal of Agricultural and Resource Economics 46:487–500. Spector, P. 2008. Data Manipulation with R. Springer, New York, NY. Sun, C. 2006a. A roll call analysis of the Healthy Forests Restoration Act and constituent interests in fire policy. Forest Policy and Economics 9:126–138. Sun, C. 2006b. State statutory reforms and retention of prescribed fire liability laws on US forest land. Forest Policy and Economics 9:392–402. Sun, C. 2011. Price dynamics in the import wooden bed market of the United States. Forest Policy and Economics 13:479–487. Sun, C. 2014. Recent growth in China’s roundwood import and its global implications. Forest Policy and Economics 39:43–53. 544 References Sun, C. and Liao, X. 2011. Effects of litigation under the Endangered Species Act on forest firm values. Journal of Forest Economics 17:388–398. Sun, C., Pokharel, S., Jones, W., Grado, S., and Grebner, D. 2007. Extent of recreational incidents and determinants of liability insurance coverage for hunters and anglers in Mississippi. Southern Journal of Applied Forestry 31:151–158. Sun, C. and Tolver, B. 2012. Assessing administrative laws for forestry prescribed burning in the southern United States: a management-based regulation approach. International Forestry Review 14:337–348. Sun, C. and Zhang, D. 2003. The effects of exchange rate volatility on U.S. forest commodities exports. Forest Science 49:807–814. University of Chicago Press. 2003. The Chicago Manual of Style: The Essential Guide for Writers, Editors, and Publishers. The University of Chicago Press, Chicago, IL, 15th edition. Vinod, H. 2008. Hands-on Intermediate Econometrics Using R: Templates for Extending Dozens of Practical Examples. World Scientific, New Jersey, USA. Wan, Y., Sun, C., and Grebner, D. 2010a. Analysis of import demand for wooden beds in the United States. Journal of Agricultural and Applied Economics 42:643–658. Wan, Y., Sun, C., and Grebner, D. 2010b. Intervention analysis of the antidumping investigation on wooden bedroom furniture imports from China. Canadian Journal of Forest Research 40:1434–1447. Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer, London, UK. Wright, B., Kaiser, R., and Nicholls, S. 2002. Rural landowner liability for recreational injuries: myths, perceptions, and realities. Journal of Soil and Water Conservation 57:183–191. List of Figures, Tables, and Programs Figures 1.1 2.1 2.2 2.3 2.4 2.5 4.1 4.2 5.1 6.1 6.2 6.3 6.4 6.5 6.6 7.1 7.2 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 11.10 11.11 11.12 11.13 15.1 15.2 15.3 The structure of the book . . . . . . . . . . . . . . . . . . . . . . . . . . R graphical user interface on Microsoft Windows . . . . . . . . . . . . . Interface of the alternative editor Tinn-R . . . . . . . . . . . . . . . . . Interface of the alternative editor RStudio . . . . . . . . . . . . . . . . . Main interface of the R Commander with a linear regression . . . . . . . Loading a data set, drawing a scatter plot, and fitting a linear model . . A comparison of an author’s work and a reader’s memory . . . . . . . . Three roles, jobs, and outputs related to an empirical study . . . . . . . The relation among keywords in a proposal for social science . . . . . . Saving PDF documents in a single folder on a local drive . . . . . . . . . Interface of a library in EndNote . . . . . . . . . . . . . . . . . . . . . . Inserting references in Microsoft Word through EndNote . . . . . . . . . Customizing reference fields in EndNote . . . . . . . . . . . . . . . . . . Interface of a library in Mendeley . . . . . . . . . . . . . . . . . . . . . . Inserting references in Microsoft Word through Mendeley . . . . . . . . Importing external Excel data in the R Commander . . . . . . . . . . . Fitting a binary logit model in the R Commander . . . . . . . . . . . . . Graphs with different graphical parameters . . . . . . . . . . . . . . . . Plotting multiple time series of wooden bed trade on a single page . . . Displaying math symbols on a graph . . . . . . . . . . . . . . . . . . . . Defining sizes of regions and margins and their relations . . . . . . . . . Understanding region, user coordinates, clipping, and overlaying . . . . Default probability response curves for hunting experience . . . . . . . . Customizing probability curves in Sun et al. (2007) with base R graphics Changes in consumer and producer surplus under a demand shift . . . . Diagram of an author’s work and a reader’s memory by base R . . . . . A diagram illustrating management-based regulations . . . . . . . . . . Probability response curves for hunting ages . . . . . . . . . . . . . . . . Changing a clipping area and user coordinates . . . . . . . . . . . . . . . A more challenging market graph with a supply shift . . . . . . . . . . . The curve for the function of y = (x + a)2 + 20 . . . . . . . . . . . . . . Variation of weekdays associated with a birthday by year . . . . . . . . The curve for the function of y = 50 − 2 × sin(x − 5) . . . . . . . . . . . 545 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 13 16 17 23 26 44 47 61 84 87 88 89 94 95 111 112 215 227 231 233 236 239 241 243 244 248 248 249 249 342 350 357 546 References 16.1 16.2 16.3 16.4 16.5 16.6 17.1 17.2 19.1 19.2 19.3 19.4 20.1 20.2 20.3 20.4 20.5 20.6 20.7 20.8 Viewports, primitive functions, graphical parameters, and units in grid Probability response curves in Sun et al. (2007) by grid . . . . . . . . . The diagram of an author’s work and a reader’s memory by grid . . . . Probability response curves in Sun et al. (2007) by ggplot2 . . . . . . . Import shares of beds in Wan et al. (2010a) by ggplot2 and grid . . . Intensity of fire regulations in Sun and Tolver (2012) by map() . . . . . Monthly import value of beds from China and Vietnam (base R) . . . . Monthly import value of beds from China and Vietnam (ggplot2) . . . The skeleton for a new package on a local drive . . . . . . . . . . . . . . A screenshot of the help document for the ciTarFit() function . . . . . Finding the environment variable and path on Windows 7 . . . . . . . . The command prompt window for building R packages . . . . . . . . . . A static view of chess board created by base R graphics . . . . . . . . . Two simple examples for understanding GUIs . . . . . . . . . . . . . . . A graphical user interface for the correlation between two variables . . . Overall appearance for the guiApt() GUI with the tabs on the left . . . Two screenshots for the guiApt() GUI with the tabs at the bottom . . A GUI for calculating net monthly income . . . . . . . . . . . . . . . . . A GUI for R calculator with six simple operations . . . . . . . . . . . . A revised GUI for the correlation between two variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367 370 372 380 383 388 408 409 430 433 442 443 447 450 457 461 462 466 466 467 . . . . . . . . . . . . paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 32 41 43 51 52 53 54 64 80 103 117 133 156 254 284 397 417 . . . . . . . . . . . . 113 114 118 121 Tables 1.1 3.1 3.2 4.1 4.2 4.3 4.4 4.5 5.1 6.1 7.1 7.2 8.1 9.1 12.1 13.1 17.1 18.1 Three versions of an empirical study . . . . . . . . . . . . . . . . Journal of Economic Literature classification system . . . . . . . A comparison of three types of economic studies . . . . . . . . . A production comparison between building a house and writing a Main components in the proposal version of an empirical study . Structure of the program version for an empirical study . . . . . Four stages of data analyses and software usage . . . . . . . . . . Structure of the manuscript version for an empirical study . . . . The outline from a funded proposal narrative document . . . . . Tasks and goals for reference and file management . . . . . . . . A draft table for the logit regression analysis in Sun et al. (2007) A comparison of grammar rules between English and R . . . . . Constructor, predicate, and coercion functions for R objects . . . Major functions for character string manipulation in R . . . . . . A draft table for the descriptive statistics in Wan et al. (2010a) . Coefficients for the static AIDS model in Wan et al. (2010a) . . . A draft table for the cointegration analyses in Sun (2011) . . . . Documents and functions included in the apt package . . . . . . Programs 7.1 7.2 7.3 7.4 The first program version for Sun et al. (2007) . . . . . . . The final program version for Sun et al. (2007) . . . . . . Curly brace match in the if statement . . . . . . . . . . . A poorly formatted program version for Sun et al. (2007) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 References 7.5 8.1 8.2 8.3 8.4 8.5 8.6 8.7 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 10.1 10.2 10.3 10.4 10.5 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 11.10 12.1 12.2 13.1 13.2 13.3 13.4 13.5 13.6 14.1 14.2 14.3 14.4 14.5 15.1 15.2 15.3 15.4 15.5 Identifying and correcting bad formats in an R program . . . . . . . Accessing and defining object attributes . . . . . . . . . . . . . . . . Creating R objects and locating missing values . . . . . . . . . . . . A brief introduction to subscripts, flow control, and new functions . Manual data inputs through R console and Excel . . . . . . . . . . . Generating a random sample . . . . . . . . . . . . . . . . . . . . . . Importing data in text, Excel, or graphics format . . . . . . . . . . . Exporting tables and graphs from R to a local drive . . . . . . . . . Built-in and user-defined operators in R . . . . . . . . . . . . . . . . Manipulating character strings . . . . . . . . . . . . . . . . . . . . . Special meaning of a character . . . . . . . . . . . . . . . . . . . . . Creating and manipulating a long character string vector . . . . . . . Manipulating factor objects . . . . . . . . . . . . . . . . . . . . . . . Manipulating date and time objects . . . . . . . . . . . . . . . . . . . Manipulating time series objects . . . . . . . . . . . . . . . . . . . . Defining the relation among variables by formula . . . . . . . . . . . Subscripting and indexing R objects . . . . . . . . . . . . . . . . . . Common tasks for manipulating data frame objects . . . . . . . . . . Summary statistics and pivot tables from data frames . . . . . . . . Generating summary statistics with apply() . . . . . . . . . . . . . Calling glm() to estimate a binary choice model . . . . . . . . . . . Base R graphics with inputs of data, devices, and plotting functions Managing screen and file graphics devices . . . . . . . . . . . . . . . Using the par() function to manipulate graphics parameters . . . . Viewing and saving multiple pages of graphs . . . . . . . . . . . . . . Creating multiple graphs on a single page . . . . . . . . . . . . . . . Using high-level and low-level plotting functions . . . . . . . . . . . . Region, user coordinates, clipping, and overlaying . . . . . . . . . . . Customizing a graph generated from existing R functions . . . . . . Market equilibrium after a demand shift . . . . . . . . . . . . . . . . An author’s work and a reader’s memory by base R . . . . . . . . . . Program version for Wan et al. (2010a) . . . . . . . . . . . . . . . . . Estimating the AIDS model with fewer user-defined functions . . . . Conditional execution with if, ifelse(), and switch() . . . . . . . Looping through for, while, and repeat . . . . . . . . . . . . . . . Looping on objects other than a vector and preallocating spaces . . . Additional functions for flow control: stop() and return() . . . . . Homogeneity and symmetry restrictions for the AIDS model . . . . . Computing elasticities and standard errors for the AIDS model . . . Creating and subscripting matrices in R . . . . . . . . . . . . . . . . Matrix multiplication, inversion, and other operations . . . . . . . . Fitting a linear model by matrix multiplication . . . . . . . . . . . . Estimating a demand system by generalized least square . . . . . . . Marginal effects and standard errors for the binary choice model . . Function structure and properties . . . . . . . . . . . . . . . . . . . . Supplying arguments for a function . . . . . . . . . . . . . . . . . . . Returning outputs from a function . . . . . . . . . . . . . . . . . . . Understanding the environment of a function . . . . . . . . . . . . . Understanding the need of S3 or S4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 129 134 138 140 143 146 149 154 158 162 164 167 170 174 178 182 189 195 199 202 216 220 222 224 228 232 237 239 242 245 260 265 272 276 279 282 285 288 295 299 302 304 307 314 317 320 325 330 548 References 15.6 15.7 15.8 15.9 15.10 15.11 15.12 16.1 16.2 16.3 16.4 16.5 16.6 17.1 17.2 18.1 18.2 19.1 19.2 19.3 20.1 20.2 20.3 20.4 A.1 A.2 A.3 A.4 A.5 A.6 B.1 B.2 A function for course scores with the S3 mechanism . . . . . . . . . . . A new function for ordinary least square with the S4 mechanism . . . Testing the new function for ordinary least square . . . . . . . . . . . An outline for estimating a linear model with ordinary least square . . Defining a new function for one-dimensional optimization . . . . . . . Estimating a binary logit model with maximum likelihood . . . . . . . Wrapping up commands in a new function for the static AIDS model . Learning viewports and low-level plotting functions in grid . . . . . . Customizing probability response curves in Sun et al. (2007) by grid . Comparing an author’s work and a reader’s memory by grid . . . . . Customizing a graph from existing R functions by ggplot2 . . . . . . Creating the panel graph in Wan et al. (2010a) by ggplot2 and grid . Creating the map for fire regulations in Sun and Tolver (2012) . . . . Main program version for generating tables in Sun (2011) . . . . . . . Graph program version for generating figures in Sun (2011) . . . . . . Debugging tools for identifying program errors in R . . . . . . . . . . . Examining function efficiency by time and memory . . . . . . . . . . . Three approaches to creating the skeleton of a new package . . . . . . Help file of ciTarFit.Rd for ciTarFit() and print.ciTarFit() . . . Installing, loading, and attaching a package in R . . . . . . . . . . . . Creating a chess board from base R graphics . . . . . . . . . . . . . . Understanding key concepts in gWidgets through two examples . . . . A demonstration of the correlation between two random variables . . . Creating a GUI for the apt package . . . . . . . . . . . . . . . . . . . A world map with two countries highlighted . . . . . . . . . . . . . . . A heart shape with three-dimensional effects . . . . . . . . . . . . . . . Survival of 2,201 passengers on Titanic sank in 1912 . . . . . . . . . . A diagram for the structure of this book . . . . . . . . . . . . . . . . . Screenshots from a demonstration video for correlation . . . . . . . . . Faces at Christmas time conditional on economic status . . . . . . . . Calculating predicted probabilities for an ordered choice model . . . . Calculating marginal effects for an ordered choice model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 336 338 341 343 344 348 367 370 373 381 382 387 402 407 422 425 431 433 437 448 451 458 462 513 514 516 517 518 520 525 529 Index of Authors Author names of the cited references are listed. In the main text, some authors are abbreviated as et al. For example, Frye is the fourth author in Buehlmann et al. (2006). Gomez-Rubio, V., 21, 386 Grado, S.C., 6, 11, 57, 70, 75, 101–103, 110, 113, 114, 120, 126, 184, 201, 246, 247, 301, 307, 312, 369, 370, 380, 391, 532 Granger, C.W.J., 255, 397, 399 Grebner, D.L., 6, 11, 37, 41, 57, 70, 72, 73, 75, 89, 96, 101–103, 110, 113, 114, 120, 126, 145, 184, 201, 209, 235, 246, 247, 252, 253, 255, 256, 260, 265, 283, 301, 307, 312, 313, 347, 369, 370, 380, 382, 385, 391, 395, 411, 470–473, 477, 480, 483, 492, 506, 532 Greene, W.H., 103, 105, 106, 108, 109, 257, 259, 521, 523 Griffiths, W.E., 103, 104, 106 Abadir, K.M., 10, 522 Adler, J., 21, 412 Akerlof, G.A., 33 Alston, J.M., 477 Amacher, G.S., 34 Baltagi, B.H., 103, 107, 109 Bivand, R.S., 21, 386 Blackburn, T.R., 62 Braun, W.J., 21, 269 Buehlmann, U., 72 Bumgardner, M., 72 Chalfant, J.A., 477 Chan, K.S., 399 Chiang, A.C., 34 Coase, R.H., 33, 34 Covey, S.R., 511 Dalgaard, P., 21 Deaton, A., 72, 255 Hamann, J.D., 257 Hartman, R., 34 Henneberry, S.R., 73 Henningsen, A., 257 Hill, R.C., 103, 104, 106 Hothorn, T., 21 Edgerton, D.L., 73 Enders, W., 72, 74, 397–399 Engle, R.F., 255, 397, 399 Everitt, B.S., 21 Faustmann, M., 34, 38 Frey, G., 39, 400, 414 Frye, M., 72 Jiang, H., 41 Johnson, E.D., 480 Jones, O., 21, 322, 419 Jones, W.D., 6, 11, 57, 70, 75, 101–103, 110, 113, 114, 120, 126, 184, Gallet, C.A., 38 Gardner, B.L., 34 549 550 201, 246, 247, 301, 307, 312, 369, 370, 380, 391, 532 Judge, G.G., 103, 104, 106 Kaiser, R.A., 71 Kleiber, C., 21 Koskela, E.A., 34 Lawrence, M.F., 449, 450 Lee, T.-C., 103, 104, 106 Lee, T.H., 399 Liao, X., 6, 9, 11, 41, 56, 77, 151, 290, 538 Lihra, T., 72 Luo, X., 41 Lutkepohl, H., 103, 104, 106 MacKinlay, A.C., 39, 81, 91 Magnus, J.R., 10, 522 Maillardet, R., 21, 322, 419 Manera, M., 39, 400, 414 Matloff, N., 21, 293 Mei, B., 483 Meng, Q., 41 Meyer, J., 74 Morrison, D.C., 58, 62, 63 Muellbauer, J., 72, 255 Murdoch, D.J., 21, 269 Murrell, P., 21, 359, 360 Nelsen, R.B., 456 Newman, D.H., 38 Nicholls, S., 71 Ollikainen, M., 34 Pebesma, E.J., 21, 386 Pfaff, B., 21 Piggott, N.E., 35 Piwethongngam, K., 73 Pokharel, S., 6, 11, 57, 70, 75, 101–103, 110, 113, 114, 120, 126, 184, Index of Authors 201, 246, 247, 301, 307, 312, 369, 370, 380, 391, 532 Qiang, H., 73 Robinson, A., 21, 322, 419 Russell, S.W., 58, 62, 63 Siklos, P.L., 398 Spector, P., 21, 128 Sun, C., 6, 7, 9, 11, 37, 41, 56, 57, 70, 72, 73, 75, 77, 89, 93, 96, 101–103, 109, 110, 113, 114, 120, 126, 145, 151, 184, 201, 209, 235, 246, 247, 252, 253, 255, 256, 260, 265, 283, 290, 301, 307, 312, 313, 347, 369, 370, 380, 382, 385–387, 391, 394, 395, 397, 402, 407, 411, 416, 460, 465, 470–473, 477, 480, 483, 489, 492, 506, 532, 535, 538 Tolver, B., 247, 386, 387 University of Chicago Press, 480 Verzani, J., 449, 450 Vinod, H.D., 21 von Cramon-Taubadel, S., 74 Wainwright, K., 34 Wan, Y., 6, 11, 37, 41, 57, 70, 72, 73, 89, 96, 145, 209, 235, 252, 253, 255, 256, 260, 265, 283, 312, 313, 347, 370, 382, 385, 395, 411, 470–473, 477, 480, 483, 492, 506 Wickham, H., 21, 374, 376 Wohlgenant, M.K., 35 Wright, B.A., 71 Zeileis, A., 21 Zhang, D., 37 Zhang, Y., 41 Index of Subjects Calling a function, 135 location match, 136 name match, 136 Character string creating, 156 displaying, 157 extracting, 158 formatting numbers, 162 lower and upper cases, 158 matching, 161 regular expression creation, 161 definition, 160 Chernoff faces, 519 Clickor, 6, 22, 110 characteristics, 27, 111 playing a data set, 25 Clinic, 503 frequently appearing symptoms, 503 manuscript details, 509 R programming, 507 study design and outline, 503 Cointegration, 256 Comparison of EndNote and Mendeley, 86 Comparison of R and commercial software, 21 Conditional statement, 269 if, 269 ifelse, 271 nesting if, 270 switch, 272 Contingency and pivot tables, 194 Contributor, 6, 394, 511 A AIDS model, 255 demand elasticity, 259, 287 estimation, 347 estimation by GLS, 257 restriction matrices, 256, 283 static and dynamic, 255 apply family, 198 list, 198 subsets, 198 Asymmetric error correction model, 399 B Backslash in R, 19, 157, 161 Beginner, 6, 100, 125, 511 Binary choice model, 103, 533 estimation, 103, 307 linear probability model, 104 maximum likelihood, 105 ordinary least square, 105 marginal effect, 108, 307, 521 maximum likelihood estimation, 344 parameter values, 107 predicted probability, 107 standard error, 109, 307 Book design, 511 motivation, 3 objective, 5 principle A: incrementalism, 5 principle B: project-oriented, 6 principle C: reproducible research, 7 structure, 7 Book materials, 10 D C 551 552 Data analysis project, 532 Data frame, 180 add new elements, 187 changing mode by column, 185 combine, 188 display, 184 extract, 186 merge, 188 pivot tables, 194 random sample, 188 remove rows or columns, 187 renaming, 185 reorder and sort, 186 replace, 186 reshape between narrow and wide formats, 188 summary, 193 Data input external data Excel format, 145 text format, 144 importing graphics, 145 manual, 139 sample in packages, 141 simulation, 142 Data output, 147 Excel format, 148 graph exports, 149 text format, 147 Date and time creation, 169 extracting, 170 type, 169 Debug browsing status, 418 change source codes, 421 no change on source codes, 421 without special tools, 419 Delta method, 521 Demonstration graph, 240 Deparse, 325 Diagram, 240, 517 E Economic research areas, 31 selection, 40 type empirical study, 35 Index of Subjects review study, 38 selection, 40 theoretical study, 33 Empirical study challenge, 37 common thread, 35 practical steps, 54 production process, 42 programming advantage, 36 three roles, jobs, and outputs, 47, 74 three versions, 45, 101, 253, 395, 511 manuscript, 53, 472 program, 51 proposal, 49 relation, 47 EndNote, 86 file management absolute path, 91 imports, 92 PDF annotation, 93 relative path, 91 save, locate, and export, 93 implementation, 87 reference management custom group, 90 group, 90 group set, 90 new reference creation, 87 output styles, 89 search and export, 91 smart group, 90 types and fields, 88 Environment, 322 Error correction model, 256, 399 Escape character in R, 19, 157, 161 Event analysis, 538 F Factor conversion, 166 reordering, 166 structure, 166 File management demand, 78 practical techniques, 83 solution, 81 Flow control, 137, 269 conditional statement, 269 looping statement, 269, 524 553 Index of Subjects stop, warning, and message, 281 Formulas, 177 creation, 177 extracting, 178 Forward slash in R, 19, 157 Frame, 323 Function arguments, 315 ellipsis, 316 with default values, 316 without default values, 316 debugging, 418 environment, 32
© Copyright 2026 Paperzz