Final Project - NorCalBiostat

Final Project
Amber Williams
2/18/2017
Intro
Studying industrial organizations is a common interest in economics. We can look at a
firm’s research and development based on its sales and profit with numerical and graphical
data analysis. We would expect to be able to model R&D increasing with firm size. Although
this data analysis which variable causes which cannot be fully determined we can still peak
at what the data looks like and if there is a correlation. We will be looking at the elasticity of
R&D based on sales and profit with the use of logarithmic variables for R&D and sales to
see percentage changes.
A Univariate Look at The Variables
require(foreign)
rdchem<-read.dta("http://fmwww.bc.edu/ec-p/data/wooldridge/rdchem.dta")
library(tibble)
rchem<-as_tibble(rdchem)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.5
head(select(rdchem, contains ("lrd"),contains ("lsales"), contains("profits")
))
##
##
##
##
##
##
##
1
2
3
4
5
6
lrd
6.0651798
4.0775380
3.1570001
1.2527630
0.5306283
2.1282320
lsales profits
8.427312
186.9
7.948032
467.0
6.391582
107.4
4.894850
-4.3
3.737670
8.0
5.966147
47.3
summary(rdchem$lrd)
##
##
Min. 1st Qu.
0.5306 2.3840
Median
3.7500
summary(rdchem$lsales)
Mean 3rd Qu.
3.6030 4.3630
Max.
7.2640
##
##
Min. 1st Qu.
3.738
6.230
Median
7.191
Mean 3rd Qu.
7.165
7.957
Max.
10.590
Looking at where the data lies in quadarants we can see if there seems to be any
discrepencies. The minimum for R&D is .5306 and the max is 7.2640, it is a wide range that
we can look at closer graphically for outliers. The minimum for sales is 3.738 and the max
is 10.590. Sales has a much smaller spread and could be dependent on what the companies
specialize in production or R&D.
library(ggplot2)
ggplot(rdchem, aes(x=lsales)) + geom_histogram(colour="pink", fill="blue") +
ggtitle("Percentage Change in Sales")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(rdchem, aes(x=lrd)) + geom_histogram(colour="blue", fill="pink") + ggt
itle("Percentage Change In R&D")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The two sets of data above are normally distributed and show some correlation. To look at
this further we will plot them on bivariate graphs.
A Bivariate Comparison
library(ggplot2)
ggplot(rdchem, aes(lsales,lrd))+geom_boxplot(color="pink")+geom_jitter(color=
"blue", width = .4)+ ggtitle("Box Plot of Percentage Sales w/ Percentage R&D
scatter")
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
This shows the data scatter of R&D compared to sales giving us some insight on the
positive relationship of the two variables. It is helpful in visualizing where the data scatter
is in comparison.
ggplot(data=rdchem, aes(x=factor(lsales), y=lrd, fill=lsales)) +geom_bar(stat
="identity", position=position_dodge(), color="pink")+ ggtitle("Percentage Ch
anges in Sales and R&D")
As you can see above there is clear evidence that R&D and sales are positively correlated,
thus higher sales more R&D.
ggplot(data=rdchem, aes(x=lsales, y=lrd, colour=lsales)) +
geom_line(color="pink") +
geom_point()+ ggtitle("The Percentage Change of R&D to Sales")
As predicted as sales percentage increases so does R&D. We can only use general
knowledge about which variable drives which. If we bring profit into the equation, we can
see that the higher the sale percentage the higher the R&D and the larger the firms profit.
ggplot(rdchem, aes(x=lrd, y=lsales, col=profits)) +
geom_point() +
geom_smooth(se=FALSE)+ ggtitle("The Profit Line for the Percentage Comparis
on of R&D and Sales")
## `geom_smooth()` using method = 'loess'
Conclusion
From the above data, we can make the assumption that there is a strong positive
correlation between the sales and profits of an industrial organization and the amount of
R&D the company is investing in. This makes sense, for an industry to grow it must produce
innovative technologies that keep up with current market demands.