RMDA Workshop 1: Answers
Dr Paul King
21 April 2015
Load up the present day data with the following command.
> present <- read.csv("http://www.pmking.org/rmda/data/present.csv")
The data are stored in a data frame called present.
1. What years are included in this data set? What are the dimensions of the data frame and what are the
variable or column names?
We can find out the answers to these questions using the usual commands shown below.
> dim(present)
[1] 63
3
> names(present)
[1] "year"
"boys"
"girls"
> str(present)
'data.frame':
$ year : int
$ boys : int
$ girls: int
63 obs. of 3 variables:
1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 ...
1211684 1289734 1444365 1508959 1435301 1404587 1691220 1899876 1813852 1826352 ...
1148715 1223693 1364631 1427901 1359499 1330869 1597452 1800064 1721216 1733177 ...
> summary(present$year)
Min. 1st Qu.
1940
1960
Median
1970
Mean 3rd Qu.
1970
1990
Max.
2000
They show that the names of the variables (columns) in the data set are year, boys and girls and that there
are 3 variables with 63 observations. The years included in the data set are from 1940 to 2000.
2. How do these counts compare to Arbuthnot’s? Are they on a similar scale?
We can summarise the counts as follows:
> summary(present$boys)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
1210000 1800000 1920000 1890000 2060000 2190000
> summary(present$girls)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
1150000 1710000 1830000 1790000 1970000 2080000
1
2
Comparing these figures with those of Arbuthnot clearly shows that we are dealing with much higher numbers
of births (over 200 times as many per year).
3. Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.?
Using the following command
> present$boys > present$girls
[1]
[18]
[35]
[52]
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE
TRUE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE
TRUE TRUE TRUE TRUE TRUE TRUE
TRUE
produces TRUE in every case, i.e. each year the number of boys born is greater than the number of girls. Thus
Arbuthnot’s observation seems to be valid in the present day.
4. Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see?
> plot(present$year, present$boys/present$girls, xlab = "Year", ylab = "ratio of boy to girl births",
+
type = "l", lwd = 2, col = "blue", las = 1)
1.058
ratio of boy to girl births
1.056
1.054
1.052
1.050
1.048
1.046
1940
1950
1960
1970
1980
1990
2000
Year
The plot clearly shows that the ratio is always greater than 1 (as we would expect given the answer to Qu. 3)
but there does appear to be a general decrease in the ratio over time.
3
5. In what year did we see the most total number of births in the U.S.?
There are a number of ways to do this. First, create a new column containing the total number of births:
> present$total <- present$boys + present$girls
We can plot the total number of births:
3500000
3000000
2500000
total number of births
4000000
> plot(present$year, present$total, xlab = "Year", ylab = "total number of births", type = "l",
+
lwd = 2, col = "blue")
1940
1950
1960
1970
1980
1990
2000
Year
from which we can see that the maximum number of births occurred in 1961.
A second way, making use of terminology we will meet in the next workshop, uses the function which.max():
> which.max(present$total)
[1] 22
This tells us that the maximum value of the total number of births occurs in row number 22. To find out which
year this is we then could type
> present$year[22]
[1] 1961
We could have combined both commands into a single one, giving
> present$year[which.max(present$total)]
[1] 1961
© Copyright 2026 Paperzz