Baseball statistics and an
introduction to R
Overview
Review of Big Data Baseball chapter 1
Review of structured data and classic baseball statistics
Introduction to R!
Announcement – baseball night!
When: tomorrow at 7:30!
Where: ASH 111
What: We will watch a few innings of baseball and go over the rules
Food?: There will be a few snacks
Big Data Baseball chapter 1
Story is about the 2013 Pittsburgh Pirates
Manager: Clint Hurdle
Clint became the manager of the Pirates in 2010
• 2010 Pittsburg team salary last in baseball: $35
million
• 2010 League average salary: $89 million
In 2011 and 2012, Pirates had good first half of
seasons but collapsed in the second half
• Pirates have not had a winning season in 20 years
Clint is worried about his job
General Manager: Neal Huntington
Amherst college alum
First worked for the Expos and then the Indians
• Learned about sabermetrics while working for the
Cleveland Indians
Became general manager of the Pirates in 2007
• Hired to build sabermetric database for the Pirates
Neal is also worried about his job
Clint and Neil meet in early October 2012
Need to create a plan to get 15 more wins to make the playoffs
Significant problems
• Questionable pitching, not great hitting
• Need to hire 3 players with only $15 million to spend on free agents
• Average player costs $10 per year
Plan: create better defense using PITCHf/x data
• Use defensive shifts to prevent runs
Point of the meeting
Neal Huntington wanted to get Clint Hurdle to use the shift to improve defense
• The data analyses said this would help, but needed to convince the manager (Clint) to
actually use this strategy
Q&R Discussion
Statistics can get us beyond what we can “see” if we trust them (Matt)
• How do we know when we should trust them? Just faith? (Jacob P)
• How do we convince others to trust them? (Navi)
• and to get people to change? (Joseph)
Use data to make players better, rather than just to find good players (Jacob B)
Human judgment vs. quantified decisions – which is best? (Sundae, Henry)
Baseball simulations (APBA) as a tool. Accuracy? (Michael)
How can we quantify the performance of a manager? (Kiryu)
• Taking a job when the odds are stacked against you – good idea? Managers
scapegoats? (Sally)
What are defensive shifts and how can they help? (Molly)
Comments/Questions?
Presentation guidelines:
• Pictures are better than words!
• Keep it short: ~5 slides, ~5 minute presentation
• Give a bit of insight/clarification beyond the chapter
• You don’t need to discuss everything that was mentioned in the chapter
• End with a discussion that mentions some of the points raised in the Q&R
• Don’t need to go over everyone’s response, can just pick a few
Review of statistics and structured data
Common baseball statistics
G = games
• Number of games a player participated in (out of 162 games in a season)
AB = at bats
• Number of times a batter was hitting and either got a hit or got out (does not
include walks or reaching base on an error)
R = runs
• Number of runs the player scored
H = hit
• Number of times a player hit the ball on got on base or hit a home run (sum of
1B, 2B, 3B, HR)
Common baseball statistics
BB = base on balls (walks)
• Number of times a player got on base do to the pitcher throwing 4 balls
RBI = Runs batted in
• How many runs scored as a result of a player getting a hit
SB = stolen bases
• Number of times a runner advanced by ‘stealing a base’
Common derived baseball statistics
AVG= batting average
• Hits/(At bats) = H/AB = (1B + 2B + 3B + HR)/AB
SLG = slugging percentage
• (1 * 1B + 2 * 2B + 3 * 3B + 4 * 4B) /AB
Lahman Database – Individual player
yearly batting statistics
Cases
Variables
Data taken from the Lahman Batting dataset
Example Dataset – Individual player yearly
statistics
Cases
Variables
Example Dataset – Individual player yearly
statistics
Cases
Variables
Cases
Categorical and Quantitative Variables
Categorical Variable
Quantitative Variable
Explanatory and Response Variables
Sometimes we use one variable (the explanatory variable) to
understand/predict another variable (the response variable)
Another Dataset – 2014 Team statistics
Cases
Variables
Cases
Example Dataset - Student Data
Variables
Describing and summarizing data
statistics that are used to summarize a data set (sample of data) are
called descriptive statistics
Examples:
• Maximum value in the data set
• Minimum value in the data set
• Mean value of the data set
A Question
Q: What programming language do the pirates use?
A: Arrrr
Q: Worst joke of the semester?
A: Wait and see…
Basics of R
Everyone log on to:
https://asterius.hampshire.edu/
Create a new script to keep notes about your work
RStudio layout
R Basics
Arithmetic:
> 2+2
> 7*5
Assignment:
> a <- 4
> b <- 7
> C <- a + b
> C
[1] 11
Number journey…
Number journey
> a <- 7
> b <- 52
> c <- a * b
>c
[1] 364
Character strings and
> a <- 7
> s <- "hello everyone"
> b <- TRUE
> class(a)
[1] numeric
> class(s)
[1] character
Functions
Functions use parenthesis: functionName(x)
> sqrt(49)
> tolower("HELLO everyone")
To get help
> ? sqrt
One can add comments to your code
> sqrt(49) # this takes the square root of 49
Getting help
You can get help about a function in R using the ? command.
> ? sqrt
Vectors
Vectors are ordered sequences of numbers or letters
The c() function is used to create vectors
> v <- c(5, 232, 5, 543)
One can access elements of a vector using square brackets []
> v[3]
# what will the answer be?
Works with strings too
> z <- ("a", "b", "c", "d")
> z[3]
Can add names to vector elements
> names(v) <- c(“first", “second", “third", “fourth")
Question?
Q: What kind of grades did the Pirates get in Statistics class?
A: High Seas
Q: Worst joke of the semester?
A: Stay tuned…
Data types: data frames
Data Frames are collections of vectors of that same length.
• Each vector can have a different type of data
Data types: data frames
One can access columns of a Data Frame using the $ symbol.
> team.data$R
[1] 615 573 705 634 614 660 595 669
Computing statistics
One compute statistics on vectors (columns of a data frame)
> sum(team.data$R)
[1] 5065
Let’s look at a data frame
Load a function I wrote into R by typing:
source('/home/shared/baseball_stats/baseball_class_functions.R')
If you load this correctly you should have a function in your Global Environment
called get.Lehman.batting.data
Let’s look at a data frame
Use this function to get batting data on a specific
player:
> card.data <- get.Lehman.batting.data("Kelly", "Shoppach")
> View(card.data)
Practice R with DataCamp!
Try chapters 1 and 2 on the introduction to R DataCamp tutorial
https://www.datacamp.com/courses/free-introduction-to-r
Join the CS 149 group:
https://www.datacamp.com/groups/a9abd31588cdcc625636026fdce168cd27
1adc91/invite
Read chapter 2 of Big Data Baseball
© Copyright 2026 Paperzz