A Deep Dive with Data* And Sharks!

A Deep Dive with Data…
And Sharks!
An example of real world data analysis
Introductions
David Pilkington
CJ Clingler
Director of Software Engineering at Ontario
Systems where he has worked for the past 17
years. One time Business Intelligence Product
Manager and is currently working on his MSPA
degree at Northwestern University.
Graduated from Ball State University with a B.S.
in Mathematical Economics and has spent the
last 2.5 years at Ontario Systems conducting
complaint and financial analyses. Currently
working on her MS in Statistics at Texas A&M.
Installing R and RStudio
R is open-source software and may be downloaded freely at http://cran.r-project.org/.
RStudio, an integrated development environment (IDE) for R. Install the most current version of R and
RStudio.
Step 1: Download and install R from http://cran.r-project.org/. Documentation is readily available on
CRAN and on many R community and resource sites.
Step 2: Download and install RStudio. http://www.rstudio.com/products/rstudio/download/
Quick Review
Step #1 - Prepare
1.What are the Context and Goals?
2.Source of the Data (gathered or not):
3.Methods of Collection:
4.Relevant Data:
Step #2 - Analyze
1.Assess Data Quality
2.Explore the Data: Fully understand what type of data is
within your dataset.
3.Apply Statistical Methods: This is the math part
STEP #3 - Conclude
1.Determine Results
a.Statistical Significance
b.Practical Significance
2.Communicate Results: To whom are you communicating
3.Presentation of the Data
Key terms
Data set & subset
Numeric variable
Categorical variable
Summary
Minimum
Maximum
Mean
Binary data
Frequency table
Bar graph
Box plot
Histogram
Correlation
Know Thy Data
“Read in” the data set
Data summary
Data subsets
Read in the dataset
Data summary and subsets
Data Summary
•
•
•
Describes each variable
Provides max and min information for
numerical variables (numbers)
Provides frequency for categorical
variables (words)
Data Subsets
•
•
Extract variables to make smaller, more
manageable datasets
Group by numbers, concepts, etc.
When and Where
do shark attacks
happen?
Table and Plot
•
•
The table() function
creates a frequency table how many attacks happen
in each country?
The plot() function is used
with categorical variables
to create a bar graph
showing the frequency of
attacks in each country
• Bar graphs are used
to show the
frequency of a
category
Barplot
•
•
The barplot() function provides a similar graph to the plot() function, but is used to create frequency
bar graphs from numerical data
What do you notice about the frequency of shark attacks by time of day? Why do you think this is?
Binary Data
• Binary data captures
True/False information
• Our data set has two
binary variables
• Was a warning
sign posted?
• Was the victim in
a group during the
attack?
• What conclusions might
we be able to draw
between this
information and the
time of shark attacks?
How were victims
attacked?
Bar graph - victims’ Activity
Bar graph - types of attacks
Bar graph - victims’ injury
Breakdown of Injury by Attack Type
What about water
conditions?
Boxplots and Histograms
Boxplots
•
Standardized way to show the distribution
of data
•
•
•
•
•
•
Histograms
Minimum
First quartile
Median
Third quartile
Maximum
Easy way to identify outliers
•
Similar to bar graphs
• Bar graphs show frequency of
categories
• Histograms show frequency of
values in a range
Boxplot - Water Depth
Histogram - Water Depth
Boxplot - Water temperature
Histogram - Water Temperature
Victim Details
Boxplot & Histogram - Victim Age
Frequency Table
Fatal
Not Fatal
Female
483
52
Male
2645
464
What are the odds?
Given that a female was attacked, what is
the probability that female dies?
NOT
Fatal
Fatal
Total
Female
483
52
535
Male
2645
464
3109
Total
3128
516
3644
483/535 = 90.3%
Given that there is a victim of a fatal shark
attack, what is the probability the victim is
male?
2645/3128 = 84.6%
What are the odds?
1. Given someone is attacked by a
shark, what is the probability they
will die?
2. Given a male is attacked by a
shark, what are the chances he
survives?
3. What percentage of shark attack
victims are female?
NOT
Fatal
Fatal
Total
Female
483
52
535
Male
2645
464
3109
Total
3128
516
3644
Describe the Sharks
Bar graph - Shark Species
How big were the
sharks?
Summary
Boxplot
Histogram
Boxplot & Histogram - Shark Length
Correlations
Correlations
• Correlation does NOT equal causation; it indicates a connection or
relationship
• Variables can be positively correlated or negatively correlated
• Positive
•
•
Both variables increase or decrease together
Value of correlation is close to 1
• Negative
•
•
One variable increases as the other decreases
Value of correlation is close to -1
• Variables close to zero are not related
What might be Connected?
• Shark length and bite size
• Others?
• Water temp and time of year
• Length & depth
• Species & temp
Questions?
Hands on Activity
https://statsguys.wordpress.com/2014/01/03/first-post/
Step by step instructions with R code to copy and paste
Predicting whether Titanic passengers survived or died.
Kaggle data set
Thank you!
David Pilkington
Senior Director, Technology
Ontario Systems, LLC
[email protected]
CJ Clingler
Compliance Analyst
Ontario Systems
[email protected]