Data Analysis 1 - Stony Brook University

Data Analysis
for Physics and Astronomy
with Python
Prof. Joanna Kiryluk
Stony Brook University
Spring 2017 Semester
Course web site:
http://skipper.physics.sunysb.edu/~joanna/Lectures/PHY390/
Target audience:
Freshman/Sophomore students (Physics & Astronomy majors)
Meeting Schedule:
The class will meet twice a week. There will be one 80-minute long session on Data Analysis
every week (Tuesdays 8.30am, Math SINC site S-235 -TBD), and one 80-minute long session
on Python programming every week (Thursdays 8.30am Math SINC site S-235)
Office hours: (by Joanna Kiryluk and by TA Anthony Catanese)
We’ll find time which works for everybody.
Homework: will be posted weekly. 1st homework will be posted on Thursday February 2nd.
Exams:
There will be two midterm exams, one on Data Analysis methods, and one on Python
programming.
There will be one (final) take home project. Students will analyze a data set provided to
them by using data analysis techniques, writing a Python computing program and writing a
report (in Latex).
Motivation
a)Data Analysis
•Expand on data analysis methods, which students learn during introductory labs
•Prepare Physics and Astronomy major students for more advanced labs such as PHY252
•Teach students modern tools which are useful for performing data analysis in physics labs
& research
Motivation
b) Python programming
•Teaching a programming language has educational value
• Important to expose Physics and Astronomy students to programming as early as possible
• Students can work on their laptops (free & fast installation), can be done on Windows
(freshman students preference), Mac and Linux.
•Teach a modern programming language at a basic level with direct data analysis applications
• Object oriented language, but easier than C/C++ . “Python reads like kindergarten math
and is easy on the layman’s eye. It requires less code to complete basic tasks, making it
an economical language to learn.”
• Tools exist for integrating C/C++ and Fortran code
This course will have essentially no overlap with PHY277
and will benefit students who take PHY277
Source: https://xkcd.com/353/
In-demand programming
language
Data Scientist:
average
1. Easy-to-Learn: Python was designed with the newcomer in mind.
Python reads like kindergarten math and is easy on the layman’s eye.
Python also requires less code to complete basic tasks, making it an
economical language to learn.
2. Your Stepping Stone
Python can be your stepping stone into the programming universe.
Employers are looking for fully stacked programmers and Python will
help you get there. Python is an object-oriented language, just like
Javascript, C++, C#, Perl, Ruby, and other key programming languages.
3. How About Some Raspberry Pi? It is a card-sized, inexpensive
microcomputer that is being used for a surprising range of exciting do-ityourself stuff such as robots, remote-controlled cars, and video game
consoles. With Python as its main programming language, the Raspberry
Pi is being used even by kids to build radios, cameras, arcade machines,
and pet feeders!
Lecture Plan
q Part A (Tuesdays) Data Analysis
q Part B (Thursdays) Python Computing
Part A: Introduction to Data Analysis
1. Introduction: what is a measurement,
random and systematic uncertainties
2. Data characteristics: distribution, mean and
variance
3. Graphic representation of data: histograms,
plots, linear and logarithmic scales
4. Statistics: binominal, Poisson and Gaussian
probability distributions
5. Central Limit Theorem
6. The meaning of sigma
7. Partial differentiation, propagation of small
uncertainties
8. Covariance and correlation
9. Least squares method
10. Combining results of different experiments,
weighted averages
11. Straight line fit
12. Parameter and distribution testing and
comparing results:
test 3 sigma, chi-squared test, p-values,
confidence levels
Textbooks:
We’ll use examples and
problems from both books
My preference is the book by
Lyons (shorter)
Part B: Python Programming for Data Analysis:
1.Python from scratch:
a. Installation and setup
b. IPython: An Interactive Computing and
Development Environment
c. Variables, basic math, types of data, input, print
formatting and strings
d. Decisions, loops, lists, functions, objects, modules
e. Pandas, data structures
f. Data files: input and output, file formats
g. Data wrangling: Clean, transform, merge, reshape
h. Plotting and Visualization
Textbook:
+ python textbook or online
tutorial
(covered in lectures)
E.g. https://docs.python.org/2/tutorial/index.html
http://www.greenteapress.com/thinkpython/thinkCSpy/thinkCSpy.pdf
Part B: Python Programming for Data Analysis:
1.Python from scratch:
a. Installation and setup
b. IPython: An Interactive Computing and
Development Environment
c. Variables, basic math, types of data, input, print
formatting and strings
d. Decisions, loops, lists, functions, objects, modules
e. Pandas, data structures
f. Data files: input and output, file formats
g. Data wrangling: Clean, transform, merge, reshape
h. Plotting and Visualization
2.Data analysis modules
a. SciPy Basics
b. NumPy Basics
http://www.scipy.org/index.html
Textbook:
Part B: Python Programming for Data Analysis:
1.Python from scratch:
a. Installation and setup
b. IPython: An Interactive Computing and
Development Environment
c. Variables, basic math, types of data, input, print
formatting and strings
d. Decisions, loops, lists, functions, objects, modules
e. Pandas, data structures
f. Data files: input and output, file formats
g. Data wrangling: Clean, transform, merge, reshape
h. Plotting and Visualization
2.Data analysis modules
a. SciPy Basics
b. NumPy Basics
http://www.numpy.org
Textbook:
Part B: Python Programming for Data Analysis:
1.Python from scratch:
a. Installation and setup
b. IPython: An Interactive Computing and
Development Environment
c. Variables, basic math, types of data, input, print
formatting and strings
d. Decisions, loops, lists, functions, objects, modules
e. Pandas, data structures
f. Data files: input and output, file formats
g. Data wrangling: Clean, transform, merge, reshape
h. Plotting and Visualization
2.Data analysis modules
a. SciPy Basics
b. NumPy Basics
3.Data analysis report
a. Latex
Textbook:
S-235
https://it.stonybrook.edu/help/kb/sinc-site-general-policies
Learning Python for Data Analysis and Visualisation
1. Anaconda (open source) – high performance python
distribution
o recommended installer for IPython/Jupyter, Pandas, SciPy,..
o installation using conda (package manager)
Virtual SINC Site:
Ø anaconda has been installed & ready for use
Your laptop/computer:
Ø Download anaconda2 (Windows/OSX/Linux)
https://www.continuum.io/downloads
and install it (YOU CAN DO IT WITH HELP OF YOUR TA –
office hours) DO NOT INSTALL THE LATEST version3
We’ll use anaconda2 version with Python 2.7 (textbook requires it)
Ø Anaconda2 includes Python 2.7
Ø Anaconda3 includes Python 3.5
Python2 versus Python3
This class
Examples of differences
http://ptgmedia.pearsoncmg.com/imprint_downloads/informit/promotions/python/python2python3.pdf
WINDOWS (this class). You can access it off-site, e.g. from your laptop,
independently of your operating system, just use your web browser and go to:
https://it.stonybrook.edu/services/virtual-sinc-site
Needed: NetID login, Citrix (if accessing from outside SINC classrooms e.g. on your
laptop, it will need to be installed on your laptop)
Launch Virtual SINC Site Desktop & start Python IDLE
(Programming & Development -> Python 2.7 -> Python IDLE)
IDLE is Python’s Integrated Development and Learning Environment.
IDLE: cross-platform: works mostly the same on Windows, Unix, and Mac OS X
IDLE:
o Python shell window (interactive interpreter) with colorizing of code
input, output, and error messages
Lecture Plan
q Python Computing (Thursdays)
q Data Analysis (Tuesdays)
Lecture DA1
Why do we do experiments ?
Introduction to Data Analysis
Data samples,
histograms, means, RMS, standard deviations
L. Lyons, "A Practical Guide to Data Analysis for Physical
Science Students” – chapter 1: Sections 1.1, 1.2, 1.3 (compact)
OR
J.R. Taylor “An introduction to Error Analysis” – chapter 1,
22
chapter 2 Sections 2.1 and 2.2
Why do we do experiments?
Two types of experiments to learn about the physical world:
R.Muller, “The Instant Physicist”
§ parameter determination
e.g. measure body temperature
§ hypothesis testing
e.g. testing whether body temperature
increased since this morning
The numerical value of the quantity we want to measure
is not enough
Our conclusion, e.g.“We have made a world shattering
discovery!” depends on the accuracy of our measurement.
Acceleration due to gravity
Measurement 1:
Measurement 2:
“True” value= ?
Are measurements 1 and 2 consistent with the “true” value?
https://en.wikipedia.org/wiki/Gravitational_acceleration
Measurement of the same quantity
Experimental Data / Results
e.g. number of students
Histograms
e.g. exam score
One entry (x) in this histogram
means one measurement (e.g. one score for every student)
26
“Binning” - important