CS 111: Program Design I - UIC Computer Science

CS 111: Program Design I
Lecture 15: Objects, Pandas,
Modules
Robert H. Sloan & Richard Warner
University of Illinois at Chicago
October 13, 2016
OBJECTS AND DOT
NOTATION
Objects
n 
n 
n 
(Implicit in Chapter 2, Variables, & 9.5, String
Methods of book, but not explicit anywhere:
So pay attention!)
Everything in Python is an object
Object combines
q 
q 
data (e.g., number, string, list) with
methods that can act on that object
Methods
n 
n 
n 
Methods: like (special case of) function but
not globally accessible
Cannot call method just by giving its name,
the way we call print(),open(),abs(),
type(),range(), etc.
Method: function that can only be accessed
through an object
q 
Using dot notation
Dot notation
n 
To call method, use dot notation:
q 
n 
object_name.method()
String example:
>>>test="Thisismyteststring"
>>>test.upper()
'THISISMYTESTSTRING'
If o is object of type having method do_it where
do_it needs an input in addition to o, and x is
defined, what is the proper way to call do_it?
A. 
B. 
C. 
D. 
do_it(x)
do_it(o, x)
o.do_it(x)
o.do_it(o, x)
methods continued
>>>test.find("my")
8
>>>42.upper()
SyntaxError:invalidsyntax
upper(test)
barf
Methods depend on type of object
n 
n 
scdb.head() prints out 5 rows because head()
is a method of objects of type Pandas
dataframe, which is type of scdb object
"test string".head() gives back an error
because head is not a method of strings
Methods' importance
n 
n 
n 
n 
n 
Understanding key data types depends on
understanding their methods
We saw many methods for strings
We have used the append method for lists,
and will come back to more list methods
file reference methods write(), read(),
readline(), readlines()
Pandas dataframe methods head(), tail(), etc.
When you get to CS 341 & 342
n 
n 
n 
Or if you know Java or C++ now
methods are an Object Oriented (OO)
concept
In our CS 111
q 
q 
We do need to know the basics of dot notation
and methods
We will otherwise be ignoring OO, and taking
primarily a procedural approach
PANDAS
(FROM ANOTHER ANGLE)
Pandas: What and Why
n 
n 
n 
High performance way to work with large
dataframes
Dataframe: The 2-d data structure most
familiar from Excel spreadsheets, often with a
header row
Pandas built to play nicely with matplotlib for
plotting (and incidentally NumPy and ScikitLearn for machine learning)
Why Pandas and not Excel
n 
n 
Excel not designed for working with large
datasets
Chicago Crimes to 2008 to present file:
q 
q 
q 
q 
1.04 million rows, 18 columns
Open file in Python: Instantaneous
pandas.read_csv(): 8 secs (Sloan’s 2013 laptop)
Open file in Excel: several minutes
n 
Just resize one column for better viewing: 5-30 sec
Why Pandas and not Excel (2)
n 
n 
Excel allows you to say/do/compute whatever
is built into Excel
Python is general purpose programming
language: Can say/do/compute anything
want, not limited to the functions Microsoft
provides in Excel
n 
Geeky fine point: Anything that can be done with a
computer. There are “uncomputable problems” (theory
of computation CS 301, maybe special lecture in this
class if time at end. Not really issue in data analytics)
Pandas data types
n 
Most important: dataframe, which we are
getting from pandas.read_csv()
q 
n 
2-d array, with column headers
Series: 1-d array, e.g., one column of a
dataframe, is second most important
Dataframe indexing
n 
n 
frame[columnname] returns series from
column with name columnname
Giving the []s a list of names selects those
columns in list order. E.g.,
q 
n 
scdb[["justiceName","chief","docketID"]]
Other indexing: .iloc, .loc
q 
(also others we won't cover)
n 
Special case is that specifically a slice index to whole frame will slice
by rows for convenience because it's a common operation, but
inconsistent with overall Pandas syntax
Dataframe positional slicing: iloc
n 
n 
.iloc for 100% positional indexing and slicing
with usual Python 0 to length-1 numbering
(stands for "integer location")
Arguments for both dimensions separated by
comma [rows, cols]:
q 
q 
n 
frame.iloc[:3,:4]
frame.iloc[:,:3]
upper left 3 rows/4 cols
all rows, first 3 cols
One argument: rows
q 
q 
frame.iloc[3:6]
fame.iloc[42]
second 3 rows
42nd row
Dataframe label indexing: .loc
n 
Use .loc to access by labels, or mix
q 
q 
q 
n 
n 
selection list will put columns in list's order;
selection set in {}s original dataframe order
scdb.loc[3:6,{'docketId','chief','justiceName'}]
n 
Rows 3 through 6 inclusive, columns in scdb's order
scdb.loc[3:6,['docketId','chief','justiceName']]
n 
Rows 3 through 6 inclusive, columns in order ['docketId',
'chief', 'justiceName']
Notice loc uses slices inclusive of both ends,
unlike all the rest of Python & Pandas
.loc with only slices: error (e.g., foo.loc[3:6, 2:4])
Dataframe and series methods
n 
head(): returns sub-dataframe (top rows)
q 
n 
tail():
q 
n 
or for series, first entries
same, bottom rows
With no argument they default to 5 rows; can give
positive integer argument for number of rows
count(): For series, returns number of values
(excluding missing, NaN, etc.)
q 
For dataframe, returns series, with count of each
column, labeled by column
Dataframe and series methods (cont.)
n 
abs, max, mean, median, min, mode, sum
q 
q 
All behave like count, except will give errors if data
types don't support the operation
E.g., a series of strings does return good answer
with .max() method (based on alphabetical order),
but cannot take .median()
plotting
n 
n 
Both DataFrame and Series have a plot()
method (as do many other Pandas types)
Must have loaded Python's plotting module,
because Pandas is making use of it:
importmatplotlib.pyplotasplt
n 
Default is Series makes a line graph; DataFrame
makes one line graph per column, and labels
each line by column labels
100% Optional: Aside for graph geeks
n 
Optional for fun: To change style of your plot:
importmatplotlib
matplotlib.style.use('fivethirtyeight')#OR
matplotlib.style.use('ggplot') #Rstyle
n 
Out of the box, it's Matlab style, which some
folks like a lot
.plot() method
n 
n 
Needs no arguments
Has optional arguments such as kind:
q 
q 
.plot(kind='bar') for bar graphs
Many others including
n 
n 
n 
n 
n 
'hist' for histogram
'box' for box with whiskers
'area' for stacked area plots
'scatter' for scatter plots
'pie' for pie charts
.plot() x and y arguments
n 
n 
If you have dataframe but want one column
as x values and one as y values, can use
optional argument(s)
df.plot(x='Year')
q 
Plot all columns but 'Year" as line graphs against
x being the Year column
Brief demo: Chi murder rate by year
import matplotlib.pyplot as plt
from matplotlib import style
import pandas
f = open("Chicago murders to 2012HeaderRows.csv", "r")
df = pandas.read_csv(f)
plt.ion()
groupby(label) method
n 
Idea: split dataframe into groups that all have
same value in column named label. E.g.,
q 
q 
q 
q 
q 
grouped=scdb.groupby("justiceName")
grouped has many of same methods, indexing
options as a dataframe
grouped.count() à dataframe with 60 columns (all
but justice name) and 1 row per justiceName
grouped["docketID"] selects out that column
we plotted grouped["docketId"].count()
n 
it has a plot() method
A series and series groupby method
n 
n 
nunique() is a method of true series, where it
gives the number of distinct values in the
series (a number)
nunique() is a method of a series-like
groupby object, where it gives a true series:
How many were in each group.
PYTHON STANDARD
LIBRARY & BEYOND:
MODULES
Extending Python
n 
n 
n 
n 
Every modern programming language has
way to extend basic functions of language
with new ones
Python: importing a module
module: Python file with new capabilities
defined in it
One you import module, it's as if you typed it
in: you get all functions, objects, variables
defined it immediately
Python Standard Library
n 
n 
n 
Python always comes with big set of modules
List at
https://docs.python.org/3/py-modindex.html
Examples
csv
Read/write csv files
datetime
Basic date & time types
math
Math stuff (e.g., sin(), cos())
os
E.g., list files in your operating system
random
random number generation
urllib
Open URLs, parse URLs
Using Modules
n 
n 
n 
n 
Use import<module_name> to make
module's function's available
Style: Put all import statements at top of file
After importmodule_name, access its
functions (and variables, etc.) through
module_name.function_name
If module_name is long, can abbreviate in
import with as:
q 
q 
importmodule_nameasmn
mn.function_name
If you prefer to save typing
n 
n 
(I mostly do not do this)
To accessfunction_namewithout having
to type module_name prefix, use:
frommodule_nameimportfunction_name
Common but not standard
n 
n 
n 
matplotlib and pandas are not part of set of
modules that must come with every Python 3
matplotlib is very, very widely used, and
pandas is widely used
Both are among the many modules that
come with the Anaconda distribution of
Python 3