CS 111: Program Design I Lecture 15: Objects, Pandas, Modules Robert H. Sloan & Richard Warner University of Illinois at Chicago October 13, 2016 OBJECTS AND DOT NOTATION Objects n n n (Implicit in Chapter 2, Variables, & 9.5, String Methods of book, but not explicit anywhere: So pay attention!) Everything in Python is an object Object combines q q data (e.g., number, string, list) with methods that can act on that object Methods n n n Methods: like (special case of) function but not globally accessible Cannot call method just by giving its name, the way we call print(),open(),abs(), type(),range(), etc. Method: function that can only be accessed through an object q Using dot notation Dot notation n To call method, use dot notation: q n object_name.method() String example: >>>test="Thisismyteststring" >>>test.upper() 'THISISMYTESTSTRING' If o is object of type having method do_it where do_it needs an input in addition to o, and x is defined, what is the proper way to call do_it? A. B. C. D. do_it(x) do_it(o, x) o.do_it(x) o.do_it(o, x) methods continued >>>test.find("my") 8 >>>42.upper() SyntaxError:invalidsyntax upper(test) barf Methods depend on type of object n n scdb.head() prints out 5 rows because head() is a method of objects of type Pandas dataframe, which is type of scdb object "test string".head() gives back an error because head is not a method of strings Methods' importance n n n n n Understanding key data types depends on understanding their methods We saw many methods for strings We have used the append method for lists, and will come back to more list methods file reference methods write(), read(), readline(), readlines() Pandas dataframe methods head(), tail(), etc. When you get to CS 341 & 342 n n n Or if you know Java or C++ now methods are an Object Oriented (OO) concept In our CS 111 q q We do need to know the basics of dot notation and methods We will otherwise be ignoring OO, and taking primarily a procedural approach PANDAS (FROM ANOTHER ANGLE) Pandas: What and Why n n n High performance way to work with large dataframes Dataframe: The 2-d data structure most familiar from Excel spreadsheets, often with a header row Pandas built to play nicely with matplotlib for plotting (and incidentally NumPy and ScikitLearn for machine learning) Why Pandas and not Excel n n Excel not designed for working with large datasets Chicago Crimes to 2008 to present file: q q q q 1.04 million rows, 18 columns Open file in Python: Instantaneous pandas.read_csv(): 8 secs (Sloan’s 2013 laptop) Open file in Excel: several minutes n Just resize one column for better viewing: 5-30 sec Why Pandas and not Excel (2) n n Excel allows you to say/do/compute whatever is built into Excel Python is general purpose programming language: Can say/do/compute anything want, not limited to the functions Microsoft provides in Excel n Geeky fine point: Anything that can be done with a computer. There are “uncomputable problems” (theory of computation CS 301, maybe special lecture in this class if time at end. Not really issue in data analytics) Pandas data types n Most important: dataframe, which we are getting from pandas.read_csv() q n 2-d array, with column headers Series: 1-d array, e.g., one column of a dataframe, is second most important Dataframe indexing n n frame[columnname] returns series from column with name columnname Giving the []s a list of names selects those columns in list order. E.g., q n scdb[["justiceName","chief","docketID"]] Other indexing: .iloc, .loc q (also others we won't cover) n Special case is that specifically a slice index to whole frame will slice by rows for convenience because it's a common operation, but inconsistent with overall Pandas syntax Dataframe positional slicing: iloc n n .iloc for 100% positional indexing and slicing with usual Python 0 to length-1 numbering (stands for "integer location") Arguments for both dimensions separated by comma [rows, cols]: q q n frame.iloc[:3,:4] frame.iloc[:,:3] upper left 3 rows/4 cols all rows, first 3 cols One argument: rows q q frame.iloc[3:6] fame.iloc[42] second 3 rows 42nd row Dataframe label indexing: .loc n Use .loc to access by labels, or mix q q q n n selection list will put columns in list's order; selection set in {}s original dataframe order scdb.loc[3:6,{'docketId','chief','justiceName'}] n Rows 3 through 6 inclusive, columns in scdb's order scdb.loc[3:6,['docketId','chief','justiceName']] n Rows 3 through 6 inclusive, columns in order ['docketId', 'chief', 'justiceName'] Notice loc uses slices inclusive of both ends, unlike all the rest of Python & Pandas .loc with only slices: error (e.g., foo.loc[3:6, 2:4]) Dataframe and series methods n head(): returns sub-dataframe (top rows) q n tail(): q n or for series, first entries same, bottom rows With no argument they default to 5 rows; can give positive integer argument for number of rows count(): For series, returns number of values (excluding missing, NaN, etc.) q For dataframe, returns series, with count of each column, labeled by column Dataframe and series methods (cont.) n abs, max, mean, median, min, mode, sum q q All behave like count, except will give errors if data types don't support the operation E.g., a series of strings does return good answer with .max() method (based on alphabetical order), but cannot take .median() plotting n n Both DataFrame and Series have a plot() method (as do many other Pandas types) Must have loaded Python's plotting module, because Pandas is making use of it: importmatplotlib.pyplotasplt n Default is Series makes a line graph; DataFrame makes one line graph per column, and labels each line by column labels 100% Optional: Aside for graph geeks n Optional for fun: To change style of your plot: importmatplotlib matplotlib.style.use('fivethirtyeight')#OR matplotlib.style.use('ggplot') #Rstyle n Out of the box, it's Matlab style, which some folks like a lot .plot() method n n Needs no arguments Has optional arguments such as kind: q q .plot(kind='bar') for bar graphs Many others including n n n n n 'hist' for histogram 'box' for box with whiskers 'area' for stacked area plots 'scatter' for scatter plots 'pie' for pie charts .plot() x and y arguments n n If you have dataframe but want one column as x values and one as y values, can use optional argument(s) df.plot(x='Year') q Plot all columns but 'Year" as line graphs against x being the Year column Brief demo: Chi murder rate by year import matplotlib.pyplot as plt from matplotlib import style import pandas f = open("Chicago murders to 2012HeaderRows.csv", "r") df = pandas.read_csv(f) plt.ion() groupby(label) method n Idea: split dataframe into groups that all have same value in column named label. E.g., q q q q q grouped=scdb.groupby("justiceName") grouped has many of same methods, indexing options as a dataframe grouped.count() à dataframe with 60 columns (all but justice name) and 1 row per justiceName grouped["docketID"] selects out that column we plotted grouped["docketId"].count() n it has a plot() method A series and series groupby method n n nunique() is a method of true series, where it gives the number of distinct values in the series (a number) nunique() is a method of a series-like groupby object, where it gives a true series: How many were in each group. PYTHON STANDARD LIBRARY & BEYOND: MODULES Extending Python n n n n Every modern programming language has way to extend basic functions of language with new ones Python: importing a module module: Python file with new capabilities defined in it One you import module, it's as if you typed it in: you get all functions, objects, variables defined it immediately Python Standard Library n n n Python always comes with big set of modules List at https://docs.python.org/3/py-modindex.html Examples csv Read/write csv files datetime Basic date & time types math Math stuff (e.g., sin(), cos()) os E.g., list files in your operating system random random number generation urllib Open URLs, parse URLs Using Modules n n n n Use import<module_name> to make module's function's available Style: Put all import statements at top of file After importmodule_name, access its functions (and variables, etc.) through module_name.function_name If module_name is long, can abbreviate in import with as: q q importmodule_nameasmn mn.function_name If you prefer to save typing n n (I mostly do not do this) To accessfunction_namewithout having to type module_name prefix, use: frommodule_nameimportfunction_name Common but not standard n n n matplotlib and pandas are not part of set of modules that must come with every Python 3 matplotlib is very, very widely used, and pandas is widely used Both are among the many modules that come with the Anaconda distribution of Python 3
© Copyright 2024 Paperzz