Probabilis@c Approaches to Ar@ficial Intelligence

1/10/13
What’s in a (course) name? •  Official class )tle: Neural and Gene)c Approaches to AI Probabilis)c Approaches to Ar)ficial Intelligence (aka: Neural and Gene)c approaches to AI) –  Actually, NEURAL&GENTC APPR ARTFCL INTEL –  This will appear on your schedules and transcripts •  But we won’t cover neural and gene)c topics –  This semester: A new experimental course on “Probabilis)c Approaches to AI” –  Prof. Hauser taught a similar version of B553 last year CS B553 Spring 2013 What’s wrong with gene)c and neural approaches? •  Nothing! They’re fascina)ng and useful! •  We’re teaching this experimental course as B553 for boring bureaucra)c reasons •  That said, probabilis)c approaches are very popular right now in most subfields of AI Brief, biased history of AI •  1950: Alan Turing speculates about machine intelligence, proposes Turing Test •  1955: Newell & Simon unveil the Logic Theorist to prove theorems automa)cally; introduce form of Lisp •  1956: Dartmouth Conference introduces term Ar)ficial Intelligence; par)cipants include Shannon, Minsky, McCarthy, etc. 1960’s: Progress & op)mism •  AI as search through huge spaces –  Techniques for heuris)c search, e.g. branch & bound (1960), alpha-­‐beta pruning (1963), A* search (1968)… •  Ar)ficial neural networks –  Using perceptrons (1957); much excitement about their poten)al power •  Fuzzy logic (1965) –  To model uncertainty •  Micro-­‐worlds –  E.g. Sussman’s “Block world” in computer vision 1970’s: Decline and the “AI Winter” •  1969: Minsky’s book Perceptrons proves limita)ons of neural networks •  1971-­‐2: Cook’s and Karp’s NP-­‐
completeness papers show many problems are simply intractable •  Overly op)mis)c predic)ons of the 60’s don’t materialize; Funding agencies lose faith in AI 1
1/10/13
1980’s: Rise from the ashes •  Resurgence of neural networks, with mul)-­‐layer networks and the backpropaga)on algorithm •  Much work in Knowledge-­‐based systems that try to represent and capture knowledge •  Expert systems apply rules defined by human experts to solve problems Early 1990’s: Another crash •  Expert systems very difficult to maintain –  Difficult to manage huge knowledge bases –  Suffered from “brikleness:” Systems would give nonsensical answers for no easily-­‐explained reasons –  Many companies see these as the future and invest lots of money in developing them… Mid 1990’s to present •  Greater math sophisEcaEon, connec)ng AI problems to other domains –  “Revolu)onary” (Norvig & Russell) –  Connec)ons to op)miza)on, probabilis)c and sta)s)cal models, informa)on theory, etc. •  Focus on more concrete, less ambiEous goals •  Less interest in biologically-­‐inspired techniques in favor of techniques that seem to work in prac)ce •  Moore’s law makes hard problems more tractable Probabilis)c techniques •  AI is full of uncertainty –  Can’t observe full state of system –  Observa)ons we can make are noisy –  Our models of the world are imperfect –  Some systems can’t be modeled anyway (chao)c and apparently nondeterminis)c systems) Martin-Shepard 2010
•  Probabilis)c frameworks give us a principled way of dealing with and reasoning about this uncertainty –  Largely championed by Judea Pearl (2011 Turing Award) But they’re not a silver bullet! •  We’ll s)ll face challenges like… –  Probability distribu)ons that are impossibly complex, with intractably many dimensions –  Parameter es)ma)on problems that seem to require exponen)al amounts of data –  Inference problems that are NP hard •  Much work is thus devoted to balancing between what we’d like to model and what we are able to model –  Probabilis.c graphical models are a popular framework Course goals •  Introduce the modern mathemaEcal and algorithmic machinery used in probabilis)c techniques for AI –  Mostly in the graphical model framework •  You’ll get to understand both the (nice, clean) theory and the (onen messy) implementa)on details –  Along three dimensions: model representa)on, inference, and parameter es)ma)on (learning) •  Gain experience with different applica)ons of these techniques to real AI problems –  In vision, natural language processing, robo)cs, etc. 2
1/10/13
Course overview (tenta)ve) • 
• 
• 
• 
• 
• 
• 
• 
• 
Basic probability: Nota)on, Bayes law, Bayesian classifiers RepresentaEon: Bayesian networks, Markov networks Exact inference: Variable elimina)on, condi)oning, clique trees Approximate inference: BP, par)cle sampling, graph cuts Inference as opEmizaEon: Gradient descent, Newton methods, stochas)c op)miza)on, gene)c algorithms Parameter learning: ML and MAP es)ma)on, Expecta)on Maximiza)on Structure learning Temporal models: Markov chains, HMMs ApplicaEons Course mechanics •  Syllabus, schedule, assignments, announcements, etc. on IU OnCourse –  hkp://oncourse.indiana.edu/ •  Readings from textbook and selec)ons from papers and other books –  Textbook: Koller and Friedman, Probabilis.c Graphical Models: Principles and Techniques, 2009. Grading •  50% Assignments –  Mixture of pen-­‐and-­‐paper and programming problems –  For programming problems, any general-­‐purpose programming language is acceptable if you implement the important rou)nes yourself (more detail on this later) –  We’ll typically recommend a language •  20% Final project •  30% In-­‐class quizzes Course staff •  Prof. David Crandall 227 Informa)cs West Office hour: T 2-­‐3 (tenta)ve) •  AI: Alex Rudnick 330I Lindley Hall Office hour: W 10-­‐11am (tenta)ve) Prerequisites •  Technically, CS B551 •  Prac)cally: –  Proficiency in a general-­‐purpose programming language, e.g. C/C++, Matlab, Python, Java –  Some level of mathema)cal maturity, esp. with sta)s)cs, linear algebra, and calculus –  Willingness to learn some programming and/or math on your own if necessary Project •  On a topic of your choice •  Three deliverables: a brief proposal, a final report (and source code), a brief presenta)on •  Wide range of possible projects, e.g. –  Develop new technique for problem X –  Apply exis)ng technique to new applica)on Y –  Implement technique Z in a significantly faster way –  Implement and compare techniques W and U –  Or something else broadly related to probabilis)c techniques 3
1/10/13
Academic integrity •  Read and understand the AI policy on syllabus Review of basic probability concepts •  We will look for and prosecute AI viola)ons •  Be especially careful with homework assignments –  You may discuss homework problems at a high level (e.g. general strategies for solving problems), but you may not share code, and you must cite the other student in your submission –  If you use ideas or code from another source (like a webpage or textbook) you must acknowledge the source in your submission Probability 101 •  A finite probability space consists of: Basic iden))es •  For two events A and B… –  A finite set S of mutually-­‐exclusive outcomes –  A func)on such that: –  What’s the probability that either A or B (or both) occur? If A and B are disjoint, their intersec)on is the empty set, and the last term is 0. •  An event A is a subset of S, . –  The probability of an event is defined as Super simple example #1 •  Suppose you roll a six-­‐sided die 5 )mes. What’s the probability of rolling a “three” all 5 )mes? Super simple example #2 •  Suppose you roll a six-­‐sided die 5 )mes. What’s the probability of rolling a “three” during the first roll? 4
1/10/13
Super simple example #3 Example #3 (2nd try) •  Suppose you roll a die 5 )mes. What’s the probability of gewng at least 1 six? •  Suppose you roll a die 5 )mes. What’s the probability of gewng at least 1 six? Answer: –  The probability of gewng a six on a single roll is 1/6. –  So the probability of gewng a six among 5 rolls is 5*1/6=5/6. –  Answer 2: Sum probabili)es of disjoint events P(at least 1 six) = P(1 six and 4 non-­‐sixes) + P(2 sixes and 3 non-­‐sixes) + P(3 sixes and 2 non-­‐sixes) + P(4 sixes and 1 non-­‐six) + P(5 sixes) = … –  Right, but a lot of work. WRONG! The events are not disjoint. An example (3rd try) –  Either we get at least 1 six (event A), or we get no sixes (event B). A and B are clearly disjoint and their union is S. The probability of B is (5/6)5. P(A) = 1 – P(B) = 1 – (5/6)5 = 0.598 •  Given our class of ~30 people, what’s the probability that at least two of us share the same birthday? Probability of shared birthday
•  Suppose you roll a die 5 )mes. What’s the probability of gewng at least 1 six? The Birthday Problem # of people
Condi)onal probabili)es •  Probability that one event occurs, given that another event is known to have occurred –  Denoted . “Probability of A given B” –  Defined as: Condi)onal probabili)es •  Two events are independent if –  Or, equivalently, if –  Independence denoted •  The joint probability of independent events A and B both occurring is then simply: •  Leads directly to the chain rule: –  This idea of factoring a distribu)on into a product of two simpler distribu)ons will be a recurring course theme! –  More generally: 5