ENGL 781/481 INTRODUCTION TO NATURAL LANGUAGE PROCESSING Time: TR 12:30-‐1:45 pm Room: LBR-‐3244 (3rd floor Mac lab, building 6) Instructor: Cecilia Ovesdotter Alm, Ph.D. Office: LBR 06-‐2110 E-‐mail/Phone: [email protected]/(585) 475-‐7327 Office Hour: Tuesdays 2-‐3 pm & by appt. Course description: This course provides theoretical foundation as well as hands-‐on (lab-‐style) practice in computational approaches for processing natural language text. The course will have relevance to various disciplines in the humanities, sciences, computational, and technical fields. We will discuss problems that involve different components of the language system (such as meaning in context and linguistic structures). Students will additionally work on modeling and implementing natural language processing and digital text solutions. Programming experience is expected. Students will write code in Python and use a variety of relevant tools. ENGL 781 is a graduate-‐level counterpart to ENGL 481. Students enrolled under the graduate-‐level number will be required to complete extra course components; see below. Course objectives: Upon successful completion of this course, students will be able to: ! ! ! ! ! ! Based on acquired theoretical and practical knowledge, define, explain, and apply concepts, methods, and evaluation procedures and metrics to computational linguistics (NLP), and identify remaining limitations Apply skills in the conceptual modeling of linguistic phenomena and their computational implementations Carry out Python and NLTK programming with written natural language input for computational text analysis; and have had the opportunity to explore additional relevant tools (such as TextBlob, Weka, Scikit-‐learn, Gensim, Praat, ontologies, CoreNLP/Curator, GRM, and Berkely Aligner) Use and evaluate linguistic corpus data resources and understand key issues in annotation for computational text analysis Understand the relationship between language science and computational linguistics and related academic disciplines Read and report on research papers on computational linguistics topics for coarse-‐grained understanding and critique Course topics: ! ! ! ! ! ! ! ! ! Computing with tools for computational linguistics and text analysis Linguistic data/corpora: collecting/accessing, annotation, metadata, archiving, agreement metrics, text as unstructured data, archiving and endangered languages, basics of speech/prosody data Processing text: encodings, regular expressions, tokenizing, lemmatizing, segmenting, multiword expressions, language/domain/genre-‐related specific issues, text normalization Analyzing and tagging words: computational phonology/morphology, part of speech tagging, ngram/language models Lexical semantics: word sense disambiguation, lexical relations, lexical semantic knowledge resources, string similarity, semantic similarity Text classification: supervised vs. unsupervised machine learning, core algorithms, case studies Processing meaning: named entity recognition, information extraction, topic modeling, semantic role labeling, computing discourse, sentiment/affect and sociolinguistic attributes, foundations of MT and Q&A Syntactic parsing: context-‐free, dependency, feature-‐based Experimentation/evaluation: procedures and methodology, performance measures, error analysis 1 NLP project: plan/design, preprocess, implement, evaluate, document, report in oral/written modes Mathematical or statistical linguistics when applicable Class policy: This course takes place in the LBR-‐3244 Mac lab. We will use the special features of this space and regularly engage with hands-‐on work. Limit the use of web access and social media to class activities. Respect your own or your peers’ class time by staying on task and not engaging with personal communication unrelated to class activities during lessons. Importantly, prepare readings before class. ! ! Course website: Class notes, problem sets, other assignment instructions, and readings will be made available via myCourses, and you will submit assignments’ code in assigned Dropbox folders in our site. Announcements will be channeled through myCourses and email. Please make sure you have a working RIT email account and check our myCourses site before each class. Textbook and article readings: Jurafsky, D., and Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition. Prentice-‐Hall. The secondary (free) textbook is Bird, S., Klein, R., & Loper, E. (2009). Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. Sebastopol: O’Reilly Media. The edition is available at http://www.nltk.org/book/. We will work with some of the books’ exercises in/out of class. Other readings will be made available in myCourses. Final grading components: Preparation/participation (activities, discussions, self-‐check exercises, peer review, talk, etc.): 10% Early opinion paper (3-‐4 pp.): 10% Self-‐check online quizzes (five; the lowest grade will be dropped): 5% Three of four problem sets (code and write-‐up; the lowest grade will be dropped): 25% Term project (4-‐6 pp. formal report with code, data, documentation, presentation, demo): 25% Final exam: 25% Graduate students’ extra course components: evidence of completion (or final grade deduction) Preparation of readings, participation in class activities, and attendance: Learning in this course is cumulative, where each new topic may presuppose that you have acquired an understanding of concepts covered previously. You are expected to prepare for classes by reading the assigned material before the lesson, in which a topic is discussed, and to participate actively and make contributions in class. I expect that you are willing to learn, inquisitive, professional, respectful, and that you engage in interactive activities in class (pair/group work, discussions, presentations, peer review, etc.). Please take a critical point of view to class topics and confidently question course materials. I encourage you to take advantage of office hours for consultations; this will help you succeed in the course. • • • • • • • Talk requirement: On Thursday 10/15 Dr. Tetreault will visit RIT to give the 2015 Distinguished Computational Linguistics Lecture 12:30-‐1:45 pm in the Golisano Auditorium. Instead of the regular class session that day, you are required to attend this talk and post a commentary about it afterwards in a myCourses forum that will be announced. Web quizzes: In the first half of the term, there are five low-‐stakes quizzes to help you digest modules’ materials. You are allowed multiple attempts and your highest score is entered into the gradebook. Problem sets, final exam: Problem sets provide opportunities to further explore topics and concepts from class. Write-‐ups are submitted in hard copy in class and documented code is submitted before class in the Dropbox in MyCourses. Make sure to follow instructions on code submission as stated in each assignment. (Late work is penalized with a 10% deduction of the actual score per extra day and must be 2 submitted within three days.) There is a final exam on Thursday 12/17. One or more bonus questions may be included in the problem sets/final exam. Extra points are limited by the max score and are not transferrable between assignments or grading categories. Early opinion paper: To stimulate your thinking about the potential, opportunities, and limits of NLP early on in the course (and before you select your term project), you will write a succinct 3-‐4 pp. opinion paper, discussing one or more aspects of the links between computational linguistics and other academic disciplines, such as your field of focus. It is due at the beginning of class on Tuesday 9/15. Term project: The topic for the term project should be cleared with me by Thursday 10/22 and involve a problem/question relevant for this course. I encourage you to talk with me earlier about your project ideas. On Tuesday 11/10, you will do a 1-‐2 min lightning talk which lays out ideas in one slide only; (submitted the slide in the assigned myCourses Dropbox on Monday 11/9 before 1 pm). For undergraduate student teams, 2-‐3 collaborators are encouraged per team, and the team will prepare a joint submission. Please submit a concise written formal report (4-‐6 pp.) together with a neatly organized project release including a readme file with overall documentation and clear instructions on how to run your code, commented code, and data (as applicable) in the Dropbox (or via a download link in your report) on Thursday 12/10. For the report, use ACL 2015 format (Latex/Word style files: http://acl2015.org/call_for_papers.html). The references section may (but does not have to) exceed the page limit by up to one extra page. A peer review session is scheduled for Tuesday 12/3 before the report submission; please bring 3 printed copies of your/your team’s report draft to class then. You will also present your project to the class, either on Tuesday 12/8 or Thursday 12/10. To be fair, the presentation dates are randomly assigned and cannot be changed. Class presentations are allotted 8 min and by 2 min Q&A, and they are peer reviewed with a survey, with results viewable by presenters. Lastly, a short 10-‐15 min less formal project demo is expected at the instructor’s office (scheduled out of class between 12/7 and 12/10). Evaluation criteria for the report, class presentation, and demo will be given ahead of time. Graduate course credit: Graduate students will complete additional course requirements. These are designed to be relevant for the graduate level and the graduate student's research/field of study: [1] An additional weekly graduate reading from weeks 2 to 11. The graduate readings are listed in myCourses by week. You will provide a commentary that demonstrates your engagement with the graduate reading. It should be submitted by Sunday 11:59 pm each week (up to two missed weeks are excused). The format of the commentary will alternate between 1) an ‘unstructeded’ reflection and discussion forum in myCourses—then, also reply to posts of 2-‐3 peers—and 2) ‘structured’ reviewer-‐style written critique using an online form linked via myCourses. These commentaries will feed into a compiled annotated bibliography, with entries for ten papers, submitted on Thursday 11/17 in hard copy in class. In the annotated bibliography you may also include commentaries for sources that you read for your term project. [2] Paper of the week discussant: Each graduate student will have 15 min to present the gist of a recent paper that caught their attention and whose topic is relevant to the course. The presenter selects the paper, and it can be on any NLP/NLP-‐related topic. Select a paper from ACL (http://aclweb.org/anthology-‐new) or another approved source (e.g. http://portal.acm.org). If possible, choose a recent paper that relates to your research or course project; this will facilitate discussing and critiquing the paper in front of the class, as opposed to merely reporting on it. Assigned dates will be listed in myCourses by the second week of class. [3] An individual (instead of collaborative) term project that addresses a well-‐defined problem/question with NLP that is relevant for the student's research/field of study, resulting in a presentation, a written report, project submission, and a demo (as described above). 3 Extra office hour for graduate students: Thursdays 9:30-‐10:30 am in the instructor’s office. Please stop by and discuss your research, the individual course project, the graduate reading of the week, etc. Indeed, you should come to office hours at least once within approximately the first month of class. Statement of reasonable accommodations: RIT is committed to providing reasonable accommodations to students with disabilities. If you would like to request accommodations such as special seating or testing modifications due to a disability, please contact the Disability Services Office. It is located in the Student Alumni Union, Room 1150; the Web site is www.rit.edu/dso. After you receive accommodation approval, it is imperative that you see me during office hours so that we can work out whatever arrangement is necessary. Emergencies: I: In the event of a University-‐wide emergency course requirements, classes, deadlines and grading schemes are subject to changes that may include alternative delivery methods, alternative methods of interaction with the instructor, class materials, and/or classmates, a revised attendance policy, and a revised semester calendar and/or grading scheme. Please familiarize yourself with these RIT documents: process for closing: https://finweb.rit.edu/grms/close_university_process.html, and emergency preparedness: http://finweb.rit.edu/publicsafety/preparedness/. Academic integrity statement: As an institution of higher learning, RIT expects students to behave honestly and ethically at all times, especially when submitting work for evaluation in conjunction with any course or degree requirement. Our department encourages all students to become familiar with the RIT Honor Code and with RIT's Academic Integrity Policy. RIT Honor Code URL: http://www.rit.edu/academicaffairs/policiesmanual/sectionA/honorcode.html. RIT Academic Integrity Policy URL: http://www.rit.edu/~w-‐policy/sectionD/D8.html. Final grading: At the end of the quarter your letter grade will be assigned based on this scale: Final grade in percentage Letter Grade 93.00-‐100.00 A 90.00-‐92.99 A-‐ 87.00-‐89.99 B+ 83.00-‐86.99 B 80.00-‐82.99 B-‐ 77.00-‐79.99 C+ 73.00-‐76.99 C 70.00-‐72.99 C-‐ 60.00-‐69.99 D <60.00 F A note on assignments: Working in a responsible and ethically sound way with peers is an important skill in the intellectual process. There are both group and individual assignments. Throughout the course, follow the conditions in the table below, in regards to academic honesty. For writing consultation, turn to your instructor or to the Writing Commons: http://www.rit.edu/academicaffairs/writing/about-‐us. 4 Course component Preparing readings, in-‐class activities Problem sets: solving Opinion paper: draft version Term project: undergraduate students (entire process); graduate students (design, development, drafting) Graduate reading: preparation, discussion Specific conditions Student collaboration is expected and encouraged. Problem sets: write-‐up Opinion paper: final version Term project: graduate students (submission) Graduate reading/annotated bibliography: submission Self-‐check quizzes Final exam Individual exercises and collaboration of any kind is unacceptable. WEEKLY OUTLINE YOU ARE EXPECTED TO COMPLETE READING/ASSIGNMENTS BEFORE THE LESSON THEY ARE DISCUSSED/DUE. ASSIGNMENTS ARE DUE AT CLASS BEGIN—WRITE-‐UPS ARE SUBMITTED IN CLASS AND CODE IN THE DROPBOX. A PRELIMINARY LIST OF TOPICS FOLLOWS, SUBJECT TO CHANGE WITH ADVANCED NOTICE. Week 1 T Aug 25 Introduction to course Complete the Day 1 Information Survey in/after class, linked in myCourses Please install Anaconda and its Launcher tools at home this week (Python version 3, http://continuum.io/downloads#py34, http://docs.continuum.io/anaconda/pkg-‐docs.html) (In the event of an incompatibility between Spyder and iPython, you may need to roll back iPython; if Anaconda is installed at /anaconda, you can try /anaconda/bin/conda install ipython=3.2.1) R Aug 27 Preliminaries & introduction to Python and NLTK Reading: Chapter 1 in J&M, ch 1-‐sec 1, 2, 4, ch 2-‐sec 3 in the NLTK book Week 2 T Sept 1 Preliminaries & continued introduction to Python and NLTK, regular expressions, speech & prosody Reading: Section 2.1, 7.1-‐7.5 in J&M, ch 3-‐sec 2-‐5,9 in the NLTK book R Sept 3 Text processing, segmentation, multilingual text, multiword expressions, text processing for speech synthesis Reading: 8.1-‐8.3 in J&M, ch 3-‐sec 1,6-‐8, ch 11-‐sec 4,6 in the NLTK book (optional: Praat Intro tutorial under Help in Praat’s Objects window) Complete web quiz 1 Week 3 T Sept 8 Finite-‐state automata for NLP Reading: Sections 2.2-‐2.4 (optional: Section 16.2) in J&M, ch-‐4 (optional: nltk.org/book/ch04-‐extras.html) in the NLTK book, json.org/example.html 5 R Sept 10 Computing morphology Reading: Section 3.1-‐3.9, 3.12 in J&M, nltk.org/book/ch10-‐extras.html in the NLTK book Complete web quiz 2 Week 4 T Sept 15 Corpora, corpus statistics, annotation for NLP Reading: Rossi (2013) in myCourses, ch 1-‐sec 3, ch 2-‐sec 1-‐2, 4 in the NLTK book Early opinion paper due R Sept 18 Word alignment: Guest lecture by P. Vaidyanathan Week 5 T Sept 22 NO CLASS: We will schedule a make-‐up class for this/next week PS 1 released on 9/23 R Sept 24 Make-‐up class: Lexical semantics, knowledge resources/ontologies Reading: Ch. 19 in J&M, section 2.5 in the NLTK book N-‐grams, language models, smoothing Reading: Pustejovsky & Stubbs (2012) in myCourses, Sections 4.1-‐4.8 in J&M Complete web quiz 3 Week 6 T Sept 29 POS tagging, Transformation-‐Based Learning (TBL), and POS evaluation Reading: Sections 5.1-‐5.4, 5.6-‐5.7 in J&M, ch 5 in the NLTK book Complete web quiz 4 R Oct 1 POS tagging (continued): HMMs, Noisy channel model Reading: Sections 5.5 in J&M, documentation for nltk.tag.hmm at http://www.nltk.org/api/nltk.tag.html?highlight=hmm#module-‐nltk.tag.hmm PS 1 due Week 7 T Oct 6 Word sense disambiguation, semantic role labeling, word similarity: ontology methods Complete web quiz 5 Reading: Sections 20.1-‐20.6, 20.8-‐20.9 in J&M R Oct 8 Word/phrase similarity: distributional methods, documents/IR Reading: Sections 20.7, 23.1 in J&M, Erk (2012) in myCourses PS 2 released Week 8 T Oct 13 NO CLASS: Classes follow a Monday schedule R Oct 15 Attend Dr. J. Tetreault’s lecture (12:30-‐1:45 pm) in the Golisano Auditorium Post commentary about the talk in the assigned myCourses forum by 11.59 pm 6 Week 9 T Oct 20 Text classification, supervised (DT, NB) vs. unsupervised methods Reading: Hladka & Holub (2015) in myCourses, ch 6 in the NLTK book PS 2 due R Oct 22 Evaluation of NLP systems, more learning methods, biomedical text classification Reading: Section 22.5 in J&M, Resnik & Lin (2010) and Meystre & Haug (2005) in myCourses Complete web quiz 3 Last day to request approval for term project: submit project title and abstract (0.5-‐1 p.) Week 10 T Oct 27 Processing and modeling linguistic affect, sentiment, and sociolinguistic variation, part 1 PS 3 released (team challenge) R Oct 29 Processing and modeling linguistic affect, sentiment, and sociolinguistic variation, part 2 Week 11 T Nov 3 Topic modeling R Nov 5 Information extraction, named entity recognition, processing relations/events/time Reading: Sections 22.1-‐22.4 in J&M, ch 7 in the NLTK book, PS 3 (team challenge): test data released Team’s PowerPoint presentation due in myCourses on Fri. Nov 6 by 11.59 pm Week 12 T Nov 10 Computing syntactic structures: preliminaries Reading: Chapter 12 (optional: Sections 16.1, 16.3) in J&M, ch 8-‐sec 1-‐3 in NLTK book Lightning talks for term project (1 slide only – submit on Monday Nov 9 before 1 pm in the assigned Dropbox in myCourses) PS 4 released R Nov 12 Computing syntactic structures: parsing algorithms Reading: Chapter 13 (optional: Chapter 14) in J&M, ch 8-‐sec 4-‐7 in the NLTK book Week 13 T Nov 17 Computing syntactic structures: feature-‐based grammars Reading: Chapter 15 in J&M, ch 9 in the NLTK book Graduate students’ annotated bibliography due R Nov 19 Computing discourse, anaphora/coreference resolution, dialog processing Reading: Sections 21.1-‐21.8 and 24.1-‐24.5 in J&M PS 4 due Week 14 T Nov 24 QA systems and machine translation Reading: Hearne & Way (2011), Section 23.2 in J&M (optional: Chapter 25 in J&M) R Nov 26 HAPPY THANKSGIVING! 7 Week 15 T Dec 1 Final exam review and practice R Dec 3 Peer review workshop: Bring three hard copies of your project report draft to class Week 16 and final exam week T Dec 8 Term project presentations R Dec 10 Term project presentations – submit project (report, code, data, readmes, etc.) and upload your term project presentation in the assigned myCourses Dropbox. C Complete the SmartEvals course evaluation R Dec 17 FINAL EXAM 10:15 am -‐ 12:15 pm in LBR-‐3244 (our regular classroom) 8
© Copyright 2025 Paperzz