Toward Automatic Speech Act Discovery • • • • email newsgroups forums blogs Data Set • 20 usenet newsgroups • The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his Newsweeder: Learning to filter netnews paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering. Preprocessing >> I just wonder if this will also cause a divergence between commercial >> and non-commercial software (ie. you will only get free software using >> Athena or OpenLook widget sets, and only get commercial software using >> the Motif widget sets). > > > I can't see why. If just about every workstation will come with Motif > by default and you can buy it for under $100 for the "free" UNIX > platforms, I can't see this causing major problems. Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap", but I cannot get the source for "cheap", hence I am limited to using whatever X libraries the Motif port was compiled against (at least with older versions of Motif. I have been told that Motif 1.2 can be used with any X, but I have not seen it myself). Preprocessing >> I just wonder if this will also cause a divergence between commercial >> and non-commercial software (ie. you will only get free software using >> Athena or OpenLook widget sets, and only get commercial software using >> the Motif widget sets). > > > I can't see why. If just about every workstation will come with Motif > by default and you can buy it for under $100 for the "free" UNIX > platforms, I can't see this causing major problems. Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap", but I cannot get the source for "cheap", hence I am limited to using whatever X libraries the Motif port was compiled against (at least with older versions of Motif. I have been told that Motif 1.2 can be used with any X, but I have not seen it myself). Section into “levels” • Level < previous level = reply to previous message • Level > previous level = new message • Also: • Remove headers Xref: cantaloupe.srv.cs.cmu.edu comp.windows.x:66928 comp.windows.x.apps:2487 Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu! cis.ohio-state.edu!zaphod.mps.ohiostate.edu!howland.reston.ans.net!gatech!asuvax!chnews!tmcconne From: [email protected] (Tom McConnell~) Newsgroups: comp.windows.x,comp.windows.x.apps Subject: Re: Motif vs. [Athena, etc.] Date: 16 Apr 1993 20:14:04 GMT Organization: Intel Corporation Lines: 44 Sender: tmcconne@sedona (Tom McConnell~) Distribution: world Message-ID: <[email protected]> References: <[email protected]> <[email protected]> <[email protected]> NNTP-Posting-Host: thunder.intel.com Originator: tmcconne@sedona Also: • Remove signatures Cheers, Tom McConnell -Tom McConnell Intel, Corp. C3-91 5000 W. Chandler Blvd. Chandler, AZ 85226 | Internet: [email protected] | Phone: (602)-554-8229 | The opinions expressed are my own. No one in | their right mind would claim them. Also: • Remove signatures Cheers, Tom McConnell -Tom McConnell Intel, Corp. C3-91 5000 W. Chandler Blvd. Chandler, AZ 85226 • | Internet: [email protected] | Phone: (602)-554-8229 | The opinions expressed are my own. No one in | their right mind would claim them. Look for ---* • Doesn't always find it Also: • Remove signatures Cheers, Tom McConnell -Tom McConnell Intel, Corp. C3-91 5000 W. Chandler Blvd. Chandler, AZ 85226 | Internet: [email protected] | Phone: (602)-554-8229 | The opinions expressed are my own. No one in | their right mind would claim them. Look for ---* • Doesn't always match • First paragraph only • Might miss important content • Sometimes grabs greetings (e.g. “Hi, \n” • Preprocessing • Bi- and tri-grams • Tag start of sentence with ^ • Force “not” to join with adjacent n-grams • e.g. ^there_is_not not_a_way a_way way_to to_do do_that Text Modeling and Topic Discovery • Assume words and/or documents belong to some class/topic • Assume words are conditionally independent given the class/topic • P(w|z) Naïve Bayes • Each document belongs to one class • P(d) = \product P(w|z) Naïve Bayes - Inference • Expectation-Maximization Latent Semantic Indexing / Latent Dirichlet Allocation • Each document contains multiple topics • P(d) = \product P(w|z) P(z|d) Model for Conversational Text • • • • Message m Response r P(m,r|z) = P(m|z) P(r|z) P(r|m) prop to P(z) P(m|z) P(r|z) Example Example Example Example Example Classification Performance • Labeled ~100 messages with speech acts – M/R model – 40-60% – Single-message NB – 20-30% • Need more labels
© Copyright 2026 Paperzz