A Text Model for Speech Act Discovery

Toward Automatic Speech Act
Discovery
•
•
•
•
email
newsgroups
forums
blogs
Data Set
• 20 usenet newsgroups
• The 20 Newsgroups data set is a collection of approximately 20,000
newsgroup documents, partitioned (nearly) evenly across 20 different
newsgroups. To the best of my knowledge, it was originally collected by Ken
Lang, probably for his Newsweeder: Learning to filter netnews paper,
though he does not explicitly mention this collection. The 20 newsgroups
collection has become a popular data set for experiments in text
applications of machine learning techniques, such as text classification and
text clustering.
Preprocessing
>> I just wonder if this will also cause a divergence between commercial
>> and non-commercial software (ie. you will only get free software using
>> Athena or OpenLook widget sets, and only get commercial software using
>> the Motif widget sets).
>
>
> I can't see why. If just about every workstation will come with Motif
> by default and you can buy it for under $100 for the "free" UNIX
> platforms, I can't see this causing major problems.
Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap",
but I cannot get the source for "cheap", hence I am limited to using whatever X
libraries the Motif port was compiled against (at least with older versions of
Motif. I have been told that Motif 1.2 can be used with any X, but I have not
seen it myself).
Preprocessing
>> I just wonder if this will also cause a divergence between commercial
>> and non-commercial software (ie. you will only get free software using
>> Athena or OpenLook widget sets, and only get commercial software using
>> the Motif widget sets).
>
>
> I can't see why. If just about every workstation will come with Motif
> by default and you can buy it for under $100 for the "free" UNIX
> platforms, I can't see this causing major problems.
Let me add another of my concerns: Yes, I can buy a port of Motif for "cheap",
but I cannot get the source for "cheap", hence I am limited to using whatever X
libraries the Motif port was compiled against (at least with older versions of
Motif. I have been told that Motif 1.2 can be used with any X, but I have not
seen it myself).
Section into “levels”
• Level < previous level = reply to previous message
• Level > previous level = new message
•
Also:
• Remove headers
Xref: cantaloupe.srv.cs.cmu.edu comp.windows.x:66928 comp.windows.x.apps:2487
Path:
cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.cmu.edu!news.sei.cmu.edu!
cis.ohio-state.edu!zaphod.mps.ohiostate.edu!howland.reston.ans.net!gatech!asuvax!chnews!tmcconne
From: [email protected] (Tom McConnell~)
Newsgroups: comp.windows.x,comp.windows.x.apps
Subject: Re: Motif vs. [Athena, etc.]
Date: 16 Apr 1993 20:14:04 GMT
Organization: Intel Corporation
Lines: 44
Sender: tmcconne@sedona (Tom McConnell~)
Distribution: world
Message-ID: <[email protected]>
References: <[email protected]> <[email protected]>
<[email protected]>
NNTP-Posting-Host: thunder.intel.com
Originator: tmcconne@sedona
Also:
• Remove signatures
Cheers,
Tom McConnell
-Tom McConnell
Intel, Corp. C3-91
5000 W. Chandler Blvd.
Chandler, AZ 85226
|
Internet: [email protected]
|
Phone: (602)-554-8229
| The opinions expressed are my own. No one in
| their right mind would claim them.
Also:
• Remove signatures
Cheers,
Tom McConnell
-Tom McConnell
Intel, Corp. C3-91
5000 W. Chandler Blvd.
Chandler, AZ 85226
•
|
Internet: [email protected]
|
Phone: (602)-554-8229
| The opinions expressed are my own. No one in
| their right mind would claim them.
Look for ---*
• Doesn't always find it
Also:
• Remove signatures
Cheers,
Tom McConnell
-Tom McConnell
Intel, Corp. C3-91
5000 W. Chandler Blvd.
Chandler, AZ 85226
|
Internet: [email protected]
|
Phone: (602)-554-8229
| The opinions expressed are my own. No one in
| their right mind would claim them.
Look for ---*
• Doesn't always match
• First paragraph only
• Might miss important content
• Sometimes grabs greetings (e.g. “Hi, \n”
•
Preprocessing
• Bi- and tri-grams
• Tag start of sentence with ^
• Force “not” to join with adjacent n-grams
• e.g.
^there_is_not not_a_way a_way way_to to_do do_that
Text Modeling and Topic Discovery
• Assume words and/or documents belong to
some class/topic
• Assume words are conditionally independent
given the class/topic
• P(w|z)
Naïve Bayes
• Each document belongs to one class
• P(d) = \product P(w|z)
Naïve Bayes - Inference
• Expectation-Maximization
Latent Semantic Indexing /
Latent Dirichlet Allocation
• Each document contains multiple topics
• P(d) = \product P(w|z) P(z|d)
Model for Conversational Text
•
•
•
•
Message m
Response r
P(m,r|z) = P(m|z) P(r|z)
P(r|m) prop to P(z) P(m|z) P(r|z)
Example
Example
Example
Example
Example
Classification Performance
• Labeled ~100 messages with speech acts
– M/R model – 40-60%
– Single-message NB – 20-30%
• Need more labels