Document

Functional description of several
linguistic web services at RACAI
Research Institute for AI,
Romanian Academy
I
Language Identification
LangID overview
•
The algorithm uses a collection of texts in various languages for
building statistical affixes models that would characterize the given
language
• When a new text is received (the bigger the better, but usually no less
that a single sentence), its affixes model is compared to the trained
ones
• The trained model that is closest to the input text model gives the
language of the input
• For the technicalities of building language identification models please
refer to:
Dan Tufiş, Radu Ion, Alexandru Ceauşu, and Dan Ştefănescu. RACAI's
Linguistic Web Services. In Proceedings of the 6th Language
Resources and Evaluation Conference - LREC 2008, Marrakech,
Morocco, May 2008. ELRA - European Language Ressources
Association. ISBN 2-9517408-4-0.
LangID WSDL description
• Located at
http://nlp.racai.ro/webservices/LangIdWebService
.asmx?WSDL
• Contains a single function, GetLanguage, that
takes a UTF-8 encoded text and returns a pair
consisting of the identified language (string) and a
confidence score (real number)
• A web interface to this web service can be found
at http://nlp.racai.ro/webservices/LanguageId.aspx
Some examples of LangID
• GetLanguage( “These are the words of
English language.” ) : <english, 91.98>
• GetLanguage( “Acestea sunt câteva cuvinte
ale limbii române.” ) : <romanian, 73.04>
• GetLanguage( “Ce sont les mots de la
langue française.” ) : <french, 97.37>
II
Tokenizing, Tagging,
Lemmatizing, Chunking and
Linking Free Running Texts
TTL overview
• TTL is able to annotate free running texts in
Romanian, English and recently in French (we
also have Bulgarian resources but no one has yet
validated the output)
• TTL is a Perl module
• “Annotate” means chaining the following:
sentence splitting, tokenization, POS tagging,
lemmatization and chunking (shallow parsing)
TTL WSDL description
• Located at http://ws.racai.ro/ttlws.wsdl
• Functions provided by the WSDL (all parameters
and return types are ASCII strings with various
Unicode characters encoded with SGML entities):
–
–
–
–
–
–
–
–
SentenceSplitter( lang, text )
Tokenizer( lang, text )
Tagger( lang, text )
Lemmatizer( lang, text )
Chunker( lang, text )
XCES( lang, text )
UTF8toSGML( text )
SGMLtoUTF8( text )
TTL usage
• Lang may be one of “en”, “ro”, “bg” or “fr”
• This list might be extended (meaning that we have the
necessary resources) to“el”, “si”, “cz”, “sr”, “de”, but
native speaker validation is required for fine tuning of
TTL
• Operations are to be stacked from SentenceSplitter to
Chunker (meaning that the next function requires the
output of the previous one)
• XCES calls all the above functions in order and returns an
XML codification of the output of Chunker (which
contains all the annotations up to its call)
• UTF8toSGML MUST be used to convert the input (to
SentenceSplitter or XCES) to SGML used internally by TTL
A TTL-chain example
• Chunker( “en”,
Lemmatizer( “en”,
Tagger( “en”,
Tokenizer( “en”,
SentenceSplitter( “en”, “This is a simple
example” ) ) ) ) )
• produces …
TTL output (I)
This
is
a
simple
example
of
a
web
service
remote
execution
.
Pd3-s
Vmip3s
Ti-s
Afp
Ncns
Sp
Ti-s
Ncns
Ncns
Afp
Ncns
PERIOD
this
be
a
simple
example
of
a
web
service
remote
execution
.
Vp#1
Np#1
Np#1,Ap#1
Np#1
Pp#1
Pp#1,Np#2
Pp#1,Np#2
Pp#1,Np#2
Pp#1,Np#2,Ap#2
Pp#1,Np#2
Another TTL-chain example –
The XCES wrapper
•
•
XCES( “en”, “This is a simple example of a web service remote
execution.” )
produces …
LexPar – linking sentences
• Located at http://www.racai.ro/lxpws.wsdl
• Uses the output of the XCES wrapper of
TTL and enriches that annotation with
dependency-like annotations (words are
related in pairs but relations are not oriented
and are not labeled)
• Function:
– linkXCES( lang, text )
LexPar output
•
•
linkXCES( “en”, XCES( “en”, “This is a simple example of a web
service remote execution.” ) )
produces …
use SOAP::Lite;
my( $soap ) = SOAP::Lite->new()->
uri( 'http://ws.racai.ro/pdk/ttlws' )->
proxy( 'http://ws.racai.ro:8080/' );
my( $soap2 ) = SOAP::Lite->new()->
uri( 'http://ws.racai.ro/pdk/lxpws' )->
proxy( 'http://ws.racai.ro:8080/' );
print(
$soap->Chunker( "en",
$soap->Lemmatizer( "en",
$soap->Tagger( "en",
$soap->Tokenizer( "en",
$soap->SentenceSplitter( "en", "This is a simple example of a web service remote execution." )
->result()
) #end Tokenizer call.
->result()
) #end Tagger call.
->result()
) #end Lemmatizer call.
->result()
) #end Chunker call.
->result()
); #end print
my( $res ) = $soap->XCES( "en", "exmpl", "This is a simple example of a web service remote execution.\n" );
my( $res2 ) = $soap2->linkXCES( "en", $res->result() );
print( $res2->result() );
III
SearchRoWiki
• SearchRoWiki (Search Romanian Wikipedia) is a web service
originally developed for the Romanian shared task at CLEF
2007.
• The web service searches trough the collection of 43000
Romanian documents available on Wikipedia and it is based on
a C# port of the Lucene search engine.
• The web service was made to use the results from query analysis
(presented in series of weighted Boolean terms) to retrieve a list
with documents/sections that best match the query.
• On the indexing side, SearchRoWiki uses the TTL and LexPar
web services to annotate the available documents.
• The web service description is available at
http://nlp.racai.ro/WebServices/SearchRoWikiWebService.asmx
?WSDL and a sample client can be found at
http://nlp.racai.ro/WebServices/SearchRoWiki.aspx
•
SearchRoWiki
methods
The web service has only one method - “GetResults”. The
argument of the method is a string containing a Lucene Boolean
query.
• There are 4 fields to search in: title word form (title), title
lemma (ltitle), document word form (text) and document lemma
(ltext).
• In the “ltext” and “ltitle” fields only the lemma is indexed.
Instead of filtering the index terms using a stop words list
SearchRoWiki uses the information from POS-tagging to keep
only the content words (nouns, main verbs, adjectives, adverbs
and numerals). In addition, the web service uses the sentence
and chunk annotation to insert phrase boundaries into the term
index; a phrase query cannot match across different chunks or
sentences.
<GetResults xmlns="http://nlp.racai.ro/">
<query>string</query>
</GetResults>
SearchRoWiki - results
• To acquire better precision in ranking the
documents and their content, the web service
uses two indexes: (i) one for the documents
and (ii) one for the sections of the documents.
• The hit list returned from the web service is a
list of sections that match the query. The
sections are sorted and ranked using the
documents index.
<GetResultsResponse
xmlns="http://nlp.racai.ro/">
<GetResultsResult>string</GetResultsResult>
</G tR lt R
>
SearchRoWiki - example
<!DOCTYPE Documents SYSTEM "http://nlp.racai.ro/WebServices/wiki.dtd">
<Documents count="178" query="(ltext:Mihai ltext:Eminescu)~2">
<Terms>
<Term field="ltext">Eminescu</Term>
<Term field="ltext">Mihai</Term>
</Terms>
<Document pgname="Mihai_Eminescu_(dezambiguizare)" score="1.0000"
url="16912.xml.ttl.xml">
<Title>
<snt>Mihai/Mihai/Np/Np#1 Eminescu/Eminescu/Np/Np#1 (/(/LPAR
dezambiguizare/dezambiguizare/Ncfsrn/Np#2 )/)/RPAR</snt>
</Title>
<Sections>
<Section score="1.0000">
<Title>
<snt>Mihai/Mihai/Np/Np#1 Eminescu/Eminescu/Np/Np#1 (/(/LPAR
dezambiguizare/dezambiguizare/Ncfsrn/Np#2 )/)/RPAR</snt>
</Title>
<Content>

<snt>Mihai/Mihai/Np/Np#1 Eminescu/Eminescu/Np/Np#1 se/sine/Px3--a-------w/Vp#1 poate/putea/Vmip3s/Vp#1 referi/referi/Vmnp/Vp#1 la/la/Spsa
:/:/COLON</snt>


<snt>Mihai/Mihai/Np/Np#1 Eminescu/Eminescu/Np/Np#1 ,/,/COMMA
poet/poet/Ncms-n/Np#2 din/din/Spsa/Pp#1
literatura/literatur&abreve;/Ncfsry/Pp#1,Np#3
român&abreve;/român/Afpfsrn/Pp#1,Np#3,Ap#1</snt>


<snt>Mihai/Mihai/Np/Np#1 Eminescu/Eminescu/Np/Np#1 ,/,/COMMA
Boto&scedil;ani/Boto&scedil;ani/Np/Np#2 ,/,/COMMA
localitate/localitate/Ncfsrn/Np#3 în/în/Spsa/Pp#1
jude&tcedil;ul/jude&tcedil;/Ncmsry/Pp#1,Np#4
Boto&scedil;ani/Boto&scedil;ani/Np/Pp#1,Np#4 ,/,/COMMA
România/România/Np/Np#5</snt>

………
IV
Statistical Machine
Translation
SMT WebService
•
•
The web service is based on Moses - the open source factorized translation system.
Factored translation (Koehn and Huang, 2007) models integrate additional
annotations into the wordform model. The additional info can be represented by
several factors like lemma, part-of-speech, morphological description, etc. The
additional annotations can help wordform translation or reordering the translated
wordforms.
The configuration of the Moses decoder is based on alternating translation and
generation steps:
1.
2.
3.
4.
•
•
Translating lemmas;
Generating morpho-syntactical descriptions from lemmas;
Translating from part-of-speech tags to part-of-speech tags plus morpho-syntactical
descriptions
Generating the surface form based on lemma and morpho-syntactical description
The main corpus used for training the translation system is the SEnAC corpus -SEE-ERA.net Administrative Corpus (Tufis et al, 2008). The SEnAC corpus is
richly annotated, thus allowing a great variation of factored translation models (4
factors available: wordform, lemma, part-of-speech and morpho-syntactical
description)
The web service description is available at
http://nlp.racai.ro/Moses/MosesWebService.asmx?WSDL and a very simple sample
client can be found at http://nlp.racai.ro/Moses
MosesWebService - methods
• The translation web service is available for 4
language pairs: English-Romanian, English-Greek,
English-Slovene and Romanian-English.
• Soon, Slovene-English, Greek-English
• The translation web service requires factorized
input. The factorized input can be obtained using the
TTL web service to annotate user input (available
only for English and Romanian). The factorized
input is then feed into a Moses decoder instance.
• <Translate xmlns="http://nlp.racai.ro/">
<languagePair>ENRO or ROEN or ENEL or
ENSL</languagePair> <input>string</input>
</Translate>
MosesWebService – example
• The input string “having regard to article 10 of the
treaty” is annotated using TTL web service
• The result is transformed into the factorized format:
– “having|have|PPRE|Vmpp regard|regard|NN|Ncns
to|to|PREP|Sp article|article|NN|Ncns 10|10|CD|Mc
of|of|PREP|Sp the|the|DM|Dd treaty|treaty|NN|Ncns
;|;|SCOLON|SCOLON”
• The factorized string is used as input for the
“Translate” method of the MosesWebService
• If “ENRO” is used as “languagePair” argument of
the “Translate” method, the result will be:
– “având în vedere art. 10 din tratat ;”
V
Connotation Analysis
Deep Phrase Structure
[The
The
perplexed
[the
[[The
TheDET
perplexed
perplexed
students
students
students
reread
]the
[reread
[reread
[the
[the
DET
ADJ
ADJ
N
N
NP
V the
V VV
DET
DET
DET
DET
ADJ
ADJstudents
NN]]reread
NP
NPreread
DET
DET
chapter
chapter
]]NP
]NP
over
over
[over
[the
weekend
weekend
]NNNP
. ]]PP
chapter
[the
[theDET
weekend
weekend
]]NP
chapterNNNN]over
the
weekend.
NP[over
P the
PP
PP[the
DET
DETweekend
N. NN
NP
NP
DET
DET
NP
PP
PP.]]VP
VP].S.
Partial Phrase Structure
<seg lang="en">
<s id="eng.1">
<w lemma="the" ana="Dd" chunk="Np#1">The</w>
<w lemma="perplexed" ana="Afp"
chunk="Np#1,Ap#1">perplexed</w>
<w lemma="student" ana="Ncnp"
chunk="Np#1">students</w>
<w lemma="reread" ana="Vmip-p" chunk="Vp#1">reread</w>
<w lemma="the" ana="Dd" chunk="Np#2">the </w>
<w lemma="chapter" ana="Ncns" chunk="Np#2">chapter</w>
<w lemma="over" ana="Sp" chunk="Pp#1">over </w>
<w lemma="the" ana="Dd" chunk="Pp#1,Np#3">the</w>
<w lemma="weekend" ana="Ncns"
chunk="Pp#1,Np#3">weekend</w>
<c>.</c>
</s>
</seg>
Partial
Phrase Structure
- The
chunk="Np#1"
chunk="Np#1,Ap#1"
chunk="Np#1"
chunk="Vp#1"
chunk="Np#2"
chunk="Np#2"
chunk="Pp#1"
chunk="Pp#1,Np#3 "
chunk="Pp#1,Np#3 "
- perplexed NP#1
- students
- reread
- the
- chapter NP#2
- over
PP#1
- the
NP#3
- weekend
Text Preprocessing
• tokenization, tagging, lemmatization and chunking
(TTL - http://tutankhamon.racai.ro/ttlws.wsdl)
• CONAN takes input either from a file or from the
keyboard.
– in case of input from a file, CONAN expects the file to be
already preprocessed and encoded the same way our
linguistic web services platform provides the output
(XCES format).
– in case of keyboard input, the system uses the RACAI web
services (based on standard web technology:
SOAP/WSDL/UDDI) which ensures processing for
Romanian and English.
Calling TTL (C#)
•
Add to Web References: http://tutankhamon.racai.ro/ttlws.wsdl
•
C# Code:
System.Net.ServicePointManager.Expect100Continue = false;
//this line is added for compatibility with the Perl implementation
// of the web service
ttlServ = new TTL();
//initialize
text = ttlServ.UTF8toSGML(UserText);
//conversion
text = ttlServ.XCES("en", "eng", text);
//actual preprocessing
text = ttlServ.SGMLtoUTF7(text);
//conversion
ASCIIEncoding asciiEnc = new ASCIIEncoding(); //conversion
text = Encoding.UTF8.GetString(UTF7Encoding.
Convert(Encoding.UTF7, Encoding.UTF8,
asciiEnc.GetBytes(text)));
//conversion
text variable contains now the preprocessed version of the text introduced by the user
CONAN
ENG20-01709714-a 0.125<N>0.5</N><O>0.375</O>
ENG20-09970518-n 0.0<N>0.0</N><O>1</O>
ENG20-00605618-v 0.25<N>0.0</N><O>0.75</O>
ENG20-05999272-n0.0<N>0.0</N><O>1</O>
ENG20-14311212-n 0.0<N>0.0</N><O>1</O>
CONAN
• Interpretations:
–
–
–
–
–
–
–
–
–
positive
negative
objective
forced positive
forced negative
forced objective
forced non-positive
forced non- negative
forced non-subjective
• The score for a node in the tree representation of the
sentence is computed as the average of its children
senti-words scores (for the current interpretation)
Forced Interpretations
• Forced interpretations:
–
–
–
–
–
–
Non-Negative
Non-Positive
Non-Subjective
Most Positive
Most Negative
Most Objective
• CONAN replaces the words in the current analyzed
sentence with synonyms which have less
interpretative scores then the current ones. The
rationale is to help the user in avoiding words which
could be interpreted in different ways, by synonyms
which have less (ideally no) connotation variability.
Forced Interpretations
• Forced I
– a word is replaced by one of its synonyms represented
by a literal that has another sense having the highest
score among all senses of all literals corresponding to
the synonyms from the already assigned synset of the
original word, for the interpretation I
• Forced non-I
– a word is replaced by one of its synonyms represented
by a literal for which the minimum of the scores for
each sense of this literal for the interpretation I is
greater that all the other minimums for the other
synonyms
Forced Interpretations
• w with senses m1, m2, …, mn (corresponding to n
synsets)
• consider that for a certain interpretation I, the
sense having the higher score is mI. For this sense,
w has the following synonyms: s1, s2, …, sk (sI
corresponding to mI)
• each si is a sense for a literal Li which may have
several other senses: mi1, mi2, …, mit
• Forced I
• Forced non-I
•
•
Forced Interpretations
The jurors said they realized a proportionate distribution of this funds
might be...
positive interpretation:
– ENG20-00855959-a - exhibiting equivalence or correspondence among
constituents of an entity or between different entities (0,125)
– ENG20-00454604-a * - agreeing in amount, magnitude, or degree (0)
– ENG20-00455216-a - being in due proportion (0)
•
synonyms in ENG20-00855959-a:
– harmonious
– proportionate
– symmetrical
word
Interpretation +
Interpretation -
harmonious
ENG20-01123151-a (musically
pleasing) - 0,75
ENG20-00480442-a (existing
together in harmony) - 0,25
proportionate
ENG20-00855959-a - 0,125
ENG20-00454604-a (agreeing in
amount, magnitude, or degree) 0,125
symmetrical
ENG20-00855959-a - 0,125
ENG20-00855959-a - 0
Senti-Words and Valence Shifters
• The words which are mark-up with SentiWordNet triples
<O;P;N> are called senti-words; a phrasal chunk (as returned
by our chunker) is called a senti-phrase
• The words which contextually modify the subjectivity scores
of senti-words or senti-phrases are called valence shifters and
collected in external files and are dealt with in a simpleminded way:
– Intensifiers (increases P or N of the argument with 20%)
– Diminishers (decreases P or N of the argument with 20%)
– Negations (switches P and N of the argument )
• A more principled way (under development) will take into account the
SWN annotation for the valence shifters which are also senti-words, their
grammar category, and preferred argument-type, an argument-sensitive
valence shifting function as well as the requested type of connotation
analysis (next)
• It is interesting to note that, in general, translation equivalence preserves
VS’s type distinctions. However this is not always true. For instance, in
Romanian destul (either adjective or adverb), when followed by the
preposition de, is arguably a diminisher. In English, its translation
equivalent enough acts more like an intensifier than as a diminisher.
Interpretativity
• The interpretativity score (IS) is a quantitative measure of
the potential of a sentence to shift its opinion/connotation.
It is computed as a normalized sum of the subjective
interpretations of the sentence:
IS (sentence ) =
0.5 * ( P (sentence ) + N (sentence ))
1+ | P (sentence ) − N (sentence ) |
• if the subjective interpretations scores are big and
comparable for a certain word, that word will very
probable have a major impact in changing the connotation
of a sentence
• For the current SentiWordNet annotations, the senti-words
with the highest interpretability score (IS= 0,875) are
pretty, immoral and gross
Interpretativity
• experiments on SEMCOR - a Ro-En
parallel preprocessed corpus with diverse
content - 8146 sentences
Interpret.
I. Poz.
I. Neg.
0.625
0.617
0.611
0.481
0.625
0.625
0.625
0.583
0.625
0.687
0.75
0.479
0.480
0.541
0.479
0.452
0.45
0.5
0.441
0.5
0.4375
Propoziţie
For he feared the Lake_of_Fire
Neither better nor worse
Pat_O'Dwyer looked_like a heavier Jim
THE MODEST AND HAPPY Spahn waved_off his
new laurels as one of those good days
Fresh warm sweet and juicy sweet lovin sixteen she
was
This happy always smiling lad with the sunny
disposition is our new Junior_Mr._Canada
Henri_de_Courcy
Only one worth a shit and that 's Brandon
CONAN Interface (Most Positive Analysis)
Results of the Interpretative Analysis
CONAN Interface (Subjectivity Scorer)
Detailed Analysis
Annotating WordNet for prior
polarities
• Starting with a set of words, hand annotated
for their prior polarities, most sentiment
resources are built by applying some ML
techniques and inducing prior polarities for
lexical items stored in lexico-semantic
repositories. As WordNet is a highly praised
such a repository, not surprisingly its
structure and content are the backbone of
such enterprises.
How much trust one can have in
such an approach?
• Pretty high, but if there are different viewpoints, one
should better harmonize them.
• Domains: psychological_features, psychology, quality,
military etc.
• SUMO/MILO: EmotionalState, Happiness,
Psychological Process, SubjectiveAssesmentAttribute,
StateOfMind, TraitAttribute, Unhappiness, War etc.
• SENTIWN, DOMAINS, SUMO&MILO, General
Inquirer annotations should, intuitively, match! Do they?
Not really!
Some statistics
• 2637 synsets labeled by the SUMO concept
SubjectiveAssessmentAttribute or
EmotionalState have the Sentiwn annotation
P:0, N:0, O:1
– Eg.(SAA): prosperously, impertinently, best,
victimization, oppression, curse (sense 1 as a
noun), honeymoon, prettify, beautify, threaten,
justify, waste, cry, experience…
– Eg. (ES): unsatisfactorily, lonely, sordid,
kindness, disappointment, frustration…
Some statstics (ctnd.)
• 28434 synsets are marked for subjectivity. Many of them are
questionably marked so:
– e.g. Abstract, BodyPart, Building, Device, EngineeringComponent (most of
the time with negative polarity☺), FieldOfStudy (nonparametric statistics is
bad: 0.0<N>0.625</N><O>0.375</O>, while gastronomy is much
better: 0.5<N>0.0</N><O>0.5</O>)
– Happyness: Happy, pleased which is similar to glad (1) is not good
(0.0<N>0.75</N><O>0.25</O>
– Human instances (both real persons or literary characters
– LinguisticExpression (extralinguistic is very bad:
0.0<N>0.75</N><O>0.25</O>)
– PrimeNumber (0.0<N>0.375</N><O>0.625</O>)
– Proposition (conservation of momentum:
0.0<N>0.25</N><O>0.75</O>
– DiseaseOrSyndrome (influenza, flu, grippe:
0.75<N>0.0</N><O>0.25</O>)
– Prison (Jail is not bad at all, it’s a little
fun:0.25<N>0.0</N><O>0.75</O>)
– etc
Some other statistics
• We took Wiebe’s hand-crafted list of PositivePolarity and
MinusPolarity words (based on General Inquirer):
– PolPman file contains 657 words
– PolMman file contains 679 words
• We extracted all the synsets in PWN2.0 containing the literals
in Wiebe’s files
– PwNPolPman file contains 2305 synsets
• 817 synsets are marked as entirely objective O:1
• 239 synsets have non-positive subjectivity (P:0)
• 486 synsets have P ≥0,5 (corresponding to 293 literals)
– PwPolMman file contains 1803 synsets
• 461 synsets are marked as entirely objective O:1
• 213 synsets have non-negative subjectivity (N:0)
• 656 synsets have N ≥0,5 (corresponding to 356 literals)
Why these happen?
Assuming WordNet structuring is perfect &
Assuming the SUMO&MILO classification is perfect:
Assuming General Inquirer is perfect
Assuming Wiebe’s polarity lists are perfect
• Taxonomic generalization is not always working
– Nightmare is bad!
• Nightmare is a dream
• But dream is not bad (per se)
– An emotion is something good (P:0,5) and so is love, but hate or envy
are not!
• Glosses are full of valence shifters (BOW is not sufficient):
– honest, honorable: not disposed to CHEAT or DEFRAUD , not
DECEPTIVE or FRAUDULENT
– intrepid: invulnerable to FEAR or INTIMIDATION
– superfluous: serving no USEFUL+ purpose; having no EXCUSE+
for being
• Majority voting is democratic but not the best solution
Why these happen (cntd)?
But Wiebe’s polarity lists are not perfect
& General Inguirer is not perfect
& SUMO&MILO classification is not perfect
& WordNet structuring is not perfect

Download Report

Document

Paperzz.com

Your Paperzz