Automatic classi cation of Web texts using Functional Text Dimensions

Annotating genres on the Web
Detecting genre symptoms
Automatic classication of Web texts using
Functional Text Dimensions
Lagutin Ì. B., Katinskaya A. Y., Selegey V. P., Sharo S. A.,
Sorokin A. A.
ÌÃÓ, ÐÃÃÓ, ABBYY, University of Leeds, ÌÔÒÈ
30 May 2015
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Problem statement
Annotation results
Genre classication as a jungle
Mehler, Sharo, Santini, (2010). Genres on the Web.
MGC (20 genres)
adults
blog
children
commercial
community
content delivery
entertainment
error message
FAQ
informative
journalistic
ocial
personal
poetry
prose
scientic
shopping
gateway
index
KI-04 (8 genres)
article
discussion
download
help
linklists
portrait-nonpriv
portrait-priv
shop
user input
Santinis (8 genres)
blog
eShop
FAQ
frontpage
hotlist
homepage
sitemap
Search page
349 genres in the Syracuse collection
No reliable annotation framework
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Problem statement
Annotation results
Functional Text Dimensions
A8: hardnews To what extent does the text appear to be an
informative report of events recent at the time of
writing? (For example, a news item).
A1: argum To what extent does the text contain explicit
argumentation to persuade the reader? (For example,
argumentative blogs or discussion forums).
A17: eval To what extent does the text evaluate a specic
entity by endorsing or criticising it? (For example, a
product review)
Rating Levels:
0 none or hardly at all;
(Forsyth, Sharo, 2014)
.5 slightly;
1 somewhat or partly;
2 strongly or very much so.
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Problem statement
Annotation results
Training corpus
Documents Words Sentences
618
997255
72651
Typical text sources
Wikipedia
Blogs (vk.com, livejournal.com)
News websites (ru.wikinews.org, chaskor.ru)
Legal informational portals (base.garant.ru,
consultant.ru)
Scientic and popular scientic journals (sci-article.ru,
cyberleninka.ru)
Advertising websites (avito.ru, mvideo.ru)
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Problem statement
Annotation results
Distribution of annotations
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Problem statement
Annotation results
Distribution of annotations
Frequent:
(info), (argum), A3 (emotive), (news),
(personal) and
(evaluation).
Less frequent: (ction), A5 (ippant), (instructive),
(dialogic) and A19 (poetic).
Some dimensions are categorical ( , , , , ).
is in-between: 221 '2'-s and 360 nonzero values
On the contrary, only 53 of 244 nonzero values for
are 2-s.
A16
A1
A11
A8
A17
A4
A7
A18
A4 A7 A9 A12 A14
A16
A17
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Binarization
We binarize the values with two recodings and perform the
clustering procedure for both of them.
Extremes: 0 (None) or 2 (Strongly)
0.5 (slightly) → 0
1 → 2 or 1 → 0
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Histogram of FTD combinations
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Calculating frequencies of FTD combinations
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Calculating frequencies of FTD combinations
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Unifying clusters to classes
Some combinations of FTD values (prototypes) are more
frequent, we take them as cluster centers.
We consider such clusters as potential pseudogenres.
Most of the clusters are very small (only 7 greater then 20
texts).
There is no hope to separate such small groups.
Fortunately, the clusters themselves are organized
hierarchically.
So we group clusters (= standard combinations of feature
values) into higher-order groups.
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Pseudogenres and their FTDs
Genre
Discussion
Their FTD combinationswith weights
A126 , (A1+A13)17 , A1311 , (A1+A8)7
Personal
A1118 , (A3+A6+A11+A17)11 , (A11+A1)9 , (A11+A17)8
News
A868
Legal
(A9+A16)42 , A926
Advert
(A12+A16)31 , A1210
Science
(A14+A16)26 , (A14+A15+A16)17 , (A15+A16),
(A14+A15)9 , A145
Info
A1660
Manual
A735 (weight based on the presence method )
Fiction
(A4 or A5 or A18)69 (weight based on the presence method)
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
11
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Linguistic features for automatic classication
Biber's Multidimensional analysis: 1988 40 initial dimensions for Russian
B1 rst person pronoun
B2 second person pronoun
B4 reexive pronoun
...
B38 conditional subordinate
B39 purpose subordinate
B40 negation
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Classication scheme
Remove noise documents from test set (using cluster
structure).
Transform the language feature values (see below).
Learn a logistic regression model for each of the FTDs.
For clusters learn separate classier for each cluster and assign
the document to the cluster whose model gives the greatest
score.
Perform backward inverse procedure to make the models more
sparse and reliable.
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Data transformation
Box-Cox: B 0 = B λ−
Standardization: X = B σ−µ
λ
1
0
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Transformation results
BEFORE
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
AFTER
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Sensitivity and specicity for functional dimensions
FTD
A1
A3
A4
A5
A6
A7
A8
A9
A11
A12
A13
A14
A15
A16
A17
A18
Threshold
0,33
0,23
0,09
0,14
0,23
0,11
0,25
0,22
0,31
0,20
0,19
0,18
0,20
0,61
0,24
0,08
Sens. train
76%
88%
92%
79%
82%
94%
86%
96%
83%
89%
74%
86%
71%
76%
72%
91%
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Sens. spec
78%
91%
96%
80%
83%
93%
86%
97%
86%
90%
80%
89%
75%
77%
83%
96%
Sens. test
79%
94%
92%
76%
81%
91%
80%
98%
79%
90%
72%
91%
73%
72%
71%
89%
Spec. test
67%
76%
100%
81%
72%
95%
73%
83%
90%
80%
64%
71%
75%
66%
88%
100%
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Sensitivity and specicity for clusters
D
D
P
N
L
A
R
I
M
F
Se
Se
Sp
P
37 9
9 41
6 0
0 0
0 0
1 0
4 3
1 0
4 6
.66 .61
.78
.60 .69
N
3
0
59
1
0
1
4
0
1
.81
.81
.86
L
0
0
0
64
0
4
1
1
0
.93
.93
.91
A
S
I
M
F
3 1 1 1 1
2 3 2 1 9
0 0 7 0 1
0 3 0 1 0
34 0 4 2 1
3 57 9 0 0
0 1 30 1 3
1 0 0 30 2
2 1 2 3 50
.83 .76 .64 .86 .72
.83 .80 .86 .72
.76 .86 .55 .77 .75
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Total
Automatic classication of Web texts
56
67
73
69
41
75
47
35
69
.76
.81
.75
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Detecting clusters in larger corpus
1000000 LJ-posts from the current version of GICR
Logistic regression for each FTD (A1-A18)
Binarized scores for every feature (threshold .75): 2 or not
Detection of the nearest prototype by Euclidean distance
The most frequent prototypes new cluster centers
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Most frequent prototypes
Class
ews
ersonal
nfo
eview
ews
eview
iscussion
dvert
iction
N
P
I
R
N
R
D
A
F
FTDs
A8
A3+A6+A11
A16
A3+A6+A11+A17
A8+A16
A3+A5+A6+A11+A17
A1+A3+A6+A11
A12+A16
A3+A4+A6+A11
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
#docs
49738
35506
34578
31214
30219
23036
20917
20874
20553
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Conclusions
Well-formed dense clusters in the FTD space
These clusters correspond to reliable pseudogenres
iscussion, ersonal, ews, egal, dvert, cience, nfo,
anual, iction
FTD values can be predicted with high accuracy (about 75%)
Some similar clusters can be detected on the Web using our
model
BUT accuracy on the Web is less clear
ALSO some new clusters could emerge
D
M
P
N
L
A
S
I
F
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts
Annotating genres on the Web
Detecting genre symptoms
Clusterization method and results
Learning the dimensions
Plan for further work
To collect a near-exhaustive list of possible clusters in the FTD
space using a bigger corpus.
1 Check the accuracy of automatic detection of the FTDs and
existing clusters on open Web data;
2 Consider the centers of such clusters as prototypes;
3 Assign the documents of the corpora to their closest prototype
provided the prototype is indeed close in the FTD space
These clusters could be used for learning an automatic model of
genre classication.
Lagutin, Katinskaya, Selegey, Sharo, Sorokin
Automatic classication of Web texts