Annotating genres on the Web Detecting genre symptoms Automatic classication of Web texts using Functional Text Dimensions Lagutin Ì. B., Katinskaya A. Y., Selegey V. P., Sharo S. A., Sorokin A. A. ÌÃÓ, ÐÃÃÓ, ABBYY, University of Leeds, ÌÔÒÈ 30 May 2015 Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Problem statement Annotation results Genre classication as a jungle Mehler, Sharo, Santini, (2010). Genres on the Web. MGC (20 genres) adults blog children commercial community content delivery entertainment error message FAQ informative journalistic ocial personal poetry prose scientic shopping gateway index KI-04 (8 genres) article discussion download help linklists portrait-nonpriv portrait-priv shop user input Santinis (8 genres) blog eShop FAQ frontpage hotlist homepage sitemap Search page 349 genres in the Syracuse collection No reliable annotation framework Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Problem statement Annotation results Functional Text Dimensions A8: hardnews To what extent does the text appear to be an informative report of events recent at the time of writing? (For example, a news item). A1: argum To what extent does the text contain explicit argumentation to persuade the reader? (For example, argumentative blogs or discussion forums). A17: eval To what extent does the text evaluate a specic entity by endorsing or criticising it? (For example, a product review) Rating Levels: 0 none or hardly at all; (Forsyth, Sharo, 2014) .5 slightly; 1 somewhat or partly; 2 strongly or very much so. Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Problem statement Annotation results Training corpus Documents Words Sentences 618 997255 72651 Typical text sources Wikipedia Blogs (vk.com, livejournal.com) News websites (ru.wikinews.org, chaskor.ru) Legal informational portals (base.garant.ru, consultant.ru) Scientic and popular scientic journals (sci-article.ru, cyberleninka.ru) Advertising websites (avito.ru, mvideo.ru) Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Problem statement Annotation results Distribution of annotations Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Problem statement Annotation results Distribution of annotations Frequent: (info), (argum), A3 (emotive), (news), (personal) and (evaluation). Less frequent: (ction), A5 (ippant), (instructive), (dialogic) and A19 (poetic). Some dimensions are categorical ( , , , , ). is in-between: 221 '2'-s and 360 nonzero values On the contrary, only 53 of 244 nonzero values for are 2-s. A16 A1 A11 A8 A17 A4 A7 A18 A4 A7 A9 A12 A14 A16 A17 Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Binarization We binarize the values with two recodings and perform the clustering procedure for both of them. Extremes: 0 (None) or 2 (Strongly) 0.5 (slightly) → 0 1 → 2 or 1 → 0 Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Histogram of FTD combinations Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Calculating frequencies of FTD combinations Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Calculating frequencies of FTD combinations Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Unifying clusters to classes Some combinations of FTD values (prototypes) are more frequent, we take them as cluster centers. We consider such clusters as potential pseudogenres. Most of the clusters are very small (only 7 greater then 20 texts). There is no hope to separate such small groups. Fortunately, the clusters themselves are organized hierarchically. So we group clusters (= standard combinations of feature values) into higher-order groups. Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Pseudogenres and their FTDs Genre Discussion Their FTD combinationswith weights A126 , (A1+A13)17 , A1311 , (A1+A8)7 Personal A1118 , (A3+A6+A11+A17)11 , (A11+A1)9 , (A11+A17)8 News A868 Legal (A9+A16)42 , A926 Advert (A12+A16)31 , A1210 Science (A14+A16)26 , (A14+A15+A16)17 , (A15+A16), (A14+A15)9 , A145 Info A1660 Manual A735 (weight based on the presence method ) Fiction (A4 or A5 or A18)69 (weight based on the presence method) Lagutin, Katinskaya, Selegey, Sharo, Sorokin 11 Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Linguistic features for automatic classication Biber's Multidimensional analysis: 1988 40 initial dimensions for Russian B1 rst person pronoun B2 second person pronoun B4 reexive pronoun ... B38 conditional subordinate B39 purpose subordinate B40 negation Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Classication scheme Remove noise documents from test set (using cluster structure). Transform the language feature values (see below). Learn a logistic regression model for each of the FTDs. For clusters learn separate classier for each cluster and assign the document to the cluster whose model gives the greatest score. Perform backward inverse procedure to make the models more sparse and reliable. Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Data transformation Box-Cox: B 0 = B λ− Standardization: X = B σ−µ λ 1 0 Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Transformation results BEFORE Lagutin, Katinskaya, Selegey, Sharo, Sorokin AFTER Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Sensitivity and specicity for functional dimensions FTD A1 A3 A4 A5 A6 A7 A8 A9 A11 A12 A13 A14 A15 A16 A17 A18 Threshold 0,33 0,23 0,09 0,14 0,23 0,11 0,25 0,22 0,31 0,20 0,19 0,18 0,20 0,61 0,24 0,08 Sens. train 76% 88% 92% 79% 82% 94% 86% 96% 83% 89% 74% 86% 71% 76% 72% 91% Lagutin, Katinskaya, Selegey, Sharo, Sorokin Sens. spec 78% 91% 96% 80% 83% 93% 86% 97% 86% 90% 80% 89% 75% 77% 83% 96% Sens. test 79% 94% 92% 76% 81% 91% 80% 98% 79% 90% 72% 91% 73% 72% 71% 89% Spec. test 67% 76% 100% 81% 72% 95% 73% 83% 90% 80% 64% 71% 75% 66% 88% 100% Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Sensitivity and specicity for clusters D D P N L A R I M F Se Se Sp P 37 9 9 41 6 0 0 0 0 0 1 0 4 3 1 0 4 6 .66 .61 .78 .60 .69 N 3 0 59 1 0 1 4 0 1 .81 .81 .86 L 0 0 0 64 0 4 1 1 0 .93 .93 .91 A S I M F 3 1 1 1 1 2 3 2 1 9 0 0 7 0 1 0 3 0 1 0 34 0 4 2 1 3 57 9 0 0 0 1 30 1 3 1 0 0 30 2 2 1 2 3 50 .83 .76 .64 .86 .72 .83 .80 .86 .72 .76 .86 .55 .77 .75 Lagutin, Katinskaya, Selegey, Sharo, Sorokin Total Automatic classication of Web texts 56 67 73 69 41 75 47 35 69 .76 .81 .75 Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Detecting clusters in larger corpus 1000000 LJ-posts from the current version of GICR Logistic regression for each FTD (A1-A18) Binarized scores for every feature (threshold .75): 2 or not Detection of the nearest prototype by Euclidean distance The most frequent prototypes new cluster centers Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Most frequent prototypes Class ews ersonal nfo eview ews eview iscussion dvert iction N P I R N R D A F FTDs A8 A3+A6+A11 A16 A3+A6+A11+A17 A8+A16 A3+A5+A6+A11+A17 A1+A3+A6+A11 A12+A16 A3+A4+A6+A11 Lagutin, Katinskaya, Selegey, Sharo, Sorokin #docs 49738 35506 34578 31214 30219 23036 20917 20874 20553 Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Conclusions Well-formed dense clusters in the FTD space These clusters correspond to reliable pseudogenres iscussion, ersonal, ews, egal, dvert, cience, nfo, anual, iction FTD values can be predicted with high accuracy (about 75%) Some similar clusters can be detected on the Web using our model BUT accuracy on the Web is less clear ALSO some new clusters could emerge D M P N L A S I F Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts Annotating genres on the Web Detecting genre symptoms Clusterization method and results Learning the dimensions Plan for further work To collect a near-exhaustive list of possible clusters in the FTD space using a bigger corpus. 1 Check the accuracy of automatic detection of the FTDs and existing clusters on open Web data; 2 Consider the centers of such clusters as prototypes; 3 Assign the documents of the corpora to their closest prototype provided the prototype is indeed close in the FTD space These clusters could be used for learning an automatic model of genre classication. Lagutin, Katinskaya, Selegey, Sharo, Sorokin Automatic classication of Web texts
© Copyright 2026 Paperzz