Central University of Las Villas, Cuba Artificial intelligence Lab Computer Science Department Text Mining Prof. Leticia Arco García [email protected] Background Central University “Marta Abreu” of Las Villas 2 Motivation: unstructured data “We are drowning in information, but starving for knowledge” John Naisbett Advances in Knowledge Discovery and Data Mining. AAAI Press and MIT Press, Menlo Park and Cambridge, MA, USA 1996 audio image video petabyte bytes Text zetabyte data 2.5 quintillon document 3 Contents Origin and definitions Techniques Natural language processing Textual representation approaches Tools and resources Applications in Business Informatics Our results Challenges: SemEval-2017? 4 Origin The challenge of exploiting the large unstructured proportion of enterprise information 1 - Luhn, 1958 Manual text mining approaches 2 – Schulman, et. al., 1989 Knowledge Discovery in Texts 1995 3 (KDT) - Feldman and Dagan, 1 Luhn, .H. P. (1958) A business intelligence system. IBM Journal. October, pp. 314-319. 2 Schulman, P., Castellon, C., Seligman, M. (1989) Assessing explanatory style: the content analysis of verbatim explanations and the attributional style questionnaire. Behav. Res. Ther. Vol. 27, No. 5, pp. 505512. 3 Feldman R., Dagan I. (1995): Knowledge Discovery in Textual Databases (KDT), in Proceedings of the First International Conference on Knowledge Discovery and Data Mining KDD-95, pp. 112-117. 5 Text Mining definitions (1/2) Text mining can be broadly defined as knowledge intensive process in which a user interacts with a document collection over time by using a suite of analysis tools Text mining is the study and practice of extracting information from text using the principles of computational linguistics Text mining as exploratory data analysis is a method of (building and) using software systems to support researchers in deriving new and relevant information (knowledge) from large text collections 6 Text Mining definitions (2/2) Text mining is the establishing of previously unknown and unsuspected relations of features in a (textual) data base Text mining is a knowledge creation tool, because it offers powerful possibilities for creating knowledge and relevance out of the massive amounts of unstructured information available on the Internet and corporate intranets Text Mining is defined as automatic discovery of hidden patterns, traits, or unknown information and knowledge from textual data 7 Text Mining vs Data Mining Structured Data Data Retrieval Data Mining Unstructured Data (Text) Information Retrieval Text Mining Search (goal-oriented) Discover (opportunistic) 8 Multidisciplinary field Natural language processing Computational linguistics Machine learning Visualization Database systems Data mining Statistics 9 Techniques Information retrieval Textual analysis Text clustering Generation of term association Topic detection Text categorization and classification Text summarization 10 Techniques Information retrieval Textual analysis 1. Indexing process 2. Information retrieval model 3. Queries Text clustering Generation of term association Topic detection Text categorization and classification Text summarization Crawlers: Nutch, Scrapy, … 11 Techniques • Dictionaries Information retrieval • Taggers • Ontologies Textual analysis • Parsers Text clustering Generation of term association Topic detection Text categorization and classification Text summarization 12 Techniques Information retrieval • Different clustering classifications • Distances and similarities Textual analysis Text clustering • Clustering validity measures • Cluster labeling Generation of term association Topic detection Text categorization and classification Text summarization 13 Techniques Information retrieval • Mining Frequent Patterns Textual analysis • Discovering associations Text clustering • Discovering correlations Generation of term association • Generating association rules from frequent itemsets Topic detection Text categorization and classification • Generating cause-effect rules • Extracting decision rules Text summarization 14 Techniques Information retrieval Textual analysis Text clustering Generation of term association • Supervised models Topic detection • Unsupervised models Text categorization and classification Text summarization 15 Techniques Information retrieval Textual analysis Text clustering Generation of term association Topic detection • Support vector machines Text categorization and classification • Decision tres • Neural networks Text summarization • Naïve Bayes • Deep learning 16 Techniques Information retrieval Textual analysis Text clustering • Single-document Generation of term association • Multi-document Topic detection • Extracts Text categorization and classification • Abstracts Text summarization • Domain specific • Domain independent 17 Natural language processing levels Phonology: Sound of words Morphology: Nature of words Lexical: An interpretation of an individual word Syntactic: Grammatical structure of the sentence Semantic: Look at the whole sentence to discover the meaning Discouse: Connections between sentences will be made Anaphora resolution Text structure recognition Pragmatic: Try to reveal an extra meaning of the text 18 Different linguistic approaches 1. 2. 3. 4. 5. • Graphemic level: analysis on a subword level, commonly concerning letters Operate solely on plain statistical facts about text. • Cannot completely capture the meaning of documents. Lexical level: analysis concerning individual words • There is only a weak relationship between term occurrences and document content. Syntactic level: analysis concerning the structure of sentences Semantic level: analysis related to the meaning of words and phrases Pragmatic level: analysis related to meaning regarding languagedependent and languageindependent, e.g. application-specific, context Try to capture more semantic content by exploiting an increasing amount of contextual information: • Structure of sentences • Paragraphs • Documents 19 Natural language processing tasks Syntax Semantics Discourse Speech 20 Syntax Morphological segmentation Separate words intoforms individual Reduce inflectional and morphemes and identify the classforms of the sometimes derivationally related morphemes of a word to a common base form I will see you tomorrow at 5 p.m. Saturday night we will go to the restaurant. Normalization (lemmatization, stemming, …) Part-of-speech tagging I will comeback at 5 p.m. Saturday. Parsing Sentence breaking (sentence boundary disambiguation) Word segmentation Separate a chunk of continuous text into separate words. 21 Semantics Named entity recognition (NER) Natural language generation Natural language understanding Determine which items in the text map to proper names (e.g. Convert information from person, location, organization) computer databases or semantic intents into readable human language (e.g. from BPMN to natural description) Question answering Given a human-language Given two text fragments, question, determine answer determine its if one being true entails other ofthe text, identify the Recognizing textual entailment Given a chunk Given a chunk of text, relationshipsseparate among named entities it into segments (e.g. antecedents and consequents a Relationship extraction each of which is devotedinto decision rule) a topic, and identify the Sentiment analysis topic of the segment (e.g. textual segments which Topic segmentation and recognition contribute to a decision) Select the meaning which Word sense disambiguation makes the most sense in context (e.g. run) 22 Discourse Automatic summarization Co-reference resolution (anaphora resolution) Discourse analysis “For a value of more than 5,000, Identify discourse structure seniorthe management approvalofis connected the nature of the requiredtext, (A8).i.e. If this is granted, theinvoice discourse mayrelationships be finally approved”. between sentences (e.g. useful for identifying the BPMN flow) 23 Textual representation models Based on vector space model Vector Space Model (VSM) Latent Semantic Analysis (LSA) Based on graphs Node: Textual units Edges: Relations between textual units Probabilistic models Probabilistic Latent Semantic Analysis (PLSA) Latent Dirichlet Allocation (LDA) Word2vec (Word embeddings) Input: a large corpus of text Output: a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. 24 Graph-based models 25 Vector Space Model … Term1 Term2 Document1 w11 w12 w1m Document2 w21 w22 w2m … … … Documentn wn1 wn2 … Termm … wnm 26 Text Representation • • • • • • • • Stop word elimination Term Frequency Thresholding and Zipf’s law Document Frequency Thresholding Entropy Mutual information Stemming Using thesaurus Latent Semantic Analysis T1 … Doc2 … Docn • • • Recognize different file formats Avoid capitalization constraints and punctuation marks Delete non-alphanumeric characters … Doc1 … • • • T2 … … fij Tm … … Local component Global component Normalization component 27 Tools Apache Lucene NLTK Tika SemanticVectors Stanford CoreNLP S-Space Apache OpenNLP LingPipe FrameNet Weka UIMA TextMiner GATE R SOLR RapidMiner Portia Knime 28 Resources WordNet (RitaWordNet y JWNL) EuroWordNet SUMO (Suggested Upper Merged Ontology) WordNet Domain WordNet Affect TreeTagger SentiWordNet General Inquirer 29 WordNet: synsets COMPUTER synset#1 {computer, computing machine, computing device, data processor, electronic computer, information processing system} - (a machine for performing calculations automatically) synset#2 {calculator, reckoner, figurer, estimator, computer} - (an expert at calculation (or at operating calculating machines)) 30 WordNet: Relation between terms and synsets EAT [IndexWord: [Lemma: eat] [POS: verb]]: take in solid food; “She was eating a banana”; “What did you eat for dinner last night?” [IndexWord: [Lemma: eat] [POS: verb]]: eat a meal; take a meal; “We did not eat until 10 P.M. because there were so many phone calls”; “I didn’t eat yet, so I gladly accept your invitation” [IndexWord: [Lemma: eat] [POS: verb]]: take in food; used of animals only; “This dog doesn’t eat certain kinds of meat”; “What do whales eat?” [IndexWord: [Lemma: eat] [POS: verb]]: worry or cause anxiety in a persistent way; “What’s eating you?” [IndexWord: [Lemma: eat] [POS: verb]]: use up (resources or materials); “this car consumes a lot of gas”; “We exhausted our savings”; “They run through 20 bottles of wine a week” [IndexWord: [Lemma: eat] [POS: verb]]: cause to deteriorate due to the action of water, air, or an acid; “The acid corroded the metal”; “The steady dripping of water rusted the metal stopper in the sink” 31 WordNet: Relation between synsets All POS Synonymy Antonymy Only sustantives Hypernymy Hyponymy Meronymy Only verbs Troponymy Entailment 32 WordNet: synsets related to COMPUTER synset#1 Hyponymy shows the relationship between a generic term (hypernym) and a specific instance of it (hyponym) Hypernymy {machine} - (any mechanical or electrical device that transmits or modifies energy to perform or assist in the performance of human tasks) Hyponymy {analog computer, analogue computer} - (a computer that represents information by variable quantities (e.g., positions or voltages)) Meronymy (part of, or a member of something) {busbar, bus} - (an electrical conductor that makes a common connection between several circuits; "the busbar in this computer can transmit data either way between any two components of the system") 33 WordNet Domain 35 TreeTagger The TreeTagger is easy to use. Palabra The TreeTagger POS DT NP Lemma the TreeTagger is easy to use VBZ JJ TO VB be easy to use . SENT . 36 Data Transforming into Business Intelligence Supply chain management Data outside Text mining Web mining Data mining Real-time analysis Information retrieval Data within enterprise Information extraction Information transformation Input Search engines Data cleaning Predictions and forecast Data Warehouse Business Intelligence tools Output 37 Scenarios where text mining can build BPM capabilities How can an organization improve system functionality and user experience? What organizational values are incorporated in an organization? How can an organization innovate its process? Customer review Documents What organizational values support innovation in a business area? Process descriptions Cloud services Are we still operating according to the right strategy? Process-related content Communication logs How can new information systems help to support users’ needs? What are emerging topics in process and in business area that are relevant to an organization? 38 A Text Mining approach for integrating business process models and governing documents 39 Facilitating business process discovery using email analysis Hypothesis: Data sources that represent the communications facilitate the identification of the business processes Idea: Identify email message threads Outcome: process fragment enactment models that can help process engineers Validate their findings about the business processes Understand better the vague and unclear parts of the processes 40 Predictive process monitoring framework that combines text mining with sequence classification techniques Structured data often comes in conjunction with unstructured (textual) data such as emails or comments. Call {revenue : 34555; debt sum : 500} {Please send a warning. 1234567: “Gave extension of 5 days and issued a warning about sending it to encashment. An encashment warning letter sent on the 06/10, 11:10 deadline.”} 41 Automated generation of business process models from natural language input 42 Semantics-based event log aggregation for process mining and analytics Event log pre-processing techniques are needed that leverage semantic information for better alignment with the purpose of semi-automatically building, extending, and applying process ontologies. 43 Other applications Generate natural language text from business process models Support process model validation based on Text Generation Use unstructured data from interviews and questionnaires for improving business process Detect inconsistencies between process models and textual descriptions Detect non-uniformly specified process element names on the same process decomposition level Detect naming convention violations, ambiguity and incomplete elements in process models 44 General schema for clustering, labeling and evaluating textual corpora 45 Document clustering based on Differential Betweenness • Consider the structure and relationships between data • Represent objects and their relationships in a graph • Exploit the topology T1 T2 … Tm Doc1 Doc1 … Doc1 Doc2 … Doc2 … Docn … … fij … … … Doc2 ... Docn Sim(i,j) Docn Betweenness is an indicator of: • who the most influential people in the network are • who control the flow of information between most others 46 Clustering based on Differential Betweenness Obtain the similarity graph Calculate the weighted differential betweenness matrix Estimate the edges to be eliminated Determine the cluster kernels by means of the extraction of the connected components Classify remaining nodes 47 GARLucene: Sistema para la Gestión de Artículos científicos Recuperados usando Lucene 48 Which are more similar? Document 1 Document 2 Document 3 … … … <Abstract> <Abstract> <Abstract> Term 1 Term 1 Term 2 <Abstract> <Abstract> <Abstract> <Keywords> <Keywords> <Keywords> Term 1 Term 1 Term 2 </Keywords> </Keywords> </Keywords> <Introduction> <Introduction> <Introduction> Term 2 Term 2 Term 1 </Introduction> </Introduction> </Introduction> … … … <References> <References> <References> Term 2 Term 2 Term 1 </References> </References> </References> 49 Methodology for clustering considering content and structure Desktop application LucXML Web system Scientific Solr 50 Schema for topic segmentation and detection Textual corpora Represent textual units Identify textual units vectors, graphs, probabilistic distribution textual units Pre-process tokens Represent segments vectors, graphs, probabilistic distribution Cluster segments Segmen t segments segment clusters (topics) Label segment clusters Framework OpinionTopicDetection Desktop application OpinionTD Topics and corresponding labels 51 RST-disambiguation New unsupervised semantic disambiguation algorithm based on clustering and Rough Set Theory 1. Eliminate stop words 2. Lemmatize terms 3. Find sense of each term in WordNet 4. Cluster senses of terms 5. Calculate lower and upper approximation of each cluster of terms 6. Calculate Rough F-measure of each cluster 7. Identify the best cluster considering number of terms, Rough F-measure value and WordNet index 8. Calculate rough membership measures for assigning senses to clusters 52 SemEval-2017 Semantic comparison for words and texts Task 1: Semantic Textual Similarity Task 2: Multilingual and Cross-lingual Semantic Word Similarity Task 3: Community Question Answering Parsing semantic structures Task 9: Abstract Meaning Representation Parsing and Generation Task 10: Extracting Keyphrases and Relations from Scientific Publications Task 11: End-User Development using Natural Language 53 Central University of Las Villas, Cuba Artificial intelligence Lab Computer Science Department Thanks! Questions, ideas, suggestions, comments, … Text Mining Prof. Leticia Arco García [email protected]
© Copyright 2026 Paperzz