2017.06.14 - Text Mining - Research group Business Informatics

Central University of Las Villas, Cuba
Artificial intelligence Lab
Computer Science Department
Text Mining
Prof. Leticia Arco García
[email protected]
Background
Central University “Marta Abreu” of Las Villas
2
Motivation: unstructured data
“We are drowning in information, but starving for knowledge”
John Naisbett
Advances in Knowledge Discovery and Data Mining.
AAAI Press and MIT Press, Menlo Park and Cambridge, MA, USA 1996
audio
image
video petabyte
bytes
Text
zetabyte
data
2.5 quintillon
document
3
Contents
 Origin and definitions
 Techniques
 Natural language processing
 Textual representation approaches
 Tools and resources
 Applications in Business Informatics
 Our results
 Challenges: SemEval-2017?
4
Origin
 The challenge of exploiting the large unstructured proportion
of enterprise information 1 - Luhn, 1958
 Manual text mining approaches 2 – Schulman, et. al., 1989
 Knowledge Discovery in Texts
1995
3
(KDT) - Feldman and Dagan,
1
Luhn, .H. P. (1958) A business intelligence system. IBM Journal. October, pp. 314-319.
2
Schulman, P., Castellon, C., Seligman, M. (1989) Assessing explanatory style: the content analysis of
verbatim explanations and the attributional style questionnaire. Behav. Res. Ther. Vol. 27, No. 5, pp. 505512.
3
Feldman R., Dagan I. (1995): Knowledge Discovery in Textual Databases (KDT), in Proceedings of the
First International Conference on Knowledge Discovery and Data Mining KDD-95, pp. 112-117.
5
Text Mining definitions (1/2)
 Text mining can be broadly defined as knowledge intensive
process in which a user interacts with a document collection
over time by using a suite of analysis tools
 Text mining is the study and practice of extracting
information from text using the principles of
computational linguistics
 Text mining as exploratory data analysis is a method of
(building and) using software systems to support researchers
in deriving new and relevant information (knowledge)
from large text collections
6
Text Mining definitions (2/2)
 Text mining is the establishing of previously unknown and
unsuspected relations of features in a (textual) data base
 Text mining is a knowledge creation tool, because it offers
powerful possibilities for creating knowledge and relevance
out of the massive amounts of unstructured information
available on the Internet and corporate intranets
Text Mining is defined as automatic discovery of hidden patterns,
traits, or unknown information and knowledge from textual data
7
Text Mining vs Data Mining
Structured
Data
Data Retrieval
Data Mining
Unstructured
Data (Text)
Information
Retrieval
Text Mining
Search
(goal-oriented)
Discover
(opportunistic)
8
Multidisciplinary field
 Natural language processing
 Computational linguistics
 Machine learning
 Visualization
 Database systems
 Data mining
 Statistics
9
Techniques
 Information retrieval
 Textual analysis
 Text clustering
 Generation of term association
 Topic detection
 Text categorization and classification
 Text summarization
10
Techniques
 Information retrieval
 Textual analysis
1. Indexing process
2. Information retrieval model
3. Queries
 Text clustering
 Generation of term association
 Topic detection
 Text categorization and classification
 Text summarization
Crawlers: Nutch, Scrapy, …
11
Techniques
• Dictionaries
 Information retrieval
• Taggers
• Ontologies
 Textual analysis
• Parsers
 Text clustering
 Generation of term association
 Topic detection
 Text categorization and classification
 Text summarization
12
Techniques
 Information retrieval
• Different clustering classifications
• Distances and similarities
 Textual analysis
 Text clustering
• Clustering validity measures
• Cluster labeling
 Generation of term association
 Topic detection
 Text categorization and classification
 Text summarization
13
Techniques
 Information retrieval
• Mining Frequent Patterns
 Textual analysis
• Discovering associations
 Text clustering
• Discovering correlations
 Generation of term association
• Generating association
rules from frequent
itemsets
 Topic detection
 Text categorization and classification
• Generating cause-effect
rules
• Extracting decision rules
 Text summarization
14
Techniques
 Information retrieval
 Textual analysis
 Text clustering
 Generation of term association
• Supervised models
 Topic detection
• Unsupervised models
 Text categorization and classification
 Text summarization
15
Techniques
 Information retrieval
 Textual analysis
 Text clustering
 Generation of term association
 Topic detection
• Support vector
machines
 Text categorization and classification
• Decision tres
• Neural networks
 Text summarization
• Naïve Bayes
• Deep learning
16
Techniques
 Information retrieval
 Textual analysis
 Text clustering
• Single-document
 Generation of term association
• Multi-document
 Topic detection
• Extracts
 Text categorization and classification
• Abstracts
 Text summarization
• Domain specific
• Domain independent
17
Natural language processing levels
 Phonology: Sound of words
 Morphology: Nature of words
 Lexical: An interpretation of an individual word
 Syntactic: Grammatical structure of the sentence
 Semantic: Look at the whole sentence to discover the meaning
 Discouse: Connections between sentences will be made
 Anaphora resolution
 Text structure recognition
 Pragmatic: Try to reveal an extra meaning of the text
18
Different linguistic approaches
1.
2.
3.
4.
5.
•
Graphemic level: analysis on a subword level, commonly concerning
letters
Operate solely on plain statistical
facts about text.
•
Cannot completely capture the
meaning of documents.
Lexical level: analysis concerning
individual words
•
There is only a weak relationship
between term occurrences and
document content.
Syntactic level: analysis concerning
the structure of sentences
Semantic level: analysis related to
the meaning of words and phrases
Pragmatic level: analysis related to
meaning regarding languagedependent and languageindependent, e.g. application-specific,
context
Try to capture more semantic
content by exploiting an increasing
amount of contextual information:
•
Structure of sentences
•
Paragraphs
•
Documents
19
Natural language processing tasks
 Syntax
 Semantics
 Discourse
 Speech
20
Syntax
 Morphological segmentation
Separate
words intoforms
individual
Reduce inflectional
and
morphemes
and identify the
classforms
of the
sometimes derivationally
related
morphemes
of a word to a common base form
I will see you tomorrow at 5 p.m.
Saturday night we will go to the
restaurant.
 Normalization (lemmatization, stemming, …)
 Part-of-speech tagging
I will comeback at 5 p.m. Saturday.
 Parsing
 Sentence breaking (sentence boundary disambiguation)
 Word segmentation
Separate a chunk of continuous text into
separate words.
21
Semantics
 Named entity recognition (NER)
 Natural language generation
 Natural language understanding
Determine which items in the
text map to proper names (e.g.
Convert information from
person, location, organization)
computer databases or
semantic intents into readable
human language (e.g. from
BPMN to natural description)
 Question answering





Given a human-language
Given two text fragments,
question, determine
answer
determine its
if one
being true
entails
other
ofthe
text,
identify
the
Recognizing textual entailment Given a chunk
Given
a chunk
of text,
relationshipsseparate
among named
entities
it into segments
(e.g. antecedents
and
consequents
a
Relationship extraction
each of
which
is devotedinto
decision rule)
a topic, and identify the
Sentiment analysis
topic of the segment (e.g.
textual segments which
Topic segmentation and recognition
contribute to a decision)
Select the meaning which
Word sense disambiguation
makes the most sense in
context (e.g. run)
22
Discourse
 Automatic summarization
 Co-reference resolution (anaphora resolution)
 Discourse analysis
“For a value of more than 5,000,
Identify
discourse structure
seniorthe
management
approvalofis
connected
the nature
of the
requiredtext,
(A8).i.e.
If this
is granted,
theinvoice
discourse
mayrelationships
be finally approved”.
between sentences (e.g. useful for
identifying the BPMN flow)
23
Textual representation models
 Based on vector space model
 Vector Space Model (VSM)
 Latent Semantic Analysis (LSA)
 Based on graphs
 Node: Textual units
 Edges: Relations between textual units
 Probabilistic models
 Probabilistic Latent Semantic Analysis (PLSA)
 Latent Dirichlet Allocation (LDA)
 Word2vec (Word embeddings)
 Input: a large corpus of text
 Output: a vector space, typically of several hundred dimensions, with each
unique word in the corpus being assigned a corresponding vector in the
space.
24
Graph-based models
25
Vector Space Model
…
Term1
Term2
Document1
w11
w12
w1m
Document2
w21
w22
w2m
…
…
…
Documentn
wn1
wn2
…
Termm
…
wnm
26
Text Representation
•
•
•
•
•
•
•
•
Stop word elimination
Term Frequency Thresholding and Zipf’s law
Document Frequency Thresholding
Entropy
Mutual information
Stemming
Using thesaurus
Latent Semantic Analysis
T1
…
Doc2
…
Docn
•
•
•
Recognize different file formats
Avoid capitalization constraints and
punctuation marks
Delete non-alphanumeric
characters
…
Doc1
…
•
•
•
T2
…
…
fij
Tm
…
…
Local component
Global component
Normalization component
27
Tools
 Apache Lucene
 NLTK
 Tika
 SemanticVectors
 Stanford CoreNLP
 S-Space
 Apache OpenNLP
 LingPipe
 FrameNet
 Weka
 UIMA
 TextMiner
 GATE
 R
 SOLR
 RapidMiner
 Portia
 Knime
28
Resources
 WordNet (RitaWordNet y JWNL)
 EuroWordNet
 SUMO (Suggested Upper Merged Ontology)
 WordNet Domain
 WordNet Affect
 TreeTagger
 SentiWordNet
 General Inquirer
29
WordNet: synsets
COMPUTER
 synset#1
{computer, computing machine, computing device, data
processor, electronic computer, information processing
system} - (a machine for performing calculations
automatically)
 synset#2
{calculator, reckoner, figurer, estimator, computer} - (an
expert at calculation (or at operating calculating machines))
30
WordNet: Relation between terms and synsets
EAT

[IndexWord: [Lemma: eat] [POS: verb]]: take in solid food; “She was eating a
banana”; “What did you eat for dinner last night?”

[IndexWord: [Lemma: eat] [POS: verb]]: eat a meal; take a meal; “We did not eat
until 10 P.M. because there were so many phone calls”; “I didn’t eat yet, so I
gladly accept your invitation”

[IndexWord: [Lemma: eat] [POS: verb]]: take in food; used of animals only; “This
dog doesn’t eat certain kinds of meat”; “What do whales eat?”

[IndexWord: [Lemma: eat] [POS: verb]]: worry or cause anxiety in a persistent way;
“What’s eating you?”

[IndexWord: [Lemma: eat] [POS: verb]]: use up (resources or materials); “this car
consumes a lot of gas”; “We exhausted our savings”; “They run through 20 bottles
of wine a week”

[IndexWord: [Lemma: eat] [POS: verb]]: cause to deteriorate due to the action of
water, air, or an acid; “The acid corroded the metal”; “The steady dripping of
water rusted the metal stopper in the sink”
31
WordNet: Relation between synsets
 All POS

Synonymy

Antonymy
 Only sustantives

Hypernymy

Hyponymy

Meronymy
 Only verbs

Troponymy

Entailment
32
WordNet: synsets related to COMPUTER synset#1
Hyponymy shows the relationship between a generic term
(hypernym) and a specific instance of it (hyponym)
 Hypernymy

{machine} - (any mechanical or electrical device that transmits or
modifies energy to perform or assist in the performance of human tasks)
 Hyponymy
 {analog computer, analogue computer} - (a computer that represents
information by variable quantities (e.g., positions or voltages))
 Meronymy (part of, or a member of something)
 {busbar, bus} - (an electrical conductor that makes a common connection
between several circuits; "the busbar in this computer can transmit data
either way between any two components of the system")
33
WordNet Domain
35
TreeTagger
The TreeTagger is easy to use.
Palabra
The
TreeTagger
POS
DT
NP
Lemma
the
TreeTagger
is
easy
to
use
VBZ
JJ
TO
VB
be
easy
to
use
.
SENT
.
36
Data Transforming into Business Intelligence
Supply chain
management
Data outside
Text mining
Web mining
Data mining
Real-time analysis
Information
retrieval
Data within
enterprise
Information
extraction
Information
transformation
Input
Search engines
Data cleaning
Predictions and
forecast
Data Warehouse
Business
Intelligence tools
Output
37
Scenarios where text mining can build BPM capabilities
How can an organization improve system
functionality and user experience?
What organizational values are
incorporated in an organization?
How can an organization
innovate its process?
Customer review
Documents
What
organizational
values support
innovation in a
business area?
Process
descriptions
Cloud services
Are we still operating according to the
right strategy?
Process-related
content
Communication
logs
How can new
information
systems help to
support users’
needs?
What are emerging topics in process
and in business area that are relevant
to an organization?
38
A Text Mining approach for integrating business process
models and governing documents
39
Facilitating business process discovery using email analysis

Hypothesis: Data sources that represent the communications
facilitate the identification of the business processes

Idea: Identify email message threads

Outcome: process fragment enactment models that can help process
engineers


Validate their findings about the business processes
Understand better the vague and unclear parts of the processes
40
Predictive process monitoring framework that combines
text mining with sequence classification techniques
Structured data often comes in conjunction with unstructured
(textual) data such as emails or comments.
Call {revenue : 34555; debt sum : 500} {Please send a warning. 1234567: “Gave extension
of 5 days and issued a warning about sending it to encashment. An encashment warning
letter sent on the 06/10, 11:10 deadline.”}
41
Automated generation of business process models from
natural language input
42
Semantics-based event log aggregation for process
mining and analytics
Event log pre-processing techniques are needed that leverage semantic
information for better alignment with the purpose of semi-automatically
building, extending, and applying process ontologies.
43
Other applications
 Generate natural language text from business process models
 Support process model validation based on Text Generation
 Use unstructured data from interviews and questionnaires for
improving business process
 Detect inconsistencies between process models and textual
descriptions
 Detect non-uniformly specified process element names on the
same process decomposition level
 Detect naming convention violations, ambiguity and
incomplete elements in process models
44
General schema for clustering,
labeling and evaluating textual
corpora
45
Document clustering based on Differential Betweenness
• Consider the structure and relationships between data
• Represent objects and their relationships in a graph
• Exploit the topology
T1
T2
…
Tm
Doc1
Doc1
…
Doc1
Doc2
…
Doc2
…
Docn
…
…
fij
…
…
…
Doc2
...
Docn
Sim(i,j)
Docn
Betweenness is an indicator of:
• who the most influential people in the network are
• who control the flow of information between most others
46
Clustering based on Differential Betweenness
Obtain the similarity
graph
Calculate the weighted differential
betweenness matrix
Estimate the edges to be
eliminated
Determine the cluster kernels by means of the extraction
of the connected components
Classify remaining nodes
47
GARLucene: Sistema para la Gestión de Artículos
científicos Recuperados usando Lucene
48
Which are more similar?
Document 1
Document 2
Document 3
…
…
…
<Abstract>
<Abstract>
<Abstract>
Term 1
Term 1
Term 2
<Abstract>
<Abstract>
<Abstract>
<Keywords>
<Keywords>
<Keywords>
Term 1
Term 1
Term 2
</Keywords>
</Keywords>
</Keywords>
<Introduction>
<Introduction>
<Introduction>
Term 2
Term 2
Term 1
</Introduction>
</Introduction>
</Introduction>
…
…
…
<References>
<References>
<References>
Term 2
Term 2
Term 1
</References>
</References>
</References>
49
Methodology for clustering considering content and
structure
Desktop application LucXML
Web system Scientific Solr
50
Schema for topic segmentation and detection
Textual corpora
Represent textual
units
Identify textual
units
vectors, graphs,
probabilistic distribution
textual units
Pre-process
tokens
Represent
segments
vectors, graphs,
probabilistic distribution
Cluster
segments
Segmen
t
segments
segment clusters
(topics)
Label segment
clusters
Framework OpinionTopicDetection
Desktop application OpinionTD
Topics and corresponding labels
51
RST-disambiguation
New unsupervised semantic disambiguation algorithm based on
clustering and Rough Set Theory
1.
Eliminate stop words
2.
Lemmatize terms
3.
Find sense of each term in WordNet
4.
Cluster senses of terms
5.
Calculate lower and upper approximation of each cluster of terms
6.
Calculate Rough F-measure of each cluster
7.
Identify the best cluster considering number of terms, Rough F-measure
value and WordNet index
8.
Calculate rough membership measures for assigning senses to clusters
52
SemEval-2017
 Semantic comparison for words and texts
 Task 1: Semantic Textual Similarity
 Task 2: Multilingual and Cross-lingual Semantic Word Similarity
 Task 3: Community Question Answering
 Parsing semantic structures
 Task 9: Abstract Meaning Representation Parsing and Generation
 Task 10: Extracting Keyphrases and Relations from Scientific Publications
 Task 11: End-User Development using Natural Language
53
Central University of Las Villas, Cuba
Artificial intelligence Lab
Computer Science Department
Thanks!
Questions, ideas, suggestions, comments, …
Text Mining
Prof. Leticia Arco García
[email protected]