A Text Mining Model for Strategic Alliance Discovery

2012 45th Hawaii International Conference on System Sciences
A Text Mining Model for Strategic Alliance Discovery
Yilu Zhou
George Washington
University
[email protected]
Yi Zhang
George Washington
University
[email protected]
Nicholas Vonortas
George Washington
University
[email protected]
based national and international knowledge repository
on strategic alliance, through both government and
private-sector organizations, those efforts have been of
limited success. Alliances in general have become
ubiquitous and thus sole reliance on a limited set of
popular sources for related information creates bias.
The main challenge in constructing such a
knowledge repository is not different from many other
fields: information overload and human limitations.
Currently, discovery of strategic alliances relies on
humans physically reading news articles, company
reports, etc and then inputting data manually. This
imposes strict limits on the knowledge to be shared,
including number of alliances that can be searched
(thus decreasing completeness), speed of knowledge
update, and increased cost (paying for researchers to
manually comb through records).
In this research, we aim to addressing the
limitations in manual work by developing an intelligent
text mining model to extract existing alliances from
open resources such as published news articles. The
model integrates meta-search, dependency parsing,
entity extraction, and relation extraction. In particular,
we propose an Alliance Discovery Template (ADT)
based relation extraction and an Alliance Confidence
Ranking strategy (ACRank). An alliance knowledge
portal is proposed to further assist alliance study. We
evaluated the effectiveness of our framework in a case
study and compared the coverage of our approach with
a gold standard, the Thomson SDC database and expert
judgment.
Our research is a first step toward building an
alliance knowledge repository. Results showed that this
automatic approach has the potential to create a much
more comprehensive alliance database than existing
gold standard such as the Thomson SDC Alliance
Database. The result of this research will provide rich
information and evidence support of strategic alliances
formed, and will enable researchers to better
understand the interactions and outputs that make these
alliances one of the central drivers of innovation and
economic growth. Policymakers will then be better
informed regarding the crafting of regulations and laws
that allow these alliances to operate at socially-optimal
Abstract
Strategic alliances among organizations are one of
the central drivers of innovation and economy and
have raised strong interest among policymakers,
strategists and economists. However, discovery of
alliances has relied on pure manual search and has
limited scope. This research addresses the limitations
by proposing a text mining framework that
automatically extracts alliances from news articles.
The model not only integrates meta-search, entity
extraction and shallow and deep linguistic parsing
techniques, but also proposes an innovative ADTbased relation extraction method to deal with the
extremely skewed and noisy news articles and ACRank
to further improve the precision using various
linguistic features. Evaluation from an IBM alliances
case study showed that ADT-based extraction achieved
78.1% in recall, 44.7% in precision and 0.569 in Fmeasure and eliminated over 99% of the noise in
document collections. ACRank further improved
precision to 97% with the top-20% extracted alliance
instances. Our case study also showed that the widely
cited Thomson SDC database only covered less than
20% of total alliances while our automatic approach
can covered 67%.
1. Introduction
Inter-firm collaboration has exploded during the
past couple of decades. The nature of collaboration has
shifted from peripheral interests to the very core
functions of the corporation, and from equity to nonequity forms of collaboration. The phenomenon has
raised strong interest among analysts, business
strategists, and policy makers from various fields
including
economics,
management,
public
administration, and science and technology.
Being able to discover strategic alliances will allow
policy, innovation, and economics researchers to better
understand this growing phenomenon. Such data can
be studied to gain insights into the evolution of
industrial sectors and the growing interdisciplinary
nature of today’s R&D, and thus allow for cause and
effect analysis of policies and events. While there have
been various attempts in the past to create a broad978-0-7695-4525-7/12 $26.00 © 2012 IEEE
DOI 10.1109/HICSS.2012.86
Jeff Williams
George Washington
University
[email protected]
3571
between invention and commercialization. Non-equity,
co-operative research, technology, and development
(RTD) agreements are often used to get around
technology bottlenecks in an industry. Industrial
comprehensive joint ventures are crucial mechanisms
through which large firms devote energy to specific
technology problems in an industry. Customer-supplier
agreements are important in that they allow researchfocused firms to stay centered on research by
outsourcing some production or infrastructure
maintenance (labs, manufacturing space, etc) to an
external specialist. One-way licensing and marketing
allow one firm to take advantage of innovations
created by a second. They are also helpful in opening
up foreign markets, with the foreign partner controlling
the production and marketing of a product in that
firm’s home country. University-based research is a
key policy issue at the moment, with discussions
centering on whether the increased pressure for
universities to convert knowledge into commercial
goods is detrimental to the long-run supply of
innovation in the economy. Jointly-funded industry and
government research undertakings often play a vital
role in creating socially important innovative products
or processes that, due to collective action or resource
appropriation issues, would not be produced by private
sector firms.
levels. In this paper, we focus on formal strategic
alliances. Other types of business collaborations are
out of the scope of this research.
The rest of the paper is structured as follows. We
first review the current research on strategic alliance
and the field of text mining. We then describe the
challenges associated with finding strategic alliances in
massive documents and present our text mining model.
Following our model, evaluations are presented in a
case study and finally we conclude the paper with
contributions and future directions.
2. Research background
As we mentioned earlier, with technological
innovation and diffusion being the primary driver of
today’s knowledge economy and alliances becoming
an ever-increasing source of technological innovation
and diffusion, economists, managers, and policymakers
need to have access to information on what makes
those alliances work and what can put up barriers.
2.1. Strategic alliance types
Strategic alliances involve a wide range of formal
activities, including: joint ventures, research
corporations, joint R&D (e.g., research pacts, joint
development agreements), technology exchange
agreements (e.g., technology sharing, cross-licensing,
mutual
second-sourcing),
direct
investment,
minority/cross-holding, customer-supplier relations,
R&D contracts, one-directional technology flow
agreements (e.g., licensing, second-sourcing), and
manufacturing agreements. In addition, innovation
networks involve informal collaboration and
knowledge exchanges such as working relationships of
individuals across organizations and systemic learning
through patents, blueprints, etc. All sorts of
organizations
are
involved
including
firms,
universities, research institutes, and “intermediate
organizations”
(e.g.,
professional
and
trade
associations, think tanks). Table 1 summarizes our
categorization of alliance types and provides examples
of each type.
Each type of alliance, listed in the table below, is of
equal, but different, importance for economists,
business managers, and policymakers to be able to
understand fully. Research corporations are important
vehicles through which large technology firms
collaborate on pushing knowledge forward on a large
scale, due to their access to resources. Research
performed for industry associations can have a similar
impact. Venture capital is crucial to the survival of
many small technology firms, as it may be the only
way for them to bridge the so-called Valley of Death
2.2. Research strategies to identify alliances
and limitations
Current methods of studying alliances – patents,
bibliometrics and citations tracking, popular press
analysis, surveys, and government documents – have
contributed important findings but are limited in their
scope. For example, patenting levels vary by sector and
by firm size, among other factors, so any alliance data
based on that metric will be potentially biased to
sectors that have strong patenting traditions or needs.
Another example is surveys, as these rarely cover all
firms in a sector, and their accuracy is contingent upon
getting the right person – or group of people – to
answer questions at the firm level. Policymakers need
to know how generic innovation policies affect all
players, not just the select few that answer a survey or
happen to patent a lot. Likewise, economists will have
a skewed picture of the inputs and outputs associated
with operating an economy at efficient levels.
Existing alliance databases are maintained
manually. As highlighted in Schilling[15], none of the
databases are considered an accurate reflection of the
entire population of strategic alliances, but only of subsets. The Thomson SDC Alliance database is a most
cited database and is populated by information
manually extracted from newspapers, trade journals,
3572
entry. For example, if one were to perform a search on
the Thomson Alliance database in 2010 for alliances
from 2009, there would be very few listed compared to
the 2008 listings. This is due to the backlog of news
sources awaiting examination by Thomson team
members.
government filings, and other press and newswire
services. Team members for Thomson will read these
news sources and, upon locating an alliance
announcement or information relating to an alliance
already in the database, will update the database as
needed. There is a time lag component to this data
Table 1. Strategic alliance types
Pre-competitive R&D Cooperation
Downstream Technology
Development Co-operation
Production and/or Marketing
and Technology Development
Co-operation
University-based co-operative research funded jointly by industry
RTD performed or sponsored by industry associations / contract RTD for multiple
clients by non-profit organizations.
Jointly funded government-industry RTD.
Research corporations
Corporate venture capital in small companies (by one or more firms)
Non-equity co-operative RTD agreements between firms in selected areas
Inter-firm agreements regarding proven technologies developed independently
Industrial comprehensive joint ventures
Customer-supplier agreements
One-way licensing and/or marketing agreements
learning-based approach employs a classification
statistical model to solve entity recognition[10]. The
systems look for patterns and relationships in text to
make a model using statistical models and machine
learning algorithms. Supervised learning involves
using a program that can learn to classify a given set of
labeled examples that are made up of the same number
of features and needs large amount of annotated data to
achieve high performance[20]. Examples of supervised
models are Hidden Markov Model, Maximum Entropy
Model and Support Vector Machine (SVM). An
unsupervised approach or a semi-supervised approach
often involves bootstrapping[13]. The hybrid approach
is to combine a rule-based and a machine learningbased method, and make new methods using the
strongest points from each method[16].
2.3. Text mining
Text mining is defined as the process of extracting
interesting information and knowledge from
unstructured texts[5]. It has been studied in various
fields such as e-commerce [14], patent analysis[18],
and bioinformatics research[7, 8]. Compared with data
mining, text mining is in general a more challenging
task because it deals with unstructured and amorphous
texts. Text mining research deals with a variety of
problems including text summarization, document and
information retrieval, text categorization, authorship
identification, entity extraction and relation
extraction[19].
Despite the development of text mining and its
applications in many scientific fields, to our knowledge
no collaboration has been formed between alliance
researchers and text mining researchers. It is natural to
think of an automatic way to discover alliances but the
task is challenging. Alliance discovery mainly involves
the subfields of entity extraction and relation
extraction, two difficult subfields in text mining.
2.3.2. Relation Extraction. Relation extraction
involves annotating the unstructured text with entities
and relations between entities. Relation extraction is
considered to be a hard task[19] and the first step is to
perform entity extraction. Most relation extractions
concern relations between different type of entities,
such as a person-position and a person-location
relation. Relation extraction methods can be classified
into the following two categories: a template-based
approach and a learning-based approach.
A template-based approach relies on experts to
carefully construct the rule lexicon and temple. It is
domain specific. With a well-defined template-based
extraction approach, Banko and Etzioni[2] concluded
that nearly 95% of 500 randomly selected sentences
belong to one of the eight categories of verb phrases
illustrated in Table 2.
A learning-based approach tends to formulate the
2.3.1. Entity Extraction. Entity extraction research
aims to extract and classify rigid designators[13].
There are many types of named entities in text such as
person name, product name, location, and date. Entity
extraction can take a rule-based approach, a machine
learning-based approach or a hybrid approach. A rulebased approach focuses on extracting names using lots
of human-made rules set. Generally the systems consist
of a set of patterns using grammatical (e.g. part of
speech), syntactic (e.g. word precedence) and
orthographic features (e.g. capitalization) in
combination with dictionaries[10]. A machine
3573
information can appear in any type of news but the
frequency of appearance is very low. When we
manually read 2000 search results from Google using
“IBM Alliance”, there are only several articles that
describe true formal alliances. This low number of hits
makes some well performing learning-based
algorithms such as SVM fail. This also alerts us that
being able to select a relevant set of documents is
important to the performance. The documents being
parsed need to be comprehensive enough to cover most
alliances announced yet still be focused enough to
avoid unnecessary noise.
We develop a text mining model to address the
challenges in alliance extraction. The model is
composed of the following components as shown in
Figure 1: (1) Meta-crawling of alliance news articles,
(2) document pre-processing, (3) sentence-level
alliance extraction, and (4) corpus level alliance
ranking. The result of our text mining model is a
ranked list of possible alliances. To further assist
alliance researchers, we also propose a knowledge
portal. Users can search/browse our extracted alliances,
categorize alliance types, and visualize alliance
relations. We discuss each component in detail.
relation extraction problem as a classification problem.
A discriminative classifier, such as Support Vector
Machine (SVM), Voted Perceptron and Log-linear
model, may be used on a set of features extracted from
a structured representation of the sentence. Two kinds
of supervised approaches have been used, feature
based[6], such as FASTUS[1] and Kernel based[9]. A
learning-based approach performs well but is difficult
to extend to new relation types. The effort needed in
labeling training data is tremendous.
Both approaches depend on the ability to parse
sentences. There are three levels of sentence parsing:
(1) Bag of Words (BoW) parsing, (2) syntactic parsing,
and (3) semantic parsing. The BoW approach
disregards grammar and word order and takes each
word into account. Syntactic parsing is also known as
shallow parsing. It performs Part-Of-Speech (POS)
tagging and entity extraction. Semantic parsing is the
most advanced and usually represents a sentence with a
dependency parse tree [3, 4, 12].
Adopting text mining techniques in alliance
discovery is not easy, because most current techniques
are developed on standard datasets and have been
tuned for better performances. To make them work in a
new domain will need much investigation.
Table 2. Verb phrase template categories[2]
Relative
Freq.
37.8
Category
Lexico Syntactic Pattern
Verb
E1 Verb E2
(e.g. X established Y)
E1 NP Prep E2
(e.g. X settlement with Y)
E1 Verb Prep E2
(e.g. X moved to Y)
E1 to Verb E2
(e.g. X plans to acquire Y)
E1 Verb E2 Noun
(e.g. X is Y winner)
E1 (and|,|-|:) E2 NP
(e.g. X-Y deal)
E1 (and|,) E2 Verb
(e.g. X , Y merge)
E1 NP (:|,)? E2
(e.g. X hometown : Y)
22.8
Noun+Prep
16.0
Verb+Prep
9.4
Infinitive
5.2
Modifier
1.8
Coordinaten
1.0
Coordinatev
0.8
Appositive
3. A text mining model: finding strategic
alliances from news articles
Figure 1. Text mining model for alliance discovery
Two factors contribute to the limitation of today’s
alliance discovery: too many resources that could
potentially contain alliance information and limited
human ability to scan all these resources. We aim to
build a text mining model for alliance discovery from
various news articles.
Our initial study showed that comparing to other
fields, extracting alliances from free text is even more
challenging. One major reason is that alliance
3.1. Meta-crawling
A meta-crawler supports unified access to multiple
existing search engines and databases[11]. It provides a
higher coverage and more up-to-date information than
each of its component search engines. An alliance
meta-search lexicon was created by experts to search
for relevant articles from multiple resources. The
3574
PURPOSE_CLAUSE_MODIFIER,
PREPOSITIONAL_MODIFIER,
ADV_CLAUSE_MODIFIER,
TEMPORAL_MODIFIER,
PRECONJUNCT,
PARTICIPIAL_MODIFIER
or
INFINITIVAL_MODIFIER, we merge w1 and w2
together forming a compound entity.
2. If a dependency d(w1, w2) is within the
dependency class AUX_MODIFIER and not
COPULA, we merge w1 and w2.
3. If a dependency d(w1, w2) is within the
dependency class PREPOSITION_MODIFIER and w1
is not a verb, we combine w1 and w2.
Each chunk has a head which is the main word of
the chunk. So we can classify them into noun chunks
and verb chunks. In the above example, 7 chunks are
extracted in Table 3 and the simplified dependency
parse tree is presented in Figure 3. After chunking all
related words, the grammar structure of the sentence is
simplified. And most remain dependency relations
belong
to
dependency
class
SUBJECT,
COMPLEMENT and PREPOSITON. It dramatically
reduces the rule space needed for extracting alliance
relationships.
lexicon contains keywords such as “alliance,” “joint
venture,” “team with,” “license,” etc. News articles are
automatically downloaded using this lexicon.
Duplicates are examined and are removed.
3.2. Pre-processing
The resulting documents from meta-crawling are
sent to pre-processing which adhere to three processes.
Document indexing involves data cleaning,
document tokenization and meta-data extraction. Metadata includes time and source of publication, and
length of the article. These features are used in alliance
confidence ranking later. In order to have a uniformed
way to manage various source data, we will convert all
source files into XML format. While commercial
databases like Thomson SDC only provide a general
source without specific dates or authors, this structure
will allow us to point to the specific alliance evidence
and will allow researchers to further investigate and
validate the type and nature of alliances.
Part-of-speech tagging is the process of assigning a
part of speech to each word/token in a sentence. The
POS tags consist of coded abbreviations conforming to
the scheme of the Penn Treebank, a linguistic corpus
developed by the University of Pennsylvania. Besides,
we also perform chunk parsing, which groups tokens
of a sentence into larger chunks with each chunk
corresponding to a syntactic unit such as a noun phrase
or a verb phrase. It allows us to identify important
noun and verb phrases. Chunk parsing also helps in
named entity identification and extracts alliance
relation verb phrases with their tenses. There are
several tools available to perform English POS tagging
and chunk parsing such as OpenNLP and LingPipe.
Dependency parse tree is a way to present a
hierarchy of word dependencies. Dependency types are
organized in a hierarchy according to the similarity in
their grammatical roles. Figure 2 provides an example
of a dependency tree after analyzing the sentence
“IBM Corp. and Alvarion Inc. have established an
alliance to offer wireless systems to municipalities and
their public safety agencies, Alvarion announced.”
While POS tagging captures the syntactic structure of a
sentence, a dependency parse tree captures the
semantic meaning of a sentence. Stanford Parse Tree
[17] is a great tool to perform this task.
Merging chunks. The structure of an original parse
tree is often too complex for template matching. To
reduce the template space, we use chunk parsing
results to simplify the tree by merging words into
chunks. The following rules are used in merging.
1. If a dependency d(w1, w2) is within the
dependency class MODIFER and not one of
RELATIVE_CLAUSE_MODIFIER,
Figure 2. Original dependency parse tree
Figure 3. Simplified dependency parse tree
3575
organizations. (We can check if w2 and w4 contain
organization names.)
Template 2. If two dependencies d1 and d2, where
d1(w1, w2) is within the dependency class SUBJECT
and d2(w1, w3) is within the class COMPLEMENT,
we extract w1, w2 and w3 as our candidate alliances.
Template 3. We extract two noun chunks
connected
by
a
dependency
relation
of
PREPOSITION_between.
Template 4. If two dependencies d1(w1, w2) and
d2(w1, w3) are within the dependency class
COMPLEMENT and d3(w1, w4) is within the class
SUBJECT, we extract w1, w2, w3 and w4 as our
candidate alliances. This rule can be used to extract
alliances among more than two organizations. (We can
check if w2 and w3 contain organization names.)
Table 3. Extracted chunks from IBM example
CHUNKS
HEAD
IBM Corp.
Corp.
Alvarion Inc.
Inc.
Have establish
Establish
An alliance
Alliance
To offer
Offer
Wireless systems
Systems
Their public safety agencies
agencies
3.3. Sentence level alliance extraction
After document parsing, the next steps are to
extract entities and relations between them.
Entity extraction is performed to extract
organization names. There are many types of named
entities in text such as person name, product name,
location, and date. We are particularly interested in
organization name which is the most difficult to extract
because novel names frequently appear. Various tools
are available and we choose Stanford NER[17] for its
better performance.
To perform sentence level relation extraction, we
use template-based approach. A learning-based
approach is not a good option here because of the
extremely skewed datasets. Although our news article
collection is crawled with a domain lexicon, it still
contains much noise: documents that are not relevant
to alliance but happen to have some relevant keywords
in the text. It could be rumors, informal alliances and
just some random mentioning of the word “alliance”.
Furthermore, each document contains at least dozens
and sometimes even hundreds of sentences. To parse
each sentence and classify them into alliance or nonalliance relations is impossible at this step. True
alliance announcements may only appear one out of
fifty documents and that can potentially translate to
one out of five hundred sentences roughly. Any
machine learning algorithms at this step will achieve
high precision and zero recall by predicting every
sentence to be a non-alliance relation.
Thus, we studied sentence syntax and semantics
when alliances are announced. Following Banko and
Etzioni[2]’s templates, we found that most alliance
announcements can be categorized into the following
four templates shown in Table 4. We call this Alliance
Discovery Template (ADT).
With the results from simplified dependency parse
trees, we can implement the four templates by the
following steps:
Template 1. If two dependencies d1(w1, w2) and
d2(w3, w4) are within the dependency class SUBJECT
and w1, w3 are in the same chunk, we extract w2, w4
and w1 as our candidate alliances. This rule can be
used to extract alliances among more than two
Table 4. Alliance Discovery Template (ADT)
Template 1
Template 2
Template 3
Template 4
ADT
Organization list +Verb
(form, establish,
forge…)
First organization
+Verb (join, work
with…)+second
organization
Noun (collaboration,
agreement) +
Conjecture(between,
among) +organization
list
Noun (participants,
partners) + include +
organization list
Example
IBM, Sony and
Toshiba form chip
R&D alliance
Red Hat joins top
level IBM strategic
alliance
The collaboration
between IBM and
Geisinger …..
Participants include
IBM and GE Health
3.4. Corpus level alliance ranking
Sentence level extraction identifies relation words
between two or more entities in the same sentence.
However, there are still many false positive examples.
We go beyond traditional relation extraction research
and expand this to a corpus level alliance ranking.
Given two entities with multiple mentions in a large
corpus, the confidence level of these two entities
forming a true alliance increases if (1) the
announcement appear in the first paragraph, news title,
and early sentences; (2) the source of news article is
authoritative; and (3) the announcement of the alliance
is mentioned multiple times.
We develop a multi-feature based ranking approach
to detect whether the relationship holds between them
is a true formal alliance. We call this ranking based
approach Alliance Confidence Rank (ACRank). A
ranking based approach is chosen instead of a learningbased classification approach because alliances do not
3576
sentence in a paragraph and the position of the
sentence in the entire document. Besides, we also
include publisher’s authority as a document level
feature. (3) Features from corpus level. We look at the
total number of times the same alliance being extracted
and aggregate their confidence levels by summing up
the confidence level of each extracted instances with
some degrading factors. Degrading factors are
introduced because big companies appear a lot in news
articles. Finally By aggregating features on all three
levels, ACRank can represent the confidence level of a
relationship being a true formal alliance. The notion is
somewhat similar to document retrieval, but we replace
“documents” with “alliances”.
appear frequently in news articles. Our pilot study
showed that a learning-based classification approach
fails with extremely skewed datasets with very few
positive examples. Also, a ranking approach allows
researchers to browse the possible alliances later on to
make their own judgment. Even some rumor alliances
might be of interest because of the possibility of
forming a true alliance in the future.
In this study, we propose the following features to
be included in ACRank. These features can be
expanded in future research. (1) Features from
sentence level. They include the ADT template used,
number of entities extracted, and number of critical
keywords appeared in the sentence. (2) Features from
document level. They include the position of the
Figure 4. User interface of alliance knowledge portal
evidences are drawn from and whether these alliances
have been validated by human experts (last column).
By clicking on the number in each cell, snippets of the
evidences are shown in the upper right panel called
“Evidence List”. This panel displays all the evidences
from news articles in a fashion that a search engine
returns a ranked list. Their titles, dates retrieved and
snippets are displayed as well. By clicking the
evidence list, a user can view the original evidence in
the lower right “Evidence” panel. A network
visualization tool is provided in the lower left
“Graphical Network” panel. In this example, each node
represents a participating company and each link
represents a strategic alliance. The size of a node
indicates the size of a company and the color indicates
3.5. Alliance knowledge portal
Automatically extracted alliances are loaded to a
knowledge portal to further assist alliance researchers.
To illustrate, we show an example of the user interface
in Figure 4. The interface is composed of several
panels. The upper left panel is the “Input” panel. The
knowledge portal provides basic and advanced search
functions, where a user can specify the organization of
interest (in this case “IBM”), type of alliances of
interest in terms of sector and company size, and time
period. Other information will be incorporated in the
future. The system pops up a list of alliances in a
ranked order based on confidence level called “Raw
Data”. It also indicates the data sources where the
3577
the sector it belongs to. The thickness of a link
indicates the confidence level (or the strength of
evidence), and a dotted link indicates an informal
alliance (not validated by experts yet). The interface is
not a final product, but gives an idea of how
researchers can trace down each alliance and actually
see the supporting evidences.
Table 5. Performance of ADT method in
comparison with benchmarks
Method
Recall
Precision F-measure
68.7%
4.4%
0.075
Co-occurrence
65.2%
7.3%
0.121
Co-occur+verb
55.8%
51.1%
0.528
Template 1
47.6%
34.6%
0.398
Template 2
11.4%
57.1%
0.186
Template 3
30.2%
51.2%
0.372
Template 4
78.1%
44.7%
0.568
ADT
4. Evaluation
We conducted a case study with IBM Corp. during
the year 2006. There are three goals of this case study:
(1) to evaluate the effectiveness of our sentence level
ADT-based approach, (2) to evaluate the effectiveness
of our corpus level ACRank approach and (3) to study
the coverage of the Thomson database by comparing
with domain experts’ extraction results. Thomson
database is the most popular database which records all
publicly announced alliance deals world-wide. We
chose IBM because as a multinational and technology
driven company, IBM has established numerous
alliances with other organizations inside and outside of
the United States. We collected all publications in 2006
from Lexis-Nexis using a domain lexicon. A total of
4,261 documents were crawled. We selected a subset
of 1,000 documents that contains a total of 63,019
sentences for this study. Our domain expert manually
read all 1,000 documents and extracted 63 alliances.
4.2. Evaluation of corpus level ACRank
Our second experiment evaluated the performance
of corpus level ACRank. We compared corpus level
ACRank with sentence-level alone, document-level
alone and a popular machine learning algorism SVM.
We adopted classic information retrieval evaluation
metrics of recall and precision for the top-N% ranked
documents. Here, we changed documents retrieved to
alliances extracted. We present the recall and precision
with the top-N% extracted alliances in Table 6, Figure
5 and Figure 6.
We observed that recall increases consistently with
the increasing number of ranked alliances included till
it reaches 100%. Fifty percent recall was reached when
the top-25% alliance instances were considered. Using
sentence-level features alone, recall reached 50% when
the top-40% alliance instances were considered, and
45% for document-level features alone. On the other
hand, SVM only achieved a 32.7% recall. That means
67.3% of alliances were missed during SVM
classification using the same set of features. The same
recall was achieved at top -15% in ACRank.
Precision reached the highest of 100% with the top10% extracted alliances and then gradually decreased
to 44.7% when all alliances were considered. With the
top-20% of alliance instances, precision was
maintained at 97%. This shows that our ACRank is
most effective in predicting the very top alliances and
the two categories of features significantly boost each
other. With the top-50% of alliance instances, precision
was still maintained at 65.5%. Compared to
information extraction tasks, our performance is
satisfactory. In comparison, SVM achieved 59.3%
precision which is similar to the precision of the top65% results from ACRank.
When looking at errors generated, we found that
there were three major causes of errors: unidentified or
wrongly identified entity names, wrong dependency
parse tree and instances that did not fit into any of the
templates. This suggested that although achieving
78.1% in recall, our four templates still need to be
expanded.
4.1. Evaluation of ADT-based extraction
We compared our ADT-based extraction with two
benchmark algorithms. Benchmark 1 was cooccurrence based approach and benchmark 2 was cooccurrence plus critical verb identified by experts. Cooccurrence approach assumes that if two organization
names appear in the same sentence, they are likely to
form an alliance given that the collection is an alliance
relevant collection. Our evaluation metrics are recall,
precision and F-measure which are standard in
information extraction.
Table 5 presents the performance of ADT method
in comparison with the two benchmark algorithms. We
also present the performance of each template
extraction. ADT method achieved a 78.1% in recall,
44.7% in precision and 0.568 in F-measure. It
outperformed two co-occurrence based benchmarks
significantly in recall, precision and F-measure.
Meanwhile, we observed that Template 1, 3 and 4 were
more accurate in finding alliances and Template 1
alone covered 50% of alliance announcement. In
general, sentence-based alliance extraction can identify
most alliances but also extracts a significant amount of
false alliances.
3578
Table 6. Performance of ACRank with the top-N% alliances
Top10%
Top20%
Top30%
precision
recall
F- measure
72.7%
16.3%
0.267
60.6%
27.2%
0.376
57.6%
38.8%
0.463
precision
recall
F- measure
66.7%
15.0%
0.244
60.6%
27.2%
0.376
53.5%
36.1%
0.431
precision
recall
F- measure
100.0%
22.4%
0.367
97.0%
43.5%
0.601
81.8%
55.1%
0.659
TopTop50%
40%
Sentence-Level
54.5% 54.5%
49.0% 61.2%
0.516
0.577
Document-Level
50.0% 47.3%
44.9% 53.1%
0.473
0.500
ACRank
69.7% 65.5%
62.6% 73.5%
0.659
0.692
Top60%
Top70%
Top80%
Top90%
Top100%
52.8%
70.7%
0.605
50.9%
79.6%
0.621
49.0%
87.8%
0.629
46.3%
93.2%
0.619
44.7%
100.0%
0.618
47.2%
63.3%
0.541
46.1%
72.1%
0.562
46.4%
83.0%
0.595
46.6%
93.9%
0.623
44.7%
100.0%
0.618
60.4%
81.0%
0.692
53.9%
84.4%
0.658
50.6%
90.5%
0.649
47.6%
95.9%
0.637
44.7%
100.0%
0.618
SVM
59.3%
32.7%
0.421
demonstrated that Thomson indeed is far from a
comprehensive alliance database.
From this table, the Thomson database only
covered 19.4% of total alliances an expert can identify.
This number is an upper boundary of Thomson’s
coverage because as experts read more documents,
more alliances may be identified. This is consistent
with Schilling’s conclusion [15] that none of the
databases are considered an accurate reflection of the
entire population of strategic alliances, but only of subsets. He estimated that Thomson covers about 20% of
total alliance.
Table 7. Coverage of the Thomson SDC
Figure 5. Top-N% Recall of ACRank
Thomson SDC
Experts
Automatic
Found in
1,000
docs
5
63
49
Found in
4,261
docs
12
NA
NA
Total
Coverage
14
NA
NA
19.4%
87.5%
68.1%
5. Conclusions and future directions
With the increasing interest in strategic alliances,
most researchers rely on manually built databases to
perform analysis and draw conclusions. Among these
databases Thomson SDC is the most popular database
which records publicly announced alliance deals
world-wide tracked down in the Security Exchange
Commission filings in the United States, newswires,
press, trade magazines, professional journals and the
like. However, as documents are manually screened,
input is limited to those sources that can be realistically
read by a set number of staff during a set time.
Manually reading documents to identify alliances is
extremely time-consuming. While the accuracy might
be high, the coverage is often low, meaning that many
valid alliances can be missed. In this research, we
design, develop and evaluate a text mining model to
Figure 6. Top-N% Precision of ACRank
4.3. Thomson SDC coverage
Our third experiment evaluates the coverage of
Thomson SDC in comparison with alliances identified
by our domain experts. There are a total of 14 IBM
alliances reported in 2006 in the Thomson database.
After examining the overlap with the Thomson
database, we found that 58 out of 63 alliances that
experts identified were not covered in Thomson
database (Table 7). This finding is surprising yet
3579
extract alliance knowledge. We propose the ADT
method, a template-based relation extraction method
that utilizes entity extraction, POS tagging and
dependency parse trees. Moreover, we aggregate
information from single sentence and document and
generate ACRank, a corpus based multi-feature
ranking algorithm. An alliance knowledge portal is
proposed to support alliance researchers to search,
browse and visualize alliance extraction results. This
portal could provide researchers with evidences of
alliance announcement and assist in answering strategy
and policy questions. Evaluation results show that our
automatic approach can extract over 78% of total
alliances with over 44% precision. In evaluating the
Thomson SDC database, we concluded that only
19.4% of strategic alliances were covered in SDC.
This research will not only encourage new research
and discovery in economics and public policy, but also
will advance techniques in text and Web mining
research. By bringing text mining and knowledge
discovery techniques into the field of economics and
public policy, the research will foster the awareness of
cross-disciplinary research and enrich collaboration
between social science and computer science
paradigms.
In the future, we plan to expand our case study by
including longer time periods and adding more
companies from different industries. We also plan to
study additional features in our ACRank algorithm and
to add other templates to the relation extraction
component.
[7] M. Krallinger and A. Valencia, "Text-mining and
information-retrieval services for molecular biology,"
Genome Biology, vol. 6, p. 224, 2005.
[8] Y. Liu, et al., "Exploiting rich syntactic information
for relation extraction from biomedical articles," 2007,
pp. 97-100.
[9] H. Lodhi, et al., "Text classification using string
kernels," The Journal of Machine Learning Research,
vol. 2, pp. 419-444, 2002.
[10] A. Mansouri, et al., "Named entity recognition
approaches," IJCSNS, vol. 8, p. 339, 2008.
[11] W. Meng, et al., "Building efficient and effective
metasearch engines," ACM Computing Surveys
(CSUR), vol. 34, pp. 48-89, 2002.
[12] S. Miller, et al., "A novel use of statistical parsing
to extract information from text," 2000, pp. 226-233.
[13] D. Nadeau, "Semi-supervised named entity
recognition: learning to recognize 100 entity types with
little supervision," 2007.
[14] B. Pang and L. Lee, "Opinion mining and
sentiment analysis," Foundations and Trends in
Information Retrieval, vol. 2, pp. 1-135, 2008.
6. References
[15] M. A. Schilling, "Understanding the alliance
data," Strategic Management Journal, vol. 30, pp. 233260, 2009.
[1] D. E. Appelt, et al., "FASTUS: A finite-state
processor for information extraction from real-world
text," 1993, pp. 1172-1172.
[16] R. Srihari, et al., "A hybrid approach for named
entity and sub-type tagging," 2000, pp. 247-254.
[2] M. Banko and O. Etzioni, "The tradeoffs between
open and traditional relation extraction," Proceedings
of ACL-08: HLT, pp. 28–36, 2008.
[17] Stanford. Named Entity Recognition (NER) and
Information
Extraction
(IE).
Available:
http://nlp.stanford.edu/ner/index.shtml
[3] R. C. Bunescu and R. J. Mooney, "A shortest path
dependency kernel for relation extraction," 2005, pp.
724-731.
[18] Y. H. Tseng, et al., "Text mining techniques for
patent
analysis,"
Information
Processing
&
Management, vol. 43, pp. 1216-1247, 2007.
[4] A. Culotta, et al., "Integrating probabilistic
extraction models and data mining to discover relations
and patterns in text," 2006, pp. 296-303.
[19] I. H. Witten, et al., "Text mining in a digital
library," International Journal on Digital Libraries, vol.
4, pp. 56-59, 2004.
[5] A. Hotho, et al., "A brief survey of text mining,"
2005, pp. 19-62.
[20] Y. C. Wu, et al., "Extracting named entities using
support vector machines," Knowledge Discovery in
Life Science Literature, pp. 91-103, 2006.
[6] N. Kambhatla, "Combining lexical, syntactic, and
semantic features with maximum entropy models for
extracting relations," 2004, pp. 22-es.
3580

Download Report

A Text Mining Model for Strategic Alliance Discovery

Paperzz.com

Your Paperzz