2012 45th Hawaii International Conference on System Sciences A Text Mining Model for Strategic Alliance Discovery Yilu Zhou George Washington University [email protected] Yi Zhang George Washington University [email protected] Nicholas Vonortas George Washington University [email protected] based national and international knowledge repository on strategic alliance, through both government and private-sector organizations, those efforts have been of limited success. Alliances in general have become ubiquitous and thus sole reliance on a limited set of popular sources for related information creates bias. The main challenge in constructing such a knowledge repository is not different from many other fields: information overload and human limitations. Currently, discovery of strategic alliances relies on humans physically reading news articles, company reports, etc and then inputting data manually. This imposes strict limits on the knowledge to be shared, including number of alliances that can be searched (thus decreasing completeness), speed of knowledge update, and increased cost (paying for researchers to manually comb through records). In this research, we aim to addressing the limitations in manual work by developing an intelligent text mining model to extract existing alliances from open resources such as published news articles. The model integrates meta-search, dependency parsing, entity extraction, and relation extraction. In particular, we propose an Alliance Discovery Template (ADT) based relation extraction and an Alliance Confidence Ranking strategy (ACRank). An alliance knowledge portal is proposed to further assist alliance study. We evaluated the effectiveness of our framework in a case study and compared the coverage of our approach with a gold standard, the Thomson SDC database and expert judgment. Our research is a first step toward building an alliance knowledge repository. Results showed that this automatic approach has the potential to create a much more comprehensive alliance database than existing gold standard such as the Thomson SDC Alliance Database. The result of this research will provide rich information and evidence support of strategic alliances formed, and will enable researchers to better understand the interactions and outputs that make these alliances one of the central drivers of innovation and economic growth. Policymakers will then be better informed regarding the crafting of regulations and laws that allow these alliances to operate at socially-optimal Abstract Strategic alliances among organizations are one of the central drivers of innovation and economy and have raised strong interest among policymakers, strategists and economists. However, discovery of alliances has relied on pure manual search and has limited scope. This research addresses the limitations by proposing a text mining framework that automatically extracts alliances from news articles. The model not only integrates meta-search, entity extraction and shallow and deep linguistic parsing techniques, but also proposes an innovative ADTbased relation extraction method to deal with the extremely skewed and noisy news articles and ACRank to further improve the precision using various linguistic features. Evaluation from an IBM alliances case study showed that ADT-based extraction achieved 78.1% in recall, 44.7% in precision and 0.569 in Fmeasure and eliminated over 99% of the noise in document collections. ACRank further improved precision to 97% with the top-20% extracted alliance instances. Our case study also showed that the widely cited Thomson SDC database only covered less than 20% of total alliances while our automatic approach can covered 67%. 1. Introduction Inter-firm collaboration has exploded during the past couple of decades. The nature of collaboration has shifted from peripheral interests to the very core functions of the corporation, and from equity to nonequity forms of collaboration. The phenomenon has raised strong interest among analysts, business strategists, and policy makers from various fields including economics, management, public administration, and science and technology. Being able to discover strategic alliances will allow policy, innovation, and economics researchers to better understand this growing phenomenon. Such data can be studied to gain insights into the evolution of industrial sectors and the growing interdisciplinary nature of today’s R&D, and thus allow for cause and effect analysis of policies and events. While there have been various attempts in the past to create a broad978-0-7695-4525-7/12 $26.00 © 2012 IEEE DOI 10.1109/HICSS.2012.86 Jeff Williams George Washington University [email protected] 3571 between invention and commercialization. Non-equity, co-operative research, technology, and development (RTD) agreements are often used to get around technology bottlenecks in an industry. Industrial comprehensive joint ventures are crucial mechanisms through which large firms devote energy to specific technology problems in an industry. Customer-supplier agreements are important in that they allow researchfocused firms to stay centered on research by outsourcing some production or infrastructure maintenance (labs, manufacturing space, etc) to an external specialist. One-way licensing and marketing allow one firm to take advantage of innovations created by a second. They are also helpful in opening up foreign markets, with the foreign partner controlling the production and marketing of a product in that firm’s home country. University-based research is a key policy issue at the moment, with discussions centering on whether the increased pressure for universities to convert knowledge into commercial goods is detrimental to the long-run supply of innovation in the economy. Jointly-funded industry and government research undertakings often play a vital role in creating socially important innovative products or processes that, due to collective action or resource appropriation issues, would not be produced by private sector firms. levels. In this paper, we focus on formal strategic alliances. Other types of business collaborations are out of the scope of this research. The rest of the paper is structured as follows. We first review the current research on strategic alliance and the field of text mining. We then describe the challenges associated with finding strategic alliances in massive documents and present our text mining model. Following our model, evaluations are presented in a case study and finally we conclude the paper with contributions and future directions. 2. Research background As we mentioned earlier, with technological innovation and diffusion being the primary driver of today’s knowledge economy and alliances becoming an ever-increasing source of technological innovation and diffusion, economists, managers, and policymakers need to have access to information on what makes those alliances work and what can put up barriers. 2.1. Strategic alliance types Strategic alliances involve a wide range of formal activities, including: joint ventures, research corporations, joint R&D (e.g., research pacts, joint development agreements), technology exchange agreements (e.g., technology sharing, cross-licensing, mutual second-sourcing), direct investment, minority/cross-holding, customer-supplier relations, R&D contracts, one-directional technology flow agreements (e.g., licensing, second-sourcing), and manufacturing agreements. In addition, innovation networks involve informal collaboration and knowledge exchanges such as working relationships of individuals across organizations and systemic learning through patents, blueprints, etc. All sorts of organizations are involved including firms, universities, research institutes, and “intermediate organizations” (e.g., professional and trade associations, think tanks). Table 1 summarizes our categorization of alliance types and provides examples of each type. Each type of alliance, listed in the table below, is of equal, but different, importance for economists, business managers, and policymakers to be able to understand fully. Research corporations are important vehicles through which large technology firms collaborate on pushing knowledge forward on a large scale, due to their access to resources. Research performed for industry associations can have a similar impact. Venture capital is crucial to the survival of many small technology firms, as it may be the only way for them to bridge the so-called Valley of Death 2.2. Research strategies to identify alliances and limitations Current methods of studying alliances – patents, bibliometrics and citations tracking, popular press analysis, surveys, and government documents – have contributed important findings but are limited in their scope. For example, patenting levels vary by sector and by firm size, among other factors, so any alliance data based on that metric will be potentially biased to sectors that have strong patenting traditions or needs. Another example is surveys, as these rarely cover all firms in a sector, and their accuracy is contingent upon getting the right person – or group of people – to answer questions at the firm level. Policymakers need to know how generic innovation policies affect all players, not just the select few that answer a survey or happen to patent a lot. Likewise, economists will have a skewed picture of the inputs and outputs associated with operating an economy at efficient levels. Existing alliance databases are maintained manually. As highlighted in Schilling[15], none of the databases are considered an accurate reflection of the entire population of strategic alliances, but only of subsets. The Thomson SDC Alliance database is a most cited database and is populated by information manually extracted from newspapers, trade journals, 3572 entry. For example, if one were to perform a search on the Thomson Alliance database in 2010 for alliances from 2009, there would be very few listed compared to the 2008 listings. This is due to the backlog of news sources awaiting examination by Thomson team members. government filings, and other press and newswire services. Team members for Thomson will read these news sources and, upon locating an alliance announcement or information relating to an alliance already in the database, will update the database as needed. There is a time lag component to this data Table 1. Strategic alliance types Pre-competitive R&D Cooperation Downstream Technology Development Co-operation Production and/or Marketing and Technology Development Co-operation University-based co-operative research funded jointly by industry RTD performed or sponsored by industry associations / contract RTD for multiple clients by non-profit organizations. Jointly funded government-industry RTD. Research corporations Corporate venture capital in small companies (by one or more firms) Non-equity co-operative RTD agreements between firms in selected areas Inter-firm agreements regarding proven technologies developed independently Industrial comprehensive joint ventures Customer-supplier agreements One-way licensing and/or marketing agreements learning-based approach employs a classification statistical model to solve entity recognition[10]. The systems look for patterns and relationships in text to make a model using statistical models and machine learning algorithms. Supervised learning involves using a program that can learn to classify a given set of labeled examples that are made up of the same number of features and needs large amount of annotated data to achieve high performance[20]. Examples of supervised models are Hidden Markov Model, Maximum Entropy Model and Support Vector Machine (SVM). An unsupervised approach or a semi-supervised approach often involves bootstrapping[13]. The hybrid approach is to combine a rule-based and a machine learningbased method, and make new methods using the strongest points from each method[16]. 2.3. Text mining Text mining is defined as the process of extracting interesting information and knowledge from unstructured texts[5]. It has been studied in various fields such as e-commerce [14], patent analysis[18], and bioinformatics research[7, 8]. Compared with data mining, text mining is in general a more challenging task because it deals with unstructured and amorphous texts. Text mining research deals with a variety of problems including text summarization, document and information retrieval, text categorization, authorship identification, entity extraction and relation extraction[19]. Despite the development of text mining and its applications in many scientific fields, to our knowledge no collaboration has been formed between alliance researchers and text mining researchers. It is natural to think of an automatic way to discover alliances but the task is challenging. Alliance discovery mainly involves the subfields of entity extraction and relation extraction, two difficult subfields in text mining. 2.3.2. Relation Extraction. Relation extraction involves annotating the unstructured text with entities and relations between entities. Relation extraction is considered to be a hard task[19] and the first step is to perform entity extraction. Most relation extractions concern relations between different type of entities, such as a person-position and a person-location relation. Relation extraction methods can be classified into the following two categories: a template-based approach and a learning-based approach. A template-based approach relies on experts to carefully construct the rule lexicon and temple. It is domain specific. With a well-defined template-based extraction approach, Banko and Etzioni[2] concluded that nearly 95% of 500 randomly selected sentences belong to one of the eight categories of verb phrases illustrated in Table 2. A learning-based approach tends to formulate the 2.3.1. Entity Extraction. Entity extraction research aims to extract and classify rigid designators[13]. There are many types of named entities in text such as person name, product name, location, and date. Entity extraction can take a rule-based approach, a machine learning-based approach or a hybrid approach. A rulebased approach focuses on extracting names using lots of human-made rules set. Generally the systems consist of a set of patterns using grammatical (e.g. part of speech), syntactic (e.g. word precedence) and orthographic features (e.g. capitalization) in combination with dictionaries[10]. A machine 3573 information can appear in any type of news but the frequency of appearance is very low. When we manually read 2000 search results from Google using “IBM Alliance”, there are only several articles that describe true formal alliances. This low number of hits makes some well performing learning-based algorithms such as SVM fail. This also alerts us that being able to select a relevant set of documents is important to the performance. The documents being parsed need to be comprehensive enough to cover most alliances announced yet still be focused enough to avoid unnecessary noise. We develop a text mining model to address the challenges in alliance extraction. The model is composed of the following components as shown in Figure 1: (1) Meta-crawling of alliance news articles, (2) document pre-processing, (3) sentence-level alliance extraction, and (4) corpus level alliance ranking. The result of our text mining model is a ranked list of possible alliances. To further assist alliance researchers, we also propose a knowledge portal. Users can search/browse our extracted alliances, categorize alliance types, and visualize alliance relations. We discuss each component in detail. relation extraction problem as a classification problem. A discriminative classifier, such as Support Vector Machine (SVM), Voted Perceptron and Log-linear model, may be used on a set of features extracted from a structured representation of the sentence. Two kinds of supervised approaches have been used, feature based[6], such as FASTUS[1] and Kernel based[9]. A learning-based approach performs well but is difficult to extend to new relation types. The effort needed in labeling training data is tremendous. Both approaches depend on the ability to parse sentences. There are three levels of sentence parsing: (1) Bag of Words (BoW) parsing, (2) syntactic parsing, and (3) semantic parsing. The BoW approach disregards grammar and word order and takes each word into account. Syntactic parsing is also known as shallow parsing. It performs Part-Of-Speech (POS) tagging and entity extraction. Semantic parsing is the most advanced and usually represents a sentence with a dependency parse tree [3, 4, 12]. Adopting text mining techniques in alliance discovery is not easy, because most current techniques are developed on standard datasets and have been tuned for better performances. To make them work in a new domain will need much investigation. Table 2. Verb phrase template categories[2] Relative Freq. 37.8 Category Lexico Syntactic Pattern Verb E1 Verb E2 (e.g. X established Y) E1 NP Prep E2 (e.g. X settlement with Y) E1 Verb Prep E2 (e.g. X moved to Y) E1 to Verb E2 (e.g. X plans to acquire Y) E1 Verb E2 Noun (e.g. X is Y winner) E1 (and|,|-|:) E2 NP (e.g. X-Y deal) E1 (and|,) E2 Verb (e.g. X , Y merge) E1 NP (:|,)? E2 (e.g. X hometown : Y) 22.8 Noun+Prep 16.0 Verb+Prep 9.4 Infinitive 5.2 Modifier 1.8 Coordinaten 1.0 Coordinatev 0.8 Appositive 3. A text mining model: finding strategic alliances from news articles Figure 1. Text mining model for alliance discovery Two factors contribute to the limitation of today’s alliance discovery: too many resources that could potentially contain alliance information and limited human ability to scan all these resources. We aim to build a text mining model for alliance discovery from various news articles. Our initial study showed that comparing to other fields, extracting alliances from free text is even more challenging. One major reason is that alliance 3.1. Meta-crawling A meta-crawler supports unified access to multiple existing search engines and databases[11]. It provides a higher coverage and more up-to-date information than each of its component search engines. An alliance meta-search lexicon was created by experts to search for relevant articles from multiple resources. The 3574 PURPOSE_CLAUSE_MODIFIER, PREPOSITIONAL_MODIFIER, ADV_CLAUSE_MODIFIER, TEMPORAL_MODIFIER, PRECONJUNCT, PARTICIPIAL_MODIFIER or INFINITIVAL_MODIFIER, we merge w1 and w2 together forming a compound entity. 2. If a dependency d(w1, w2) is within the dependency class AUX_MODIFIER and not COPULA, we merge w1 and w2. 3. If a dependency d(w1, w2) is within the dependency class PREPOSITION_MODIFIER and w1 is not a verb, we combine w1 and w2. Each chunk has a head which is the main word of the chunk. So we can classify them into noun chunks and verb chunks. In the above example, 7 chunks are extracted in Table 3 and the simplified dependency parse tree is presented in Figure 3. After chunking all related words, the grammar structure of the sentence is simplified. And most remain dependency relations belong to dependency class SUBJECT, COMPLEMENT and PREPOSITON. It dramatically reduces the rule space needed for extracting alliance relationships. lexicon contains keywords such as “alliance,” “joint venture,” “team with,” “license,” etc. News articles are automatically downloaded using this lexicon. Duplicates are examined and are removed. 3.2. Pre-processing The resulting documents from meta-crawling are sent to pre-processing which adhere to three processes. Document indexing involves data cleaning, document tokenization and meta-data extraction. Metadata includes time and source of publication, and length of the article. These features are used in alliance confidence ranking later. In order to have a uniformed way to manage various source data, we will convert all source files into XML format. While commercial databases like Thomson SDC only provide a general source without specific dates or authors, this structure will allow us to point to the specific alliance evidence and will allow researchers to further investigate and validate the type and nature of alliances. Part-of-speech tagging is the process of assigning a part of speech to each word/token in a sentence. The POS tags consist of coded abbreviations conforming to the scheme of the Penn Treebank, a linguistic corpus developed by the University of Pennsylvania. Besides, we also perform chunk parsing, which groups tokens of a sentence into larger chunks with each chunk corresponding to a syntactic unit such as a noun phrase or a verb phrase. It allows us to identify important noun and verb phrases. Chunk parsing also helps in named entity identification and extracts alliance relation verb phrases with their tenses. There are several tools available to perform English POS tagging and chunk parsing such as OpenNLP and LingPipe. Dependency parse tree is a way to present a hierarchy of word dependencies. Dependency types are organized in a hierarchy according to the similarity in their grammatical roles. Figure 2 provides an example of a dependency tree after analyzing the sentence “IBM Corp. and Alvarion Inc. have established an alliance to offer wireless systems to municipalities and their public safety agencies, Alvarion announced.” While POS tagging captures the syntactic structure of a sentence, a dependency parse tree captures the semantic meaning of a sentence. Stanford Parse Tree [17] is a great tool to perform this task. Merging chunks. The structure of an original parse tree is often too complex for template matching. To reduce the template space, we use chunk parsing results to simplify the tree by merging words into chunks. The following rules are used in merging. 1. If a dependency d(w1, w2) is within the dependency class MODIFER and not one of RELATIVE_CLAUSE_MODIFIER, Figure 2. Original dependency parse tree Figure 3. Simplified dependency parse tree 3575 organizations. (We can check if w2 and w4 contain organization names.) Template 2. If two dependencies d1 and d2, where d1(w1, w2) is within the dependency class SUBJECT and d2(w1, w3) is within the class COMPLEMENT, we extract w1, w2 and w3 as our candidate alliances. Template 3. We extract two noun chunks connected by a dependency relation of PREPOSITION_between. Template 4. If two dependencies d1(w1, w2) and d2(w1, w3) are within the dependency class COMPLEMENT and d3(w1, w4) is within the class SUBJECT, we extract w1, w2, w3 and w4 as our candidate alliances. This rule can be used to extract alliances among more than two organizations. (We can check if w2 and w3 contain organization names.) Table 3. Extracted chunks from IBM example CHUNKS HEAD IBM Corp. Corp. Alvarion Inc. Inc. Have establish Establish An alliance Alliance To offer Offer Wireless systems Systems Their public safety agencies agencies 3.3. Sentence level alliance extraction After document parsing, the next steps are to extract entities and relations between them. Entity extraction is performed to extract organization names. There are many types of named entities in text such as person name, product name, location, and date. We are particularly interested in organization name which is the most difficult to extract because novel names frequently appear. Various tools are available and we choose Stanford NER[17] for its better performance. To perform sentence level relation extraction, we use template-based approach. A learning-based approach is not a good option here because of the extremely skewed datasets. Although our news article collection is crawled with a domain lexicon, it still contains much noise: documents that are not relevant to alliance but happen to have some relevant keywords in the text. It could be rumors, informal alliances and just some random mentioning of the word “alliance”. Furthermore, each document contains at least dozens and sometimes even hundreds of sentences. To parse each sentence and classify them into alliance or nonalliance relations is impossible at this step. True alliance announcements may only appear one out of fifty documents and that can potentially translate to one out of five hundred sentences roughly. Any machine learning algorithms at this step will achieve high precision and zero recall by predicting every sentence to be a non-alliance relation. Thus, we studied sentence syntax and semantics when alliances are announced. Following Banko and Etzioni[2]’s templates, we found that most alliance announcements can be categorized into the following four templates shown in Table 4. We call this Alliance Discovery Template (ADT). With the results from simplified dependency parse trees, we can implement the four templates by the following steps: Template 1. If two dependencies d1(w1, w2) and d2(w3, w4) are within the dependency class SUBJECT and w1, w3 are in the same chunk, we extract w2, w4 and w1 as our candidate alliances. This rule can be used to extract alliances among more than two Table 4. Alliance Discovery Template (ADT) Template 1 Template 2 Template 3 Template 4 ADT Organization list +Verb (form, establish, forge…) First organization +Verb (join, work with…)+second organization Noun (collaboration, agreement) + Conjecture(between, among) +organization list Noun (participants, partners) + include + organization list Example IBM, Sony and Toshiba form chip R&D alliance Red Hat joins top level IBM strategic alliance The collaboration between IBM and Geisinger ….. Participants include IBM and GE Health 3.4. Corpus level alliance ranking Sentence level extraction identifies relation words between two or more entities in the same sentence. However, there are still many false positive examples. We go beyond traditional relation extraction research and expand this to a corpus level alliance ranking. Given two entities with multiple mentions in a large corpus, the confidence level of these two entities forming a true alliance increases if (1) the announcement appear in the first paragraph, news title, and early sentences; (2) the source of news article is authoritative; and (3) the announcement of the alliance is mentioned multiple times. We develop a multi-feature based ranking approach to detect whether the relationship holds between them is a true formal alliance. We call this ranking based approach Alliance Confidence Rank (ACRank). A ranking based approach is chosen instead of a learningbased classification approach because alliances do not 3576 sentence in a paragraph and the position of the sentence in the entire document. Besides, we also include publisher’s authority as a document level feature. (3) Features from corpus level. We look at the total number of times the same alliance being extracted and aggregate their confidence levels by summing up the confidence level of each extracted instances with some degrading factors. Degrading factors are introduced because big companies appear a lot in news articles. Finally By aggregating features on all three levels, ACRank can represent the confidence level of a relationship being a true formal alliance. The notion is somewhat similar to document retrieval, but we replace “documents” with “alliances”. appear frequently in news articles. Our pilot study showed that a learning-based classification approach fails with extremely skewed datasets with very few positive examples. Also, a ranking approach allows researchers to browse the possible alliances later on to make their own judgment. Even some rumor alliances might be of interest because of the possibility of forming a true alliance in the future. In this study, we propose the following features to be included in ACRank. These features can be expanded in future research. (1) Features from sentence level. They include the ADT template used, number of entities extracted, and number of critical keywords appeared in the sentence. (2) Features from document level. They include the position of the Figure 4. User interface of alliance knowledge portal evidences are drawn from and whether these alliances have been validated by human experts (last column). By clicking on the number in each cell, snippets of the evidences are shown in the upper right panel called “Evidence List”. This panel displays all the evidences from news articles in a fashion that a search engine returns a ranked list. Their titles, dates retrieved and snippets are displayed as well. By clicking the evidence list, a user can view the original evidence in the lower right “Evidence” panel. A network visualization tool is provided in the lower left “Graphical Network” panel. In this example, each node represents a participating company and each link represents a strategic alliance. The size of a node indicates the size of a company and the color indicates 3.5. Alliance knowledge portal Automatically extracted alliances are loaded to a knowledge portal to further assist alliance researchers. To illustrate, we show an example of the user interface in Figure 4. The interface is composed of several panels. The upper left panel is the “Input” panel. The knowledge portal provides basic and advanced search functions, where a user can specify the organization of interest (in this case “IBM”), type of alliances of interest in terms of sector and company size, and time period. Other information will be incorporated in the future. The system pops up a list of alliances in a ranked order based on confidence level called “Raw Data”. It also indicates the data sources where the 3577 the sector it belongs to. The thickness of a link indicates the confidence level (or the strength of evidence), and a dotted link indicates an informal alliance (not validated by experts yet). The interface is not a final product, but gives an idea of how researchers can trace down each alliance and actually see the supporting evidences. Table 5. Performance of ADT method in comparison with benchmarks Method Recall Precision F-measure 68.7% 4.4% 0.075 Co-occurrence 65.2% 7.3% 0.121 Co-occur+verb 55.8% 51.1% 0.528 Template 1 47.6% 34.6% 0.398 Template 2 11.4% 57.1% 0.186 Template 3 30.2% 51.2% 0.372 Template 4 78.1% 44.7% 0.568 ADT 4. Evaluation We conducted a case study with IBM Corp. during the year 2006. There are three goals of this case study: (1) to evaluate the effectiveness of our sentence level ADT-based approach, (2) to evaluate the effectiveness of our corpus level ACRank approach and (3) to study the coverage of the Thomson database by comparing with domain experts’ extraction results. Thomson database is the most popular database which records all publicly announced alliance deals world-wide. We chose IBM because as a multinational and technology driven company, IBM has established numerous alliances with other organizations inside and outside of the United States. We collected all publications in 2006 from Lexis-Nexis using a domain lexicon. A total of 4,261 documents were crawled. We selected a subset of 1,000 documents that contains a total of 63,019 sentences for this study. Our domain expert manually read all 1,000 documents and extracted 63 alliances. 4.2. Evaluation of corpus level ACRank Our second experiment evaluated the performance of corpus level ACRank. We compared corpus level ACRank with sentence-level alone, document-level alone and a popular machine learning algorism SVM. We adopted classic information retrieval evaluation metrics of recall and precision for the top-N% ranked documents. Here, we changed documents retrieved to alliances extracted. We present the recall and precision with the top-N% extracted alliances in Table 6, Figure 5 and Figure 6. We observed that recall increases consistently with the increasing number of ranked alliances included till it reaches 100%. Fifty percent recall was reached when the top-25% alliance instances were considered. Using sentence-level features alone, recall reached 50% when the top-40% alliance instances were considered, and 45% for document-level features alone. On the other hand, SVM only achieved a 32.7% recall. That means 67.3% of alliances were missed during SVM classification using the same set of features. The same recall was achieved at top -15% in ACRank. Precision reached the highest of 100% with the top10% extracted alliances and then gradually decreased to 44.7% when all alliances were considered. With the top-20% of alliance instances, precision was maintained at 97%. This shows that our ACRank is most effective in predicting the very top alliances and the two categories of features significantly boost each other. With the top-50% of alliance instances, precision was still maintained at 65.5%. Compared to information extraction tasks, our performance is satisfactory. In comparison, SVM achieved 59.3% precision which is similar to the precision of the top65% results from ACRank. When looking at errors generated, we found that there were three major causes of errors: unidentified or wrongly identified entity names, wrong dependency parse tree and instances that did not fit into any of the templates. This suggested that although achieving 78.1% in recall, our four templates still need to be expanded. 4.1. Evaluation of ADT-based extraction We compared our ADT-based extraction with two benchmark algorithms. Benchmark 1 was cooccurrence based approach and benchmark 2 was cooccurrence plus critical verb identified by experts. Cooccurrence approach assumes that if two organization names appear in the same sentence, they are likely to form an alliance given that the collection is an alliance relevant collection. Our evaluation metrics are recall, precision and F-measure which are standard in information extraction. Table 5 presents the performance of ADT method in comparison with the two benchmark algorithms. We also present the performance of each template extraction. ADT method achieved a 78.1% in recall, 44.7% in precision and 0.568 in F-measure. It outperformed two co-occurrence based benchmarks significantly in recall, precision and F-measure. Meanwhile, we observed that Template 1, 3 and 4 were more accurate in finding alliances and Template 1 alone covered 50% of alliance announcement. In general, sentence-based alliance extraction can identify most alliances but also extracts a significant amount of false alliances. 3578 Table 6. Performance of ACRank with the top-N% alliances Top10% Top20% Top30% precision recall F- measure 72.7% 16.3% 0.267 60.6% 27.2% 0.376 57.6% 38.8% 0.463 precision recall F- measure 66.7% 15.0% 0.244 60.6% 27.2% 0.376 53.5% 36.1% 0.431 precision recall F- measure 100.0% 22.4% 0.367 97.0% 43.5% 0.601 81.8% 55.1% 0.659 TopTop50% 40% Sentence-Level 54.5% 54.5% 49.0% 61.2% 0.516 0.577 Document-Level 50.0% 47.3% 44.9% 53.1% 0.473 0.500 ACRank 69.7% 65.5% 62.6% 73.5% 0.659 0.692 Top60% Top70% Top80% Top90% Top100% 52.8% 70.7% 0.605 50.9% 79.6% 0.621 49.0% 87.8% 0.629 46.3% 93.2% 0.619 44.7% 100.0% 0.618 47.2% 63.3% 0.541 46.1% 72.1% 0.562 46.4% 83.0% 0.595 46.6% 93.9% 0.623 44.7% 100.0% 0.618 60.4% 81.0% 0.692 53.9% 84.4% 0.658 50.6% 90.5% 0.649 47.6% 95.9% 0.637 44.7% 100.0% 0.618 SVM 59.3% 32.7% 0.421 demonstrated that Thomson indeed is far from a comprehensive alliance database. From this table, the Thomson database only covered 19.4% of total alliances an expert can identify. This number is an upper boundary of Thomson’s coverage because as experts read more documents, more alliances may be identified. This is consistent with Schilling’s conclusion [15] that none of the databases are considered an accurate reflection of the entire population of strategic alliances, but only of subsets. He estimated that Thomson covers about 20% of total alliance. Table 7. Coverage of the Thomson SDC Figure 5. Top-N% Recall of ACRank Thomson SDC Experts Automatic Found in 1,000 docs 5 63 49 Found in 4,261 docs 12 NA NA Total Coverage 14 NA NA 19.4% 87.5% 68.1% 5. Conclusions and future directions With the increasing interest in strategic alliances, most researchers rely on manually built databases to perform analysis and draw conclusions. Among these databases Thomson SDC is the most popular database which records publicly announced alliance deals world-wide tracked down in the Security Exchange Commission filings in the United States, newswires, press, trade magazines, professional journals and the like. However, as documents are manually screened, input is limited to those sources that can be realistically read by a set number of staff during a set time. Manually reading documents to identify alliances is extremely time-consuming. While the accuracy might be high, the coverage is often low, meaning that many valid alliances can be missed. In this research, we design, develop and evaluate a text mining model to Figure 6. Top-N% Precision of ACRank 4.3. Thomson SDC coverage Our third experiment evaluates the coverage of Thomson SDC in comparison with alliances identified by our domain experts. There are a total of 14 IBM alliances reported in 2006 in the Thomson database. After examining the overlap with the Thomson database, we found that 58 out of 63 alliances that experts identified were not covered in Thomson database (Table 7). This finding is surprising yet 3579 extract alliance knowledge. We propose the ADT method, a template-based relation extraction method that utilizes entity extraction, POS tagging and dependency parse trees. Moreover, we aggregate information from single sentence and document and generate ACRank, a corpus based multi-feature ranking algorithm. An alliance knowledge portal is proposed to support alliance researchers to search, browse and visualize alliance extraction results. This portal could provide researchers with evidences of alliance announcement and assist in answering strategy and policy questions. Evaluation results show that our automatic approach can extract over 78% of total alliances with over 44% precision. In evaluating the Thomson SDC database, we concluded that only 19.4% of strategic alliances were covered in SDC. This research will not only encourage new research and discovery in economics and public policy, but also will advance techniques in text and Web mining research. By bringing text mining and knowledge discovery techniques into the field of economics and public policy, the research will foster the awareness of cross-disciplinary research and enrich collaboration between social science and computer science paradigms. In the future, we plan to expand our case study by including longer time periods and adding more companies from different industries. We also plan to study additional features in our ACRank algorithm and to add other templates to the relation extraction component. [7] M. Krallinger and A. Valencia, "Text-mining and information-retrieval services for molecular biology," Genome Biology, vol. 6, p. 224, 2005. [8] Y. Liu, et al., "Exploiting rich syntactic information for relation extraction from biomedical articles," 2007, pp. 97-100. [9] H. Lodhi, et al., "Text classification using string kernels," The Journal of Machine Learning Research, vol. 2, pp. 419-444, 2002. [10] A. Mansouri, et al., "Named entity recognition approaches," IJCSNS, vol. 8, p. 339, 2008. [11] W. Meng, et al., "Building efficient and effective metasearch engines," ACM Computing Surveys (CSUR), vol. 34, pp. 48-89, 2002. [12] S. Miller, et al., "A novel use of statistical parsing to extract information from text," 2000, pp. 226-233. [13] D. Nadeau, "Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision," 2007. [14] B. Pang and L. Lee, "Opinion mining and sentiment analysis," Foundations and Trends in Information Retrieval, vol. 2, pp. 1-135, 2008. 6. References [15] M. A. Schilling, "Understanding the alliance data," Strategic Management Journal, vol. 30, pp. 233260, 2009. [1] D. E. Appelt, et al., "FASTUS: A finite-state processor for information extraction from real-world text," 1993, pp. 1172-1172. [16] R. Srihari, et al., "A hybrid approach for named entity and sub-type tagging," 2000, pp. 247-254. [2] M. Banko and O. Etzioni, "The tradeoffs between open and traditional relation extraction," Proceedings of ACL-08: HLT, pp. 28–36, 2008. [17] Stanford. Named Entity Recognition (NER) and Information Extraction (IE). Available: http://nlp.stanford.edu/ner/index.shtml [3] R. C. Bunescu and R. J. Mooney, "A shortest path dependency kernel for relation extraction," 2005, pp. 724-731. [18] Y. H. Tseng, et al., "Text mining techniques for patent analysis," Information Processing & Management, vol. 43, pp. 1216-1247, 2007. [4] A. Culotta, et al., "Integrating probabilistic extraction models and data mining to discover relations and patterns in text," 2006, pp. 296-303. [19] I. H. Witten, et al., "Text mining in a digital library," International Journal on Digital Libraries, vol. 4, pp. 56-59, 2004. [5] A. Hotho, et al., "A brief survey of text mining," 2005, pp. 19-62. [20] Y. C. Wu, et al., "Extracting named entities using support vector machines," Knowledge Discovery in Life Science Literature, pp. 91-103, 2006. [6] N. Kambhatla, "Combining lexical, syntactic, and semantic features with maximum entropy models for extracting relations," 2004, pp. 22-es. 3580
© Copyright 2026 Paperzz