learn more - Ambiverse

Text Analytics with
Ambiverse
Text to Knowledge
www.ambiverse.com
Version 1.2, November 2016
WWW. AMBIVERSE . COM
Contents
1
Ambiverse: Text to Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1
Text is all Around
5
1.2
Ambiverse: Leading research to industry
6
1.3
Text to Knowledge
6
2
Named Entity Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1
What is it?
7
2.2
Why is it Important?
8
2.3
Why is it Challenging?
8
2.4
Ambiverse Gives Meaning to Text
9
2.5
Ambiverse & YAGO, a Powerful Combination
9
2.6
Integrating Domain-specific Knowledge
10
2.7
Ambiverse Text Analytics in Facts
10
3
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1
Ambiverse Search
13
3.2
Ambiverse Analyze
15
3.3
Ambiverse Write
16
3.4
Personalized Text Analytics
17
1. Ambiverse: Text to Knowledge
1.1
Text is all Around
Most of the information produced by persons, organizations, and public institutions
is in the form of text. In 2014, 300 million new websites were created.1 Every year,
2 million blog posts are written,2 thousands of news sites around the globe publish
articles, and millions of new updates in social networks are generated. In fact, most
of human interaction is performed via unstructured data (e.g., articles, reports, social
network posts, adds, comments, reviews, etc). Companies and public institutions also
tend to produce, on a regular basis, large quantities of internal documents.
This vast amount of text goes beyond of what is commonly understood as “big data”.
Textual information is not easy to interpret, it basically lacks a well defined structure.
To make use of it, it is necessary to provide the machine with certain “text understanding” capabilities so that these huge collections of documents can be computationally
analyzed and transformed into useful data. It is being increasingly understood that text
analytics gives a big leverage to companies, persons, and public institutions.
The text analytics market is expected to grow at an average rate of 25% per year.3 By
2013 only 1% of the companies were processing its textual information, by 2021 65%
will do (Figure 1.1).4 In domains such as news, advertising, finance, insurance, among
others, companies are starting to make sense of its textual data as a means of adding
value to their businesses.
1 http://www.internetlivestats.com/total-number-of-websites/
2 http://www.digitalbuzzblog.com/wp-content/uploads/2012/03/A-Day-In-The-Internet.jpg
3 http://www.digitalreasoning.com/resources/Text-Analytics-2014-Digital-Reasoning.pdf
4 http://www.federalnewsradio.com/wp-content/uploads/pdfs/031115_gartner_co_branded_
newsletter_turning_dark_data_into_smart_data.pdf
Chapter 1. Ambiverse: Text to Knowledge
% of companies using text analystics
6
100
65
50
25
0
1
2013
2016
2021
Figure 1.1: The use of text analytics will increase dramatically in the coming years
1.2
Ambiverse: Leading research to industry
Ambiverse, a spin-off of the Max Planck Institute for Informatics, joins the new world of
text analytics. Ambiverse develops a technology to automatically understand, analyze,
and manage big collections of textual data. Ambiverse is built on years of state-of-the-art
research in text analytics. In 2015, Ambiverse received an EXIST Transfer of Research
grant by the German Federal Ministry for Economic Affairs and the European Union.
1.3
Text to Knowledge
Our technology is focused on the recognition and disambiguation of named entities
in text. It relies on years of experience in scientific developments by the Max Planck
Institute for Informatics, a world leading institution in automatic text understanding.
Our technology for named entity disambiguation was named the best named entity
disambiguation system by IBM5 and our corresponding scientific publications are among
the most cited in the international automatic text understanding community67 .
This cutting edge technology gives Ambiverse an advantage in the text analytics world,
allowing the development of a new generation of text analytics tools to transform textual
information into machine-understandable knowledge.
5 D.
A. Ferrucci (2012). Introduction to ‘This is Watson’. IBM Journal of Research and Development.
Hoffart et al. (2011). Robust Disambiguation of Named Entities in Text. In Proceedings of the
Conference on Empirical Methods on Natural Language Processing (EMNLP).
7 J. Hoffart et al. (2013). YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia.
Artificial Intelligence.
6 J.
2. Named Entity Disambiguation
2.1
What is it?
A named entity, or simply entity, is a real-world object such as a person, an organization,
a location or a product. Named entity disambiguation is the task of automatically
recognizing the names of these objects in text and identifying their real-world reference.
For instance, in the sentence “Page played the hit Kashmir on his uniquely tuned
Les Paul” our disambiguation system recognizes that the mention “Page” refers to the
famous rock guitarist Jimmy Page and not to Larry Page, founder of Google, and that
“Les Paul” refers to the guitar and not its designer (see Figure 2.1).
Figure 2.1: Selecting the correct entity for each mention: Jimmy Page, the song Kashmir
and a Les Paul guitar
Chapter 2. Named Entity Disambiguation
8
2.2
Why is it Important?
Ambiguous entities are all around us. The variety of names is much smaller than one
may think; there are more entities than names. Places are named after people, and
people after people. Also places tend to have similar names, the same as people
or products. In this context, knowing the real-world object of a reference produces
significant gains in text understanding capabilities.
If one wants to select or analyze documents mentioning the city of Paris in France, first
we have to make sure that the mentions of “Paris” refer to the entity we are interested in
and not, for instance, to the city of Paris in Texas. If one wants to efficiently search for
information about Larry Page, we have to make sure to exclude documents about Jimmy
Page, another famous “Page”. Even more, if companies want to analyze customer
opinions about cars, they need to understand that a tweet refers to the Jeep Wrangler
and not to Jeans Wrangler (“I bought a Wrangler, and it is very comfortable”, “I sell my
brand new Wranglers”, Figure 2.3).
Knowing the correct meaning of a name allows to more efficiently analyze and search
over large text collections. Ambiverse developed a state-of-the-art technology to disambiguate entities and a set of applications around it for smart text analytics.
Image from flickr (zombieite) - CC-BY 2.0
Figure 2.2: Ambiverse Text Analytics helps to identify the real enthusiastic fans.
2.3
Why is it Challenging?
Named entity mentions can be very ambiguous. The name “Page” can already refer to
hundreds of entities, for more ambiguous names like “John” the potential candidates
are likely in the thousands.
A machine needs to resolve the meanings of all names in a single text assuring coherence among the entities (e.g., it is reasonable that “Paris” and “France” are simultaneously assigned to the french capital and the European country). Naive approaches of
simply enumerating all possible combinations would quickly come up against a brick
wall. Even for a single sentence with three or four moderately ambiguous names, the
combination exceeds 100,000. For full documents, this becomes infeasible for even the
fastest machines. Solving such a problem requires smart technologies as the one we
provide in Ambiverse Text Analytics.
2.4 Ambiverse Gives Meaning to Text
9
Page played the hit Kashmir on his uniquely tuned Les Paul.
500
x
50
x
5
= 125.000 possible candidate combinations
Figure 2.3: There are 500 possible “Pages”, 50 possible “Kashmirs”, 5 possible “Les
Paul”, leading to 125.000 possible entity combinations.
2.4
Ambiverse Gives Meaning to Text
Ambiverse Text Analytics opens up a wide range of possibilities to manage and understand big text collections. Its main characteristic is the capability to understand the
meaning of the objects, detaching them from their textual representations. For instance,
in the sentences “Page played Kashmir.”, “Jimmy rocked the show at Knebworth!” and
“James Patrick Page is one of the greatest guitarists of all time.”, Ambiverse Text Analytics understands that “Jimmy”, “Page”, and “James Patrick Page” all refer to the same
person (Figure 2.4). It understands real world concepts in text regardless of how they
are actually mentioned. This allows Ambiverse to develop a set of applications around
the named entity disambiguation technology, changing the way in which text is stored,
searched, analyzed and produced.
James Patrick Page is one of the greatest guitarists of all time.
Page played Kashmir.
Jimmy rocked the show at Knebworth!
Figure 2.4: Ambiverse Text Analytics understands that all sentences refer to the same
Jimmy Page.
2.5
Ambiverse & YAGO, a Powerful Combination
All entities like Jimmy Page, Larry Page, Les Paul (person) and his self-named guitar
are present in our YAGO knowledge graph [Hof+13]. YAGO, which is derived from
Wikipedia, can be thought of as a very large collection of entities.
YAGO also contains accurate characterizations of all entities. It knows that Larry Page
is a computer scientist, a corporate director, and a billionaire, that Google is a U. S.
company, or that Jimmy Page is a guitarist and a musician. These characteristics of the
entities are called categories or classes and are the key to develop useful applications
Chapter 2. Named Entity Disambiguation
10
around named entity disambiguation technology. An example of YAGO is shown in
Figure 2.5.
artifact
subclass
subclass
song
type
musician
1975
Classes
guitar
type
type
in
created
was played at
plays
played at
Entities
happened
in
Figure 2.5: Example of the knowledge stored in YAGO: The entities, their classes, and
the relations between them.
2.6
Integrating Domain-specific Knowledge
The flexible architecture of Ambiverse Text Analytics allows the use of additional domainspecific entities. Other knowledge graphs (e.g., a company-specific knowledge graph
or a product catalog) can be easily integrated into our system or a specific user can
concentrate in a specific slice of YAGO. This enables companies to focus on the entities
of importance to them, like their products or customers. Ambiverse Text Analytics to be
fully customized to the specific needs of our customers.
2.7
2.7.1
Ambiverse Text Analytics in Facts
Performance
The following numbers correspond to average length news articles processed on a
compute instance with 16 CPU cores and 32 GB of memory.
• Documents per hour with high accuracy: 20.000
• Documents per hour with highest accuracy: 6.000
The exact accuracy depends on the nature of the documents. An experimental evaluation on a large set of newswire documents [Hof+11] showed 80% accuracy for the high
accuracy setting and 83% accuracy for the highest accuracy setting.
2.7 Ambiverse Text Analytics in Facts
2.7.2
11
Languages
We currently support English, Spanish, Chinese, and German.
2.7.3
Knowledge Graph
A brief comparison of the size of YAGO and other prominent openly available knowledge
graphs shows that YAGO is among the most comprehensive and precise ones. YAGO’s
distinct advantages are the clear semantic modelling of entities and especially the
specific class hierarchy, ranging from very general categories like “person” to highly
specific ones like “British rhythm and blues boom musicians”. Also, YAGO is the only
knowledge graph that has been evaluated in terms of accuracy [Hof+13].
English YAGO3
Combined YAGO3 (10 languages)
English DBpedia
Combined DBpedia
Entities
Classes
Accuracy
3.5 million
4.6 million
4.8 million
38.3 million
550 thousand
570 thousand
735
735
> 95%
> 95%
not evaluated
not evaluated
Table 2.1: Facts about the YAGO knowledge graph
!
More details about YAGO are available at:
http://www.yago-knowledge.org
3. Applications
Ambiverse’s cutting edge text analysis technology allows the development of a whole
range of next-generation applications to manage, search, analyze and produce text.
3.1
3.1.1
Ambiverse Search
Searching for Entities
Traditional search engines take words or phrases as input and return a set of documents,
in which these words or phrases may be more relevant. They have limited understanding
of the user intent in the sense that they do not give meaning to the input words. They
only understand their form. For instance, they cannot understand if the input word
“Paris” refers to the city in France, to Paris Hilton, or to the mythological Greek character.
Searching for “Paris” in a regular search engine will return documents where the word
“Paris” appears without distinguishing which Paris it is. Probably documents referring
to the city of Paris in France will be ranked at the top since it is the most popular entity.
Users searching for less common “Paris” references should refine their input (e.g. “Paris
Greece Troy”), forcing them to express their intention by incorporating (sometimes
unavailable) extra knowledge into the input.
However, if the documents are first processed via Ambiverse Text Analytics (meaning
that all entities in all documents have been previously identified), the user can search for
the entities themselves independently of how they are mentioned in the text, and without
any additional background knowledge. The user intent is fully described in the input
entity itself. For instance, the user can directly search for Paris Hilton and no matter
how she is referred to (e.g. “Paris”, “Paris Hilton”, “Hilton’s granddaughter”, etc.), all
documents in which she is mentioned will be retrieved (and properly ranked). All other
documents where other “Paris” occurrences appear (Paris, France; the Greek character;
Paris, Texas) will be excluded. This type of ambiguity is more common that one may
think, resulting in highly imprecise search results.
Ambiverse Search gives the user the capability to search for meaning or concepts on
huge text collections, reaching more precise results by better interpreting the user’s
Chapter 3. Applications
14
Figure 3.1: Searching for the word “Prada” is imprecise due to its ambiguity.
Figure 3.2: Searching for the company Prada gives precise results: Ambiguities have
been resolved.
intent, abstracting meaning from textual forms. Out of the box, we provide search for
4.6 million entities, to which, in addition, customer-specific entities can easily be added
(see Section 2.6). Figures 3.1 and 3.2 provide an example of regular and smart search.
!
Contact us for a demonstration of the prototype.
3.2 Ambiverse Analyze
3.1.2
15
Searching for Categories: the Power of the YAGO Knowledge Graph
As mentioned before, YAGO contains information about categories for each entity. This
allows us to incorporate a new abstraction layer to our search, something impossible
in traditional search engines. Instead of searching for a given entity, we can directly
search for a category so that a set of entities is grouped in our search.
For instance, we can directly search for fashion labels, and all the documents mentioning
a fashion label (e.g., Prada, Gucci, Chanel, etc.) will be retrieved. We can also search
for documents containing German soccer players (e.g., Schweinsteiger, Thomas Müller,
Mesut Özil, etc.), Harvard alumni (e.g., Barack Obama, Ban Ki-Moon, Natalie Portman,
Robert Solow, etc.), or any other category available in our knowledge graph. The secret
here is that Ambiverse Text Analytics is capable of identifying the entities in the text
and our knowledge graph knows the categories of those entities. Our knowledge graph
contains more than 570k categories.
Figure 3.3: Searching for the category high fashion brands finds documents on all
fashion labels.
3.2
Ambiverse Analyze
Understanding entities in text allows a whole new range of text analytics tools. For
instance, one can visualize the correlation over time between two companies or even
the correlation between a company and its sector. Ambiverse Analyze helps you
understand how mentions of the fashion label Prada correlate to mentions of all fashion
labels (Figure 3.4).
!
Contact us for a demonstration of the prototype.
16
Chapter 3. Applications
Figure 3.4: Ambiverse Analyze plots the trends of Prada against all other fashion labels.
3.3
Ambiverse Write
Understanding entities is also a key element in the production of intelligent texts. We
developed Ambiverse Write, a smart authoring platform for intelligent text production:
While typing, entities are automatically recognized, relevant entities are suggested and
background information is provided to the author on the fly. An author writing about
fashion topics will get suggestions about fashion brands or designers, and background
information about them directly while typing.
Figure 3.5: Ambiverse Write allows authors to write texts and link entities at the same
time.
Once the writing process has been completed, the text is ready for smart publishing:
it gets annotated with the correct entities and can be immediately integrated into
Ambiverse Search and Analyze. This integration also enables Ambiverse to continuously
improve the quality of its technology, incorporating user specific annotations.
3.4 Personalized Text Analytics
17
In the example shown in Figure 3.5, authors can get a deeper understanding about
the entities they are writing about without ever leaving the editor. Additionally, the links
improve the reading experience for all readers, adding value to the article, making them
stay longer, and use the article as a prominent reference.
!
3.4
Contact us for a demonstration of the prototype.
Personalized Text Analytics
Companies or even individual users usually have their own knowledge graph or want to
add their own customization to YAGO (e.g., they may be interested in only a part of it
or modify some entities or categories). We developed a framework that allows users
to add their own entities to their specific knowledge graph making our disambiguation
technology fully customizable to each particular user and/or organization. Ambiverse
Text Analytics will then focus on entities of interest for the user or adapt to the setting
that the user considers most appropriate.
The tool for augmenting an existing knowledge graph is very intuitive and extremely
simple to use. The user has different possibilities to easily generate its customized
knowledge graph without specific knowledge of our technology.
!
Contact us for a demonstration of the prototype.
References
[Fer12]
David A Ferrucci. “Introduction to ‘This is Watson’”. In: IBM Journal of Research and Development 56.3.4 (2012), pages 1–15.
[Hof+11] Johannes Hoffart et al. “Robust Disambiguation of Named Entities in Text”. In:
Proceedings of the Conference on Empirical Methods in Natural Language
Processing. 2011, pages 782–792 (cited on page 10).
[Hof+13] Johannes Hoffart et al. “YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia”. In: Artificial Intelligence 194 (2013), pages 28–
61 (cited on pages 9, 11).
More details about YAGO are available at: http://www.yago-knowledge.org
Ambiverse GmbH
Campus E1 4
66123 Saarbrücken
Germany
Phone: +49 681 9325-5024
Fax: +49 681 9325-5099
E-Mail: [email protected]
WWW. AMBIVERSE . COM