Teragram® Linguistic Suites

SOLUTION OVERVIEW
Teragram Linguistic Suites –
European and Middle Eastern
Languages
®
Experience reliable, highly profitable linguistic technologies for European
and Middle Eastern languages
Overview
You need a rich suite of tools for
discovering and extracting knowledge
from text documents, including a
comprehensive text mining solution that
integrates text-based information with
structured data and predictive analytics
for better answers to complex questions.
Teragram, a division of SAS, provides
fast, automated and highly scalable
linguistic technologies that address
the challenges associated with extracting text from document sources and
reducing it to meaningful elements
(known as tokenization).
Language-processing functions identify
the necessary linguistic information for
retrieving meaning from input text that
allows you to act on hidden insights.
Teragram’s advanced linguistic technologies are available for all major
European and Middle Eastern languages
with annotated dictionaries that contain
several hundred-million words for:
• Arabic
• Czech
• Danish
• Dutch
• English
• Finnish
• French
• Farsi
• German
• Greek
• Hebrew
• Hindi
• Hungarian
• Italian
• Norwegian
• Polish
• Portuguese
• Romanian
• Russian
• Slovak
• Spanish
• Swedish
• Turkish
• Ukrainian
■ Challenges
• Accurate tokenization at the
word and sentence level.
Complexities in words,
expressions and numerical
representations, as well as in
punctuation, can lead to
inaccurate segmentation.
• Compound words.
Decomposing compound
words in Germanic languages
is complex, creating increased
processing time, poor stem
recognition, loss of contextual
meaning and improper handling
of lexicalized compound words.
• Stemming complexity. The
large number of inflected terms
in Dutch can result in a loss of
relationships between terms,
leading to inaccurate or
ambiguous results.
• Limited or closed dictionaries
and inflexible interfaces.
Linguistic solutions that are
inflexible place unnecessary
limitations on solutions,
particularly for extensibility
and scalability.
Automate text processing
at word and sentence levels
■ Key Benefits
• Punctuation recognition.
Teragram tokenization can
identify subtle and complex
punctuation across our
extensive language dictionaries.
• Efficient word stemming.
Our solution allows you to
access a combination of
high-speed, dictionary and
heuristic-based algorithms and
tap into rich, easily extendible
dictionaries.
• Accurate part-of-speech
tagging. Count on more than
95 percent accuracy for major
languages at some of the
fastest speeds in the industry.
• Language-specific custom
APIs. Configure your tokenizer
to specify how characters and
punctuation are handled.
• Seamless linguistic
processing. Our solution easily
integrates with your document
preprocessing and content
management software.
Capabilities
Teragram licenses linguistic technologies as OEM libraries so organizations
can embed linguistic functions into
commercial offerings. These highly
specialized linguistics capabilities can
enhance your existing offerings and
provide superior meaning extraction to
new offerings.
Advanced text processing
Our technology provides you with advanced text processing, including case
and accent normalization, to effectively
handle variations and normalize characters to a standard form.
Case and accent normalization are important because sentence meaning can
change due to incorrect usage of upper
or lower case or accent marks. With
these prebuilt functions, the product
development cycle is reduced, allowing
for rapid deployment.
Customizable language dictionaries
and programming interfaces
Teragram’s language dictionaries are
customizable for the specific way words
need to be handled in European and
Middle Eastern languages. The software handles exceptions, including text
containing numbers, dates, URLs and
abbreviations. You have the ability to
handle tokenization and other modules based on the chosen language.
Linguistic modules can be customized
for a wide variety of standard platforms.
You will have access to readily available configuration and maintenance
support.
Enables increased performance and
rapid expansion
As highly scalable technologies, our
linguistic suites are optimized to perform linear scans of text for accurate
and timely processing with the ability
to analyze nearly 500,000 words per
second from compressed forms. Our
parts-of-speech tagger’s performance
is highly accurate and can be verified
with manual parts-of-speech tagging.
Efficient integration with major
technologies
Our linguistic suites provide seamless
integration with existing document preprocessing activities and easy integration with content management software
and search applications. Major Internet
search companies, news and media
companies, as well as technology and
pharmaceutical companies, are current
users of Teragram’s technologies.
Components: Linguistic
Functionalities
The following components are available
for each language:
Segmentation
Segmentation, or tokenization, is the
process by which a sequence of characters is divided into one or more units
of meaning. Teragram’s word tokenization separates punctuation marks from
text to enhance text processing, while
Teragram’s sentence tokenization adds
another tokenization layer to incom-
Input text
collections
Compound
decomposition
Segmentation
}
Compound word decomposition
Germanic languages present unique
challenges because compounding is a
dynamic process that makes it impossible
to compile and maintain comprehensive
lists of compound words. Our software
solves this problem by operating in two
modes. Weak compound analysis
performs compound decomposition
only for words that are not in the dictionary
Parts-of-speech tagging
Correctly analyzing a word in context
is crucial to understanding its precise
meaning. Parts-of-speech tagging identifies or differentiates the grammatical
category of a word by analyzing its
context. Consider this sentence: “I like to
bank at the local branch of my bank.”
In this case, the word “bank” has two
parts-of-speech tags: verb (V) and noun
(N), respectively. The correct meaning
can only be understood in its context.
Noun-phrase extraction
Noun phrases are nouns and their
modifiers. Our noun-phrase extraction determines the noun context that
is crucial in evaluating data. This uses
parts-of-speech tagging to identify
nouns and their related words that form
a noun phrase. Sentence tokenization is deployed to identify the noun
phrase within a sentence. For example,
“week-long cruises,” “Middle Eastern
languages” and “emerging economies”
all provide more information that might
make them more or less relevant than
simply returning the nouns: “cruises,”
“languages” and “economies.”
}
Noun phrase
chunking
}
Entity extraction
}
Teragram’s linguistic functionalities
identify every synonym/inflected form
of a given stem. When the software
encounters an inflected form of the
stem, it identifies the stem word. For
example, “talk” is identified as the stem
for the inflected form “talks.” For some
languages, such as Dutch, which has
a high number of both inflected word
forms and compound words, stemming
is a more complex process than for English.
For example, the word passagierlijst
(passenger list) in Dutch makes it
necessary to break this compound
word into its simpler constituents
before it can be stemmed.
or that are new. Strong compound analysis
is a comprehensive functionality that
aggressively decomposes unlisted
and listed compound words.
Part of speech
tagging
}
Morphological stemming
}
}
ing text by breaking documents into
sentence streams. Our tokenization
can differentiate, for example, a period
in English that functions as a marker
for the end of a sentence, the period
character that follows abbreviations or
periods used in ellipses. Our technologies can also recognize, for example,
inverted question marks (¿) and exclamation points (¡), which precede
Spanish sentences that end with a
question mark or an exclamation point.
Morphological
analysis
Valuable information –
business value
Parts-of-Speech Representation of
Teragram Linguistic Suites for European
and Middle Eastern Languages
Software Development
Toolkits and Platforms
Advanced linguistic technologies from
Teragram are available as software
development toolkit (SDK) libraries with
application programming interfaces
(APIs). The SDKs can be provided as
multithreaded libraries that enable
application data to be shared across
multiple calls to the SDK. As a multithreaded technology, the application
data is loaded once and shared across
the threads for optimizing access and
natural-language processing speed.
The SDK libraries are supported for all
major 32- and 64-bit operating systems,
including HP-UX, Mac OS X, AIX, Windows and UNIX.
Why Teragram®?
Native language dictionaries specific to
European and Middle Eastern languages
European and Middle Eastern language
files are built and developed by native
language speakers so that the contextsensitive nuances are accurately represented in the software functions.
27 languages,
East and West
European
Middle Eastern
Asian languages
Customizable dictionaries and APIs for
easy integration with many platforms
Teragram Linguistic Suites can be
customized with user-defined terms to
support unique application needs.
Dedicated implementation and maintenance support
Teragram linguistic suite software
includes an installation guide that explains how to customize the SDK, along
with documentation for language-specific uses. Personalized and dedicated
technical online support for implementation or integration issues is available.
Global language support
Dictionaries are available for the listed
languages and we support more than
24 European and Middle Eastern and
six Asian languages. Teragram continues to extend lexical and inflection
forms for other Eastern languages,
including Malay.
About Teragram
A division of SAS
Teragram, a market leader in delivering
reliable natural-language processing
technologies, deciphers unorganized
text data for your applications. Founded
in 1997 by innovators in computational
linguistics, Teragram provides a full
range of natural-language technologies,
including stemming, spelling correction, phrase extraction, segmentation,
entity and event extraction, information
retrieval, full-text parsing and questionanswering technologies for more than
27 languages, including East and West
European, Middle Eastern and Asian
languages. Teragram advanced linguistic technologies are used by clients
worldwide to improve searching and
classifying text-based information.
Copyright © 2013, SAS Institute Inc. All rights reserved.
Teragram serves customers across
the publishing, pharmaceutical,
telecommunications and financial
industries, See the current listing of
public customers at Teragram.com/
customers.
SAS is the leader in business analytics
software and services, and the largest
independent vendor in the business
intelligence market. Through innovative
solutions, SAS helps customers at more
than 65,000 sites improve performance
and deliver value by making better decisions faster. Since 1976 SAS has been
giving customers around the world
THE POWER TO KNOW®.
SAS Institute Inc. World Headquarters +1 919 677 8000
To contact your local SAS office, please visit:
sas.com/offices
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA
and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Copyright © 2014, SAS Institute Inc. All rights reserved. 106290_S121292.0214