SOLUTION OVERVIEW Teragram Linguistic Suites – European and Middle Eastern Languages ® Experience reliable, highly profitable linguistic technologies for European and Middle Eastern languages Overview You need a rich suite of tools for discovering and extracting knowledge from text documents, including a comprehensive text mining solution that integrates text-based information with structured data and predictive analytics for better answers to complex questions. Teragram, a division of SAS, provides fast, automated and highly scalable linguistic technologies that address the challenges associated with extracting text from document sources and reducing it to meaningful elements (known as tokenization). Language-processing functions identify the necessary linguistic information for retrieving meaning from input text that allows you to act on hidden insights. Teragram’s advanced linguistic technologies are available for all major European and Middle Eastern languages with annotated dictionaries that contain several hundred-million words for: • Arabic • Czech • Danish • Dutch • English • Finnish • French • Farsi • German • Greek • Hebrew • Hindi • Hungarian • Italian • Norwegian • Polish • Portuguese • Romanian • Russian • Slovak • Spanish • Swedish • Turkish • Ukrainian ■ Challenges • Accurate tokenization at the word and sentence level. Complexities in words, expressions and numerical representations, as well as in punctuation, can lead to inaccurate segmentation. • Compound words. Decomposing compound words in Germanic languages is complex, creating increased processing time, poor stem recognition, loss of contextual meaning and improper handling of lexicalized compound words. • Stemming complexity. The large number of inflected terms in Dutch can result in a loss of relationships between terms, leading to inaccurate or ambiguous results. • Limited or closed dictionaries and inflexible interfaces. Linguistic solutions that are inflexible place unnecessary limitations on solutions, particularly for extensibility and scalability. Automate text processing at word and sentence levels ■ Key Benefits • Punctuation recognition. Teragram tokenization can identify subtle and complex punctuation across our extensive language dictionaries. • Efficient word stemming. Our solution allows you to access a combination of high-speed, dictionary and heuristic-based algorithms and tap into rich, easily extendible dictionaries. • Accurate part-of-speech tagging. Count on more than 95 percent accuracy for major languages at some of the fastest speeds in the industry. • Language-specific custom APIs. Configure your tokenizer to specify how characters and punctuation are handled. • Seamless linguistic processing. Our solution easily integrates with your document preprocessing and content management software. Capabilities Teragram licenses linguistic technologies as OEM libraries so organizations can embed linguistic functions into commercial offerings. These highly specialized linguistics capabilities can enhance your existing offerings and provide superior meaning extraction to new offerings. Advanced text processing Our technology provides you with advanced text processing, including case and accent normalization, to effectively handle variations and normalize characters to a standard form. Case and accent normalization are important because sentence meaning can change due to incorrect usage of upper or lower case or accent marks. With these prebuilt functions, the product development cycle is reduced, allowing for rapid deployment. Customizable language dictionaries and programming interfaces Teragram’s language dictionaries are customizable for the specific way words need to be handled in European and Middle Eastern languages. The software handles exceptions, including text containing numbers, dates, URLs and abbreviations. You have the ability to handle tokenization and other modules based on the chosen language. Linguistic modules can be customized for a wide variety of standard platforms. You will have access to readily available configuration and maintenance support. Enables increased performance and rapid expansion As highly scalable technologies, our linguistic suites are optimized to perform linear scans of text for accurate and timely processing with the ability to analyze nearly 500,000 words per second from compressed forms. Our parts-of-speech tagger’s performance is highly accurate and can be verified with manual parts-of-speech tagging. Efficient integration with major technologies Our linguistic suites provide seamless integration with existing document preprocessing activities and easy integration with content management software and search applications. Major Internet search companies, news and media companies, as well as technology and pharmaceutical companies, are current users of Teragram’s technologies. Components: Linguistic Functionalities The following components are available for each language: Segmentation Segmentation, or tokenization, is the process by which a sequence of characters is divided into one or more units of meaning. Teragram’s word tokenization separates punctuation marks from text to enhance text processing, while Teragram’s sentence tokenization adds another tokenization layer to incom- Input text collections Compound decomposition Segmentation } Compound word decomposition Germanic languages present unique challenges because compounding is a dynamic process that makes it impossible to compile and maintain comprehensive lists of compound words. Our software solves this problem by operating in two modes. Weak compound analysis performs compound decomposition only for words that are not in the dictionary Parts-of-speech tagging Correctly analyzing a word in context is crucial to understanding its precise meaning. Parts-of-speech tagging identifies or differentiates the grammatical category of a word by analyzing its context. Consider this sentence: “I like to bank at the local branch of my bank.” In this case, the word “bank” has two parts-of-speech tags: verb (V) and noun (N), respectively. The correct meaning can only be understood in its context. Noun-phrase extraction Noun phrases are nouns and their modifiers. Our noun-phrase extraction determines the noun context that is crucial in evaluating data. This uses parts-of-speech tagging to identify nouns and their related words that form a noun phrase. Sentence tokenization is deployed to identify the noun phrase within a sentence. For example, “week-long cruises,” “Middle Eastern languages” and “emerging economies” all provide more information that might make them more or less relevant than simply returning the nouns: “cruises,” “languages” and “economies.” } Noun phrase chunking } Entity extraction } Teragram’s linguistic functionalities identify every synonym/inflected form of a given stem. When the software encounters an inflected form of the stem, it identifies the stem word. For example, “talk” is identified as the stem for the inflected form “talks.” For some languages, such as Dutch, which has a high number of both inflected word forms and compound words, stemming is a more complex process than for English. For example, the word passagierlijst (passenger list) in Dutch makes it necessary to break this compound word into its simpler constituents before it can be stemmed. or that are new. Strong compound analysis is a comprehensive functionality that aggressively decomposes unlisted and listed compound words. Part of speech tagging } Morphological stemming } } ing text by breaking documents into sentence streams. Our tokenization can differentiate, for example, a period in English that functions as a marker for the end of a sentence, the period character that follows abbreviations or periods used in ellipses. Our technologies can also recognize, for example, inverted question marks (¿) and exclamation points (¡), which precede Spanish sentences that end with a question mark or an exclamation point. Morphological analysis Valuable information – business value Parts-of-Speech Representation of Teragram Linguistic Suites for European and Middle Eastern Languages Software Development Toolkits and Platforms Advanced linguistic technologies from Teragram are available as software development toolkit (SDK) libraries with application programming interfaces (APIs). The SDKs can be provided as multithreaded libraries that enable application data to be shared across multiple calls to the SDK. As a multithreaded technology, the application data is loaded once and shared across the threads for optimizing access and natural-language processing speed. The SDK libraries are supported for all major 32- and 64-bit operating systems, including HP-UX, Mac OS X, AIX, Windows and UNIX. Why Teragram®? Native language dictionaries specific to European and Middle Eastern languages European and Middle Eastern language files are built and developed by native language speakers so that the contextsensitive nuances are accurately represented in the software functions. 27 languages, East and West European Middle Eastern Asian languages Customizable dictionaries and APIs for easy integration with many platforms Teragram Linguistic Suites can be customized with user-defined terms to support unique application needs. Dedicated implementation and maintenance support Teragram linguistic suite software includes an installation guide that explains how to customize the SDK, along with documentation for language-specific uses. Personalized and dedicated technical online support for implementation or integration issues is available. Global language support Dictionaries are available for the listed languages and we support more than 24 European and Middle Eastern and six Asian languages. Teragram continues to extend lexical and inflection forms for other Eastern languages, including Malay. About Teragram A division of SAS Teragram, a market leader in delivering reliable natural-language processing technologies, deciphers unorganized text data for your applications. Founded in 1997 by innovators in computational linguistics, Teragram provides a full range of natural-language technologies, including stemming, spelling correction, phrase extraction, segmentation, entity and event extraction, information retrieval, full-text parsing and questionanswering technologies for more than 27 languages, including East and West European, Middle Eastern and Asian languages. Teragram advanced linguistic technologies are used by clients worldwide to improve searching and classifying text-based information. Copyright © 2013, SAS Institute Inc. All rights reserved. Teragram serves customers across the publishing, pharmaceutical, telecommunications and financial industries, See the current listing of public customers at Teragram.com/ customers. SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market. Through innovative solutions, SAS helps customers at more than 65,000 sites improve performance and deliver value by making better decisions faster. Since 1976 SAS has been giving customers around the world THE POWER TO KNOW®. SAS Institute Inc. World Headquarters +1 919 677 8000 To contact your local SAS office, please visit: sas.com/offices SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright © 2014, SAS Institute Inc. All rights reserved. 106290_S121292.0214
© Copyright 2025 Paperzz