Instantly identify and triage many languages within large volumes of

Rosette
BIG TEXT ANALYTICS
RLI
ROSETTE
Language Identifier
Sorted
Language
www.basistech.com
[email protected]
+1 617-386-2090
RBL
ROSETTE
Base Linguistics
Instantly identify and triage
REX
many
languages
within large
Entity
Extractor
Tagged Entities
volumes
of text.
RES
Better Search
ROSETTE
English
Primary Language
ROSETTE
Entity Resolver
即时识别和处理大量多语言文本。
Chinese
22%
Identifiez et triez instantanément plusieurs
RNI
langues à travers
de nombreux textes.
ROSETTE
RNT
French
Name
Indexer
‫ﻟﻠﻌﺪﻳﺪ ﻣﻦ اﻟﻠﻐﺎت‬
‫اﻟﺘﺤﺪﻳﺪ واﻟﺘﺼﻨﻴﻒ اﻟﻔﻮري‬
Arabic
.‫ﺿﻤﻦ ﻛﻤﻴﺎت ﻛﺒﻴﺮة ﻣﻦ اﻟﻨﺼﻮص‬
ROSETTE
Name Translator
Identify languages and
transform
RCA ROSETTE encodings
Categorizer
Rosette® Language Identifier (RLI) analyzes text from a few words to whole
documents, to detect the languages and character encoding with speed and
very high accuracy. Automatic language identification is the necessary first
step for applications that categorize, search, process, and store text in many
ROSETTE
languages. Individual documents may be routed to language specialists, or sent
into language-specific analysis pipelines (such as Rosette Base Linguistics) to
improve the quality of search results.
RSA
French
8%
Sentiment Analyzer
For applications that analyze tweets, search keywords, and other short text,
RLI offers market-leading accuracy for language detection given 1-3 words
(<20 bytes) up to a full sentence.
Real Identities
Primary Script
Chinese
39%
Arabic
Latin
French
English
Matched Names
31%
55
K EY F EATUR ES
Translated Name
Supported
Languages
Sorted Content
- Simple API
- Fast and scalable
- Industrial-strength support
- Easy installation
Actionable Insigh
- Flexible and customizable
- Java or C++
- Unix, Linux, Mac, or Windows
- Component of the Rosette SDK
RLI achieves its incredible accuracy through the use of proprietary algorithms
with information-rich language profiles derived from statistical analysis,
in addition to language-specific methods for short text language detection.
Basis Technology continually improves the Rosette product family with
language additions, feature updates, and the latest innovations from the
academic world.
Select Customers
StumbleUpon
Start using RLI today
Try our free product evaluation
www.basistech.com
ons
Rosette®
BIG TEXT ANALYTICS
RLI
RBL
ROSETTE
Language Identifier
ROSETTE
Base Linguistics
IDENTIFICATION
FEATURES
Sorted Languages
Better Search
LANGUAGE
BOUNDARYIdentifier
LOCATOR
ENCODING CONVERSION
Language
RLI
- Identifies the primary or dominant language
- Determines the languages and their
ROSETTE
percentages
within multilingual documents
RES
Entity Resolver
RNT
RCA
Name Indexer
Name Translator
Although modern text encoding standards,
Tagged Entities
such as XML, mandate the use of Unicode,
many existing applications, documents,
websites, and data streams use “legacy
einen schoenen
Laut von
sich. languages with high accuracy
Search
many
encodings,” such as ASCII, ISO 8859-1, Shift-JIS,
Real Identities
and many others.
RBL
Base Linguistics
RBL
wound care management prevents die Geige gibt
ENGLISH
FRENCH
GERMAN
Entit
languages are written in the same script—
Names
such as English, French, German, orTranslated
Italian.
connections
in your data
BoundariesMake
of eachreal-world
writing system
are also
detected, such as Latin, Cyrillic, Japanese kana,
or Chinese hanzi.
Entit
RES
Entity Extractor
REX
Entity Resolver
Sorted Content
RES
RNI
188
RNT
RNI
Sentiment
Analyzer
Language/
55
7
44
Encoding Pairs
Languages
with Unicode
Latin Script
Variants
(Transliterations)
RCA
RSA
Compatibility
Code Base
Platform Support
© 2015 Basis Technology Corporation. “Basis Technology
Corporation” , “Rosette”, and “Highlight” are registered trademarks
of Basis Technology Corporation. “Big Text Analytics” is a trademark
of Basis Technology Corporation. All other trademarks, service marks,
and logos used in this document are the property of their respective
owners. (2015-06-29-RLI)
Match names between many variations
Albanian — ISO-8859-1, Windows-1252
Lithuanian — ISO-8859-13, Windows-1257
ActionableMacedonian
Insights — ISO-8859-5, Windows-1251
Arabic — ISO-8859-6, Windows-720,
Windows-1256
Malay — ISO-8859-1, Windows-1252
Arabic (transliterated) — ISO-8859-1,
Malayalam — ISCII-Malayalam
Windows-1252,
Windows-1256
Norwegian — ISO-8859-1, Windows-1252
Translate
foreign names into English
Bengali — ISCII-Bengali
Pashto — ISO-8859-6, Windows-1256
Bulgarian — ISO-8859-5, Windows-1251, KOI8-R Pashto (transliterated) — ISO-8859-1,
Catalan — ISO-8859-1, Windows-1252
Windows-1252
Chinese, Simplified — GB-2312, GB-18030,
Persian — ISO-8859-6, Windows-1256
HZ-GB-2312, ISO-2022-CN
Persian (transliterated) — ISO-8859-1,
Chinese, Traditional
— Big5,
Big5-HKSCSIn Sight Windows-1252, Windows-1256
Categorize
Everything
Croatian — Windows-1250
Polish — ISO-8859-2, Windows-1250
Czech — ISO-8859-2, Windows-1250
Portuguese — ISO-8859-1, Windows-1252
Danish — ISO-8859-1, Windows-1252
Romanian — ISO-8859-2, Windows-1250
Dutch — ISO-8859-1, Windows-1252
Russian — ISO-8859-5, Windows-1251, KOI8-R,
English — ISO-8859-1, Windows-1252
IBM-866, Mac Cyrillic
Estonian —
ISO-8859-13,
Detect
The Windows-1257
Sentiments Of YourSerbian
Text — ISO-8859-5, Windows-1251
Finnish — ISO-8859-1, Windows-1252
Serbian (transliterated) — ISO-8859-2,
French — ISO-8859-1, Windows-1252
Windows-1250
German — ISO-8859-1, Windows-1252
Slovak — Windows-1250
Greek — ISO-8859-7, Windows-1253
Slovenian — Windows-1250
Gujarati — ISCII-Gujarati
Somali — ISO-8859-1, Windows-1252
Hebrew — ISO-8859-8, Windows-1255
Spanish — ISO-8859-1, Windows-1252
Hindi — ISCII-Hindi
Swedish — ISO-8859-1, Windows-1252
Hungarian — ISO-8859-2, Windows-1250
Tagalog — ISO-8859-1, Windows-1252
Icelandic — ISO-8859-1, Windows-1252
Tamil — ISCII-Tamil
Indonesian — ISO-8859-1, Windows-1252
Telugu — ISCII-Telugu
Italian — ISO-8859-1, Windows-1252
Thai — Windows-874
Japanese — EUC-JP, ISO-2022-JP, Shift-JIS,
Turkish — ISO-8859-9, Windows-1254
Shift-JIS-2004 (JIS X 0213)
Ukrainian — ISO-8859-5, Windows-1251, KOI8-R
Kannada — ISCII-Kannada
Urdu — ISO-8859-6, Windows-1256
Korean — EUC-KR, ISO-2022-KR
Urdu (transliterated) — ISO-8859-1,
Kurdish — Windows-1256
Windows-1252
Kurdish (transliterated) — ISO-8859-1,
Uzbek — ISO-8859-5, Windows-1251, KOI8-R
Windows-1252, Windows-1256
Uzbek (transliterated) — Windows-1251
Latvian — ISO-8859-13, Windows-1257
Vietnamese — TCVN, VIQR, VISCII, VNI, VPS
RNT
Legacy
Encodings
Base
Rosette accurately converts large collections
Digital text is often composed of multiple
of text with these legacy encodings into a
languages within the same document,
single,
uniform format in the Unicode standard.
presentingTag
a challenge
to of
computers
and
names
people,
places,Names
and
organizations
Matched
This converted text can then be used in any
humans alike. RLI enriches the text with start
language, which eliminates data corruption and
and end markers for each language placed
other problems due to incompatible code.
within multilingual documents—even if all the
REX
ROSETTE
ROSETT
SPANISH
Name Indexer
LANGUAGE AND ENCODING COMPATIBILITY
RSA
Lang
prensa los bordes de la placa decorativa. Proper
ROSETTE
Categorizer
ROSETT
Biden spoke about this in Munich. El carpintero
- Works with texts that have been
transliterated, such as Arabic chat that is
written in the Latin script
ROSETTE
- Accurate with short strings—from 1-3 words
(<20 bytes) to a full sentence to enable full
analysis of search queries, tweets, image
captions, metadata, news headlines, email
ROSETTE
subject lines,
and more.
RNI
RLI
J'ai été surprise par cette surprise. Vice President
of a document
ROSETTE
REX
Entity
- Identifies the
languageExtractor
scripts within the
document, such as Latin and Cyrillic
Identify languages and encodings
Name Translator
Categorizer
RCA
Sentiment Analyzer
RSA
HEADQUARTERS
FEDERAL
WEST COAST
One Alewife Center
Cambridge, MA
02140
2553 Dulles View Dr.
Suite 450
Herndon, VA
20171
1700 Montgomery St. Furzeground Way
San Francisco, CA
Middlesex UB11 1BD,
94111
UK
EUROPE
ASIA
9-6 Nibancho,
Chiyoda-ku
Tokyo 102-0084,
Japan
ROSETT
ROSETT
ROSETT
Nam
ROSETT
Nam
ROSETT
Cate
ROSETT
Sent