Text Analysis Language Definition Guide

CUSTOMER
SAP BusinessObjects Predictive Analytics 3.1
2016-12-20
Text Analysis Language Definition Guide
Data Manager User Guide
Content
1
About this Document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2
General introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1
How to Configure a New Language?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2
What is a Stop Words List in Automated Analytics?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3
What is a Concepts List in Automated Analytics?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
2.4
What is a Synonyms List in Automated Analytics?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5
What is Stemming in Automated Analytics?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6
Example Presented in this Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3
Algorithm Transcription for Spanish Language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
3.1
Region Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
Step 1 - Spanish First Region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Step 2 - Spanish Second Region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Step 3 - Spanish Third Region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2
Main Stemming Rules for the Spanish Language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Step Rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17
Phase A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18
Phase B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Phase C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Phase D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Phase E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3
Testing Your Stemming Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Creating the Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Applying the Stemming Rules. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2
CUSTOMER
Text Analysis Language Definition Guide
Content
1
About this Document
The purpose of this document is to provide Automated Analytics users with the ability to create their own
languages definitions for Data Manager – text coding feature.
Its main goals are to:
● present Automated Analytics vision and use of stemming,
● present the five configuration files used by Data Manager – text coding,
● describe the internal format of the stemming rules,
● provide a complete stemming algorithm transcription for a language, using Spanish as an example.
Organization of this Document
This document is subdivided into three chapters.
This chapter, About this Document, details who is concerned by the reading of this document and what
knowledge is required to understand and use this document.
Chapter 2, General Introduction, presents the definition of the Spanish language as an example, which will be
developed in Chapter 3.
Chapter 3, Stemming Algorithm Transcription for Spanish Language, provides you with the complete
algorithm definition including all necessary rules to create a stemming language definition usable in Data
Manager – text codingfor the Spanish language.
Who should Read this Document
This document aims at answering several purposes:
● to understand how the languages definitions provided with Data Manager – text coding have been created.
● to expand the provided languages definitions to better answer the user's needs.
● to create the definition for a new language and use it in Data Manager – text coding.
Full Documentation
Complete documentation can be found on the SAP Help Portal at http://help.sap.com/pa.
Text Analysis Language Definition Guide
About this Document
CUSTOMER
3
Prerequisites
In order to make the best use of this document, you need to:
● have a basic understanding of regular expressions,
● have a stemming algorithm for the language to define, either by having enough linguistics knowledge to
define it on your own, or by having been provided with it by a third party. For example, some ready-to-use
algorithms are available on Snowball website (http://snowball.tartarus.org/algorithms/spanish/
stemmer.html).
4
CUSTOMER
Text Analysis Language Definition Guide
About this Document
2
General introduction
The text coding feature allowing a user to add their own definitions has been developed so that the user can
enhance the provided languages definitions or create definitions for a new language. Indeed only English,
German and Spanish languages definitions are provided with the current version of the application. These
definitions deal with a general level of language but the user may need to create a definition for a more specific
domain.
When do you need to write your own stemming rules?
● when the language definition does not exist,
● when the user wants to add rules to an existing definition in order to increase the text analysis accuracy,
● when the language found in the data is specific to a domain rather than generic.
What is a Language Definition?
A language definition is composed of five files:
● a configuration file common to all language (KXLANGUAGE.CFG),
● one containing the concepts definition,
● one containing the synonyms definition,
● one containing the stop words definition,
● one containing the stemming rules definition.
2.1
How to Configure a New Language?
Definition
The KXLANGUAGE.CFG file is used to centralize the languages definitions. It must be present in all folders
containing languages definitions.
File Description
The KXLANGUAGE.CFG file lists information necessary to the language definition as pairs <key=value>. All
these pairs are optional, but one at least must be defined. The keys and their associated values are listed in the
following table:
Key
Value
<language>.name
name of the language in Automated Analytics. If not set, the
language name will be the first part of the key of the follow
ing node.
Text Analysis Language Definition Guide
General introduction
CUSTOMER
5
Key
Value
<language>.ConceptList
name of the file containing the concepts for the language
<language>
<language>.StemmingRules
name of the file containing the stemming rules defining the
language <language>
<language>.StopList
name of the file containing the stop words list for the lan
guage <language>
<language>.SynonymList
name of the file containing the list of synonyms for the lan
guage <language>
Caution
For each file, the name of the file must be valid and exist in the current repository.
<language> can be any string defined by the user. When no value is assigned to <language>.name,
<language> is assigned as the value. For all other keys, the empty string is used by default. For the Spanish
language, the file KXLANGUAGE.CFG should contain the following lines when all the Automated Analytics
configuration files exist:
sp.Name="Spanish"
sp.ConceptList="ConceptList_sp"
sp.StemmingRules="StemmingRules_sp"
sp.StopList="StopList_sp"
sp.SynonymList="SynonymList_sp"
2.2
What is a Stop Words List in Automated Analytics?
Definition
Stop words are common words that do not bring any information, such as link words, articles, and so on.
File Description
The stop words file is a text file starting with “Word” on the first line and listing each stop word on a new line as
shown on the example below.
Word
de
la
que
el
6
CUSTOMER
Text Analysis Language Definition Guide
General introduction
en
y
a
los
del
se
las
por
un
para
con
no
2.3
What is a Concepts List in Automated Analytics?
Definition
The concepts list allows you to replace a set of roots by one root carrying the same meaning. For example
"operating system" with "OS".
File Description
The file contains a list of pairs with the following syntax:
<root>-<root>…-<root>=<Value>
For example:
WordList=Concept
20-gb=capacity_20gb
2-gb=capacity_2gb
30-gb=capacity_30gb
40-gb=capacity_40gb
4-gb=capacity_4gb
6-gb=capacity_6gb
Text Analysis Language Definition Guide
General introduction
CUSTOMER
7
2.4
What is a Synonyms List in Automated Analytics?
Definition
The synonyms list allows you to replace one root by another root carrying the same meaning. For example
"Windows" and "Unix" could define the same root "OperatingSystem".
File Description
The file contains a list of pairs with the following syntax:
<Key>=<Value>
For example:
Word=Synonym
iPhone=cellPhone
Nokia5800=cellPhone
BlackBerry9000=cellPhone
SamsungS5230=cellPhone
2.5
What is Stemming in Automated Analytics?
Definition
The stemming is a process that determines the morphological root of a given inflected (or, sometimes,
derived) word form -- generally a written word form.
Description of the Process
The stemming applies on words, so the textual fields that are being analyzed by stemming are first separated
into words. Then to identify and extract the morphological roots, each word is separated into regions, which
can be achieved by using regular expressions -- a process based on Porter's theory (http://
www.comp.lancs.ac.uk/computing/research/stemming/Links/porter.htm). A preprocess consists in
transforming all capital letters into small letters. The first step is to identify the word root and to verify if it has
suffixes (plural, conjugation, …). If it is a derivative form and its length is longer than a specified threshold then
the suffix or part of it is deleted and, in some cases, replaced.
This process is composed of the following steps:
1. The first step defines the characters or groups of characters for a language, that is the vowels,
consonants, special characters, accented forms, and so on.
8
CUSTOMER
Text Analysis Language Definition Guide
General introduction
2. The second step defines the regions of the word that will be used to construct replacement conditions. All
the stemmers make use of at least one of the region definitions R1 and R2. As an example, they can be
defined as follow for Latin languages:
1. R1 is the region after the first non-vowel following a vowel, or is the null region at the end of the word if
there is no such non-vowel.
2. R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the
word if there is no such non-vowel.
Depending on the language, others regions can be identified, such as RV (verbal region) for the
Spanish language.
The regions corresponding to a specific language can only be determined by a linguistics specialist.
3. The third step is where the actual stemming is done, that is, the removal of suffixes and prefixes and the
extraction of the roots from the word.
2.6
Example Presented in this Documentation
As an example, this document will describe how to write stemming rules for the Spanish language. The original
Spanish algorithm can be found on the Snowball website (http://snowball.tartarus.org/algorithms/spanish/
stemmer.html).
Algorithm Sequence Graph
The following figure presents the algorithm for Spanish stemming based on Porter's algorithm and an example
of application on the word “torpedearon”.
Text Analysis Language Definition Guide
General introduction
CUSTOMER
9
Defining the Phases for Spanish Stemming
The goal of each phase is defined in the following table:
Phase
Goal
A
identifying and removing the attached pronoun
B
identifying and removing the standard suffixes
C
identifying and removing the verb suffixes
D
identifying and removing the residual suffixes
E
identifying and replacing the accented forms by the unac
cented forms
Each phase is composed of one or several steps and each step is composed of one or several rules as shown in
the figure below.
10
CUSTOMER
Text Analysis Language Definition Guide
General introduction
Defining the Columns for Spanish Stemming Rules
This part lists the number of columns needed to define Spanish stemming rules and provides their title and
meaning in the global process.
● Rule: the identifier of the current rule. This column must be sorted in ascending order.
● Step: the step the rule belongs to. Rules must be ordered by group of steps.
● CondWord: a regular expression defining a condition to apply on the word.
● CondR1: a regular expression defining a condition to apply on the region R1 of the word.
● CondR2: a regular expression defining a condition to apply on the region R2 of the word.
● CondRV: a regular expression defining a condition to apply on the region RV of the word.
● Match a regular expression defining the part of the word to be replaced.
● Replace: the string replacing the string defined by the match.
● StepAfter: the step to go if the current rule has been applied to the processed word.
Text Analysis Language Definition Guide
General introduction
CUSTOMER
11
3
Algorithm Transcription for Spanish
Language
Preprocessing
This part explains the preprocessing rules: how to define them and what they do exactly.
The preprocessing rules belong to step 0.
A preprocessing rule defines:
● a condition to apply on the word,
● a Match,
● a Replace,
● a StepAfter.
No preprocessing rules will be defined here since none are needed in Spanish.
Example
To replace every instance of the letter é found in the word by the letter e, the following rule should be applied:
0
1
é
nocond
nocond
nocond
é
e
i
CondR1
CondR2
CondRV
Match
Replace
StepAfter
This rules obeys the following syntax:
Step
Rule
CondWord
This rule can be translated as follow:
● This is rule 1
● This rule belongs to Step 0
● If the word contains the letter é
● Then replace the letter é by the letter e
● Go to Step 1.
Note
the key word nocond indicates that no condition is defined for this part of the rule.
3.1
Region Definition
To define a region with the stemming rules, you have to create a regular expression designating the specific
region you want to isolate. This expression has to be declared in the Match column and the Replace column
must be blank. A region is defined by one rule and only one.
12
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
The important thing is the step of the rule. Indeed the rule that defines the first region must be in Step 1 and
must be the only rule of Step 1. In the same way, the rule defining the second region must be in Step 2 and be
the only rule of Step 2, etc…
That is to say that, if there are three regions to define, then Step 1, Step 2 and Step 3 are used to define them.
So the first main stemming rules start at Step 4.
If in a given language there is no region to define, then Step 1 contains the first main stemming rules.
Regular Expressions Reminder
The regular expressions engine used for the stemming rules is a PCRE engine (Pearl Compatible Regular
Expression). The following table summarizes the main elements that can be used in the regular expressions:
\
general escape character with several uses
^
assert start of subject (or line, in multiline mode)
$
assert end of subject (or line, in multiline mode)
.
match any character except newline (by default)
[
start character class definition
]
end character class definition
|
start of alternative branch
(
start subpattern
)
end subpattern
?
extends the meaning of (, also 0 or 1 quantifier, also quanti
fier minimizer
*
0 or more quantifier
+
1 or more quantifier
{
start min/max quantifier
}
end min/max quantifier
Spanish Letters
Letters in Spanish include the following accented forms,
áéíóúüñ
So the complete alphabet is:
aábcdeéfghiíjklmnñoópqrstuúüvwxyz
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
13
Spanish Vowels and Consonants
The following letters are considered as vowels :
aeiouáéíóúü
The following letters are considered as consonants:
bcdfghjklmnñpqrstvwxyz
3.1.1 Step 1 - Spanish First Region
Definition
R1 is the region starting after the first non-vowel following a vowel, or is the null region at the end of the word if
there is no such non-vowel.
Examples
The R1 value for the word TORMENTAS is MENTAS.
The R1 value for the word TOROS is OS.
The R1 value for the word CHE is NULL.
Rule
The regular expression that defines the region described above is:
(^[aábcdeéfghiíjklmnñoópqrstuúüvwxyz]*?[ aeiouáéíóúü][bcdfghjklmnñpqrstvwxyz])
So the complete rule defining R1 in the Spanish stemming file is:
Rule: 1
Step: 1
CondWord: nocond
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match: (^[aábcdeéfghiíjklmnñoópqrstuúüvwxyz]*?[ aeiouáéíóúü]
[bcdfghjklmnñpqrstvwxyz])
Replace:
StepAfter: 2
When applied to the word TORMENTAS, the regular expression defined in the Match column returns TOR. The
matched part (TOR) is replaced by nothing since the string declared in the Replace field is null. As a result the
14
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
word is transformed from TORMENTAS to MENTAS (the desired region). This result is stored internally as the
first defined region of the processed word.
3.1.2 Step 2 - Spanish Second Region
Definition
R2 is the region after the first non-vowel following a vowel in R1, or is the null region at the end of the word if
there is no such non-vowel.
Examples
The R1 value for the word TORMENTAS is MENTAS, so the R2 is TAS.
The R1 value for the word TOROS is OS, so the R2 is NULL.
Both R1 and R2 values are NULL for the word CHE.
Rule
The regular expression that defines the region described above is:
(^[aábcdeéfghiíjklmnñoópqrstuúüvwxyz]*?[ aeiouáéíóúü][bcdfghjklmnñpqrstvwxyz]
[aábcdeéfghiíjklmnñoópqrstuúüvwxyz]*?[ aeiouáéíóúü][bcdfghjklmnñpqrstvwxyz])
So the complete rule defining R2 in the Spanish stemming file is:
Rule: 2
Step: 2
CondWord: nocond
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match: (^[aábcdeéfghiíjklmnñoópqrstuúüvwxyz]*?[ aeiouáéíóúü]
[bcdfghjklmnñpqrstvwxyz][aábcdeéfghiíjklmnñoópqrstuúüvwxyz]*?[ aeiouáéíóúü]
[bcdfghjklmnñpqrstvwxyz])
Replace:
StepAfter: 3
When applied to the word TORMENTAS, the regular expression defined in the Match column will return
TORMEN. The matched part (MENTAS) is then replaced by nothing since the string declared in the Replace
column is null. As a result the word is transformed from TORMENTAS into TAS (the desired region). This result
is stored internally as the second defined region of the processed word.
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
15
3.1.3 Step 3 - Spanish Third Region
Definition
● If the second letter is a consonant, RV is the region after the next following vowel.
● If the first two letters are vowels, RV is the region after the next consonant,
● Otherwise (consonant-vowel case) RV is the region after the third letter.
● Finally, RV is the end of the word if the positions described above cannot be found.
Examples
● The RV value for the word TORMENTAS is MENTAS.
● The RV value for the word TOROS is OS.
● The RV value for the word CHE is NULL.
Rule
Since the definition is composed of several parts, the expression will be defined by steps.
The first condition, which states that « if the second letter is a consonant, RV is the region after the next
following vowel” can be described by the following expression:
(.[bcdfghjklmnñpqrstvwxyz][bcdfghjklmnñpqrstvwxyz]*?[aeiouáéíóúü])
The second condition, which states that “if the first two letters are vowels, RV is the region after the next
consonant”, can be described by the following expression:
([aeiouáéíóúü][aeiouáéíóúü][aeiouáéíóúü]*?[bcdfghjklmnñpqrstvwxyz])
The last condition, which states that “(consonant-vowel case) RV is the region after the third letter”, can be
described by the following expression:
([bcdfghjklmnñpqrstvwxyz][ aeiouáéíóúü].)
So the final expression can be defined as follow:
(.[bcdfghjklmnñpqrstvwxyz][bcdfghjklmnñpqrstvwxyz]*?[aeiouáéíóúü])| ([aeiouáéíóúü]
[aeiouáéíóúü][aeiouáéíóúü]*?[bcdfghjklmnñpqrstvwxyz])| ([bcdfghjklmnñpqrstvwxyz]
[ aeiouáéíóúü].)
So the complete rule defining RV in the Spanish stemming file is:
Rule: 3
Step: 3
CondWord: nocond
CondR1: nocond
16
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CondR2: nocond
DondRV: nocond
Match: (.[bcdfghjklmnñpqrstvwxyz][bcdfghjklmnñpqrstvwxyz]*?[aeiouáéíóúü])|
([aeiouáéíóúü][aeiouáéíóúü][aeiouáéíóúü]*?[bcdfghjklmnñpqrstvwxyz])|
([bcdfghjklmnñpqrstvwxyz][ aeiouáéíóúü].)
Replace:
StepAfter: 3
When applied to the word TORMENTAS, the regular expression defined in the Match column will return TOR.
Then the matched part (TOR) is replaced by nothing since the string declared in the Replace column is null. As
a result, the word is transformed from TORMENTAS into MENTAS (the desired region). This result is stored
internally as the RV of the processed word.
Conclusion
Three steps are needed to define the three regions used in Spanish. The accented characters, which are
usually handled in step 0 (during pre-processing), are needed in the stemming rules of the Spanish language
and as such will only be removed at the very end of the algorithm. The number of regions defined for a
language will determine the rules syntax. In Spanish, three regions have been defined thus determining the
following syntax:
Rule Step CondWord CondR1 CondR2 CondRV Match Replace StepAfter
3.2
Main Stemming Rules for the Spanish Language
3.2.1 Step Rule
Definition
A special rule, named Step Rule, must be created to define which step to go if no rules has been applied in the
current step.
Example
The rule stating that if no rule in step 8 has been matched, then the next step to apply should be step 12, would
be defined as follow:
Rule: 108
Step: 8
CondWord: nocond
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match: nocond
Replace: nocond
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
17
StepAfter: 12
Note
This rule is mandatory for every step. Even if the step to go is the following one (for example from step 8 to
step 9) the rule must be defined!
3.2.2 Phase A
Goal
This step handles the attached pronoun by identifying in the word the longest suffix from the following list ME
SE SELA SELO SELAS SELOS LA LE LO LAS LES LOS NOS and suppressing the identified suffix from the
processed word.
However those suffixes will only be deleted on specific conditions:
● only if they appear after:
○ iéndo ándo ár ér ír
○ ando iendo ar er ir
○ yendo preceded by u
● a), b) or c) must appear in RV. Moreover for c) the U is not necessarily in RV.
● Finally for all (a) conditions, the accents must be removed (for example IÉNDO -> IENDO).
To sum up, there are three different stages in this step:
● first, identifying if the word that will be stemmed finishes with one of the suffixes from the provided list.
● then, identifying if this suffix comes after one of three given conditions.
● finally, if one of the conditions is one of the accentuated conditions, removing the accent.
Each stage corresponds to a group of rules or step.
First Stage
Definition
The first condition detects if the word contains one of the suffixes in the following list:
me se sela selo selas selos la le lo las les los nos
If more than one suffix can be identified in the word, the longest one is selected and removed. To facilitate the
identification of the longest suffix first, the list is resorted by decreasing length. The resorted list is:
selos selas selo sela nos los les las se me lo le la
18
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
Rule
This stage can be described by the following rule for the suffix SELOS:
Rule: 4
Step: 4
CondWord: ^.*selos$
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match:
Replace:
StepAfter: 5
A single rule can be created with a disjunctive expression rather than one for each suffix, that is 13 rules.
CondWord can then be defined as follow:
CondWord:
(^.*selos$|^.*selas$|^.*selo$|^.*sela$|^.*nos$|^.*los$|^.*les$|^.*las$|^.*se$|^.*me
$|^.*lo$|^.*le$|^.*la$)
Since the goal of this rule is to determine if the word contains the suffix defined in CondWord, the values for the
columns Match and Replace are null.
The following rule, in this case Rule 5, will be a Step Rule from step 4 to a step yet to be defined (it can only be
determined once Phase A has been completely defined).
Second Stage
Definition
The identified suffix appears after one of the following word parts:
1. IÉNDO ÁNDO ÁR ÉR ÍR
2. ANDO IENDO AR ER IR
3. YENDO preceded by U
when they are present in RV.
Because of the specific problem of the letter U that can precede YENDO but not necessarily in RV, this
particular case will be handle in an additional rule.
Rule
The second stage can be defined by the following rule:
Rule: 6
Step: 5
CondWord: nocond
CondR1: nocond
CondR2: nocond
CondRV: (^.*iéndo.*$|^.*ándo.*$|^.*ár.*$|^.*ér.*$|^.*ír.*$|^.*ando.*$|^.*iendo.*
$|^.*ar.*$|^.*er.*$|^.*ir.*$)
Match: (^.*selos$|^.*selas$|^.*selo$|^.*sela$|^.*nos$|^.*los$|^.*les$|^.*las$|
^.*se$|^.*me$|^.*lo$|^.*le$|^.*la$)
Replace:
StepAfter: 6
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
19
This rule is relatively simple; if one of the conditions defined in CondRV is matched then the suffix identified in
Match will be replaced by a null value. So it simply deletes the suffix if the condition is verified.
For the specific (c) condition, the rule can be defined as follow:
Rule: 7
Step: 5
CondWord: ^.*uyendo.*$
CondR1: nocond
CondR2: nocond
CondRV: ^.*yendo.*$
Match: (^.*selos$|^.*selas$|^.*selo$|^.*sela$|^.*nos$|^.*los$|^.*les$|^.*las$|
^.*se$|^.*me$|^.*lo$|^.*le$|^.*la$)
Replace:
StepAfter: 6
The following rule, in this case Rule 8, will be a Step Rule from step 5 to a step yet to be defined (it can only be
determined once Phase A has been entirely defined).
Last Stage
Definition
If the word part corresponding to the RV condition is accentuated, then the accent must be removed.
Rule
A rule needs to be created for each accent removing. For example, for IÉNDO the rule would be defined as
follow:
Rule: 7
Step: 6
CondWord: ^.*iéndo.*$
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match: iéndo
Replace: iendo
StepAfter: 7
The last rule, in this case Rule 14, will be a Step Rule from step 6 to a step yet to be defined (it can only be
determined once Phase A has been entirely defined).
Step 7 will be the first step of Phase B. As a result, the value for the StepAfter field in all the Step Rules defined
in this section should be set to 7.
3.2.3 Phase B
3.2.3.1
First Stage
This step handles the standard suffix removal.
20
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
Description
The goal is to identify in the word the longest suffix from the following list in order to suppress it from the
processed word:
anza anzas ico ica icos icas ismo ismos able ables ible ibles ista istas oso osa
osos osas amiento amientos imiento imientos
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
imientos amientos imiento amiento istas ismos ibles anzas ables osos osas ista ismo
icos icas ible anza able oso osa ico ica
Those suffixes will be deleted only if they appear in R2.
Rule
This first stage can be defined by the following rule:
Rule: 15
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: (^.*imientos$|^.*amientos$|^.*imiento$|^.*amiento$|^.*istas$|^.*ismos$|
^.*ibles$|^.*anzas$|^.*ables$|^.*osos$|^.*osas$|^.*ista$|^.*ismo$|^.*icos$|
^.*icas$|^.*ible$|^.*anza$|^.*able$|^.*oso$|^.*osa$|^.*ico$|^.*ica$)
CondRV: nocond
Match: (imientos$|amientos$|imiento$|amiento$|istas$|ismos$|ibles$|anzas$|ables$|
osos$|osas$|ista$|ismo$|icos$|icas$|ible$|anza$|able$|oso$|osa$|ico$|ica$)
Replace:
StepAfter: ??
If one of the conditions defined in CondR2 is matched, then the suffix identified in Match will be replaced by a
NULL value. So each rule simply deletes the suffix if the condition is verified.
Because the number of steps contained in each part of Phase B is still unknown, the column StepAfter will only
be defined after the creation of last rule in the last stage.
3.2.3.2
Second Stage
Description
The goal is to identify in the word the longest suffix from the following list in order to suppress it from the
processed word:
adora ador ación adoras adores aciones ante antes ancia ancias
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
21
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
aciones adores adoras ancias antes adora ancia ación ante ador
Those suffixes will be deleted only if they appear in R2. Moreover if IC appears in R2 just before the suffix, it is
also deleted.
Rule
Rule: 16
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: (^.*aciones$|^.*adores$|^.*adoras$|^.*ancias$|^.*antes$|^.*adora$|
^.*ancia$|^.*ación$|^.*ante$|^.*ador$)
CondRV: nocond
Match: ([ic]?aciones$|[ic]?adores$|[ic]?adoras$|[ic]?ancias$|[ic]?antes$|[ic]?
adora$|[ic]?ancia$|[ic]?ación$|[ic]?ante$|[ic]?ador$)
Replace:
StepAfter: ??
This stage can be described by the following rule:
Because the number of steps contained in each part of Phase B is still unknown, the column StepAfter will only
be defined after the creation of last rule in the last stage.
3.2.3.3
Third Stage
Description
The goal is to identify in the word the longest suffix from the following list in order to suppress it from the
processed word:
logía logías
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
logías logía
If the identified suffix appears in R2 then it is replaced by LOG.
Rule
This stage can be defined by the following rule:
Rule: 17
22
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: (^.*logías$|^.*logía$)
CondRV: nocond
Match: (logías$|logía$)
Replace: log
StepAfter: ??
Because the number of steps contained in each part of Phase B is still unknown, the column StepAfter will only
be defined after the creation of the last rule in the last stage.
3.2.3.4
Fourth Stage
Description
The goal is to identify in the word the longest suffix from the following list in order to suppress it from the
processed word:
ución uciones
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
uciones ución
If the identified suffix appears in R2 then it is replaced by U.
Rule
This stage can be defined by the following rule:
Rule: 18
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: (^.*uciones$|^.*ución$)
CondRV: nocond
Match: (uciones$|ución$)
Replace: u
StepAfter: ??
Because the number of steps contained in each part of Phase B is still unknown, the column StepAfter will only
be defined after the creation of last rule in the last stage.
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
23
3.2.3.5
Fifth Stage
Description
The goal is to identify in the word the longest suffix from the following list in order to suppress it from the
processed word:
encia encias
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
encias encia
If the identified suffix appears in R2 then it is replaced by ENTE.
Rule
This stage can be defined by the following rule:
Rule: 19
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: (^.*encias$|^.*encia$)
CondRV: nocond
Match: (encias$|encia$)
Replace: ente
StepAfter: ??
Because the number of steps contained in each part of Phase B is still unknown, the column StepAfter will only
be defined after the creation of last rule in the last stage.
3.2.3.6
Sixth Stage
Description
The goal is to identify and suppress the suffix AMENTE from the processed word.
This suffix will only be deleted if it appears in R2.
Moreover if ATIV, IV, OS, IC or AD appear in R2 just before the suffix, they are also deleted.
24
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
Rule
This stage can be defined by the following rule:
Rule: 20
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: (((ativ)?|(iv)?|(os)?|(ic)?|(ad)?)amente$)
CondRV: nocond
Match: (((ativ)?|(iv)?|(os)?|(ic)?|(ad)?)amente$)
Replace:
StepAfter: ??
Because the number of steps contained in each part of Phase B is still unknown, the column StepAfter will only
be defined after the creation of the last rule in the last stage.
3.2.3.7
Seventh Stage
Description
The goal is to identify and suppress the suffix MENTE from the processed word.
This suffix will only be deleted if it appears in R2.
Moreover if ANTE, ABLE or IBLE appear in R2 just before the suffix, they are aso deleted.
Rule
This stage can be defined by the following rule:
Rule: 21
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: (((ante)?|(able)?|(ible)?)mente$)
CondRV: nocond
Match: (((ante)?|(able)?|(ible)?)mente$)
Replace:
StepAfter: ??*
Because the number of steps contained in each part of Phase B is still unknown, the column StepAfter will only
be defined after the creation of last rule in the last stage.
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
25
3.2.3.8
Eighth Stage
Description
The goal is to identify in the word the longest suffix from the following list in order to suppress it from the
processed word:
idad idades
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
idades idad
This suffix will only be deleted if it appears in R2.
Moreover if ABIL, IC or IV appear in R2 just before the suffix, they are also deleted.
Rule
This stage can be defined by the following rule:
Rule: 22
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: ((((abil)?|(ic)?|(iv)?))(idades$|idad$))
CondRV: nocond
Match: ((((abil)?|(ic)?|(iv)?))(idades$|idad$))
Replace:
StepAfter: ??
Because the number of steps contained in each part of Phase B is still unknown, the column StepAfter will only
be defined after the creation of last rule in the last stage.
3.2.3.9
Ninth Stage
Description
The goal is to identify in the word the longest suffix from the following list in order to suppress it from the
processed word:
iva ivo ivas ivos
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
ivos ivas ivo iva
This suffix will only be deleted if it appears in R2.
26
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
Moreover if AT appears in R2 just before the suffix, it is also deleted.
Rule
This stage can be defined by the following rule:
Rule: 23
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: (((at)?)(ivos$|ivas$|ivo$|iva$))
CondRV: nocond
Match: (((at)?)(ivos$|ivas$|ivo$|iva$))
Replace:
StepAfter: ??
Because the number of steps contained in each part of Phase B is still unknown, the column StepAfter will only
be defined after the creation of the last rule in the last stage.
3.2.3.10 Last Stage
Description
If a rule of Phase B has been applied then the next step is to Phase D, else the step to apply is Phase C.
Since the step number where Phase D rules starts is unknown, the StepAfter field for all Phase B rules will be
filled after doing this stage.
Rule
This stage can be defined by the following rule:
Rule: 24
Step: 7
CondWord: nocond
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match: nocond
Replace: nocond
StepAfter: 8
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
27
3.2.4 Phase C
3.2.4.1
First Stage
Description
The goal is to suppress the verb suffixes beginning by Y.
The longest suffix from the following list must be identified in order to suppress it from the processed word:
ya ye yan yen yeron yendo yo yó yas yes yais yamos
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
yamos yeron yendo yais yes yen yas yan yo yó ye yo
This suffix will only be deleted if it appears in RV.
If the identified suffix appears in RV preceded by the letter U, the U should also be deleted. This applies
whether the letter U is in RV or not
Rule
This stage can be defined by the following rule:
Rule: 25
Step: 8
CondWord: ((uyamos$)|(uyeron$)|(uyendo$)|(uyais$)|(uyes$)|(uyen$)|(uyas$)|(uyan
$)|(uyo$)|(uyó$)|(uye$)|(uyo$))
CondR1: nocond
CondR2: nocond
CondRV: ((yamos$)|(yeron$)|(yendo$)|(yais$)|(yes$)|(yen$)|(yas$)|(yan$)|(yo$)|(yó
$)|(ye$)|(yo$))
Match: ((yamos$)|(yeron$)|(yendo$)|(yais$)|(yes$)|(yen$)|(yas$)|(yan$)|(yo$)|(yó
$)|(ye$)|(yo$))
Replace:
StepAfter: ??
Because the number of steps contained in each part of Phase C is still unknown, the column StepAfter will only
be defined after the creation of last rule in the last stage.
28
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
3.2.4.2
Second Stage
Description
The goal is to identify the other verb suffixes and delete them. The longest suffix from the following list must be
identified in order to remove it from the processed word:
en es éis emos arían arías arán arás aríais aría aréis aríamos aremos ará aré erían
erías erán erás eríais ería eréis eríamos eremos erá eré irían irías irán irás
iríais iría iréis iríamos iremos irá iré aba ada ida ía ara iera ad ed id ase iese
aste iste an aban ían aran ieran asen iesen aron ieron ado ido ando iendo ió ar er
ir as abas adas idas ías aras ieras ases ieses ís áis abais íais arais ierais aseis
ieseis asteis isteis ados idos amos ábamos íamos imos áramos iéramos iésemos ásemos
This suffix will be removed only when appearing in RV. Some of the suffixes from the given list need special
treatment. Indeed EN, ES, ÉIS and EMOS must be removed but if one of these suffixes is preceded by GU (not
necessarily in RV) then the U must also be deleted. Those specific suffixes will be handled in a separate rule.
Basic Suffixes
The longest suffix from the following list must be identified in order to remove it from the processed word:
arían arías arán arás aríais aría aréis aríamos aremos ará aré erían erías erán
erás eríais ería eréis eríamos eremos erá eré irían irías irán irás iríais iría
iréis iríamos iremos irá iré aba ada ida ía ara iera ad ed id ase iese aste iste an
aban ían aran ieran asen iesen aron ieron ado ido ando iendo ió ar er ir as abas
adas idas ías aras ieras ases ieses ís áis abais íais arais ierais aseis ieseis
asteis isteis ados idos amos ábamos íamos imos áramos iéramos iésemos ásemos
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
iríamos iésemos iéramos eríamos aríamos isteis iríais iremos ieseis ierais eríais
eremos ásemos áramos ábamos asteis aríais aremos íamos irías irían iréis ieses
iesen ieron ieras ieran iendo erías erían eréis aseis arías arían arias aréis abais
íais iste iría irás irán imos iese iera idos idas ería erás erán aste ases asen
aron aría arás arán aras aran ando amos ados adas abas aban ías ían iré irá ido ida
eré erá áis ase aré ará ara ado ada aba ís ía ir ió id er ed as ar an ad
Rule for Basic Suffixes
This stage can be defined by the following rule:
Rule: 26
Step: 8
CondWord: nocond
CondR1: nocond
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
29
CondR2: nocond
CondRV: (iríamos$|iésemos$|iéramos$|eríamos$|aríamos$|isteis$|iríais$|iremos$|
ieseis$|ierais$|eríais$|eremos$|ásemos$|áramos$|ábamos$|asteis$|aríais$|aremos$|
íamos$|irías$|irían$|iréis$|ieses$|iesen$|ieron$|ieras$|ieran$|iendo$|erías$|
erían$|eréis$|aseis$|arias$|arían$|arias$|aréis$|abais$|íais$|iste$|iría$|irás$|
irán$|imos$|iese$|iera$|idos$|idas$|ería$|eras$|erán$|aste$|ases$|asen$|aron$|
aría$|arás$|arán$|aras$|aran$|ando$|amos$|ados$|adas$|abas$|aban$|ías$|ían$|iré$|
irá$|ido$|ida$|eré$|erá$|áis$|ase$|aré$|ará$|ara$|ado$|ada$|aba$|ís$|ía$|ir$|ió$|
id$|er$|ed$|as$|ar$|an$|ad$)
Match: (iríamos$|iésemos$|iéramos$|eríamos$|aríamos$|isteis$|iríais$|iremos$|
ieseis$|ierais$|eríais$|eremos$|ásemos$|áramos$|ábamos$|asteis$|aríais$|aremos$|
íamos$|irías$|irían$|iréis$|ieses$|iesen$|ieron$|ieras$|ieran$|iendo$|erías$|
erían$|eréis$|aseis$|arias$|arían$|arias$|aréis$|abais$|íais$|iste$|iría$|irás$|
irán$|imos$|iese$|iera$|idos$|idas$|ería$|eras$|erán$|aste$|ases$|asen$|aron$|
aría$|arás$|arán$|aras$|aran$|ando$|amos$|ados$|adas$|abas$|aban$|ías$|ían$|iré$|
irá$|ido$|ida$|eré$|erá$|áis$|ase$|aré$|ará$|ara$|ado$|ada$|aba$|ís$|ía$|ir$|ió$|
id$|er$|ed$|as$|ar$|an$|ad$)
Replace:
StepAfter: ??
Because the number of steps contained in each part of Phase C is still unknown, the column StepAfter will only
be defined after the creation of last rule in the last stage
Suffixes Needing a Special Treatment
The longest suffix from the following list must be identified in order to remove it from the processed word and
to apply the specific condition described in the Second Stage description:
en es éis emos
To facilitate the identification of the longest suffix first, the list is resorted by decreasing length. The resorted
list is:
emos éis es en
Rule for Suffixes Needing a Special Treatment
This stage can be defined by the following rule:
Rule: 27
Step: 8
CondWord: (guemos$|guéis$|gues$|guen$)
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match: (uemos$|uéis$|ues$|uen$)
Replace:
StepAfter: ??
and
Rule: 28
Step: 8
CondWord: (emos$|éis$|es$|en$)
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match: (emos$|éis$|es$|en$)
Replace:
30
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
StepAfter: 9
In phases B and C, the StepAfter field could not be filled for some of the rules defined. At this point, you know
that Step 9 will be the next step, so you can finish those rules. The rules needing to be completed are rules 15
to 23 in Step B and rules 25 to 27 in Step C.
3.2.4.3
Last Stage
Description
This is a Step Rule leading to Phase D.
Rule
This stage can be defined by the following rule:
Rule: 29
Step: 8
CondWord: nocond
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match: nocond
Replace: nocond
StepAfter: 9
3.2.5 Phase D
First Stage
Description
The goal is to suppress residual suffixes.
The longest suffix from the following list must be identified in order to suppress it from the processed word:
os a o á í ó
This suffix will only be deleted if it appears in RV.
This stage can be defined by the following rule:
Rule: 30
Step: 9
CondWord: nocond
CondR1: nocond
CondR2: nocond
CondRV: (os$|a$|o$|á$|í$|ó$)
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
31
Match: (os$|a$|o$|á$|í$|ó$)
Replace:
StepAfter: 10
Second Stage
Description
Any suffix from the following list must be identified in order to suppress it from the processed word:
e é
This suffix will only be deleted if it appears in RV.
If the identified suffix appears in RV preceded by GU, and if the U is in RV too, then the U should also be
deleted.
This stage can be defined by the following rules:
Rule: 31
Step: 9
CondWord: (gue$|gué$)
CondR1: nocond
CondR2: nocond
CondRV: (ue$|ué$)
Match: (ue$|ué$)
Replace:
StepAfter: 10
Rule: 32
Step: 9
CondWord: nocond
CondR1: nocond
CondR2: nocond
CondRV: (e$|é$)
Match: (e$|é$)
Replace:
StepAfter: 10
Last Stage
Description
This is a Step Rule leading to Phase E
Rule
This stage can be defined by the following rule:
Rule: 33
Step: 9
CondWord: nocond
CondR1: nocond
CondR2: nocond
CondRV: nocond
32
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
Match: nocond
Replace: nocond
StepAfter: 10
3.2.6 Phase E
Description
The goal is to remove the acute accents listed below:
á é í ó ú
This stage can be defined by the following rule:
Rule: 34
Step: 10
CondWord: (á|é|í|ó|ú)
CondR1: nocond
CondR2: nocond
CondRV: nocond
Match: (á|é|í|ó|ú)
Replace: (a|e|i|o|u)
StepAfter: 11
3.3
Testing Your Stemming Rules
This part details the process to follow in order to test and validate your stemming rules.
Verification Process Overview
The goal of the verification process is to apply the stemming rules you have created on a specific data set and
to visualize their results in an output file. You will then be able to compare the roots created by your rules to
the roots identified and provided in the source file.
This process consists of three steps:
1. Creating the source file:
To check your stemming, you will need to create a file containing words for which you already know the
stem. This file will contain three columns:
○ Word, which contains the word to stem.
○ Expected Stem, which contains the known stem of the word.
○ Target, which can be empty. This column is used when applying a classification/regression or
segmentation/clustering model.
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
33
Note
Both columns Word and Expected Stem must contain only one word on each line
2. Creating a model to apply:
You will then need to build a Data Manager – text coding model without added computing such as
classification/regression or segmentation/clustering. In KxShell creating a model would not be necessary
and you would be able to directly apply the stemming rules on your data. However with the graphical
interface you cannot apply a model which has not been created. Creating the model will also allow you to
set the language of the stemming rules and dictionary. Since no actual model is needed, you will only need
a few lines in the learning phase.
3. Applying the Stemming Rules:
You will select the columns to create in the output file, so that you can easily compare the Expected Stem
with the Computed Stem created by the application of your stemming rules on the source file.
3.3.1 Creating the Model
1. Start SAP BusinessObjects Predictive Analytics.
2. Select the option
Data Manager
Perform a Text Analysis .
3. Click the Next button.
The panel Add a Transform is displayed.
34
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
4. Select the option No Added Computing.
5. Use the following settings to create your model:
Task(s)
○
Selecting a Cutting Strategy
○
Specifying the Data Source
Sub-sampling the Data
Describing the Data
Data Manager – text coding - Setting
the Language Definition
Screen
Data to be Modeled
Sub-Sampling Settings
Data Description
Data Manager – text coding Parame
ters Settings
Settings
○
Cutting strategy: Random
○
Select the option Text Source.
○
In the Source field, select the
folder Samples/ Data Manager –
text coding /
○
In the Estimation field, select the
file containing the words list to
test.
○
Select the option Line Selection
○
Enter 10 in the field Last Line
○
Use the Analyze option.
○
Set 'Word' as textual.
○
Keep default Language Definition
Repository (blank)
○
Select the User Defined
Language option as the Lan
guage Recognition Mode
○
In the combo box, select the
language you want to test (es) as
the User Defined Language
Data Manager – text coding - Setting
the Advanced Parameters
Advanced…
Set the Frequency Threshold to 0
Data Manager – text coding - Setting
the Dictionary and Encoding Parame
ters
Data Manager – text coding Parame
ters Settings (2)
Uncheck the option Stop Words
Removing.
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
35
Once the model has been generated, the following panel is displayed:
6. Click the Next button. The Data Manager – text coding Stand Alone options are displayed.
3.3.2 Applying the Stemming Rules
The data set used as the apply data set is the same used to create the model.
36
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
1. On the panel KTC Stand Alone Options, select the option Apply Model to a New Data Set.
2. Click the Next button.
The apply panel is displayed.
3. Select the file containing the word list in the Data field of section Application Data Set.
4. In the Generate drop-down list, select the Transactional option.
5. Define a name for the output file in the Data field of section Results Generated by the Model.
6. In the Generation Options section, click the button Advanced Apply Settings…
7. In the tab General Outputs, check the box Copy Input Variables.
8. Select the Individual option.
9. 9 Click the >> button. The list of available variables is displayed.
10. Select the variable Expected Stem.
11. Click the > button to send it to the Selected list.
12. Click the Validate button.
13. Click the Apply button.
The generated file will contain the following columns:
○ Expected Stem, which has been copied from the source file,
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
CUSTOMER
37
○ Id, which contains the identifier for each line,
○ Textual_Field, which contains the original word from the source file,
○ Original_Case, which corresponds to the identifier of the line in the original source file,
○ Root, which contains the root extracted by the application of your stemming rules on the word.
By comparing the columns Expected Stem and Root, you will be able to check if the roots have been
correctly identified by your stemming rules.
38
CUSTOMER
Text Analysis Language Definition Guide
Algorithm Transcription for Spanish Language
Important Disclaimers and Legal Information
Coding Samples
Any software coding and/or code lines / strings ("Code") included in this documentation are only examples and are not intended to be used in a productive system
environment. The Code is only intended to better explain and visualize the syntax and phrasing rules of certain coding. SAP does not warrant the correctness and
completeness of the Code given herein, and SAP shall not be liable for errors or damages caused by the usage of the Code, unless damages were caused by SAP
intentionally or by SAP's gross negligence.
Accessibility
The information contained in the SAP documentation represents SAP's current view of accessibility criteria as of the date of publication; it is in no way intended to be
a binding guideline on how to ensure accessibility of software products. SAP in particular disclaims any liability in relation to this document. This disclaimer, however,
does not apply in cases of willful misconduct or gross negligence of SAP. Furthermore, this document does not result in any direct or indirect contractual obligations
of SAP.
Gender-Neutral Language
As far as possible, SAP documentation is gender neutral. Depending on the context, the reader is addressed directly with "you", or a gender-neutral noun (such as
"sales person" or "working days") is used. If when referring to members of both sexes, however, the third-person singular cannot be avoided or a gender-neutral noun
does not exist, SAP reserves the right to use the masculine form of the noun and pronoun. This is to ensure that the documentation remains comprehensible.
Internet Hyperlinks
The SAP documentation may contain hyperlinks to the Internet. These hyperlinks are intended to serve as a hint about where to find related information. SAP does
not warrant the availability and correctness of this related information or the ability of this information to serve a particular purpose. SAP shall not be liable for any
damages caused by the use of related information unless damages have been caused by SAP's gross negligence or willful misconduct. All links are categorized for
transparency (see: http://help.sap.com/disclaimer).
Text Analysis Language Definition Guide
Important Disclaimers and Legal Information
CUSTOMER
39
go.sap.com/registration/
contact.html
© 2016 SAP SE or an SAP affiliate company. All rights reserved.
No part of this publication may be reproduced or transmitted in any
form or for any purpose without the express permission of SAP SE
or an SAP affiliate company. The information contained herein may
be changed without prior notice.
Some software products marketed by SAP SE and its distributors
contain proprietary software components of other software
vendors. National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company
for informational purposes only, without representation or warranty
of any kind, and SAP or its affiliated companies shall not be liable for
errors or omissions with respect to the materials. The only
warranties for SAP or SAP affiliate company products and services
are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein
should be construed as constituting an additional warranty.
SAP and other SAP products and services mentioned herein as well
as their respective logos are trademarks or registered trademarks
of SAP SE (or an SAP affiliate company) in Germany and other
countries. All other product and service names mentioned are the
trademarks of their respective companies.
Please see http://www.sap.com/corporate-en/legal/copyright/
index.epx for additional trademark information and notices.

Download Report

Text Analysis Language Definition Guide

Paperzz.com

Your Paperzz