Set up in WCA

IBM Software Services, Support and Success
IBM Watson Group
Limiting Search Field Values in Search Results Pane
Watson Content Analytics
1
IBM Software Services, Support and Success
IBM Watson Group
Contents
No table of contents entries found.
2
IBM Software Services, Support and Success
IBM Watson Group
Author
[email protected]
[email protected],
Marshall Schor/Watson/IBM
Date
July 29, 2017
November 18, 2014
Version
0
0.1
3
IBM Software Services, Support and Success
IBM Watson Group
Description of Issue
Search fields can be displayed in either the Facet tree or in the Details pane of the search results.
In WCA 3.5, to display the field, it must be set as Returnable in the Search Field Definitions:
4
IBM Software Services, Support and Success
IBM Watson Group
It must also be added to the fields to display, via the user Preferences
5
IBM Software Services, Support and Success
IBM Watson Group
Note that in the search results, each occurrence is listed in the field display.
6
IBM Software Services, Support and Success
IBM Watson Group
This contrasts with the Facet tree, which displays only the single value, and a count. The count,
however, is not the number of occurrences of the value in the data corpus; rather, it is the number of
documents in which the vvalue occurs at least once. In the example below, the Field : Person with the
Value : William
The search field values are generated by a UIMA annotator, and come from unstructured text. A more
detailed explanation is that the annotator fires on all values in each text, and puts each of those values
into the Common Analysis Structure (CAS). In this simple example, all occurrences of “cat” fire the
Animal annotation:
This is exactly what we see in the Details pane, in the search results: all the values in the CAS (this is
handled via the cas2index.xml). The Facet tree representation is not a direct output of the filed values
that are poplating the search results. For one thing, the unqiue values are listed only once,
understanding that different representations of the same value will be each considered unique, unless
you do a special configuration, that only outputs the lemma of each value (we will see this later).
7
IBM Software Services, Support and Success
IBM Watson Group
The Solution
Here, we only need to remember that as long as one occurrence of the field values exists in the
document, that value will receive a count in the Facet tree. Thus, if we could remove all values from the
CAS, except one, the document count would not be affected, and only one unique value would occur in
the Details pane. In the simple example, this UIMA code would suffice:
// Get an array of Animal annotations, and an iterator for those
AnnotationIndex<Annotation> idx =
aJCas.getAnnotationIndex(Animal.type);
FSIterator<Annotation> it = idx.iterator();
while(idx.size() >1){
// as long as the array has one entry left, we can remove the rest
it.next().removeFromIndexes(aJCas);
// we need a fresh copy of the annotation index
idx = aJCas.getAnnotationIndex(Animal.type);
it = idx.iterator();
}
8
IBM Software Services, Support and Success
IBM Watson Group
We can demonstrate this in the CAS Visual Debugger, shipped with all Apache UIMA
distributions. In our simple example, without the above code, we get four
occurrences of “cat”.
9
IBM Software Services, Support and Success
IBM Watson Group
After inserting the code fo culling values, we are left with the single occurrence.
Set up in WCA
We will configure a Custom Stage in our WCA pipline, which will contain the above UIMA code. First, we
need to export the Java project:
10
IBM Software Services, Support and Success
IBM Watson Group
We can export the AE Descirptor seperately, or extract it from the jar we created. In any event, we must
import both artifacts into our Studio project.
11
IBM Software Services, Support and Success
IBM Watson Group
Now we simply add a Custom Stage, after the main Parsing Rules Stage.
12
IBM Software Services, Support and Success
IBM Watson Group
When we now run the pipeline, the Type Animal gets into the CAS from the default Parsing Stage, and
then is culled in our Custom (UIMA) Stage, to have only one value for the Type:
We now export the TAE as a .pear file (Processing Engine Archive). One step that will make the result
even cleaner is to export the lemma of the values the annotator picks:
13
IBM Software Services, Support and Success
IBM Watson Group
Here are the results:
14
IBM Software Services, Support and Success
IBM Watson Group
Solution: Phase 2
You may have recognized that this solution is too simplistic for most cases. Indeed, it was only
presented to lay the foundations for more robust solutions. In the above solution, the UIMA code simply
removes all the annotation, but the last. Thus, if we add other annotations, besides cat, the problem
becomes evident:
It is clear that what we need is a kind of de-duplication of Annotations. In Java, this is remarkably simple
– many of the Collection Types in Java accept only unique values; thus, typically, putting the list of
Annotations in a Java Set would automatically de-duplicate. Some sample code would be:
AnnotationIndex<Annotation> idx = aJCas.getAnnotationIndex();
ArrayList<Annotation> tempList = new ArrayList<Annotation>(idx.size());
15
IBM Software Services, Support and Success
IBM Watson Group
FSIterator it
= idx.iterator();
//load the Annotations into a temporary list.
includes duplicates
while(it.hasNext())
{
tempList.add((Annotation) it.next());
}
Iterator tempIt = tempList.iterator();
// remove all Annotations from the index.
this works fine
while(tempIt.hasNext()){
((Annotation) tempIt.next()).removeFromIndexes(aJCas);
}
// push tempList into HashSet
HashSet<Annotation> hs = new HashSet<Annotation>();
hs.addAll(tempList);
// this should not allow duplicates
System.out.println("HS length: "+hs.size()); // size should be less the
size of the FSIndex by the number of duplicates.
tempList.clear();
tempList.addAll(hs);
System.out.println("templist length: "+tempList.size());
Iterator<Annotation> it2 = tempList.iterator(); // this should now be the
// clean list
while(it2.hasNext()){
it2.next().addToIndexes(aJCas);
16
IBM Software Services, Support and Success
IBM Watson Group
In the bold line above, we might think that all “duplicate” occurrences of “cat” and “bird” would be
excluded as duplicates. However, while they are the same Annotation, e.g. Animal, and they have the
same covered text, i.e. “cat”, it is not the same “cat”. An Annotation’s identity is fundamentally
dependent upon its location in the text, by design. That is, every Annotation has a begin and end offset
pair, which would be different in the case of the “cat” and “bird” occurrences shown above.
To force the primacy of the covered text in the Annotations, we can substitute HashSet for HashMap,
and include the covered text as keys (the Annotations are the values). The HashMap will cull elements
with duplicate keys, by default:
HashMap hm = new HashMap();
for (Annotation a : tempList) {
hm.put(a.getCoveredText(), a);
}
System.out.println("HS length: "+hm.size());
tempList.clear();
tempList.addAll(hm.values());
This will produce the desired results:
17
IBM Software Services, Support and Success
IBM Watson Group
We can safely add additional Annotation Types:
18
IBM Software Services, Support and Success
IBM Watson Group
19
IBM Software Services, Support and Success
IBM Watson Group
20
IBM Software Services, Support and Success
IBM Watson Group
This solution should now work in WCA, but it does not. WCA/Studio has a behavior that adds the
uima.tt.TokenAnnotation annotation to the pipeline automatically, at the very end of the FSIndex. Thus,
if the index contains:
Type
Covered Text
com.ibm.watson.l2.Animal
bird
com.ibm.watson.l2.Vegetable
bush
uima.tt.TokenAnnotation
bird
uima.tt.TokenAnnotation
bush
Since the method is putting in all Types, and using their covered text to sort (and de-duplicate)
hm.put(a.getCoveredText(), a);
the final HashMap will only contain
uima.tt.TokenAnnotation
bird
uima.tt.TokenAnnotation
bush
To fix this, we can modify the code the prepares the list that is passed to the HashMap, so that elements
of Type uima.tt.TokenAnnotation are excluded:
while(it2.hasNext())
{
anno=it2.next();
type=anno.getType();
21
IBM Software Services, Support and Success
IBM Watson Group
if (!(type.toString().equals("uima.tt.TokenAnnotation")) )
{ tempList.add(anno);}
System.out.println(type.toString());
}
Note that if we print the Types that are presented to the ArrayList, we see the
TokenAnnotation:
uima.tcas.DocumentAnnotation
uima.tt.TokenAnnotation
com.ibm.watson.l2.Animal
uima.tt.TokenAnnotation
com.ibm.watson.l2.Animal
But if we print the ArrayList, we see that only the Animal Types were added, and thus
presented to the HashMap:
DocumentAnnotation
sofa: _InitialView
begin: 0
end: 9
language: "en"
Animal
sofa: _InitialView
begin: 0
end: 3
Animal
sofa: _InitialView
begin: 4
end: 8
tempList length: 3
22
IBM Software Services, Support and Success
IBM Watson Group
To test, we can comment out the filter
Index size is: 5
uima.tcas.DocumentAnnotation
uima.tt.TokenAnnotation
com.ibm.watson.l2.Animal
uima.tt.TokenAnnotation
com.ibm.watson.l2.Animal
DocumentAnnotation
sofa: _InitialView
begin: 0
end: 7
language: "en"
TokenAnnotation
sofa: _InitialView
23
IBM Software Services, Support and Success
IBM Watson Group
begin: 0
end: 3
Animal
sofa: _InitialView
begin: 0
end: 3
TokenAnnotation
sofa: _InitialView
begin: 4
end: 7
Animal
sofa: _InitialView
begin: 4
end: 7
templist length: 2
24