IBM Software Services, Support and Success IBM Watson Group Limiting Search Field Values in Search Results Pane Watson Content Analytics 1 IBM Software Services, Support and Success IBM Watson Group Contents No table of contents entries found. 2 IBM Software Services, Support and Success IBM Watson Group Author [email protected] [email protected], Marshall Schor/Watson/IBM Date July 29, 2017 November 18, 2014 Version 0 0.1 3 IBM Software Services, Support and Success IBM Watson Group Description of Issue Search fields can be displayed in either the Facet tree or in the Details pane of the search results. In WCA 3.5, to display the field, it must be set as Returnable in the Search Field Definitions: 4 IBM Software Services, Support and Success IBM Watson Group It must also be added to the fields to display, via the user Preferences 5 IBM Software Services, Support and Success IBM Watson Group Note that in the search results, each occurrence is listed in the field display. 6 IBM Software Services, Support and Success IBM Watson Group This contrasts with the Facet tree, which displays only the single value, and a count. The count, however, is not the number of occurrences of the value in the data corpus; rather, it is the number of documents in which the vvalue occurs at least once. In the example below, the Field : Person with the Value : William The search field values are generated by a UIMA annotator, and come from unstructured text. A more detailed explanation is that the annotator fires on all values in each text, and puts each of those values into the Common Analysis Structure (CAS). In this simple example, all occurrences of “cat” fire the Animal annotation: This is exactly what we see in the Details pane, in the search results: all the values in the CAS (this is handled via the cas2index.xml). The Facet tree representation is not a direct output of the filed values that are poplating the search results. For one thing, the unqiue values are listed only once, understanding that different representations of the same value will be each considered unique, unless you do a special configuration, that only outputs the lemma of each value (we will see this later). 7 IBM Software Services, Support and Success IBM Watson Group The Solution Here, we only need to remember that as long as one occurrence of the field values exists in the document, that value will receive a count in the Facet tree. Thus, if we could remove all values from the CAS, except one, the document count would not be affected, and only one unique value would occur in the Details pane. In the simple example, this UIMA code would suffice: // Get an array of Animal annotations, and an iterator for those AnnotationIndex<Annotation> idx = aJCas.getAnnotationIndex(Animal.type); FSIterator<Annotation> it = idx.iterator(); while(idx.size() >1){ // as long as the array has one entry left, we can remove the rest it.next().removeFromIndexes(aJCas); // we need a fresh copy of the annotation index idx = aJCas.getAnnotationIndex(Animal.type); it = idx.iterator(); } 8 IBM Software Services, Support and Success IBM Watson Group We can demonstrate this in the CAS Visual Debugger, shipped with all Apache UIMA distributions. In our simple example, without the above code, we get four occurrences of “cat”. 9 IBM Software Services, Support and Success IBM Watson Group After inserting the code fo culling values, we are left with the single occurrence. Set up in WCA We will configure a Custom Stage in our WCA pipline, which will contain the above UIMA code. First, we need to export the Java project: 10 IBM Software Services, Support and Success IBM Watson Group We can export the AE Descirptor seperately, or extract it from the jar we created. In any event, we must import both artifacts into our Studio project. 11 IBM Software Services, Support and Success IBM Watson Group Now we simply add a Custom Stage, after the main Parsing Rules Stage. 12 IBM Software Services, Support and Success IBM Watson Group When we now run the pipeline, the Type Animal gets into the CAS from the default Parsing Stage, and then is culled in our Custom (UIMA) Stage, to have only one value for the Type: We now export the TAE as a .pear file (Processing Engine Archive). One step that will make the result even cleaner is to export the lemma of the values the annotator picks: 13 IBM Software Services, Support and Success IBM Watson Group Here are the results: 14 IBM Software Services, Support and Success IBM Watson Group Solution: Phase 2 You may have recognized that this solution is too simplistic for most cases. Indeed, it was only presented to lay the foundations for more robust solutions. In the above solution, the UIMA code simply removes all the annotation, but the last. Thus, if we add other annotations, besides cat, the problem becomes evident: It is clear that what we need is a kind of de-duplication of Annotations. In Java, this is remarkably simple – many of the Collection Types in Java accept only unique values; thus, typically, putting the list of Annotations in a Java Set would automatically de-duplicate. Some sample code would be: AnnotationIndex<Annotation> idx = aJCas.getAnnotationIndex(); ArrayList<Annotation> tempList = new ArrayList<Annotation>(idx.size()); 15 IBM Software Services, Support and Success IBM Watson Group FSIterator it = idx.iterator(); //load the Annotations into a temporary list. includes duplicates while(it.hasNext()) { tempList.add((Annotation) it.next()); } Iterator tempIt = tempList.iterator(); // remove all Annotations from the index. this works fine while(tempIt.hasNext()){ ((Annotation) tempIt.next()).removeFromIndexes(aJCas); } // push tempList into HashSet HashSet<Annotation> hs = new HashSet<Annotation>(); hs.addAll(tempList); // this should not allow duplicates System.out.println("HS length: "+hs.size()); // size should be less the size of the FSIndex by the number of duplicates. tempList.clear(); tempList.addAll(hs); System.out.println("templist length: "+tempList.size()); Iterator<Annotation> it2 = tempList.iterator(); // this should now be the // clean list while(it2.hasNext()){ it2.next().addToIndexes(aJCas); 16 IBM Software Services, Support and Success IBM Watson Group In the bold line above, we might think that all “duplicate” occurrences of “cat” and “bird” would be excluded as duplicates. However, while they are the same Annotation, e.g. Animal, and they have the same covered text, i.e. “cat”, it is not the same “cat”. An Annotation’s identity is fundamentally dependent upon its location in the text, by design. That is, every Annotation has a begin and end offset pair, which would be different in the case of the “cat” and “bird” occurrences shown above. To force the primacy of the covered text in the Annotations, we can substitute HashSet for HashMap, and include the covered text as keys (the Annotations are the values). The HashMap will cull elements with duplicate keys, by default: HashMap hm = new HashMap(); for (Annotation a : tempList) { hm.put(a.getCoveredText(), a); } System.out.println("HS length: "+hm.size()); tempList.clear(); tempList.addAll(hm.values()); This will produce the desired results: 17 IBM Software Services, Support and Success IBM Watson Group We can safely add additional Annotation Types: 18 IBM Software Services, Support and Success IBM Watson Group 19 IBM Software Services, Support and Success IBM Watson Group 20 IBM Software Services, Support and Success IBM Watson Group This solution should now work in WCA, but it does not. WCA/Studio has a behavior that adds the uima.tt.TokenAnnotation annotation to the pipeline automatically, at the very end of the FSIndex. Thus, if the index contains: Type Covered Text com.ibm.watson.l2.Animal bird com.ibm.watson.l2.Vegetable bush uima.tt.TokenAnnotation bird uima.tt.TokenAnnotation bush Since the method is putting in all Types, and using their covered text to sort (and de-duplicate) hm.put(a.getCoveredText(), a); the final HashMap will only contain uima.tt.TokenAnnotation bird uima.tt.TokenAnnotation bush To fix this, we can modify the code the prepares the list that is passed to the HashMap, so that elements of Type uima.tt.TokenAnnotation are excluded: while(it2.hasNext()) { anno=it2.next(); type=anno.getType(); 21 IBM Software Services, Support and Success IBM Watson Group if (!(type.toString().equals("uima.tt.TokenAnnotation")) ) { tempList.add(anno);} System.out.println(type.toString()); } Note that if we print the Types that are presented to the ArrayList, we see the TokenAnnotation: uima.tcas.DocumentAnnotation uima.tt.TokenAnnotation com.ibm.watson.l2.Animal uima.tt.TokenAnnotation com.ibm.watson.l2.Animal But if we print the ArrayList, we see that only the Animal Types were added, and thus presented to the HashMap: DocumentAnnotation sofa: _InitialView begin: 0 end: 9 language: "en" Animal sofa: _InitialView begin: 0 end: 3 Animal sofa: _InitialView begin: 4 end: 8 tempList length: 3 22 IBM Software Services, Support and Success IBM Watson Group To test, we can comment out the filter Index size is: 5 uima.tcas.DocumentAnnotation uima.tt.TokenAnnotation com.ibm.watson.l2.Animal uima.tt.TokenAnnotation com.ibm.watson.l2.Animal DocumentAnnotation sofa: _InitialView begin: 0 end: 7 language: "en" TokenAnnotation sofa: _InitialView 23 IBM Software Services, Support and Success IBM Watson Group begin: 0 end: 3 Animal sofa: _InitialView begin: 0 end: 3 TokenAnnotation sofa: _InitialView begin: 4 end: 7 Animal sofa: _InitialView begin: 4 end: 7 templist length: 2 24
© Copyright 2025 Paperzz