azu_etd_12987_sip1_m

AUTOMATIC CONCEPT-BASED QUERY EXPANSION
USING TERM RELATIONAL PATHWAYS BUILT
FROM A COLLECTION-SPECIFIC ASSOCIATION THESAURUS
by
Jennifer Rae Lyall-Wilson
_____________________
Copyright © Jennifer Rae Lyall-Wilson 2013
A Dissertation Submitted to the Faculty of the
SCHOOL OF INFORMATION RESOURCE AND LIBRARY SCIENCE
In Partial Fulfillment of the Requirements
For the Degree of
DOCTOR OF PHILOSOPHY
In the Graduate College
THE UNIVERSITY OF ARIZONA
2013
2
THE UNIVERSITY OF ARIZONA
GRADUATE COLLEGE
As members of the Dissertation Committee, we certify that we have read the dissertation
prepared by Jennifer Rae Lyall-Wilson
entitled Automatic Concept-Based Query Expansion Using Term Relational Pathways
Built From a Collection-Specific Association Thesaurus
and recommend that it be accepted as fulfilling the dissertation requirement for the
Degree of Doctor of Philosophy
_______________________________________________________________________
Date: 5/6/2013
_______________________________________________________________________
Date: 5/6/2013
_______________________________________________________________________
Date: 5/6/2013
_______________________________________________________________________
Date: 5/6/2013
Martin Frické
Hong Cui
Bryan Heidorn
Gary Bakken
Final approval and acceptance of this dissertation is contingent upon the candidate’s
submission of the final copies of the dissertation to the Graduate College.
I hereby certify that I have read this dissertation prepared under my direction and
recommend that it be accepted as fulfilling the dissertation requirement.
________________________________________________
Dissertation Director: Martin Frické
Date: 5/6/2013
3
STATEMENT BY AUTHOR
This dissertation has been submitted in partial fulfillment of requirements for an
advanced degree at the University of Arizona and is deposited in the University Library
to be made available to borrowers under rules of the Library.
Brief quotations from this dissertation are allowable without special permission, provided
that accurate acknowledgment of source is made. Requests for permission for extended
quotation from or reproduction of this manuscript in whole or in part may be granted by
the copyright holder.
SIGNED: Jennifer Rae Lyall-Wilson
4
ACKNOWLEDGEMENTS
I would like to thank my major advisor Dr. Martin Frické for his help, advice, and
encouragement throughout my journey. He is an outstanding teacher and an insightful
and caring advisor and mentor. Dr. Frické has been there for me since I began the
program and I am grateful he was willing to work with me. He encouraged me to pursue
a dissertation research topic that I was excited about and he believed I was capable to do
the search engine research that I wasn’t sure I could. His confidence in my abilities and
support has made all the difference.
I’d like to thank Dr. Gary Bakken for his support, advice, and encouragement
throughout my doctoral program. His passion for his work and energy in pursuing his
goals is inspiring.
I’d like to thank my dissertation committee members Dr. Hong Cui and Dr. Bryan
Heidorn for their input, suggestions, and working with me to make this a successful
dissertation. I’d also like to thank Dr. Heshan Sun for early advice and direction in the
dissertation research.
I’d like to thank Geraldine Fragoso in the SIRLS Administration Office for all
she’s done to make sure that the administrative side of my program stayed on track. As a
distance student it is easy to feel disconnected and alone but, I always felt that I had an
advocate in Geraldine looking out for me and ready to help.
I’d like to thank all those who support and staff the libraries in which I spent time
working on my dissertation. These libraries include: Portland State University Library,
University of Oregon Knight Library, San Diego State University Library, University of
5
California San Diego Geisel Library, and Northern Arizona University Cline Library.
And, I’d also especially like to thank the staff at the University of Arizona Main Library
who provided me access to all the resources I needed electronically. Their electronic
collections as well as their excellent Interlibrary Loan and Document Delivery programs
made it possible for me to effectively and efficiently access the resources I needed to
complete this research as a distance student.
I would like to thank my sister-in-law and good friend Michele Lyall for
generously donating her time and expertise to be my editor and conduct a thorough
review of my dissertation. She caught those silly, embarrassing mistakes that could’ve
only been made in the wee hours of the morning and, more importantly, she provided
suggestions and guidance that allowed my writing to achieve a higher degree of clarity.
She helped make the ideas I presented more understandable and, therefore, made my
dissertation stronger.
I’d like to thank my Mom who, as the first in her family to go to college, instilled
in me the value of education and set an excellent example of being successful in college
while being a loving, caring, and nurturing mother. I am so lucky to have a Mom who
always believes in me, never ceases to encourage me, and is always ready to be my own
personal cheering section.
I’d like to thank my daughter Kinsey who warms my heart every day with her
loving, fun, and generous spirit. Her never-ending curiosity and passion to learn about the
world around her inspires me to look at the world in the same way. She constantly
amazes me, makes me laugh, and helps me be a better person.
6
And, finally, I’d like to thank my partner Beth. She’s my partner in life, love,
family, growth, work, fun, relaxation, and appreciation of all the wonderful things around
us. It sounds like a cliché, but, it really is so hard to capture in mere words how grateful I
am to her for all she has done to contribute to both my dissertation and growth as a
researcher in general. As a sounding board and as an excellent critical, logical, creative,
and global thinker, she has contributed in ways both big and small from the conception of
this dissertation research to its completion.
Of course, there are many others who have played large or small roles over the
years but I don’t have the space to include them all. It takes a village to complete a
dissertation. I am grateful to my village.
7
DEDICATION
To my daughter Kinsey and my partner Beth. You are the loves of my life.
Kinsey, you inspire me to be the best person I possibly can be. Beth, with your love,
support, and encouragement you help me achieve it.
8
TABLE OF CONTENTS
LIST OF FIGURES .........................................................................................................15
LIST OF TABLES ...........................................................................................................21
ABSTRACT
.................................................................................................................27
CHAPTER 1. INTRODUCTION ..................................................................................28
1.1. Outside the Scope .................................................................................................33
1.2. Improved Automatic Query Expansion Using Relational Pathways ....................35
1.2.1. Generating the Conceptual Network ............................................................38
1.2.2. Formulating the Expanded Query ................................................................41
1.3. Example Query Walkthrough ...............................................................................44
1.4. Dissertation Report Structure ................................................................................49
CHAPTER 2. LITERATURE REVIEW ......................................................................50
2.1. Improving Information Retrieval using Natural Language Processing ................50
2.1.1. Information Retrieval ...................................................................................50
2.1.2. Natural Language Processing ......................................................................64
2.1.3. Describing Information Needs in the Form of a Query ...............................71
2.1.4. Will NLP techniques improve information retrieval systems? ....................75
2.2. Lexical Acquisition of Meaning ...........................................................................78
2.2.1. Vector Space Models ...................................................................................78
2.2.2. Latent Semantic Analysis ............................................................................85
2.2.3. Normalized Web Distance ...........................................................................89
2.2.4. Probabilistic Models Used in Determining Semantic Similarity .................92
9
TABLE OF CONTENTS – Continued
2.3. Augmenting Retrieval Methods ............................................................................98
2.3.1. Integrating Manually Developed Semantic Knowledge ..............................98
2.3.2. Relevance Feedback...................................................................................102
2.3.3. Query Expansion ........................................................................................103
2.4. Evaluation ...........................................................................................................125
2.4.1. Search Engine Performance Measures.......................................................125
2.4.2. Search Engine Data for Evaluation Experiments.......................................131
CHAPTER 3. RESEARCH HYPOTHESES..............................................................139
CHAPTER 4. METHODOLOGY ...............................................................................140
4.1. Identify and Design Baseline Search Engine ......................................................140
4.1.1. Design Process Used ..................................................................................142
4.1.2. Baseline Search Engine Structure ..............................................................143
4.2. Design Enhanced Search Engine ........................................................................144
4.2.1. Build Association Thesaurus .....................................................................147
4.2.2. Generate Conceptual Network ...................................................................149
4.2.3. Expand Query ............................................................................................151
4.3. Select a Document Collection.............................................................................152
4.4. Develop Query Topics ........................................................................................153
4.4.1. Tangible and Intangible Concepts .............................................................154
4.4.2. Relevant Document Sets for Query Topics ...............................................155
4.5. Run Experiment ..................................................................................................156
10
TABLE OF CONTENTS – Continued
4.5.1. Select Samples ...........................................................................................156
4.5.2. Adjudicate ..................................................................................................157
4.6. Measure Performance and Determine Statistical Significance ...........................158
4.6.1. Addressing Potential Impact of Unfound Relevant Documents ................159
4.6.2. Addressing Potential Impact of Relevancy Assumption in Modified Pooling
Method .................................................................................................................162
4.6.3. Perform Calculations and Sensitivity Analyses .........................................163
CHAPTER 5. RESULTS ..............................................................................................168
5.1. Full Query Topic Set ...........................................................................................168
5.2. Samples of Query Topics That Produce a Difference ........................................171
5.2.1. Baseline vs. Enhanced Search Performance ..............................................173
5.2.2. Tangible vs. Intangible Concepts ...............................................................174
5.3. Sensitivity Analysis For Impact Of Unfound Relevant Documents ...................176
5.3.1. Level 1 Sensitivity To Unfound Relevant Documents (0.25X) .................176
5.3.2. Level 2 Sensitivity To Unfound Relevant Documents (2X) ......................178
5.3.3. Level 3 Sensitivity To Unfound Relevant Documents (10X) ....................178
5.4. Sensitivity Analysis For Impact Of Relevancy Assumptions .............................179
5.4.1. Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant)......180
5.4.2. Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant)......181
5.4.3. Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant)......182
11
TABLE OF CONTENTS – Continued
5.4.4. Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google
Desktop) ...............................................................................................................183
5.5. Query Topic Outliers ..........................................................................................184
CHAPTER 6. DISCUSSION........................................................................................187
6.1. Research Hypothesis #1 ......................................................................................188
6.2. Research Hypothesis #2 ......................................................................................189
6.3. Research Hypothesis #3 ......................................................................................190
6.4. Research Hypothesis #4 ......................................................................................190
6.5. Generalizing the Results to Full Set of Query Topics ........................................191
6.6. Sensitivity Analysis For Impact Of Unfound Relevant Documents ...................193
6.7. Sensitivity Analysis For Impact Of Relevancy Assumptions .............................194
6.8. Interpreting the Results .......................................................................................195
6.8.1. Conceptual Network ..................................................................................195
6.8.2. Complete Set of Relevant Documents .......................................................197
6.8.3. Query Topic Types ....................................................................................198
6.9. Impact of Sample Selection ................................................................................199
6.10. Impact of Search Engine Parameter Values......................................................199
6.10.1. Jaccard Coefficient Threshold Value .......................................................200
6.10.2. Maximum Associated Term Entries Threshold Value.............................201
6.11. Outstanding Issues ............................................................................................204
6.11.1. Impact of Characteristics of the Test Document Collection ....................204
12
TABLE OF CONTENTS – Continued
6.11.2. Data Processing and Document Collection Size......................................205
6.11.3. Performance Comparison.........................................................................206
CHAPTER 7. CONCLUSION .....................................................................................208
APPENDIX A. SEARCH ENGINE STRUCTURE AND DESIGN PARAMETERS ..
...............................................................................................................210
A.1. Structure .............................................................................................................210
A.1.1. Baseline Search Engine Structure .............................................................210
A.1.2. Enhanced Search Engine Structure ...........................................................212
A.2. Design Parameters..............................................................................................214
A.2.1.Technology.................................................................................................214
A.2.2.Indexing Parameters ...................................................................................214
A.2.3. Association Thesaurus Processing Parameters .........................................215
A.2.4. Conceptual Network and Relational Pathway Parameters ........................218
A.2.5. Search Parameters .....................................................................................220
APPENDIX B. BUILDING BOOLEAN EXPRESSIONS FROM RELATIONAL
PATHWAYS ...............................................................................................................221
B.1. Building Boolean Phrase from Relational Pathway ...........................................221
B.2. Building Query Expression for Query Topic .....................................................223
B.2.1. Combining Multiple Relational Pathways for Term Pair..........................223
B.2.1. Creating Full Boolean Expression for Query Topic..................................224
13
TABLE OF CONTENTS – Continued
APPENDIX C. QUERY TOPIC LIST ........................................................................227
C.1. Query Topics Representing Tangible Concepts .................................................227
C.2. Query Topics Representing Intangible Concepts ...............................................228
APPENDIX D. DOCUMENTS RETURNED COUNTS FOR QUERY TOPICS BY
SEARCH ENGINE ........................................................................................................230
D.1. Documents Returned Counts .............................................................................230
D.2. Significance of Difference of Documents Returned ..........................................233
APPENDIX E. GRADED RELEVANCE QUERY TOPIC DEFINITIONS ............235
E.1. Query Topics Representing Tangible Concepts .................................................236
E.2. Query Topics Representing Intangible Concepts ...............................................241
APPENDIX F. BINARY RELEVANCE DATA AND CALCULATIONS ...............246
F.1. Data for Query Topics Representing Tangible Concepts ...................................246
F.1.1. Adjudicated Query Topics for Tangible Concepts ....................................247
F.1.2. Not Adjudicated Query Topics for Tangible Concepts .............................250
F.2. Data for Query Topics Representing Intangible Concepts .................................253
F.2.1. Adjudicated Query Topics for Intangible Concepts ..................................253
F.2.2. Not Adjudicated Query Topics For Intangible Concepts ..........................256
APPENDIX G. BINARY RELEVANCE SIGNIFICANCE TESTS.........................259
G.1. Difference in Recall* between Baseline and Enhanced Search Engines ...........259
G.2. Difference in F-measure* between Baseline and Enhanced Search Engines ....261
G.3. Difference in F-measures* between Tangible and Intangible Concepts............263
14
TABLE OF CONTENTS – Continued
G.3.1. Difference between Tangible and Intangible Concepts for Baseline Search
Engine ..................................................................................................................264
G.3.2. Difference between Tangible and Intangible Concepts for Enhanced Search
Engine ..................................................................................................................265
APPENDIX H. GRADED RELEVANCE DATA .......................................................266
H.1. Graded Relevance Data of Query Topics for Tangible Concepts ......................266
H.2. Graded Relevance Data of Query Topics for Intangible Concepts ....................268
APPENDIX I. SENSITIVITY ANALYSIS FOR IMPACT OF UNFOUND
RELEVANT DOCUMENTS ........................................................................................270
I.1. Level 1 Sensitivity To Unfound Relevant Documents (0.25X) ..........................271
I.2. Level 2 Sensitivity To Unfound Relevant Documents (2X) ...............................278
I.3. Level 3 Sensitivity To Unfound Relevant Documents (10X) .............................285
APPENDIX J. SENSITIVITY ANALYSIS FOR IMPACT OF RELEVANCY
ASSUMPTIONS .............................................................................................................292
J.1. Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant) ...............293
J.2. Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant) ...............301
J.3. Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant) ...............308
J.4. Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop)
....................................................................................................................................315
REFERENCES ...............................................................................................................322
15
LIST OF FIGURES
Figure 1.1 Term cluster generated from thesaurus entry for term A. ............................ 38
Figure 1.2 Term clusters for term A and term D connected by association relationship
generated from thesaurus entries. ................................................................ 39
Figure 1.3 Imaginary relational pathway between terms A and J................................. 39
Figure 1.4 Simple Boolean implementation for formulating an expanded query from
path A – B – C – D – E. In the visual representation, the solid circles on the
pathway indicate the terms that must be present in the document for the
document to be identified as relevant. ......................................................... 43
Figure 1.5 Term clusters generated from collection-specific association thesaurus
entries for terms warning and understandable.......................................... 46
Figure 1.6 Example Boolean query for the relational path warnings – aware – color –
coding – understandable............................................................................ 48
Figure 2.1 The two phases of the retrieval process in an information retrieval system.
...................................................................................................................... 54
Figure 2.2 The stages of analysis in Natural Language Processing (adapted from Figure
1.1 in Dale, 2010, p. 4)................................................................................. 66
Figure 2.3 An example document-by-word matrix (adapted from Figure 8.3 in
Manning & Schütze, 1999, p. 297). ............................................................. 79
Figure 2.4 An example word-by-word matrix (adapted from Figure 8.4 in Manning &
Schütze, 1999, p. 297).................................................................................. 80
16
LIST OF FIGURES – Continued
Figure 2.5 Typical steps of query expansion (adapted from Figure 1 of Carpineto &
Romano, 2012, p. 1:10).............................................................................. 108
Figure 4.1 High-level structure of the Baseline search engine.................................... 144
Figure 4.2 High-level structure of the Enhanced search engine.................................. 146
Figure 4.3 Venn digram illustrating that the documents returned by the Baseline are a
subset of those returned by the Enhanced search engine. .......................... 160
Figure 5.1 Histogram of the returned document count frequencies for the 75 query
topics run on the Baseline and the Enhanced search engines. ................... 169
Figure 5.2 Distribution of all 75 query topics that produced a difference in performance
between the Baseline and the Enhanced search engines and those that
performed the same (i.e., no difference in performance). .......................... 170
Figure 6.1 Distribution of all 75 query topics with addtional detail to illustrate
population from which the tangible and intangible sample set were derived.
.................................................................................................................... 188
Figure 6.2 Relational pathways identified for various values of the Maximum
Associated Term Entries Threshold. .......................................................... 202
Figure A.1 High-level structure of the Baseline search engine.................................... 211
Figure A.2 High-level structure of the Enhanced search engine.................................. 213
17
LIST OF FIGURES – Continued
Figure B.1 Boolean Phrase for Relational Pathway Length of 3. In the visual
representation, the solid circles on the pathway indicate the terms that must
be present in the document for the document to be identified as relevant. ......
.................................................................................................................... 221
Figure B.2 Boolean Phrase for Relational Pathway Length of 4. In the visual
representation, the solid circles on the pathway indicate the terms that must
be present in the document for the document to be identified as relevant. ......
.................................................................................................................... 222
Figure B.3 Boolean Phrase for Relational Pathway Length of 5. In the visual
representation, the solid circles on the pathway indicate the terms that must
be present in the document for the document to be identified as relevant. ......
.................................................................................................................... 222
Figure B.4 Boolean phrase for term pair A and E constructed by combining individual
Boolean phrases for the term pair’s two relational pathways. ................... 224
Figure B.5 Boolean phrase for query topic A B C constructed by combining Boolean
phrases for each of the term pairs. ............................................................. 225
Figure B.6 Boolean phrase for query topic A B C constructed by combining Boolean
phrases generated from the relational pathway identified between the term
pair A and B and the term C. Term C did not share a relational pathway
with term A nor with term B. ..................................................................... 226
18
LIST OF FIGURES – Continued
Figure G.1 Two Sample t-Test to test the recall* difference between Baseline and
Enhanced search engines. .......................................................................... 260
Figure G.2 Two Sample t-Test to test the F-measure* difference between Baseline and
Enhanced search engines. .......................................................................... 262
Figure G.3 ANOVA Single Factor to test the F-measure* difference between Tangible
and Intangible Concepts Query Sample Sets on the Baseline search engine.
.................................................................................................................... 264
Figure G.4 ANOVA Single Factor to test the F-measure difference between Tangible
and Intangible Concepts Query Sample Sets on the Enhanced search engine.
.................................................................................................................... 265
Figure I.1
Two Sample t-Test to test the difference of the recall* at Sensitivity Level 1
(0.25X) between the Baseline and Enhanced search engines. ................... 276
Figure I.2
Two Sample t-Test to test the difference of the F-measure* at Sensitivity
Level 1 (0.25X) between the Baseline and Enhanced search engines. ...... 277
Figure I.3
Two Sample t-Test to test the difference of the recall* at Sensitivity Level 2
(2X) between the Baseline and Enhanced search engines. ........................ 283
Figure I.4
Two Sample t-Test to test the difference of the F-measure* at Sensitivity
Level 2 (2X) between the Baseline and Enhanced search engines. ........... 284
Figure I.5
Two Sample t-Test to test the difference of the recall* at Sensitivity Level 3
(10X) between the Baseline and Enhanced search engines. ...................... 290
19
LIST OF FIGURES – Continued
Figure I.6
Two Sample t-Test to test the difference of the F-measure* at Sensitivity
Level 3 (10X) between the Baseline and Enhanced search engines. ......... 291
Figure J.1 Two Sample t-Test to test the difference of the recall* at Level 1 Sensitivity
to Relevancy Assumptions (25%) between the Baseline and Enhanced
search engines. ........................................................................................... 299
Figure J.2 Two Sample t-Test to test the difference of the F-measure* at Level 1
Sensitivity to Relevancy Assumptions (25% non-relevant) between the
Baseline and Enhanced search engines. ..................................................... 300
Figure J.3 Two Sample t-Test to test the difference of the recall* at Level 2 Sensitivity
to Relevancy Assumptions (50%) between the Baseline and Enhanced
search engines. ........................................................................................... 306
Figure J.4 Two Sample t-Test to test the difference of the F-measure* at Level 2
Sensitivity to Relevancy Assumptions (50% non-relevant) between the
Baseline and Enhanced search engines. ..................................................... 307
Figure J.5 Two Sample t-Test to test the difference of the recall* at Level 3 Sensitivity
to Relevancy Assumptions (75%) between the Baseline and Enhanced
search engines. ........................................................................................... 313
Figure J.6 Two Sample t-Test to test the difference of the F-measure* at Level 3
Sensitivity to Relevancy Assumptions (75%) between the Baseline and
Enhanced search engines. .......................................................................... 314
20
LIST OF FIGURES – Continued
Figure J.7 Two Sample t-Test to test the difference of the recall* at Level 4 Sensitivity
to Relevancy Assumptions where estimated number of assumed relevant
documents generated from overlap of results with Google Desktop. ........ 320
Figure J.8 Two Sample t-Test to test the difference of the F-measure* at Level 4
Sensitivity to Relevancy Assumptions where estimated number of assumed
relevant documents generated from overlap of results with Google Desktop.
.................................................................................................................... 321
21
LIST OF TABLES
Table 2.1
Vector-based similarity measures. ............................................................... 81
Table 2.2
Probabilistic dissimilarity measures. ........................................................... 95
Table 2.3
The number of times term t1 and term t2 occur in each of the five documents
that comprise the example document collection. ....................................... 106
Table 4.1
Example of the estimates used for unfound relevant documents used in each
of the three levels of the sensitivity. Estimates of unfound documents were
rounded up to the next whole number........................................................ 164
Table 5.1
Counts and percentages of the distribution of the 75 query topics that
produced a difference in performance between the Baseline and the
Enhanced search engines and those that performed the same (i.e., no
difference in performance). ........................................................................ 170
Table 5.2
Sample of 14 query topics representing tangible concepts. ....................... 171
Table 5.3
Sample of 16 query topics representing intangible concepts. .................... 172
Table 5.4
Average performance measures for Baseline and Enhanced search engines.
.................................................................................................................... 174
Table 5.5
Average performance measures for query topics representing tangible and
and intangible concepts for the Baseline search engine............................. 175
Table 5.6
Average performance measures for query topics representing tangible and
and intangible concepts for the Enhanced search engine........................... 175
Table 5.7
Average recall* for Baseline and Enhanced search engines from original
calculation and at the three sensitivity levels for unfound documents. ..... 177
22
LIST OF TABLES – Continued
Table 5.8
Average F-measure* for Baseline and Enhanced search engines from
original calculation and at the three sensitivity levels for unfound
documents. ................................................................................................. 177
Table 5.9
Average recall* for Baseline and Enhanced search engines from original
calculation and at the four sensitivity levels for assumed relevant
documents. ................................................................................................. 180
Table 5.10 Average F-measure* for Baseline and Enhanced search engines from
original calculation and at the four sensitivity levels for assumed relevant
documents. ................................................................................................. 181
Table 5.11 Documents returned by the Baseline and Enhanced search engines for the
two outlier query topics. ............................................................................ 185
Table 6.1
Calculation to estimate the performance of the Enhanced search engine on
the full 75 query topic set........................................................................... 192
Table D.1
Document counts returned by Baseline and Enhanced search engine. ...... 233
Table D.2
Two Sample t-test Assuming Equal Variances to determine statistical
significance of differences in documents returned by Baseline and Enhanced
search engines. ........................................................................................... 234
Table F.1
Adjudicated Query Topics for Tangible Concepts with Recall*, Precision,
and F-measure* calculations. ..................................................................... 248
Table F.2
Query topics for Tangible Concepts with no performance difference. ...... 251
23
LIST OF TABLES – Continued
Table F.3
Query topics for Tangible Concepts with a performance difference but not
included in the Tangible Concepts Query Sample Set............................... 252
Table F.4
Adjudicated Query Topics for Intangible Concepts with Recall*, Precision,
and F-measure* calculations. ..................................................................... 254
Table F.5
Query topics for Intangible Concepts with no performance difference. .... 257
Table F.6
Query topics for Intangible Concepts with a performance difference but not
included in the Intangible Concepts Query Sample Set............................. 258
Table H.1
Graded Relevance Data for Tangible Concepts Query Sample Set. .......... 267
Table H.2
Graded Relevance Data for Intangible Concepts Query Sample Set. ........ 269
Table I.1
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Sensitivity Level 1 (0.25X) where the number of
unfound documents are assumed to be a quarter of the number of relevant
documents identified by the Baseline and Enhanced search engines. ....... 272
Table I.2
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Sensitivity Level 1 (0.25X) where the number of
unfound documents are assumed to be a quarter of the number of relevant
documents identified by the Baseline and Enhanced search engines. ....... 274
Table I.3
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Sensitivity Level 2 (2X) where the number of unfound
documents are assumed to be double the number of relevant documents
identified by the Baseline and Enhanced search engines........................... 279
24
LIST OF TABLES – Continued
Table I.4
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Sensitivity Level 2 (2X) where the number of unfound
documents are assumed to be double the number of relevant documents
identified by the Baseline and Enhanced search engines........................... 281
Table I.5
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Sensitivity Level 3 (10X) where the number of
unfound documents are assumed to be ten times the number of relevant
documents identified by the Baseline and Enhanced search engines. ....... 286
Table I.6
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Sensitivity Level 3 (10X) where the number of
unfound documents are assumed to be ten times the number of relevant
documents identified by the Baseline and Enhanced search engines. ....... 288
Table J.1
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Level 1 Sensitivity to Relevancy Assumptions where
25% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant. ..................................... 295
Table J.2
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Level 1 Sensitivity to Relevancy Assumptions where
25% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant. ..................................... 297
25
LIST OF TABLES – Continued
Table J.3
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Level 2 Sensitivity to Relevancy Assumptions where
50% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant ...................................... 302
Table J.4
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Level 2 Sensitivity to Relevancy Assumptions where
50% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant. ..................................... 304
Table J.5
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Level 3 Sensitivity to Relevancy Assumptions where
75% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant. ..................................... 309
Table J.6
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Level 3 Sensitivity to Relevancy Assumptions where
75% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant. ..................................... 311
Table J.7
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Level 4 Sensitivity to Relevancy Assumptions where
estimated number of relevant documents that may be assumed are the
documents returned by all three search engines (i.e., Baseline, Enhanced,
and Google Desktop search engines) ......................................................... 316
26
LIST OF TABLES – Continued
Table J.8
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Level 4 Sensitivity to Relevancy Assumptions where
estimated number of relevant documents to be assumed are generated from
overlap of results with Google Desktop..................................................... 318
27
ABSTRACT
The dissertation research explores an approach to automatic concept-based query
expansion to improve search engine performance. It uses a network-based approach for
identifying the concept represented by the user’s query and is founded on the idea that a
collection-specific association thesaurus can be used to create a reasonable representation
of all the concepts within the document collection as well as the relationships these
concepts have to one another. Because the representation is generated using data from the
association thesaurus, a mapping will exist between the representation of the concepts
and the terms used to describe these concepts. The research applies to search engines
designed for use in an individual website with content focused on a specific conceptual
domain. Therefore, both the document collection and the subject content must be wellbounded, which affords the ability to make use of techniques not currently feasible for
general purpose search engine used on the entire web.
28
CHAPTER 1. INTRODUCTION
It is difficult to overemphasize the importance of the search engine in today’s
Web environment. It is estimated that the Web contains over 13.02 billion documents 1
and within this enormous collection exists information related to almost any imaginable
topic. The challenge in learning about a particular topic arises not because relevant
information does not exist on the Web, but rather because it can be very difficult to
efficiently retrieve the relatively small subset of documents on the Web that meets a
specific information need. One solution to this challenge is to use a search engine. Search
engines provide a way to identify and access documents containing relevant information
that would otherwise remain unknown to the user (Baeza-Yates & Ribeiro-Neto, 1999;
Savoy & Gaussier, 2010; Wolfram, Spink, Jansen, & Saracevic, 2001).
An analogous challenge exists when the information required is contained within
an individual website (i.e., local web domain). Individual websites can contain a large
amount of detailed information about a particular subject. Finding the specific, relevant
information within the website is often difficult, and, again, the search engine can serve
as an important tool for helping users locate the specific documents that fill their
information needs.
Designing a search engine is deceptively difficult. One of the primary reasons for
this difficulty is that natural language is used as the medium to form the information
1
The size of the World Wide Web. (9 February 2012). Retrieved from http://www.worldwidewebsize.com/
29
bridges 2 between the user with the information need and the authors of the documents in
the collection. Users form a query using natural language as an abstraction to represent
their information need, and the authors of the documents in the collection use natural
language as an abstraction to represent the concepts in the documents. The translation
into and representation using natural language introduces issues with the accuracy and
completeness of the description of the concepts on both sides of the information bridge.
Therefore, the search engine must perform its task using a frequently incomplete,
imprecise, and ambiguous description of both the user’s information need and the related
concepts included in the documents within the collection (Savoy & Gaussier, 2010). In
addition, the challenge of the task is exacerbated because the less-than-perfect
descriptions of the same concept can be expressed using different words and phrases due
to the richness and productivity (i.e., invention of new words and new uses of old words)
of natural language (Manning & Schütze, 1999).
Search engines may be designed using a symbolic search algorithm that compares
the text of the user’s query against text found in the documents within the collection. A
document is identified as relevant when string patterns in its text match the string patterns
in the query text. While a symbolic approach is typically able to identify a significant
portion of the relevant documents, it is often not sufficient for identifying the complete
set of relevant documents contained in the collection. The primary reason is that the use
of natural language introduces the issues mentioned above that limit the ability of a
purely symbolic approach to identify all the relevant information.
2
The concept of information bridges is described by Martin Frické in his book Logic and
the Organization of Information published in 2012.
30
In addition, it is important to note that the information missed by a symbolic
search algorithm may be important in fully understanding the relevant concepts contained
in the document collection. The information is missed because it is expressed differently
from what the query posed (i.e., the text string patterns are not the same). It can be the
missed information that is able to present a unique way of thinking about the concept by
highlighting, emphasizing, or connecting aspects of the idea not contained in the terms
used to construct the query, and the user may not be aware of these aspects. Therefore,
including this information would fill deficits in the user’s understanding of the concept
thereby more completely filling the user’s information need.
Search engine design is an actively researched area and presents interesting
challenges in discovering effective ways to identify relevant information missed by
symbolic search algorithms. One area of research addressing these challenges is the
development of semantic information to either replace or augment symbolic search
engine designs. A portion of this research focuses on manually developing (i.e., a human
expert is required to develop) semantic information. This includes research in developing,
defining, and using semantic relations among lexical elements (e.g., WordNet, Fellbaum,
1998) as well as research that addresses the manual development of domain ontologies to
extract and represent meaning based on a constructed world model (Nirenburg & Raskin,
2004). However, the manual development of semantic information is extremely time
consuming, expensive, and either lacks the specificity for technical domains or lacks
portability for reuse in other conceptual domains and over time within an evolving
conceptual domain (Anderson & Pérez-Carballo, 2001; Manning & Schütze, 1999).
31
Because of these limitations, other lines of research exploring ways to
automatically generate information to augment symbolic search engine designs are
attractive. One area of this type of research uses Natural Language Processing (NLP)
techniques to perform automatic query expansion in an attempt to more completely
define and describe a user’s query. By augmenting the user’s query with additional search
terms, additional candidate string patterns are available when performing the symbolic
search. It is believed that these added candidate strings allow additional opportunities for
a symbolic search algorithm to identify additional documents that contain information
relevant to the desired concept.
Unfortunately, there has been a marked lack of success in this line of research
over the years and many of the resulting systems either do not improve or decrease the
performance of the search engine (Qiu & Frei, 1993; Brants, 2003). Qiu and Frei (1993)
present a theory that the lack of success with previous query expansion methods is
primarily because the methods were based on adding terms similar to each of the
individual terms used to construct the query instead of adding terms similar to the overall
concept the query describes. This method of expanding a query by adding terms similar
only to the individual terms of the query often introduces tangential, non-relevant
concepts to the query causing the search algorithm to identify documents that are not
relevant to the original query concept. To address this, Qiu and Frei developed an
alternate method in which their search algorithm creates a vector of the query in the Term
Vector Space (TVS) generated from the document collection and expands the query with
those terms that have a high similarity to the query vector. While Qiu and Frei achieved
32
a notable improvement in search engine performance, they noted that their method was
less successful than systems that used mature user feedback relevance data. This result
indicates that relevant documents were missed by their search algorithm. To create a
search engine capable of identifying all relevant documents without a significant increase
in non-relevant documents, a more effective method for concept-based query expansion
is needed to augment the performance of symbolic search engine design.
In this dissertation research project, the goal is to extend the work in and state of
the art understanding of automatic concept-based query expansion methods. The project
explores an approach to automatic query expansion able to identify a greater portion of
relevant documents without increasing the portion of non-relevant documents returned in
response to the user’s query. The research applies to search engines designed for use in
an individual website with content focused on a specific conceptual domain (sometimes
referred to as a vertical search 3). Therefore, both the document collection and the subject
content must be well-bounded, which affords the ability to make use of techniques not
currently feasible for general purpose search engine used on the entire web. In the
following sections the scope of the work, the idea that drives the approach, and a walkthrough of an example query to illustrate it will be presented.
3
“Vertical search is a specialized form of web search where the domain of the search is
restricted to a particular topic” (Croft, Metzler, & Strohman, 2010, p. 3). Other terms
used to describe searches that are limited by the conceptual domain are focused search or
topical search.
33
1.1. Outside the Scope
Designing a search engine, as mentioned above, is a complex and difficult
process. There are many aspects to be considered if the ultimate goal is to design a
search engine that performs as well as a human but as fast as a computer. To keep this
dissertation work well-contained, it is focused on addressing only a few aspects of
importance.
The research focuses on the design of a search engine that extends the work in
state of the art understanding of automatic concept-based query expansion. The search
engine is intended to be used by searchers who have a well-formed information need in
advance of using the search engine but may not be able to articulate their information
need with a high level of precision. The search engine is intended to be used on a
bounded, medium-sized document collection containing documents that are focused on a
single subject domain that is in a scientific or other technical domain. A medium-sized
document collection can be assumed to be comprised of several thousand documents 4.
There are many important and interesting issues that have been defined as outside
the scope of this research. Related and relevant issues not addressed in this research
include:
•
Formation and definition of the information need
o Challenges with forming and defining an information need – The
early stages of the Kuhlthau’s (2004) Information Search Process
4
Qiu & Frei (1993) refer to the CACM test collection that is comprised of 3,204
documents as a medium-sized document collection. The CACM collection is a collection
of titles and abstracts from Communications of ACM.
34
(ISP) in which the information seeking task is initiated, a topic is
selected, information about the topic is explored and the information
seeker formulates a focused perspective on the topic is not addressed.
o Challenges with expressing an information need in the form of a
query – The ability of a user to develop a precise description in the
form of a query that truly express their particular information need is
not addressed.
•
Accommodation of user behaviors and backgrounds
o Accommodating a variety of information seeking behaviors –
Identifying and effectively facilitating various searching behaviors
(e.g., browsing, berry picking, or horizontal search) which may be
preferred by the users is not addressed.
o Accommodating varying user backgrounds – Adapting to facilitate
users with varying backgrounds to effectively conduct a search is not
addressed. This research assumes that the users for which the website
and its document collection were designed are the intended users of
the search engine.
•
Capabilities and technical details of the search engine
o Ability to process languages other than English – This research is
focused only on document collections and queries in English.
35
o Ranking documents returned in order of relevance to query – This
research is focused only generating a set of relevant documents
contained in the collection and does not employ or address document
relevance ranking algorithms.
o Computational efficiency of approach – The computational efficiency
of the design and feasibility of real-world use of the search engine
design is not addressed. It is assumed that these issues will be
considered in the future if the research produces favorable results that
suggest that the approach may be useful in a real-world application.
1.2. Improved Automatic Query Expansion Using Relational Pathways
As mentioned earlier, previous work has shown that expansion methods are
generally not successful when the terms used to expand the query are identified by only
individually considering the terms that make up the query (i.e., using synonyms for each
individual term). In response to this problem, Qiu & Frei (1993) investigated a conceptbased query expansion method using the Vector Space Model that sought to add terms
related to the overall concept of the query rather than synonyms for individual terms.
Their method improved retrieval performance over a non-enhanced search engine but still
missed relevant documents found using other methods.
The new method proposed in this research uses a network-based rather than a
vector-based approach for identifying the concept represented by the user’s query. It is
founded on the idea that a collection-specific association thesaurus can be used to create a
36
reasonable representation of all the concepts within the document collection as well as
the relationships these concepts have to one another. Because the representation is
generated using data from the association thesaurus, a mapping will exist between the
representation of the concepts and the terms used to describe these concepts.
To do this, an interconnected network will be generated from each of the
association thesaurus entries. The terms will be represented by nodes and the connections
between the nodes will be created through the relationship defined by the ISASSOCIATED-WITH terms identified for each thesaurus entry. At search time, the
relevant portion of the overall conceptual network will be used to represent the intended
concept of the query. This will be accomplished by identifying all those pathways that
exist within the network that connect together the individual terms from the query. Each
of these relational pathways will represent a particular aspect of the overall desired
concept, and the intervening nodes (i.e., terms) on these pathways between a pair of
query terms will be candidate terms for expanding the original query.
The idea that such a conceptual network could be automatically generated from
the association thesaurus for the document collection is based on the following
assumptions. Consider that a local website houses a document collection that contains
information concentrated on a particular conceptual subject domain. Given this, the
following properties will be assumed for all content-bearing terms contained in the
37
collection (i.e., terms that provide meaning and, therefore, exclude stop words or words
that simply provide the structural framework in the document):
•
Terms are used to describe concepts within a conceptual domain; therefore,
they are used in regular and intentional ways.
•
Individual terms represent some portion of an overall concept.
•
Terms that are in close proximity to one another are likely used to describe a
single (though possibly complex) concept (i.e., each term contributing in
conjunction with other nearby terms to the overall meaning of a concept).
•
Terms that frequently co-occur are likely conceptually related.
Based on these assumptions, a collection-specific association thesaurus could be
constructed using the co-occurrence of terms (i.e., terms that frequently appear within
close proximity to one another), and each of its term entries would represent small
clusters of conceptually related terms. Linking these clusters together through shared
terms would result in a complex network of terms linked by way of association-based
relationships.
Because of the above properties of the content-bearing terms and the cooccurrence-based method by which the association relationships are generated, the
resulting association thesaurus would contain the data necessary to generate a network
capable of representing a reasonable approximation of all the concepts represented in the
document collection and the conceptual relationships among these concepts. Various
appropriately short paths between two terms could then be used to represent specific,
38
relevant aspects of a concept expressed by the term pair for which information exists in
the document collection.
1.2.1. Generating the Conceptual Network
We can think of each term in the association thesaurus as a node. Child nodes for
a term are generated from the associated terms defined in its thesaurus entry. For
example, the thesaurus entry for term A may contain terms B, C, D, E, and F. From this
information, a term cluster as shown below in Figure 1.1 could be generated. Following
this pattern, term clusters could be generated for all entries in the association thesaurus.
Main term: A
is-associated-with:
• B
• C
• D
• E
• F
Figure 1.1
Term cluster generated from thesaurus entry for term A.
To form the full network, each term cluster would be linked to the other term
clusters using shared terms as defined by the association relationships. For example, the
thesaurus entry for one of term A’s associated terms, term D, may contain entries for A,
G, H, and I and could be linked to the term A cluster as illustrated in Figure 1.2.
39
Main term: A
Main term: D
is-associated-with:
• B
• C
• D
• E
• F
is-associated-with:
• A
• G
• H
• I
Figure 1.2
Term clusters for term A and term D connected by association relationship
generated from thesaurus entries.
The entire conceptual network would be developed by continuing this linking
process using all shared terms defined through relationships defined in the association
thesaurus.
Once the entire network has been created, if we want to know the relationship
between any two terms, we could follow a path within this network from one term to the
other term. For example, Figure 1.3 shows one imaginary path that could be traversed to
connect term A to term J.
A
G
Main
D
J
Figure 1.3
Imaginary relational pathway between terms A and J.
40
As described above, because the overall network represents an approximation of
the concepts described within the document collection and their relationships to one
another, the relational pathways that exist between terms (i.e., nodes) represent specific,
relevant aspects of a concept expressed by the term pair. The path A – D – G – J then can
be thought of as representing a particular aspect of the concept that would be expressed
by using the two terms A and J together. The intervening nodes, terms D and G, can then
be identified as additional concept-relevant terms that may be used to expand the user’s
original query.
It is important to note that because the association thesaurus is created from the
document collection itself, the conceptual relationships represented in the network are
only derived from information that exists in the document collection. Because we want to
use the network to expand a query for retrieving information from the collection, we are
only interested in aspects of the concept that are present in document collection.
Therefore, we cannot assume that all aspects of a given concept would be represented in
our network, and, therefore, this network would not be appropriate to use in other
document collections. Each document collection is unique and would, therefore, require a
unique conceptual network.
The success of this method for generating a conceptual network from the
collection-specific association thesaurus may be impacted by various search engine
design decisions including:
•
The number of relative paths used to identify expansion terms for a given term
pair.
41
•
The longest length of a path between nodes that should be used (i.e., provides
a meaningful representation of a concept).
•
Which content-bearing terms in the association thesaurus are included in the
network.
1.2.2. Formulating the Expanded Query
Once we have identified additional concept-relevant terms that may be used to
expand the user’s original query, we must appropriately formulate the new query to send
into the search engine.
The idea behind using the relational pathways between terms to identify
additional terms is that we want any additional terms to be focused on the intended
concept expressed in the original query. Adding such concept-relevant terms will allow
us to maximize the matches to documents containing information relevant to the intended
concept and minimize non- relevant document matches. Therefore, instead of using all
the associated terms for an individual query term, we will only expand the query with
those associated terms that we can identify as having a connection to the overall concept.
In addition to only choosing additional terms that have been found to be related to
the overall concept, we want to be smart about how we use those terms in the expanded
query. Therefore, we should formulate the expanded query in such a way that allows us
to capitalize on the implied properties of the relationships between the terms along a
pathway. The first implied property is that for any given relational path, we can assume
that the closer the nodes are to one another in the path, the stronger their similarity is to
42
one another. For example, given the path A – B – C – D – E, the term A is likely to be
more similar to B than it is to D. The second implied property is that we can assume that
the full path expresses a fairly complete specification of the aspect of the concept that it
represents, but the concept will not always be expressed in an individual document using
all the terms that make up the path. Therefore, a document containing terms that
represent only a partial path may still represent the desired concept. For example, it may
be the case that a document that describes the desired concept contains the terms A, C, D,
and E but does not include the term B. In another example, it may be the case that the
desired concept is described using the terms B, C, and D but does not include the terms A
nor E.
Taking these points into consideration, the expanded query should be formulated
to look for appropriate combinations of terms that represent partial versions of the
identified pathways. It is believed that a simple, yet effective implementation of this can
be accomplished using nested Boolean phrases. While it is possible that other models
could be used to formulate an expanded query using relational pathways with increased
retrieval precision, this simple implementation will likely serve as an efficient proof-ofconcept approach.
As an example of generating an expanded query using a simple Boolean
implementation, a relational pathway with a length of five nodes may be divided into
three partial relational paths and formed into a nested Boolean query as is illustrated in
Figure 1.4.
43
Relational
Pathway
A
B
Boolean Query Phrase
(A
C
D
E
Visual Representation
AND
E)
AND
C
AND
D)
AND
C
AND
E)
AND
B
AND
D)
AND
D
AND
E)
OR (
(A
OR (
(B
OR
(A
OR
(B
Figure 1.4
Simple Boolean implementation for formulating an expanded query from
path A – B – C – D – E. In the visual representation, the solid circles on
the pathway indicate the terms that must be present in the document for
the document to be identified as relevant.
The success of this method for formulating an expanded query from the relational
pathways between query term pairs may be impacted by various search engine design
decisions including:
•
The combinations of original terms and candidate expansion terms from the
relational pathways used to formulate the expanded query.
•
The number of terms (i.e., nodes) that make up the path that are necessary for
effectively representing the full concept in the query.
•
Whether term proximity is used in the expanded query (e.g., all the terms
searched for from the relational path occur within x terms of each other in the
document).
44
1.3. Example Query Walkthrough
As previously discussed, purely symbolic search algorithms may have difficulty
generating a conceptually complete result set of documents to fill the user’s information
need. For example, consider that a local website houses a document collection whose
content is focused on the Human Factors design elements of aircraft flight decks and that
a user has the following question about flight deck interface design: How do you ensure
that warnings are understandable? To conduct this search, the user enters the terms
warnings understandable into the search engine.
When a symbolic search algorithm is used, each document returned as a “match”
by the search engine will contain both the word warnings and the word
understandable5. To the user who is not an expert on the contents of the document set
contained on the website, the list of retrieved documents may appear to be complete and
the user may think, “If I read all of these documents, my question will be answered and I
will have learned all the information that the documents on this website have to offer
about this topic.”
However, it is possible that the result set may have missed some important
relevant documents. For example, the missed documents could talk about employing the
5
Most modern search engines are capable of performing a symbolic search in which
alternate grammatical forms of words are also automatically included in the search. For
example, alternate forms for the word warnings may include warn, warns, and warning and
alternate grammatical forms for the word understandable may include understand,
understands, and understanding. The baseline search engine to be used for comparison as
well as the search system enhanced with my research will also include this functionality.
45
aviation convention of using the color red on flight deck displays to represent warning
messages as in this passage:
To ensure that the pilot is made aware of and understands the impending danger
of this situation, the message presented must be color-coded red.
This passage uses a form of the word understandable but does not include any variants of
the word warnings and, therefore, would not be considered a match to the user’s query.
However, by taking into consideration the semantic content of the passage (namely the
knowledge that one definition of the word warning in the aviation domain is a message
used to make pilots aware of impending danger), it is clear that the example passage is
relevant to the concept of ensuring that warnings are understandable. If there were no
other passages that contained the two original query words in this document, this
document would not be included in its result set. A search engine that only uses a
symbolic search algorithm would miss some important information relevant to answering
the user’s question.
The vision addressed in this research is that a symbolic search engine enhanced
with the ideas presented in the previous section for an improved automatic concept-based
query expansion would enhance the query sent into the search engine as follows. The
enhanced search engine would begin by discovering the portion of the overall conceptual
network that represents the concept warnings understandable. It does this by
identifying all those pathways that exist within the network that connect the terms
warnings and understandable together. For example, assume that the entry from
collection-specific association thesaurus for the term warning contains the terms
46
caution, aware, indicate, alert, annunciation, and nuisance. And also assume that the
entry for the term understandable contains the terms distinct, meaning, clarity,
confusing, coding, and clear. From this information, two term clusters as shown below
in Figure 1.5 could be generated.
Main term: warning
is-associated-with:
• caution
• aware
• indicate
• alert
• annunciation
• nuisance
Main term: understandable
is-associated-with:
• distinct
• meaning
• clarity
• confusing
• clear
• coding
Figure 1.5
Term clusters generated from collection-specific association thesaurus
entries for terms warning and understandable.
47
The algorithm then identifies all the (appropriately short) relational pathways that
exist between these two terms. By identifying all these pathways, it in turn identifies all
the aspects of the concept that are present in the document collection.
Imagine that the following relational pathways exist between the two terms and
illustrate some of the conceptual relationships that exist between these two terms in the
document collection:
•
warnings – aware – color – coding – understandable
•
warnings – aware – color – confusing – understandable
•
warnings – caution – clear – understandable
•
warnings – indicate – clear – understandable
From the relational pathways identified, each of the intervening terms on the pathways
that connect the pair of original query terms are candidate terms for enhancing and
expanding the query. If we consider the relational path warnings – aware – color –
coding – understandable, we see that the candidate terms are aware, color, and coding.
Using the simple implementation described above in Figure 1.4 as a model, the enhanced
system would create the Boolean query phrase shown in Figure 1.6.
48
Relational
Pathway
aware
warnings
color
Boolean Query Phrase
( warnings
AND
coding
understandable
Visual Representation
understandable )
OR (
(warnings
AND
color
AND
coding )
OR (
( aware
AND
color
AND
understandable )
OR
( warnings
AND
aware
AND
coding )
OR
( aware
Figure 1.6
AND
coding
AND
understandable )
Example Boolean query for the relational path warnings – aware – color
– coding – understandable.
If we return to consider our example passage
To ensure that the pilot is made aware of and understands the impending danger
of this situation, the message presented must be color-coded red.
we see that the enhanced Boolean query phrase provides a symbolic match to sub-phrase
3. The enhanced system, therefore, would return the document containing this passage as
a match to the intended concept of the user’s query. (Note: In the actual enhanced search
engine, the enhanced Boolean query phrase shown in Figure 1.6 would be combined with
the other enhanced Boolean query phrases created from each of the other relational
pathways identified for the original pair of terms.)
The relational pathways in the conceptual network created through the collectionspecific association thesaurus may be a powerful way to take advantage of the inherent
49
mapping that exists between the concepts and the pattern of terms used to describe them
in the document collection. Based on the relational pathways that exist in the document
collection, the terms chosen to enhance the query, and how the query is reformulated, this
approach to concept-based query expansion may be able to return a more complete set of
documents without a significant addition of non-relevant documents.
1.4. Dissertation Report Structure
The remaining chapters present a description of the research conducted to
determine if the ideas described in this introductory chapter of using relational pathways
derived from an automatically generated conceptual network to expand user queries
could be an effective approach in improving the performance of search engines used on
domain-specific collections.
The next chapter provides a review of the relevant literature in the fields of
information retrieval and natural language processing to understand the foundational
concepts and recent research related to automatic concept-based query expansion
methods. Then, the research hypotheses posed, the methodology used to test the research
hypotheses, and the experimental results collected are presented. And finally, the
analysis of the results are discussed and conclusions drawn. Data, calculations, and other
supporting information are presented in the Appendices A through H.
50
CHAPTER 2. LITERATURE REVIEW
In this chapter, the foundational concepts and recent research related to automatic
concept-based query expansion methods will be reviewed.
2.1. Improving Information Retrieval using Natural Language Processing
The fields of Information Retrieval and Natural Language Processing (NLP)
illustrate an excellent example of two disciplines coming together to create new ways of
thinking about old problems and devising new strategies to address them. However, the
path has not always been easy or successful. To make progress in solving what may at
times seem to be the intractable remaining problems requires thoughtful consideration of
the lessons learned by earlier attempts to apply NLP to Information Retrieval systems. In
this section, an overview of the fields of Information Retrieval and NLP will be
presented. In addition, the primary challenges in using NLP to improve information
retrieval performance will be discussed.
2.1.1. Information Retrieval
Information retrieval has its origins in Library Science and Computer Science.
One of the pioneers of information retrieval, Gerard Salton, defined it as follows:
Information Retrieval is a field concerned with the structure, analysis,
organization, storage, searching, and retrieval of information. (1968, p. v)
This general, all-encompassing definition captures the wide range of areas that are
part of information retrieval. Information retrieval consists of all the elements and
processes that are necessary to access the information necessary to satisfy an information
51
need. This includes the behind-the-scenes data architecture and representation used in
storage and processing, the front-end interface with which the user enters the query and
views the results, and everything in-between.
Because the goal of an Information Retrieval system is to identify and fulfill the
user’s information need, an important aspect of information retrieval is that “[t]he
representation and organization of the information items should provide the user with
easy access to the information in which he is interested” (Baeza-Yates and Ribeiro-Neto,
1999, p. 1) (Liddy, 1998; Croft, Metzler, & Strohman, 2010; Savoy and Gaussier, 2010).
2.1.1.1. Data Retrieval versus Information Retrieval
In this field, a key distinction is made between data retrieval and information
retrieval (van Rijsbergen, 1979; Baeza-Yates & Ribeiro-Neto, 1999). To understand the
distinction between these two types of retrieval, a definition of data versus information in
important. While a universally acceptable, formal definition of these two concepts may
be impossible to find, the following informal definition is sufficient to understand the
important distinction. Assume that data is a raw, unprocessed message while information
is a message that has been processed, structured, and organized within a particular
context in order to make it useful.
This conceptual difference is extended to the different types of retrieval. Data
retrieval is concerned with identifying which records (which may be full documents)
contain an exact match to the keywords in the user’s query. No semantic information is
needed to perform the straight-forward symbolic pattern matching task to identify that a
“data” match exists between the user’s query and a record. A common example of a data
52
retrieval system is a relational database (van Rijsbergen, 1979; Baeza-Yates and RibeiroNeto, 1999).
On the other hand, as the name implies, information retrieval is concerned with
identifying information and requires that at least some of the intrinsic semantics of the
text be considered. Therefore, a system must be able to identify documents that contain
varied expressions of relevant concepts in order to be an effective retriever of
information. The expression of the concept may be represented using a variety of
terminology and grammatical constructs. The simple string matching function used for
data retrieval is often not sufficient for generating a complete set of relevant documents
in information retrieval.
2.1.1.2. Information Retrieval Systems
Information retrieval systems are useful whenever there is a need to retrieve
relevant information from a large collection of information bearing items. Not
surprisingly, some of the first institutions to adopt information retrieval systems for
retrieving needed information were libraries. For example, in 1964 the National Library
of Medicine (NLM) began using the computer for batch processing bibliographic
information retrieval (Baeza-Yates and Ribeiro-Neto, 1999). Early information retrieval
systems used in libraries typically were either for searching bibliographic catalog records
of the library’s locally held materials like that used by the NLM or for “searching remote
electronic databases provided by commercial vendors in order to provide reference
services” (Baeza-Yates and Ribeiro-Neto, 1999, p. 397). Today, the information retrieval
systems used in libraries have evolved into powerful resources where the distinctions
53
between locally held and remote material is often blurred and may contain references to
both physical and electronic materials.
“Desktop and file system search provides another example of a widely used
[information retrieval] application. A desktop search engine provides search and
browsing facilities for files stored on a local hard disk and possibly on disks connected
over a local network” (Büttcher, Clarke, & Cormack, 2010, p. 3).
But arguably, the most well-known and heavily used information retrieval
systems today are Web search engines (Büttcher, Clarke, & Cormack, 2010). Search
engines allow users to manage, retrieve, and filter information from a large, constantly
changing, unstructured set of documents that exist on the Web. It is estimated that the
Web contains over 8.28 billion documents 6 and within this huge collection exists
information related to almost any imaginable topic. The challenge in learning about a
particular topic arises not because relevant information does not exist on the Web, but
rather because it can be very difficult to efficiently retrieve the relatively small subset of
documents on the Web that meets a specific information need. One solution to this
challenge is to use a search engine. Search engines provide a way to identify and access
documents containing relevant information that would otherwise remain unknown to the
user (Baeza-Yates and Ribeiro-Neto, 1999; Savoy & Gaussier, 2010; Wolfram, Spink,
Jansen, & Saracevic, 2001).
6
The size of the World Wide Web. (14 December 2011). Retrieved from
http://www.worldwidewebsize.com/
54
An analogous challenge exists when the information required is contained within
an individual website (i.e., local domain). Individual websites can contain a large amount
of detailed information about a particular subject. Finding the specific relevant
information within the website is often difficult, and again, the search engine can serve as
an important tool in helping users locate the specific documents that fill their information
need.
2.1.1.3. The Retrieval Process
The retrieval process in an information retrieval system consists of two distinct
phases: first, preparing the system for use and second, processing the query submitted to
the system in order to return the results.
Preparing the System
Processing the Query
Defining the Collection
Submitting a Query
Text Acquisition
Text Transformation
Indexing
Figure 2.1
Identifying Relevant Document
Matches
Returning Results
The two phases of the retrieval process in an information retrieval system.
55
2.1.1.3.1. Preparing the System
Before the first query is submitted, the system must be prepared to effectively
process queries and quickly return the relevant results.
2.1.1.3.1.1. Defining the Collection
The first step in preparing the system includes specifying the documents that will
be included in the searchable collection, the elements from the documents that will be
searchable, and the elements of the document that will be retrieved. Specifying the
documents to be included in the information retrieval system is typically a straightforward task and consists of defining the bounds of the document collection. An
information retrieval system designed to be a web search engine for a local web domain
may contain all textual documents stored on the domain or alternatively may include only
those documents that contain information-bearing text and exclude documents that only
provide structure or navigation elements (e.g., site maps, navigational menus, etc.). Next,
the elements of documents that should be searchable need to be specified. This will
likely include the main body of the document but may also include various metadata
about the document such as title, author, document type, or other tags or keywords that
have been assigned to the document. And finally, the elements of the document that will
be presented in the results must be specified so that the system can store and make
accessible the specified information for each document. This may include information
that describes the document like its title and a hyperlink to the full text of the document.
If snippets of relevant text from the document will be presented to the user, the text of the
document will need to be stored by the system in such a way that the location of the
56
relevant text can be efficiently identified and presented when the system displays the
results (Baeza-Yates and Ribeiro-Neto, 1999; Croft, Metzler, & Strohman, 2010).
2.1.1.3.1.2. Text Acquisition
Once the collection has been defined, the text contained in the documents must be
acquired. This process consists of acquiring the document, converting the document’s
text into a form usable by the system, storing the document’s text and metadata, and
passing the document’s text to the text transformation processor (Baeza-Yates and
Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman,
2010).
After a document has been acquired, the system converts the document into a
format that can be easily read and processed by later components of the information
retrieval system. For example, if a document is in a format such as Microsoft Word, it
must be converted so that “the control sequences and non-content data associated with a
particular format are either removed or recorded as metadata” (Croft, Metzler, &
Strohman, 2010, p. 18). At this point in the process, some systems also ensure that the
text is encoded using the correct character encoding specification (Croft, Metzler, &
Strohman, 2010).
Typically, the next step of the text acquisition process is to store the converted
document text along with its metadata and other information extracted from the document
into a document data store. Depending on the size of the document collection, the data
store system may be designed specifically to allow for fast retrieval times (Croft, Metzler,
& Strohman, 2010).
57
If the collection contains source documents that are frequently changed or if new
documents will be added to the collection, the text acquisition process must have some
mechanism to continually revisit the set of documents within the bounds of the collection
to identify and process all the new and revised documents within the collection (BaezaYates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, &
Strohman, 2010).
2.1.1.3.1.3. Text Transformation
The text acquired in the previous phase is sent to the text transformation or text
processing phase. In this phase, the text transformation operations modify the text of the
document to reduce the complexity of the document representation and determine which
terms are eligible to be included in the index.
Some of the most commonly used text transformation operations are the
following:
•
Lexical Analysis – “Lexical analysis is the process of converting a stream of
characters (the text of the documents) into a stream of words (the candidate
words to be adopted as index terms). Thus, one of the major objectives of the
lexical analysis phase is the identification of words in the text” (Baeza-Yates
and Ribeiro-Neto, 1999, p.165). This step, also referred to as parsing, is
often harder than is initially expected because even in such languages as
English, space characters are not the only delimiters of individual words.
Often characters such as punctuation marks, hyphens, numerical digits, and
the case of the letters need to be considered when identifying the bounds of a
58
word (Baeza-Yates and Ribeiro-Neto, 1999; Croft, Metzler, & Strohman,
2010).
In addition to identifying words, the lexical analysis process also may
remove punctuation and perform case normalization in which all characters
are converted to lowercase (Büttcher, Clarke, & Cormack, 2010).
•
Stopword Removal – Another common text transformation operation is
known as stopword removal. The concept of stopwords was first introduced
by Hans Peter Luhn in 1958. Stopwords are words that occur too frequently
in the document collection to aid in the ability to discriminate between
relevant and non-relevant documents. Baeza-Yates and Ribeiro-Neto (1999)
state that “a word which occurs in 80% of the documents in the collection is
useless for purposes of retrieval. Such words are frequently referred to as
stopwords and are normally filtered out as potential index terms” (p. 167).
Stopword removal has the added benefit of reducing the size of the index
(Baeza-Yates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010;
Savoy & Gaussier, 2010) and often also reduces query execution times by
avoiding the need to process the stopwords (Büttcher, Clarke, & Cormack,
2010).
The most common type of stopword removed before indexing is
function words. “Function words are words that have no well-defined
meanings in and of themselves; rather they modify other words or indicate
59
grammatical relationships. In English, function words include prepositions,
articles, pronouns and articles, and conjunctions. Function words are usually
the most frequently occurring words in any language” (Büttcher, Clarke, &
Cormack, 2010, p. 89). However, depending on the document collection and
the type of searches to be conducted, there may be other typically
information-bearing terms that also may be included in the stopword list
(Blanchard, 2007). Terms to be included in the stopword list may be generated
from predefined stopword lists such as van Rijsbergen’s list of stopwords in
English (van Rijsbergen, 1979), manually defined stopword lists that are
customized based on knowledge of the content of the documents in collection
or automatically created stopword lists that are created using tools that
analyze word frequency and distribution or other word attributes in the
document collection. (Blanchard, 2007)
•
Word Stemming – Word stemming is a type of morphological normalization
to allow query terms and document terms that are morphological variants of
the same word to be matched (Savoy & Gaussier, 2010). As described by
Croft, Metzler, & Strohman (2010),
[p]art of the expressiveness of natural language comes from the huge
number of ways to convey a single idea. This can be a problem for search
engines, which rely on matching words to find relevant documents.
Instead of restricting matches to words that are identical, a number of
techniques have been developed to allow a search engine to match words
that are semantically related. Stemming, also called conflation, is a
component of text processing that captures the relationships between
different variations of a word. More precisely, stemming reduces the
60
different forms of a word that occur because of inflection (e.g., plurals,
tenses) or derivation (e.g., making a verb into a noun by adding the suffix
-ation) to a common stem. (p. 91)
The stem or root of a word is the portion of the word that remains after
its prefixes and suffixes have been removed (Baeza-Yates and Ribeiro-Neto,
1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010;
Savoy & Gaussier, 2010).
In contrast to the process of lemmatization in Linguistics that produces
linguistically valid lemmas, word stemming is a purely operational process
and may produce a stem that does not have any linguistic validity. Because
the stem created from the word stemming process is used only for comparison
to other word stems generated using the same process, word stemming rather
than the more difficult and complex process of lemmatization is adequate for
most information retrieval systems (Büttcher, Clarke, & Cormack, 2010).
Like stopword removal, word stemming has the added benefit of
reducing the size of the index (Baeza-Yates and Ribeiro-Neto, 1999).
2.1.1.3.1.4. Indexing
The final step of preparing the system is to build the index of terms for each
document in the collection. Instead of searching in each of the actual documents of the
collection at search time, an index in which a mapping between each eligible term and the
documents in which it can be found is used. The index allows for very fast searching over
61
large document collections (Baeza-Yates and Ribeiro-Neto, 1999; Croft, Metzler, &
Strohman, 2010; Savoy & Gaussier, 2010).
While an index can take a number of different forms, currently the most popular
form is known as an inverted file or inverted index. “An inverted file (or inverted index)
is a word-oriented mechanism for indexing a text collection in order to speed up the
searching task. The inverted file structure is composed of two elements: the vocabulary
and the occurrences. The vocabulary is the set of all different words in the text. For each
such word, a list of all the text positions where the word appears is stored. The set of all
those lists is called the ‘occurrences’” (Baeza-Yates and Ribeiro-Neto, 1999, p. 192).
Therefore, the inverted file is able to not only store information related to which
documents contain a specific word, but also where in that document the word may be
found.
Performing the indexing process consists of converting “the stream of documentterm information coming from the text transformation component into term-document
information for the creation of inverted indexes. The challenge is to do this efficiently,
not only for large numbers of documents when the inverted indexes are initially created,
but also when the indexes are updated with new documents from feeds or crawls” (Croft,
Metzler, & Strohman, 2010, p. 23).
2.1.1.3.2. Processing the Query
Once the system has been prepared, it is ready for the user to submit a query.
After the user has submitted a query to the system, the system identifies relevant
62
documents that matches the topic described by the query and returns the results as
appropriate.
2.1.1.3.2.1. Submitting a Query
To submit the query, the user enters a query into the IR system’s user interface.
The user’s information need “underlies and drives the search process. … As a result of
her information need, the user constructs and issues a query to the IR system. Typically,
this query consists of a small number of terms with two to three terms being typical for a
Web search” (Büttcher, Clarke, & Cormack, 2010, pp. 5-6).
2.1.1.3.2.2. Identifying Relevant Document Matches
When the query is received, the system performs the same text transformation
operations that were performed on each of the documents before they were indexed. This
transforms the stream of characters received by the system to consist of the same eligible
terms that would be candidates for an index. For example, if stemming operations were
performed on the document text to convert words into their root stems, then it is
necessary to perform these same operations on the query submitted to enable appropriate
matches between the query terms and the document terms to be found (Baeza-Yates and
Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman,
2010; Savoy & Gaussier, 2010).
After the query terms have been transformed to be consistent with the form of the
terms in the document index, query enhancement processes may be performed in which
the system attempts to more precisely and accurately capture the user’s intended
63
information need as a reformulated query. For example, this is stage at which conceptbased query expansion (the subject of this research) would be applied. In addition to any
query enhancements, the query may be formatted to be consistent with any system
specific formatting and operator symbology required by the system (Baeza-Yates and
Ribeiro-Neto, 1999; Savoy & Gaussier, 2010).
Finally, the query is run against the index to identify all documents that satisfy the
requirements of the query. The actual mechanism by which the query is compared with
the index data to determine if a match exists and the speed that this processing can occur
across the entire index depends on the architecture of the index data store. A variety of
index architectures have been explored, and the architecture used largely depends on the
size of the collection and the retrieval needs of the users of the system (Baeza-Yates and
Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman,
2010; Savoy & Gaussier, 2010).
Ranking and ordering processes required to calculate the relevancy ranking and
other ordering criteria are then performed to determine the final order and organization of
the documents that will be returned to the user (Croft, Metzler, & Strohman, 2010;
Savoy & Gaussier, 2010).
2.1.1.3.2.3. Returning Results
After the system has identified, ordered, and organized the set of documents that
satisfies the user’s query, the resulting documents are displayed to the user. This provides
both the listing of the relevant results in a form that is deemed useful (at least by the
system designer) as well as providing a mechanism for accessing the relevant documents
64
and information to allow the user to make an initial assessment of the relevance of each
document. The information to identify each document varies from system to system but
may include the title of the document, document access information (i.e., a hyperlink to
the document or address of the physical location of the document), and snippets of text
that surround the terms that satisfied the requirement of the query (Baeza-Yates and
Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010).
2.1.2. Natural Language Processing
Natural Language Processing (NLP) has its origins in symbolic linguistics and
statistical modeling. Liddy (1998) provides the following definition:
Natural language processing is a set of computational techniques for analyzing
and representing naturally occurring texts at one or more levels of linguistic
analysis for the purpose of achieving human-like language processing for a range
of tasks or applications. The goal of researchers and developers of NLP is to
produce systems that process text of any type, the same way which we, as
humans, do - systems that take written or spoken text and extract what is meant at
different levels at which meaning is conveyed in language. (p. 137)
Over the past 30 years a revolution in NLP has occurred that has had a huge
impact on the types of techniques for analyzing and representing natural language that are
used. Dale (2010) describes the situation in the mid 1990s:
… the field of natural language processing was less than 10 years into what some
might call its “statistical revolution.” It was still early enough that there were
occasional signs of friction between some of the “old guard,” who hung on to the
symbolic approaches to natural language processing that they had grown up with,
and the “young turks,” with their new fangled statistical processing techniques,
which just kept gaining ground. Some old guard would give talks pointing out that
there were problems in natural language processing that were beyond the reach of
statistical or corpus-based methods; meanwhile, the occasional young turk could
65
be heard muttering a variation on Fred Jelinek’s 1988 statement that “whenever I
fire a linguist our system performance improves.” (p. 3)
For a time it looked as if the old guard of symbolic linguists would need to give
way to statistical modelers, most recently, however, there has been a trend to develop
techniques in which the important lessons learned from symbolic linguistics are
incorporated into the latest statistical modeling techniques (Dale, 2010). I think it is
important to note that much of the friction and growing pains related to the recent
evolution in NLP has taken place during the same time IR researchers were trying to
augment their systems with NLP and were repeatedly disappointed with their levels of
success.
2.1.2.1. Stages of Analysis in NLP
The stages of analysis in NLP move from processing the symbolic representation
of the text up through identifying conceptual meaning conveyed in the text. These stages
are illustrated in Figure 2.2. Each of these analysis stages are described below except for
pragmatic analysis 7, which is an element of Natural Language Process beyond the scope
of this research project.
7
Pragmatic Analysis is concerned with understanding the context and purpose of a
message.
66
Speaker's intended meaning
Pragmatic analysis
Semantic analysis
Syntactic analysis
Lexical analysis
Tokenization
Surface text
Figure 2.2
The stages of analysis in Natural Language Processing (adapted from
Figure 1.1 in Dale, 2010, p. 4).
2.1.2.1.1. Tokenization
The first stage of analysis in NLP consists of “the task of converting a raw text
file, essentially a sequence of digital bits, into a well-defined sequence of linguistically
meaningful units: at the lowest level characters representing the individual graphemes in
a language’s written system, words consisting of one or more characters, and sentences
consisting of one or more words.” (Palmer, 2010, p. 9)
As discussed earlier, the process of identifying words from a stream of characters
is not always straight-forward; even in such languages as English, space characters aren’t
the only delimiters of individual words. The tokenization process must consider
67
characters representing punctuation marks, hyphens, numerical digits, as well as the case
of letters to determine what segment of the text constitutes a word. (Baeza-Yates and
Ribeiro-Neto, 1999; Croft, Metzler, & Strohman, 2010; Dale, 2010).
2.1.2.1.2. Lexical Analysis
The lexical analysis stage of processing in NLP performs text analysis at the level
of the word. One of the primary tasks at this stage is lemmatization. Lemmatization is a
process of relating morphological variants to their lemma. To this, “morphologically
complex strings are identified, decomposed into invariant stem (= lemma’s canonical
form) and affixes, and the affixes are then deleted. The result is texts as search objects
that consist of stems only so that they can be searched via a lemma list” (Hippisley, 2010,
p. 32).
As discussed earlier, lemmatization is similar to word stemming, but there is an
important distinction between these two processes. A goal of lemmatization is to produce
linguistically valid lemmas while word stemming is a purely operational process and may
produce a stem that does not have any linguistic validity. Because the stem created from
the word stemming process is used only for comparison to other word stems, word
stemming rather than the more difficult and complex process of lemmatization is
adequate for most information retrieval systems (Büttcher, Clarke, & Cormack, 2010).
Another distinction between NLP and Information Retrieval related to lexical
analysis is that some Information Retrieval researchers bundle the step of tokenization
with other text transformation processes such as word stemming and refer to the entire
68
stage as lexical analysis. However, NLP researchers tend to talk about tokenization as a
separate and distinct process from lexical analysis.
2.1.2.1.3. Syntactic Analysis
The syntactic analysis area is arguably the most well-established area in natural
language processing. “A presupposition in most work in natural language processing is
that the basic unit of meaning analysis is the sentence: a sentence expresses a proposition,
an idea, or a thought, and says something about some real or imaginary world. Extracting
meaning from a sentence is thus a key issue” (Dale, 2010, p. 6). Syntactic analysis is
comprised of applying “techniques for grammar-driven natural language parsing, that is,
analyzing a string of words (typically a sentence) to determine its structural description
according to a formal grammar” (Ljunglöf & Wirén, 2010, p. 59).
Some of the challenges syntactic analysis must overcome include the following:
•
Robustness – Robustness is the system’s ability to gracefully handle input
that does not conform to the expectations of the system. One source of nonconformance is “that the input may contain errors; in other words, it may be
ill-formed (though the distinction between well-formed and ill-formed input
is by no means clear cut.)” (Ljunglöf & Wirén, 2010, p. 80). Another source
of non-conformance is undergeneration in which the rules of the grammar
being used by the system does not adequately cover the natural language
being input. Ljunglöf & Wirén (2010) talk about the desirability of graceful
degradation where “robustness means that small deviations from the
69
expected input will only cause small impairments of the parse result, whereas
large deviations may cause large impairments” (p. 80).
•
Disambiguation – “At any point in a pass through a sentence, there will
typically be several grammar rules that might apply” (Ljunglöf & Wirén,
2010, p. 60). The challenge then is determining which of the possible
syntactic structures that appears to fit the input is the one intended by the
creator of the sentence. While the information required to disambiguate
possible syntactic structures may not always be available during this stage, at
a minimum the analysis typically can narrow the possible options. This
problem is helped by the observation made by Ljunglöf & Wirén (2010) that
“although a general grammar will allow a large number of analyses of almost
any nontrivial sentence, most of these analyses will be extremely implausible
in the context of a particular domain” (p. 81).
2.1.2.1.4. Semantic Analysis
The semantic analysis in NLP uses the results from the previous levels of analysis
to analyze “the meanings of words, fixed expressions, whole sentences, and utterances in
context. In practice, this means translating original expressions into some kind of
semantic metalanguage. The major theoretical issues in semantic analysis therefore turn
on the nature of the metalanguage or equivalent representational system” (Goddard &
Schalley, 2010, p. 94). In this way, the semantic analysis works to translate the text into
a semantic representational system to determine the meaning of words, multi-word
70
expressions, phrases and/or indefinitely large word combinations such as sentences in
order to understand the message.
The variety of approaches and theories used to conduct semantic analysis tend to
be divided along two dimensions. The first is a compositional versus lexical dimension.
The compositional approaches are concerned with working bottom-up to construct
meaning from the lexical items whose meaning is accepted as a given. At the other end
of this dimension are the lexical approaches that work to precisely analyze the meaning
of the lexical items using either decomposition or relational methods. The second
dimension is formal versus cognitive. Formal approaches focus on the readily apparent
structural patterns in the messages as well as the importance of linking the grammatical
components to semantic components. On the other hand, cognitive approaches focus on
the patterns and processes of the organization of the conceptual content in a language
(Goddard & Schalley, 2010; Talmy, forthcoming).
Regardless of the approach used to conduct the semantic analysis, “[i]t is widely
recognized that the overriding problems in semantic analysis are how to avoid circularity
and how to avoid infinite regress. Most approaches concur that the solution is to ground
the analysis in a terminal set of primitive elements, but they differ on the nature of the
primitives … Approaches also differ on the extent to which they envisage that semantic
analysis can be precise and exhaustive” (Goddard & Schalley, 2010, p. 94).
Dale points out that in semantic analysis “we begin to reach the bounds of what
has so far been scaled up from theoretical work to practical application” (Dale, 2010, p.
6). The problems of understanding text through semantic analysis are difficult and
71
complex. According to Goddard & Schalley (2010), the outlook is grim for significant
advancements in semantic analysis in the near future. “Despite the tremendous advances
in computation power and improvements in corpus linguistics … [m]any common
semantic phenomena are likely to remain computationally intractable for the foreseeable
future. Rough-and-ready semantic processing (partial text understanding), especially in
restricted domains and/or restricted ‘sublanguages’ offer more promising prospects”
(Goddard & Schalley, 2010, p. 114).
2.1.3. Describing Information Needs in the Form of a Query
The primary purpose of an IR system is to efficiently and effectively fill a user’s
information need (Baeza-Yates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack,
2010; Croft, Metzler, & Strohman, 2010). The term information need was coined by
Taylor in his 1962 paper “The Process of Asking Questions”. Taylor describes an
information need as something distinct and traceable that is developed through a process
that progresses through four levels of question formation. The information need begins
as a vague, inexpressible sense of dissatisfaction, progresses to a “conscious mental
description of an ill-defined area of indecision” (p. 392), then develops into an
unambiguous and rational expression of the question, and finally is adapted into a form
the user believes is appropriate to pose to the IR system. This final step of adapting the
question into a form appropriate for the system is impacted not only by the form in which
the user must express the information but also by what the user believes that the system
can provide. Because the already complex process of developing the user’s information
72
need must be translated into a query that the system can process, it is no surprise that “[a]
query can be a poor representation of the information need” (Croft, Metzler, & Strohman,
2010, p. 188).
Two issues that further compound the difficulty in appropriately expressing an
information need into the form of an acceptable IR system query are the gap in the user’s
knowledge and that natural language as the medium for concept representation.
2.1.3.1. Gap in the User’s Knowledge
It can be difficult to ask about things a user doesn’t know. Belkin (1980)
described the issue as the necessity for the users to develop their information need from
an inadequate state of knowledge. “The expression of an information need … is in
general a statement of what the user does not know” (Belkin, Oddy, & Brooks, 1982,
p.64). IR system must then try to match this inadequate, uncertain, imprecise, and
possibly incoherent statement of the need with documents that contain representations of
a coherent state of knowledge (Belkin, 1980; Belkin, Oddy, & Brooks, 1982).
2.1.3.2. Natural Language as the Medium for Concept Representation
Natural language is used as the medium to form the information bridges as
described by Frické (2012) between the user with the information need and the authors of
the documents in the collection. The user forms a query using natural language as an
abstraction to represent an information need, and the authors of the documents in the
collection use natural language as an abstraction to represent the concepts in the
documents. The translation into and representation using natural language introduces
73
issues with the accuracy and completeness of the description of the concepts on both
sides of the information bridge. Therefore, the search engine must perform its task using
a frequently incomplete, imprecise, and ambiguous description of both the user’s
information need and the related concepts included in the documents within the collection
(Savoy & Gaussier, 2010).
Attributes of natural language that can cause textual expressions to be incomplete,
imprecise, ambiguous and, therefore, difficult for IR systems to handle include word
morphology, orthographic variation, and various forms of syntax.
2.1.3.2.1. Morphology
Morphology relates to the composition of words in a language. A word may have
a number of different valid morphological forms; each of which represents the same
underlying concept. Morphological forms of words may be created through inflectional
construction such as singular versus plural forms (e.g., “book” and “books”) or gender
assignment (e.g., “actor” and “actress”), verb conjugation such as the present participle
(e.g., “laugh” and “laughing”) and past participle (e.g., “play” and “played”), and
derivational construction such as the gerund (e.g., the verb “train” and the noun
“training”) (Savoy & Gaussier, 2010).
In English, as illustrated in the previous examples, morphological variants of a
word are typically formed by adding affixes to the beginning of the word (i.e., prefix) or
to the end of a word (i.e., suffix). However, there are exceptions that introduce
complexity and challenges and prohibit a morphological analyzer from relying only on
simple rules to identify all the word forms that represent the same concept. For example,
74
challenges are introduced with exceptions like the pluralized form of “goose” to “geese”
and then further complicated by the fact that the pluralized form of the very similar word
“moose” is “moose”. Homographs such as “train” as in a locomotive and “train” as in
instruct also pose difficult challenges when trying to determine which words represent the
same underlying concept (Savoy & Gaussier, 2010).
When searching for relevant matches within a document collection, it is necessary
to return results for all morphological variants of a word in order to have complete recall.
However, the many exceptions in a language make this a difficult task.
2.1.3.2.2. Orthographic Variation
Orthographic variation relates to the differences in the acceptable spelling of a
word. In the 1800s, much effort was put into ensuring that spelling was standardized.
However, spelling differences still exist. Two common sources of orthographic variation
are regional differences (e.g., the British and American English forms of the word “grey”
and “gray”) and the transliteration of foreign names (e.g., “Creutzfeld-Jakob” and
“Creutzfeldt-Jacob”). In addition, to acceptable alternative spellings, typographic errors
and misspellings are also commonplace in document collections. Like for morphological
variants, when searching for matches within a document collection it may important to
return all orthographical variants of a word in order to have complete recall (Savoy &
Gaussier, 2010).
75
2.1.3.2.3. Syntax
There are a number of valid syntactical constructions that can make it difficult for
automated analyzers to correctly determine the appropriate meaning of a phrase or
sentence. As discussed earlier, a number of valid grammatical structures may fit a given
sentence, each of which provides a different meaning. Consider the following example
of syntactic ambiguity: “The fish is ready to eat.” Like the optical illusion of the Rubin
Vase, either the interpretation of it being time to feed your fish or your fish dinner is
ready for you to eat seems to bounce back and forth in your mind. Without context,
either interpretation is valid but represents a very different semantic concept (Manning &
Schütze, 1999; Savoy & Gaussier, 2010).
2.1.4. Will NLP techniques improve information retrieval systems?
While some researchers consider information retrieval an example of a successful
applied domain of NLP (Savoy & Gaussier, 2010), others have been disappointed with
the lack of success in using NLP to improve the performance of IR systems (Brants,
2003; Lease, 2007; Sparck Jones, 1997; Smeaton, 1999; Buckley, 2004). In the early
days, there was optimism about the large improvement in the performance of IR systems
that could be reaped by augmenting IR systems with NLP techniques. The combination
seemed natural. Information retrieval systems needed to go beyond simple symbolic
searches in order to find information that required a deeper level of understanding of the
content that pattern matching algorithms could not identify, and Natural Language
76
Processing brought techniques to extract the meaning conveyed at different levels of
written and spoken language.
However, as described by Brants (2003), “[s]imple methods (stopwording, porterstyle stemming, etc.) usually yield significant improvements, while higher-level
processing (chunking, parsing, word sense disambiguation, etc.) only yield very small
improvements or even a decrease in accuracy” (p. 1). These were disappointing results
since it was hoped that Natural Language Processing could allow IR systems to move
beyond the limitations of symbolic pattern matching by providing them with the
capability to work at the higher-level of the semantic content of both the queries posed
and the documents in the collection.
Several researchers have posited ideas about why NLP techniques have not been
more successful. Brants (2003) stated that the use of existing ‘out-of-the-box’ NLP
components not specifically geared for IR was one reason why NLP techniques have not
been more successful at improving the performance of information retrieval. In his 2003
review of research investigating the use of NLP techniques to improve retrieval, he
described that the examples of successful NLP techniques like the use of the Porter
stemming algorithm (Porter, 1980) and statistical “phrases” were techniques that actually
have linguistic flaws and, in some cases, are counter to linguistic knowledge. But, they
are specifically designed with the goal of improving retrieval rather than adherence to
linguistic constructs, and this allows them to be more successful at performing the task
for which they are used.
77
Qiu and Frei (1993) presented their thoughts on why the specific technique of
Query Expansion in information retrieval systems using NLP has not been more
successful. They stated that the lack of success with previous query expansion methods
was primarily because the methods were based on adding terms similar to each of the
individual terms used to construct the query instead of adding terms similar to the overall
concept the query describes. This method of expanding a query by adding terms similar
only to the individual terms of the query often introduces tangential, non-relevant
concepts to the query causing the search algorithm to identify documents that are not
relevant to the original query concept.
In addition, two other factors have likely contributed to the lack of success
experienced. First, as mentioned earlier, much of the friction and growing pains related
to the recent evolution in NLP was taking place at the same time that information
retrieval researchers were trying to augment their systems with NLP and being
disappointed with their levels of success. The flux and level of maturity could have
played a role in the NLP techniques that were available to an information retrieval
system. And second, NLP is still working on the tough issues at the semantic level of
processing natural language. “[T]he known is the surface text, and anything deeper is a
representational abstraction that is harder to pin down; so it is not surprising that we have
better developed techniques at the more concrete end of the processing spectrum” (Dale,
2010, p. 5).
Lease (2007) stated that in the field of NLP “the statistical revolution has
continued to expand the fields horizons; the field today is thoroughly statistical with
78
robust methodology for estimation, inference, and evaluation. As such, one may well ask
if there are new advancements that suggest re-exploring prior directions in applying NLP
to [information retrieval]?” (p. 1).
By carefully considering the potential pitfalls, challenges, and new opportunities
highlighted by such researchers as Brants, Qiu and Frei, Dale, and Lease, it is possible
that NLP may still hold the key to unlock the large performance benefits available by
moving beyond symbolic searches.
2.2. Lexical Acquisition of Meaning
The lexical acquisition of meaning is focused primarily on automatically
measuring the relative value of how similar (or dissimilar) one word is to another word.
This process of calculating a relative measure is a substitute for actually determining
what the meaning of a word is, but, despite this, it is a useful measure in information
retrieval systems (Manning & Schütze, 1999).
There are a number of different methods that have been used to measure semantic
similarity. A few of the more popular and well-known methods are described below.
2.2.1. Vector Space Models
Vector space models represent one of the oldest and most well-known methods
for automatically measuring semantic similarity (Büttcher, Clarke, & Cormack, 2010). In
these models, words are represented as vectors in a multi-dimensional space.
79
2.2.1.1. Multi-dimensional Space
The multi-dimensional space used in vector space models may be created from
different elements, but two of the most common are document space and word space. To
describe the differences between these two types of multi-dimensional space, examples
are presented below that have been adapted from those in Manning & Schütze (1999).
cosmonaut
astronaut
moon
car
truck
1
0
1
0
0
0
0
1
0
0
0
0
1
1
0
0
0
0
1
0
0
1
1
0
0
0
0
1
0
1
d1
d2
d3
d4
d5
d6
Figure 2.3
An example document-by-word matrix (adapted from Figure 8.3 in
Manning & Schütze, 1999, p. 297).
In Figure 2.3, we see that a document space is created from a document-by-word
matrix where the values in the cells represent the number of times the word occurs in the
document. Using document space, words are represented as vectors formed from their
occurrences in each of the documents in the collection and “[w]ords are deemed similar
to the extent that they occur in the same documents. In document space, cosmonaut and
astronaut are dissimilar (no shared documents); truck and car are similar since they share
a document: they co-occur in [document] d4” (Manning & Schütze, 1999, p. 296).
80
cosmonaut
astronaut
moon
car
truck
Figure 2.4
cosmonaut
astronaut
moon
car
truck
2
0
1
1
0
0
1
1
0
0
1
1
2
1
0
1
0
1
3
1
0
0
0
1
2
An example word-by-word matrix (adapted from Figure 8.4 in Manning &
Schütze, 1999, p. 297).
In Figure 2.4, we see that a word space is created from a word-by-word matrix
where the values in the cells represent the number of documents in which the two
intersecting word occur together. Using word space, words are represented as vectors
formed from their co-occurrence with other words. “Co-occurrence can be defined with
respect to documents, paragraphs or other units. Words are similar to the extent that they
co-occur with the same words. Here, cosmonaut and astronaut are more similar than
before since they both co-occur with moon” (Manning & Schütze, 1999, p. 297).
2.2.1.2. Constructing Vectors
The vectors created from a multi-dimensional space may be constructed using
either binary vectors or real-valued vector. A binary vector is one in which each
dimension is assigned either a 0 or 1. However, a more powerful representation, though
more computationally complex, is when real values are used for each of the dimensions.
A vector constructed from real values represents more information about the level or
strength of each dimension.
81
2.2.1.3. Vector Similarity Measures
After vectors have been constructed, the next step is to measure how similar two
vectors are to one another (i.e., how close they are to one another in the multidimensional vector space) in order to determine the similarity of (or association between)
the words that they represent. When the vectors represent words, the similarity of the
two vectors may be measured to derive a relative value that can be used to determine how
similar one word is to another word within the document collection.
Similarity between two vectors can be measured using a variety of vector
similarity measures. The most commonly used are outlined in Figure 2.5 and described
in more detail in the sections below.
Vector Similarity Measure
Matching Coefficient
Dice Coefficient
Jaccard Coefficient
Overlap Coefficient
Cosine Coefficient
Table 2.1
Definition
|𝑋 ∩𝑌|
intersection
|𝑋 ∩𝑌|
|𝑋 ∪𝑌|
intersection over union
2
|𝑋 ∩𝑌|
|𝑋| + |𝑌|
intersection over mean
|𝑋 ∩𝑌|
min (|𝑋|, |𝑌|)
|𝑋 ∩𝑌|
�|𝑋| × |𝑌|)
Vector-based similarity measures.
82
2.2.1.3.1. Matching Coefficient
The simplest similarity measure is the matching coefficient. The similarity value
using the matching coefficient is determined by counting the number of dimensions on
which both vectors have a non-zero value or, in other words, the intersection of the two
vectors. This is different than other methods used to measure similarity because it does
not account for any differences in the length of the vectors. For example, assume that
vector A has 10 non-zero dimensions, vector B has 12 non-zero dimensions, and vector C
has 1000 non-zero dimensions. If vector A and vector B share 8 dimensions and vector A
and vector C share 8 dimensions, the similarity values for A-B and for A-C would both be
8. However, considering the differences in the length of the various vectors, it is likely
that vector A and vector B are semantically more similar to one another than vector A is
to vector C because proportionally there is significantly more overlap between vectors A
and B (van Rijsbergen, 1979; Manning & Schütze, 1999).
2.2.1.3.2. Dice Coefficient
The Dice Coefficient performs an intersection over mean calculation to normalize
the length of the vectors and then measure the amount of overlap the vectors have with
one another. The result is a range between 0.0 and 1.0. The value 0.0 means that there is
no overlap between the two vectors and, therefore, no similarity. The value 1.0 means
that the vectors have perfect overlap and are, therefore, identical (van Rijsbergen, 1979;
Manning & Schütze, 1999).
83
2.2.1.3.3. Jaccard Coefficient
The Jaccard Coefficient is also known as the Tanimoto Coefficient. It is similar
to the Dice Coefficient, but instead of the intersection over mean, it is based on an
Intersection Over Union (IOU) calculation to normalize and measure the amount of
overlap between two vectors. The Jaccard Coefficient values range from 0.0 to 1.0 where
0.0 represents no overlap and 1.0 represents perfect overlap (i.e., the vectors are
identical). The difference between the Jaccard Coefficient and the Dice Coefficient is
that the Jaccard Coefficient includes a greater penalty for situations in which the
proportion of shared dimensions with non-zero values is small with respect to the overall
length of the vectors (i.e., overall number of non-zero dimension that each vector
possesses) (van Rijsbergen, 1979; Manning & Schütze, 1999).
2.2.1.3.4. Cosine Coefficient
The Cosine similarity calculation uses linear algebra to measure the angle
between two vectors. Smaller angles represent higher levels of similarity between the
two vectors. (van Rijsbergen, 1979; Manning & Schütze, 1999; Büttcher, Clarke, &
Cormack, 2010). Because the Cosine Coefficient is based on the angle between the two
vectors rather than the amount of overlap, the Cosine Coefficient includes a reduced
penalty for situations in which the non-zero dimensions between the two vectors are very
different. In other words, when using the Cosine Coefficient, it is not necessary for the
two vectors being compared to be similar in size. “This property of the cosine is
important in Statistical NLP since we often compare words or objects that we have
84
different amounts of data for, but we don’t want to say they are dissimilar just because of
that” (Manning & Schütze, 1999, p. 300).
2.2.1.4. Using Vector Space Model in Retrieval
Vector space models can be used in the retrieval process to identify the set of the
most relevant documents in a collection to a query. To do this, “[q]ueries as well as
documents are represented as vectors in a high-dimensional space in which each vector
component corresponds to a term in the vocabulary of the collection” (Büttcher, Clarke,
& Cormack, 2010, p. 55). So in this way, a real-valued vector of the query and realvalued vectors of each of the documents in the collection are created from a multidimensional space comprised of all the words present in the query and in the document
collection (i.e., the union of the words in the query and the words in the document
collection). The relative similarity between the query vector and each of the document
vectors is then measured using a vector similarity measure. Typically, the vector
similarity measure used is the Cosine similarity measure so that the disparity of the
number of vector dimensions between the query vector and the document vectors do not
adversely impact the similarity calculations. “If we can appropriately represent queries
and documents as vectors, cosine similarity may be used to rank the documents with
respect to the queries. In representing a document or query as a vector, a weight must be
assigned to each term that represents the value of the corresponding component of the
vector” (Büttcher, Clarke, & Cormack, 2010, p. 57). Typically, the weight calculated is
based on Term Frequency (TF) and Inverse Document Frequency (IDF).
85
Term Frequency (TF) represents the frequency with which a term appears in a
document. It is used in the weight calculation based on the idea that terms that occur
more frequently in a document should be assigned a greater weight than terms that occur
less frequently (i.e., the more frequently occurring term is more important to the concept
represented in the document). Inverse Document Frequency (IDF) represents the
frequency with which a term occurs in documents in the collection. It is used in the
weight calculation based on the idea that terms that occur in a large number of documents
should be assigned a lower weight than terms that occur less frequently in the document
collection (i.e., the less frequently occurring term is better able to discriminate between
relevant and non-relevant documents in the collection) (Büttcher, Clarke, & Cormack,
2010).
The vector space models used in the retrieval process can be used not only with
respect to documents but also at the level of paragraphs or other units of text that are
deemed useful by the information retrieval system designer. Therefore, by using the
vector space model, it is possible to identify the set of most relevant paragraphs to a
query if a finer level of retrieval is desired.
2.2.2. Latent Semantic Analysis
As described by Landauer, Foltz, and Laham (1998), “Latent Semantic Analysis
(LSA) is a theory and method for extracting and representing the contextual-usage
meaning of words by statistical computations applied to a large corpus of text” (p. 259).
It is an extension of the vector space model in which singular value decomposition (SVD)
86
is used to reduce the dimensionality of the term-vector space to extract and infer relations
between words based on how the words are used in the text (Büttcher, Clarke, &
Cormack, 2010; Cilibrasi & Vitányi, 2007; Landauer, Foltz, and Laham, 1998;
Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990).
Proponents of LSA say that this method “is capable of correctly inferring much
deeper relations (thus the phrase latent semantic), and, as a consequence, they are often
much better predictors of human meaning-based judgments and performance than are the
surface-level contingencies” (Landauer, Foltz, & Laham, 1998, pp. 260-261). It is also
believed that SVD is able “to reduce the negative impact of synonymy – multiple terms
with the same meaning – by merging related words into common dimensions” (Büttcher,
Clarke, & Cormack, 2010, p. 78).
2.2.2.1. Singular Value Decomposition
The main distinctive element of LSA is the use of Singular Value Decomposition
(SVD) to reduce the dimensionality of the semantic space represented as the term-vector
space. SVD is a mathematical factor analysis from linear algebra that performs a linear
decomposition on the term-vector and then uses the resulting scaling values to determine
which dimensions may be removed (Büttcher, Clarke, & Cormack, 2010; Cilibrasi &
Vitányi, 2007; Landauer, Foltz, and Laham, 1998).
As mentioned, the first step is to perform a linear decomposition in which the
original “rectangular matrix is decomposed into the product of three other matrices. One
component matrix describes the original row entities as vectors of derived orthogonal
factor values, another describes the original column entities in the same way, and the
87
third is a diagonal matrix containing scaling values such that when the three components
are matrix multiplied, the original matrix is reconstructed” (Landauer, Foltz, & Laham,
1998, p. 263). The second step of the SVD is to reduce the number of dimensions by
deleting the scaling coefficients in the diagonal matrix starting with the smallest
coefficients until only the desired number of dimensions remains. The final matrix is then
reconstructed using the modified, reduced-dimensionality diagonal matrix. The resulting
values in the individual cells of the final matrix represent the level of similarity between
the entity represented by the row and the entity represented by the column (e.g.,
similarity between a word and a document, similarity between two words, or similarity
between two documents) (Landauer, Foltz, and Laham, 1998).
One of the primary assumptions of LSA is “that reducing the dimensionality (the
number parameters by which a word or passage is described) of the observed data from
the number of initial contexts to a much smaller – but still large – number will often
produce much better approximations to human cognitive relations. It is this
dimensionality reduction step, the combining of surface information into a deeper
abstraction, that captures the mutual implications of words and passages” (Landauer,
Foltz, and Laham, 1998, pp. 261-262). Therefore, determining the number of dimension
that should be chosen to represent the semantic space is a key element of SVD and has a
large impact on the results of LSA. However, methods are still evolving for choosing the
optimal dimensionality for a data set.
88
2.2.2.2. Distinctions of LSA from Other Statistical Approaches
LSA differs in several important ways from other common statistical lexical
acquisition of meaning techniques. First, to a greater extent than other vector space
models, the level of similarity between a word and a document in LSA is dependent not
only on the attributes of documents in which the word occurs, but also on the attributes of
documents in which the word does not occur. The LSA calculation incorporates data
from all the documents in the collection and considers which words occur and which do
not occur in each. For example, the fact that word a does not occur in document D1 but
words b, c, and d do occur in document D1 may impact the calculation of the similarity
level between word a and document D2.
Second, LSA does not consider word order and other grammatical constructs, but
instead LSA uses “the detailed patterns of occurrences of very many words over very
large numbers of local meaning-bearing contexts, such as sentences or paragraphs,
treated as unitary wholes. … Another way to think of this is that LSA represents the
meaning of a word as a kind of average of the meaning of all the passages in which it
appears and the meaning of a passage as a kind of average of the meaning of all the
words it contains” (Landauer, Foltz, and Laham, 1998, p. 261).
And third, unlike other common statistical methods used to calculate semantic
similarity, LSA is both used as a successful computational method and as a model of
human learning. As a computational method, it is used to estimate the similarity between
two words and between words and other units of text (e.g., documents, paragraphs, or
phrases) by mathematically extracting and inferring the relationships between the
89
expected contextual usages of words. As a model of human learning, it is posited as a
computational theory of inductive learning by which humans acquire and represent
knowledge in an environment that does not appear to contain adequate information to
account for the level of knowledge the human learns (i.e., the problem of the ‘poverty of
the input’ or ‘insufficiency of evidence’) (Landauer, Foltz, and Laham, 1998).
2.2.3. Normalized Web Distance
Normalized Web Distance (NWD) is a compression-based similarity metric
created by Cilibrasi & Vitányi (2007) to determine the relative similarity between words
or phrase and is computed using the Web and a search engine. In the earliest version of
the NWD theory, Cilibrasi & Vitányi (2007) refer to NWD as the more specific
‘Normalized Google Distance’ because they had specifically used the Google search
engine to supply the necessary page counts used in their calculations. Later, Cilibrasi &
Vitányi re-named the metric to the more broadly applicable ‘Normalized Web Distance’
to allow for any search engine with sufficient coverage of the web to be used to perform
the relative similarity calculations (Cilibrasi & Vitányi, 2007; Vitányi & Cilibrasi, 2010).
NWD is based on Cilibrasi & Vitányi’s “contention that the relative frequencies
of web pages containing search terms gives objective information about the semantic
relations between the search terms” (Cilibrasi & Vitányi, 2007, p. 371). Due to the
vastness and sheer quantity of the information available on the Web, it is likely that the
extremes of information not representative of how words are currently used in society
will cancel each other out and that the majority of the information contained on the Web
90
consists of diverse, low-quality, yet valid information. Though the overwhelming
majority of the information is likely of low-quality, there is an immense quantity of it on
the Web. Cilibrasi & Vitányi (2007) stated that because there is so much of the of lowquality information and that is so diverse that using search engine page counts can
effectively average out the semantic information so that valid and useful semantic
associations between words and phrases can be drawn from the NWD method.
The example presented in Cilibrasi & Vitányi (2007) best illustrates how the
NWD is computed to determine similarity between two terms using the World Wide Web
and Google for the search engine.
“While the theory we propose is rather intricate, the resulting method is simple
enough. We give an example: At the time of doing the experiment, a Google
search for ‘horse’, returned 46,700,000 hits. The number of hits for the search
term ‘rider’ was 12,200,000. Searching for the pages where both ‘horse’ and
‘rider’ occur gave 2,630,000 hits, and Google indexed 8,058,044,651 web pages.
Using these numbers in the main formula … with N = 8,058,044,651, this yields a
Normalized Google Distance [(NGD)] between the terms ‘horse’ and ‘rider’ as
follows: NGD(horse, rider) ≈ 0.443.” (Cilibrasi & Vitányi, 2007, p. 371)
As can be inferred from this example, this method does not consider the location
of the terms as they exist in the web pages nor the number of occurrences of the terms
within a single web page, it is simply based on the number of web pages in which the
terms occur at least once. Therefore, Cilibrasi & Vitányi (2007) point out “that this can
mean that terms with different meaning have the same semantics, and that opposites like
‘true’ and ‘false’ often have a similar semantics. Thus, we just discover associations
between terms, suggesting a likely relationship” (Cilibrasi & Vitányi, 2007, p. 371).
91
2.2.3.1. Kolmogorov Complexity and Normalized Information Distance
NWD is based on a theory of semantic distance between a pair of objects using
Kolmogorov complexity and normalized information distance. “One way to think about
the Kolmogorov complexity K(x) is to view it as the length, in bits, of the ultimate
compressed version from which x can be recovered by a general decompression
program” (Cilibrasi & Vitányi, 2007, p. 372). The less complex and more redundant the
information is, the more it can be compressed and the smaller the Kolmogorov
complexity value. It is important to note that Kolmogorov complexity is theoretical
construct that is incomputable because it is based on an imaginary perfect compressor.
Kolmogorov complexity represents the lower bound of the ultimate value of the
compressed string. It is important to NWD to allow for a theoretical analysis “to express
and prove properties of absolute relations between objects” (Cilibrasi & Vitányi, 2007, p.
371) typically not possible with other automatic approaches used in the lexical
acquisition of meaning for information retrieval.
Normalized Information Distance (NID) is based on Kolmogorov complexity and
is the normalized length of the shortest program required to reconstruct one string from
another string and vice versa. For example, given string x and string y, NID represents
the length of the shortest program required to reconstruct string x given string y as the
input and to reconstruct string y given sting x as the input. Like Kolmogorov complexity,
NID is a theoretical construct useful for expressing and proving the properties of the
NWD. Because NID is based on the incomputable Kolmogorov complexity, it is also
incomputable.
92
2.2.3.2. Normalized Compression Distance
As indicated above, because Kolmogorov complexity is incomputable, NID is,
therefore, incomputable. However, a computable version of NID, called Normalized
Compression Distance (NCD) can be formulated by using “real data compression
programs to approximate the Kolmogorov complexities … A compression algorithm
defines a computable function from strings to the lengths of the compressed versions of
those strings. Therefore, the number of bits of the compressed version of a string is an
upper bound on Kolmogorov complexity of that string, up to an additive constant
depending on the compressor but not on the string in question.” The formula for the
NCD is what is used to compute the NWD.
2.2.4. Probabilistic Models Used in Determining Semantic Similarity
While vector space models are conceptually easy to understand, there are some
theoretical problems with using vector space for measuring the relative similarity of
terms. Specifically, Manning & Schütze (1999) point out that vector space models
assume a Euclidean space to measure the distance between two vectors and are, therefore,
theoretically only appropriate for normally distributed values. Normal distributions,
however, cannot be assumed for data based on counts and probabilities. Word counts
(i.e., the number of occurrences of a word within a unit of text like a document or
paragraph) or probabilities (i.e., the probability that a word is contained in a document) is
the data that typically populates the matrices from which the values of vectors are
derived.
93
Matrices containing counts such as those used in the vector-based methods can be
converted into “matrices of conditional probabilities by dividing each element in a row
by the sum of all entries in the row” (Manning & Schütze, 1999, p. 303), which
transforms the values into estimates of maximum likelihood.
A number of information retrieval researchers have explored the usage of
probabilistic models to determine semantic similarity in order to use a method with sound
theoretical underpinnings. Probabilistic models recast the idea of calculating the
semantic similarity in vector space models to one of calculating the dissimilarity of two
probability distributions (van Rijsbergen, 1979; Lin, 1991; Dagan, Lee, & Pereira, 1997;
Manning & Schütze, 1999).
2.2.4.1. Entropy and Probabilistic Models
One class of probabilistic measures for calculating the semantic dissimilarity
between two probability distributions are based on the notion of entropy defined in
Claude Shannon’s (1948) Mathematical Theory of Communication (van Rijsbergen,
1979; Lin, 1991; Dagan, Lee, & Pereira, 1997; Manning & Schütze, 1999). In general
terms, entropy is the amount of information contained in a message and can be thought of
as the level of uncertainty about a message before it is received. As the level of
predictability of the message decreases, the level of uncertainty on the part of the
recipient increases. For example, if the message sender always sends the same sequence
of digits (010101010101…), the recipient can reliably predict what the message will be
without receiving it. In this sense, the message contains no information because there is
no uncertainty about the content of the message. On the other hand, if the recipient
94
cannot predict what message the sender will transmit, then the recipient is only certain of
what the message contains after the message is received (i.e., uncertainty about the
message is resolved only when the message is received) (Shannon & Weaver, 1998;
Pierce, 1980).
The amount of information contained in a message is called entropy and is
measured in bits. The entropy increases as the number of possible messages increases
and as the freedom with which to choose the messages increases (i.e., the greater the
number of possible choices and the less predictable which message will be sent, the
greater the entropy.) The entropy decreases as the number of possible messages
decreases and as the freedom with which to choose the messages decreases (i.e., the
fewer the number of possible choices and the more predictable which message will be
sent, the lower the entropy) (Shannon & Weaver, 1998; Pierce, 1980).
Shannon developed the equation for entropy when the symbols are not equally
probable in the message in the following equation:
𝑛
𝐻 = � 𝑝𝑖 log 𝑝𝑖
𝑖=1
Where:
H = entropy (i.e., amount of information)
pi = probability of the ith symbol being chosen
Entropy can be used to compare coding methods to determine which are the most
efficient for data compression. The true entropy for a message indicates what the best
possible encoding method can be for a message (i.e., the best compression possible
95
without losing any information). In the case of binary coding methods, this means
encoding a message in such a way that it requires the fewest number of binary digits per
symbol (i.e., the number of bits per symbol approaches the true entropy). The trick is to
devise an encoding scheme that achieves (or comes reasonably close to achieving) this
goal (Shannon & Weaver, 1998; Pierce, 1980; Floridi, 2003).
Entropy and its relationship to data compression form a basis for a class of
information-theoretic measures developed to measure the difference (or divergence)
between probability distributions (Lin, 1991). Two of these measures, Kullback-Leiber
(KL) divergence and the Jensen-Shannon divergence are described below.
2.2.4.2. Dissimilarity Probabilistic Measures
Dissimilarity between two probability distributions can be measured using a
variety of probabilistic measures. Three commonly used measures are outlined in Figure
2.6 and described in more detail in the sections below.
Dissimilarity Measure
Definition
Kullback-Leiber Divergence
Jensen-Shannon Divergence
L1 Norm
𝑝𝑖
𝐷(𝑝 || 𝑞) = � 𝑝𝑖 𝑙𝑜𝑔 � �
𝑞𝑖
𝑖
𝐷 �𝑝 ||
𝑝+𝑞
𝑝+𝑞
� + 𝐷 �𝑞 ||
�
2
2
� | 𝑝𝑖 + 𝑞𝑖 |
𝑖
Table 2.2
Probabilistic dissimilarity measures.
96
2.2.4.2.1. Kullback-Leibler Divergence
The Kullback-Leiber (KL) divergence is the relative entropy of two probability
distributions. (Lin, 1991; Dagan, Lee, Pereira, 1997; Manning & Schütze, 1999;
Pargellis, Fosler-Lussier, Potamianos & Lee, 2001). KL divergence “measures how well
distribution q approximates distribution p; or, more precisely, how much information is
lost if we assume distribution q when the true distribution is p” (Manning & Schütze,
1999, p. 304).
There are several problems with the KL divergence that cause practical
difficulties when using it to determine the relative similarity of terms. One problem is
that the KL divergence measure has difficulty with infinite values. The measure returns
“a value of ∞ if there is a ‘dimension’ with qi = 0 and pi ≠ 0 (which will happen often,
especially if we use simple maximum likelihood estimates)” (Manning & Schütze, 1999,
p. 304). To deal with this situation, estimates must be smoothed to redistribute some
probability mass to the zero-frequency events. Mathematically, such a situation requires
additional effort that can be computationally expensive for large vocabularies (Dagan,
Lee, Pereira, 1997).
Another problem with KL divergence is that it is asymmetric. Intuitively,
semantic similarity between two terms is typically symmetric so that the level of
similarity between term a and term b is equal to the level of similarity between term b
and term a (Manning & Schütze, 1999).
97
2.2.4.2.2. Jensen-Shannon Divergence
Jensen-Shannon divergence measure (Lin, 1991; Wartena & Brussee, 2008) is
also known as information radius (Manning & Schütze, 1999; Pargellis, Fosler-Lussier,
Potamianos & Lee, 2001) and as the total divergence to the average measure (Dagan,
Lee, & Pereira, 1997). Jensen-Shannon divergence measure is an extension of the KL
divergence measure and “can be defined as the average of the KL divergence of each of
two distributions to their average distribution” (Dagan, Lee, & Pereira, 1999).
Like the KL divergence, the Jensen-Shannon divergence measure is based on the
notion of entropy as a measure of information. “The intuitive interpretation of [the
Jensen-Shannon divergence] is that it answers the question: How much information is
lost if we describe the two words … that correspond to p and q with their average
distribution?” (Manning & Schütze, 1999, p. 304).
The Jensen-Shannon divergence overcomes two of the major problems of the
practical application of KL divergence because all values produced by Jensen-Shannon
divergence measure are finite (i.e., there is no difficulty with infinite values being
generated), and the Jensen-Shannon divergence measure is symmetric (Lin, 1991;
Manning & Schütze, 1999).
According to Dagan, Lee & Pereira (1997), the Jensen-Shannon divergence
method consistently performs better than the other probabilistic measures they compared
and, in general, recommend its use as a similarity-based estimation method.
98
2.2.4.2.3. L1 Norm
L1 norm or Manhattan norm is “the absolute value of the difference of the two
distributions” (Pargellis, Fosler-Lussier, Potamianos & Lee, 2001, p. 220). It can be
interpreted “as a measure of the expected proportion of different events, that is, as the
expected proportion of events that are going to be different between the distributions p
and q” (Manning & Schütze, 1999, pp. 304-305). Like the Jensen-Shannon divergence,
the L1 norm is symmetric.
2.3. Augmenting Retrieval Methods
A major challenge in search engine design is the fact that purely symbolic search
algorithms miss relevant information. To identify and retrieve information that is missed,
researchers have investigated a number of different approaches to augment retrieval
methods used in search engines. Three of these approaches include integrating manually
developed semantic knowledge into the retrieval method, incorporating relevance
feedback into the retrieval method, and augmenting the retrieval method with automatic
query expansion. Each of these approaches is discussed in detail in the following
sections.
2.3.1. Integrating Manually Developed Semantic Knowledge
One large area of research that addresses the challenge of retrieving information
that is typically missed by symbolic search algorithms is the development of semantic
information to either replace or augment symbolic search engine designs. A portion of
this research focuses on manually developing (i.e., a human expert is required to develop)
99
semantic information. This includes research in developing, defining, and using semantic
relations between lexical elements (e.g., WordNet, Fellbaum, 1998) as well as research
that addresses the development of domain ontologies to extract and represent meaning
based on a constructed world model (Bhogal, Macfarlane, & Smith, 2007). However, the
manual development of semantic information is extremely time consuming, expensive,
and either lacks the specificity for technical domains or lacks portability for reuse in other
conceptual domains and over time within an evolving conceptual domain (Anderson &
Pérez-Carballo, 2001; Manning & Schütze, 1999).
2.3.1.1. WordNet
WordNet is electronic lexical database of English whose design was inspired by
computational and psycholinguistic theories of human lexical memory. It is a large-scale
implementation of relational lexical semantics that represents a pattern of semantic
relations creating a mapping between word forms and word meanings. WordNet
organizes English nouns, verbs, adverbs, and adjectives into synonym sets (also called
synsets). Each synset represents one distinct underlying lexical concept and is linked to
other synsets based on semantic relationships (Fellbaum, 1998; Miller, 1998a; Carpineto
& Romano, 2012; Bird, Klein, & Loper, 2009).
In addition to the synonym relationship as captured by the construction of synsets,
other relationships are captured, such as the following semantic relationships between
noun synsets:
•
hyponymy – is the generalization relationship between concepts where the
individual concepts fall on a continuum from specific to general. It can be
100
represented in a hierarchical tree relationship connected by IS-A or IS-AKIND-OF links (e.g., robin → bird → animal → organism)
•
meronymy – is the whole-part relationship that describes the relation
between an concrete or abstract object and its components. It can be
represented by IS-A-COMPONENT-OF, IS-A-MEMBER-OF, or IS-MADEFROM links (e.g., beak and wing are parts of a bird). (Miller, 1998b)
The effort and time required to create and populate WordNet has been huge. As
Miller (1998a) describes it, “[a] small army of people have worked on it at one time or
another” (p. xxi). They began adding words derived from the standard corpus of presentday edited American English (also known as the Brown Corpus) by Kučera and Francis
(1967) and progressively added words as the developers continued to come across
sources that contained words not already present in the WordNet vocabulary. The
WordNet vocabulary is not specific to any particular domain but instead is comprised of
general-purpose words currently in use. Since 1991 when WordNet 1.0 was released to
be used by the research community, it has been used for a variety of applications, some of
which include its use to replace or augment symbolic search engine designs (Miller,
1998a; Carpineto & Romano, 2012; Bird, Klein, & Loper, 2009).
In general, WordNet can be used to augment symbolic search by “selecting one
synset for a given query term, thus solving the ambiguity problem, and then traversing
the hierarchy by following its typed links. In order to choose a synset with a similar
meaning to the query term, the adjacent query terms can be best matched with the
concepts present in each synset containing the query term. After selecting the most
101
relevant synset, one might consider for query expansion, all the synonyms of the query
term in the synset plus the concepts contained in any synset directly related to it, usually
with different weights” (Carpineto & Romano, 2012, p. 13).
2.3.1.2. Domain ontologies
A domain ontology is a model of knowledge that captures and represents a
conceptual view of a particular subject domain. It is a formal representation “with a welldefined mathematical interpretation which is capable at least to represent a subconcept
taxonomy, concept instances and user-defined relations between concepts” (Nagypál,
2005, p. 781). Through concepts, relations, and instances, ontologies represent
knowledge of a domain that may be used in information retrieval applications to infer the
intended context of potentially ambiguous queries (Nagypál, 2005; Bhogal, Macfarlane,
& Smith, 2007).
The success of using an ontology in an information retrieval application is
dependent on a variety of factors and one of the most challenging is the quality of the
ontology. The quality is determined by the accuracy, comprehensiveness, stability, and
currency of the knowledge represented in the ontology. (Bhogal, Macfarlane, & Smith,
2007) But, creating a quality ontology is expensive and often the cost is prohibitive.
Nagypál (2005) has found that “presently good quality ontologies … are a very scarce
resource” (p. 782).
102
The level of effort required to build an ontology can be seen in Revuri,
Upadhyaya, and Kumar’s (2006) brief overview of the process they took to build the
ontology used in their work. Their process includes the following:
1. list all possible concepts in the domain
2. identify properties of each concept
3. identify characteristics of each property (e.g., Transitive, Symmetric, Functional,
Inverse Functional)
4. define constraints on properties to add specificity as necessary
5. identify relationships between concepts
6. define instances of concept
7. populate property values for each instance
8. check entire ontology for consistency
Because of the cost of manually building an ontology, a number of researchers
have started investigating ways to partially or fully automate the process of creating an
ontology (Bhogal, Macfarlane, & Smith, 2007).
2.3.2. Relevance Feedback
Another method that has been investigated that requires human input, although in
this case after the search results have been returned, is the incorporation of relevance
feedback back into the search engine. “Relevance feedback takes the results that are
initially returned from a given query and uses information provided by the user about
103
whether or not those results are relevant to perform a new query. The content of the
assessed documents is used to adjust the weights of terms in the original query and/or to
add words to the query” (Carpineto & Romano, 2012, p. 13). A variation on relevance
feedback is called pseudo-relevance feedback (also known as blind feedback or blind
query expansion) in which instead of requiring the user to assess the relevance of
documents, the system will assume the top-ranked documents returned are relevant.
These top-ranked documents are then analyzed and used to refine the definition of the
original query. The system then uses the refined query to create the listing of results that
are returned to the user (Bhogal, Macfarlane, & Smith, 2007; Baeza-Yates and RibeiroNeto, 1999; Büttcher, Clarke, & Cormack, 2010; Savoy & Gaussier, 2010).
2.3.3. Query Expansion
Another line of research that addresses the challenge of retrieving information
that is typically missed by symbolic search algorithms is the automatic development of
semantic information to augment symbolic search engine designs. The automatic
development of semantic information is particularly attractive because of the time, cost,
specificity, and portability limitations of the manual development of this information.
One area of this type of research uses Natural Language Processing (NLP)
techniques to perform automatic query expansion in an attempt to more completely
define and describe a user’s query. By augmenting the user’s query with additional search
terms, additional candidate string patterns are available when performing the symbolic
search. It is believed that these added candidate strings allow additional opportunities for
104
a symbolic search algorithm to identify additional documents that contain information
relevant to the desired concept (Efthimiadis, 1996; Bhogal, Macfarlane, & Smith, 2007;
Savoy & Gaussier, 2010; Carpineto & Romano, 2012).
Unfortunately, there has been a marked lack of success in this line of research
over the years and the resulting systems either do not improve or decrease the
performance of the search engine (Qiu & Frei, 1993; Brants, 2003). Qiu and Frei (1993)
present a theory that the lack of success with previous query expansion methods is
primarily because the methods were based on adding terms similar to each of the
individual terms used to construct the query, instead of adding terms similar to the overall
concept the query describes. The method of expanding a query by adding terms similar
only to the individual terms of the query often introduces tangential, non-relevant
concepts to the query causing the search algorithm to identify documents that are not
relevant to the original query concept. This phenomenon where the expansion terms
cause a drift in the focus of the intended search is often referred to as topic drift or query
drift (Qiu and Frei, 1993; Carpineto & Romano, 2012; Savoy & Gaussier, 2010; Carmel
et al., 2002).
2.3.3.1. Qiu and Frei’s Concept-Based Query Expansion
To address the problem of topic drift with previous methods of query expansion,
Qiu and Frei (1993) developed an alternate method whose goal was to expand the
original query with terms similar to the overall concept expressed by the original query
rather than only to the individual terms of the original query. They developed a search
algorithm that relies on a vector space model to represent the original query as a vector in
105
the term vector space (TVS) generated from the document collection. Additional terms
with which to expand the query are identified as those that have a high similarity to the
query vector in TVS. For a term to be eligible for use in expansion, it must be similar to
the overall query rather than similar only to one of the terms that make up the query.
To accomplish this, Qiu and Frei (1993) constructed a similarity thesaurus by
interchanging the traditional roles of documents and terms. “[T]he terms play the role of
the retrievable items and the documents constitute the ‘indexing features’ of the terms.
With this arrangement a term ti is represented by a vector t⃗𝑖 = (d𝑖1 , d𝑖2 , … , d𝑖𝑛 ) in the
document vector space (DVS) defined by all the documents of the collection. The dik’s
signify feature weights of the indexing features (documents) dk with respect to the item
(term) ti and n is the number of features (documents) in the collection” (p. 161). For
example, assume that the indexing features are weighted using the number of times a
term occurs in the document (i.e., number of occurrences) and that term t1 and term t2
occur in the five documents of the collection as presented in Figure 2.7. If this were the
case, the vector for ⃗t1 would be ⃗t1 = (4, 1,0,2,0) and the vector for ⃗t 2 would be ⃗t 2 =
(0,8,0,5,0). The similarity between two terms can be measured using a simple scalar
vector product (or other vector similarity calculations as described in the earlier sections
of this chapter) and a similarity thesaurus can be constructed by calculating the
similarities of all the term pairs (ti, tj).
106
Table 2.3
Document
Occurrences of t1
Occurrences of t2
D1
4
0
D2
1
8
D3
0
0
D4
2
5
D5
0
0
The number of times term t1 and term t2 occur in each of the five
documents that comprise the example document collection.
It should be noted that in Qiu and Frei’s approach, a more complex calculation is
used to determine the feature weights that includes such things as document length,
number of unique terms contained in a document, and total number of documents in the
collection that contain the term and total number of documents in the collection.
The next step is to represent the user’s query as a vector. “A query is represented
by a vector 𝑞⃗ = (q1 , q 2 , … , q 𝑚 ) in the term vector space (TVS) defined by all the terms
of the collection. Here, the qi’s are the weights of the search terms ti contained in the
query q; m is the total number of terms in the collection” (p. 162). The term weights are
determined by calculating the probability that the term is similar to the overall concept of
the query. Therefore, in the vector 𝑞⃗ the value q1 corresponds to the probability that term
t1 is similar to the overall concept of the query, value q2 corresponds to the probability
that term t2 is similar to the overall concept of the query, and so on.
“Since the similarity thesaurus expresses the similarity between the terms of the
collection in the DVS (defined by the documents of the collection), we map the vector 𝑞⃗
from the TVS (defined by the terms of the collection) into a vector in space DVS. This
107
way, the overall similarity between a term and the query can be estimated” (p. 163). Once
the vector 𝑞⃗ has been mapped to the document vector space, the pre-computed entries of
the similarity thesaurus can be used to determine which terms in the collection have a
high similarity to the overall concept of the query and may, therefore, be considered
candidate terms for expanding the original query.
While Qiu and Frei achieved a notable improvement in search engine
performance, they noted that their method was less successful than systems that used
mature user feedback relevance data. However, Qiu and Frei’s work inspired a line of
query-expansion methods known as concept-based query expansion that continues to be
actively researched (Efthimiadis, 1996; Bhogal, Macfarlane, & Smith, 2007; Savoy &
Gaussier, 2010; Carpineto & Romano, 2012).
2.3.3.2. Performing Query Expansion
While the details of the methods that have been investigated for query expansion
vary, the overall process typically consists of four main steps as illustrated in Figure 2.8.
The typical steps of query expansion include data pre-preprocessing, feature generation,
feature selection, and query reformulations. Considering each of these key steps
separately presents a useful way to understand the types and variety of alternative
approaches that have been explored by information retrieval researchers in an attempt to
better understand the best methods for improving retrieval performance with query
expansion. Each step is discussed in the sections below.
108
Data
Preprocessing
Figure 2.5
Feature
Generation
Feature
Selection
Query
Reformulation
Typical steps of query expansion (adapted from Figure 1 of Carpineto &
Romano, 2012, p. 1:10).
2.3.3.2.1. Data Preprocessing
The first step of the query expansion process is data preprocessing. In this step,
the data used to identify the candidate terms and determine the weighting used in refining
or augmenting the query is converted and processed into a form that can be used by the
system. Much of the data preprocessing that occurs at this stage is not unique to the
query expansion process and, therefore, fairly similar across various query expansion
methods (see Section 2.1.1.3.1. Preparing the System). However, one key difference
between query expansion methods at this stage is the data source that the method uses to
derive the features or terms that will be used in later steps of the process. In some cases,
the corpus on which the search is to be conducted is used; in other words, the text from
the documents in the target collection is used (Qiu & Frei, 1993; Graupmann, Cai, &
Schenkel, 2005; Hu, Deng, & Guo, 2006). However, sometimes only a particular aspect
or set of the documents from the corpus is used as the data source. For example, some
systems use what is sometimes called anchor text documents or anchor text summaries as
the data source. These anchor text documents correspond to a particular document in the
collection and are created by collecting all the anchor text (i.e., the underlined or
highlighted clickable text in a hyperlink) found in the collection that points to this
109
particular document (Kraft, & Zien, 2004; He & Ounis, 2007). Another very common
data source is derived by running the original, unaltered query against the collection and
using the matching top ranked document listing and their relevant text snippets as the
subset of text used for the subsequent steps of query expansion process (Robertson,
Walker, & Beaulieu, 1998; Lavrenko & Croft, 2001; Carpineto, de Mori, Romano, &
Bigi, 2001; Carmel, Farchi, Petruschka, & Soffer, 2002).
But, the corpus is not the only data source that may be used in information
retrieval systems to identify terms for use in query expansion. Some systems look beyond
the current corpus for ways to better define the query. Examples of alternate data sources
include the following:
•
a query log of previous queries made in the system (Billerbeck, Scholer,
Williams, & Zobel, 2003; Cui, Wen, Nie, & Ma, 2003)
•
WordNet or other ontological knowledge model (Voorhees, 1994; Liu, Liu,
Yu, & Meng, 2004; Collins-Thompson & Callan, 2005; Bhogal, Macfarlane,
& Smith, 2007; Song, Song, Hu, & Allen, 2007; Kara, Alan, Sabuncu,
Akpinar, Cicekli, & Alpaslan, 2012)
•
the content of Frequently Asked Questions pages (Riezler, Vasserman,
Tsochantaridis, Mittal, & Liu, 2007)
•
information derived from relevant Wikipedia articles (Arguello, Elsas, Callan,
& Carbonell, 2008; Xu, Jones, & Wang, 2009)
110
Some systems also use a combination of the above data sources to increase the
opportunity of finding features or terms that will improve the ability of the system to
identify relevant documents for a wider variety of users’ information needs.
2.3.3.2.2. Feature Generation
A variety of methods are used to generate the possible features that are considered
candidates for expanding the original query. Typically the candidate features can be
thought of as the specific terms or phrases that may be used to expand the original query
to provide more opportunities for string pattern matching but features may also include
such things as abstract representations of concepts or attribute-value pairs. The methods
used to extract features from the data source reflect the conceptual paradigms that the
system creator has chosen to guide the design of the query expansion component of the
information retrieval system (Carpineto & Romano, 2012).
Carpineto and Romano (2012) identify five major conceptual paradigms that were
used to generate and rank the features used in query expansion. These are linguistic
analysis, global corpus-specific techniques, local query-specific techniques, search log
analysis, and web data harvesting.
2.3.3.2.2.1. Linguistic Analysis
Linguistic analysis “techniques leverage global language properties such as
morphological, lexical, syntactic and semantic word relationships to expand or
reformulate query terms. They are typically based on dictionaries, thesauri, or other
similar knowledge representation sources such as WordNet. As the expansion features are
111
usually generated independently of the full query and of the content of the database being
searched, they are usually more sensitive to word sense ambiguity” (Carpineto &
Romano, 2012, p. 25). Some common types of linguistic analysis for the generation of
features are the following:
•
word stemming to expand the query to include the morphological variants of
the original query terms (Krovetz, 1993; Collins-Thompson & Callan, 2005)
•
domain-specific ontologies to provide appropriate contextual information in
order to disambiguate query terms and find synonyms and related words
(Nagypál, 2005; Bhogal, Macfarlane, & Smith, 2007; Revuri, Upadhyaya, &
Kumar, 2006; Song, Song, Allen, & Obradovic, 2006)
•
domain-independent models like WordNet to find synonyms and related
words (Voorhees, 1994; Liu, Liu, Yu, & Meng, 2004; Collins-Thompson &
Callan, 2005; Bhogal, Macfarlane, & Smith, 2007)
•
syntactic analysis to extract relations between the query terms (Sun, Ong &
Chua, 2006)
2.3.3.2.2.2. Global Corpus-Specific Techniques
The global corpus-specific techniques extract information from the whole set of
documents in the collection (Baeza-Yates & Ribierto-Neto, 1999). These techniques
analyze the collection to identify features used in similar ways and in many cases build a
collection-specific thesaurus to be used for identifying candidate expansion features
112
(Carpineto & Romano, 2012). Approaches used to build the collection-specific thesaurus
include the following:
•
similarity between concept terms (Qiu & Frei, 1993)
•
term clustering (Schütze & Pedersen, 1997; Bast, Majumdar & Weber 2007)
•
associations between terms formed through mutual information (Hu, Deng, &
Guo, 2006; Bai, Nie, Cao, & Bouchard, 2007), context vectors (Gauch, Wang,
& Rachakonda, 1999), and latent semantic indexing (Park and
Ramamohanarao, 2007)
2.3.3.2.2.3. Local Query-Specific Techniques
The local query-specific techniques extract information from the local set of
documents retrieved from the original query to identify the features to be used to refine
and augment the query (Baeza-Yates & Ribierto-Neto, 1999). Typically, these
techniques make use of the top-ranked documents that are returned when the original
query is submitted and include pseudo-relevance rankings as described above (see section
2.3.2. Relevance Ranking) (Carpineto & Romano, 2012). The information to identify
each of the top-ranked documents such as the title of the document, document access
information (i.e., a hyperlink to the document or address of the physical location of the
document), and snippets of text that surround the terms that satisfied the requirement of
the query may be analyzed to identify features (Robertson, Walker, & Beaulieu, 1998;
Lee, Croft, & Allan, 2008; Cao, Gao, Nie, & Robertson, 2008).
113
2.3.3.2.2.4. Search Log Analysis
The technique of analyzing search logs mines “query associations that have been
implicitly suggested by Web users, thus bypassing the need to generate such associations
in the first place by content analysis” (Carpineto & Romano, 2012, p. 27). While this
technique has the advantage of being able to mine information already created, the
implicit relevance feedback contained in search logs may be only relatively accurate and
may not be equally useful for all search tasks.
The most widely used techniques based on analyzing search logs is to “exploit the
relation of queries and retrieval results to provide additional or greater context in finding
expansion features” (Carpineto & Romano, 2012, p. 27). Methods that have used this
approach include the following:
•
using top-ranked documents returned in similar past queries (Fitzpatrick &
Dent, 1997)
•
selecting terms from past queries associated with documents in the collection
(Billerbeck, Scholer, Williams, & Zobel, 2003)
•
establishing probabilistic correlations between query terms and document
terms by analyzing those documents selected by the user after submitting a
query (Cui, Wen, Nie, & Ma, 2003)
2.3.3.2.2.5. Web Data Harvest
The web data harvest technique uses sources of relevant information that are not
necessarily part of the document collection to generate candidate expansion features.
114
One of these techniques is to use the articles and their connections with one another in
Wikipedia as a source of relevant feature extraction. Two examples include:
•
using the anchor text of hyperlinks that point to relevant Wikipedia articles as
a source of expansion phrases (Arguello, Elsas, Callan, & Carbonell, 2008)
•
creating Wikipedia based pseudo-relevance information based on the category
of query submitted (Xu, Jones, & Wang, 2009)
Other techniques have included the use of Frequently Asked Question (FAQ)
pages to compile a large set of question-answer pairs as a source of relevant feature
extraction. (Riezler et al., 2007)
2.3.3.2.3. Feature Selection
After generating the candidate expansion features, the top features are selected for
query expansion. “Usually only a limited number of features is selected for expansion,
partly because the resulting query can be processed more rapidly, partly because the
retrieval effectiveness of a small set of good terms is not necessarily less successful than
adding all candidate expansion terms, due to noise reduction” (Carpineto & Romano,
2012, p. 22).
The representation of the candidate features that are to be selected in this step
vary from method to method. Candidate features to be selected include the following:
•
single words (Qiu & Frei, 1993; Voorhees, 1994; Robertson, Walker, &
Beaulieu, 1998; Lee, Croft, & Allan, 2008; Cao, Gao, Nie, & Robertson,
2008)
115
•
phrases (Kraft, & Zien, 2004; Song, Song, Allen, & Obradovic, 2006; Riezler
et al., 2007; Arguello et al., 2008)
•
attribute-value pairs (Graupmann, Cai, & Schenkel, 2005)
•
multi-word concepts (Metzler & Croft, 2007)
In some cases, multiple types of candidate features have been generated. For
example, Liu et al. (2004) describe a method that selects the top-ranking single words
and/or phrases that have been identified in the feature generation step.
There is no consensus about whether there exists an optimum number of
expansion features that should be used to optimize the results returned or what this
number (or range) is. Some suggest that an optimum range is five to ten features (Amati,
2003; Chang, Ounis, & Kim, 2006), Harman (1992) recommends 20 features, and at the
extreme end of the range, Buckley et al. (1995) recommends a massive set of 300-530
features. Alternatively, Chirita, Firan and Nejdl (2007) suggest that instead of selecting a
static number of expansion features, that the number of expansion features selected may
be adapted based on the estimated clarity of the original query. The clarity of the query is
estimated using an equation that measures “the divergence between the language model
associated to the user query and the language model associated to the collection” (p. 12).
2.3.3.2.4. Query Reformulation
After the featured are selected, the final step is to determine how to augment the
original query with the selected features. Many query expansion methods assign a weight
to the features that are added (called query re-weighting). A common re-weighting
technique uses a calculation that is based on Rocchio’s relevance feedback equation. In
116
general, Rocchio’s equation is defined so that q represents the original query, q’
represents the expanded query, λ is a parameter used to calculate the weight of the
relative contribution of the query terms and the expansion terms, and the scoret is a
weight that is assigned to expansion term t.
𝑤′𝑡,𝑞′ = (1 − λ) ∙ 𝑤𝑡,𝑞 + λ ∙ 𝑠𝑐𝑜𝑟𝑒𝑡
Alternatively, instead of using an equation to calculate the weights for each term,
a simple rule of thumb may be used by simply arbitrarily assigning the original query
terms twice as much weight as the new expansion terms (Carpineto & Romano, 2012).
While re-weighting is common in query expansion methods, it is not always
performed. Some query expansion methods achieve good results by reformulating the
query using query specification languages in which weighting is not used. Two examples
of query specification methods that do not weight terms that have been shown to be
successful in some instances are a general structured query (Collins-Thompson & Callan,
2005) and a Boolean query (Graupmann, Cai, & Schenkel, 2005; Liu, Liu, Yu, & Meng,
2004; Kekäläinen & Järvelin, 1998).
2.3.3.3. Example Query Expansion Systems
In the previous section, a variety of approaches for designing a concept-based
query expansion system was described in the context of the typical key steps used in the
query expansion process. In this section, two query expansion systems that possess
elements of interest will be presented in more detail.
117
2.3.3.3.1. Hu, Deng, and Guo’s Global Analysis Approach
Hu, Deng, and Guo (2006) designed an information retrieval system in which a
global analysis approach was used to perform query expansion. They divided their
approach into three primary stages: one, term-term association calculation; two, suitable
term selection; and three, expansion term reweighting.
In the first stage, the term-term association calculations were performed in which
the statistical relationship between term pairs in a document collection provides the basis
for the construction of what Hu, Deng, and Guo refer to as a “thesaurus-like resource” to
aid query expansion. As discussed in section 2.2, the commonly-used measures used to
calculate the relative value of how similar (or dissimilar) one term is to another each
suffer from various limitations and weaknesses. Hu, Deng, and Guo attempted to
overcome some of these limitations by developing their own association measure in
which the values of three different measures are integrated into a single association value.
Their association measure is composed of the following:
•
Term Weight – From vector space models, the term weight of each term is
calculated based on a normalized Term Frequency (TF) and Inverse
Document Frequency (IDF).
•
Mutual Information – From probabilistic models based on the notion of
entropy, mutual information represents the average amount of information
shared by the two terms.
118
•
Normalized Distance Between Terms – The normalized distance between
the terms represents the average proximity of the two terms in the
collection.
The term weight, mutual information, and normalized distance between terms measures
are combined with equal weight to produce an association value for a given term pair.
Hu, Deng, and Guo ran this association measure on their document collection to construct
the thesaurus-like resource containing the association measure values between each term
pair in the collection.
In the second stage, the term-query based expansion was performed. To address
the problem of topic drift identified by Qiu and Frei (1993) when expansion terms are
selected based only on their similarity to a single term in the query, Hu, Deng, and Guo
used “a term-query based expansion scheme, which emphasizes the correlation of a term
to the entire query” (p. 704). They expressed the correlation in the following equation:
𝑗=𝑘
𝐶𝑜(𝑡𝑖 , 𝐪) = � 𝑡𝑓𝑗 ∙ 𝐴(𝑡𝑖 , 𝑡𝑗 )
𝑗=1
Where:
k = number of unique terms in the query q
tfi = term j’s term frequency in the query q
A(ti, tj) = association measure value for term ti and term tj
Using this calculation, “all index terms are ranked in decreasing order according
to their correlation to a given query q” (p. 704) and the top m ranked terms are selected to
create a revised query q′.
119
The third and final stage of the process reweighted the expansion terms chosen in
the previous stage to create the reformulated query. The calculation Hu, Deng, and Guo
used was based on their goal to ensure that the selected expansion terms contributed to
the performance of the search but did not override the impact of the original terms of the
query. They therefore used “a simple reweighing scheme with a formula defined by the
product of expanded term’s rank and average term weight of the original query” (p.705).
Hu, Deng, and Guo evaluated their retrieval performance by comparing the
performance of their search engine after each stage of processing to a baseline search
engine that ran with only the original unexpanded query. They found that each stage
improved the retrieval performance. One of the explanations that they provide to describe
why they believe that their method was successful was that “it is mainly based on the
analysis of intrinsic association relationships between terms in the entire document
collection rather than some retrieved documents that are assumed to be relevant” (p. 706).
Like Hu, Deng, and Guo’s method, the Enhanced search engine method
investigated in this dissertation research is based on an association thesaurus built from
the document collection. There are two primary differences related to the association
thesaurus used by Hu, Deng, and Guos and the association thesaurus used in the
Enhanced search engine. First, Hu, Deng, and Guo use an association measure derived
from combining term weight, mutual information, and normalized distance between
terms while the method in this dissertation research simply uses a co-occurrence
120
calculation based on the Jaccard Coefficient within a predefined proximity window 8. The
second difference is the way the association thesaurus is used in the two methods. Hu,
Deng, and Guo used the association thesaurus to create a ranked list of all index terms
and chose the top m terms as the expansion terms. In this dissertation research, the
association thesaurus is used to create a conceptual network. The expansion terms,
therefore, are not derived directly from the association thesaurus but rather from the
relational pathways that connect the original query terms within the conceptual network.
2.3.3.3.2. Collin-Thompson and Callan’s Markov Chain Framework Approach
Collins-Thompson and Callan (2005) designed an information retrieval system in
which they used a Markov chain framework to combine multiple sources of information
related to term associations. They began with the pseudo-relevance feedback function
built into the Lemur Project’s Indri search engine to identify the set of the candidate
expansion terms for the given query. The Indri algorithm calculates a log-odds ratio for
each candidate term and the top k terms are selected. From these top k Indri generated
candidate terms, a query-specific term network is constructed using multi-stage Markov
chain framework to conduct “a random walk to estimate the likelihood of relevance of
expansion terms” (p. 705). In this way, pairs of terms are linked by drawing data from
six different sources of semantic and lexical association to form a network of terms
specific to the query. The association information about the terms are derived from the
following sources: synonyms from WordNet, stemming, general word association
8
Additional details about the association calculations used in the Enhanced search engine
developed in this research are presented in Chapter 4.
121
database, co-occurrence in set of Wikipedia articles, co-occurrence in top-retrieved
documents from original query, and background smoothing to uniformly link each term
with all others. At different stages, the multi-stage Markov chain model favors different
sources of information at different stages of it walk. For example, in the earliest steps of
the random walk, the chain may favor co-occurrence relationships while later in the
process the walk may favor synonyms.
The walk process begins with a node representing one aspect of the original
query. The aspects of a query are each represented using one or more terms from the
original query. Collins-Thompson and Callan present the example query “Ireland peace
talks” and state that the aspects of this query may be represented by the following sets of
words taken from the query:
•
“Ireland”
•
“peace”
•
“talks”
•
“peace talks”
•
“Ireland peace talks”
In order to ensure that documents related to the intent of the query are retrieved, the
expansion terms chosen should reflect one or more of the aspects of the original query.
Therefore, the method performed a multi-staged random walk for each aspect to
propagate the association information and create the term network model. The stationary
distribution of the model provided a probability distribution over the candidate expansion
terms and was used to identify those expansion terms that had a high probability of
122
representing one or more of these aspects of the original query. It identified expansion
terms from those nodes that had direct links to one another as well as from those nodes in
which connections were implied them. Once the expansion terms were identified, the
terms were weighted to favor those terms that had a high probability of being closely
related to main aspects of the query as well as those terms that reflected multiple aspects
of the query.
Collins-Thompson and Callan’s method performed similarly to other wellperforming methods conducted using the same TREC datasets with some “modest
improvements in precision, accuracy, and robustness for some tests. Statistically
significant differences in accuracy were observed depending on the weighting of
evidence in the random walk. For example, using co-occurrence data later in the walk
was generally better than using it early” (p. 711).
On the surface Collins-Thompson and Calla’s Markov chain framework approach
is one of the most similar query expansion methods to the Enhanced search engine
method investigated in this dissertation research. Like Collins-Thompson and Callan’s
method, the Enhanced search engine is based identifying candidate expansion terms from
a network of terms generated by the system. However, the Collins-Thompson and Callan
query-specific term network differs from the conceptual network in this dissertation
research in several ways.
First, the nature of and the process by which the network is constructed differ
between the two methods. Collins-Thompson and Callan began their process by using the
set of candidate expansion terms generated from Indri’s built-in pseudo-relevance
123
function for the original query. They then used a Markov chain framework to generate a
term network using the word sets representing the various aspects of the query and the
Indri generated candidate terms. Each link between the candidate term pairs in the
network was weighted by the probability from one of a variety of sources that the two
terms were related.
In contrast, the Enhanced search engine conceptual network was created using the
term entries from the association thesaurus and all links between terms were of equal
importance. The association thesaurus represented clusters of terms related by collectionspecific co-occurrence. In this way, the conceptual network represents all the concepts
present in the document collection and does not change based on the query posed but
rather changes only as the document collection changes. Therefore, rather than requiring
a kernel of expansion terms, it used the individual terms that make up the original query
itself to identify the portion of the overall conceptual network that is relevant to the given
query concept.
Second, how the expansion terms are selected using the network is different
between the two methods. The Collins-Thompson and Callan method selected the top k
expansion terms based on a probability calculated by combining the probability derived
from the stationary distribution and the original Indri log-odds ratio. On the other hand,
the Enhanced search engine selected the candidate expansion terms based on the
intervening terms within the relational pathways. The various, appropriately short
relational pathways that form the connection(s) between the original query terms in the
conceptual network are thought to represent an aspect of the query topic. The pathway of
124
terms represents a logical progression of ideas that, linked together, represent an aspect of
the overall query concept. In this way, each pathway represents an aspect of the query
concept that are present in the collection as opposed to the sets of words that provide the
initial node for the Markov random walk in the Collins-Thompson and Callan method.
Therefore, appropriate combinations of the intervening terms and the original query terms
from each relational pathway found were selected to expand the original query.
Third, the processing requirements of the two methods also appear to differ
significantly. Collins-Thompson and Callan method requires a variety of mathematical
processing techniques to perform the multi-stage Markov chain random walks to create
the stationary distribution of the model. Because the term network is constructed for
each query based on the query terms and the candidate terms generated from Indri’s
pseudo-relevance function, it is likely that the majority of the processing must be
performed after the user has submitted the query (i.e., post-search). Depending on the
processing time required generate the network, identify the expansion terms, reformulate
the query, and re-run the query, the user’s experience may be negatively impacted by the
processing time required to complete the search request. In contrast, the mathematical
calculations required by the Enhanced search engine are relatively simple (i.e., the most
complex calculation is the Jaccard Coefficient to determine term co-occurrence) and a
large majority of the processing can be performed ahead of time so that the user’s
experience is not negatively impacted (i.e., the user is not aware of the time required to
perform the processing).
125
2.4. Evaluation
Like that of computer science research in general, the purpose of most research in
search engine design “is to invent algorithms and generate evidence to convince others
that the new methods are worthwhile” (Moffat & Zobel, 2004, p. 1). In order to convince
others that the new search engine methods are worthwhile, compelling, accurate,
reproducible evidence must be generated. Typically, the most effective way to achieve
this is by implementing the search engine and conducting a well-designed experiment to
measure its performance (Moffat & Zobel, 2004).
Two unique aspects of experimental design for the comparative evaluation of
search engines are the performance measures chosen and the data (i.e., test collection)
used.
2.4.1. Search Engine Performance Measures
One of the primary aspects of search engine performance that is desirable to try to
quantify and measure is search engine effectiveness. “Effectiveness, loosely speaking,
measures the ability of the search engine to find the right information” (Croft, Metzler, &
Strohman, 2010, p. 298). There are various statistical methods available to measure the
effectiveness of a search engine. Depending on the objectives of the search engine and
the type of performance improvement expected, some measures may be more appropriate
than others (Baeza-Yates & Ribiero-Neto, 1999; Büttcher, Clarke, & Cormack, 2010;
Croft, Metzler, & Strohman, 2010).
126
2.4.1.1. Recall and Precision
Recall and precision are traditional information retrieval measures that have been
used extensively in information retrieval research. Recall is the proportion of the relevant
documents contained in the document collection that are retrieved by a search engine in
response to a search query. Therefore, recall quantifies exhaustivity (i.e., how complete
the set of relevant documents retrieved for a given search query is). Precision, on the
other hand, is the proportion of relevant documents contained in the result set. In other
words, precision measures the amount of noise (i.e., number of false positives in the
result set) (Baeza-Yates & Ribiero-Neto, 1999; Büttcher, Clarke, & Cormack, 2010;
Croft, Metzler, & Strohman, 2010).
Recall and precision are calculated using the following equations:
𝑅 = 𝑟𝑒𝑐𝑎𝑙𝑙 =
|𝐴 ∩ 𝐵|
|𝐴|
𝑃 = 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
|𝐴 ∩ 𝐵|
|𝐵|
Where:
A = set of all relevant documents in document collection
B = set of retrieved documents
2.4.1.1.1. Combining Recall and Precision with the F-measure
Recall and precision are complimentary calculations that should be considered
together to properly evaluate the effectiveness of a search engine. A search engine that
always returns the entire collection in response to a query has perfect recall (i.e., R = 1)
but extremely low precision (i.e., P approaches 0); it is doubtful that such a search engine
127
would be considered effective. The opposite is also true, a search engine with near perfect
precision (P = 1) and very low recall (i.e., R approaches 0) also is unlikely to be
considered an effective search engine. In most cases, the goal is to find an appropriate
balance between recall and precision (Croft, Metzler, & Strohman, 2010).
To make it easier to consider recall and precision together, the values may be
combined into a single value called the F-measure. The F-measure is the harmonic mean
of recall and precision and is calculated using the following equation:
F-measure =
1
𝑅
2
2 ∙𝑅 ∙𝑃
1 =
𝑅+𝑃
+𝑃
Where:
R = recall value
P = precision value
The F-measure as a harmonic mean is able to enforce a balance between recall
and precision so that a more logical measure of effectiveness is computed at the various
extremes of recall and precision values. For example, consider the example above in
which the entire collection was retrieved in response to a search query (i.e., R = 1 and P
approaches 0). When an arithmetic mean ((R + P) / 2) is used, the resulting value is
greater than 0.5. However, when the F-measure (i.e., harmonic mean) is used, the
resulting value is close to 0. In this situation, an effectiveness value close to 0 instead of
a value greater than 0.5 intuitively makes more sense and more accurately conveys how
effectively the search engine has performed (Croft, Metzler, & Strohman, 2010).
128
While the version of the F-measure that equally balances recall and precision is
the most common, a weighted harmonic mean may also be used. The weighted version
allows weights to be included in the calculation to reflect the relative importance of recall
and precision that is desired. The weighted version of the F-measure is calculated using
the following equation:
weighted F-measure =
𝑅 ∙𝑃
𝛼 ∙ 𝑅 + (1 − 𝛼) ∙ 𝑃
Where:
R = recall value
P = precision value
𝛼 = a weight
2.4.1.1.2. Assumptions of Recall and Precision Measures
Recall and precision rely on several assumptions for their calculations to be valid
measurements of effectiveness. These assumptions include the following:
•
The user’s information need is represented by the search query supplied to a
search engine.
•
Each document in the document collection is relevant or is not relevant to a
user’s information need (i.e., it is a binary classification).
•
The classification of relevance is only dependent on the document itself and
the information need and is not influenced by the relevance of any other
document in the document collection.
129
•
For a given search query supplied to a given search engine, there will be a set
of one or more documents that will be retrieved, and the remainder of the
documents in the document collection will be not be retrieved.
•
The order that the documents are returned in the result set has no bearing on
the calculated values.
2.4.1.2. Effectiveness Measures Used for Ranked Results
While relevance ranking is outside the scope of this dissertation, a majority of
search engine research conducted today performs relevance ranking to order the
documents returned in the result set. Therefore, two of the most common measures of
effectiveness reported in search engine research literature are Precision at k Documents
(P@k) and Average Precision (AP). Both measures are based on the assumption that as
both document collections and the number of relevant documents they contain grow
larger, users are not interested in every relevant document contained in the collection.
Instead, it is assumed that a user considers a search engine most effective if the first
documents returned are relevant. The relevancy of the document beyond the initial set of
documents retrieved is unimportant and, therefore, should not be considered as part of the
effectiveness measures. Therefore, both P@k and AP focus primarily on the precision of
the documents returned (Büttcher, Clarke, & Cormack, 2010).
2.4.1.2.1. Precision at k Documents (P@k)
The measure Precision at k Documents (P@k) “is meant to model the satisfaction
of a user who is presented with a list of up to k highly ranked documents, for some small
130
value of k (typically k = 5, 10, or 20)” (Büttcher, Clarke, & Cormack, 2010, p. 408). The
P@k equation is defined as:
𝑃@𝑘 =
| 𝐵[1. . 𝑘] ∩ 𝐴 |
𝑘
Where:
k = the number of highly ranked documents to be considered
A = set of all relevant documents in document collection
B[1..k] = set of the top k retrieved documents
The P@k measure does not consider order of the top k documents (i.e., whether
document 1 or document k has the higher ranking does not impact the results). “P@k
assumes that the user inspects the results in arbitrary order, and that she inspects all of
them even after she has found one or more relevant documents. It also assumes that if the
search engine is unable to identify at least one relevant document in the top k results, it
has failed, and the user’s information need remains unfulfilled. Precision at k documents
is sometimes referred to as an early precision measure” (Büttcher, Clarke, & Cormack,
2010, p. 408).
One argument against the use of the P@k equation is that it is undesirable that the
choice of the value of k is arbitrary but can impact the effectiveness results.
2.4.1.2.2. Average Precision (AP)
The Average Precision (AP) measure provides an alternative approach to the P@k
equation. AP addresses the problem of k being an arbitrary choice by combining
precision values at all the possible recall levels.
131
|𝐵|
1
𝐴𝑃 =
∙ � relevant(𝑖) ∙ 𝑃@𝑖
|𝐴|
𝑖=1
Where:
A = set of all relevant documents in document collection
B = set of retrieved documents
relevant(i) = is assigned the value 1 if the i-th document in B is
relevant; otherwise is assigned the value of 0
“[F]or every relevant document d, AP computes the precision of the result list up
to and including d. If a document does not appear in [the set of retrieved documents], AP
assumes the corresponding precision to be 0. Thus, we may say that AP contains an
implicit recall component because it accounts for relevant documents that are not in the
results list” (Büttcher, Clarke, & Cormack, 2010, p. 408).
2.4.2. Search Engine Data for Evaluation Experiments
The test data set for conducting search engine evaluation experiments are
generally comprised of three elements: the document collection that contains the
documents that will be searched, the query topics that represent the information need to
be filled, and the set of relevant documents within the collection that matches the criteria
of each query topic.
The test data set used can have a large impact on the accuracy of the results. It
may either be developed specifically for the particular evaluation experiment to be
conducted or pre-defined by others for the purpose of evaluating search engines. The
132
objectives of the novel search engine method and the type of improvement expected
should always drive the choice of the test data set used.
2.4.2.1. Developing a Test Data Set
The test data set may be developed specifically for the particular evaluation
experiment to be conducted. While this requires additional effort and may be time
consuming, the quality of the results from some evaluation experiments may benefit from
developing a specific test data set. Typically, the test data set developed is comprised of
a document collection, a set of query topics, and the relevance judgments for each query
topics (Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010).
2.4.2.1.1. Document Collection
When identifying or creating a document collection, it is important to begin by
identifying the required attributes of the document collection for which the search engine
is intended to be used. Depending on the objective of the search engine, the important
attributes of the document collection may include the number of documents or size of
collection, the average length of the documents, type of documents, subject matter of the
documents, and the level of technical detail and jargon presented in the documents. Once
the required attributes have been identified, the document collection should be selected or
created to adequately meet these requirements. “In some cases, this may be the actual
collection for the application; in others it will be a sample of the actual collection or even
a similar collection” (Croft, Metzler, & Strohman, 2010, pp. 303-304).
133
2.4.2.1.2. Query Topics
Query topics are identified to represent individual information needs that will be
tested in the experiment. Two important aspects of identifying query topics are the
particular concepts represented by each of the query topics and the number of query
topics defined. The information needs as described by each of the query topics should
provide reasonable coverage and be representative of the types of information needs that
the intended users will try to fill with the search engine. It is also desirable that “[w]hen
developing a topic, a preliminary test of the topic on a standard IR system should reveal a
reasonable mixture of relevant and non-relevant documents in the top ranks” (Büttcher,
Clarke, & Cormack, 2010, p. 75). This will increase the chances that differences
between the performances of two or more search engines will be distinguishable. Sources
that may be used to identify representative queries include query logs of similar systems
and asking potential users for examples of queries (Büttcher, Clarke, & Cormack, 2010;
Croft, Metzler, & Strohman, 2010).
“Although it may be possible to gather tens of thousands of queries in some
applications, the need for relevance judgments is a major constraint. The number of
queries must be sufficient to establish that a new technique makes a significant
difference” (Croft, Metzler, & Strohman, 2010, p. 304). A balance should be struck so
that neither too many nor too few query topics are defined. Too many query topics will
require unnecessary time and effort to perform the necessary relevant judgments. Too
few queries will not allow the necessary conclusions to be drawn. The number of query
topics represents the sample size to be used in the evaluation experiment, and, therefore,
134
a statistical power analysis should be conducted to determine the required sample size
(Cohen, 1988). A power analysis will provide the information necessary to determine
how many query topics will be necessary to generate the results required to conclude that
an improvement in performance at the desired effect size and significance level is
achieved.
2.4.2.1.3. Relevance Judgments
For each query topic, the set of documents within the collection that matches the
criteria of each query topic must be identified. This is the information that will be used to
determine the number of relevant and non-relevant documents that the search engines
return as well as the number of relevant documents that the search engines miss.
The best practice for assessing the relevance of documents in a document
collection for a particular query topic is to perform adjudication by a human assessor.
However, such a process can be very time intensive (Büttcher, Clarke, & Cormack, 2010;
Croft, Metzler, & Strohman, 2010). For example, consider a medium-sized test document
collection composed of 3000 documents, and assume that there are 50 query topics to be
tested. In the Exhaustive Adjudication method, each document is reviewed for relevancy
against each query topic. If each of the 3000 documents is reviewed to determine its
relevance for each of the 50 test query topics, and each judgment takes 30 seconds, the
adjudication effort would take approximately 1250 hours (i.e., a single assessor would
need to spend over 30 weeks of full-time effort). The time commitment required by this
method of Exhaustive Adjudication makes it unfeasible for most projects. Therefore,
135
alternative methods that minimize the effort required to perform the relevance
assessments for each test query topic are typically used.
There are a number of alternatives that have been used successfully to reduce the
effort of adjudication, yet produce good results. The standard method used by Text
REtrieval Conference (TREC) is known as the Pooling Method. In this method, a
document collection and test query set is supplied to a group of participants. The
participants then generate a set of retrieved documents for each query topic in the test
query set. For a given test query topic, the top-ranked k documents supplied by each
participant are pooled together to form the set of documents to be assessed for relevance.
This pool of documents is then reviewed, and each document is tagged as relevant or not
relevant by a human assessor or team of human assessors. All the documents in the
collection that are not part of the assessment pool are classified automatically as not
relevant. The pooling method can significantly reduce the number of documents that
must be looked at by the human assessors (Büttcher, Clarke, & Cormack, 2010; Harman,
2005).
Another alternative is the Interactive Search and Judging (ISJ) Method. In this
method, a “skilled” searcher uses a search engine to find and tag as many relevant
documents as possible. Typically an adjudication tracking tool is used to help the
searcher record relevance judgments and suppress previously adjudicated documents
from consideration. All documents in the collection that are not explicitly adjudicated for
a given test query topic are automatically tagged as not relevant. The results of using the
136
ISJ evaluation method were found both by Cormack, Palmer, and Clarke (1998) and by
Voorhees (2000) to be very similar to official TREC results using the pooling method.
2.4.2.2. Pre-defined Test Data Sets
A number of pre-defined test data sets created by others in the information
retrieval community are available for the purpose of evaluating search engines. The most
well-known and widely used test collections are those from the Text REtrieval
Conference (TREC) but, other publically available collections also exist.
2.4.2.2.1. Text REtrieval Conference (TREC)
Using a Text REtrieval Conference (TREC) test data set can be an effective way
to evaluate how a new search engine method compares to other search engines. In the
early days of evaluating information retrieval systems, it was difficult to conduct an
evaluation using a realistically-sized document collection. Instead, many of the
document collections available for use were much smaller than the collections for which
the retrieval system was designed. “Evaluation using the small collections often does not
reflect performance of systems in large full-text searching, and certainly does not
demonstrate any proven abilities to operate in real-world information retrieval
environments” (Harman, 2005, p. 21).
To address this problem, TREC was developed. It is “a series of experimental
evaluation efforts conducted annually since 1991 by the U.S. National Institute of
Standards and Technology (NIST). TREC provides a forum for researchers to test their
IR systems on a broad range of problems. … In a typical year, TREC experiments are
137
structured into six or seven tracks, each devoted to a different area of information
retrieval. In recent years, TREC has included tracks devoted to enterprise search,
genomic information retrieval, legacy discovery, e-mail spam filtering, and blog search.
Each track is divided into several tasks that test different aspects of that area. For
example, at TREC 2007, the enterprise search track included an e-mail discussion search
task and a task to identify experts on given topics” (Büttcher, Clarke, & Cormack, 2010,
p. 23).
Because of the variety of tracks available, a TREC collection appropriate for the
objectives of a new search engine method is often available. If the TREC test collection
chosen is an active track, then the results of the experiment may be submitted to TREC
and included in the TREC relevance adjudication efforts. If the TREC test collection
used is not active, the relevance judgments will have been previously compiled and can
be used to calculate the performance of a new search engine. These results may also be
compared to the results of the other systems that participated while the track was active.
2.4.2.2.2. Other Publicly Available Test Data Sets
TREC document collections are not in the public domain; as a result, individuals
or their affiliated research organizations must be approved by TREC to be eligible to
purchase a copy of a TREC data set. Therefore, not everyone can obtain a copy of a
TREC data collection to use to evaluate their search engine. However, TREC data sets
are not the only pre-defined test data sets available for evaluating a search engine. For
example, both the CACM test collection and the OHSUMED test collection are available
138
at no cost 9. CACM is a collection of titles and abstracts of articles published in the
Communications of the ACM journal between 1958 and 1979. This is a collection of
3,204 documents, 64 query topics, and the relevant document result sets for each query
topic (Harman, 2005). The OHSUMED test collection consists of a subset of clinicallyoriented MEDLINE articles published in 270 medical journals between 1987 and 1991.
This is a collection of 348,566 documents, 106 query topics, and the relevant document
result sets for each query topic (Hersh et al., 1994).
Today, many of the alternative test data sets are considered quite small and,
therefore, less relevant for testing the advances in technology for use on the ever
expanding Web. But, for some smaller projects or in the early stages of larger projects,
these sets may be useful. In fact, Qiu and Frei (1993) used the CACM collection along
with several of these freely available test sets for their groundbreaking work.
9
A copy of CACM test collections may be obtained from the Information Retrieval
Group in the School of Computing Science at the University of Glaslow at
http://ir.dcs.gla.ac.uk/resources/test_collections/. A copy of the OHSUMED test
collection may be obtained from Department of Medical Informatics and Clinical
Epidemiology at the Oregon Health & Science University at http://ir.ohsu.edu/ohsumed/.
139
CHAPTER 3. RESEARCH HYPOTHESES
The objective of this research was to determine if a search engine enhanced with
automatic concept-based query expansion using term relational pathways built from
collection-specific association thesaurus positively impacted search performance. To
determine if this research objective was met, the following four research hypotheses were
posed.
Research Hypothesis #1
The Enhanced search engine will perform differently than the Baseline search engine.
Research Hypothesis #2
The Enhanced search engine will on average have greater recall than the Baseline engine.
Research Hypothesis #3
The Enhanced search engine will on average perform better than the Baseline search
engine as determined by a higher average F-measure value.
Research Hypothesis #4
The Enhanced search engine will on average perform better on query topics that describe
intangible concepts than query topics that describe tangible concepts as determined by a
higher average F-measure value.
140
CHAPTER 4. METHODOLOGY
The research methodology was designed to effectively and efficiently test the four
research hypotheses posed in Chapter 3.
4.1. Identify and Design Baseline Search Engine
In order to determine whether the enhancement impacts search performance, the
performance resulting from the enhancement must be isolated from the performance of
other search engine elements. To do this, the performance of an enhanced search engine
may be compared to the performance of a baseline search engine where the only
difference between the two search engines are the modules and components required for
implementing the enhancement. This approach ensures that any difference in
performance is the result of the enhancement rather than the performance being
confounded by other differences in search engine architecture or configuration settings.
Ideally, when choosing an appropriate baseline search engine, it is beneficial for it
to be well-known, well understood, and representative of the state of the art of
performance in the field. However, this is not always possible. In this research, the
original plan was to use an offline version of the Google search engine called Google
Desktop as a model for the baseline condition because it could be considered a wellknown, representative tool used for searching local document collections. And while the
core of Google’s search technology and component architecture is proprietary and,
therefore, cannot be modified with any design enhancements, the plan was to use Google
Desktop as a guide for tuning the performance of a Baseline search engine that could be
141
modified. The Baseline search engine was to be developed using the popular, opensource Lucene.NET search engine development library. Once developed, the
Lucene.NET Baseline search engine would be configured and tuned to produce
equivalent search results to Google Desktop. After the Lucene.NET Baseline search
engine was appropriately configured, enhancements to implement the automatic conceptbased query expansion would be built on top of the Lucene.NET Baseline search engine
core to create the Enhanced search engine.
However, in the course of the Baseline search engine development, it was
recognized that the Lucene.NET Baseline search engine could not be accurately
configured to perform in an equivalent manner to Google Desktop on the target document
collection. Therefore, an alternate plan was developed. Instead of using Google Desktop
to tune the Lucene.NET Baseline search engine, the Lucene.NET Baseline search engine
was configured using current best practices in search engine design. These best practices
included the following features:
•
Word Stemming – The Snowball word stemmer was used during the parsing
step. It is an implementation of a modified Porter Stemmer and is the
standard stemmer used in the Lucene.NET search engine development library.
•
Stopword Removal – A list of common function words in English typically
used to indicate grammatical relationships were removed during the parsing
step. See Appendix A section A.2.2.2 for the list of stopwords removed.
142
•
HTML code filtering – An HTML code filter was used so that words included
inside HTML code tags were ignored during indexing. The code filter used
was the built-in Lucene.NET analyzer called HTMLStripCharFilter.
•
Proximity Search – The search engine had the ability to retrieve documents
that contained the terms occurring within a prescribed threshold of one
another. If the document contained all the terms, but they were considered too
far away from one another in the document (as determined by the predefined
threshold), then the document was not considered a match and was not
returned. See Appendix A section A.2.5. for more information about the
proximity search and the threshold used.
4.1.1. Design Process Used
To ensure that the search engines designed as part of this research would be of the
appropriate level of quality and successfully meet the research objectives, many
principles of systems engineering were used in the design process. Some of the core
principles in the design process used to develop the search engines included the
following:
•
Define the intentions of the system, including the needs and relevant
characteristics of the intended users.
•
Define the system objectives and requirements.
•
Make design decisions based on defined intentions, objectives, and
requirements.
143
•
Iteratively refine requirements throughout design process as better and
more complete knowledge of the intended system is acquired.
•
Ensure that all intentions, objectives, and requirements are met in the
design through requirements tracking and system testing.
4.1.2. Baseline Search Engine Structure
The Baseline search engine performed two core functions: one, index the
document collection and store the indexed information in a quickly accessible format;
and, two, process queries entered by the user. Therefore, the Baseline search engine was
composed of an Index Module, a Search Module, and an Index data store. The Index
Module was part of the pre-search processes that prepare the system for use. The Search
Module was part of the post-search processes that occur to allow users to enter their
desired search terms, the system to process the query, and the system to return the search
results to the user. The high-level structure of the Baseline search engine is illustrated in
Figure 4.1.
The distinction made between pre-search and post-search is important when
considering the impact on the user’s experience. The processing time required to perform
the pre-search processes does not impact the user’s experience (i.e., the user is not aware
of the time required to perform the pre-search processing), while the processing time
required to perform the post-search processes does impact the user’s experience (i.e., the
user must wait for the post-search processing to be completed before the search results
may be displayed).
144
Baseline Search Engine
Post-search
processes
Search Module
User Interface
Enter Search Terms
Search
Present Search
Results to User
Parse User Input
Organize and Format Results
Build Query
Run Query
Data store
Index
Index Document
Analyze Document
Build Document Record
Acquire Content
Pre-search
processes
Figure 4.1
Figure 4.1
Index Module
Baseline Search Engine Structure
High-level structure of the Baseline search engine.
4.2. Design Enhanced Search Engine
The Enhanced search engine was designed to implement the approach to conceptbased query expansion described in Chapter 1. The search engine, therefore, was
145
enhanced to perform three additional core functions: one, build a collection-specific
association thesaurus and store it in a quickly accessible format; two, generate the
conceptual network from the association thesaurus entries and store it in a quickly
accessible format; and, three, identify candidate query expansion terms from the
conceptual network and the user’s original query terms.
The additional modules and components necessary to accomplish the enhanced
functionality were built on top of the Lucene.NET Baseline search engine described in
the previous section. Therefore, in addition to the Baseline’s Index Module, Search
Module, and Index data store, the Enhanced search engine also contained modules to
Build Association Thesaurus, Generate Conceptual Network and Expand Query as well
as the necessary components for the Association Thesaurus data store and the Conceptual
Network data store. The Index Module, Build Association Thesaurus module, and
Generate Conceptual Network module were part of the pre-search processes to prepare
the system. The Search Module and Expand Query module were part of the post-search
processes that occur to process the user’s query. The high-level structure of the Enhanced
search engine is illustrated in Figure 4.2.
146
Enhanced Search Engine
Post-search processes
Search Module
User Interface
Expand Query
Enter Search Terms
Identify Relational
Pathways
Collect Expansion Terms
Search
Parse User Input
Build Expanded Query
Present Search
Results to User
Organize and Format Results
Run Query
Data store
Conceptual Network
Association Thesaurus
Index
Store Conceptual
Network
Store Association
Thesaurus Entries
Create Links Between
Terms
Calculate
Co-Occurrence Values
Identify Term Pairs
Create Matrix
Generate Conceptual Network
Identify Eligible Terms
Index Document
Analyze Document
Build Document Record
Acquire Content
Index Module
Pre-search
processes
Figure 4.2
Create Document
Segments
Build Association Thesaurus
High-level structure of the Enhanced search engine.
147
4.2.1. Build Association Thesaurus
The Build Association Thesaurus module automatically builds the Association
Thesaurus using the document collection. The module accomplished this by manipulating
a Term-Document matrix comprised of terms and their occurrences in the documents of
the collection to determine the level of association between term pairs. To determine the
similarity, overlapping document segments were analyzed to determine the frequency of
eligible terms, and the resulting data was used to calculate co-occurrence values.
4.2.1.1. Overlapping Document Segments
The term vectors that made up the Term-Document matrix were defined by the
number of occurrences (i.e., frequency) of the term within document segments rather than
within full documents. Document segments (i.e., moving shingled window) were created
from each full document. The document segments were 200 words long, and each
segment overlapped the previous and next segment by 100 words (i.e., the shingle
increment). The number of document segments created from a full document varied from
one segment to several hundred segments, depending on the length of the full document.
The Term-Document matrix was, therefore, constructed so that the terms were
represented by the columns, and the document segments were represented by the rows.
Using document segments rather than the full documents controlled for the variability in
length of documents in the collection and ensured that only the terms in close proximity
(i.e., within 200 terms) to one another were assumed to be similar to one another.
The document segment size and the shingle increment were chosen based on an
informal average paragraph size. It was observed that a single, although possibly
148
complex, concept is often contained in a paragraph. Because of this, the words used in
the beginning of the paragraph are likely topically related to the words used at the end of
the paragraph. Therefore, the average number of words contained in a paragraph may be
a reasonable guide to the size of a chunk of text in which all the words are semantically
related. Assuming that paragraphs typically range from 100 to 200 words, a document
segment size of 200 words and a shingle increment of 100 words were chosen. These
values were chosen early in the design process and no tuning of these values was
performed.
4.2.1.2. Eligible Term Identification
Not all terms present in the document collection were included in the Association
Thesaurus. Only stemmed content bearing words (i.e., stop words were excluded)
present in the document collection with an appropriate level of frequency were identified
as eligible for inclusion in the Association Thesaurus. Therefore, the terms needed to
occur frequently enough in the document collection for co-occurrence calculations to
yield useful information but not too frequently for their presence to not be a useful
discriminator of relevance. Eligible terms were those that had a minimum frequency of
50 in the overall document collection and did not appear in more than 9999 document
segments. These eligible terms parameters were not tuned but chosen at the beginning of
the design process based on reasonable initial guesses as to appropriate starting values.
149
4.2.1.3. Co-Occurrence Calculations
The co-occurrence calculations to determine level of association (or, similarity)
between term pairs were conducted using the Jaccard Coefficient. The Jaccard
Coefficient is based on an Intersection Over Union (IOU) calculation to normalize and
measure the amount of overlap between two term vectors.
The Jaccard Coefficient value of a term pair was used only to make the binary
decision of inclusion or exclusion of a term pair in the Association Thesaurus. Those
term pairs with a Jaccard Coefficient value greater than 0.5 were included in the
Association Thesaurus as associated term entries for each other.
This minimum threshold value of 0.5 was chosen early in the design process
based on the idea that a value near the mid-point of possible Jaccard Coefficient values
(i.e, values between 0 and 1) would provide a reasonable starting point and no tuning was
performed to improve this value.
4.2.2. Generate Conceptual Network
The Generate Conceptual Network module used the entries in the Association
Thesaurus to generate the conceptual network in the same manner described in section
1.2.1. Each term in the Association Thesaurus represented a node. Child nodes for a term
were generated from all of the associated terms defined in its thesaurus entry to create a
term cluster. To form the full conceptual network, each term cluster generated from the
thesaurus entry was linked to the other term clusters using shared terms. The entire
conceptual network was developed by continuing this term cluster linking process using
150
all shared terms defined through the relationships defined by the associated term entries
in the Association Thesaurus.
Only terms likely to be useful in discriminating the relevance of a document were
included in the conceptual network. A maximum threshold was used to restrict the
number of associated terms a target term may have to be eligible for inclusion in the
conceptual network. Terms that had more than 275 entries were considered to be too
frequently occurring to be able to offer a useful discrimination and were ignored during
the process of creating the conceptual network. Therefore, any term included in
conceptual network had 275 or fewer associated terms included in its Association
Thesaurus entry. This threshold value of 275 entries was chosen early in the design
process based on reviewing several example term pairs and their resulting pathways. No
tuning was done after this early design decision was made.
In this way, the conceptual network was composed of all terms with 275 or fewer
entries in the Association Thesaurus and links between terms were only based on whether
or not shared terms existed in the individual term clusters (i.e., there were no other
parameters considered when forming the links between nodes).
4.2.2.1. Relational Pathways
To minimize search processing time experienced by the users, the relational
pathways were identified during the pre-search process stage in the Generate Conceptual
Network module. All possible term pairs were identified from the terms contained in the
Association Thesaurus. Next, the relational pathways for each term pair were identified
and stored for fast retrieval at search time.
151
The relational pathways identified were 3, 4, or 5 terms long. To identify the
relational pathways between a pair of terms, the module began with the first term of the
term pair and traversed the conceptual network looking for the second term of the term
pair using a breadth-first search to a maximum depth of 4 terms. When the second term
was found, the intervening terms were captured to form a relational pathway.
It was possible for zero, one, or more relational pathways to be identified for a
given term pair. There was no maximum threshold used to limit the number of relational
pathways that could be identified for a given term pair.
All the relational pathways for a given term pair were the same length. Once a
relational pathway was found, the search on that level was completed and then stopped
before it moved to the next level of depth.
4.2.3. Expand Query
The Expand Query module identified the candidate expansion terms and
performed the query expansion. The module accomplished this by identifying all
relational pathways between each term pair in the original query, identifying appropriate
combinations of candidate expansion terms, and reformulating the original query into an
expanded query.
For each pair of terms in the original query, the Expand Query module attempted
to identify one or more relational pathways for the term pair. To do this, all possible term
pairs in the original query were identified. For example, if the original query is composed
152
of three terms such as such as warnings quickly understandable, then there are three
terms pairs that may result in relational pathways.
The three term pairs in this example are the following:
1. warnings quickly
2. warnings understandable
3. quickly understandable
Next, all relational pathways identified and stored during the pre-search process
stage in the Generate Conceptual Network module for each pair of terms from the
original query were retrieved. The relational pathways for each term pair retrieved were
then processed to identify the appropriate combinations of candidate expansion terms
with which to expand the original query.
See Appendix B for additional details about how Boolean expressions are
generated from the relational pathways.
4.3. Select a Document Collection
The search engine enhancement is intended to be useful for bounded, mediumsized document collections containing documents that are focused on a single scientific
or technical conceptual domain. The document collection of the Design CoPilotTM web
application 10 was chosen as the test document collection to be used in the search
performance experiments.
10
The Design CoPilotTM is available at http://www.designcopilot.com.
153
Design CoPilotTM is a subscription-based web application that contains a
collection of approximately 3000 documents. These documents range in length from a
few paragraphs to several hundred pages. Approximately half of the documents in the
collection are regulatory or guidance documents. These are technical documents written
by government agencies and/or industry groups that focus on providing the regulatory
requirements and advisory guidance material for the design and certification of aircraft.
The second half of the Design CoPilotTM document collection is a set of the documents
that describe human factors research, guidelines, and other relevant human factors
information for the design and certification of aircraft.
Therefore, the Design CoPilotTM document collection provides an appropriately
sized, well-bounded collection focused on a single technical conceptual domain.
4.4. Develop Query Topics
Query topics were developed with which to test the search performance of the
Baseline and Enhanced search engines on the test document collection. In order to
provide an accurate measure of performance, the query topics needed to be representative
of the queries posed by the users of the Design CoPilotTM to fill their real-life information
needs. Therefore, they were developed by a team of Design CoPilotTM content experts by
reviewing query logs of the Design CoPilotTM web application and by drawing on the
content experts’ experience with the document collection.
The team of six of Design CoPilotTM content experts included the author of this
dissertation. The team worked together to develop a set of 75 query topics that would
154
represent the variety of concepts likely to be asked by users of the Design CoPilotTM
application. Each query topic included in the set was required to be composed of two or
more terms (i.e., no single term query topics were included).
4.4.1. Tangible and Intangible Concepts
Because Research Hypothesis #4 predicted that the type of concept that the query
topic represented would impact the performance of the Enhanced search engine, two
types of query topics were developed: those that represented tangible concepts and those
that represented intangible concepts.
Tangible concepts are simple, unambiguous, well-defined concepts and include
query topics such as “attitude trend indicator” and “placard lighting.” Tangible concepts
are often specific examples or instances of an item (e.g., “mouse”, “track pad” and “joy
stick” are specific instances of cursor control devices). Intangible concepts are complex,
harder-to-define, fuzzy concepts and include query topics such as “excessive cognitive
effort” or “how to provide unambiguous feedback.” Intangible concepts may represent a
general class of items or concepts (e.g., “cursor control device”) and may be discussed in
many different ways (e.g., “mental processing” may also be discussed as “cognitive
processing,” “cognitive requirements,” or “mental workload”). Often intangible concepts
are described using one or more ambiguous or difficult to define qualifying terms (e.g.,
“suitable”, “appropriate” and “excessively”).
While there may be concepts that could arguably fit in either the tangible or
intangible category, the primary distinction is that of how easily, unambiguously,
155
uniformly, and precisely the concept may be described. Those query topics that represent
concepts that are typically described in a single way and are unambiguous have been
categorized as tangible; those query topics that represent concepts that are ambiguous,
complex, and may be described in several different ways have been categorized as
intangible.
A total of 75 query topics were developed. Of these, 40 query topics represented
tangible concepts, and 35 query topics represented intangible concepts. See Appendix C
for a complete list of query topics used in this research.
4.4.2. Relevant Document Sets for Query Topics
The set of relevant documents for each query topic were identified through the
process of adjudication after the query topics had been run on both the Baseline and the
Enhanced search engines. To make the effort required to perform the adjudication task
manageable, a modified pooling method was used in which only the differences in the
sets of documents returned were manually reviewed by a human assessor to determine
document relevance. Documents that were returned by both the Baseline and the
Enhanced search engines were assumed to be relevant. Documents returned by neither
the Baseline nor the Enhanced search engines were assumed to be non-relevant.
Documents only returned by the Enhanced search engine (and not by the Baseline) were
manually reviewed for relevance.
Because the Enhanced search engine was built on top of the Baseline search
engine, all documents returned by the Baseline would always also be returned by the
156
Enhanced. Or, stated another way, no documents were returned by the Baseline that were
not also returned by the Enhanced.
4.5. Run Experiment
The experiment was run by querying the Baseline and then the Enhanced search
engine for each of the 75 query topics. The list of documents returned by each search
engine was stored in a relational database. After the experimental run was complete, a
table of differences was created that represented those documents returned by only the
Enhanced search engine and, therefore, required manual adjudication to determine their
relevancy. The following sections describe how the samples of query topics were selected
and their associated document sets adjudicated to determine relevance.
4.5.1. Select Samples
To understand the performance difference between the Baseline and Enhanced
search engines, a more detailed analysis was performed on those query topics in which
the Baseline and the Enhanced search engine’s performance differed. A sample of 14
query topics from the tangible set and a sample of 16 query topics from the intangible set
were selected.
Each sample set was created by selecting all query topics for which the difference
in the number of documents returned by the Baseline and the Enhanced search engines
was less than or equal to 50 documents. In the Tangible Sample Set, there were 14 query
topics that met this criterion. In the Intangible Sample Set, there were 16 query topics
157
that met this criterion. The threshold of less than or equal to 50 documents was chosen
arbitrarily to restrict the samples to an appropriately large yet manageable size.
4.5.2. Adjudicate
The 30 query topics selected in the previous step (i.e., 14 query topics in the
Tangible Sample Set plus 16 query topics from the Intangible Sample Set) were
adjudicated using the modified pooling method described earlier to identify their relevant
document sets. Therefore, documents returned by both search engines were assumed to
be relevant; documents returned by neither search engine were assumed to be irrelevant;
and the documents returned by only the Enhanced search engine were reviewed manually
by a human assessor to determine relevance.
The human assessor was the author of this dissertation. Because the only source
of the documents to be manually adjudicated was the Enhanced search engine (i.e., all
documents returned by the Baseline were, by design, also returned by the Enhanced
search engine), the adjudication process was not blind. To address the potential issues of
consistency, repeatability, and reasonableness of subjective decisions related to the
adjudication process that may be introduced when the process is not blind, graded
relevancy descriptions were used to aid the adjudication task.
During the adjudication, graded relevancy descriptions for each query topic were
developed and, based on the content of the documents, reviewed for relevancy. The
higher the score, the more relevant the document was to the query topic. Documents with
a score of “1” contained information that addressed all aspects of the concept represented
158
by the query topic. Documents with a score of “0.5” or “0.25” contained information that
only addressed some but not all aspects of the concept represented by the query topic.
And finally, documents with a score of “0” were irrelevant (i.e., did not contain any
information that addressed the concept represented by the query topic). See Appendix E
for the query topic graded relevancy descriptions.
In addition to aiding adjudication consistency, the graded relevancy definitions
allowed for traceability of the process. For each document adjudicated, both a binary
relevancy score (i.e., assigned “1” when the document is fully or partially relevant and is
assigned a “0” when the document is irrelevant) and a graded relevancy score (i.e., “1”,
“0.5”, “0.25”, “0” as described above) were captured in a relational database table. If the
document received a non-zero relevancy score, an excerpt from the document supporting
the relevancy score assigned was also captured in the relational database table.
The graded relevancy scores were not used in any of the calculations in this
research project but rather served as adjudication aid and allowed for traceability. Follow
on research may be conducted in which these scores are analyzed and used in the
performance calculations.
4.6. Measure Performance and Determine Statistical Significance
After the adjudication was complete, the data generated in the previous steps were
used to measure performance and determine the statistical significance related to the full
set of 30 query topics as well as for any differences among the 14 query topics in the
Tangible Sample Set and 16 query topics from the Intangible Sample Set.
159
4.6.1. Addressing Potential Impact of Unfound Relevant Documents
In this experiment, the goal of measuring search engine performance was to
determine whether or not the enhancement proposed positively impacted the overall
performance. A positive impact was considered one in which the recall is increased while
the precision remained high. Therefore, an F-measure, which enforces a balance between
precision and recall, provided a useful and logical method to compare the overall
performance between search engines.
As described earlier, the Enhanced search engine was created by building the
enhanced functionality on top of the Baseline search engine. The advantage of this
technique is that any differences in performance between the Baseline and the Enhanced
search engine are the product of the enhancement and not some uncontrolled or unknown
variability introduced by differences in the search engine structure or components.
However, there are some potential issues with performing a traditional F-measure
calculation in this experimental situation because of the potential impact of unfound (and,
therefore, unknown) relevant documents.
Unfound relevant documents can impact whether or not the recall calculation is a
good estimator of actual recall (Frické, 1998). Typically, the likelihood that unfound
relevant documents exist in an experimental test set is reduced to an acceptable level by
pooling the results from a collection of unique search engines. The idea is that each
unique search engine algorithm will provide its own set of relevant documents and allow
identification of most, if not all, relevant documents that exist in the collection (i.e., it is
assumed that all search engines will not miss retrieving the same relevant documents).
160
However, because this experiment is conducted on a unique local website
document collection to illustrate an enhancement particular to this type of collection, the
relevant documents for the query topics are not pre-defined, and the experiment cannot
take advantage of a collective of researchers to perform the adjudication task. Therefore,
the adjudication task must be performed in such a way to minimize effort required, so
that it is not prohibitively time-consuming, while still providing information that is of an
appropriate level of quality.
One of the assumptions that impacted the number of documents that require
manual adjudication was the use of a modified pooling method. As described earlier in
section 4.4.2., in the modified pooling method, all documents returned by both the
Baseline and the Enhanced search engine were assumed to be relevant; all other
documents returned were manually adjudicated to determine their relevance. Because the
set of documents returned by the Baseline was always a subset of those returned by the
Enhanced search engine (as illustrated in the Venn diagram in Figure 4.3), the result was
that all documents returned by the Baseline were assumed to be relevant.
Documents Returned
Enhanced
Baseline
Figure 4.3
Venn digram illustrating that the documents returned by the Baseline are a
subset of those returned by the Enhanced search engine.
161
Because of this assumption, the only additional relevant documents identified for
a query topic were provided by the Enhanced search engine. Therefore, it is possible that
a large number of relevant documents could remain unfound in this experiment.
To address the issue that there may be a large number of unfound relevant
documents and that they may impact the ability of the recall calculations to be a good
estimator of the true recall of the Baseline and Enhanced search engines, this experiment
modified the calculations used and performed a sensitivity analysis to assess the potential
impact of unfound relevant documents.
Instead of using the traditional recall calculation, this experiment used a recalllike calculation we will call recall*. Recall* is calculated the same way that traditional
recall is calculated, but the notation highlights the idea that it may not be an accurate
estimation of the true recall of the search engine.
𝑅 ∗ = 𝑟𝑒𝑐𝑎𝑙𝑙 ∗ =
|𝐴 ∩ 𝐵|
|𝐴|
Where:
A = set of all found relevant documents in document collection
B = set of retrieved documents
Because the F-measure relies on recall, the issue described also impacts the Fmeasure calculation. Therefore, instead of the traditional F-measure, this experiment
used an F-measure-like calculation we called F-measure*. Like recall*, F-measure* is
calculated in the same way as the traditional version, but the notation used highlights the
162
idea that it may not be an accurate estimation of the true F-measure value of the search
engines.
F-measure* =
1
𝑅∗
2
+
1
𝑃
=
2 ∙𝑅∗∙𝑃
𝑅 ∗ +𝑃
Where:
R* = recall* value
P = precision value
4.6.2. Addressing Potential Impact of Relevancy Assumption in Modified Pooling
Method
The use of a modified pooling method to minimize time and effort required while
still providing information that is of an appropriate level of quality relied on the
assumption that all documents returned by both the Baseline and the Enhanced search
engine were relevant and, therefore, did not require adjudication. However, it is possible
that this assumption is not accurate; there may be a large proportion of those documents
returned by both the Baseline and the Enhanced search engines that are not relevant. If
there is a large proportion of non-relevant documents returned by both the Baseline and
the Enhanced search engines, the precision calculations may not be a good estimator nor
represent the true precision of the Baseline and Enhanced search engines.
To address this possible issue, a sensitivity analysis for documents assumed
relevant was performed to assess the potential impact of non-relevant documents returned
by both the Baseline and the Enhanced search engines.
163
4.6.3. Perform Calculations and Sensitivity Analyses
After the adjudication was complete, the relevancy results for the sample of 30
query topics run on each search engine were collated and the F-measure* (i.e., harmonic
mean of recall* and precision) was calculated. Significance was determined using t-tests
to determine whether a performance difference existed between the Baseline and
Enhanced search engines. Next, a Single Factor ANOVA calculation using the Fmeasure* was performed to determine whether a performance difference existed between
the 14 query topics that represented tangible concepts and the 16 query topics that
represented intangible concepts.
A sensitivity analysis was then conducted to assess the potential impact of
unfound relevant documents. The F-measure* was calculated for three levels of
sensitivity to unfound relevant documents. The estimated total relevant documents for
each level of sensitivity was calculated by adding the number of relevant documents for
the query topic identified by the Baseline and Enhanced search engines plus the estimated
number of unfound relevant documents at that level. The three levels of sensitivity were
the following:
•
Level 1 Sensitivity To Unfound Relevant Documents (0.25X) – The unfound
relevant documents were estimated to be a quarter of the number of relevant
documents identified by the Baseline and Enhanced search engines.
•
Level 2 Sensitivity To Unfound Relevant Documents (2X) – The unfound
relevant documents were estimated to be double the number of relevant
documents identified by the Baseline and Enhanced search engines.
164
•
Level 3 Sensitivity To Unfound Relevant Documents (10X) – The unfound
relevant documents were estimated to be ten times the number of relevant
documents identified by the Baseline and Enhanced search engines.
For each level of sensitivity, the values estimating the number of unfound relevant
documents were rounded up to the next whole number. See Table 4.1 for an example of
the estimates for each of the three levels of unfound relevant documents used in the
sensitivity analysis.
Level
1 0.25X
Table 4.1
Identified
Relevant
10
Estimated
Unfound
Relevant
3
Estimated
Total
Relevant
13
2
2X
10
20
30
3
10X
10
100
110
Example of the estimates used for unfound relevant documents used in
each of the three levels of the sensitivity. Estimates of unfound documents
were rounded up to the next whole number.
As with the earlier calculation, significance was determined at each level of
sensitivity using t-tests to determine whether a performance difference existed between
the Baseline and Enhanced search engines. The last step of the sensitivity analysis was to
compare the conclusions drawn from the t-tests at each level of sensitivity to the
conclusion drawn in the original calculations (i.e., without the addition of estimated
unfound relevant documents).
165
Finally, a sensitivity analysis for documents assumed relevant was then conducted
to assess the potential impact of non-relevant documents returned by both the Baseline
and the Enhanced search engines. The F-measure* was calculated for four levels of
sensitivity to non-relevant documents in the set of documents returned by both the
Baseline and the Enhanced search engines. The estimates were calculated in two different
ways. First, the estimated total relevant documents returned by both the Baseline and the
Enhanced search engines was calculated by assuming that a certain percentage of the
documents returned by both the Baseline and Enhanced search engines were nonrelevant. The following outlines the three levels of sensitivity estimates calculated:
•
Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant) –
Twenty-five percent of the documents returned by both the Baseline and the
Enhanced search engines were assumed to be non-relevant.
•
Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant) – Fifty
percent of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant.
•
Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant) –
Seventy-five percent of the documents returned by both the Baseline and the
Enhanced search engines were assumed to be non-relevant.
For each level of sensitivity, the values calculated to estimate the assumed number of
non-relevant documents returned by both the Baseline and the Enhanced search engines
were rounded up to the next whole number. See Table 4.2 for an example of the estimates
166
for each of the three levels of assumed non-relevant documents used in the sensitivity
analysis.
Level
1 25%
Returned by
both Baseline
and Enhanced
10
Estimated
Non-Relevant
3
Estimated
Relevant
7
2
50%
10
5
5
3
75%
10
8
2
Table 4.2
Example of the estimates used for assumed non-relevant and relevant
documents used in each of the three levels of the sensitivity to documents
assumed relevant. Estimates of non-relevant documents were rounded up
to the next whole number.
An additional sensitivity level was calculated using results generated by Google
Desktop. The estimates for this level of sensitivity were generated by determining the
number of documents returned by the Google Desktop search engine that overlap with
the documents returned by the Baseline and the Enhanced search engines. Documents
returned by all three search engines were assumed to be relevant in the fourth level of
sensitivity.
As described earlier, initial attempts of tuning the performance of the Baseline
search engine with Google Desktop appeared unfeasible, and they also revealed that
Google Desktop missed some of the relevant documents that the Baseline returned.
Therefore, this fourth level of sensitivity may be considered a conservative estimate of
167
relevancy, and it is likely that Google Desktop missed some of the relevant documents
that the Baseline search engine found.
As with the earlier calculations, significance was determined at each of the four
levels of sensitivity using t-tests to determine whether a performance difference existed
between the Baseline and Enhanced search engines. And like the previous sensitivity
analysis, the last step of this sensitivity analysis was to compare the conclusions drawn
from the t-tests at each level of sensitivity to the conclusion drawn in the original
calculations (i.e., assuming that all documents returned by both the Baseline and the
Enhanced search engines were relevant).
168
CHAPTER 5. RESULTS
The methodology described in the previous chapter was performed to address the
four research hypotheses posed and the results derived from the experiment are described
in this chapter.
5.1. Full Query Topic Set
A total of 75 query topics were run on the Baseline search engine. All documents
returned by the Baseline search engine for each of the query topics were stored in an
Access relational database. On average, there were 20.2 documents retrieved for each of
the query topics run on the Baseline search engine with a range between 0 and 105
documents. The process was repeated using the Enhanced search engine. The same 75
query topics were run on the Enhanced search engine. All documents returned by the
Enhanced search engine for each of the query topics were stored in an Access relational
database. On average, there were 43.5 documents returned for each of the query topics
run on the Enhanced search engine with a range between 0 and 244 documents. The
difference in the returned document counts between the Baseline search engine and the
Enhanced search engine was determined to be statistically significant to an alpha less
than 0.05 using a Two Sample t-Test. (See Appendix D for the data and calculations
related to the documents returned counts.) The frequency distribution of the documents
returned by the Baseline and by the Enhanced search engines is presented in Figure 5.1.
169
Returned Document Count Frequency
Number of query topics
40
35
30
25
20
15
10
5
0
10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250
Document Count
Baseline Search Engine
Figure 5.1
Enhanced Search Engine
Histogram of the returned document count frequencies for the 75 query
topics run on the Baseline and the Enhanced search engines.
Of the 75 topics that were run, there were 44 query topics (i.e., 54.7%) that
produced a difference in performance between the Baseline and the Enhanced search
engines. Twenty-four of these were from the set of query topics that represented tangible
concepts, and 17 of these were from the set of query topics that represented intangible
concepts. Of the remaining 29 query topics, in which the Enhanced search engine
performed the same as the Baseline search engine, 16 of these were from the set of query
topics that represented tangible concepts, and 18 of these were from the set of query
topics that represented intangible concepts. The counts and percentages of the full set of
75 query topics distribution are listed in Table 5.1 and illustrated in Figure 5.2.
170
Query Topic Set
Performance in Baseline and Enhanced
Same
Different
Full Set of Query Topics
34
45.3%
41
54.7%
75
Tangible
16
40.0%
24
60.0%
40
Intangible
18
51.4%
17
48.6%
35
Table 5.1
Total
Counts and percentages of the distribution of the 75 query topics that
produced a difference in performance between the Baseline and the
Enhanced search engines and those that performed the same (i.e., no
difference in performance).
Query Topics Distribution
Same Performance (45.3%)
Tangible
Intangible
Different Performance (54.7%)
Tangible
Intangible
Figure 5.2
Distribution of all 75 query topics that produced a difference in
performance between the Baseline and the Enhanced search engines and
those that performed the same (i.e., no difference in performance).
171
5.2. Samples of Query Topics That Produce a Difference
The next step was to perform a detailed analysis on only those query topics in
which the Baseline and the Enhanced search engine’s performance differed. A sample of
14 query topics from the tangible set and a sample of 16 query topics from the intangible
set were selected. The listing of both sample sets is presented in Tables 5.2 and 5.3.
Tangible Query Topic Sample Set
1 auxiliary power unit fire extinguishing
2 false resolution advisory
3 fault tolerant data entry
4 hydraulic system status messages
5 icing conditions operating speeds provided in AFM
6 information presented in peripheral visual field
7 information readable with vibration
8 instruments located in normal line of sight
9 labels readable distance
10 landing gear manual extension control design
11 negative transfer issues
12 safety belt latch operation
13 side stick control considerations
14 text color contrast
Table 5.2
Sample of 14 query topics representing tangible concepts.
172
Intangible Query Topic Sample Set
1 acceptable message failure rate and pilots confidence in
system
2 appropriate size of characters on display
3 arrangement of right seat instruments
4 control is identifiable in the dark
5 cultural conventions switch design
6 design attributes for auditory displays
7 ergonomics of pilot seating
8 excessive cognitive effort
9 how to ensure that labels are readable
10 how to improve situation awareness
11 how to provide unambiguous feedback
12 minimal mental processing
13 needs too much attention
14 preventing instrument reading errors
15 proper use of red and amber on displays
16 suitable menu navigation methods
Table 5.3
Sample of 16 query topics representing intangible concepts.
Adjudication was performed to determine the relevancy of the documents
retrieved by the Baseline and Enhanced search engines. The results were collated and Fmeasure* calculated for each of the 30 query topics (i.e., 14 query topics in the Tangible
Sample Set plus 16 query topics from the Intangible Sample Set). See Appendix F for
the individual recall*, precision, and F-measure* calculations for each of the 30 query
topics.
173
5.2.1. Baseline vs. Enhanced Search Performance
The Baseline search engine performed with an average recall* of 0.30, average
precision of 0.77, and average F-measure* of 0.41 as shown in Table 5.4. Because the
Enhanced search engine always returned all the documents returned by the Baseline for a
given query topic and these documents were assumed to be relevant, the precision for the
Baseline search engine was always either 1.00 or 0.00. The precision value 0.00 occurred
when the Baseline search engine returned no documents for a given query topic but the
Enhanced search engine returned at least 1 relevant document for that same query topic.
The recall* value varied based on the number of additional relevant documents that the
Enhanced search engine returned.
The Enhanced search engine performed with an average recall* of 1.00, average
precision of 0.78, and average F-measure* of 0.85 as shown in Table 5.4. When a
difference in performance existed between the two search engines, the set of documents
returned by the Baseline was always only a subset of those returned by the Enhanced
search engine; the recall* was always determined by the number of relevant documents
returned by the Enhanced search engine (i.e., there were no opportunities for additional
relevant documents to be identified that were not returned by the Enhanced search
engine), and, therefore, the recall* value was always 1.00. While there were 7 query
topics in the two sample sets in which the Baseline search engine returned 0 documents,
the Enhanced search engine always returned 2 or more documents for all query topics in
the sample sets. The precision value varied based on the number of additional irrelevant
documents that the Enhanced search engine returned.
174
Search Engine
Average Performance Measures
Recall*
Precision
F-measure*
Baseline
0.30
0.77
0.41
Enhanced
1.00
0.78
0.85
Table 5.4
Average performance measures for Baseline and Enhanced search
engines.
The difference in recall* between the Baseline search engine with an average
recall* of 0.30 and the Enhanced search engine with an average recall* of 1.00 was
determined to be statistically significant to an alpha less than 0.05 using a Two Sample tTest. The difference in performance in the F-measure* value between the Baseline search
engine with an average F-measure* of 0.41 and the Enhanced search engine with an
average F-measure* of 0.85 was also determined to be statistically significant to an alpha
less than 0.05 using a Two Sample t-Test. See Appendix G, sections G.1. and G.2. for
the t-Test calculations used to determine this statistical significance.
5.2.2. Tangible vs. Intangible Concepts
For the 14 query topics that represented tangible concepts, the Baseline search
engine performed with an average recall* of 0.39, average precision of 0.86, and average
F-measure* of 0.51. For the 16 query topics that represented intangible concepts, the
Baseline search engine performed with an average recall* of 0.22, average precision of
0.69, and average F-measure* of 0.32. The performance measures for Tangible and
Intangible Sample Sets for the Baseline search engine are shown in Figure 5.5.
175
Query Topic Type
Baseline Search Engine
Average Performance Measures
Recall*
Precision
F-measure*
Tangible Concepts
0.39
0.86
0.51
Intangible Concepts
0.22
0.69
0.32
Table 5.5
Average performance measures for query topics representing tangible and
and intangible concepts for the Baseline search engine.
For the 14 query topics that represented tangible concepts, the Enhanced search
engine performed with an average recall* of 1.00, average precision of 0.67, and average
F-measure* of 0.78. For the 16 query topics that represented intangible concepts, the
Enhanced search engine performed with an average recall* of 1.00, average precision of
0.88, and average F-measure* of 0.92. The performance measures for Tangible and
Intangible Sample Sets for the Enhanced search engine are shown in Figure 5.6.
Query Topic Type
Enhanced Search Engine
Average Performance Measures
Recall*
Precision
F-measure*
Tangible Concepts
1.00
0.67
0.78
Intangible Concepts
1.00
0.88
0.92
Table 5.6
Average performance measures for query topics representing tangible and
and intangible concepts for the Enhanced search engine.
A Single Factor ANOVA calculation was used to determine whether a
performance difference existed between the 14 query topics that represented tangible
concepts and the 16 query topics that represented intangible concepts. For the Baseline
search engine, it was determined that there was no difference in performance between the
176
query topics that represented tangible concepts and the query topics that represented
intangible concepts. However, a difference in performance was found for the Enhanced
search engine. The difference in performance of the Enhanced search engine between
query topics that represent tangible concepts and query topics that represent intangible
concepts was determined to be statistically significant at an alpha of 0.05. See Appendix
G section G.3. for the ANOVA calculations used to determine this statistical significance.
5.3. Sensitivity Analysis For Impact Of Unfound Relevant Documents
A sensitivity analysis was performed at three different levels to assess the
potential impact of unfound relevant documents. This analysis was conducted using the
sample of 30 query topics (i.e., 14 query topics in the Tangible Concepts Query Sample
Set plus 16 query topics from the Intangible Concepts Query Sample Set).
5.3.1. Level 1 Sensitivity To Unfound Relevant Documents (0.25X)
At Sensitivity Level 1 in which the unfound relevant documents were assumed to
be a quarter of the number of relevant documents found by the Baseline and Enhanced
search engines, the average recall* of the Baseline was 0.22 and the average recall* of
the Enhanced was 0.78 as presented in Table 5.7. The difference in performance in the
recall* value between the Baseline and Enhanced search engines was determined to be
statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the
same conclusion that was drawn from the original recall* calculations.
177
Search Engine
Original
Average Recall*
Level 1
Level 2
(0.25X)
(2X)
Level 3
(10X)
Baseline
0.30
0.22
0.10
0.03
Enhanced
1.00
0.78
0.33
0.09
Table 5.7
Average recall* for Baseline and Enhanced search engines from original
calculation and at the three sensitivity levels for unfound documents.
Under these same conditions at Sensitivity Level 1, the average F-measure* of the
Baseline was 0.33 and the average F-measure* of the Enhanced was 0.75 as presented in
Table 5.8. The difference in performance in the F-measure* value between the Baseline
and Enhanced search engines was determined to be statistically significant to an alpha
less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn
from the original F-measure* calculations.
See Appendix I section I.1. for the data and t-Test calculations used to determine
the statistical significance of recall* and F-measure* at Sensitivity Level 1 for unfound
documents.
Search Engine
Original
Average F-measure*
Level 1
Level 2
(0.25X)
(2X)
Level 3
(10X)
Baseline
0.41
0.33
0.17
0.05
Enhanced
0.85
0.75
0.45
0.16
Table 5.8
Average F-measure* for Baseline and Enhanced search engines from
original calculation and at the three sensitivity levels for unfound
documents.
178
5.3.2. Level 2 Sensitivity To Unfound Relevant Documents (2X)
At Sensitivity Level 2 in which the unfound relevant documents were assumed to
be double the number of relevant documents found by the Baseline and Enhanced search
engines, the average recall* of the Baseline was 0.10 and the average recall* of the
Enhanced was 0.33 as presented in Table 5.7. The difference in performance in the
recall* value between the Baseline and Enhanced search engines was determined to be
statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the
same conclusion that was drawn from the original recall* calculations.
Under these same conditions at Sensitivity Level 2, the average F-measure* of the
Baseline was 0.17 and the average F-measure* of the Enhanced was 0.45 as presented in
Table 5.8. The difference in performance in the F-measure* value between the Baseline
and Enhanced search engines was determined to be statistically significant to an alpha
less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn
from the original F-measure* calculations.
See Appendix I section I.2. for the data and t-Test calculations used to determine
the statistical significance of recall* and F-measure* at Sensitivity Level 2 for unfound
documents.
5.3.3. Level 3 Sensitivity To Unfound Relevant Documents (10X)
At Sensitivity Level 3, in which the unfound relevant documents were assumed to
be ten times the number of relevant documents found by the Baseline and Enhanced
search engines, the average recall* of the Baseline was 0.03 and the average recall* of
179
the Enhanced was 0.09 as presented in Table 5.7. The difference in performance in the
recall* value between the Baseline and Enhanced search engines was determined to be
statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the
same conclusion that was drawn from the original recall* calculations.
Under these same conditions at Sensitivity Level 3, the average F-measure* of the
Baseline was 0.05 and the average F-measure* of the Enhanced was 0.16 as presented in
Table 5.8. The difference in performance in the F-measure* value between the Baseline
and Enhanced search engines was determined to be statistically significant to an alpha
less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn
from the original F-measure* calculations.
See Appendix I section I.3. for the data and t-Test calculations used to determine
the statistical significance of recall* and F-measure* at Sensitivity Level 3 for unfound
documents.
5.4. Sensitivity Analysis For Impact Of Relevancy Assumptions
A sensitivity analysis for documents assumed relevant was performed at four
different sensitivity levels to assess the potential impact of non-relevant documents
returned by both the Baseline and the Enhanced search engines. This analysis was
conducted using the sample of 30 query topics (i.e., 14 query topics in the Tangible
Concepts Query Sample Set plus 16 query topics from the Intangible Concepts Query
Sample Set).
180
5.4.1. Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant)
At Sensitivity Level 1 in which 25% of the documents returned by both the
Baseline and the Enhanced search engines were assumed to be non-relevant, the average
recall* of the Baseline was 0.25 and the average recall* of the Enhanced was 1.00, as
presented in Table 5.9. The difference in performance in the recall* value between the
Baseline and Enhanced search engines was determined to be statistically significant to an
alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was
drawn from the original recall* calculations.
Average Recall*
Original
Level 1
(25%)
Level 2
(50%)
Level 3
(75%)
Level 4
(Overlap with
Google
Desktop)
Baseline
0.29
0.25
0.20
0.09
0.21
Enhanced
1.00
1.00
1.00
1.00
1.00
Search Engine
Table 5.9
Average recall* for Baseline and Enhanced search engines from original
calculation and at the four sensitivity levels for assumed relevant
documents.
Under these same conditions at Sensitivity Level 1, the average F-measure* of the
Baseline was 0.30 and the average F-measure* of the Enhanced was 0.80 as presented in
Table 5.10. The difference in performance in the F-measure* value between the Baseline
and Enhanced search engines was determined to be statistically significant to an alpha
less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn
from the original F-measure* calculations.
181
See Appendix J section J.1. for the data and t-Test calculations used to determine
the statistical significance of recall* and F-measure* at Sensitivity Level 1 for assumed
relevant documents.
Average F-measure*
Original
Level 1
(25%)
Level 2
(50%)
Level 3
(75%)
Level 4
(Overlap with
Google
Desktop)
Baseline
0.40
0.30
0.22
0.09
0.25
Enhanced
0.84
0.80
0.76
0.70
0.77
Search Engine
Table 5.10
Average F-measure* for Baseline and Enhanced search engines from
original calculation and at the four sensitivity levels for assumed relevant
documents.
5.4.2. Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant)
At Sensitivity Level 2 in which 50% of the documents returned by both the
Baseline and the Enhanced search engines were assumed to be non-relevant, the average
recall* of the Baseline was 0.20 and the average recall* of the Enhanced was 1.00, as
presented in Table 5.9. The difference in performance in the recall* value between the
Baseline and Enhanced search engines was determined to be statistically significant to an
alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was
drawn from the original recall* calculations.
Under these same conditions at Sensitivity Level 2, the average F-measure* of the
Baseline was 0.22 and the average F-measure* of the Enhanced was 0.76, as presented in
Table 5.10. The difference in performance in the F-measure* value between the Baseline
182
and Enhanced search engines was determined to be statistically significant to an alpha
less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn
from the original F-measure* calculations.
See Appendix J section J.2. for the data and t-Test calculations used to determine
the statistical significance of recall* and F-measure* at Sensitivity Level 2 for assumed
relevant documents.
5.4.3. Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant)
At Sensitivity Level 3 in which 75% of the documents returned by both the
Baseline and the Enhanced search engines were assumed to be non-relevant, the average
recall* of the Baseline was 0.09 and the average recall* of the Enhanced was 1.00, as
presented in Table 5.9. The difference in performance in the recall* value between the
Baseline and Enhanced search engines was determined to be statistically significant to an
alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was
drawn from the original recall* calculations.
Under these same conditions at Sensitivity Level 3, the average F-measure* of the
Baseline was 0.09 and the average F-measure* of the Enhanced was 0.70, as presented in
Table 5.10. The difference in performance in the F-measure* value between the Baseline
and Enhanced search engines was determined to be statistically significant to an alpha
less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn
from the original F-measure* calculations.
183
See Appendix J section J.3. for the data and t-Test calculations used to determine
the statistical significance of recall* and F-measure* at Sensitivity Level 3 for assumed
relevant documents.
5.4.4. Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop)
At Sensitivity Level 4, in which estimates of relevant documents that may be
assumed were generated by determining the number of documents returned by the
Google Desktop search engine that overlap with the documents returned by the Baseline
and the Enhanced search engines, the average recall* of the Baseline was 0.21 and the
average recall* of the Enhanced was 1.00, as presented in Table 5.9. The difference in
performance in the recall* value between the Baseline and Enhanced search engines was
determined to be statistically significant to an alpha less than 0.05 using a Two Sample tTest. This is the same conclusion that was drawn from the original recall* calculations.
Under these same conditions at Sensitivity Level 4, the average F-measure* of the
Baseline was 0.25 and the average F-measure* of the Enhanced was 0.77, as presented in
Table 5.10. The difference in performance in the F-measure* value between the Baseline
and Enhanced search engines was determined to be statistically significant to an alpha
less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn
from the original F-measure* calculations.
See Appendix J section J.4. for the data and t-Test calculations used to determine
the statistical significance of recall* and F-measure* at Sensitivity Level 4 for assumed
relevant documents.
184
5.5. Query Topic Outliers
Two query topics in the sample set had an extremely large number of nonoverlapping documents returned by the Baseline and Enhanced search engines as
compared to the other query topics (i.e., the Enhanced search engine returned more than
100 documents that the Baseline search engine did not). To better understand why these
query topics were outliers, an informal analysis was conducted.
The first outlier query topic was “TCAS warnings” in the tangible concepts query
topic sample set. The number of documents returned by the Baseline and Enhanced
search engines for this query topic is presented in Table 5.11. Information relevant for
this query topic discusses design attributes of effective warnings for traffic and collision
avoidance systems or other similar systems. The design attributes necessary for effective
warnings is discussed extensively in the Design CoPilotTM document collection. From
partial adjudication of the documents returned only by the Enhanced search engine (i.e.,
the differences), it is evident that the Enhanced search engine found relevant documents
that the Baseline missed. However, it should be noted that the Baseline search engine
returned a large number of documents (i.e., 96 documents). In fact, it was one of the
largest document result sets returned by the Baseline. The initial analysis suggests that
the large difference between the number of documents returned by the Baseline and
Enhanced search engines may be a function of proportion. The average proportional
difference between the documents returned by the Baseline and the Enhanced is 4.16
(i.e., the Enhanced search engine on average returned 4.16 times the number of
documents returned by the Baseline search engine when a difference exists).
185
Outlier Query Topic (QT)
Documents Returned
Baseline
Enhanced
Difference
TCAS warnings
96
244
148
easily understandable system status
9
133
124
Table 5.11
Documents returned by the Baseline and Enhanced search engines for the
two outlier query topics.
The second query topic was “easily understandable system status” in the
intangible concepts query topic sample set. The number of documents returned by the
Baseline and Enhanced search engines for this query topic is presented in Table 5.11.
Information relevant for this query topic discusses the design attributes related to
recognizing and comprehending system status messages. Again, this is a topic that is
discussed extensively in the Design CoPilotTM document collection. From partial
adjudication of the documents returned by only the Enhanced search engine, it is evident
that the Enhanced search engine found relevant documents that the Baseline missed.
However, unlike the previous outlier query topic, the Baseline search engine returned a
relatively small number of documents (i.e., 9 documents), and, therefore, the large
difference cannot be attributed to a function of proportion. For this query topic, it may be
a function of conceptually redundant wording. Within the Design CoPilotTM document
collection, the concept of “understandability” assumes the idea of “easily”. In other
words, if the information presented or the way a control operates is understandable, it is
also easily understandable (i.e., not just understandable after some difficult cognitive
effort). Therefore, it may be the case that the words “understandable” and “easily” do not
186
appear together frequently in the document collection. But, the query topic in the
Baseline search engine requires that the concept be discussed in the text with both the
word “easily” and the word “understandable,” while the Enhanced search engine also
looks for other similar words based on the relational pathways. In this way, the
Enhanced search engine was not constrained by the conceptually redundant wording in
the query topic itself.
187
CHAPTER 6. DISCUSSION
Four hypotheses were developed to guide the research to determine whether or
not the Enhanced search engine design positively impacted search performance. Based
on the results of the experiment, each of the four hypotheses was found to be true.
When considering the results for each of the research hypotheses, it is important
to keep in mind that the precision, recall*, and F-measure* values are only available for
the sample of 30 query topics that produced a difference in performance between the
Baseline and the Enhanced search engines and that these represent only subset of query
topics that may be run. Of the 75 query topics developed for the research, only 54.7% of
these topics produced a difference in performance. Figure 6.1 is a modified version of
Figure 5.2 with additional detail to illustrate that the sample query topics were derived
from the population of query topics that produced a difference in performance between
the Baseline and Enhanced search engine.
In addition, it also important to realize that the Enhanced system was designed to
improve the performance of query topics that contain two or more terms. Because the
enhancement is based on the relationship between terms, query topics that consist of a
single term do not have these relationships to exploit.
188
Query Topics Distribution
Same Performance (45.3%)
Different Performance (54.7%)
Tangible
Sample Set
(15 query topics)
Tangible query topics with a
performance difference but
not included in the Tangible
Sample Set
Intangible
Sample Set
(15 query topics)
Figure 6.1
Intangible query topics with a
performance difference but
not included in the Intangible
Sample Set
Distribution of all 75 query topics with addtional detail to illustrate
population from which the tangible and intangible sample set were
derived.
6.1. Research Hypothesis #1
The first research hypothesis states that the Enhanced search engine will perform
differently than the Baseline search engine. Based on the results derived from running the
full set of 75 query topics on both search engines, the difference in counts of documents
returned from the Baseline search engine with an average of 20.2 documents returned and
from the Enhanced search engine with an average of 43.5 documents returned was
determined to be statistically significant to an alpha less than 0.05 using a Two Sample tTest. The probability that this is not a statistical difference was calculated as 8.91 x 10-5
for the two-tailed test and 4.46 x 10-5 for the one-tailed test. It can be confidently
concluded that on average, the Enhanced search engine returns a different number of
189
documents for a given query topic than does the Baseline search engine. And stated
another way, the Enhanced search engine performs differently from the Baseline search
engine. This is not an unexpected conclusion, but it forms the foundation for the
conclusions that can be made by the other hypotheses posited in this research.
6.2. Research Hypothesis #2
The second research hypothesis states that the Enhanced search engine will, on
average, have greater recall than the Baseline engine. Based on the results derived from
testing the sample of 30 query topics (i.e., both sample sets combined), the difference in
recall* between the Baseline search engine with an average recall* of 0.30 and the
Enhanced search engine with an average recall* of 1.00 was also determined to be
statistically significant to an alpha less than 0.05 using a Two Sample t-Test. The
probability that this is not truly a statistical difference was calculated as 7.66 x 10-22 for
the two-tailed test and 3.83 x 10-22 for the one-tailed test. Therefore, it can be confidently
concluded that not only is the recall* performance of the Enhanced search engine
different from that of the Baseline search engine, but also that the Enhanced search
engine performs with higher recall* than the Baseline search engine.
This conclusion is important because it supports the claims that traditional
symbolic pattern matching search engines miss relevant documents and that enhancing
the queries using the relational pathways developed in this research can identify at least
some relevant documents that are missed.
190
6.3. Research Hypothesis #3
The third research hypothesis states that the Enhanced search engine will, on
average, perform better than the Baseline search engine as determined by a higher
average F-measure value. Based on testing the sample of 30 query topics (i.e., both
sample sets combined), the difference in performance in the F-measure* values between
the Baseline search engine with an average F-measure* of 0.41 and the Enhanced search
engine with an average F-measure* of 0.85 was determined to be directionally
statistically significant to an alpha less than 0.05 using a Two Sample t-Test (i.e., the
Enhanced search engine’s F-measure* performance values are statistically greater than
the Baseline’s F-measure* values). The probability that this is not truly a directional
statistical difference was calculated as 6.07 x 10-10 for the one-tailed test. Therefore, it
can be confidently concluded that the Enhanced search engine performs with a greater Fmeasure* value than the Baseline search engine. Because the greater the F-measure*
value, the better the performance, it is concluded that Enhanced search engine performed
better than the Baseline search engine in the experiment.
6.4. Research Hypothesis #4
The fourth research hypothesis states that the Enhanced search engine will, on
average, perform better on query topics that represent intangible concepts than on query
topics that represent tangible concepts as determined by a higher average F-measure
value. Based on the results derived from comparing the performance of the Enhanced
search engine on the sample of 14 query topics representing tangible concepts with an
191
average F-measure* value of 0.78 and the sample of 16 query topics representing
intangible concepts with an average F-measure* value of 0.92, the difference in
performance of the Enhanced search engine between query topics that represent tangible
concepts and query topics that represent intangible concepts was determined to be
statistically significant at an alpha of 0.05 using a Single Factor ANOVA test. The
probability that this is not truly a statistical difference was calculated as 0.025.
Therefore, it is concluded that Enhanced search engine performed better on the query
topics that represented intangible concepts than on the query topics that represented
tangible concepts in the experiment.
6.5. Generalizing the Results to Full Set of Query Topics
As described earlier, the parts of the experiment in which adjudicated results were
used (i.e., research hypothesis #2, #3, and #4) are based on samples drawn only from the
population of query topics that produce a difference in performance between the Baseline
and Enhanced search engines. However, 43.3% of the query topics produced the same
results in the Baseline and the Enhanced search engine. Therefore, it is not appropriate to
generalize the performance of the Enhanced search engine measured in the samples to be
the overall performance of the search engine to the full population of query topics. Based
on the data gathered in this experiment, the best estimate of the performance of the
Enhanced search engine when it returns the same set of documents that the Baseline
search engine returns (i.e., there are no additional files returned by the Enhanced search
engine) is to use the Baseline’s performance derived from the 30 query topic sample set.
192
As determined in the significance calculations, the Baseline did not perform differently
on query topics that represented tangible concepts than on query topics that represented
intangible concepts. Because there was no difference in performance, it is appropriate to
use the combined performance of the Tangible and Intangible Sample Sets to estimate the
average performance of the Baseline search engine.
Therefore, if we estimate the Enhanced search engine’s performance on the query
topics when it returns the same set of documents that the Baseline search engine returns
with the Baseline’s F-measure* value of 0.41, we can calculate an overall estimate of the
Enhanced search engine’s performance on the full 75 query topic set to be an F-measure*
value of 0.64. This value may be a more accurate average performance expectation on
the population of all query topics for the Enhanced search engine than an F-measure* of
0.85 because it attempts to take into account the portion of query topics that do and do
not have relational pathways that identify additional documents in the search results.
Query Topic Set
Tangible QT with
Performance Difference
Intangible QT with
Performance Difference
Estimate for all QT with
Same Performance
Count
Percentage (%)
F-measure*
(% x F-measure*)
24
0.320
0.78
0.250
17
0.227
0.92
0.209
34
0.453
0.41
0.186
Estimated F-measure* =
Table 6.1
0.644
Calculation to estimate the performance of the Enhanced search engine on
the full 75 query topic set.
193
6.6. Sensitivity Analysis For Impact Of Unfound Relevant Documents
The purpose of the sensitivity analysis was to determine whether or not unfound
relevant documents would impact the conclusions drawn from the results of the
experiment. Specifically, it looked at whether the differences in recall* and F-measure*
between the Baseline and Enhanced search engines would continue to be statistically
significant if unfound relevant documents existed in the document collection.
The results at each of the levels of sensitivity for both recall* and F-measure*
allowed the same conclusions to be drawn as were drawn from the original calculations.
The difference of recall* for the Baseline and the Enhanced search engine at each level of
sensitivity was found to remain statistically significant. This means that the second
research hypothesis that states that the Enhanced search engine will, on average, have
greater recall than the Baseline engine would remain true if a large number of unfound
relevant documents existed in the collection.
The difference in F-measure* for the Baseline and the Enhanced search engine at
each level of sensitivity was also found to remain statistically significant. This means
that the third and fourth research hypotheses that states that the Enhanced search engine
will, on average, perform better than the Baseline search engine as determined by a
higher average F-measure value would remain true if a large number of unfound relevant
documents existed in the collection.
Therefore, the sensitivity analysis demonstrates that even with a large number of
unfound relevant documents in the collection, the conclusions drawn from the original
calculations remain valid.
194
6.7. Sensitivity Analysis For Impact Of Relevancy Assumptions
The purpose of the sensitivity analysis was to determine whether the documents
that were assumed relevant in the modified pooling method of adjudication would impact
the conclusions drawn from the results of the experiment. Specifically, it looked at
whether the differences in recall* and F-measure* between the Baseline and Enhanced
search engines would continue to be statistically significant if some of the documents that
were returned by both the Baseline and Enhanced search engines were not relevant.
Like the previous sensitivity analysis, the results at each of the levels of
sensitivity for both recall* and F-measure* allowed the same conclusions to be drawn as
were drawn from the original calculations. The difference of recall* for the Baseline and
the Enhanced search engine at each level of sensitivity was found to remain statistically
significant. This means that the second research hypothesis that states that the Enhanced
search engine will, on average, have greater recall than the Baseline engine would remain
true if a large portion of the documents that were returned by both the Baseline and
Enhanced search engines were not relevant.
Similarly, the difference in F-measure* for the Baseline and the Enhanced search
engine at each level of sensitivity was also found to remain statistically significant. This
means that the third and fourth research hypotheses that states that the Enhanced search
engine will, on average, perform better than the Baseline search engine as determined by
a higher average F-measure value would remain true if a large portion of the documents
that were returned by both the Baseline and Enhanced search engines were not relevant.
195
Therefore, the sensitivity analysis for the impact of the relevancy assumptions
demonstrates that even if a large portion of non-relevant documents returned by both the
Baseline and Enhanced search engines were assumed relevant, the conclusions drawn
from the original calculations remain valid.
6.8. Interpreting the Results
The results are promising and provide evidence that the idea of automatic
concept-based query expansion using term relational pathways built from a collectionspecific association thesaurus may be an effective way to improve search engine
performance. With these results in mind, it is important to revisit the original ideas from
which the enhanced design was inspired. Some of the main ideas revolved around the
automatic development of a conceptual network, whether a complete set of relevant
documents could be retrieved by augmenting symbolic search, and the impact of query
topic types on search performance.
6.8.1. Conceptual Network
One of the primary ideas that inspired this research was that a conceptual network
could be automatically generated through the use of a document collection co-occurrence
calculation-based association thesaurus that would represent a reasonable approximation
of the concepts that were represented in the document collection. Therefore, it is
important to consider whether or not concepts are truly represented in the network
created by the Enhanced search engine. The relational pathways that are identified for
term pairs from the query topics provide the best look into the network. To get a sense of
196
whether concepts have been captured in the full network, we can ask whether the
relational pathways appear to represent relevant aspects of concepts that a human would
recognize and agree are relevant to the term pair.
Consider character and size, a term pair taken from the following query topic:
appropriate size of characters on display. In the experiment, the following three
relational paths were identified for the term pair:
Pathway #1:
character – discriminable – font – size
Pathway #2:
character – readable – font – size
Pathway #3:
character – text – font – size
The three relational pathways do appear to represent recognizable aspects of a
larger concept. They illustrate two types of relationships among the terms. The first type
of relationship is illustrated in pathway #1 and pathway #2. These pathways represent
the effect that one of the original query terms has on the other query term. In pathway
#1, the effect of discriminability is identified. Size affects a character’s discriminability
(i.e., the ease with which one piece of information, or in this case a character, can be
recognized within a group). If characters are too small, they will not be discriminable.
Characters that are not discriminable are not of an appropriate size. It is clear that this
pathway represents a valid aspect of the overall concept presented by the term pair and, in
addition, the query topic in general.
The same type of relationship is represented in pathway #2 with the effect of
readability. Like discriminability, size affects a character’s readability (i.e., the ease with
which text or numbers are recognized as words or number sequences). If characters are
197
too small, they will not be readable and, therefore, not of an appropriate size. Again, it is
clear that this pathway represents a valid aspect of the overall concept presented by the
term pair and, in addition, the query topic in general.
The second type of relationship is illustrated in pathway #3. In pathway #3, the
pathway represents alternate expressions of the similar concept. The alternate
expressions are the terms text and font for the original query term character. These
alternate expressions may be of greater or lesser specificity, but in the test document
collection, they represent the same concept 11.
Both types of relationships are useful in identifying candidate terms with which to
expand the query in a way that remains focused on the overall concept of interest.
All relational pathways identified for term pairs may not be as easily recognizable
in representing a relevant aspect of the overall concept. However, as seen in this
example, some relational pathways do present highly recognizable and obviously relevant
aspects of the overall concept represented by the term pairs and the overall query topic.
This suggests that real concepts, at least in part, are represented in the conceptual network
produced automatically using the association thesaurus.
6.8.2. Complete Set of Relevant Documents
The primary motivation for pursuing this approach was the desire to return a
complete set of relevant documents. So, the question remains, did the Enhanced search
11
The term font is part of all three pathways #1, #2 and #3 and represents an alternate
expression in each of them. Therefore, it is clear that more than one type of relationship
may be represented in a pathway.
198
engine return a complete set of relevant documents? The methodology was not designed
to identify the complete set of all relevant documents for a given query topic in the
collection but rather to determine how many additional relevant documents were
returned. As discussed earlier, complete adjudication of a document collection (i.e.,
identifying all relevant documents in the collection to ensure that no unfound relevant
documents exist in the collection) is prohibitively time consuming. However, based on
an informal assessment during the adjudication process, it was identified by the collection
content expert that some relevant documents were missed, not only by the Baseline
search engine but also by the Enhanced search engine.
This proof-of-concept research work was based on a number of “best reasonable
guess” design decisions, and, therefore, this observation is not surprising. (See Appendix
A for a list of design parameters chosen.) It is reasonable that further investigation will
be necessary to identify and tune various design parameters that impact the recall level of
an enhanced search engine.
6.8.3. Query Topic Types
While developing the query topics, there appeared to be a qualitative difference in
the types of query topics that were of interest to the users of the Design CoPilotTM. There
were query topics that represented well-defined, concrete, tangible concepts and query
topics that represented more complex, harder-to-define, fuzzy, intangible concepts.
Therefore, it seemed of interest to determine if there was a qualitative performance
difference between these two types of query topics. In addition, it seemed logical that the
199
less well-defined concepts represented by the intangible query topics would receive a
greater benefit from more completely defining the information need using the relational
pathways. The experimental results appear to bear this out. The Enhanced search engine
did perform better on query topics that represented the intangible concepts. Additional
investigation will be necessary to determine how this knowledge may be used to further
improve the performance of the Enhanced search engine.
6.9. Impact of Sample Selection
As described in section 4.5.1, the Tangible and Intangible Concept Query Sample
Sets were generated by using an arbitrarily selected difference threshold of less than or
equal to 50 documents. This was done to restrict the samples to a manageable size so that
the adjudication effort required would be feasible. However, one of the ramifications of
this decision is that the conclusions are only relevant for query topics with threshold of
less than or equal to 50 documents. Because those query topics with large differences in
the number documents returned by the Baseline and Enhanced search engines were not
included in the sample, it may be the case that query topics with a large difference count
(i.e., greater than 50 documents) perform qualitatively differently than query topics with
smaller difference counts. Additional research would be necessary to determine the
impact of the magnitude of the difference counts on performance.
6.10. Impact of Search Engine Parameter Values
As mentioned earlier, the values for the various parameters were primarily based
on a number of “best reasonable guess” design decisions. It is likely that with tuning, the
200
performance of the Enhanced search engine could be improved. However, in this
research, the decision was made to avoid tuning the parameters because the tuning could
have undesirable and unknown impact on the generalizability of the Enhanced search
engine design for other document collections. Therefore, it was thought inappropriate to
tune at this early stage in the research.
After the experiment was completed, an informal analysis was performed to get a
sense of the sensitivity that two key parameters may have on the performance of the
Enhanced search engine. The parameters analyzed included the Jaccard Coefficient
Threshold and the Maximum Associated Term Entries Threshold.
6.10.1. Jaccard Coefficient Threshold Value
An informal analysis was conducted by calculating the number of associated
terms that would be included in the Association Thesaurus for each of the unique eligible
terms that made up the 75 query topics at alternate Jaccard Coefficient Threshold values.
The Jaccard Coefficient Threshold used in the experiment was 0.5 and so values of 0.4
and 0.6 were analyzed.
There were a total of 176 unique eligible terms that did not exceed the Maximum
Associated Terms threshold of 275 associated terms. The number of associated terms for
each of these 176 terms was calculated for Jaccard Coefficient Threshold values of 0.4,
0.5, and 0.6. Of these, the average difference in the number of associated terms that
would be included in the Association Thesaurus for each of the unique eligible terms at
thresholds 0.4 and 0.5 was 0.26 terms (i.e., if the threshold value was 0.4, an average of
201
0.26 additional terms would be added to the thesaurus entry); the average between 0.5
and 0.6 was 0.78 terms (i.e., if the threshold value was 0.6, an average of 0.78 fewer
terms would be added to the thesaurus entry); and the average between 0.4 and 0.6 was
1.03. These results suggest that, on average, there would be little impact if the threshold
value chosen was 0.4, 0.5, or 0.6.
6.10.2. Maximum Associated Term Entries Threshold Value
An informal analysis was conducted by identifying the relational pathways that
would be returned for a term pair at various Maximum Associated Term Entries
Threshold values. The term pair used was made up of the terms warnings and
understandable12. The Maximum Associated Term Entries Threshold used in the
experiment was 275 terms and so values of 200, 225, 250, 300, and 325 were analyzed.
The relational pathways for each of these threshold values were identified. At the
200 terms threshold levels, 3 relational pathways were identified; at the 225, 250, and
275 terms threshold levels, 5 relational pathways were identified; and, at the 300 and 325
terms threshold levels, 11 relational pathways were identified. The relational pathways at
each of these threshold levels are presented in Figure 6.2.
12
This term pair “warnings understandable” was presented in example discussed in
Chapter 1 using hypothetical thesaurus entries and relational pathways. In the analysis
presented in this section 6.9.2., however, the actual Association Thesaurus entries for the
document collection were used and the relational pathways generated are real.
202
Maximum Associated Term Entries Threshold Trials
MaxAssocThreshold = 200
understand – clear – awareness – warn
[2] understand – clear – device – warn
[3] understand – confusing – caution – warn
[1]
[1]
[2]
[3]
[4]
[4]
[5]
[5]
[1]
[2]
[3]
[4]
[5]
MaxAssocThreshold = 250
understand – clear – annunciate – warn
understand – clear – awareness – warn
understand – clear – device – warn
understand – confusing – caution – warn
understand – distinct – visual – warn
MaxAssocThreshold = 300
understand – clear – alert – warn
[2] understand – clear – annunciate – warn
[3] understand – clear – awareness – warn
[4] understand – clear – device – warn
[5] understand – confusing – caution – warn
[6] understand – consistent – alert – warn
[7] understand – consistent – annunciate – warn
[8] understand – consistent – aural – warn
[9] understand – consistent – caution – warn
[10] understand – distinct – alert – warn
[11] understand – distinct – visual – warn
[1]
Figure 6.2
[1]
[2]
[3]
[4]
[5]
MaxAssocThreshold = 225
understand – clear – annunciate – warn
understand – clear – awareness – warn
understand – clear – device – warn
understand – confusing – caution – warn
understand – distinct – visual – warn
MaxAssocThreshold = 275
understand – clear – annunciate – warn
understand – clear – awareness – warn
understand – clear – device – warn
understand – confusing – caution – warn
understand – distinct – visual – warn
MaxAssocThreshold = 325
understand – clear – alert – warn
[2] understand – clear – annunciate – warn
[3] understand – clear – awareness – warn
[4] understand – clear – device – warn
[5] understand – confusing – caution – warn
[6] understand – consistent – alert – warn
[7] understand – consistent – annunciate – warn
[8] understand – consistent – aural – warn
[9] understand – consistent – caution – warn
[10] understand – distinct – alert – warn
[11] understand – distinct – visual – warn
[1]
Relational pathways identified for various values of the Maximum
Associated Term Entries Threshold.
Reviewing the relational pathways for this example term pair shows that
additional relational pathways identified at threshold values higher than used in the
experiment may represent relevant aspects of the overall intended concept of interest
without introducing undesirable tangential concepts. For example, the six additional
relational pathways introduced at the 300 and 325 terms threshold levels each present
203
recognizable aspects of a larger concept related to ensuring that warnings are
understandable. The six additional relational pathways are the following:
Pathway [1]: understand – clear – alert – warn
Pathway [6]: understand – consistent – alert – warn
Pathway [7]: understand – consistent – annunciate – warn
Pathway [8]: understand – consistent – aural – warn
Pathway [9]: understand – consistent – caution – warn
Pathway [10]: understand – distinct – alert – warn
The higher maximum threshold values of both 300 and 325 allow two additional
terms to be included in the conceptual network and, therefore, to be eligible for creating
relational pathways. These two terms are consistent and alert. Both these terms appear
relevant to the larger concept represented by the term pair. For example, consistency
plays a large role in promoting the understandability of the elements in the flight deck,
and the term alert is often used synonymously with the words warn or warning. Both
these terms have the potential of playing a useful role in expanding the query with the
original term pair in a way that would remain focused on the overall concept of interest
yet allow additional relevant documents to be retrieved.
The results of this informal analysis suggest that the performance of the Enhanced
search engine may be improved by using a higher Maximum Associated Threshold value
for the document collection used in this experiment. These results also suggest the
Maximum Associated Term Entries Threshold may be a useful parameter to tune based
on the particular characteristics of the document collection on which the Enhanced
204
Search Engine runs. However, a sufficient number of example term pairs would need to
be analyzed and considered to appropriately tune this threshold value.
6.11. Outstanding Issues
While the results are promising, this research was a proof-of-concept test. There
are still questions to be answered before such a method could be put into production on a
live website. Three of these issues are discussed below.
6.11.1. Impact of Characteristics of the Test Document Collection
The Design CoPilotTM document collection by its nature has some repetitive
content. The human factors-related pages in the collection cite passages from the related
regulatory and guidance material. In addition, within the regulatory and guidance
materials, cross-referencing and discussion about various regulatory excerpts occur.
Therefore, the same passage of text may be present in several documents within the
document collection. It is difficult to know whether this played a significant role in the
performance of the Enhanced search engine. It could have made the conceptual network
stronger in that important concepts and connections between terms were given more
weight by having an exaggerated frequency. Or, it could have limited the useful
connections among terms and provided fewer relational pathways with which to expand
the original query.
Therefore, an outstanding question is: Would the Enhanced search engine perform
in a similar manner on a document collection that is comparably technical but less
repetitive? Further investigation is required to determine the answer to this question.
205
6.11.2. Data Processing and Document Collection Size
The method to automatically create the association thesaurus and generate the
conceptual network requires a large amount of data processing. While much of the timeconsuming processing can be done ahead of time (i.e., during the pre-query processing
stage) and likely the algorithms used be made more efficient, the data processing required
limits the size of the document collection for which this approach may be used given
currently available processing capabilities.
The matrix size required for calculating the Jaccard Coefficient co-occurrence
values for the Design CoPilotTM document collection tested exceeded the performance
limitations of MATLAB 7.4 (R2007). Because MATLAB must hold the matrix in
memory, the maximum matrix size (or array size) is determined by the amount of
contiguous memory made available to MATLAB. For installations of MATLAB on 32bit Windows operating systems, the maximum number of elements in a real-numbered
array ranges from approximately 155 x 106 to approximately 200 x 106 elements 13. To
overcome this limitation, a workaround was developed to break up the matrix into
sufficiently small sub-matrices, send each sub-matrix into MATLAB to perform the
required co-occurrence calculations, and then reassemble the data generated in MATLAB
back into the full matrix. While it is likely the workaround process developed could be
easily scaled-up to handle document collections that are double that of the test collection
13
This information was taken from an article about MATLAB on the MathWorks
website titled What is the maximum matrix size for each platform? Retrieved March 17,
2013 from webhttp://www.mathworks.com/support/solutions/en/data/1IHYHFZ/index.html
206
(i.e., to handle document collections of approximately 6000 documents), it is much less
certain how much further the scaling of process could be taken and remain feasible.
Therefore, an outstanding question is: What are the size limitations on the
document collection? Further investigation is required to determine the answer to this
question.
6.11.3. Performance Comparison
Finally, another outstanding issue is related to how the performance of the
Enhanced search engine compares to other third-party search engines. The methodology
used in this experiment allowed us to conclude that the Enhanced functionality improved
the performance of the Baseline search engine. However, it left an open question about
the true level of performance that can be expected from the Enhanced search engine and
how this compares to other search algorithms in use today.
This question is partially addressed by including the results generated by the
Google Desktop search engine in the sensitivity analysis to assess the impact of the
relevancy assumptions made in the modified pooling adjudication method. However,
additional work is necessary to draw conclusions about the true level of performance of
the Enhanced search engine.
Two paths of follow-on research are under consideration. In the first path,
Google Desktop (or some other third-party search engine) may be used to run the sample
of 30 query topics on the document collection, and the unique documents returned for
each query would be manually adjudicated. After the adjudication, the results of the
207
third-party search engine would be compared to the performance results of the Baseline
and Enhanced search engine described in this research. In the second path of follow-on
research, an alternate document collection that possesses the appropriate collection
characteristics and has a set of predefined query topics may be used to assess the
performance of the Baseline and the Enhanced search engines. The Baseline and
Enhanced search engines would be run on this alternate document collection and their
performance measured. Next, their performance would then be compared to other thirdparty search engines that have also been run on that particular alternate document
collection. Such a document collection would need to be identified and obtained before
this path of follow-on research could be performed.
208
CHAPTER 7. CONCLUSION
While search engines provide an important tool to help users locate relevant
information contained within a website, the use of natural language in both the
representation of the information need and the concepts in the documents present
challenges in designing effective search engines. One promising method to overcome
these challenges is using concept-based query expansion to augment traditional symbolic
approaches. This research described an approach to concept-based query expansion that
uses a network-based method to automatically create a reasonable approximation of all
the concepts and their relationships between one another represented in the document
collection using a association thesaurus created for the target document collection.
Even though not all query topics have associated relational pathways with which
to expand the original query, the experiment demonstrated that the Enhanced search
engine performs better than the Baseline search engine.
In addition, the results suggest that real concepts, at least in part, are represented
in the conceptual network produced automatically using the association thesaurus.
Therefore, this approach has the potential for extensions to a variety of other applications
in which mapping the verbal representation of the concept to the terms used to express
them with a set of documents is required.
While there are still some important questions to be answered before such a
method could be put into production on a live website, the results of this experiment are
encouraging. The results suggest that on a bounded, medium-sized document collection
containing documents focused on a single technical subject domain that the enhancement
209
will allow users to identify a significantly greater portion of relevant documents to fill
their information needs.
210
APPENDIX A. SEARCH ENGINE STRUCTURE AND DESIGN PARAMETERS
This appendix provides a high-level illustration of the structure of the Baseline
and Enhanced search engines and identifies significant features and design parameters
used. Much of the content of this appendix is a duplicate of the information presented in
the body of the dissertation; however, this appendix was created to facilitate replicating
or modifying the design of the Baseline and Enhanced search by presenting the relevant
search engine design structure and parameters together.
A.1. Structure
The following sections describe the high-level structure and major components of
the Baseline and Enhanced search engines.
A.1.1. Baseline Search Engine Structure
The Baseline search engine was composed of an Index Module, a Search Module,
and an Index data store. The Index Module was part of the pre-search processes that
occur to prepare the system for use. The Search Module was part of the post-search
processes that occur to allow users to enter their desired search terms, the system to
process the query, and the system to return the search results to the user. The high-level
structure of the Baseline search engine is illustrated in Figure A.1.
211
Baseline Search Engine
Post-search
processes
Search Module
User Interface
Enter Search Terms
Search
Present Search
Results to User
Parse User Input
Organize and Format Results
Build Query
Run Query
Data store
Index
Index Document
Analyze Document
Build Document Record
Pre-search
processes
Figure A.1
Acquire Content
Index Module
High-level structure of the Baseline search engine.
212
A.1.2. Enhanced Search Engine Structure
The Baseline search engine was used as the core of the Enhanced search engine.
Modules and components necessary for performing the query expansion task were added
on to the core structure. Therefore, in addition to the Baseline’s Index Module, Search
Module, and Index data store, the Enhanced search engine also contained modules to
Build Association Thesaurus, Generate Conceptual Network and Expand Query as well
as the necessary components for the Association Thesaurus data store and the Conceptual
Network data store. The Index Module, Build Association Thesaurus module, and
Generate Conceptual Network module were part of the pre-search processes that occur to
prepare the system. The Search Module and Expand Query module were part of the postsearch processes to process the user’s query. The high-level structure of the Enhanced
search engine is illustrated in Figure A.2.
213
Enhanced Search Engine
Post-search processes
Search Module
User Interface
Expand Query
Enter Search Terms
Identify Relational
Pathways
Collect Expansion Terms
Search
Parse User Input
Build Expanded Query
Present Search
Results to User
Organize and Format Results
Run Query
Data store
Conceptual Network
Association Thesaurus
Index
Store Conceptual
Network
Store Association
Thesaurus Entries
Create Links Between
Terms
Calculate
Co-Occurrence Values
Identify Term Pairs
Create Matrix
Generate Conceptual Network
Identify Eligible Terms
Index Document
Analyze Document
Build Document Record
Acquire Content
Index Module
Pre-search
processes
Figure A.2
Create Document
Segments
Build Association Thesaurus
High-level structure of the Enhanced search engine.
214
A.2. Design Parameters
The following design parameters were used in the design and development of the
Baseline and Enhanced search engines. When no best current practices dictated an
appropriate value for a given parameter, the values were chosen based on educated best
guesses.
A.2.1.Technology
The core of this search engine was built using the open-source Lucene.NET
search engine development library to perform the indexing, data storage, and retrieval
functions. The features for the baseline search engine were chosen based on current best
practices and therefore included word stemming, stop word removal, HTML code filters
(to ignore text inside HTML tags), and proximity search.
The search engine was developed in a Visual Studio 2008 development platform
using C# and the Lucene.NET version 2.9.2 development library to build an ASPX
website containing the document collection and the search engine.
A.2.2.Indexing Parameters
The following parameters were used in the indexing process for both the Baseline
and the Enhanced search engines.
A.2.2.1. HTML Code Filter
The built-in Lucene.NET analyzer called HTMLStripCharFilter was used to
ignore all text contained inside HTML tags while indexing.
215
A.2.2.2. Stopwords
The following stopword list was used:
•
a
•
been
•
is
•
the
•
to
•
an
•
but
•
it
•
their
•
was
•
and
•
by
•
no
•
then
•
will
•
are
•
for
•
of
•
there
•
with
•
as
•
if
•
on
•
these
•
at
•
in
•
such
•
they
•
be
•
into
•
that
•
this
The stopword list used differs from the ENGLISH_STOP_WORD_SET that is
built into Lucene.NET in the following ways:
•
The words “or” and “not” were removed from the list.
•
The word “been” was added to the list.
A.2.2.3. Word Stemming
Word stemming was performed using the Lucene.NET built-in Snowball
Analyzer.
A.2.3. Association Thesaurus Processing Parameters
The Build Association Thesaurus module automatically builds the Association
Thesaurus using the document collection. The module accomplished this by manipulating
a Term-Document matrix comprised of terms and their occurrences in the documents of
216
the collection to determine the level of association between term pairs. To determine the
association, overlapping document segments were analyzed to determine the frequency of
eligible terms and the resulting data was used to calculate co-occurrence values.
A.2.3.1. Overlapping Document Segments
The term vectors that made up the Term-Document matrix were defined by the
number of occurrences (i.e., frequency) of the term within document segments rather than
within full documents. Document segments (i.e., moving shingled window) were created
from each full document. The document segments were 200 words long, and each
segment overlapped the previous and next segment by 100 words (i.e., the shingle
increment). The number of document segments created from a full document varied from
one segment to several hundred segments, depending on the length of the full document.
The Term-Document matrix was, therefore, constructed so that the terms were
represented by the columns and the document segments were represented by the rows.
Using document segments rather than the full documents controlled for the variability in
length of documents in the collection and ensured that only the terms in close proximity
(i.e., within 200 terms) to one another were assumed to be similar to one another.
The document segment size and the shingle increment were chosen based on an
informal average paragraph size. It was observed that a single, although possibly
complex, concept is often contained in a paragraph. Because of this, the words used in
the beginning of the paragraph are likely topically related to the words used at the end of
the paragraph. Therefore, the average number of words contained in a paragraph may be
a reasonable guide to the size of a chunk of text in which all the words are semantically
217
related. Assuming that paragraphs typically range from 100 to 200 words, a document
segment size of 200 words and a shingle increment of 100 words were chosen. These
values were chosen early in the design process and no tuning of these values was
performed.
A.2.3.2. Eligible Term Identification
Not all terms present in the document collection were included in the Association
Thesaurus. Only stemmed content bearing (i.e., stop words were excluded) words
present in the document collection with an appropriate level of frequency were identified
as eligible for inclusion in the Association Thesaurus. Therefore, the terms needed to
occur frequently enough in the document collection for co-occurrence calculations to
yield useful information but not too frequently for their presence to not be a useful
discriminator of relevance. Eligible terms were those that had a minimum frequency of
50 in the overall document collection and did not appear in more than 9999 document
segments. These eligible terms parameters were not tuned but chosen at the beginning of
the design process based on reasonable intial guesses as to appropriate starting values.
A.2.3.3. Co-Occurrence Calculations
The co-occurrence calculations to determine level of association (or, similarity)
between term pairs were conducted using the Jaccard Coefficient. The Jaccard
Coefficient is based on an Intersection Over Union (IOU) calculation to normalize and
measure the amount of overlap between two term vectors.
218
The Jaccard Coefficient value of a term pair was used only to make the binary
decision of inclusion or exclusion of a term pair in the Association Thesaurus. Those
term pairs with a Jaccard Coefficient value greater than 0.5 were included in the
Association Thesuarus as associated term entries for each other.
This minimum threshold value of 0.5 was chosen early in the design process
based on the idea that a value near the mid-point of possible Jaccard Coefficient values
(i.e, values between 0 and 1) would provide a reasonable starting point and no tuning was
performed to improve this value.
A.2.4. Conceptual Network and Relational Pathway Parameters
The Generate Conceptual Network module used the entries in the Association
Thesaurus to generate the conceptual network. Each term in the Association Thesaurus
represented a node. Child nodes for a term were generated from all of the associated
terms defined in its thesaurus entry to create a term cluster. To form the full conceptual
network, each term cluster generated from the thesaurus entry was linked to the other
term clusters using shared terms. The entire conceptual network was developed by
continuing this term cluster linking process using all shared terms defined through the
relationships defined by the associated term entries in the Association Thesaurus.
Only terms likely to be useful in discriminating the relevance of a document were
included in the conceptual network. A maximum threshold was used to restrict the
number of associated a terms a target term may have to be eligible for inclusion in the
conceptual network. Terms that had more than 275 entries were considered to be too
219
frequently occurring to be able to offer a useful discrimination and were ignored during
the process of creating the conceptual network. Therefore, any term included in
conceptual network had 275 or fewer associated terms included in its Association
Thesaurus entry. This threshold value of 275 entries was chosen early in the design
process based on reviewing several example term pairs and their resulting pathways.No
tuning was done after this early design decision was made.
In this way, the conceptual network was composed of all terms with 275 or fewer
entries in the Association Thesaurus and links between terms were only based on whether
or not shared terms existed in the individual term clusters (i.e., there were no other
parameters considered when forming the links between nodes).
A.2.4.1. Relational Pathways
To minimize search processing time experienced by the users, the relational
pathways were identified during the pre-search process stage in the Generate Conceptual
Network module. All possible term pairs were identified from the terms contained in the
Association Thesaurus. Next, the relational pathways for each term pair were identified
and stored for fast retrieval at search time.
The relational pathways identified were 3, 4, or 5 terms long. To identify the
relational pathways between a pair of terms, the module began with the first term of the
term pair and traversed the conceptual network looking for the second term of the term
pair using a breadth-first search to a maximum depth of 4 terms. When the second term
was found, the intervening terms were captured to form a relational pathway.
220
It was possible for zero, one, or more relational pathways to be identified for a
given term pair. There was no maximum threshold used to limit the number of relational
pathways that could be identified for a given term pair.
All the relational pathways for a given term pair were the same length. Once a
relational pathway was found, the search on that level was completed and then stopped
before it moved to the next level of depth.
A.2.5. Search Parameters
The search is conducted by parsing the user’s input to remove stop words and
perform word stemming. The query then is performed as a proximity search where
documents are retrieved when the terms occur within 100 words of one another. A
proximity threshold of 100 words is a fairly restrictive value. Such a restrictive value was
chosen for two reasons. One, technical content such as that represented by the document
collection used in this research tends to be written in a very focused and direct manner,
and it was believed that if all the terms of the query were found within three to four
sentences of one another, the passage would likely be relevant to the concept represented
by the query topic. Conversely, terms present outside of that proximity window may be
more likely addressing different ideas. Two, a frequent issue with query expansion
methods is the loss of precision. Therefore, using a proximity value that was more
restrictive should mitigate some of the undesirable effects that cause the loss of precision.
221
APPENDIX B. BUILDING BOOLEAN EXPRESSIONS FROM RELATIONAL
PATHWAYS
This appendix describes how the Boolean expressions are built from the relational
pathways.
B.1. Building Boolean Phrase from Relational Pathway
The relational pathways built for each term pair of the query topic may be
comprised of 3, 4, or 5 terms. The query expansion terms are selected in such a way that
the resulting Boolean phrase requires at least one of the original terms to be present in the
document. How a Boolean phrase is constructed from a relational pathway is illustrated
in the following three figures:
Pathway length = 3
Relational
Pathway
A
Boolean Query Phrase
(A
AND
B)
AND
C)
B
C
Visual Representation
OR (
(A
OR (
(B
Figure B.1
AND
C)
Boolean Phrase for Relational Pathway Length of 3. In the visual
representation, the solid circles on the pathway indicate the terms that
must be present in the document for the document to be identified as
relevant.
222
Pathway length = 4
Relational
Pathway
A
Boolean Query Phrase
(A
D
C
B
Visual Representation
AND
D)
AND
B
AND
C)
C
AND
D)
OR (
(A
OR (
(B
Figure B.2
AND
Boolean Phrase for Relational Pathway Length of 4. In the visual
representation, the solid circles on the pathway indicate the terms that
must be present in the document for the document to be identified as
relevant.
Pathway length = 5
Relational
Pathway
A
B
Boolean Query Phrase
(A
C
D
E
Visual Representation
AND
E)
AND
C
AND
D)
AND
C
AND
E)
AND
B
AND
D)
AND
D
AND
E)
OR (
(A
OR (
(B
OR
(A
OR
(B
Figure B.3
Boolean Phrase for Relational Pathway Length of 5. In the visual
representation, the solid circles on the pathway indicate the terms that
must be present in the document for the document to be identified as
relevant.
223
B.2. Building Query Expression for Query Topic
For each pair of original query terms, the search engine attempts to identify one or
more relational pathways between the term pair. The Boolean phrase for each relational
pathway is generated, and all the resulting Boolean phrases are combined into a single
full Boolean query expression.
B.2.1. Combining Multiple Relational Pathways for Term Pair
When more than one relational pathway is found for a term pair, the Boolean
phrases for each relational pathway are generated and then combined with OR operators
to form the full Boolean phrase for this term pair. Using OR operators to combine the
phrases means that if the Boolean phrase for any of the relational pathways is true, then
the entire Boolean phrase is true. This is appropriate because a concept represented by
only one relational pathways needs to be present in the document for the term pair to be
represented.
Figure B.4 illustrates the full Boolean phrase that is generated from a term pair for
which two relational pathways were identified. This figure assumes that each of the two
relational pathways found between the original query terms A and E are 5 terms long.
Because a breadth-first search is used, all relational paths found will be of the same
length. Once a path is found, the search on that level is completed and then stops before it
moves to the next level of depth.
224
A & E Boolean Phrase
(
(
Figure B.4
OR
)
)
Boolean phrase for term pair A and E constructed by combining
individual Boolean phrases for the term pair’s two relational pathways.
B.2.1. Creating Full Boolean Expression for Query Topic
For query topics that include only two terms (i.e., a single pair of terms), the
Boolean phrase generated using the method for combining multiple relational pathways
for the single term pair generates the full Boolean query expression.
However, if a query topic is comprised of more than two terms such as warnings
quickly understandable, then there are three terms pairs that may result in relational
pathways. The three term pairs in this example are the following:
1. warnings quickly
2. warnings understandable
3. quickly understandable
225
The Boolean phrases generated for the relational pathways found for each of the
term pairs must be combined to form the full Boolean query expression. To do this, the
Boolean phrase generated for each term pair is combined using AND operators. The
resulting Boolean expression, therefore, requires that each term pair be represented by at
least one of its relational pathways.
Figure B.5 illustrates the full Boolean phrase that is generated from a query topic
comprised of three term pairs for which relational pathways were identified for each term
pair.
AND
Figure B.5
AND
Boolean phrase for query topic A B C constructed by combining Boolean
phrases for each of the term pairs.
Not all term pairs have relational pathways. Therefore, provision is made to
ensure that all terms present in the original query topic are represented in the final full
Boolean expression. During the process of identifying relational pathways and
constructing the Boolean phrases, a record is kept for each term that is represented in the
Boolean expression. Each original query topic term that is not represented in any of the
relational pathways is added to the full Boolean expression using the AND operator.
226
Figure B.6 illustrates the full Boolean phrase that is generated from a query topic
comprised of three term pairs for which only one relational pathway was identified for
one term pair.
AND
Figure B.6
C
Boolean phrase for query topic A B C constructed by combining Boolean
phrases generated from the relational pathway identified between the term
pair A and B and the term C. Term C did not share a relational pathway
with term A nor with term B.
227
APPENDIX C. QUERY TOPIC LIST
This appendix contains the complete listing of the 75 query topics that were used
in the search engine testing. The query topics are divided by those that represent tangible
concepts and those that represent intangible concepts. The query topics were developed
by content experts by reviewing query logs of the Design CoPilotTM web application and
by drawing from experience with the document collection.
C.1. Query Topics Representing Tangible Concepts
There were 40 query topics that represented tangible concepts.
1. ambient lighting intensity
2. attitude trend indicator
3. audible stall warnings
4. auxiliary power unit fire extinguishing
5. cockpit display color scheme
6. cockpit lighting at night
7. color and vibration
8. consistent label placement
9. control force and resistance considerations
10. electronic flight bag display design
11. emergency landing gear control location
12. excessive strength required to operate control
13. false resolution advisory
14. fault tolerant data entry
15. fuel feed selector control
16. hydraulic system status messages
17. icing conditions operating speeds provided in AFM
228
18. inadvertent activation prevention
19. information presented in peripheral visual field
20. information readable with vibration
21. instruments located in normal line of sight
22. interference from adjacent controls
23. labels readable distance
24. landing gear manual extension control design
25. low airspeed alerting
26. monochrome altimeter design
27. negative transfer issues
28. operable using only one hand
29. pilot access to circuit breakers
30. pilot response time to visual warnings
31. placard lighting
32. redundant coding methods
33. safety belt latch operation
34. sensitivity of controls
35. side stick control considerations
36. stuck microphone
37. tactile feedback
38. TCAS warnings
39. text color contrast
40. thrust reverser control design
C.2. Query Topics Representing Intangible Concepts
There were 35 query topics that represented intangible concepts.
1. acceptable message failure rate and pilots confidence in system
2. adequately usable
3. appropriate size of characters on display
4. arrangement of right seat instruments
5. audio signals annoying to the pilot
6. body position required to operate equipment
229
7. control design allows pilot to operate in extreme flight conditions
8. control is identifiable in the dark
9. cultural conventions switch design
10. cursor control device
11. design attributes for auditory displays
12. designed to keep pilot fatigue to a minimum
13. easily understandable system status
14. ergonomics of pilot seating
15. excessive cognitive effort
16. excessively objectionable
17. failure notifications are readily apparent
18. hard to readily find information on screen
19. how to ensure that labels are readable
20. how to improve situation awareness
21. how to provide unambiguous feedback
22. implications of automatically disengaging autopilot
23. information overload
24. intuitive display design
25. magnitude and direction of systems response to control input
26. minimal mental processing
27. needs too much attention
28. poorly organized information
29. preventing instrument reading errors
30. proper use of red and amber on displays
31. readily accessible overhead controls
32. representing self in airplane-referenced displays
33. soft controls
34. suitable menu navigation methods
35. unnecessarily distract from tasks
230
APPENDIX D. DOCUMENTS RETURNED COUNTS FOR QUERY TOPICS BY
SEARCH ENGINE
This appendix contains the data and calculations related to counts of documents
returned by the Baseline and Enhanced search engines.
D.1. Documents Returned Counts
The following is a table of counts of documents returned by the Baseline and
Enhanced search engines for each of the 75 query topics.
Query Topics
Baseline
Enhanced
Difference
1 audible stall warnings
20
78
58
2 monochrome altimeter design
0
0
0
3 ambient lighting intensity
21
21
0
4 attitude trend indicator
10
79
69
5 auxiliary power unit fire extinguishing
16
21
5
6 icing conditions operating speeds
provided in AFM
11
61
50
7 pilot access to circuit breakers
29
107
78
8 cockpit display color scheme
17
117
100
9 color and vibration
30
109
79
10 control force and resistance considerations
8
8
0
11 electronic flight bag display design
48
48
0
12 emergency landing gear control location
9
80
71
13 false resolution advisory
9
23
14
14 fault tolerant data entry
1
14
13
15 fuel feed selector control
14
71
57
16 inadvertent activation prevention
57
109
52
17 consistent label placement
15
15
0
231
Query Topics (continued…)
Baseline
Enhanced
Difference
18 labels readable distance
7
12
5
19 landing gear manual extension control
design
3
48
45
20 placard lighting
70
148
78
21 low airspeed alerting
30
118
88
22 negative transfer issues
2
5
3
23 cockpit lighting at night
61
61
0
24 instruments located in normal line of sight
6
35
29
25 operable using only one hand
38
38
0
26 information presented in peripheral visual
field
3
11
8
27 information readable and vibration
12
44
32
28 pilot response time to visual warnings
26
107
81
29 redundant coding methods
20
20
0
30 sensitivity of controls
105
105
0
31 side stick control considerations
0
30
30
32 stuck microphone
8
8
0
33 tactile feedback
50
50
0
34 text color contrast
19
54
35
35 interference from adjacent controls
28
91
63
36 excessive strength required to operate
control
67
67
0
37 thrust reverser control design
41
41
0
38 TCAS warnings
96
244
148
39 safety belt latch operation
19
54
35
40 hydraulic system status messages
0
27
27
41 poorly organized information
17
17
0
42 excessive cognitive effort
4
10
6
43 how to improve situation awareness
2
10
8
44 needs too much attention
4
26
22
232
Query Topics (continued…)
Baseline
Enhanced
Difference
45 adequately usable
88
88
0
46 intuitive display design
60
60
0
47 cultural conventions switch design
1
7
6
48 excessively objectionable
30
30
0
49 readily accessible overhead controls
0
0
0
50 easily understandable system status
9
133
124
51 unnecessarily distract from tasks
11
11
0
52 audio signals annoying to the pilot
8
8
0
53 minimal mental processing
9
26
17
54 information overload
26
26
0
55 soft controls
58
58
0
56 cursor control device
57
57
0
57 body position required to operate
equipment
12
12
0
58 control is identifiable in the dark
20
69
49
59 hard to readily find information on screen
0
0
0
60 suitable menu navigation methods
0
11
11
61 how to provide unambiguous feedback
0
25
25
62 proper use of red and amber on displays
13
29
16
63 failure notifications are readily apparent
0
0
0
64 design attributes for auditory displays
4
7
3
65 preventing instrument reading errors
7
33
26
66 magnitude and direction of systems
response to control input
14
14
0
67 acceptable message failure rate and pilots
confidence in system
0
4
4
68 control design allows pilot to operate in
extreme flight conditions
8
8
0
69 arrangement of right seat instruments
6
23
17
70 appropriate size of characters on display
13
56
43
233
Query Topics (continued…)
Baseline
Enhanced
Difference
71 ergonomics of pilot seating
0
2
2
72 implications of automatically disengaging
autopilot
4
4
0
73 representing self in airplane-referenced
displays
0
0
0
74 designed to keep pilot fatigue to a
minimum
3
3
0
75 how to ensure that labels are readable
0
20
20
Table D.1
Document counts returned by Baseline and Enhanced search engine.
D.2. Significance of Difference of Documents Returned
To determine whether the difference between the Baseline and the Enhanced
search engines in the number of documents returned were statically significant, a Two
Sample t-Test Assuming Equal Variances was run with an α=0.05.
234
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
43.5
584.23
1936.41
75
75
1260.32
t Stat
-4.0295
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Enhanced
20.2
Pooled Variance
P(T<=t) one-tail
Table D.2
Baseline
4.455E-05
1.6552
8.910E-05
1.9761
Two Sample t-test Assuming Equal Variances to determine statistical
significance of differences in documents returned by Baseline and
Enhanced search engines.
235
APPENDIX E. GRADED RELEVANCE QUERY TOPIC DEFINITIONS
This appendix contains the table of graded relevancy definitions for the 30 query
topics that were adjudicated: 14 query topics representing tangible concepts and 16 query
topics representing intangible topics.
The graded definitions were used in the adjudication process to determine the
relative level of relevancy of each document returned from each search engine. The
higher the score, the more relevant the document is to the query topic. Documents with a
score of 1 contain information that addresses all aspects of the concept represented by the
query topic. Documents with a score of 0.5 or 0.25 contain information that only address
some but not all aspects of the concept represented by the query topic. Documents with a
score of 0 are irrelevant (i.e., do not contain any information that addresses the concept
represented by the query topic). Documents that were returned by both the Baseline and
the Enhanced search engines were assumed to be relevant.
The graded relevancy descriptions were developed as an aid to consistently
adjudicate documents for the query topic and were based on the content of the documents
reviewed. In some cases, the descriptions for the 0.5 or 0.25 scores were not developed
because no documents contained information that addressed only that level of relevancy.
In addition, some query topics include specific descriptions of irrelevant topics (i.e., score
of 0) when it was deemed useful in the adjudication process to distinguish between
concepts that were fully or partially relevant and related information that was not
sufficient to be considered relevant.
236
E.1. Query Topics Representing Tangible Concepts
A sample of 14 query topics were chosen from the set of query topics for tangible
concepts in which there was a difference in performance between the Baseline and
Enhanced search engines. The following definitions were used in the adjudication process
to determine the relevancy of documents.
1. auxiliary power unit fire extinguishing
Score Description of Document Content
1.0
Discusses the design considerations related to extinguishing fires of the
Auxiliary Power Unit (APU)
0.5
Discusses the act of extinguishing a fire or the design of fire extinguishing
equipment in the flight deck
0.5
Discusses design issues related to fireproofing the APU or to fireproofing
elements of the APU
0.25
Discusses design attributes related to fire detection
0.25
Discusses extinguishing fires of other systems
0.25
Discusses design attributes related to fireproofing other systems
0
States only that fire detection systems or fire extinguishing systems must be
provided
0
States only that APUs should be provided
2. false resolution advisory
Score Description of Document Content
1.0
Discusses design attributes related to or the impact of false Resolution
Advisories (RAs) presented to the pilot
0.5
Discusses design attributes related to or the impact of false alerts presented by
the TCAS system (and not specifically RAs)
0.25
Discusses design attributes related to or the impact of false alarms, alerts, or
nuisance warnings from systems other than the TCAS
237
3. fault tolerant data entry
Score Description of Document Content
1.0
Discusses design attributes related to or needs for fault tolerant data entry
0.5
n/a
0.25
n/a
4. hydraulic system status messages
Score Description of Document Content
1.0
Discusses design attributes related to or needs for hydraulic system status
messages
0.5
Design attributes of hydraulic system indicators/displays
0.25
Design attributes of status messages for other systems
0
States only that a status message (of some system other than hydraulics) should
be provided
5. icing conditions operating speeds provided in AFM
Score Description of Document Content
1.0
Discusses the requirement that the information about operating speeds in icing
conditions should be provided in the Airplane Flight Manual (AFM)
0.5
Discusses providing information in AFM about operating speeds in general
(i.e., not specifically in icing conditions)
0.5
Discusses operating speeds in icing conditions but does not state that they
should be included in AFM
0.25
Discusses providing information in AFM about operating speeds in other
conditions (i.e., not in icing conditions)
0.25
Discusses providing information related to icing conditions but not specifically
about operating speeds
0.25
Discusses design attributes or format of information that could apply to
operating speeds (e.g., units)
0
States only that an AFM should be provided, that information should be
provided in the AFM (i.e., information other than that related to icing
conditions), or icing conditions in general
238
6. information presented in peripheral visual field
Score Description of Document Content
1.0
Discusses design attributes of information or types of information that is
presented in the pilot’s peripheral visual field
0.5
n/a
0.25
Discusses parameters to define the primary visual field and the peripheral
visual field
0
States only that information should be provided in the primary visual field
7. information readable with vibration
Score Description of Document Content
1.0
Discusses readability issues caused by vibration
0.5
n/a
0.25
Discusses other visual issues related to vibration (e.g., eye fatigue)
0
Discusses non-visually-specific issues related to vibration (e.g., general
physical fatigue)
8. instruments located in normal line of sight
Score Description of Document Content
1.0
Discusses instruments that should be located in the pilot’s normal line of sight
or primary field of view
0.5
n/a
0.25
Discusses the location of instruments with respect to pilots and their visibility
requirements
239
9. labels readable distance
Score Description of Document Content
1.0
Discusses how distance from the pilot impacts in the readability of a label
0.5
Describes design attributes related to placement of label and the impact of
label’s location on being seen or readable to pilot
0.25
n/a
10. landing gear manual extension control design
Score Description of Document Content
1.0
Discusses design attributes related to the manual landing gear extension control
0.5
Discusses the requirement that a landing gear manual extension control is
provided
0.5
Discusses design attributes related to landing gear controls in general
(including position indicator markings on controls)
0.25
Discusses entire extension/retracting system (which includes control to operate
it). This does not include general statements of landing gear system that does
not specifically reference the extension/retracting mechanism
0
Discusses landing gear in general such as landing gear failures and landing
gear position (and does not specifically talk about design of control associated
with allow pilot to operate landing gear)
11. negative transfer issues
Score Description of Document Content
1.0
Discusses the impact of negative transfer on pilot performance or design
attributes that may cause negative transfer issues to occur
0.5
n/a
0.25
n/a
240
12. safety belt latch operation
Score Description of Document Content
1.0
Discusses the operation or design attributes of the safety belt latch (other terms
that may be used include seat belt, safety harness, shoulder harness, pilot
restraint system, buckle, fastener)
0.5
n/a
0.25
Discusses other design attributes of the safety belt (i.e., not specific to the
latching mechanism)
0
Operation of other sorts of latches such as doors
13. side stick control considerations
Score Description of Document Content
1.0
Discusses design attributes of the side stick control and their impact on pilot
performance
0.5
n/a
0.25
Discusses design attributes of the stick control (including stick control forces)
0
Discusses design attributes or use of joysticks as cursor control devices
14. text color contrast
Score Description of Document Content
1.0
Discusses use of color contrast on text or alphanumeric characters
0.5
n/a
0.25
Discusses attributes of color that impact readability and perception of text (e.g.,
ambient light impacting choice of saturation of color to be used, use of colors
that are distinguishable from one another)
0.25
Use of contrast such as reverse video
0
Discusses design attributes other than those related to perception and
readability to be considered when using color on a display (e.g., attentiongetting qualities or color-coding)
241
E.2. Query Topics Representing Intangible Concepts
A sample of 16 query topics was chosen from the set of query topics for
intangible concepts in which there was a difference in performance between the Baseline
and Enhanced search engines. The following definitions were used in the adjudication
process to determine the relevancy of documents.
15. acceptable message failure rate and pilots confidence in system
Score Description of Document Content
1.0
Discusses acceptable rates of failure of messages presented to the pilot and the
impact of nuisance warning or false alarms on the pilots confidence in the
system
0.5
n/a
0.25
Discusses nuisance warnings and false alarms
16. appropriate size of characters on display
Score Description of Document Content
1.0
Discusses design attributes related to the appropriate size of characters on a
display
0.5
n/a
0.25
Discusses other appropriate design attributes related to displaying characters
on a display (i.e., not character size)
17. arrangement of right seat instruments
Score Description of Document Content
1.0
Discusses the arrangement of instruments for the right seat pilot (i.e., first
officer seat)
0.5
Discusses other design attributes related to the right seat versus the left seat of
flight deck
0.25
Discusses the general arrangement of instruments that could apply to right seat
242
18. control is identifiable in the dark
Score Description of Document Content
1.0
Discusses design attributes related to allowing a control to be identified in the
dark or specific controls that must be identifiable in the dark
0.5
n/a
0.25
Discusses design attributes that make control identifiable
19. cultural conventions switch design
Score Description of Document Content
1.0
Discusses the impact of cultural conventions on control design
0.5
Discusses the impact of cultural conventions on display or other equipment
design
0.25
n/a
20. design attributes for auditory displays
Score Description of Document Content
1.0
Discusses design attributes of auditory displays
0.5
n/a
0.25
n/a
21. ergonomics of pilot seating
Score Description of Document Content
1.0
Discusses the ergonomics of pilot seating
0.5
n/a
0.25
Lists reference resources specific to ergonomics of pilot seating
243
22. excessive cognitive effort
Score Description of Document Content
1.0
Discusses design attributes and tasks related to addressing the avoidance of
excessive cognitive effort
0.5
n/a
0.25
n/a
23. how to ensure that labels are readable
Score Description of Document Content
1.0
Discusses design attributes related to ensuring that that labels are readable by
the pilot
0.5
n/a
0.25
Discusses other design attributes related to making labels usable
0
States only that a label should be provided or that a component should be
properly labeled
24. how to improve situation awareness
Score Description of Document Content
1.0
Discusses design attributes or instruments that improve a pilot’s situation
awareness
0.5
Discusses design attributes or instruments that provide information about the
current conditions that allow pilots to respond in appropriate and timely
manner
0.25
n/a
25. how to provide unambiguous feedback
Score Description of Document Content
1.0
Discusses design attributes related to providing the pilot unambiguous
feedback (i.e., the feedback is clearly understandable)
0.5
n/a
0.25
n/a
244
26. minimal mental processing
Score Description of Document Content
1.0
Discusses design attributes or tasks related to requiring only minimal mental
(or cognitive) processing from the pilot
0.5
Discusses design attributes or tasks related to reducing mental effort or
avoiding excessive mental (or cognitive) processing from the pilot
0.25
Discusses design attributes or tasks related to mental or cognitive effort
27. needs too much attention
Score Description of Document Content
1.0
Discusses design attributes and tasks related to addressing a component or
situation that requires too much attention
0.5
Discusses the level of attention required to perform a task or operate a
component
0.25
Discusses design attributes that impact the level of attention required
28. preventing instrument reading errors
Score Description of Document Content
1.0
Discusses design attributes related to preventing instrument reading errors (i.e.,
any error related to visual, tactile, or auditory perception to the state, setting, or
value of a display, control, or other equipment)
0.5
Discusses design attributes related to ensuring that instruments are readable.
0.5
Discusses design attributes related to preventing data interpretation errors.
(Note: The idea of “reading errors” can be used in the broad sense to include
interpreting the data that has been read on a display)
0.25
Discusses display design attributes or importance of preventing errors in
general (i.e., not specifically reading errors)
0.25
Discusses display design attributes related to readability
0
Discusses errors in general
245
29. proper use of red and amber on displays
Score Description of Document Content
1.0
Discusses the correct usage of both the colors red and amber on a flight deck
display
0.5
Discusses the correct usage of either red or amber on a flight deck display (but
not both colors)
0.25
Discusses the use of an appropriate color coding philosophy in display design
30. suitable menu navigation methods
Score Description of Document Content
1.0
Discusses design attributes of menus to ensure that the menu navigation
methods required are appropriate
0.5
n/a
0.25
n/a
246
APPENDIX F. BINARY RELEVANCE DATA AND CALCULATIONS
This appendix contains the table of binary relevancy counts and the calculated
values for each query topic included in the samples adjudicated. Binary relevance was
determined by converting any non-zero graded relevance value assigned to a document
(i.e., a graded relevance of 0.25, 0.5, 1) to 1. All relevance values of zero remain a value
of zero.
The information is divided by those query topics that represent tangible concepts
and those that represent intangible concepts. This information also indentifies the query
topics for which the Enhanced search engine performed the same as the Baseline (e.g.,
when no relational pathways were found for any of the term pairs in the query topic) and
those for which the performance differed (i.e., the Enhanced search engine returned
additional documents).
F.1. Data for Query Topics Representing Tangible Concepts
There were 40 total query topics that represented tangible concepts. There were
13 query topics in which the Baseline and the Enhanced search engine each returned the
same set of documents, and there were 27 query topics in which the Enhanced search
engine returned additional documents that the Baseline did not return. Of the 27 query
topics in which there was a difference in performance between the Baseline and
Enhanced search engine, a sample of 14 were adjudicated. The results from each are
included in the sections below.
247
F.1.1. Adjudicated Query Topics for Tangible Concepts
A sample of 14 query topics was chosen from the set of query topics for tangible
concepts in which there was a difference in performance between the Baseline and
Enhanced search engines. These query topics were adjudicated to determine the
relevancy of the returned documents. Table F.1 provides the data and calculations for
these query topics.
248
Table F.1
Adjudicated Query Topics for Tangible Concepts with Recall*,
Precision, and F-measure* calculations.
249
Tangible Concepts Query Sample Set
Baseline Search Engine
Query Topic
1
2
auxiliary power unit fire
extinguishing
false resolution advisory
3
fault tolerant data entry
4
hydraulic system status
messages
icing conditions operating
speeds provided in AFM
information presented in
peripheral visual field
information readable with
vibration
instruments located in
normal line of sight
labels readable distance
5
6
7
8
9
10
11
landing gear manual
extension control design
negative transfer issues
12
safety belt latch operation
13
side stick control
considerations
text color contrast
14
Enhanced Search Engine
Total
Relevant
Total
Relevant
Irrelevant
Recall*
Precision
F-measure*
Total
Relevant
Irrelevant
Recall*
Precision
F-measure*
21
16
16
0
0.76
1.00
0.86
21
21
0
1.00
1.00
1.00
20
9
9
0
0.45
1.00
0.62
23
20
3
1.00
0.87
0.93
4
1
1
0
0.25
1.00
0.40
14
4
10
1.00
0.29
0.44
7
0
0
0
0.00
0.00
0.00
27
7
20
1.00
0.26
0.41
46
11
11
0
0.24
1.00
0.39
61
46
15
1.00
0.75
0.86
11
3
3
0
0.27
1.00
0.43
11
11
0
1.00
1.00
1.00
25
12
12
0
0.48
1.00
0.65
44
25
19
1.00
0.57
0.72
28
6
6
0
0.21
1.00
0.35
35
28
7
1.00
0.80
0.89
10
7
7
0
0.70
1.00
0.82
12
10
2
1.00
0.83
0.91
27
3
3
0
0.11
1.00
0.20
48
27
21
1.00
0.56
0.72
2
2
2
0
1.00
1.00
1.00
5
2
3
1.00
0.40
0.57
37
19
19
0
0.51
1.00
0.68
54
37
17
1.00
0.69
0.81
21
0
0
0
0.00
0.00
0.00
30
21
9
1.00
0.70
0.82
38
19
19
0
0.50
1.00
0.67
54
38
16
1.00
0.70
0.83
0.67
0.78
Average =
0.39
0.51
250
F.1.2. Not Adjudicated Query Topics for Tangible Concepts
Of the 40 query topics that represented tangible concepts, there were a total of 26
query topics that were not adjudicated. The details of the query topics that were not
adjudicated are outlined below.
251
F.1.2.1. No Performance Difference For Tangible Concepts
There were 13 query topics representing tangible concepts in which the Baseline
and the Enhanced search engine each returned the same set of documents (i.e., there was
no difference in performance between the Baseline and the Enhanced search engines).
These query topics are listed in Table F.2. This occurred when no relational pathways
were found for any of the term pairs in the query topic or when the relational pathways
identified did not yield an expanded query that generated additional document matches.
Query Topic
Baseline
Enhanced
Total
Total
Difference
1
ambient lighting intensity
21
21
0
2
cockpit lighting at night
61
61
0
3
consistent label placement
15
15
0
4
control force and resistance considerations
8
8
0
5
electronic flight bag display design
48
48
0
6
excessive strength required to operate control
67
67
0
7
monochrome altimeter design
0
0
0
8
operable using only one hand
38
38
0
9
redundant coding methods
20
20
0
10
sensitivity of controls
105
105
0
11
stuck microphone
8
8
0
12
tactile feedback
50
50
0
13
thrust reverser control design
41
41
0
Table F.2
Query topics for Tangible Concepts with no performance difference.
252
F.1.2.2. Difference in Performance For Tangible Concepts
Although there was a difference in performance between the Baseline and
Enhanced search engines in the query topics representing tangible concepts listed in
Table F.3, these query topics were not part of the sample that was adjudicated because
the difference in performance exceeded the difference threshold of 50 documents.
Query Topic
Baseline
Enhanced
Total
Total
Difference
1
attitude trend indicator
10
79
69
2
audible stall warnings
20
78
58
3
pilot access to circuit breakers
29
107
78
4
cockpit display color scheme
17
117
100
5
color and vibration
30
109
79
6
emergency landing gear control location
9
80
71
7
fuel feed selector control
14
71
57
8
inadvertent activation prevention
57
109
52
9
placard lighting
70
148
78
10
low airspeed alerting
30
118
88
11
pilot response time to visual warnings
26
107
81
12
interference from adjacent controls
28
91
63
13
TCAS warnings
96
244
148
Table F.3
Query topics for Tangible Concepts with a performance difference but
not included in the Tangible Concepts Query Sample Set.
253
F.2. Data for Query Topics Representing Intangible Concepts
There were 35 total query topics that represented intangible concepts. There were
18 query topics in which the Baseline and the Enhanced search engine each returned the
same set of documents, and there were 17 query topics in which the Enhanced search
engine returned additional documents that the Baseline did not return. Of the 17 query
topics in which there was a difference in performance between the Baseline and
Enhanced search engine, a sample of 16 were adjudicated. The results from each are
included in the sections below.
F.2.1. Adjudicated Query Topics for Intangible Concepts
A sample of 16 query topics was chosen from the set of query topics for
intangible concepts in which there was a difference in performance between the Baseline
and Enhanced search engines. These query topics were adjudicated to determine the
relevancy of the returned documents. Table F.4 provides the data and calculations for
these query topics.
254
Table F.4
Adjudicated Query Topics for Intangible Concepts with Recall*,
Precision, and F-measure* calculations.
255
Intangible Concepts Query Sample Set
Baseline Search Engine
Query Topic
1
2
3
4
5
6
7
8
9
10
11
12
13
14
14
15
acceptable message failure rate
and pilots confidence in system
appropriate size of characters on
display
arrangement of right seat
instruments
control is identifiable in the dark
cultural conventions switch
design
design attributes for auditory
displays
ergonomics of pilot seating
excessive cognitive effort
how to ensure that labels are
readable
how to improve situation
awareness
how to provide unambiguous
feedback
minimal mental processing
needs too much attention
preventing instrument reading
errors
proper use of red and amber on
displays
suitable menu navigation
methods
Enhanced Search Engine
Total
Relevant
Total
Relevant
Irrelevant
Recall*
Precision
F-measure*
Total
Relevant
Irrelevant
Recall*
Precision
F-measure*
3
0
0
0
0.00
0.00
0.00
4
3
1
1.00
0.75
0.86
56
13
13
0
0.23
1.00
0.38
56
0
1.00
1.00
1.00
19
6
6
0
0.32
1.00
0.48
23
19
4
1.00
0.83
0.90
56
20
20
0
0.36
1.00
0.53
69
56
13
1.00
0.81
0.90
6
1
1
0
0.17
1.00
0.29
7
6
1
1.00
0.86
0.92
7
4
4
0
0.57
1.00
0.73
7
7
0
1.00
1.00
1.00
2
10
0
4
0
4
0
0
0.00
0.40
0.00
1.00
0.00
0.57
2
10
2
10
0
0
1.00
1.00
1.00
1.00
1.00
1.00
16
0
0
0
0.00
0.00
0.00
20
16
4
1.00
0.80
0.89
10
2
2
0
0.20
1.00
0.33
10
10
0
1.00
1.00
1.00
25
0
0
0
0.00
0.00
0.00
25
25
0
1.00
1.00
1.00
24
24
9
4
9
4
0
0
0.38
0.17
1.00
1.00
0.55
0.29
26
26
24
24
2
2
1.00
1.00
0.92
0.92
0.96
0.96
30
7
7
0
0.23
1.00
0.38
33
30
3
1.00
0.91
0.95
28
13
13
0
0.46
1.00
0.63
29
28
1
1.00
0.97
0.98
3
0
0
0
0.00
0.00
0.00
11
3
8
1.00
0.27
0.43
0.88
0.92
Average =
0.22
0.32
56
256
F.2.2. Not Adjudicated Query Topics For Intangible Concepts
Of the 35 query topics that represented intangible concepts, there were a total of
19 query topics that were not adjudicated. The details of the query topics that were not
adjudicated are outlined below.
F.2.2.1. No Performance Difference For Intangible Concepts
There were 18 query topics representing intangible concepts in which the
Baseline and the Enhanced search engine each returned the same set of documents (i.e.,
there was no difference in performance between the Baseline and the Enhanced search
engines). These query topics are listed in Table F.5. This occurred when no relational
pathways were found for any of the term pairs in the query topic or when the relational
pathways identified did not yield an expanded query that generated additional document
matches.
257
Query Topic
Baseline
Enhanced
Total
Total
Difference
1
adequately usable
88
88
0
2
audio signals annoying to the pilot
8
8
0
3
body position required to operate equipment
12
12
0
4
control design allows pilot to operate in extreme flight
conditions
8
8
0
5
cursor control device
57
57
0
6
designed to keep pilot fatigue to a minimum
3
3
0
7
excessively objectionable
30
30
0
8
failure notifications are readily apparent
0
0
0
9
hard to readily find information on screen
0
0
0
10
implications of automatically disengaging autopilot
4
4
0
11
information overload
26
26
0
12
intuitive display design
60
60
0
13
magnitude and direction of systems response to
control input
14
14
0
14
poorly organized information
17
17
0
15
readily accessible overhead controls
0
0
0
16
representing self in airplane-referenced displays
0
0
0
17
soft controls
58
58
0
18
unnecessarily distract from tasks
11
11
0
Table F.5
Query topics for Intangible Concepts with no performance difference.
258
F.2.2.2. Difference in Performance For Intangible Concepts
Although there was a difference in performance between the Baseline and
Enhanced search engines in the query topics representing intangible concepts listed in
Table F.6, this query topic was not part of the sample that was adjudicated because the
difference in performance exceeded the difference threshold of 50 documents.
Query Topic
1
easily understandable system status
Table F.6
Baseline
Enhanced
Total
Total
9
133
Difference
124
Query topics for Intangible Concepts with a performance difference but
not included in the Intangible Concepts Query Sample Set.
259
APPENDIX G. BINARY RELEVANCE SIGNIFICANCE TESTS
This appendix contains the data and calculations used to determine the
significance of the differences identified in the binary relevance data gathered. Binary
relevance was determined by converting any non-zero graded relevance value assigned to
a document (i.e., a graded relevance of 0.25, 0.5, 1) to 1. All relevance values of zero
remain a value of zero.
G.1. Difference in Recall* between Baseline and Enhanced Search Engines
To determine whether the difference in recall* between the Baseline and the
Enhanced search engines was statically significant, a Two Sample t-Test Assuming Equal
Variances was run with a significance level of α=0.05. The data and results are presented
in Figure G.1.
260
Recall* Significance Calculations
Recall* Values
Baseline
Enhanced
0.76
0.45
0.25
0.00
0.24
0.27
0.48
0.21
0.70
0.11
1.00
0.51
0.00
0.50
0.00
0.23
0.32
0.36
0.17
0.57
0.00
0.40
0.00
0.20
0.00
0.38
0.17
0.23
0.46
0.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
Figure G.1
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.299
0.064
30
Enhanced
1
0
30
0.032
-15.164
3.830E-22
1.672
7.660E-22
2.002
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is
that there is no statistical difference between the
recall* of the Baseline and the Enhanced search
engines. Because the absolute value of the t Stat is
greater than t Critical two-tail value, we can reject
the Null Hypothesis. We can draw this same
conclusion by looking at the P(T<=t) two-tail that
represents the probability that the Null Hypothesis is
true. The P(T<=t) two-tail value is less than the
Alpha value of 0.05, so we can reject the Null
Hypothesis. By rejecting the Null Hypothesis, we can
conclude that there is a statistically significant
difference between the recall* of the Baseline and
Enhanced search engines.
Two Sample t-Test to test the recall* difference between Baseline and
Enhanced search engines.
261
G.2. Difference in F-measure* between Baseline and Enhanced Search Engines
To determine whether the difference in search performance between the Baseline
and the Enhanced search engines was statically significant, a Two Sample t-Test
Assuming Equal Variances was run with a significance level of α=0.05. Search
performance was calculated using the F-measure* in order to balance equally the
contribution of recall* and precision. The data and results are presented in Figure G.2.
262
F-measure* Significance Calculations
F-measure* Values
Baseline
0.86
0.62
0.40
0.00
0.39
0.43
0.65
0.35
0.82
0.20
1.00
0.68
0.00
0.67
0.00
0.38
0.48
0.53
0.29
0.73
0.00
0.57
0.00
0.33
0.00
0.55
0.29
0.38
0.63
0.00
Figure G.2
Enhanced
1.00
0.93
0.44
0.41
0.85
1.00
0.72
0.89
0.91
0.72
0.57
0.81
0.82
0.83
0.86
1.00
0.90
0.90
0.92
1.00
1.00
1.00
0.89
1.00
1.00
0.96
0.96
0.95
0.98
0.43
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.407
0.084
30
Enhanced
0.856
0.031
30
0.058
-7.228
6.071E-10
1.6716
1.214E-09
2.002
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that
there is no statistical difference between the performance
of the Baseline and the Enhanced search engines. Search
performance is determined using the F-measure* as the
harmonic mean of recall* and precision. The greater the
F-measure* value, the better the performance. Because
the absolute value of the t Stat is greater than t Critical
two-tail value, we can reject the Null Hypothesis. We can
draw this same conclusion by looking at the P(T<=t) twotail that represents the probability that the Null
Hypothesis is true. The P(T<=t) two-tail value is less
than the Alpha value of 0.05, so we can reject the Null
Hypothesis. By rejecting the Null Hypothesis, we can
conclude that there is a statistically significant difference
between the Baseline and Enhanced search engine Fmeasure* values. The F-measure* value of the Enhanced
search engine is statistically greater than the Baseline and,
therefore, performs better than the Baseline search engine.
Two Sample t-Test to test the F-measure* difference between Baseline
and Enhanced search engines.
263
G.3. Difference in F-measures* between Tangible and Intangible Concepts
To determine whether the F-measure* values for the query topics that represent
tangible concepts and those that represent intangible are statistically different, a Single
Factor ANOVA test was run with a significant level of α=0.05 for the Baseline search
engine and for the Enhanced search engine.
264
G.3.1. Difference between Tangible and Intangible Concepts for Baseline Search
Engine
F-measure* Values
Baseline Search Engine
Tangible
0.86
0.62
0.40
0.00
0.39
0.43
0.65
0.35
0.82
0.20
1.00
0.68
0.00
0.67
---
Interpretation of Statistical Test:
Intangible
The Null Hypothesis for this Single Factor ANOVA
test is that there is no statistical difference between
the performance of the Baseline search engine for the
query topics that represent tangible concepts and the
query topics that represent intangible concepts.
Because the value of F is less than F critical value,
we cannot reject the Null Hypothesis. Because we
cannot reject the Null Hypothesis, we must conclude
that there is no statistically significant difference
between the performance of the Baseline search
engine on the query topics that represent tangible
concepts and the query topics that represent
intangible concepts.
0.00
0.38
0.48
0.53
0.29
0.73
0.00
0.57
0.00
0.33
0.00
0.55
0.29
0.38
0.63
0.00
Summary Statistics for Baseline Search Engine
Groups
Tangible
Intangible
Count
14
16
Sum
7.070
5.145
Average
0.505
0.322
Variance
0.094
0.065
ANOVA Single Factor
Source of Variation
Between Groups
Figure G.3
F
3.212
P-value
0.084
F critical
4.196
ANOVA Single Factor to test the F-measure* difference between
Tangible and Intangible Concepts Query Sample Sets on the Baseline
search engine.
265
G.3.2. Difference between Tangible and Intangible Concepts for Enhanced Search
Engine
F-measure* Values
Enhanced Search Engine
Tangible
1.00
0.93
0.44
0.41
0.86
1.00
0.72
0.89
0.91
0.72
0.57
0.81
0.82
0.83
---
Intangible
0.86
1.00
0.90
0.90
0.92
1.00
1.00
1.00
0.89
1.00
1.00
0.96
0.96
0.95
0.98
0.43
Interpretation of Statistical Test:
The Null Hypothesis for this Single Factor ANOVA test
is that there is no statistical difference between the
performance of the Enhanced search engine on the
query topics that represent tangible concepts and the
query topics that represent intangible concepts. Because
the value of F is greater than F critical value, we can
reject the Null Hypothesis. By rejecting the Null
Hypothesis, we can conclude that there is a statistically
significant difference between the performance of the
Enhanced search engine on the query topics that
represent tangible concepts and the query topics that
represent intangible concepts.
Summary Statistics for Enhanced Search Engine
Groups
Tangible
Intangible
Count
14
16
Sum
Average Variance
10.923
0.780
0.035
14.753
0.922
0.020
ANOVA Single Factor
Source of Variation
Between Groups
Figure G.4
F
5.599
P-value F critical
0.025
4.196
ANOVA Single Factor to test the F-measure difference between
Tangible and Intangible Concepts Query Sample Sets on the Enhanced
search engine.
266
APPENDIX H. GRADED RELEVANCE DATA
This appendix contains the table of graded relevancy score counts for each query
topic included in the samples adjudicated. The graded relevance identifies the level of
relevancy of each document returned in the results generated by each search engine. The
higher the number, the more relevant the document is to the query topic. The graded
definitions used for each query topic adjudicated can be found in Appendix E.
H.1. Graded Relevance Data of Query Topics for Tangible Concepts
A sample of 14 query topics was chosen from the set of query topics for tangible
concepts in which there was a difference in performance between the Baseline and
Enhanced search engines. These query topics were adjudicated to determine the
relevancy of the returned documents, and the graded relevancy scores were captured for
each document returned in the Enhanced search engine result set. Table H.1 provides the
counts for each of the graded relevancy scores for the adjudicated Tangible Concepts
Query Sample Set.
267
Enhanced Search Engine
Query Topics Representing Tangible
Concepts
Total
Returned
Graded Relevancy Score
1
0.5
0.25
0
1
auxiliary power unit fire extinguishing
21
16
4
1
0
2
false resolution advisory
23
10
1
9
3
3
fault tolerant data entry
14
4
0
0
10
4
hydraulic system status messages
27
1
0
6
20
5
icing conditions operating speeds provided
in AFM
61
16
7
23
15
6
information presented in peripheral visual
field
11
11
0
0
0
7
information readable with vibration
44
20
0
5
19
8
instruments located in normal line of sight
35
25
1
2
7
9
labels readable distance
12
7
3
0
2
10
landing gear manual extension control
design
48
17
8
2
21
11
negative transfer issues
5
2
0
0
3
12
safety belt latch operation
54
32
0
5
17
13
side stick control considerations
30
6
0
15
9
14
text color contrast
54
35
0
3
16
Table H.1
Graded Relevance Data for Tangible Concepts Query Sample Set.
268
H.2. Graded Relevance Data of Query Topics for Intangible Concepts
A sample of 16 query topics was chosen from the set of query topics for
intangible concepts in which there was a difference in performance between the Baseline
and Enhanced search engines. These query topics were adjudicated to determine the
relevancy of the returned documents, and the graded relevancy scores were captured for
each document returned in the Enhanced search engine result set. Table H.2 provides the
counts for each of the graded relevancy scores for the adjudicated Intangible Concepts
Query Sample Set.
269
Enhanced Search Engine
Query Topics Representing Intangible
Concepts
Total
Returned
Graded Relevancy Score
1
0.5
0.25
0
1
acceptable message failure rate and pilots
confidence in system
4
2
0
1
1
2
appropriate size of characters on display
56
49
0
7
0
3
arrangement of right seat instruments
23
7
4
8
4
4
control is identifiable in the dark
69
30
5
21
13
5
cultural conventions switch design
7
2
4
0
1
6
design attributes for auditory displays
7
7
0
0
0
7
ergonomics of pilot seating
2
0
0
2
0
8
excessive cognitive effort
10
9
0
1
0
9
how to ensure that labels are readable
20
9
0
7
4
10
how to improve situation awareness
10
7
3
0
0
11
how to provide unambiguous feedback
25
24
1
0
0
12
minimal mental processing
26
11
11
2
2
13
needs too much attention
26
12
5
7
2
14
preventing instrument reading errors
33
17
9
4
3
15
proper use of red and amber on displays
29
22
2
4
1
16
suitable menu navigation methods
11
3
0
0
8
Table H.2
Graded Relevance Data for Intangible Concepts Query Sample Set.
270
APPENDIX I.
SENSITIVITY ANALYSIS FOR IMPACT OF UNFOUND
RELEVANT DOCUMENTS
This appendix contains the data and calculations used to perform the sensitivity
analysis to unfound relevant documents. Three different levels of sensitivity were
analyzed to assess the impact of unfound relevant documents on the performance
measures used in this experiment. To perform the sensitivity analysis, unfound relevant
documents were estimated based on the level of sensitivity and rounded up to the next
whole number. These unfound relevant document estimates were then used to estimate
what the total number of relevant documents would be in the collection by adding the
number of relevant documents for the query topic identified by the Baseline and
Enhanced search engines to the estimated number of unfound relevant documents at that
sensitivity level.
Using the estimated total relevant documents for the given sensitivity level, the
recall* was calculated for each of the 30 sample query topics. Next, a Two Sample t-Test
Assuming Equal Variances was run with a significance level of α=0.05 to determine
whether the difference in the recall* between the Baseline and the Enhanced search
engines was statically significant. Finally, the conclusion drawn related to the
significance of the difference in recall* values was compared to the conclusion drawn in
the original calculations (i.e., those performed without the addition of estimated unfound
relevant documents).
Next, the estimated total relevant documents for the given sensitivity level was
used to calculate the F-measure* for each of the 30 sample query topics. A Two Sample
271
t-Test Assuming Equal Variances was then run with a significance level of α=0.05 to
determine whether the difference in the F-measure* between the Baseline and the
Enhanced search engines was statically significant. Finally, the conclusion drawn related
to the significance of the difference in F-measure* values was compared to the
conclusion drawn in the original calculations (i.e., those performed without the addition
of estimated unfound relevant documents).
The data and calculations for each of the three sensitivity levels are presented in
the following sections.
I.1. Level 1 Sensitivity To Unfound Relevant Documents (0.25X)
At the first sensitivity level, the unfound relevant documents were estimated to be
a quarter of the number of relevant documents identified by the Baseline and Enhanced
search engines. Table I.1 and I.2 provide the data and calculations performed using the
unfound relevant document estimates for Sensitivity Level 1 for the Tangible Concept
Query Sample Set and the Intangible Concept Query Sample Set. The data and results of
the statistical significance test at Sensitivity Level 1 are presented in Figure I.1 for recall*
and in Figure I.2 for F-measure*.
272
Table I.1
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Sensitivity Level 1 (0.25X) where the number of
unfound documents are assumed to be a quarter of the number of relevant
documents identified by the Baseline and Enhanced search engines.
273
Level 1 Sensitivity To Unfound Relevant Documents (0.25X) – Tangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Baseline +
Enhanced
Relevant
Unfoun
d
Relevan
t
Estimated
Total
Relevant
Sensitivity
Level 1
Recall*
Enhanced
Precision
Sensitivity
Level 1
F-measure*
Sensitivity
Level 1
Recall*
Precision
Sensitivity
Level 1
F-measure*
1
auxiliary power unit fire
extinguishing
21
6
27
0.593
1.000
0.744
0.778
1.000
0.875
2
false resolution advisory
20
5
25
0.360
1.000
0.529
0.800
0.870
0.833
3
fault tolerant data entry
4
1
5
0.200
1.000
0.333
0.800
0.286
0.421
4
hydraulic system status messages
7
2
9
0.000
0.000
0.000
0.778
0.259
0.389
5
icing conditions operating speeds
provided in AFM
46
12
58
0.190
1.000
0.319
0.793
0.754
0.773
6
information presented in peripheral
visual field
11
3
14
0.214
1.000
0.353
0.786
1.000
0.880
7
information readable with vibration
25
7
32
0.375
1.000
0.545
0.781
0.568
0.658
8
instruments located in normal line of
sight
28
7
35
0.171
1.000
0.293
0.800
0.800
0.800
9
labels readable distance
10
3
13
0.538
1.000
0.700
0.769
0.833
0.800
10
landing gear manual extension
control design
27
7
34
0.088
1.000
0.162
0.794
0.563
0.659
11
negative transfer issues
2
1
3
0.667
1.000
0.800
0.667
0.400
0.500
12
safety belt latch operation
37
10
47
0.404
1.000
0.576
0.787
0.685
0.733
13
side stick control considerations
21
6
27
0.000
0.000
0.000
0.778
0.700
0.737
14
text color contrast
38
10
48
0.396
1.000
0.567
0.792
0.704
0.745
274
Table I.2
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Sensitivity Level 1 (0.25X) where the number of
unfound documents are assumed to be a quarter of the number of relevant
documents identified by the Baseline and Enhanced search engines.
275
Level 1 Sensitivity To Unfound Relevant Documents (0.25X) – Intangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Baseline +
Enhanced
Relevant
Unfound
Relevant
Estimated
Total
Relevant
Sensitivity
Level 1
Recall*
Enhanced
Precision
Sensitivity
Level 1
F-measure*
Sensitivity
Level 1
Recall*
Precision
Sensitivity
Level 1
F-measure*
1
acceptable message failure rate and
pilots confidence in system
3
1
4
0.000
0.000
0.000
0.750
0.750
0.750
2
appropriate size of characters on
display
56
14
70
0.186
1.000
0.313
0.800
1.000
0.889
3
arrangement of right seat instruments
19
5
24
0.250
1.000
0.400
0.792
0.826
0.809
4
control is identifiable in the dark
56
14
70
0.286
1.000
0.444
0.800
0.812
0.806
5
cultural conventions switch design
6
2
8
0.125
1.000
0.222
0.750
0.857
0.800
6
design attributes for auditory displays
7
2
9
0.444
1.000
0.615
0.778
1.000
0.875
7
ergonomics of pilot seating
2
1
3
0.000
0.000
0.000
0.667
1.000
0.800
8
excessive cognitive effort
10
3
13
0.308
1.000
0.471
0.769
1.000
0.870
9
how to ensure that labels are readable
16
4
20
0.000
0.000
0.000
0.800
0.800
0.800
10
how to improve situation awareness
10
3
13
0.154
1.000
0.267
0.769
1.000
0.870
11
how to provide unambiguous
feedback
25
7
32
0.000
0.000
0.000
0.781
1.000
0.877
12
minimal mental processing
24
6
30
0.300
1.000
0.462
0.800
0.923
0.857
13
needs too much attention
24
6
30
0.133
1.000
0.235
0.800
0.923
0.857
14
preventing instrument reading errors
30
8
38
0.184
1.000
0.311
0.789
0.909
0.845
15
proper use of red and amber on
displays
28
7
35
0.371
1.000
0.542
0.800
0.966
0.875
16
suitable menu navigation methods
3
1
4
0.000
0.000
0.000
0.750
0.273
0.400
276
Level 1 Sensitivity To Unfound Relevant Documents (0.25X) –
Recall* Significance Calculations
Recall* Values
at Sensitivity Level 1 (0.25X)
Baseline
Enhanced
0.593
0.360
0.200
0.000
0.190
0.214
0.375
0.171
0.538
0.088
0.667
0.404
0.000
0.396
0.000
0.186
0.250
0.286
0.125
0.444
0.000
0.308
0.000
0.154
0.000
0.300
0.133
0.184
0.371
0.000
0.778
0.800
0.800
0.778
0.793
0.786
0.781
0.800
0.769
0.794
0.667
0.787
0.778
0.792
0.750
0.800
0.792
0.800
0.750
0.778
0.667
0.769
0.800
0.769
0.781
0.800
0.800
0.789
0.800
0.750
Figure I.1
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.2313
0.0352
30
Enhanced
0.7766
0.0011
30
0.0182
-15.6739
8.155E-23
1.6716
1.63E-22
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that
there is no statistical difference between the recall* of
the Baseline and the Enhanced search engines at
Sensitivity Level 1 (0.25X). Because the absolute
value of the t Stat is greater than t Critical two-tail
value, we can reject the Null Hypothesis. We can draw
this same conclusion by looking at the P(T<=t) twotail that represents the probability that the Null
Hypothesis is true. The P(T<=t) two-tail value is less
than the Alpha value of 0.05, so we can reject the Null
Hypothesis. By rejecting the Null Hypothesis, we can
conclude that there is a statistically significant
difference between the recall* at Sensitivity Level 1
(0.25X) of the Baseline and Enhanced search engines.
Two Sample t-Test to test the difference of the recall* at Sensitivity
Level 1 (0.25X) between the Baseline and Enhanced search engines.
277
Level 1 Sensitivity To Unfound Relevant Documents (0.25X) –
F-measure* Significance Calculations
F-measure* Values
at Sensitivity Level 1 (0.25X)
Baseline
0.744
0.529
0.333
0.000
0.319
0.353
0.545
0.293
0.700
0.162
0.800
0.576
0.000
0.567
0.000
0.313
0.400
0.444
0.222
0.615
0.000
0.471
0.000
0.267
0.000
0.462
0.235
0.311
0.542
0.000
Figure I.2
Enhanced
0.875
0.833
0.421
0.389
0.773
0.880
0.658
0.800
0.800
0.659
0.500
0.733
0.737
0.745
0.750
0.889
0.809
0.806
0.800
0.875
0.800
0.870
0.800
0.870
0.877
0.857
0.857
0.845
0.875
0.400
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.3401
0.0596
30
Enhanced
0.7594
0.0215
30
0.0406
-8.0617
2.414E-11
1.6716
4.83E-11
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that
there is no statistical difference between the performance
of the Baseline and the Enhanced search engines at
Sensitivity Level 1 (0.25X). Search performance is
determined using the F-measure* at Sensitivity Level 1
(0.25X) where it is assumed that there are additional
unfound relevant documents. Because the absolute value of
the t Stat is greater than t Critical two-tail value, we can
reject the Null Hypothesis. We can draw this same
conclusion by looking at the P(T<=t) two-tail that
represents the probability that the Null Hypothesis is true.
The P(T<=t) two-tail value is less than the Alpha value of
0.05, so we can reject the Null Hypothesis. By rejecting the
Null Hypothesis, we can conclude that there is a
statistically significant difference between the Baseline and
Enhanced search engine F-measure* at Sensitivity Level 1
(0.25X) values. The F-measure* value of the Enhanced
search engine is statistically greater than the Baseline and,
therefore, performs better than the Baseline search engine.
This is consistent with the conclusion drawn without the
addition of estimated unfound relevant documents.
Two Sample t-Test to test the difference of the F-measure* at Sensitivity
Level 1 (0.25X) between the Baseline and Enhanced search engines.
278
I.2. Level 2 Sensitivity To Unfound Relevant Documents (2X)
At the second sensitivity level, the unfound relevant documents were estimated to
be double the number of relevant documents identified by the Baseline and Enhanced
search engines. Table I.3 and I.4 provide the data and calculations performed using the
unfound relevant document estimates for Sensitivity Level 2 for the Tangible Concept
Query Sample Set and the Intangible Concept Query Sample Set. The data and results of
the statistical significance test at Sensitivity Level 2 are presented in Figure I.3 for recall*
and in Figure I.4 for F-measure*.
279
Table I.3
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Sensitivity Level 2 (2X) where the number of
unfound documents are assumed to be double the number of relevant
documents identified by the Baseline and Enhanced search engines.
280
Level 2 Sensitivity To Unfound Relevant Documents (2X) – Tangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Baseline +
Enhanced
Relevant
Unfound
Relevant
Estimated
Total
Relevant
Sensitivity
Level 2
Recall*
Enhanced
Precision
Sensitivity
Level 2
F-measure*
Sensitivity
Level 2
Recall*
Precision
Sensitivity
Level 2
F-measure*
1
auxiliary power unit fire
extinguishing
21
42
63
0.254
1.000
0.405
0.333
1.000
0.500
2
false resolution advisory
20
40
60
0.150
1.000
0.261
0.333
0.870
0.482
3
fault tolerant data entry
4
8
12
0.083
1.000
0.154
0.333
0.286
0.308
4
hydraulic system status messages
7
14
21
0.000
0.000
0.000
0.333
0.259
0.292
5
icing conditions operating speeds
provided in AFM
46
92
138
0.080
1.000
0.148
0.333
0.754
0.462
6
information presented in peripheral
visual field
11
22
33
0.091
1.000
0.167
0.333
1.000
0.500
7
information readable with vibration
25
50
75
0.160
1.000
0.276
0.333
0.568
0.420
8
instruments located in normal line of
sight
28
56
84
0.071
1.000
0.133
0.333
0.800
0.471
9
labels readable distance
10
20
30
0.233
1.000
0.378
0.333
0.833
0.476
10
landing gear manual extension
control design
27
54
81
0.037
1.000
0.071
0.333
0.563
0.419
11
negative transfer issues
2
4
6
0.333
1.000
0.500
0.333
0.400
0.364
12
safety belt latch operation
37
74
111
0.171
1.000
0.292
0.333
0.685
0.448
13
side stick control considerations
21
42
63
0.000
0.000
0.000
0.333
0.700
0.452
14
text color contrast
38
76
114
0.167
1.000
0.286
0.333
0.704
0.452
281
Table I.4
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Sensitivity Level 2 (2X) where the number of
unfound documents are assumed to be double the number of relevant
documents identified by the Baseline and Enhanced search engines.
282
Level 2 Sensitivity To Unfound Relevant Documents (2X) – Intangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Baseline +
Enhanced
Relevant
Unfound
Relevant
Estimated
Total
Relevant
Sensitivity
Level 2
Recall*
Enhanced
Precision
Sensitivity
Level 2
F-measure*
Sensitivity
Level 2
Recall*
Precision
Sensitivity
Level 2
F-measure*
1
acceptable message failure rate and
pilots confidence in system
3
6
9
0.000
0.000
0.000
0.333
0.750
0.462
2
appropriate size of characters on
display
56
112
168
0.077
1.000
0.144
0.333
1.000
0.500
3
arrangement of right seat instruments
19
38
57
0.105
1.000
0.190
0.333
0.826
0.475
4
control is identifiable in the dark
56
112
168
0.119
1.000
0.213
0.333
0.812
0.473
5
cultural conventions switch design
6
12
18
0.056
1.000
0.105
0.333
0.857
0.480
6
design attributes for auditory displays
7
14
21
0.190
1.000
0.320
0.333
1.000
0.500
7
ergonomics of pilot seating
2
4
6
0.000
0.000
0.000
0.333
1.000
0.500
8
excessive cognitive effort
10
20
30
0.133
1.000
0.235
0.333
1.000
0.500
9
how to ensure that labels are readable
16
32
48
0.000
0.000
0.000
0.333
0.800
0.471
10
how to improve situation awareness
10
20
30
0.067
1.000
0.125
0.333
1.000
0.500
11
how to provide unambiguous
feedback
25
50
75
0.000
0.000
0.000
0.333
1.000
0.500
12
minimal mental processing
24
48
72
0.125
1.000
0.222
0.333
0.923
0.490
13
needs too much attention
24
48
72
0.056
1.000
0.105
0.333
0.923
0.490
14
preventing instrument reading errors
30
60
90
0.078
1.000
0.144
0.333
0.909
0.488
15
proper use of red and amber on
displays
28
56
84
0.155
1.000
0.268
0.333
0.966
0.496
16
suitable menu navigation methods
3
6
9
0.000
0.000
0.000
0.333
0.273
0.300
283
Level 2 Sensitivity To Unfound Relevant Documents (2X) –
Recall* Significance Calculations
Recall* Values
at Sensitivity Level 2 (2X)
Baseline
Enhanced
0.254
0.150
0.083
0.000
0.080
0.091
0.160
0.071
0.233
0.037
0.333
0.171
0.000
0.167
0.000
0.077
0.105
0.119
0.056
0.190
0.000
0.133
0.000
0.067
0.000
0.125
0.056
0.078
0.155
0.000
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
0.333
Figure I.3
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.0997
0.0071
30
Enhanced
0.3333
0.0000
30
0.0036
-15.1644
3.830E-22
1.6716
7.66E-22
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that
there is no statistical difference between the recall* of
the Baseline and the Enhanced search engines at
Sensitivity Level 2 (2X). Because the absolute value
of the t Stat is greater than t Critical two-tail value,
we can reject the Null Hypothesis. We can draw this
same conclusion by looking at the P(T<=t) two-tail
that represents the probability that the Null Hypothesis
is true. The P(T<=t) two-tail value is less than the
Alpha value of 0.05, so we can reject the Null
Hypothesis. By rejecting the Null Hypothesis, we can
conclude that there is a statistically significant
difference between the recall* at Sensitivity Level 2
(2X) of the Baseline and Enhanced search engines.
Two Sample t-Test to test the difference of the recall* at Sensitivity
Level 2 (2X) between the Baseline and Enhanced search engines.
284
Level 2 Sensitivity To Unfound Relevant Documents (2X) –
F-measure* Significance Calculations
F-measure* Values
at Sensitivity Level 2 (2X)
Baseline
0.405
0.261
0.154
0.000
0.148
0.167
0.276
0.133
0.378
0.071
0.500
0.292
0.000
0.286
0.000
0.144
0.190
0.213
0.105
0.320
0.000
0.235
0.000
0.125
0.000
0.222
0.105
0.144
0.268
0.000
Figure I.4
Enhanced
0.500
0.482
0.308
0.292
0.462
0.500
0.420
0.471
0.476
0.419
0.364
0.448
0.452
0.452
0.462
0.500
0.475
0.473
0.480
0.500
0.500
0.500
0.471
0.500
0.500
0.490
0.490
0.488
0.496
0.300
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.1714
0.0180
30
Enhanced
0.4556
0.0037
30
0.0109
-10.5565
2.007E-15
1.6716
4.01E-15
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that
there is no statistical difference between the performance
of the Baseline and the Enhanced search engines at
Sensitivity Level 2 (2X). Search performance is
determined using the F-measure* at Sensitivity Level 2
(2X) where it is assumed that there are additional unfound
relevant documents. Because the absolute value of the t
Stat is greater than t Critical two-tail value, we can reject
the Null Hypothesis. We can draw this same conclusion
by looking at the P(T<=t) two-tail that represents the
probability that the Null Hypothesis is true. The P(T<=t)
two-tail value is less than the Alpha value of 0.05, so we
can reject the Null Hypothesis. By rejecting the Null
Hypothesis, we can conclude that there is a statistically
significant difference between the Baseline and Enhanced
search engine F-measure* at Sensitivity Level 2 (2X)
values. The F-measure* value of the Enhanced search
engine is statistically greater than the Baseline and,
therefore, performs better than the Baseline search engine.
This is consistent with the conclusion drawn without the
addition of estimated unfound relevant documents.
Two Sample t-Test to test the difference of the F-measure* at Sensitivity
Level 2 (2X) between the Baseline and Enhanced search engines.
285
I.3. Level 3 Sensitivity To Unfound Relevant Documents (10X)
At the third sensitivity level, the unfound relevant documents were estimated to
be ten times the number of relevant documents identified by the Baseline and Enhanced
search engines. Table I.5 and I.6 provide the data and calculations performed using the
unfound relevant document estimates for Sensitivity Level 3 for the Tangible Concept
Query Sample Set and the Intangible Concept Query Sample Set. The data and results of
the statistical significance test at Sensitivity Level 3 are presented in Figure I.5 for recall*
and in Figure I.6 for F-measure*.
286
Table I.5
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Sensitivity Level 3 (10X) where the number of
unfound documents are assumed to be ten times the number of relevant
documents identified by the Baseline and Enhanced search engines.
287
Level 3 Sensitivity To Unfound Relevant Documents (10X) – Tangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Baseline +
Enhanced
Relevant
Estimated
Total
Relevant
Sensitivity
Level 3
Recall*
Unfound
Relevant
Precision
210
231
0.069
Enhanced
Sensitivity
Level 3
F-measure*
Sensitivity
Level 3
Recall*
Precision
Sensitivity
Level 3
F-measure*
1.000
0.130
0.091
1.000
0.167
1
auxiliary power unit fire
extinguishing
21
2
false resolution advisory
20
200
220
0.041
1.000
0.079
0.091
0.870
0.165
3
fault tolerant data entry
4
40
44
0.023
1.000
0.044
0.091
0.286
0.138
4
hydraulic system status messages
7
70
77
0.000
0.000
0.000
0.091
0.259
0.135
5
icing conditions operating speeds
provided in AFM
46
460
506
0.022
1.000
0.043
0.091
0.754
0.162
6
information presented in peripheral
visual field
11
110
121
0.025
1.000
0.048
0.091
1.000
0.167
7
information readable with vibration
25
250
275
0.044
1.000
0.084
0.091
0.568
0.157
8
instruments located in normal line of
sight
28
280
308
0.019
1.000
0.038
0.091
0.800
0.163
9
labels readable distance
10
100
110
0.064
1.000
0.120
0.091
0.833
0.164
10
landing gear manual extension
control design
27
270
297
0.010
1.000
0.020
0.091
0.563
0.157
11
negative transfer issues
2
20
22
0.091
1.000
0.167
0.091
0.400
0.148
12
safety belt latch operation
37
370
407
0.047
1.000
0.089
0.091
0.685
0.161
13
side stick control considerations
21
210
231
0.000
0.000
0.000
0.091
0.700
0.161
14
text color contrast
38
380
418
0.045
1.000
0.087
0.091
0.704
0.161
288
Table I.6
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Sensitivity Level 3 (10X) where the number of
unfound documents are assumed to be ten times the number of relevant
documents identified by the Baseline and Enhanced search engines.
289
Level 3 Sensitivity To Unfound Relevant Documents (10X) – Intangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Baseline +
Enhanced
Relevant
Unfound
Relevant
Estimated
Total
Relevant
Sensitivity
Level 3
Recall*
30
33
0.000
Enhanced
Precision
Sensitivity
Level 3
F-measure*
Sensitivity
Level 3
Recall*
Precision
Sensitivity
Level 3
F-measure*
0.000
0.000
0.091
0.750
0.162
1
acceptable message failure rate and
pilots confidence in system
3
2
appropriate size of characters on
display
56
560
616
0.021
1.000
0.041
0.091
1.000
0.167
3
arrangement of right seat instruments
19
190
209
0.029
1.000
0.056
0.091
0.826
0.164
4
control is identifiable in the dark
56
560
616
0.032
1.000
0.063
0.091
0.812
0.164
5
cultural conventions switch design
6
60
66
0.015
1.000
0.030
0.091
0.857
0.164
6
design attributes for auditory displays
7
70
77
0.052
1.000
0.099
0.091
1.000
0.167
7
ergonomics of pilot seating
2
20
22
0.000
0.000
0.000
0.091
1.000
0.167
8
excessive cognitive effort
10
100
110
0.036
1.000
0.070
0.091
1.000
0.167
9
how to ensure that labels are readable
16
160
176
0.000
0.000
0.000
0.091
0.800
0.163
10
how to improve situation awareness
10
100
110
0.018
1.000
0.036
0.091
1.000
0.167
11
how to provide unambiguous
feedback
25
250
275
0.000
0.000
0.000
0.091
1.000
0.167
12
minimal mental processing
24
240
264
0.034
1.000
0.066
0.091
0.923
0.166
13
needs too much attention
24
240
264
0.015
1.000
0.030
0.091
0.923
0.166
14
preventing instrument reading errors
30
300
330
0.021
1.000
0.042
0.091
0.909
0.165
15
proper use of red and amber on
displays
28
280
308
0.042
1.000
0.081
0.091
0.966
0.166
16
suitable menu navigation methods
3
30
33
0.000
0.000
0.000
0.091
0.273
0.136
290
Level 3 Sensitivity To Unfound Relevant Documents (10X) –
Recall* Significance Calculations
Recall* Values
at Sensitivity Level 1
(0.25X)
Baseline
Enhanced
0.069
0.041
0.023
0.000
0.022
0.025
0.044
0.019
0.064
0.010
0.091
0.047
0.000
0.045
0.000
0.021
0.029
0.032
0.015
0.052
0.000
0.036
0.000
0.018
0.000
0.034
0.015
0.021
0.042
0.000
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
0.091
Figure I.5
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.0272
0.0005
30
Enhanced
0.090
1.793E-3
3
0.0003
-15.1644
3.830E-22
1.6716
7.66E-22
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that
there is no statistical difference between the recall* of the
Baseline and the Enhanced search engines at Sensitivity
Level 3 (10X). Because the absolute value of the t Stat
is greater than t Critical two-tail value, we can reject the
Null Hypothesis. We can draw this same conclusion by
looking at the P(T<=t) two-tail that represents the
probability that the Null Hypothesis is true. The P(T<=t)
two-tail value is less than the Alpha value of 0.05, so we
can reject the Null Hypothesis. By rejecting the Null
Hypothesis, we can conclude that there is a statistically
significant difference between the recall* at Sensitivity
Level 3 (10X) of the Baseline and Enhanced search
engines.
Two Sample t-Test to test the difference of the recall* at Sensitivity
Level 3 (10X) between the Baseline and Enhanced search engines.
291
Level 3 Sensitivity To Unfound Relevant Documents (10X) –
F-measure* Significance Calculations
F-measure* Values
at Sensitivity Level 3 (10X)
Baseline
0.130
0.079
0.044
0.000
0.043
0.048
0.084
0.038
0.120
0.020
0.167
0.089
0.000
0.087
0.000
0.041
0.056
0.063
0.030
0.099
0.000
0.070
0.000
0.036
0.000
0.066
0.030
0.042
0.081
0.000
Figure I.6
Enhanced
0.167
0.165
0.138
0.135
0.162
0.167
0.157
0.163
0.164
0.157
0.148
0.161
0.161
0.161
0.162
0.167
0.164
0.164
0.164
0.167
0.167
0.167
0.163
0.167
0.167
0.166
0.166
0.165
0.166
0.136
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.0520
0.0018
30
Enhanced
0.1607
0.0001
30
0.0010
-13.5466
6.441E-20
1.6716
1.29E-19
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that
there is no statistical difference between the performance
of the Baseline and the Enhanced search engines at
Sensitivity Level 3 (10X). Search performance is
determined using the F-measure* at Sensitivity Level 3
(10X) where it is assumed that there are additional
unfound relevant documents. Because the absolute value
of the t Stat is greater than t Critical two-tail value, we
can reject the Null Hypothesis. We can draw this same
conclusion by looking at the P(T<=t) two-tail that
represents the probability that the Null Hypothesis is true.
The P(T<=t) two-tail value is less than the Alpha value
of 0.05, so we can reject the Null Hypothesis. By
rejecting the Null Hypothesis, we can conclude that there
is a statistically significant difference between the
Baseline and Enhanced search engine F-measure* at
Sensitivity Level 3 (10X) values. The F-measure* value
of the Enhanced search engine is statistically greater than
the Baseline and, therefore, performs better than the
Baseline search engine. This is consistent with the
conclusion drawn without the addition of estimated
unfound relevant documents.
Two Sample t-Test to test the difference of the F-measure* at Sensitivity
Level 3 (10X) between the Baseline and Enhanced search engines.
292
APPENDIX J. SENSITIVITY ANALYSIS FOR IMPACT OF RELEVANCY
ASSUMPTIONS
This appendix contains the data and calculations used to perform the sensitivity
analysis for documents assumed relevant. Four different levels of sensitivity were
analyzed to assess the impact of document relevancy assumptions on the performance
measures used in this experiment. To perform the sensitivity analysis, estimates of the
number of relevant documents in the set of documents returned by the Baseline and
Enhanced to be assumed relevant were calculated based on the level of sensitivity. The
estimates were calculated in two different ways. For sensitivity levels 1, 2, and 3, the
estimated total relevant documents returned by both the Baseline and the Enhanced
search engines (and therefore assumed relevant in the modified pooling adjudication
method), was calculated by assuming that a certain percentage of the documents returned
by both the Baseline and Enhanced search engines were non-relevant for. For sensitivity
level 4, the estimated total relevant documents was determined using results generated by
Google Desktop. The estimates for this level of sensitivity were generated by
determining the number of documents returned by the Baseline, the Enhanced, and the
Google Desktop search engines. Documents returned by all three search engines were
assumed to be relevant in the fourth level of sensitivity and therefore the number of
documents to be assumed relevant in the modified pooling adjudication method.
Using the estimated total relevant documents to be assumed relevant for the given
sensitivity level, the recall* was calculated for each of the 30 sample query topics. Next,
293
a Two Sample t-Test Assuming Equal Variances was run with a significance level of
α=0.05 to determine whether the difference in the recall* between the Baseline and the
Enhanced search engines was statically significant. Finally, the conclusion drawn related
to the significance of the difference in recall* values was compared to the conclusion
drawn in the original calculations (i.e., those performed without the addition of estimated
unfound relevant documents).
Next, the estimated total relevant documents to be assumed relevant for the given
sensitivity level was used to calculate the F-measure* for each of the 30 sample query
topics. A Two Sample t-Test Assuming Equal Variances was then run with a
significance level of α=0.05 to determine whether the difference in the F-measure*
between the Baseline and the Enhanced search engines was statically significant. Finally,
the conclusion drawn related to the significance of the difference in F-measure* values
was compared to the conclusion drawn in the original calculations (i.e., those performed
without the addition of estimated unfound relevant documents).
The data and calculations for each of the four sensitivity levels are presented in
the following sections.
J.1. Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant)
At the first sensitivity level, the documents to be assumed relevant were estimated
by assuming that 25% of the documents returned by both the Baseline and the Enhanced
search engines were non-relevant. Table J.1 and J.2 provide the data and calculations
performed using the document relevancy assumption estimates for Sensitivity Level 1 for
294
the Tangible Concept Query Sample Set and the Intangible Concept Query Sample Set.
The data and results of the statistical significance test at Sensitivity Level 1 are presented
in Figure J.1 for recall* and in Figure J.2 for F-measure*.
295
Table J.1
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Level 1 Sensitivity to Relevancy Assumptions where
25% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant.
296
Level 1 Sensitivity To Relevancy Assumptions (25%) – Tangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Enhanced
Original
Assumed
Relevant
Estimated
NonRelevant
Estimated
Assumed
Relevant
Sensitivity
Level 1
Recall*
Sensitivity
Level 1
Precision
Sensitivity
Level 1
F-measure*
Sensitivity
Level 1
Recall*
Sensitivity
Level 1
Precision
Sensitivity
Level 1
F-measure*
1
auxiliary power unit fire
extinguishing
16
4
12
0.706
0.750
0.727
1.000
0.810
0.895
2
false resolution advisory
9
3
6
0.353
0.667
0.462
1.000
0.739
0.850
0
0.000
0.000
0.000
1.000
0.214
0.353
3
fault tolerant data entry
1
1
4
hydraulic system status messages
0
0
0
0.000
0.000
0.000
1.000
0.259
0.412
8
0.186
0.727
0.296
1.000
0.705
0.827
5
icing conditions operating speeds
provided in AFM
11
3
6
information presented in peripheral
visual field
3
1
2
0.200
0.667
0.308
1.000
0.909
0.952
7
information readable with vibration
12
3
9
0.409
0.750
0.529
1.000
0.500
0.667
8
instruments located in normal line of
sight
6
2
4
0.154
0.667
0.250
1.000
0.743
0.852
9
labels readable distance
7
2
5
0.625
0.714
0.667
1.000
0.667
0.800
2
0.077
0.667
0.138
1.000
0.542
0.703
10
landing gear manual extension
control design
3
1
11
negative transfer issues
2
1
1
1.000
0.500
0.667
1.000
0.200
0.333
14
0.438
0.737
0.549
1.000
0.593
0.744
12
safety belt latch operation
19
5
13
side stick control considerations
0
0
0
0.000
0.000
0.000
1.000
0.700
0.824
19
5
14
0.424
0.737
0.538
1.000
0.611
0.759
14
text color contrast
297
Table J.2
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Level 1 Sensitivity to Relevancy Assumptions where
25% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant.
298
Level 1 Sensitivity To Relevancy Assumptions (25%) – Intangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Enhanced
Original
Assumed
Relevant
Estimated
NonRelevant
Estimated
Assumed
Relevant
Sensitivity
Level 1
Recall*
Sensitivity
Level 1
Precision
Sensitivity
Level 1
F-measure*
Sensitivity
Level 1
Recall*
Sensitivity
Level 1
Precision
Sensitivity
Level 1
F-measure*
1
acceptable message failure rate and
pilots confidence in system
0
0
0
0.000
0.000
0.000
1.000
0.750
0.857
2
appropriate size of characters on
display
13
4
9
0.173
0.692
0.277
1.000
0.929
0.963
3
arrangement of right seat instruments
6
2
4
0.235
0.667
0.348
1.000
0.739
0.850
4
control is identifiable in the dark
20
5
15
0.294
0.750
0.423
1.000
0.739
0.850
5
cultural conventions switch design
1
1
0
0.000
0.000
0.000
1.000
0.714
0.833
6
design attributes for auditory displays
4
1
3
0.500
0.750
0.600
1.000
0.857
0.923
7
ergonomics of pilot seating
0
0
0
0.000
0.000
0.000
1.000
1.000
1.000
8
excessive cognitive effort
4
1
3
0.333
0.750
0.462
1.000
0.900
0.947
9
how to ensure that labels are readable
0
0
0
0.000
0.000
0.000
1.000
0.800
0.889
10
how to improve situation awareness
2
1
1
0.111
0.500
0.182
1.000
0.900
0.947
11
how to provide unambiguous
feedback
0
0
0
0.000
0.000
0.000
1.000
1.000
1.000
12
minimal mental processing
10
3
7
0.318
0.700
0.438
1.000
0.846
0.917
13
needs too much attention
4
1
3
0.130
0.750
0.222
1.000
0.885
0.939
14
preventing instrument reading errors
7
2
5
0.179
0.714
0.286
1.000
0.848
0.918
15
proper use of red and amber on
displays
13
4
9
0.375
0.692
0.486
1.000
0.828
0.906
16
suitable menu navigation methods
0
0
0
0.000
0.000
0.000
1.000
0.273
0.429
299
Level 1 Sensitivity To Relevancy Assumptions (25%) –
Recall* Significance Calculations
Recall* Values
at Sensitivity Level 1
(25% Assumed
Non-Relevant)
Baseline
0.706
0.353
0.000
0.000
0.186
0.200
0.409
0.154
0.625
0.077
1.000
0.438
0.000
0.424
0.000
0.173
0.235
0.294
0.000
0.500
0.000
0.333
0.000
0.111
0.000
0.318
0.130
0.179
0.375
0.000
Figure J.1
Enhanced
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.2407
0.0605
30
Enhanced
1.0000
0.0000
30
0.0303
-16.9082
2.184E-24
1.6716
4.37E-24
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is
that there is no statistical difference between the
recall* of the Baseline and the Enhanced search
engines at Level 1 Sensitivity to Relevancy
Assumptions (25%). Because the absolute value of
the t Stat is greater than t Critical two-tail value, we
can reject the Null Hypothesis. We can draw this
same conclusion by looking at the P(T<=t) two-tail
that represents the probability that the Null
Hypothesis is true. The P(T<=t) two-tail value is
less than the Alpha value of 0.05, so we can reject the
Null Hypothesis. By rejecting the Null Hypothesis,
we can conclude that there is a statistically significant
difference between the recall* at Level 1 Sensitivity
to Relevancy Assumptions (25%) of the Baseline and
Enhanced search engines.
Two Sample t-Test to test the difference of the recall* at Level 1
Sensitivity to Relevancy Assumptions (25%) between the Baseline and
Enhanced search engines.
300
Level 1 Sensitivity To Relevancy Assumptions (25%) –
F-measure* Significance Calculations
F-measure* Values
at Sensitivity Level 1
(25% Assumed
Non-Relevant)
Baseline
0.727
0.462
0.000
0.000
0.296
0.308
0.529
0.250
0.667
0.138
0.667
0.549
0.000
0.538
0.000
0.277
0.348
0.423
0.000
0.600
0.000
0.462
0.000
0.182
0.000
0.438
0.222
0.286
0.486
0.000
Figure J.2
Enhanced
0.895
0.850
0.353
0.412
0.827
0.952
0.667
0.852
0.800
0.703
0.333
0.744
0.824
0.759
0.857
0.963
0.850
0.850
0.833
0.923
1.000
0.947
0.889
0.947
1.000
0.917
0.939
0.918
0.906
0.429
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.2951
0.0584
30
Enhanced
0.8046
0.0351
30
0.0468
-9.1255
4.118E-13
1.6716
8.24E-13
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that there is no
statistical difference between the performance of the Baseline and
the Enhanced search engines at Sensitivity Level 1 (25% nonrelevant). Search performance is determined using the Fmeasure* at Sensitivity Level 1 where it is assumed that 25% of
the documents returned by both the Baseline and the Enhanced
search engines are non-relevant. Because the absolute value of the
t Stat is greater than t Critical two-tail value, we can reject the
Null Hypothesis. We can draw this same conclusion by looking at
the P(T<=t) two-tail that represents the probability that the Null
Hypothesis is true. The P(T<=t) two-tail value is less than the
Alpha value of 0.05, so we can reject the Null Hypothesis. By
rejecting the Null Hypothesis, we can conclude that there is a
statistically significant difference between the Baseline and
Enhanced search engine F-measure* at Sensitivity Level 1 (25%
non-relevant). The F-measure* value of the Enhanced search
engine is statistically greater than the Baseline and, therefore,
performs better than the Baseline search engine. This is consistent
with the conclusion drawn in the original calculations (i.e.,
assuming that all documents returned by both the Baseline and the
Enhanced search engines are relevant).
Two Sample t-Test to test the difference of the F-measure* at Level 1
Sensitivity to Relevancy Assumptions (25% non-relevant) between the
Baseline and Enhanced search engines.
301
J.2. Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant)
At the second sensitivity level, the documents to be assumed relevant were
estimated by assuming that 50% of the documents returned by both the Baseline and the
Enhanced search engines were non-relevant. Table J.3 and J.4 provide the data and
calculations performed using the document relevancy assumption estimates for
Sensitivity Level 2 for the Tangible Concept Query Sample Set and the Intangible
Concept Query Sample Set. The data and results of the statistical significance test at
Sensitivity Level 2 are presented in Figure J.3 for recall* and in Figure J.4 for Fmeasure*.
302
Table J.3
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Level 2 Sensitivity to Relevancy Assumptions where
50% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant
303
Level 2 Sensitivity To Relevancy Assumptions (50%) – Tangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Enhanced
Original
Assumed
Relevant
Estimated
NonRelevant
Estimated
Assumed
Relevant
Sensitivity
Level 2
Recall*
Sensitivity
Level 2
Precision
Sensitivity
Level 2
F-measure*
Sensitivity
Level 2
Recall*
Sensitivity
Level 2
Precision
Sensitivity
Level 2
F-measure*
1
auxiliary power unit fire
extinguishing
16
8
8
0.615
0.500
0.552
1.000
0.619
0.765
2
false resolution advisory
9
5
4
0.267
0.444
0.333
1.000
0.652
0.789
0
0.000
0.000
0.000
1.000
0.214
0.353
3
fault tolerant data entry
1
1
4
hydraulic system status messages
0
0
0
0.000
0.000
0.000
1.000
0.259
0.412
5
0.125
0.455
0.196
1.000
0.656
0.792
5
icing conditions operating speeds
provided in AFM
11
6
6
information presented in peripheral
visual field
3
2
1
0.111
0.333
0.167
1.000
0.818
0.900
7
information readable with
vibration
12
6
6
0.316
0.500
0.387
1.000
0.432
0.603
8
instruments located in normal line
of sight
6
3
3
0.120
0.500
0.194
1.000
0.714
0.833
9
labels readable distance
7
4
3
0.500
0.429
0.462
1.000
0.500
0.667
1
0.040
0.333
0.071
1.000
0.521
0.685
10
landing gear manual extension
control design
3
2
11
negative transfer issues
2
1
1
1.000
0.500
0.667
1.000
0.200
0.333
9
0.333
0.474
0.391
1.000
0.500
0.667
12
safety belt latch operation
19
10
13
side stick control considerations
0
0
0
0.000
0.000
0.000
1.000
0.700
0.824
19
10
9
0.321
0.474
0.383
1.000
0.519
0.683
14
text color contrast
304
Table J.4
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Level 2 Sensitivity to Relevancy Assumptions where
50% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant.
305
Level 2 Sensitivity To Relevancy Assumptions (50%) – Intangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Enhanced
Original
Assumed
Relevant
Estimated
NonRelevant
Estimated
Assumed
Relevant
Sensitivity
Level 2
Recall*
Sensitivity
Level 2
Precision
Sensitivity
Level 2
F-measure*
Sensitivity
Level 2
Recall*
Sensitivity
Level 2
Precision
Sensitivity
Level 2
F-measure*
1
acceptable message failure rate and
pilots confidence in system
0
0
0
0.000
0.000
0.000
1.000
0.750
0.857
2
appropriate size of characters on
display
13
7
6
0.122
0.462
0.194
1.000
0.875
0.933
3
arrangement of right seat instruments
6
3
3
0.188
0.500
0.273
1.000
0.696
0.821
4
control is identifiable in the dark
20
10
10
0.217
0.500
0.303
1.000
0.667
0.800
5
cultural conventions switch design
1
1
0
0.000
0.000
0.000
1.000
0.714
0.833
6
design attributes for auditory displays
4
2
2
0.400
0.500
0.444
1.000
0.714
0.833
7
ergonomics of pilot seating
0
0
0
0.000
0.000
0.000
1.000
1.000
1.000
8
excessive cognitive effort
4
2
2
0.250
0.500
0.333
1.000
0.800
0.889
9
how to ensure that labels are readable
0
0
0
0.000
0.000
0.000
1.000
0.800
0.889
10
how to improve situation awareness
2
1
1
0.111
0.500
0.182
1.000
0.900
0.947
11
how to provide unambiguous
feedback
0
0
0
0.000
0.000
0.000
1.000
1.000
1.000
12
minimal mental processing
10
5
5
0.250
0.500
0.333
1.000
0.769
0.870
13
needs too much attention
4
2
2
0.091
0.500
0.154
1.000
0.846
0.917
14
preventing instrument reading errors
7
4
3
0.115
0.429
0.182
1.000
0.788
0.881
15
proper use of red and amber on
displays
13
7
6
0.286
0.462
0.353
1.000
0.724
0.840
16
suitable menu navigation methods
0
0
0
0.000
0.000
0.000
1.000
0.273
0.429
306
Level 2 Sensitivity To Relevancy Assumptions (50%) –
Recall* Significance Calculations
Recall* Values
at Sensitivity Level 2
(50% Assumed
Non-Relevant)
Baseline
0.615
0.267
0.000
0.000
0.125
0.111
0.316
0.120
0.500
0.040
1.000
0.333
0.000
0.321
0.000
0.122
0.188
0.217
0.000
0.400
0.000
0.250
0.000
0.111
0.000
0.250
0.091
0.115
0.286
0.000
Figure J.3
Enhanced
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
Enhanced
0.1926
0.0497
30
1.0000
0.0000
30
0.0249
-19.8269
8.157E-28
1.6716
1.63E-27
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that
there is no statistical difference between the recall* of
the Baseline and the Enhanced search engines at Level 2
Sensitivity to Relevancy Assumptions (50%). Because
the absolute value of the t Stat is greater than t Critical
two-tail value, we can reject the Null Hypothesis. We
can draw this same conclusion by looking at the P(T<=t)
two-tail that represents the probability that the Null
Hypothesis is true. The P(T<=t) two-tail value is less
than the Alpha value of 0.05, so we can reject the Null
Hypothesis. By rejecting the Null Hypothesis, we can
conclude that there is a statistically significant difference
between the recall* at Level 2 Sensitivity to Relevancy
Assumptions (50%) of the Baseline and Enhanced search
engines.
Two Sample t-Test to test the difference of the recall* at Level 2
Sensitivity to Relevancy Assumptions (50%) between the Baseline and
Enhanced search engines.
307
Level 2 Sensitivity To Relevancy Assumptions (50%) –
F-measure* Significance Calculations
F-measure* Values
at Sensitivity Level 2
(50% Assumed
Non-Relevant)
Baseline
0.552
0.333
0.000
0.000
0.196
0.167
0.387
0.194
0.462
0.071
0.667
0.391
0.000
0.383
0.000
0.194
0.273
0.303
0.000
0.444
0.000
0.333
0.000
0.182
0.000
0.333
0.154
0.182
0.353
0.000
Figure J.4
Enhanced
0.765
0.789
0.353
0.412
0.792
0.900
0.603
0.833
0.667
0.685
0.333
0.667
0.824
0.683
0.857
0.933
0.821
0.800
0.833
0.833
1.000
0.889
0.889
0.947
1.000
0.870
0.917
0.881
0.840
0.429
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.2184
0.0358
30
Enhanced
0.7681
0.0330
30
0.0344
-11.4772
7.364E-17
1.6716
1.47E-16
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that there is
no statistical difference between the performance of the Baseline
and the Enhanced search engines at Sensitivity Level 2 (50%
non-relevant). Search performance is determined using the Fmeasure* at Sensitivity Level 2 where it is assumed that 50% of
the documents returned by both the Baseline and the Enhanced
search engines are non-relevant. Because the absolute value of
the t Stat is greater than t Critical two-tail value, we can reject
the Null Hypothesis. We can draw this same conclusion by
looking at the P(T<=t) two-tail that represents the probability
that the Null Hypothesis is true. The P(T<=t) two-tail value is
less than the Alpha value of 0.05, so we can reject the Null
Hypothesis. By rejecting the Null Hypothesis, we can conclude
that there is a statistically significant difference between the
Baseline and Enhanced search engine F-measure* at Sensitivity
Level 2 (50% non-relevant). The F-measure* value of the
Enhanced search engine is statistically greater than the Baseline
and, therefore, performs better than the Baseline search engine.
This is consistent with the conclusion drawn in the original
calculations (i.e., assuming that all documents returned by both
the Baseline and the Enhanced search engines are relevant).
Two Sample t-Test to test the difference of the F-measure* at Level 2
Sensitivity to Relevancy Assumptions (50% non-relevant) between the
Baseline and Enhanced search engines.
308
J.3. Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant)
At the second sensitivity level, the documents to be assumed relevant were
estimated by assuming that 75% of the documents returned by both the Baseline and the
Enhanced search engines were non-relevant. Table J.5 and J.6 provide the data and
calculations performed using the document relevancy assumption estimates for
Sensitivity Level 3 for the Tangible Concept Query Sample Set and the Intangible
Concept Query Sample Set. The data and results of the statistical significance test at
Sensitivity Level 3 are presented in Figure J.5 for recall* and in Figure J.6 for Fmeasure*.
309
Table J.5
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Level 3 Sensitivity to Relevancy Assumptions where
75% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant.
310
Level 3 Sensitivity To Relevancy Assumptions (75%) – Tangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Enhanced
Original
Assumed
Relevant
Estimated
NonRelevant
Estimated
Assumed
Relevant
Sensitivity
Level 3
Recall*
Sensitivity
Level 3
Precision
Sensitivity
Level 3
F-measure*
Sensitivity
Level 3
Recall*
Sensitivity
Level 3
Precision
Sensitivity
Level 3
F-measure*
1
auxiliary power unit fire
extinguishing
16
12
4
0.444
0.250
0.320
1.000
0.429
0.600
2
false resolution advisory
9
7
2
0.154
0.222
0.182
1.000
0.565
0.722
0
0.000
0.000
0.000
1.000
0.214
0.353
3
fault tolerant data entry
1
1
4
hydraulic system status messages
0
0
0
0.000
0.000
0.000
1.000
0.259
0.412
2
0.054
0.182
0.083
1.000
0.607
0.755
5
icing conditions operating speeds
provided in AFM
11
9
6
information presented in peripheral
visual field
3
3
0
0.000
0.000
0.000
1.000
0.727
0.842
7
information readable with
vibration
12
9
3
0.188
0.250
0.214
1.000
0.364
0.533
8
instruments located in normal line
of sight
6
5
1
0.043
0.167
0.069
1.000
0.657
0.793
9
labels readable distance
7
6
1
0.250
0.143
0.182
1.000
0.333
0.500
0
0.000
0.000
0.000
1.000
0.500
0.667
10
landing gear manual extension
control design
3
3
11
negative transfer issues
2
2
0
0.000
0.000
0.000
0.000
0.000
0.000
4
0.182
0.211
0.195
1.000
0.407
0.579
12
safety belt latch operation
19
15
13
side stick control considerations
0
0
0
0.000
0.000
0.000
1.000
0.700
0.824
19
15
4
0.174
0.211
0.190
1.000
0.426
0.597
14
text color contrast
311
Table J.6
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Level 3 Sensitivity to Relevancy Assumptions where
75% of the documents returned by both the Baseline and the Enhanced
search engines were assumed to be non-relevant.
312
Level 3 Sensitivity To Relevancy Assumptions (75%) – Intangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Enhanced
Original
Assumed
Relevant
Estimated
NonRelevant
Estimated
Assumed
Relevant
Sensitivity
Level 3
Recall*
Sensitivity
Level 3
Precision
Sensitivity
Level 3
F-measure*
Sensitivity
Level 3
Recall*
Sensitivity
Level 3
Precision
Sensitivity
Level 3
F-measure*
1
acceptable message failure rate and
pilots confidence in system
0
0
0
0.000
0.000
0.000
1.000
0.750
0.857
2
appropriate size of characters on
display
13
10
3
0.065
0.231
0.102
1.000
0.821
0.902
3
arrangement of right seat instruments
6
5
1
0.071
0.167
0.100
1.000
0.609
0.757
4
control is identifiable in the dark
20
15
5
0.122
0.250
0.164
1.000
0.594
0.745
5
cultural conventions switch design
1
1
0
0.000
0.000
0.000
1.000
0.714
0.833
6
design attributes for auditory displays
4
3
1
0.250
0.250
0.250
1.000
0.571
0.727
7
ergonomics of pilot seating
0
0
0
0.000
0.000
0.000
1.000
1.000
1.000
8
excessive cognitive effort
4
3
1
0.143
0.250
0.182
1.000
0.700
0.824
9
how to ensure that labels are readable
0
0
0
0.000
0.000
0.000
1.000
0.800
0.889
10
how to improve situation awareness
2
2
0
0.000
0.000
0.000
1.000
0.800
0.889
11
how to provide unambiguous
feedback
0
0
0
0.000
0.000
0.000
1.000
1.000
1.000
12
minimal mental processing
10
8
2
0.118
0.200
0.148
1.000
0.654
0.791
13
needs too much attention
4
3
1
0.048
0.250
0.080
1.000
0.808
0.894
14
preventing instrument reading errors
7
6
1
0.042
0.143
0.065
1.000
0.727
0.842
15
proper use of red and amber on
displays
13
10
3
0.167
0.231
0.194
1.000
0.621
0.766
16
suitable menu navigation methods
0
0
0
0.000
0.000
0.000
1.000
0.273
0.429
313
Level 3 Sensitivity To Relevancy Assumptions (75%) –
Recall* Significance Calculations
Recall* Values
at Sensitivity Level 3
(75% Assumed
Non-Relevant)
Baseline
0.444
0.154
0.000
0.000
0.054
0.000
0.188
0.043
0.250
0.000
0.000
0.182
0.000
0.174
0.000
0.065
0.071
0.122
0.000
0.250
0.000
0.143
0.000
0.000
0.000
0.118
0.048
0.042
0.167
0.000
Figure J.5
Enhanced
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
0.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.0838
0.0113
30
Enhanced
0.9667
0.0333
30
0.0223
-22.8770
5.127E-31
1.6716
1.03E-30
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is
that there is no statistical difference between the
recall* of the Baseline and the Enhanced search
engines at Level 3 Sensitivity to Relevancy
Assumptions (75%). Because the absolute value of
the t Stat is greater than t Critical two-tail value, we
can reject the Null Hypothesis. We can draw this
same conclusion by looking at the P(T<=t) two-tail
that represents the probability that the Null
Hypothesis is true. The P(T<=t) two-tail value is
less than the Alpha value of 0.05, so we can reject the
Null Hypothesis. By rejecting the Null Hypothesis,
we can conclude that there is a statistically significant
difference between the recall* at Level 3 Sensitivity
to Relevancy Assumptions (75%) of the Baseline and
Enhanced search engines.
Two Sample t-Test to test the difference of the recall* at Level 3
Sensitivity to Relevancy Assumptions (75%) between the Baseline and
Enhanced search engines.
314
Level 3 Sensitivity To Relevancy Assumptions (75%) –
F-measure* Significance Calculations
F-measure* Values
at Sensitivity Level 3
(75% Assumed
Non-Relevant)
Baseline
0.320
0.182
0.000
0.000
0.083
0.000
0.214
0.069
0.182
0.000
0.000
0.195
0.000
0.190
0.000
0.102
0.100
0.164
0.000
0.250
0.000
0.182
0.000
0.000
0.000
0.148
0.080
0.065
0.194
0.000
Figure J.6
Enhanced
0.600
0.722
0.353
0.412
0.755
0.842
0.533
0.793
0.500
0.667
0.000
0.579
0.824
0.597
0.857
0.902
0.757
0.745
0.833
0.727
1.000
0.824
0.889
0.889
1.000
0.791
0.894
0.842
0.766
0.429
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.0906
0.0092
30
Enhanced
0.7107
0.0463
30
0.0277
-14.4212
3.872E-21
1.6716
7.74E-21
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that there is no
statistical difference between the performance of the Baseline and
the Enhanced search engines at Sensitivity Level 3 (75% nonrelevant). Search performance is determined using the F-measure*
at Sensitivity Level 3 where it is assumed that 75% of the
documents returned by both the Baseline and the Enhanced search
engines are non-relevant. Because the absolute value of the t Stat
is greater than t Critical two-tail value, we can reject the Null
Hypothesis. We can draw this same conclusion by looking at the
P(T<=t) two-tail that represents the probability that the Null
Hypothesis is true. The P(T<=t) two-tail value is less than the
Alpha value of 0.05, so we can reject the Null Hypothesis. By
rejecting the Null Hypothesis, we can conclude that there is a
statistically significant difference between the Baseline and
Enhanced search engine F-measure* at Sensitivity Level 3 (75%
non-relevant). The F-measure* value of the Enhanced search
engine is statistically greater than the Baseline and, therefore,
performs better than the Baseline search engine. This is consistent
with the conclusion drawn in the original calculations (i.e.,
assuming that all documents returned by both the Baseline and the
Enhanced search engines are relevant).
Two Sample t-Test to test the difference of the F-measure* at Level 3
Sensitivity to Relevancy Assumptions (75%) between the Baseline and
Enhanced search engines.
315
J.4. Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop)
At the fourth sensitivity level, the estimated number of relevant documents that
may be assumed was generated by determining the number of documents returned by the
Google Desktop search engine that overlap with the documents returned by the Baseline
and the Enhanced search engines. Documents returned by all three search engines were
assumed to be relevant. Table J.7 and J.8 provide the data and calculations performed
using the document relevancy assumption estimates for Sensitivity Level 4 for the
Tangible Concept Query Sample Set and the Intangible Concept Query Sample Set. The
data and results of the statistical significance test at Sensitivity Level 4 are presented in
Figure J.7 for recall* and in Figure J.8 for F-measure*.
316
Table J.7
Recall*, Precision, and F-measure* calculations for Tangible Concepts
Query Sample Set at Level 4 Sensitivity to Relevancy Assumptions where
estimated number of relevant documents that may be assumed are the
documents returned by all three search engines (i.e., Baseline, Enhanced,
and Google Desktop search engines)
317
Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) – Tangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Enhanced
Original
Assumed
Relevant
Estimated
NonRelevant
Estimated
Assumed
Relevant
Sensitivity
Level 4
Recall*
Sensitivity
Level 4
Precision
Sensitivity
Level 4
F-measure*
Sensitivity
Level 4
Recall*
Sensitivity
Level 4
Precision
Sensitivity
Level 4
F-measure*
1
auxiliary power unit fire
extinguishing
16
--
9
0.643
0.563
0.600
1.000
0.667
0.800
2
false resolution advisory
9
--
4
0.267
0.444
0.333
1.000
0.652
0.789
1
0.250
1.000
0.400
1.000
0.286
0.444
3
fault tolerant data entry
1
--
4
hydraulic system status messages
0
--
0
0.000
0.000
0.000
1.000
0.259
0.412
4
0.103
0.364
0.160
1.000
0.639
0.780
5
icing conditions operating speeds
provided in AFM
11
--
6
information presented in peripheral
visual field
3
--
3
0.273
1.000
0.429
1.000
1.000
1.000
7
information readable with
vibration
12
--
7
0.350
0.583
0.438
1.000
0.455
0.625
8
instruments located in normal line
of sight
6
--
3
0.120
0.500
0.194
1.000
0.714
0.833
9
labels readable distance
7
--
3
0.500
0.429
0.462
1.000
0.500
0.667
2
0.077
0.667
0.138
1.000
0.542
0.703
10
landing gear manual extension
control design
3
--
11
negative transfer issues
2
--
1
1.000
0.500
0.667
1.000
0.200
0.333
6
0.250
0.316
0.279
1.000
0.444
0.615
12
safety belt latch operation
19
--
13
side stick control considerations
0
--
0
0.000
0.000
0.000
1.000
0.700
0.824
19
--
16
0.457
0.842
0.593
1.000
0.648
0.787
14
text color contrast
318
Table J.8
Recall*, Precision, and F-measure* calculations for Intangible Concepts
Query Sample Set at Level 4 Sensitivity to Relevancy Assumptions where
estimated number of relevant documents to be assumed are generated from
overlap of results with Google Desktop.
319
Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) – Intangible Concepts Query Sample Set
Baseline
Query Topic (QT)
Enhanced
Original
Assumed
Relevant
Estimated
NonRelevant
Estimated
Assumed
Relevant
Sensitivity
Level 4
Recall*
Sensitivity
Level 4
Precision
Sensitivity
Level 4
F-measure*
Sensitivity
Level 4
Recall*
Sensitivity
Level 4
Precision
Sensitivity
Level 4
F-measure*
1
acceptable message failure rate and
pilots confidence in system
0
--
0
0.000
0.000
0.000
1.000
0.750
0.857
2
appropriate size of characters on
display
13
--
13
0.232
1.000
0.377
1.000
1.000
1.000
3
arrangement of right seat instruments
6
--
1
0.071
0.167
0.100
1.000
0.609
0.757
4
control is identifiable in the dark
20
--
4
0.100
0.200
0.133
1.000
0.580
0.734
5
cultural conventions switch design
1
--
1
0.167
1.000
0.286
1.000
0.857
0.923
6
design attributes for auditory displays
4
--
3
0.500
0.750
0.600
1.000
0.857
0.923
7
ergonomics of pilot seating
0
--
0
0.000
0.000
0.000
1.000
1.000
1.000
8
excessive cognitive effort
4
--
4
0.400
1.000
0.571
1.000
1.000
1.000
9
how to ensure that labels are readable
0
--
0
0.000
0.000
0.000
1.000
0.800
0.889
10
how to improve situation awareness
2
--
2
0.200
1.000
0.333
1.000
1.000
1.000
11
how to provide unambiguous
feedback
0
--
0
0.000
0.000
0.000
1.000
1.000
1.000
12
minimal mental processing
10
--
0
0.000
0.000
0.000
1.000
0.577
0.732
13
needs too much attention
4
--
2
0.091
0.500
0.154
1.000
0.846
0.917
14
preventing instrument reading errors
7
--
0
0.000
0.000
0.000
1.000
0.697
0.821
15
proper use of red and amber on
displays
13
--
3
0.167
0.231
0.194
1.000
0.621
0.766
16
suitable menu navigation methods
0
--
0
0.000
0.000
0.000
1.000
0.273
0.429
320
Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google
Desktop) – Recall* Significance Calculations
Recall* Values
at Sensitivity Level 4
(Overlap with
Google Desktop)
Baseline
0.643
0.267
0.250
0.000
0.103
0.273
0.350
0.120
0.500
0.077
1.000
0.250
0.000
0.457
0.000
0.232
0.071
0.100
0.167
0.500
0.000
0.400
0.000
0.200
0.000
0.000
0.091
0.000
0.167
0.000
Figure J.7
Enhanced
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
1.000
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.2072
0.0548
30
Enhanced
1.0000
0.0000
30
0.0274
-18.5406
2.369E-26
1.6716
4.74E-26
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is
that there is no statistical difference between the
recall* of the Baseline and the Enhanced search
engines at Level 4 Sensitivity to Relevancy
Assumptions (Overlap with Google Desktop).
Because the absolute value of the t Stat is greater
than t Critical two-tail value, we can reject the Null
Hypothesis. We can draw this same conclusion by
looking at the P(T<=t) two-tail that represents the
probability that the Null Hypothesis is true. The
P(T<=t) two-tail value is less than the Alpha value
of 0.05, so we can reject the Null Hypothesis. By
rejecting the Null Hypothesis, we can conclude that
there is a statistically significant difference between
the recall* at Level 2 Sensitivity to Relevancy
Assumptions (Overlap with Google Desktop) of the
Baseline and Enhanced search engines.
Two Sample t-Test to test the difference of the recall* at Level 4 Sensitivity
to Relevancy Assumptions where estimated number of assumed relevant
documents generated from overlap of results with Google Desktop.
321
Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google
Desktop) – F-measure* Significance Calculations
F-measure* Values
at Sensitivity Level 4
(Overlap with
Google Desktop)
Baseline
0.600
0.333
0.400
0.000
0.160
0.429
0.438
0.194
0.462
0.138
0.667
0.279
0.000
0.593
0.000
0.377
0.100
0.133
0.286
0.600
0.000
0.571
0.000
0.333
0.000
0.000
0.154
0.000
0.194
0.000
Enhanced
0.800
0.789
0.444
0.412
0.780
1.000
0.625
0.833
0.667
0.703
0.333
0.615
0.824
0.787
0.857
1.000
0.757
0.734
0.923
0.923
1.000
1.000
0.889
1.000
1.000
0.732
0.917
0.821
0.766
0.429
Two Sample t-Test Assuming Equal Variances
Statistical Measures
Mean
Variance
Observations
Pooled Variance
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Baseline
0.2480
0.0497
30
Enhanced
0.7786
0.0352
30
0.0424
-9.9769
1.691E-14
1.6716
3.38E-14
2.0017
Interpretation of Statistical Test:
The Null Hypothesis for this Two Sample t-Test is that there is
no statistical difference between the performance of the
Baseline and the Enhanced search engines at Sensitivity Level
4 (Overlap with Google Desktop). Search performance is
determined using the F-measure* at Sensitivity Level 4 where
the documents to be assumed relevant are those that are also
returned by Google Desktop search engine Because the
absolute value of the t Stat is greater than t Critical two-tail
value, we can reject the Null Hypothesis. We can draw this
same conclusion by looking at the P(T<=t) two-tail that
represents the probability that the Null Hypothesis is true. The
P(T<=t) two-tail value is less than the Alpha value of 0.05, so
we can reject the Null Hypothesis. By rejecting the Null
Hypothesis, we can conclude that there is a statistically
significant difference between the Baseline and Enhanced
search engine F-measure* at Sensitivity Level 4 (Overlap with
Google Desktop). The F-measure* value of the Enhanced
search engine is statistically greater than the Baseline and,
therefore, performs better than the Baseline search engine.
This is consistent with the conclusion drawn in the original
calculations (i.e., assuming that all documents returned by both
the Baseline and the Enhanced search engines are relevant).
Figure J.8 Two Sample t-Test to test the difference of the F-measure* at Level 4
Sensitivity to Relevancy Assumptions where estimated number of assumed
relevant documents generated from overlap of results with Google Desktop.
322
REFERENCES
Amati, G. (2003). Probability models for information retrieval based on divergence from
randomness (Doctoral dissertation, University of Glasgow).
Anderson, J. D. & Pérez-Carballo, J. (2001). The nature of indexing: How humans and
machines analyze messages and texts for retrieval. Information Processing &
Management, 37, 231-254.
Arguello, J., Elsas, J. L., Callan, J., & Carbonell, J. G. (2008). Document representation
and query expansion models for blog recommendation. Proceedings of the 2nd
International Conference on Weblogs and Social Media (pp. 10–18).
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York,
NY: ACM Press.
Bai, J., Nie, J.-Y., Cao, G., & Bouchard, H. (2007). Using query contexts in information
retrieval. Proceedings of the 30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (pp. 15–22).
Bast, H., Majumdar, D., & Weber, I. (2007). Efficient interactive query expansion with
complete search. Proceedings of the sixteenth ACM conference on Conference on
information and knowledge management (pp. 857-860).
Belkin, N. J. (1980). Anomalous states of knowledge as a basis for information retrieval.
Canadian Journal of Information Science, 5(1), 133-143.
Belkin, N. J., Oddy, R. N., & Brooks, H. M. (1982). ASK for information retrieval: Part
I. Background and theory. Journal of Documentation, 38(2), 61-71.
Bhogal, J., Macfarlane, A., & Smith, P. (2007). A review of ontology based query
expansion. Information Processing & Management, 43(4), 866-886.
Billerbeck, B., Scholer, F., Williams, H. E., & Zobel, J. (2003). Query expansion using
associated queries. Proceedings of the 12th ACM International Conference on
Information and Knowledge Management (pp. 2–9).
Bird, S., Klein, E. & Loper, E. (2009). Natural language processing with Python:
Analyzing text with the Natural Language Toolkit. Sebastopol, CA: O’Reilly.
Blanchard, A. (2007). Understanding and customizing stopword lists for enhanced patent
mapping. World Patent Information, 29(4), 308-316.
323
Brants, T. (2003). Natural language processing in information retrieval. Proceedings of
the 14th Meeting of Computational Linguistics in the Netherlands, CLIN 2003, Antwerp,
Belgium, December 19, 2003..
Buckley, G. (2004). Why current IR engines fail. Proceedings of ACM-SIGIR'2004,
Sheffield, U.K. (584-585).
Buckley, C., Salton, G., Allan, J., & Singhal, A. (1995). Automatic query expansion
using SMART: TREC 3. Proceedings of the 3rd Text REtrieval Conference (TREC-3),
NIST Special Publication (pp. 69–80).
Büttcher, S., Clarke, C. L. A., & Cormack, G. V. (2010). Information Retrieval:
Implementing and evaluating search engines. Cambridge, MA: MIT Press.
Cao, G., Gao, J., Nie, J.-Y., & Robertson, S. (2008). Selecting good expansion terms for
pseudorelevance feedback. Proceedings of the 31st Annual International ACMSIGIR
Conference on Research and Development in Information Retrieval (pp. 243–250).
Carmel, D., Farchi, E., Petruschka, Y., & Soffer, A. (2002). Automatic query refinement
using lexical affinities with maximal information gain. Proceedings of the 25th Annual
International ACM SIGIR Conference on Research and Development in Information
Retrieval (pp. 283–290).
Carpineto, C., de Mori, R., Romano, G., & Bigi, B. (2001). An information theoretic
approach to automatic query expansion. ACM Transactions on Information Systems,
19(1), 1–27.
Carpineto, C., & Romano, G. (2012). A survey of automatic query expansion in
information retrieval. ACM Computing Surveys, 44(1), 1:1 -1:50.
Chang, Y., Ounis, I., & Kim, M. (2006). Query reformulation using automatically
generated query concepts from a document space. Information Processing &
Management, 42(2), 453-468.
Chirita, P. A., Firan, C. S., & Nejdl, W. (2007). Personalized query expansion for the
web. Proceedings of the 30th annual international ACM SIGIR conference on Research
and development in information retrieval (pp. 7-14).
Cilibrasi, R. L. & Vitanyi, P. M. B. (2007). The Google similarity distance. IEEE
Transactions on Knowledge and Data Engineering 19(3), 370-383.
324
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.).
Hillsdale, NJ: Lawrence Erlbaum.
Collins-Thompson, K. & Callan, J. (2005). Query expansion using random walk models.
Proceedings of the 14th Conference on Information and Knowledge Management
(CIKM’05). ACM Press (pp. 704–711).
Cormack, G. V., Palmer, C. R., & Clarke, C. L. (1998). Efficient construction of large
test collections. Proceedings of the 21st annual international ACM SIGIR conference on
Research and development in information retrieval (pp. 282-289).
Croft, B., Metzler, D., & Strohman, T. (2010). Search engines: Information retrieval in
practice. Boston, MA: Pearson Education.
Cui, H.,Wen, J.-R., Nie, J.-Y., & Ma, W.-Y. (2003). Query expansion by mining user
logs. IEEE Transactions in Knowledge Data Engineering, 15(4), 829–839.
Dale, R. (2010). Classical approaches to natural language processing. In N. Indurkhya &
F. Damerau (Eds). Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL:
Chapman & Hall/CRC Taylor & Francis Group.
Dagan, I., Lee, L., & Pereira, F. (1997). Similarity-based methods for word sense
disambiguation. Proceedings of the 35th annual meeting on Association for
Computational Linguistics, July 07-12,1997, Madrid, Spain, (pp. 56-63).
Dagan, I., Lee, L., & Pereira, F. C. (1999). Similarity-based models of word
cooccurrence probabilities. Machine Learning, 34(1), 43-69.
Deerwester, S., Dumais, S. T, Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American Society for Information
Science, 41, 391-407.
Efthimiadis, E. N. (1996). Query expansion. Annual Review of Information Science and
Technology, 31, 121-187.
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database (pp. 23–46).
Cambridge, MA: The MIT Press.
Fitzpatrick, L., & Dent, M. (1997). Automatic feedback using past queries: Social
searching? Proceedings of the 20th annual international ACM SIGIR conference on
Research and development in information retrieval. July 27-31, 1997, Philadelphia,
Pennsylvania (pp. 306-313).
325
Floridi, L. (Ed.) (2003). The Blackwell guide to the philosophy of computing and
information. Oxford, New York: Blackwell.
Frické, M. (1998). Measuring recall. Journal of Documentation, 24(6), 409-417.
Frické, M. (2012). Logic and the Organization of Information. New York: Springer.
Gauch, S., Wang, J., & Rachakonda, S. M. (1999). A corpus analysis approach for
automatic query expansion and its extension to multiple databases. ACM Transactions on
Information Systems (TOIS), 17(3), 250-269.
Goddard, C. & Schalley, A. C. (2010). Semantic analysis. In N. Indurkhya & F. Damerau
(Eds). Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL: Chapman &
Hall/CRC Taylor & Francis Group.
Graupmann, J., Cai, J., & Schenkel, R. (2005). Automatic query refinement using mined
semantic relations. Proceedings of the International Workshop on Challenges in Web
Information Retrieval and Integration (WIRI). IEEE Computer Society (pp. 205–213).
Harman, D. (1992). Relevance feedback revisited. Proceedings of the 15th annual
international ACM SIGIR conference on Research and development in information
retrieval (pp. 1-10).
Harman, D. K. (2005). The TREC test collections. In E. Voorhees and D. K. Harman
(Eds.), TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT
press.
He, B. & Ounis, I. (2007). Combining fields for query expansion and adaptive query
expansion. Information Processing and Management, 43, 1294–1307.
Hersh, W., Buckley, C., Leone, T. J., & Hickam, D. (1994). OHSUMED: An interactive
retrieval evaluation and new large test collection for research. Proceedings of the 17th
annual international ACM SIGIR conference on Research and development in
information retrieval (pp. 192-201).
Hu, J., Deng, W., & Guo, J. (2006). Improving retrieval performance by global analysis.
Proceedings of the 18th International Conference on Pattern Recognition. IEEE
Computer Society (pp. 703–706).
Kara, S., Alan, Ö., Sabuncu, O., Akpinar, S., Cicekli, N. K., & Alpaslan, F. N. (2012). An
ontology-based retrieval system using semantic indexing. Information Systems, 37(4),
294-305.
326
Kuhlthau, C. C. (2004). Seeking meaning : A process approach to library and
information services (2nd ed.). Westport, CT: Libraries Unlimited.
Kekäläinen, J. & Järvelin, K. (1998). The impact of query structure and query expansion
on retrieval performance. Proceedings of the 21st Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 130–137).
Kraft, R. & Zien, J. (2004). Mining anchor text for query refinement. Proceedings of the
13th International Conference on World Wide Web (pp. 666–674).
Krovetz, R. (1993). Viewing morphology as an inference process. Proceedings of the
16th annual international ACM SIGIR conference on Research and development in
information retrieval (pp. 191-202).
Kučera, H. & Francis, W. N. (1967). The standard corpus of present-day edited
American English [the Brown Corpus]. (Electronic database.) Providence, RI: Brown
University.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic
analysis. Discourse Processes, 25(2-3), 259-284.
Lavrenko, V. & Croft, W. B. (2001). Relevance based language models. Proceedings of
the 24th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 120–127).
Lease, M. (2007). Natural language processing for information retrieval: The time is ripe
(again). Proceedings of the ACM first Ph.D. workshop in CIKM, November 09, 2007,
Lisbon, Portugal.
Lee, K. S., Croft, W. B., & Allan, J. (2008). A cluster-based resampling method for
pseudo-relevance feedback. Proceedings of the 31th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 235–242).
Liddy, E. D. (1998). Natural language processing for information retrieval and
knowledge discovery. In P. A. Cochrane & E. H. Johnson (Eds.), Visualizing Subject
Access for 21st Century Information Resources, pp. 137-147.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions
on Information Theory, 37(1), 145-151.
Liu, S., Liu, F., Yu, C., & Meng, W. (2004). An effective approach to document retrieval
via utilizing wordnet and recognizing phrases. Proceedings of the 27th Annual
327
International ACM SIGIR Conference on Research and Development in Information
Retrieval (pp. 266–272).
Ljunglöf, P. & Wirén, M. (2010). Syntactic parsing. In N. Indurkhya & F. Damerau
(Eds). Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL: Chapman &
Hall/CRC Taylor & Francis Group.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of
Research and Development, 2(2), 159–65.
Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural language
processing. Cambridge, Massachusetts: MIT Press.
Metzler, D. & Croft, W. B. (2007). Latent concept expansion using Markov random
fields. Proceedings of the 30th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (pp. 311–318).
Miller, G. A. (1998a). [Forward]. In C. Fellbaum (Ed.), WordNet: An electronic lexical
database (pp. xv–xxii). Cambridge, MS: The MIT Press.
Miller, G. A. (1998b). Nouns in WordNet. In C. Fellbaum (Ed.), WordNet:An electronic
lexical database (pp. 23–46). Cambridge, MA: The MIT Press.
Moffat, A. & Zobel, J. (2004). What does it mean to measure performance? Proceedings
of the 5th International Conference on Web Informations Systems, Brisbane, Australia.
Lecture Notes in Computer Science, 3306 (pp. 1–12).
Nagypál, G. (2005). Improving information retrieval effectiveness by using domain
knowledge stored in ontologies. In R. Meersman, Z. Tari, & P. Herrero (Eds.), On the
Move to Meaningful Internet Systems 2005: OTM 2005 Workshops (pp. 780-789).
Berlin/Heidelberg: Springer.
Nirenburg, S. & Raskin, V. (2004). Ontological semantics. Cambridge, MA: MIT Press.
Palmer, D. D. (2010). Text preprocessing. In N. Indurkhya & F. Damerau (Eds.),
Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL: Chapman &
Hall/CRC Taylor & Francis Group.
Pargellis, A., Fosler-Lussier, E., Potamianos, A., & Lee, C. H. (2001). A comparison of
four metrics for auto-inducing semantic classes. IEEE Workshop on Automatic Speech
Recognition and Understanding, 2001. ASRU'01. (pp. 218-221).
328
Park, L. A. F. & Ramamohanarao, K. (2007). Query expansion using a collection
dependent probabilistic latent semantic thesaurus. In Z.-H. Zhou, H. Li, Q. Yang (Eds.),
Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data
Mining. PAKDD 2007 (pp. 224–235). Heidelberg: Springer.
Pierce, J.R. (1980). An introduction to information theory: Symbols, signals and noise.
Second, revised edition. NewYork: Dover.
Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14,130-137.
Qiu, Y. & Frei, H. (1993). Concept based query expansion. Proceedings of ACM-SIGIR
'93, Pittsburgh, PA (pp. 160-169).
Revuri, S., Upadhyaya, R. S., & Kumar, K. S. (2006). Using domain ontologies for
efficient information retrieval. In L. V. S. Lakshmanan, P. Roy, and A. K. H. Tung
(Eds.) Proceedings of the 13th International Conference on Management of Data
(COMAD 2006), December 14-16, 2006, Delhi, India (pp. 170-173).
Riezler, S., Vasserman, A., Tsochantaridis, I., Mittal, V., & Liu, Y. (2007). Statistical
machine translation for query expansion in answer retrieval. Proceedings of the 45th
Annual Meeting of the Association for Computational Linguistics (ACL-07) (pp. 464–
471).
Robertson, S. E., Walker, S., & Beaulieu, M. M. (1998). Okapi at TREC-7: Automatic ad
hoc, filtering, VLC, and interactive track. Proceedings of the 7th Text REtrieval
Conference (TREC-7), NIST Special Publication 500-242. National Institute of Standards
and Technology (NIST), Gaithersburg, MD (pp. 253–264).
Salton, G. (1968). Automatic information organization and retrieval. New York:
McGraw-Hill.
Savoy, J. & Gaussier, E. (2010). Information retrieval. In N. Indurkhya & F. Damerau
(Eds). Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL: Chapman &
Hall/CRC Taylor & Francis Group.
Schütze, H. & Pedersen, J. O. (1997). A cooccurrence-based thesaurus and two
applications to information retrieval. Information Processing & Management, 33(3), 307318.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical
Journal, 27, 379-423, 623-656.
329
Shannon, C.E. & Weaver, W. (1998). The mathematical theory of communication.
Champaign, IL: University of Illinois Press.
Song, M., Song, I.-Y., Allen, R. B., & Obradovic, Z. (2006). Keyphrase extraction-based
query expansion in digital libraries. Proceedings of the 6th ACM/IEEE-CS joint
International Conference on Digital Libraries (JCDL’06) (pp. 202–209).
Song, M., Song, I.-Y., Hu, X., & Allen, R. B. (2007). Integration of association rules and
ontologies for semantic query expansion. Data and Knowledge Engineering, 63(1), 63–
75.
Sun, R., Ong, C. H., & Chua, T. S. (2006). Mining dependency relations for query
expansion in passage retrieval. Proceedings of the 29th annual international ACM SIGIR
conference on Research and development in information retrieval (pp. 382-389).
Talmy, L. (forthcoming). Cognitive semantics: An overview. In C. Maienborn, K. von
Heusinger, P. Portner, & M. de Gruyder (Eds.), Semantics: An international handbook of
natural language meaning. New York: Mouton De Gruyter.
Taylor, R. S. (1962). The process of asking questions. American Documentation, 13(4),
391-396.
Van Rijsbergen, C. J. (1979). Information retrieval. 2nd ed. London: Butterworths.
Vitanyi, P. M. B. & Cilibrasi, R. L. (2010). Normalized Web Distance and word
similarity. In N. Indurkhya & F. Damerau (Eds.), Handbook of Natural Language
Processing (pp. 293-314). Boca Raton, FL: Chapman & Hall/CRC Taylor & Francis
Group.
Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of
retrieval effectiveness. Information processing & management, 36(5), 697-716.
Voorhees, E. (1994). Query expansion using lexical-semantic relations. Proceedings of
the 17th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval (pp. 61–69).
Voorhees, E., & Harman, D. K. (2005a). The Text REtrieval Conference. In E. Voorhees
and D. K. Harman (Eds.), TREC: Experiment and evaluation in information retrieval.
Cambridge, MA: MIT press.
Voorhees, E., & Harman, D. K. (Eds.). (2005b). TREC: Experiment and evaluation in
information retrieval. Cambridge, MA: MIT press.
330
Wartena, C., & Brussee, R. (2008). Topic detection by clustering keywords. IEEE 19th
International Workshop on Database and Expert Systems Application, 2008. DEXA'08.
(pp. 54-58).
Wolfram, D., Spink, A., Jansen, B. J., & Saracevic, T. (2001). Vox populi: The public
searching of the web. JASIST, 52(2001), 1073-1074.
Xu, Y., Jones, G. J. F., & Wang, B. (2009). Query dependent pseudo-relevance feedback
based on wikipedia. Proceedings of the 32nd Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (pp. 59–66).