AUTOMATIC CONCEPT-BASED QUERY EXPANSION USING TERM RELATIONAL PATHWAYS BUILT FROM A COLLECTION-SPECIFIC ASSOCIATION THESAURUS by Jennifer Rae Lyall-Wilson _____________________ Copyright © Jennifer Rae Lyall-Wilson 2013 A Dissertation Submitted to the Faculty of the SCHOOL OF INFORMATION RESOURCE AND LIBRARY SCIENCE In Partial Fulfillment of the Requirements For the Degree of DOCTOR OF PHILOSOPHY In the Graduate College THE UNIVERSITY OF ARIZONA 2013 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Dissertation Committee, we certify that we have read the dissertation prepared by Jennifer Rae Lyall-Wilson entitled Automatic Concept-Based Query Expansion Using Term Relational Pathways Built From a Collection-Specific Association Thesaurus and recommend that it be accepted as fulfilling the dissertation requirement for the Degree of Doctor of Philosophy _______________________________________________________________________ Date: 5/6/2013 _______________________________________________________________________ Date: 5/6/2013 _______________________________________________________________________ Date: 5/6/2013 _______________________________________________________________________ Date: 5/6/2013 Martin Frické Hong Cui Bryan Heidorn Gary Bakken Final approval and acceptance of this dissertation is contingent upon the candidate’s submission of the final copies of the dissertation to the Graduate College. I hereby certify that I have read this dissertation prepared under my direction and recommend that it be accepted as fulfilling the dissertation requirement. ________________________________________________ Dissertation Director: Martin Frické Date: 5/6/2013 3 STATEMENT BY AUTHOR This dissertation has been submitted in partial fulfillment of requirements for an advanced degree at the University of Arizona and is deposited in the University Library to be made available to borrowers under rules of the Library. Brief quotations from this dissertation are allowable without special permission, provided that accurate acknowledgment of source is made. Requests for permission for extended quotation from or reproduction of this manuscript in whole or in part may be granted by the copyright holder. SIGNED: Jennifer Rae Lyall-Wilson 4 ACKNOWLEDGEMENTS I would like to thank my major advisor Dr. Martin Frické for his help, advice, and encouragement throughout my journey. He is an outstanding teacher and an insightful and caring advisor and mentor. Dr. Frické has been there for me since I began the program and I am grateful he was willing to work with me. He encouraged me to pursue a dissertation research topic that I was excited about and he believed I was capable to do the search engine research that I wasn’t sure I could. His confidence in my abilities and support has made all the difference. I’d like to thank Dr. Gary Bakken for his support, advice, and encouragement throughout my doctoral program. His passion for his work and energy in pursuing his goals is inspiring. I’d like to thank my dissertation committee members Dr. Hong Cui and Dr. Bryan Heidorn for their input, suggestions, and working with me to make this a successful dissertation. I’d also like to thank Dr. Heshan Sun for early advice and direction in the dissertation research. I’d like to thank Geraldine Fragoso in the SIRLS Administration Office for all she’s done to make sure that the administrative side of my program stayed on track. As a distance student it is easy to feel disconnected and alone but, I always felt that I had an advocate in Geraldine looking out for me and ready to help. I’d like to thank all those who support and staff the libraries in which I spent time working on my dissertation. These libraries include: Portland State University Library, University of Oregon Knight Library, San Diego State University Library, University of 5 California San Diego Geisel Library, and Northern Arizona University Cline Library. And, I’d also especially like to thank the staff at the University of Arizona Main Library who provided me access to all the resources I needed electronically. Their electronic collections as well as their excellent Interlibrary Loan and Document Delivery programs made it possible for me to effectively and efficiently access the resources I needed to complete this research as a distance student. I would like to thank my sister-in-law and good friend Michele Lyall for generously donating her time and expertise to be my editor and conduct a thorough review of my dissertation. She caught those silly, embarrassing mistakes that could’ve only been made in the wee hours of the morning and, more importantly, she provided suggestions and guidance that allowed my writing to achieve a higher degree of clarity. She helped make the ideas I presented more understandable and, therefore, made my dissertation stronger. I’d like to thank my Mom who, as the first in her family to go to college, instilled in me the value of education and set an excellent example of being successful in college while being a loving, caring, and nurturing mother. I am so lucky to have a Mom who always believes in me, never ceases to encourage me, and is always ready to be my own personal cheering section. I’d like to thank my daughter Kinsey who warms my heart every day with her loving, fun, and generous spirit. Her never-ending curiosity and passion to learn about the world around her inspires me to look at the world in the same way. She constantly amazes me, makes me laugh, and helps me be a better person. 6 And, finally, I’d like to thank my partner Beth. She’s my partner in life, love, family, growth, work, fun, relaxation, and appreciation of all the wonderful things around us. It sounds like a cliché, but, it really is so hard to capture in mere words how grateful I am to her for all she has done to contribute to both my dissertation and growth as a researcher in general. As a sounding board and as an excellent critical, logical, creative, and global thinker, she has contributed in ways both big and small from the conception of this dissertation research to its completion. Of course, there are many others who have played large or small roles over the years but I don’t have the space to include them all. It takes a village to complete a dissertation. I am grateful to my village. 7 DEDICATION To my daughter Kinsey and my partner Beth. You are the loves of my life. Kinsey, you inspire me to be the best person I possibly can be. Beth, with your love, support, and encouragement you help me achieve it. 8 TABLE OF CONTENTS LIST OF FIGURES .........................................................................................................15 LIST OF TABLES ...........................................................................................................21 ABSTRACT .................................................................................................................27 CHAPTER 1. INTRODUCTION ..................................................................................28 1.1. Outside the Scope .................................................................................................33 1.2. Improved Automatic Query Expansion Using Relational Pathways ....................35 1.2.1. Generating the Conceptual Network ............................................................38 1.2.2. Formulating the Expanded Query ................................................................41 1.3. Example Query Walkthrough ...............................................................................44 1.4. Dissertation Report Structure ................................................................................49 CHAPTER 2. LITERATURE REVIEW ......................................................................50 2.1. Improving Information Retrieval using Natural Language Processing ................50 2.1.1. Information Retrieval ...................................................................................50 2.1.2. Natural Language Processing ......................................................................64 2.1.3. Describing Information Needs in the Form of a Query ...............................71 2.1.4. Will NLP techniques improve information retrieval systems? ....................75 2.2. Lexical Acquisition of Meaning ...........................................................................78 2.2.1. Vector Space Models ...................................................................................78 2.2.2. Latent Semantic Analysis ............................................................................85 2.2.3. Normalized Web Distance ...........................................................................89 2.2.4. Probabilistic Models Used in Determining Semantic Similarity .................92 9 TABLE OF CONTENTS – Continued 2.3. Augmenting Retrieval Methods ............................................................................98 2.3.1. Integrating Manually Developed Semantic Knowledge ..............................98 2.3.2. Relevance Feedback...................................................................................102 2.3.3. Query Expansion ........................................................................................103 2.4. Evaluation ...........................................................................................................125 2.4.1. Search Engine Performance Measures.......................................................125 2.4.2. Search Engine Data for Evaluation Experiments.......................................131 CHAPTER 3. RESEARCH HYPOTHESES..............................................................139 CHAPTER 4. METHODOLOGY ...............................................................................140 4.1. Identify and Design Baseline Search Engine ......................................................140 4.1.1. Design Process Used ..................................................................................142 4.1.2. Baseline Search Engine Structure ..............................................................143 4.2. Design Enhanced Search Engine ........................................................................144 4.2.1. Build Association Thesaurus .....................................................................147 4.2.2. Generate Conceptual Network ...................................................................149 4.2.3. Expand Query ............................................................................................151 4.3. Select a Document Collection.............................................................................152 4.4. Develop Query Topics ........................................................................................153 4.4.1. Tangible and Intangible Concepts .............................................................154 4.4.2. Relevant Document Sets for Query Topics ...............................................155 4.5. Run Experiment ..................................................................................................156 10 TABLE OF CONTENTS – Continued 4.5.1. Select Samples ...........................................................................................156 4.5.2. Adjudicate ..................................................................................................157 4.6. Measure Performance and Determine Statistical Significance ...........................158 4.6.1. Addressing Potential Impact of Unfound Relevant Documents ................159 4.6.2. Addressing Potential Impact of Relevancy Assumption in Modified Pooling Method .................................................................................................................162 4.6.3. Perform Calculations and Sensitivity Analyses .........................................163 CHAPTER 5. RESULTS ..............................................................................................168 5.1. Full Query Topic Set ...........................................................................................168 5.2. Samples of Query Topics That Produce a Difference ........................................171 5.2.1. Baseline vs. Enhanced Search Performance ..............................................173 5.2.2. Tangible vs. Intangible Concepts ...............................................................174 5.3. Sensitivity Analysis For Impact Of Unfound Relevant Documents ...................176 5.3.1. Level 1 Sensitivity To Unfound Relevant Documents (0.25X) .................176 5.3.2. Level 2 Sensitivity To Unfound Relevant Documents (2X) ......................178 5.3.3. Level 3 Sensitivity To Unfound Relevant Documents (10X) ....................178 5.4. Sensitivity Analysis For Impact Of Relevancy Assumptions .............................179 5.4.1. Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant)......180 5.4.2. Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant)......181 5.4.3. Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant)......182 11 TABLE OF CONTENTS – Continued 5.4.4. Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) ...............................................................................................................183 5.5. Query Topic Outliers ..........................................................................................184 CHAPTER 6. DISCUSSION........................................................................................187 6.1. Research Hypothesis #1 ......................................................................................188 6.2. Research Hypothesis #2 ......................................................................................189 6.3. Research Hypothesis #3 ......................................................................................190 6.4. Research Hypothesis #4 ......................................................................................190 6.5. Generalizing the Results to Full Set of Query Topics ........................................191 6.6. Sensitivity Analysis For Impact Of Unfound Relevant Documents ...................193 6.7. Sensitivity Analysis For Impact Of Relevancy Assumptions .............................194 6.8. Interpreting the Results .......................................................................................195 6.8.1. Conceptual Network ..................................................................................195 6.8.2. Complete Set of Relevant Documents .......................................................197 6.8.3. Query Topic Types ....................................................................................198 6.9. Impact of Sample Selection ................................................................................199 6.10. Impact of Search Engine Parameter Values......................................................199 6.10.1. Jaccard Coefficient Threshold Value .......................................................200 6.10.2. Maximum Associated Term Entries Threshold Value.............................201 6.11. Outstanding Issues ............................................................................................204 6.11.1. Impact of Characteristics of the Test Document Collection ....................204 12 TABLE OF CONTENTS – Continued 6.11.2. Data Processing and Document Collection Size......................................205 6.11.3. Performance Comparison.........................................................................206 CHAPTER 7. CONCLUSION .....................................................................................208 APPENDIX A. SEARCH ENGINE STRUCTURE AND DESIGN PARAMETERS .. ...............................................................................................................210 A.1. Structure .............................................................................................................210 A.1.1. Baseline Search Engine Structure .............................................................210 A.1.2. Enhanced Search Engine Structure ...........................................................212 A.2. Design Parameters..............................................................................................214 A.2.1.Technology.................................................................................................214 A.2.2.Indexing Parameters ...................................................................................214 A.2.3. Association Thesaurus Processing Parameters .........................................215 A.2.4. Conceptual Network and Relational Pathway Parameters ........................218 A.2.5. Search Parameters .....................................................................................220 APPENDIX B. BUILDING BOOLEAN EXPRESSIONS FROM RELATIONAL PATHWAYS ...............................................................................................................221 B.1. Building Boolean Phrase from Relational Pathway ...........................................221 B.2. Building Query Expression for Query Topic .....................................................223 B.2.1. Combining Multiple Relational Pathways for Term Pair..........................223 B.2.1. Creating Full Boolean Expression for Query Topic..................................224 13 TABLE OF CONTENTS – Continued APPENDIX C. QUERY TOPIC LIST ........................................................................227 C.1. Query Topics Representing Tangible Concepts .................................................227 C.2. Query Topics Representing Intangible Concepts ...............................................228 APPENDIX D. DOCUMENTS RETURNED COUNTS FOR QUERY TOPICS BY SEARCH ENGINE ........................................................................................................230 D.1. Documents Returned Counts .............................................................................230 D.2. Significance of Difference of Documents Returned ..........................................233 APPENDIX E. GRADED RELEVANCE QUERY TOPIC DEFINITIONS ............235 E.1. Query Topics Representing Tangible Concepts .................................................236 E.2. Query Topics Representing Intangible Concepts ...............................................241 APPENDIX F. BINARY RELEVANCE DATA AND CALCULATIONS ...............246 F.1. Data for Query Topics Representing Tangible Concepts ...................................246 F.1.1. Adjudicated Query Topics for Tangible Concepts ....................................247 F.1.2. Not Adjudicated Query Topics for Tangible Concepts .............................250 F.2. Data for Query Topics Representing Intangible Concepts .................................253 F.2.1. Adjudicated Query Topics for Intangible Concepts ..................................253 F.2.2. Not Adjudicated Query Topics For Intangible Concepts ..........................256 APPENDIX G. BINARY RELEVANCE SIGNIFICANCE TESTS.........................259 G.1. Difference in Recall* between Baseline and Enhanced Search Engines ...........259 G.2. Difference in F-measure* between Baseline and Enhanced Search Engines ....261 G.3. Difference in F-measures* between Tangible and Intangible Concepts............263 14 TABLE OF CONTENTS – Continued G.3.1. Difference between Tangible and Intangible Concepts for Baseline Search Engine ..................................................................................................................264 G.3.2. Difference between Tangible and Intangible Concepts for Enhanced Search Engine ..................................................................................................................265 APPENDIX H. GRADED RELEVANCE DATA .......................................................266 H.1. Graded Relevance Data of Query Topics for Tangible Concepts ......................266 H.2. Graded Relevance Data of Query Topics for Intangible Concepts ....................268 APPENDIX I. SENSITIVITY ANALYSIS FOR IMPACT OF UNFOUND RELEVANT DOCUMENTS ........................................................................................270 I.1. Level 1 Sensitivity To Unfound Relevant Documents (0.25X) ..........................271 I.2. Level 2 Sensitivity To Unfound Relevant Documents (2X) ...............................278 I.3. Level 3 Sensitivity To Unfound Relevant Documents (10X) .............................285 APPENDIX J. SENSITIVITY ANALYSIS FOR IMPACT OF RELEVANCY ASSUMPTIONS .............................................................................................................292 J.1. Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant) ...............293 J.2. Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant) ...............301 J.3. Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant) ...............308 J.4. Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) ....................................................................................................................................315 REFERENCES ...............................................................................................................322 15 LIST OF FIGURES Figure 1.1 Term cluster generated from thesaurus entry for term A. ............................ 38 Figure 1.2 Term clusters for term A and term D connected by association relationship generated from thesaurus entries. ................................................................ 39 Figure 1.3 Imaginary relational pathway between terms A and J................................. 39 Figure 1.4 Simple Boolean implementation for formulating an expanded query from path A – B – C – D – E. In the visual representation, the solid circles on the pathway indicate the terms that must be present in the document for the document to be identified as relevant. ......................................................... 43 Figure 1.5 Term clusters generated from collection-specific association thesaurus entries for terms warning and understandable.......................................... 46 Figure 1.6 Example Boolean query for the relational path warnings – aware – color – coding – understandable............................................................................ 48 Figure 2.1 The two phases of the retrieval process in an information retrieval system. ...................................................................................................................... 54 Figure 2.2 The stages of analysis in Natural Language Processing (adapted from Figure 1.1 in Dale, 2010, p. 4)................................................................................. 66 Figure 2.3 An example document-by-word matrix (adapted from Figure 8.3 in Manning & Schütze, 1999, p. 297). ............................................................. 79 Figure 2.4 An example word-by-word matrix (adapted from Figure 8.4 in Manning & Schütze, 1999, p. 297).................................................................................. 80 16 LIST OF FIGURES – Continued Figure 2.5 Typical steps of query expansion (adapted from Figure 1 of Carpineto & Romano, 2012, p. 1:10).............................................................................. 108 Figure 4.1 High-level structure of the Baseline search engine.................................... 144 Figure 4.2 High-level structure of the Enhanced search engine.................................. 146 Figure 4.3 Venn digram illustrating that the documents returned by the Baseline are a subset of those returned by the Enhanced search engine. .......................... 160 Figure 5.1 Histogram of the returned document count frequencies for the 75 query topics run on the Baseline and the Enhanced search engines. ................... 169 Figure 5.2 Distribution of all 75 query topics that produced a difference in performance between the Baseline and the Enhanced search engines and those that performed the same (i.e., no difference in performance). .......................... 170 Figure 6.1 Distribution of all 75 query topics with addtional detail to illustrate population from which the tangible and intangible sample set were derived. .................................................................................................................... 188 Figure 6.2 Relational pathways identified for various values of the Maximum Associated Term Entries Threshold. .......................................................... 202 Figure A.1 High-level structure of the Baseline search engine.................................... 211 Figure A.2 High-level structure of the Enhanced search engine.................................. 213 17 LIST OF FIGURES – Continued Figure B.1 Boolean Phrase for Relational Pathway Length of 3. In the visual representation, the solid circles on the pathway indicate the terms that must be present in the document for the document to be identified as relevant. ...... .................................................................................................................... 221 Figure B.2 Boolean Phrase for Relational Pathway Length of 4. In the visual representation, the solid circles on the pathway indicate the terms that must be present in the document for the document to be identified as relevant. ...... .................................................................................................................... 222 Figure B.3 Boolean Phrase for Relational Pathway Length of 5. In the visual representation, the solid circles on the pathway indicate the terms that must be present in the document for the document to be identified as relevant. ...... .................................................................................................................... 222 Figure B.4 Boolean phrase for term pair A and E constructed by combining individual Boolean phrases for the term pair’s two relational pathways. ................... 224 Figure B.5 Boolean phrase for query topic A B C constructed by combining Boolean phrases for each of the term pairs. ............................................................. 225 Figure B.6 Boolean phrase for query topic A B C constructed by combining Boolean phrases generated from the relational pathway identified between the term pair A and B and the term C. Term C did not share a relational pathway with term A nor with term B. ..................................................................... 226 18 LIST OF FIGURES – Continued Figure G.1 Two Sample t-Test to test the recall* difference between Baseline and Enhanced search engines. .......................................................................... 260 Figure G.2 Two Sample t-Test to test the F-measure* difference between Baseline and Enhanced search engines. .......................................................................... 262 Figure G.3 ANOVA Single Factor to test the F-measure* difference between Tangible and Intangible Concepts Query Sample Sets on the Baseline search engine. .................................................................................................................... 264 Figure G.4 ANOVA Single Factor to test the F-measure difference between Tangible and Intangible Concepts Query Sample Sets on the Enhanced search engine. .................................................................................................................... 265 Figure I.1 Two Sample t-Test to test the difference of the recall* at Sensitivity Level 1 (0.25X) between the Baseline and Enhanced search engines. ................... 276 Figure I.2 Two Sample t-Test to test the difference of the F-measure* at Sensitivity Level 1 (0.25X) between the Baseline and Enhanced search engines. ...... 277 Figure I.3 Two Sample t-Test to test the difference of the recall* at Sensitivity Level 2 (2X) between the Baseline and Enhanced search engines. ........................ 283 Figure I.4 Two Sample t-Test to test the difference of the F-measure* at Sensitivity Level 2 (2X) between the Baseline and Enhanced search engines. ........... 284 Figure I.5 Two Sample t-Test to test the difference of the recall* at Sensitivity Level 3 (10X) between the Baseline and Enhanced search engines. ...................... 290 19 LIST OF FIGURES – Continued Figure I.6 Two Sample t-Test to test the difference of the F-measure* at Sensitivity Level 3 (10X) between the Baseline and Enhanced search engines. ......... 291 Figure J.1 Two Sample t-Test to test the difference of the recall* at Level 1 Sensitivity to Relevancy Assumptions (25%) between the Baseline and Enhanced search engines. ........................................................................................... 299 Figure J.2 Two Sample t-Test to test the difference of the F-measure* at Level 1 Sensitivity to Relevancy Assumptions (25% non-relevant) between the Baseline and Enhanced search engines. ..................................................... 300 Figure J.3 Two Sample t-Test to test the difference of the recall* at Level 2 Sensitivity to Relevancy Assumptions (50%) between the Baseline and Enhanced search engines. ........................................................................................... 306 Figure J.4 Two Sample t-Test to test the difference of the F-measure* at Level 2 Sensitivity to Relevancy Assumptions (50% non-relevant) between the Baseline and Enhanced search engines. ..................................................... 307 Figure J.5 Two Sample t-Test to test the difference of the recall* at Level 3 Sensitivity to Relevancy Assumptions (75%) between the Baseline and Enhanced search engines. ........................................................................................... 313 Figure J.6 Two Sample t-Test to test the difference of the F-measure* at Level 3 Sensitivity to Relevancy Assumptions (75%) between the Baseline and Enhanced search engines. .......................................................................... 314 20 LIST OF FIGURES – Continued Figure J.7 Two Sample t-Test to test the difference of the recall* at Level 4 Sensitivity to Relevancy Assumptions where estimated number of assumed relevant documents generated from overlap of results with Google Desktop. ........ 320 Figure J.8 Two Sample t-Test to test the difference of the F-measure* at Level 4 Sensitivity to Relevancy Assumptions where estimated number of assumed relevant documents generated from overlap of results with Google Desktop. .................................................................................................................... 321 21 LIST OF TABLES Table 2.1 Vector-based similarity measures. ............................................................... 81 Table 2.2 Probabilistic dissimilarity measures. ........................................................... 95 Table 2.3 The number of times term t1 and term t2 occur in each of the five documents that comprise the example document collection. ....................................... 106 Table 4.1 Example of the estimates used for unfound relevant documents used in each of the three levels of the sensitivity. Estimates of unfound documents were rounded up to the next whole number........................................................ 164 Table 5.1 Counts and percentages of the distribution of the 75 query topics that produced a difference in performance between the Baseline and the Enhanced search engines and those that performed the same (i.e., no difference in performance). ........................................................................ 170 Table 5.2 Sample of 14 query topics representing tangible concepts. ....................... 171 Table 5.3 Sample of 16 query topics representing intangible concepts. .................... 172 Table 5.4 Average performance measures for Baseline and Enhanced search engines. .................................................................................................................... 174 Table 5.5 Average performance measures for query topics representing tangible and and intangible concepts for the Baseline search engine............................. 175 Table 5.6 Average performance measures for query topics representing tangible and and intangible concepts for the Enhanced search engine........................... 175 Table 5.7 Average recall* for Baseline and Enhanced search engines from original calculation and at the three sensitivity levels for unfound documents. ..... 177 22 LIST OF TABLES – Continued Table 5.8 Average F-measure* for Baseline and Enhanced search engines from original calculation and at the three sensitivity levels for unfound documents. ................................................................................................. 177 Table 5.9 Average recall* for Baseline and Enhanced search engines from original calculation and at the four sensitivity levels for assumed relevant documents. ................................................................................................. 180 Table 5.10 Average F-measure* for Baseline and Enhanced search engines from original calculation and at the four sensitivity levels for assumed relevant documents. ................................................................................................. 181 Table 5.11 Documents returned by the Baseline and Enhanced search engines for the two outlier query topics. ............................................................................ 185 Table 6.1 Calculation to estimate the performance of the Enhanced search engine on the full 75 query topic set........................................................................... 192 Table D.1 Document counts returned by Baseline and Enhanced search engine. ...... 233 Table D.2 Two Sample t-test Assuming Equal Variances to determine statistical significance of differences in documents returned by Baseline and Enhanced search engines. ........................................................................................... 234 Table F.1 Adjudicated Query Topics for Tangible Concepts with Recall*, Precision, and F-measure* calculations. ..................................................................... 248 Table F.2 Query topics for Tangible Concepts with no performance difference. ...... 251 23 LIST OF TABLES – Continued Table F.3 Query topics for Tangible Concepts with a performance difference but not included in the Tangible Concepts Query Sample Set............................... 252 Table F.4 Adjudicated Query Topics for Intangible Concepts with Recall*, Precision, and F-measure* calculations. ..................................................................... 254 Table F.5 Query topics for Intangible Concepts with no performance difference. .... 257 Table F.6 Query topics for Intangible Concepts with a performance difference but not included in the Intangible Concepts Query Sample Set............................. 258 Table H.1 Graded Relevance Data for Tangible Concepts Query Sample Set. .......... 267 Table H.2 Graded Relevance Data for Intangible Concepts Query Sample Set. ........ 269 Table I.1 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Sensitivity Level 1 (0.25X) where the number of unfound documents are assumed to be a quarter of the number of relevant documents identified by the Baseline and Enhanced search engines. ....... 272 Table I.2 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Sensitivity Level 1 (0.25X) where the number of unfound documents are assumed to be a quarter of the number of relevant documents identified by the Baseline and Enhanced search engines. ....... 274 Table I.3 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Sensitivity Level 2 (2X) where the number of unfound documents are assumed to be double the number of relevant documents identified by the Baseline and Enhanced search engines........................... 279 24 LIST OF TABLES – Continued Table I.4 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Sensitivity Level 2 (2X) where the number of unfound documents are assumed to be double the number of relevant documents identified by the Baseline and Enhanced search engines........................... 281 Table I.5 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Sensitivity Level 3 (10X) where the number of unfound documents are assumed to be ten times the number of relevant documents identified by the Baseline and Enhanced search engines. ....... 286 Table I.6 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Sensitivity Level 3 (10X) where the number of unfound documents are assumed to be ten times the number of relevant documents identified by the Baseline and Enhanced search engines. ....... 288 Table J.1 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Level 1 Sensitivity to Relevancy Assumptions where 25% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. ..................................... 295 Table J.2 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Level 1 Sensitivity to Relevancy Assumptions where 25% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. ..................................... 297 25 LIST OF TABLES – Continued Table J.3 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Level 2 Sensitivity to Relevancy Assumptions where 50% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant ...................................... 302 Table J.4 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Level 2 Sensitivity to Relevancy Assumptions where 50% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. ..................................... 304 Table J.5 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Level 3 Sensitivity to Relevancy Assumptions where 75% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. ..................................... 309 Table J.6 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Level 3 Sensitivity to Relevancy Assumptions where 75% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. ..................................... 311 Table J.7 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Level 4 Sensitivity to Relevancy Assumptions where estimated number of relevant documents that may be assumed are the documents returned by all three search engines (i.e., Baseline, Enhanced, and Google Desktop search engines) ......................................................... 316 26 LIST OF TABLES – Continued Table J.8 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Level 4 Sensitivity to Relevancy Assumptions where estimated number of relevant documents to be assumed are generated from overlap of results with Google Desktop..................................................... 318 27 ABSTRACT The dissertation research explores an approach to automatic concept-based query expansion to improve search engine performance. It uses a network-based approach for identifying the concept represented by the user’s query and is founded on the idea that a collection-specific association thesaurus can be used to create a reasonable representation of all the concepts within the document collection as well as the relationships these concepts have to one another. Because the representation is generated using data from the association thesaurus, a mapping will exist between the representation of the concepts and the terms used to describe these concepts. The research applies to search engines designed for use in an individual website with content focused on a specific conceptual domain. Therefore, both the document collection and the subject content must be wellbounded, which affords the ability to make use of techniques not currently feasible for general purpose search engine used on the entire web. 28 CHAPTER 1. INTRODUCTION It is difficult to overemphasize the importance of the search engine in today’s Web environment. It is estimated that the Web contains over 13.02 billion documents 1 and within this enormous collection exists information related to almost any imaginable topic. The challenge in learning about a particular topic arises not because relevant information does not exist on the Web, but rather because it can be very difficult to efficiently retrieve the relatively small subset of documents on the Web that meets a specific information need. One solution to this challenge is to use a search engine. Search engines provide a way to identify and access documents containing relevant information that would otherwise remain unknown to the user (Baeza-Yates & Ribeiro-Neto, 1999; Savoy & Gaussier, 2010; Wolfram, Spink, Jansen, & Saracevic, 2001). An analogous challenge exists when the information required is contained within an individual website (i.e., local web domain). Individual websites can contain a large amount of detailed information about a particular subject. Finding the specific, relevant information within the website is often difficult, and, again, the search engine can serve as an important tool for helping users locate the specific documents that fill their information needs. Designing a search engine is deceptively difficult. One of the primary reasons for this difficulty is that natural language is used as the medium to form the information 1 The size of the World Wide Web. (9 February 2012). Retrieved from http://www.worldwidewebsize.com/ 29 bridges 2 between the user with the information need and the authors of the documents in the collection. Users form a query using natural language as an abstraction to represent their information need, and the authors of the documents in the collection use natural language as an abstraction to represent the concepts in the documents. The translation into and representation using natural language introduces issues with the accuracy and completeness of the description of the concepts on both sides of the information bridge. Therefore, the search engine must perform its task using a frequently incomplete, imprecise, and ambiguous description of both the user’s information need and the related concepts included in the documents within the collection (Savoy & Gaussier, 2010). In addition, the challenge of the task is exacerbated because the less-than-perfect descriptions of the same concept can be expressed using different words and phrases due to the richness and productivity (i.e., invention of new words and new uses of old words) of natural language (Manning & Schütze, 1999). Search engines may be designed using a symbolic search algorithm that compares the text of the user’s query against text found in the documents within the collection. A document is identified as relevant when string patterns in its text match the string patterns in the query text. While a symbolic approach is typically able to identify a significant portion of the relevant documents, it is often not sufficient for identifying the complete set of relevant documents contained in the collection. The primary reason is that the use of natural language introduces the issues mentioned above that limit the ability of a purely symbolic approach to identify all the relevant information. 2 The concept of information bridges is described by Martin Frické in his book Logic and the Organization of Information published in 2012. 30 In addition, it is important to note that the information missed by a symbolic search algorithm may be important in fully understanding the relevant concepts contained in the document collection. The information is missed because it is expressed differently from what the query posed (i.e., the text string patterns are not the same). It can be the missed information that is able to present a unique way of thinking about the concept by highlighting, emphasizing, or connecting aspects of the idea not contained in the terms used to construct the query, and the user may not be aware of these aspects. Therefore, including this information would fill deficits in the user’s understanding of the concept thereby more completely filling the user’s information need. Search engine design is an actively researched area and presents interesting challenges in discovering effective ways to identify relevant information missed by symbolic search algorithms. One area of research addressing these challenges is the development of semantic information to either replace or augment symbolic search engine designs. A portion of this research focuses on manually developing (i.e., a human expert is required to develop) semantic information. This includes research in developing, defining, and using semantic relations among lexical elements (e.g., WordNet, Fellbaum, 1998) as well as research that addresses the manual development of domain ontologies to extract and represent meaning based on a constructed world model (Nirenburg & Raskin, 2004). However, the manual development of semantic information is extremely time consuming, expensive, and either lacks the specificity for technical domains or lacks portability for reuse in other conceptual domains and over time within an evolving conceptual domain (Anderson & Pérez-Carballo, 2001; Manning & Schütze, 1999). 31 Because of these limitations, other lines of research exploring ways to automatically generate information to augment symbolic search engine designs are attractive. One area of this type of research uses Natural Language Processing (NLP) techniques to perform automatic query expansion in an attempt to more completely define and describe a user’s query. By augmenting the user’s query with additional search terms, additional candidate string patterns are available when performing the symbolic search. It is believed that these added candidate strings allow additional opportunities for a symbolic search algorithm to identify additional documents that contain information relevant to the desired concept. Unfortunately, there has been a marked lack of success in this line of research over the years and many of the resulting systems either do not improve or decrease the performance of the search engine (Qiu & Frei, 1993; Brants, 2003). Qiu and Frei (1993) present a theory that the lack of success with previous query expansion methods is primarily because the methods were based on adding terms similar to each of the individual terms used to construct the query instead of adding terms similar to the overall concept the query describes. This method of expanding a query by adding terms similar only to the individual terms of the query often introduces tangential, non-relevant concepts to the query causing the search algorithm to identify documents that are not relevant to the original query concept. To address this, Qiu and Frei developed an alternate method in which their search algorithm creates a vector of the query in the Term Vector Space (TVS) generated from the document collection and expands the query with those terms that have a high similarity to the query vector. While Qiu and Frei achieved 32 a notable improvement in search engine performance, they noted that their method was less successful than systems that used mature user feedback relevance data. This result indicates that relevant documents were missed by their search algorithm. To create a search engine capable of identifying all relevant documents without a significant increase in non-relevant documents, a more effective method for concept-based query expansion is needed to augment the performance of symbolic search engine design. In this dissertation research project, the goal is to extend the work in and state of the art understanding of automatic concept-based query expansion methods. The project explores an approach to automatic query expansion able to identify a greater portion of relevant documents without increasing the portion of non-relevant documents returned in response to the user’s query. The research applies to search engines designed for use in an individual website with content focused on a specific conceptual domain (sometimes referred to as a vertical search 3). Therefore, both the document collection and the subject content must be well-bounded, which affords the ability to make use of techniques not currently feasible for general purpose search engine used on the entire web. In the following sections the scope of the work, the idea that drives the approach, and a walkthrough of an example query to illustrate it will be presented. 3 “Vertical search is a specialized form of web search where the domain of the search is restricted to a particular topic” (Croft, Metzler, & Strohman, 2010, p. 3). Other terms used to describe searches that are limited by the conceptual domain are focused search or topical search. 33 1.1. Outside the Scope Designing a search engine, as mentioned above, is a complex and difficult process. There are many aspects to be considered if the ultimate goal is to design a search engine that performs as well as a human but as fast as a computer. To keep this dissertation work well-contained, it is focused on addressing only a few aspects of importance. The research focuses on the design of a search engine that extends the work in state of the art understanding of automatic concept-based query expansion. The search engine is intended to be used by searchers who have a well-formed information need in advance of using the search engine but may not be able to articulate their information need with a high level of precision. The search engine is intended to be used on a bounded, medium-sized document collection containing documents that are focused on a single subject domain that is in a scientific or other technical domain. A medium-sized document collection can be assumed to be comprised of several thousand documents 4. There are many important and interesting issues that have been defined as outside the scope of this research. Related and relevant issues not addressed in this research include: • Formation and definition of the information need o Challenges with forming and defining an information need – The early stages of the Kuhlthau’s (2004) Information Search Process 4 Qiu & Frei (1993) refer to the CACM test collection that is comprised of 3,204 documents as a medium-sized document collection. The CACM collection is a collection of titles and abstracts from Communications of ACM. 34 (ISP) in which the information seeking task is initiated, a topic is selected, information about the topic is explored and the information seeker formulates a focused perspective on the topic is not addressed. o Challenges with expressing an information need in the form of a query – The ability of a user to develop a precise description in the form of a query that truly express their particular information need is not addressed. • Accommodation of user behaviors and backgrounds o Accommodating a variety of information seeking behaviors – Identifying and effectively facilitating various searching behaviors (e.g., browsing, berry picking, or horizontal search) which may be preferred by the users is not addressed. o Accommodating varying user backgrounds – Adapting to facilitate users with varying backgrounds to effectively conduct a search is not addressed. This research assumes that the users for which the website and its document collection were designed are the intended users of the search engine. • Capabilities and technical details of the search engine o Ability to process languages other than English – This research is focused only on document collections and queries in English. 35 o Ranking documents returned in order of relevance to query – This research is focused only generating a set of relevant documents contained in the collection and does not employ or address document relevance ranking algorithms. o Computational efficiency of approach – The computational efficiency of the design and feasibility of real-world use of the search engine design is not addressed. It is assumed that these issues will be considered in the future if the research produces favorable results that suggest that the approach may be useful in a real-world application. 1.2. Improved Automatic Query Expansion Using Relational Pathways As mentioned earlier, previous work has shown that expansion methods are generally not successful when the terms used to expand the query are identified by only individually considering the terms that make up the query (i.e., using synonyms for each individual term). In response to this problem, Qiu & Frei (1993) investigated a conceptbased query expansion method using the Vector Space Model that sought to add terms related to the overall concept of the query rather than synonyms for individual terms. Their method improved retrieval performance over a non-enhanced search engine but still missed relevant documents found using other methods. The new method proposed in this research uses a network-based rather than a vector-based approach for identifying the concept represented by the user’s query. It is founded on the idea that a collection-specific association thesaurus can be used to create a 36 reasonable representation of all the concepts within the document collection as well as the relationships these concepts have to one another. Because the representation is generated using data from the association thesaurus, a mapping will exist between the representation of the concepts and the terms used to describe these concepts. To do this, an interconnected network will be generated from each of the association thesaurus entries. The terms will be represented by nodes and the connections between the nodes will be created through the relationship defined by the ISASSOCIATED-WITH terms identified for each thesaurus entry. At search time, the relevant portion of the overall conceptual network will be used to represent the intended concept of the query. This will be accomplished by identifying all those pathways that exist within the network that connect together the individual terms from the query. Each of these relational pathways will represent a particular aspect of the overall desired concept, and the intervening nodes (i.e., terms) on these pathways between a pair of query terms will be candidate terms for expanding the original query. The idea that such a conceptual network could be automatically generated from the association thesaurus for the document collection is based on the following assumptions. Consider that a local website houses a document collection that contains information concentrated on a particular conceptual subject domain. Given this, the following properties will be assumed for all content-bearing terms contained in the 37 collection (i.e., terms that provide meaning and, therefore, exclude stop words or words that simply provide the structural framework in the document): • Terms are used to describe concepts within a conceptual domain; therefore, they are used in regular and intentional ways. • Individual terms represent some portion of an overall concept. • Terms that are in close proximity to one another are likely used to describe a single (though possibly complex) concept (i.e., each term contributing in conjunction with other nearby terms to the overall meaning of a concept). • Terms that frequently co-occur are likely conceptually related. Based on these assumptions, a collection-specific association thesaurus could be constructed using the co-occurrence of terms (i.e., terms that frequently appear within close proximity to one another), and each of its term entries would represent small clusters of conceptually related terms. Linking these clusters together through shared terms would result in a complex network of terms linked by way of association-based relationships. Because of the above properties of the content-bearing terms and the cooccurrence-based method by which the association relationships are generated, the resulting association thesaurus would contain the data necessary to generate a network capable of representing a reasonable approximation of all the concepts represented in the document collection and the conceptual relationships among these concepts. Various appropriately short paths between two terms could then be used to represent specific, 38 relevant aspects of a concept expressed by the term pair for which information exists in the document collection. 1.2.1. Generating the Conceptual Network We can think of each term in the association thesaurus as a node. Child nodes for a term are generated from the associated terms defined in its thesaurus entry. For example, the thesaurus entry for term A may contain terms B, C, D, E, and F. From this information, a term cluster as shown below in Figure 1.1 could be generated. Following this pattern, term clusters could be generated for all entries in the association thesaurus. Main term: A is-associated-with: • B • C • D • E • F Figure 1.1 Term cluster generated from thesaurus entry for term A. To form the full network, each term cluster would be linked to the other term clusters using shared terms as defined by the association relationships. For example, the thesaurus entry for one of term A’s associated terms, term D, may contain entries for A, G, H, and I and could be linked to the term A cluster as illustrated in Figure 1.2. 39 Main term: A Main term: D is-associated-with: • B • C • D • E • F is-associated-with: • A • G • H • I Figure 1.2 Term clusters for term A and term D connected by association relationship generated from thesaurus entries. The entire conceptual network would be developed by continuing this linking process using all shared terms defined through relationships defined in the association thesaurus. Once the entire network has been created, if we want to know the relationship between any two terms, we could follow a path within this network from one term to the other term. For example, Figure 1.3 shows one imaginary path that could be traversed to connect term A to term J. A G Main D J Figure 1.3 Imaginary relational pathway between terms A and J. 40 As described above, because the overall network represents an approximation of the concepts described within the document collection and their relationships to one another, the relational pathways that exist between terms (i.e., nodes) represent specific, relevant aspects of a concept expressed by the term pair. The path A – D – G – J then can be thought of as representing a particular aspect of the concept that would be expressed by using the two terms A and J together. The intervening nodes, terms D and G, can then be identified as additional concept-relevant terms that may be used to expand the user’s original query. It is important to note that because the association thesaurus is created from the document collection itself, the conceptual relationships represented in the network are only derived from information that exists in the document collection. Because we want to use the network to expand a query for retrieving information from the collection, we are only interested in aspects of the concept that are present in document collection. Therefore, we cannot assume that all aspects of a given concept would be represented in our network, and, therefore, this network would not be appropriate to use in other document collections. Each document collection is unique and would, therefore, require a unique conceptual network. The success of this method for generating a conceptual network from the collection-specific association thesaurus may be impacted by various search engine design decisions including: • The number of relative paths used to identify expansion terms for a given term pair. 41 • The longest length of a path between nodes that should be used (i.e., provides a meaningful representation of a concept). • Which content-bearing terms in the association thesaurus are included in the network. 1.2.2. Formulating the Expanded Query Once we have identified additional concept-relevant terms that may be used to expand the user’s original query, we must appropriately formulate the new query to send into the search engine. The idea behind using the relational pathways between terms to identify additional terms is that we want any additional terms to be focused on the intended concept expressed in the original query. Adding such concept-relevant terms will allow us to maximize the matches to documents containing information relevant to the intended concept and minimize non- relevant document matches. Therefore, instead of using all the associated terms for an individual query term, we will only expand the query with those associated terms that we can identify as having a connection to the overall concept. In addition to only choosing additional terms that have been found to be related to the overall concept, we want to be smart about how we use those terms in the expanded query. Therefore, we should formulate the expanded query in such a way that allows us to capitalize on the implied properties of the relationships between the terms along a pathway. The first implied property is that for any given relational path, we can assume that the closer the nodes are to one another in the path, the stronger their similarity is to 42 one another. For example, given the path A – B – C – D – E, the term A is likely to be more similar to B than it is to D. The second implied property is that we can assume that the full path expresses a fairly complete specification of the aspect of the concept that it represents, but the concept will not always be expressed in an individual document using all the terms that make up the path. Therefore, a document containing terms that represent only a partial path may still represent the desired concept. For example, it may be the case that a document that describes the desired concept contains the terms A, C, D, and E but does not include the term B. In another example, it may be the case that the desired concept is described using the terms B, C, and D but does not include the terms A nor E. Taking these points into consideration, the expanded query should be formulated to look for appropriate combinations of terms that represent partial versions of the identified pathways. It is believed that a simple, yet effective implementation of this can be accomplished using nested Boolean phrases. While it is possible that other models could be used to formulate an expanded query using relational pathways with increased retrieval precision, this simple implementation will likely serve as an efficient proof-ofconcept approach. As an example of generating an expanded query using a simple Boolean implementation, a relational pathway with a length of five nodes may be divided into three partial relational paths and formed into a nested Boolean query as is illustrated in Figure 1.4. 43 Relational Pathway A B Boolean Query Phrase (A C D E Visual Representation AND E) AND C AND D) AND C AND E) AND B AND D) AND D AND E) OR ( (A OR ( (B OR (A OR (B Figure 1.4 Simple Boolean implementation for formulating an expanded query from path A – B – C – D – E. In the visual representation, the solid circles on the pathway indicate the terms that must be present in the document for the document to be identified as relevant. The success of this method for formulating an expanded query from the relational pathways between query term pairs may be impacted by various search engine design decisions including: • The combinations of original terms and candidate expansion terms from the relational pathways used to formulate the expanded query. • The number of terms (i.e., nodes) that make up the path that are necessary for effectively representing the full concept in the query. • Whether term proximity is used in the expanded query (e.g., all the terms searched for from the relational path occur within x terms of each other in the document). 44 1.3. Example Query Walkthrough As previously discussed, purely symbolic search algorithms may have difficulty generating a conceptually complete result set of documents to fill the user’s information need. For example, consider that a local website houses a document collection whose content is focused on the Human Factors design elements of aircraft flight decks and that a user has the following question about flight deck interface design: How do you ensure that warnings are understandable? To conduct this search, the user enters the terms warnings understandable into the search engine. When a symbolic search algorithm is used, each document returned as a “match” by the search engine will contain both the word warnings and the word understandable5. To the user who is not an expert on the contents of the document set contained on the website, the list of retrieved documents may appear to be complete and the user may think, “If I read all of these documents, my question will be answered and I will have learned all the information that the documents on this website have to offer about this topic.” However, it is possible that the result set may have missed some important relevant documents. For example, the missed documents could talk about employing the 5 Most modern search engines are capable of performing a symbolic search in which alternate grammatical forms of words are also automatically included in the search. For example, alternate forms for the word warnings may include warn, warns, and warning and alternate grammatical forms for the word understandable may include understand, understands, and understanding. The baseline search engine to be used for comparison as well as the search system enhanced with my research will also include this functionality. 45 aviation convention of using the color red on flight deck displays to represent warning messages as in this passage: To ensure that the pilot is made aware of and understands the impending danger of this situation, the message presented must be color-coded red. This passage uses a form of the word understandable but does not include any variants of the word warnings and, therefore, would not be considered a match to the user’s query. However, by taking into consideration the semantic content of the passage (namely the knowledge that one definition of the word warning in the aviation domain is a message used to make pilots aware of impending danger), it is clear that the example passage is relevant to the concept of ensuring that warnings are understandable. If there were no other passages that contained the two original query words in this document, this document would not be included in its result set. A search engine that only uses a symbolic search algorithm would miss some important information relevant to answering the user’s question. The vision addressed in this research is that a symbolic search engine enhanced with the ideas presented in the previous section for an improved automatic concept-based query expansion would enhance the query sent into the search engine as follows. The enhanced search engine would begin by discovering the portion of the overall conceptual network that represents the concept warnings understandable. It does this by identifying all those pathways that exist within the network that connect the terms warnings and understandable together. For example, assume that the entry from collection-specific association thesaurus for the term warning contains the terms 46 caution, aware, indicate, alert, annunciation, and nuisance. And also assume that the entry for the term understandable contains the terms distinct, meaning, clarity, confusing, coding, and clear. From this information, two term clusters as shown below in Figure 1.5 could be generated. Main term: warning is-associated-with: • caution • aware • indicate • alert • annunciation • nuisance Main term: understandable is-associated-with: • distinct • meaning • clarity • confusing • clear • coding Figure 1.5 Term clusters generated from collection-specific association thesaurus entries for terms warning and understandable. 47 The algorithm then identifies all the (appropriately short) relational pathways that exist between these two terms. By identifying all these pathways, it in turn identifies all the aspects of the concept that are present in the document collection. Imagine that the following relational pathways exist between the two terms and illustrate some of the conceptual relationships that exist between these two terms in the document collection: • warnings – aware – color – coding – understandable • warnings – aware – color – confusing – understandable • warnings – caution – clear – understandable • warnings – indicate – clear – understandable From the relational pathways identified, each of the intervening terms on the pathways that connect the pair of original query terms are candidate terms for enhancing and expanding the query. If we consider the relational path warnings – aware – color – coding – understandable, we see that the candidate terms are aware, color, and coding. Using the simple implementation described above in Figure 1.4 as a model, the enhanced system would create the Boolean query phrase shown in Figure 1.6. 48 Relational Pathway aware warnings color Boolean Query Phrase ( warnings AND coding understandable Visual Representation understandable ) OR ( (warnings AND color AND coding ) OR ( ( aware AND color AND understandable ) OR ( warnings AND aware AND coding ) OR ( aware Figure 1.6 AND coding AND understandable ) Example Boolean query for the relational path warnings – aware – color – coding – understandable. If we return to consider our example passage To ensure that the pilot is made aware of and understands the impending danger of this situation, the message presented must be color-coded red. we see that the enhanced Boolean query phrase provides a symbolic match to sub-phrase 3. The enhanced system, therefore, would return the document containing this passage as a match to the intended concept of the user’s query. (Note: In the actual enhanced search engine, the enhanced Boolean query phrase shown in Figure 1.6 would be combined with the other enhanced Boolean query phrases created from each of the other relational pathways identified for the original pair of terms.) The relational pathways in the conceptual network created through the collectionspecific association thesaurus may be a powerful way to take advantage of the inherent 49 mapping that exists between the concepts and the pattern of terms used to describe them in the document collection. Based on the relational pathways that exist in the document collection, the terms chosen to enhance the query, and how the query is reformulated, this approach to concept-based query expansion may be able to return a more complete set of documents without a significant addition of non-relevant documents. 1.4. Dissertation Report Structure The remaining chapters present a description of the research conducted to determine if the ideas described in this introductory chapter of using relational pathways derived from an automatically generated conceptual network to expand user queries could be an effective approach in improving the performance of search engines used on domain-specific collections. The next chapter provides a review of the relevant literature in the fields of information retrieval and natural language processing to understand the foundational concepts and recent research related to automatic concept-based query expansion methods. Then, the research hypotheses posed, the methodology used to test the research hypotheses, and the experimental results collected are presented. And finally, the analysis of the results are discussed and conclusions drawn. Data, calculations, and other supporting information are presented in the Appendices A through H. 50 CHAPTER 2. LITERATURE REVIEW In this chapter, the foundational concepts and recent research related to automatic concept-based query expansion methods will be reviewed. 2.1. Improving Information Retrieval using Natural Language Processing The fields of Information Retrieval and Natural Language Processing (NLP) illustrate an excellent example of two disciplines coming together to create new ways of thinking about old problems and devising new strategies to address them. However, the path has not always been easy or successful. To make progress in solving what may at times seem to be the intractable remaining problems requires thoughtful consideration of the lessons learned by earlier attempts to apply NLP to Information Retrieval systems. In this section, an overview of the fields of Information Retrieval and NLP will be presented. In addition, the primary challenges in using NLP to improve information retrieval performance will be discussed. 2.1.1. Information Retrieval Information retrieval has its origins in Library Science and Computer Science. One of the pioneers of information retrieval, Gerard Salton, defined it as follows: Information Retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information. (1968, p. v) This general, all-encompassing definition captures the wide range of areas that are part of information retrieval. Information retrieval consists of all the elements and processes that are necessary to access the information necessary to satisfy an information 51 need. This includes the behind-the-scenes data architecture and representation used in storage and processing, the front-end interface with which the user enters the query and views the results, and everything in-between. Because the goal of an Information Retrieval system is to identify and fulfill the user’s information need, an important aspect of information retrieval is that “[t]he representation and organization of the information items should provide the user with easy access to the information in which he is interested” (Baeza-Yates and Ribeiro-Neto, 1999, p. 1) (Liddy, 1998; Croft, Metzler, & Strohman, 2010; Savoy and Gaussier, 2010). 2.1.1.1. Data Retrieval versus Information Retrieval In this field, a key distinction is made between data retrieval and information retrieval (van Rijsbergen, 1979; Baeza-Yates & Ribeiro-Neto, 1999). To understand the distinction between these two types of retrieval, a definition of data versus information in important. While a universally acceptable, formal definition of these two concepts may be impossible to find, the following informal definition is sufficient to understand the important distinction. Assume that data is a raw, unprocessed message while information is a message that has been processed, structured, and organized within a particular context in order to make it useful. This conceptual difference is extended to the different types of retrieval. Data retrieval is concerned with identifying which records (which may be full documents) contain an exact match to the keywords in the user’s query. No semantic information is needed to perform the straight-forward symbolic pattern matching task to identify that a “data” match exists between the user’s query and a record. A common example of a data 52 retrieval system is a relational database (van Rijsbergen, 1979; Baeza-Yates and RibeiroNeto, 1999). On the other hand, as the name implies, information retrieval is concerned with identifying information and requires that at least some of the intrinsic semantics of the text be considered. Therefore, a system must be able to identify documents that contain varied expressions of relevant concepts in order to be an effective retriever of information. The expression of the concept may be represented using a variety of terminology and grammatical constructs. The simple string matching function used for data retrieval is often not sufficient for generating a complete set of relevant documents in information retrieval. 2.1.1.2. Information Retrieval Systems Information retrieval systems are useful whenever there is a need to retrieve relevant information from a large collection of information bearing items. Not surprisingly, some of the first institutions to adopt information retrieval systems for retrieving needed information were libraries. For example, in 1964 the National Library of Medicine (NLM) began using the computer for batch processing bibliographic information retrieval (Baeza-Yates and Ribeiro-Neto, 1999). Early information retrieval systems used in libraries typically were either for searching bibliographic catalog records of the library’s locally held materials like that used by the NLM or for “searching remote electronic databases provided by commercial vendors in order to provide reference services” (Baeza-Yates and Ribeiro-Neto, 1999, p. 397). Today, the information retrieval systems used in libraries have evolved into powerful resources where the distinctions 53 between locally held and remote material is often blurred and may contain references to both physical and electronic materials. “Desktop and file system search provides another example of a widely used [information retrieval] application. A desktop search engine provides search and browsing facilities for files stored on a local hard disk and possibly on disks connected over a local network” (Büttcher, Clarke, & Cormack, 2010, p. 3). But arguably, the most well-known and heavily used information retrieval systems today are Web search engines (Büttcher, Clarke, & Cormack, 2010). Search engines allow users to manage, retrieve, and filter information from a large, constantly changing, unstructured set of documents that exist on the Web. It is estimated that the Web contains over 8.28 billion documents 6 and within this huge collection exists information related to almost any imaginable topic. The challenge in learning about a particular topic arises not because relevant information does not exist on the Web, but rather because it can be very difficult to efficiently retrieve the relatively small subset of documents on the Web that meets a specific information need. One solution to this challenge is to use a search engine. Search engines provide a way to identify and access documents containing relevant information that would otherwise remain unknown to the user (Baeza-Yates and Ribeiro-Neto, 1999; Savoy & Gaussier, 2010; Wolfram, Spink, Jansen, & Saracevic, 2001). 6 The size of the World Wide Web. (14 December 2011). Retrieved from http://www.worldwidewebsize.com/ 54 An analogous challenge exists when the information required is contained within an individual website (i.e., local domain). Individual websites can contain a large amount of detailed information about a particular subject. Finding the specific relevant information within the website is often difficult, and again, the search engine can serve as an important tool in helping users locate the specific documents that fill their information need. 2.1.1.3. The Retrieval Process The retrieval process in an information retrieval system consists of two distinct phases: first, preparing the system for use and second, processing the query submitted to the system in order to return the results. Preparing the System Processing the Query Defining the Collection Submitting a Query Text Acquisition Text Transformation Indexing Figure 2.1 Identifying Relevant Document Matches Returning Results The two phases of the retrieval process in an information retrieval system. 55 2.1.1.3.1. Preparing the System Before the first query is submitted, the system must be prepared to effectively process queries and quickly return the relevant results. 2.1.1.3.1.1. Defining the Collection The first step in preparing the system includes specifying the documents that will be included in the searchable collection, the elements from the documents that will be searchable, and the elements of the document that will be retrieved. Specifying the documents to be included in the information retrieval system is typically a straightforward task and consists of defining the bounds of the document collection. An information retrieval system designed to be a web search engine for a local web domain may contain all textual documents stored on the domain or alternatively may include only those documents that contain information-bearing text and exclude documents that only provide structure or navigation elements (e.g., site maps, navigational menus, etc.). Next, the elements of documents that should be searchable need to be specified. This will likely include the main body of the document but may also include various metadata about the document such as title, author, document type, or other tags or keywords that have been assigned to the document. And finally, the elements of the document that will be presented in the results must be specified so that the system can store and make accessible the specified information for each document. This may include information that describes the document like its title and a hyperlink to the full text of the document. If snippets of relevant text from the document will be presented to the user, the text of the document will need to be stored by the system in such a way that the location of the 56 relevant text can be efficiently identified and presented when the system displays the results (Baeza-Yates and Ribeiro-Neto, 1999; Croft, Metzler, & Strohman, 2010). 2.1.1.3.1.2. Text Acquisition Once the collection has been defined, the text contained in the documents must be acquired. This process consists of acquiring the document, converting the document’s text into a form usable by the system, storing the document’s text and metadata, and passing the document’s text to the text transformation processor (Baeza-Yates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010). After a document has been acquired, the system converts the document into a format that can be easily read and processed by later components of the information retrieval system. For example, if a document is in a format such as Microsoft Word, it must be converted so that “the control sequences and non-content data associated with a particular format are either removed or recorded as metadata” (Croft, Metzler, & Strohman, 2010, p. 18). At this point in the process, some systems also ensure that the text is encoded using the correct character encoding specification (Croft, Metzler, & Strohman, 2010). Typically, the next step of the text acquisition process is to store the converted document text along with its metadata and other information extracted from the document into a document data store. Depending on the size of the document collection, the data store system may be designed specifically to allow for fast retrieval times (Croft, Metzler, & Strohman, 2010). 57 If the collection contains source documents that are frequently changed or if new documents will be added to the collection, the text acquisition process must have some mechanism to continually revisit the set of documents within the bounds of the collection to identify and process all the new and revised documents within the collection (BaezaYates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010). 2.1.1.3.1.3. Text Transformation The text acquired in the previous phase is sent to the text transformation or text processing phase. In this phase, the text transformation operations modify the text of the document to reduce the complexity of the document representation and determine which terms are eligible to be included in the index. Some of the most commonly used text transformation operations are the following: • Lexical Analysis – “Lexical analysis is the process of converting a stream of characters (the text of the documents) into a stream of words (the candidate words to be adopted as index terms). Thus, one of the major objectives of the lexical analysis phase is the identification of words in the text” (Baeza-Yates and Ribeiro-Neto, 1999, p.165). This step, also referred to as parsing, is often harder than is initially expected because even in such languages as English, space characters are not the only delimiters of individual words. Often characters such as punctuation marks, hyphens, numerical digits, and the case of the letters need to be considered when identifying the bounds of a 58 word (Baeza-Yates and Ribeiro-Neto, 1999; Croft, Metzler, & Strohman, 2010). In addition to identifying words, the lexical analysis process also may remove punctuation and perform case normalization in which all characters are converted to lowercase (Büttcher, Clarke, & Cormack, 2010). • Stopword Removal – Another common text transformation operation is known as stopword removal. The concept of stopwords was first introduced by Hans Peter Luhn in 1958. Stopwords are words that occur too frequently in the document collection to aid in the ability to discriminate between relevant and non-relevant documents. Baeza-Yates and Ribeiro-Neto (1999) state that “a word which occurs in 80% of the documents in the collection is useless for purposes of retrieval. Such words are frequently referred to as stopwords and are normally filtered out as potential index terms” (p. 167). Stopword removal has the added benefit of reducing the size of the index (Baeza-Yates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Savoy & Gaussier, 2010) and often also reduces query execution times by avoiding the need to process the stopwords (Büttcher, Clarke, & Cormack, 2010). The most common type of stopword removed before indexing is function words. “Function words are words that have no well-defined meanings in and of themselves; rather they modify other words or indicate 59 grammatical relationships. In English, function words include prepositions, articles, pronouns and articles, and conjunctions. Function words are usually the most frequently occurring words in any language” (Büttcher, Clarke, & Cormack, 2010, p. 89). However, depending on the document collection and the type of searches to be conducted, there may be other typically information-bearing terms that also may be included in the stopword list (Blanchard, 2007). Terms to be included in the stopword list may be generated from predefined stopword lists such as van Rijsbergen’s list of stopwords in English (van Rijsbergen, 1979), manually defined stopword lists that are customized based on knowledge of the content of the documents in collection or automatically created stopword lists that are created using tools that analyze word frequency and distribution or other word attributes in the document collection. (Blanchard, 2007) • Word Stemming – Word stemming is a type of morphological normalization to allow query terms and document terms that are morphological variants of the same word to be matched (Savoy & Gaussier, 2010). As described by Croft, Metzler, & Strohman (2010), [p]art of the expressiveness of natural language comes from the huge number of ways to convey a single idea. This can be a problem for search engines, which rely on matching words to find relevant documents. Instead of restricting matches to words that are identical, a number of techniques have been developed to allow a search engine to match words that are semantically related. Stemming, also called conflation, is a component of text processing that captures the relationships between different variations of a word. More precisely, stemming reduces the 60 different forms of a word that occur because of inflection (e.g., plurals, tenses) or derivation (e.g., making a verb into a noun by adding the suffix -ation) to a common stem. (p. 91) The stem or root of a word is the portion of the word that remains after its prefixes and suffixes have been removed (Baeza-Yates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010; Savoy & Gaussier, 2010). In contrast to the process of lemmatization in Linguistics that produces linguistically valid lemmas, word stemming is a purely operational process and may produce a stem that does not have any linguistic validity. Because the stem created from the word stemming process is used only for comparison to other word stems generated using the same process, word stemming rather than the more difficult and complex process of lemmatization is adequate for most information retrieval systems (Büttcher, Clarke, & Cormack, 2010). Like stopword removal, word stemming has the added benefit of reducing the size of the index (Baeza-Yates and Ribeiro-Neto, 1999). 2.1.1.3.1.4. Indexing The final step of preparing the system is to build the index of terms for each document in the collection. Instead of searching in each of the actual documents of the collection at search time, an index in which a mapping between each eligible term and the documents in which it can be found is used. The index allows for very fast searching over 61 large document collections (Baeza-Yates and Ribeiro-Neto, 1999; Croft, Metzler, & Strohman, 2010; Savoy & Gaussier, 2010). While an index can take a number of different forms, currently the most popular form is known as an inverted file or inverted index. “An inverted file (or inverted index) is a word-oriented mechanism for indexing a text collection in order to speed up the searching task. The inverted file structure is composed of two elements: the vocabulary and the occurrences. The vocabulary is the set of all different words in the text. For each such word, a list of all the text positions where the word appears is stored. The set of all those lists is called the ‘occurrences’” (Baeza-Yates and Ribeiro-Neto, 1999, p. 192). Therefore, the inverted file is able to not only store information related to which documents contain a specific word, but also where in that document the word may be found. Performing the indexing process consists of converting “the stream of documentterm information coming from the text transformation component into term-document information for the creation of inverted indexes. The challenge is to do this efficiently, not only for large numbers of documents when the inverted indexes are initially created, but also when the indexes are updated with new documents from feeds or crawls” (Croft, Metzler, & Strohman, 2010, p. 23). 2.1.1.3.2. Processing the Query Once the system has been prepared, it is ready for the user to submit a query. After the user has submitted a query to the system, the system identifies relevant 62 documents that matches the topic described by the query and returns the results as appropriate. 2.1.1.3.2.1. Submitting a Query To submit the query, the user enters a query into the IR system’s user interface. The user’s information need “underlies and drives the search process. … As a result of her information need, the user constructs and issues a query to the IR system. Typically, this query consists of a small number of terms with two to three terms being typical for a Web search” (Büttcher, Clarke, & Cormack, 2010, pp. 5-6). 2.1.1.3.2.2. Identifying Relevant Document Matches When the query is received, the system performs the same text transformation operations that were performed on each of the documents before they were indexed. This transforms the stream of characters received by the system to consist of the same eligible terms that would be candidates for an index. For example, if stemming operations were performed on the document text to convert words into their root stems, then it is necessary to perform these same operations on the query submitted to enable appropriate matches between the query terms and the document terms to be found (Baeza-Yates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010; Savoy & Gaussier, 2010). After the query terms have been transformed to be consistent with the form of the terms in the document index, query enhancement processes may be performed in which the system attempts to more precisely and accurately capture the user’s intended 63 information need as a reformulated query. For example, this is stage at which conceptbased query expansion (the subject of this research) would be applied. In addition to any query enhancements, the query may be formatted to be consistent with any system specific formatting and operator symbology required by the system (Baeza-Yates and Ribeiro-Neto, 1999; Savoy & Gaussier, 2010). Finally, the query is run against the index to identify all documents that satisfy the requirements of the query. The actual mechanism by which the query is compared with the index data to determine if a match exists and the speed that this processing can occur across the entire index depends on the architecture of the index data store. A variety of index architectures have been explored, and the architecture used largely depends on the size of the collection and the retrieval needs of the users of the system (Baeza-Yates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010; Savoy & Gaussier, 2010). Ranking and ordering processes required to calculate the relevancy ranking and other ordering criteria are then performed to determine the final order and organization of the documents that will be returned to the user (Croft, Metzler, & Strohman, 2010; Savoy & Gaussier, 2010). 2.1.1.3.2.3. Returning Results After the system has identified, ordered, and organized the set of documents that satisfies the user’s query, the resulting documents are displayed to the user. This provides both the listing of the relevant results in a form that is deemed useful (at least by the system designer) as well as providing a mechanism for accessing the relevant documents 64 and information to allow the user to make an initial assessment of the relevance of each document. The information to identify each document varies from system to system but may include the title of the document, document access information (i.e., a hyperlink to the document or address of the physical location of the document), and snippets of text that surround the terms that satisfied the requirement of the query (Baeza-Yates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010). 2.1.2. Natural Language Processing Natural Language Processing (NLP) has its origins in symbolic linguistics and statistical modeling. Liddy (1998) provides the following definition: Natural language processing is a set of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications. The goal of researchers and developers of NLP is to produce systems that process text of any type, the same way which we, as humans, do - systems that take written or spoken text and extract what is meant at different levels at which meaning is conveyed in language. (p. 137) Over the past 30 years a revolution in NLP has occurred that has had a huge impact on the types of techniques for analyzing and representing natural language that are used. Dale (2010) describes the situation in the mid 1990s: … the field of natural language processing was less than 10 years into what some might call its “statistical revolution.” It was still early enough that there were occasional signs of friction between some of the “old guard,” who hung on to the symbolic approaches to natural language processing that they had grown up with, and the “young turks,” with their new fangled statistical processing techniques, which just kept gaining ground. Some old guard would give talks pointing out that there were problems in natural language processing that were beyond the reach of statistical or corpus-based methods; meanwhile, the occasional young turk could 65 be heard muttering a variation on Fred Jelinek’s 1988 statement that “whenever I fire a linguist our system performance improves.” (p. 3) For a time it looked as if the old guard of symbolic linguists would need to give way to statistical modelers, most recently, however, there has been a trend to develop techniques in which the important lessons learned from symbolic linguistics are incorporated into the latest statistical modeling techniques (Dale, 2010). I think it is important to note that much of the friction and growing pains related to the recent evolution in NLP has taken place during the same time IR researchers were trying to augment their systems with NLP and were repeatedly disappointed with their levels of success. 2.1.2.1. Stages of Analysis in NLP The stages of analysis in NLP move from processing the symbolic representation of the text up through identifying conceptual meaning conveyed in the text. These stages are illustrated in Figure 2.2. Each of these analysis stages are described below except for pragmatic analysis 7, which is an element of Natural Language Process beyond the scope of this research project. 7 Pragmatic Analysis is concerned with understanding the context and purpose of a message. 66 Speaker's intended meaning Pragmatic analysis Semantic analysis Syntactic analysis Lexical analysis Tokenization Surface text Figure 2.2 The stages of analysis in Natural Language Processing (adapted from Figure 1.1 in Dale, 2010, p. 4). 2.1.2.1.1. Tokenization The first stage of analysis in NLP consists of “the task of converting a raw text file, essentially a sequence of digital bits, into a well-defined sequence of linguistically meaningful units: at the lowest level characters representing the individual graphemes in a language’s written system, words consisting of one or more characters, and sentences consisting of one or more words.” (Palmer, 2010, p. 9) As discussed earlier, the process of identifying words from a stream of characters is not always straight-forward; even in such languages as English, space characters aren’t the only delimiters of individual words. The tokenization process must consider 67 characters representing punctuation marks, hyphens, numerical digits, as well as the case of letters to determine what segment of the text constitutes a word. (Baeza-Yates and Ribeiro-Neto, 1999; Croft, Metzler, & Strohman, 2010; Dale, 2010). 2.1.2.1.2. Lexical Analysis The lexical analysis stage of processing in NLP performs text analysis at the level of the word. One of the primary tasks at this stage is lemmatization. Lemmatization is a process of relating morphological variants to their lemma. To this, “morphologically complex strings are identified, decomposed into invariant stem (= lemma’s canonical form) and affixes, and the affixes are then deleted. The result is texts as search objects that consist of stems only so that they can be searched via a lemma list” (Hippisley, 2010, p. 32). As discussed earlier, lemmatization is similar to word stemming, but there is an important distinction between these two processes. A goal of lemmatization is to produce linguistically valid lemmas while word stemming is a purely operational process and may produce a stem that does not have any linguistic validity. Because the stem created from the word stemming process is used only for comparison to other word stems, word stemming rather than the more difficult and complex process of lemmatization is adequate for most information retrieval systems (Büttcher, Clarke, & Cormack, 2010). Another distinction between NLP and Information Retrieval related to lexical analysis is that some Information Retrieval researchers bundle the step of tokenization with other text transformation processes such as word stemming and refer to the entire 68 stage as lexical analysis. However, NLP researchers tend to talk about tokenization as a separate and distinct process from lexical analysis. 2.1.2.1.3. Syntactic Analysis The syntactic analysis area is arguably the most well-established area in natural language processing. “A presupposition in most work in natural language processing is that the basic unit of meaning analysis is the sentence: a sentence expresses a proposition, an idea, or a thought, and says something about some real or imaginary world. Extracting meaning from a sentence is thus a key issue” (Dale, 2010, p. 6). Syntactic analysis is comprised of applying “techniques for grammar-driven natural language parsing, that is, analyzing a string of words (typically a sentence) to determine its structural description according to a formal grammar” (Ljunglöf & Wirén, 2010, p. 59). Some of the challenges syntactic analysis must overcome include the following: • Robustness – Robustness is the system’s ability to gracefully handle input that does not conform to the expectations of the system. One source of nonconformance is “that the input may contain errors; in other words, it may be ill-formed (though the distinction between well-formed and ill-formed input is by no means clear cut.)” (Ljunglöf & Wirén, 2010, p. 80). Another source of non-conformance is undergeneration in which the rules of the grammar being used by the system does not adequately cover the natural language being input. Ljunglöf & Wirén (2010) talk about the desirability of graceful degradation where “robustness means that small deviations from the 69 expected input will only cause small impairments of the parse result, whereas large deviations may cause large impairments” (p. 80). • Disambiguation – “At any point in a pass through a sentence, there will typically be several grammar rules that might apply” (Ljunglöf & Wirén, 2010, p. 60). The challenge then is determining which of the possible syntactic structures that appears to fit the input is the one intended by the creator of the sentence. While the information required to disambiguate possible syntactic structures may not always be available during this stage, at a minimum the analysis typically can narrow the possible options. This problem is helped by the observation made by Ljunglöf & Wirén (2010) that “although a general grammar will allow a large number of analyses of almost any nontrivial sentence, most of these analyses will be extremely implausible in the context of a particular domain” (p. 81). 2.1.2.1.4. Semantic Analysis The semantic analysis in NLP uses the results from the previous levels of analysis to analyze “the meanings of words, fixed expressions, whole sentences, and utterances in context. In practice, this means translating original expressions into some kind of semantic metalanguage. The major theoretical issues in semantic analysis therefore turn on the nature of the metalanguage or equivalent representational system” (Goddard & Schalley, 2010, p. 94). In this way, the semantic analysis works to translate the text into a semantic representational system to determine the meaning of words, multi-word 70 expressions, phrases and/or indefinitely large word combinations such as sentences in order to understand the message. The variety of approaches and theories used to conduct semantic analysis tend to be divided along two dimensions. The first is a compositional versus lexical dimension. The compositional approaches are concerned with working bottom-up to construct meaning from the lexical items whose meaning is accepted as a given. At the other end of this dimension are the lexical approaches that work to precisely analyze the meaning of the lexical items using either decomposition or relational methods. The second dimension is formal versus cognitive. Formal approaches focus on the readily apparent structural patterns in the messages as well as the importance of linking the grammatical components to semantic components. On the other hand, cognitive approaches focus on the patterns and processes of the organization of the conceptual content in a language (Goddard & Schalley, 2010; Talmy, forthcoming). Regardless of the approach used to conduct the semantic analysis, “[i]t is widely recognized that the overriding problems in semantic analysis are how to avoid circularity and how to avoid infinite regress. Most approaches concur that the solution is to ground the analysis in a terminal set of primitive elements, but they differ on the nature of the primitives … Approaches also differ on the extent to which they envisage that semantic analysis can be precise and exhaustive” (Goddard & Schalley, 2010, p. 94). Dale points out that in semantic analysis “we begin to reach the bounds of what has so far been scaled up from theoretical work to practical application” (Dale, 2010, p. 6). The problems of understanding text through semantic analysis are difficult and 71 complex. According to Goddard & Schalley (2010), the outlook is grim for significant advancements in semantic analysis in the near future. “Despite the tremendous advances in computation power and improvements in corpus linguistics … [m]any common semantic phenomena are likely to remain computationally intractable for the foreseeable future. Rough-and-ready semantic processing (partial text understanding), especially in restricted domains and/or restricted ‘sublanguages’ offer more promising prospects” (Goddard & Schalley, 2010, p. 114). 2.1.3. Describing Information Needs in the Form of a Query The primary purpose of an IR system is to efficiently and effectively fill a user’s information need (Baeza-Yates and Ribeiro-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010). The term information need was coined by Taylor in his 1962 paper “The Process of Asking Questions”. Taylor describes an information need as something distinct and traceable that is developed through a process that progresses through four levels of question formation. The information need begins as a vague, inexpressible sense of dissatisfaction, progresses to a “conscious mental description of an ill-defined area of indecision” (p. 392), then develops into an unambiguous and rational expression of the question, and finally is adapted into a form the user believes is appropriate to pose to the IR system. This final step of adapting the question into a form appropriate for the system is impacted not only by the form in which the user must express the information but also by what the user believes that the system can provide. Because the already complex process of developing the user’s information 72 need must be translated into a query that the system can process, it is no surprise that “[a] query can be a poor representation of the information need” (Croft, Metzler, & Strohman, 2010, p. 188). Two issues that further compound the difficulty in appropriately expressing an information need into the form of an acceptable IR system query are the gap in the user’s knowledge and that natural language as the medium for concept representation. 2.1.3.1. Gap in the User’s Knowledge It can be difficult to ask about things a user doesn’t know. Belkin (1980) described the issue as the necessity for the users to develop their information need from an inadequate state of knowledge. “The expression of an information need … is in general a statement of what the user does not know” (Belkin, Oddy, & Brooks, 1982, p.64). IR system must then try to match this inadequate, uncertain, imprecise, and possibly incoherent statement of the need with documents that contain representations of a coherent state of knowledge (Belkin, 1980; Belkin, Oddy, & Brooks, 1982). 2.1.3.2. Natural Language as the Medium for Concept Representation Natural language is used as the medium to form the information bridges as described by Frické (2012) between the user with the information need and the authors of the documents in the collection. The user forms a query using natural language as an abstraction to represent an information need, and the authors of the documents in the collection use natural language as an abstraction to represent the concepts in the documents. The translation into and representation using natural language introduces 73 issues with the accuracy and completeness of the description of the concepts on both sides of the information bridge. Therefore, the search engine must perform its task using a frequently incomplete, imprecise, and ambiguous description of both the user’s information need and the related concepts included in the documents within the collection (Savoy & Gaussier, 2010). Attributes of natural language that can cause textual expressions to be incomplete, imprecise, ambiguous and, therefore, difficult for IR systems to handle include word morphology, orthographic variation, and various forms of syntax. 2.1.3.2.1. Morphology Morphology relates to the composition of words in a language. A word may have a number of different valid morphological forms; each of which represents the same underlying concept. Morphological forms of words may be created through inflectional construction such as singular versus plural forms (e.g., “book” and “books”) or gender assignment (e.g., “actor” and “actress”), verb conjugation such as the present participle (e.g., “laugh” and “laughing”) and past participle (e.g., “play” and “played”), and derivational construction such as the gerund (e.g., the verb “train” and the noun “training”) (Savoy & Gaussier, 2010). In English, as illustrated in the previous examples, morphological variants of a word are typically formed by adding affixes to the beginning of the word (i.e., prefix) or to the end of a word (i.e., suffix). However, there are exceptions that introduce complexity and challenges and prohibit a morphological analyzer from relying only on simple rules to identify all the word forms that represent the same concept. For example, 74 challenges are introduced with exceptions like the pluralized form of “goose” to “geese” and then further complicated by the fact that the pluralized form of the very similar word “moose” is “moose”. Homographs such as “train” as in a locomotive and “train” as in instruct also pose difficult challenges when trying to determine which words represent the same underlying concept (Savoy & Gaussier, 2010). When searching for relevant matches within a document collection, it is necessary to return results for all morphological variants of a word in order to have complete recall. However, the many exceptions in a language make this a difficult task. 2.1.3.2.2. Orthographic Variation Orthographic variation relates to the differences in the acceptable spelling of a word. In the 1800s, much effort was put into ensuring that spelling was standardized. However, spelling differences still exist. Two common sources of orthographic variation are regional differences (e.g., the British and American English forms of the word “grey” and “gray”) and the transliteration of foreign names (e.g., “Creutzfeld-Jakob” and “Creutzfeldt-Jacob”). In addition, to acceptable alternative spellings, typographic errors and misspellings are also commonplace in document collections. Like for morphological variants, when searching for matches within a document collection it may important to return all orthographical variants of a word in order to have complete recall (Savoy & Gaussier, 2010). 75 2.1.3.2.3. Syntax There are a number of valid syntactical constructions that can make it difficult for automated analyzers to correctly determine the appropriate meaning of a phrase or sentence. As discussed earlier, a number of valid grammatical structures may fit a given sentence, each of which provides a different meaning. Consider the following example of syntactic ambiguity: “The fish is ready to eat.” Like the optical illusion of the Rubin Vase, either the interpretation of it being time to feed your fish or your fish dinner is ready for you to eat seems to bounce back and forth in your mind. Without context, either interpretation is valid but represents a very different semantic concept (Manning & Schütze, 1999; Savoy & Gaussier, 2010). 2.1.4. Will NLP techniques improve information retrieval systems? While some researchers consider information retrieval an example of a successful applied domain of NLP (Savoy & Gaussier, 2010), others have been disappointed with the lack of success in using NLP to improve the performance of IR systems (Brants, 2003; Lease, 2007; Sparck Jones, 1997; Smeaton, 1999; Buckley, 2004). In the early days, there was optimism about the large improvement in the performance of IR systems that could be reaped by augmenting IR systems with NLP techniques. The combination seemed natural. Information retrieval systems needed to go beyond simple symbolic searches in order to find information that required a deeper level of understanding of the content that pattern matching algorithms could not identify, and Natural Language 76 Processing brought techniques to extract the meaning conveyed at different levels of written and spoken language. However, as described by Brants (2003), “[s]imple methods (stopwording, porterstyle stemming, etc.) usually yield significant improvements, while higher-level processing (chunking, parsing, word sense disambiguation, etc.) only yield very small improvements or even a decrease in accuracy” (p. 1). These were disappointing results since it was hoped that Natural Language Processing could allow IR systems to move beyond the limitations of symbolic pattern matching by providing them with the capability to work at the higher-level of the semantic content of both the queries posed and the documents in the collection. Several researchers have posited ideas about why NLP techniques have not been more successful. Brants (2003) stated that the use of existing ‘out-of-the-box’ NLP components not specifically geared for IR was one reason why NLP techniques have not been more successful at improving the performance of information retrieval. In his 2003 review of research investigating the use of NLP techniques to improve retrieval, he described that the examples of successful NLP techniques like the use of the Porter stemming algorithm (Porter, 1980) and statistical “phrases” were techniques that actually have linguistic flaws and, in some cases, are counter to linguistic knowledge. But, they are specifically designed with the goal of improving retrieval rather than adherence to linguistic constructs, and this allows them to be more successful at performing the task for which they are used. 77 Qiu and Frei (1993) presented their thoughts on why the specific technique of Query Expansion in information retrieval systems using NLP has not been more successful. They stated that the lack of success with previous query expansion methods was primarily because the methods were based on adding terms similar to each of the individual terms used to construct the query instead of adding terms similar to the overall concept the query describes. This method of expanding a query by adding terms similar only to the individual terms of the query often introduces tangential, non-relevant concepts to the query causing the search algorithm to identify documents that are not relevant to the original query concept. In addition, two other factors have likely contributed to the lack of success experienced. First, as mentioned earlier, much of the friction and growing pains related to the recent evolution in NLP was taking place at the same time that information retrieval researchers were trying to augment their systems with NLP and being disappointed with their levels of success. The flux and level of maturity could have played a role in the NLP techniques that were available to an information retrieval system. And second, NLP is still working on the tough issues at the semantic level of processing natural language. “[T]he known is the surface text, and anything deeper is a representational abstraction that is harder to pin down; so it is not surprising that we have better developed techniques at the more concrete end of the processing spectrum” (Dale, 2010, p. 5). Lease (2007) stated that in the field of NLP “the statistical revolution has continued to expand the fields horizons; the field today is thoroughly statistical with 78 robust methodology for estimation, inference, and evaluation. As such, one may well ask if there are new advancements that suggest re-exploring prior directions in applying NLP to [information retrieval]?” (p. 1). By carefully considering the potential pitfalls, challenges, and new opportunities highlighted by such researchers as Brants, Qiu and Frei, Dale, and Lease, it is possible that NLP may still hold the key to unlock the large performance benefits available by moving beyond symbolic searches. 2.2. Lexical Acquisition of Meaning The lexical acquisition of meaning is focused primarily on automatically measuring the relative value of how similar (or dissimilar) one word is to another word. This process of calculating a relative measure is a substitute for actually determining what the meaning of a word is, but, despite this, it is a useful measure in information retrieval systems (Manning & Schütze, 1999). There are a number of different methods that have been used to measure semantic similarity. A few of the more popular and well-known methods are described below. 2.2.1. Vector Space Models Vector space models represent one of the oldest and most well-known methods for automatically measuring semantic similarity (Büttcher, Clarke, & Cormack, 2010). In these models, words are represented as vectors in a multi-dimensional space. 79 2.2.1.1. Multi-dimensional Space The multi-dimensional space used in vector space models may be created from different elements, but two of the most common are document space and word space. To describe the differences between these two types of multi-dimensional space, examples are presented below that have been adapted from those in Manning & Schütze (1999). cosmonaut astronaut moon car truck 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 d1 d2 d3 d4 d5 d6 Figure 2.3 An example document-by-word matrix (adapted from Figure 8.3 in Manning & Schütze, 1999, p. 297). In Figure 2.3, we see that a document space is created from a document-by-word matrix where the values in the cells represent the number of times the word occurs in the document. Using document space, words are represented as vectors formed from their occurrences in each of the documents in the collection and “[w]ords are deemed similar to the extent that they occur in the same documents. In document space, cosmonaut and astronaut are dissimilar (no shared documents); truck and car are similar since they share a document: they co-occur in [document] d4” (Manning & Schütze, 1999, p. 296). 80 cosmonaut astronaut moon car truck Figure 2.4 cosmonaut astronaut moon car truck 2 0 1 1 0 0 1 1 0 0 1 1 2 1 0 1 0 1 3 1 0 0 0 1 2 An example word-by-word matrix (adapted from Figure 8.4 in Manning & Schütze, 1999, p. 297). In Figure 2.4, we see that a word space is created from a word-by-word matrix where the values in the cells represent the number of documents in which the two intersecting word occur together. Using word space, words are represented as vectors formed from their co-occurrence with other words. “Co-occurrence can be defined with respect to documents, paragraphs or other units. Words are similar to the extent that they co-occur with the same words. Here, cosmonaut and astronaut are more similar than before since they both co-occur with moon” (Manning & Schütze, 1999, p. 297). 2.2.1.2. Constructing Vectors The vectors created from a multi-dimensional space may be constructed using either binary vectors or real-valued vector. A binary vector is one in which each dimension is assigned either a 0 or 1. However, a more powerful representation, though more computationally complex, is when real values are used for each of the dimensions. A vector constructed from real values represents more information about the level or strength of each dimension. 81 2.2.1.3. Vector Similarity Measures After vectors have been constructed, the next step is to measure how similar two vectors are to one another (i.e., how close they are to one another in the multidimensional vector space) in order to determine the similarity of (or association between) the words that they represent. When the vectors represent words, the similarity of the two vectors may be measured to derive a relative value that can be used to determine how similar one word is to another word within the document collection. Similarity between two vectors can be measured using a variety of vector similarity measures. The most commonly used are outlined in Figure 2.5 and described in more detail in the sections below. Vector Similarity Measure Matching Coefficient Dice Coefficient Jaccard Coefficient Overlap Coefficient Cosine Coefficient Table 2.1 Definition |𝑋 ∩𝑌| intersection |𝑋 ∩𝑌| |𝑋 ∪𝑌| intersection over union 2 |𝑋 ∩𝑌| |𝑋| + |𝑌| intersection over mean |𝑋 ∩𝑌| min (|𝑋|, |𝑌|) |𝑋 ∩𝑌| �|𝑋| × |𝑌|) Vector-based similarity measures. 82 2.2.1.3.1. Matching Coefficient The simplest similarity measure is the matching coefficient. The similarity value using the matching coefficient is determined by counting the number of dimensions on which both vectors have a non-zero value or, in other words, the intersection of the two vectors. This is different than other methods used to measure similarity because it does not account for any differences in the length of the vectors. For example, assume that vector A has 10 non-zero dimensions, vector B has 12 non-zero dimensions, and vector C has 1000 non-zero dimensions. If vector A and vector B share 8 dimensions and vector A and vector C share 8 dimensions, the similarity values for A-B and for A-C would both be 8. However, considering the differences in the length of the various vectors, it is likely that vector A and vector B are semantically more similar to one another than vector A is to vector C because proportionally there is significantly more overlap between vectors A and B (van Rijsbergen, 1979; Manning & Schütze, 1999). 2.2.1.3.2. Dice Coefficient The Dice Coefficient performs an intersection over mean calculation to normalize the length of the vectors and then measure the amount of overlap the vectors have with one another. The result is a range between 0.0 and 1.0. The value 0.0 means that there is no overlap between the two vectors and, therefore, no similarity. The value 1.0 means that the vectors have perfect overlap and are, therefore, identical (van Rijsbergen, 1979; Manning & Schütze, 1999). 83 2.2.1.3.3. Jaccard Coefficient The Jaccard Coefficient is also known as the Tanimoto Coefficient. It is similar to the Dice Coefficient, but instead of the intersection over mean, it is based on an Intersection Over Union (IOU) calculation to normalize and measure the amount of overlap between two vectors. The Jaccard Coefficient values range from 0.0 to 1.0 where 0.0 represents no overlap and 1.0 represents perfect overlap (i.e., the vectors are identical). The difference between the Jaccard Coefficient and the Dice Coefficient is that the Jaccard Coefficient includes a greater penalty for situations in which the proportion of shared dimensions with non-zero values is small with respect to the overall length of the vectors (i.e., overall number of non-zero dimension that each vector possesses) (van Rijsbergen, 1979; Manning & Schütze, 1999). 2.2.1.3.4. Cosine Coefficient The Cosine similarity calculation uses linear algebra to measure the angle between two vectors. Smaller angles represent higher levels of similarity between the two vectors. (van Rijsbergen, 1979; Manning & Schütze, 1999; Büttcher, Clarke, & Cormack, 2010). Because the Cosine Coefficient is based on the angle between the two vectors rather than the amount of overlap, the Cosine Coefficient includes a reduced penalty for situations in which the non-zero dimensions between the two vectors are very different. In other words, when using the Cosine Coefficient, it is not necessary for the two vectors being compared to be similar in size. “This property of the cosine is important in Statistical NLP since we often compare words or objects that we have 84 different amounts of data for, but we don’t want to say they are dissimilar just because of that” (Manning & Schütze, 1999, p. 300). 2.2.1.4. Using Vector Space Model in Retrieval Vector space models can be used in the retrieval process to identify the set of the most relevant documents in a collection to a query. To do this, “[q]ueries as well as documents are represented as vectors in a high-dimensional space in which each vector component corresponds to a term in the vocabulary of the collection” (Büttcher, Clarke, & Cormack, 2010, p. 55). So in this way, a real-valued vector of the query and realvalued vectors of each of the documents in the collection are created from a multidimensional space comprised of all the words present in the query and in the document collection (i.e., the union of the words in the query and the words in the document collection). The relative similarity between the query vector and each of the document vectors is then measured using a vector similarity measure. Typically, the vector similarity measure used is the Cosine similarity measure so that the disparity of the number of vector dimensions between the query vector and the document vectors do not adversely impact the similarity calculations. “If we can appropriately represent queries and documents as vectors, cosine similarity may be used to rank the documents with respect to the queries. In representing a document or query as a vector, a weight must be assigned to each term that represents the value of the corresponding component of the vector” (Büttcher, Clarke, & Cormack, 2010, p. 57). Typically, the weight calculated is based on Term Frequency (TF) and Inverse Document Frequency (IDF). 85 Term Frequency (TF) represents the frequency with which a term appears in a document. It is used in the weight calculation based on the idea that terms that occur more frequently in a document should be assigned a greater weight than terms that occur less frequently (i.e., the more frequently occurring term is more important to the concept represented in the document). Inverse Document Frequency (IDF) represents the frequency with which a term occurs in documents in the collection. It is used in the weight calculation based on the idea that terms that occur in a large number of documents should be assigned a lower weight than terms that occur less frequently in the document collection (i.e., the less frequently occurring term is better able to discriminate between relevant and non-relevant documents in the collection) (Büttcher, Clarke, & Cormack, 2010). The vector space models used in the retrieval process can be used not only with respect to documents but also at the level of paragraphs or other units of text that are deemed useful by the information retrieval system designer. Therefore, by using the vector space model, it is possible to identify the set of most relevant paragraphs to a query if a finer level of retrieval is desired. 2.2.2. Latent Semantic Analysis As described by Landauer, Foltz, and Laham (1998), “Latent Semantic Analysis (LSA) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text” (p. 259). It is an extension of the vector space model in which singular value decomposition (SVD) 86 is used to reduce the dimensionality of the term-vector space to extract and infer relations between words based on how the words are used in the text (Büttcher, Clarke, & Cormack, 2010; Cilibrasi & Vitányi, 2007; Landauer, Foltz, and Laham, 1998; Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990). Proponents of LSA say that this method “is capable of correctly inferring much deeper relations (thus the phrase latent semantic), and, as a consequence, they are often much better predictors of human meaning-based judgments and performance than are the surface-level contingencies” (Landauer, Foltz, & Laham, 1998, pp. 260-261). It is also believed that SVD is able “to reduce the negative impact of synonymy – multiple terms with the same meaning – by merging related words into common dimensions” (Büttcher, Clarke, & Cormack, 2010, p. 78). 2.2.2.1. Singular Value Decomposition The main distinctive element of LSA is the use of Singular Value Decomposition (SVD) to reduce the dimensionality of the semantic space represented as the term-vector space. SVD is a mathematical factor analysis from linear algebra that performs a linear decomposition on the term-vector and then uses the resulting scaling values to determine which dimensions may be removed (Büttcher, Clarke, & Cormack, 2010; Cilibrasi & Vitányi, 2007; Landauer, Foltz, and Laham, 1998). As mentioned, the first step is to perform a linear decomposition in which the original “rectangular matrix is decomposed into the product of three other matrices. One component matrix describes the original row entities as vectors of derived orthogonal factor values, another describes the original column entities in the same way, and the 87 third is a diagonal matrix containing scaling values such that when the three components are matrix multiplied, the original matrix is reconstructed” (Landauer, Foltz, & Laham, 1998, p. 263). The second step of the SVD is to reduce the number of dimensions by deleting the scaling coefficients in the diagonal matrix starting with the smallest coefficients until only the desired number of dimensions remains. The final matrix is then reconstructed using the modified, reduced-dimensionality diagonal matrix. The resulting values in the individual cells of the final matrix represent the level of similarity between the entity represented by the row and the entity represented by the column (e.g., similarity between a word and a document, similarity between two words, or similarity between two documents) (Landauer, Foltz, and Laham, 1998). One of the primary assumptions of LSA is “that reducing the dimensionality (the number parameters by which a word or passage is described) of the observed data from the number of initial contexts to a much smaller – but still large – number will often produce much better approximations to human cognitive relations. It is this dimensionality reduction step, the combining of surface information into a deeper abstraction, that captures the mutual implications of words and passages” (Landauer, Foltz, and Laham, 1998, pp. 261-262). Therefore, determining the number of dimension that should be chosen to represent the semantic space is a key element of SVD and has a large impact on the results of LSA. However, methods are still evolving for choosing the optimal dimensionality for a data set. 88 2.2.2.2. Distinctions of LSA from Other Statistical Approaches LSA differs in several important ways from other common statistical lexical acquisition of meaning techniques. First, to a greater extent than other vector space models, the level of similarity between a word and a document in LSA is dependent not only on the attributes of documents in which the word occurs, but also on the attributes of documents in which the word does not occur. The LSA calculation incorporates data from all the documents in the collection and considers which words occur and which do not occur in each. For example, the fact that word a does not occur in document D1 but words b, c, and d do occur in document D1 may impact the calculation of the similarity level between word a and document D2. Second, LSA does not consider word order and other grammatical constructs, but instead LSA uses “the detailed patterns of occurrences of very many words over very large numbers of local meaning-bearing contexts, such as sentences or paragraphs, treated as unitary wholes. … Another way to think of this is that LSA represents the meaning of a word as a kind of average of the meaning of all the passages in which it appears and the meaning of a passage as a kind of average of the meaning of all the words it contains” (Landauer, Foltz, and Laham, 1998, p. 261). And third, unlike other common statistical methods used to calculate semantic similarity, LSA is both used as a successful computational method and as a model of human learning. As a computational method, it is used to estimate the similarity between two words and between words and other units of text (e.g., documents, paragraphs, or phrases) by mathematically extracting and inferring the relationships between the 89 expected contextual usages of words. As a model of human learning, it is posited as a computational theory of inductive learning by which humans acquire and represent knowledge in an environment that does not appear to contain adequate information to account for the level of knowledge the human learns (i.e., the problem of the ‘poverty of the input’ or ‘insufficiency of evidence’) (Landauer, Foltz, and Laham, 1998). 2.2.3. Normalized Web Distance Normalized Web Distance (NWD) is a compression-based similarity metric created by Cilibrasi & Vitányi (2007) to determine the relative similarity between words or phrase and is computed using the Web and a search engine. In the earliest version of the NWD theory, Cilibrasi & Vitányi (2007) refer to NWD as the more specific ‘Normalized Google Distance’ because they had specifically used the Google search engine to supply the necessary page counts used in their calculations. Later, Cilibrasi & Vitányi re-named the metric to the more broadly applicable ‘Normalized Web Distance’ to allow for any search engine with sufficient coverage of the web to be used to perform the relative similarity calculations (Cilibrasi & Vitányi, 2007; Vitányi & Cilibrasi, 2010). NWD is based on Cilibrasi & Vitányi’s “contention that the relative frequencies of web pages containing search terms gives objective information about the semantic relations between the search terms” (Cilibrasi & Vitányi, 2007, p. 371). Due to the vastness and sheer quantity of the information available on the Web, it is likely that the extremes of information not representative of how words are currently used in society will cancel each other out and that the majority of the information contained on the Web 90 consists of diverse, low-quality, yet valid information. Though the overwhelming majority of the information is likely of low-quality, there is an immense quantity of it on the Web. Cilibrasi & Vitányi (2007) stated that because there is so much of the of lowquality information and that is so diverse that using search engine page counts can effectively average out the semantic information so that valid and useful semantic associations between words and phrases can be drawn from the NWD method. The example presented in Cilibrasi & Vitányi (2007) best illustrates how the NWD is computed to determine similarity between two terms using the World Wide Web and Google for the search engine. “While the theory we propose is rather intricate, the resulting method is simple enough. We give an example: At the time of doing the experiment, a Google search for ‘horse’, returned 46,700,000 hits. The number of hits for the search term ‘rider’ was 12,200,000. Searching for the pages where both ‘horse’ and ‘rider’ occur gave 2,630,000 hits, and Google indexed 8,058,044,651 web pages. Using these numbers in the main formula … with N = 8,058,044,651, this yields a Normalized Google Distance [(NGD)] between the terms ‘horse’ and ‘rider’ as follows: NGD(horse, rider) ≈ 0.443.” (Cilibrasi & Vitányi, 2007, p. 371) As can be inferred from this example, this method does not consider the location of the terms as they exist in the web pages nor the number of occurrences of the terms within a single web page, it is simply based on the number of web pages in which the terms occur at least once. Therefore, Cilibrasi & Vitányi (2007) point out “that this can mean that terms with different meaning have the same semantics, and that opposites like ‘true’ and ‘false’ often have a similar semantics. Thus, we just discover associations between terms, suggesting a likely relationship” (Cilibrasi & Vitányi, 2007, p. 371). 91 2.2.3.1. Kolmogorov Complexity and Normalized Information Distance NWD is based on a theory of semantic distance between a pair of objects using Kolmogorov complexity and normalized information distance. “One way to think about the Kolmogorov complexity K(x) is to view it as the length, in bits, of the ultimate compressed version from which x can be recovered by a general decompression program” (Cilibrasi & Vitányi, 2007, p. 372). The less complex and more redundant the information is, the more it can be compressed and the smaller the Kolmogorov complexity value. It is important to note that Kolmogorov complexity is theoretical construct that is incomputable because it is based on an imaginary perfect compressor. Kolmogorov complexity represents the lower bound of the ultimate value of the compressed string. It is important to NWD to allow for a theoretical analysis “to express and prove properties of absolute relations between objects” (Cilibrasi & Vitányi, 2007, p. 371) typically not possible with other automatic approaches used in the lexical acquisition of meaning for information retrieval. Normalized Information Distance (NID) is based on Kolmogorov complexity and is the normalized length of the shortest program required to reconstruct one string from another string and vice versa. For example, given string x and string y, NID represents the length of the shortest program required to reconstruct string x given string y as the input and to reconstruct string y given sting x as the input. Like Kolmogorov complexity, NID is a theoretical construct useful for expressing and proving the properties of the NWD. Because NID is based on the incomputable Kolmogorov complexity, it is also incomputable. 92 2.2.3.2. Normalized Compression Distance As indicated above, because Kolmogorov complexity is incomputable, NID is, therefore, incomputable. However, a computable version of NID, called Normalized Compression Distance (NCD) can be formulated by using “real data compression programs to approximate the Kolmogorov complexities … A compression algorithm defines a computable function from strings to the lengths of the compressed versions of those strings. Therefore, the number of bits of the compressed version of a string is an upper bound on Kolmogorov complexity of that string, up to an additive constant depending on the compressor but not on the string in question.” The formula for the NCD is what is used to compute the NWD. 2.2.4. Probabilistic Models Used in Determining Semantic Similarity While vector space models are conceptually easy to understand, there are some theoretical problems with using vector space for measuring the relative similarity of terms. Specifically, Manning & Schütze (1999) point out that vector space models assume a Euclidean space to measure the distance between two vectors and are, therefore, theoretically only appropriate for normally distributed values. Normal distributions, however, cannot be assumed for data based on counts and probabilities. Word counts (i.e., the number of occurrences of a word within a unit of text like a document or paragraph) or probabilities (i.e., the probability that a word is contained in a document) is the data that typically populates the matrices from which the values of vectors are derived. 93 Matrices containing counts such as those used in the vector-based methods can be converted into “matrices of conditional probabilities by dividing each element in a row by the sum of all entries in the row” (Manning & Schütze, 1999, p. 303), which transforms the values into estimates of maximum likelihood. A number of information retrieval researchers have explored the usage of probabilistic models to determine semantic similarity in order to use a method with sound theoretical underpinnings. Probabilistic models recast the idea of calculating the semantic similarity in vector space models to one of calculating the dissimilarity of two probability distributions (van Rijsbergen, 1979; Lin, 1991; Dagan, Lee, & Pereira, 1997; Manning & Schütze, 1999). 2.2.4.1. Entropy and Probabilistic Models One class of probabilistic measures for calculating the semantic dissimilarity between two probability distributions are based on the notion of entropy defined in Claude Shannon’s (1948) Mathematical Theory of Communication (van Rijsbergen, 1979; Lin, 1991; Dagan, Lee, & Pereira, 1997; Manning & Schütze, 1999). In general terms, entropy is the amount of information contained in a message and can be thought of as the level of uncertainty about a message before it is received. As the level of predictability of the message decreases, the level of uncertainty on the part of the recipient increases. For example, if the message sender always sends the same sequence of digits (010101010101…), the recipient can reliably predict what the message will be without receiving it. In this sense, the message contains no information because there is no uncertainty about the content of the message. On the other hand, if the recipient 94 cannot predict what message the sender will transmit, then the recipient is only certain of what the message contains after the message is received (i.e., uncertainty about the message is resolved only when the message is received) (Shannon & Weaver, 1998; Pierce, 1980). The amount of information contained in a message is called entropy and is measured in bits. The entropy increases as the number of possible messages increases and as the freedom with which to choose the messages increases (i.e., the greater the number of possible choices and the less predictable which message will be sent, the greater the entropy.) The entropy decreases as the number of possible messages decreases and as the freedom with which to choose the messages decreases (i.e., the fewer the number of possible choices and the more predictable which message will be sent, the lower the entropy) (Shannon & Weaver, 1998; Pierce, 1980). Shannon developed the equation for entropy when the symbols are not equally probable in the message in the following equation: 𝑛 𝐻 = � 𝑝𝑖 log 𝑝𝑖 𝑖=1 Where: H = entropy (i.e., amount of information) pi = probability of the ith symbol being chosen Entropy can be used to compare coding methods to determine which are the most efficient for data compression. The true entropy for a message indicates what the best possible encoding method can be for a message (i.e., the best compression possible 95 without losing any information). In the case of binary coding methods, this means encoding a message in such a way that it requires the fewest number of binary digits per symbol (i.e., the number of bits per symbol approaches the true entropy). The trick is to devise an encoding scheme that achieves (or comes reasonably close to achieving) this goal (Shannon & Weaver, 1998; Pierce, 1980; Floridi, 2003). Entropy and its relationship to data compression form a basis for a class of information-theoretic measures developed to measure the difference (or divergence) between probability distributions (Lin, 1991). Two of these measures, Kullback-Leiber (KL) divergence and the Jensen-Shannon divergence are described below. 2.2.4.2. Dissimilarity Probabilistic Measures Dissimilarity between two probability distributions can be measured using a variety of probabilistic measures. Three commonly used measures are outlined in Figure 2.6 and described in more detail in the sections below. Dissimilarity Measure Definition Kullback-Leiber Divergence Jensen-Shannon Divergence L1 Norm 𝑝𝑖 𝐷(𝑝 || 𝑞) = � 𝑝𝑖 𝑙𝑜𝑔 � � 𝑞𝑖 𝑖 𝐷 �𝑝 || 𝑝+𝑞 𝑝+𝑞 � + 𝐷 �𝑞 || � 2 2 � | 𝑝𝑖 + 𝑞𝑖 | 𝑖 Table 2.2 Probabilistic dissimilarity measures. 96 2.2.4.2.1. Kullback-Leibler Divergence The Kullback-Leiber (KL) divergence is the relative entropy of two probability distributions. (Lin, 1991; Dagan, Lee, Pereira, 1997; Manning & Schütze, 1999; Pargellis, Fosler-Lussier, Potamianos & Lee, 2001). KL divergence “measures how well distribution q approximates distribution p; or, more precisely, how much information is lost if we assume distribution q when the true distribution is p” (Manning & Schütze, 1999, p. 304). There are several problems with the KL divergence that cause practical difficulties when using it to determine the relative similarity of terms. One problem is that the KL divergence measure has difficulty with infinite values. The measure returns “a value of ∞ if there is a ‘dimension’ with qi = 0 and pi ≠ 0 (which will happen often, especially if we use simple maximum likelihood estimates)” (Manning & Schütze, 1999, p. 304). To deal with this situation, estimates must be smoothed to redistribute some probability mass to the zero-frequency events. Mathematically, such a situation requires additional effort that can be computationally expensive for large vocabularies (Dagan, Lee, Pereira, 1997). Another problem with KL divergence is that it is asymmetric. Intuitively, semantic similarity between two terms is typically symmetric so that the level of similarity between term a and term b is equal to the level of similarity between term b and term a (Manning & Schütze, 1999). 97 2.2.4.2.2. Jensen-Shannon Divergence Jensen-Shannon divergence measure (Lin, 1991; Wartena & Brussee, 2008) is also known as information radius (Manning & Schütze, 1999; Pargellis, Fosler-Lussier, Potamianos & Lee, 2001) and as the total divergence to the average measure (Dagan, Lee, & Pereira, 1997). Jensen-Shannon divergence measure is an extension of the KL divergence measure and “can be defined as the average of the KL divergence of each of two distributions to their average distribution” (Dagan, Lee, & Pereira, 1999). Like the KL divergence, the Jensen-Shannon divergence measure is based on the notion of entropy as a measure of information. “The intuitive interpretation of [the Jensen-Shannon divergence] is that it answers the question: How much information is lost if we describe the two words … that correspond to p and q with their average distribution?” (Manning & Schütze, 1999, p. 304). The Jensen-Shannon divergence overcomes two of the major problems of the practical application of KL divergence because all values produced by Jensen-Shannon divergence measure are finite (i.e., there is no difficulty with infinite values being generated), and the Jensen-Shannon divergence measure is symmetric (Lin, 1991; Manning & Schütze, 1999). According to Dagan, Lee & Pereira (1997), the Jensen-Shannon divergence method consistently performs better than the other probabilistic measures they compared and, in general, recommend its use as a similarity-based estimation method. 98 2.2.4.2.3. L1 Norm L1 norm or Manhattan norm is “the absolute value of the difference of the two distributions” (Pargellis, Fosler-Lussier, Potamianos & Lee, 2001, p. 220). It can be interpreted “as a measure of the expected proportion of different events, that is, as the expected proportion of events that are going to be different between the distributions p and q” (Manning & Schütze, 1999, pp. 304-305). Like the Jensen-Shannon divergence, the L1 norm is symmetric. 2.3. Augmenting Retrieval Methods A major challenge in search engine design is the fact that purely symbolic search algorithms miss relevant information. To identify and retrieve information that is missed, researchers have investigated a number of different approaches to augment retrieval methods used in search engines. Three of these approaches include integrating manually developed semantic knowledge into the retrieval method, incorporating relevance feedback into the retrieval method, and augmenting the retrieval method with automatic query expansion. Each of these approaches is discussed in detail in the following sections. 2.3.1. Integrating Manually Developed Semantic Knowledge One large area of research that addresses the challenge of retrieving information that is typically missed by symbolic search algorithms is the development of semantic information to either replace or augment symbolic search engine designs. A portion of this research focuses on manually developing (i.e., a human expert is required to develop) 99 semantic information. This includes research in developing, defining, and using semantic relations between lexical elements (e.g., WordNet, Fellbaum, 1998) as well as research that addresses the development of domain ontologies to extract and represent meaning based on a constructed world model (Bhogal, Macfarlane, & Smith, 2007). However, the manual development of semantic information is extremely time consuming, expensive, and either lacks the specificity for technical domains or lacks portability for reuse in other conceptual domains and over time within an evolving conceptual domain (Anderson & Pérez-Carballo, 2001; Manning & Schütze, 1999). 2.3.1.1. WordNet WordNet is electronic lexical database of English whose design was inspired by computational and psycholinguistic theories of human lexical memory. It is a large-scale implementation of relational lexical semantics that represents a pattern of semantic relations creating a mapping between word forms and word meanings. WordNet organizes English nouns, verbs, adverbs, and adjectives into synonym sets (also called synsets). Each synset represents one distinct underlying lexical concept and is linked to other synsets based on semantic relationships (Fellbaum, 1998; Miller, 1998a; Carpineto & Romano, 2012; Bird, Klein, & Loper, 2009). In addition to the synonym relationship as captured by the construction of synsets, other relationships are captured, such as the following semantic relationships between noun synsets: • hyponymy – is the generalization relationship between concepts where the individual concepts fall on a continuum from specific to general. It can be 100 represented in a hierarchical tree relationship connected by IS-A or IS-AKIND-OF links (e.g., robin → bird → animal → organism) • meronymy – is the whole-part relationship that describes the relation between an concrete or abstract object and its components. It can be represented by IS-A-COMPONENT-OF, IS-A-MEMBER-OF, or IS-MADEFROM links (e.g., beak and wing are parts of a bird). (Miller, 1998b) The effort and time required to create and populate WordNet has been huge. As Miller (1998a) describes it, “[a] small army of people have worked on it at one time or another” (p. xxi). They began adding words derived from the standard corpus of presentday edited American English (also known as the Brown Corpus) by Kučera and Francis (1967) and progressively added words as the developers continued to come across sources that contained words not already present in the WordNet vocabulary. The WordNet vocabulary is not specific to any particular domain but instead is comprised of general-purpose words currently in use. Since 1991 when WordNet 1.0 was released to be used by the research community, it has been used for a variety of applications, some of which include its use to replace or augment symbolic search engine designs (Miller, 1998a; Carpineto & Romano, 2012; Bird, Klein, & Loper, 2009). In general, WordNet can be used to augment symbolic search by “selecting one synset for a given query term, thus solving the ambiguity problem, and then traversing the hierarchy by following its typed links. In order to choose a synset with a similar meaning to the query term, the adjacent query terms can be best matched with the concepts present in each synset containing the query term. After selecting the most 101 relevant synset, one might consider for query expansion, all the synonyms of the query term in the synset plus the concepts contained in any synset directly related to it, usually with different weights” (Carpineto & Romano, 2012, p. 13). 2.3.1.2. Domain ontologies A domain ontology is a model of knowledge that captures and represents a conceptual view of a particular subject domain. It is a formal representation “with a welldefined mathematical interpretation which is capable at least to represent a subconcept taxonomy, concept instances and user-defined relations between concepts” (Nagypál, 2005, p. 781). Through concepts, relations, and instances, ontologies represent knowledge of a domain that may be used in information retrieval applications to infer the intended context of potentially ambiguous queries (Nagypál, 2005; Bhogal, Macfarlane, & Smith, 2007). The success of using an ontology in an information retrieval application is dependent on a variety of factors and one of the most challenging is the quality of the ontology. The quality is determined by the accuracy, comprehensiveness, stability, and currency of the knowledge represented in the ontology. (Bhogal, Macfarlane, & Smith, 2007) But, creating a quality ontology is expensive and often the cost is prohibitive. Nagypál (2005) has found that “presently good quality ontologies … are a very scarce resource” (p. 782). 102 The level of effort required to build an ontology can be seen in Revuri, Upadhyaya, and Kumar’s (2006) brief overview of the process they took to build the ontology used in their work. Their process includes the following: 1. list all possible concepts in the domain 2. identify properties of each concept 3. identify characteristics of each property (e.g., Transitive, Symmetric, Functional, Inverse Functional) 4. define constraints on properties to add specificity as necessary 5. identify relationships between concepts 6. define instances of concept 7. populate property values for each instance 8. check entire ontology for consistency Because of the cost of manually building an ontology, a number of researchers have started investigating ways to partially or fully automate the process of creating an ontology (Bhogal, Macfarlane, & Smith, 2007). 2.3.2. Relevance Feedback Another method that has been investigated that requires human input, although in this case after the search results have been returned, is the incorporation of relevance feedback back into the search engine. “Relevance feedback takes the results that are initially returned from a given query and uses information provided by the user about 103 whether or not those results are relevant to perform a new query. The content of the assessed documents is used to adjust the weights of terms in the original query and/or to add words to the query” (Carpineto & Romano, 2012, p. 13). A variation on relevance feedback is called pseudo-relevance feedback (also known as blind feedback or blind query expansion) in which instead of requiring the user to assess the relevance of documents, the system will assume the top-ranked documents returned are relevant. These top-ranked documents are then analyzed and used to refine the definition of the original query. The system then uses the refined query to create the listing of results that are returned to the user (Bhogal, Macfarlane, & Smith, 2007; Baeza-Yates and RibeiroNeto, 1999; Büttcher, Clarke, & Cormack, 2010; Savoy & Gaussier, 2010). 2.3.3. Query Expansion Another line of research that addresses the challenge of retrieving information that is typically missed by symbolic search algorithms is the automatic development of semantic information to augment symbolic search engine designs. The automatic development of semantic information is particularly attractive because of the time, cost, specificity, and portability limitations of the manual development of this information. One area of this type of research uses Natural Language Processing (NLP) techniques to perform automatic query expansion in an attempt to more completely define and describe a user’s query. By augmenting the user’s query with additional search terms, additional candidate string patterns are available when performing the symbolic search. It is believed that these added candidate strings allow additional opportunities for 104 a symbolic search algorithm to identify additional documents that contain information relevant to the desired concept (Efthimiadis, 1996; Bhogal, Macfarlane, & Smith, 2007; Savoy & Gaussier, 2010; Carpineto & Romano, 2012). Unfortunately, there has been a marked lack of success in this line of research over the years and the resulting systems either do not improve or decrease the performance of the search engine (Qiu & Frei, 1993; Brants, 2003). Qiu and Frei (1993) present a theory that the lack of success with previous query expansion methods is primarily because the methods were based on adding terms similar to each of the individual terms used to construct the query, instead of adding terms similar to the overall concept the query describes. The method of expanding a query by adding terms similar only to the individual terms of the query often introduces tangential, non-relevant concepts to the query causing the search algorithm to identify documents that are not relevant to the original query concept. This phenomenon where the expansion terms cause a drift in the focus of the intended search is often referred to as topic drift or query drift (Qiu and Frei, 1993; Carpineto & Romano, 2012; Savoy & Gaussier, 2010; Carmel et al., 2002). 2.3.3.1. Qiu and Frei’s Concept-Based Query Expansion To address the problem of topic drift with previous methods of query expansion, Qiu and Frei (1993) developed an alternate method whose goal was to expand the original query with terms similar to the overall concept expressed by the original query rather than only to the individual terms of the original query. They developed a search algorithm that relies on a vector space model to represent the original query as a vector in 105 the term vector space (TVS) generated from the document collection. Additional terms with which to expand the query are identified as those that have a high similarity to the query vector in TVS. For a term to be eligible for use in expansion, it must be similar to the overall query rather than similar only to one of the terms that make up the query. To accomplish this, Qiu and Frei (1993) constructed a similarity thesaurus by interchanging the traditional roles of documents and terms. “[T]he terms play the role of the retrievable items and the documents constitute the ‘indexing features’ of the terms. With this arrangement a term ti is represented by a vector t⃗𝑖 = (d𝑖1 , d𝑖2 , … , d𝑖𝑛 ) in the document vector space (DVS) defined by all the documents of the collection. The dik’s signify feature weights of the indexing features (documents) dk with respect to the item (term) ti and n is the number of features (documents) in the collection” (p. 161). For example, assume that the indexing features are weighted using the number of times a term occurs in the document (i.e., number of occurrences) and that term t1 and term t2 occur in the five documents of the collection as presented in Figure 2.7. If this were the case, the vector for ⃗t1 would be ⃗t1 = (4, 1,0,2,0) and the vector for ⃗t 2 would be ⃗t 2 = (0,8,0,5,0). The similarity between two terms can be measured using a simple scalar vector product (or other vector similarity calculations as described in the earlier sections of this chapter) and a similarity thesaurus can be constructed by calculating the similarities of all the term pairs (ti, tj). 106 Table 2.3 Document Occurrences of t1 Occurrences of t2 D1 4 0 D2 1 8 D3 0 0 D4 2 5 D5 0 0 The number of times term t1 and term t2 occur in each of the five documents that comprise the example document collection. It should be noted that in Qiu and Frei’s approach, a more complex calculation is used to determine the feature weights that includes such things as document length, number of unique terms contained in a document, and total number of documents in the collection that contain the term and total number of documents in the collection. The next step is to represent the user’s query as a vector. “A query is represented by a vector 𝑞⃗ = (q1 , q 2 , … , q 𝑚 ) in the term vector space (TVS) defined by all the terms of the collection. Here, the qi’s are the weights of the search terms ti contained in the query q; m is the total number of terms in the collection” (p. 162). The term weights are determined by calculating the probability that the term is similar to the overall concept of the query. Therefore, in the vector 𝑞⃗ the value q1 corresponds to the probability that term t1 is similar to the overall concept of the query, value q2 corresponds to the probability that term t2 is similar to the overall concept of the query, and so on. “Since the similarity thesaurus expresses the similarity between the terms of the collection in the DVS (defined by the documents of the collection), we map the vector 𝑞⃗ from the TVS (defined by the terms of the collection) into a vector in space DVS. This 107 way, the overall similarity between a term and the query can be estimated” (p. 163). Once the vector 𝑞⃗ has been mapped to the document vector space, the pre-computed entries of the similarity thesaurus can be used to determine which terms in the collection have a high similarity to the overall concept of the query and may, therefore, be considered candidate terms for expanding the original query. While Qiu and Frei achieved a notable improvement in search engine performance, they noted that their method was less successful than systems that used mature user feedback relevance data. However, Qiu and Frei’s work inspired a line of query-expansion methods known as concept-based query expansion that continues to be actively researched (Efthimiadis, 1996; Bhogal, Macfarlane, & Smith, 2007; Savoy & Gaussier, 2010; Carpineto & Romano, 2012). 2.3.3.2. Performing Query Expansion While the details of the methods that have been investigated for query expansion vary, the overall process typically consists of four main steps as illustrated in Figure 2.8. The typical steps of query expansion include data pre-preprocessing, feature generation, feature selection, and query reformulations. Considering each of these key steps separately presents a useful way to understand the types and variety of alternative approaches that have been explored by information retrieval researchers in an attempt to better understand the best methods for improving retrieval performance with query expansion. Each step is discussed in the sections below. 108 Data Preprocessing Figure 2.5 Feature Generation Feature Selection Query Reformulation Typical steps of query expansion (adapted from Figure 1 of Carpineto & Romano, 2012, p. 1:10). 2.3.3.2.1. Data Preprocessing The first step of the query expansion process is data preprocessing. In this step, the data used to identify the candidate terms and determine the weighting used in refining or augmenting the query is converted and processed into a form that can be used by the system. Much of the data preprocessing that occurs at this stage is not unique to the query expansion process and, therefore, fairly similar across various query expansion methods (see Section 2.1.1.3.1. Preparing the System). However, one key difference between query expansion methods at this stage is the data source that the method uses to derive the features or terms that will be used in later steps of the process. In some cases, the corpus on which the search is to be conducted is used; in other words, the text from the documents in the target collection is used (Qiu & Frei, 1993; Graupmann, Cai, & Schenkel, 2005; Hu, Deng, & Guo, 2006). However, sometimes only a particular aspect or set of the documents from the corpus is used as the data source. For example, some systems use what is sometimes called anchor text documents or anchor text summaries as the data source. These anchor text documents correspond to a particular document in the collection and are created by collecting all the anchor text (i.e., the underlined or highlighted clickable text in a hyperlink) found in the collection that points to this 109 particular document (Kraft, & Zien, 2004; He & Ounis, 2007). Another very common data source is derived by running the original, unaltered query against the collection and using the matching top ranked document listing and their relevant text snippets as the subset of text used for the subsequent steps of query expansion process (Robertson, Walker, & Beaulieu, 1998; Lavrenko & Croft, 2001; Carpineto, de Mori, Romano, & Bigi, 2001; Carmel, Farchi, Petruschka, & Soffer, 2002). But, the corpus is not the only data source that may be used in information retrieval systems to identify terms for use in query expansion. Some systems look beyond the current corpus for ways to better define the query. Examples of alternate data sources include the following: • a query log of previous queries made in the system (Billerbeck, Scholer, Williams, & Zobel, 2003; Cui, Wen, Nie, & Ma, 2003) • WordNet or other ontological knowledge model (Voorhees, 1994; Liu, Liu, Yu, & Meng, 2004; Collins-Thompson & Callan, 2005; Bhogal, Macfarlane, & Smith, 2007; Song, Song, Hu, & Allen, 2007; Kara, Alan, Sabuncu, Akpinar, Cicekli, & Alpaslan, 2012) • the content of Frequently Asked Questions pages (Riezler, Vasserman, Tsochantaridis, Mittal, & Liu, 2007) • information derived from relevant Wikipedia articles (Arguello, Elsas, Callan, & Carbonell, 2008; Xu, Jones, & Wang, 2009) 110 Some systems also use a combination of the above data sources to increase the opportunity of finding features or terms that will improve the ability of the system to identify relevant documents for a wider variety of users’ information needs. 2.3.3.2.2. Feature Generation A variety of methods are used to generate the possible features that are considered candidates for expanding the original query. Typically the candidate features can be thought of as the specific terms or phrases that may be used to expand the original query to provide more opportunities for string pattern matching but features may also include such things as abstract representations of concepts or attribute-value pairs. The methods used to extract features from the data source reflect the conceptual paradigms that the system creator has chosen to guide the design of the query expansion component of the information retrieval system (Carpineto & Romano, 2012). Carpineto and Romano (2012) identify five major conceptual paradigms that were used to generate and rank the features used in query expansion. These are linguistic analysis, global corpus-specific techniques, local query-specific techniques, search log analysis, and web data harvesting. 2.3.3.2.2.1. Linguistic Analysis Linguistic analysis “techniques leverage global language properties such as morphological, lexical, syntactic and semantic word relationships to expand or reformulate query terms. They are typically based on dictionaries, thesauri, or other similar knowledge representation sources such as WordNet. As the expansion features are 111 usually generated independently of the full query and of the content of the database being searched, they are usually more sensitive to word sense ambiguity” (Carpineto & Romano, 2012, p. 25). Some common types of linguistic analysis for the generation of features are the following: • word stemming to expand the query to include the morphological variants of the original query terms (Krovetz, 1993; Collins-Thompson & Callan, 2005) • domain-specific ontologies to provide appropriate contextual information in order to disambiguate query terms and find synonyms and related words (Nagypál, 2005; Bhogal, Macfarlane, & Smith, 2007; Revuri, Upadhyaya, & Kumar, 2006; Song, Song, Allen, & Obradovic, 2006) • domain-independent models like WordNet to find synonyms and related words (Voorhees, 1994; Liu, Liu, Yu, & Meng, 2004; Collins-Thompson & Callan, 2005; Bhogal, Macfarlane, & Smith, 2007) • syntactic analysis to extract relations between the query terms (Sun, Ong & Chua, 2006) 2.3.3.2.2.2. Global Corpus-Specific Techniques The global corpus-specific techniques extract information from the whole set of documents in the collection (Baeza-Yates & Ribierto-Neto, 1999). These techniques analyze the collection to identify features used in similar ways and in many cases build a collection-specific thesaurus to be used for identifying candidate expansion features 112 (Carpineto & Romano, 2012). Approaches used to build the collection-specific thesaurus include the following: • similarity between concept terms (Qiu & Frei, 1993) • term clustering (Schütze & Pedersen, 1997; Bast, Majumdar & Weber 2007) • associations between terms formed through mutual information (Hu, Deng, & Guo, 2006; Bai, Nie, Cao, & Bouchard, 2007), context vectors (Gauch, Wang, & Rachakonda, 1999), and latent semantic indexing (Park and Ramamohanarao, 2007) 2.3.3.2.2.3. Local Query-Specific Techniques The local query-specific techniques extract information from the local set of documents retrieved from the original query to identify the features to be used to refine and augment the query (Baeza-Yates & Ribierto-Neto, 1999). Typically, these techniques make use of the top-ranked documents that are returned when the original query is submitted and include pseudo-relevance rankings as described above (see section 2.3.2. Relevance Ranking) (Carpineto & Romano, 2012). The information to identify each of the top-ranked documents such as the title of the document, document access information (i.e., a hyperlink to the document or address of the physical location of the document), and snippets of text that surround the terms that satisfied the requirement of the query may be analyzed to identify features (Robertson, Walker, & Beaulieu, 1998; Lee, Croft, & Allan, 2008; Cao, Gao, Nie, & Robertson, 2008). 113 2.3.3.2.2.4. Search Log Analysis The technique of analyzing search logs mines “query associations that have been implicitly suggested by Web users, thus bypassing the need to generate such associations in the first place by content analysis” (Carpineto & Romano, 2012, p. 27). While this technique has the advantage of being able to mine information already created, the implicit relevance feedback contained in search logs may be only relatively accurate and may not be equally useful for all search tasks. The most widely used techniques based on analyzing search logs is to “exploit the relation of queries and retrieval results to provide additional or greater context in finding expansion features” (Carpineto & Romano, 2012, p. 27). Methods that have used this approach include the following: • using top-ranked documents returned in similar past queries (Fitzpatrick & Dent, 1997) • selecting terms from past queries associated with documents in the collection (Billerbeck, Scholer, Williams, & Zobel, 2003) • establishing probabilistic correlations between query terms and document terms by analyzing those documents selected by the user after submitting a query (Cui, Wen, Nie, & Ma, 2003) 2.3.3.2.2.5. Web Data Harvest The web data harvest technique uses sources of relevant information that are not necessarily part of the document collection to generate candidate expansion features. 114 One of these techniques is to use the articles and their connections with one another in Wikipedia as a source of relevant feature extraction. Two examples include: • using the anchor text of hyperlinks that point to relevant Wikipedia articles as a source of expansion phrases (Arguello, Elsas, Callan, & Carbonell, 2008) • creating Wikipedia based pseudo-relevance information based on the category of query submitted (Xu, Jones, & Wang, 2009) Other techniques have included the use of Frequently Asked Question (FAQ) pages to compile a large set of question-answer pairs as a source of relevant feature extraction. (Riezler et al., 2007) 2.3.3.2.3. Feature Selection After generating the candidate expansion features, the top features are selected for query expansion. “Usually only a limited number of features is selected for expansion, partly because the resulting query can be processed more rapidly, partly because the retrieval effectiveness of a small set of good terms is not necessarily less successful than adding all candidate expansion terms, due to noise reduction” (Carpineto & Romano, 2012, p. 22). The representation of the candidate features that are to be selected in this step vary from method to method. Candidate features to be selected include the following: • single words (Qiu & Frei, 1993; Voorhees, 1994; Robertson, Walker, & Beaulieu, 1998; Lee, Croft, & Allan, 2008; Cao, Gao, Nie, & Robertson, 2008) 115 • phrases (Kraft, & Zien, 2004; Song, Song, Allen, & Obradovic, 2006; Riezler et al., 2007; Arguello et al., 2008) • attribute-value pairs (Graupmann, Cai, & Schenkel, 2005) • multi-word concepts (Metzler & Croft, 2007) In some cases, multiple types of candidate features have been generated. For example, Liu et al. (2004) describe a method that selects the top-ranking single words and/or phrases that have been identified in the feature generation step. There is no consensus about whether there exists an optimum number of expansion features that should be used to optimize the results returned or what this number (or range) is. Some suggest that an optimum range is five to ten features (Amati, 2003; Chang, Ounis, & Kim, 2006), Harman (1992) recommends 20 features, and at the extreme end of the range, Buckley et al. (1995) recommends a massive set of 300-530 features. Alternatively, Chirita, Firan and Nejdl (2007) suggest that instead of selecting a static number of expansion features, that the number of expansion features selected may be adapted based on the estimated clarity of the original query. The clarity of the query is estimated using an equation that measures “the divergence between the language model associated to the user query and the language model associated to the collection” (p. 12). 2.3.3.2.4. Query Reformulation After the featured are selected, the final step is to determine how to augment the original query with the selected features. Many query expansion methods assign a weight to the features that are added (called query re-weighting). A common re-weighting technique uses a calculation that is based on Rocchio’s relevance feedback equation. In 116 general, Rocchio’s equation is defined so that q represents the original query, q’ represents the expanded query, λ is a parameter used to calculate the weight of the relative contribution of the query terms and the expansion terms, and the scoret is a weight that is assigned to expansion term t. 𝑤′𝑡,𝑞′ = (1 − λ) ∙ 𝑤𝑡,𝑞 + λ ∙ 𝑠𝑐𝑜𝑟𝑒𝑡 Alternatively, instead of using an equation to calculate the weights for each term, a simple rule of thumb may be used by simply arbitrarily assigning the original query terms twice as much weight as the new expansion terms (Carpineto & Romano, 2012). While re-weighting is common in query expansion methods, it is not always performed. Some query expansion methods achieve good results by reformulating the query using query specification languages in which weighting is not used. Two examples of query specification methods that do not weight terms that have been shown to be successful in some instances are a general structured query (Collins-Thompson & Callan, 2005) and a Boolean query (Graupmann, Cai, & Schenkel, 2005; Liu, Liu, Yu, & Meng, 2004; Kekäläinen & Järvelin, 1998). 2.3.3.3. Example Query Expansion Systems In the previous section, a variety of approaches for designing a concept-based query expansion system was described in the context of the typical key steps used in the query expansion process. In this section, two query expansion systems that possess elements of interest will be presented in more detail. 117 2.3.3.3.1. Hu, Deng, and Guo’s Global Analysis Approach Hu, Deng, and Guo (2006) designed an information retrieval system in which a global analysis approach was used to perform query expansion. They divided their approach into three primary stages: one, term-term association calculation; two, suitable term selection; and three, expansion term reweighting. In the first stage, the term-term association calculations were performed in which the statistical relationship between term pairs in a document collection provides the basis for the construction of what Hu, Deng, and Guo refer to as a “thesaurus-like resource” to aid query expansion. As discussed in section 2.2, the commonly-used measures used to calculate the relative value of how similar (or dissimilar) one term is to another each suffer from various limitations and weaknesses. Hu, Deng, and Guo attempted to overcome some of these limitations by developing their own association measure in which the values of three different measures are integrated into a single association value. Their association measure is composed of the following: • Term Weight – From vector space models, the term weight of each term is calculated based on a normalized Term Frequency (TF) and Inverse Document Frequency (IDF). • Mutual Information – From probabilistic models based on the notion of entropy, mutual information represents the average amount of information shared by the two terms. 118 • Normalized Distance Between Terms – The normalized distance between the terms represents the average proximity of the two terms in the collection. The term weight, mutual information, and normalized distance between terms measures are combined with equal weight to produce an association value for a given term pair. Hu, Deng, and Guo ran this association measure on their document collection to construct the thesaurus-like resource containing the association measure values between each term pair in the collection. In the second stage, the term-query based expansion was performed. To address the problem of topic drift identified by Qiu and Frei (1993) when expansion terms are selected based only on their similarity to a single term in the query, Hu, Deng, and Guo used “a term-query based expansion scheme, which emphasizes the correlation of a term to the entire query” (p. 704). They expressed the correlation in the following equation: 𝑗=𝑘 𝐶𝑜(𝑡𝑖 , 𝐪) = � 𝑡𝑓𝑗 ∙ 𝐴(𝑡𝑖 , 𝑡𝑗 ) 𝑗=1 Where: k = number of unique terms in the query q tfi = term j’s term frequency in the query q A(ti, tj) = association measure value for term ti and term tj Using this calculation, “all index terms are ranked in decreasing order according to their correlation to a given query q” (p. 704) and the top m ranked terms are selected to create a revised query q′. 119 The third and final stage of the process reweighted the expansion terms chosen in the previous stage to create the reformulated query. The calculation Hu, Deng, and Guo used was based on their goal to ensure that the selected expansion terms contributed to the performance of the search but did not override the impact of the original terms of the query. They therefore used “a simple reweighing scheme with a formula defined by the product of expanded term’s rank and average term weight of the original query” (p.705). Hu, Deng, and Guo evaluated their retrieval performance by comparing the performance of their search engine after each stage of processing to a baseline search engine that ran with only the original unexpanded query. They found that each stage improved the retrieval performance. One of the explanations that they provide to describe why they believe that their method was successful was that “it is mainly based on the analysis of intrinsic association relationships between terms in the entire document collection rather than some retrieved documents that are assumed to be relevant” (p. 706). Like Hu, Deng, and Guo’s method, the Enhanced search engine method investigated in this dissertation research is based on an association thesaurus built from the document collection. There are two primary differences related to the association thesaurus used by Hu, Deng, and Guos and the association thesaurus used in the Enhanced search engine. First, Hu, Deng, and Guo use an association measure derived from combining term weight, mutual information, and normalized distance between terms while the method in this dissertation research simply uses a co-occurrence 120 calculation based on the Jaccard Coefficient within a predefined proximity window 8. The second difference is the way the association thesaurus is used in the two methods. Hu, Deng, and Guo used the association thesaurus to create a ranked list of all index terms and chose the top m terms as the expansion terms. In this dissertation research, the association thesaurus is used to create a conceptual network. The expansion terms, therefore, are not derived directly from the association thesaurus but rather from the relational pathways that connect the original query terms within the conceptual network. 2.3.3.3.2. Collin-Thompson and Callan’s Markov Chain Framework Approach Collins-Thompson and Callan (2005) designed an information retrieval system in which they used a Markov chain framework to combine multiple sources of information related to term associations. They began with the pseudo-relevance feedback function built into the Lemur Project’s Indri search engine to identify the set of the candidate expansion terms for the given query. The Indri algorithm calculates a log-odds ratio for each candidate term and the top k terms are selected. From these top k Indri generated candidate terms, a query-specific term network is constructed using multi-stage Markov chain framework to conduct “a random walk to estimate the likelihood of relevance of expansion terms” (p. 705). In this way, pairs of terms are linked by drawing data from six different sources of semantic and lexical association to form a network of terms specific to the query. The association information about the terms are derived from the following sources: synonyms from WordNet, stemming, general word association 8 Additional details about the association calculations used in the Enhanced search engine developed in this research are presented in Chapter 4. 121 database, co-occurrence in set of Wikipedia articles, co-occurrence in top-retrieved documents from original query, and background smoothing to uniformly link each term with all others. At different stages, the multi-stage Markov chain model favors different sources of information at different stages of it walk. For example, in the earliest steps of the random walk, the chain may favor co-occurrence relationships while later in the process the walk may favor synonyms. The walk process begins with a node representing one aspect of the original query. The aspects of a query are each represented using one or more terms from the original query. Collins-Thompson and Callan present the example query “Ireland peace talks” and state that the aspects of this query may be represented by the following sets of words taken from the query: • “Ireland” • “peace” • “talks” • “peace talks” • “Ireland peace talks” In order to ensure that documents related to the intent of the query are retrieved, the expansion terms chosen should reflect one or more of the aspects of the original query. Therefore, the method performed a multi-staged random walk for each aspect to propagate the association information and create the term network model. The stationary distribution of the model provided a probability distribution over the candidate expansion terms and was used to identify those expansion terms that had a high probability of 122 representing one or more of these aspects of the original query. It identified expansion terms from those nodes that had direct links to one another as well as from those nodes in which connections were implied them. Once the expansion terms were identified, the terms were weighted to favor those terms that had a high probability of being closely related to main aspects of the query as well as those terms that reflected multiple aspects of the query. Collins-Thompson and Callan’s method performed similarly to other wellperforming methods conducted using the same TREC datasets with some “modest improvements in precision, accuracy, and robustness for some tests. Statistically significant differences in accuracy were observed depending on the weighting of evidence in the random walk. For example, using co-occurrence data later in the walk was generally better than using it early” (p. 711). On the surface Collins-Thompson and Calla’s Markov chain framework approach is one of the most similar query expansion methods to the Enhanced search engine method investigated in this dissertation research. Like Collins-Thompson and Callan’s method, the Enhanced search engine is based identifying candidate expansion terms from a network of terms generated by the system. However, the Collins-Thompson and Callan query-specific term network differs from the conceptual network in this dissertation research in several ways. First, the nature of and the process by which the network is constructed differ between the two methods. Collins-Thompson and Callan began their process by using the set of candidate expansion terms generated from Indri’s built-in pseudo-relevance 123 function for the original query. They then used a Markov chain framework to generate a term network using the word sets representing the various aspects of the query and the Indri generated candidate terms. Each link between the candidate term pairs in the network was weighted by the probability from one of a variety of sources that the two terms were related. In contrast, the Enhanced search engine conceptual network was created using the term entries from the association thesaurus and all links between terms were of equal importance. The association thesaurus represented clusters of terms related by collectionspecific co-occurrence. In this way, the conceptual network represents all the concepts present in the document collection and does not change based on the query posed but rather changes only as the document collection changes. Therefore, rather than requiring a kernel of expansion terms, it used the individual terms that make up the original query itself to identify the portion of the overall conceptual network that is relevant to the given query concept. Second, how the expansion terms are selected using the network is different between the two methods. The Collins-Thompson and Callan method selected the top k expansion terms based on a probability calculated by combining the probability derived from the stationary distribution and the original Indri log-odds ratio. On the other hand, the Enhanced search engine selected the candidate expansion terms based on the intervening terms within the relational pathways. The various, appropriately short relational pathways that form the connection(s) between the original query terms in the conceptual network are thought to represent an aspect of the query topic. The pathway of 124 terms represents a logical progression of ideas that, linked together, represent an aspect of the overall query concept. In this way, each pathway represents an aspect of the query concept that are present in the collection as opposed to the sets of words that provide the initial node for the Markov random walk in the Collins-Thompson and Callan method. Therefore, appropriate combinations of the intervening terms and the original query terms from each relational pathway found were selected to expand the original query. Third, the processing requirements of the two methods also appear to differ significantly. Collins-Thompson and Callan method requires a variety of mathematical processing techniques to perform the multi-stage Markov chain random walks to create the stationary distribution of the model. Because the term network is constructed for each query based on the query terms and the candidate terms generated from Indri’s pseudo-relevance function, it is likely that the majority of the processing must be performed after the user has submitted the query (i.e., post-search). Depending on the processing time required generate the network, identify the expansion terms, reformulate the query, and re-run the query, the user’s experience may be negatively impacted by the processing time required to complete the search request. In contrast, the mathematical calculations required by the Enhanced search engine are relatively simple (i.e., the most complex calculation is the Jaccard Coefficient to determine term co-occurrence) and a large majority of the processing can be performed ahead of time so that the user’s experience is not negatively impacted (i.e., the user is not aware of the time required to perform the processing). 125 2.4. Evaluation Like that of computer science research in general, the purpose of most research in search engine design “is to invent algorithms and generate evidence to convince others that the new methods are worthwhile” (Moffat & Zobel, 2004, p. 1). In order to convince others that the new search engine methods are worthwhile, compelling, accurate, reproducible evidence must be generated. Typically, the most effective way to achieve this is by implementing the search engine and conducting a well-designed experiment to measure its performance (Moffat & Zobel, 2004). Two unique aspects of experimental design for the comparative evaluation of search engines are the performance measures chosen and the data (i.e., test collection) used. 2.4.1. Search Engine Performance Measures One of the primary aspects of search engine performance that is desirable to try to quantify and measure is search engine effectiveness. “Effectiveness, loosely speaking, measures the ability of the search engine to find the right information” (Croft, Metzler, & Strohman, 2010, p. 298). There are various statistical methods available to measure the effectiveness of a search engine. Depending on the objectives of the search engine and the type of performance improvement expected, some measures may be more appropriate than others (Baeza-Yates & Ribiero-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010). 126 2.4.1.1. Recall and Precision Recall and precision are traditional information retrieval measures that have been used extensively in information retrieval research. Recall is the proportion of the relevant documents contained in the document collection that are retrieved by a search engine in response to a search query. Therefore, recall quantifies exhaustivity (i.e., how complete the set of relevant documents retrieved for a given search query is). Precision, on the other hand, is the proportion of relevant documents contained in the result set. In other words, precision measures the amount of noise (i.e., number of false positives in the result set) (Baeza-Yates & Ribiero-Neto, 1999; Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010). Recall and precision are calculated using the following equations: 𝑅 = 𝑟𝑒𝑐𝑎𝑙𝑙 = |𝐴 ∩ 𝐵| |𝐴| 𝑃 = 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = |𝐴 ∩ 𝐵| |𝐵| Where: A = set of all relevant documents in document collection B = set of retrieved documents 2.4.1.1.1. Combining Recall and Precision with the F-measure Recall and precision are complimentary calculations that should be considered together to properly evaluate the effectiveness of a search engine. A search engine that always returns the entire collection in response to a query has perfect recall (i.e., R = 1) but extremely low precision (i.e., P approaches 0); it is doubtful that such a search engine 127 would be considered effective. The opposite is also true, a search engine with near perfect precision (P = 1) and very low recall (i.e., R approaches 0) also is unlikely to be considered an effective search engine. In most cases, the goal is to find an appropriate balance between recall and precision (Croft, Metzler, & Strohman, 2010). To make it easier to consider recall and precision together, the values may be combined into a single value called the F-measure. The F-measure is the harmonic mean of recall and precision and is calculated using the following equation: F-measure = 1 𝑅 2 2 ∙𝑅 ∙𝑃 1 = 𝑅+𝑃 +𝑃 Where: R = recall value P = precision value The F-measure as a harmonic mean is able to enforce a balance between recall and precision so that a more logical measure of effectiveness is computed at the various extremes of recall and precision values. For example, consider the example above in which the entire collection was retrieved in response to a search query (i.e., R = 1 and P approaches 0). When an arithmetic mean ((R + P) / 2) is used, the resulting value is greater than 0.5. However, when the F-measure (i.e., harmonic mean) is used, the resulting value is close to 0. In this situation, an effectiveness value close to 0 instead of a value greater than 0.5 intuitively makes more sense and more accurately conveys how effectively the search engine has performed (Croft, Metzler, & Strohman, 2010). 128 While the version of the F-measure that equally balances recall and precision is the most common, a weighted harmonic mean may also be used. The weighted version allows weights to be included in the calculation to reflect the relative importance of recall and precision that is desired. The weighted version of the F-measure is calculated using the following equation: weighted F-measure = 𝑅 ∙𝑃 𝛼 ∙ 𝑅 + (1 − 𝛼) ∙ 𝑃 Where: R = recall value P = precision value 𝛼 = a weight 2.4.1.1.2. Assumptions of Recall and Precision Measures Recall and precision rely on several assumptions for their calculations to be valid measurements of effectiveness. These assumptions include the following: • The user’s information need is represented by the search query supplied to a search engine. • Each document in the document collection is relevant or is not relevant to a user’s information need (i.e., it is a binary classification). • The classification of relevance is only dependent on the document itself and the information need and is not influenced by the relevance of any other document in the document collection. 129 • For a given search query supplied to a given search engine, there will be a set of one or more documents that will be retrieved, and the remainder of the documents in the document collection will be not be retrieved. • The order that the documents are returned in the result set has no bearing on the calculated values. 2.4.1.2. Effectiveness Measures Used for Ranked Results While relevance ranking is outside the scope of this dissertation, a majority of search engine research conducted today performs relevance ranking to order the documents returned in the result set. Therefore, two of the most common measures of effectiveness reported in search engine research literature are Precision at k Documents (P@k) and Average Precision (AP). Both measures are based on the assumption that as both document collections and the number of relevant documents they contain grow larger, users are not interested in every relevant document contained in the collection. Instead, it is assumed that a user considers a search engine most effective if the first documents returned are relevant. The relevancy of the document beyond the initial set of documents retrieved is unimportant and, therefore, should not be considered as part of the effectiveness measures. Therefore, both P@k and AP focus primarily on the precision of the documents returned (Büttcher, Clarke, & Cormack, 2010). 2.4.1.2.1. Precision at k Documents (P@k) The measure Precision at k Documents (P@k) “is meant to model the satisfaction of a user who is presented with a list of up to k highly ranked documents, for some small 130 value of k (typically k = 5, 10, or 20)” (Büttcher, Clarke, & Cormack, 2010, p. 408). The P@k equation is defined as: 𝑃@𝑘 = | 𝐵[1. . 𝑘] ∩ 𝐴 | 𝑘 Where: k = the number of highly ranked documents to be considered A = set of all relevant documents in document collection B[1..k] = set of the top k retrieved documents The P@k measure does not consider order of the top k documents (i.e., whether document 1 or document k has the higher ranking does not impact the results). “P@k assumes that the user inspects the results in arbitrary order, and that she inspects all of them even after she has found one or more relevant documents. It also assumes that if the search engine is unable to identify at least one relevant document in the top k results, it has failed, and the user’s information need remains unfulfilled. Precision at k documents is sometimes referred to as an early precision measure” (Büttcher, Clarke, & Cormack, 2010, p. 408). One argument against the use of the P@k equation is that it is undesirable that the choice of the value of k is arbitrary but can impact the effectiveness results. 2.4.1.2.2. Average Precision (AP) The Average Precision (AP) measure provides an alternative approach to the P@k equation. AP addresses the problem of k being an arbitrary choice by combining precision values at all the possible recall levels. 131 |𝐵| 1 𝐴𝑃 = ∙ � relevant(𝑖) ∙ 𝑃@𝑖 |𝐴| 𝑖=1 Where: A = set of all relevant documents in document collection B = set of retrieved documents relevant(i) = is assigned the value 1 if the i-th document in B is relevant; otherwise is assigned the value of 0 “[F]or every relevant document d, AP computes the precision of the result list up to and including d. If a document does not appear in [the set of retrieved documents], AP assumes the corresponding precision to be 0. Thus, we may say that AP contains an implicit recall component because it accounts for relevant documents that are not in the results list” (Büttcher, Clarke, & Cormack, 2010, p. 408). 2.4.2. Search Engine Data for Evaluation Experiments The test data set for conducting search engine evaluation experiments are generally comprised of three elements: the document collection that contains the documents that will be searched, the query topics that represent the information need to be filled, and the set of relevant documents within the collection that matches the criteria of each query topic. The test data set used can have a large impact on the accuracy of the results. It may either be developed specifically for the particular evaluation experiment to be conducted or pre-defined by others for the purpose of evaluating search engines. The 132 objectives of the novel search engine method and the type of improvement expected should always drive the choice of the test data set used. 2.4.2.1. Developing a Test Data Set The test data set may be developed specifically for the particular evaluation experiment to be conducted. While this requires additional effort and may be time consuming, the quality of the results from some evaluation experiments may benefit from developing a specific test data set. Typically, the test data set developed is comprised of a document collection, a set of query topics, and the relevance judgments for each query topics (Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010). 2.4.2.1.1. Document Collection When identifying or creating a document collection, it is important to begin by identifying the required attributes of the document collection for which the search engine is intended to be used. Depending on the objective of the search engine, the important attributes of the document collection may include the number of documents or size of collection, the average length of the documents, type of documents, subject matter of the documents, and the level of technical detail and jargon presented in the documents. Once the required attributes have been identified, the document collection should be selected or created to adequately meet these requirements. “In some cases, this may be the actual collection for the application; in others it will be a sample of the actual collection or even a similar collection” (Croft, Metzler, & Strohman, 2010, pp. 303-304). 133 2.4.2.1.2. Query Topics Query topics are identified to represent individual information needs that will be tested in the experiment. Two important aspects of identifying query topics are the particular concepts represented by each of the query topics and the number of query topics defined. The information needs as described by each of the query topics should provide reasonable coverage and be representative of the types of information needs that the intended users will try to fill with the search engine. It is also desirable that “[w]hen developing a topic, a preliminary test of the topic on a standard IR system should reveal a reasonable mixture of relevant and non-relevant documents in the top ranks” (Büttcher, Clarke, & Cormack, 2010, p. 75). This will increase the chances that differences between the performances of two or more search engines will be distinguishable. Sources that may be used to identify representative queries include query logs of similar systems and asking potential users for examples of queries (Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010). “Although it may be possible to gather tens of thousands of queries in some applications, the need for relevance judgments is a major constraint. The number of queries must be sufficient to establish that a new technique makes a significant difference” (Croft, Metzler, & Strohman, 2010, p. 304). A balance should be struck so that neither too many nor too few query topics are defined. Too many query topics will require unnecessary time and effort to perform the necessary relevant judgments. Too few queries will not allow the necessary conclusions to be drawn. The number of query topics represents the sample size to be used in the evaluation experiment, and, therefore, 134 a statistical power analysis should be conducted to determine the required sample size (Cohen, 1988). A power analysis will provide the information necessary to determine how many query topics will be necessary to generate the results required to conclude that an improvement in performance at the desired effect size and significance level is achieved. 2.4.2.1.3. Relevance Judgments For each query topic, the set of documents within the collection that matches the criteria of each query topic must be identified. This is the information that will be used to determine the number of relevant and non-relevant documents that the search engines return as well as the number of relevant documents that the search engines miss. The best practice for assessing the relevance of documents in a document collection for a particular query topic is to perform adjudication by a human assessor. However, such a process can be very time intensive (Büttcher, Clarke, & Cormack, 2010; Croft, Metzler, & Strohman, 2010). For example, consider a medium-sized test document collection composed of 3000 documents, and assume that there are 50 query topics to be tested. In the Exhaustive Adjudication method, each document is reviewed for relevancy against each query topic. If each of the 3000 documents is reviewed to determine its relevance for each of the 50 test query topics, and each judgment takes 30 seconds, the adjudication effort would take approximately 1250 hours (i.e., a single assessor would need to spend over 30 weeks of full-time effort). The time commitment required by this method of Exhaustive Adjudication makes it unfeasible for most projects. Therefore, 135 alternative methods that minimize the effort required to perform the relevance assessments for each test query topic are typically used. There are a number of alternatives that have been used successfully to reduce the effort of adjudication, yet produce good results. The standard method used by Text REtrieval Conference (TREC) is known as the Pooling Method. In this method, a document collection and test query set is supplied to a group of participants. The participants then generate a set of retrieved documents for each query topic in the test query set. For a given test query topic, the top-ranked k documents supplied by each participant are pooled together to form the set of documents to be assessed for relevance. This pool of documents is then reviewed, and each document is tagged as relevant or not relevant by a human assessor or team of human assessors. All the documents in the collection that are not part of the assessment pool are classified automatically as not relevant. The pooling method can significantly reduce the number of documents that must be looked at by the human assessors (Büttcher, Clarke, & Cormack, 2010; Harman, 2005). Another alternative is the Interactive Search and Judging (ISJ) Method. In this method, a “skilled” searcher uses a search engine to find and tag as many relevant documents as possible. Typically an adjudication tracking tool is used to help the searcher record relevance judgments and suppress previously adjudicated documents from consideration. All documents in the collection that are not explicitly adjudicated for a given test query topic are automatically tagged as not relevant. The results of using the 136 ISJ evaluation method were found both by Cormack, Palmer, and Clarke (1998) and by Voorhees (2000) to be very similar to official TREC results using the pooling method. 2.4.2.2. Pre-defined Test Data Sets A number of pre-defined test data sets created by others in the information retrieval community are available for the purpose of evaluating search engines. The most well-known and widely used test collections are those from the Text REtrieval Conference (TREC) but, other publically available collections also exist. 2.4.2.2.1. Text REtrieval Conference (TREC) Using a Text REtrieval Conference (TREC) test data set can be an effective way to evaluate how a new search engine method compares to other search engines. In the early days of evaluating information retrieval systems, it was difficult to conduct an evaluation using a realistically-sized document collection. Instead, many of the document collections available for use were much smaller than the collections for which the retrieval system was designed. “Evaluation using the small collections often does not reflect performance of systems in large full-text searching, and certainly does not demonstrate any proven abilities to operate in real-world information retrieval environments” (Harman, 2005, p. 21). To address this problem, TREC was developed. It is “a series of experimental evaluation efforts conducted annually since 1991 by the U.S. National Institute of Standards and Technology (NIST). TREC provides a forum for researchers to test their IR systems on a broad range of problems. … In a typical year, TREC experiments are 137 structured into six or seven tracks, each devoted to a different area of information retrieval. In recent years, TREC has included tracks devoted to enterprise search, genomic information retrieval, legacy discovery, e-mail spam filtering, and blog search. Each track is divided into several tasks that test different aspects of that area. For example, at TREC 2007, the enterprise search track included an e-mail discussion search task and a task to identify experts on given topics” (Büttcher, Clarke, & Cormack, 2010, p. 23). Because of the variety of tracks available, a TREC collection appropriate for the objectives of a new search engine method is often available. If the TREC test collection chosen is an active track, then the results of the experiment may be submitted to TREC and included in the TREC relevance adjudication efforts. If the TREC test collection used is not active, the relevance judgments will have been previously compiled and can be used to calculate the performance of a new search engine. These results may also be compared to the results of the other systems that participated while the track was active. 2.4.2.2.2. Other Publicly Available Test Data Sets TREC document collections are not in the public domain; as a result, individuals or their affiliated research organizations must be approved by TREC to be eligible to purchase a copy of a TREC data set. Therefore, not everyone can obtain a copy of a TREC data collection to use to evaluate their search engine. However, TREC data sets are not the only pre-defined test data sets available for evaluating a search engine. For example, both the CACM test collection and the OHSUMED test collection are available 138 at no cost 9. CACM is a collection of titles and abstracts of articles published in the Communications of the ACM journal between 1958 and 1979. This is a collection of 3,204 documents, 64 query topics, and the relevant document result sets for each query topic (Harman, 2005). The OHSUMED test collection consists of a subset of clinicallyoriented MEDLINE articles published in 270 medical journals between 1987 and 1991. This is a collection of 348,566 documents, 106 query topics, and the relevant document result sets for each query topic (Hersh et al., 1994). Today, many of the alternative test data sets are considered quite small and, therefore, less relevant for testing the advances in technology for use on the ever expanding Web. But, for some smaller projects or in the early stages of larger projects, these sets may be useful. In fact, Qiu and Frei (1993) used the CACM collection along with several of these freely available test sets for their groundbreaking work. 9 A copy of CACM test collections may be obtained from the Information Retrieval Group in the School of Computing Science at the University of Glaslow at http://ir.dcs.gla.ac.uk/resources/test_collections/. A copy of the OHSUMED test collection may be obtained from Department of Medical Informatics and Clinical Epidemiology at the Oregon Health & Science University at http://ir.ohsu.edu/ohsumed/. 139 CHAPTER 3. RESEARCH HYPOTHESES The objective of this research was to determine if a search engine enhanced with automatic concept-based query expansion using term relational pathways built from collection-specific association thesaurus positively impacted search performance. To determine if this research objective was met, the following four research hypotheses were posed. Research Hypothesis #1 The Enhanced search engine will perform differently than the Baseline search engine. Research Hypothesis #2 The Enhanced search engine will on average have greater recall than the Baseline engine. Research Hypothesis #3 The Enhanced search engine will on average perform better than the Baseline search engine as determined by a higher average F-measure value. Research Hypothesis #4 The Enhanced search engine will on average perform better on query topics that describe intangible concepts than query topics that describe tangible concepts as determined by a higher average F-measure value. 140 CHAPTER 4. METHODOLOGY The research methodology was designed to effectively and efficiently test the four research hypotheses posed in Chapter 3. 4.1. Identify and Design Baseline Search Engine In order to determine whether the enhancement impacts search performance, the performance resulting from the enhancement must be isolated from the performance of other search engine elements. To do this, the performance of an enhanced search engine may be compared to the performance of a baseline search engine where the only difference between the two search engines are the modules and components required for implementing the enhancement. This approach ensures that any difference in performance is the result of the enhancement rather than the performance being confounded by other differences in search engine architecture or configuration settings. Ideally, when choosing an appropriate baseline search engine, it is beneficial for it to be well-known, well understood, and representative of the state of the art of performance in the field. However, this is not always possible. In this research, the original plan was to use an offline version of the Google search engine called Google Desktop as a model for the baseline condition because it could be considered a wellknown, representative tool used for searching local document collections. And while the core of Google’s search technology and component architecture is proprietary and, therefore, cannot be modified with any design enhancements, the plan was to use Google Desktop as a guide for tuning the performance of a Baseline search engine that could be 141 modified. The Baseline search engine was to be developed using the popular, opensource Lucene.NET search engine development library. Once developed, the Lucene.NET Baseline search engine would be configured and tuned to produce equivalent search results to Google Desktop. After the Lucene.NET Baseline search engine was appropriately configured, enhancements to implement the automatic conceptbased query expansion would be built on top of the Lucene.NET Baseline search engine core to create the Enhanced search engine. However, in the course of the Baseline search engine development, it was recognized that the Lucene.NET Baseline search engine could not be accurately configured to perform in an equivalent manner to Google Desktop on the target document collection. Therefore, an alternate plan was developed. Instead of using Google Desktop to tune the Lucene.NET Baseline search engine, the Lucene.NET Baseline search engine was configured using current best practices in search engine design. These best practices included the following features: • Word Stemming – The Snowball word stemmer was used during the parsing step. It is an implementation of a modified Porter Stemmer and is the standard stemmer used in the Lucene.NET search engine development library. • Stopword Removal – A list of common function words in English typically used to indicate grammatical relationships were removed during the parsing step. See Appendix A section A.2.2.2 for the list of stopwords removed. 142 • HTML code filtering – An HTML code filter was used so that words included inside HTML code tags were ignored during indexing. The code filter used was the built-in Lucene.NET analyzer called HTMLStripCharFilter. • Proximity Search – The search engine had the ability to retrieve documents that contained the terms occurring within a prescribed threshold of one another. If the document contained all the terms, but they were considered too far away from one another in the document (as determined by the predefined threshold), then the document was not considered a match and was not returned. See Appendix A section A.2.5. for more information about the proximity search and the threshold used. 4.1.1. Design Process Used To ensure that the search engines designed as part of this research would be of the appropriate level of quality and successfully meet the research objectives, many principles of systems engineering were used in the design process. Some of the core principles in the design process used to develop the search engines included the following: • Define the intentions of the system, including the needs and relevant characteristics of the intended users. • Define the system objectives and requirements. • Make design decisions based on defined intentions, objectives, and requirements. 143 • Iteratively refine requirements throughout design process as better and more complete knowledge of the intended system is acquired. • Ensure that all intentions, objectives, and requirements are met in the design through requirements tracking and system testing. 4.1.2. Baseline Search Engine Structure The Baseline search engine performed two core functions: one, index the document collection and store the indexed information in a quickly accessible format; and, two, process queries entered by the user. Therefore, the Baseline search engine was composed of an Index Module, a Search Module, and an Index data store. The Index Module was part of the pre-search processes that prepare the system for use. The Search Module was part of the post-search processes that occur to allow users to enter their desired search terms, the system to process the query, and the system to return the search results to the user. The high-level structure of the Baseline search engine is illustrated in Figure 4.1. The distinction made between pre-search and post-search is important when considering the impact on the user’s experience. The processing time required to perform the pre-search processes does not impact the user’s experience (i.e., the user is not aware of the time required to perform the pre-search processing), while the processing time required to perform the post-search processes does impact the user’s experience (i.e., the user must wait for the post-search processing to be completed before the search results may be displayed). 144 Baseline Search Engine Post-search processes Search Module User Interface Enter Search Terms Search Present Search Results to User Parse User Input Organize and Format Results Build Query Run Query Data store Index Index Document Analyze Document Build Document Record Acquire Content Pre-search processes Figure 4.1 Figure 4.1 Index Module Baseline Search Engine Structure High-level structure of the Baseline search engine. 4.2. Design Enhanced Search Engine The Enhanced search engine was designed to implement the approach to conceptbased query expansion described in Chapter 1. The search engine, therefore, was 145 enhanced to perform three additional core functions: one, build a collection-specific association thesaurus and store it in a quickly accessible format; two, generate the conceptual network from the association thesaurus entries and store it in a quickly accessible format; and, three, identify candidate query expansion terms from the conceptual network and the user’s original query terms. The additional modules and components necessary to accomplish the enhanced functionality were built on top of the Lucene.NET Baseline search engine described in the previous section. Therefore, in addition to the Baseline’s Index Module, Search Module, and Index data store, the Enhanced search engine also contained modules to Build Association Thesaurus, Generate Conceptual Network and Expand Query as well as the necessary components for the Association Thesaurus data store and the Conceptual Network data store. The Index Module, Build Association Thesaurus module, and Generate Conceptual Network module were part of the pre-search processes to prepare the system. The Search Module and Expand Query module were part of the post-search processes that occur to process the user’s query. The high-level structure of the Enhanced search engine is illustrated in Figure 4.2. 146 Enhanced Search Engine Post-search processes Search Module User Interface Expand Query Enter Search Terms Identify Relational Pathways Collect Expansion Terms Search Parse User Input Build Expanded Query Present Search Results to User Organize and Format Results Run Query Data store Conceptual Network Association Thesaurus Index Store Conceptual Network Store Association Thesaurus Entries Create Links Between Terms Calculate Co-Occurrence Values Identify Term Pairs Create Matrix Generate Conceptual Network Identify Eligible Terms Index Document Analyze Document Build Document Record Acquire Content Index Module Pre-search processes Figure 4.2 Create Document Segments Build Association Thesaurus High-level structure of the Enhanced search engine. 147 4.2.1. Build Association Thesaurus The Build Association Thesaurus module automatically builds the Association Thesaurus using the document collection. The module accomplished this by manipulating a Term-Document matrix comprised of terms and their occurrences in the documents of the collection to determine the level of association between term pairs. To determine the similarity, overlapping document segments were analyzed to determine the frequency of eligible terms, and the resulting data was used to calculate co-occurrence values. 4.2.1.1. Overlapping Document Segments The term vectors that made up the Term-Document matrix were defined by the number of occurrences (i.e., frequency) of the term within document segments rather than within full documents. Document segments (i.e., moving shingled window) were created from each full document. The document segments were 200 words long, and each segment overlapped the previous and next segment by 100 words (i.e., the shingle increment). The number of document segments created from a full document varied from one segment to several hundred segments, depending on the length of the full document. The Term-Document matrix was, therefore, constructed so that the terms were represented by the columns, and the document segments were represented by the rows. Using document segments rather than the full documents controlled for the variability in length of documents in the collection and ensured that only the terms in close proximity (i.e., within 200 terms) to one another were assumed to be similar to one another. The document segment size and the shingle increment were chosen based on an informal average paragraph size. It was observed that a single, although possibly 148 complex, concept is often contained in a paragraph. Because of this, the words used in the beginning of the paragraph are likely topically related to the words used at the end of the paragraph. Therefore, the average number of words contained in a paragraph may be a reasonable guide to the size of a chunk of text in which all the words are semantically related. Assuming that paragraphs typically range from 100 to 200 words, a document segment size of 200 words and a shingle increment of 100 words were chosen. These values were chosen early in the design process and no tuning of these values was performed. 4.2.1.2. Eligible Term Identification Not all terms present in the document collection were included in the Association Thesaurus. Only stemmed content bearing words (i.e., stop words were excluded) present in the document collection with an appropriate level of frequency were identified as eligible for inclusion in the Association Thesaurus. Therefore, the terms needed to occur frequently enough in the document collection for co-occurrence calculations to yield useful information but not too frequently for their presence to not be a useful discriminator of relevance. Eligible terms were those that had a minimum frequency of 50 in the overall document collection and did not appear in more than 9999 document segments. These eligible terms parameters were not tuned but chosen at the beginning of the design process based on reasonable initial guesses as to appropriate starting values. 149 4.2.1.3. Co-Occurrence Calculations The co-occurrence calculations to determine level of association (or, similarity) between term pairs were conducted using the Jaccard Coefficient. The Jaccard Coefficient is based on an Intersection Over Union (IOU) calculation to normalize and measure the amount of overlap between two term vectors. The Jaccard Coefficient value of a term pair was used only to make the binary decision of inclusion or exclusion of a term pair in the Association Thesaurus. Those term pairs with a Jaccard Coefficient value greater than 0.5 were included in the Association Thesaurus as associated term entries for each other. This minimum threshold value of 0.5 was chosen early in the design process based on the idea that a value near the mid-point of possible Jaccard Coefficient values (i.e, values between 0 and 1) would provide a reasonable starting point and no tuning was performed to improve this value. 4.2.2. Generate Conceptual Network The Generate Conceptual Network module used the entries in the Association Thesaurus to generate the conceptual network in the same manner described in section 1.2.1. Each term in the Association Thesaurus represented a node. Child nodes for a term were generated from all of the associated terms defined in its thesaurus entry to create a term cluster. To form the full conceptual network, each term cluster generated from the thesaurus entry was linked to the other term clusters using shared terms. The entire conceptual network was developed by continuing this term cluster linking process using 150 all shared terms defined through the relationships defined by the associated term entries in the Association Thesaurus. Only terms likely to be useful in discriminating the relevance of a document were included in the conceptual network. A maximum threshold was used to restrict the number of associated terms a target term may have to be eligible for inclusion in the conceptual network. Terms that had more than 275 entries were considered to be too frequently occurring to be able to offer a useful discrimination and were ignored during the process of creating the conceptual network. Therefore, any term included in conceptual network had 275 or fewer associated terms included in its Association Thesaurus entry. This threshold value of 275 entries was chosen early in the design process based on reviewing several example term pairs and their resulting pathways. No tuning was done after this early design decision was made. In this way, the conceptual network was composed of all terms with 275 or fewer entries in the Association Thesaurus and links between terms were only based on whether or not shared terms existed in the individual term clusters (i.e., there were no other parameters considered when forming the links between nodes). 4.2.2.1. Relational Pathways To minimize search processing time experienced by the users, the relational pathways were identified during the pre-search process stage in the Generate Conceptual Network module. All possible term pairs were identified from the terms contained in the Association Thesaurus. Next, the relational pathways for each term pair were identified and stored for fast retrieval at search time. 151 The relational pathways identified were 3, 4, or 5 terms long. To identify the relational pathways between a pair of terms, the module began with the first term of the term pair and traversed the conceptual network looking for the second term of the term pair using a breadth-first search to a maximum depth of 4 terms. When the second term was found, the intervening terms were captured to form a relational pathway. It was possible for zero, one, or more relational pathways to be identified for a given term pair. There was no maximum threshold used to limit the number of relational pathways that could be identified for a given term pair. All the relational pathways for a given term pair were the same length. Once a relational pathway was found, the search on that level was completed and then stopped before it moved to the next level of depth. 4.2.3. Expand Query The Expand Query module identified the candidate expansion terms and performed the query expansion. The module accomplished this by identifying all relational pathways between each term pair in the original query, identifying appropriate combinations of candidate expansion terms, and reformulating the original query into an expanded query. For each pair of terms in the original query, the Expand Query module attempted to identify one or more relational pathways for the term pair. To do this, all possible term pairs in the original query were identified. For example, if the original query is composed 152 of three terms such as such as warnings quickly understandable, then there are three terms pairs that may result in relational pathways. The three term pairs in this example are the following: 1. warnings quickly 2. warnings understandable 3. quickly understandable Next, all relational pathways identified and stored during the pre-search process stage in the Generate Conceptual Network module for each pair of terms from the original query were retrieved. The relational pathways for each term pair retrieved were then processed to identify the appropriate combinations of candidate expansion terms with which to expand the original query. See Appendix B for additional details about how Boolean expressions are generated from the relational pathways. 4.3. Select a Document Collection The search engine enhancement is intended to be useful for bounded, mediumsized document collections containing documents that are focused on a single scientific or technical conceptual domain. The document collection of the Design CoPilotTM web application 10 was chosen as the test document collection to be used in the search performance experiments. 10 The Design CoPilotTM is available at http://www.designcopilot.com. 153 Design CoPilotTM is a subscription-based web application that contains a collection of approximately 3000 documents. These documents range in length from a few paragraphs to several hundred pages. Approximately half of the documents in the collection are regulatory or guidance documents. These are technical documents written by government agencies and/or industry groups that focus on providing the regulatory requirements and advisory guidance material for the design and certification of aircraft. The second half of the Design CoPilotTM document collection is a set of the documents that describe human factors research, guidelines, and other relevant human factors information for the design and certification of aircraft. Therefore, the Design CoPilotTM document collection provides an appropriately sized, well-bounded collection focused on a single technical conceptual domain. 4.4. Develop Query Topics Query topics were developed with which to test the search performance of the Baseline and Enhanced search engines on the test document collection. In order to provide an accurate measure of performance, the query topics needed to be representative of the queries posed by the users of the Design CoPilotTM to fill their real-life information needs. Therefore, they were developed by a team of Design CoPilotTM content experts by reviewing query logs of the Design CoPilotTM web application and by drawing on the content experts’ experience with the document collection. The team of six of Design CoPilotTM content experts included the author of this dissertation. The team worked together to develop a set of 75 query topics that would 154 represent the variety of concepts likely to be asked by users of the Design CoPilotTM application. Each query topic included in the set was required to be composed of two or more terms (i.e., no single term query topics were included). 4.4.1. Tangible and Intangible Concepts Because Research Hypothesis #4 predicted that the type of concept that the query topic represented would impact the performance of the Enhanced search engine, two types of query topics were developed: those that represented tangible concepts and those that represented intangible concepts. Tangible concepts are simple, unambiguous, well-defined concepts and include query topics such as “attitude trend indicator” and “placard lighting.” Tangible concepts are often specific examples or instances of an item (e.g., “mouse”, “track pad” and “joy stick” are specific instances of cursor control devices). Intangible concepts are complex, harder-to-define, fuzzy concepts and include query topics such as “excessive cognitive effort” or “how to provide unambiguous feedback.” Intangible concepts may represent a general class of items or concepts (e.g., “cursor control device”) and may be discussed in many different ways (e.g., “mental processing” may also be discussed as “cognitive processing,” “cognitive requirements,” or “mental workload”). Often intangible concepts are described using one or more ambiguous or difficult to define qualifying terms (e.g., “suitable”, “appropriate” and “excessively”). While there may be concepts that could arguably fit in either the tangible or intangible category, the primary distinction is that of how easily, unambiguously, 155 uniformly, and precisely the concept may be described. Those query topics that represent concepts that are typically described in a single way and are unambiguous have been categorized as tangible; those query topics that represent concepts that are ambiguous, complex, and may be described in several different ways have been categorized as intangible. A total of 75 query topics were developed. Of these, 40 query topics represented tangible concepts, and 35 query topics represented intangible concepts. See Appendix C for a complete list of query topics used in this research. 4.4.2. Relevant Document Sets for Query Topics The set of relevant documents for each query topic were identified through the process of adjudication after the query topics had been run on both the Baseline and the Enhanced search engines. To make the effort required to perform the adjudication task manageable, a modified pooling method was used in which only the differences in the sets of documents returned were manually reviewed by a human assessor to determine document relevance. Documents that were returned by both the Baseline and the Enhanced search engines were assumed to be relevant. Documents returned by neither the Baseline nor the Enhanced search engines were assumed to be non-relevant. Documents only returned by the Enhanced search engine (and not by the Baseline) were manually reviewed for relevance. Because the Enhanced search engine was built on top of the Baseline search engine, all documents returned by the Baseline would always also be returned by the 156 Enhanced. Or, stated another way, no documents were returned by the Baseline that were not also returned by the Enhanced. 4.5. Run Experiment The experiment was run by querying the Baseline and then the Enhanced search engine for each of the 75 query topics. The list of documents returned by each search engine was stored in a relational database. After the experimental run was complete, a table of differences was created that represented those documents returned by only the Enhanced search engine and, therefore, required manual adjudication to determine their relevancy. The following sections describe how the samples of query topics were selected and their associated document sets adjudicated to determine relevance. 4.5.1. Select Samples To understand the performance difference between the Baseline and Enhanced search engines, a more detailed analysis was performed on those query topics in which the Baseline and the Enhanced search engine’s performance differed. A sample of 14 query topics from the tangible set and a sample of 16 query topics from the intangible set were selected. Each sample set was created by selecting all query topics for which the difference in the number of documents returned by the Baseline and the Enhanced search engines was less than or equal to 50 documents. In the Tangible Sample Set, there were 14 query topics that met this criterion. In the Intangible Sample Set, there were 16 query topics 157 that met this criterion. The threshold of less than or equal to 50 documents was chosen arbitrarily to restrict the samples to an appropriately large yet manageable size. 4.5.2. Adjudicate The 30 query topics selected in the previous step (i.e., 14 query topics in the Tangible Sample Set plus 16 query topics from the Intangible Sample Set) were adjudicated using the modified pooling method described earlier to identify their relevant document sets. Therefore, documents returned by both search engines were assumed to be relevant; documents returned by neither search engine were assumed to be irrelevant; and the documents returned by only the Enhanced search engine were reviewed manually by a human assessor to determine relevance. The human assessor was the author of this dissertation. Because the only source of the documents to be manually adjudicated was the Enhanced search engine (i.e., all documents returned by the Baseline were, by design, also returned by the Enhanced search engine), the adjudication process was not blind. To address the potential issues of consistency, repeatability, and reasonableness of subjective decisions related to the adjudication process that may be introduced when the process is not blind, graded relevancy descriptions were used to aid the adjudication task. During the adjudication, graded relevancy descriptions for each query topic were developed and, based on the content of the documents, reviewed for relevancy. The higher the score, the more relevant the document was to the query topic. Documents with a score of “1” contained information that addressed all aspects of the concept represented 158 by the query topic. Documents with a score of “0.5” or “0.25” contained information that only addressed some but not all aspects of the concept represented by the query topic. And finally, documents with a score of “0” were irrelevant (i.e., did not contain any information that addressed the concept represented by the query topic). See Appendix E for the query topic graded relevancy descriptions. In addition to aiding adjudication consistency, the graded relevancy definitions allowed for traceability of the process. For each document adjudicated, both a binary relevancy score (i.e., assigned “1” when the document is fully or partially relevant and is assigned a “0” when the document is irrelevant) and a graded relevancy score (i.e., “1”, “0.5”, “0.25”, “0” as described above) were captured in a relational database table. If the document received a non-zero relevancy score, an excerpt from the document supporting the relevancy score assigned was also captured in the relational database table. The graded relevancy scores were not used in any of the calculations in this research project but rather served as adjudication aid and allowed for traceability. Follow on research may be conducted in which these scores are analyzed and used in the performance calculations. 4.6. Measure Performance and Determine Statistical Significance After the adjudication was complete, the data generated in the previous steps were used to measure performance and determine the statistical significance related to the full set of 30 query topics as well as for any differences among the 14 query topics in the Tangible Sample Set and 16 query topics from the Intangible Sample Set. 159 4.6.1. Addressing Potential Impact of Unfound Relevant Documents In this experiment, the goal of measuring search engine performance was to determine whether or not the enhancement proposed positively impacted the overall performance. A positive impact was considered one in which the recall is increased while the precision remained high. Therefore, an F-measure, which enforces a balance between precision and recall, provided a useful and logical method to compare the overall performance between search engines. As described earlier, the Enhanced search engine was created by building the enhanced functionality on top of the Baseline search engine. The advantage of this technique is that any differences in performance between the Baseline and the Enhanced search engine are the product of the enhancement and not some uncontrolled or unknown variability introduced by differences in the search engine structure or components. However, there are some potential issues with performing a traditional F-measure calculation in this experimental situation because of the potential impact of unfound (and, therefore, unknown) relevant documents. Unfound relevant documents can impact whether or not the recall calculation is a good estimator of actual recall (Frické, 1998). Typically, the likelihood that unfound relevant documents exist in an experimental test set is reduced to an acceptable level by pooling the results from a collection of unique search engines. The idea is that each unique search engine algorithm will provide its own set of relevant documents and allow identification of most, if not all, relevant documents that exist in the collection (i.e., it is assumed that all search engines will not miss retrieving the same relevant documents). 160 However, because this experiment is conducted on a unique local website document collection to illustrate an enhancement particular to this type of collection, the relevant documents for the query topics are not pre-defined, and the experiment cannot take advantage of a collective of researchers to perform the adjudication task. Therefore, the adjudication task must be performed in such a way to minimize effort required, so that it is not prohibitively time-consuming, while still providing information that is of an appropriate level of quality. One of the assumptions that impacted the number of documents that require manual adjudication was the use of a modified pooling method. As described earlier in section 4.4.2., in the modified pooling method, all documents returned by both the Baseline and the Enhanced search engine were assumed to be relevant; all other documents returned were manually adjudicated to determine their relevance. Because the set of documents returned by the Baseline was always a subset of those returned by the Enhanced search engine (as illustrated in the Venn diagram in Figure 4.3), the result was that all documents returned by the Baseline were assumed to be relevant. Documents Returned Enhanced Baseline Figure 4.3 Venn digram illustrating that the documents returned by the Baseline are a subset of those returned by the Enhanced search engine. 161 Because of this assumption, the only additional relevant documents identified for a query topic were provided by the Enhanced search engine. Therefore, it is possible that a large number of relevant documents could remain unfound in this experiment. To address the issue that there may be a large number of unfound relevant documents and that they may impact the ability of the recall calculations to be a good estimator of the true recall of the Baseline and Enhanced search engines, this experiment modified the calculations used and performed a sensitivity analysis to assess the potential impact of unfound relevant documents. Instead of using the traditional recall calculation, this experiment used a recalllike calculation we will call recall*. Recall* is calculated the same way that traditional recall is calculated, but the notation highlights the idea that it may not be an accurate estimation of the true recall of the search engine. 𝑅 ∗ = 𝑟𝑒𝑐𝑎𝑙𝑙 ∗ = |𝐴 ∩ 𝐵| |𝐴| Where: A = set of all found relevant documents in document collection B = set of retrieved documents Because the F-measure relies on recall, the issue described also impacts the Fmeasure calculation. Therefore, instead of the traditional F-measure, this experiment used an F-measure-like calculation we called F-measure*. Like recall*, F-measure* is calculated in the same way as the traditional version, but the notation used highlights the 162 idea that it may not be an accurate estimation of the true F-measure value of the search engines. F-measure* = 1 𝑅∗ 2 + 1 𝑃 = 2 ∙𝑅∗∙𝑃 𝑅 ∗ +𝑃 Where: R* = recall* value P = precision value 4.6.2. Addressing Potential Impact of Relevancy Assumption in Modified Pooling Method The use of a modified pooling method to minimize time and effort required while still providing information that is of an appropriate level of quality relied on the assumption that all documents returned by both the Baseline and the Enhanced search engine were relevant and, therefore, did not require adjudication. However, it is possible that this assumption is not accurate; there may be a large proportion of those documents returned by both the Baseline and the Enhanced search engines that are not relevant. If there is a large proportion of non-relevant documents returned by both the Baseline and the Enhanced search engines, the precision calculations may not be a good estimator nor represent the true precision of the Baseline and Enhanced search engines. To address this possible issue, a sensitivity analysis for documents assumed relevant was performed to assess the potential impact of non-relevant documents returned by both the Baseline and the Enhanced search engines. 163 4.6.3. Perform Calculations and Sensitivity Analyses After the adjudication was complete, the relevancy results for the sample of 30 query topics run on each search engine were collated and the F-measure* (i.e., harmonic mean of recall* and precision) was calculated. Significance was determined using t-tests to determine whether a performance difference existed between the Baseline and Enhanced search engines. Next, a Single Factor ANOVA calculation using the Fmeasure* was performed to determine whether a performance difference existed between the 14 query topics that represented tangible concepts and the 16 query topics that represented intangible concepts. A sensitivity analysis was then conducted to assess the potential impact of unfound relevant documents. The F-measure* was calculated for three levels of sensitivity to unfound relevant documents. The estimated total relevant documents for each level of sensitivity was calculated by adding the number of relevant documents for the query topic identified by the Baseline and Enhanced search engines plus the estimated number of unfound relevant documents at that level. The three levels of sensitivity were the following: • Level 1 Sensitivity To Unfound Relevant Documents (0.25X) – The unfound relevant documents were estimated to be a quarter of the number of relevant documents identified by the Baseline and Enhanced search engines. • Level 2 Sensitivity To Unfound Relevant Documents (2X) – The unfound relevant documents were estimated to be double the number of relevant documents identified by the Baseline and Enhanced search engines. 164 • Level 3 Sensitivity To Unfound Relevant Documents (10X) – The unfound relevant documents were estimated to be ten times the number of relevant documents identified by the Baseline and Enhanced search engines. For each level of sensitivity, the values estimating the number of unfound relevant documents were rounded up to the next whole number. See Table 4.1 for an example of the estimates for each of the three levels of unfound relevant documents used in the sensitivity analysis. Level 1 0.25X Table 4.1 Identified Relevant 10 Estimated Unfound Relevant 3 Estimated Total Relevant 13 2 2X 10 20 30 3 10X 10 100 110 Example of the estimates used for unfound relevant documents used in each of the three levels of the sensitivity. Estimates of unfound documents were rounded up to the next whole number. As with the earlier calculation, significance was determined at each level of sensitivity using t-tests to determine whether a performance difference existed between the Baseline and Enhanced search engines. The last step of the sensitivity analysis was to compare the conclusions drawn from the t-tests at each level of sensitivity to the conclusion drawn in the original calculations (i.e., without the addition of estimated unfound relevant documents). 165 Finally, a sensitivity analysis for documents assumed relevant was then conducted to assess the potential impact of non-relevant documents returned by both the Baseline and the Enhanced search engines. The F-measure* was calculated for four levels of sensitivity to non-relevant documents in the set of documents returned by both the Baseline and the Enhanced search engines. The estimates were calculated in two different ways. First, the estimated total relevant documents returned by both the Baseline and the Enhanced search engines was calculated by assuming that a certain percentage of the documents returned by both the Baseline and Enhanced search engines were nonrelevant. The following outlines the three levels of sensitivity estimates calculated: • Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant) – Twenty-five percent of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. • Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant) – Fifty percent of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. • Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant) – Seventy-five percent of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. For each level of sensitivity, the values calculated to estimate the assumed number of non-relevant documents returned by both the Baseline and the Enhanced search engines were rounded up to the next whole number. See Table 4.2 for an example of the estimates 166 for each of the three levels of assumed non-relevant documents used in the sensitivity analysis. Level 1 25% Returned by both Baseline and Enhanced 10 Estimated Non-Relevant 3 Estimated Relevant 7 2 50% 10 5 5 3 75% 10 8 2 Table 4.2 Example of the estimates used for assumed non-relevant and relevant documents used in each of the three levels of the sensitivity to documents assumed relevant. Estimates of non-relevant documents were rounded up to the next whole number. An additional sensitivity level was calculated using results generated by Google Desktop. The estimates for this level of sensitivity were generated by determining the number of documents returned by the Google Desktop search engine that overlap with the documents returned by the Baseline and the Enhanced search engines. Documents returned by all three search engines were assumed to be relevant in the fourth level of sensitivity. As described earlier, initial attempts of tuning the performance of the Baseline search engine with Google Desktop appeared unfeasible, and they also revealed that Google Desktop missed some of the relevant documents that the Baseline returned. Therefore, this fourth level of sensitivity may be considered a conservative estimate of 167 relevancy, and it is likely that Google Desktop missed some of the relevant documents that the Baseline search engine found. As with the earlier calculations, significance was determined at each of the four levels of sensitivity using t-tests to determine whether a performance difference existed between the Baseline and Enhanced search engines. And like the previous sensitivity analysis, the last step of this sensitivity analysis was to compare the conclusions drawn from the t-tests at each level of sensitivity to the conclusion drawn in the original calculations (i.e., assuming that all documents returned by both the Baseline and the Enhanced search engines were relevant). 168 CHAPTER 5. RESULTS The methodology described in the previous chapter was performed to address the four research hypotheses posed and the results derived from the experiment are described in this chapter. 5.1. Full Query Topic Set A total of 75 query topics were run on the Baseline search engine. All documents returned by the Baseline search engine for each of the query topics were stored in an Access relational database. On average, there were 20.2 documents retrieved for each of the query topics run on the Baseline search engine with a range between 0 and 105 documents. The process was repeated using the Enhanced search engine. The same 75 query topics were run on the Enhanced search engine. All documents returned by the Enhanced search engine for each of the query topics were stored in an Access relational database. On average, there were 43.5 documents returned for each of the query topics run on the Enhanced search engine with a range between 0 and 244 documents. The difference in the returned document counts between the Baseline search engine and the Enhanced search engine was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. (See Appendix D for the data and calculations related to the documents returned counts.) The frequency distribution of the documents returned by the Baseline and by the Enhanced search engines is presented in Figure 5.1. 169 Returned Document Count Frequency Number of query topics 40 35 30 25 20 15 10 5 0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 Document Count Baseline Search Engine Figure 5.1 Enhanced Search Engine Histogram of the returned document count frequencies for the 75 query topics run on the Baseline and the Enhanced search engines. Of the 75 topics that were run, there were 44 query topics (i.e., 54.7%) that produced a difference in performance between the Baseline and the Enhanced search engines. Twenty-four of these were from the set of query topics that represented tangible concepts, and 17 of these were from the set of query topics that represented intangible concepts. Of the remaining 29 query topics, in which the Enhanced search engine performed the same as the Baseline search engine, 16 of these were from the set of query topics that represented tangible concepts, and 18 of these were from the set of query topics that represented intangible concepts. The counts and percentages of the full set of 75 query topics distribution are listed in Table 5.1 and illustrated in Figure 5.2. 170 Query Topic Set Performance in Baseline and Enhanced Same Different Full Set of Query Topics 34 45.3% 41 54.7% 75 Tangible 16 40.0% 24 60.0% 40 Intangible 18 51.4% 17 48.6% 35 Table 5.1 Total Counts and percentages of the distribution of the 75 query topics that produced a difference in performance between the Baseline and the Enhanced search engines and those that performed the same (i.e., no difference in performance). Query Topics Distribution Same Performance (45.3%) Tangible Intangible Different Performance (54.7%) Tangible Intangible Figure 5.2 Distribution of all 75 query topics that produced a difference in performance between the Baseline and the Enhanced search engines and those that performed the same (i.e., no difference in performance). 171 5.2. Samples of Query Topics That Produce a Difference The next step was to perform a detailed analysis on only those query topics in which the Baseline and the Enhanced search engine’s performance differed. A sample of 14 query topics from the tangible set and a sample of 16 query topics from the intangible set were selected. The listing of both sample sets is presented in Tables 5.2 and 5.3. Tangible Query Topic Sample Set 1 auxiliary power unit fire extinguishing 2 false resolution advisory 3 fault tolerant data entry 4 hydraulic system status messages 5 icing conditions operating speeds provided in AFM 6 information presented in peripheral visual field 7 information readable with vibration 8 instruments located in normal line of sight 9 labels readable distance 10 landing gear manual extension control design 11 negative transfer issues 12 safety belt latch operation 13 side stick control considerations 14 text color contrast Table 5.2 Sample of 14 query topics representing tangible concepts. 172 Intangible Query Topic Sample Set 1 acceptable message failure rate and pilots confidence in system 2 appropriate size of characters on display 3 arrangement of right seat instruments 4 control is identifiable in the dark 5 cultural conventions switch design 6 design attributes for auditory displays 7 ergonomics of pilot seating 8 excessive cognitive effort 9 how to ensure that labels are readable 10 how to improve situation awareness 11 how to provide unambiguous feedback 12 minimal mental processing 13 needs too much attention 14 preventing instrument reading errors 15 proper use of red and amber on displays 16 suitable menu navigation methods Table 5.3 Sample of 16 query topics representing intangible concepts. Adjudication was performed to determine the relevancy of the documents retrieved by the Baseline and Enhanced search engines. The results were collated and Fmeasure* calculated for each of the 30 query topics (i.e., 14 query topics in the Tangible Sample Set plus 16 query topics from the Intangible Sample Set). See Appendix F for the individual recall*, precision, and F-measure* calculations for each of the 30 query topics. 173 5.2.1. Baseline vs. Enhanced Search Performance The Baseline search engine performed with an average recall* of 0.30, average precision of 0.77, and average F-measure* of 0.41 as shown in Table 5.4. Because the Enhanced search engine always returned all the documents returned by the Baseline for a given query topic and these documents were assumed to be relevant, the precision for the Baseline search engine was always either 1.00 or 0.00. The precision value 0.00 occurred when the Baseline search engine returned no documents for a given query topic but the Enhanced search engine returned at least 1 relevant document for that same query topic. The recall* value varied based on the number of additional relevant documents that the Enhanced search engine returned. The Enhanced search engine performed with an average recall* of 1.00, average precision of 0.78, and average F-measure* of 0.85 as shown in Table 5.4. When a difference in performance existed between the two search engines, the set of documents returned by the Baseline was always only a subset of those returned by the Enhanced search engine; the recall* was always determined by the number of relevant documents returned by the Enhanced search engine (i.e., there were no opportunities for additional relevant documents to be identified that were not returned by the Enhanced search engine), and, therefore, the recall* value was always 1.00. While there were 7 query topics in the two sample sets in which the Baseline search engine returned 0 documents, the Enhanced search engine always returned 2 or more documents for all query topics in the sample sets. The precision value varied based on the number of additional irrelevant documents that the Enhanced search engine returned. 174 Search Engine Average Performance Measures Recall* Precision F-measure* Baseline 0.30 0.77 0.41 Enhanced 1.00 0.78 0.85 Table 5.4 Average performance measures for Baseline and Enhanced search engines. The difference in recall* between the Baseline search engine with an average recall* of 0.30 and the Enhanced search engine with an average recall* of 1.00 was determined to be statistically significant to an alpha less than 0.05 using a Two Sample tTest. The difference in performance in the F-measure* value between the Baseline search engine with an average F-measure* of 0.41 and the Enhanced search engine with an average F-measure* of 0.85 was also determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. See Appendix G, sections G.1. and G.2. for the t-Test calculations used to determine this statistical significance. 5.2.2. Tangible vs. Intangible Concepts For the 14 query topics that represented tangible concepts, the Baseline search engine performed with an average recall* of 0.39, average precision of 0.86, and average F-measure* of 0.51. For the 16 query topics that represented intangible concepts, the Baseline search engine performed with an average recall* of 0.22, average precision of 0.69, and average F-measure* of 0.32. The performance measures for Tangible and Intangible Sample Sets for the Baseline search engine are shown in Figure 5.5. 175 Query Topic Type Baseline Search Engine Average Performance Measures Recall* Precision F-measure* Tangible Concepts 0.39 0.86 0.51 Intangible Concepts 0.22 0.69 0.32 Table 5.5 Average performance measures for query topics representing tangible and and intangible concepts for the Baseline search engine. For the 14 query topics that represented tangible concepts, the Enhanced search engine performed with an average recall* of 1.00, average precision of 0.67, and average F-measure* of 0.78. For the 16 query topics that represented intangible concepts, the Enhanced search engine performed with an average recall* of 1.00, average precision of 0.88, and average F-measure* of 0.92. The performance measures for Tangible and Intangible Sample Sets for the Enhanced search engine are shown in Figure 5.6. Query Topic Type Enhanced Search Engine Average Performance Measures Recall* Precision F-measure* Tangible Concepts 1.00 0.67 0.78 Intangible Concepts 1.00 0.88 0.92 Table 5.6 Average performance measures for query topics representing tangible and and intangible concepts for the Enhanced search engine. A Single Factor ANOVA calculation was used to determine whether a performance difference existed between the 14 query topics that represented tangible concepts and the 16 query topics that represented intangible concepts. For the Baseline search engine, it was determined that there was no difference in performance between the 176 query topics that represented tangible concepts and the query topics that represented intangible concepts. However, a difference in performance was found for the Enhanced search engine. The difference in performance of the Enhanced search engine between query topics that represent tangible concepts and query topics that represent intangible concepts was determined to be statistically significant at an alpha of 0.05. See Appendix G section G.3. for the ANOVA calculations used to determine this statistical significance. 5.3. Sensitivity Analysis For Impact Of Unfound Relevant Documents A sensitivity analysis was performed at three different levels to assess the potential impact of unfound relevant documents. This analysis was conducted using the sample of 30 query topics (i.e., 14 query topics in the Tangible Concepts Query Sample Set plus 16 query topics from the Intangible Concepts Query Sample Set). 5.3.1. Level 1 Sensitivity To Unfound Relevant Documents (0.25X) At Sensitivity Level 1 in which the unfound relevant documents were assumed to be a quarter of the number of relevant documents found by the Baseline and Enhanced search engines, the average recall* of the Baseline was 0.22 and the average recall* of the Enhanced was 0.78 as presented in Table 5.7. The difference in performance in the recall* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original recall* calculations. 177 Search Engine Original Average Recall* Level 1 Level 2 (0.25X) (2X) Level 3 (10X) Baseline 0.30 0.22 0.10 0.03 Enhanced 1.00 0.78 0.33 0.09 Table 5.7 Average recall* for Baseline and Enhanced search engines from original calculation and at the three sensitivity levels for unfound documents. Under these same conditions at Sensitivity Level 1, the average F-measure* of the Baseline was 0.33 and the average F-measure* of the Enhanced was 0.75 as presented in Table 5.8. The difference in performance in the F-measure* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original F-measure* calculations. See Appendix I section I.1. for the data and t-Test calculations used to determine the statistical significance of recall* and F-measure* at Sensitivity Level 1 for unfound documents. Search Engine Original Average F-measure* Level 1 Level 2 (0.25X) (2X) Level 3 (10X) Baseline 0.41 0.33 0.17 0.05 Enhanced 0.85 0.75 0.45 0.16 Table 5.8 Average F-measure* for Baseline and Enhanced search engines from original calculation and at the three sensitivity levels for unfound documents. 178 5.3.2. Level 2 Sensitivity To Unfound Relevant Documents (2X) At Sensitivity Level 2 in which the unfound relevant documents were assumed to be double the number of relevant documents found by the Baseline and Enhanced search engines, the average recall* of the Baseline was 0.10 and the average recall* of the Enhanced was 0.33 as presented in Table 5.7. The difference in performance in the recall* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original recall* calculations. Under these same conditions at Sensitivity Level 2, the average F-measure* of the Baseline was 0.17 and the average F-measure* of the Enhanced was 0.45 as presented in Table 5.8. The difference in performance in the F-measure* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original F-measure* calculations. See Appendix I section I.2. for the data and t-Test calculations used to determine the statistical significance of recall* and F-measure* at Sensitivity Level 2 for unfound documents. 5.3.3. Level 3 Sensitivity To Unfound Relevant Documents (10X) At Sensitivity Level 3, in which the unfound relevant documents were assumed to be ten times the number of relevant documents found by the Baseline and Enhanced search engines, the average recall* of the Baseline was 0.03 and the average recall* of 179 the Enhanced was 0.09 as presented in Table 5.7. The difference in performance in the recall* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original recall* calculations. Under these same conditions at Sensitivity Level 3, the average F-measure* of the Baseline was 0.05 and the average F-measure* of the Enhanced was 0.16 as presented in Table 5.8. The difference in performance in the F-measure* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original F-measure* calculations. See Appendix I section I.3. for the data and t-Test calculations used to determine the statistical significance of recall* and F-measure* at Sensitivity Level 3 for unfound documents. 5.4. Sensitivity Analysis For Impact Of Relevancy Assumptions A sensitivity analysis for documents assumed relevant was performed at four different sensitivity levels to assess the potential impact of non-relevant documents returned by both the Baseline and the Enhanced search engines. This analysis was conducted using the sample of 30 query topics (i.e., 14 query topics in the Tangible Concepts Query Sample Set plus 16 query topics from the Intangible Concepts Query Sample Set). 180 5.4.1. Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant) At Sensitivity Level 1 in which 25% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant, the average recall* of the Baseline was 0.25 and the average recall* of the Enhanced was 1.00, as presented in Table 5.9. The difference in performance in the recall* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original recall* calculations. Average Recall* Original Level 1 (25%) Level 2 (50%) Level 3 (75%) Level 4 (Overlap with Google Desktop) Baseline 0.29 0.25 0.20 0.09 0.21 Enhanced 1.00 1.00 1.00 1.00 1.00 Search Engine Table 5.9 Average recall* for Baseline and Enhanced search engines from original calculation and at the four sensitivity levels for assumed relevant documents. Under these same conditions at Sensitivity Level 1, the average F-measure* of the Baseline was 0.30 and the average F-measure* of the Enhanced was 0.80 as presented in Table 5.10. The difference in performance in the F-measure* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original F-measure* calculations. 181 See Appendix J section J.1. for the data and t-Test calculations used to determine the statistical significance of recall* and F-measure* at Sensitivity Level 1 for assumed relevant documents. Average F-measure* Original Level 1 (25%) Level 2 (50%) Level 3 (75%) Level 4 (Overlap with Google Desktop) Baseline 0.40 0.30 0.22 0.09 0.25 Enhanced 0.84 0.80 0.76 0.70 0.77 Search Engine Table 5.10 Average F-measure* for Baseline and Enhanced search engines from original calculation and at the four sensitivity levels for assumed relevant documents. 5.4.2. Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant) At Sensitivity Level 2 in which 50% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant, the average recall* of the Baseline was 0.20 and the average recall* of the Enhanced was 1.00, as presented in Table 5.9. The difference in performance in the recall* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original recall* calculations. Under these same conditions at Sensitivity Level 2, the average F-measure* of the Baseline was 0.22 and the average F-measure* of the Enhanced was 0.76, as presented in Table 5.10. The difference in performance in the F-measure* value between the Baseline 182 and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original F-measure* calculations. See Appendix J section J.2. for the data and t-Test calculations used to determine the statistical significance of recall* and F-measure* at Sensitivity Level 2 for assumed relevant documents. 5.4.3. Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant) At Sensitivity Level 3 in which 75% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant, the average recall* of the Baseline was 0.09 and the average recall* of the Enhanced was 1.00, as presented in Table 5.9. The difference in performance in the recall* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original recall* calculations. Under these same conditions at Sensitivity Level 3, the average F-measure* of the Baseline was 0.09 and the average F-measure* of the Enhanced was 0.70, as presented in Table 5.10. The difference in performance in the F-measure* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original F-measure* calculations. 183 See Appendix J section J.3. for the data and t-Test calculations used to determine the statistical significance of recall* and F-measure* at Sensitivity Level 3 for assumed relevant documents. 5.4.4. Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) At Sensitivity Level 4, in which estimates of relevant documents that may be assumed were generated by determining the number of documents returned by the Google Desktop search engine that overlap with the documents returned by the Baseline and the Enhanced search engines, the average recall* of the Baseline was 0.21 and the average recall* of the Enhanced was 1.00, as presented in Table 5.9. The difference in performance in the recall* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample tTest. This is the same conclusion that was drawn from the original recall* calculations. Under these same conditions at Sensitivity Level 4, the average F-measure* of the Baseline was 0.25 and the average F-measure* of the Enhanced was 0.77, as presented in Table 5.10. The difference in performance in the F-measure* value between the Baseline and Enhanced search engines was determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. This is the same conclusion that was drawn from the original F-measure* calculations. See Appendix J section J.4. for the data and t-Test calculations used to determine the statistical significance of recall* and F-measure* at Sensitivity Level 4 for assumed relevant documents. 184 5.5. Query Topic Outliers Two query topics in the sample set had an extremely large number of nonoverlapping documents returned by the Baseline and Enhanced search engines as compared to the other query topics (i.e., the Enhanced search engine returned more than 100 documents that the Baseline search engine did not). To better understand why these query topics were outliers, an informal analysis was conducted. The first outlier query topic was “TCAS warnings” in the tangible concepts query topic sample set. The number of documents returned by the Baseline and Enhanced search engines for this query topic is presented in Table 5.11. Information relevant for this query topic discusses design attributes of effective warnings for traffic and collision avoidance systems or other similar systems. The design attributes necessary for effective warnings is discussed extensively in the Design CoPilotTM document collection. From partial adjudication of the documents returned only by the Enhanced search engine (i.e., the differences), it is evident that the Enhanced search engine found relevant documents that the Baseline missed. However, it should be noted that the Baseline search engine returned a large number of documents (i.e., 96 documents). In fact, it was one of the largest document result sets returned by the Baseline. The initial analysis suggests that the large difference between the number of documents returned by the Baseline and Enhanced search engines may be a function of proportion. The average proportional difference between the documents returned by the Baseline and the Enhanced is 4.16 (i.e., the Enhanced search engine on average returned 4.16 times the number of documents returned by the Baseline search engine when a difference exists). 185 Outlier Query Topic (QT) Documents Returned Baseline Enhanced Difference TCAS warnings 96 244 148 easily understandable system status 9 133 124 Table 5.11 Documents returned by the Baseline and Enhanced search engines for the two outlier query topics. The second query topic was “easily understandable system status” in the intangible concepts query topic sample set. The number of documents returned by the Baseline and Enhanced search engines for this query topic is presented in Table 5.11. Information relevant for this query topic discusses the design attributes related to recognizing and comprehending system status messages. Again, this is a topic that is discussed extensively in the Design CoPilotTM document collection. From partial adjudication of the documents returned by only the Enhanced search engine, it is evident that the Enhanced search engine found relevant documents that the Baseline missed. However, unlike the previous outlier query topic, the Baseline search engine returned a relatively small number of documents (i.e., 9 documents), and, therefore, the large difference cannot be attributed to a function of proportion. For this query topic, it may be a function of conceptually redundant wording. Within the Design CoPilotTM document collection, the concept of “understandability” assumes the idea of “easily”. In other words, if the information presented or the way a control operates is understandable, it is also easily understandable (i.e., not just understandable after some difficult cognitive effort). Therefore, it may be the case that the words “understandable” and “easily” do not 186 appear together frequently in the document collection. But, the query topic in the Baseline search engine requires that the concept be discussed in the text with both the word “easily” and the word “understandable,” while the Enhanced search engine also looks for other similar words based on the relational pathways. In this way, the Enhanced search engine was not constrained by the conceptually redundant wording in the query topic itself. 187 CHAPTER 6. DISCUSSION Four hypotheses were developed to guide the research to determine whether or not the Enhanced search engine design positively impacted search performance. Based on the results of the experiment, each of the four hypotheses was found to be true. When considering the results for each of the research hypotheses, it is important to keep in mind that the precision, recall*, and F-measure* values are only available for the sample of 30 query topics that produced a difference in performance between the Baseline and the Enhanced search engines and that these represent only subset of query topics that may be run. Of the 75 query topics developed for the research, only 54.7% of these topics produced a difference in performance. Figure 6.1 is a modified version of Figure 5.2 with additional detail to illustrate that the sample query topics were derived from the population of query topics that produced a difference in performance between the Baseline and Enhanced search engine. In addition, it also important to realize that the Enhanced system was designed to improve the performance of query topics that contain two or more terms. Because the enhancement is based on the relationship between terms, query topics that consist of a single term do not have these relationships to exploit. 188 Query Topics Distribution Same Performance (45.3%) Different Performance (54.7%) Tangible Sample Set (15 query topics) Tangible query topics with a performance difference but not included in the Tangible Sample Set Intangible Sample Set (15 query topics) Figure 6.1 Intangible query topics with a performance difference but not included in the Intangible Sample Set Distribution of all 75 query topics with addtional detail to illustrate population from which the tangible and intangible sample set were derived. 6.1. Research Hypothesis #1 The first research hypothesis states that the Enhanced search engine will perform differently than the Baseline search engine. Based on the results derived from running the full set of 75 query topics on both search engines, the difference in counts of documents returned from the Baseline search engine with an average of 20.2 documents returned and from the Enhanced search engine with an average of 43.5 documents returned was determined to be statistically significant to an alpha less than 0.05 using a Two Sample tTest. The probability that this is not a statistical difference was calculated as 8.91 x 10-5 for the two-tailed test and 4.46 x 10-5 for the one-tailed test. It can be confidently concluded that on average, the Enhanced search engine returns a different number of 189 documents for a given query topic than does the Baseline search engine. And stated another way, the Enhanced search engine performs differently from the Baseline search engine. This is not an unexpected conclusion, but it forms the foundation for the conclusions that can be made by the other hypotheses posited in this research. 6.2. Research Hypothesis #2 The second research hypothesis states that the Enhanced search engine will, on average, have greater recall than the Baseline engine. Based on the results derived from testing the sample of 30 query topics (i.e., both sample sets combined), the difference in recall* between the Baseline search engine with an average recall* of 0.30 and the Enhanced search engine with an average recall* of 1.00 was also determined to be statistically significant to an alpha less than 0.05 using a Two Sample t-Test. The probability that this is not truly a statistical difference was calculated as 7.66 x 10-22 for the two-tailed test and 3.83 x 10-22 for the one-tailed test. Therefore, it can be confidently concluded that not only is the recall* performance of the Enhanced search engine different from that of the Baseline search engine, but also that the Enhanced search engine performs with higher recall* than the Baseline search engine. This conclusion is important because it supports the claims that traditional symbolic pattern matching search engines miss relevant documents and that enhancing the queries using the relational pathways developed in this research can identify at least some relevant documents that are missed. 190 6.3. Research Hypothesis #3 The third research hypothesis states that the Enhanced search engine will, on average, perform better than the Baseline search engine as determined by a higher average F-measure value. Based on testing the sample of 30 query topics (i.e., both sample sets combined), the difference in performance in the F-measure* values between the Baseline search engine with an average F-measure* of 0.41 and the Enhanced search engine with an average F-measure* of 0.85 was determined to be directionally statistically significant to an alpha less than 0.05 using a Two Sample t-Test (i.e., the Enhanced search engine’s F-measure* performance values are statistically greater than the Baseline’s F-measure* values). The probability that this is not truly a directional statistical difference was calculated as 6.07 x 10-10 for the one-tailed test. Therefore, it can be confidently concluded that the Enhanced search engine performs with a greater Fmeasure* value than the Baseline search engine. Because the greater the F-measure* value, the better the performance, it is concluded that Enhanced search engine performed better than the Baseline search engine in the experiment. 6.4. Research Hypothesis #4 The fourth research hypothesis states that the Enhanced search engine will, on average, perform better on query topics that represent intangible concepts than on query topics that represent tangible concepts as determined by a higher average F-measure value. Based on the results derived from comparing the performance of the Enhanced search engine on the sample of 14 query topics representing tangible concepts with an 191 average F-measure* value of 0.78 and the sample of 16 query topics representing intangible concepts with an average F-measure* value of 0.92, the difference in performance of the Enhanced search engine between query topics that represent tangible concepts and query topics that represent intangible concepts was determined to be statistically significant at an alpha of 0.05 using a Single Factor ANOVA test. The probability that this is not truly a statistical difference was calculated as 0.025. Therefore, it is concluded that Enhanced search engine performed better on the query topics that represented intangible concepts than on the query topics that represented tangible concepts in the experiment. 6.5. Generalizing the Results to Full Set of Query Topics As described earlier, the parts of the experiment in which adjudicated results were used (i.e., research hypothesis #2, #3, and #4) are based on samples drawn only from the population of query topics that produce a difference in performance between the Baseline and Enhanced search engines. However, 43.3% of the query topics produced the same results in the Baseline and the Enhanced search engine. Therefore, it is not appropriate to generalize the performance of the Enhanced search engine measured in the samples to be the overall performance of the search engine to the full population of query topics. Based on the data gathered in this experiment, the best estimate of the performance of the Enhanced search engine when it returns the same set of documents that the Baseline search engine returns (i.e., there are no additional files returned by the Enhanced search engine) is to use the Baseline’s performance derived from the 30 query topic sample set. 192 As determined in the significance calculations, the Baseline did not perform differently on query topics that represented tangible concepts than on query topics that represented intangible concepts. Because there was no difference in performance, it is appropriate to use the combined performance of the Tangible and Intangible Sample Sets to estimate the average performance of the Baseline search engine. Therefore, if we estimate the Enhanced search engine’s performance on the query topics when it returns the same set of documents that the Baseline search engine returns with the Baseline’s F-measure* value of 0.41, we can calculate an overall estimate of the Enhanced search engine’s performance on the full 75 query topic set to be an F-measure* value of 0.64. This value may be a more accurate average performance expectation on the population of all query topics for the Enhanced search engine than an F-measure* of 0.85 because it attempts to take into account the portion of query topics that do and do not have relational pathways that identify additional documents in the search results. Query Topic Set Tangible QT with Performance Difference Intangible QT with Performance Difference Estimate for all QT with Same Performance Count Percentage (%) F-measure* (% x F-measure*) 24 0.320 0.78 0.250 17 0.227 0.92 0.209 34 0.453 0.41 0.186 Estimated F-measure* = Table 6.1 0.644 Calculation to estimate the performance of the Enhanced search engine on the full 75 query topic set. 193 6.6. Sensitivity Analysis For Impact Of Unfound Relevant Documents The purpose of the sensitivity analysis was to determine whether or not unfound relevant documents would impact the conclusions drawn from the results of the experiment. Specifically, it looked at whether the differences in recall* and F-measure* between the Baseline and Enhanced search engines would continue to be statistically significant if unfound relevant documents existed in the document collection. The results at each of the levels of sensitivity for both recall* and F-measure* allowed the same conclusions to be drawn as were drawn from the original calculations. The difference of recall* for the Baseline and the Enhanced search engine at each level of sensitivity was found to remain statistically significant. This means that the second research hypothesis that states that the Enhanced search engine will, on average, have greater recall than the Baseline engine would remain true if a large number of unfound relevant documents existed in the collection. The difference in F-measure* for the Baseline and the Enhanced search engine at each level of sensitivity was also found to remain statistically significant. This means that the third and fourth research hypotheses that states that the Enhanced search engine will, on average, perform better than the Baseline search engine as determined by a higher average F-measure value would remain true if a large number of unfound relevant documents existed in the collection. Therefore, the sensitivity analysis demonstrates that even with a large number of unfound relevant documents in the collection, the conclusions drawn from the original calculations remain valid. 194 6.7. Sensitivity Analysis For Impact Of Relevancy Assumptions The purpose of the sensitivity analysis was to determine whether the documents that were assumed relevant in the modified pooling method of adjudication would impact the conclusions drawn from the results of the experiment. Specifically, it looked at whether the differences in recall* and F-measure* between the Baseline and Enhanced search engines would continue to be statistically significant if some of the documents that were returned by both the Baseline and Enhanced search engines were not relevant. Like the previous sensitivity analysis, the results at each of the levels of sensitivity for both recall* and F-measure* allowed the same conclusions to be drawn as were drawn from the original calculations. The difference of recall* for the Baseline and the Enhanced search engine at each level of sensitivity was found to remain statistically significant. This means that the second research hypothesis that states that the Enhanced search engine will, on average, have greater recall than the Baseline engine would remain true if a large portion of the documents that were returned by both the Baseline and Enhanced search engines were not relevant. Similarly, the difference in F-measure* for the Baseline and the Enhanced search engine at each level of sensitivity was also found to remain statistically significant. This means that the third and fourth research hypotheses that states that the Enhanced search engine will, on average, perform better than the Baseline search engine as determined by a higher average F-measure value would remain true if a large portion of the documents that were returned by both the Baseline and Enhanced search engines were not relevant. 195 Therefore, the sensitivity analysis for the impact of the relevancy assumptions demonstrates that even if a large portion of non-relevant documents returned by both the Baseline and Enhanced search engines were assumed relevant, the conclusions drawn from the original calculations remain valid. 6.8. Interpreting the Results The results are promising and provide evidence that the idea of automatic concept-based query expansion using term relational pathways built from a collectionspecific association thesaurus may be an effective way to improve search engine performance. With these results in mind, it is important to revisit the original ideas from which the enhanced design was inspired. Some of the main ideas revolved around the automatic development of a conceptual network, whether a complete set of relevant documents could be retrieved by augmenting symbolic search, and the impact of query topic types on search performance. 6.8.1. Conceptual Network One of the primary ideas that inspired this research was that a conceptual network could be automatically generated through the use of a document collection co-occurrence calculation-based association thesaurus that would represent a reasonable approximation of the concepts that were represented in the document collection. Therefore, it is important to consider whether or not concepts are truly represented in the network created by the Enhanced search engine. The relational pathways that are identified for term pairs from the query topics provide the best look into the network. To get a sense of 196 whether concepts have been captured in the full network, we can ask whether the relational pathways appear to represent relevant aspects of concepts that a human would recognize and agree are relevant to the term pair. Consider character and size, a term pair taken from the following query topic: appropriate size of characters on display. In the experiment, the following three relational paths were identified for the term pair: Pathway #1: character – discriminable – font – size Pathway #2: character – readable – font – size Pathway #3: character – text – font – size The three relational pathways do appear to represent recognizable aspects of a larger concept. They illustrate two types of relationships among the terms. The first type of relationship is illustrated in pathway #1 and pathway #2. These pathways represent the effect that one of the original query terms has on the other query term. In pathway #1, the effect of discriminability is identified. Size affects a character’s discriminability (i.e., the ease with which one piece of information, or in this case a character, can be recognized within a group). If characters are too small, they will not be discriminable. Characters that are not discriminable are not of an appropriate size. It is clear that this pathway represents a valid aspect of the overall concept presented by the term pair and, in addition, the query topic in general. The same type of relationship is represented in pathway #2 with the effect of readability. Like discriminability, size affects a character’s readability (i.e., the ease with which text or numbers are recognized as words or number sequences). If characters are 197 too small, they will not be readable and, therefore, not of an appropriate size. Again, it is clear that this pathway represents a valid aspect of the overall concept presented by the term pair and, in addition, the query topic in general. The second type of relationship is illustrated in pathway #3. In pathway #3, the pathway represents alternate expressions of the similar concept. The alternate expressions are the terms text and font for the original query term character. These alternate expressions may be of greater or lesser specificity, but in the test document collection, they represent the same concept 11. Both types of relationships are useful in identifying candidate terms with which to expand the query in a way that remains focused on the overall concept of interest. All relational pathways identified for term pairs may not be as easily recognizable in representing a relevant aspect of the overall concept. However, as seen in this example, some relational pathways do present highly recognizable and obviously relevant aspects of the overall concept represented by the term pairs and the overall query topic. This suggests that real concepts, at least in part, are represented in the conceptual network produced automatically using the association thesaurus. 6.8.2. Complete Set of Relevant Documents The primary motivation for pursuing this approach was the desire to return a complete set of relevant documents. So, the question remains, did the Enhanced search 11 The term font is part of all three pathways #1, #2 and #3 and represents an alternate expression in each of them. Therefore, it is clear that more than one type of relationship may be represented in a pathway. 198 engine return a complete set of relevant documents? The methodology was not designed to identify the complete set of all relevant documents for a given query topic in the collection but rather to determine how many additional relevant documents were returned. As discussed earlier, complete adjudication of a document collection (i.e., identifying all relevant documents in the collection to ensure that no unfound relevant documents exist in the collection) is prohibitively time consuming. However, based on an informal assessment during the adjudication process, it was identified by the collection content expert that some relevant documents were missed, not only by the Baseline search engine but also by the Enhanced search engine. This proof-of-concept research work was based on a number of “best reasonable guess” design decisions, and, therefore, this observation is not surprising. (See Appendix A for a list of design parameters chosen.) It is reasonable that further investigation will be necessary to identify and tune various design parameters that impact the recall level of an enhanced search engine. 6.8.3. Query Topic Types While developing the query topics, there appeared to be a qualitative difference in the types of query topics that were of interest to the users of the Design CoPilotTM. There were query topics that represented well-defined, concrete, tangible concepts and query topics that represented more complex, harder-to-define, fuzzy, intangible concepts. Therefore, it seemed of interest to determine if there was a qualitative performance difference between these two types of query topics. In addition, it seemed logical that the 199 less well-defined concepts represented by the intangible query topics would receive a greater benefit from more completely defining the information need using the relational pathways. The experimental results appear to bear this out. The Enhanced search engine did perform better on query topics that represented the intangible concepts. Additional investigation will be necessary to determine how this knowledge may be used to further improve the performance of the Enhanced search engine. 6.9. Impact of Sample Selection As described in section 4.5.1, the Tangible and Intangible Concept Query Sample Sets were generated by using an arbitrarily selected difference threshold of less than or equal to 50 documents. This was done to restrict the samples to a manageable size so that the adjudication effort required would be feasible. However, one of the ramifications of this decision is that the conclusions are only relevant for query topics with threshold of less than or equal to 50 documents. Because those query topics with large differences in the number documents returned by the Baseline and Enhanced search engines were not included in the sample, it may be the case that query topics with a large difference count (i.e., greater than 50 documents) perform qualitatively differently than query topics with smaller difference counts. Additional research would be necessary to determine the impact of the magnitude of the difference counts on performance. 6.10. Impact of Search Engine Parameter Values As mentioned earlier, the values for the various parameters were primarily based on a number of “best reasonable guess” design decisions. It is likely that with tuning, the 200 performance of the Enhanced search engine could be improved. However, in this research, the decision was made to avoid tuning the parameters because the tuning could have undesirable and unknown impact on the generalizability of the Enhanced search engine design for other document collections. Therefore, it was thought inappropriate to tune at this early stage in the research. After the experiment was completed, an informal analysis was performed to get a sense of the sensitivity that two key parameters may have on the performance of the Enhanced search engine. The parameters analyzed included the Jaccard Coefficient Threshold and the Maximum Associated Term Entries Threshold. 6.10.1. Jaccard Coefficient Threshold Value An informal analysis was conducted by calculating the number of associated terms that would be included in the Association Thesaurus for each of the unique eligible terms that made up the 75 query topics at alternate Jaccard Coefficient Threshold values. The Jaccard Coefficient Threshold used in the experiment was 0.5 and so values of 0.4 and 0.6 were analyzed. There were a total of 176 unique eligible terms that did not exceed the Maximum Associated Terms threshold of 275 associated terms. The number of associated terms for each of these 176 terms was calculated for Jaccard Coefficient Threshold values of 0.4, 0.5, and 0.6. Of these, the average difference in the number of associated terms that would be included in the Association Thesaurus for each of the unique eligible terms at thresholds 0.4 and 0.5 was 0.26 terms (i.e., if the threshold value was 0.4, an average of 201 0.26 additional terms would be added to the thesaurus entry); the average between 0.5 and 0.6 was 0.78 terms (i.e., if the threshold value was 0.6, an average of 0.78 fewer terms would be added to the thesaurus entry); and the average between 0.4 and 0.6 was 1.03. These results suggest that, on average, there would be little impact if the threshold value chosen was 0.4, 0.5, or 0.6. 6.10.2. Maximum Associated Term Entries Threshold Value An informal analysis was conducted by identifying the relational pathways that would be returned for a term pair at various Maximum Associated Term Entries Threshold values. The term pair used was made up of the terms warnings and understandable12. The Maximum Associated Term Entries Threshold used in the experiment was 275 terms and so values of 200, 225, 250, 300, and 325 were analyzed. The relational pathways for each of these threshold values were identified. At the 200 terms threshold levels, 3 relational pathways were identified; at the 225, 250, and 275 terms threshold levels, 5 relational pathways were identified; and, at the 300 and 325 terms threshold levels, 11 relational pathways were identified. The relational pathways at each of these threshold levels are presented in Figure 6.2. 12 This term pair “warnings understandable” was presented in example discussed in Chapter 1 using hypothetical thesaurus entries and relational pathways. In the analysis presented in this section 6.9.2., however, the actual Association Thesaurus entries for the document collection were used and the relational pathways generated are real. 202 Maximum Associated Term Entries Threshold Trials MaxAssocThreshold = 200 understand – clear – awareness – warn [2] understand – clear – device – warn [3] understand – confusing – caution – warn [1] [1] [2] [3] [4] [4] [5] [5] [1] [2] [3] [4] [5] MaxAssocThreshold = 250 understand – clear – annunciate – warn understand – clear – awareness – warn understand – clear – device – warn understand – confusing – caution – warn understand – distinct – visual – warn MaxAssocThreshold = 300 understand – clear – alert – warn [2] understand – clear – annunciate – warn [3] understand – clear – awareness – warn [4] understand – clear – device – warn [5] understand – confusing – caution – warn [6] understand – consistent – alert – warn [7] understand – consistent – annunciate – warn [8] understand – consistent – aural – warn [9] understand – consistent – caution – warn [10] understand – distinct – alert – warn [11] understand – distinct – visual – warn [1] Figure 6.2 [1] [2] [3] [4] [5] MaxAssocThreshold = 225 understand – clear – annunciate – warn understand – clear – awareness – warn understand – clear – device – warn understand – confusing – caution – warn understand – distinct – visual – warn MaxAssocThreshold = 275 understand – clear – annunciate – warn understand – clear – awareness – warn understand – clear – device – warn understand – confusing – caution – warn understand – distinct – visual – warn MaxAssocThreshold = 325 understand – clear – alert – warn [2] understand – clear – annunciate – warn [3] understand – clear – awareness – warn [4] understand – clear – device – warn [5] understand – confusing – caution – warn [6] understand – consistent – alert – warn [7] understand – consistent – annunciate – warn [8] understand – consistent – aural – warn [9] understand – consistent – caution – warn [10] understand – distinct – alert – warn [11] understand – distinct – visual – warn [1] Relational pathways identified for various values of the Maximum Associated Term Entries Threshold. Reviewing the relational pathways for this example term pair shows that additional relational pathways identified at threshold values higher than used in the experiment may represent relevant aspects of the overall intended concept of interest without introducing undesirable tangential concepts. For example, the six additional relational pathways introduced at the 300 and 325 terms threshold levels each present 203 recognizable aspects of a larger concept related to ensuring that warnings are understandable. The six additional relational pathways are the following: Pathway [1]: understand – clear – alert – warn Pathway [6]: understand – consistent – alert – warn Pathway [7]: understand – consistent – annunciate – warn Pathway [8]: understand – consistent – aural – warn Pathway [9]: understand – consistent – caution – warn Pathway [10]: understand – distinct – alert – warn The higher maximum threshold values of both 300 and 325 allow two additional terms to be included in the conceptual network and, therefore, to be eligible for creating relational pathways. These two terms are consistent and alert. Both these terms appear relevant to the larger concept represented by the term pair. For example, consistency plays a large role in promoting the understandability of the elements in the flight deck, and the term alert is often used synonymously with the words warn or warning. Both these terms have the potential of playing a useful role in expanding the query with the original term pair in a way that would remain focused on the overall concept of interest yet allow additional relevant documents to be retrieved. The results of this informal analysis suggest that the performance of the Enhanced search engine may be improved by using a higher Maximum Associated Threshold value for the document collection used in this experiment. These results also suggest the Maximum Associated Term Entries Threshold may be a useful parameter to tune based on the particular characteristics of the document collection on which the Enhanced 204 Search Engine runs. However, a sufficient number of example term pairs would need to be analyzed and considered to appropriately tune this threshold value. 6.11. Outstanding Issues While the results are promising, this research was a proof-of-concept test. There are still questions to be answered before such a method could be put into production on a live website. Three of these issues are discussed below. 6.11.1. Impact of Characteristics of the Test Document Collection The Design CoPilotTM document collection by its nature has some repetitive content. The human factors-related pages in the collection cite passages from the related regulatory and guidance material. In addition, within the regulatory and guidance materials, cross-referencing and discussion about various regulatory excerpts occur. Therefore, the same passage of text may be present in several documents within the document collection. It is difficult to know whether this played a significant role in the performance of the Enhanced search engine. It could have made the conceptual network stronger in that important concepts and connections between terms were given more weight by having an exaggerated frequency. Or, it could have limited the useful connections among terms and provided fewer relational pathways with which to expand the original query. Therefore, an outstanding question is: Would the Enhanced search engine perform in a similar manner on a document collection that is comparably technical but less repetitive? Further investigation is required to determine the answer to this question. 205 6.11.2. Data Processing and Document Collection Size The method to automatically create the association thesaurus and generate the conceptual network requires a large amount of data processing. While much of the timeconsuming processing can be done ahead of time (i.e., during the pre-query processing stage) and likely the algorithms used be made more efficient, the data processing required limits the size of the document collection for which this approach may be used given currently available processing capabilities. The matrix size required for calculating the Jaccard Coefficient co-occurrence values for the Design CoPilotTM document collection tested exceeded the performance limitations of MATLAB 7.4 (R2007). Because MATLAB must hold the matrix in memory, the maximum matrix size (or array size) is determined by the amount of contiguous memory made available to MATLAB. For installations of MATLAB on 32bit Windows operating systems, the maximum number of elements in a real-numbered array ranges from approximately 155 x 106 to approximately 200 x 106 elements 13. To overcome this limitation, a workaround was developed to break up the matrix into sufficiently small sub-matrices, send each sub-matrix into MATLAB to perform the required co-occurrence calculations, and then reassemble the data generated in MATLAB back into the full matrix. While it is likely the workaround process developed could be easily scaled-up to handle document collections that are double that of the test collection 13 This information was taken from an article about MATLAB on the MathWorks website titled What is the maximum matrix size for each platform? Retrieved March 17, 2013 from webhttp://www.mathworks.com/support/solutions/en/data/1IHYHFZ/index.html 206 (i.e., to handle document collections of approximately 6000 documents), it is much less certain how much further the scaling of process could be taken and remain feasible. Therefore, an outstanding question is: What are the size limitations on the document collection? Further investigation is required to determine the answer to this question. 6.11.3. Performance Comparison Finally, another outstanding issue is related to how the performance of the Enhanced search engine compares to other third-party search engines. The methodology used in this experiment allowed us to conclude that the Enhanced functionality improved the performance of the Baseline search engine. However, it left an open question about the true level of performance that can be expected from the Enhanced search engine and how this compares to other search algorithms in use today. This question is partially addressed by including the results generated by the Google Desktop search engine in the sensitivity analysis to assess the impact of the relevancy assumptions made in the modified pooling adjudication method. However, additional work is necessary to draw conclusions about the true level of performance of the Enhanced search engine. Two paths of follow-on research are under consideration. In the first path, Google Desktop (or some other third-party search engine) may be used to run the sample of 30 query topics on the document collection, and the unique documents returned for each query would be manually adjudicated. After the adjudication, the results of the 207 third-party search engine would be compared to the performance results of the Baseline and Enhanced search engine described in this research. In the second path of follow-on research, an alternate document collection that possesses the appropriate collection characteristics and has a set of predefined query topics may be used to assess the performance of the Baseline and the Enhanced search engines. The Baseline and Enhanced search engines would be run on this alternate document collection and their performance measured. Next, their performance would then be compared to other thirdparty search engines that have also been run on that particular alternate document collection. Such a document collection would need to be identified and obtained before this path of follow-on research could be performed. 208 CHAPTER 7. CONCLUSION While search engines provide an important tool to help users locate relevant information contained within a website, the use of natural language in both the representation of the information need and the concepts in the documents present challenges in designing effective search engines. One promising method to overcome these challenges is using concept-based query expansion to augment traditional symbolic approaches. This research described an approach to concept-based query expansion that uses a network-based method to automatically create a reasonable approximation of all the concepts and their relationships between one another represented in the document collection using a association thesaurus created for the target document collection. Even though not all query topics have associated relational pathways with which to expand the original query, the experiment demonstrated that the Enhanced search engine performs better than the Baseline search engine. In addition, the results suggest that real concepts, at least in part, are represented in the conceptual network produced automatically using the association thesaurus. Therefore, this approach has the potential for extensions to a variety of other applications in which mapping the verbal representation of the concept to the terms used to express them with a set of documents is required. While there are still some important questions to be answered before such a method could be put into production on a live website, the results of this experiment are encouraging. The results suggest that on a bounded, medium-sized document collection containing documents focused on a single technical subject domain that the enhancement 209 will allow users to identify a significantly greater portion of relevant documents to fill their information needs. 210 APPENDIX A. SEARCH ENGINE STRUCTURE AND DESIGN PARAMETERS This appendix provides a high-level illustration of the structure of the Baseline and Enhanced search engines and identifies significant features and design parameters used. Much of the content of this appendix is a duplicate of the information presented in the body of the dissertation; however, this appendix was created to facilitate replicating or modifying the design of the Baseline and Enhanced search by presenting the relevant search engine design structure and parameters together. A.1. Structure The following sections describe the high-level structure and major components of the Baseline and Enhanced search engines. A.1.1. Baseline Search Engine Structure The Baseline search engine was composed of an Index Module, a Search Module, and an Index data store. The Index Module was part of the pre-search processes that occur to prepare the system for use. The Search Module was part of the post-search processes that occur to allow users to enter their desired search terms, the system to process the query, and the system to return the search results to the user. The high-level structure of the Baseline search engine is illustrated in Figure A.1. 211 Baseline Search Engine Post-search processes Search Module User Interface Enter Search Terms Search Present Search Results to User Parse User Input Organize and Format Results Build Query Run Query Data store Index Index Document Analyze Document Build Document Record Pre-search processes Figure A.1 Acquire Content Index Module High-level structure of the Baseline search engine. 212 A.1.2. Enhanced Search Engine Structure The Baseline search engine was used as the core of the Enhanced search engine. Modules and components necessary for performing the query expansion task were added on to the core structure. Therefore, in addition to the Baseline’s Index Module, Search Module, and Index data store, the Enhanced search engine also contained modules to Build Association Thesaurus, Generate Conceptual Network and Expand Query as well as the necessary components for the Association Thesaurus data store and the Conceptual Network data store. The Index Module, Build Association Thesaurus module, and Generate Conceptual Network module were part of the pre-search processes that occur to prepare the system. The Search Module and Expand Query module were part of the postsearch processes to process the user’s query. The high-level structure of the Enhanced search engine is illustrated in Figure A.2. 213 Enhanced Search Engine Post-search processes Search Module User Interface Expand Query Enter Search Terms Identify Relational Pathways Collect Expansion Terms Search Parse User Input Build Expanded Query Present Search Results to User Organize and Format Results Run Query Data store Conceptual Network Association Thesaurus Index Store Conceptual Network Store Association Thesaurus Entries Create Links Between Terms Calculate Co-Occurrence Values Identify Term Pairs Create Matrix Generate Conceptual Network Identify Eligible Terms Index Document Analyze Document Build Document Record Acquire Content Index Module Pre-search processes Figure A.2 Create Document Segments Build Association Thesaurus High-level structure of the Enhanced search engine. 214 A.2. Design Parameters The following design parameters were used in the design and development of the Baseline and Enhanced search engines. When no best current practices dictated an appropriate value for a given parameter, the values were chosen based on educated best guesses. A.2.1.Technology The core of this search engine was built using the open-source Lucene.NET search engine development library to perform the indexing, data storage, and retrieval functions. The features for the baseline search engine were chosen based on current best practices and therefore included word stemming, stop word removal, HTML code filters (to ignore text inside HTML tags), and proximity search. The search engine was developed in a Visual Studio 2008 development platform using C# and the Lucene.NET version 2.9.2 development library to build an ASPX website containing the document collection and the search engine. A.2.2.Indexing Parameters The following parameters were used in the indexing process for both the Baseline and the Enhanced search engines. A.2.2.1. HTML Code Filter The built-in Lucene.NET analyzer called HTMLStripCharFilter was used to ignore all text contained inside HTML tags while indexing. 215 A.2.2.2. Stopwords The following stopword list was used: • a • been • is • the • to • an • but • it • their • was • and • by • no • then • will • are • for • of • there • with • as • if • on • these • at • in • such • they • be • into • that • this The stopword list used differs from the ENGLISH_STOP_WORD_SET that is built into Lucene.NET in the following ways: • The words “or” and “not” were removed from the list. • The word “been” was added to the list. A.2.2.3. Word Stemming Word stemming was performed using the Lucene.NET built-in Snowball Analyzer. A.2.3. Association Thesaurus Processing Parameters The Build Association Thesaurus module automatically builds the Association Thesaurus using the document collection. The module accomplished this by manipulating a Term-Document matrix comprised of terms and their occurrences in the documents of 216 the collection to determine the level of association between term pairs. To determine the association, overlapping document segments were analyzed to determine the frequency of eligible terms and the resulting data was used to calculate co-occurrence values. A.2.3.1. Overlapping Document Segments The term vectors that made up the Term-Document matrix were defined by the number of occurrences (i.e., frequency) of the term within document segments rather than within full documents. Document segments (i.e., moving shingled window) were created from each full document. The document segments were 200 words long, and each segment overlapped the previous and next segment by 100 words (i.e., the shingle increment). The number of document segments created from a full document varied from one segment to several hundred segments, depending on the length of the full document. The Term-Document matrix was, therefore, constructed so that the terms were represented by the columns and the document segments were represented by the rows. Using document segments rather than the full documents controlled for the variability in length of documents in the collection and ensured that only the terms in close proximity (i.e., within 200 terms) to one another were assumed to be similar to one another. The document segment size and the shingle increment were chosen based on an informal average paragraph size. It was observed that a single, although possibly complex, concept is often contained in a paragraph. Because of this, the words used in the beginning of the paragraph are likely topically related to the words used at the end of the paragraph. Therefore, the average number of words contained in a paragraph may be a reasonable guide to the size of a chunk of text in which all the words are semantically 217 related. Assuming that paragraphs typically range from 100 to 200 words, a document segment size of 200 words and a shingle increment of 100 words were chosen. These values were chosen early in the design process and no tuning of these values was performed. A.2.3.2. Eligible Term Identification Not all terms present in the document collection were included in the Association Thesaurus. Only stemmed content bearing (i.e., stop words were excluded) words present in the document collection with an appropriate level of frequency were identified as eligible for inclusion in the Association Thesaurus. Therefore, the terms needed to occur frequently enough in the document collection for co-occurrence calculations to yield useful information but not too frequently for their presence to not be a useful discriminator of relevance. Eligible terms were those that had a minimum frequency of 50 in the overall document collection and did not appear in more than 9999 document segments. These eligible terms parameters were not tuned but chosen at the beginning of the design process based on reasonable intial guesses as to appropriate starting values. A.2.3.3. Co-Occurrence Calculations The co-occurrence calculations to determine level of association (or, similarity) between term pairs were conducted using the Jaccard Coefficient. The Jaccard Coefficient is based on an Intersection Over Union (IOU) calculation to normalize and measure the amount of overlap between two term vectors. 218 The Jaccard Coefficient value of a term pair was used only to make the binary decision of inclusion or exclusion of a term pair in the Association Thesaurus. Those term pairs with a Jaccard Coefficient value greater than 0.5 were included in the Association Thesuarus as associated term entries for each other. This minimum threshold value of 0.5 was chosen early in the design process based on the idea that a value near the mid-point of possible Jaccard Coefficient values (i.e, values between 0 and 1) would provide a reasonable starting point and no tuning was performed to improve this value. A.2.4. Conceptual Network and Relational Pathway Parameters The Generate Conceptual Network module used the entries in the Association Thesaurus to generate the conceptual network. Each term in the Association Thesaurus represented a node. Child nodes for a term were generated from all of the associated terms defined in its thesaurus entry to create a term cluster. To form the full conceptual network, each term cluster generated from the thesaurus entry was linked to the other term clusters using shared terms. The entire conceptual network was developed by continuing this term cluster linking process using all shared terms defined through the relationships defined by the associated term entries in the Association Thesaurus. Only terms likely to be useful in discriminating the relevance of a document were included in the conceptual network. A maximum threshold was used to restrict the number of associated a terms a target term may have to be eligible for inclusion in the conceptual network. Terms that had more than 275 entries were considered to be too 219 frequently occurring to be able to offer a useful discrimination and were ignored during the process of creating the conceptual network. Therefore, any term included in conceptual network had 275 or fewer associated terms included in its Association Thesaurus entry. This threshold value of 275 entries was chosen early in the design process based on reviewing several example term pairs and their resulting pathways.No tuning was done after this early design decision was made. In this way, the conceptual network was composed of all terms with 275 or fewer entries in the Association Thesaurus and links between terms were only based on whether or not shared terms existed in the individual term clusters (i.e., there were no other parameters considered when forming the links between nodes). A.2.4.1. Relational Pathways To minimize search processing time experienced by the users, the relational pathways were identified during the pre-search process stage in the Generate Conceptual Network module. All possible term pairs were identified from the terms contained in the Association Thesaurus. Next, the relational pathways for each term pair were identified and stored for fast retrieval at search time. The relational pathways identified were 3, 4, or 5 terms long. To identify the relational pathways between a pair of terms, the module began with the first term of the term pair and traversed the conceptual network looking for the second term of the term pair using a breadth-first search to a maximum depth of 4 terms. When the second term was found, the intervening terms were captured to form a relational pathway. 220 It was possible for zero, one, or more relational pathways to be identified for a given term pair. There was no maximum threshold used to limit the number of relational pathways that could be identified for a given term pair. All the relational pathways for a given term pair were the same length. Once a relational pathway was found, the search on that level was completed and then stopped before it moved to the next level of depth. A.2.5. Search Parameters The search is conducted by parsing the user’s input to remove stop words and perform word stemming. The query then is performed as a proximity search where documents are retrieved when the terms occur within 100 words of one another. A proximity threshold of 100 words is a fairly restrictive value. Such a restrictive value was chosen for two reasons. One, technical content such as that represented by the document collection used in this research tends to be written in a very focused and direct manner, and it was believed that if all the terms of the query were found within three to four sentences of one another, the passage would likely be relevant to the concept represented by the query topic. Conversely, terms present outside of that proximity window may be more likely addressing different ideas. Two, a frequent issue with query expansion methods is the loss of precision. Therefore, using a proximity value that was more restrictive should mitigate some of the undesirable effects that cause the loss of precision. 221 APPENDIX B. BUILDING BOOLEAN EXPRESSIONS FROM RELATIONAL PATHWAYS This appendix describes how the Boolean expressions are built from the relational pathways. B.1. Building Boolean Phrase from Relational Pathway The relational pathways built for each term pair of the query topic may be comprised of 3, 4, or 5 terms. The query expansion terms are selected in such a way that the resulting Boolean phrase requires at least one of the original terms to be present in the document. How a Boolean phrase is constructed from a relational pathway is illustrated in the following three figures: Pathway length = 3 Relational Pathway A Boolean Query Phrase (A AND B) AND C) B C Visual Representation OR ( (A OR ( (B Figure B.1 AND C) Boolean Phrase for Relational Pathway Length of 3. In the visual representation, the solid circles on the pathway indicate the terms that must be present in the document for the document to be identified as relevant. 222 Pathway length = 4 Relational Pathway A Boolean Query Phrase (A D C B Visual Representation AND D) AND B AND C) C AND D) OR ( (A OR ( (B Figure B.2 AND Boolean Phrase for Relational Pathway Length of 4. In the visual representation, the solid circles on the pathway indicate the terms that must be present in the document for the document to be identified as relevant. Pathway length = 5 Relational Pathway A B Boolean Query Phrase (A C D E Visual Representation AND E) AND C AND D) AND C AND E) AND B AND D) AND D AND E) OR ( (A OR ( (B OR (A OR (B Figure B.3 Boolean Phrase for Relational Pathway Length of 5. In the visual representation, the solid circles on the pathway indicate the terms that must be present in the document for the document to be identified as relevant. 223 B.2. Building Query Expression for Query Topic For each pair of original query terms, the search engine attempts to identify one or more relational pathways between the term pair. The Boolean phrase for each relational pathway is generated, and all the resulting Boolean phrases are combined into a single full Boolean query expression. B.2.1. Combining Multiple Relational Pathways for Term Pair When more than one relational pathway is found for a term pair, the Boolean phrases for each relational pathway are generated and then combined with OR operators to form the full Boolean phrase for this term pair. Using OR operators to combine the phrases means that if the Boolean phrase for any of the relational pathways is true, then the entire Boolean phrase is true. This is appropriate because a concept represented by only one relational pathways needs to be present in the document for the term pair to be represented. Figure B.4 illustrates the full Boolean phrase that is generated from a term pair for which two relational pathways were identified. This figure assumes that each of the two relational pathways found between the original query terms A and E are 5 terms long. Because a breadth-first search is used, all relational paths found will be of the same length. Once a path is found, the search on that level is completed and then stops before it moves to the next level of depth. 224 A & E Boolean Phrase ( ( Figure B.4 OR ) ) Boolean phrase for term pair A and E constructed by combining individual Boolean phrases for the term pair’s two relational pathways. B.2.1. Creating Full Boolean Expression for Query Topic For query topics that include only two terms (i.e., a single pair of terms), the Boolean phrase generated using the method for combining multiple relational pathways for the single term pair generates the full Boolean query expression. However, if a query topic is comprised of more than two terms such as warnings quickly understandable, then there are three terms pairs that may result in relational pathways. The three term pairs in this example are the following: 1. warnings quickly 2. warnings understandable 3. quickly understandable 225 The Boolean phrases generated for the relational pathways found for each of the term pairs must be combined to form the full Boolean query expression. To do this, the Boolean phrase generated for each term pair is combined using AND operators. The resulting Boolean expression, therefore, requires that each term pair be represented by at least one of its relational pathways. Figure B.5 illustrates the full Boolean phrase that is generated from a query topic comprised of three term pairs for which relational pathways were identified for each term pair. AND Figure B.5 AND Boolean phrase for query topic A B C constructed by combining Boolean phrases for each of the term pairs. Not all term pairs have relational pathways. Therefore, provision is made to ensure that all terms present in the original query topic are represented in the final full Boolean expression. During the process of identifying relational pathways and constructing the Boolean phrases, a record is kept for each term that is represented in the Boolean expression. Each original query topic term that is not represented in any of the relational pathways is added to the full Boolean expression using the AND operator. 226 Figure B.6 illustrates the full Boolean phrase that is generated from a query topic comprised of three term pairs for which only one relational pathway was identified for one term pair. AND Figure B.6 C Boolean phrase for query topic A B C constructed by combining Boolean phrases generated from the relational pathway identified between the term pair A and B and the term C. Term C did not share a relational pathway with term A nor with term B. 227 APPENDIX C. QUERY TOPIC LIST This appendix contains the complete listing of the 75 query topics that were used in the search engine testing. The query topics are divided by those that represent tangible concepts and those that represent intangible concepts. The query topics were developed by content experts by reviewing query logs of the Design CoPilotTM web application and by drawing from experience with the document collection. C.1. Query Topics Representing Tangible Concepts There were 40 query topics that represented tangible concepts. 1. ambient lighting intensity 2. attitude trend indicator 3. audible stall warnings 4. auxiliary power unit fire extinguishing 5. cockpit display color scheme 6. cockpit lighting at night 7. color and vibration 8. consistent label placement 9. control force and resistance considerations 10. electronic flight bag display design 11. emergency landing gear control location 12. excessive strength required to operate control 13. false resolution advisory 14. fault tolerant data entry 15. fuel feed selector control 16. hydraulic system status messages 17. icing conditions operating speeds provided in AFM 228 18. inadvertent activation prevention 19. information presented in peripheral visual field 20. information readable with vibration 21. instruments located in normal line of sight 22. interference from adjacent controls 23. labels readable distance 24. landing gear manual extension control design 25. low airspeed alerting 26. monochrome altimeter design 27. negative transfer issues 28. operable using only one hand 29. pilot access to circuit breakers 30. pilot response time to visual warnings 31. placard lighting 32. redundant coding methods 33. safety belt latch operation 34. sensitivity of controls 35. side stick control considerations 36. stuck microphone 37. tactile feedback 38. TCAS warnings 39. text color contrast 40. thrust reverser control design C.2. Query Topics Representing Intangible Concepts There were 35 query topics that represented intangible concepts. 1. acceptable message failure rate and pilots confidence in system 2. adequately usable 3. appropriate size of characters on display 4. arrangement of right seat instruments 5. audio signals annoying to the pilot 6. body position required to operate equipment 229 7. control design allows pilot to operate in extreme flight conditions 8. control is identifiable in the dark 9. cultural conventions switch design 10. cursor control device 11. design attributes for auditory displays 12. designed to keep pilot fatigue to a minimum 13. easily understandable system status 14. ergonomics of pilot seating 15. excessive cognitive effort 16. excessively objectionable 17. failure notifications are readily apparent 18. hard to readily find information on screen 19. how to ensure that labels are readable 20. how to improve situation awareness 21. how to provide unambiguous feedback 22. implications of automatically disengaging autopilot 23. information overload 24. intuitive display design 25. magnitude and direction of systems response to control input 26. minimal mental processing 27. needs too much attention 28. poorly organized information 29. preventing instrument reading errors 30. proper use of red and amber on displays 31. readily accessible overhead controls 32. representing self in airplane-referenced displays 33. soft controls 34. suitable menu navigation methods 35. unnecessarily distract from tasks 230 APPENDIX D. DOCUMENTS RETURNED COUNTS FOR QUERY TOPICS BY SEARCH ENGINE This appendix contains the data and calculations related to counts of documents returned by the Baseline and Enhanced search engines. D.1. Documents Returned Counts The following is a table of counts of documents returned by the Baseline and Enhanced search engines for each of the 75 query topics. Query Topics Baseline Enhanced Difference 1 audible stall warnings 20 78 58 2 monochrome altimeter design 0 0 0 3 ambient lighting intensity 21 21 0 4 attitude trend indicator 10 79 69 5 auxiliary power unit fire extinguishing 16 21 5 6 icing conditions operating speeds provided in AFM 11 61 50 7 pilot access to circuit breakers 29 107 78 8 cockpit display color scheme 17 117 100 9 color and vibration 30 109 79 10 control force and resistance considerations 8 8 0 11 electronic flight bag display design 48 48 0 12 emergency landing gear control location 9 80 71 13 false resolution advisory 9 23 14 14 fault tolerant data entry 1 14 13 15 fuel feed selector control 14 71 57 16 inadvertent activation prevention 57 109 52 17 consistent label placement 15 15 0 231 Query Topics (continued…) Baseline Enhanced Difference 18 labels readable distance 7 12 5 19 landing gear manual extension control design 3 48 45 20 placard lighting 70 148 78 21 low airspeed alerting 30 118 88 22 negative transfer issues 2 5 3 23 cockpit lighting at night 61 61 0 24 instruments located in normal line of sight 6 35 29 25 operable using only one hand 38 38 0 26 information presented in peripheral visual field 3 11 8 27 information readable and vibration 12 44 32 28 pilot response time to visual warnings 26 107 81 29 redundant coding methods 20 20 0 30 sensitivity of controls 105 105 0 31 side stick control considerations 0 30 30 32 stuck microphone 8 8 0 33 tactile feedback 50 50 0 34 text color contrast 19 54 35 35 interference from adjacent controls 28 91 63 36 excessive strength required to operate control 67 67 0 37 thrust reverser control design 41 41 0 38 TCAS warnings 96 244 148 39 safety belt latch operation 19 54 35 40 hydraulic system status messages 0 27 27 41 poorly organized information 17 17 0 42 excessive cognitive effort 4 10 6 43 how to improve situation awareness 2 10 8 44 needs too much attention 4 26 22 232 Query Topics (continued…) Baseline Enhanced Difference 45 adequately usable 88 88 0 46 intuitive display design 60 60 0 47 cultural conventions switch design 1 7 6 48 excessively objectionable 30 30 0 49 readily accessible overhead controls 0 0 0 50 easily understandable system status 9 133 124 51 unnecessarily distract from tasks 11 11 0 52 audio signals annoying to the pilot 8 8 0 53 minimal mental processing 9 26 17 54 information overload 26 26 0 55 soft controls 58 58 0 56 cursor control device 57 57 0 57 body position required to operate equipment 12 12 0 58 control is identifiable in the dark 20 69 49 59 hard to readily find information on screen 0 0 0 60 suitable menu navigation methods 0 11 11 61 how to provide unambiguous feedback 0 25 25 62 proper use of red and amber on displays 13 29 16 63 failure notifications are readily apparent 0 0 0 64 design attributes for auditory displays 4 7 3 65 preventing instrument reading errors 7 33 26 66 magnitude and direction of systems response to control input 14 14 0 67 acceptable message failure rate and pilots confidence in system 0 4 4 68 control design allows pilot to operate in extreme flight conditions 8 8 0 69 arrangement of right seat instruments 6 23 17 70 appropriate size of characters on display 13 56 43 233 Query Topics (continued…) Baseline Enhanced Difference 71 ergonomics of pilot seating 0 2 2 72 implications of automatically disengaging autopilot 4 4 0 73 representing self in airplane-referenced displays 0 0 0 74 designed to keep pilot fatigue to a minimum 3 3 0 75 how to ensure that labels are readable 0 20 20 Table D.1 Document counts returned by Baseline and Enhanced search engine. D.2. Significance of Difference of Documents Returned To determine whether the difference between the Baseline and the Enhanced search engines in the number of documents returned were statically significant, a Two Sample t-Test Assuming Equal Variances was run with an α=0.05. 234 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations 43.5 584.23 1936.41 75 75 1260.32 t Stat -4.0295 t Critical one-tail P(T<=t) two-tail t Critical two-tail Enhanced 20.2 Pooled Variance P(T<=t) one-tail Table D.2 Baseline 4.455E-05 1.6552 8.910E-05 1.9761 Two Sample t-test Assuming Equal Variances to determine statistical significance of differences in documents returned by Baseline and Enhanced search engines. 235 APPENDIX E. GRADED RELEVANCE QUERY TOPIC DEFINITIONS This appendix contains the table of graded relevancy definitions for the 30 query topics that were adjudicated: 14 query topics representing tangible concepts and 16 query topics representing intangible topics. The graded definitions were used in the adjudication process to determine the relative level of relevancy of each document returned from each search engine. The higher the score, the more relevant the document is to the query topic. Documents with a score of 1 contain information that addresses all aspects of the concept represented by the query topic. Documents with a score of 0.5 or 0.25 contain information that only address some but not all aspects of the concept represented by the query topic. Documents with a score of 0 are irrelevant (i.e., do not contain any information that addresses the concept represented by the query topic). Documents that were returned by both the Baseline and the Enhanced search engines were assumed to be relevant. The graded relevancy descriptions were developed as an aid to consistently adjudicate documents for the query topic and were based on the content of the documents reviewed. In some cases, the descriptions for the 0.5 or 0.25 scores were not developed because no documents contained information that addressed only that level of relevancy. In addition, some query topics include specific descriptions of irrelevant topics (i.e., score of 0) when it was deemed useful in the adjudication process to distinguish between concepts that were fully or partially relevant and related information that was not sufficient to be considered relevant. 236 E.1. Query Topics Representing Tangible Concepts A sample of 14 query topics were chosen from the set of query topics for tangible concepts in which there was a difference in performance between the Baseline and Enhanced search engines. The following definitions were used in the adjudication process to determine the relevancy of documents. 1. auxiliary power unit fire extinguishing Score Description of Document Content 1.0 Discusses the design considerations related to extinguishing fires of the Auxiliary Power Unit (APU) 0.5 Discusses the act of extinguishing a fire or the design of fire extinguishing equipment in the flight deck 0.5 Discusses design issues related to fireproofing the APU or to fireproofing elements of the APU 0.25 Discusses design attributes related to fire detection 0.25 Discusses extinguishing fires of other systems 0.25 Discusses design attributes related to fireproofing other systems 0 States only that fire detection systems or fire extinguishing systems must be provided 0 States only that APUs should be provided 2. false resolution advisory Score Description of Document Content 1.0 Discusses design attributes related to or the impact of false Resolution Advisories (RAs) presented to the pilot 0.5 Discusses design attributes related to or the impact of false alerts presented by the TCAS system (and not specifically RAs) 0.25 Discusses design attributes related to or the impact of false alarms, alerts, or nuisance warnings from systems other than the TCAS 237 3. fault tolerant data entry Score Description of Document Content 1.0 Discusses design attributes related to or needs for fault tolerant data entry 0.5 n/a 0.25 n/a 4. hydraulic system status messages Score Description of Document Content 1.0 Discusses design attributes related to or needs for hydraulic system status messages 0.5 Design attributes of hydraulic system indicators/displays 0.25 Design attributes of status messages for other systems 0 States only that a status message (of some system other than hydraulics) should be provided 5. icing conditions operating speeds provided in AFM Score Description of Document Content 1.0 Discusses the requirement that the information about operating speeds in icing conditions should be provided in the Airplane Flight Manual (AFM) 0.5 Discusses providing information in AFM about operating speeds in general (i.e., not specifically in icing conditions) 0.5 Discusses operating speeds in icing conditions but does not state that they should be included in AFM 0.25 Discusses providing information in AFM about operating speeds in other conditions (i.e., not in icing conditions) 0.25 Discusses providing information related to icing conditions but not specifically about operating speeds 0.25 Discusses design attributes or format of information that could apply to operating speeds (e.g., units) 0 States only that an AFM should be provided, that information should be provided in the AFM (i.e., information other than that related to icing conditions), or icing conditions in general 238 6. information presented in peripheral visual field Score Description of Document Content 1.0 Discusses design attributes of information or types of information that is presented in the pilot’s peripheral visual field 0.5 n/a 0.25 Discusses parameters to define the primary visual field and the peripheral visual field 0 States only that information should be provided in the primary visual field 7. information readable with vibration Score Description of Document Content 1.0 Discusses readability issues caused by vibration 0.5 n/a 0.25 Discusses other visual issues related to vibration (e.g., eye fatigue) 0 Discusses non-visually-specific issues related to vibration (e.g., general physical fatigue) 8. instruments located in normal line of sight Score Description of Document Content 1.0 Discusses instruments that should be located in the pilot’s normal line of sight or primary field of view 0.5 n/a 0.25 Discusses the location of instruments with respect to pilots and their visibility requirements 239 9. labels readable distance Score Description of Document Content 1.0 Discusses how distance from the pilot impacts in the readability of a label 0.5 Describes design attributes related to placement of label and the impact of label’s location on being seen or readable to pilot 0.25 n/a 10. landing gear manual extension control design Score Description of Document Content 1.0 Discusses design attributes related to the manual landing gear extension control 0.5 Discusses the requirement that a landing gear manual extension control is provided 0.5 Discusses design attributes related to landing gear controls in general (including position indicator markings on controls) 0.25 Discusses entire extension/retracting system (which includes control to operate it). This does not include general statements of landing gear system that does not specifically reference the extension/retracting mechanism 0 Discusses landing gear in general such as landing gear failures and landing gear position (and does not specifically talk about design of control associated with allow pilot to operate landing gear) 11. negative transfer issues Score Description of Document Content 1.0 Discusses the impact of negative transfer on pilot performance or design attributes that may cause negative transfer issues to occur 0.5 n/a 0.25 n/a 240 12. safety belt latch operation Score Description of Document Content 1.0 Discusses the operation or design attributes of the safety belt latch (other terms that may be used include seat belt, safety harness, shoulder harness, pilot restraint system, buckle, fastener) 0.5 n/a 0.25 Discusses other design attributes of the safety belt (i.e., not specific to the latching mechanism) 0 Operation of other sorts of latches such as doors 13. side stick control considerations Score Description of Document Content 1.0 Discusses design attributes of the side stick control and their impact on pilot performance 0.5 n/a 0.25 Discusses design attributes of the stick control (including stick control forces) 0 Discusses design attributes or use of joysticks as cursor control devices 14. text color contrast Score Description of Document Content 1.0 Discusses use of color contrast on text or alphanumeric characters 0.5 n/a 0.25 Discusses attributes of color that impact readability and perception of text (e.g., ambient light impacting choice of saturation of color to be used, use of colors that are distinguishable from one another) 0.25 Use of contrast such as reverse video 0 Discusses design attributes other than those related to perception and readability to be considered when using color on a display (e.g., attentiongetting qualities or color-coding) 241 E.2. Query Topics Representing Intangible Concepts A sample of 16 query topics was chosen from the set of query topics for intangible concepts in which there was a difference in performance between the Baseline and Enhanced search engines. The following definitions were used in the adjudication process to determine the relevancy of documents. 15. acceptable message failure rate and pilots confidence in system Score Description of Document Content 1.0 Discusses acceptable rates of failure of messages presented to the pilot and the impact of nuisance warning or false alarms on the pilots confidence in the system 0.5 n/a 0.25 Discusses nuisance warnings and false alarms 16. appropriate size of characters on display Score Description of Document Content 1.0 Discusses design attributes related to the appropriate size of characters on a display 0.5 n/a 0.25 Discusses other appropriate design attributes related to displaying characters on a display (i.e., not character size) 17. arrangement of right seat instruments Score Description of Document Content 1.0 Discusses the arrangement of instruments for the right seat pilot (i.e., first officer seat) 0.5 Discusses other design attributes related to the right seat versus the left seat of flight deck 0.25 Discusses the general arrangement of instruments that could apply to right seat 242 18. control is identifiable in the dark Score Description of Document Content 1.0 Discusses design attributes related to allowing a control to be identified in the dark or specific controls that must be identifiable in the dark 0.5 n/a 0.25 Discusses design attributes that make control identifiable 19. cultural conventions switch design Score Description of Document Content 1.0 Discusses the impact of cultural conventions on control design 0.5 Discusses the impact of cultural conventions on display or other equipment design 0.25 n/a 20. design attributes for auditory displays Score Description of Document Content 1.0 Discusses design attributes of auditory displays 0.5 n/a 0.25 n/a 21. ergonomics of pilot seating Score Description of Document Content 1.0 Discusses the ergonomics of pilot seating 0.5 n/a 0.25 Lists reference resources specific to ergonomics of pilot seating 243 22. excessive cognitive effort Score Description of Document Content 1.0 Discusses design attributes and tasks related to addressing the avoidance of excessive cognitive effort 0.5 n/a 0.25 n/a 23. how to ensure that labels are readable Score Description of Document Content 1.0 Discusses design attributes related to ensuring that that labels are readable by the pilot 0.5 n/a 0.25 Discusses other design attributes related to making labels usable 0 States only that a label should be provided or that a component should be properly labeled 24. how to improve situation awareness Score Description of Document Content 1.0 Discusses design attributes or instruments that improve a pilot’s situation awareness 0.5 Discusses design attributes or instruments that provide information about the current conditions that allow pilots to respond in appropriate and timely manner 0.25 n/a 25. how to provide unambiguous feedback Score Description of Document Content 1.0 Discusses design attributes related to providing the pilot unambiguous feedback (i.e., the feedback is clearly understandable) 0.5 n/a 0.25 n/a 244 26. minimal mental processing Score Description of Document Content 1.0 Discusses design attributes or tasks related to requiring only minimal mental (or cognitive) processing from the pilot 0.5 Discusses design attributes or tasks related to reducing mental effort or avoiding excessive mental (or cognitive) processing from the pilot 0.25 Discusses design attributes or tasks related to mental or cognitive effort 27. needs too much attention Score Description of Document Content 1.0 Discusses design attributes and tasks related to addressing a component or situation that requires too much attention 0.5 Discusses the level of attention required to perform a task or operate a component 0.25 Discusses design attributes that impact the level of attention required 28. preventing instrument reading errors Score Description of Document Content 1.0 Discusses design attributes related to preventing instrument reading errors (i.e., any error related to visual, tactile, or auditory perception to the state, setting, or value of a display, control, or other equipment) 0.5 Discusses design attributes related to ensuring that instruments are readable. 0.5 Discusses design attributes related to preventing data interpretation errors. (Note: The idea of “reading errors” can be used in the broad sense to include interpreting the data that has been read on a display) 0.25 Discusses display design attributes or importance of preventing errors in general (i.e., not specifically reading errors) 0.25 Discusses display design attributes related to readability 0 Discusses errors in general 245 29. proper use of red and amber on displays Score Description of Document Content 1.0 Discusses the correct usage of both the colors red and amber on a flight deck display 0.5 Discusses the correct usage of either red or amber on a flight deck display (but not both colors) 0.25 Discusses the use of an appropriate color coding philosophy in display design 30. suitable menu navigation methods Score Description of Document Content 1.0 Discusses design attributes of menus to ensure that the menu navigation methods required are appropriate 0.5 n/a 0.25 n/a 246 APPENDIX F. BINARY RELEVANCE DATA AND CALCULATIONS This appendix contains the table of binary relevancy counts and the calculated values for each query topic included in the samples adjudicated. Binary relevance was determined by converting any non-zero graded relevance value assigned to a document (i.e., a graded relevance of 0.25, 0.5, 1) to 1. All relevance values of zero remain a value of zero. The information is divided by those query topics that represent tangible concepts and those that represent intangible concepts. This information also indentifies the query topics for which the Enhanced search engine performed the same as the Baseline (e.g., when no relational pathways were found for any of the term pairs in the query topic) and those for which the performance differed (i.e., the Enhanced search engine returned additional documents). F.1. Data for Query Topics Representing Tangible Concepts There were 40 total query topics that represented tangible concepts. There were 13 query topics in which the Baseline and the Enhanced search engine each returned the same set of documents, and there were 27 query topics in which the Enhanced search engine returned additional documents that the Baseline did not return. Of the 27 query topics in which there was a difference in performance between the Baseline and Enhanced search engine, a sample of 14 were adjudicated. The results from each are included in the sections below. 247 F.1.1. Adjudicated Query Topics for Tangible Concepts A sample of 14 query topics was chosen from the set of query topics for tangible concepts in which there was a difference in performance between the Baseline and Enhanced search engines. These query topics were adjudicated to determine the relevancy of the returned documents. Table F.1 provides the data and calculations for these query topics. 248 Table F.1 Adjudicated Query Topics for Tangible Concepts with Recall*, Precision, and F-measure* calculations. 249 Tangible Concepts Query Sample Set Baseline Search Engine Query Topic 1 2 auxiliary power unit fire extinguishing false resolution advisory 3 fault tolerant data entry 4 hydraulic system status messages icing conditions operating speeds provided in AFM information presented in peripheral visual field information readable with vibration instruments located in normal line of sight labels readable distance 5 6 7 8 9 10 11 landing gear manual extension control design negative transfer issues 12 safety belt latch operation 13 side stick control considerations text color contrast 14 Enhanced Search Engine Total Relevant Total Relevant Irrelevant Recall* Precision F-measure* Total Relevant Irrelevant Recall* Precision F-measure* 21 16 16 0 0.76 1.00 0.86 21 21 0 1.00 1.00 1.00 20 9 9 0 0.45 1.00 0.62 23 20 3 1.00 0.87 0.93 4 1 1 0 0.25 1.00 0.40 14 4 10 1.00 0.29 0.44 7 0 0 0 0.00 0.00 0.00 27 7 20 1.00 0.26 0.41 46 11 11 0 0.24 1.00 0.39 61 46 15 1.00 0.75 0.86 11 3 3 0 0.27 1.00 0.43 11 11 0 1.00 1.00 1.00 25 12 12 0 0.48 1.00 0.65 44 25 19 1.00 0.57 0.72 28 6 6 0 0.21 1.00 0.35 35 28 7 1.00 0.80 0.89 10 7 7 0 0.70 1.00 0.82 12 10 2 1.00 0.83 0.91 27 3 3 0 0.11 1.00 0.20 48 27 21 1.00 0.56 0.72 2 2 2 0 1.00 1.00 1.00 5 2 3 1.00 0.40 0.57 37 19 19 0 0.51 1.00 0.68 54 37 17 1.00 0.69 0.81 21 0 0 0 0.00 0.00 0.00 30 21 9 1.00 0.70 0.82 38 19 19 0 0.50 1.00 0.67 54 38 16 1.00 0.70 0.83 0.67 0.78 Average = 0.39 0.51 250 F.1.2. Not Adjudicated Query Topics for Tangible Concepts Of the 40 query topics that represented tangible concepts, there were a total of 26 query topics that were not adjudicated. The details of the query topics that were not adjudicated are outlined below. 251 F.1.2.1. No Performance Difference For Tangible Concepts There were 13 query topics representing tangible concepts in which the Baseline and the Enhanced search engine each returned the same set of documents (i.e., there was no difference in performance between the Baseline and the Enhanced search engines). These query topics are listed in Table F.2. This occurred when no relational pathways were found for any of the term pairs in the query topic or when the relational pathways identified did not yield an expanded query that generated additional document matches. Query Topic Baseline Enhanced Total Total Difference 1 ambient lighting intensity 21 21 0 2 cockpit lighting at night 61 61 0 3 consistent label placement 15 15 0 4 control force and resistance considerations 8 8 0 5 electronic flight bag display design 48 48 0 6 excessive strength required to operate control 67 67 0 7 monochrome altimeter design 0 0 0 8 operable using only one hand 38 38 0 9 redundant coding methods 20 20 0 10 sensitivity of controls 105 105 0 11 stuck microphone 8 8 0 12 tactile feedback 50 50 0 13 thrust reverser control design 41 41 0 Table F.2 Query topics for Tangible Concepts with no performance difference. 252 F.1.2.2. Difference in Performance For Tangible Concepts Although there was a difference in performance between the Baseline and Enhanced search engines in the query topics representing tangible concepts listed in Table F.3, these query topics were not part of the sample that was adjudicated because the difference in performance exceeded the difference threshold of 50 documents. Query Topic Baseline Enhanced Total Total Difference 1 attitude trend indicator 10 79 69 2 audible stall warnings 20 78 58 3 pilot access to circuit breakers 29 107 78 4 cockpit display color scheme 17 117 100 5 color and vibration 30 109 79 6 emergency landing gear control location 9 80 71 7 fuel feed selector control 14 71 57 8 inadvertent activation prevention 57 109 52 9 placard lighting 70 148 78 10 low airspeed alerting 30 118 88 11 pilot response time to visual warnings 26 107 81 12 interference from adjacent controls 28 91 63 13 TCAS warnings 96 244 148 Table F.3 Query topics for Tangible Concepts with a performance difference but not included in the Tangible Concepts Query Sample Set. 253 F.2. Data for Query Topics Representing Intangible Concepts There were 35 total query topics that represented intangible concepts. There were 18 query topics in which the Baseline and the Enhanced search engine each returned the same set of documents, and there were 17 query topics in which the Enhanced search engine returned additional documents that the Baseline did not return. Of the 17 query topics in which there was a difference in performance between the Baseline and Enhanced search engine, a sample of 16 were adjudicated. The results from each are included in the sections below. F.2.1. Adjudicated Query Topics for Intangible Concepts A sample of 16 query topics was chosen from the set of query topics for intangible concepts in which there was a difference in performance between the Baseline and Enhanced search engines. These query topics were adjudicated to determine the relevancy of the returned documents. Table F.4 provides the data and calculations for these query topics. 254 Table F.4 Adjudicated Query Topics for Intangible Concepts with Recall*, Precision, and F-measure* calculations. 255 Intangible Concepts Query Sample Set Baseline Search Engine Query Topic 1 2 3 4 5 6 7 8 9 10 11 12 13 14 14 15 acceptable message failure rate and pilots confidence in system appropriate size of characters on display arrangement of right seat instruments control is identifiable in the dark cultural conventions switch design design attributes for auditory displays ergonomics of pilot seating excessive cognitive effort how to ensure that labels are readable how to improve situation awareness how to provide unambiguous feedback minimal mental processing needs too much attention preventing instrument reading errors proper use of red and amber on displays suitable menu navigation methods Enhanced Search Engine Total Relevant Total Relevant Irrelevant Recall* Precision F-measure* Total Relevant Irrelevant Recall* Precision F-measure* 3 0 0 0 0.00 0.00 0.00 4 3 1 1.00 0.75 0.86 56 13 13 0 0.23 1.00 0.38 56 0 1.00 1.00 1.00 19 6 6 0 0.32 1.00 0.48 23 19 4 1.00 0.83 0.90 56 20 20 0 0.36 1.00 0.53 69 56 13 1.00 0.81 0.90 6 1 1 0 0.17 1.00 0.29 7 6 1 1.00 0.86 0.92 7 4 4 0 0.57 1.00 0.73 7 7 0 1.00 1.00 1.00 2 10 0 4 0 4 0 0 0.00 0.40 0.00 1.00 0.00 0.57 2 10 2 10 0 0 1.00 1.00 1.00 1.00 1.00 1.00 16 0 0 0 0.00 0.00 0.00 20 16 4 1.00 0.80 0.89 10 2 2 0 0.20 1.00 0.33 10 10 0 1.00 1.00 1.00 25 0 0 0 0.00 0.00 0.00 25 25 0 1.00 1.00 1.00 24 24 9 4 9 4 0 0 0.38 0.17 1.00 1.00 0.55 0.29 26 26 24 24 2 2 1.00 1.00 0.92 0.92 0.96 0.96 30 7 7 0 0.23 1.00 0.38 33 30 3 1.00 0.91 0.95 28 13 13 0 0.46 1.00 0.63 29 28 1 1.00 0.97 0.98 3 0 0 0 0.00 0.00 0.00 11 3 8 1.00 0.27 0.43 0.88 0.92 Average = 0.22 0.32 56 256 F.2.2. Not Adjudicated Query Topics For Intangible Concepts Of the 35 query topics that represented intangible concepts, there were a total of 19 query topics that were not adjudicated. The details of the query topics that were not adjudicated are outlined below. F.2.2.1. No Performance Difference For Intangible Concepts There were 18 query topics representing intangible concepts in which the Baseline and the Enhanced search engine each returned the same set of documents (i.e., there was no difference in performance between the Baseline and the Enhanced search engines). These query topics are listed in Table F.5. This occurred when no relational pathways were found for any of the term pairs in the query topic or when the relational pathways identified did not yield an expanded query that generated additional document matches. 257 Query Topic Baseline Enhanced Total Total Difference 1 adequately usable 88 88 0 2 audio signals annoying to the pilot 8 8 0 3 body position required to operate equipment 12 12 0 4 control design allows pilot to operate in extreme flight conditions 8 8 0 5 cursor control device 57 57 0 6 designed to keep pilot fatigue to a minimum 3 3 0 7 excessively objectionable 30 30 0 8 failure notifications are readily apparent 0 0 0 9 hard to readily find information on screen 0 0 0 10 implications of automatically disengaging autopilot 4 4 0 11 information overload 26 26 0 12 intuitive display design 60 60 0 13 magnitude and direction of systems response to control input 14 14 0 14 poorly organized information 17 17 0 15 readily accessible overhead controls 0 0 0 16 representing self in airplane-referenced displays 0 0 0 17 soft controls 58 58 0 18 unnecessarily distract from tasks 11 11 0 Table F.5 Query topics for Intangible Concepts with no performance difference. 258 F.2.2.2. Difference in Performance For Intangible Concepts Although there was a difference in performance between the Baseline and Enhanced search engines in the query topics representing intangible concepts listed in Table F.6, this query topic was not part of the sample that was adjudicated because the difference in performance exceeded the difference threshold of 50 documents. Query Topic 1 easily understandable system status Table F.6 Baseline Enhanced Total Total 9 133 Difference 124 Query topics for Intangible Concepts with a performance difference but not included in the Intangible Concepts Query Sample Set. 259 APPENDIX G. BINARY RELEVANCE SIGNIFICANCE TESTS This appendix contains the data and calculations used to determine the significance of the differences identified in the binary relevance data gathered. Binary relevance was determined by converting any non-zero graded relevance value assigned to a document (i.e., a graded relevance of 0.25, 0.5, 1) to 1. All relevance values of zero remain a value of zero. G.1. Difference in Recall* between Baseline and Enhanced Search Engines To determine whether the difference in recall* between the Baseline and the Enhanced search engines was statically significant, a Two Sample t-Test Assuming Equal Variances was run with a significance level of α=0.05. The data and results are presented in Figure G.1. 260 Recall* Significance Calculations Recall* Values Baseline Enhanced 0.76 0.45 0.25 0.00 0.24 0.27 0.48 0.21 0.70 0.11 1.00 0.51 0.00 0.50 0.00 0.23 0.32 0.36 0.17 0.57 0.00 0.40 0.00 0.20 0.00 0.38 0.17 0.23 0.46 0.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 Figure G.1 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.299 0.064 30 Enhanced 1 0 30 0.032 -15.164 3.830E-22 1.672 7.660E-22 2.002 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the recall* of the Baseline and the Enhanced search engines. Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the recall* of the Baseline and Enhanced search engines. Two Sample t-Test to test the recall* difference between Baseline and Enhanced search engines. 261 G.2. Difference in F-measure* between Baseline and Enhanced Search Engines To determine whether the difference in search performance between the Baseline and the Enhanced search engines was statically significant, a Two Sample t-Test Assuming Equal Variances was run with a significance level of α=0.05. Search performance was calculated using the F-measure* in order to balance equally the contribution of recall* and precision. The data and results are presented in Figure G.2. 262 F-measure* Significance Calculations F-measure* Values Baseline 0.86 0.62 0.40 0.00 0.39 0.43 0.65 0.35 0.82 0.20 1.00 0.68 0.00 0.67 0.00 0.38 0.48 0.53 0.29 0.73 0.00 0.57 0.00 0.33 0.00 0.55 0.29 0.38 0.63 0.00 Figure G.2 Enhanced 1.00 0.93 0.44 0.41 0.85 1.00 0.72 0.89 0.91 0.72 0.57 0.81 0.82 0.83 0.86 1.00 0.90 0.90 0.92 1.00 1.00 1.00 0.89 1.00 1.00 0.96 0.96 0.95 0.98 0.43 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.407 0.084 30 Enhanced 0.856 0.031 30 0.058 -7.228 6.071E-10 1.6716 1.214E-09 2.002 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the performance of the Baseline and the Enhanced search engines. Search performance is determined using the F-measure* as the harmonic mean of recall* and precision. The greater the F-measure* value, the better the performance. Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) twotail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the Baseline and Enhanced search engine Fmeasure* values. The F-measure* value of the Enhanced search engine is statistically greater than the Baseline and, therefore, performs better than the Baseline search engine. Two Sample t-Test to test the F-measure* difference between Baseline and Enhanced search engines. 263 G.3. Difference in F-measures* between Tangible and Intangible Concepts To determine whether the F-measure* values for the query topics that represent tangible concepts and those that represent intangible are statistically different, a Single Factor ANOVA test was run with a significant level of α=0.05 for the Baseline search engine and for the Enhanced search engine. 264 G.3.1. Difference between Tangible and Intangible Concepts for Baseline Search Engine F-measure* Values Baseline Search Engine Tangible 0.86 0.62 0.40 0.00 0.39 0.43 0.65 0.35 0.82 0.20 1.00 0.68 0.00 0.67 --- Interpretation of Statistical Test: Intangible The Null Hypothesis for this Single Factor ANOVA test is that there is no statistical difference between the performance of the Baseline search engine for the query topics that represent tangible concepts and the query topics that represent intangible concepts. Because the value of F is less than F critical value, we cannot reject the Null Hypothesis. Because we cannot reject the Null Hypothesis, we must conclude that there is no statistically significant difference between the performance of the Baseline search engine on the query topics that represent tangible concepts and the query topics that represent intangible concepts. 0.00 0.38 0.48 0.53 0.29 0.73 0.00 0.57 0.00 0.33 0.00 0.55 0.29 0.38 0.63 0.00 Summary Statistics for Baseline Search Engine Groups Tangible Intangible Count 14 16 Sum 7.070 5.145 Average 0.505 0.322 Variance 0.094 0.065 ANOVA Single Factor Source of Variation Between Groups Figure G.3 F 3.212 P-value 0.084 F critical 4.196 ANOVA Single Factor to test the F-measure* difference between Tangible and Intangible Concepts Query Sample Sets on the Baseline search engine. 265 G.3.2. Difference between Tangible and Intangible Concepts for Enhanced Search Engine F-measure* Values Enhanced Search Engine Tangible 1.00 0.93 0.44 0.41 0.86 1.00 0.72 0.89 0.91 0.72 0.57 0.81 0.82 0.83 --- Intangible 0.86 1.00 0.90 0.90 0.92 1.00 1.00 1.00 0.89 1.00 1.00 0.96 0.96 0.95 0.98 0.43 Interpretation of Statistical Test: The Null Hypothesis for this Single Factor ANOVA test is that there is no statistical difference between the performance of the Enhanced search engine on the query topics that represent tangible concepts and the query topics that represent intangible concepts. Because the value of F is greater than F critical value, we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the performance of the Enhanced search engine on the query topics that represent tangible concepts and the query topics that represent intangible concepts. Summary Statistics for Enhanced Search Engine Groups Tangible Intangible Count 14 16 Sum Average Variance 10.923 0.780 0.035 14.753 0.922 0.020 ANOVA Single Factor Source of Variation Between Groups Figure G.4 F 5.599 P-value F critical 0.025 4.196 ANOVA Single Factor to test the F-measure difference between Tangible and Intangible Concepts Query Sample Sets on the Enhanced search engine. 266 APPENDIX H. GRADED RELEVANCE DATA This appendix contains the table of graded relevancy score counts for each query topic included in the samples adjudicated. The graded relevance identifies the level of relevancy of each document returned in the results generated by each search engine. The higher the number, the more relevant the document is to the query topic. The graded definitions used for each query topic adjudicated can be found in Appendix E. H.1. Graded Relevance Data of Query Topics for Tangible Concepts A sample of 14 query topics was chosen from the set of query topics for tangible concepts in which there was a difference in performance between the Baseline and Enhanced search engines. These query topics were adjudicated to determine the relevancy of the returned documents, and the graded relevancy scores were captured for each document returned in the Enhanced search engine result set. Table H.1 provides the counts for each of the graded relevancy scores for the adjudicated Tangible Concepts Query Sample Set. 267 Enhanced Search Engine Query Topics Representing Tangible Concepts Total Returned Graded Relevancy Score 1 0.5 0.25 0 1 auxiliary power unit fire extinguishing 21 16 4 1 0 2 false resolution advisory 23 10 1 9 3 3 fault tolerant data entry 14 4 0 0 10 4 hydraulic system status messages 27 1 0 6 20 5 icing conditions operating speeds provided in AFM 61 16 7 23 15 6 information presented in peripheral visual field 11 11 0 0 0 7 information readable with vibration 44 20 0 5 19 8 instruments located in normal line of sight 35 25 1 2 7 9 labels readable distance 12 7 3 0 2 10 landing gear manual extension control design 48 17 8 2 21 11 negative transfer issues 5 2 0 0 3 12 safety belt latch operation 54 32 0 5 17 13 side stick control considerations 30 6 0 15 9 14 text color contrast 54 35 0 3 16 Table H.1 Graded Relevance Data for Tangible Concepts Query Sample Set. 268 H.2. Graded Relevance Data of Query Topics for Intangible Concepts A sample of 16 query topics was chosen from the set of query topics for intangible concepts in which there was a difference in performance between the Baseline and Enhanced search engines. These query topics were adjudicated to determine the relevancy of the returned documents, and the graded relevancy scores were captured for each document returned in the Enhanced search engine result set. Table H.2 provides the counts for each of the graded relevancy scores for the adjudicated Intangible Concepts Query Sample Set. 269 Enhanced Search Engine Query Topics Representing Intangible Concepts Total Returned Graded Relevancy Score 1 0.5 0.25 0 1 acceptable message failure rate and pilots confidence in system 4 2 0 1 1 2 appropriate size of characters on display 56 49 0 7 0 3 arrangement of right seat instruments 23 7 4 8 4 4 control is identifiable in the dark 69 30 5 21 13 5 cultural conventions switch design 7 2 4 0 1 6 design attributes for auditory displays 7 7 0 0 0 7 ergonomics of pilot seating 2 0 0 2 0 8 excessive cognitive effort 10 9 0 1 0 9 how to ensure that labels are readable 20 9 0 7 4 10 how to improve situation awareness 10 7 3 0 0 11 how to provide unambiguous feedback 25 24 1 0 0 12 minimal mental processing 26 11 11 2 2 13 needs too much attention 26 12 5 7 2 14 preventing instrument reading errors 33 17 9 4 3 15 proper use of red and amber on displays 29 22 2 4 1 16 suitable menu navigation methods 11 3 0 0 8 Table H.2 Graded Relevance Data for Intangible Concepts Query Sample Set. 270 APPENDIX I. SENSITIVITY ANALYSIS FOR IMPACT OF UNFOUND RELEVANT DOCUMENTS This appendix contains the data and calculations used to perform the sensitivity analysis to unfound relevant documents. Three different levels of sensitivity were analyzed to assess the impact of unfound relevant documents on the performance measures used in this experiment. To perform the sensitivity analysis, unfound relevant documents were estimated based on the level of sensitivity and rounded up to the next whole number. These unfound relevant document estimates were then used to estimate what the total number of relevant documents would be in the collection by adding the number of relevant documents for the query topic identified by the Baseline and Enhanced search engines to the estimated number of unfound relevant documents at that sensitivity level. Using the estimated total relevant documents for the given sensitivity level, the recall* was calculated for each of the 30 sample query topics. Next, a Two Sample t-Test Assuming Equal Variances was run with a significance level of α=0.05 to determine whether the difference in the recall* between the Baseline and the Enhanced search engines was statically significant. Finally, the conclusion drawn related to the significance of the difference in recall* values was compared to the conclusion drawn in the original calculations (i.e., those performed without the addition of estimated unfound relevant documents). Next, the estimated total relevant documents for the given sensitivity level was used to calculate the F-measure* for each of the 30 sample query topics. A Two Sample 271 t-Test Assuming Equal Variances was then run with a significance level of α=0.05 to determine whether the difference in the F-measure* between the Baseline and the Enhanced search engines was statically significant. Finally, the conclusion drawn related to the significance of the difference in F-measure* values was compared to the conclusion drawn in the original calculations (i.e., those performed without the addition of estimated unfound relevant documents). The data and calculations for each of the three sensitivity levels are presented in the following sections. I.1. Level 1 Sensitivity To Unfound Relevant Documents (0.25X) At the first sensitivity level, the unfound relevant documents were estimated to be a quarter of the number of relevant documents identified by the Baseline and Enhanced search engines. Table I.1 and I.2 provide the data and calculations performed using the unfound relevant document estimates for Sensitivity Level 1 for the Tangible Concept Query Sample Set and the Intangible Concept Query Sample Set. The data and results of the statistical significance test at Sensitivity Level 1 are presented in Figure I.1 for recall* and in Figure I.2 for F-measure*. 272 Table I.1 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Sensitivity Level 1 (0.25X) where the number of unfound documents are assumed to be a quarter of the number of relevant documents identified by the Baseline and Enhanced search engines. 273 Level 1 Sensitivity To Unfound Relevant Documents (0.25X) – Tangible Concepts Query Sample Set Baseline Query Topic (QT) Baseline + Enhanced Relevant Unfoun d Relevan t Estimated Total Relevant Sensitivity Level 1 Recall* Enhanced Precision Sensitivity Level 1 F-measure* Sensitivity Level 1 Recall* Precision Sensitivity Level 1 F-measure* 1 auxiliary power unit fire extinguishing 21 6 27 0.593 1.000 0.744 0.778 1.000 0.875 2 false resolution advisory 20 5 25 0.360 1.000 0.529 0.800 0.870 0.833 3 fault tolerant data entry 4 1 5 0.200 1.000 0.333 0.800 0.286 0.421 4 hydraulic system status messages 7 2 9 0.000 0.000 0.000 0.778 0.259 0.389 5 icing conditions operating speeds provided in AFM 46 12 58 0.190 1.000 0.319 0.793 0.754 0.773 6 information presented in peripheral visual field 11 3 14 0.214 1.000 0.353 0.786 1.000 0.880 7 information readable with vibration 25 7 32 0.375 1.000 0.545 0.781 0.568 0.658 8 instruments located in normal line of sight 28 7 35 0.171 1.000 0.293 0.800 0.800 0.800 9 labels readable distance 10 3 13 0.538 1.000 0.700 0.769 0.833 0.800 10 landing gear manual extension control design 27 7 34 0.088 1.000 0.162 0.794 0.563 0.659 11 negative transfer issues 2 1 3 0.667 1.000 0.800 0.667 0.400 0.500 12 safety belt latch operation 37 10 47 0.404 1.000 0.576 0.787 0.685 0.733 13 side stick control considerations 21 6 27 0.000 0.000 0.000 0.778 0.700 0.737 14 text color contrast 38 10 48 0.396 1.000 0.567 0.792 0.704 0.745 274 Table I.2 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Sensitivity Level 1 (0.25X) where the number of unfound documents are assumed to be a quarter of the number of relevant documents identified by the Baseline and Enhanced search engines. 275 Level 1 Sensitivity To Unfound Relevant Documents (0.25X) – Intangible Concepts Query Sample Set Baseline Query Topic (QT) Baseline + Enhanced Relevant Unfound Relevant Estimated Total Relevant Sensitivity Level 1 Recall* Enhanced Precision Sensitivity Level 1 F-measure* Sensitivity Level 1 Recall* Precision Sensitivity Level 1 F-measure* 1 acceptable message failure rate and pilots confidence in system 3 1 4 0.000 0.000 0.000 0.750 0.750 0.750 2 appropriate size of characters on display 56 14 70 0.186 1.000 0.313 0.800 1.000 0.889 3 arrangement of right seat instruments 19 5 24 0.250 1.000 0.400 0.792 0.826 0.809 4 control is identifiable in the dark 56 14 70 0.286 1.000 0.444 0.800 0.812 0.806 5 cultural conventions switch design 6 2 8 0.125 1.000 0.222 0.750 0.857 0.800 6 design attributes for auditory displays 7 2 9 0.444 1.000 0.615 0.778 1.000 0.875 7 ergonomics of pilot seating 2 1 3 0.000 0.000 0.000 0.667 1.000 0.800 8 excessive cognitive effort 10 3 13 0.308 1.000 0.471 0.769 1.000 0.870 9 how to ensure that labels are readable 16 4 20 0.000 0.000 0.000 0.800 0.800 0.800 10 how to improve situation awareness 10 3 13 0.154 1.000 0.267 0.769 1.000 0.870 11 how to provide unambiguous feedback 25 7 32 0.000 0.000 0.000 0.781 1.000 0.877 12 minimal mental processing 24 6 30 0.300 1.000 0.462 0.800 0.923 0.857 13 needs too much attention 24 6 30 0.133 1.000 0.235 0.800 0.923 0.857 14 preventing instrument reading errors 30 8 38 0.184 1.000 0.311 0.789 0.909 0.845 15 proper use of red and amber on displays 28 7 35 0.371 1.000 0.542 0.800 0.966 0.875 16 suitable menu navigation methods 3 1 4 0.000 0.000 0.000 0.750 0.273 0.400 276 Level 1 Sensitivity To Unfound Relevant Documents (0.25X) – Recall* Significance Calculations Recall* Values at Sensitivity Level 1 (0.25X) Baseline Enhanced 0.593 0.360 0.200 0.000 0.190 0.214 0.375 0.171 0.538 0.088 0.667 0.404 0.000 0.396 0.000 0.186 0.250 0.286 0.125 0.444 0.000 0.308 0.000 0.154 0.000 0.300 0.133 0.184 0.371 0.000 0.778 0.800 0.800 0.778 0.793 0.786 0.781 0.800 0.769 0.794 0.667 0.787 0.778 0.792 0.750 0.800 0.792 0.800 0.750 0.778 0.667 0.769 0.800 0.769 0.781 0.800 0.800 0.789 0.800 0.750 Figure I.1 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.2313 0.0352 30 Enhanced 0.7766 0.0011 30 0.0182 -15.6739 8.155E-23 1.6716 1.63E-22 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the recall* of the Baseline and the Enhanced search engines at Sensitivity Level 1 (0.25X). Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) twotail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the recall* at Sensitivity Level 1 (0.25X) of the Baseline and Enhanced search engines. Two Sample t-Test to test the difference of the recall* at Sensitivity Level 1 (0.25X) between the Baseline and Enhanced search engines. 277 Level 1 Sensitivity To Unfound Relevant Documents (0.25X) – F-measure* Significance Calculations F-measure* Values at Sensitivity Level 1 (0.25X) Baseline 0.744 0.529 0.333 0.000 0.319 0.353 0.545 0.293 0.700 0.162 0.800 0.576 0.000 0.567 0.000 0.313 0.400 0.444 0.222 0.615 0.000 0.471 0.000 0.267 0.000 0.462 0.235 0.311 0.542 0.000 Figure I.2 Enhanced 0.875 0.833 0.421 0.389 0.773 0.880 0.658 0.800 0.800 0.659 0.500 0.733 0.737 0.745 0.750 0.889 0.809 0.806 0.800 0.875 0.800 0.870 0.800 0.870 0.877 0.857 0.857 0.845 0.875 0.400 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.3401 0.0596 30 Enhanced 0.7594 0.0215 30 0.0406 -8.0617 2.414E-11 1.6716 4.83E-11 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the performance of the Baseline and the Enhanced search engines at Sensitivity Level 1 (0.25X). Search performance is determined using the F-measure* at Sensitivity Level 1 (0.25X) where it is assumed that there are additional unfound relevant documents. Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the Baseline and Enhanced search engine F-measure* at Sensitivity Level 1 (0.25X) values. The F-measure* value of the Enhanced search engine is statistically greater than the Baseline and, therefore, performs better than the Baseline search engine. This is consistent with the conclusion drawn without the addition of estimated unfound relevant documents. Two Sample t-Test to test the difference of the F-measure* at Sensitivity Level 1 (0.25X) between the Baseline and Enhanced search engines. 278 I.2. Level 2 Sensitivity To Unfound Relevant Documents (2X) At the second sensitivity level, the unfound relevant documents were estimated to be double the number of relevant documents identified by the Baseline and Enhanced search engines. Table I.3 and I.4 provide the data and calculations performed using the unfound relevant document estimates for Sensitivity Level 2 for the Tangible Concept Query Sample Set and the Intangible Concept Query Sample Set. The data and results of the statistical significance test at Sensitivity Level 2 are presented in Figure I.3 for recall* and in Figure I.4 for F-measure*. 279 Table I.3 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Sensitivity Level 2 (2X) where the number of unfound documents are assumed to be double the number of relevant documents identified by the Baseline and Enhanced search engines. 280 Level 2 Sensitivity To Unfound Relevant Documents (2X) – Tangible Concepts Query Sample Set Baseline Query Topic (QT) Baseline + Enhanced Relevant Unfound Relevant Estimated Total Relevant Sensitivity Level 2 Recall* Enhanced Precision Sensitivity Level 2 F-measure* Sensitivity Level 2 Recall* Precision Sensitivity Level 2 F-measure* 1 auxiliary power unit fire extinguishing 21 42 63 0.254 1.000 0.405 0.333 1.000 0.500 2 false resolution advisory 20 40 60 0.150 1.000 0.261 0.333 0.870 0.482 3 fault tolerant data entry 4 8 12 0.083 1.000 0.154 0.333 0.286 0.308 4 hydraulic system status messages 7 14 21 0.000 0.000 0.000 0.333 0.259 0.292 5 icing conditions operating speeds provided in AFM 46 92 138 0.080 1.000 0.148 0.333 0.754 0.462 6 information presented in peripheral visual field 11 22 33 0.091 1.000 0.167 0.333 1.000 0.500 7 information readable with vibration 25 50 75 0.160 1.000 0.276 0.333 0.568 0.420 8 instruments located in normal line of sight 28 56 84 0.071 1.000 0.133 0.333 0.800 0.471 9 labels readable distance 10 20 30 0.233 1.000 0.378 0.333 0.833 0.476 10 landing gear manual extension control design 27 54 81 0.037 1.000 0.071 0.333 0.563 0.419 11 negative transfer issues 2 4 6 0.333 1.000 0.500 0.333 0.400 0.364 12 safety belt latch operation 37 74 111 0.171 1.000 0.292 0.333 0.685 0.448 13 side stick control considerations 21 42 63 0.000 0.000 0.000 0.333 0.700 0.452 14 text color contrast 38 76 114 0.167 1.000 0.286 0.333 0.704 0.452 281 Table I.4 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Sensitivity Level 2 (2X) where the number of unfound documents are assumed to be double the number of relevant documents identified by the Baseline and Enhanced search engines. 282 Level 2 Sensitivity To Unfound Relevant Documents (2X) – Intangible Concepts Query Sample Set Baseline Query Topic (QT) Baseline + Enhanced Relevant Unfound Relevant Estimated Total Relevant Sensitivity Level 2 Recall* Enhanced Precision Sensitivity Level 2 F-measure* Sensitivity Level 2 Recall* Precision Sensitivity Level 2 F-measure* 1 acceptable message failure rate and pilots confidence in system 3 6 9 0.000 0.000 0.000 0.333 0.750 0.462 2 appropriate size of characters on display 56 112 168 0.077 1.000 0.144 0.333 1.000 0.500 3 arrangement of right seat instruments 19 38 57 0.105 1.000 0.190 0.333 0.826 0.475 4 control is identifiable in the dark 56 112 168 0.119 1.000 0.213 0.333 0.812 0.473 5 cultural conventions switch design 6 12 18 0.056 1.000 0.105 0.333 0.857 0.480 6 design attributes for auditory displays 7 14 21 0.190 1.000 0.320 0.333 1.000 0.500 7 ergonomics of pilot seating 2 4 6 0.000 0.000 0.000 0.333 1.000 0.500 8 excessive cognitive effort 10 20 30 0.133 1.000 0.235 0.333 1.000 0.500 9 how to ensure that labels are readable 16 32 48 0.000 0.000 0.000 0.333 0.800 0.471 10 how to improve situation awareness 10 20 30 0.067 1.000 0.125 0.333 1.000 0.500 11 how to provide unambiguous feedback 25 50 75 0.000 0.000 0.000 0.333 1.000 0.500 12 minimal mental processing 24 48 72 0.125 1.000 0.222 0.333 0.923 0.490 13 needs too much attention 24 48 72 0.056 1.000 0.105 0.333 0.923 0.490 14 preventing instrument reading errors 30 60 90 0.078 1.000 0.144 0.333 0.909 0.488 15 proper use of red and amber on displays 28 56 84 0.155 1.000 0.268 0.333 0.966 0.496 16 suitable menu navigation methods 3 6 9 0.000 0.000 0.000 0.333 0.273 0.300 283 Level 2 Sensitivity To Unfound Relevant Documents (2X) – Recall* Significance Calculations Recall* Values at Sensitivity Level 2 (2X) Baseline Enhanced 0.254 0.150 0.083 0.000 0.080 0.091 0.160 0.071 0.233 0.037 0.333 0.171 0.000 0.167 0.000 0.077 0.105 0.119 0.056 0.190 0.000 0.133 0.000 0.067 0.000 0.125 0.056 0.078 0.155 0.000 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 0.333 Figure I.3 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.0997 0.0071 30 Enhanced 0.3333 0.0000 30 0.0036 -15.1644 3.830E-22 1.6716 7.66E-22 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the recall* of the Baseline and the Enhanced search engines at Sensitivity Level 2 (2X). Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the recall* at Sensitivity Level 2 (2X) of the Baseline and Enhanced search engines. Two Sample t-Test to test the difference of the recall* at Sensitivity Level 2 (2X) between the Baseline and Enhanced search engines. 284 Level 2 Sensitivity To Unfound Relevant Documents (2X) – F-measure* Significance Calculations F-measure* Values at Sensitivity Level 2 (2X) Baseline 0.405 0.261 0.154 0.000 0.148 0.167 0.276 0.133 0.378 0.071 0.500 0.292 0.000 0.286 0.000 0.144 0.190 0.213 0.105 0.320 0.000 0.235 0.000 0.125 0.000 0.222 0.105 0.144 0.268 0.000 Figure I.4 Enhanced 0.500 0.482 0.308 0.292 0.462 0.500 0.420 0.471 0.476 0.419 0.364 0.448 0.452 0.452 0.462 0.500 0.475 0.473 0.480 0.500 0.500 0.500 0.471 0.500 0.500 0.490 0.490 0.488 0.496 0.300 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.1714 0.0180 30 Enhanced 0.4556 0.0037 30 0.0109 -10.5565 2.007E-15 1.6716 4.01E-15 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the performance of the Baseline and the Enhanced search engines at Sensitivity Level 2 (2X). Search performance is determined using the F-measure* at Sensitivity Level 2 (2X) where it is assumed that there are additional unfound relevant documents. Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the Baseline and Enhanced search engine F-measure* at Sensitivity Level 2 (2X) values. The F-measure* value of the Enhanced search engine is statistically greater than the Baseline and, therefore, performs better than the Baseline search engine. This is consistent with the conclusion drawn without the addition of estimated unfound relevant documents. Two Sample t-Test to test the difference of the F-measure* at Sensitivity Level 2 (2X) between the Baseline and Enhanced search engines. 285 I.3. Level 3 Sensitivity To Unfound Relevant Documents (10X) At the third sensitivity level, the unfound relevant documents were estimated to be ten times the number of relevant documents identified by the Baseline and Enhanced search engines. Table I.5 and I.6 provide the data and calculations performed using the unfound relevant document estimates for Sensitivity Level 3 for the Tangible Concept Query Sample Set and the Intangible Concept Query Sample Set. The data and results of the statistical significance test at Sensitivity Level 3 are presented in Figure I.5 for recall* and in Figure I.6 for F-measure*. 286 Table I.5 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Sensitivity Level 3 (10X) where the number of unfound documents are assumed to be ten times the number of relevant documents identified by the Baseline and Enhanced search engines. 287 Level 3 Sensitivity To Unfound Relevant Documents (10X) – Tangible Concepts Query Sample Set Baseline Query Topic (QT) Baseline + Enhanced Relevant Estimated Total Relevant Sensitivity Level 3 Recall* Unfound Relevant Precision 210 231 0.069 Enhanced Sensitivity Level 3 F-measure* Sensitivity Level 3 Recall* Precision Sensitivity Level 3 F-measure* 1.000 0.130 0.091 1.000 0.167 1 auxiliary power unit fire extinguishing 21 2 false resolution advisory 20 200 220 0.041 1.000 0.079 0.091 0.870 0.165 3 fault tolerant data entry 4 40 44 0.023 1.000 0.044 0.091 0.286 0.138 4 hydraulic system status messages 7 70 77 0.000 0.000 0.000 0.091 0.259 0.135 5 icing conditions operating speeds provided in AFM 46 460 506 0.022 1.000 0.043 0.091 0.754 0.162 6 information presented in peripheral visual field 11 110 121 0.025 1.000 0.048 0.091 1.000 0.167 7 information readable with vibration 25 250 275 0.044 1.000 0.084 0.091 0.568 0.157 8 instruments located in normal line of sight 28 280 308 0.019 1.000 0.038 0.091 0.800 0.163 9 labels readable distance 10 100 110 0.064 1.000 0.120 0.091 0.833 0.164 10 landing gear manual extension control design 27 270 297 0.010 1.000 0.020 0.091 0.563 0.157 11 negative transfer issues 2 20 22 0.091 1.000 0.167 0.091 0.400 0.148 12 safety belt latch operation 37 370 407 0.047 1.000 0.089 0.091 0.685 0.161 13 side stick control considerations 21 210 231 0.000 0.000 0.000 0.091 0.700 0.161 14 text color contrast 38 380 418 0.045 1.000 0.087 0.091 0.704 0.161 288 Table I.6 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Sensitivity Level 3 (10X) where the number of unfound documents are assumed to be ten times the number of relevant documents identified by the Baseline and Enhanced search engines. 289 Level 3 Sensitivity To Unfound Relevant Documents (10X) – Intangible Concepts Query Sample Set Baseline Query Topic (QT) Baseline + Enhanced Relevant Unfound Relevant Estimated Total Relevant Sensitivity Level 3 Recall* 30 33 0.000 Enhanced Precision Sensitivity Level 3 F-measure* Sensitivity Level 3 Recall* Precision Sensitivity Level 3 F-measure* 0.000 0.000 0.091 0.750 0.162 1 acceptable message failure rate and pilots confidence in system 3 2 appropriate size of characters on display 56 560 616 0.021 1.000 0.041 0.091 1.000 0.167 3 arrangement of right seat instruments 19 190 209 0.029 1.000 0.056 0.091 0.826 0.164 4 control is identifiable in the dark 56 560 616 0.032 1.000 0.063 0.091 0.812 0.164 5 cultural conventions switch design 6 60 66 0.015 1.000 0.030 0.091 0.857 0.164 6 design attributes for auditory displays 7 70 77 0.052 1.000 0.099 0.091 1.000 0.167 7 ergonomics of pilot seating 2 20 22 0.000 0.000 0.000 0.091 1.000 0.167 8 excessive cognitive effort 10 100 110 0.036 1.000 0.070 0.091 1.000 0.167 9 how to ensure that labels are readable 16 160 176 0.000 0.000 0.000 0.091 0.800 0.163 10 how to improve situation awareness 10 100 110 0.018 1.000 0.036 0.091 1.000 0.167 11 how to provide unambiguous feedback 25 250 275 0.000 0.000 0.000 0.091 1.000 0.167 12 minimal mental processing 24 240 264 0.034 1.000 0.066 0.091 0.923 0.166 13 needs too much attention 24 240 264 0.015 1.000 0.030 0.091 0.923 0.166 14 preventing instrument reading errors 30 300 330 0.021 1.000 0.042 0.091 0.909 0.165 15 proper use of red and amber on displays 28 280 308 0.042 1.000 0.081 0.091 0.966 0.166 16 suitable menu navigation methods 3 30 33 0.000 0.000 0.000 0.091 0.273 0.136 290 Level 3 Sensitivity To Unfound Relevant Documents (10X) – Recall* Significance Calculations Recall* Values at Sensitivity Level 1 (0.25X) Baseline Enhanced 0.069 0.041 0.023 0.000 0.022 0.025 0.044 0.019 0.064 0.010 0.091 0.047 0.000 0.045 0.000 0.021 0.029 0.032 0.015 0.052 0.000 0.036 0.000 0.018 0.000 0.034 0.015 0.021 0.042 0.000 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 Figure I.5 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.0272 0.0005 30 Enhanced 0.090 1.793E-3 3 0.0003 -15.1644 3.830E-22 1.6716 7.66E-22 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the recall* of the Baseline and the Enhanced search engines at Sensitivity Level 3 (10X). Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the recall* at Sensitivity Level 3 (10X) of the Baseline and Enhanced search engines. Two Sample t-Test to test the difference of the recall* at Sensitivity Level 3 (10X) between the Baseline and Enhanced search engines. 291 Level 3 Sensitivity To Unfound Relevant Documents (10X) – F-measure* Significance Calculations F-measure* Values at Sensitivity Level 3 (10X) Baseline 0.130 0.079 0.044 0.000 0.043 0.048 0.084 0.038 0.120 0.020 0.167 0.089 0.000 0.087 0.000 0.041 0.056 0.063 0.030 0.099 0.000 0.070 0.000 0.036 0.000 0.066 0.030 0.042 0.081 0.000 Figure I.6 Enhanced 0.167 0.165 0.138 0.135 0.162 0.167 0.157 0.163 0.164 0.157 0.148 0.161 0.161 0.161 0.162 0.167 0.164 0.164 0.164 0.167 0.167 0.167 0.163 0.167 0.167 0.166 0.166 0.165 0.166 0.136 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.0520 0.0018 30 Enhanced 0.1607 0.0001 30 0.0010 -13.5466 6.441E-20 1.6716 1.29E-19 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the performance of the Baseline and the Enhanced search engines at Sensitivity Level 3 (10X). Search performance is determined using the F-measure* at Sensitivity Level 3 (10X) where it is assumed that there are additional unfound relevant documents. Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the Baseline and Enhanced search engine F-measure* at Sensitivity Level 3 (10X) values. The F-measure* value of the Enhanced search engine is statistically greater than the Baseline and, therefore, performs better than the Baseline search engine. This is consistent with the conclusion drawn without the addition of estimated unfound relevant documents. Two Sample t-Test to test the difference of the F-measure* at Sensitivity Level 3 (10X) between the Baseline and Enhanced search engines. 292 APPENDIX J. SENSITIVITY ANALYSIS FOR IMPACT OF RELEVANCY ASSUMPTIONS This appendix contains the data and calculations used to perform the sensitivity analysis for documents assumed relevant. Four different levels of sensitivity were analyzed to assess the impact of document relevancy assumptions on the performance measures used in this experiment. To perform the sensitivity analysis, estimates of the number of relevant documents in the set of documents returned by the Baseline and Enhanced to be assumed relevant were calculated based on the level of sensitivity. The estimates were calculated in two different ways. For sensitivity levels 1, 2, and 3, the estimated total relevant documents returned by both the Baseline and the Enhanced search engines (and therefore assumed relevant in the modified pooling adjudication method), was calculated by assuming that a certain percentage of the documents returned by both the Baseline and Enhanced search engines were non-relevant for. For sensitivity level 4, the estimated total relevant documents was determined using results generated by Google Desktop. The estimates for this level of sensitivity were generated by determining the number of documents returned by the Baseline, the Enhanced, and the Google Desktop search engines. Documents returned by all three search engines were assumed to be relevant in the fourth level of sensitivity and therefore the number of documents to be assumed relevant in the modified pooling adjudication method. Using the estimated total relevant documents to be assumed relevant for the given sensitivity level, the recall* was calculated for each of the 30 sample query topics. Next, 293 a Two Sample t-Test Assuming Equal Variances was run with a significance level of α=0.05 to determine whether the difference in the recall* between the Baseline and the Enhanced search engines was statically significant. Finally, the conclusion drawn related to the significance of the difference in recall* values was compared to the conclusion drawn in the original calculations (i.e., those performed without the addition of estimated unfound relevant documents). Next, the estimated total relevant documents to be assumed relevant for the given sensitivity level was used to calculate the F-measure* for each of the 30 sample query topics. A Two Sample t-Test Assuming Equal Variances was then run with a significance level of α=0.05 to determine whether the difference in the F-measure* between the Baseline and the Enhanced search engines was statically significant. Finally, the conclusion drawn related to the significance of the difference in F-measure* values was compared to the conclusion drawn in the original calculations (i.e., those performed without the addition of estimated unfound relevant documents). The data and calculations for each of the four sensitivity levels are presented in the following sections. J.1. Level 1 Sensitivity To Relevancy Assumptions (25% Non-Relevant) At the first sensitivity level, the documents to be assumed relevant were estimated by assuming that 25% of the documents returned by both the Baseline and the Enhanced search engines were non-relevant. Table J.1 and J.2 provide the data and calculations performed using the document relevancy assumption estimates for Sensitivity Level 1 for 294 the Tangible Concept Query Sample Set and the Intangible Concept Query Sample Set. The data and results of the statistical significance test at Sensitivity Level 1 are presented in Figure J.1 for recall* and in Figure J.2 for F-measure*. 295 Table J.1 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Level 1 Sensitivity to Relevancy Assumptions where 25% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. 296 Level 1 Sensitivity To Relevancy Assumptions (25%) – Tangible Concepts Query Sample Set Baseline Query Topic (QT) Enhanced Original Assumed Relevant Estimated NonRelevant Estimated Assumed Relevant Sensitivity Level 1 Recall* Sensitivity Level 1 Precision Sensitivity Level 1 F-measure* Sensitivity Level 1 Recall* Sensitivity Level 1 Precision Sensitivity Level 1 F-measure* 1 auxiliary power unit fire extinguishing 16 4 12 0.706 0.750 0.727 1.000 0.810 0.895 2 false resolution advisory 9 3 6 0.353 0.667 0.462 1.000 0.739 0.850 0 0.000 0.000 0.000 1.000 0.214 0.353 3 fault tolerant data entry 1 1 4 hydraulic system status messages 0 0 0 0.000 0.000 0.000 1.000 0.259 0.412 8 0.186 0.727 0.296 1.000 0.705 0.827 5 icing conditions operating speeds provided in AFM 11 3 6 information presented in peripheral visual field 3 1 2 0.200 0.667 0.308 1.000 0.909 0.952 7 information readable with vibration 12 3 9 0.409 0.750 0.529 1.000 0.500 0.667 8 instruments located in normal line of sight 6 2 4 0.154 0.667 0.250 1.000 0.743 0.852 9 labels readable distance 7 2 5 0.625 0.714 0.667 1.000 0.667 0.800 2 0.077 0.667 0.138 1.000 0.542 0.703 10 landing gear manual extension control design 3 1 11 negative transfer issues 2 1 1 1.000 0.500 0.667 1.000 0.200 0.333 14 0.438 0.737 0.549 1.000 0.593 0.744 12 safety belt latch operation 19 5 13 side stick control considerations 0 0 0 0.000 0.000 0.000 1.000 0.700 0.824 19 5 14 0.424 0.737 0.538 1.000 0.611 0.759 14 text color contrast 297 Table J.2 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Level 1 Sensitivity to Relevancy Assumptions where 25% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. 298 Level 1 Sensitivity To Relevancy Assumptions (25%) – Intangible Concepts Query Sample Set Baseline Query Topic (QT) Enhanced Original Assumed Relevant Estimated NonRelevant Estimated Assumed Relevant Sensitivity Level 1 Recall* Sensitivity Level 1 Precision Sensitivity Level 1 F-measure* Sensitivity Level 1 Recall* Sensitivity Level 1 Precision Sensitivity Level 1 F-measure* 1 acceptable message failure rate and pilots confidence in system 0 0 0 0.000 0.000 0.000 1.000 0.750 0.857 2 appropriate size of characters on display 13 4 9 0.173 0.692 0.277 1.000 0.929 0.963 3 arrangement of right seat instruments 6 2 4 0.235 0.667 0.348 1.000 0.739 0.850 4 control is identifiable in the dark 20 5 15 0.294 0.750 0.423 1.000 0.739 0.850 5 cultural conventions switch design 1 1 0 0.000 0.000 0.000 1.000 0.714 0.833 6 design attributes for auditory displays 4 1 3 0.500 0.750 0.600 1.000 0.857 0.923 7 ergonomics of pilot seating 0 0 0 0.000 0.000 0.000 1.000 1.000 1.000 8 excessive cognitive effort 4 1 3 0.333 0.750 0.462 1.000 0.900 0.947 9 how to ensure that labels are readable 0 0 0 0.000 0.000 0.000 1.000 0.800 0.889 10 how to improve situation awareness 2 1 1 0.111 0.500 0.182 1.000 0.900 0.947 11 how to provide unambiguous feedback 0 0 0 0.000 0.000 0.000 1.000 1.000 1.000 12 minimal mental processing 10 3 7 0.318 0.700 0.438 1.000 0.846 0.917 13 needs too much attention 4 1 3 0.130 0.750 0.222 1.000 0.885 0.939 14 preventing instrument reading errors 7 2 5 0.179 0.714 0.286 1.000 0.848 0.918 15 proper use of red and amber on displays 13 4 9 0.375 0.692 0.486 1.000 0.828 0.906 16 suitable menu navigation methods 0 0 0 0.000 0.000 0.000 1.000 0.273 0.429 299 Level 1 Sensitivity To Relevancy Assumptions (25%) – Recall* Significance Calculations Recall* Values at Sensitivity Level 1 (25% Assumed Non-Relevant) Baseline 0.706 0.353 0.000 0.000 0.186 0.200 0.409 0.154 0.625 0.077 1.000 0.438 0.000 0.424 0.000 0.173 0.235 0.294 0.000 0.500 0.000 0.333 0.000 0.111 0.000 0.318 0.130 0.179 0.375 0.000 Figure J.1 Enhanced 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.2407 0.0605 30 Enhanced 1.0000 0.0000 30 0.0303 -16.9082 2.184E-24 1.6716 4.37E-24 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the recall* of the Baseline and the Enhanced search engines at Level 1 Sensitivity to Relevancy Assumptions (25%). Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the recall* at Level 1 Sensitivity to Relevancy Assumptions (25%) of the Baseline and Enhanced search engines. Two Sample t-Test to test the difference of the recall* at Level 1 Sensitivity to Relevancy Assumptions (25%) between the Baseline and Enhanced search engines. 300 Level 1 Sensitivity To Relevancy Assumptions (25%) – F-measure* Significance Calculations F-measure* Values at Sensitivity Level 1 (25% Assumed Non-Relevant) Baseline 0.727 0.462 0.000 0.000 0.296 0.308 0.529 0.250 0.667 0.138 0.667 0.549 0.000 0.538 0.000 0.277 0.348 0.423 0.000 0.600 0.000 0.462 0.000 0.182 0.000 0.438 0.222 0.286 0.486 0.000 Figure J.2 Enhanced 0.895 0.850 0.353 0.412 0.827 0.952 0.667 0.852 0.800 0.703 0.333 0.744 0.824 0.759 0.857 0.963 0.850 0.850 0.833 0.923 1.000 0.947 0.889 0.947 1.000 0.917 0.939 0.918 0.906 0.429 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.2951 0.0584 30 Enhanced 0.8046 0.0351 30 0.0468 -9.1255 4.118E-13 1.6716 8.24E-13 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the performance of the Baseline and the Enhanced search engines at Sensitivity Level 1 (25% nonrelevant). Search performance is determined using the Fmeasure* at Sensitivity Level 1 where it is assumed that 25% of the documents returned by both the Baseline and the Enhanced search engines are non-relevant. Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the Baseline and Enhanced search engine F-measure* at Sensitivity Level 1 (25% non-relevant). The F-measure* value of the Enhanced search engine is statistically greater than the Baseline and, therefore, performs better than the Baseline search engine. This is consistent with the conclusion drawn in the original calculations (i.e., assuming that all documents returned by both the Baseline and the Enhanced search engines are relevant). Two Sample t-Test to test the difference of the F-measure* at Level 1 Sensitivity to Relevancy Assumptions (25% non-relevant) between the Baseline and Enhanced search engines. 301 J.2. Level 2 Sensitivity To Relevancy Assumptions (50% Non-Relevant) At the second sensitivity level, the documents to be assumed relevant were estimated by assuming that 50% of the documents returned by both the Baseline and the Enhanced search engines were non-relevant. Table J.3 and J.4 provide the data and calculations performed using the document relevancy assumption estimates for Sensitivity Level 2 for the Tangible Concept Query Sample Set and the Intangible Concept Query Sample Set. The data and results of the statistical significance test at Sensitivity Level 2 are presented in Figure J.3 for recall* and in Figure J.4 for Fmeasure*. 302 Table J.3 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Level 2 Sensitivity to Relevancy Assumptions where 50% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant 303 Level 2 Sensitivity To Relevancy Assumptions (50%) – Tangible Concepts Query Sample Set Baseline Query Topic (QT) Enhanced Original Assumed Relevant Estimated NonRelevant Estimated Assumed Relevant Sensitivity Level 2 Recall* Sensitivity Level 2 Precision Sensitivity Level 2 F-measure* Sensitivity Level 2 Recall* Sensitivity Level 2 Precision Sensitivity Level 2 F-measure* 1 auxiliary power unit fire extinguishing 16 8 8 0.615 0.500 0.552 1.000 0.619 0.765 2 false resolution advisory 9 5 4 0.267 0.444 0.333 1.000 0.652 0.789 0 0.000 0.000 0.000 1.000 0.214 0.353 3 fault tolerant data entry 1 1 4 hydraulic system status messages 0 0 0 0.000 0.000 0.000 1.000 0.259 0.412 5 0.125 0.455 0.196 1.000 0.656 0.792 5 icing conditions operating speeds provided in AFM 11 6 6 information presented in peripheral visual field 3 2 1 0.111 0.333 0.167 1.000 0.818 0.900 7 information readable with vibration 12 6 6 0.316 0.500 0.387 1.000 0.432 0.603 8 instruments located in normal line of sight 6 3 3 0.120 0.500 0.194 1.000 0.714 0.833 9 labels readable distance 7 4 3 0.500 0.429 0.462 1.000 0.500 0.667 1 0.040 0.333 0.071 1.000 0.521 0.685 10 landing gear manual extension control design 3 2 11 negative transfer issues 2 1 1 1.000 0.500 0.667 1.000 0.200 0.333 9 0.333 0.474 0.391 1.000 0.500 0.667 12 safety belt latch operation 19 10 13 side stick control considerations 0 0 0 0.000 0.000 0.000 1.000 0.700 0.824 19 10 9 0.321 0.474 0.383 1.000 0.519 0.683 14 text color contrast 304 Table J.4 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Level 2 Sensitivity to Relevancy Assumptions where 50% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. 305 Level 2 Sensitivity To Relevancy Assumptions (50%) – Intangible Concepts Query Sample Set Baseline Query Topic (QT) Enhanced Original Assumed Relevant Estimated NonRelevant Estimated Assumed Relevant Sensitivity Level 2 Recall* Sensitivity Level 2 Precision Sensitivity Level 2 F-measure* Sensitivity Level 2 Recall* Sensitivity Level 2 Precision Sensitivity Level 2 F-measure* 1 acceptable message failure rate and pilots confidence in system 0 0 0 0.000 0.000 0.000 1.000 0.750 0.857 2 appropriate size of characters on display 13 7 6 0.122 0.462 0.194 1.000 0.875 0.933 3 arrangement of right seat instruments 6 3 3 0.188 0.500 0.273 1.000 0.696 0.821 4 control is identifiable in the dark 20 10 10 0.217 0.500 0.303 1.000 0.667 0.800 5 cultural conventions switch design 1 1 0 0.000 0.000 0.000 1.000 0.714 0.833 6 design attributes for auditory displays 4 2 2 0.400 0.500 0.444 1.000 0.714 0.833 7 ergonomics of pilot seating 0 0 0 0.000 0.000 0.000 1.000 1.000 1.000 8 excessive cognitive effort 4 2 2 0.250 0.500 0.333 1.000 0.800 0.889 9 how to ensure that labels are readable 0 0 0 0.000 0.000 0.000 1.000 0.800 0.889 10 how to improve situation awareness 2 1 1 0.111 0.500 0.182 1.000 0.900 0.947 11 how to provide unambiguous feedback 0 0 0 0.000 0.000 0.000 1.000 1.000 1.000 12 minimal mental processing 10 5 5 0.250 0.500 0.333 1.000 0.769 0.870 13 needs too much attention 4 2 2 0.091 0.500 0.154 1.000 0.846 0.917 14 preventing instrument reading errors 7 4 3 0.115 0.429 0.182 1.000 0.788 0.881 15 proper use of red and amber on displays 13 7 6 0.286 0.462 0.353 1.000 0.724 0.840 16 suitable menu navigation methods 0 0 0 0.000 0.000 0.000 1.000 0.273 0.429 306 Level 2 Sensitivity To Relevancy Assumptions (50%) – Recall* Significance Calculations Recall* Values at Sensitivity Level 2 (50% Assumed Non-Relevant) Baseline 0.615 0.267 0.000 0.000 0.125 0.111 0.316 0.120 0.500 0.040 1.000 0.333 0.000 0.321 0.000 0.122 0.188 0.217 0.000 0.400 0.000 0.250 0.000 0.111 0.000 0.250 0.091 0.115 0.286 0.000 Figure J.3 Enhanced 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline Enhanced 0.1926 0.0497 30 1.0000 0.0000 30 0.0249 -19.8269 8.157E-28 1.6716 1.63E-27 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the recall* of the Baseline and the Enhanced search engines at Level 2 Sensitivity to Relevancy Assumptions (50%). Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the recall* at Level 2 Sensitivity to Relevancy Assumptions (50%) of the Baseline and Enhanced search engines. Two Sample t-Test to test the difference of the recall* at Level 2 Sensitivity to Relevancy Assumptions (50%) between the Baseline and Enhanced search engines. 307 Level 2 Sensitivity To Relevancy Assumptions (50%) – F-measure* Significance Calculations F-measure* Values at Sensitivity Level 2 (50% Assumed Non-Relevant) Baseline 0.552 0.333 0.000 0.000 0.196 0.167 0.387 0.194 0.462 0.071 0.667 0.391 0.000 0.383 0.000 0.194 0.273 0.303 0.000 0.444 0.000 0.333 0.000 0.182 0.000 0.333 0.154 0.182 0.353 0.000 Figure J.4 Enhanced 0.765 0.789 0.353 0.412 0.792 0.900 0.603 0.833 0.667 0.685 0.333 0.667 0.824 0.683 0.857 0.933 0.821 0.800 0.833 0.833 1.000 0.889 0.889 0.947 1.000 0.870 0.917 0.881 0.840 0.429 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.2184 0.0358 30 Enhanced 0.7681 0.0330 30 0.0344 -11.4772 7.364E-17 1.6716 1.47E-16 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the performance of the Baseline and the Enhanced search engines at Sensitivity Level 2 (50% non-relevant). Search performance is determined using the Fmeasure* at Sensitivity Level 2 where it is assumed that 50% of the documents returned by both the Baseline and the Enhanced search engines are non-relevant. Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the Baseline and Enhanced search engine F-measure* at Sensitivity Level 2 (50% non-relevant). The F-measure* value of the Enhanced search engine is statistically greater than the Baseline and, therefore, performs better than the Baseline search engine. This is consistent with the conclusion drawn in the original calculations (i.e., assuming that all documents returned by both the Baseline and the Enhanced search engines are relevant). Two Sample t-Test to test the difference of the F-measure* at Level 2 Sensitivity to Relevancy Assumptions (50% non-relevant) between the Baseline and Enhanced search engines. 308 J.3. Level 3 Sensitivity To Relevancy Assumptions (75% Non-Relevant) At the second sensitivity level, the documents to be assumed relevant were estimated by assuming that 75% of the documents returned by both the Baseline and the Enhanced search engines were non-relevant. Table J.5 and J.6 provide the data and calculations performed using the document relevancy assumption estimates for Sensitivity Level 3 for the Tangible Concept Query Sample Set and the Intangible Concept Query Sample Set. The data and results of the statistical significance test at Sensitivity Level 3 are presented in Figure J.5 for recall* and in Figure J.6 for Fmeasure*. 309 Table J.5 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Level 3 Sensitivity to Relevancy Assumptions where 75% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. 310 Level 3 Sensitivity To Relevancy Assumptions (75%) – Tangible Concepts Query Sample Set Baseline Query Topic (QT) Enhanced Original Assumed Relevant Estimated NonRelevant Estimated Assumed Relevant Sensitivity Level 3 Recall* Sensitivity Level 3 Precision Sensitivity Level 3 F-measure* Sensitivity Level 3 Recall* Sensitivity Level 3 Precision Sensitivity Level 3 F-measure* 1 auxiliary power unit fire extinguishing 16 12 4 0.444 0.250 0.320 1.000 0.429 0.600 2 false resolution advisory 9 7 2 0.154 0.222 0.182 1.000 0.565 0.722 0 0.000 0.000 0.000 1.000 0.214 0.353 3 fault tolerant data entry 1 1 4 hydraulic system status messages 0 0 0 0.000 0.000 0.000 1.000 0.259 0.412 2 0.054 0.182 0.083 1.000 0.607 0.755 5 icing conditions operating speeds provided in AFM 11 9 6 information presented in peripheral visual field 3 3 0 0.000 0.000 0.000 1.000 0.727 0.842 7 information readable with vibration 12 9 3 0.188 0.250 0.214 1.000 0.364 0.533 8 instruments located in normal line of sight 6 5 1 0.043 0.167 0.069 1.000 0.657 0.793 9 labels readable distance 7 6 1 0.250 0.143 0.182 1.000 0.333 0.500 0 0.000 0.000 0.000 1.000 0.500 0.667 10 landing gear manual extension control design 3 3 11 negative transfer issues 2 2 0 0.000 0.000 0.000 0.000 0.000 0.000 4 0.182 0.211 0.195 1.000 0.407 0.579 12 safety belt latch operation 19 15 13 side stick control considerations 0 0 0 0.000 0.000 0.000 1.000 0.700 0.824 19 15 4 0.174 0.211 0.190 1.000 0.426 0.597 14 text color contrast 311 Table J.6 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Level 3 Sensitivity to Relevancy Assumptions where 75% of the documents returned by both the Baseline and the Enhanced search engines were assumed to be non-relevant. 312 Level 3 Sensitivity To Relevancy Assumptions (75%) – Intangible Concepts Query Sample Set Baseline Query Topic (QT) Enhanced Original Assumed Relevant Estimated NonRelevant Estimated Assumed Relevant Sensitivity Level 3 Recall* Sensitivity Level 3 Precision Sensitivity Level 3 F-measure* Sensitivity Level 3 Recall* Sensitivity Level 3 Precision Sensitivity Level 3 F-measure* 1 acceptable message failure rate and pilots confidence in system 0 0 0 0.000 0.000 0.000 1.000 0.750 0.857 2 appropriate size of characters on display 13 10 3 0.065 0.231 0.102 1.000 0.821 0.902 3 arrangement of right seat instruments 6 5 1 0.071 0.167 0.100 1.000 0.609 0.757 4 control is identifiable in the dark 20 15 5 0.122 0.250 0.164 1.000 0.594 0.745 5 cultural conventions switch design 1 1 0 0.000 0.000 0.000 1.000 0.714 0.833 6 design attributes for auditory displays 4 3 1 0.250 0.250 0.250 1.000 0.571 0.727 7 ergonomics of pilot seating 0 0 0 0.000 0.000 0.000 1.000 1.000 1.000 8 excessive cognitive effort 4 3 1 0.143 0.250 0.182 1.000 0.700 0.824 9 how to ensure that labels are readable 0 0 0 0.000 0.000 0.000 1.000 0.800 0.889 10 how to improve situation awareness 2 2 0 0.000 0.000 0.000 1.000 0.800 0.889 11 how to provide unambiguous feedback 0 0 0 0.000 0.000 0.000 1.000 1.000 1.000 12 minimal mental processing 10 8 2 0.118 0.200 0.148 1.000 0.654 0.791 13 needs too much attention 4 3 1 0.048 0.250 0.080 1.000 0.808 0.894 14 preventing instrument reading errors 7 6 1 0.042 0.143 0.065 1.000 0.727 0.842 15 proper use of red and amber on displays 13 10 3 0.167 0.231 0.194 1.000 0.621 0.766 16 suitable menu navigation methods 0 0 0 0.000 0.000 0.000 1.000 0.273 0.429 313 Level 3 Sensitivity To Relevancy Assumptions (75%) – Recall* Significance Calculations Recall* Values at Sensitivity Level 3 (75% Assumed Non-Relevant) Baseline 0.444 0.154 0.000 0.000 0.054 0.000 0.188 0.043 0.250 0.000 0.000 0.182 0.000 0.174 0.000 0.065 0.071 0.122 0.000 0.250 0.000 0.143 0.000 0.000 0.000 0.118 0.048 0.042 0.167 0.000 Figure J.5 Enhanced 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.0838 0.0113 30 Enhanced 0.9667 0.0333 30 0.0223 -22.8770 5.127E-31 1.6716 1.03E-30 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the recall* of the Baseline and the Enhanced search engines at Level 3 Sensitivity to Relevancy Assumptions (75%). Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the recall* at Level 3 Sensitivity to Relevancy Assumptions (75%) of the Baseline and Enhanced search engines. Two Sample t-Test to test the difference of the recall* at Level 3 Sensitivity to Relevancy Assumptions (75%) between the Baseline and Enhanced search engines. 314 Level 3 Sensitivity To Relevancy Assumptions (75%) – F-measure* Significance Calculations F-measure* Values at Sensitivity Level 3 (75% Assumed Non-Relevant) Baseline 0.320 0.182 0.000 0.000 0.083 0.000 0.214 0.069 0.182 0.000 0.000 0.195 0.000 0.190 0.000 0.102 0.100 0.164 0.000 0.250 0.000 0.182 0.000 0.000 0.000 0.148 0.080 0.065 0.194 0.000 Figure J.6 Enhanced 0.600 0.722 0.353 0.412 0.755 0.842 0.533 0.793 0.500 0.667 0.000 0.579 0.824 0.597 0.857 0.902 0.757 0.745 0.833 0.727 1.000 0.824 0.889 0.889 1.000 0.791 0.894 0.842 0.766 0.429 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.0906 0.0092 30 Enhanced 0.7107 0.0463 30 0.0277 -14.4212 3.872E-21 1.6716 7.74E-21 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the performance of the Baseline and the Enhanced search engines at Sensitivity Level 3 (75% nonrelevant). Search performance is determined using the F-measure* at Sensitivity Level 3 where it is assumed that 75% of the documents returned by both the Baseline and the Enhanced search engines are non-relevant. Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the Baseline and Enhanced search engine F-measure* at Sensitivity Level 3 (75% non-relevant). The F-measure* value of the Enhanced search engine is statistically greater than the Baseline and, therefore, performs better than the Baseline search engine. This is consistent with the conclusion drawn in the original calculations (i.e., assuming that all documents returned by both the Baseline and the Enhanced search engines are relevant). Two Sample t-Test to test the difference of the F-measure* at Level 3 Sensitivity to Relevancy Assumptions (75%) between the Baseline and Enhanced search engines. 315 J.4. Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) At the fourth sensitivity level, the estimated number of relevant documents that may be assumed was generated by determining the number of documents returned by the Google Desktop search engine that overlap with the documents returned by the Baseline and the Enhanced search engines. Documents returned by all three search engines were assumed to be relevant. Table J.7 and J.8 provide the data and calculations performed using the document relevancy assumption estimates for Sensitivity Level 4 for the Tangible Concept Query Sample Set and the Intangible Concept Query Sample Set. The data and results of the statistical significance test at Sensitivity Level 4 are presented in Figure J.7 for recall* and in Figure J.8 for F-measure*. 316 Table J.7 Recall*, Precision, and F-measure* calculations for Tangible Concepts Query Sample Set at Level 4 Sensitivity to Relevancy Assumptions where estimated number of relevant documents that may be assumed are the documents returned by all three search engines (i.e., Baseline, Enhanced, and Google Desktop search engines) 317 Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) – Tangible Concepts Query Sample Set Baseline Query Topic (QT) Enhanced Original Assumed Relevant Estimated NonRelevant Estimated Assumed Relevant Sensitivity Level 4 Recall* Sensitivity Level 4 Precision Sensitivity Level 4 F-measure* Sensitivity Level 4 Recall* Sensitivity Level 4 Precision Sensitivity Level 4 F-measure* 1 auxiliary power unit fire extinguishing 16 -- 9 0.643 0.563 0.600 1.000 0.667 0.800 2 false resolution advisory 9 -- 4 0.267 0.444 0.333 1.000 0.652 0.789 1 0.250 1.000 0.400 1.000 0.286 0.444 3 fault tolerant data entry 1 -- 4 hydraulic system status messages 0 -- 0 0.000 0.000 0.000 1.000 0.259 0.412 4 0.103 0.364 0.160 1.000 0.639 0.780 5 icing conditions operating speeds provided in AFM 11 -- 6 information presented in peripheral visual field 3 -- 3 0.273 1.000 0.429 1.000 1.000 1.000 7 information readable with vibration 12 -- 7 0.350 0.583 0.438 1.000 0.455 0.625 8 instruments located in normal line of sight 6 -- 3 0.120 0.500 0.194 1.000 0.714 0.833 9 labels readable distance 7 -- 3 0.500 0.429 0.462 1.000 0.500 0.667 2 0.077 0.667 0.138 1.000 0.542 0.703 10 landing gear manual extension control design 3 -- 11 negative transfer issues 2 -- 1 1.000 0.500 0.667 1.000 0.200 0.333 6 0.250 0.316 0.279 1.000 0.444 0.615 12 safety belt latch operation 19 -- 13 side stick control considerations 0 -- 0 0.000 0.000 0.000 1.000 0.700 0.824 19 -- 16 0.457 0.842 0.593 1.000 0.648 0.787 14 text color contrast 318 Table J.8 Recall*, Precision, and F-measure* calculations for Intangible Concepts Query Sample Set at Level 4 Sensitivity to Relevancy Assumptions where estimated number of relevant documents to be assumed are generated from overlap of results with Google Desktop. 319 Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) – Intangible Concepts Query Sample Set Baseline Query Topic (QT) Enhanced Original Assumed Relevant Estimated NonRelevant Estimated Assumed Relevant Sensitivity Level 4 Recall* Sensitivity Level 4 Precision Sensitivity Level 4 F-measure* Sensitivity Level 4 Recall* Sensitivity Level 4 Precision Sensitivity Level 4 F-measure* 1 acceptable message failure rate and pilots confidence in system 0 -- 0 0.000 0.000 0.000 1.000 0.750 0.857 2 appropriate size of characters on display 13 -- 13 0.232 1.000 0.377 1.000 1.000 1.000 3 arrangement of right seat instruments 6 -- 1 0.071 0.167 0.100 1.000 0.609 0.757 4 control is identifiable in the dark 20 -- 4 0.100 0.200 0.133 1.000 0.580 0.734 5 cultural conventions switch design 1 -- 1 0.167 1.000 0.286 1.000 0.857 0.923 6 design attributes for auditory displays 4 -- 3 0.500 0.750 0.600 1.000 0.857 0.923 7 ergonomics of pilot seating 0 -- 0 0.000 0.000 0.000 1.000 1.000 1.000 8 excessive cognitive effort 4 -- 4 0.400 1.000 0.571 1.000 1.000 1.000 9 how to ensure that labels are readable 0 -- 0 0.000 0.000 0.000 1.000 0.800 0.889 10 how to improve situation awareness 2 -- 2 0.200 1.000 0.333 1.000 1.000 1.000 11 how to provide unambiguous feedback 0 -- 0 0.000 0.000 0.000 1.000 1.000 1.000 12 minimal mental processing 10 -- 0 0.000 0.000 0.000 1.000 0.577 0.732 13 needs too much attention 4 -- 2 0.091 0.500 0.154 1.000 0.846 0.917 14 preventing instrument reading errors 7 -- 0 0.000 0.000 0.000 1.000 0.697 0.821 15 proper use of red and amber on displays 13 -- 3 0.167 0.231 0.194 1.000 0.621 0.766 16 suitable menu navigation methods 0 -- 0 0.000 0.000 0.000 1.000 0.273 0.429 320 Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) – Recall* Significance Calculations Recall* Values at Sensitivity Level 4 (Overlap with Google Desktop) Baseline 0.643 0.267 0.250 0.000 0.103 0.273 0.350 0.120 0.500 0.077 1.000 0.250 0.000 0.457 0.000 0.232 0.071 0.100 0.167 0.500 0.000 0.400 0.000 0.200 0.000 0.000 0.091 0.000 0.167 0.000 Figure J.7 Enhanced 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.2072 0.0548 30 Enhanced 1.0000 0.0000 30 0.0274 -18.5406 2.369E-26 1.6716 4.74E-26 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the recall* of the Baseline and the Enhanced search engines at Level 4 Sensitivity to Relevancy Assumptions (Overlap with Google Desktop). Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the recall* at Level 2 Sensitivity to Relevancy Assumptions (Overlap with Google Desktop) of the Baseline and Enhanced search engines. Two Sample t-Test to test the difference of the recall* at Level 4 Sensitivity to Relevancy Assumptions where estimated number of assumed relevant documents generated from overlap of results with Google Desktop. 321 Level 4 Sensitivity To Relevancy Assumptions (Overlap With Google Desktop) – F-measure* Significance Calculations F-measure* Values at Sensitivity Level 4 (Overlap with Google Desktop) Baseline 0.600 0.333 0.400 0.000 0.160 0.429 0.438 0.194 0.462 0.138 0.667 0.279 0.000 0.593 0.000 0.377 0.100 0.133 0.286 0.600 0.000 0.571 0.000 0.333 0.000 0.000 0.154 0.000 0.194 0.000 Enhanced 0.800 0.789 0.444 0.412 0.780 1.000 0.625 0.833 0.667 0.703 0.333 0.615 0.824 0.787 0.857 1.000 0.757 0.734 0.923 0.923 1.000 1.000 0.889 1.000 1.000 0.732 0.917 0.821 0.766 0.429 Two Sample t-Test Assuming Equal Variances Statistical Measures Mean Variance Observations Pooled Variance t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Baseline 0.2480 0.0497 30 Enhanced 0.7786 0.0352 30 0.0424 -9.9769 1.691E-14 1.6716 3.38E-14 2.0017 Interpretation of Statistical Test: The Null Hypothesis for this Two Sample t-Test is that there is no statistical difference between the performance of the Baseline and the Enhanced search engines at Sensitivity Level 4 (Overlap with Google Desktop). Search performance is determined using the F-measure* at Sensitivity Level 4 where the documents to be assumed relevant are those that are also returned by Google Desktop search engine Because the absolute value of the t Stat is greater than t Critical two-tail value, we can reject the Null Hypothesis. We can draw this same conclusion by looking at the P(T<=t) two-tail that represents the probability that the Null Hypothesis is true. The P(T<=t) two-tail value is less than the Alpha value of 0.05, so we can reject the Null Hypothesis. By rejecting the Null Hypothesis, we can conclude that there is a statistically significant difference between the Baseline and Enhanced search engine F-measure* at Sensitivity Level 4 (Overlap with Google Desktop). The F-measure* value of the Enhanced search engine is statistically greater than the Baseline and, therefore, performs better than the Baseline search engine. This is consistent with the conclusion drawn in the original calculations (i.e., assuming that all documents returned by both the Baseline and the Enhanced search engines are relevant). Figure J.8 Two Sample t-Test to test the difference of the F-measure* at Level 4 Sensitivity to Relevancy Assumptions where estimated number of assumed relevant documents generated from overlap of results with Google Desktop. 322 REFERENCES Amati, G. (2003). Probability models for information retrieval based on divergence from randomness (Doctoral dissertation, University of Glasgow). Anderson, J. D. & Pérez-Carballo, J. (2001). The nature of indexing: How humans and machines analyze messages and texts for retrieval. Information Processing & Management, 37, 231-254. Arguello, J., Elsas, J. L., Callan, J., & Carbonell, J. G. (2008). Document representation and query expansion models for blog recommendation. Proceedings of the 2nd International Conference on Weblogs and Social Media (pp. 10–18). Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. New York, NY: ACM Press. Bai, J., Nie, J.-Y., Cao, G., & Bouchard, H. (2007). Using query contexts in information retrieval. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 15–22). Bast, H., Majumdar, D., & Weber, I. (2007). Efficient interactive query expansion with complete search. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management (pp. 857-860). Belkin, N. J. (1980). Anomalous states of knowledge as a basis for information retrieval. Canadian Journal of Information Science, 5(1), 133-143. Belkin, N. J., Oddy, R. N., & Brooks, H. M. (1982). ASK for information retrieval: Part I. Background and theory. Journal of Documentation, 38(2), 61-71. Bhogal, J., Macfarlane, A., & Smith, P. (2007). A review of ontology based query expansion. Information Processing & Management, 43(4), 866-886. Billerbeck, B., Scholer, F., Williams, H. E., & Zobel, J. (2003). Query expansion using associated queries. Proceedings of the 12th ACM International Conference on Information and Knowledge Management (pp. 2–9). Bird, S., Klein, E. & Loper, E. (2009). Natural language processing with Python: Analyzing text with the Natural Language Toolkit. Sebastopol, CA: O’Reilly. Blanchard, A. (2007). Understanding and customizing stopword lists for enhanced patent mapping. World Patent Information, 29(4), 308-316. 323 Brants, T. (2003). Natural language processing in information retrieval. Proceedings of the 14th Meeting of Computational Linguistics in the Netherlands, CLIN 2003, Antwerp, Belgium, December 19, 2003.. Buckley, G. (2004). Why current IR engines fail. Proceedings of ACM-SIGIR'2004, Sheffield, U.K. (584-585). Buckley, C., Salton, G., Allan, J., & Singhal, A. (1995). Automatic query expansion using SMART: TREC 3. Proceedings of the 3rd Text REtrieval Conference (TREC-3), NIST Special Publication (pp. 69–80). Büttcher, S., Clarke, C. L. A., & Cormack, G. V. (2010). Information Retrieval: Implementing and evaluating search engines. Cambridge, MA: MIT Press. Cao, G., Gao, J., Nie, J.-Y., & Robertson, S. (2008). Selecting good expansion terms for pseudorelevance feedback. Proceedings of the 31st Annual International ACMSIGIR Conference on Research and Development in Information Retrieval (pp. 243–250). Carmel, D., Farchi, E., Petruschka, Y., & Soffer, A. (2002). Automatic query refinement using lexical affinities with maximal information gain. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 283–290). Carpineto, C., de Mori, R., Romano, G., & Bigi, B. (2001). An information theoretic approach to automatic query expansion. ACM Transactions on Information Systems, 19(1), 1–27. Carpineto, C., & Romano, G. (2012). A survey of automatic query expansion in information retrieval. ACM Computing Surveys, 44(1), 1:1 -1:50. Chang, Y., Ounis, I., & Kim, M. (2006). Query reformulation using automatically generated query concepts from a document space. Information Processing & Management, 42(2), 453-468. Chirita, P. A., Firan, C. S., & Nejdl, W. (2007). Personalized query expansion for the web. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 7-14). Cilibrasi, R. L. & Vitanyi, P. M. B. (2007). The Google similarity distance. IEEE Transactions on Knowledge and Data Engineering 19(3), 370-383. 324 Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Collins-Thompson, K. & Callan, J. (2005). Query expansion using random walk models. Proceedings of the 14th Conference on Information and Knowledge Management (CIKM’05). ACM Press (pp. 704–711). Cormack, G. V., Palmer, C. R., & Clarke, C. L. (1998). Efficient construction of large test collections. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 282-289). Croft, B., Metzler, D., & Strohman, T. (2010). Search engines: Information retrieval in practice. Boston, MA: Pearson Education. Cui, H.,Wen, J.-R., Nie, J.-Y., & Ma, W.-Y. (2003). Query expansion by mining user logs. IEEE Transactions in Knowledge Data Engineering, 15(4), 829–839. Dale, R. (2010). Classical approaches to natural language processing. In N. Indurkhya & F. Damerau (Eds). Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL: Chapman & Hall/CRC Taylor & Francis Group. Dagan, I., Lee, L., & Pereira, F. (1997). Similarity-based methods for word sense disambiguation. Proceedings of the 35th annual meeting on Association for Computational Linguistics, July 07-12,1997, Madrid, Spain, (pp. 56-63). Dagan, I., Lee, L., & Pereira, F. C. (1999). Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1), 43-69. Deerwester, S., Dumais, S. T, Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391-407. Efthimiadis, E. N. (1996). Query expansion. Annual Review of Information Science and Technology, 31, 121-187. Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database (pp. 23–46). Cambridge, MA: The MIT Press. Fitzpatrick, L., & Dent, M. (1997). Automatic feedback using past queries: Social searching? Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval. July 27-31, 1997, Philadelphia, Pennsylvania (pp. 306-313). 325 Floridi, L. (Ed.) (2003). The Blackwell guide to the philosophy of computing and information. Oxford, New York: Blackwell. Frické, M. (1998). Measuring recall. Journal of Documentation, 24(6), 409-417. Frické, M. (2012). Logic and the Organization of Information. New York: Springer. Gauch, S., Wang, J., & Rachakonda, S. M. (1999). A corpus analysis approach for automatic query expansion and its extension to multiple databases. ACM Transactions on Information Systems (TOIS), 17(3), 250-269. Goddard, C. & Schalley, A. C. (2010). Semantic analysis. In N. Indurkhya & F. Damerau (Eds). Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL: Chapman & Hall/CRC Taylor & Francis Group. Graupmann, J., Cai, J., & Schenkel, R. (2005). Automatic query refinement using mined semantic relations. Proceedings of the International Workshop on Challenges in Web Information Retrieval and Integration (WIRI). IEEE Computer Society (pp. 205–213). Harman, D. (1992). Relevance feedback revisited. Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 1-10). Harman, D. K. (2005). The TREC test collections. In E. Voorhees and D. K. Harman (Eds.), TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT press. He, B. & Ounis, I. (2007). Combining fields for query expansion and adaptive query expansion. Information Processing and Management, 43, 1294–1307. Hersh, W., Buckley, C., Leone, T. J., & Hickam, D. (1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 192-201). Hu, J., Deng, W., & Guo, J. (2006). Improving retrieval performance by global analysis. Proceedings of the 18th International Conference on Pattern Recognition. IEEE Computer Society (pp. 703–706). Kara, S., Alan, Ö., Sabuncu, O., Akpinar, S., Cicekli, N. K., & Alpaslan, F. N. (2012). An ontology-based retrieval system using semantic indexing. Information Systems, 37(4), 294-305. 326 Kuhlthau, C. C. (2004). Seeking meaning : A process approach to library and information services (2nd ed.). Westport, CT: Libraries Unlimited. Kekäläinen, J. & Järvelin, K. (1998). The impact of query structure and query expansion on retrieval performance. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 130–137). Kraft, R. & Zien, J. (2004). Mining anchor text for query refinement. Proceedings of the 13th International Conference on World Wide Web (pp. 666–674). Krovetz, R. (1993). Viewing morphology as an inference process. Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 191-202). Kučera, H. & Francis, W. N. (1967). The standard corpus of present-day edited American English [the Brown Corpus]. (Electronic database.) Providence, RI: Brown University. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25(2-3), 259-284. Lavrenko, V. & Croft, W. B. (2001). Relevance based language models. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 120–127). Lease, M. (2007). Natural language processing for information retrieval: The time is ripe (again). Proceedings of the ACM first Ph.D. workshop in CIKM, November 09, 2007, Lisbon, Portugal. Lee, K. S., Croft, W. B., & Allan, J. (2008). A cluster-based resampling method for pseudo-relevance feedback. Proceedings of the 31th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 235–242). Liddy, E. D. (1998). Natural language processing for information retrieval and knowledge discovery. In P. A. Cochrane & E. H. Johnson (Eds.), Visualizing Subject Access for 21st Century Information Resources, pp. 137-147. Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151. Liu, S., Liu, F., Yu, C., & Meng, W. (2004). An effective approach to document retrieval via utilizing wordnet and recognizing phrases. Proceedings of the 27th Annual 327 International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 266–272). Ljunglöf, P. & Wirén, M. (2010). Syntactic parsing. In N. Indurkhya & F. Damerau (Eds). Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL: Chapman & Hall/CRC Taylor & Francis Group. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159–65. Manning, C. D. & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts: MIT Press. Metzler, D. & Croft, W. B. (2007). Latent concept expansion using Markov random fields. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 311–318). Miller, G. A. (1998a). [Forward]. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. xv–xxii). Cambridge, MS: The MIT Press. Miller, G. A. (1998b). Nouns in WordNet. In C. Fellbaum (Ed.), WordNet:An electronic lexical database (pp. 23–46). Cambridge, MA: The MIT Press. Moffat, A. & Zobel, J. (2004). What does it mean to measure performance? Proceedings of the 5th International Conference on Web Informations Systems, Brisbane, Australia. Lecture Notes in Computer Science, 3306 (pp. 1–12). Nagypál, G. (2005). Improving information retrieval effectiveness by using domain knowledge stored in ontologies. In R. Meersman, Z. Tari, & P. Herrero (Eds.), On the Move to Meaningful Internet Systems 2005: OTM 2005 Workshops (pp. 780-789). Berlin/Heidelberg: Springer. Nirenburg, S. & Raskin, V. (2004). Ontological semantics. Cambridge, MA: MIT Press. Palmer, D. D. (2010). Text preprocessing. In N. Indurkhya & F. Damerau (Eds.), Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL: Chapman & Hall/CRC Taylor & Francis Group. Pargellis, A., Fosler-Lussier, E., Potamianos, A., & Lee, C. H. (2001). A comparison of four metrics for auto-inducing semantic classes. IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU'01. (pp. 218-221). 328 Park, L. A. F. & Ramamohanarao, K. (2007). Query expansion using a collection dependent probabilistic latent semantic thesaurus. In Z.-H. Zhou, H. Li, Q. Yang (Eds.), Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining. PAKDD 2007 (pp. 224–235). Heidelberg: Springer. Pierce, J.R. (1980). An introduction to information theory: Symbols, signals and noise. Second, revised edition. NewYork: Dover. Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14,130-137. Qiu, Y. & Frei, H. (1993). Concept based query expansion. Proceedings of ACM-SIGIR '93, Pittsburgh, PA (pp. 160-169). Revuri, S., Upadhyaya, R. S., & Kumar, K. S. (2006). Using domain ontologies for efficient information retrieval. In L. V. S. Lakshmanan, P. Roy, and A. K. H. Tung (Eds.) Proceedings of the 13th International Conference on Management of Data (COMAD 2006), December 14-16, 2006, Delhi, India (pp. 170-173). Riezler, S., Vasserman, A., Tsochantaridis, I., Mittal, V., & Liu, Y. (2007). Statistical machine translation for query expansion in answer retrieval. Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-07) (pp. 464– 471). Robertson, S. E., Walker, S., & Beaulieu, M. M. (1998). Okapi at TREC-7: Automatic ad hoc, filtering, VLC, and interactive track. Proceedings of the 7th Text REtrieval Conference (TREC-7), NIST Special Publication 500-242. National Institute of Standards and Technology (NIST), Gaithersburg, MD (pp. 253–264). Salton, G. (1968). Automatic information organization and retrieval. New York: McGraw-Hill. Savoy, J. & Gaussier, E. (2010). Information retrieval. In N. Indurkhya & F. Damerau (Eds). Handbook of Natural Language Processing, 2nd Ed. Boca Raton, FL: Chapman & Hall/CRC Taylor & Francis Group. Schütze, H. & Pedersen, J. O. (1997). A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management, 33(3), 307318. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379-423, 623-656. 329 Shannon, C.E. & Weaver, W. (1998). The mathematical theory of communication. Champaign, IL: University of Illinois Press. Song, M., Song, I.-Y., Allen, R. B., & Obradovic, Z. (2006). Keyphrase extraction-based query expansion in digital libraries. Proceedings of the 6th ACM/IEEE-CS joint International Conference on Digital Libraries (JCDL’06) (pp. 202–209). Song, M., Song, I.-Y., Hu, X., & Allen, R. B. (2007). Integration of association rules and ontologies for semantic query expansion. Data and Knowledge Engineering, 63(1), 63– 75. Sun, R., Ong, C. H., & Chua, T. S. (2006). Mining dependency relations for query expansion in passage retrieval. Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 382-389). Talmy, L. (forthcoming). Cognitive semantics: An overview. In C. Maienborn, K. von Heusinger, P. Portner, & M. de Gruyder (Eds.), Semantics: An international handbook of natural language meaning. New York: Mouton De Gruyter. Taylor, R. S. (1962). The process of asking questions. American Documentation, 13(4), 391-396. Van Rijsbergen, C. J. (1979). Information retrieval. 2nd ed. London: Butterworths. Vitanyi, P. M. B. & Cilibrasi, R. L. (2010). Normalized Web Distance and word similarity. In N. Indurkhya & F. Damerau (Eds.), Handbook of Natural Language Processing (pp. 293-314). Boca Raton, FL: Chapman & Hall/CRC Taylor & Francis Group. Voorhees, E. M. (2000). Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management, 36(5), 697-716. Voorhees, E. (1994). Query expansion using lexical-semantic relations. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 61–69). Voorhees, E., & Harman, D. K. (2005a). The Text REtrieval Conference. In E. Voorhees and D. K. Harman (Eds.), TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT press. Voorhees, E., & Harman, D. K. (Eds.). (2005b). TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT press. 330 Wartena, C., & Brussee, R. (2008). Topic detection by clustering keywords. IEEE 19th International Workshop on Database and Expert Systems Application, 2008. DEXA'08. (pp. 54-58). Wolfram, D., Spink, A., Jansen, B. J., & Saracevic, T. (2001). Vox populi: The public searching of the web. JASIST, 52(2001), 1073-1074. Xu, Y., Jones, G. J. F., & Wang, B. (2009). Query dependent pseudo-relevance feedback based on wikipedia. Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 59–66).
© Copyright 2026 Paperzz