Identifying queries in the wild, wild web

Identifying Queries in the Wild, Wild Web
Jingjing Liu, Chang Liu, Jun Zhang, Ralf Bierig, Michael Cole
School of Communication and Information, Rutgers University
4 Huntington Street, New Brunswick, NJ 08901, USA
{jingjing, changl, zhangj}@eden.rutgers.edu, {bierig, m.cole}@rutgers.edu
To extend our understanding of searching, especially querying
behavior, it will be necessary to generate and analyze search
session logs of quite a different character than has been done up
to now. Two key features of such logs are that they will need to
be constructed on the client-side, rather than on the server-side, as
has been done; and, that they will contain data on queries and
other behaviors that are not amenable to the types of simple
analyses and identification of queries that have been used in the
past. In this paper, we report on three methods for identifying
queries in client-side logs of unrestricted searching in the Web.
These methods were developed and tested with a corpus of
realistic, unrestricted assigned searches, collected in the course of
an experimental study of Web searching. We report our results of
the methods, with failure analyses. Our methods or slight
revisions can be applied to other search logs, and have an
acceptable performance in identifying querying behavior in the
open Web. This is an important initial step to explore and
understand users’ unconstrained search behaviors in everyday
life.
ABSTRACT
Identifying user querying behavior is an important problem for
information seeking and retrieval research. Query-related studies
typically rely on server-side logs taken from a single search
engine, but a comprehensive view of user querying behaviors
requires analysis of data collected from the client-side for
unrestricted searches. We developed three methods to identify
querying behaviors and tested them on client-side logs collected
in a lab experiment for realistic tasks and unrestricted searches on
the entire Web. Results show that the best method was able to
identify 97% of queries issued, with a precision of 92%. Although
based on a relatively small number of search episodes, our
methods, perhaps with minimal modifications, should be adequate
for identification of queries in logs of unconstrained Web search.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search
and Retrieval – Query formulation
General Terms
Algorithms, Measurement,
Experimentation.
Performance,
Human
2. RELATED WORK
2.1 Server-side Query Log Analysis
Factors,
One stream of query related work has used large transaction logs
from search engines. Most transaction logs are server-side
recordings of interactions between searchers browsing on a
particular computer and search engines [5]. The data usually
come from a specific search engine and include all queries
submitted by all users to this search engine during a specific
period of time, varying from a portion of a day to several years.
Queries and query terms are already identified in these logs.
Keywords
Web search logs, query log analysis, query identification, user
studies, unrestricted search.
1. INTRODUCTION
Understanding searcher querying behavior is a central concern of
information retrieval and information seeking research. Almost all
studies of such behaviors have been conducted within the
confines of a single search engine or database with associated
search facilities. However, when searching for information,
especially in the context of the Web, people do not typically limit
themselves to only one search engine or one database. Rather,
they often follow links from one system to another, or go from a
search engine in the “Visible Web” to one in the “Invisible Web”,
and so on. Thus, the nature of the data concerning Web searching
behavior that we currently have does not represent realistic
searching behaviors in the Web in general.
Previous work using this type of transaction log has focused
mainly on identifying query sessions to group the queries for the
same search topic together (e.g., [5], [6], [7], [8]), and analyzing
and/or evaluating query reformulation types ([12], [13]). Data
were transaction logs from AltaVista ([5], [13]), Excite ([12],
[14]), Dogpile [8], AOL [6], and so on. Since queries were
available in the data sets in all these studies, none of them needed
to conduct any work to identify queries or querying behavior.
2.2 Client-side Query Log Analysis
Another line of studies involving query analysis collected their
data through lab experiments. These data were usually logs
recorded from the client side rather than the server side. They
typically asked users to search in specific systems, either existing
or self-built, or in a limited number of Web search engines,
particularly Google. User querying behavior and query terms in
all these circumstances can be easily detected.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
Some query related studies examined search moves and tactics by
analyzing changes in query terms while searching in specified
IIiX 2010, August 18-21, 2010, New Brunswick, New Jersey, USA.
Copyright 2010 ACM 978-1-4503-0247-0/10/08...$10.00.
317
online databases such as INQUERER [16], PsychINFO [15], etc.
Although these works did not describe how queries were
obtained, they seemed to be identifiable without difficulty. Some
studies (e.g., [3]) using the number of queries as one indication of
user search effort restricted users to only specified search systems
(e.g., Google Wikipedia and ALVIS Wikipedia), where search
queries could be easily detected. Other studies used self-built
systems to examine the effectiveness of query elicitation
techniques [1] or term relevance feedback interface [9].
Accurately identifying queries in these systems is not difficult.
•
3.2 Data Collection
Data used in this paper came from a lab experiment that was
conducted to investigate user search behaviors associated with
different task types in an unconstrained search environment.
Participants were 32 undergraduate journalism students, who were
asked to carry out tasks designed to be realistic journalism
assignments following “simulated work task situations” [2]. After a
training task that helped the participants become familiar with task
requirements and the search environment, they performed four tasks
involving Web search (described below). Task orders were balanced
to minimize possible order effect. For each task, participants were
allowed up to 20 minutes to search freely on the Web and find and
save useful documents. Morae recorded user-computer interactions.
Pre- and post-task questionnaires were used to elicit participants’
knowledge of task topics, self-rating of search experience, and selfevaluation of performance, etc.
Another approach to logging search activities is allowing users to
install on their own browsers add-ons that log their search
activities. Among others, one example is the Lemur Query Log
Toolbar (http://www.lemurproject.org/querylogtoolbar/) which
can be installed on Internet Explorer or Firefox browsers.
However, to our knowledge, this tool can only log searches in
pre-defined systems, and it is not easy to capture invisible Web
searches that are not always possible to be defined a-priori.
In sum, it appears that in previous studies on query analysis, the
queries and query terms were either provided by the server logs of a
specific search engine, or could be easily detected by client-side
logs where users searched in specified or self-built systems. Query
identification in these circumstances is not a problem. However, in
everyday life, people search freely on the Web in order to
accomplish their work tasks. In addition to major search engines,
invisible Web searches occur frequently, for example, searches on
proprietary databases and searches with dynamic results. It is much
more difficult to accurately identify queries and query terms in these
circumstances. To our knowledge, no one has addressed this
problem for query research.
Although we had only 4 tasks in the experiment, they represented
several task facets that have been found to have significant impact
on user behaviors [10] (Table 1). On all tasks, participants were
observed to engage various search sources including Web search
engines, searchable Web sites, and proprietary databases through a
university library. The tasks (complete descriptions available at
[10]) were: 1) background information collection (BIC): collect
published stories in important newspapers on changes in US visa
laws after 9/11 reducing enrollment of international students at
universities in the US; 2) copy editing (CPE): find authoritative
pages that confirm or disconfirm three facts in a part of an article; 3)
interview preparation (INT): find the contact information of two
people with appropriate expertise to be interviewed for a news story
about the effect of state budget cuts in New Jersey on financial aid
for college and university students; and 4) advanced obituary (OBI):
collect information needed to write an advance obituary of an artist.
3. METHOD
3.1 Three Query Identification Methods
According to the page number, a Search Engine Result Page
(SERP) can generally fall into one of two types: the first page of
search results, or subsequent pages of search results. According to
the times that a user visits a SERP, it can be categorized into two
types: the first time visit, or revisiting (if happening). Among these
types, the first time a user visits the first page of a SERP
corresponds to this user’s query submission. We refer to these
SERPs as “querying SERPs”. In this paper, we focus on
identification of these querying SERPs, which accordingly helps
identifying querying behavior.
Task
BIC
CPE
INT
OBI
Our querying identification methods are based on client-side search
logs instead of server-side logs. The logs we used were captured in a
lab experiment (details in Section 3.2) using the logging software
Morae (http://www.techsmith.com/morae.asp), which recorded user
activities including URLs opened, mouse clicks, and keystrokes, as
well as the time stamp of each user activity. Using this data set, we
developed and tested three methods for querying SERPs
identification, and they are:
•
•
participants used page internal search (Ctrl+F keys) or typed
URLs in the browser address bar.
The Pattern and Keystroke (P+K) method: Merging unique
results from the P and the K methods.
Table 1. Variable facet values for the search tasks
Objective
Product
Level
Goal (Quality)
complexity
High
Mixed
Document
Specific
Low
Factual
Segment
Specific
Low
Mixed
Document
Mixed
High
Factual Document
Amorphous
3.3 Performance Measurements
We used recall, precision, and false positive rate as evaluation
measures. Ground truth, i.e., the actual number of searches (query
SERPs) was obtained by manually reviewing the screen videos.
Since our study did not constrain search sources, a fair number of
searches were performed using non-major search engines. In order
to compare performance of the three query detection methods on
different types of search sources, we classified the searches into two
categories: 1) those using major Web search engines (including
Google, Yahoo!, and MSN/Bing) – referred to as Visible Web (VWeb) search, and 2) those using all other search sources – referred
to as Invisible Web (I-Web) search.
The Pattern (P) method: Using query URL patterns found in
major search engines (Google, Yahoo!, and MSN/Bing),
removing duplicates and keeping unique SERPs that first
appeared, removing SERPs which were pages 2 and above of
the returned lists.
The Keystroke (K) method: Using keystrokes followed by a
mouse click or an “Enter” key, excluding cases where
• Recall: the number of correct query SERPs detected by a method
out of the total number of query SERPs in ground truth
(calculated separately for V-Web and I-Web).
318
• Precision: the number of correct query SERPs detected by a
method out of the total number of query SERPs detected by this
method.
• False positive rate: the number of incorrect query SERPs detected
by a method out of total number of query SERPs detected by this
method. This rate is equal to 1 minus precision.
outperformed (at least not worse than) both the P method and the K
method in the V-Web and I-Web, considered alone or together.
Table 3 compares the performance of the three methods. The
precision of the three methods is equivalent except the false positive
rate in the P+K method (7%) was a little higher than the other two
methods (5% in the P method and 4% in the K method).
Table 2. Sample data set information (12 sessions)
# of query SERP (ground truth)
#
User
Task
All
V-Web
I-Web
1
s006
25
1
24
BIC
2
s014
6
2
4
3
s029
20
5
15
4
s006
11
8
3
CPE
5
s024
9
4
5
6
s025
18
17
1
7
s008
11
9
2
INT
8
s014
15
11
4
9
s021
23
18
5
10 s008
11
10
1
OBI
11 s012
25
20
5
12 s021
17
14
3
Total
191
119
72
4. RESULTS
4.1 Data Sets in General
There were 128 experiment sessions (32 participants by 4 tasks). To
evaluate the performance of the three methods, we sampled 10% (12
sessions) stratified by tasks (3 per task). A random sample was first
selected, of which 6 (50%) had I-Web searches. To test our
methods, we desired a collection that included I-Web searches, so
we resampled to replace the 6 non I-Web search sessions with ones
that included I-Web searches.
We found that users employed quite a number of different ways to
enter a search query in various search sources. Query text input
methods included keystrokes, “copy and paste” (Ctrl+C / Ctrl+V;
using “copy” and “paste” functions in the browser drop down menu
or the dialog box shown after right clicking), or combinations of
these. Query reformulation methods included modifying the whole,
some parts, or even none of the previous queries. Searching could
be initiated through using the “Enter” key or clicking the “Search”
button on the search interface. System returned SERPs sometimes
employed rules in major search engine SERPs and other times did
not.
100.00%
90.00%
4.2 The Test Data Set
80.00%
4.2.1 Description of the Test Set
70.00%
60.00%
The 12 sample sessions had 191 query SERPs, 119 of which
(62.11%) were V-Web searches and 72 of which (37.89%) were IWeb searches. Table 2 shows the ground truth details.
P+K
10.00%
0.00%
Recall
Recall in
visible web
Recall in
invisible web
Precision
False positive
Figure 1. Performance of the three methods
4.2.3 False Positive and Missing Cases
The P method had 7 false positive cases, all had a string pattern in
the URLs. Among them, 4 were page 2 of SERP on non-search
engine search systems, 1 was from Google cache, 1 was a content
page with a string pattern in its URL, and 1 was a linked SERP from
Google search engine to Bloomberg search system. The K method
had 8 false positive cases, all had some keystrokes followed by an
Enter/click then followed by a URL.
Table 3. Query SERP statistics by three methods
# of Detected SERPs
Method
Correct SERPs Correct SERPs Total correct
(V-Web)
(I-Web)
SERP
Pattern (P)
116
32
148
Keystroke (K)
110
64
174
P+K
119
67
186
72
K
20.00%
Table 3 and Figure 1 show the performance of the three methods for
the test set. The recall of the P method is 77%. It did well in V-Web
SERPs with a recall of 97%, but had a poor recall of 44% in I-Web
SERPs. The recall of the K method is 91%, which was a 17.57%
improvement over the P method. The K method did worse than the
P method in detecting V-Web searches, with a recall of 92%;
however, it greatly outperformed the P method in detecting I-Web
SERPs (recall 89% vs. 44%). The P+K method, with a recall of
97%, had better results than either of the two individual methods. It
had a 25.68% improvement compared with the P method and a
6.90% improvement compared with the K method. In particular, the
P+K method detection performance was a 100% improvement over
the P method and a 4.69% improvement over the K method in IWeb SERP detection. In sum, the P method was better in V-Web
and the K method was better in I-Web, but the P+K method
119
P
40.00%
30.00%
4.2.2 Performance of the Three Methods
Ground Truth
50.00%
191
319
False
positive
7
8
15
Total detected
SERPs
155
182
201
0
191
# of Missing SERPs
In VIn ITotal
Web
Web
3
40
43
9
8
17
0
5
5
0
0
0
In 7 of them, users were modifying their previous query, but
before they executed the modified query, they clicked a link in
the result list or clicked a link to their previous content pages; the
other case was when the user typed login information to a
database. False positives in the P+K method combined those in
the above two methods, ended up with 15 cases.
7. REFERENCES
[1] Belkin, N.J., Cool, C., Kelly, D., Kim, G., Kim, J.-Y., Lee, H.[2]
The P method missed 43 query SERPs: 40 of them were due to no
string patterns in URLs, and 3 were repeated queries the searcher
issued in the same search session. The K method missed 16 query
SERPs because none of them included any keyboard activities;
they just used mouse click to change search fields (7 times),
accept system spelling corrections (8 times), or enable a
consecutive repeated search (once). Since most of the missing
cases in the P and the K methods complemented each other, in the
P+K method, missing cases were far fewer: there were only one 5
cases, all of which were the same type: no string patterns in the
URLs, and no keystroke activities.
[3]
[4]
[5]
[6]
5. CONCLUSIONS
[7]
To gain a better understanding of realistic Web searching
behaviors, in particular querying behavior, it is necessary to move
beyond the current methods of query analysis based on logs of
searches conducted on single search engines, or in single
databases. In our study of realistic, unrestricted Web searches,
50% included queries put to the Invisible Web. This supports the
common sense understanding of the importance of studying
unrestricted searching behavior, and the need to ensure we have
methods to analyze that behavior, in particular, there is a need to
identify unrestricted querying behavior.
[8]
[9]
[10]
The method described here was successful in identifying 97% of
the queries, with a precision of 92%, in the 12 different search
episodes of our data sample. The methods can be applied to other
client-side Web logs as long as they record URLs, mouse clicks,
and keystrokes, which can be done by many loggers. Our data
was collected in unrestricted Web searching which represents
quite typical real-life Web activities including URL typing, page
internal search, and log-in activities, etc. Possible false positive
cases such as URL typing and page internal search were
accounted for in our algorithms. Although we did not address login activities or Web-form filling-in, which also involve keystrokes
and mouse clicks (or the Enter key), they are often not a great
portion in people’s normal Web activities, and if needed, scripts
addressing these can certainly be added in the algorithms.
[11]
[12]
[13]
[14]
Our failure analyses suggest that the false positive and missing
cases in our results are quite special cases, which might be
amenable to post hoc correction. Even if they cannot be corrected,
it does not matter much in the noisy environment of client-side
logs of unrestricted information seeking. We believe that our
methods, or slight revisions, provide an acceptable level of
performance for analysis of querying behavior in the Wild, Wild
Web.
[15]
[16]
6. ACKNOWLEDGEMENTS
This work is sponsored by IMLS grant LG-06-07-0105-07. We
thank Nick Belkin, Jacek Gwizdka, and Xiangmin Zhang for their
advice and feedback.
320
J., et al. (2003). Query length in interactive information retrieval.
Proceedings of SIGIR '03.
Borlund, P. (2003). The IIR evaluation model: A framework for
evaluation of interactive information retrieval systems.
Information Research, 8(3), paper no. 152.
Croft, W. B. & Thompson, R. (1987). I3R: A New Approach to
the Design of Document Retrieval Systems. Journal of the
American Society for Information Science, 58(6), 389-404.
Gwizdka, J. (2009). Assessing cognitive load on Web search
tasks. The Ergonomics Open Journal, 2, 114 - 123.
He, D., Goker, A., & Harper, D. J. (2002). Combining evidence
for automatic Web session identification. Information Processing
& Management, 38(5), 727-742.
Huang, J. & Efthimiadis, E. N. (2009). Analyzing and evaluating
query reformulation strategies in Web search logs. Proceedings
of CIKM '09.
Jansen, B. J. (2006). Search log analysis: What is it; what's been
done; how to do it. Library and Information Science Research,
28(3), 407-432.
Jansen, B. J., Booth, D. L., & Spink, A. (2009). Patterns of
query reformulation during Web searching. Journal of the
American Society for Information Science & Technology, 60(7),
1358-1371.
Kelly, D., & Fu, X. (2006). Elicitation of term relevance
feedback: An investigation of term source and context.
Proceedings of SIGIR '06, 453-460.
Li, Y. & Belkin, N.J. (2008). A faceted approach to
conceptualizing tasks in information seeking. Information
Processing & Management, 44 (6), 1822-1837.
Liu, J., Cole, M., Liu, C., Bierig, R., Gwizdka, J., Belkin, N.J.,
Zhang J., & Zhang, X. (2010). Search behaviors in different task
types. Proceedings of JCDL ’10.
Rieh, S. Y. & Xie, H. (2006). Analysis of multiple query
reformulations on the Web: The interactive information retrieval
context. Information Processing & Management, 42, 751-768.
Silverstein, C., Marais, H., Henzinger, M., & Moricz, M. (1999).
Analysis of a very large Web search engine query log. ACM
SIGIR Forum, 33(1), 6-12.
Spink, A., Wolfram, D., Jansen, B.J., & Saracevic, T. (2001).
Searching the Web: The public and their queries. Journal of the
American Society for Information Science and Technology,
52(3), 226-234.
Vakkari, P., Pennanen, M., & Serola, S. (2003). Changes in
search terms and tactics while writing a research proposal: A
longitudinal case study. Information Processing & Management,
39, 445–463.
Wildemuth, B. M. (2004). The Effects of Domain Knowledge on
Search Tactic Formulation. Journal of American Society for
Information Science and Technology, 55, 246-258.
321