Query abandonment in the legal search domain

Query abandonment in the legal
search domain
Menno van Leeuwen
10280588
Bachelor thesis
Credits: 18 EC
Bachelor Opleiding Kunstmatige Intelligentie
University of Amsterdam
Faculty of Science
Science Park 904
1098 XH Amsterdam
Supervisor
Dr. Evangelos Kanoulas
Informatics Institute
Faculty of Science
University of Amsterdam
Science Park 904
1098 XH Amsterdam
June 24th, 2016
Abstract
This thesis measures search engine performance using search queries satisfaction
as key performance indicator. The used search engine is a commercial available
version developed by Legal Intelligence. A number of 100.000 queries from users
of the university of Leiden were analysed and divided in two groups. A query
can either be satisfying (SAT) or dissatisfying (DSAT). This division was made by
implementing four discriminating methods: click only, click plus dwell time, reformulations and a combination of reformulations and clicks. After creating a list
of DSAT queries applied filter path for each query in this list was analysed to see if
there are law areas where the system performs less good compared to others. For
the most simple discrimination method fifty-eight percent of the queries was classified as DSAT, this percentage further increased when dwell time is introduced.
Reformulation showed a drop compered to earlier methods, explainable because
the definition differs much from the first two. The last method that is an extension
of reformulation showed again an increase in DSAT. When looking at applied filter
paths their seemed to be no big difference between the four methods. It showed
however that within the DSAT lists people more often search for unstructured data
such as literature or magazines compared to structured data like laws.
2
Contents
1
Introduction
4
2
Related work
5
3
Experimental design
3.1 Data . . . . . . . . . . . . . . . . . .
3.2 Building search sessions . . . . . . .
3.3 Identifying queries . . . . . . . . . .
3.4 Analysing filter paths of DSAT queries
.
.
.
.
7
7
8
8
9
4
Results
4.1 Query abandonment . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 DSAT Filter paths . . . . . . . . . . . . . . . . . . . . . . . . . .
10
10
11
5
Conclusion
13
6
Discussion & future work
14
7
References
15
8
Appendix
16
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
Introduction
Search engines play a central role in the way people discover information available
on the Internet. Practically every person with access to the Internet uses one of the
generic search providers such as but not limited to: Google, Bing or Yahoo. These
engines have a common goal in trying to index the web as a whole covering a broad
spectrum of knowledge, e.g sites about information of celebrities on one end and
sports on the other. In contrast to these, proprietary search engines exists that focus
towards a specific information domain. Whenever an accountant searches on the
intranet of a big firm such as Deloitte, or a doctor accessing medical records of a
patient, they both use such a proprietary variant of a search engines. Proprietary
variants are available in both public and commercial fashion, whilst this thesis will
focus on the latter.
One company that develops a proprietary available engine is Legal Intelligence
(LI) (Legal Intelligence, 2016), a Rotterdam based company whose main product
is a web application focused on searching content within the Dutch legal domain.
It searches through publicly available legal data, such as wetten.nl, as well as commercial documents that are provided by publishers like Wolters Kluwer.
For LI the main question for this thesis was to look at the way and the reasons why
people give up on a particular search. Such abandonment plays an important role
as key performance indicator (KPI), since the inability to fulfil the user need for information with relevant documents can lead to customer frustration. This irritation
may further lead to the switch to a different alternatives, or in case of a commercial
product can result in subscriber cancellation. Looking back at the example of the
accountant and the doctor we can also identify relevance of this KPI to different
sectors, hence making it a broader measure of interest than just the legal field. In
order to measure performance we can ask the question: In which law areas does
the search engine perform less well? Two subsections and experiments have been
conducted in pursuit of answering the main question.The first question is which
queries do not address the information need of the user? And when we have found
these the second question arises: What are characteristics of these queries that fail
to address the information need?
For the current thesis two major aspects are of interest compared to previous research which has been conducted on the topic of query abandonment. The first being the domain specific knowledge the chosen engine tries to comprehend. As will
be shown in the literature review research about abandonment tends to focus more
on generic search engines than on proprietary ones. (Diriye et al., 2012; Dupret
and Lalmas, 2013) Secondly the Dutch language poses a challenge on it’s own, as
many models in natural language processing (NLP) use English as language rather
than focusing on a language that is relatively uncommon.
4
2
Related work
Search query abandonment can be seen as outcome of failure to discover related
document(s) on the search engine result page (SERP). (Thuma et al., 2013) Several
studies investigating query abandonment have been carried out on general purpose
search engines, such as but not limited to Bing and Yahoo!. (Diriye et al., 2012;
Dupret and Lalmas, 2013) In order to reveal insights about abandonment, which
can lead to user frustration, observability and quantification of user behaviour is
key. Quantitative and qualitative research has been conducted on the topic of query
abandonment. Qualitative research most often consists of surveys or interviews
where the user explicitly states its search task intent. Researches thus can evaluate
if accessed documents on the SERP are fulfilling the intended task. (Diriye et al.,
2012; Stamou and Efthimiadis, 2010) For the current research quantitative methods
such as conducted by Hassan et al. (2013) seem more relevant, since only search
query logs of the partner company are available and the absence of permission to
conduct surveys or interview users make qualitative research impractical.
Users interaction with search engines can be described at three levels: the query
(Diriye et al., 2012; Dupret and Lalmas, 2013) , the session (Kanoulas et al., 2011)
and the task level (Buscher et al., 2012). Each one can be seen as multiple instances of the previous stage. The first and most straightforward level focuses on
single search queries and the resulting interaction with the SERP. Search sessions
are an extension, defined as multiple succeeding queries that are reformulations
of each other and may be separated in terms of activity by interaction with the
SERP page. Multiple differences in session ending exists. A generally accepted
definition of the end is denoted by a thirty minute period of inactivity. (Radlinski
and Joachims, 2005) Any activity beyond this period is considered part of a new
session. User tasks are a much broader and harder to define but are focused on
describing the user information need. The user may have a task that may be one
session long, e.g. finding a hotel in Amsterdam. A task however can also be more
extensive as for example comparing hotel prices across the Netherlands, so the first
session could comprehend hotels around Amsterdam, followed by a consecutive
search about hotels in the Rotterdam Area. Single query analysis would be a logical starting point for any research addressing query abandonment. If time omits
the later level can be analysed.
In order to address success or failure in user intent two terms are widely used by
various papers. Query satisfaction (SAT) can be defined as the successful fulfilment of the information need of the user. In contrast query dissatisfaction (DSAT)
occurs when the user need is not addressed as expected. These definitions leave
room for debate on what is considered satisfaction, and there seems to be no generally accepted definition. Papers describe satisfaction in numerous ways, mostly
as the researches see fit. In the paper of Hassan et al. (2013) a five way system
approach is used to describe user satisfaction at different levels. The first system
5
starts with the assumption that whenever a document is clicked as result of a search
query satisfaction occurs. This very simple approach dismisses satisfaction as result from a false positive clicked document. In order to tackle this bias a second
system is proposed, in which document dwell time is taken into account. Dwell
time is the time a user spends on a particular document, this by looking at time
lapses between two user actions (document clicks or a reformulated query). If a
user spends more than a certain time on a particular document than it must be informative and satisfactory. Reformulation is the central idea behind the third system.
It assumes that as result of reformulation the user must be dissatisfied with the
previous query and therefore reformulates his search. This system ignores the information gained from the previous two systems. The fourth approach combines
all the three preceding variants and takes clicks and reformulations into account.
Whenever a reformulation occurs it checks if the followup action is a click. The
fifth and last variant trains a classifier that takes the reformulations of queries into
account at the same time rather that the defined if-this-than-that (IFTTT) way as
described in step four. In this thesis the five way approach of describing satisfaction will be used as a foundation, this helps to gain focus and to abscond the debate
about what is and is not satisfaction.
6
3
Experimental design
In order to analyse search query behaviour two experiments were conducted. The
first experiment is focused on answering the first sub research question and labels queries as either being satisfactory (SAT) or unsatisfactory (DSAT). As can be
seen in previous work there are multiple approaches as how this distinction can be
made. In order to not get lost in an endless debate about what is or not is a SAT
or DSAT query this thesis follows a four out of five methods to separate queries as
described by Hassan et al. (2013). The second experiment focuses on looking at
the characteristics of failing queries that are outputted by the first experiment. The
applied filter path of these queries is analysed, can we see if certain filter paths,
that represent certain law areas within the search engine, are more present in the
DSAT query list?
All experiments are implemented in the python 3 programming language. Within
this language several packages are utilised to analyse the data further. All scripts
were executed on a Macbook Pro (Retina, 13-inch, Early 2015) and Mac OS X
El Captain was used as the main operating system. All code has been made publicly available at the GitHub platform and is accessible trough the following url:
https://github.com/10280588/LI-Scriptie.
3.1
Data
Data was gathered from search logs, those hold records of all user interactions
with the system. The cooperating company Legal Intelligence (LI) provided these
search logs and were pulled from their search engine database. LI uses a list of
observed events to follow these interactions. For this thesis two events were of
interest, one event contains all query entered data attributes and the other one comprehends information about accessed documents. They are respectively labelled as
event 164 and event 27. A major limitation to the available data was that exclusion of paying clients, analysis of this client data could breach the attorney-client
privilege if used without explicit consent. Therefore search logs that tracks the
behaviour from students and academics from the University of Leiden was used.
They have free access to the web application and their behaviour may be tracked
for purposes like presented in the current research.
Data was exported from the database to comma separated value (CSV) files, via the
help of a bash script executing multiple curl commands. Three sizes of recorded
entries were used: ten, 100 and 100.000 entries respectively. This created a total of
six files, three per event. The first two sets were used for testing purposes to see if
the algorithm worked correctly, while the largest one was used as final data set to
be analysed.
Per event several attributes were selected. For both events the user ID and time
7
stamp were selected. Queries queries the search text, applied filter path and time
stamp were saved. While for documents the web address (URL) and time stamp
were stored.
3.2
Building search sessions
In order to see how user are interacting with the search engine, time series of search
queries and documents access were constructed. In order to create these sessions a
merged version of the two event data frames was used. This merged frame was further sorted by userID and the time stamp, constructing a chronological time frame
per user. Now these records were slitted up into sessions. A session either starts as
the first search event for a new user, or alternatively starts.
These sessions have certain characteristics, such as the session length by combining the two observed, and the number of queries and documents per session.
In the appendix graphs show the occurrences of the length, number of queries and
number of documents. All three graphs show a classical long tail distribution were
short sessions occur more often than longer sessions. This reflects logical expectations. An interesting observation is that almost 4000 sessions do not have a single
document click at all. This must be taken into account when analysing the results.
3.3
Identifying queries
The first and simplest method classifies a query (Q) as being SAT if it receives at
least one document click document (D) at the SERP page. In the defined sessions
such a query is successful if a event 164 is followed by a document click event 27.
Sessions that started with a document access instead of a search were discarded.
Working as a extension of this method by looking at document clicks, the second
method also classifies a query as satisfactory as long as a document click occurs,
but adds dwell time as an additional constrain in order for a query to be successful.
This constrains follows from the idea that when a document is relevant to the user,
and thus satisfies the user need, at least several seconds of dwell time must elapse.
Within this dwell time the user uses the document as information, hence if this is
the case the user will spend more time on it reading further. Which this leads to a
period of inactivity and called the dwell time. After a certain threshold a certain
document has therefore satisfied (part) of the users need for information. Experiments were conducted with thirty, sixty and ninety seconds as dwell time threshold.
Furthermore one can see similarities between one can conclude that system one is
equivalent with system two if we let the time delta be zero seconds.
The third method, ignores query clicks and instead looks at reformulations. If a
8
user is not satisfied with the result he sees on the SERP page, he will reformulate
his query is the base argument that this method follows. Therefore a SAT query is
one that is not followed by a reformulation Multiple ways can be used to measure
reformulation, for example by looking at the stemming of words or at the. This
thesis solely uses the Jaccard Index to calculate if the query is a reformulation of
the previous one. The Jaccard distance can be written as the intersection divided
by the size of the union of the two sets. Elements of each set are denoted by words
within a search query. As long as the Jaccard index is above the 35% threshold
queries were seen as reformulation of each other, if similarity dropped below this
threshold a query was seen as a new search.
The last implemented method, combines reformulations and clicks in a cascading
way. It will check if a followup query can be seen as a reformulation, and if this
is the case apply the definition of the third method. If on the other hand no reformulation is found we treat this query as a new query and check if the click as
described by the first two method is detected. Bot a combination of this method
plus click only as well as a combination with click dwell time (thirty, sixty and
ninety seconds) were used.
Each of the four used methods focuses on identifying SAT queries, instead of classifying DSAT queries. However every single query within a collection must either
be SAT or DSAT, hence if we denote a method of detecting SAT we can say that
every observed search that does not met this requirement can be seen as a DSAT
variant.
3.4
Analysing filter paths of DSAT queries
After applying the above four methods the result would be a list of those queries
that are classified as DSAT variants. This thesis will introduce a methodology to
analyse queries by looking at the filter path, in this way we can answer the second
sub question. Filter paths that are applied say something about the law area that the
user was looking for with a particular query. From the list of queries occurrences
of applied filter paths will be counted and ordered so we can see if there are any
big differences between applied filters.
9
4
4.1
Results
Query abandonment
Query log result from users of the University of Leiden (Running total of 89.875 queries)
Method
NR of SAT NR of DSAT
DSAT percentage
Click only
36896
52979
58,95%
Clicks with dwell time (∆t =
26638
63237
70,36%
30s)
Clicks with dwell time (∆t =
22448
67427
75,02%
60s)
Clicks with dwell time (∆t =
19976
69899
77,77%
90s)
Reformulation only
78967
10908
12,14%
Reformulation with clicks only
64856
25019
27,84%
(no dwell time)
Reformulation with clicks and
54598
35277
39,25%
dwell time (∆t = 30s)
Reformulation with clicks and
50408
39467
43,91%
dwell time (∆t = 60s)
Reformulation with clicks and
47936
41939
46,66%
dwell time (∆t = 90s)
In the table above the results of the four methods are showed. At first one can
note that out of the original 100.000 queries 89.875 are left. This difference can be
explained by the fact that certain data points where cleaned out from the sessions.
The biggest change in DSAT percentage can be found when looked at the difference
between method two and three. There is a logical explanation since method three
focuses on reformulation rather than looking at query clicks. A division therefore
can be made between method one and two on the one hand and three and four on
the other (in the table divided by the double line). One sees that SAT and DSAT
percentage increase as soon as a more narrow definition of SAT is introduced. Increment in DSAT percentage in method two is caused by the introduction of the
dwell time, and as logically expected a longer dwell time leads to increased DSAT
percentages. We still see quite high outcomes of DSAT percentages a part of this
percentage points for method one and two can be explained by the fact that around
4000 sessions do not have any document clicks at all. Other reason would be that
we include a lot of false positives as DSAT, but without asking the user for this information it remains the question if this is the case. If we look at the reformulation
only method we see that the percentage of queries that are classified as DSAT drop
dramatically, although the same trend in increasing percentages as seen between
method one and two can be found between the third and fourth method.
10
4.2
DSAT Filter paths
After seeing the differences between percentages it is time to look at how filter
paths of different methods compare. In order to do this a comparison will be made
between the method with click only and the method with reformulation only. The
top twenty of applied filter paths of both methods are displayed in the two tables
below. As we can see the top five results stay the same for both methods. The
biggest percentage of DSAT queries do not have any associated filter path, 32396
and 8149 respectively. Further results are difficult to obtain without further knowing the user intent.
Click only - Top 20 applied filters (over 52979 queries)
Filer path
nr. of times applied
/
32396
/Literatuur/Tijdschriften
4068
/Nederland/Rechtspraak
2806
/Literatuur/Naslagwerken
1739
/Literatuur/Tijdschriften/Artikelen
829
/Nederland/Officiële publicaties
517
/Europa/Rechtspraak
388
/Nederland/Wet- en regelgeving
347
/2010-heden
345
/2016
237
/Literatuur/Naslagwerken/Asser Serie
182
/2015
180
/Nederland/Rechtspraak/Hoge Raad
176
/Nederland/Overig
132
/Nederland/Rechtspraak/Centrale Nederlandse rechtscolleges
105
/Literatuur/Naslagwerken/Tekst & Commentaar
102
/2010-heden/2016
96
/Literatuur/Tijdschriften/2010-heden
86
/Literatuur/Tijdschriften/AA
86
/Nederland/Rechtspraak/Geannoteerd
8
11
Reformulation only - Top 20 applied filters (over 10908 queries)
Filer path
/
/Nederland/Rechtspraak
/Literatuur/Tijdschriften
/Literatuur/Naslagwerken
/Literatuur/Tijdschriften/Artikelen
/Europa/Rechtspraak
/Nederland/Officiële publicaties
/Nederland/Wet- en regelgeving
/Nederland/Rechtspraak/Hoge Raad
/2010-heden
/2015
/2016
/Literatuur/Naslagwerken/Asser Serie
/Nederland/Rechtspraak/Geannoteerd
/Nederland/Rechtspraak/Bestuursrecht, Bestuursrecht; Belastingrecht, ....., 2011, 2010
/Nederland/Rechtspraak/Bestuursrecht, Bestuursrecht; Ambtenarenrecht,, ..., 2011, 2010
/Europa/Officiële publicaties
/Nederland/Rechtspraak/Bestuursrecht
/2012
/Nederland/Rechtspraak/Strafrecht/Strafvordering
12
nr. of times applied
8149
559
513
230
123
59
58
48
32
28
28
23
22
20
18
17
17
15
14
12
5
Conclusion
From the results the biggest conclusion to be drawn is that different methods lead
to varying DSAT percentages. The low percentage of DSAT in the reformulation
only method suggests a high precision but at the risk of having a low recall of
DSAT queries. But as mentioned in the result section, without user surveys or interviews it is difficult to evaluate these findings. What we can conclude however,
as one would expect, is that when narrowing down the definition the percentage of
DSAT increases.
When looking at the filter paths we see no big differences when comparing the click
only and reformulation only methods. So independent of the size of the DSAT list,
and recall the big differences between percentage points between these methods,
the same kind of paths seem to be selected by the user. We can especially see this
in the top 5 where they are identical. Filter paths that are most applied seems to be
associated with unstructured documents.
If we look at our main question to identify in what kind of law areas the search
engine performs less optimal it is still difficult to provide an clear answer. From
the results we can draw the conclusion that unstructured data is more difficult to
index, this follows from the applied filter paths. Generalising these paths to specific
law areas is more difficult. Although some paths are associated with certain fields,
this mapping between both can not be found for every filter. On top of that it is hard
to verify if these paths are really representing actions when a user was not satisfied
with the outcome, or that false positive were included in the created DSAT lists as
well. In the next section a few reflections and suggestion will be made to tackle
this problem.
13
6
Discussion & future work
As can be seen from the results quite some differences exists between the percentage of queries that fail to satisfy the user information need. One explanation for
these varying results is the fact that we have used different methods for separating
SAT and DSAT queries. The suggestion for this different methods shows how researchers hold different opinions about which queries lead to satisfaction. Maybe
a query is only a satisfied one when official documents are found instead of articles
from magazines or blog, for instance when the user is looking for a specific law article. Besides multiple opinions that exists about the discrimination of queries this
thesis focuses on single queries only, neglecting to take into account the broader
search session or task the user tries to fulfil. It can be the case that a query is failing
today, but when the user reformulates it tomorrow and finds a relevant document
that the query is in fact one that satisfies the user.
In order to learn more about users intent, and the correlated search task future work
could consist of qualitative explanations for query abandonment. This would also
give more insight as to what the user sees as a satisfying result to a query. This
research can be build around pop up windows that ask the user if he found what he
was looking for, so we can calculate accuracy and recall between predicted queries
classified as DSAT and those reported by the user. A further intensified research
would comprehend surveys or interviews that can be conducted for extended information about what the users search intent is in the first place, and secondly how
good he or she finds the outcomes as suggested by the system.
Future research could also compare the result when different data input is used.
At this moment only academic users are present in the data, legal professionals
working at law firms have not been included. It therefore would be interesting to
see if certain assumptions, such as things users are looking for and the type of used
queries,uphold when we cross validate it with professional users. One distinction
that may also be of interest is to see if there is an age distinction between the
younger lawyer who may be using different search tactics compared to his senior
colleague.
14
7
References
Buscher, G., White, R. W., Dumais, S., and Huang, J. (2012). Large-scale analysis
of individual and task differences in search result page examination strategies.
In Proceedings of the Fifth ACM International Conference on Web Search and
Data Mining, WSDM ’12, pages 373–382, New York, NY, USA. ACM.
Diriye, A., White, R., Buscher, G., and Dumais, S. (2012). Leaving so soon?:
Understanding and predicting web search abandonment rationales. In Proceedings of the 21st ACM International Conference on Information and Knowledge
Management, CIKM ’12, pages 1025–1034, New York, NY, USA. ACM.
Dupret, G. and Lalmas, M. (2013). Absence time and user engagement: Evaluating
ranking functions. In Proceedings of the Sixth ACM International Conference
on Web Search and Data Mining, WSDM ’13, pages 173–182, New York, NY,
USA. ACM.
Hassan, A., Shi, X., Craswell, N., and Ramsey, B. (2013). Beyond clicks: query
reformulation as a predictor of search satisfaction. In Proceedings of the 22nd
ACM international conference on Conference on information & knowledge
management, CIKM ’13, pages 2019–2028, New York, NY, USA. ACM.
Kanoulas, E., Carterette, B., Clough, P. D., and Sanderson, M. (2011). Evaluating multi-query sessions. In Proceedings of the 34th International ACM SIGIR
Conference on Research and Development in Information Retrieval, SIGIR ’11,
pages 1053–1062, New York, NY, USA. ACM.
Legal Intelligence (2016). Legal intelligence.
Radlinski, F. and Joachims, T. (2005). Query chains: learning to rank from implicit
feedback. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 239–248. ACM.
Stamou, S. and Efthimiadis, E. N. (2010). Advances in Information Retrieval:
32nd European Conference on IR Research, ECIR 2010, Milton Keynes, UK,
March 28-31, 2010.Proceedings, chapter Interpreting User Inactivity on Search
Results, pages 100–113. Springer Berlin Heidelberg, Berlin, Heidelberg.
Thuma, E., Rogers, S., and Ounis, I. (2013). Evaluating bad query abandonment in
an iterative sms-based faq retrieval system. In Proceedings of the 10th conference on open research areas in information retrieval, pages 117–120. LE
CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE
DOCUMENTAIRE.
15
8
Appendix
Figure one and two display the length per session, It forms a traditional long tail
distribution.
Figure three and four display the number of documents per session, again it forms
a long tail. But an important factor is the fact that near 4000 sessions have no
recorded document clicks at all, this can be seen in the zoomed in version with the
bar where the number of document is zero.
Figure five and six display the number of queries per session, as can be seen a
typical long tail graphs forms itself. The first graph provides an broad overview
whilst the second one is a zoomed in version.
Figure 1: Session length overview
16
Figure 2: Session length - Top 25
Figure 3: Queries per session overview
17
Figure 4: Queries per session - Top 25
Figure 5: Documents per session overview
18
Figure 6: Documents per session - Top 25
19