Query expansion for the legal domain: a case study from the JUMAS

ONTOSE 2010 – 7th June 2010 @ CAISE, Hammamet, Tunisia
Query expansion for the legal domain:
a case study from the JUMAS project
Fabio Sartori, Matteo Palmonari
Disco, University of Milan-Bicocca, Italy
Paper appears in: M.-A. Sicilia, Ch. Kop, F. Sartori, ONTOSE 2010, Springer,
Lecture Notes in Business Information Processing (LNBIP), Vol. 62
Outline
• JUMAS: Background & Motivation
• Query Expansion Methods
• Platform for Qualitative Evaluation of Methods
Outline
• JUMAS: Background & Motivation
• Query Expansion Methods
• Platform for Qualitative Evaluation of Methods
Libraries in the Juridical Domain
• The legal domain has interested AI for many years
• Knowledge representation systems for reasoning about
the legal domain
• Large and heterogeneous data sources:
• Documents
• Video and audio of trials
• Need for effective techniques to manage libraries in the
juridical domain
• Semantic processing and search
• The JUMAS project addresses these issues
The JUMAS project
• JUdicial MAnagement by
Digital Libraries
Semantics
• JUMAS envisages a
system for the
embedded semantic
extraction from
multimedia data that
join into an advanced
knowledge management
system
Jumas’ Objectives
• Objectives:
– Knowledge Models and Spaces: search directly in the audio and video
source without a verbatim transcription of the proceedings
– Knowledge and Content Management: exploit hidden semantics in
audiovisual digital libraries in order to facilitate search and retrieval,
intelligent processing and effective presentation of multimedia
information
– Sensor and Multimedia Integration: information fusion deriving from
multi-modal sources (audio/video) in order to improve accuracy in
automatic transcription and annotation phases
– Effective Information Management: streamline and optimise the
document workflow allowing the analysis of (un) structured
information for document search and evidence base assessment
– ICT Infrastructure: Service Oriented Architecture supporting a largescale audiovideo retrieval system focusing on scalability,
interoperability and modularity.
Jumas’ Objectives
• Objectives:
– Knowledge Models and Spaces: search directly in the audio and video
source without a verbatim transcription of the proceedings
– Knowledge and Content Management: exploit hidden semantics in
audiovisual digital libraries in order to facilitate search and retrieval,
intelligent processing and effective presentation of multimedia
information
– Sensor and Multimedia Integration: information fusion deriving from
multi-modal sources (audio/video) in order to improve accuracy in
automatic transcription and annotation phases
– Effective Information Management: streamline and optimise the
document workflow allowing the analysis of (un) structured
information for document search and evidence base assessment
– ICT Infrastructure: Service Oriented Architecture supporting a largescale audiovideo retrieval system focusing on scalability,
interoperability and modularity.
Jumas’ Functional Architecture
• Automatic transcription
• Semantic annotation
• Search services, including semantic search
Semantic Search and
Query Expansion (QE)
• Problems with search for semantically annotated
content: poor recall
• Query expansion (QE) is the process of
reformulating a query to improve retrieval
performance.
– Evaluates a query (i.e. the keywords typed into the
search query form) and expands them to match
additional relevant documents, using a domain
ontology for query expansion
• Many different approaches:
– Difficult to evaluate which is better in a given domain
Specific Goal & Objectives
• Main goal
• Testing and comparison of existing
ontology-based methods for query
expansion to be used in the JUMAS project
• Objectives
• Develop a configurable platform for query
expansion that:
• implements and integrate existing methods
• can be adopted by domain experts in the context
of the JUMAS project
Outline
• JUMAS: Background & Motivation
• Query Expansion Methods
• Platform for Qualitative Evaluation of Methods
Query Expansion Explained
Query expansion
Knife , trial , murder
Knife
phrase-based
single-keyword
often E consists of a set of weighted
keywords
Two main appraches in literature:
– Probabilistic QE
– Ontology-based QE
knife - 1.0
blade - 0.6
melee weapon - 0.5
bayonet - 0.3
machete - 0.3
Scalpel - 0.3
Sword - 0.2
Query Expansion Explained
Query expansion
Knife , trial , murder
Knife
phrase-based
single-keyword
often E consists of a set of weighted
keywords
Two main appraches in literature:
– Probabilistic QE
– Ontology-based QE
knife - 1.0
blade - 0.6
melee weapon - 0.5
bayonet - 0.3
machete - 0.3
Scalpel - 0.3
Sword - 0.2
Methodology
1.Analysis of proposed approaches
2.Implementation of the platform, including a
set of selected methods
3.Testing e qualitative method comparison
Proposed and Selected QE Methods
• There are many classifications
according to:
– Type of method
• (relevance feedback / similarity function)
– User involvement
• (manual / interactive / automatic)
– Usage of adjacent words
• (single word / phrase based)
– Kind of query
• (short / long, structured / not structured,
navigational / information / transactional)
– Application domain
• documents / tagged resources /ontologies /
source code / …)
– Kind of domain
• (generic / specific)
Voorhees 94
94 [Voorhees]
[Voorhees]
Voorhees
Tuominen et
et al.
al. 09
09
Tuominen
Alani et
et al.
al. 00
00 [ATJ]
[ATJ]
Alani
Navigli &
& Velardi
Velardi 03
03
Navigli
Hirst &
& St-Onge
St-Onge 97
97 [HO]
[HO]
Hirst
Andreou 05
05
Andreou
Xhu et
et al.
al. 06
06
Xhu
Calegari &
& Pasi
Pasi 08
08
Calegari
+
Wu & Palmer 94 [Ancestor]
Adopted Ontology Model and
Web-compliant Representation
• Thesaurus as the ontological models (i.e. lexical
ontologies)
– focus on four relations among terms:
•
•
•
•
SYN (Sinonimy);
NT (Narrower Term);
BT (Broader Term);
RT (Related Term).
• Representation with SKOS (Simple Knowledge
Organization System)
– W3C recommandation for thesauri definition
– based on RDF (+- OWL)
Platform’s functional architecture
• Three main blocks:
– query expansion
testing interface
– query expansion
engine
– ontology
management
Outline
• JUMAS: Background & Motivation
• Query Expansion Methods
• Platform for Qualitative Evaluation of
Methods
Platform’s functional architecture
• Three main blocks:
– query expansion
testing interface
– query expansion
engine
– ontology
management
Platform’s functional architecture
• Three main blocks:
– query expansion
testing interface
– query expansion
engine
– ontology
management
Platform’s functional architecture
• Three main blocks:
– query expansion
testing interface
– query expansion
engine
– ontology
management
QE workflow
Query Expansion Engine in JUMAS
• QE exposed as a
Web service
Query Expansion Platform’s GUI
• Queries and results
– with explanations
Type of
relation
Retrieved word and distance
Word to expand
Query Expansion Platform’s GUI
• Ontology
Management
Interface
– Add/delete terms and
relations to/from the
thesaurus
Retrieved word and distance
Word to expand
Export the ontology
Query Expansion Platform’s GUI
Example for HO
• Platform
configuration
– Configuration of
parameters for each
method
Conclusions & Future Work
• Development of
– an environment for testing and discovering potentiality of
expansion through lexical ontologies (i.e. thesauri)
– a query expander service that can be invoked by a Web service
• The platform allowed the expert from the court to select a
method that they felt it fit better their needs (Hirst & StOnge 97)
• Future work
– More quantitative experiments
– Axiomatic ontologies
– Adaptation of other similarity measures for ontologies (e.g. in
the Simpack library)
Thanks!
Thesaurus adopted
• Thesaurus PICO, from
www.culturaitalia.it
(the website of Italian
Ministery of Cultural
Resources Management)
• Why Pico?
–
Developed in SKOS;
–
complete;
–
Well-known
domain.
• Completed by means of
–
RT addition;
–
Synonyms addition;
–
Terms addition.
Set di query
• The query adopted are single word
• They have been selected in order to guarantee that each
term can distignuishe within the thesaurus. In particular
they are different according to:
• genericity;
• Number of NT;
• Number of RT;
• Presence of
synonyms;
• Numbr of parents;
Evaluation Criteria
•
Qualitative, not quantitative!
•
Based on reasonable research
scenarios
Output
Output
Query
Output
Query
Expansion
By means of
Ancestor
Output
Query
Expansion
By means of
Ancestor
Expansion
By means of
Voorhees
(vers.1)
Results
Indications on the better expansion for different scenarios:
•
•
•
•
•
Contextual definition: expansion by menas of brothers or
NT
Components definition: the expansion by mean of NT is
ideal, it is better to avoid the expansion by menas of brother
nodes
Search for general information: RT are very usefule, better
to avoid NT methods
Search for specific information: better to avoid brothrhood
based methods, synonyms are very useful
Causal relationships among terms: it is ideal to expand by
means of children instead of brothers
Results
Indications on general features of QE:
•
•
Importance numerosity: relationships among terms seem
to be inverse proportional with respect to their number
Expansion “Explosion”: some terms, if reached during the
expansion, cause the addition of many other terms which
are irrelevant
These result are starting points for new tests!!
Sviluppi futuri
•
•
•
Platform Extension:
– Addition of new QE methods
– Addition of new kinds of ontolgies
– Addition of a suitable GUI
Application to the JUMAS and CRESM-Milano Antica
projects
Analisys and research: looking for new methods,
phrase-based methods, techniques to detect the kind
of queries and to configure the method to apply
according to them
Metodo 1: Voorhees
Four variants:
1
Expansion only
by means of
synonyms
2
Expansion by
means of
synonyms and
hyponyms
3
Expansion by
means of
synonyms,
hypnyms and
hyperonyms
4
Expansion by
means of all the
adjacent nodes
Metodo 2: ATJ
Quattro varianti:
•
espansione con iponimi e iperonimi, pesando le relazioni in
base a tipo e profondità;
•
come prima, con l'aggiunta delle relazioni RT;
•
limitazione topologica
all'espansione delle RT;
•
limitazione semantica
all'espansione delle RT
(non implementata).
Metodo 2: ATJ
Quattro varianti:
•
espansione con iponimi e iperonimi, pesando le relazioni in
base a tipo e profondità;
•
come prima, con l'aggiunta delle relazioni RT;
•
limitazione topologica
all'espansione delle RT;
•
limitazione semantica
all'espansione delle RT
(non implementata).
Metodo 2: ATJ
Quattro varianti:
•
espansione con iponimi e iperonimi, pesando le relazioni in
base a tipo e profondità;
•
come prima, con l'aggiunta delle relazioni RT;
•
limitazione topologica
all'espansione delle RT;
•
limitazione semantica
all'espansione delle RT
(non implementata).
Metodo 2: ATJ
Quattro varianti:
•
espansione con iponimi e iperonimi, pesando le relazioni in
base a tipo e profondità;
•
come prima, con l'aggiunta delle relazioni RT;
•
limitazione topologica
all'espansione delle RT;
•
limitazione semantica
all'espansione delle RT
(non implementata).
Metodo 2: ATJ
Quattro varianti:
•
espansione con iponimi e iperonimi, pesando le relazioni in
base a tipo e profondità;
•
come prima, con l'aggiunta delle relazioni RT;
•
limitazione topologica
all'espansione delle RT;
•
limitazione semantica
all'espansione delle RT
(non implementata).
Metodo 3: HO
• Vengono aggiunti soli i termini
che sono raggiunti attraverso
pattern specifici:
• Una relazione BT può essere
preceduto solo da altre
relazioni BT
• Può esserci un unico cambio di
direzione, due se il pattern è
composto da serie di BT, serie
di RT, serie di NT
• La lunghezza massima del
percorso è 5
Metodo 3: HO, pattern
Uno o più RT
Uno o più BT
Uno o più NT
Metodo 3: HO
Vengono aggiunti soli i termini
che sono raggiunti attraverso
pattern specifici:
• Una relazione BT può essere
preceduto solo da altre
relazioni BT
• Può esserci un unico cambio
di direzione, due se il pattern è
composto da serie di BT, serie
di RT, serie di NT
• La lunghezza massima del
percorso è 5
Metodo 4: Ancestor
• Calcola la similarità di due
termini tramite una formula
che tiene conto della distanza
dal genitore comune e dalla
profondità di quest'ultimo:
ConSim C1,C2 =
2 N3
N1+N2+ 2 N3
• Vengono aggiunti i termini che
hanno distanza minore dal
termine iniziale.
Metodo 4: Ancestor
• Calcola la similarità di due
termini tramite una formula
che tiene conto della distanza
dal genitore comune e dalla
profondità di quest'ultimo:
ConSim C1,C2 =
2 N3
N1+N2+ 2 N3
• Vengono aggiunti i termini che
hanno distanza minore dal
termine iniziale.