topic centric querying of web information resources

TOPIC CENTRIC QUERYING
OF WEB RESOURCES
by
İsmail Sengör Altıngövde
Supervisor: Assoc. Prof. Dr. Özgür Ulusoy
September 2001
Talk Outline
•
•
•
•
•
•
Introduction
Background and Related Work
Web Information Space (WIS) Model
SQL-TC (Topic Centric) Language Features
Prototype Implementation
Conclusions and Contributions
Future Work
Introduction: Problem Definition
• Tremendous growth of WWW, thanks to...
– No centralized authority governing the web
– No strict schema characterizing the data on
the web
• Problem: “needle in the haystack” [Pepper]
• Locating highly relevant information in
reasonable time and effort!
Introduction: The Approach
• Exploit metadata (expressed as topics and
their relationships),
• Exploit XML and related efforts, such as
XML Topic Maps,
• Exploit the database point of view,
to produce high-quality semantically
related responses to user queries in short
time spans.
A Motivating Example
• Query: Find movies and their resources
that are related to the novel “Carrie”,
written by “Stephen King”, and rated very
good (i.e. importance above 0.7)
• Presently achievable by:
– Browsing the movie pages, or
– By a keyword-based search followed by a
look-up of resulting hits
may be ineffective and time-inefficient!
Motivating Example (2)
• Assume for the movie database at
www.movie-bank.com, an expert defines a
metadata model where
– “Stephen King” and “Carrie” are so-called
topics,
– RelatedTo and WrittenBy are relationships
(called metalinks) among these topics
Motivating Example (3)
• Furthermore,
– the expert links these topics to actual
web pages (topic sources), and
– attaches importance values to topics,
metalinks and sources
• The above data model serves as a
“semantic-index” for the underlying web
resources
Story of Carrie
0.75
WrittenBy
RelatedTo
Topic/metalink
domain
Actual
“information
resource” domain
Carrie
Stephen King
Motivating Example (4)
• Now, it is possible to (informally) formulate
the example query using these data model
objects and satisfy the user’s request:
select sources for topic_t
from www.movie-bank.com
where topic_t RelatedTo “Carrie” and
topic_t WrittenBy “Stephen King”
topic_t.Importance > 0.7
order by topic importance
What do we propose ?
• A “web information space” model with
metadata (topics and metalinks) at its heart,
• An SQL-like query language operating on
this model: SQL-TC.
• Metadata is
– stable (with the exception of topic sources), and
– amortized by its use over time and fast query
responses
What do we propose ? (2)
• We practically assume that, modeled
information resources do not span the web,
but metadata is defined for subnets, such as:
– ACM SIGMOD Anthology sites, or
– Online Collections of Smithsonian Institution
• “web information resources”
“subnets”
• Semi-automated tools can gather metadata
for much larger and/or distributed domains
Talk Outline
•
•
•
•
•
Introduction
Background and Related Work
Web Information Space (WIS) Model
SQL-TC (Topic Centric) Language Features
Prototype Implementation
Conclusions and Contributions
Future Work
Background: XML
• Extensible Markup Language (XML)
– being a subset of SGML, also makes use of
tags for elements and attributes,
– a self-describing format for data (i.e., XML
describes the content, but not presentation),
– becoming a universal standard for data
exchange on the web.
Background: Metadata
• Metadata: fellow traveller with data
• Need for metadata on the web is justified with:
– increase of digital information and noise on the web,
– emerging mechanisms to address information objects in a
finer granularity
• Metadata is different from database views:
– the metadata attached to resource set may not be explicitly
present in the set, and
– metadata is likely to be semi-structured or not structured at
all
Background: Metadata
• Two well-known metadata standards for
web pages are
– Dublin Core and,
– Warwick framework
• More general frameworks are also
proposed:
– Topic Maps Standard (ISO/IEC 13250)
– Resource Description Framework (RDF)
Background: Topic Maps
• Topic Maps standard
– provides an interchangeable notation to
represent the structure of information
resources.
• Key concepts:
– Topics (with types, names, occurrences)
– Associations among topics, with association
types and player role types
– Occurrences of topics, with occurrence types
Background: Topic Maps
• Example. In the context of an encyclopedia:
“Stephen King” is a topic of type “person”
“Maine” is a topic of type “state”
“S. King” was-born-in “Maine” is an association,
where was-born-in is the association type
“S. King” can be depicted in a picture or mentioned
in an article
• In a topic map, all types, roles etc. are also
topics
Background: XML Topic Maps
• The ISO standard defines Topic Maps using
SGML architectural forms and HyTime
hyperlinks
• XML Topic Maps (XTM) effort aims to
– describe an XML grammar for interchanging
Web-based topic maps, and thus,
– encourage the use of a functional and syntactic
subset of topic maps in the mass market
Background: RDF
• RDF is another specification to annotate
resources with metadata in an interchangeable
manner
• RDF abstract model defines
– resources, identified by URIs,
– properties, corresponding to attributes used to
describe the resources, and
– statements, to associate property-value pair with a
resource
Background: RDF vs. Topic
Maps
• RDF and Topic Maps are similar efforts, but
[Freese]
– RDF starts from resources, and may end up at
an abstract knowledge layer, whereas
– Topic maps starts with an abstract topic-domain
which may be optionally linked to actual
resources
Thus,
RDF
Topic Maps
resource-centric
topic-centric
Related Work: C-Web
• C-Web [Christophides et. al.] project aims to
support information sharing within the
specific web-communities
• Design goals
– creation of conceptual models (schema) by domain
experts,
– publishing resources using the schema terms,
– querying these information resources (RQL)
• C-Web employs RDF to express the schema.
Related Work: Others
• WebML and Structured Maps are two
other studies exploiting metadata
• IBM’s Xcentral: the first search engine that
indexes XML and RDF elements
• Lots of languages with the broader goal of
querying web as a whole:
– WebSQL, WebOQL, W3QL, STRUQL,
QUEST...
Talk Outline
•
•
•
•
Introduction
Background and Related Work
Web Information Space (WIS) Model
SQL-TC (Topic Centric) Language Features
Prototype Implementation
Conclusions and Contributions
Future Work
Web Information Space Model
• WIS model has three basic components:
– Information Resources
– Expert Advice Repositories
– Personalized Information Repositories
• Information Resources are web-based
documents of any type.
– In this study, we only consider XML/HTML
documents.
WIS: Expert Advice Model
• Expert Advice Repositories
– capture metadata in terms of topic, topic source
reference and metalink entities , and
– slightly extend the topic map data model (i.e.
topic detail levels).
– All metadata entities are assigned importance
values by the expert specifying them.
importance values  [0, 1]  {No, Don’t-care}
Expert Advice: Topics
• Expert advice specifies topics along with their
type and domain, and the quadruple
<Tname, Ttype, Tdomain, Tauthor>
serves as a key for a topic entity.
• Example. Topic entity with internal id T1
T1 : <“Carrrie”, “Novel”, “Literature”, E1>
Expert Advice: Topics (2)
• Topic importance values can be specified in
a functional form:
Tadvice (E1, Tname = “*Diabetes*”, Ttype=“Disease”,
Tdomain=“Surgery Nurse Training”)= 0.3
• The above advice states that in the domain of
“surgery nurse training”, any topic (of type
“disease”) including the phrase “diabetes” in
its name has low importance value.
Expert Advice: Topic Sources
• Topic Source Reference entity contains
information about the topic sources and has the
following attributes:
– web address of the document in which the topic
occurs,
– detail-level, describing how advanced the level of
document for the corresponding topic,
– importance value (as usual), and
– other attributes as media-type, last-modification
date etc.
Expert Advice: Metalinks
• Metalinks represent relationships among
topics (correspond to associations of TMs)
• Metalinks do not represent the relationship
among topic sources, hence the term
“metalink” is chosen
Expert Advice: Metalinks (2)
• Assume the metalink signature below:
Prerequisite : SetOf topic
topic
• Metalink instances of type Prerequisite are:
“Rel. Calculus”
Prerequisite “Rel. Algebra”
“Diabetes Surgery2 Prerequisite “Diabetes1”
• Then, an importance value may be assigned to a
metalink instance as below:
Madvice(E, “Rel. Calculus”
Prerequisite “Rel. Algebra”) = 1
Expert Advice: Metalinks (3)
• In this study, we consider “topic closure”
with respect to a particular metalink type
• For instance, assume two assertions below:
“Rel. Calculus”
Prerequisite “Rel. Algebra”
“Rel. Model”
Prerequisite “Rel. Calculus”
Then, the prerequisites of understanding
relational model are both the relational algebra
and relational calculus
WIS: Personalized Information
Model
• Personalized information is expressed in 2
forms (both are assumed to be given in XML):
– User preferences
– User knowledge
• User preferences are specified as an ordered
set of statements, along the lines of [AW00].
• User preferences are used to resolve conflicts
among different domain experts.
User Preferences
• Example. Preferences of user John Doe:
Expert (John-Doe) = <GW-Bush, W-Clinton>
TImportance (John-Doe) = {(GW-Bush, 0.5), (WClinton, 0.9)}
MImportance (John-Doe) = {(W-Clinton, 0.9)}
Reject-S (John-Doe) ={Web-Address=www.dirtypolitics.com)
Conflict-R = Ordered-Accept
User Knowledge
• User knowledge includes knowledge on a
particular topic in terms of detail levels:
UKnowledge (U) = {(topic, detail-level-value)}
UKnowledge (John -Doe)={(TName=“Math”, 3)}
• Besides, navigation history of web
resources for these topics is also kept in user
knowledge.
Talk Outline
Introduction
Background and Related Work
Web Information Space (WIS) Model
SQL-TC (Topic Centric) Language Features
• Prototype Implementation
• Conclusions and Contributions
• Future Work
SQL-TC (Topic Centric)
Features
• SQL-TC
– is a strongly-typed multi-database query
language,
– is designed for querying both web resources
in a particular domain and the associated
expert advice repositories in an integrated
manner,
– allows users to pose queries using WIS
model entities -like topics and metalinks.
SQL-TC Syntax
select [topic {.attribute} | metalink {.attribute}] as T
from resources XML: url1, …
using experts Topic Map1: url1, … as E1, …
with user profile XML: URL as U
where (i) conditions on topics and metalinks of experts
(ii) content-based conditions on sources,
(iii) conditions on user profile information.
order by [topic] importance
stop after n most important| when importance below m
| after n most important and when importance below m
Querying Web Resources
• Assume the information resources are at
– http://www.stephenkinglibrary.com as S1,
– http://www.stephen-king .net as S2,
• The expert advice repository associated
with these resources is located at
– www.sql-tc.com/king.xtm as E1
• And the user profile is located at
– www.myprofile.com as U1
Example Query-1
• Query: Using only the advice at E1, find
two highest-ranked novels that are written
by the novelist Stephen King, and the
novels’ detail level 4 reviews from the two
information resources.
Example Query-1
select [$topic.name, $sourceRef.web-address] as T
from resources S1, S2
using experts E1
where WrittenBy in E.Metalinks and
$topic = any (WrittenBy ("Stephen King", “horror
novelist”, “literature”, E,)) and
$sourceRef = SourceOf($topic, 4, E) and
“review” in $sourceRef.roles
order by $topic importance
stop after 2 most important
Example Query-1
• The output of the query will be an interactive
table as below
Tname
“Carrie”
“The Stand”
SourceRef.Web-address
www.critics.com/carrie.html
www.critics.com/stand.html
Example Query-2
• Illustrates topic closure computation,
user profiles and multi-experts
• Query: Using the advice of experts E1 and
E2, and excluding the novels read by the
user, find the highest ranked novel which
is related to another novel “Wizard and
The Glass”.
Example Query-2
select [$topic.name] as T
from resources S1, S2
using experts E1, E2
with user profile www.myprofile.com
where RelatedTo in (E1, E2).Metalinks and
$topic = any (RelatedTo* (“Wizard and The
Glass”, ,”literature”, ,) and
$topic not in GetTopics(U.UserKnowledge)
order by importance
stop after 1 most important
Example Query-2
• Assume in the expert advice repositories, the
following are specified:
“The Wasteful Lands”
Glass”
“Drawings of Three”
“Dark Tower”
RelatedTo “Wizard and The
RelatedTo “The Wasteful Lands”
RelatedTo “Drawings of Three”
• Then, the variable $topic will be bound each of the
LHS topics above
Example Query-2
• While using multiple expert advice repositories,
expert advice conflicts may occur.
• Example:
Madvice(E1, “Drawings of Three”
Lands”) = 0.8
Madvice(E2, “Drawings of Three”
Lands”) = “No”
RelatedTo “The Wasteful
RelatedTo “The Wasteful
• Consult user preference while computing the
closure: may lead early pruning!
Talk Outline
Introduction
Background and Related Work
Web Information Space (WIS) Model
SQL-TC (Topic Centric) Language Features
Prototype Implementation
• Conclusions and Contributions
• Future Work
Implementation
• Prototype implementation is provided on PCWindows platform using Java
• Expert Advice Repositories are expressed as
XTM documents with slight modifications
• XTMs are then processed to yield an
internal graph data structure and stored in
a relational database
• At the moment, SQL-TC queries are
translated to equivalent SQL queries
Talk Outline
Introduction
Background and Related Work
Web Information Space (WIS) Model
SQL-TC (Topic Centric) Language Features
Prototype Implementation
Conclusions and Contributions
Future Work
Conclusion and Contributions
• A web information space model and its query
language is developed for sophisticated
querying of “subnets”
• Essential components of WIS model are expert
advice repositories capturing metadata and
personalized information for users
• XML Topic Maps is exploited to express expert
advice, with the chance of identifying and
resolving many issues with this evolving effort
Conclusion and Contributions (2)
• SQL-TC, an integrated query language, is
designed to query both the expert advice
repositories and associated web resources
• SQL-TC consults user preferences and
knowledge to refine the outputs
• Further features include
– topic closure computation,
– importance based output ranking/filtering
Future Research Directions
• Develop a semi-automated tool to gather
metadata information for very large web
domains,
• Develop a sophisticated GUI for naive
users,
• Develop a large-scale fully-functional
system implementation allowing to
measure performance in terms of response
time and semantic quality,
Future Research Directions(2)
• Develop query processing and
optimization algorithms for SQL-TC,
– tailored for similarity-based matching of
metadata entities,
– support ranking inherently, and
– support topic closure operator evaluation
efficiently.
Thanks for
your attendance...
Any questions ???