TOPIC CENTRIC QUERYING OF WEB RESOURCES by İsmail Sengör Altıngövde Supervisor: Assoc. Prof. Dr. Özgür Ulusoy September 2001 Talk Outline • • • • • • Introduction Background and Related Work Web Information Space (WIS) Model SQL-TC (Topic Centric) Language Features Prototype Implementation Conclusions and Contributions Future Work Introduction: Problem Definition • Tremendous growth of WWW, thanks to... – No centralized authority governing the web – No strict schema characterizing the data on the web • Problem: “needle in the haystack” [Pepper] • Locating highly relevant information in reasonable time and effort! Introduction: The Approach • Exploit metadata (expressed as topics and their relationships), • Exploit XML and related efforts, such as XML Topic Maps, • Exploit the database point of view, to produce high-quality semantically related responses to user queries in short time spans. A Motivating Example • Query: Find movies and their resources that are related to the novel “Carrie”, written by “Stephen King”, and rated very good (i.e. importance above 0.7) • Presently achievable by: – Browsing the movie pages, or – By a keyword-based search followed by a look-up of resulting hits may be ineffective and time-inefficient! Motivating Example (2) • Assume for the movie database at www.movie-bank.com, an expert defines a metadata model where – “Stephen King” and “Carrie” are so-called topics, – RelatedTo and WrittenBy are relationships (called metalinks) among these topics Motivating Example (3) • Furthermore, – the expert links these topics to actual web pages (topic sources), and – attaches importance values to topics, metalinks and sources • The above data model serves as a “semantic-index” for the underlying web resources Story of Carrie 0.75 WrittenBy RelatedTo Topic/metalink domain Actual “information resource” domain Carrie Stephen King Motivating Example (4) • Now, it is possible to (informally) formulate the example query using these data model objects and satisfy the user’s request: select sources for topic_t from www.movie-bank.com where topic_t RelatedTo “Carrie” and topic_t WrittenBy “Stephen King” topic_t.Importance > 0.7 order by topic importance What do we propose ? • A “web information space” model with metadata (topics and metalinks) at its heart, • An SQL-like query language operating on this model: SQL-TC. • Metadata is – stable (with the exception of topic sources), and – amortized by its use over time and fast query responses What do we propose ? (2) • We practically assume that, modeled information resources do not span the web, but metadata is defined for subnets, such as: – ACM SIGMOD Anthology sites, or – Online Collections of Smithsonian Institution • “web information resources” “subnets” • Semi-automated tools can gather metadata for much larger and/or distributed domains Talk Outline • • • • • Introduction Background and Related Work Web Information Space (WIS) Model SQL-TC (Topic Centric) Language Features Prototype Implementation Conclusions and Contributions Future Work Background: XML • Extensible Markup Language (XML) – being a subset of SGML, also makes use of tags for elements and attributes, – a self-describing format for data (i.e., XML describes the content, but not presentation), – becoming a universal standard for data exchange on the web. Background: Metadata • Metadata: fellow traveller with data • Need for metadata on the web is justified with: – increase of digital information and noise on the web, – emerging mechanisms to address information objects in a finer granularity • Metadata is different from database views: – the metadata attached to resource set may not be explicitly present in the set, and – metadata is likely to be semi-structured or not structured at all Background: Metadata • Two well-known metadata standards for web pages are – Dublin Core and, – Warwick framework • More general frameworks are also proposed: – Topic Maps Standard (ISO/IEC 13250) – Resource Description Framework (RDF) Background: Topic Maps • Topic Maps standard – provides an interchangeable notation to represent the structure of information resources. • Key concepts: – Topics (with types, names, occurrences) – Associations among topics, with association types and player role types – Occurrences of topics, with occurrence types Background: Topic Maps • Example. In the context of an encyclopedia: “Stephen King” is a topic of type “person” “Maine” is a topic of type “state” “S. King” was-born-in “Maine” is an association, where was-born-in is the association type “S. King” can be depicted in a picture or mentioned in an article • In a topic map, all types, roles etc. are also topics Background: XML Topic Maps • The ISO standard defines Topic Maps using SGML architectural forms and HyTime hyperlinks • XML Topic Maps (XTM) effort aims to – describe an XML grammar for interchanging Web-based topic maps, and thus, – encourage the use of a functional and syntactic subset of topic maps in the mass market Background: RDF • RDF is another specification to annotate resources with metadata in an interchangeable manner • RDF abstract model defines – resources, identified by URIs, – properties, corresponding to attributes used to describe the resources, and – statements, to associate property-value pair with a resource Background: RDF vs. Topic Maps • RDF and Topic Maps are similar efforts, but [Freese] – RDF starts from resources, and may end up at an abstract knowledge layer, whereas – Topic maps starts with an abstract topic-domain which may be optionally linked to actual resources Thus, RDF Topic Maps resource-centric topic-centric Related Work: C-Web • C-Web [Christophides et. al.] project aims to support information sharing within the specific web-communities • Design goals – creation of conceptual models (schema) by domain experts, – publishing resources using the schema terms, – querying these information resources (RQL) • C-Web employs RDF to express the schema. Related Work: Others • WebML and Structured Maps are two other studies exploiting metadata • IBM’s Xcentral: the first search engine that indexes XML and RDF elements • Lots of languages with the broader goal of querying web as a whole: – WebSQL, WebOQL, W3QL, STRUQL, QUEST... Talk Outline • • • • Introduction Background and Related Work Web Information Space (WIS) Model SQL-TC (Topic Centric) Language Features Prototype Implementation Conclusions and Contributions Future Work Web Information Space Model • WIS model has three basic components: – Information Resources – Expert Advice Repositories – Personalized Information Repositories • Information Resources are web-based documents of any type. – In this study, we only consider XML/HTML documents. WIS: Expert Advice Model • Expert Advice Repositories – capture metadata in terms of topic, topic source reference and metalink entities , and – slightly extend the topic map data model (i.e. topic detail levels). – All metadata entities are assigned importance values by the expert specifying them. importance values [0, 1] {No, Don’t-care} Expert Advice: Topics • Expert advice specifies topics along with their type and domain, and the quadruple <Tname, Ttype, Tdomain, Tauthor> serves as a key for a topic entity. • Example. Topic entity with internal id T1 T1 : <“Carrrie”, “Novel”, “Literature”, E1> Expert Advice: Topics (2) • Topic importance values can be specified in a functional form: Tadvice (E1, Tname = “*Diabetes*”, Ttype=“Disease”, Tdomain=“Surgery Nurse Training”)= 0.3 • The above advice states that in the domain of “surgery nurse training”, any topic (of type “disease”) including the phrase “diabetes” in its name has low importance value. Expert Advice: Topic Sources • Topic Source Reference entity contains information about the topic sources and has the following attributes: – web address of the document in which the topic occurs, – detail-level, describing how advanced the level of document for the corresponding topic, – importance value (as usual), and – other attributes as media-type, last-modification date etc. Expert Advice: Metalinks • Metalinks represent relationships among topics (correspond to associations of TMs) • Metalinks do not represent the relationship among topic sources, hence the term “metalink” is chosen Expert Advice: Metalinks (2) • Assume the metalink signature below: Prerequisite : SetOf topic topic • Metalink instances of type Prerequisite are: “Rel. Calculus” Prerequisite “Rel. Algebra” “Diabetes Surgery2 Prerequisite “Diabetes1” • Then, an importance value may be assigned to a metalink instance as below: Madvice(E, “Rel. Calculus” Prerequisite “Rel. Algebra”) = 1 Expert Advice: Metalinks (3) • In this study, we consider “topic closure” with respect to a particular metalink type • For instance, assume two assertions below: “Rel. Calculus” Prerequisite “Rel. Algebra” “Rel. Model” Prerequisite “Rel. Calculus” Then, the prerequisites of understanding relational model are both the relational algebra and relational calculus WIS: Personalized Information Model • Personalized information is expressed in 2 forms (both are assumed to be given in XML): – User preferences – User knowledge • User preferences are specified as an ordered set of statements, along the lines of [AW00]. • User preferences are used to resolve conflicts among different domain experts. User Preferences • Example. Preferences of user John Doe: Expert (John-Doe) = <GW-Bush, W-Clinton> TImportance (John-Doe) = {(GW-Bush, 0.5), (WClinton, 0.9)} MImportance (John-Doe) = {(W-Clinton, 0.9)} Reject-S (John-Doe) ={Web-Address=www.dirtypolitics.com) Conflict-R = Ordered-Accept User Knowledge • User knowledge includes knowledge on a particular topic in terms of detail levels: UKnowledge (U) = {(topic, detail-level-value)} UKnowledge (John -Doe)={(TName=“Math”, 3)} • Besides, navigation history of web resources for these topics is also kept in user knowledge. Talk Outline Introduction Background and Related Work Web Information Space (WIS) Model SQL-TC (Topic Centric) Language Features • Prototype Implementation • Conclusions and Contributions • Future Work SQL-TC (Topic Centric) Features • SQL-TC – is a strongly-typed multi-database query language, – is designed for querying both web resources in a particular domain and the associated expert advice repositories in an integrated manner, – allows users to pose queries using WIS model entities -like topics and metalinks. SQL-TC Syntax select [topic {.attribute} | metalink {.attribute}] as T from resources XML: url1, … using experts Topic Map1: url1, … as E1, … with user profile XML: URL as U where (i) conditions on topics and metalinks of experts (ii) content-based conditions on sources, (iii) conditions on user profile information. order by [topic] importance stop after n most important| when importance below m | after n most important and when importance below m Querying Web Resources • Assume the information resources are at – http://www.stephenkinglibrary.com as S1, – http://www.stephen-king .net as S2, • The expert advice repository associated with these resources is located at – www.sql-tc.com/king.xtm as E1 • And the user profile is located at – www.myprofile.com as U1 Example Query-1 • Query: Using only the advice at E1, find two highest-ranked novels that are written by the novelist Stephen King, and the novels’ detail level 4 reviews from the two information resources. Example Query-1 select [$topic.name, $sourceRef.web-address] as T from resources S1, S2 using experts E1 where WrittenBy in E.Metalinks and $topic = any (WrittenBy ("Stephen King", “horror novelist”, “literature”, E,)) and $sourceRef = SourceOf($topic, 4, E) and “review” in $sourceRef.roles order by $topic importance stop after 2 most important Example Query-1 • The output of the query will be an interactive table as below Tname “Carrie” “The Stand” SourceRef.Web-address www.critics.com/carrie.html www.critics.com/stand.html Example Query-2 • Illustrates topic closure computation, user profiles and multi-experts • Query: Using the advice of experts E1 and E2, and excluding the novels read by the user, find the highest ranked novel which is related to another novel “Wizard and The Glass”. Example Query-2 select [$topic.name] as T from resources S1, S2 using experts E1, E2 with user profile www.myprofile.com where RelatedTo in (E1, E2).Metalinks and $topic = any (RelatedTo* (“Wizard and The Glass”, ,”literature”, ,) and $topic not in GetTopics(U.UserKnowledge) order by importance stop after 1 most important Example Query-2 • Assume in the expert advice repositories, the following are specified: “The Wasteful Lands” Glass” “Drawings of Three” “Dark Tower” RelatedTo “Wizard and The RelatedTo “The Wasteful Lands” RelatedTo “Drawings of Three” • Then, the variable $topic will be bound each of the LHS topics above Example Query-2 • While using multiple expert advice repositories, expert advice conflicts may occur. • Example: Madvice(E1, “Drawings of Three” Lands”) = 0.8 Madvice(E2, “Drawings of Three” Lands”) = “No” RelatedTo “The Wasteful RelatedTo “The Wasteful • Consult user preference while computing the closure: may lead early pruning! Talk Outline Introduction Background and Related Work Web Information Space (WIS) Model SQL-TC (Topic Centric) Language Features Prototype Implementation • Conclusions and Contributions • Future Work Implementation • Prototype implementation is provided on PCWindows platform using Java • Expert Advice Repositories are expressed as XTM documents with slight modifications • XTMs are then processed to yield an internal graph data structure and stored in a relational database • At the moment, SQL-TC queries are translated to equivalent SQL queries Talk Outline Introduction Background and Related Work Web Information Space (WIS) Model SQL-TC (Topic Centric) Language Features Prototype Implementation Conclusions and Contributions Future Work Conclusion and Contributions • A web information space model and its query language is developed for sophisticated querying of “subnets” • Essential components of WIS model are expert advice repositories capturing metadata and personalized information for users • XML Topic Maps is exploited to express expert advice, with the chance of identifying and resolving many issues with this evolving effort Conclusion and Contributions (2) • SQL-TC, an integrated query language, is designed to query both the expert advice repositories and associated web resources • SQL-TC consults user preferences and knowledge to refine the outputs • Further features include – topic closure computation, – importance based output ranking/filtering Future Research Directions • Develop a semi-automated tool to gather metadata information for very large web domains, • Develop a sophisticated GUI for naive users, • Develop a large-scale fully-functional system implementation allowing to measure performance in terms of response time and semantic quality, Future Research Directions(2) • Develop query processing and optimization algorithms for SQL-TC, – tailored for similarity-based matching of metadata entities, – support ranking inherently, and – support topic closure operator evaluation efficiently. Thanks for your attendance... Any questions ???
© Copyright 2025 Paperzz