Detecting and Representing Relevant Page-Level Web Deltas Sanjay Kumar Madria Department of Computer Science Purdue University West Lafayette, IN 47907 [email protected] Current Situation of W3 The Web allows information to change at any time and in any way Two forms of changes Existence Structure and content modification Leaves no trace of the previous document Replaces its antecedents leaving no trace!!!! Problems of Change Management Problem: Detecting, Representing and Querying these changes The problem is challenging Typical database approaches to detect changes based on triggering mechanisms are not usable Information sources typical do not keep track of historical information to a format that is accessible to the outside user Motivating Example Assume that there is a web site at www.panacea.gov Provides information related to drugs used for various diseases Motivating Example Suppose, on 15th January, a user wishes to find out periodically (every 30 days) information related to side effects and uses of drugs used for various drugs and changes to these information at the page-level compared to its previous version Structure of www.panacea.gov Web page at www.panacea.gov contains a list of diseases Each link of a particular disease points to a web page containing a list of drugs used for prevention and cure of the disease Hyperlinks associated with each drug points to documents containing a list of various issues related to a particular drug (description, manufacturers, clinical pharmacology, uses, side-effects etc) From the hyperlinks associated with each issue, one can retrieve details of these issues for a particular drug A Snapshot as on 15th Jan Side effects Indavir Ritonavir Uses AIDS Cancer Alzheimer’s Disease Heart disease Side effects Hirudin Uses Ibuprofen Diabetes Niacin Impotence Side effects Vasomax Side effects Caverject Side effects Uses Uses Some Changes 25th January Links related to Diabetes are removed New link containing information related to Parkinson’s Disease Information related to issues, side-effects and uses of various drugs for Cancer are also modified A Partial Snapshot as on 25th Jan Tolcapone Parkinson’s Disease Side effects Uses Cancer www.panacea.gov Diabetes Side effects Some Changes 30th January Links related to Impotence is modified • Previously provided by www.pfizer.com • Now by www.panacea.gov Inter-linked structure of the Web pages related to Caverject is also modified Information about Viagra, a new drug for Impotence is added A Partial Snapshot as on 30th Jan Side effects www.panacea.gov Uses Caverject Impotence Side effects Vasomax Viagra Uses Some Changes 8th February Link structure of Heart Disease is modified • Label Heart Disease is modified to Heart Disorder • Content of the pages dealing with side-effects and uses of Hirudin are updated • Inter-linked document structure of Niacin is modified Web pages related to the side effects and uses of Ibuprofen (Alzheimer’s Disease) are removed On 8th February www.panacea.gov Alzheimer’s Disease Heart disorder Side effects Hirudin Uses Niacin Side effects A Snapshot as on 15th Feb Indavir Ritonavir Alzheimer’s Disease AIDS Cancer Parkinson’s Disease Heart disease Hirudin Niacin Impotence Viagra Vasomax Side effects Caverject Uses Objectives Web deltas - Changes to web information Detecting and representing relevant page-level web deltas Detect those documents changes that are relevant to user’s query, not any arbitrary changes or web deltas Restricted to page level which are added to the site deleted from the site those documents which has undergone content or structural modification How these delta documents are related to one another and with other documents relevant to the user’s query The WHOWEDA Project WHOWEDA: A WareHouse of WEb DAta To design and implement a web warehousing system capable of effective extraction, management, and processing of information on the World Wide Web Data model: WHOM (WareHouse Object Model) Overview of WHOM Our web warehouse can be conceived of as a collection of web tables A set of web tuples and a set of web schemas represents a web table A web tuple is a directed graph containing nodes and links and satisfies a web schema Nodes and links contain content, metadata and structural information associated with Web documents and hyperlinks Tree representation Web algebra containing web operators to manipulate web tables Global Coupling, Web Select, Web Join etc. Overview of our approach Step 1: Two snapshots of old and new relevant data is coupled from the Web using global web coupling operation and materialized in two web tables. Step 2: Web join, left outer join and right outer joined operations are performed on these two web tables Result is joined, left and right outer joined web tables Step 3: Delta web tables containing different types of web deltas are generated from these resultant web tables. Elaborate on these steps……... Step 1: Retrieving snapshots of Web data using Global Web Coupling Web Query Specification Features: Draw a web query as a directed connected acyclic graph (also called a coupling query) Query can also be specified in text form Specify search conditions on the nodes and edges of the graph Performed by the global web coupling operator Coupling Query Set of node variables Xn Set of link variables Xl To specify hyperlink structure of the documents Set of predicates P defined over some of the node and link variables Each variable represent set of hyperlinks Set of connectivities C in DNF defined over node and link variables Each variable represents set of Web documents Specify metadata, content or structural conditions Set of coupling query predicates Q Conditions on execution of the query Example Suppose, on 15th January, a user wishes to find out periodically (every 30 days) from the web site at www.panacea.gov information related to side effects and uses of drugs used for various diseases Result of the query is stored in the form of web table Coupling Query Xn = {a, b, d, k} Xl = { - } P = {p1, p2, p3, p4} p1(a) = METADATA:: a[url] EQUALS “www.panacea.gov” p2(b) = CONTENT:: b[html.body.title] NON-ATTRCONT “drug list” p3(k) = CONTENT:: k[html.body.title] NON-ATTRCONT “uses” p4(d) = CONTENT:: d[html.body.title] NON-ATTRCONT “side effects” Coupling Query C = k1 AND k2 AND k3 k1 = a < - > b k2 = b < -{1, 6} > d k3 = b < -{1, 3} > k Q = {q1} q1(b) = COUPLING_QUERY:: polling_frequency EQUALS “30 days” Pictorial Representation {1, 6} d “side effects” k “uses” www.panacea.gov a b “drug list” {1, 3} Web Table Drugs (15th Jan) a0 b0 AIDS u0 Indavir d0 k0 a0 b0 AIDS u1 Ritonavir k1 Beta Carotene a0 b1 d1 d2 Cancer k2 a0 b5 Alzheimer’s Disease Ibuprofen d12 k12 Web Table Drugs (15th Jan) a0 b3 Diabetes a0 Albuterol b4 Impotence d4 u4 k5 u5 u6 Vasomax k6 a0 b4 Impotence a0 Heart Disease b2 Cavarject Hirudin u7 d6 u8 k7 u2 d3 k3 d5 Web Table New Drugs (15th Feb) a0 b0 AIDS Indavir u0 d0 k0 a0 b0 AIDS Ritonavir u1 k1 Beta Carotene a0 b1 d1 d2 Cancer k2 a0 Heart Disorder b2 Hirudin u2 d3 k3 Web Table New Drugs (15th Feb) a0 Heart Disorder b2 u3 Niacin d7 k7 a0 b4 Impotence u9 d8 Vasomax k8 a0 b4 Impotence Cavarject u7 d6 k7 a0 b6 Parkinson’s Disease Tolcapone u10 d10 b6 k10 Web Table New Drugs (15th Feb) a0 b6 Parkinson’s Disease a0 Tolcapone d10 b6 k10 b4 Impotence u10 u12 d9 Viagra k9 Step 2: Performing Web Join, Left and Right Outer Web Join Web Join Information composition operator Combines two web tables into a single web table under certain conditions Combine two web tables by concatenating a web tuple of one web table with a web tuple of other web table whenever there exist joinable nodes Two nodes are joinable if they are identical Two nodes are identical if the URL and last modification date of the nodes are same The joined web tuple is stored in a different web table Web Join Join web tables Drugs and New Drugs Nodes which has not undergone any changes are the joinable nodes in these two web tables. Content modified nodes, new nodes and deleted nodes cannot be joinable nodes Joined web table (1) a0 b0 AIDS Indavir u0 AIDS k0 a0 (2) a0 AIDS b0 AIDS Ritonavir u1 a0 d1 k1 a0 (3) d0 b0 AIDS Indavir u0 d0 k0 Ritonavir a0 u1 d1 AIDS k1 Joined Web Table (4) a0 Heart Disorder a0 b2 Niacin u3 d7 k4 Hirudin Heart Disease u2 d3 k3 a0 (5) b4 Impotence a0 b4 Impotence Cavarject Cavarject u7 d6 u8 k7 u7 Joined Table a0 (6) Heart Disease b2 Hirudin u2 d3 k3 a0 Heart Disorder Hirudin u2 d3 k3 Types of web tuples Web tuples in which all the nodes are joinable Results of joining two versions of web tuples that has remained unchanged during the transition Web tuples in which some of the nodes are joinable nodes remaining nodes are the result of insertion, deletion or modification operations a0 (5) b4 Impotence a0 b4 Impotence Cavarject Cavarject u7 d6 u8 k7 u7 Types of web tuples Tuples in which Some of the nodes are joinable nodes Out of the remaining nodes some are result of insertion, deletion or modification and The remaining ones remained unchanged during the transition a0 (3) b0 AIDS Indavir u0 d0 k0 Ritonavir a0 u1 d1 AIDS k1 Outer Web Join Web tuples that do not pariticipate in the web join process (dangling web tuples) are absent from the joined web table Outer web join enables us to identify them Left outer web join Right outer web join Web Table New Drugs (15th Feb) a0 b0 AIDS Indavir u0 d0 k0 a0 b0 AIDS Ritonavir u1 k1 Beta Carotene a0 b1 d1 d2 Cancer k2 a0 Heart Disorder b2 Hirudin u2 d3 k3 Web Table New Drugs (15th Feb) a0 Heart Disorder b2 u3 Niacin d7 k7 a0 b4 Impotence u9 d8 Vasomax k8 a0 b4 Impotence Cavarject u7 d6 k7 Web Table New Drugs (15th Feb) a0 b6 Parkinson’s Disease a0 Tolcapone d10 b6 k10 b4 Impotence u10 u12 d9 Viagra k9 Right Outer Web Join Beta Carotene a0 b1 d2 Cancer k2 a0 b4 Impotence u9 d8 Vasomax k8 a0 b4 Impotence u12 d9 Viagra k9 a0 b6 Parkinson’s Disease Tolcapone u10 d10 b6 k10 Types of web tuples New web tuples which are added during the transition These tuples contain some new nodes and remaining ones content are changes Tuples in which all the nodes have undergone content modification Tuples which existed before and in which some of the nodes are new and remaining ones content have changed. Web Table Drugs (15th Jan) a0 b0 AIDS u0 Indavir d0 k0 a0 b0 AIDS u1 Ritonavir Beta Carotene a0 b1 d2 d1 k1 Cancer k2 a0 b5 Alzheimer’s Disease Ibuprofen d12 k12 Web Table Drugs (15th Jan) a0 b3 Diabetes a0 Albuterol b4 Impotence d4 u4 k5 u5 u6 Vasomax k6 a0 b4 Impotence a0 Heart Disease b2 Cavarject Hirudin u7 d6 u8 k7 u2 d3 k3 d5 Left Outer Web Join Beta Carotene a0 b1 d2 Cancer k2 a0 b5 Ibuprofen Alzheimer’s Disease a0 b3 Diabetes a0 k12 Albuterol b4 Impotence d12 u4 d4 k5 u5 u6 Vasomax k6 d5 Types of web tuples Web tuples which are deleted during the transition These tuples do not occur in the new web table Tuples in which all the nodes have undergone content modification Tuples in which some of the nodes are deleted and remaining ones content have changed. Step 3: Generating Delta Web Tables Overview Input Joined, left outer joined and right outer joined web tables Output Set of delta web tables Delta Web Tables Delta web tables are used to represent web deltas Encapsulate the relevant changes that has occurred in the Web with respect to a user’s query Three types Delta+ web table • Contains a set of tuples containing new nodes inserted during transition Delta- web table • Set of web tuples containing nodes removed during the transition Delta-M web table • Set of web tuples representing the previous and current sets of modified nodes Steps for Generation Phase 1: Delta Nodes Identification Phase Nodes which are added, deleted or modified during the transition are identified Input: Old and new version of web tables and a set of joinable nodes from the joined web table Output: Sets of nodes which are added, deleted or modified during the transition • Nodes which exists in new web table but not in old web table are the new nodes • Nodes which exists in old web table but not in new one are the deleted nodes • Nodes which exists in both the web tables but are not joinable are the nodes which has undergone content modification Steps for Generation Phase 2: Delta Tuples Identification Phase Determines how the delta nodes are related to one another and how they are associated with those nodes which have remained unchanged We identify those tuples which contain nodes which are added, deleted or modified during the transition Input: Joined, left outer joined and right outer joined web tables, sets of delta nodes Output: Sets of web tuples represented by Delta+, Delta- and Delta-M web tables Phase 2 (Delta+ Web Table) Scan joined and right outer joined web tables to identify web tuples containing nodes which are inserted during the transition New nodes can occur in these tables only because In the right outer joined table if the remaining nodes in the tuple containing the new nodes are modified (hence not joinable) In the joined web table if some of the nodes in the tuple containing new nodes has remained unchanged and hence are joinable These web tuples are stored in Delta+ Web Table Example (Right Outer Web Join) Beta Carotene a0 b1 d2 Cancer k2 a0 b4 Impotence u9 d8 Vasomax k8 a0 b4 Impotence u12 d9 Viagra k9 a0 b6 Parkinson’s Disease Tolcapone u10 d10 b6 k10 Example (Joined Web Table) (4) a0 Heart Disorder a0 Heart Disease b2 Niacin u3 d7 k7 Hirudin u2 d3 k3 Delta+ Web Table a0 b2 Heart Disorder Niacin u3 d7 k7 a0 b4 Impotence u9 d8 Vasomax k8 a0 b4 Impotence u12 d9 Viagra k9 a0 b6 Parkinson’s Disease Tolcapone u10 d10 b6 k10 Phase 2 (Delta- Web Table) Scan joined and left outer joined web tables to identify web tuples containing nodes which are deleted during the transition Deleted nodes can occur in these tables only because In the left outer joined table if the remaining nodes in the tuple containing the deleted nodes are modified (hence not joinable) In the joined web table if some of the nodes in the tuple containing deleted nodes has remained unchanged and hence are joinable These web tuples are stored in Delta- Web Table Example (Left Outer Web Join) Beta Carotene a0 b1 d2 Cancer k2 a0 b5 Ibuprofen Alzheimer’s Disease a0 b3 Diabetes a0 k12 Albuterol b4 Impotence d12 u4 d4 k5 u5 u6 Vasomax k6 d5 Example (Joined Web Table) a0 (5) b4 Impotence a0 b4 Impotence Cavarject Cavarject u7 d6 u8 k7 u7 Delta- Web Table a0 b4 Impotence a0 Cavarject b5 b3 Diabetes a0 u8 k7 d12 k12 Albuterol b4 Impotence d6 Ibuprofen Alzheimer’s Disease a0 u7 u4 d4 k5 u5 u6 Vasomax k6 d5 Phase 2 (Delta-M Web Table) Finally, nodes which are modified during the transition can be identified by inspecting all the three web tables Tuples in the left and right outer joined tables which do not contain any new or deleted node represent the old and new version of these nodes respectively • These tuples do not occur in the joined web table as all the nodes are modified Tuples in left and right outer joined tables that contain modified nodes as well as inserted or deleted nodes • These modified nodes may not appear in the joined web table if no other joinable web tuples contain these modified nodes Example (Right Outer Web Join) Beta Carotene a0 b1 d2 Cancer k2 a0 b4 Impotence u9 d8 Vasomax k8 a0 b4 Impotence u12 d9 Viagra k9 a0 b6 Parkinson’s Disease Tolcapone u10 d10 b6 k10 Example (Left Outer Web Join) Beta Carotene a0 b1 d2 Cancer k2 a0 b5 Ibuprofen Alzheimer’s Disease a0 b3 Diabetes a0 k12 Albuterol b4 Impotence d12 u4 d4 k5 u5 u6 Vasomax k6 d5 Phase 2 Tuples in the joined web tables where some of the nodes represent the old and new version of these modified nodes These web tuples are stored in Delta-M Web Table Example (Joined web table) (1) a0 AIDS b0 Indavir u0 AIDS k0 a0 a0 (2) AIDS AIDS a0 b0 d0 Ritonavir u1 d1 k1 Delta-M Web Table (1) a0 AIDS b0 Indavir u0 AIDS k0 a0 (2) a0 AIDS b0 AIDS Ritonavir u1 a0 d1 k1 a0 (3) d0 b4 Impotence a0 b4 Impotence Cavarject Cavarject u7 d6 u8 k7 u7 Delta-M Web Table a0 (4) Heart Disease b2 Hirudin d3 k3 Hirudin a0 Heart Disorder a0 (5) u2 u2 d3 Beta Carotene b1 d2 Cancer Beta Carotene a0 b1 k2 d2 Cancer k2 k3 Applications Provides the framework for Trend analysis E-commerce • Consumer behaviour • Product comparisons • Competitive Intelligence • Notification Services • Provide a useful database for buyer and sellers agents Future Work Analytical and empirical studies of the algorithms for generating delta web tables Mechanism to distinguish between the modified, new or deleted nodes Annotation on delta nodes Extend to sub-page level Query languages for querying the changes Change notification service
© Copyright 2026 Paperzz