changem

Detecting and Representing
Relevant Page-Level Web Deltas
Sanjay Kumar Madria
Department of Computer Science
Purdue University
West Lafayette, IN 47907
[email protected]
Current Situation of W3


The Web allows information
to change at any time and in
any way
Two forms of changes



Existence
Structure and content
modification
Leaves no trace of the
previous document
Replaces its
antecedents
leaving no
trace!!!!
Problems of Change Management

Problem:


Detecting, Representing and Querying these changes
The problem is challenging


Typical database approaches to detect changes based
on triggering mechanisms are not usable
Information sources typical do not keep track of
historical information to a format that is accessible to
the outside user
Motivating Example

Assume that there is a web site at www.panacea.gov

Provides information related to drugs used for various
diseases
Motivating Example

Suppose, on 15th January, a user wishes to find out
periodically (every 30 days)


information related to side effects and uses of drugs
used for various drugs and
changes to these information at the page-level
compared to its previous version
Structure of www.panacea.gov




Web page at www.panacea.gov contains a list of
diseases
Each link of a particular disease points to a web page
containing a list of drugs used for prevention and
cure of the disease
Hyperlinks associated with each drug points to
documents containing a list of various issues related
to a particular drug (description, manufacturers,
clinical pharmacology, uses, side-effects etc)
From the hyperlinks associated with each issue, one
can retrieve details of these issues for a particular
drug
A Snapshot as on 15th Jan
Side effects
Indavir
Ritonavir
Uses
AIDS
Cancer
Alzheimer’s
Disease
Heart
disease
Side effects
Hirudin
Uses
Ibuprofen
Diabetes
Niacin
Impotence
Side effects
Vasomax
Side effects
Caverject
Side effects
Uses
Uses
Some Changes

25th January



Links related to Diabetes are removed
New link containing information related to Parkinson’s
Disease
Information related to issues, side-effects and uses of
various drugs for Cancer are also modified
A Partial Snapshot as on 25th Jan
Tolcapone
Parkinson’s
Disease
Side effects
Uses
Cancer
www.panacea.gov
Diabetes
Side effects
Some Changes

30th January



Links related to Impotence is modified
• Previously provided by www.pfizer.com
• Now by www.panacea.gov
Inter-linked structure of the Web pages related to
Caverject is also modified
Information about Viagra, a new drug for Impotence is
added
A Partial Snapshot as on 30th Jan
Side effects
www.panacea.gov
Uses
Caverject
Impotence
Side effects
Vasomax
Viagra
Uses
Some Changes

8th February


Link structure of Heart Disease is modified
• Label Heart Disease is modified to Heart Disorder
• Content of the pages dealing with side-effects and
uses of Hirudin are updated
• Inter-linked document structure of Niacin is
modified
Web pages related to the side effects and uses of
Ibuprofen (Alzheimer’s Disease) are removed
On 8th February
www.panacea.gov
Alzheimer’s
Disease
Heart
disorder
Side effects
Hirudin
Uses
Niacin
Side
effects
A Snapshot as on 15th Feb
Indavir
Ritonavir
Alzheimer’s
Disease
AIDS
Cancer
Parkinson’s
Disease
Heart
disease
Hirudin
Niacin
Impotence
Viagra
Vasomax
Side effects
Caverject
Uses
Objectives


Web deltas - Changes to web information
Detecting and representing relevant page-level web
deltas



Detect those documents




changes that are relevant to user’s query, not any
arbitrary changes or web deltas
Restricted to page level
which are added to the site
deleted from the site
those documents which has undergone content or
structural modification
How these delta documents are related to one
another and with other documents relevant to the
user’s query
The WHOWEDA Project



WHOWEDA: A WareHouse of WEb DAta
To design and implement a web warehousing system
capable of effective extraction, management, and
processing of information on the World Wide Web
Data model: WHOM (WareHouse Object Model)
Overview of WHOM




Our web warehouse can be conceived of as a
collection of web tables
A set of web tuples and a set of web schemas
represents a web table
A web tuple is a directed graph containing nodes and
links and satisfies a web schema
Nodes and links contain content, metadata and
structural information associated with Web
documents and hyperlinks


Tree representation
Web algebra containing web operators to manipulate
web tables

Global Coupling, Web Select, Web Join etc.
Overview of our approach


Step 1: Two snapshots of old and new relevant data is
coupled from the Web using global web coupling
operation and materialized in two web tables.
Step 2: Web join, left outer join and right outer
joined operations are performed on these two web
tables



Result is joined, left and right outer joined web tables
Step 3: Delta web tables containing different types of
web deltas are generated from these resultant web
tables.
Elaborate on these steps……...
Step 1: Retrieving snapshots of
Web data using Global Web
Coupling
Web Query Specification

Features:




Draw a web query as a directed connected acyclic
graph (also called a coupling query)
Query can also be specified in text form
Specify search conditions on the nodes and edges of
the graph
Performed by the global web coupling operator
Coupling Query

Set of node variables Xn


Set of link variables Xl


To specify hyperlink structure of the documents
Set of predicates P defined over some of the node
and link variables


Each variable represent set of hyperlinks
Set of connectivities C in DNF defined over node and
link variables


Each variable represents set of Web documents
Specify metadata, content or structural conditions
Set of coupling query predicates Q

Conditions on execution of the query
Example

Suppose, on 15th January, a user wishes to find out
periodically (every 30 days) from the web site at
www.panacea.gov


information related to side effects and uses of drugs
used for various diseases
Result of the query is stored in the form of web table
Coupling Query



Xn = {a, b, d, k}
Xl = { - }
P = {p1, p2, p3, p4}




p1(a) = METADATA:: a[url] EQUALS
“www.panacea.gov”
p2(b) = CONTENT:: b[html.body.title] NON-ATTRCONT “drug list”
p3(k) = CONTENT:: k[html.body.title] NON-ATTRCONT “uses”
p4(d) = CONTENT:: d[html.body.title] NON-ATTRCONT “side effects”
Coupling Query

C = k1 AND k2 AND k3




k1 = a < - > b
k2 = b < -{1, 6} > d
k3 = b < -{1, 3} > k
Q = {q1}

q1(b) = COUPLING_QUERY:: polling_frequency
EQUALS “30 days”
Pictorial Representation
{1, 6}
d
“side effects”
k
“uses”
www.panacea.gov
a
b
“drug list”
{1, 3}
Web Table Drugs (15th Jan)
a0
b0
AIDS
u0
Indavir
d0
k0
a0
b0
AIDS
u1
Ritonavir
k1
Beta Carotene
a0
b1
d1
d2
Cancer
k2
a0
b5
Alzheimer’s
Disease
Ibuprofen
d12
k12
Web Table Drugs (15th Jan)
a0
b3
Diabetes
a0
Albuterol
b4
Impotence
d4
u4
k5
u5
u6
Vasomax
k6
a0
b4
Impotence
a0
Heart
Disease
b2
Cavarject
Hirudin
u7
d6
u8
k7
u2
d3
k3
d5
Web Table New Drugs (15th Feb)
a0
b0
AIDS
Indavir
u0
d0
k0
a0
b0
AIDS
Ritonavir
u1
k1
Beta Carotene
a0
b1
d1
d2
Cancer
k2
a0
Heart
Disorder
b2
Hirudin
u2
d3
k3
Web Table New Drugs (15th Feb)
a0
Heart
Disorder
b2
u3
Niacin
d7
k7
a0
b4
Impotence
u9
d8
Vasomax
k8
a0
b4
Impotence
Cavarject
u7
d6
k7
a0
b6
Parkinson’s
Disease
Tolcapone
u10
d10
b6
k10
Web Table New Drugs (15th Feb)
a0
b6
Parkinson’s
Disease
a0
Tolcapone
d10
b6
k10
b4
Impotence
u10
u12
d9
Viagra
k9
Step 2: Performing Web Join, Left
and Right Outer Web Join
Web Join






Information composition operator
Combines two web tables into a single web table
under certain conditions
Combine two web tables by concatenating a web
tuple of one web table with a web tuple of other
web table whenever there exist joinable nodes
Two nodes are joinable if they are identical
Two nodes are identical if the URL and last
modification date of the nodes are same
The joined web tuple is stored in a different web
table
Web Join



Join web tables Drugs and New Drugs
Nodes which has not undergone any changes are the
joinable nodes in these two web tables.
Content modified nodes, new nodes and deleted
nodes cannot be joinable nodes
Joined web table
(1)
a0
b0
AIDS
Indavir
u0
AIDS
k0
a0
(2)
a0
AIDS
b0
AIDS
Ritonavir
u1
a0
d1
k1
a0
(3)
d0
b0
AIDS
Indavir
u0
d0
k0
Ritonavir
a0
u1
d1
AIDS
k1
Joined Web Table
(4)
a0
Heart
Disorder
a0
b2
Niacin
u3
d7
k4
Hirudin
Heart
Disease
u2
d3
k3
a0
(5)
b4
Impotence
a0
b4
Impotence
Cavarject
Cavarject
u7
d6
u8
k7
u7
Joined Table
a0
(6)
Heart
Disease
b2
Hirudin
u2
d3
k3
a0
Heart
Disorder
Hirudin
u2
d3
k3
Types of web tuples

Web tuples in which all the nodes are joinable


Results of joining two versions of web tuples that has
remained unchanged during the transition
Web tuples in which


some of the nodes are joinable nodes
remaining nodes are the result of insertion, deletion or
modification operations
a0
(5)
b4
Impotence
a0
b4
Impotence
Cavarject
Cavarject
u7
d6
u8
k7
u7
Types of web tuples

Tuples in which



Some of the nodes are joinable nodes
Out of the remaining nodes some are result of
insertion, deletion or modification and
The remaining ones remained unchanged during the
transition
a0
(3)
b0
AIDS
Indavir
u0
d0
k0
Ritonavir
a0
u1
d1
AIDS
k1
Outer Web Join


Web tuples that do not pariticipate in the web join
process (dangling web tuples) are absent from the
joined web table
Outer web join enables us to identify them


Left outer web join
Right outer web join
Web Table New Drugs (15th Feb)
a0
b0
AIDS
Indavir
u0
d0
k0
a0
b0
AIDS
Ritonavir
u1
k1
Beta Carotene
a0
b1
d1
d2
Cancer
k2
a0
Heart
Disorder
b2
Hirudin
u2
d3
k3
Web Table New Drugs (15th Feb)
a0
Heart
Disorder
b2
u3
Niacin
d7
k7
a0
b4
Impotence
u9
d8
Vasomax
k8
a0
b4
Impotence
Cavarject
u7
d6
k7
Web Table New Drugs (15th Feb)
a0
b6
Parkinson’s
Disease
a0
Tolcapone
d10
b6
k10
b4
Impotence
u10
u12
d9
Viagra
k9
Right Outer Web Join
Beta Carotene
a0
b1
d2
Cancer
k2
a0
b4
Impotence
u9
d8
Vasomax
k8
a0
b4
Impotence
u12
d9
Viagra
k9
a0
b6
Parkinson’s
Disease
Tolcapone
u10
d10
b6
k10
Types of web tuples

New web tuples which are added during the
transition



These tuples contain some new nodes and remaining
ones content are changes
Tuples in which all the nodes have undergone
content modification
Tuples which existed before and in which some of the
nodes are new and remaining ones content have
changed.
Web Table Drugs (15th Jan)
a0
b0
AIDS
u0
Indavir
d0
k0
a0
b0
AIDS
u1
Ritonavir
Beta Carotene
a0
b1
d2
d1
k1
Cancer
k2
a0
b5
Alzheimer’s
Disease
Ibuprofen
d12
k12
Web Table Drugs (15th Jan)
a0
b3
Diabetes
a0
Albuterol
b4
Impotence
d4
u4
k5
u5
u6
Vasomax
k6
a0
b4
Impotence
a0
Heart
Disease
b2
Cavarject
Hirudin
u7
d6
u8
k7
u2
d3
k3
d5
Left Outer Web Join
Beta Carotene
a0
b1
d2
Cancer
k2
a0
b5
Ibuprofen
Alzheimer’s
Disease
a0
b3
Diabetes
a0
k12
Albuterol
b4
Impotence
d12
u4
d4
k5
u5
u6
Vasomax
k6
d5
Types of web tuples

Web tuples which are deleted during the transition



These tuples do not occur in the new web table
Tuples in which all the nodes have undergone
content modification
Tuples in which some of the nodes are deleted and
remaining ones content have changed.
Step 3: Generating Delta Web
Tables
Overview

Input


Joined, left outer joined and right outer joined web
tables
Output

Set of delta web tables
Delta Web Tables



Delta web tables are used to represent web deltas
Encapsulate the relevant changes that has occurred in
the Web with respect to a user’s query
Three types



Delta+ web table
• Contains a set of tuples containing new nodes
inserted during transition
Delta- web table
• Set of web tuples containing nodes removed during
the transition
Delta-M web table
• Set of web tuples representing the previous and
current sets of modified nodes
Steps for Generation

Phase 1: Delta Nodes Identification Phase



Nodes which are added, deleted or modified during
the transition are identified
Input: Old and new version of web tables and a set of
joinable nodes from the joined web table
Output: Sets of nodes which are added, deleted or
modified during the transition
• Nodes which exists in new web table but not in old
web table are the new nodes
• Nodes which exists in old web table but not in new
one are the deleted nodes
• Nodes which exists in both the web tables but are
not joinable are the nodes which has undergone
content modification
Steps for Generation

Phase 2: Delta Tuples Identification Phase




Determines how the delta nodes are related to one
another and how they are associated with those nodes
which have remained unchanged
We identify those tuples which contain nodes which
are added, deleted or modified during the transition
Input: Joined, left outer joined and right outer joined
web tables, sets of delta nodes
Output: Sets of web tuples represented by Delta+,
Delta- and Delta-M web tables
Phase 2 (Delta+ Web Table)


Scan joined and right outer joined web tables to
identify web tuples containing nodes which are
inserted during the transition
New nodes can occur in these tables only because



In the right outer joined table if the remaining nodes in
the tuple containing the new nodes are modified
(hence not joinable)
In the joined web table if some of the nodes in the
tuple containing new nodes has remained unchanged
and hence are joinable
These web tuples are stored in Delta+ Web Table
Example (Right Outer Web Join)
Beta Carotene
a0
b1
d2
Cancer
k2
a0
b4
Impotence
u9
d8
Vasomax
k8
a0
b4
Impotence
u12
d9
Viagra
k9
a0
b6
Parkinson’s
Disease
Tolcapone
u10
d10
b6
k10
Example (Joined Web Table)
(4)
a0
Heart
Disorder
a0
Heart
Disease
b2
Niacin
u3
d7
k7
Hirudin
u2
d3
k3
Delta+ Web Table
a0
b2
Heart
Disorder
Niacin
u3
d7
k7
a0
b4
Impotence
u9
d8
Vasomax
k8
a0
b4
Impotence
u12
d9
Viagra
k9
a0
b6
Parkinson’s
Disease
Tolcapone
u10
d10
b6
k10
Phase 2 (Delta- Web Table)


Scan joined and left outer joined web tables to
identify web tuples containing nodes which are
deleted during the transition
Deleted nodes can occur in these tables only because



In the left outer joined table if the remaining nodes in
the tuple containing the deleted nodes are modified
(hence not joinable)
In the joined web table if some of the nodes in the
tuple containing deleted nodes has remained
unchanged and hence are joinable
These web tuples are stored in Delta- Web Table
Example (Left Outer Web Join)
Beta Carotene
a0
b1
d2
Cancer
k2
a0
b5
Ibuprofen
Alzheimer’s
Disease
a0
b3
Diabetes
a0
k12
Albuterol
b4
Impotence
d12
u4
d4
k5
u5
u6
Vasomax
k6
d5
Example (Joined Web Table)
a0
(5)
b4
Impotence
a0
b4
Impotence
Cavarject
Cavarject
u7
d6
u8
k7
u7
Delta- Web Table
a0
b4
Impotence
a0
Cavarject
b5
b3
Diabetes
a0
u8
k7
d12
k12
Albuterol
b4
Impotence
d6
Ibuprofen
Alzheimer’s
Disease
a0
u7
u4
d4
k5
u5
u6
Vasomax
k6
d5
Phase 2 (Delta-M Web Table)

Finally, nodes which are modified during the
transition can be identified by inspecting all the three
web tables


Tuples in the left and right outer joined tables which
do not contain any new or deleted node represent the
old and new version of these nodes respectively
• These tuples do not occur in the joined web table
as all the nodes are modified
Tuples in left and right outer joined tables that contain
modified nodes as well as inserted or deleted nodes
• These modified nodes may not appear in the joined
web table if no other joinable web tuples contain
these modified nodes
Example (Right Outer Web Join)
Beta Carotene
a0
b1
d2
Cancer
k2
a0
b4
Impotence
u9
d8
Vasomax
k8
a0
b4
Impotence
u12
d9
Viagra
k9
a0
b6
Parkinson’s
Disease
Tolcapone
u10
d10
b6
k10
Example (Left Outer Web Join)
Beta Carotene
a0
b1
d2
Cancer
k2
a0
b5
Ibuprofen
Alzheimer’s
Disease
a0
b3
Diabetes
a0
k12
Albuterol
b4
Impotence
d12
u4
d4
k5
u5
u6
Vasomax
k6
d5
Phase 2


Tuples in the joined web tables where some of the
nodes represent the old and new version of these
modified nodes
These web tuples are stored in Delta-M Web Table
Example (Joined web table)
(1)
a0
AIDS
b0
Indavir
u0
AIDS
k0
a0
a0
(2)
AIDS
AIDS
a0
b0
d0
Ritonavir
u1
d1
k1
Delta-M Web Table
(1)
a0
AIDS
b0
Indavir
u0
AIDS
k0
a0
(2)
a0
AIDS
b0
AIDS
Ritonavir
u1
a0
d1
k1
a0
(3)
d0
b4
Impotence
a0
b4
Impotence
Cavarject
Cavarject
u7
d6
u8
k7
u7
Delta-M Web Table
a0
(4)
Heart
Disease
b2
Hirudin
d3
k3
Hirudin
a0
Heart
Disorder
a0
(5)
u2
u2
d3
Beta Carotene
b1
d2
Cancer
Beta Carotene
a0
b1
k2
d2
Cancer
k2
k3
Applications

Provides the framework for


Trend analysis
E-commerce
• Consumer behaviour
• Product comparisons
• Competitive Intelligence
• Notification Services
• Provide a useful database for buyer and sellers
agents
Future Work


Analytical and empirical studies of the algorithms for
generating delta web tables
Mechanism to distinguish between the modified, new
or deleted nodes




Annotation on delta nodes
Extend to sub-page level
Query languages for querying the changes
Change notification service