Measuring Information Revealed

UIR Alert Agent : An alert system for identifying
suspicious web-site browsing leading to unintended
information revelation(UIR)
Rohini K. Srihari
State University of New York at Buffalo
May 6, 2003
Tracking suspicious web browsing
User has visited these pages
http://www.faa.gov/apa/safer_skies/fsstats.htm
http://www.faa.gov/certification/aircraft/sfar88/01hstry2.pps
User is requesting
http://www.awp.faa.gov/fsdo/docs/spm_info/what/fy2000/sdplan00.doc
Should we let him see it? Should we monitor his next moves?





What Information has the user obtained till now?
What was inferred from the visited pages?
What additional information can they infer with this new web-page?
Did we intend to reveal this information?
Should we be alerted if this is unintended?
Measuring Unintended Information Revelation(UIR) for visited and
requested pages will answer these questions
FAA Workshop May 2003
2
Outline
 Unintended Information Revelation
 Problem Definition
Solutions with Existing Technology
 Proposed Solution
UIR System Architecture
Extracting Concepts and Associations
Creating Concept Chain Graphs (CCG)
Mining and visualization of CCGs
 Evaluation Methodology
 Preliminary Results
 Summary
FAA Workshop May 2003
3
User’s previous request
Fact Sheet: Aviation Accident Statistics
http://www.faa.gov/apa/safer_skies/fsstats.htm
Important Concepts
safer skies, fatal accidents,
runway incursions, hijack, etc.
Interesting Information
Number and percentage of Fatal
Accidents in 1996
 Runway Incursions
 Ice/Snow
 In-Flight fire
FAA Workshop May 2003
4
User’s current request
Fuel tank ignition events
http://www.faa.gov/certification/aircraft/sfar88/01hstry2.pps
Important Concepts
fatalities, fuel tank ignition, hull
loss, electrostatics, etc.
Interesting Information
Identifies causes for fuel
tank ignition accidents
 Small bomb
 Faulty Wiring
 Pump Faults
FAA Workshop May 2003
5
Synthesized Information
In-flight fire can cause accidents
Fuel-tank ignitions caused by small bombs, faulty pumps/wirings, etc.
Domain Knowledge: In-flight fires and fuel-tank ignitions are aviation
hazards.
Inference: faulty wirings can cause in-flight fires
FAA Workshop May 2003
6
UIR Alert Agent
UIR is a phenomenon where information synthesized from
multiple documents is more than the information provided by
the sum of the individual documents
Generate alerts for unintended information revelation based on user’s
browsing history and requested pages
User Browsing History
A
B
C
1
2
3
4
1
6
7
4
9
1
11
12
UIR
Alert
Agent
Alert Generated
on User B
Alerts
Log
FAA Workshop May 2003
7
Architecture of UIR System
Concept Chain
Graphs (CCG)
Document
Collection
(web pages)
Pre-existing Domain
Ontology/Lexicon
(e.g Aviation Ontology)
Information
Extraction
Input: User
surfing web
pages on sites of
interest to
national security
Document subset
1
2
3
4
5
6
7
8
9
10
11
12
7
12
10
4
3
2
1
UIR
Output: web pages that
reveal too much
information; human
monitor can visualize
paths in CCG
11
10
UIR Alert
CCG instantiated
for subset of
interest
Module
Accident-hazard-fuel tank -…
ice/snow-hazard-fatalities-…
User alerts / logs
Proposed Solution
Step 1: Determine significant concepts and associations in
target domain (offline, semi-automatic)
use of existing ontologies such as DAML ontology on aviation
use of information extraction to automatically extract concepts
and associations from representative document collection
Step 2: Create Concept Chain Graph (CCG)
consolidates underlying domain knowledge, specific documents
weights concepts and associations using both domain weights,
individual document weights
Step 3: Visualization and text mining operations on CCG
Step 4: UIR Alert agent invoked
tracking user surfing patterns
what-if scenarios
FAA Workshop May 2003
9
Evaluation Methodology
Typical IR
evaluation
TREC Query:
IR system
find pages that discuss
ways of causing air
disasters
includes
query
expansion
Ranked
web pages
TREC Narrative:
UIR
Evaluation
Relevant
web pages
UIR
System
CCG
Pages that are relevant
to causing air disasters
will mention aircraft
maintenance
operations or
passenger screening
procedures
FAA Workshop May 2003
Evaluate
precision and
recall of IR
system
Evaluate
ability to
generate
narrative
10
Step 1: Extracting Concepts and Associations
Extracting Concepts:
 Use InfoXtract engine from Cymfony
 Named Entity Tagger (NE) identifies common Entities like Date,
Time, Location, State, Country, Organization, Person.
 InfoXtract also identifies significant noun groups, verb groups
 e.g. fuel tanker, runway de-icing
Extracting Associations:
 Concept Co-occurrence in documents
 Concept Proximity in sentences/paragraphs
Advanced Techniques using machine learning
… The designation for one end of the runway
should be used on the sign only when the
taxiway intersects the beginning of that
runway. Taxiways that intersect the runway at
intermediate points must have the
designations for both runway ends. ...
Output implies: System has
85% confidence that runway
and taxiway associated by
some relation.
Association
Learning
(runway, taxiway): 0.85
FAA Workshop May 2003
11
Sample Information Extraction output
Concepts and Named
Entities are marked up
during information
extraction
DATE: October 23, 1992 NO. 92-03
TO: AIRPORT CERTIFICATION PROGRAM INSPECTORS
TOPIC: Effects Of Type II Deicing Fluid On Runway Friction
The FAA's Technical Center in conjunction with the Port Authority of
New York and New Jersey conducted tests to determine the effects of
Type II aircraft deicing fluids on runway friction. The tests were
conducted this past July and August at La Guardia and John F. Kennedy
International Airports on grooved asphaltic pavement. Since the tests
were conducted in the summer no attempt was made to simulate ice or
snow on the pavement surface. (See future test programs.) Two specially
instrumented B-727's and two Saab friction devices were used to
measure the runway friction.
The purpose of this effort was to test the premise that Type II deicing
fluid deposited on a runway poses a hazard to aircraft landing on the
runway. At the present time it is unknown to what extent Type II actually
falls off a departing aircraft and what portion of it is deposited on the
runway. (See future test programs.)
FAA Workshop May 2003
12
Step 2: Create Concept Chain Graph
 Create concept chain graph based on underlying domain knowledge
(concepts, associations).
 Weight concept nodes based on frequency, type, user-defined importance
 weight associations based on proximity, importance of concepts they link,
uniqueness
 Project/Map documents viewed by user onto CCG
 A document is represented as a probabilistic sub-graph in the CCG
 Proximity and other metrics are used to assign weights on the concepts(nodes)
and associations(edges) discovered in a document
1
0.101
0. 088
0.013
0. 239
0.124
Aviation Ontology
Document-specific
concepts, associations,
with weights
0.54
FAA Workshop May 2003
0.2324
13
Step 2: Instantiated Concept Chain Graph
Accident Statistics
Lightning
AIRPLANE
HAZARD
Fuel Tank
Wiring
AVIATION
Statistics
ACCIDENT
Windshear
Fuel Tank
Ignition events
Pumps
Air_traffic_
_control_tower
Ice/snow
Small
Bomb
In-flight fire
Fatalities
Runway Incursions
hull losses
Associations in
Document
Domain Knowledge
Fuel tank ignition events
FAA Workshop May 2003
14
Step 3: Mining the CCG
 Goals
detecting information-rich concept chains
e.g. air disaster - onboard explosion - fuel tanker
quantifying information revealed
issue alerts when too much information is revealed
“what-if” scenarios to enable dissemination of benign information
 Graph traversal
generate CCG representing documents viewed by user
start with explicit query/search terms as seed concepts; could
be multiple terms
strategies:
try to find best paths/chains that connect “seed” concepts; could
generate multiple chains
try to find best subgraph
various graph traversal algorithms are suitable
FAA Workshop May 2003
15
Graph Traversal Techniques
 minimum cover techniques
 INSTANCE: Graph G = {V, E}
 SOLUTION: A vertex cover for G, i.e., a subset V’  V such that, for
each edge (u,v)  E, at least one of u and v belongs to V'.
 MEASURE: Cardinality of the vertex cover, i.e., |V’ |.
 Flow networks
 given a network (G,s,t,c) where G = (V,E) is a directed graph with n
vertices and m edges, s and t are two vertices (source and sink), and
c: E-> R+ is a function that defines capacities of edges
 find maximum flow from s to t that satisfies capacity constraints
 Energy minimization (used in image processing)
 active contours (e.g. snakes) used for tracking various shapes,
including road detection
 dynamic programming solutions available
FAA Workshop May 2003
16
Step 4: Track user surfing with UIR module
Lightning
AIRPLANE
HAZARD
Fuel Tank
Wiring
AVIATION
Statistics
ACCIDENT
Windshear
Fuel Tank
Ignition events
Pumps
Air_traffic_
_control_tower
Ice/snow
Previously
viewed page(s)
In-flight fire
Runway Incursions
UIR module determines that these two
documents reveal new association
between wiring and accidents.
FAA Workshop May 2003
Small
Bomb
Fatalities
hull losses
requested
page
17
Preliminary Experiments
FAA Workshop May 2003
18
Summary
Benefits to FAA
 Automated monitoring information acquired by users of the FAA website
and alert mechanism for unintentionally revealed information.
 Shortlist and identify documents and concepts seen by the user that
reveal unintended information
 Domain map visualization tool facilitates concept and association based
queries
Claims
 new, richer representation for information retrieval that combines keyword
statistics (bag-of-words model) with NLP-based information extraction
 Solution is general to any domain; only domain map needs to be
customized/retrained
 Experts can intervene, guide the process, if desired; tools provided
FAA Workshop May 2003
19