Document Ontology Extractor (DOE)

Document Ontology Extractor
(DOE)
Research Team:
Govind R Maddi, Jun Zhao
Chakravarthi S Velvadapu
Faculty:
Dr.Sadanand Srivastava
Dr.James Gil De Lamadrid
Joint Project of
University of Maryland, Baltimore County
Bowie State University
Sponsored by
Department Of Defense
OVERVIEW
1. The system takes text documents as its
input
2. Performs semantic analysis on these
documents
3. Generates useful ontology
4. Represents it graphically
GOAL
To build an Ontology utilizing
• Statistical methods
• A small amount of user feedback
• Automation
Architecture of DOE
Text Document
Pre-processing
Normalization
Latent Semantic Indexing
(SVD)
Document Ontology
Graph Construction
GUI
INPUT
Text documents
Pre-processing
Read-in text file
Extract meaningful terms
Count their frequencies
Normalization
Calculate weight of each term using
W i,k = frequency i,k
nk
Σ frequency j,k
j=1
Normalization(contd)
Calculate normalized weight using
W i,k
w(i,k)
nk
sqrt(Σ w2(j,k))
j=1
Latent Semantic Indexing(LSI)
Statistical method representing documents
by statistically independent concepts
Based on Singular Value Decomposition
(SVD)
Singular Value Decomposition
(SVD)
A technique that decomposes a
given matrix into three
components – U, S and V.
SVD (contd)
m x n term-document matrix A, of rank r,
can be expressed as the product:
A = U * S * VT
U is m x r term matrix
S is r x r diagonal matrix
V is r x n document matrix
SVD (contd)
Diagonal of S contains singular values of A
in the descending order.
SVD (contd)
A is formed from LSI as follows:
A = US * SS * VsT
US - derived from U removing all but the s
columns
SS - derived from S removing all but the
largest s singular values
VsT - derived from VT removing all but the s
corresponding rows
SVD (contd)
SS
VsT
S
rxr
VT
rxn
US
A
mxn
U
mxr
Document Ontology
Build Concept Nodes and Term Nodes
using the document matrix (V) and term
matrix (U).
Building concept nodes from
term matrix(U)
A concept node contains information
about
• Concept name
• Terms that belong to that concept
• Respective weights of terms in that
concept
Building concept nodes from
term matrix(U) (contd)
Naming convention:
• Generates automatically
• A hyphenated string of the five most
high frequent terms in that concept
Building concept nodes from
term matrix(U) (contd)
A concept node represents a document
Each column in U corresponds to a concept
node
Building term nodes from term
matrix(U)
A term node contains information
about
• Term name
• Concepts to which it belongs
• Its respective weight in each concept
Building term nodes from term
matrix(U) (contd)
Naming convention:
• Generates automatically
• Simply named using the term name
Building term nodes from term
matrix(U) (contd)
A term node represents a term
Each row in U corresponds to a term
node
Graph Construction
A bipartite graph is constructed with
concept nodes and term nodes
A concept node is connected to all term
nodes that belong to it.
A term node is connected to all concept
nodes to which it belongs.
Graph Construction (contd)
Term 1
Concept 1
Term 2
Term 3
Concept 2
Term 4
Term 5
Graphical User Interface
(GUI)
GUI (contd)
GUI consists of
• Concepts list
• Terms list
• Display for bipartite graph
• Display for list of files in ontology
GUI
To view terms related to a concept, user
selects that concept from concepts list
To view concepts related to a term, user
selects that term from terms list
GUI (contd)
To view only terms related to a
specific concept:
• Select that concept from concepts list
• Select checkbox “Display Selected Ones
Only”
Result:
• GUI displays ONLY relations between
selected terms and concepts
GUI (contd)
To view only concepts related to a
term:
• Select that term from terms list
• Select checkbox “Display Selected Ones
Only”
Result:
• GUI displays ONLY relations between
selected terms and concepts
GUI (contd)
To highlight relationship between a
term and a concept:
• Select that term or concept from terms
or concepts list
• Click on line connecting term and
concept
GUI – File Operations
New
Open
Save
saveAs
Close
Exit
GUI – Ontology Updates
Add
Delete
ChangeSVDThreshold
changeConcThreshold
foldInDoc
defaultBuild
GUI – Ontology Updates
Add:
• Click on Add
• Select file to be added from file
chooser popup menu
• Choose whether to build now or not
• If yes document is added and displayed
• If no GUI remains unchanged
GUI – Ontology Updates
Delete:
• Click on Delete
• Select file to be deleted from file chooser
popup menu
• Choose whether to build now or not
• If yes document is deleted and displayed
• If no GUI remains unchanged
GUI – Ontology Updates
changeSVDThreshold:
• SVDThreshold controls the largest s
singular values that will be selected
from S.
• Default value is 70% i.e. only the
singular values higher than 70% of the
highest singular value are selected
• User can change this default value
GUI – Ontology Updates
changeConcThreshold:
• Controls the number of terms related to
a concept based upon term weight
• Default value is 70% i.e. only the
terms with weights higher than 70% of
the highest term weight are selected
• User can change this default value
GUI – Ontology Modifications
Rename
• Renames a selected concept
DelTerm
• Deletes a selected term
Undo
• Ignores last modification and returns to the
previous state
Future Work
To investigate less expensive methods
for adding new documents:
• Fold-In
• SVD update
Future Work
Fold-In:
• A method to add new document(s) to
an existing ontology
• Uses the existing data in document
addition process
• Less expensive process than the
regular build method
Acknowledgements
We express our appreciation to
• Department Of Defense
• University of Maryland, Baltimore County
• Advisors, Bowie State University