A Search Engine That Learns

A Search Engine That Learns
Jeff Elser – [email protected]
John Paxton – [email protected]
Montana State University - Bozeman
Presentation Outline
I.
II.
III.
IV.
V.
VI.
VII.
Problem
Background Information
Approach
Preliminary Results
Future Work
Summary
Questions
I. Problem
 RightNow software use
 Spidering and searching
 Website optimization
• Page by page is tedious and time consuming
• Dual ownership should allow perfect optimization
 Solutions
• Search engine adjustments
• Suggesting specific web page changes
II. Background – Search Engine



Spidering
Indexing
Weighting factors
Weight Identifier
Default Value
First Test Case
Results
backlink
1000.0
510.0
description
150.0
980.0
keywords
100.0
66.0
title
100.0
180.0
meta-description
50.0
920.0
heading 1
5.0
130.0
author
1.0
440.0
multi-match
1.0
170.0
text
1.0
0.0
url text
1.0
540.0
date
0.35
140.0
II. Background – Genetic Algorithms

Goldberg’s Simple
GA
•
•
•
•
•
Mutation
Crossover
Elitism
Non-overlapping
populations
Several fitness
functions


Individual 1 0 0 0 0
•
Fitness = 2
Individual 2 1 1 1 1
•
Fitness = 4
III. Approach
A.
B.
C.
D.
E.
Architecture
Training data
Testing controls (website source)
GA specifics
Fitness functions
A. Architecture
B. Training Data

Website source
•
•
•
20000 newsgroup articles from UCI Knowledge
Discovery in Databases Archive
Hand formatted HTML
Chosen for word count and structure
C. Testing Controls

Webmaster
provides training
data
•
•
•
List of important
keywords
Associated ranked
pages
Tedious, but trivial
compared to
optimizing all pages
D. GA Specifics

Random initial population
• Population size 1000
• Used GAlib’s built in random number
generator

Genome
• 16 real numbers corresponding to the 16
•
weighting factors
Range 0.0 – 1000.0
D. GA Specifics




GA executes for 10000 generations
Elitism is turned on
Mutation probability = 0.01
Crossover probability = 0.6
D. Fitness Function 1



∑D
D = |(actual ranking) – (desired ranking)|
+1 to avoid division by 0
D. Fitness Function 2


+100 penalty for pages that don’t appear
-10 reward for pages with a perfect fit
IV. Preliminary Results

12 tests using fitness function #2

4 tests obtained perfect rankings
4 improved rankings, but did not achieve
optimal
4 tests showed no improvement


• 1 realistic set of desired rankings
• 11 random sets
IV. Preliminary Results
7.5
Htdig default weights
7
6.5
Distance
6
5.5
5
4.5
4
3.5
Fitness Function #2
3
2.5
0
25
50
75
Generation Number
100
IV. Preliminary Results
0.4
Fitness Function #2
Fitness Value
0.3
0.2
0.1
Htdig default weights
0
0
25
50
75
100
125
Generation Number
150
175
V. Future Work – Fitness Function 3
Levenshtein Distance




D = string 1; A = string 2
Construct a mxn Matrix (M)
where m = |D|+1 and n = |A|+1
M[0,i] = i and M[j,0] = j
For each remaining cell:
D[i] == A[j] then cost = 0
D[i] != A[j] then cost = 1
M[i,j] = MIN {a, b, c} where
a = M[i-1,j] + 1
b = M[i,j-1] + 1
c = M[i-1,j-1] + cost

Distance = M[m,n]
F A R M
0 1 2 3 4
F 1 0 1 2 3
R 2 1 1 1 2
O 3 2 2 2 2
M 4 3 3 3 2
V. Future Work – Fitness Function 3
Levenshtein Distance

Reduce the url comparison to string
comparison
F A R M
↓
www.url.com/index.htm

www.url.com/ga.htm
www.url.com/seo.htm www.url.com/etc.htm
Experiment further using LD as a fitness
function
•
Sigmoid weighting function to increase the importance
of the front of the string
V. Future Work

Create more extensive test sets
• dare.com, studentaid.ed.gov, fafsa.ed.gov,
americorps.org
V. Future Work
V. Future Work
V. Future Work



For pages that still do not rank properly,
create optimization suggestions
Use custom meta tags to properly rank
outliers
Use implicit user feedback to find the
desired rankings
VI. Summary


Proof of concept
Testing on real world websites will
strengthen results and open other areas
of study.
VII. Questions

Thanks for attending

Any questions?