A Search Engine That Learns Jeff Elser – [email protected] John Paxton – [email protected] Montana State University - Bozeman Presentation Outline I. II. III. IV. V. VI. VII. Problem Background Information Approach Preliminary Results Future Work Summary Questions I. Problem RightNow software use Spidering and searching Website optimization • Page by page is tedious and time consuming • Dual ownership should allow perfect optimization Solutions • Search engine adjustments • Suggesting specific web page changes II. Background – Search Engine Spidering Indexing Weighting factors Weight Identifier Default Value First Test Case Results backlink 1000.0 510.0 description 150.0 980.0 keywords 100.0 66.0 title 100.0 180.0 meta-description 50.0 920.0 heading 1 5.0 130.0 author 1.0 440.0 multi-match 1.0 170.0 text 1.0 0.0 url text 1.0 540.0 date 0.35 140.0 II. Background – Genetic Algorithms Goldberg’s Simple GA • • • • • Mutation Crossover Elitism Non-overlapping populations Several fitness functions Individual 1 0 0 0 0 • Fitness = 2 Individual 2 1 1 1 1 • Fitness = 4 III. Approach A. B. C. D. E. Architecture Training data Testing controls (website source) GA specifics Fitness functions A. Architecture B. Training Data Website source • • • 20000 newsgroup articles from UCI Knowledge Discovery in Databases Archive Hand formatted HTML Chosen for word count and structure C. Testing Controls Webmaster provides training data • • • List of important keywords Associated ranked pages Tedious, but trivial compared to optimizing all pages D. GA Specifics Random initial population • Population size 1000 • Used GAlib’s built in random number generator Genome • 16 real numbers corresponding to the 16 • weighting factors Range 0.0 – 1000.0 D. GA Specifics GA executes for 10000 generations Elitism is turned on Mutation probability = 0.01 Crossover probability = 0.6 D. Fitness Function 1 ∑D D = |(actual ranking) – (desired ranking)| +1 to avoid division by 0 D. Fitness Function 2 +100 penalty for pages that don’t appear -10 reward for pages with a perfect fit IV. Preliminary Results 12 tests using fitness function #2 4 tests obtained perfect rankings 4 improved rankings, but did not achieve optimal 4 tests showed no improvement • 1 realistic set of desired rankings • 11 random sets IV. Preliminary Results 7.5 Htdig default weights 7 6.5 Distance 6 5.5 5 4.5 4 3.5 Fitness Function #2 3 2.5 0 25 50 75 Generation Number 100 IV. Preliminary Results 0.4 Fitness Function #2 Fitness Value 0.3 0.2 0.1 Htdig default weights 0 0 25 50 75 100 125 Generation Number 150 175 V. Future Work – Fitness Function 3 Levenshtein Distance D = string 1; A = string 2 Construct a mxn Matrix (M) where m = |D|+1 and n = |A|+1 M[0,i] = i and M[j,0] = j For each remaining cell: D[i] == A[j] then cost = 0 D[i] != A[j] then cost = 1 M[i,j] = MIN {a, b, c} where a = M[i-1,j] + 1 b = M[i,j-1] + 1 c = M[i-1,j-1] + cost Distance = M[m,n] F A R M 0 1 2 3 4 F 1 0 1 2 3 R 2 1 1 1 2 O 3 2 2 2 2 M 4 3 3 3 2 V. Future Work – Fitness Function 3 Levenshtein Distance Reduce the url comparison to string comparison F A R M ↓ www.url.com/index.htm www.url.com/ga.htm www.url.com/seo.htm www.url.com/etc.htm Experiment further using LD as a fitness function • Sigmoid weighting function to increase the importance of the front of the string V. Future Work Create more extensive test sets • dare.com, studentaid.ed.gov, fafsa.ed.gov, americorps.org V. Future Work V. Future Work V. Future Work For pages that still do not rank properly, create optimization suggestions Use custom meta tags to properly rank outliers Use implicit user feedback to find the desired rankings VI. Summary Proof of concept Testing on real world websites will strengthen results and open other areas of study. VII. Questions Thanks for attending Any questions?
© Copyright 2026 Paperzz