Efficient Approximate Search on String Collections Part I Marios Hadjieleftheriou Chen Li 1 DBLP Author Search http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 2 Try their names (good luck!) UCSD Yannis Papakonstantinou Case Western Meral Ozsoyoglu AT&T--Research Marios Hadjieleftheriou http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html 3 4 Better system? 5 http://dblp.ics.uci.edu/authors/ People Search at UC Irvine 6 http://psearch.ics.uci.edu/ Web Search Actual queries gathered by Google http://www.google.com/jobs/britney.html Errors in queries Errors in data Bring query and meaningful results closer together 7 Data Cleaning R informix microsoft … … S infromix … mcrosoft … 8 Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS 9 Outline Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II 10 Next… Preliminaries 11 Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey 12 Edit Distance A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 13 13 Next… Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08] 14 “q-grams” of strings universal 2-grams 15 Edit operation’s effect on grams universal Fixed length: q k operations could affect k * q grams If ed(s1,s2) <= k, then their # of common grams >= (|s1|- q + 1) – k * q 16 q-gram inverted lists id 0 1 2 3 4 strings rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 17 Searching using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh id 0 1 2 3 4 ht strings rich stick stich stuck static ti 2-grams ic at ch ck ic ri st ta ti tu uc ck # of common grams >= 3 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 18 T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T 19 Example T=4 1 10 5 3 13 7 5 15 13 13 15 10 13 Result: 13 20 List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip 21 Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap 22 MergeOpt Algorithm [SK04] Binary search Long Lists: T-1 Short Lists 23 Example of MergeOpt 1 10 5 3 13 7 5 15 13 13 15 10 13 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4 24 ScanCount String ids 1 2 3 # of occurrences 1 0 0 1 0 Increment by 1 … 1 10 5 3 13 7 5 15 13 13 15 10 13 14 0 4 0 15 2 0 Result! 13 Count threshold T≥ 4 25 List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip 26 MergeSkip algorithm [BK02, LLL08] …… Min-heap Jump Pop T-1 Greater or equals T-1 27 Example of MergeSkip 1 minHeap 5 13 Jump 10 15 1 10 5 3 13 7 5 15 17 13 15 10 15 Count threshold T≥ 4 28 DivideSkip Algorithm [LLL08] Binary MergeSkip search Long Lists Short Lists 29 How many lists are treated as long lists? 30 Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19 31 Positional Filtering Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12) 32 A filter tree Combine filters with list-merging algorithms [LLL08] 33 Next… Variable-length grams (VGRAM) [LWY07,YWL08] 34 2-grams -> 3-grams? Query: “shtick”, ED(shtick, ?)≤1 sht id 0 1 2 3 4 hti strings rich stick stich stuck static tic 3-grams ick ati ich ick ric sta sti stu tat tic tuc uck # of common grams >= 1 4 0 1 0 4 1 3 4 1 3 3 2 2 2 4 35 Observation 1: dilemma of choosing “q” Increasing “q” causing: id 0 1 2 3 4 Longer grams Shorter lists Smaller # of common grams of similar strings strings rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 0 2 1 0 0 1 4 1 3 3 3 1 2 4 2 3 4 2 4 36 Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 37 VGRAM: Main idea Grams with variable lengths (between qmin and qmax) zebra - corrasion - ze(123) co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms 38 Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 39 Challenge 1: String Variable-length grams? Fixed-length 2-grams universal Variable-length grams [2,4]-gram dictionary universal ni ivr sal uni vers 40 Representing gram dictionary as a trie ni ivr sal uni vers 41 Step 2: Constructing a gram dictionary qmin=2 qmax=4 Frequency-based [LYW07] Cost-based [YLW08] 42 Challenge 3: Edit operation’s effect on grams universal Fixed length: q k operations could affect k * q grams 43 Deletion affects variable-length grams Not affected Affected i-qmax+1 i Deletion Not affected i+qmax- 1 44 Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index Vector of s = <2,4,6,8,9> With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering 45 Summary of VGRAM index 46 Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s grams String s1, s2 such that ed(s1,s2) <= k min # of their common grams 47 Lower bound on # of common grams Fixed length (q) universal If ed(s1,s2) <= k, then their # of common grams >=: (|s1|- q + 1) – k * q Variable lengths: # of grams of s1 – NAG(s1,k) 48 Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 sh ht tick 2-grams … ck ic … ti … 2-4 grams Lower bound = 3 1 0 3 1 2 1 2 4 id 0 1 2 3 4 4 strings rich stick stich stuck static … ck ic ich … tic tick … 1 1 0 3 4 2 2 1 4 Lower bound = 1 49 End of part I Motivation Preliminaries Trie-based approach Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II 50 References [AGK06] Efficient Exact Set-Similarity Joins. Arvind Arasu, Venkatesh Ganti, Raghav Kaushik .VLDB 2006 [ACGK08] Incorporating string transformations in record matching. Arvind Arasu, Surajit Chaudhuri, Kris Ganjam, Raghav Kaushik. SIGMOD 2008 [BK02] Adaptive intersection and t-threshold problems. Jérémy Barbay, Claire Kenyon. SODA 2002 [BJL+09] Space-Constrained Gram-Based Indexing for Efficient Approximate String Search. Alexander Behm, Shengyue Ji, Chen Li, and Jiaheng Lu. ICDE 2009 [BCFM98] Min-Wise Independent Permutations. Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher. STOC 1998 [CGG+05]Data cleaning in microsoft SQL server 2005. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, Rahul Kapoor, Vivek R. Narasayya, Theo Vassilakis. SIGMOD 2005 [CGK06] A Primitive Operator for Similarity Joins in Data Cleaning. Surajit Chaudhuri, Venkatesh Ganti, Raghav Kaushik. ICDE06 [CCGX08] An Efficient Filter for Approximate Membership Checking. Kaushik Chakrabarti, Surajit Chaudhuri, Venkatesh Ganti, Dong Xin. SIGMOD08 [HCK+08] Fast Indexes and Algorithms for Set Similarity Selection Queries. Marios Hadjieleftheriou, Amit Chandel, Nick Koudas, Divesh Srivastava. ICDE 2008 51 References [HYK+08] Hashed samples: selectivity estimators for set similarity selection queries. Marios Hadjieleftheriou, Xiaohui Yu, Nick Koudas, Divesh Srivastava. PVLDB 2008. [JL05] Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. Liang Jin, Chen Li. VLDB 2005. [JLL+09] Efficient Interactive Fuzzy Keyword Search. Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng. WWW 2009 [JLV08] SEPIA: Estimating Selectivities of Approximate String Predicates in Large Databases. Liang Jin, Chen Li, Rares Vernica. VLDBJ08 [KSS06] Record linkage: Similarity measures and algorithms. Nick Koudas, Sunita Sarawagi, Divesh Srivastava. SIGMOD 2006. [LLL08] Efficient Merging and Filtering Algorithms for Approximate String Searches. Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008. [LNS07] Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance. Hongrae Lee, Raymond T. Ng, Kyuseok Shim. VLDB 2007 [LWY07] VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams, Chen Li, Bin Wang, and Xiaochun Yang. VLDB 2007 [MBK+07] Estimating the selectivity of approximate string queries. Arturas Mazeika, Michael H. Böhlen, Nick Koudas, Divesh Srivastava. ACM TODS 2007 52 References [SK04] Efficient set joins on similarity predicates. Sunita Sarawagi, Alok Kirpal. SIGMOD 2004 [XWL08] Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. Chuan Xiao, Wei Wang, Xuemin Lin. PVLDB 2008 [XWL+08] Efficient similarity joins for near duplicate detection. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu. WWW 2008 [YWL08] Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently. Xiaochun Yang, Bin Wang, and Chen Li. SIGMOD 2008 53
© Copyright 2026 Paperzz