occurence

Efficient and flexible text
manipulation, spelling
correction and page collections
with Pywikibot
From Budapest
User:Bináris
Hungarian Wikipedia &
Pywikipedia developer team
Wikimania 2012
Useful links
[[meta:User:Bináris]]
Just check it now on your laptop to
follow me
What is this about?
My spellchecker underlined occurence.
• Wiktionary:
Noun
occurence
1.Common misspelling of occurrence.
• A search in English Wikipedia:
Results 1–20 of 333,623 for occurence
Does this include every erronious form?
We speak about
• Pywikipedia bot framework
• replace.py
• fixes.py
This works on every MediaWiki installation!
Some ideas
•
•
•
•
•
•
•
•
Spellink corrections
Linking and unlinking
Mass change of section titles
Execution of naming conventions
Replacing templates
Replacing template parameters
Placing templates
Correcting link errors
Decisions
1.
2.
3.
4.
Command line parameters or fix?
Searching in live wiki or in dump?
Search & replace in one run or separately?
Simple text replacements or regular
expressions?
5. Manual or automatic running?
6
The two-pass model of replacement
1. Gathering candidates (possible to-bereplaced texts) to a file
-save / -savenew
Relatively slow and automatic
– Optionally uploading the list to your wiki
(line numbers help to clean)
2. Making the actual replacements
Faster (or very fast) and attended
7
Decisions
1.
2.
3.
4.
Command line parameters or fix?
Searching in live wiki or in dump?
Search & replace in one run or separately?
Simple text replacements or regular
expressions?
5. Manual or automatic running?
8
What is a fix?
• A fix contains a replacement task.
• See the links on my Meta page for
description & examples
9
The magic of regular expressions
Decisions
1.
2.
3.
4.
Command line parameters or fix?
Searching in live wiki or in dump?
Search & replace in one run or separately?
Simple text replacements or regular
expressions?
5. Manual or automatic running?
11
Regular expressions
• color  colour: this is concrete and accidental (and
uninteresting :-P)
• What about changing
[[január 4]]. to [[január 4.]] and [[január 4]]-én to
[[január 4.|január 4]]-én? (For all dates, of course)
• Or July 13, 2012 and 13 July 2012 to 2012-07-13 and
7/13/2012 to 2012-07-13 (ISO 8601) within tables?
• Or color, Color, c/Colorful, c/Colorfulness to colour…
(but not Colorado and colorectal cancer)?
Note! Colorful (film) and (manga) and CSS colors go
to exceptions! (Why? Sure? How to decide?)
12
Regular expressions
• Regular expressions form a simple
programming language that searches for
patterns and replaces with patterns.
• Learn them, they are worth! Another dimension
of efficiency.
13
Example: search for a date
July 13, 2012 (a regex-like analysis)
1. A month name (possibly in lower case or abbreviated
as Jul)
2. One or more or less spaces
3. 1…9 OR 0 followed by 1…9 OR 1 or 2 followed by
0…9 OR 3 followed by 0 or 1
4. Comma?
5. One or more or less spaces (not less without comma)
6. Maximum of four digits (1 and 2: are they worth?)
14
First theorem
The more hits and the more precise matching
you want, the more complex the regex will be.
(Do you want to find july? Do you want to
find July 13,2012? Do you want to find
Jul 13, 2012?)
15
Example: agents (search & replace)
'replacements': [
(ur'(FBI|CIA|KGB|MI ?\d) [üÜ]gynök(?!e)', ur'\1-ügynök'),
(ur'(FBI|CIA|KGB|MI ?\d\]\]) [üÜ]gynök(?!e)', ur'\1-ügynök'),
],
1. An agency (MI followed by an optional space and a digit)
2. A space
3. Ügynök OR ügynök, but NOT ügynöke (hyphen prohibited)
Second line: a linked agency
Result: a hyphenated, lower case agent (=ügynök in Hungarian)
NB it was preceeded by some searches! Not all agencies are here.
16
Example: exceptions with regexes
BaseExceptions = {
'inside-tags': [
'hyperlink',
'interwiki',
],
'text-contains': [
ur'(?i)(\{\{szinnyei|\{\{pallas\}|\{\{fényes\}|\{\{vályi\}|Vályi András|Fényes Elek|\{\{sicc\})',
],
'inside': [
r'\{\{DEFAULTSORT:.*?\}\}', #A defaultsortban szándékosan ékezet nélküli szavak vannak.
ur'<ref name.*?>',
#Mindenféle idézősablonok:
ur'(?is)\{\{cite.*?\}\}', #Az összes citenyavalya sablon (nem mindig van szóköz)
ur'(?is)\{\{cit(lib|per).*?\}\}', #A CitLib és a CitPer (nem biztos a szóköz, lehet |)
ur'(?is)\{\{citation .*?\}\}',
],
'title': [
ur'\d{4} a jogalkotásban',
],
}
17
What is to be excepted?
• Keywords
18
Advanced level
• Fixes and functions – own Python functions
19
Workflow
Simple replacement tasks
•
•
•
•
Find an idea
Create the replacement
Find a good selector (search*, category…)
Do the work with two fingers
(y/enter, then /enter)
(asynchronous save!)
• Imagine this and next slide is a flowchart. 
*Unfortunately, no regexes in MediaWiki
search engine 
21
Advanced replacements tasks
• Find an idea
• Create the first version of replacement
• Test it as usual in software development
– Watch it working during collection
– Create a test page with purposeful errors
– Take care of [[link]]ed & [[link|piped]] versions!
• Found falses? Missing replacements? Is it too
slow? Are the previous problems solved as far as
possible? Refine your regexes and/or exceptions
• Press ctrl C, and da capo al fine
• If the fix is good enough, begin the work.
22
• Maintain fixes & exceptions continously
Decisions
1.
2.
3.
4.
Command line parameters or fix?
Searching in live wiki or in dump?
Search & replace in one run or separately?
Simple text replacements or regular
expressions?
5. Manual or automatic running?
23
Why manually?
• Color as CSS property
• % next to a number – may be an
operation
• Misspelled word – may be an example
in a linguistic article or a quotation
• RESPONSIBILITY!
24
Second theorem
Spelling corrections must be manually.
Period.
25
Semiautomatic running
• Ingredients:
– A replacement task that runs almost always
correctly
– One or more pizzas (depending on running time)
(possibly a bottle of beer, if you like it)
– Your favourite music
– Stable knowledge of where your Pause button is
26
Errors
•
•
•
•
•
•
•
•
False positives
Conflicts (originated from false positives)
Missed matches
Simply bad replacement expression
Slow fix
Inappropriate automatic running
Unneccessary changing because of fatigue
Unneccessary changing because of incompetence
Change the bot owner! 
27
Third theorem
The more hits you want, the more conflicts
you get.
This is the game.
Find the balance.
28
Speed
Speed
•
•
•
•
Complex fixes may run slower
Exceptions make it slower
Lookbehinds make it slower
Recursive run and allowoverlap are
definitely slow (risk of infinite loop!)
• Will be slow if the beginning of the
expression has much more hits than the
trailing (see examples in fixes.py)
30
Speed
Fast replacements take the titles from
• -search
• -cat & al
• -links
• -transcludes
• -file
etc.
31
The two-pass model of replacement
1. Gathering candidates (possible to-bereplaced texts) to a file
-save / -savenew
Relatively slow and automatic
– Optionally uploading the list to your wiki
2. Making the actual replacements
Faster (or very fast) and attended
32
Decisions
1.
2.
3.
4.
Command line parameters or fix?
Searching in live wiki or in dump?
Search & replace in one run or separately?
Simple text replacements or regular
expressions?
5. Manual or automatic running?
33
Efficiency
What does it mean?
• Find as much occurrences as possible (even
if agglutinated)
• Find as few false positives as possible
• Face as few correction conflicts as possible
• Give the appropriate replacement always
• Let the bot work quickly — don’t wait in
front of the screen
35
Keys to efficiency
• If you find a very efficient replacement (near to 100%),
do it separately before others in the same package –
you will have less conflict (but you may collect them
together)
• Too big packages may run slow and have a greater
chance to cause correction conflicts. Sometimes it is
worth to make smaller parts of them.
• Too small packages will use more dead time during
preparation and execution. Sometimes it is worth to put
them together.
• How to decide then? Just watch. 
36
Keys to efficiency
• Use exceptions when appropriate. They will decrease
false positives as well as correction conflicts. E.g.
– Cite book, cite web, cite anything templates
– URLs, image names (even as template parameters and
gallery images!)
– Templates marking pages out of your scope (old authors in
Hungarian Wikipedia whose quotations contain old-style
spelling)
– Titles marking pages out of your scope (year numbers in
law in Hungarian Wikipedia)
• …and first of all: improve your regexes continously!
37
Keys to efficiency
• Once you found a false positive, save it for later use!
-saveexc / -saveexcnew
•
•
•
•
Then insert these titles into your exceptions.
Run searches before/during creation of a fix.
Don’t deal with tasks that are not worth a bot!
Use the two-pass model and the dump whenever
possible!
38
An ugly example
I have a fix to correct short and long i (i/í).
Argentína has an í, but often occurs in English
and Spanish titles  no regex for it, title
exceptions must be used  separate fix.
But they may be collected together.
39
A less ugly example
• replace.py ásnéven "ás néven" search:másnéven -ns:0 summary:"Helyesírás javítása kézi
botszerkesztéssel: más néven„
•  live demo
40
Character encoding problems
• Keep your files in UTF-8, and don’t use
Notepad of Windows
• E.g. setting in Notepad++:
41
Character encoding problems
• If it doesn’t work in command line, write a fix
• If you can’t solve with a fix, use URL encoding
– replace.py -catr:Венгрия . @ -lang:ru
-excepttext:"[[hu:" -save:magyarok.txt -always
– replace.py catr:%D0%92%D0%B5%D0%BD%D0%B3%D1%80
%D0%B8%D1%8F . @ -lang:ru -excepttext:"[[hu:"
-save:magyarok.txt –always  live demo
• You may store this in a script (import replace.py)
This is the way of page collections 
42
Page collections
The two-pass model of replacement
1. Gathering candidates (possible to-bereplaced texts) to a file
-save / -savenew
Relatively slow and automatic
– Optionally uploading the list to your wiki
2. Making the actual replacements
Faster (or very fast) and attended
44
A simple idea
1. Gathering candidates (possible to-be-replaced
texts) to a file
Relatively slow and automatic
– Uploading the list to your wiki (this is the result!)
2. Nothing. You are ready.
45
Some ideas for page collections
• Scheme: some existing/missing text
• Articles related to Hungary in other
Wikipedias (see above for ruwiki)
• The Redlist Project for animals and plants
• Articles with {{commons}} template, but
without any image
• …let your phantasy go!
46
Useful links
[[meta:User:Bináris]]
Thank you for your attention!
47
PS – some thoughts months later
• Lookahead is faster than recursion or
overlapping.
• If a function is called for each much, that
makes the bot run really slowly.
• In such cases a separate „fellow fix”
without function call for searching is useful
for faster search.
48