Kein Folientitel

Federal Statistical Office Germany
Application of Regular
Expressions in the German
Business Register
Session 5: Projects on Improvements for Business Registers
Wiesbaden Group on Business Registers
Paris, November 26th 2007, Patrizia Moedinger
© Federal Statistical Office Germany, IV A2
Federal Statistical Office Germany
Example 1:
Improving legal form coding
by using regular expressions
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 2
Federal Statistical Office Germany
Background
 information on legal forms mainly from VAT
records
 not all administrative sources provide
information on legal forms
 use of different not compatible legal form
coding or different aggregation levels
 special requirements for other purposes
like the coding of institutional sectors
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 3
Federal Statistical Office Germany
Background
 enterprises (legal units) with certain legal
forms are legally obliged to carry their legal
form in the enterprise name:




incorporated firms
non-incorporated firms
cooperatives
merchants that are registered in the German
Commercial Register
enterprise names can be used for
legal form coding
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 4
Federal Statistical Office Germany
Definition of search patterns


patterns from nomenclature, abbreviation
and notations (tax authorities)
GmbH, AG & Co.KG, Limited, Ltd.
patterns from BR real data
mistakes in writing, missing blanks, ..
construction of regular expression
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 5
Federal Statistical Office Germany
Evaluation of search patterns
 completeness of coding
legal obligation: high level of found legal forms
in enterprise names
 degree of reliance: evaluation of coding results
 drawing sample after legal form coding
 classification of the coding results
 calculation of sensitivity, specificity, positive
predictive value, negative predictive value
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 6
Federal Statistical Office Germany
Completeness of coding
no legal form could
sole proprietors
93.7 be detected from
1
6.3
enterprise name
legal form could be detected
from enterprise name
9.9
non-incorporated
firms
2
90.1
3.2
incorporated firms
3
96.8
miscellaneous legal
forms
(including
cooperatives)
89.7
4
10.3
0
50
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
100 %
31.07.2017
Slide 7
Federal Statistical Office Germany
Evaluation of Type I and II errors
Enterprise name contains
legal form
legal form
regular
expression
detects
no or
wrong
legal form
no or wrong
legal form
4
PPV (positive
predictive value)
= 1,009 / (1,009 + 4)
= 99.6 %
26
2,961
NPV (negative
predictive value)
= 2,961 / (2,961 + 24)
= 99.1 %
Sensitivity
= 1,009 /
(1,009 + 26)
= 97.5 %
Specificity
= 2,961 /
(4 + 2,961)
= 99.8 %
N =4,000
1,009
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 8
Federal Statistical Office Germany
Example 2:
Data pre-processing as a
preliminary for record linkage
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 9
Federal Statistical Office Germany
Background
 no common unique identifiers available
 data from different sources are initially
linked by names and addresses
 different or none address standards
 different notations “BMW“ or “Bayerische
Motorenwerke“ or “Bay. Motorenwerke“
 German BR is technically limited in storing
several addresses (only dispatch and
domicile)
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 10
Federal Statistical Office Germany
Problem of non standardized notations
 matching by administrative identifiers
dependent variable =
match by administrative identifiers
+ no change in the postal code
independent variable =
differences between enterprise names, street names
and town names (Levenshtein edit distance)
 same (administrative) source
 different sources (administrative source – BR)
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 11
Federal Statistical Office Germany
Matching probability against string similarity
within an administrative source (Employment
Agency) (Model: Logistic regression)
1
Match
0.8
0.7
predicted y
Street Name
Enterprise
Name
0.9
Town
Name
0.6
0.5
0.4
0.3
0.2
0.1
No Match
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Levenshtein - Edit - Distance / Maximum String Length
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 12
Federal Statistical Office Germany
Matching probability against string similarity
between an administrative source (Employment
Agency) and BR (Model: Logistic regression)
Match
1
Enterprise
Name
Street Name
0.9
Town
Name
predicted y
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
No match
0
0
0.2
0.4
0.6
0.8
1
Levenshtein Edit Distance / Maximun String Length
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 13
Federal Statistical Office Germany
Pre-processing of administrative data for
record linkage
high level of similarity between two strings
identical units
identical unit
different unit
low
high
differences in name or
address
high level of disparity between two strings 
different units
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 14
Federal Statistical Office Germany
Pre-processing of administrative data for
record linkage
 conversion into specific variables for string
matching
enterprise address
BMW AG
Branch Munich
Mr Mueller
enterprise name: BMW
legal form:
AG
other elements:
Branch Munich
Mr Mueller
 simplify comparison strings
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 15
Federal Statistical Office Germany
Methods for evaluation


evaluate link between string similarity and
match before and after pre-processing the
data
evaluation of matching results



(drawing sample after matching process)
classification of the matching results
calculation of sensitivity, specificity, positive
predictive value, negative predictive value
 controlling for effects caused by the used
matching
program
© Federal Statistical
Office Germany, IV
A2 – Patrizia Moedinger
31.07.2017
Slide 16
Federal Statistical Office Germany
Synopsis
 BR text data needs special treatment in data
processing

applications for regular expressions
 simple application: legal form coding (limited set
of search pattern)
 more complex application: pre-processing
(set of pattern depends on data source and later
use)
 application of regular expressions should
always be evaluated
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 17
Federal Statistical Office Germany
Thank you for your attention.
© Federal Statistical Office Germany, IV A2 – Patrizia Moedinger
31.07.2017
Slide 18