ESSnet DI WP4 Additional Italian case study 2

ESSNet DI PROJECT
On Data Integration
Work Package 4 – Case studies
23-05-2011
First Steps in Profiling Italian Patenting Enterprises
Daniela Ichim, Giulio Perani, Giovanni Seri
{ichim,perani,seri}@istat.it
Version 1.0
ISTAT, Italian National Statistical Institute
Via C. Balbo, 16
00184, Rome
Italy
Deliverable No: DI WP4-IT
Abstract
The paper describes the record linkage scheme followed at the Italian national statistical institute to
match micro-data on patent application from the international database PATSTAT with the data
available from the Italian Official Business Register (ASIA).
The target data in PATSTAT are the applicants based in Italy registering patent/s in the period
1985-2010. Patents applicants can be ‘individuals’ or ‘establishments’. In this last category we aim
at identifying business enterprises who were active (as recorded in ASIA) in the period 1989-2008.
The wishing output of the linkage process is, for each patenting enterprise, a pair composed by the
‘applicant identification code in PATSTAT’ and the ‘enterprise identification number in ASIA’.
This last allows for accessing the repositories of the official statistical data and, therefore, linking
economic data to patenting enterprises. Statistical analysis such as: identifying the premises of
patenting propensity; evaluate the impact of patenting on the enterprise profitability; etc. can be
then performed.
On the methodological side, linkage of patent data has to rely on the ‘applicants names’.
Consequently, a great effort has been put in the pre-processing phase of the process to standardise
the applicant/enterprise names and extract the ‘legal form’ from the name string. During the linkage
process, two practical problems were faced: the reduced number of comparison variables and the
huge dimension, in terms of number records, of the Italian Business Register. These issues were
addressed within a rule-based deterministic record linkage approach. In this paper, together with the
results obtained, we will illustrate the main features of the sequential searching and linkage
methodology we adopted.
2
Table of contents
Page
Introduction
4
1. A general description of patenting administrative flows
5
2. Data sources
6
3. Data pre-processing and standardisation
9
4. The record linkage process
11
5. Preliminary results
17
6. Conclusions and future plans
19
References
19
3
Introduction
In this report we will describe a preliminary stage of an Istat project aiming, mainly, at monitoring
and profiling Italian patenting enterprises. A complete characterisation of such enterprises might
allow, for example, updating the survey frame list of potential Research and Development (R&D)
performers, might favour investigation of specific subpopulations like biotech enterprises.
Moreover, from a statistical analysis point of view, identification of patenting enterprises enables
their linking to structural characteristics. Thus, factors influencing patenting propensity of
enterprises might be studied, as well as, the economic impact of patenting activity.
The preliminary stage we are concerned with in this report is the design of a strategy aiming at the
unambiguous identification of Italian patenting enterprises.
This document is divided in six sections. In section 1, a general description of patenting
administrative flows is given. In section 2, we discuss the selection of the databases to work with. A
brief description of these databases is provided. In section 3, details on the applied standardization
procedure are reported. The record linkage methodology as applied to these particular datasets is
illustrated in section 4. Due to the reduced number of comparison variables and to the huge amount
of data, we had to deal with, in section 4, emphasis is put on search space reduction methods.
Finally, in section 5, we present the obtained results. Some conclusions and ideas for further
implementations, analyses and research are given in the last section.
4
1. A general description of patenting administrative flows
A patent is an exclusive right granted for an invention, which is a product or a process that provides,
in general, a new way of doing something, or offers a new technical solution to a problem. In order
to be patentable, the invention must fulfill certain conditions. Namely, it must be of practical use; it
must show an element of novelty, that is, some new characteristic which is not known in the body
of existing knowledge in its technical field. This body of existing knowledge is called "prior art".
The invention must show an inventive step which could not be deduced by a person with average
knowledge of the technical field. Finally, its subject matter must be accepted as "patentable" under
law.
A patent is granted by a national patent office or by a regional office that does the work for a
number of countries, such as the European Patent Office. Under such regional systems, an applicant
requests protection for the invention in one or more countries, and each country decides as to
whether to offer patent protection within its borders. A patent provides protection for the invention
to the owner of the patent. The protection is granted for a limited period.
A patent owner has the right to decide who may - or may not - use the patented invention for the
period in which the invention is protected. The patent owner may give permission to, or license,
other parties to use the invention on mutually agreed terms. The owner may also sell the right to the
invention to someone else, who will then become the new owner of the patent. Once a patent
expires, the protection ends, and an invention enters the public domain, that is, the owner no longer
holds exclusive rights to the invention, which becomes available to commercial exploitation by
others.
The first step in securing a patent is the filing of a patent application. The patent application
generally contains the title of the invention, as well as an indication of its technical field; it must
include the background and a description of the invention.
There are 3 main actors in any administrative patenting flow: the inventor, the owner and the
applicant. A special feature of a patenting process is that the inventor, the owner and the applicant
might be different subjects (each referring to one or more entities).
A special case of the relationship inventor-owner-applicant is provided by the patents whose
original idea ‘born’ in enterprises where the general manger (head of the company) is also the
owner of the enterprises. Sometimes, the manager is the patent owner while in other cases the patent
owner is the enterprise itself. In both cases, the inventor might be a completely different person (for
example a researcher employed by the enterprise) as well as the applicant (for example a notary’s
office offering patenting services).
5
2. Data sources
To our knowledge, the most complete and updated database on patents is the European Patent
Office (EPO) database “Worldwide Patent Statistical Database”, called PATSTAT. Much of the
raw data in PATSTAT is extracted from the EPO's master bibliographic database DOCDB, also
known as the EPO Patent Information Resource. PATSTAT is updated twice a year (April and
October). PATSTAT is a relational database containing 20 tables with more than 70 millions of
records (63 millions patent applications) from over 80 countries. Other sources on patents either
concern only regional applications (like Ufficio Italiano Brevetti e Marchi) or offer only data
extraction and analyses services.
PATSTAT registers mainly information on patent applications. To reach our goal (identification of
Italian patenting enterprises), in this work we concentrated only on the two tables depicted in
Figure 1. The link between them is given by the unique values of the field Application Number, or
alternatively, Publication Number. The Application Number also contains the patent year of
registration. The time period covered by the database is given by the years 1985-2010. There is no
explicit database field concerning the legal form of the inventor, owner or applicant. PATSTAT
registers both the inventor and applicant name; only the latter was used in this work. The possible
legal form should be extracted from those names. About the applicant, PATSTAT also registers its
address (street, city, postal code) and its country code. Only applicants based in Italy, i.e.
COUNTRY_CODE = “IT”, were selected from PATSTAT tables. At this stage of the work, the
postal code was used as geographical location assuming it has the same accuracy as the address. We
plan to rely on the detailed address to assess the linkage quality or to elucidate special cases like
those in which the applicant is the manager (or owner) of an enterprise. These aspects are not
reported in this document.
About the patent, PATSTAT registers its IPC (International Patent Classification), its application
and publication number. It is worth noting that a patent could have assigned more than one IPC
codes. Indeed, while in the second table of Figure 1 each record corresponds to a unique
Application Number (about 70000), in the first table the number of records is around 300000.
Moreover, it should be stressed that there is no formal/well-defined relationship between IPC codes
and the principal economic activity classification (NACE).
PATSTAT (1) Applications
Application number (by year)
Publication number
International Patent Classification (IPC)
PATSTAT (2) Applications
Application number (by year)
Publication number
Applicant name
Applicant code
Postal/Zip code
Applicant Country
Figure 1. Used database tables from PATSTAT; COUNTRY_CODE = “IT”.
The most recent versions of PATSTAT also include a standardized version of the applicant name1.
In our case study, this standardization was ignored because it is not fully compliant with Italian
1
see OECD "Harmonised Applicants' Names" available at http://www.oecd.org/dataoecd/52/17/43846611.pdf
6
enterprise names, it includes the legal form. Moreover, as our goal is to link the patent applications
to an enterprise register, we should apply the same standardization process to the selected enterprise
register too.
Applicants may be classified as individuals or establishments. These latter, according to the Frascati
manual, see OECD (2002), could be: business enterprises, public institutions, non-profit institutions
and private or public universities. In this work, the aim is the identification of patenting enterprises.
The complete classification of applications will be performed in later stages. A distinction between
business enterprises and natural individuals could be favoured by a catalogue of Italian first names.
Istat provides such a list, stemming from surveys on population register. Alternatively, a list of
Italian first names may be downloaded from www.nomix.it. From preliminary investigations, other
data sources potentially helpful in profiling the Italian patent applications might concern the general
managers of large enterprises and academia researchers.
Additional details on PATSTAT may be found at www.epo.org.
On enterprises, many registers might be available in Italy, with different degrees of accessibility.
From the Istat point of view, the most important is surely ASIA (Archivio Statistico delle Imprese
Attive). ASIA is developed, updated and maintained through the statistical integration of different
administrative sources (Tax Register, Register of Enterprises and Local Units, Social Security
Register, Work Accident Insurance Register, Register of the Electric Power Board), covering the
entire population of enterprises of industry and services, other minor archives available (covering
particular sectors), and structural business statistics currently produced by Istat. ASIA is a business
register used in many different business survey stages, e.g. sampling frame, post-stratification,
calibration, etc.
Among the variables included in ASIA, one may specify:
a) Enterprises Identification Number (an Istat internal identification code allowing linkage to
whatever economical information on the same unit collected by Istat); this identification
code is unique for each enterprise
b) Enterprises Name
c) Zip Code
d) NACE code
e) Geographical information (address, municipality, province, region),
f) Legal form
Other variables that could be useful are the Fiscal Code, Number of employees and Turnover.
It should be observed that only Enterprise Name and Zip code are overlapping with the information
contained in PATSTAT.
According to the ASIA reliability and availability only enterprises that were active in the period
1998-2008 have been taken into consideration. Consequently, in this work, it was assumed that an
enterprise was active during the year it applied for a patent. In this report we refer only to the
selection of ASIA corresponding exclusively to active enterprises.
To give an idea of the numerical complexity of the problem we report, in table 1, the number of
active enterprises is presented. In the second column, for each year, we also present the percentage
of active enterprises with more than 1 employee, which is almost constant, around 40%. As it may
be observed, the union of different versions of ASIA contains more than 47 millions of records.
7
Thousands of active
YEAR
Enterprises
3871
1998
3950
1999
4223
2000
3992
2001
4323
2002
4327
2003
4367
2004
4458
2005
4484
2006
4554
2007
4577
2008
% of active enterprises with more
than 1 employee
40
40
40
40
40
40
40
40
40
40
40
Table 1. Statistics on the number active enterprises in Italy, period 1998-2008.
Given that our goal is to identify the patent applicants, it was considered that it could be useful and
efficient to concentrate first on lists of enterprises showing a high research and innovation
propensity. To this aim, we took into consideration the survey frame of Research and Development
survey which is yearly conducted by Istat. Only 2006 and 2007 waves were available in a
standardized form. In the 2006 R & D data file, there are 26237 records, while in the 2007 data file,
there are 16730 records. Since ASIA is the sampling frame for any business survey conducted at
Istat, the information included in R & D survey frames is similar to the one contained in ASIA.
When using the R & D survey frames, we only assumed that the linkage probability would be
higher (due to the innovation propensity) for the R & D survey frames than for the entire business
register ASIA.
8
3. Data pre-processing and standardisation
PATSTAT counts 299769 applications identified by an Application Number and a Publication
Number; the latter is redundant information and therefore it was ignored in this work. The number
of Italian applications reduces to 72037. To each Application Number is assigned an applicant name
(and id code), and the Zip Code. Additional information may be derived from the previous
information: year of application, year of first/last application by applicant; number of patent
applications filed by each applicant, region of residence of the applicants.
Variable Applicant Name has been subject to the following standardisation operations:
1. extraction of the application year, recorded in a new variable “Application Year”
2. transformation of all letters in upper case letters
3. removal of punctuations
a. accents
b. symbols and special characters (e.g. ‘$’, ‘%’ , ‘&’, ‘/’, ‘*’)
c. double spaces (transformed in a single space)
d. dots (e.g. L.T.D. transformed in LTD)
4. standardisation of known abbreviations (e.g. we found about 150 ways to say “in short”) in
an unique value (typically Italian words)
5. standardisation of the most frequent words using a deterministic record linkage procedure in
Relais, see Istat (2011)
a) input files: we considered a file of words (sequence of characters separated by a blank
character) with frequencies greater than 1000 against a file of words with frequencies
greater than 100, but smaller than 1000;
b) parameters: comparison function = “Edit distance”; threshold=0.8, greedy algorithm to
perform the one-to-one assignment;
c) output check: the word pairs declared “match” were subject to a clerical review;
d) standardization: the 122 pairs declared as equivalent were standardized in the same
way; they generally concerned singular – plural or Italian – English versions of the
same words.
e) examples: TERMOIDRAULICA
–
TERMOIDRAULICI;
SOLUTION
–
SOLUTIONS; MULTISERVICE – MULTISERVIZI;
6. removal of duplicated words in the same name (each second occurrence of the same word
was removed). This means that each name is composed by words of frequency 1, e.g.
AAA BB AAA CCCCC was transformed in AAA BB CCCCC
7. ordering of words in alphabetical order, e.g. CC BB AA was transformed in AA BB CC
8. identification and removal of the legal form, if any. Information on the legal form was stored
in a standardized manner in another variable, called Legal Form. About 80 ways of
expressing 6 main standardized legal forms were identified. The 6 main legal form
categories are ‘SPA’, ‘SRL’, ‘SAS’, ‘SNC’, ‘COOP’ and ‘NONE’.
The resulting variable is called Standardized Name.
Then, some additional variables have been derived from the standardized Applicant Name:
a) the Standardised name, without abbreviations, duplications, etc.
b) the standardised Applicant Name, without some very common words, e.g. ITALIA
c) acronyms and abbreviations
d) number of characters
e) longest and shortest words
f) length of the longest/shortest words
9
Since, in this stage, only enterprises should be subject to any linkage process, universities and
known public administrations were eliminated from the file (those records were identified as names
containing words like “UNIVERSITY”, “POLITECNICO”, etc.).
Except for standardisation operations 1 and 8, the same pre-processing was applied to ASIA.
Operations 1 and 8 are not necessary since ASIA already contains information on year and legal
form of enterprises. Additionally, the same unique standard values identified when performing
operation 5 on PATSTAT were used also for ASIA.
As comparison variables, in this linkage stage, the only three variables shared by PATSTAT and
ASIA are: Standardised Name, Zip Code and Legal Form (stemming from the Applicant Name).
Finally, the PATSTAT data file was deduplicated by considering duplicated those records having
simultaneously the same values for the three comparison variables mentioned above. Thus, the
number of records reduced from 72037 to 23833. It should be noted that records in ASIA are
supposed to be unique, each enterprise being assigned an unique identification code, i.e. a key
number. This unique identification number allows the enterprise traceability in whatever Istat
business survey conducted.
In figure 1, histograms of length of Standardized Name and number of words (sequence of
characters separated by blank) in the standardized PATSTAT database are shown. It may be
observed that, in mean, the Standardized Name has a length equal to 15, while the mean number of
words in a name equals 2.2.
Figure 1. Distribution of both length of Standardized Name and number of words in a name, PATSTAT database.
In table 2, the distribution of variable Legal Form is shown. It may be observed that for almost 40%
of records none legal form was identified, while the majority, about 56%, of records is concentrated
in categories “SPA” and “SRL”.
Legal Form
Frequency
%
8979
37.67
COOP
63
0.26
SAS
501
2.10
SNC
756
3.17
SPA
6164
25.86
SRL
7370
30.92
Total
23833
100
Table 2. Distribution of Legal Form, PATSTAT database.
10
4. The record linkage process
As illustrated in figure 2 by the red arrow, the linkage problem desirable output is the pair Applicant
Identification Number (PATSTAT) - Enterprise Identification Number (ASIA). The latter allows
linking structural and economical information stemming from Istat official surveys to patenting
enterprises, shown in gray in Figure 2.
Applicant
identification
number
Patent
information
PATSTAT
Enterprises
identification
number
ASIA
Enterprises
structural
information
Istat surveys
Figure 2. The PATSTAT-ASIA linkage problem and its opportunities.
As described in the previous sections, the only overlapping information between the two datasets
PATSTAT and ASIA are those relative to the Applicant Name – Enterprise Name and the Zip Code.
It should be noted that, in PATSTAT, Applicant Name is missing in 40 records, while Zip Code is
missing in about 10% of records. Besides the missing value problem, variable Zip Code in
PATSTAT, also presents about 9.4% of values representing the geographical location only at
aggregated level.
4.1 Search space reduction
Due to the huge amount of data, and, consequently, the huge amount of candidate matching pairs,
the usage of search space reduction techniques was necessary. In this section details on the search
space reduction techniques applied to PATSTAT and ASIA will be given. Moreover, a blocking
technique by neighbourhoods of words will be introduced. Some classical blocking techniques
based on the patent year and 2-digit ZIP Code proved to be extremely ineffective; these are not
further detailed here.
A) Reduction of PATSTAT
After the removal of duplicated records, PATSTAT, the number of records equals 23833. It should
be reminded that records showing the same exact values for Standardized Name, Zip Code and
Legal Form were considered duplicated records.
Since, in this phase, our goal is to link PATSTAT enterprises to ASIA enterprises, PATSTAT was
reduced in order to contain only units probably representing enterprises. Unfortunately, given its
meaning, variable Legal Form does not provide a perfect discrimination between enterprise and notenterprise units. Thus a list of Italian First Names, containing about 1600 units, was used. From the
PATSTAT database, we removed those records whose Standardized Name satisfy simultaneously
the following conditions:
1. it contains an Italian First Name
2. it has an empty Legal Form
3. it does not contain several special words indicating a business activity (e.g. enterprise,
group, systems, hotel, holding, etc.). These special words were found by a
11
manual inspection of those records satisfying only the first two conditions. About 63 such
special words were found.
These procedure does not offer a 100% discrimination between enterprise and non-enterprise units.
For example, those Standardized Names containing a non-Italian First Name or extremely rare
(almost uniuqes) Italian First Names, e.g. Karl Dietriech, Jean-Pierre or Odoardo,
would not be correctly classified. Anyway, since it was considered that it is very difficult to
discover these situations in automatic manner, the PATSTAT reduction was not further improved.
Probably some very simple record linkage technique would help in finding some typing errors, e.g.
Eduardo instead Odoardo.
Based on the above separation procedure, PATSTAT was divided in two parts: the first one
containing 7700 records considered non-enterprises and 16132 records considered enterprises. The
record linkage process was applied to the latter.
B) Reduction of ASIA
Obviously, there is a large number of enterprise which are active in consecutive years. This means
that the same enterprise, if active in consecutive years, should be registered in consecutive versions
of ASIA. Moreover, the same reasoning holds for non-consecutive years, too. In order to reduce the
search space, the 11 versions of ASIA (1998 - 2008) were prepared in such a way that an active
enterprise is included only once in their union. Indeed, the ASIA 2008 was considered the most
complete and updated version. Then from ASIA 2007, enterprises that were active also in 2008
were deleted, since they are already included in ASIA 2008. Next, from ASIA 2006, enterprises that
were active in 2007 and/or 2008 were removed, and so on backwards until ASIA 1998. The
Enterprise Identification Number was used to perform the recursive selection of records. The final
number of ASIA records we had to deal with is shown in table 3. For each year, the percentage
(third column) was computed over the total number of unique enterprises (union of ASIA 1998 –
ASIA 2008).
YEAR
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
Total
Thousands of
Percentage of
remaining active remaining active
enterprises
enterprise
292
3.76
170
2.19
484
6.24
127
1.64
336
4.34
321
4.14
322
4.15
347
4.47
367
4.73
417
5.37
4577
58.99
7760
100
Table 3. Number of active enterprises in the search space, by year.
In the union of different waves of ASIA, except for 885 records (over more than 7 millions), the
ZIP Code is always registered with 5 digits.
In table 4, the percentage of enterprises in several ASIA datasets by Legal Form and year is shown.
Only information stemming from even years was used to derive the table 4. First, it may be
12
observed the high percentage of enterprises without a “Legal form”; such enterprises are probably
individual enterprises. Second, a quite stable temporal trend of Legal form distribution might be
noticed, too. Finally, these ASIA distributions seem quite different with respect to the PATSTAT
distribution shown in table 2.
Legal
form\YEAR
NONE
COOP
SAS
SNC
SPA
SRL
1998
71.03
1.07
6.92
8.09
0.58
12.31
2000
79.17
0.48
5.65
7.26
0.46
6.98
2002
73.68
0.77
6.74
7.82
0.48
10.51
2004
72.94
0.86
6.59
7.43
0.43
11.76
2006
72.40
1.09
6.38
6.66
0.37
13.10
Table 4. ASIA: percentage of enterprises by Legal Form and year.
Finally, it should be mentioned that, due to the huge computational burden, ASIA 2008 was divided
in three parts: a) with more than 10 employees, b) with 1-9 employees, with non-empty Legal
Form, and c) with less than 1 employee with non-empty Legal Form.
A subpopulation receiving special attention is represented by the R & D enterprises. Indeed, it was
assumed that the patenting enterprises have an increased probability of performing research and
development activities. Consequently, in the first record linkage procedures were applied using the
2006 and 2007 R & D survey frames. Since a significant number of patenting enterprises were
linked to enterprises in R & D survey frames, this approach allowed the reduction of the number of
patenting enterprises, too.
A similar reasoning was applied to ASIA 2008. It was assumed that the greatest enterprises, in
terms of number of employees, have an increased probability of performing patenting activities.
This assumption is supported by the complexity of the technical, legal and administrative procedure
an applicant should follow in order to have granted a patent. Consequently, enterprises with more
than 10 employees in ASIA 2008 were considered in the second record linkage step.
C) Blocking by neighbourhood
Due to the huge number of records we had to deal with, some search space reduction blocking
technique was still necessary. As previously discussed, PATSTAT and ASIA share only three
variables, Standardized Name, Zip Code and Legal Form. Unfortunately, none of these variables is
reliable enough to be used as blocking variable. The idea of neighbourhood of words was then
introduced. For a pair of records, it was assumed that a necessary matching condition was that their
Standardized Names share at least one word. Here by “word” it is meant a sequence of characters
not including a blank. Otherwise stated, it was assumed that at least one word is registered
correctly. Then, for each record in PATSTAT, the list of words defining its Standardized Name was
found. These words are illustrated by coloured (main horizontal row, non-gray symbols in Figure
3). Next, for each such word, the list of enterprises in ASIA containing those words was identified
(the vertical columns in Figure 3). The union of lists of such enterprises was named Neighbourhood
of the Standardized Name under consideration. If an exact match on Standardized Name exists, it
should belong to this Neighbourhood, it should belong to the intersection of the list of enterprises
forming the Neighbourhood (see the rows with all coloured non-gray symbols in each column). In
such cases, a merge operation should be equivalent. It might also happen that the exact match on
Standardized Name does not exist. In such cases, a rule-based deterministic record linkage should
13
be applied in situations like the one depicted in the fifth row in the third column. Finally, for each
record in PATSTAT, the record linkage procedure was applied using only its Neighbourhood, i.e.
the Neighbourhood was used as blocking variable.
Figure 3. The PATSTAT-ASIA neighbourhoods.
Several considerations hold. First, blocking by Neighbourhood allows us to divide the enormous
search space in a huge number of much smaller search spaces. Obviously, the number of search
spaces equals the number of records in PATSTAT and RELAIS may dealt with many search spaces
in an automatic manner. Second, each search space has a reduced dimension. In Table 5, some
statistics on the dimension of such search spaces are shown. It may be observed that the maximum
dimension of the search space equals 15570, a very reasonable dimension to deal with in record
linkage problems. Third, it should be mentioned that, by construction, each Neighbourhood
contains at maximum one correct link. Due to this reason and to the dependency between
Neighbourhood and Standardized Name, this blocking procedure, as it was here defined, cannot be
used in a probabilistic record linkage procedure based on the Standardized Name as comparison
variable (because blocks and comparison variables are not independent). Moreover, it might be
difficult/ineffective to apply the Neighbourhood blocking procedure when units are individuals
(natural persons) because the variability of names of natural persons is much smaller than the
variability shown by names of enterprises. Hence, when dealing with natural persons,
Neighbourhoods might contain a huge number of records as well, thus a real reduction of the search
space would not be obtained.
MIN
1° QUARTILE
MEDIAN
MEAN
3° QUARTILE
MAX
# of ASIA enterprises
in a Neighborhood
1
5
77
760
837
15570
# of Neighborhoods
containing the same ASIA enterprise
1
1
3
8
10
124
Table 5. ASIA 2008: percentage of enterprises by Legal Form and year.
14
Of course, it might happen that very short Standardized Names (1-2 letters) or very common words
(e.g. ITALIA, GROUP, etc) could generate huge neighbourhoods. PATSTAT records containing
only very short words, i.e. length of the longest word equal to 1 or 2, were excluded from the
Neighbourhood creation phase. Such PATSTAT records were searched for by a simple merging
procedure. From 649 PATSTAT records, 169 were identified by a complete search in ASIA. As for
the very common words, it was considered that no reliable record linkage procedure could be
performed only on the basis of such words; the reasoning is similar to the one applied for the names
of natural persons.
Moreover, it might happen that some Standardized Names have an empty Neighbourhood. This is
generally the case for Standardized Names of a single word. If such words are differently registered
in PATSTAT and ASIA, the corresponding Neighbourhood would be empty because of the way it
is derived (at least one word is registered in exactly the same manner in both databases). In table 6,
the number and percentage (over the total number of PATSTAT Standardized Names with empty
Neighbourhood) of records with empty Neighbourhood are shown. Even if, for each database the
number of records with empty Neighbourhood is not so small, it was observed that only 836 records
were classified as “WITHOUT Neighbourhood” through the entire search space creation flow.
ASIA
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008 less than 1 employee
2008 more than 10 employees
R & D 2006
R & D 2007
2008 more than 1 employee with Legal Form
# records without
neighbourhood
1870
1983
1794
2053
1888
1793
1832
1818
1812
1734
2481
2114
4144
3664
5046
percentage of
records without
neighbourhood
11.6
12.3
11.1
12.7
11.7
11.1
11.4
11.3
11.2
10.7
15.4
13.1
25.7
22.7
31.3
Table 6. Number and percentage of records without Neighbourhood.
The 836 empty Neighbourhood Standardized Names generally have 1 or 2 words, as illustrated in
Figure 4. Indeed, the number of words in Standardized Names corresponding to empty
Neighbourhood, has a median equal to 1, while the number words in all Standardized Names has a
median equal to 2.
Of course, neighbourhoods could be defined also by an approximate matching (e.g. a similarity
distance different instead of equality distance) of at least one word. This idea will be subject to
further implementations.
15
4.2 Deterministic record linkage
Even if the Neighbourhood was used as blocking variable, the usage of similarity criteria was still
necessary. Indeed, the Neighbourhood contains all ASIA records having at least one exact word in
common with the studied PATSTAT record. A similarity criteria between Standardized Names was
used to give an overall measure of the records similarity.
Figure 4 Number of words in Standardized Names corresponding to empty Neighbourhoods.
A deterministic rule was used in this work. It is a compound one, stating that at least one of the
following string comparators is greater than 0.8:
1. Jaro
2. Levensthein
3. Jaro-Winkler
4. Dice
5. 3-Grams
6. equality rule (only in this case the threshold was equal to 1)
Details on the implementation of this comparison functions may be found in the Relais manual.
Other thresholds, different from 0.8, were tested, but 0.8 proved to be the most efficient.
The selection of the unique links was also performed using Relais, by means of a greedy solution
already implemented in the software. In this work, equal weights for all rules were always used.
Finally, the pairs declared matches were subject to a clerical review.
To conclude this section, we summarize the record linkage procedure. Only the blocking procedure
and the selection of databases were varied. The deterministic rule when comparing Standardized
Names and the threshold were constant. In a first phase, blocking by Neighbourhood, Zip Code and
Legal Form was used. In a second phase, only blocking by Neighbourhood and Legal Form was
used. The matching pairs were always subject to a clerical review. The records in PATSTAT were
linked against the following databases:
1. 2006 and 2007 R & D survey frames
2. ASIA 2008 with more than 10 employees
Then an update of the PATSTAT database was performed.
3. ASIA 2008 with more than 1 employee
4.ASIA 1998 – ASIA 2007
Then an update of PATSTAT database was performed.
5. ASIA 2008 with less than 1 employee
16
5. Preliminary Results
At this stage, the number of found “correct” link is 12510 out of 16132 (applicant names potentially
referring to individuals have been stored for late analysis), i.e. 78%. As for “correct” link we intend
a (non duplicated) pair (Applicant Identification Code - Enterprise Code) stemming from one of the
linkage steps performed during the project: in each step, the links found have been classified as
“correct” (true according to the available information), “maybe” (possibly subject to more detailed
and sophisticated clerical review) or “false” (discarded) and stored removing duplications. Even if
pairs Applicant Identification Code - Enterprise Code are non-duplicated, some of them may
represent duplication of Applicants (more than one Applicant Identification Code may be linked to
the same Enterprise Code) or of enterprises (the same Applicant Identification Code may be linked
to more than one Enterprise Code). The first case may happen when a multi patenting applicant has
been registered with different names in different applications; the standardisation process do not
compensate for these differences. Consequently, at the same applicant might be assigned to more
than one enterprise. On the other side, as different steps of the linkage procedure have been
performed on the same applicants dataset (identified applicants were removed from the file only
twice) several applicants linked “correctly” to the same enterprise code cannot be excluded.
Anyway the impact on the total number of links seems to be limited to few cases: the number of
unique enterprise codes at this stage is 12488 out of 12510. Moreover, in order to asses the quality
of the results, a small experiment has been conducted on a set of 190 codes randomly selected from
the Espacenet web database (the “application number” field has been used to download patent
information from the EPA web-site2). We found 5 mismatches out of 190 records (2,5%). This
means that, even if the available standardised information coincide in the two sources it is not
possible to grant 100% exact link because of very similar (or common) names. Other possible
sources of misclassification that should be taken into account when checking the quality of the
linkage process are: enterprises belonging to the same enterprise group often register their patents
with similar names and the changes incurred to enterprises through their life (changes of address,
legal form, etc.).
In Table 7 the patenting enterprises (corresponding to the Enterprise Code found in ASIA) are
reported by size, i.e. classes of employees. Frequencies are shown in two subsequent phases, in the
second one the similarity criterion adopted in searching a link into the ‘neighbourhood’ have been
relaxed removing the postal code from the set of the blocking variables. As expected, more than
half of the population of patenting enterprises have a size greater or equal to ten employees (the
highest class considered).
Classes of
Emplyees
First phase
Freq
%
(Cum Freq)
(Cum %)
Second phase
Freq
%
(Cum Freq)
(Cum %)
1
793
8.1
2345
18.8
(1-10)
1995 (2788)
20.4 (28.5)
2334 (4679)
18.7 (37.5)
[10,
6985
71.5
7809
62.5
Total
9773
12488
Table 7. Patenting enterprises by size (classes of employees)
In Table 8 the frequency distribution of the patenting enterprises by the division economic activity
(NACE 2 digits code) are reported in descending order of the frequency count. Only the ten most
frequent NACE divisions are shown out of the 45 assigned to the whole set of matched pairs. These
2
Ten applicants number can be downloaded by trials, for a maximum of 200 application numbers. Moreover, the
information provided need to be managed before the use.
17
‘most important’ divisions cover more than 65% of the total and belong to the Manufacturing
sector.
NACE
NACE Description
Frequency
%
Cumulative
Frequency
Cumulative
%
28
Manufacture of machinery and equipment n.e.c.
2186
22.4
2186
22.4
25
Manufacture of fabricated metal products, except
machinery and equipment
Wholesale trade, except of motor vehicles and
motorcycles
885
9.1
3071
31.5
695
7.1
3766
38.6
22
Manufacture of rubber and plastic products
601
6.2
4367
44.8
27
Manufacture of electrical equipment
461
4.7
4828
49.5
26
Manufacture of computer, electronic and optical
products
456
4.7
5284
54.2
20
Manufacture of chemicals and chemical products
324
3.3
5608
57.5
68
Real estate activities
282
2.9
5890
60.4
29
Manufacture of motor vehicles, trailers and semitrailers
280
2.9
6170
63.3
32
Other manufacturing
251
2.6
6421
65.9
46
Table 8. Patenting enterprises (active in 2008) by economic activity (2 digit NACE 2007): the ten most frequent
NACE’s division
18
6. Conclusions and future plans
In this report we reported the path followed at the Italian national statistical institute (Istat) in
designing a linkage strategy to match micro-data on patent application from the international
database PATSTAT and the data available from the Italian Official Business Register (ASIA). The
overall aim of this project is to identify the Italian patenting enterprises and characterise them
through their economical information surveyed by Istat. It might allow, for example, to investigate
which factors influence the patenting propensity of the enterprises and/or if patenting activities has
an impact on the economical performance. On the other side, monitoring such a subpopulation
could be useful in maintaining the survey frame list for surveys related to Research and
Development area, as it could be the biotechnology sector.
In PATSTAT, the applicants resident in Italy and registering at least one patent in the period
1985-2010 have been considered. Patent applicants can be ‘individuals’ or ‘establishments’. At this
stage, the linkage process aimed at identifying, among establishments, the business enterprises
recorded in ASIA in the period 1989-2008. The desired output of the linkage process is to assign to
each patenting enterprise the ‘applicant identification code in PATSTAT’ and the ‘enterprise
identification number in ASIA’. This last allows for accessing the repositories of the official
statistical data and, therefore, linking economic data to patenting enterprises.
The overlapping information between the two archives reliable as matching variables in the linkage
process mainly consists only of the ‘applicants names’ and the ‘postal code’. Moreover, the size of
the business register ASIA in terms of number of records represents a computational problem to be
faced. Therefore, a great effort has been put in the pre-processing phase of the process to
standardise the applicant/enterprise names and some ‘search space’ reduction techniques have been
adopted. Among these last, particularly effective has proved to be the ‘blocking by neighbourhood’
introduced in this work. Assuming that, for a given patenting enterprise, at least one word in the
‘applicant name’ (in PATSTAT) and the ‘enterprise name’ (in ASIA) is correctly registered in both
the archives, the ‘neighbourhood’ of an applicant name is defined as the set of enterprises which
have a name containing at least one word equal to (written in the same way) a word in the applicant
name. Then, the correct link for the given applicant have been searched in its neighbourhood.
At this stage of the study, it can be considered a promising results the percentage of around 75% of
enterprises identified as patenting (12488 out of 16132 applicants identified as ‘establishments’) but
further investigations are planned to be improved. The next step will be to define the
‘neighbourhood’ on the base of similarity between words instead of equality, in order to manage
possible typing errors in the names. Moreover, the set of names without a neighbourhood have to be
investigated.
As further developments, it would be desirable to classify the whole set of patenting establishments
as: business enterprises, public institutions, non-profit institutions and private or public universities
enterprises, according to the Frascati Manual (2002). A subset of applicants identified as
‘individuals’ (with no legal form) has not been investigated at the moment, assuming the probability
to find a correct link to enterprises is low. For this kind of applicants it is planned to use different
archives (such as the List of enterprise manager or the List of companies partners) or information
(checking by the address).
Finally, a probabilistic approach to the record linkage can be considered. In order to reduce the
computational size of the problem the R&D survey frame can be used as test set given that it proved
to define a subpopulation with a high concentration of patenting enterprises.
References
OECD (2002). Frascati Manual 2002: Proposed Standard Practice for Surveys on Research and
Experimental Development, Paris 2002.
19
Istat (2003), Metodi statistici per il record linkage, Metodi e Norme n. 16, Anno 2003, a cura di
Mauro Scanu
Istat (2011), RELAIS - Record linkage at Istat, software and User’s guide available at:
http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/
20