APE‐INV's “NAME GAME” ALGORITHM CHALLENGE: A GUIDELINE FOR BENCHMARK DATA ANALYSIS & REPORTING VERSION 1.2, July 2010 Francesco Lissoni (APE‐INV Chair; [email protected]) Andrea Maurino ([email protected]) Michele Pezzoni (APE‐INV External Coordinator; [email protected]) contact author Gianluca Tarasconi ([email protected]) DIMI ‐ Università di Brescia KITES –Università Bocconi, Milano DISCO – Università Milano‐Bicocca Abstract APE‐INV is a project funded by the European Science Foundation that aims at identifying academic inventors through a reclassification by inventor of patents from PatStat, the EPO Worldwide Patent Statistical Database. Such reclassification effort requires inventors’ names, surnames, and addresses to be parsed, matched, and filtered, in order to identify synonyms (that is names+surnames or addresses which are the same, although spelled differently) and to disambiguate homonyms (verify whether two inventors with same name and surname are indeed the same person). Several algorithms have been produced in the recent past, either with reference to data from PatStat or from national patent offices. One the objectives of the APE‐INV project is to compare the accuracy and efficiency of such algorithms, and to involve as many researchers as possible in a collective research effort aimed at producing a shared database of inventors’ names, surnames, and addresses, linked to PatStat. In order to achieve this objective APE‐INV produces a number of PatStat‐based benchmark databases, and invites all interested parties to test their algorithms against them. The present document (to be updated periodically) describes such benchmark databases, their rules of access, and provides guidelines on how to conduct the tests and how to report their results, in order to ensure comparability. Information is also provided on workshops that will be organized in order to allow a discussion of the results. Last update: 27/07/2010 1 OUTLINE 1. INTRODUCTION 2. A VERY SHORT INTRODUCTION TO PATSTAT 3. THE ‘NAME GAME’ ALGORITHM CHALLENGE AND THE ROLE OF BENCHMARK DATABASES 4. CONTENTS AND STRUCTURE OF THE BENCHMARK DATABASES 5. REPORTING ON THE EFFICIENCY OF ALGORITHMS AND USE OF BENCHMARK DATABASES 6. AVAILABLE AND PLANNED BENCHMARK DATABASES 7. CONCLUSIONS: HOW TO JOIN THE ALGORITHM CHALLENGE REFERENCES APPENDIX A – IDENTIFICATION AND DISAMBIGUATION OF INVENTORS : A SHORT SURVEY APPENDIX B – A NOTE ON USPTO DATA IN PATSTAT APPENDIX C – «NAME GAME » WORKSHOPS : A CALENDAR Last update: 27/07/2010 2 1. INTRODUCTION APE‐INV is a project funded by the European Science Foundation, which aims at measuring the extent of academic patenting in Europe, and studying its determinants, in order to improve our understanding of university–industry relationships (for details: http://www.academicpatenting.eu). APE‐INV is chaired by KITES‐Università Bocconi, which is also in charge of maintaining the related databases. APE‐INV builds its activities on an historical and institutional premise, namely that most European universities have for long being prevented from getting involved in IPR management, or have themselves resisted such involvement, either for legal, administrative, or cultural reasons. As a consequence, European universities often do not appear as applicants on patents taken over their own scientists’ invention. It is only by re‐classifying patents by inventors, and by discovering whether such inventors belong to the academic research system, that it becomes possible to measure the number and importance of the inventions produced by academia. To this end, APE‐ INV promotes any effort to reclassify patents by inventor. In particular, it supports efforts to reclassify all patent applications to the European and US Patent Offices (respectively, EPO and USPTO applications) as listed in the EPO Worldwide Patent Statistical Database, better known as “PatStat”. A very important part of the reclassification‐by‐inventor effort will consists in parsing, matching, and filtering1 the inventors’ names as reported on the original patent application documents: APE‐INV promotes collective participation to this effort by inviting all interested researchers to: ‐ produce their own algorithms for cleaning, matching, and filtering inventors’ names ‐ test such algorithms against one or more common benchmark databases ‐ report the results of their tests in such a way that lessons can be learned, and possibly a common algorithm may be produced In what follows, technical information is provided on the type of data used for benchmarking, the contents of the first benchmark database produced so far, and the information to be reported to APE‐INV on the algorithm effectiveness, as measured from application to the benchmark database. 2. A VERY SHORT INTRODUCTION TO PATSTAT PatStat is produced by EPO, the European Patent Office, and contains over 70 million records. It is updated every six months (for details: http://www.epo.org/patents/patent‐information/raw‐ data/test/product‐14‐24.html). Records consist of patent applications and granted patents from a number of offices. APE‐INV is interested to EPO and USPTO patent applications. At this stage of the project, however, only work on EPO data has been conducted, so all the following discussion refers to the contents and characteristics of EPO data, unless otherwise specified. Patent documents and the information therein are identified by a number of elements which contain text or codes derived from the original legal documents. All elements related to a specific patent document remain the same across different patent editions, as long the document is present in all such editions. In addition, PATSTAT provides a number of “surrogate keys” which summarize information and help identifying relevant documents (or information within documents and common to several documents, such as inventors’ names). These surrogate keys are specific of each edition of PATSTAT, 1 The terminology “parsingmatchingfiltering” used in this document to describe the necessary steps leading to identification of inventors derives from Raffo and Luhillery (2009). We come back to it in section 4. Last update: 27/07/2010 3 so they cannot be compared across different editions, the design principle of PATSTAT being that each new edition of PATSTAT is a stand‐alone database, completely refreshed.2 This means that users cannot easily update their databases built upon one edition of PATSTAT by simply looking for additional records in the latest edition. The assistance of a programmer is needed. It also means that when building a benchmark database for the purposes of the APE‐INV’s name game, we will have to refer to one specific edition of PATSTAT, because the surrogate keys included in the benchmark database are edition‐specific. Patent documents are identified by a combination of unique elements, which contains codes attributed to them by the examiners. For the purposes of the APE‐INV “Name Game” the most relevant elements are: PUBLN_NR (Publication number): It is the number given by the Patent Authority issuing the publication of application. The number is stored in PATSTAT as a 15‐character string, with leading spaces. PUBLN_AUTH (Publication Authority, aka Publishing office). It is a code indicating the Patent Authority that issued the publication of the application: EP indicates EPO, US indicates the US Patent and Trademark Office (USPTO) and so forth. Any combination of PUBLN_AUTH and PUBLN_NR identifies uniquely a patent application. For example, PUBLN_AUTH=EP and PUBLN_NR=10000 identify patent application nr 10000 at EPO, while PUBLN_AUTH=US and PUBLN_NR=10000 identifies patent application nr 10000 at the US Patent and Trademark Office (they are entirely different patents to which the two offices have – by chance – given the same publication number). After being numbered by the relevant Patent Authority, each patent application undergoes a number of processing steps (such as examination, granting, opposition etc.) each of which produces a separate document, also included in PATSTAT as soon as it is made available by the relevant authority. All documents related to the same application share the same PUBLN_AUTH and PUBLN_NR and are differentiated by an additional field, PUBLN_KIND, which contains 1‐ or 2‐digit codes that specify the nature of the document. Contents of PUBLN_KIND are specific to each Publication Authority, because they reflect country‐specific legal procedures. In the case of EPO, the most common code is A1, which refers to the first document published by EPO in relation to any patent application, inclusive of the “search report” performed by EPO on the existing prior art (if no A1 can be found, then A2 exists, which also refers to the patent application, when this does not include a search report).3 2 The full list of these surrogate keys is: APPLN_ID; INTERNAT_APPLN_ID; PRIOR_APPLN_ID; TECH_REL_APPLN_ID; PERSON_ID; DOC_STD_NAME_ID; PAT_PUBLN_ID; CITN_ID; NPL_PUBLN_ID; CITED_PAT_PUBLN_ID; PARENT_APPLN_ID; DOCDB_FAMILY_ID; INPADOC_FAMILY_ID 3 The full list of codes which can be found in PUBL‐KIND for EPO patents is: A1 APPLICATION PUBLISHED WITH SEARCH REPORT A2 APPLICATION PUBLISHED WITHOUT SEARCH REPORT A3 SEARCH REPORT A4 SUPPLEMENTARY SEARCH REPORT A8 MODIFIED FIRST PAGE A9 MODIFIED COMPLETE SPECIFICATION B1 PATENT SPECIFICATION B2 NEW PATENT SPECIFICATION B3 AFTER LIMITATION PROCEDURE B8 MODIFIED FIRST PAGE GRANTED PATENT B9 CORRECTED COMPLETE GRANTED PATENT Last update: 27/07/2010 4 Notice that the same combination of PUBLN_AUTH and PUBLN_NR, despite identifying a unique patent, may appear on several PatStat records. This is also due to phenomena of “re‐issuing” or “renumbering” of a patent.4 Publication Number (PUBLN_NR) and Publication Authority (PUBLN_AUTH) remain the same from one edition of PATSTAT to the following ones and can be compared across editions (that is, any patent document which appear in two different editions of PATSTAT will carry the same PUBLN_NR and PUBLN_AUTH in both editions). The relevant surrogate key for patent documents is PAT_PUBLN_ID, which is unique for any combination of PUBLN_NR, PUBLN_AUTH, and PUBLN_KIND; and it is a surrogate key, i.e. it cannot be compared across editions The following example shows the case of the patent nr 1 issued by the authority ‘AP’ in its many instances PAT_PUBLN_ID 70 6697 84476 85183 85184 771622 PUBLN_AUTH 'AP' 'AP' 'AT' 'AT' 'AT' 'AT' PUBLN_NR ' 1' ' 1' ' 1' ' 1' ' 1' ' 1' PUBLN_KIND 'A' 'U' 'B' 'U2' 'U3' 'T' PUBLN_DATE '1985‐07‐03' '2002‐06‐06' '9999‐12‐31' '1994‐07‐25' '1995‐01‐25' '1980‐11‐15' However, for the purposes of building the benchmark database (which, we remind the reader, are by now based only on EPO patents) we collapse all records with the same PUBLN_NR PUBLN_AUTH into one record, and report information of the various documents with a unique combinations of these two separately. Coming to information on inventors, all persons (both physical and legal) involved in the invention and/or application of a patent are identified by five fields: PERSON_NAME, which includes all elements of the name as from the application, with no further standardization by PatStat producers (Note: small differences like number of spaces or commas will cause e.g. "John Smith" and 'john smith," to be treated as 2 separate persons in the PATSTAT database , even if they have exactly the same address) PERSON_ADDRESS, which contains all address elements of the person apart from the country (Example: street, city, postal) PERSON_CTRY_CODE, which indicates, the country of residence of person or business by means of its international code; indeed PatStat makes use of several country codes, according to different code standards, we use just a two‐letter one (Example: ‘IT’ for Italy) INVT_SEQ_NR (Sequence Number of Inventor), which indicates the person’s place in the list of inventors attached to the patent application; all persons for whom this field takes zero value do not appear as inventors in the application, therefore they must be identified as applicants, and not inventors (i.e. in order to be an inventor, a person has to have INVT_SEQ_NR>0). One and the same person may be recorded in different places in the source files, for example both as applicant and inventor. 5 4 In some jurisdictions, once a patent is issued, the patent holder may request a "reissue" of it to correct mistakes, within a particular period of time following issuance of the original patent. 5 For some applications the inventor and the applicant may be the same person. If an Inventor record has sequence nr = 20 then the Inventor(s) are the same as the Applicant(s). Last update: 27/07/2010 5 PERSON_ID, which is a surrogate key (that is, a piece of information created by PatStat producers, and not present in the original legal document) based on PERSON_NAME, PERSON_ADDRESS and PERSON_CTRY_CODE (technically, it is a sequential number unique for each unique combination of these elements). When considering these fields for the creation of PERSON_ID, upper case and lower case are considered equal, so that, for example, Donald Duck is considered to be the same person as DONALD DUCK; and Ducktown Street is considered the same address as DUCKTOWN STREET. Two persons receive the same PERSON_ID only when they can be both fully identified – by name and address and country. If one of the attributes is missing no combination is done. This can lead to cases where you can clearly guess that two persons are the same individual, but PATSTAT does not provide a common PERSON_ID. Besides, and more importantly, PATSTAT producers are unwilling to make assumptions that any two similar names or addresses may be , in reality, the same (these assumptions are left to data users): so any inventor who appears on two different patent documents with a slightly changed name or address (possibly due to typos) will be identified by two different PERSON_IDs. So for example, “Donald Duck, 166 Ducktown Street – Disneyworld, US” and Donald Duck, 166 Ducktown St. – Disneyworld, US” are not given the same PERSON_ID. Notice finally that for some EPO patents, the inventors’ personal data have been withheld at their request. In these cases the text 'data withheld' is substituted for the name, also for the address. 3. THE ‘NAME GAME’ ALGORITHM CHALLENGE AND THE ROLE OF BENCHMARK DATABASES The ‘Name Game’ Algorithm Challenge consists of a comparison of the results obtained by applying different algorithms to the same sets of PATSTAT data, where the aim of all algorithms is that of identifying who, among the various PERSON_IDs, are the same persons (inventors). All researchers interested into producing algorithms and joining the challenge are welcome. A list of PERSON_IDs and related information (publication numbers of patents associated to those PERSON_IDs, as well as addresses and country codes) will be provided by the Challenge organizers (see RAW data in section 4). Participants will use this information plus any other information of their choice (either from PATSTAT or from other sources) in order to identify inventors. Typically, algorithms will comprise the following operations (which follows, with modifications, those proposed by Raffo and Lhuillery, 2009): 1. Parsing: Strings of names and addresses or other text are cleaned in order to delete or modify such as corrupted characters, double spaces, unnecessary punctuation, and so forth. Conversion of characters specific to one relatively uncommon alphabet into characters from a more common one can also take place (as when Scandinavian characters such as “Ø” are all converted to “O”, or “ü” is converted to “ue”). Finally, at this stage some algorithms may split a string into two or more substrings, as when a string comprising both a person’s name and surname (such as PERSON_NAME in PatStat) is split into “Name” and “Surname”; or a string containing elements of a person’s address (such as PERSON_ADDRESS in PatStat) are split into “Street and street number”, “City”, “Region” etc. Notice that these operations may refer not only to information regarding the inventors (such as PERSON_NAME and PERSON_ADDRESS from PatStat) but also to information regarding the inventors’ patents. In particular, algorithms that will base subsequent steps on information relative to the inventor’s patents will parse PatStat elements such as IPC_CLASS_SYMBOL (which reports the technological classification of the patent according to the International Patent Last update: 27/07/2010 6 Classification) or PERSON_NAME and PERSON_ADDRESS, but referred to the patent applicant, rather than the inventor (that is, for INVT_SEQ_NR = 0).6 2. External information retrieval: After parsing, contents of PatStat may be matched to external information in order either to improve the results of parsing and/or adding information useful for the subsequent steps. For example, parsed addresses may be compared to addresses retrieved from online directories or zip codes may be added when missing, again by searching the internet on the basis of parsed information from PatStat. It is important that, when describing their algorithms, participants to the Challenge will mention explicitly what external sources of information they have accessed and any limitation of access may exist, either due to fees or data sensitivity issues. 3. Matching (Identification of Synonyms): A matching algorithm is applied in order to produce a list of potential matched pairs of inventors. Most typically, inventors with the same or similar names, but different addresses, are matched, such as “Donald Duck, Ducktown Street 1, Disneyland” and “Donald D. Duck¸ Dücktøwn St. 1, Disney” (in which case also addresses are similar) or “Mordecai Richler, 32 avenue Duddy Kravitz, Montreal, QC H3W 1P2” is compared to “Mordecai Richler, 561 St Urbain’s Horseman, London SW7 2RH”. 4. Filtering (Disambiguation of Homonyms): Some rules are applied in order to decide which matches have to be retained (that is, the two matched inventors are considered to be the same person) and which discarded (the two matched inventors are simply considered homonyms or quasi‐homonyms). These rules are often based on “similarity scores”, that is scores assigned to elements of similarity between the two matched inventors besides their names (such as the existence of common co‐inventors or the technological similarity of their patents or the rarity of their surnames etc.) Notice that the sequence of steps we have just illustrated is purely logical: some algorithms may skip one step or collapse two in one. For example, an algorithm may be produced that match all inventors in a database one to another, irrespective of the similarity of names, and immediately filters out “wrong” matches. A different algorithm instead may retrieve external information only after the matching or the filtering stage and so forth. In order to join the Challenge, participants should take care of producing an output comparable with the benchmark database, in the particular with the MATCH table of such database (as described in the next section). This will allow to compute precision and recall statistics in a similar fashion for all algorithms and also to make the algorithms’ output immediately intelligible to all participants to the Challenge. The benchmark database will provide information useful to test precision and recall for all stages of the algorithms. For all pairs of inventors in the benchmark database, information will be provided not only on whether the two are in fact the same person, but also on whether their address or city or zip code is in fact the same (which may be the case even if the two inventors are NOT the same person). This information can be useful to evaluate algorithms not only for the quality of their final outcome (which consists in identifying those inventors who are, or are not, the same person), but also for the quality of their intermediate stages. For example, an algorithm that does a poor job at filtering may nevertheless be very effective at parsing, thus resulting in the correct identification of most addresses, albeit not of persons. In principle, this will help pushing forward the collective research agenda by combining the strong elements of all algorithms into one meta‐algorithm. In order to get a clearer picture of what “Name Game” algorithms for inventors may be expected to do, readers may refer to the papers surveyed in Appendix A. Some of these papers, along with 6 For these and other PatStat elements see section 2 and EPO’s information on PatStat at: http://www.epo.org/patents/patent‐information/raw‐data/test/product‐14‐24.html Last update: 27/07/2010 7 others dealing with similar problems for companies’ (patent applicants’) names can be found on the website of the APE‐INV project (http://www.academicpatenting.eu). 4. CONTENTS AND STRUCTURE OF THE BENCHMARK DATABASE By “benchmark database” we mean a database containing the tables and elements listed in figure 1 (in bold: tables’ names; in italics: original elements from PATSTAT; in plain text: elements created ad hoc for the benchmark exercise). The combination of Person_ID , PUBLN_NR, and PUBLN_AUTH provides the primary key for linking the various tables among themselves and to the PATSTAT database. Figure 1 – Structure and contents of Benchmark Database RAW Person_ID PUBLN_NR PUBLN_AUTH PERSON_NAME PERSON_ADDRESS PERSON_CTRY_CODE PatStat_Edition CLEAN_ADDRESS Person_ID PUBLN_NR PUBLN_AUTH Full_Address Country_Code Zipcode Street_Nr City Province Region BENCHMARK_ID Person_ID PUBLN_NR PUBLN_AUTH BENCHMARK_ID MATCH Person_ID PUBLN_NR PUBLN_AUTH Person_ID_match PUBLN_NR_match PUBLN_AUTH_match DIRECTION SAME_PERSON SAME_Name_Surname SAME_Full_Address SAME_Country SAME_Name SAME_Surname SAME_Street_Nr SAME_Zipcode SAME_City SAME_Province SAME_Region CLEAN_NAME Person_ID PUBLN_NR PUBLN_AUTH Name_Surname Name Surname … to PATSTAT database Last update: 27/07/2010 8 The two most important tables are RAW and MATCH, the latter providing the information necessary to calculate precision and recall rates of algorithms applied to Person_IDs, as identified by the RAW table. CLEAN_ADDRESS and CLEAN_NAME contain additional information that participants to the “Name Game” challenge may find useful in order to compare the inventors’ names and addresses, as parsed and cleaned by their algorithms, to the inventors’ names and addresses parsed, cleaned, and hand‐checked by the author of the benchmark database. List 1 contains the definition of each element in the four tables. Concerning RAW table, at the date of this report, the PATSTAT version of reference is October 2009. Participants to the APE‐INV “Name Game” challenge ought to secure themselves access to this version of PATSTAT directly from EPO, or to contact [email protected] in order to arrange for it. Besides PERSON_ID, RAW table contains original PATSTAT information on inventors such as PERSON_NAME, PERSON_ADDRESS, and PERSON_CTRY_CODE. Although this information may be sufficient to test an algorithm’s efficiency in parsing and cleaning names, it is insufficient to perform the matching stage (see again section 3). The MATCH table provides all information needed to test precision and recall of any algorithm applied to RAW data (and related info from PATSTAT). Every observation (line) contains a pair of uniquely identified combinations ‘inventor+patent’, plus information on whether the two inventors in the pair are in reality the same person and/or share some trait (e.g. the address or the city or the name or surname or a combination of these elements). This information is contained in a number of variables whose names’ first four letter are ‘SAME’ (more on their meaning below): when referring to them as a group we will call them the SAME_x variables (where ‘x’ refers to the rest of their name. As a way of illustration, a line may compare “Donald Duck, Ducktown Street 1, Disneyland + his patent nr. 10000” to “Donald D. Duck¸ Dücktøwn St. 1, Disney + his patent nr. 99999”, and provide information on: ‐ whether “Donald Duck” and “Donald D. Duck” are the same person (in which case the element SAME_PERSON takes value 1; otherwise 0) and/or ‐ whether the “Ducktown Street 1” and “Dücktøwn St. 1” are the same address (in which case the element SAME_STREET_NR takes value 1; otherwise 0) and/or ‐ whether “Disneyland” and “Disney” are in reality the same city (in which case the element SAME_CITY takes value 1; otherwise 0) and so on. More precisely, each line of MATCH compares two combinations inventor+patent, in which the first inventor+patent is identified by: PERSON_ID, PUBLN_NR, and PUBLN_AUTH; and the second inventor+patent is identified by: PERSON_ID_match, PUBLN_NR_match, and PUBLN_AUTH_match. Notice that both the combination PERSON_ID + PUBLN_NR + PUBLN_AUTH and the combination PERSON_ID_match + PUBLN_NR_match + PUBLN_AUTH_match map into RAW table. Notice also that each pair of combination can be found twice, but permuted, with the flag variable DIRECTION taking value 1 for one permutation and value 2 for the other . For example, on one line of MATCH we will compare: “Donald Duck, Ducktown Street 1, Disneyland + his patent nr. 10000” to “Donald D. Duck¸ Dücktøwn St. 1, Disney + his patent nr. 99999”, with DIRECTION=1; while another line will compare: “Donald D. Duck¸ Dücktøwn St. 1, Disney + his patent 99999” to “Donald Duck, Ducktown Street 1, Disneyland + his patent nr. 10000”, with DIRECTION=2. Last update: 27/07/2010 9 List 1 – Definition of elements in the benchmark database TABLE Element Description All tables Person_ID Surrogate key from PATSTAT (unique combination of PERSON_NAME, PERSON_ADDRESS, PERSON_CTRY_CODE) All tables PUBLN_NR Publication number of the patent (from PATSTAT) All tables PUBLN_AUTH Patent authority issuing the patent (from PATSTAT) RAW PERSON_NAME All elements of the inventor's name, as from PATSTAT RAW PERSON_ADDRESS All elements of the inventor's address, as from PATSTAT RAW PERSON_CTRY_CODE Inventor's country code, as from PATSTAT RAW PatStat_Edition Edition of PATSTAT to which PERSON_ID refers MATCH Person_ID_match Surrogate key from PATSTAT (unique combination of PERSON_NAME, PERSON_ADDRESS, PERSON_CTRY_CODE) MATCH PUBLN_NR_match Publication number of the patent (from PATSTAT) MATCH PUBLN_AUTH_match Patent authority issuing the patent (from PATSTAT) MATCH DIRECTION Flag variable (values: 1 or 2) for filtering purposes (see explanation in text) MATCH SAME_PERSON =1 if the two inventors are the same person; =0 if they are not (NULL values not admitted) MATCH SAME_Name_Surname =1 if the combination of name and surname of the two inventors are the same for the two inventors; =0 if they are not (NULL values admitted) MATCH SAME_Full_Address =1 if the addresses of the two inventors are the same; =0 if they are not (NULL values admitted) MATCH SAME_Country =1 if the countries of the two inventors are the same; =0 if they are not (NULL values admitted) MATCH SAME_Name =1 if the first name of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_Surname =1 if the surname of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_Street_Nr =1 if the street and street number of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH MATCH SAME_Zipcode SAME_City =1 if the zip code of the two inventors are the same; =0 if they are not; (NULL values admitted) =1 if the city of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_Province =1 if the province (county, departement…) of the two inventors are the same; =0 if they are not; (NULL values admitted) MATCH SAME_Region =1 if the region (State…) of the two inventors are the same; =0 if they are not; (NULL values admitted) CLEAN_ADDRESS Street_Nr Inventor's street and street number, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_ADDRESS Zipcode Inventor's zip code, as retrieved by the authors of the benchmark database CLEAN_ADDRESS City Inventor's city, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_ADDRESS Province Inventor's province, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_ADDRESS Region Inventor's region, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_ADDRESS Country_Code Inventor's country code, as checked and formatted by the authors of the benchmark database CLEAN_ADDRESS Address Inventor's full address (street and street nr, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_NAME Name Inventor's name, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_NAME Surname Inventor's surname, as parsed, cleaned and formatted by the authors of the benchmark database CLEAN_NAME Name_Surname Inventor's full name and surname, as parsed, cleaned and formatted by the authors of the benchmark database Last update: 27/07/2010 10 In this way, by filtering for DIRECTION=1 or DIRECTION=2, and extracting non‐duplicated values of PERSON_ID + PUBLN_NR + PUBLN_AUTH combinations, one obtains the same list of inventors+patents in the RAW table. Finally, notice that, with the exception of SAME_PERSON, all SAME_x variables may take not only value 1 or 0, but also value NULL (identified by the missing value symbol ‘ . ‘) when the information is not available and could not be retrieved. In a similar fashion, the participants to the Name Game may wish to place a similar value when their algorithm does not produce the information; for example SAME_NAME=. and SAME_SURNAME=., if the algorithm does not split names and surnames, and compares inventors only by means of the full string name_surname. SAME_PERSON is an exception to the extent that all algorithms are expected to produce a judgement on whether two inventors are or are not the same person (NULL, that is “don’t know” judgement are considered equivalent to zero values). In what follows, we provide three graphical illustrations of these same concepts. In the first example (Donald Duck) the two inventors are both identified as the same person and found to share the same address (although not all info on such address is available – for example the Province, Region, and Zipcode are missing in the original PatStat data and not recovered by the imaginary author of the algorithm). DIRECTION SAME_PERSON SAME_Name_Surname SAME_Full_Address SAME_Country SAME_Name SAME_Surname SAME_Street_Nr SAME_Zipcode SAME_City SAME_Province SAME_Region 1 1 1 1 1 1 1 1 1 1 1 1 1 1 . . 1 1 . . . . Person_ID_match PUBLN_NR PUBLN_NR_match 1 2 PUBLN_AUTH 113 10000 EP 222 99999 EP 222 99999 EP 113 10000 EP Person_ID PUBLN_AUTH_match MATCH table (Donald Duck example): the two inventors are the same person, although not all the information on their addresses was available 113 10000 EP Donald Duck 222 99999 EP Donald D. Duck Last update: 27/07/2010 Ducktown Street 1, Disneyland CA Dücktøwn St. 1, Disney CA PERSON_CTRY_CODE PERSON_ADDRESS PERSON_NAME RAW table (Donald Duck example) PUBLN_AUTH PUBLN_NR Person_ID US US 11 In the second example (Mordecai Richler) the two inventors are found to be the same person despite not sharing the same address (not even the city or the country); we can imagine they are identified thanks to other information derived from PatStat (such as the technological class of their patents and/or the name of the patents’ applicants and/or a common co‐inventor), and not reported in the benchmark database (but available on request). DIRECTION SAME_PERSON SAME_Name_Surname SAME_Full_Address SAME_Country SAME_Name SAME_Surname SAME_Street_Nr SAME_Zipcode SAME_City SAME_Province SAME_Region 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 Person_ID_match PUBLN_NR PUBLN_NR_match 1 2 PUBLN_AUTH 777 11111 EP 888 12345 EP 888 12345 EP 777 11111 EP Person_ID PUBLN_AUTH_match MATCH table (Mordecai Richler ex.): the two inventors are the same person, although their addresses are clearly different (i.e. same person, but two addresses) Mordecai 777 11111 EP Richler Mordecai 888 12345 EP Richler PERSON_CTRY_CODE PERSON_ADDRESS PERSON_NAME RAW table (Mordecai Richler example) PUBLN_AUTH PUBLN_NR Person_ID 32 avenue Duddy Kravitz, Montreal, QC H3W 1P2 CA 561 St Urbain’s Horseman, London SW7 2RH UK In the third example (Antoine Doinel) the two inventors are found to be different persons despite sharing the same city); we can imagine they are identified thanks to other information derived from PatStat (such as the technological class of their patents and/or the name of the patents’ applicants and/or a common co‐inventor), and not reported in the benchmark database (but available on request). Last update: 27/07/2010 12 DIRECTION SAME_PERSON SAME_Name_Surname SAME_Full_Address SAME_Country SAME_Name SAME_Surname SAME_Street_Nr SAME_Zipcode SAME_City SAME_Province SAME_Region 0 0 1 1 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 Person_ID_match PUBLN_NR PUBLN_NR_match 1 2 PUBLN_AUTH 303 13571 EP 404 45785 EP 404 45785 EP 303 13571 EP Person_ID PUBLN_AUTH_match MATCH table (Antoine Doinel ex.): the two inventors are not the same person, despite sharing the same name, surname, and city PERSON_CTRY_CODE PERSON_ADDRESS PERSON_NAME RAW table (Antoine Doinel example) PUBLN_AUTH PUBLN_NR Person_ID 303 13571 EP Antoine Doinel 451, rue de Fahrenheit, 75006 Paris FR 404 45785 EP Antoine Doinel 400, cours deCoups, 75001 Paris FR As for the remaining tables of the Benchmark Database they serve mainly for reference purposes. CLEAN_NAME and CLEAN_ADDRESS tables contain respectively the inventors’ names, surnames, and addresses as cleaned and standardized by the authors of the Benchmark Databases. Loosely speaking, they are the “true” names, surnames, and addresses of the inventors corresponding to the list of PERSON_IDs from PatStat. Strictly speaking, no “true” name, surname, and address really exists, since these items’ syntax always depends on conventions; and the conventions followed by the authors of the Benchmark database are not necessarily universal and uncontroversial, with the possible exception of zip codes and Country codes. For example, when building the CLEAN_NAME table we may have adopted the convention that both PERSON_NAMES “Donald Duck” and “Donald D. Duck” correspond to “Donald Duck” (that is, middle name may be ignored); although this may not be the choice made by a participant to the Challenge, nothing prevents such participant from correctly identifying the two PERSON_NAMES as the same Name_Surname combination, nor from correctly splitting both into identical Names and Surnames. Last update: 27/07/2010 13 As for BENCHMARK_ID table, this contains surrogate keys (BENCHMARK_IDs) produced by the authors of the benchmark in order to identify uniquely all PERSON_IDs who are in fact the same person. For each PERSON_ID (no matter on how many different patents – i.e. PUBLN_NR – it appears on) one and only one BENCHMARK_ID may exist; but of course several PERSON_IDs may correspond to one and only one BENCHMARK_ID. Counting the BENCHMARK_IDs in the table is a quick way to count the number of true persons corresponding to all the PERSON_IDs in the Benchmark Database. By producing a similar surrogate key and counting its instances, participants to the challenge may quickly check if their algorithms over‐ or under‐estimate the number of persons in the Benchmark database. This exercise, however, does not immediately produce the required Precision and Recall statistics. In order to achieve these results, we recommend to follow the procedure we describe below, which relies heavily on using the MATCH table of the Benchmark database, and requires producing a similar one. 5. REPORTING ON THE EFFICIENCY OF ALGORITHMS AND USE OF BENCHMARK DATABASE List 2 summarizes the information requested to the Challenge participants in order to evaluate the performance of their algorithms. List 2. Required information for Challenge Participants 1. 2. true positive true positive false positive true positive Recall rate, defined as recall true positive false negative Precision rate, defined as: precision for the following fields: i. Full address (Street and street nr, City, Zipcode) and/or parts thereof (including Province and Region) ii. Name_Surname and/ or parts thereof (Name and Surname as separate fields) iii. Person 6. AVAILABLE BENCHMARK DATABASES 3. Time completion by activity (Cleaning + Matching) 4. Additional information: i. description of algorithm ii. clean dataset resulting from application to Benchmark Database In the context of the Challenge, “positives” and “negatives” correspond to matched pairs in the MATCH table of the Benchmark Database, and to the value assigned to the various SAME_x variable. For example, when comparing PUBLN_AUTH_match PUBLN_NR_match 113 10000 EP to Person_ID_match PUBLN_AUTH PUBLN_NR Person_ID 222 99999 EP Last update: 27/07/2010 14 A “positive” match is generated if SAME_PERSON=1, that is if the algorithm considers the two PERSON_IDs (more precisely: PERSON_ID and PERSON_ID_match) as the same person; on the contrary, a “negative” match is generated if SAME_PERSON=0, that is the algorithm does not consider the two PERSON_ID as the same person. Similarly, for the same observation, we obtain a “positive” (“negative”) for the Address if the algorithm assigns value 1 (value 0) to the SAME_Full_Address variable, that is it recognizes the two addresses as the same. Also similarly, we obtain a “positive” (“negative”) for the Name_Surname if the algorithm assigns value 1 (value 0) to the SAME_Name_Surname variable, that is it recognizes the two combinations of Name and Surname as the same. And so on for SAME_City, SAME_Zipcode etc. In all these case, we allow for algorithms joining the Challenge to produce also a NULL value, in case the algorithm’s structure is such that some information is not generated (for example, the algorithm does not split Name and Surname, or does not split the Street and the City). By comparing “positives” and “negatives” calculated by the algorithm for the various SAME_x variables, authors of the algorithms can calculate also how many true and false positives, as well as true and false negatives, their algorithm generates for the various SAME_x variables. Notice that the MATCH table is a directed one (see DIRECTION flag): if the following match appears: PUBLN_NR_match PUBLN_AUTH_match 113 10000 EP Person_ID_match PUBLN_AUTH PUBLN_NR Person_ID 222 99999 EP (DIRECTION=1) EP PUBLN_AUTH PUBLN_AUTH_match 99999 (DIRECTION=2) PUBLN_NR PUBLN_NR_match 222 Person_ID Person_ID_match then the following will appear, too: 113 10000 EP Therefore, it is advisable, when preparing the equivalent of the MATCH table, to produce a similar permutation. Alternatively (and less time consuming in computational terms), participants to the Challenge may produce matches just in one direction, but then should compare their results with the MATCH table for one direction only (that is, they should filter MATCH either for DIRECTION=1 or DIRECTION=2). A third, even less time consuming alternative, may consist in producing only a subset of the MATCH table, for example one which contains only matches between similar names and/or addresses (the MATCH table of the benchmark datasets contains all possible matches, regardless of any similarity). In this case, the author of algorithm is simply assuming that all the matches she is not Last update: 27/07/2010 15 producing have to be considered “negatives”, and will take this into account when computing the relevant precision and recall scores. Within this third strategy, the author of the algorithm may consider to produce separate MATCH tables, one for each SAME_x variable of interest (e.g., one for calculating precision and recall over SAME_PERSON, another for SAME_Full_Address etc.). By following one of these procedure, a perfect precision score (that is a score of 1.0 alias 100%) for a given SAME_x variable means that the algorithm always generates SAME_x=1 when such is the value of SAME_x in the MATCH table of the Benchmark database. In other words, a perfectly precise algorithm does not generate false positives, that is it never assigns SAME_x=1 when it is not the case. (However, this says nothing about whether the algorithms fails to assign SAME_x=1 when it is the case, that is whether it generates some false negatives). A perfect precision score for SAME_PERSON, in particular, means that all inventors are correctly identified: that is, the number of inventors identified by the algorithm corresponds to the number of distinct BENCHMARK_IDs listed in the BENCHMARK_ID table of the Benchmark database. Notice that if we were interested only in this aspect of Precision, the MATCH table would be unnecessary, the Precision rate being easily calculated only by considering the information provided by the BENCHMARK table. But since we are interested also in checking the Precision of the algorithm in retrieving the Addresses or the Surnames or other elements of an inventor’s identity, using the MATCH table appears more convenient. Notice also that this way of calculating Precision allows for an algorithm to be overall precise in identifying inventors (that is to be precise with respect to SAME_PERSON) despite not being much precise with respect to other elements of the inventor’s identity such as the Address or the Name and so on. In particular, we may have algorithms which identify precisely the inventors, without locating them precisely in the geographical space. Similarly, a perfect Recall score (1.0 alias 100%) means that the algorithm assigns SAME_x=1 to all cases in which MATCH table actually reports such value, that is it does not generate false negatives. (However, this score says nothing on whether the algorithm also assigns any SAME_x=1 when it is not the case, that is whether it generates false negatives). As a further example, let’s imagine that a participant to the challenge has created an algorithm called ‘Garfield’, she has applied it to the three examples listed above (Donald Duck, Mordecai Richler, and Antoine Doinel), and produced a relevant MATCH table (which we will call MATCH_Garfield to distinguish it from the MATCH table in the benchmark database). Here the records of this imaginary MATCH_Garfield table, compared to the same records from MATCH (which corresponds to the examples above). Notice that both tables have 30 observations, since the combinations (cum permutation) of the six unique “inventor+patent”s of the three example are 30 [n=6 obs n*(n‐1)=6*5=30 combinations cum permutation]; that is, each of our “inventor+patent” is compared twice with the other five. Notice also that, in our example, MATCH_Garfield identifies all pairs in our examples as the same person, that is it identifies 3 persons out of the various combinations. In reality, MATCH tells there are 4 persons, because the two “Donald Duck” are the same, as they are the two “Mordecai Richler”, but the two “Antoine Doinel” are different persons; that is, it create a false positive and therefore it falls short in terms of precision with respect to SAME_PERSON. However, the Garfield algorithm does miss out any real positive, that is it does not neglect to identify the two Donald Ducks and the two Mordecai Richler as the same person; in other words, it does not create false negatives, so it exhibit perfect recall with respect to SAME_PERSON. Box 1 reports in greater details how both precision and recall rates are calculated. As for all the other matching dimensions (all the SAME_X variables besides SAME_PERSON) the Garfield algorithm exhibits both perfect precision and recall. Last update: 27/07/2010 16 MATCH table for examples above Last update: 27/07/2010 17 MATCH_Garfield table: outcome of imaginary Garfield algorithm applied to examples above Last update: 27/07/2010 18 Box 1 – “Garfield” algorithm’s precision and recall rates for SAME_PERSON Positive matches: 6 (3 for each value of DIRECTION), of which: ‐ True positives: 4 (2 for each value of DIRECTION) ‐ False positives: 2 (1 for each value of DIRECTION) Negative matches: 24 (12 for each value of DIRECTION), of which: ‐ True negatives: 12 (6 for each value of DIRECTION) ‐ False negatives: 0 (0 for each value of DIRECTION) Precision (calculated on both DIRECTIONs) = 4/6=66% Precision (calculated on one value of DIRECTION only) = 2/3=66% Recall (calculated on both DIRECTIONs) = 24/24 =100% Recall (calculated on one value of DIRECTION only) = 12/12 =100% Notice that, provided that no algorithm predicts differently the value of SAME_X variables according to the DIRECTION of the match, calculating precision and recall rates by making use of all observations in the MATCH table, or by filtering for one value of DIRECTION only, makes no difference. Notice also that precision and recall rates for SAME_PERSON could have been calculated after producing a subset of MATCH_Garfield only, namely one containing matches only for similar names and surnames, that is the first six lines of MATCH. By calculating correctly the number of all potential matches (that is, 30, i.e. 15 for each DIRECTION) and by treating all non‐performed matches as negatives (which in this case would mean 24 negatives, 12 for each DIRECTION) , one could calculate anyway the precision and recall rates. Even in this case, the MATCH table of the benchmark database would contain useful information, because it would help tracking the false negatives (that is, the non‐performed matches that would have involve a positive). 6. AVAILABLE AND PLANNED BENCHMARK DATABASES Three benchmark databases will be produced over time, each containing a different subsets of Person‐IDs from PATSTAT: The France_Academic_Benchmark database, which contains 1498 Person_IDs and 1850 PUBLN_NRs (EPO patent applications) corresponding to 1997 Person_ID ‐ PUBLN_NR pairs. The number of distinct inventors is 424, all of them being academic scientists affiliated to a French universities in 2004‐05. More precisely, the database comes from KITES’ parsing, cleaning, and matching of all inventors listed on a patent application at EPO from 1975 to 2001, with PERSON_CTRY_CODE,= ‘FR’ and further matching the resulting records with the list of all Maitres a Conference and Professeurs listed on French ministerial records in 2005, for the medical, engineering, and natural sciences (see Lissoni et al., 2008). Subsequent hand‐checking and cleaning has been performed both by Carayol and Cassi (2009) and by the authors of this report The EPFL_Benchmark database, which contains 843 Person_Ids and 685 patent publications, of which 564 with EP as publication authority (PUBLN_AUTH='EP') and 121 with WIPO as publication authority, (PUBLN_AUTH='WO'), corresponding to 1088 Person_ID ‐ PUBLN_NR pairs. The number of distinct inventors is 312, all of them being academic scientists affiliated to the Ecole Polytechnique Federale Last update: 27/07/2010 19 de Lausanne (EPFL) plus a few homonyms of theirs, from various countries. This database is based upon Raffo and Lhuillery (2009) The IBM_Benchmark database, based upon a list of 500 inventors kindly provided by IBM corporation At the present date, only the France_Academic_Benchmark and the EPFL_Benchmark databases are ready for use, and can be downloaded from the dedicated website (http://www.academicpatenting.eu section: "Name Game" Algorithm Challenge and Tools). List 3 provides information on their contents. List 3 – Numerosity of elements in the French and EPFL benchmark database TABLE Element Numerosity France_Academic EPFL All tables Person_ID 1498 843 All tables PUBLN_NR 1850 685 All tables PUBLN_AUTH 1 2 RAW nr of observations 1997 1088 RAW PERSON_NAME 728 308 RAW PERSON_ADDRESS 1446 682 RAW PERSON_CTRY_CODE 1 12 MATCH nr of observations 3986012 1182656 MATCH DIRECTION 2 2 MATCH SAME_PERSON 2 2 MATCH SAME_Name_Surname 2 2 MATCH SAME_Full_Address 3 3 MATCH SAME_Country 1 3 MATCH SAME_Name 2 2 MATCH SAME_Surname 2 2 MATCH SAME_Street_Nr 3 3 MATCH SAME_Zipcode 3 3 MATCH SAME_City 2 2 MATCH SAME_Province 2 2 MATCH SAME_Region 2 2 CLEAN_ADDRESS nr of observations 1997 1088 CLEAN_ADDRESS Street_Nr 746 315 CLEAN_ADDRESS Zipcode 420 162 CLEAN_ADDRESS City 357 131 CLEAN_ADDRESS Province 59 7 CLEAN_ADDRESS Region 20 4 CLEAN_ADDRESS Country_Code 1 12 CLEAN_ADDRESS Full_Address 806 365 CLEAN_NAME nr of observations 1997 1088 CLEAN_NAME Name 120 242 CLEAN_NAME Surname 345 315 CLEAN_NAME Name_Surname 365 326 BENCHMARK_ID nr of observations 1997 1088 BENCHMARK_ID PUBLN_NR 1850 685 BENCHMARK_ID PUBLN_AUTH BENCHMARK_ID BENCHMARK_ID 1 2 424 312 Last update: 27/07/2010 20 As for the IBM_Benchmark database it will be made available in September 2010. A major limitation of the existing and planned benchmark databases is preponderance of Names and Surnames of European descent among inventors, and of European addresses, which pose different challenges than Asian ones (Japan, Korea, and China being among the largest countries for number of filed patent applications both at USPTO and EPO). Any contribution to create an Asian‐oriented benchmark database is therefore welcome. 7. CONCLUSIONS: HOW TO JOIN THE ALGORITHM CHALLENGE i. Obtain access to PatStat version October 2009 or contact [email protected] to obtain it (Notice also that REGPAT users will find information from PATSTAT ‐ October 2009 in the January 2010 REGPAT edition) ii. Keep in touch with [email protected] in order to obtain information on the next workshop, which will be scheduled around November 2010 iii. Visit the website http://www.academicpateting.eu ( section: "Name Game" Algorithm Challenge and Tools) for downloading the BENCHMARK DATABASES and useful info iv. Provide, according to a schedule that will be communicated to all participants, a report containing the following info: true positive true positive false positive true positive Recall rate, defined as recall true positive false negative 1. Precision rate, defined as: precision 2. 3. 4. for the following fields: iv. Full address (Street and street nr, City, Zipcode) and/or parts thereof (including Province and Region) v. Name_Surname and/ or parts thereof (Name and Surname as separate fields) vi. Person Time completion by activity (Cleaning + Matching) Additional information: iii. description of algorithm iv. clean dataset resulting from application to Benchmark Database Last update: 27/07/2010 21 REFERENCES Balconi M., Breschi S., Lissoni F. (2004), “Networks of inventors and the role of academia: an exploration of Italian patent data” , Research Policy 33/1, pp. 127‐145 Carayol N., Cassi L. (2009), “Who’s Who in Patents. A Bayesian approach”, Cahiers du GREThA 2009‐07, Groupe de Recherche en Economie Théorique et Appliquée – Université Bordeaux 4, Bordeaux (http://cahiersdugretha.u‐bordeaux4.fr/2009/2009‐07.pdf) Hall B.H., Jaffe A.B., Trajtenberg M. (2001), "The Nber Patent Citation Data File: Lessons, Insights and Methodological Tools " , NBER Working Paper 8498 National Bureau of Economic Research, Cambrige MA (http://www.nber.org/papers/w8498) Huang H., Walsh J.P. (2010), “A New Name‐Matching Approach for Searching Patent Inventors”, mimeo Kim J, Lee S., Marschke G. (2005), “The Influence of University Research on Industrial Innovation", NBER Working Paper 11447 National Bureau of Economic Research, Cambrige MA (http://www.nber.org/papers/w11447). Forthcoming in Journal of Economic Behavior and Organization Lai R., D'Amour A., Fleming L. (2009), "The careers and co‐authorship networks of U.S. patentholders, since 1975", Harvard Business School ‐ Harvard Institute for Quantitative Social Science (http://en.scientificcommons.org/48544046) Lissoni F., Sanditov B., Tarasconi G. (2006), “The Keins Database on Academic Inventors: Methodology and Contents”, CESPRI working paper 181, Università “L.Bocconi”, Milano, October 2006 (http://www.cespri.unibocconi.it/workingpapers) Magerman T., van Looy B., Song X. (2006), “Data production methods for harmonized patent statistics: Patentee name harmonization”, KU Leuven FETEW MSI Research report 0605, Leuven Raffo J., Lhuillery S. (2009), “How to play the “Names Game”: Patent retrieval comparing different heuristics”, Research Policy 38(10), pp. 1617‐1627 Tang L., Walsh J.P. (2010), “Bibliometric fingerprints: name disambiguation based on approximate structure equivalence of cognitive maps”, Scientometrics (forthcoming) Thoma G. , Torrisi S., Gambardella A., Guellec D., Hall B.H., Harhoff D. (2010), “Harmonizing and Combining Large Datasets – An Application to Patent and Finance Data”, NBER Working Paper 15851, National Bureau of Economic Research, Cambrige MA (http://www.nber.org/papers/w15851) Trajtenberg M., Shiff G., Melamed R. (2006), “The “Names Game”: Harnessing Inventors’ Patent Data for Economic Research”, NBER Working Paper 12479, National Bureau of Economic Research, Cambrige MA (http://www.nber.org/papers/w12479) Last update: 27/07/2010 22 APPENDIX A – IDENTIFICATION AND DISAMBIGUATION OF INVENTORS : A SHORT SURVEY The present survey summarizes the main methodological issues related to the identification and disambiguation of inventors, as discussed in a number of recent papers which have made use of patent data from various sources. The survey has not the ambition of being exhaustive. No effort has been made to retrieve all papers based upon inventors’ data; only those entirely or largely dedicated to methodological issue have been considered. At the same time, we restrict our attention only to inventors’ data and do not consider papers dedicated to the identification and disambiguation of applicants (which are chiefly business companies and other organizations) , such as Magerman et al. (2006) and Thoma et al. (2010). After a preliminary discussion of terminology, we illustrate briefly the data sources used by the surveyed papers, then we move to a comparison of methodologies regarding the various steps followed to move from the raw data to the final product. Terminology Not all papers used the same terminology in order to describe the operations they perform, so that similar operations may go under different names. In what follows we will make use of two different sets of words, coming respectively from Raffo and Luhillery (2009), which is one of the surveyed paper, and from Kang et al. (2009), which is one of the many papers from the "information processing" literature, a specialized field of computer science. Raffo and Lhuillery (2009) describe the various operations to be undertaken when dealing with inventors as ParsingExternal information retrievalMatchingFiltering (each operation is described in detail in section 3 above). The sequence: some algorithms may skip one step or collapse two in one, such as when an algorithm matches all inventors in a database one to another, irrespective of the similarity of names, and immediately filters out “wrong” matches; a different algorithm instead may retrieve external information only after the matching or the filtering stage and so forth. Kang et al. (2009) describe the first three steps (ParsingExternal information retrievalMatching) as leading to the "identification" of inventors, and the last one as "disambiguation", the latter being a term used also by Lai et al (2009) with reference to the entire process. Internal vs. External information Information on each inventor to be examined can be distinguished between "internal" and "external". Internal information concerns exclusively the inventor's name (and surname, middle names or initials etc) and address, as reported in separate text strings on patent documents (one string for name‐surname‐etc, one for the address, either inclusive or exclusive of the city and country, which in some patent data sources are reported in dedicated strings). External information may come from within the patent data source or from other sources. External information from within the patent data concerns: ‐ the patents signed by the inventor (their technological classification, title, abstract…) ‐ the characteristics of the patents' applicant (whether it is the inventor itself, or another entity, such as a company or a university; in which case we are interested into the text strings reporting the applicant's name, address, etc) ‐ the citations linking the inventor's patents to other patents or to the "non patent literature" (chiefly, scientific articles) ‐ relational data such as the identity of the inventor's co‐inventors or other inventors in the same database (such as those linked to the inventor of interest through a chain of co‐inventorship relationships, as illustrated in Balconi et al., 2004). External information from outside the patent dataset refers to any source which may help improving the identification or disambiguation. A typical set of external information in this sense are zip codes repertoires Last update: 27/07/2010 23 to be attached to the inventor's address (if missing or incomplete in the original patent document); or any firm directory, which may help disambiguating the patent applicant's names and/or clarifying issues such as changes of names due to mergers and acquisitions. Data sources Three main data sources can be identified in the surveyed papers: ‐ EP‐CESPRI database, used by Lissoni et al. (2006) and Carayol and Cassi (2006). This database consist of patent applications filed at the European Patent Office and published on the ESPACE Bulletin (a bibliographic and legal status database of all European patent applications published since the EPO was founded; http://www.epo.org/patents/patent‐information/subscription/bulletin.html), with names and addresses of inventors and applicants parsed and coded by CESPRI researchers7. ‐ NBER database, produced by Hall et al. (2001) and updated by Trajtenberg et al. (2006). It is based on information released by the US Patent & Trademark Office (USPTO). Such information covers utility patents granted between 1963 and 1999, and citations between 1975 and 1999. Other authors, such as Lai et al. (2009) and Kim et al. (2005) integrate it with data coming directly from the USPTO for years after 1998 or 1999, which may cover also design and other types of patents. ‐ PatStat database, which we describe in section 2 above. PatStat covers data from both EPO and USPTO and have been used, among the papers we survey, by Raffo and Lhuillery (2009). Notice that the information on USPTO patents reported by PatStat is not entirely consistent with the information contained in the USPTO patent records in the NBER database and updates (see Appendix B below). Methodologies Tables A1.1‐A1.3 below provide a synoptic view of the key features of the methodologies followed by the surveyed papers. Most methodologies are largely similar, although a few papers make choices (some of them still at an experimental stage) that are worth commenting upon. PARSING: at the parsing level all authors undertake similar steps for the elimination of symbol, characters, double blanks, and conversion of characters typical of uncommon alphabet. However, no common standards are used; two partial exceptions are Lissoni et al. (2006), who explicitly convert all characters to ASCII, and Raffo and Luhillery (2009), who have produced and made public a list of symbols and characters to be eliminated from PatStat text strings (http://wiki.epfl.ch/patstat/cleaning) . EXTERNAL INFORMATION RETRIEVAL: While all papers make use of information external to the inventor's records, but internal to patent data, none but one makes use of external‐to‐patent information. The exception is provided by Lai (2009) and it is limited to the use of CASSIS (a database for the harmonization of USPTO technological classes across various editions of the USPTO classification) and the SAS directories of zip codes, for filling the information gaps in the original inventors' records, for which information on the address was provided. MATCHING: This is the stage where most differences across methodologies can be found. Differences can be found in the way records are selected for matching and in the criteria used for matching. As for the selection of records, Lai et al. (2009) distinguish between "block" and "adjacency" matching. Block matching requires selecting groups (blocks) of records according to some criteria, and then matching all records satisfying those criteria. For example, Trajtenberg et al. (2006) convert each inventor's name and surname into as many Soundex codes, group all inventors with the same Soundex‐coded names and surnames and the proceed to matching them. So, if 4 inventors (a,b,c,d) have the same Soundex‐coded names and surnames, they will be grouped and the group will generate 6 matches (a‐b,a‐c,a‐d, b‐c, b‐d, c‐ 7 CESPRI was a research centre of Bocconi University, Milan (Italy). It has now been absorbed by KITES, a larger research centre of the same university. What used to be known as EP‐CESPRI database has now been absorbed by the KITES database, which is no more based on ESPACE Bulletins, but on subsequent editions of PatStat, whose records are parsed and matched according to the same methodological principles once used for the EP‐CESPRI database. Last update: 27/07/2010 24 d). Similarly, Lissoni et al. (2006) group all inventors with exactly the same name+surname string (purged of punctuation, double blanks, symbols etc), and then proceed to match all records within the group. The main drawback of this technique is that once a record has been assigned to a group it will not be matched anymore to records in other groups, with a negative impact on recall. Alternatively, one can produce very large groups (as with the Soundex method as opposed to the full‐string method), but this will increase the computation time; the same applies to the possibility of re‐iterating the matching procedure, each time creating new blocks, with different grouping criteria. Lai et al. (2009) propose instead to proceed with adjacency matching, which requires sorting all records according to some criteria, and the proceed to match‐and‐filter each record with the following ones; whenever two records are considered to point at the same inventor, the information they contain is collapsed into one record only (for example, if comparing inventors "a" to "b", with "a" patenting in class 1 for company A and "b" patenting in class 2 for company B, suggests that "a" and "b" are the same person, the two records will be replaced by one record, indicating that inventor "a" ("b") patents both in class 1 and 2, and for both company A and company B. Provided that the sorting procedure makes sense (so that inventors with the same or similar names follow one another in the database) and that the procedure is iterated procedure a number of time, all records will end up being matched to all others. FILTERING: After matching two inventors, all methodologies produce a similarity score and a score threshold is fixed, so that only matches with higher‐than‐threshold scores are retained. Most similarity scores are calculated on the basis of similar criteria such as further similarity measures for names and/or surnames (for example, two inventors with same Soundex‐coded names and surnames may be further compared to check whether their names or surnames are exactly the same, or whether they have the same middle name initials); presence or absence of common co‐inventors, assignees, or technological classes of patents; or accessory criteria, such as whether the common name or surname is a common or uncommon one, whether the common assignee is one with many or few patents etc. Most algorithms are based upon arbitrary scores assigned to the various similarity items, which are then possibly fine‐tuned through trial and error in order to maximize precision or recall or a combination of the two, on the basis of results obtained against a hand‐cleaned benchmark database (the best discussion on this point is provided by Raffo and Lhuillery, 2009). The main exception, in this respect, is provided by Carayol and Cassi (2009) who calculate the similarity score according to a Bayesian approach (that is, they base their similarity scores on a theoretical ground). Their approach requires first to examine different patents from the same inventor and to observe whether the same inventor exhibit differences across patents (for example, the two patents are in different technological classes or have different assignees). By examining all relevant patents, they can estimate the probability that two inventors may exhibit some differences across patents. Similarity scores are then calculated on the basis of such probability and computed for all pairs of inventors‐patents, where the inventors have the same name and surname. A similarity threshold is then chosen, so that only inventor‐patent pairs with higher‐than‐threshold similarity scores are considered to point at the same person; in addition, transitivity checks (a feature common to other algorithms) allow to further identify two or more inventors as the same person. The process is the reiterated: new probabilities to observes differences between inventor‐patent pairs (with the inventors being the same person) are calculated, similarity scores are re‐calculated accordingly, and a new mathing+filtering round is launched. The process continues until it convergences, that is until no further inventors are identified as the same person. Results are then tested against a benchmark dataset. Depending on the chosen threshold, combined false positives and false negatives can go down to 2% of the benchmark. The most original approach to filtering is proposed by Huang and Walsh (2010), although still at a very experimental stage. It ignores most of the information used by the other methods in order to rely exclusively on citation analysis. Following the "Approximate Structural Equivalence" method proposed by Tang and Walsh (2009) for the disambiguation of authors of scientific papers, Huang and Walsh calculate for each pair of patents (by matched inventors) a similarity score based upon the number of common citations to prior art (that is, the method checks whether patents 111 and 100 cite the same prior patents, or different ones). Common citations are weighted by their overall frequency (highly cited patents contribute less to the similarity score than patents receiving very few citations) and by the number of Last update: 27/07/2010 25 citations listed in the matched patents (if the matched patents cite lots of prior art, then the probability of a common citation is higher, which affects negatively the similarity score). In conclusion, no superior methodology has yet emerged from the various efforts aimed at cleaning and coding inventors, with the exception of some results on matching methods achieved by Raffo and Lhuillery (such as the superior accuracy of weighted 2‐gram methods) and the necessity of multi‐filtering approaches based upon information on patent classes, assignees, citations, and co‐inventorship. Last update: 27/07/2010 26 Table A1.1 – A comparison of identification and disambiguation methods (parsing and external info) DATA PARSING (OF INVENTORS’ NAMES AND ADDRESSES) As in Lissoni et al. (2006) WITHIN‐PATENT EXTERNAL INFO Assignee Inventors’ address (city) Tech class of patents Citations Huang and Walsh (2010) NBER db, plus update As in Lai et al. (2009) Patent citations, for the computation of “approximate structural equivalence” (ASE) of patents Kim et al. (2005) USPTO data from As in Trajtenberg et al. (2006) Patent citations, for self‐citation analysis NBER db, plus update Co‐inventorship data Lai et al. (2009) NBER db, plus update 1. Conversion to ASCII (?) Tech. Class of patents 2. Capitalization (?) Inventors’ address (city, country/US State, zip code) Assignees’ names (parsed) Co‐inventorship data Patent citations, for self‐ and cross‐citation analysis Lissoni et al. (2006) EP‐CESPRI database 1. Elimination of all non‐letter characters & Tech. Class (IPC code) of patents symbols, incl. double spaces and accents Inventors’ address (street, city, province/county, region/US State, country, zip code) 2. Conversion to ASCII Assignees’ names (parsed) and group info (if available) (http://rawpatentdata.blogspot.com/search Size of assignee’s patent portfolio (nr of patents) /label/ascii) Patent citations, for self‐citation analysis 3. Capitalization Co‐inventorship data Surname’s frequency within the database, by country Raffo and Luhillery PatStat v.2006 1. Elimination of all non‐letter characters & Tech. Class of patents (2009) symbols, incl. double spaces and corrupt Assignees’ names (parsed) characters Patent citations, for self‐ and cross‐citation analysis (http://wiki.epfl.ch/patstat/corrupted) Co‐inventorship data 2. Conversion to ASCII 3. Capitalization Trajtenberg et al. (2006) NBER database 1. Elimination of all non‐letter characters & USPTO tech. class of patents symbols Assignees’ names (parsed) 2. Split of: Name(s)‐Surname‐Middle name Patent citations, for self‐citation analysis initials‐Name modifier Co‐inventorship data 3. Elimination of space within Name and Frequency of: Soundex‐coded names and surnames; Cities; Assignee names; Tech. Surname spaces Class 4. Capitalization 5. Soundex conversion of Name and Surname Carayol and Cassi (2009) EP‐CESPRI database Last update: 27/07/2010 27 Table A1.2 – A comparison of identification and disambiguation methods (external info and matching method) Carayol and Cassi (2009) Huang and Walsh (2010) Kim et al. (2005) Lai et al. (2009) EXTERNAL‐TO‐PATENT INFO ‐ ‐ ‐ 1. CASSIS database 2. SAS Institute’s ZIP codes Lissoni et al. (2006) ‐ Raffo and Luhillery (2009) EPFL Humane Resource info (for filtering EPFL staff) Trajtenberg et al. (2006) ‐ MATCHING METHOD AND CRITERIA Block matching Block matching: exact match of Full Name(s)+Surname string, to the exclusion of cases in which middle names’ initials differ Block matching: exact match of: Soundex Name, Soundex Surname, and middle initials MAIN Adjacency matching: approximate match (Jaro‐Winkler method) of: ‐ Full Name(s)+Surname string ‐ Applicant’s name string ‐ City+Country (US State) string with match probability adjusted by string’s frequency in the database SUBSIDIARY Block matching of similar (?) matches, if not resolved by MAIN matching Block matching, exact match of full Name(s)+Surname string Block matching, various experiments. Best results with approximate match of full Name(s)+Surname string, with weighted 2‐gram method* (followed by Token and 3‐gram) * weight=inverse function of 2‐gram’s frequency in PatStat Block, exact match of Soundex Name and Soundex Surname Resulting matching are classified according to (decreasing level of accuracy) 1. Exactly same first name (or >5 non‐zero digits in Soundex‐code) and exactly same last name (or >5 non‐zero digits in Soundex‐ code) 2. Exactly same last name (but not exactly same first name) or 2‐4digits in Soundex‐coded last name 3. All other matches Last update: 27/07/2010 28 Table A1.3 – A comparison of identification and disambiguation methods (filtering method) FILTERING Carayol and Cassi (2009) Bayesian approach: similarity scores are computed on the basis of observed differences between inventor‐patent observations known to refer to the same inventor, and updated through an iterative process. Huang and Walsh (2010) Computation of similarity score, cum threshold filter (arbitrary) Similarity score based on ASE score of matched inventors’ patents Kim et al. (2005) No similarity score is calculated and compared to any threshold. Matches retained if at least on these conditions is met: Same full address Self‐citation Common co‐inventor Same full Name and full Surname, plus same zip code OR same full Middle Name Lai et al. (2009) MAIN: 5‐stage (S1 to S5) adjacency matching, based on multiple information stage‐specific thresholds based upon: S1‐S3. Assignee name/Location (city, country/US state)/NameSurname string S4. Assignee name/Location (city, country/US state)/NameSurname string + tech. class S5. NameSurname string + tech. class + co‐inventors SUBSIDIARY: Negative matches from MAIN re‐checked for self‐citations Lissoni et al. (2006) Computation of similarity score, cum threshold filter (≈median score, by country) Similarity score = linear combination of (arbitrary) scores for (in order of importance): Co‐inventor + Geodesic distance in inventors’ network (points if <3 degrees of separation) Self‐citation Tech class (cumulative points for 4‐6‐12 IPC digits) Same assignee (exact match) + extra points for assignees with less than 50 patents Max time lag between patents (negative score if >20 years) Location (cumulative points for same city, province/county, region/US State, country) Common surname (negative points) Raffo and Luhillery (2009) Computation of similarity score + optimal threshold filter (max accuracy against benchmark test) Similarity score = linear combination of equal scores for: Tech. Class Assignees’ names (parsed) Self‐citations Cross‐citation (citation of co‐inventor's patents) Trajtenberg et al. (2006) STAGE 1 ‐ Computation of similarity score = linear combination of arbitrary for (in decreasing order of score): Exact same address Self‐citations Co‐inventors Middle name (Full middle name or Initials, for "rare" middle names) Others (such as: size or frequencies of assignees, names, cities, patent classes; presence of surname modifier) STAGE 2 – Filtering by 3‐level, thresholds, one for each level of matching accuracy. Last update: 27/07/2010 29 APPENDIX B – A NOTE ON USPTO DATA IN PATSTAT PatStat stores data on patent applications from the EPO (the European Patent Office) and a number of national patent offices, including the USPTO. USPTO data have a long tradition as a research tool in bibliometrics, economics, and other social sciences. In particular, extensive use has been made of the NBER Patent Citation Data File (Hall et al., 2001; see also: http://www.nber.org/patents/) , which contains information on patents granted by the USPTO from 1963 to 1999. Recent users of this source, such as Lai et al. (2009) and Kim et al. (2005 and 2007) have often integrated it with data from the Patent Bibliographic data (Patents BIB), published regularly by the USPTO (http://www.uspto.gov/products/catalog/patent_products/page6.jsp#heading‐5). Data on USPTO patents in PatStat come from a variety of sources and the information contents on person names and addresses is not the same as in the NBER and Patents BIB databases. In what follows we sum up the evidence we have produced so far on such differences Information on person names and addresses in PatStat Patstat stores data about applicants and inventors inside a table with the prefix TLS206, indexed using a field named PERSON_ID that can be linked to the applications via table TLS207_PERS_APPLN linking each person_id to an application id (appln_id). Patstat DVD provides two versions of TLS2068: TLS206_PERSON is a comma separated value file containing mainly the fields name, address and country code TLS206_ASCII instead contains the same information already parsed (and also something more9): name is split into last, first and middle name; address into street, city, state and zip code. Both tables are generated by the following sources: 1) EPO Register10 for EP patent applications: PatStat patent information from this source is the most up‐to‐date at the time of extraction of the data from the Register 2) OECD patents database for US data post 1976‐01‐01 up to and including November 15th 2005 for Published Grants. 3) PATSTAT weekly file extracts from USPTO website for Published Grants from November 22nd 2005 until today; Published Applications from September 29th 2005 to today inclusive. 4) Inventor & Applicant names for US PTO Published Applications from March 1st 2001 to September 22nd 2005 from DOCDB , data‐format="docdba". 5) all other names from DOCDB , data‐format="docdba" (US data for names and addresses for patents published before 1976‐01‐01 is taken from the EPO's DOCDB Database) ….. To be completed later 8 More info on http://www.epo.org/patents/patent‐information/raw‐data/test/product‐14‐24.html Insert ref to layout of TLS206_ASCII 10 See_ http://www.epoline.org/portal/public/registerplus 9 Last update: 27/07/2010 30 APPENDIX C – «NAME GAME » WORKSHOPS : A CALENDAR For information on all events, please contact Michele Pezzoni ([email protected]) Past events: ESF‐APE‐INV “Name Game” workshop ‐ Paris, 25‐26 November 2009 In collaboration with the Observatoire des sciences et des techniques (OST: http://www.obs‐ost.fr) Forthcoming: ESF‐APE‐INV 2nd “Name Game” workshop ‐ Madrid, 9‐10 December 2010 In collaboration with: Instituto de Politicas y Bienes Publicos, Consejo Superior de Investigaciones Científicas (IPP‐CSIC: http://www.iesam.csic.es/) Joint "disambiguation” workshop – Cambridge MA, Spring 2010 (tbc) In collaboration with: Institute for Quantitative Social Science (IQSS), Harvard University Last update: 27/07/2010 31
© Copyright 2026 Paperzz