Documentation of the datasets shared with participants of the PatentsView Inventor Disambiguation Technical Workshop 5 June 2015 Workshop introduction PatentsView is a US Patent and Trademark Office (USPTO)-supported initiative to develop a patent data visualization and analysis platform that will increase the value, utility, and transparency of US patent data. We are seeking creative new approaches to get better information on innovators and the new technologies they develop by disambiguating inventor names. The American Institutes for Research (AIR) is hosting this workshop as part of the PatentsView initiative. We invite individual researchers or research teams to develop inventor disambiguation algorithms. The top fifteen teams will be invited to present their results at the final workshop, which will be held at USPTO headquarters on September 24, 2015. At the final workshop, we will identify a single approach that can be scaled and integrated into the next version of the PatentsView data platform The researcher or team that contributes the most effective algorithm will receive a $25,000 stipend. The stipend will support the team’s work on the algorithm and compensate team members for their technical guidance on integration efforts. Workshop participation is open to US and foreign nationals in academia or the private sector. US government employees are not eligible to participate. Labeled research datasets The set of files we have provided is drawn from five labeled research data sets. These research datasets were generously provided by the researchers who previously developed, curated, and validated the data for research purposes. 1. One training dataset contains deidentified pairwise comparison of records that originally come from a human-labeled research dataset that consists of 98,762 labeled USPTO records corresponding to inventors of optoelectronic patents—hereinafter OE labeled dataset. The OE labeled dataset is a hand-disambiguated dataset that comes from a study on economic downturns, inventor mobility, and technology trajectories in OE (Akinsanmi et al., 2014). It consists of four subsamples of inventors with varying characteristics. The target populations of the four sub-samples were as follows: top inventors by number of patents before 1999, top inventors by rate of patenting before 1999, all inventors with patents in a technological 1 sub-field of OE corresponding to USPTO subclass 385/14, an emerging sub-field of OE called “integration” (on which Akinsanmi et al., 2014 were focused), and a random sample of all OE inventors with no patents in sub-class 385/14. Because of confidentiality agreements, the AIR team could not have access to the OE labeled dataset itself but to a deidentified analysis file of pairwise record comparison. Akinsanmi et al., 2014 and Ventura et al., 2015 provide detailed information about the OE labeled dataset as well as on the deidentified pairwise comparison dataset that is available to workshop participants. 2. The second of the training datasets comes from a human-labeled research dataset of 42,376 labeled USPTO patent-inventor records corresponding to a subset of 4,801 academics in the life sciences with patents—hereinafter ALS labeled dataset. This research dataset was kindly provided by Pierre Azoulay. In putting together this dataset, Azoulay leveraged data from the Association of Medical Colleges (AAMC) Faculty Roster, the NIH consolidated grant application file, and the database of patents issued by USPTO during the period from 1976 to 2004. Detailed information on the build-up of this dataset can be found in Azoulay et al., 2007 and Azoulay et al., 2012. Because of confidentiality agreements, workshop participants will not have access to the labeled ALS dataset but to a deidentified analysis file of pairwise record comparison. Below is a detailed explanation of how the AIR team built this deidentified analysis file. We also provide an example program that shows how this file was created from underlying patent data. 3. The third research dataset has 9,156 patent-inventor records with 3,845 unique Israeli inventors that have patented in the US—hereinafter IS labeled dataset. The dataset was kindly provided by Manuel Trajtenberg and further information on how it was constructed can be found in Trajtenberg and Shiff, 2008. 4. The fourth research dataset has 96,104 labeled patent-inventor records with 14,293 unique engineers and scientists with patents—hereinafter E&S labeled dataset. The dataset was kindly provided by Ivan Png and it comes from the work done by Chunmian et al., 2015. 5. The fifth research database contains information on European patents and inventors from the European Patents Office (EPO). This information was kindly shared by Francesco Lissoni. This research database has a labeled dataset—hereinafter EPO labeled dataset—in which every observation contains a pair of uniquely identified combinations “inventor+patent”, plus information on whether the two inventors in the pair are in reality the same person and/or share some trait (e.g. the address or the city or the name or surname or a combination of these elements). Training files: unmodified files from researchers The files described in this section are provided to workshop participants in the same format in which they were received from researchers. These files are “unmodified” in the sense that they are presented in the same format in which they were received. However, in the case of the IS and E&S datasets, we have provided data for a random sample of 80% of the inventors that appeared in the original data. The remaining data will be used for evaluation. As noted above, we are unable to release identifying data for those inventors in the ALS dataset due to a confidentiality agreement. Hence there are no unmodified files for that dataset. 2 OE labeled dataset Files available for download from http://www.cmu.edu/epp/disambiguation The Sample Pairwise Comparisons Dataset contains 15,000 pairwise comparison records. Each pair is labeled as either a pairwise match (if the pair refers to the same unique individual) or a pairwise non-match (if the pair refers to two unique individuals). These records are a subset of the full pairwise comparisons dataset that contains all possible comparisons of record-pairs from the authors’ labeled USPTO optoelectronics inventor records dataset. The full dataset is very large, with 105,407,940 comparisons. The link also provides a data dictionary of the fields present in the pairwise comparison datasets, which we transcribe here: “Each field contains a prefix (e.g. ‘first’, ‘last’, ‘city’, etc), indicating which field in the original dataset is being compared, and a suffix (e.g. ‘j’, ‘l’, ‘e’, ‘s’, etc), indicating the type of comparison being made. The prefix and suffix are separated by an underscore (’_’). “The fields being compared: first the first name field of the inventor record last the last name field of the inventor record mid the middle name field of the inventor record suffix the name suffix (e.g. "Jr.", "III", etc) of the inventor record city the city name field of the inventor record; cities are associated with the inventor of the patent, not the assignee st the state abbreviation (e.g. "PA", "CA", "NY") field of the inventor record; states are associated with the inventor of the patent, not the assignee country the county abbreviation (e.g. "USA", "GB", "JP", etc) field of the inventor record; countries are associated with the inventor of the patent, not the assignee ass the assignee corresponding to the inventor record’s patent class the list of technology classes corresponding to the inventor record’s patent subclass the list of technology subclasses corresponding to the inventor record’s patent coinv the list of co-inventors on the inventor record’s patent fileyear the year the inventor record’s patent was filed The types of comparisons: j : Jaro-Winkler String Similarity l : Levenshtein String Similarity e : Exact matching (1 if exactly equal, 0 otherwise) s : Exact matching performed on the SoundEx abbreviations of the pair a : Absolute difference between numerical values (e.g. file year of the patent) 3 p : Jaro-Winkler Similarity of the phonetic representations of the strings p : Levenshtein Similarity of the phonetic representations of the strings 1, 2, or 3 : Exact matching performed on the first 1, 2, or 3 characters of the strings set : Jaccard coefficient of the two sets (e.g. sets of co-inventors)1 “Before making any comparisons, all text fields are converted to all capital letters, punctuation is removed, and leading/training whitespace is removed.” Finally, a Match flag indicates whether the patent-inventor records being compared have the same inventor—as indicated in the labeled OE inventor dataset. IS labeled dataset File: is_inventors.csv This file contains one record for each labeled patent-inventor pair. It includes additional fields that may be useful to disambiguation, including first and last inventor name and inventor location fields. See Trajtenberg and Shiff, 2008, for additional information about this file. E&S labeled dataset Files: ens_inventors.csv and ens_patents.csv Each record in the file ens_inventors.csv identifies an inventor-year pair. The inventor is identified by a unique ID number (lkn_id) and a URL to a LinkedIn page. The file ens_inventors.csv links inventors to patent numbers. This file includes both a raw patent number and a cleaned patent number provided by Png et al. See Chunmian et al., 2015, for additional information about these files. EPO labeled dataset Files: benchmark_epfl.rar and benchmark_france.rar Each of these files is a compressed archive file that contains five CSV files: benchmark_id.csv, clean_address.csv, clean_name.csv, match.csv, and raw.csv. The following description is taken from Lissoni et al., 2010. Refer to that document for additional information about these databases. “The two most important tables are RAW and MATCH, the latter providing the information necessary to calculate precision and recall rates of algorithms applied to Person\_IDs, as identified by the RAW table. CLEAN_ADDRESS and CLEAN_NAME contain additional information that participants to the ‘Name Game’ challenge may find useful in order to compare the inventors’ names and addresses, as parsed and cleaned by their algorithms, to the inventors’ names and addresses parsed, cleaned, and hand-checked by the author of the benchmark database.” 1 Although described by the authors, please note that some fields were not provided in the pairwise comparison datasets. 4 Training files: processed files provided by AIR In addition to the “raw” files described above, AIR provides additional data sets for participants that were created by linking researcher datasets to patent databases. The first is an analytical file created from the ALS dataset and modeled after the files provided by researchers in the OE dataset. The second data set consists of multiple files created by linking inventors and patents in the IS and E&S datasets to processed bulk patent data. These processed files are created as convenience and may provide a starting point for disambiguation work. We heartily encourage participants to create their own linked training files based directly on the processed bulk patent data. We also provide the IPC patent classifications for patents that appear in the EPO labeled dataset. Linking inventors to processed bulk patent data In order to create the processed files, we linked inventor names in the ALS, IS, and E&S datasets to inventor names in the processed bulk patent data. We used the following rules to create these links: 1. If inventor’s last name is unique among coinventors on the patent, use exact last name match 2. When there is no exact last name match, use the record with the best fuzzy last name match if the string comparison on the best match is greater than 0.88 3. If there is more than one exact last name match, and there is exactly one first name with a fuzzy match score greater than 0.88, use that record Workshop participants who choose to use these files should consider possible bias in the linked data due to these matching rules. ALS and IS labeled datasets contain parsed inventor names. The E&S dataset does not contain parsed inventor names, but the AIR team was able to parse out the majority of inventor names from a data field containing names embedded in LinkedIn URLs2 . We applied the above inventor-linking rules to the extracted names and discarded inventors for which there was no match. ALS labeled dataset, pairwise comparison file File: als_training_data.csv Because we were not able to release the names and patent numbers in the ALS dataset due to a confidentiality agreement, AIR created an analytical file modeled after the files provided in the OE dataset (Ventura et al, 2015). While we have attempted to recreate the comparisons used in the Sample Pairwise Comparisons Dataset provided in the OE training data, we note that the exact parameters of the Jaro-Winkler string comparator can vary between implementations. We provide an example program that shows how this file was created from underlying patent data (forthcoming). The fields begin compared: first the first name field of the inventor record 2 For example, given: ‘http://www.linkedin.com/pub/john-a-johnson-phd/26/2C7/842‘, we parsed out ‘[john, a, johnson, phd]‘ and created a list of stopwords that consisted mostly of degrees and professional certifications, so in this example we would be left with ‘[john, a, johnson]‘. After removing stopwords, we treated the first word in the list as a first name, the last word as a last name, and anything else was concatenated to form a middle name. We did not consider URLs where the individual’s name was not hyphenated. For example: ‘http://www.linkedin.com/in/michaelbeigel‘ 5 last the last name field of the inventor record mid the middle name field of the inventor record suffix the name suffix (e.g. "Jr.", "II", etc) of the inventor record city the city name field of the inventor record; cities are associated with the inventokr of the patent, not the assignee state the state abbreviation field of the inventor record; states are associated with the inventor of the patent, not the assignee country the county abbreviation (e.g. "USA", "GB", "JP", etc) field of the inventor record; countries are associated with the inventor of the patent, not the assignee assignee the assignee corresponding to the inventor record’s patent class the list of technology classes corresponding to the inventor record’s patent subclass the list of technology subclasses corresponding to the inventor record’s patent coinventor the list of co-inventors on the inventor record’s patent The types of comparisons: j Jaro-Winkler string comparator, rounded to two digits e Exact comparison (1 if exactly equal, 0 otherwise) s Exact comparison between SoundEx codes same Jaccard similarity coefficient In particular, the field same_coninventor shows the number of common coinventors (determined by exact string comparison) divided by the total number of coinventors on both patents. The fields same_class and same_subclass are computed in the same way using USPC classifications. The match flag indicates whether the two patent records being compared have the same label in your data or not. Blank cells indicate that one (or both) of the records did not contain a value for the given field Combined IS and E&S training files Files: td_patent.csv, td_inventor.csv, td_assignee.csv, and td_class.csv Manuel Trajtenberg and Ivan Png gave consent to share inventor names from their labeled datasets with workshop participants. In this section we describe linked files that were created from bulk patent data. For information about the researcher files, see Trajtenberg et al, 2008, and Chunmian et al, forthcoming, We note again that workshop participants will have access to inventor names and patent numbers in the researcher files, and they are free and encouraged to create their own files from bulk patent data or other external sources for analysis. The use of any external data should be included in the “intent to participate” that is submitted to the AIR workshop team. AIR must approve in advance the use of any external datasets in this workshop. 6 Our linked patent data consists of 4 different data files containing labeled records from the IS and E&S research datasets. The inventors and linked patents that appear in these files are the same inventors that were selected in the 80% sample that was used to create the “unmodified” files for these datasets. 1. Patent file: td_patent.csv. The dataset has 68,563 observations and the following fields: patent_number US patent number date Date of the patent grant abstract Text of the patent abstract title Patent title num_claims Number of claims made in the patent 2. Inventor file: td_inventor.csv. The dataset has 209,405 observations and the following fields: patent_number Number of the patent in the patent-inventor record name_first First Name of the inventor in the patent-inventor record name_last Last Name of the inventor in the patent-inventor record city City of inventor (if City is in the US) state State of inventor (if State is in the US) country Country of inventor id_ens Unique identifier for inventor in E&S labeled dataset id_is Unique identifier for inventor in IS labeled dataset We keep the two unique identifiers in separate fields, so that workshop participants can see which dataset a given label was taken from. Please note that since we merged in some USPTO source data fields to the IS and E&S research database (using the patent number), there are some inventor-patent records that do not have a unique identifier for the inventor. In those cases, the unlabeled inventor is a co-inventor from a labeled patent-inventor record that appear in IS or E&S. 3. Assignee file: td_assignee.csv. This data file has 60,917 observations and the following fields: patent_number Number of the patent in the patent-inventor record name_first First Name of the assignee in the patent-inventor record, if the assignee is an individual. name_last Last Name of the assignee in the patent-inventor record, if the assignee is an individual. organization Name of organization if the assignee is a company sequence Assigns a number [0,4] that lists order of appearance of the assignee in the patent. 4. USPC patent class file: td_class.csv. This data file has 264,910 and the following fields: patent_number Number of the patent in the patent-inventor record mainclass_id Code of the patent’s main class subclass_id Code of the patents subclass 7 IPC patent class data for EPO labeled dataset (forthcoming) When available, this file will show the IPC patent classes assigned to patents that appear in the EPO labeled dataset. The value of inventor disambiguation The purpose of this workshop is to encourage researchers to develop novel approaches to disambiguating inventor identity in US patent data. Unambiguously identifying inventors is critical to researching patterns and trends in US and international patenting activity. For example, identifying inventors allows researchers to study their mobility patterns, both in space and across companies (Trajtenberg et al. 2008; Marx et al., 2009; Png et al., 2014), as well as their social capital, as measured within co-inventor networks (Fleming, 2007). Also, when inventors are uniquely identified, it is possible to use additional information at the individual level such as scientific publications, professional affiliations, and employment history. References Akinsanmi, E., Reagans, R., Fuchs, E., 2014. Economic Downturns, Technology Trajectories, and the Careers of Scientists. Carnegie Mellon University Working Paper. Azoulay, P., Michigan, R., Sampat, B.N., 2007. The anatomy of medical school patenting. N. Engl. J. Med. 357 (20), 2049–2056, http://dx.doi.org/10.1056/NEJMsa067417. Azoulay, P., Zivin, J.S.G., Wang, J., 2010. Superstar extinction. Q. J. Econ. 125 (2),549–589. Azoulay, P., Zivin, J.S.G., Sampat, B.N., 2012. The diffusion of scientific knowledge across time and space: evidence from professional transitions for the super-stars of medicine. In: Lerner, J., Stern, S. (Eds.), The Rate & Direction of Inventive Activity Revisited. University of Chicago Press, Chicago, IL, pp. 107–155. Chunmian Ge, Ke-wei Huang, and I.P.L. Png, “Engineer/Scientist Careers: Patents, Online Profiles, and Misclassification”, Strategic Management Journal, forthcoming. Fleming, L., King III, C., Juda, A., 2007. Small world and regional innovation. Organ.Sci. 18 (6). Fleming, L., Singh, J., 2010. Lone inventors as sources of breakthroughs: myth orreality? Manag. Sci. 56 (1), 41–56. Lissoni, F., Maurino, A., Pezzoni, M., Tarasconi, G., 2010. Ape-Inv’s “Name Game” Algorithm Challenge: A Guideline For Benchmark Data Analysis & reporting. Version 1.2. Marx, M., Strumsky, D., Fleming, L., 2009. Mobility, skills, and the Michigan non-compete experiment. Manag. Sci. 55 (6), 875–889. Trajtenberg M., Shiff G., Melamed R. (2008), “Identification and Mobility of Israeli Patenting Inventors,” Working Paper. Ventura, S., Nugent, R., and Fuchs, E. 2015. “Seeing the Non-Stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records”. 8
© Copyright 2026 Paperzz