Documentation of the datasets shared with

Documentation of the datasets shared with participants of the
PatentsView Inventor Disambiguation Technical Workshop
5 June 2015
Workshop introduction
PatentsView is a US Patent and Trademark Office (USPTO)-supported initiative to develop a patent
data visualization and analysis platform that will increase the value, utility, and transparency of US
patent data. We are seeking creative new approaches to get better information on innovators and
the new technologies they develop by disambiguating inventor names.
The American Institutes for Research (AIR) is hosting this workshop as part of the PatentsView
initiative. We invite individual researchers or research teams to develop inventor disambiguation
algorithms. The top fifteen teams will be invited to present their results at the final workshop, which
will be held at USPTO headquarters on September 24, 2015. At the final workshop, we will identify
a single approach that can be scaled and integrated into the next version of the PatentsView data
platform
The researcher or team that contributes the most effective algorithm will receive a $25,000 stipend.
The stipend will support the team’s work on the algorithm and compensate team members for their
technical guidance on integration efforts.
Workshop participation is open to US and foreign nationals in academia or the private sector. US
government employees are not eligible to participate.
Labeled research datasets
The set of files we have provided is drawn from five labeled research data sets. These research datasets
were generously provided by the researchers who previously developed, curated, and validated the
data for research purposes.
1. One training dataset contains deidentified pairwise comparison of records that originally
come from a human-labeled research dataset that consists of 98,762 labeled USPTO records
corresponding to inventors of optoelectronic patents—hereinafter OE labeled dataset.
The OE labeled dataset is a hand-disambiguated dataset that comes from a study on economic
downturns, inventor mobility, and technology trajectories in OE (Akinsanmi et al., 2014). It
consists of four subsamples of inventors with varying characteristics. The target populations
of the four sub-samples were as follows: top inventors by number of patents before 1999,
top inventors by rate of patenting before 1999, all inventors with patents in a technological
1
sub-field of OE corresponding to USPTO subclass 385/14, an emerging sub-field of OE called
“integration” (on which Akinsanmi et al., 2014 were focused), and a random sample of all OE
inventors with no patents in sub-class 385/14.
Because of confidentiality agreements, the AIR team could not have access to the OE labeled
dataset itself but to a deidentified analysis file of pairwise record comparison. Akinsanmi et al.,
2014 and Ventura et al., 2015 provide detailed information about the OE labeled dataset as well
as on the deidentified pairwise comparison dataset that is available to workshop participants.
2. The second of the training datasets comes from a human-labeled research dataset of 42,376
labeled USPTO patent-inventor records corresponding to a subset of 4,801 academics in the life
sciences with patents—hereinafter ALS labeled dataset. This research dataset was kindly
provided by Pierre Azoulay. In putting together this dataset, Azoulay leveraged data from
the Association of Medical Colleges (AAMC) Faculty Roster, the NIH consolidated grant
application file, and the database of patents issued by USPTO during the period from 1976 to
2004. Detailed information on the build-up of this dataset can be found in Azoulay et al., 2007
and Azoulay et al., 2012.
Because of confidentiality agreements, workshop participants will not have access to the labeled
ALS dataset but to a deidentified analysis file of pairwise record comparison. Below is a detailed
explanation of how the AIR team built this deidentified analysis file. We also provide an
example program that shows how this file was created from underlying patent data.
3. The third research dataset has 9,156 patent-inventor records with 3,845 unique Israeli inventors
that have patented in the US—hereinafter IS labeled dataset. The dataset was kindly
provided by Manuel Trajtenberg and further information on how it was constructed can be
found in Trajtenberg and Shiff, 2008.
4. The fourth research dataset has 96,104 labeled patent-inventor records with 14,293 unique
engineers and scientists with patents—hereinafter E&S labeled dataset. The dataset was
kindly provided by Ivan Png and it comes from the work done by Chunmian et al., 2015.
5. The fifth research database contains information on European patents and inventors from the
European Patents Office (EPO). This information was kindly shared by Francesco Lissoni.
This research database has a labeled dataset—hereinafter EPO labeled dataset—in which
every observation contains a pair of uniquely identified combinations “inventor+patent”, plus
information on whether the two inventors in the pair are in reality the same person and/or
share some trait (e.g. the address or the city or the name or surname or a combination of these
elements).
Training files: unmodified files from researchers
The files described in this section are provided to workshop participants in the same format in which
they were received from researchers. These files are “unmodified” in the sense that they are presented
in the same format in which they were received. However, in the case of the IS and E&S datasets,
we have provided data for a random sample of 80% of the inventors that appeared in the original
data. The remaining data will be used for evaluation.
As noted above, we are unable to release identifying data for those inventors in the ALS dataset due
to a confidentiality agreement. Hence there are no unmodified files for that dataset.
2
OE labeled dataset
Files available for download from http://www.cmu.edu/epp/disambiguation
The Sample Pairwise Comparisons Dataset contains 15,000 pairwise comparison records. Each pair
is labeled as either a pairwise match (if the pair refers to the same unique individual) or a pairwise
non-match (if the pair refers to two unique individuals). These records are a subset of the full pairwise
comparisons dataset that contains all possible comparisons of record-pairs from the authors’ labeled
USPTO optoelectronics inventor records dataset. The full dataset is very large, with 105,407,940
comparisons.
The link also provides a data dictionary of the fields present in the pairwise comparison datasets,
which we transcribe here:
“Each field contains a prefix (e.g. ‘first’, ‘last’, ‘city’, etc), indicating which field in the original dataset
is being compared, and a suffix (e.g. ‘j’, ‘l’, ‘e’, ‘s’, etc), indicating the type of comparison being
made. The prefix and suffix are separated by an underscore (’_’).
“The fields being compared:
first the first name field of the inventor record
last the last name field of the inventor record
mid the middle name field of the inventor record
suffix the name suffix (e.g. "Jr.", "III", etc) of the inventor record
city the city name field of the inventor record; cities are associated with the inventor of the patent,
not the assignee
st the state abbreviation (e.g. "PA", "CA", "NY") field of the inventor record; states are associated
with the inventor of the patent, not the assignee
country the county abbreviation (e.g. "USA", "GB", "JP", etc) field of the inventor record; countries
are associated with the inventor of the patent, not the assignee
ass the assignee corresponding to the inventor record’s patent
class the list of technology classes corresponding to the inventor record’s patent
subclass the list of technology subclasses corresponding to the inventor record’s patent
coinv the list of co-inventors on the inventor record’s patent
fileyear the year the inventor record’s patent was filed
The types of comparisons:
j : Jaro-Winkler String Similarity
l : Levenshtein String Similarity
e : Exact matching (1 if exactly equal, 0 otherwise)
s : Exact matching performed on the SoundEx abbreviations of the pair
a : Absolute difference between numerical values (e.g. file year of the patent)
3
p : Jaro-Winkler Similarity of the phonetic representations of the strings
p : Levenshtein Similarity of the phonetic representations of the strings
1, 2, or 3 : Exact matching performed on the first 1, 2, or 3 characters of the strings
set : Jaccard coefficient of the two sets (e.g. sets of co-inventors)1
“Before making any comparisons, all text fields are converted to all capital letters, punctuation is
removed, and leading/training whitespace is removed.”
Finally, a Match flag indicates whether the patent-inventor records being compared have the same
inventor—as indicated in the labeled OE inventor dataset.
IS labeled dataset
File: is_inventors.csv
This file contains one record for each labeled patent-inventor pair. It includes additional fields that
may be useful to disambiguation, including first and last inventor name and inventor location fields.
See Trajtenberg and Shiff, 2008, for additional information about this file.
E&S labeled dataset
Files: ens_inventors.csv and ens_patents.csv
Each record in the file ens_inventors.csv identifies an inventor-year pair. The inventor is identified
by a unique ID number (lkn_id) and a URL to a LinkedIn page. The file ens_inventors.csv links
inventors to patent numbers. This file includes both a raw patent number and a cleaned patent
number provided by Png et al. See Chunmian et al., 2015, for additional information about these
files.
EPO labeled dataset
Files: benchmark_epfl.rar and benchmark_france.rar
Each of these files is a compressed archive file that contains five CSV files: benchmark_id.csv,
clean_address.csv, clean_name.csv, match.csv, and raw.csv. The following description is taken
from Lissoni et al., 2010. Refer to that document for additional information about these databases.
“The two most important tables are RAW and MATCH, the latter providing the information necessary
to calculate precision and recall rates of algorithms applied to Person\_IDs, as identified by the RAW
table. CLEAN_ADDRESS and CLEAN_NAME contain additional information that participants to
the ‘Name Game’ challenge may find useful in order to compare the inventors’ names and addresses,
as parsed and cleaned by their algorithms, to the inventors’ names and addresses parsed, cleaned,
and hand-checked by the author of the benchmark database.”
1 Although described by the authors, please note that some fields were not provided in the pairwise comparison
datasets.
4
Training files: processed files provided by AIR
In addition to the “raw” files described above, AIR provides additional data sets for participants
that were created by linking researcher datasets to patent databases. The first is an analytical file
created from the ALS dataset and modeled after the files provided by researchers in the OE dataset.
The second data set consists of multiple files created by linking inventors and patents in the IS and
E&S datasets to processed bulk patent data. These processed files are created as convenience and
may provide a starting point for disambiguation work. We heartily encourage participants to create
their own linked training files based directly on the processed bulk patent data. We also provide the
IPC patent classifications for patents that appear in the EPO labeled dataset.
Linking inventors to processed bulk patent data
In order to create the processed files, we linked inventor names in the ALS, IS, and E&S datasets to
inventor names in the processed bulk patent data. We used the following rules to create these links:
1. If inventor’s last name is unique among coinventors on the patent, use exact last name match
2. When there is no exact last name match, use the record with the best fuzzy last name match if
the string comparison on the best match is greater than 0.88
3. If there is more than one exact last name match, and there is exactly one first name with a
fuzzy match score greater than 0.88, use that record
Workshop participants who choose to use these files should consider possible bias in the linked data
due to these matching rules.
ALS and IS labeled datasets contain parsed inventor names. The E&S dataset does not contain
parsed inventor names, but the AIR team was able to parse out the majority of inventor names from
a data field containing names embedded in LinkedIn URLs2 . We applied the above inventor-linking
rules to the extracted names and discarded inventors for which there was no match.
ALS labeled dataset, pairwise comparison file
File: als_training_data.csv
Because we were not able to release the names and patent numbers in the ALS dataset due to a
confidentiality agreement, AIR created an analytical file modeled after the files provided in the OE
dataset (Ventura et al, 2015). While we have attempted to recreate the comparisons used in the
Sample Pairwise Comparisons Dataset provided in the OE training data, we note that the exact
parameters of the Jaro-Winkler string comparator can vary between implementations.
We provide an example program that shows how this file was created from underlying patent data
(forthcoming).
The fields begin compared:
first the first name field of the inventor record
2 For example, given: ‘http://www.linkedin.com/pub/john-a-johnson-phd/26/2C7/842‘, we parsed out ‘[john, a,
johnson, phd]‘ and created a list of stopwords that consisted mostly of degrees and professional certifications, so in this
example we would be left with ‘[john, a, johnson]‘. After removing stopwords, we treated the first word in the list as a
first name, the last word as a last name, and anything else was concatenated to form a middle name. We did not consider
URLs where the individual’s name was not hyphenated. For example: ‘http://www.linkedin.com/in/michaelbeigel‘
5
last the last name field of the inventor record
mid the middle name field of the inventor record
suffix the name suffix (e.g. "Jr.", "II", etc) of the inventor record
city the city name field of the inventor record; cities are associated with the inventokr of the patent,
not the assignee
state the state abbreviation field of the inventor record; states are associated with the inventor of
the patent, not the assignee
country the county abbreviation (e.g. "USA", "GB", "JP", etc) field of the inventor record; countries
are associated with the inventor of the patent, not the assignee
assignee the assignee corresponding to the inventor record’s patent
class the list of technology classes corresponding to the inventor record’s patent
subclass the list of technology subclasses corresponding to the inventor record’s patent
coinventor the list of co-inventors on the inventor record’s patent
The types of comparisons:
j Jaro-Winkler string comparator, rounded to two digits
e Exact comparison (1 if exactly equal, 0 otherwise)
s Exact comparison between SoundEx codes
same Jaccard similarity coefficient
In particular, the field same_coninventor shows the number of common coinventors (determined
by exact string comparison) divided by the total number of coinventors on both patents. The fields
same_class and same_subclass are computed in the same way using USPC classifications.
The match flag indicates whether the two patent records being compared have the same label in your
data or not. Blank cells indicate that one (or both) of the records did not contain a value for the
given field
Combined IS and E&S training files
Files: td_patent.csv, td_inventor.csv, td_assignee.csv, and td_class.csv
Manuel Trajtenberg and Ivan Png gave consent to share inventor names from their labeled datasets
with workshop participants. In this section we describe linked files that were created from bulk
patent data. For information about the researcher files, see Trajtenberg et al, 2008, and Chunmian
et al, forthcoming, We note again that workshop participants will have access to inventor names and
patent numbers in the researcher files, and they are free and encouraged to create their own files
from bulk patent data or other external sources for analysis.
The use of any external data should be included in the “intent to participate” that is submitted to the
AIR workshop team. AIR must approve in advance the use of any external datasets in this workshop.
6
Our linked patent data consists of 4 different data files containing labeled records from the IS and
E&S research datasets. The inventors and linked patents that appear in these files are the same
inventors that were selected in the 80% sample that was used to create the “unmodified” files for
these datasets.
1. Patent file: td_patent.csv. The dataset has 68,563 observations and the following fields:
patent_number US patent number
date Date of the patent grant
abstract Text of the patent abstract
title Patent title
num_claims Number of claims made in the patent
2. Inventor file: td_inventor.csv. The dataset has 209,405 observations and the following fields:
patent_number Number of the patent in the patent-inventor record
name_first First Name of the inventor in the patent-inventor record
name_last Last Name of the inventor in the patent-inventor record
city City of inventor (if City is in the US)
state State of inventor (if State is in the US)
country Country of inventor
id_ens Unique identifier for inventor in E&S labeled dataset
id_is Unique identifier for inventor in IS labeled dataset
We keep the two unique identifiers in separate fields, so that workshop participants can see
which dataset a given label was taken from.
Please note that since we merged in some USPTO source data fields to the IS and E&S research
database (using the patent number), there are some inventor-patent records that do not have a
unique identifier for the inventor. In those cases, the unlabeled inventor is a co-inventor from a
labeled patent-inventor record that appear in IS or E&S.
3. Assignee file: td_assignee.csv. This data file has 60,917 observations and the following fields:
patent_number Number of the patent in the patent-inventor record
name_first First Name of the assignee in the patent-inventor record, if the assignee is an
individual.
name_last Last Name of the assignee in the patent-inventor record, if the assignee is an
individual.
organization Name of organization if the assignee is a company
sequence Assigns a number [0,4] that lists order of appearance of the assignee in the patent.
4. USPC patent class file: td_class.csv. This data file has 264,910 and the following fields:
patent_number Number of the patent in the patent-inventor record
mainclass_id Code of the patent’s main class
subclass_id Code of the patents subclass
7
IPC patent class data for EPO labeled dataset (forthcoming)
When available, this file will show the IPC patent classes assigned to patents that appear in the EPO
labeled dataset.
The value of inventor disambiguation
The purpose of this workshop is to encourage researchers to develop novel approaches to disambiguating inventor identity in US patent data. Unambiguously identifying inventors is critical to
researching patterns and trends in US and international patenting activity. For example, identifying
inventors allows researchers to study their mobility patterns, both in space and across companies
(Trajtenberg et al. 2008; Marx et al., 2009; Png et al., 2014), as well as their social capital, as
measured within co-inventor networks (Fleming, 2007). Also, when inventors are uniquely identified,
it is possible to use additional information at the individual level such as scientific publications,
professional affiliations, and employment history.
References
Akinsanmi, E., Reagans, R., Fuchs, E., 2014. Economic Downturns, Technology Trajectories, and
the Careers of Scientists. Carnegie Mellon University Working Paper.
Azoulay, P., Michigan, R., Sampat, B.N., 2007. The anatomy of medical school patenting. N. Engl.
J. Med. 357 (20), 2049–2056, http://dx.doi.org/10.1056/NEJMsa067417.
Azoulay, P., Zivin, J.S.G., Wang, J., 2010. Superstar extinction. Q. J. Econ. 125 (2),549–589.
Azoulay, P., Zivin, J.S.G., Sampat, B.N., 2012. The diffusion of scientific knowledge across time and
space: evidence from professional transitions for the super-stars of medicine. In: Lerner, J., Stern, S.
(Eds.), The Rate & Direction of Inventive Activity Revisited. University of Chicago Press, Chicago,
IL, pp. 107–155.
Chunmian Ge, Ke-wei Huang, and I.P.L. Png, “Engineer/Scientist Careers: Patents, Online Profiles,
and Misclassification”, Strategic Management Journal, forthcoming.
Fleming, L., King III, C., Juda, A., 2007. Small world and regional innovation. Organ.Sci. 18 (6).
Fleming, L., Singh, J., 2010. Lone inventors as sources of breakthroughs: myth orreality? Manag.
Sci. 56 (1), 41–56.
Lissoni, F., Maurino, A., Pezzoni, M., Tarasconi, G., 2010. Ape-Inv’s “Name Game” Algorithm
Challenge: A Guideline For Benchmark Data Analysis & reporting. Version 1.2. Marx, M., Strumsky,
D., Fleming, L., 2009. Mobility, skills, and the Michigan non-compete experiment. Manag. Sci. 55
(6), 875–889.
Trajtenberg M., Shiff G., Melamed R. (2008), “Identification and Mobility of Israeli Patenting
Inventors,” Working Paper.
Ventura, S., Nugent, R., and Fuchs, E. 2015. “Seeing the Non-Stars: (Some) sources of bias in past
disambiguation approaches and a new public tool leveraging labeled records”.
8