Living Network Centrality

Living Network Centrality
Thomas Krichel
Long Island University & Novosibirsk State
University
5 May 2010
sponsors
●
●
●
Nikos Askitas of IZA for inviting me today.
Vincent Bertone Jr of Miteq Corp. for the
computation support. We have an 8-CPU
machine that runs the calculations and
temporarily hosts the web service
Previous version of these slides, prepared for a
meeting exactly 4 years ago where joint work
with Nisa Bakkalbaşı.
structure of this talk
●
background on RePEc
●
RePEc author service
●
centrality as an incentive device
●
back to basics
●
results using the RePEc author service
●
implementation challenges
RePEc essence and history
●
●
●
It is an open-access abstracting and indexing
database about economics.
It goes back to 1993 when Thomas Krichel
started to build indeces of printed and online
working papers in economics.
Now it also covers journal articles and some
other publication types such as books and book
chapters.
what is interesting about RePEc
●
Large
●
Unfunded
●
Relational
●
Evaluation oriented
RePEc is large
●
●
●
Over 550 archives contribute document data to
the collection.
There about 350k items described. These are
more than in arXiv.org, at some recent count.
There are about 10 different user services that
use RePEc data or further process.
RePEc is unfunded
●
●
●
While there are some sponsors for parts of
RePEc, neither data collection or service
provision is externally sponored.
Most data about publications come from
dedicated RePEc archives based at
–
economics departments at universities
–
other research centers
–
some specialized administrative units such as
central banks.
Services are mainly run by amateurs.
RePEc is relational
●
●
●
RePEc does not only register documents but
also researcher and their institutions.
Institutions are centrally registered by one
volunteer, Christian Zimmermann.
People register with the RePEc Author Service
RAS. More about this later.
RePEc is evaluation-oriented
●
●
●
Since we have indentified authors, we can
aggregate evaluative measures over authors
and institutions.
Recently, Christian Zimmermann has built a
battery of 22 different indicators for individuals.
This is very rich dataset for scientometric
exercise.
any questions?
RAS history and essence
●
●
It goes back to 1999 when Thomas Krichel
directed work by Markus Johannes Richard
Klink to build a special author registration web
interface.
In 2002 the Open Society Institute contributed
$50k to develop a generic software to
implements servics such as RAS.
–
The software is written by Ivan V. Kurmanov.
–
It is called ACIS (Academic Contribution Information
System)
how does RAS work?
●
●
●
Authors contact RAS to let RePEc know what
papers they have written.
–
Registrants create and maintain a personnal profile
–
Registrants create and maintain a name variations
profile
–
RAS creates and maintains a contributions profile.
Once an initial profile is defined ACIS has a
mechanism called ARPU that alerts authors
about documents being added to their profile.
The contributions profile contains the name of
all documents.
what is interesting about RAS?
●
●
Registration of authors solves all problems of
trying to indentify authors by their names.
–
There are many ways to represent the same name.
ex Bruno Van Pottelsbergh De la Potterie,
proceedings page 128. Some RAS registrants have
even longer names!
–
Many different authors may share the same name
or the same way in which the name can be
represented.
Solving these problems "manually" is very
expensive and only feasible for small sets of
authors.
but RAS is not complete
●
●
Bakkalbasi and Krichel (2006) http://openlib.org
/home/krichel/papers/elba.pdf, (Elba paper)
have shown, that, at their time of writing
–
Roughly every third RePEc document has at least
one registered author.
–
Roughly very fourth RePEc authorship is captured
by RAS.
These figures are not likely to change very
rapidly.
–
RAS gets more registrants.
–
RePEc gets more documents.
RAS and co-authorship
●
●
●
In the Elba paper there is a conjecture that the
fact that author A is registered does not
significantly increase the chance that the coauthors of A are registered.
This is can not be formally shown without
labouring through attempt to identify by name.
One indication is that the graph of formed by
co-author relationships in RAS is not dense.
This has been found in recent work by Nisa
Bakkalbasi.
registration incentive on co-authors
●
●
●
●
To get authors to register, we need good
incentives.
In conventional (Zimmermann's 22) indicators,
the positionn of an author depends only on the
author's action.
If we use co-authorship, we can devise rankings
that depend on co-authorship.
If we have such a ranking, authors will have
incentives to get their co-authors to register.
imagine a RAS-CIS
●
●
●
A RAS Collaboration Information System should
be built.
RAS-CIS could show the registrants
–
local information about shortest paths
–
network summaries via centrality indices
The summary information will improve with
more colllaborators of the author registered.
two tasks to build RAS-CIS
●
●
●
We have to select the measures to calculate
and develop the tools to calculated them. This
is what the paper is about.
We have to build an interface that will allow
intuitive access to that data. The data would
have to be updated.
Since there has been no similar service before
this is a hard task. But not done here.
the job here
●
●
●
●
We calculate differents centrality rankings of
authors.
We compare the rankings among themselves.
We want to select a measure that is best to use
in web-based collaboration centrality ranking
service.
RAS-CIS is still to be built fully. But I have build
a running version under the title
collec.repec.org for the meeting today.
collaboration graph
●
●
From a social networking perspective,
collaboration establishes a graph structure
–
RAS authors are the nodes.
–
Collaboration, i.e. common claim(s) of a same
paper is the edges between nodes.
–
If there is no common paper claimed by two authors
no edges exists between the nodes.
Specific results depends on how the edge
length is calculated from the collaboration
structure.
graph components
●
●
●
If there is a path between one author A and
another author B along collaboration archs, we
say that A and B belong to the same component
of the collaboration graph.
It is commonly observed in real work network
that the largest component is quite large. It
usually has more than 50% of all nodes and it is
therefore know as the giant component.
Most centrality measures are only meaningful
for the members of the giant components.
face the force of facts in 2010
●
●
●
●
24,000 registrants are found it RAS.
????? registrants (70% of registrants) are
authors, i.e. they have claimed at least one
paper.
???? registrants (66% of authors) are coauthors, i.e. they are authors who have
collaborated with at least one other RAS author.
16000 registrants (83% of co-authors) are
in the giant component.
the RAS nodes
●
●
●
16k authors is still a rather large network.
There are at least 16k times 16k / 2 shortest
paths between the authors, and many more
other paths.
Calculations of a set of shortest paths takes 8
days on an 8 CPU machine.
network type
●
●
●
●
Between any two nodes, there is an edge if the
authors have ever collaborated.
But the length of the edge depends on your
point of view of the strength of collaboration.
Different edge lengths lead to different
networks.
We introduce three networks in the following
three slides.
network 1: binary network
●
●
●
●
In the binary network, the collaboration strengh
between any two authors is one if the two
authors have claimed at least one common
paper in RAS. The collaboration strength is zero
otherwise.
The edge length is the inverse of the
collaboration strength.
If the collaboration strength is zero, there is no
edge between the two nodes.
We use an algorithm by Newman to do the
calculations.
network 2: symmetric weighted network
●
●
●
In a symmetric weighted network, for each
paper that two authors have claimed in
common, we increment the collaboration
strength between two authors by the number of
authors on that paper minus 1.
As a result, the total collaboration strength of an
author is the amount of co-authored papers.
We used the Dijkstra algorithm to find the
shortest paths. This will find only one shortest
path.
network 3: random walk network
●
●
●
●
In this type of network, we norm the
collaboration strength of each author to be one.
This generates an assymetric networks where
inward edges are shorter for important authors
who have written more papers.
This type of measures is used in SNA to
measure prestige.
We used the Dijkstra algorithm to find the
shortest paths. This will find only one shortest
path.
centrality measures
●
●
For each network, we can look at two centrality
measures.
–
closeness centrality: a node is more central if it has
shorter average shortest path leading to all other
nodes.
–
betweeness centrality: a node is more central if it
lies on the more shortest paths leading from one
node to the other.
These centrality measures rank authors from
the more central to the least central.
notation for centrality measures
●
BIC closeness centrality in the binary network
●
BIB betweeness centrality in the binary network
●
●
●
●
SYC closeness centrality in the symmetric
weighed network
SYB betweeness centrality in the symmetric
weighed network
RWC closeness centrality in the random walk
netowork
RWB betweeness centrality in the random walk
network
pair-wise Spearman rank correlation
from paper of 4 years ago
BIC BIB SYC SYB RWC RWB
BIC
1
.60 .90
.52
.89 .30
BIB
.60
1
.81
.61 .57
SYC
.90
.54
1 .54
.91 .23
SYB
.52
.81
.54
1
RWC .87 .61
.91
.56
1
.41
RWB .30 .57
.23
.42
.41
1
.54
.56 .42
comments
●
●
●
All three closeness measures are produce very
similar rankings.
SYB and BIB are close, but RWB is quite far off
both of them.
Overall, the choice of betweeness and
closeness seem to be more important that the
choice between models. This has been a
surprise to us. BIC and BIB are close by 60%,
the others are even lower.
adding the number of documents
●
We can add the number of documents as an
additional ranking criterion NDO. We get
NDO BIC BIB SYC SYB RWC RWB
NDO
●
1
.68 .55 .71
.60
.70 .19
Overall, the weighed network appears to be
best correlated with the number of documents.
This should come as no surprise.
why add this alien number NDO?
●
●
●
●
We can think of NDO as the simplest easiest
indication of the personal fame of an author.
If we want to incentivize authors to want to
climb the ranks of a collaboration centrality
ranking, we need to have people at the top that
they do actually realize.
Remember Groucho Marx "I'll never join a club
that accepts me as a member".
Thus the symmetric weighed network appears
appealling.
symmetric weighed network
●
●
●
If we are using the symmetric value is an
interface, the numbers that come out for
closeness are not intutive because the total
length are fractions.
But the fact that there should be much less path
multiplicity makes the presentation simpler.
But the paths may be longer (in simple counts
of intermediate nodes) than counts in the binary
model.
RAS-CIS
●
●
The most difficult aspect is to build the interface
when there is no similar service present at this
time.
The updating can not be done instantaneous,
but ought to be close to it.
–
If the contributions profile of an author changes, we
can recalculate her paths.
–
We can also recalculate the paths of her coauthors.
–
But then we end up with an overall network that is
no longer symmetric.
more work
●
●
●
●
RAS authorship are a high-quality dataset that
is easy to use.
It is not widely used at this point.
Note in particular that much of the data affecting
collaboration has not been worked on
–
affiliation data
–
journal/series data
–
subject classification data
New ideas and partnerships welcome!
More history
●
●
●
●
In September 2006 I started to work on a
document that would describe a general
software system to maintain centrality
calculations and interface.
This is the Metz paper at http://openlib.org/ho
me/krichel/work/metz.html
It was first implemented by Dmitri Ishkov, but in
a way that I did not like.
I have recently been rewriting the software and
the spec. After 4 years it has become a hard-hat
area again.
basic ideas
●
●
●
●
●
Software written in Perl for mod_fcgi.
Can support a number of networks but does not
automate addition and removal of networks.
Computational intensity controlled by crontab
entries.
Perl manipulated XML structures (nuclea). All
presentation work done through XSLT.
Verry limited use of database technology.
key concept
●
●
●
A source contains network data. These are
descriptions of nodes.
A nettype is a type of network. The nettype
determines the structure of the network, i.e. the
numbers in the edges matrix.
Every network has a source and a nettype. All
icanis functions are parameterized by them.
nodes table
●
There is a single table for all nodes.
–
–
–
–
–
–
–
–
●
name
homepage
node_tist
path_tist
closeness
closeness_rank
betweenness
betweenness_rank
In addition, there is an URL and nodepage attribute
that can be generated using a configuration Perl
module.
path calculations
●
●
●
All software that I know basically can calculate
the paths from a single start point to all other
nodes, as specified in the edges matrix at the
time of calculations.
Although the Metz specified some crude
instruction for a in incremental recalculation of
paths.
I have completely abandoned that approach.
path data store
●
●
●
●
Historically, our attempts to feed paths into a
database came to a sad end.
Now the paths are held in a file, one per node.
This implies that the same information is held
twice on disk.
At path search time, software determines the
more recent source of path data and uses that
one.
closeness update
●
●
Closeness centrality can be calculated knowing
information about a single node only.
Closeness is immediately updated
betweenness update
●
●
●
●
This is particularly hard because the basis of
calculations is all paths.
Icanis uses a construct called the inter file. At
path calculation time, a file called inter, in a
directory determined by the handle of the
starting node, is created or updated.
It contains, for each node, the number of times
it has been seen an as intermediary on the
paths for this node.
This enables easy betweenness calculations.
ranking updates
●
●
The paths database contains not only values for
node criteria, but also their rankings.
The ranking for each criterion is updated when
the static ranking pages are calculated. The
nodes pages refer to the rankings as calculated
at that time.
node visualization
●
●
●
●
One interface problem is the choice of
representation of a node.
All registrants can give us a homepage
address, but it is optional and may not be up to
date.
We have the node_page, an internal page of an
icanis implementer.
We have the URL, an address of an external
service we can link to for node information.
static html pages
●
●
●
●
●
icanis tries to rely as much as possible on file
based responses.
This is implemented for all browsable
components.
There is one file per node.
There is one file per batch of criteria rank. The
batch size is given as a run-time parameter at
page renewal.
Make paths browsable seems difficult. At this
time only a seach is supported.
RAS implementation
●
●
RAS data forms a source called “ras”. Currently
an implementation with the nettype “mans”
(weighed symmetric network) exists at the
address http://collec.repec.org/ras/mans.
A search for nodes is still to be done. At the
moment nodes can be browsed only.
Destinations for paths can be searched.
problem with mans
●
●
The results appear to lend themselves to a
paradox of a shortest path between two
collaborators that have written a paper together.
This is the Joseph Pearlman / Thomas Sargent
problem. I have obseved it with these two
authors.
a new binary network
●
●
●
Since this is to appeal to humans rather than to
computers, it appears to be best to return to a
binary representation of edges.
Since binary networks tend to produce a vast
array of multiple shorted paths, a mans edges
length can be used to eliminate all paths that
don’t have shortest length in a weighed
network.
We can take random selection of the rest.
other application
●
●
●
This technology can be extended to many
domains.
For example, I did some analysis using RePEc
data into the centrality of JEL classification
categories using relationships of classification
numbers in actual economics papers.
But that’s a topic for another day!
AuthorClaim
●
●
●
●
●
This is Krichel’s opus magnus.
A completely free author registration service for
all disciplines and all types of academic
documents.
Started in 2008, it will occupy me until my dying
day I hope.
Even if ORCID have gotten the hype, I think I
can still make a useful contribution.
Lives at http://authorclaim.org
3lib.org
●
●
●
If you want to build an Author registration
system you need document data.
I tried to get free access to CrossRef but was
turned down.
Now I have to build a free CrossRef like
database. This is 3lib.org.
many docs, no registrants
●
●
●
●
Over 90 millions authorships can be claimed in
AuthorClaim.
But there are just a few authors who have
registered.
The problem is a chicken and egg. With no
services using the data, no authors claim
papers. With no author data, the data is useless
for services.
I am forced to set up my own user service.
A Fethy Mili strategy
●
●
●
Historic evidence has shown that when we
show bad data about academics, they don’t get
angry.
They are eager for all exposure that they can
get.
So they will get in touch to help.
AuthorProfile
●
●
●
We know that names are bad author identifiers.
But we can show this poor data and make this
visible.
The basic component of AuthorProfile is
therefore an aunex, an author name
expression.
auversion
●
●
Bibliographic data is indexed by document id.
Auversion is the process by which bibliographic
data is reindexed by aunex.
●
A record by aunex is kept.
●
It is a huge technical challenge.
network structure
●
●
●
A network structure can be build by co-aunexes
on same paper, like in the co-authorship.
A centrality calculation is not feasible.
A second network structure can be build using
automated name variations profile.
adding a top
●
●
●
Registered authors can be used to build
browsable entry points.
Destination auxexes can link back to the
registered authors and report their network
distance to them.
This will increment the page rank of registered
authors and generate a visibility payoff for them.
Thank you for your attention!
http://openlib.org/home/krichel
write to [email protected]