Impact of lexical and sentiment factors on the popularity of scientific

Impact of lexical and sentiment factors on the popularity of scientific papers
Julian Sienkiewicz and Eduardo G. Altmann
arXiv:1605.07465v1 [physics.soc-ph] 24 May 2016
Max Planck Institute for the Physics of Complex Systems, D-01187 Dresden, Germany
(Dated: May 25, 2016)
We investigate how textual properties of scientific papers relate to the number of citations they
receive. Our main finding is that correlations are non-linear and affect differently most-cited and
typical papers. For instance, we find that in most journals short titles correlate positively with
citations only for the most cited papers, for typical papers the correlation is in most cases negative.
Our analysis of 6 different factors, calculated both at the title and abstract level of 4.3 million papers
in over 1500 journals, reveals the number of authors, and the length and complexity of the abstract,
as having the strongest (positive) influence on the number of citations.
PACS numbers:
I.
INTRODUCTION
The number of citations an article receives can be considered a proxy for the attention or popularity the article
achieved in the scientific community. Citations play a
crucial role both in the evolution of science [1–5] as well
as in the bibliometric evaluation of scientists and institutions, in which case the number of citations is often tacitly taken as a measure of quality. Understanding which
factors in a paper contribute or correlate with citations
has been the subject of a number of investigations (see
Refs. [6–8] for reviews). Diversity in the affiliation of authors, multinationality, multidisciplinarity, and number
of references, figures or tables have all been identified as
factors that positively correlate with citations.
Here we perform a more systematic investigation of
how different textual properties of scientific papers affect
the number of citations they acquire (see Appendix A
for data description). A classical result, which motivates
our more general analysis, is the negative correlation between title length and citations (i.e., shorter the titles
more citations) [9–12]. In our analysis we consider additionally the complexity and the sentiment of the text
both at the title and the abstract, see table I. Lexical
complexity is usually considered as proportional to the
effort needed (by non-experts) to understand the texts.
The three measures of text complexity we use (see table I) take into account the number of different words
in the text (normalized by the length) and the length of
these words in syllables (see Appendix B for details). In
several previous studies authors used the concept of the
sentiment analysis (i.e., emotional content) of the examined text/messages. In general, psychologists are able to
specify several dimensions of emotions, reaching as far as
12 [13]. However two of them — valence and arousal —
are probably the best recognized and the most frequently
used. Valence reflects the emotional sign of the message
(negative, neutral, positive) while arousal is used to describe the level of activation (low, medium, high). Pairs
of valence and arousal can indicate the specific emotion
type [14], e.g., fear (negative and aroused), sad (negative and not aroused) etc., however they can be also utilized as independent variables. For example valence as a
standalone dimension has successfully been used to detect collective states of online users [15], to indicate the
end of online discussions [16] or to predict the dynamics
of Twitter users during Olympic Games in London [17].
Lately this kind of analysis has also been introduced to
judge upon the role of negative citations [18], citation
bias [19], and to check what boosts the diffusion of scientific content [20]. Here we quantify arousal and valence
through dictionary classifier, see Appendix C.
property
length
complexity
title
number of chars
—
z-index
Herdan’s C
valence
arousal
sentiment
abstract
number of words
Gunning fog index F
z-index
Herdan’s C
valence
arousal
number of authors
TABLE I: List of textual factors whose relation to citations
we investigate in our paper. Whenever possible, factors are
obtained on the title and abstract of a paper. See Appendices
B and C for exact definitions. Additionally, we consider the
number of authors (motivated by previous studies e.g., [7,
21]).
II.
RESULTS
We are interested in quantifying the relationship between X – a real number that quantifies for each paper
one of the textual factors listed in table I – and the logarithm of the number of citations Y ≡ ln(citations + 1).
We standardize X in order to be able to compare the
different factors (see Appendix D) and we use the citations provided by Web Of Science at the end of 2014 for
papers published in 1995-2004[38]. Exemplary results of
the X vs. Y relationship for two factors in two journals are shown in the left part of figure 1. The broad
scattering of the points shows that visual inspection fails
even to detect whether the relation between X and Y
is positive or negative. The simplest (and widely used)
approach is to perform an ordinary (least square) linear
2
au
0.00
0.25
ex
zin
d
w
or
ds
ce
C
H
er
da
τ
s
sa
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
normalized number of characters
in
d
4
ar
ou
2
l
-0.15
0
n'
β top
Fo
g
5
ex
Abstract
10
-2
ex
l
50
β half
th
o
sa
0.50
ar
ou
0.75
100
β
citations
500
zin
d
0.2
0.1
0.0
-0.1
-0.2
β low
τ
va
le
n
1000
rs
β
0.2
0.1
0.0
-0.1
-0.2
0.15
5000
ch
ar
H
s
er
da
n'
s
C
va
le
nc
e
β
β
0.2
0.1
0.0
-0.1
-0.2
β
Title & authors
0.2
0.1
0.0
-0.1
-0.2
de
x
en
in
z-
da
va
l
n'
s
ce
C
s
ar
H
er
au
ar
ou
sa
l
β half
0.75
100
0.50
β
-0.25
50
β low
τ
de
x
in
z-
en
w
or
ce
C
s
va
l
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
in
2
n'
l
sa
0
ar
ou
-2
normalized valence
Fo
g
-4
da
-0.50
5
er
10
ds
Abstract
H
0.25
de
x
citations
500
th
o
τ
ch
0.00
1000
rs
Title & authors
β top
5000
FIG. 1: Relation between different factors and the number of citations in two journals: Science (top) and Nature Genetics
(bottom). Left-side plots: each black dot corresponds to one paper and lines show QR results for color-coded quantiles
τ = {0.02, 0.04, ..., 0.98}. Middle panels: β coefficients (slopes of QR in the left panel) as a function of quantile τ . The red
arrows (summary pointers) show βlow ≡ β(τ = 0.02), βhalf ≡ β(τ = 0.5) and βtop ≡ β(τ = 0.98), as, respectively, the nock, a
circle on the shaft, and the head of the arrow. Right panels: summary pointers for all factors.
regression Y = α† + β † X, in which case β is related to
the Pearson correlation coefficient r as β = rσY /σX (in
fact, due to standardization of variable X in our case β †
is simply covXY ). For the data in figure 1, this yields:
β † = 0.020 ± 0.011 with p > 0.05 for title length in Science case and β † = −0.21 ± 0.03 with p < 0.001 for valence in Nature Genetics. In other words, the second example shows a negative correlation between valence and
citations while the first shows no clear correlation between the number of characters and citations (we cannot
reject the null hypothesis of lack of linear dependence
at 5% significance level). We note that the analysis of
Ref. [12], which identified a negative correlation between
title length and citation, was restricted only to the most
cited papers. This difference in the conclusion regarding
the role of title length and the large variability shown in
the data motivates us to go beyond the above described
computation of linear correlations, which relies on the
(homoscedasticity) assumption of uniform errors in the
whole dataset.
A.
Quantile regression (QR)
Quantile regression [22] gives the opportunity to track
the relation between variables for different parts of the
dataset. The simple question it addresses is: what are the
coefficients α and β of a linear relation Y = α(τ )+β(τ )X
that divides the dataset so that a fraction τ of points
lies below the line and the remaining part (1-τ ) above
it (a precise formulation of QR is shown in Appendix
E). We thus obtain a sequence of values β(τ ) which can
be thought as the quantification of the relation between
X and Y at the τ quantile. The QR is widely used in
different fields [23] and has lately been applied to predict
future paper citation basing on their previous history,
i.e., early citations as well as on the Impact Factor [24].
The results in the center panels of figure 1 show a clear
τ dependence of β, a signature of the non-linearity of correlations. For instance, the top panel shows that for low
values of τ there is a positive correlation between number of characters in the tile and citations, while for high
3
τ the correlation is reversed. This shows the limitations
of the popularized message [25, 26] following Ref. [12]
that shorter titles lead to more citation. This only holds
if you know in advance that your paper will be among
the top cited papers (longer titles seem to be better, e.g.,
in order to avoid being among the least cited papers).
Similar observations (with the opposite trend) are observed in the lower panel for valence — the emotional
polarity — contained in the abstract of Nature Genetics
articles. These examples show that even simple textual
variables can have a mixed relation to the number of citations acquired by the papers of a given journal. We repeated the QR analysis for all factors in more than 1500
journals[39]. In our discussion of our different findings
below we focus on three characteristic values of β which
represent the low-cited (βlow ≡ β(τ = 0.02)), typical
(βhalf ≡ β(τ = 0.5)), and top-cited (βtop ≡ β(τ = 0.98))
papers (graphically represented in the central and left
panels of figure 1 by a summary pointer, i.e., a red arrow
with a circle).
B.
Strength of factors
In order to compare the strength of the effect of a factor
on the number of citations we focus on the distribution
of βhalf (typical papers) across different journals. The
linear relationship Y = ln(citations) = α + βX and the
fact that X is standardized imply that β quantifies how
much growth in citations should be expected from the
variation of one standard deviation in one factor (e.g.,
β = ln 2 ≈ 0.69 means that the number of citations Y
doubles by moving one standard deviation in X). Figure
2 summarizes the results and presents the factors ordered
according to the median of the βhalf distributions. The
influence of factors is overall rather weak, as seen by the
fact that for most journals |β| < 0.5. Factors in the title are considerably weaker than those in the abstract or
the one connected to the number of authors. The variation across journals is in general high, but higher in the
title than in the abstract (possibly due to the fact that
the estimations of X are more robust in the abstract due
to the larger amount of text). The strongest factors observed are: (i) the number of words in the abstract, (ii)
the number of authors and (iii) z-index in the abstract.
For those factors, over 75% of journals (equivalently, the
whole box) are placed above zero. The negative value of
Herdan’s C can be attributed to its anticorrelation to the
number of words (see Appendix B); when C is accounted
for that fact and presented in the form of z-index the
value is positive. This means that for a typical paper
and for most journals a more variable vocabulary (more
unique words) translates into more citations. Similarly,
the number of words in the abstract or the number of
authors are positively correlated with the number of citations in almost all journals.
a. Quantile dependence. Now we quantify in which
extent the influence of factors (β) varies across papers
with different number of citations (the quantile τ ). We
are particularly interested in the cases in which the effect of a given factor on the most successful papers is
significantly different from the effect on typical papers.
To quantify how typical this is, we count the number of
journals for which βtop 6= βhalf is observed beyond the
estimated uncertainties σβtop , σβhalf , i.e., |βtop − βhalf | >
(σβtop + σβhalf ). The results shown in table II reveal
that overall this happens in about 1/3 of the cases (it
is more typical for text length and less typical for sentiment factors). Table II reveals also the factors for which
βtop 6= βhalf because β(τ ) grows in most journals (and
thus βtop > βhalf , as in the case of valence in the abstract), decays in most journals (and thus βtop < βhalf ,
as in the case title length), or shows a mixed behavior in
different journals (as in the case of arousal).
The next question we investigate is the extent into
which the quantile dependence leads to a reversal of the
effect of factors, i.e., when β(τ ) crosses 0. Table III shows
the percentage of journals with positive βlow , βhalf , and
βtop coefficients for each factor. It shows that except
for singular cases (marked by asterisk) the observations
tend to be significantly different from chance (50%). The
variation across the different β’s (quantiles) quantifies the
number of journals for which β(τ ) crosses zero. Such a
behavior has already been discussed for title length in
Science (see figure 1), and table III confirms the generality of this observation (it shows for title length 72% of
journals with positive βlow as compared to nearly 75%
with negative βtop ). In case of three factors (title length,
Herdan’s C in the abstract, and valence in the abstract),
we observe that moving form βlow to βtop we cross 50%,
which indicates that for a certain range of β the factor
in question increases citations for most journals while for
other β’s the opposite effect is typical across journals.
The combination of the results of these two tables allows for a more complete picture of the τ dependence on
β for different factors. For instance, the number of authors and the number of characters in the title can be
identified as the ones that exhibit the strongest systematic trend of decaying β(τ ) (in about 40% of journals,
as shown in table II). However, only for the number of
authors the majority of the values are above zero (see
table III), i.e., the value of β for top papers is less then
for typical ones but it still stays positive. On the other
hand, in the case of the number of characters not only is
β smaller for top papers as compared to typical ones but
it changes its sign as well. Sentiment factors (except for
the valence in the abstract) bring no overall information
about the trend — the number of up- and downward
occurrences is similar. Notably, there is a strong coincidence between z-index and fog index in the abstract,
suggesting that although those two quantities have different definitions, both indicate the increase of correlations
between abstract complexity and citations.
Variability across journals. The large variability
across journals apparent in all our analysis can have different origins. One possibility is that certain journals
4
TITLE
AUTHORS
1.0
1.0
ABSTRACT
property
1.0
complexity
length
sentiment
0.5
0.5
0.5
β half
outliers
0.0
0.0
0.0
Q3 (3rd quartile)
median
-0.5
-0.5
-1.0
-1.0
arousal
chars
z
C
valence
Q1 (1st quartile)
-0.5
-1.0
authors
words
z
arousal fog index valence
C
FIG. 2: Strength of factors calculated over all journals. Box-plots (see definition on the right) summarize the distribution of
βhalf values across different journals. Influential factors are identified as those for which |β| is large for almost all journals (e.g.,
when the box does not contain βhalf = 0 line this implies that in at least 75% of the journals the value of βhalf is above or
below zero).
property
length
factor βtop > βhalf βtop < βhalf βtop 6= βhalf
no. of characters (title)
2.6%
44.4%
47.0%
no. of words (abstract)
8.3%
29.4%
36.7%
mean
41.9%
complexity
Herdan’s C (title)
18.7%
8.5%
27.2%
Herdan’s C (abstract)
34.9%
6.5%
41.4%
z-index (title)
8.3%
16.7%
25.0%
z-index (abstract)
24.6%
7.7%
32.3%
fog index (abstract)
26.4%
8.0%
34.4%
mean
32.0%
sentiment
arousal (title)
11.0%
13.5%
24.5%
arousal (abstract)
15.7%
13.7%
29.4%
valence (title)
16.1%
11.3%
27.4%
valence (abstract)
29.2%
5.7%
34.9%
mean
29.1%
no. of authors
4.0%
39.6%
43.6%
overall mean
33.7%
TABLE II: Factors often affect top and typical papers differently. Percentage of journals for which βtop 6= βhalf are reported.
The right column, βtop 6= βhalf , is the sum of the two others.
property
length
factor
no. of characters (title)
no. of words (abstract)
complexity
Herdan’s C (title)
Herdan’s C (abstract)
z-index (title)
z-index (abstract)
fog index (abstract)
sentiment
arousal (title)
arousal (abstract)
valence (title)
valence (abstract)
no. of authors
βlow > 0 βhalf > 0
71.4%
56.2%
96.5%
96.7%
50.1%∗ 56.7%
19.4%
28.1%
62.0%
58.2%
71.3%
82.1%
62.9%
68.2%
56.5%
61.6%
62.8%
67.8%
42.9%
43.1%
42.0% 47.6%∗
93.4%
92.5%
βtop > 0
27.7%
83.4%
62.4%
51.2%∗
47.9%∗
81.0%
72.9%
58.3%
61.9%
49.5%∗
63.4%
61.6%
TABLE III: Percentage of journals with positive βlow , βhalf ,
and βtop for each factor. All values are statistically significant
(p < 0.001) except for those marked with an asterisk ∗ (see
Appendix F).
are read only by specific (scientific) communities. To
address that issue, in figure 3 we group the journals in
disciplines according to their OECD subcategory [27] and
show summary pointers (introduced in figure 1) for two
factors. The results indicate that the variation across
journals is partially explained by disciplines, e.g., for
Clinical medicine all values of β in case of valence in
abstract are below zero while for Physical sciences the
majority is positive. Another possibility is that more
popular journals are different than less popular journal.
To address this option, journals inside each discipline in
figure 3 are ranked by their Impact Factor index (IF).
No clear tendency can be visually identified, however by
comparing with a random attribution of IF, popularity
proves to be statistically significant, although to much
less extent than scientific discipline (see caption for figure 3). Figure 3 allows also for a straightforward comparison of the strength of title length and abstract valence
5
factors in different journals. By calculating exp(β∆X)
one can directly estimate how much gain in citations is
obtained on average by a move in ∆X standard deviations in the variable X (e.g., for title length in the journal
Lancet βhalf = 0.33 and thus extending the length of the
title by one standard deviation gives almost 40% gain in
citations; for Nature, βhalf = 0.038 and thus one obtains
less than 4% gain).
III.
DISCUSSION AND CONCLUSION
In this paper we investigate the importance of factors
of scientific papers on the popularity they acquire. As
factors we consider the number of authors of the paper
and also text-related properties that quantify the length
of title and abstract, the complexity of the vocabulary,
and sentiment based on the used words. These factors
capture different stylistic dimensions of scientific writing and were selected also based on previous works that
indicated a correlation to the number of citations. We
found that the factors with a stronger (positive) effect
on citations are the number of authors and the length of
the abstract. Text complexity is positive correlated with
citation at the level of the abstract, while we could not
detect a strong effect within the title. The agreement of
two factors designed to quantify text complexity – the
z-index and Gunning fog index – support this conclusion
(the opposite result is obtained if Herdan’s C measure is
used, but we attribute this to the negative correlation of
this measure with text length). In terms of the sentiment
factors, the level of arousal a title or abstract invokes is
poorly correlated with citations. This result should be
examined more carefully as there are controversies as to
the relation between text polarity and information contained therein (see [28, 29] and the following discussion).
Also the vocabulary on which we rely in this study [30]
has been obtained by evaluating the common reception
of words. This fact can strongly affect the value of valence, e.g., a highly negative word “cancer” in medical
papers.
The discussion above, and the fact that a statistically
significant effect is present for most factors, should not
hide that the effect is typically weak (|β| < 0.5 for most
factors, quantiles τ , and journals) and that there are
strong fluctuations across papers and journals. For instance, a positive correlation between number of characters and citations for all the quantiles is measured in
the New England Journal of Medicine, while a negative
correlation is observed in the overwhelming majority of
other journals. One of the main findings of our paper is
that the factors vary also strongly depending whether the
analysis uses all or only the most cited papers. We quantified this effect by the dependence of β on the quantile τ
in a Quantile Regression analysis. One example in which
this effect is particularly strong is the role of title length
in figure 1. In the public media [25, 26], the message
behind the finding [12] of negative correlation between
text length and citations was that authors should write
shorter titles to achieve more citations. While this simple
message is appealing and agrees with some stylistic recommendations, our results show that for most journals
this is wrong (even if one assumes that there is a causal
relation behind the correlations). The negative correlation is found only in the most cited journals, for typical
journals the correlation is positive (longer titles are better). This suggests that papers with short titles show a
larger variation on the number of citations and can be
very well cited or very poorly cited. A similar behavior
is observed in other factors, and a significant dependence
on τ is seen on average in 1/3 of the journals.
Altogether, our results indicate that textual properties of title and abstract have non-trivial effects in the
processes leading to the attribution of citations. In particular, the effect varies significantly between papers with
usual number of citations and with large number of citations. This finding is even more important considering
that the number of citations across papers varies dramatically. The weak signal we detect can be considered also
a sign that the quantities we measure have limited information, e.g., expressing the impact of publications by
single number (the number of citations) can be misleading and lacking information (a point that has been previously raised, e.g., in Ref. [31]). The overall estimates
(calculated over a set of journals or categories) may dim
the clear picture one receives while observing a specific
journal. For authors interested in how to write the title
and abstract of their paper, we recommend looking at
the values of βhalf and βtop of the different factors for the
specific journals of interest.
Acknowledgments
We thank Margit Palzenberger and the Max-PlanckDigital-Library for providing access to the dataset used
in this paper.
Appendix A: Data
We obtained data from the Web of Science service
about the papers marked as “articles” published in
the period of 1995 — 2004 that fulfil the following
two conditions: (i) the journal where the article has
been published had to be active in all the mentioned
years and (ii) there had to be at least 1000 articles
published in total in this journal in the given period.
By applying this filtering we obtained over 4 300 000
articles from over 1500 different journals containing
information about the title of the paper, the number of
its authors, full abstract contents and OECD category
it had been classified to. Additionally, for each of
the record we also recorded the number of citations it
acquired between being published and 31st December
2014. Data processing, plots and statistical analysis
β low , β half , β top
-0.50
-0.25
0.00
0.25
0.50
Valence in abstract
s r s b l e t l n s l j n s s l c l e tt v l l r e l p y l n l i s r tt j j b c d
s it a h g d g y d n t i r l l d c t i d y t y d ll t v t l l ll i a l v s i v b e e a i a i s s s tr s l r n
re e ic o iro m io no io re io rt io n re io a io at le re pp eo g in io o g o a ito sc l te le s n s o v em al n ct c n e n it e o es sc he no ho e o ce er e g ia g e e ne e ne io io e sc gi ho e re sc a l ur c s sc di sc id di di u di io a u
e e thom icr v nool bch lat rc ard ea ns rtesc ard im v blim sc ci l a g anocc rob un olotomorgas av uppmaev phy tro hy n s re chteri mo a med eb md e unp meurinv ro l t upat j masslan nc n moloycholot m cge e dge rr bll bnt cin olo yc ld din robehcho at ienci uad i infr j a ct ct n n ct emiommm
r
n
h
r
i
s
e
s
h
p
a
m
e
r
s
o
e
m
o
a r er s at a
i
e
t n u e a ra h s i ra eu l y n sc s c c a
c
n
u
n
c
h
c
m
n
m
e
t
a
o
c
t
n
fe fe li fe id b i
a
n
t
s v
e
g im io e a e m x n
a
eu co m j l
c
a
i
r
r t p
c e
d ars s
in in j cj in epem ect
no ge ge on j g grob io irc oll eu er hy iov ell o ns j neary ecchel oc v ml imoged e ua t p bes j ad ys str uclast phycliom ru ch j bo-mol liom im j e clinj nma j imm eng ed tl cinte nt n p ep n na gehum c j c plv b yc c p chv bv n sio ps
m na oe e h
j id f
a
c yp j rd l c smco
i r d in bi th b
j ar
ca ny ur
la rn
a ps so a a hy oc
n r
ge m vir mmic j b c
j
rg linam
o vetno eri me aq ve im hy ph a
b
a
n
w
l
p
h
g
r
h
h
in
c
e
n
h
n
u
m
o
t
a
e
c
j
o
a
o
m
t
a
n
l
t
c
u h j dis an op
amep
cm c
h en m l m
h te
anas ch
ne-j a
ph
be ur ers be be prs s
am
th
at an
e h p an
ja
no
l tr
em
rt ua
r
p
l
lim
m t
j im
r
e
e
n
n
p
g
a
p
e
a
p
p
i
e
n
a
e rc
s
n j
l
m
e q
i
c
p
p
a
h
i
p
p
o
a
m
c
a
a
n
a
t
m
ja
oc
ca
ve
ge
-0.50
-0.25
0.00
0.25
Characters in title
FIG. 3: Summary pointers showing βlow , βhalf , and βtop for two factors number of title characters (top) and valence in abstract
(bottom) (see figure 1 for the definition of summary pointers). Journals are grouped according to the OECD bibliographic
categories [27]. The 8 journals with highest Impact Factor in each category are shown (6 for Other natural sciences). The
categories are sorted with respect to the number of positive βhalf values. Testing null hypothesis that categories are randomly
attributed to journals (we compare the average standard deviation within categories with a random attribution of categories
to journals) yields p-values p = 0.002 for title length and p < 10−8 for valence in abstract. The same procedure performed for
Impact Factor (by creating 12 categories according to decreasing IF) gives p = 0.02 for title length and p < 10−5 for valence in
abstract, suggesting higher impact exerted by scientific category.
Journal
l
i
l
l r
i
l l r
l
l j
l
l
t i r l l
t
t
i
l i
i
t l
l
t i
l t
t
t l l l
r
r t j j
r
ed oc e r ed gy ia gy ed el e ev e io io el ity ed on s c e o o ds is is ut is io a un on es io rt on ns es io re ce sa c ia c m ls ni ta ch ng ed ng c ia o ev es c av l b ac io te et ev p o g ne io op gy o an to c l s te et s n b oc d es e cs ob ro e io o
m ss nc ce m lo ch lo m c en d en b l b t c un m ur veros th unath ai t d t d n t d m om m ati c r rd ea si te c r rd tu en i u d sind j s he ria o ac e e m e n slog ch d n r ros eh ho m b ma c l i r ap ge no ci ob n lo om rg si v spp a v l hy ro ys s rev e r th mi cr vi oml b hn
l j d a lacan rn ero sy atonat at genem g urrcel lan m xp nen ineu col m j p fecfec lin fec ide bi imcul cirl car h tenper as ca nasci sc ca ci afrin c ateg mim iomedlab edrai ho psyhildrai eu l b yc chi erv cli t s sccolem eavacicr muenoent t o ara ha su v m rerop astl phronys om ne no mi engen io tec
g
b
l
e
t
e
m
e
c
c
a
t
v
t
a s
e
p p
o s
s
h b m m b c
l
j
s j e y
o
t
in in j in p m c ir ol eu er hy io el
n g hu c j p im j e cli j nma j i m
ad y rr s c iomdrun c j io- hol io av sy oc c v bv nysi c p moon lannar e ch l o t ml imioged qu et p bys adhy ast nuc as ph en g giron j gm cro bi
en me atl in en n e
a rg lin m j j eidenfe c
c yp j rd l c
j ar
g m v
j
r
p
b er li t b at n b h op s s ha ha h o os c
m i j
ac n u
w m j n nn tro ge h
p r
i
a
e
h ca o
no veunoher j mis a v nimph
e be p s c
m
tl ann c
amep
hu en ma l m
ot
h ate
am
h c e p an be ur er
a ro
m c
a as ch
ne-j a
m
ph
t
t
m
b
a
t
a
d
s
i
j
n
l
e
r
r
p
l
l
m
r
j
e h
u
n
m
g ar
p st
e
a
p
in
ne j p
m
ea q
iearc
p
ap
pechi
on
m
cl
ap a
nc
ap
ti
m
ja
o
ca
ve
ge
β low , β half , β top
0.50
6
7
have been performed using R language [32].
Appendix B: Text properties
The most obvious candidates for quantitative factors
that could be used to describe the paper are the number
of words or the number of characters. In the case of the
title the second option has been used while in the case
of abstract — the first one. Additionally the number of
authors has also been used as in a previous study it had
been shown to be an important factor [21]. As it concerns
the complexity of the vocabulary a way to account for
that is to measure a so-called Herdan’s C index (see e.g.,
[33] p. 72), defined for each paper i as
log Ni
Ci =
,
log Mi
(B1)
where Mi stands for the text length (number of words)
and Ni is the vocabulary size (i.e., the number of unique
words) of paper i. To overcome methodological shortcomings of this traditional approach (e.g., no fluctuations
effect included) it has recently been proposed [34] to use
a z-score that shows how much the obtained pair (Ni ,
Mi ) is different form the expected value µ(M ) in units of
standard deviations σ(M )
zNi ,Mi =
Ni − µ(Mi )
,
σ(Mi )
(B2)
where µ(Mi ) and σ(Mi ) were obtained empirically using
all papers in our database. Finally one might also take
into account the complexity of the used words. A classical
quantity to measure this effect is so-called “Gunning fog
index” Fi [35], defined for each paper i as
#wordsi
#complexwordsi
Fi = 0.4
,
+ 100
#sentencesi
#wordsi
(B3)
where a complex a word is a one that has more than two
syllables [36]. Fog index is widely used as its value can
be connected to the number of formal years of education
needed to understand the text at first reading. Because
of the absence of sentences fog index has not been
calculated in case of title (i.e., a typical title contains
only one sentence therefore Fi is highly correlated with
the number of words).
We have used a very recent study [30] which contains
norms for almost 14.000 English words, where valence
(v) and arousal (a) are given as real numbers in the scale
[1; 9] (i.e., v below 5 is negative, while v > 5 means positive words, low a values indicate low arousal, while high
a is high arousal). The total valence and arousal were
obtained as the average of all words in the title or abstract.
Appendix D: Standardization
In order to make comparison among different factors
each factor x has been separately standardized with respect to journal, i.e. for each i
x̂i =
xi − µ(x)
,
σ(x)
(D1)
where µ(x) and σ(x) are, respectively, sample mean and
variance over factor x in a journal j it belongs to.
Appendix E: Quantile regression
In the approach of quantile regression [22, 23], having
k factors (variables) Xk and an observable Y , we are able
to obtain a regression line defined by coefficients βi (τ )
Y = β0 (τ ) + X1 β1 (τ ) + ... + Xk βk (τ ) = Xβ
(E1)
for a given quantile τ by solving the minimization problem
β̂(τ ) = arg min
β∈Rk
n
X
(ρτ (Yi − Xβ)) ,
(E2)
i=1
where ρτ (y) = |y(τ − I(y<0) )| is called loss function (I is
indicator variable). In this study we restrict ourselves to
case where
ln Y = α(τ ) + Xβ(τ ),
(E3)
i.e., we examine the influence of each of the factors separately. As the logarithm is an increasing function, the
logarithm of the p-th quantile is equal to the p-th quantile of the log-transformed citation counts. For computational purposes we used R’s quantreg package [37].
Appendix F: Statistical analysis
Appendix C: Sentiment properties
In study the idea of a dictionary emotional classifier
has been used: in this approach one takes the dictionary
of words that had been tagged for valence and arousal
and calculates the mean arithmetic value of all the recognized words. Thus in case of each paper we have separately valence (and arousal) values for title and abstract.
We test if the number of positive values of βlow , βhalf ,
and βtop is significantly different from the one obtained
by chance (i.e., by randomly choosing “+” or “-” signs
with equal probability q = 1/2). This statistics follows a
binomial distribution. However, as the number of samples (journals) n is large (n > 1500) we simplyp
use normal
distribution N (µ, σ), with µ = nq and σ = nq(q − 1)
8
with q = 1/2. We consider the observation to be statistical significant at if the measured number of positive β
differs from µ by more than 3σ (i.e., p-value is less than
0.001).
[1] Chavalarias D, Cointet JP. 2013 Phylomemetic patterns
in science evolution–the rise and fall of scientific fields.
PLoS One 8, e54847. (doi:10.1371/journal.pone.0054847)
[2] Kuhn T, Perc M, Helbing D. 2014 Inheritance Patterns in
Citation Networks Reveal Scientific Memes. Phys. Rev.
X 4, 041036. (doi:10.1103/PhysRevX.4.041036)
[3] Scharnhorst A, Börner K, van den Besselaar P (eds.).
2012 Models of Science Dynamics, Understanding
Complex Systems, Springer-Verlag Berlin Heidelberg
(doi:10.1007/978-3-642-23068-4)
[4] Perc M. 2013 Self-organization of progress across
the century of physics. Sci. Rep. 3,
1720.
(doi:10.1038/srep01720)
[5] Sinatra R, Deville P, Szell M, Wang D, Barabási AL.
2015 A century of physics. Nature Physics 11, 791–796.
(doi:10.1038/nphys3494)
[6] Didegah F, Thelwall, M. 2013 Which factors help authors produce the highest impact research? Collaboration, journal and document properties. J. Informetr. 7,
861–873. (doi:10.1016/j.joi.2013.08.006)
[7] Didegah F, Thelwall, M. 2013 Determinants of research citation impact in nanoscience and nanotechnology. J. Am. Soc. Inf. Sci. Technol. 64, 1055–1064.
(doi:10.1002/asi.22806)
[8] Onodera N, Yoshikane F. 2015 Factors affecting citation
rates of research articles. J. Assoc. Inf. Sci. Technol. 66,
739–764. (doi:10.1002/asi.23209).
[9] Paiva C, Lima J, Paiva B. 2012 Articles with short titles
describing the results are cited more often. Clinics 67,
509–513. (doi:10.6061/clinics/2012(05)17).
[10] van Wesel M, Wyatt S, ten Haaf J. 2014 What a difference a colon makes: how superficial factors influence subsequent citation. Scientometrics 98, 1601–1615.
(doi:10.1007/s11192-013-1154-x)
[11] Rostami F, Mohammadpoorasl A, Hajizadeh M.
2014 The effect of characteristics of title on citation rates of articles. Scientometrics 98, 2007–2010.
(doi:10.1007/s11192-013-1118-1).
[12] Letchford A, Moat HS, Preis T. 2015 The advantage
of short paper titles. R. Soc. Open Sci. 2, 150266.
(doi:10.1098/rsos.150266)
[13] Yik M, Russell JA, Steiger JH. 2011 A 12-point circumplex structure of core affect. Emotion 11, 705–731.
(doi:10.1037/a0023980)
[14] Russell JA. 1980 A circumplex model of affect. J. Pers.
Soc. Psychol. 39, 1161–1178. (doi:10.1037/h0077714).
[15] Chmiel A, Sienkiewicz J, Thelwall M, Paltoglou G, Buckley K, Kappas A, Holyst JA. 2011 Collective emotions
online and their influence on community life. PLoS One
6, e22207. (doi:10.1371/journal.pone.0022207)
[16] Sienkiewicz J, Skowron M, Paltoglou G, Holyst JA. 2013
Entropy-growth-based model of emotionally charged
online dialogues. Adv. Complex Syst. 16, 1350026.
(doi:10.1142/S0219525913500264)
[17] Choloniewski J, Sienkiewicz J, Holyst JA, Thelwall M.
2014 The role of emotional variables in the classification
and prediction of collective social dynamics. Acta Phys.
Pol. A 127, A-21. (doi:10.12693/APhysPolA.127.A-21)
[18] Catalini C, Lacetera N, Oettl A. 2015 The incidence and
role of negative citations in science. Proc. Natl. Acad. Sci.
U. S. A. 112, 13823–13826.
[19] Yu B. 2013 Automated citation sentiment analysis: what
can we learn from biomedical researchers. In ASIST’13
Proceedings of the 76th ASIS&T Annual Meeting: Beyond the Cloud: Rethinking Information Boundaries, art
no. 83.
[20] Milkman KL, Berger J. 2014 The science of sharing and
the sharing of science. Proc. Natl. Acad. Sci. U. S. A.
111 Suppl. 4, 13642–9. (doi:10.1073/pnas.1317511111)
[21] Miotto JM, Altmann EG. 2014 Predictability of Extreme Events in Social Media. PLoS One 9, e111506.
(doi:10.1371/journal.pone.0111506).
[22] Koenker R, Bassett-Jr G. 1978 Regression Quantiles.
Econometrica 46, 33–50.
[23] Davino C, Furno M, Vistocco D. 2014 Quantile Regression: Theory and Applications. Oxford, UK: Wiley.
[24] Stegehuis C, Litvak N, Waltman L. 2015 Predicting the
long-term citation impact of recent publications. J. Informetr. 9, 642–657. (doi:10.1016/j.joi.2015.06.005).
[25] Chawla DS. 2015 In brief, papers with shorter titles get
more citations, study suggests, Science News, 25 August
2015 (doi: 10.1126/science.aad1669).
[26] Deng B. 2015 Papers with shorter titles get
more citations, Nature News, 26 August 2015
(doi:10.1038/nature.2015.18246).
[27] http://www.oecd.org/science/inno/38235147.pdf
[28] Garcia D, Garas A, Schweitzer F. 2012 Positive words
carry less information than negative words. EPJ Data
Sci. 1, 1–12. (doi:10.1140/epjds3)
[29] Dodds PS et al. 2015 Human language reveals a universal
positivity bias. Proc. Natl. Acad. Sci. U. S. A. 112, 2389–
2394. (doi:10.1073/pnas.1411678112)
[30] Warriner AB, Kuperman V, Brysbaert M. 2013 Norms
of valence, arousal, and dominance for 13,915 English lemmas. Behav. Res. Methods 45, 1191–1207.
(doi:10.3758/s13428-012-0314-x)
[31] Bollen J, Van de Sompel H, Hagberg A, Chute
R. 2009 A Principal Component Analysis of 39
Scientific Impact Measures. PLoS One 4, e6022.
(doi:10.1371/journal.pone.0006022)
[32] R Core Team. 2015 R: A Language and Environment for
Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria (http://www.R-project.org/).
[33] Herdan G. 1960 Language as Choice and Chance.
Berlin, Heidelberg:
Springer Berlin Heidelberg.
(doi:10.1007/978-3-642-88388-0)
[34] Gerlach M, Altmann EG. 2014 Scaling laws and fluctuations in the statistics of word frequencies. New J. Phys.
16, 113010 (doi:10.1088/1367-2630/16/11/113010).
[35] Gunning R. 1952 The technique of clear writing, New
York: McGraw-Hill International Book Co.
[36] An own adaptation of Greg Fast’s Pearl algorithm
(http://cpansearch.perl.org/src/GREGFAST/Lingua-EN-Syllable-0.2
to R language has been used.
9
[37] Koenker R. 2015 quantreg:
Quantile Regression,
(http://CRAN.R-project.org/package=quantreg).
[38] This guarantees that papers have at least 10 years to
gain citations. Equivalent, but much more noisy (due to
drastically limited number of data points) results, are
obtained when looking only at papers published in the
same year.
[39] We perform QR fitting for each journal independently
because journals are known to play an important role in
the number of citations a paper receives.