THE MEASUREMENT OF JARGON STANDARDIZATION
IN SCIENTIFIC WRITING
USING RANK-FREQUENCY ("ZIPF") CURVES
BY
RONALD EUGENE WYLLYS
A thesis submitted in partial fulfillment of the
requirements for the degree of
DOCTOR OF PHILOSOPHY
at the
UNIVERSITY OF WISCONSIN
1974
©
Copyright by Ronald Eugene Wyllys, 1974
All Rights Reserved
ii
ABSTRACT
THE MEASUREMENT OF JARGON STANDARDIZATION IN SCIENTIFIC WRITING
USING RANK-FREQUENCY ("ZIPF") CURVES
Ronald Eugene Wyllys
Under the supervision of Associate Professor James Krikelas
The jargon of a scientific discipline is taken to consist of the words in writings in the discipline.
At any given time, the jargon contains expressions for (1) concepts that are already well defined and
generally accepted in the discipline, and (2) concepts that are not yet well defined or fully accepted. A
well defined concept will tend to have one standard expression; an incompletely defined concept will
tend to be expressed in a variety of ways. The study investigated two kinds of hypothesized effects of
the mix of standardized and nonstandardized expressions in the jargon of a discipline.
The first set of hypothesized effects dealt with changes over time in the mix. As time passes, the
number of well defined concepts in a discipline increases, and hence so does the number of
standardized expressions in the jargon of the discipline. It was hypothesized that this increase would be
reflected in a steepening of the slope of a certain curve formed from a frequency count of the words in
writings in the discipline. This was the "Zipf" curve, the curve formed by plotting the logarithms of the
frequencies of the words against the logarithms of the ranks of the words, after the words had been
arranged in order of decreasing frequency. Experiments with eight corpora drawn from 1921 and
1969 writings in four disciplines—ecology, mathematics, physics, and psychology—failed to support the
hypothesis that later writings in a discipline would have steeper slopes for their Zipf curves than earlier
writings.
The second set of hypothesized effects dealt with comparisons of the mix of standardized
and nonstandardized expressions between the so-called hard and soft sciences. It seemed possible that
part of the hardness of a hard discipline consists in its having a higher proportion of standardized
expressions in its jargon than a soft discipline has. It was hypothesized that this higher proportion
would be reflected in a steeper slope for the Zipf curve of writings in a hard discipline than for the
Zipf curve of coeval writings in a soft discipline. Experiments with the eight corpora, treating
mathematics and physics as hard disciplines and the other two as soft, supported this hypothesis.
_________________________
James Krikelas
iii
TABLE OF CONTENTS
ABSTRACT
page
iii
PREFACE
ix
CHAPTER 1 INTRODUCTION
1.1
How Do Scientists Use Words?
1.2
A Possible Measure of Jargon Standardization
1.3
Outline of Zipf's Hypothesis
1.4
Outline of the Purposes of This Study
1.5
Organization of This Thesis
1
1
2
2
5
6
CHAPTER 2 RELATED WORK
2.1
Quantitative Analysis of Scientific Prose
2.2
Zipf's Hypothesis and Some Suggested Explanations
2.2.1 Early work by Zipf
2.2.2 Later work by Zipf
2.2.3 Mandelbrot's explanation of the
rank-frequency phenomenon
2.2.4 Simon's explanation of the rank-frequency
phenomenon
2.2.5 Other work on the rank-frequency
phenomenon
2.2.6 The Waring-Herdan formula
2.2.7 A perspective on the rank-frequency
phenomenon
2.3 The Notion of Hard Science and Soft Science
2.3.1 Storer's work
2.3.2 Hagstrom's work
2.3.3 Price's work
2.3.4 Other work on hard and soft science
2.4 Summary
16
17
19
20
21
22
CHAPTER 3 PURPOSES OF THE STUDY
3.1
Assumptions and Definitions
3.1.1 Scientific and technical disciplines
24
24
24
iv
7
7
9
9
10
11
13
14
14
16
3.1.2
3.1.3
3.1.4
3.1.5
Representativeness of the journals sampled
The use of journals
Hard science and soft science
The jargon of a discipline: "Characteristic idiom"
and "technical vocabulary"
3.1.6 Words, word-tokens, and word-types
3.1.7 Rank-frequency curves, their slopes, and regression
3.2 Hypotheses
3.2.1 Hypotheses concerning a tendency toward jargon
standardization over time
General Hypothesis I
3.2.2 Hypotheses concerning the degrees of jargon
standardization in different disciplines
General Hypothesis II
3.3 Summary
26
27
27
28
29
30
34
34
35
37
CHAPTER 4 DATA SELECTION, PREPARATION, PROCESSING,
AND ANALYSIS
4.1
Selection of the Data
4.2
Preparation of the Text Samples
4.3
Processing of the Machine-Usable Data
4.4
Analysis of the Word-Frequency Data
4.5
Summary
38
CHAPTER 5 RESULTS OF THE STUDY
5.1
Outline of the Results
5.2
The Tests of the Hypotheses
5.2.1 Tests of Hypotheses 1
5.2.2 Tests of Hypotheses 2
5.2.3 Tests of Hypotheses 3
5.2.4 Tests of Hypotheses 4
5.2.5 Tests of Hypothesis 5 and Hypothesis 6
5.3 Summary
47
47
48
48
49
49
51
51
54
CHAPTER 6 INTERPRETATION AND RECOMMENDATIONS
6.1
A Possible Interpretation of the Results of the Study
6.2
Possible Inquiries into Changes over Time in Scientific
Writings
6.3
Possible Inquiries into Potential Measures of the Hardness
of a Scientific Discipline
6.3.1 The type-token ratio
6.3.2 The logarithmic type-token ratio
56
57
58
v
39
41
43
44
46
60
62
64
6.4
APPENDICES
Appendix A.
Appendix B.
Appendix C.
6.3.3 Yule's characteristic, K
6.3.4 Singleton types and their probability
Summary
64
65
66
Articles from Which Text Samples Were Drawn
List of Words Defined as "Common" Words in This Study
Graphs of Rank-Frequency Pairs and Regression Lines for
the Corpora
68
82
85
REFERENCES
102
vi
LIST OF FIGURES AND TABLES
Figures
page
4
Figure 1-1
A Typical Zipf Curve
Figure 3-1
Rank-Frequency Curve and Regression Line, for a One-Sentence
Corpus
33
Tables
Table 1-1
Example of a Ranked Frequency Count
3
Table 1-2
Two Hypothetical Examples of Rank-Frequency Distributions
3
Table 2-1
Values of Price's Index for Journals Used in This Study
21
Table 3-1
Example of a Frequency Count of Word-Types
31
Table 3-2
Example of a Frequency-Ordered List of Word-Types for a
One-Sentence Corpus
31
Table 3-3
Example of Observed Distinct Frequencies and Defined
Corresponding Ranks for a One-Sentence Corpus
33
Table 3-4
Example of Pairs (log ri, log fi) for a One-Sentence Corpus
33
Table 5-1
Data for the Tests of Hypotheses 1, Using the Original Texts of
the Samples
49
Table 5-2
Data for the Tests of Hypotheses 2, Using Samples with Common
Words Deleted
51
Table 5-3
Data for the Tests of Hypotheses 3, Using the Original Texts of
the Samples
53
Table 5-4
Data for the Tests of Hypotheses 4, Using Samples with Common
Words Deleted
54
vii
Table 5-5
Data for the Tests of Hypothesis 5, Using the Original Texts of the
Samples, and of Hypothesis 6, Using Texts with Common Words
Deleted
56
Table 6-1
Vocabulary Statistics of the Corpora, Using the Original Texts
60
Table 6-2
Vocabulary Statistics of the Corpora, Using Texts with Common
Words Deleted
61
Table 6-3
Proportions of Running Text Accounted for by Word-Types of
Various Frequencies
63
Table 6-4
Values of the Proposed "Yule's Index of Diversity,"
W = 10,000/K , for the Original-Text Corpora
65
Table 6-5
Distribution of Numbers of Word-Types Represented by
1, 2, . . . , 10 Tokens in the Original-Text Corpora
67
viii
PREFACE
Science advances because one scientist's ideas and factual findings are communicated to
other scientists by the written and spoken word. The study reported here deals with one aspect
of the language that scientists use: the process by which the scientists in a particular field reach
consensus on the word or phrase they use to express a concept.
This process is a gradual one. Typically, scientists begin by using a variety of words and
phrases to deal with a new concept as it starts to emerge in a discipline. Only over a period of
time does the concept gradually become clear and assume a form that the scientists agree on.
Accompanying the process of clarifying and standardizing the concept is the process of settling
on a word or phrase that becomes the standard way of expressing the concept.
To deal directly with these twin processes would demand an extremely detailed
examination of the individual words used over periods of time in any one field or subfield of
science. Furthermore, the examiner would have to have a substantial knowledge of the field or
subfield in order to trace the emergence of each new concept. The degree of knowledge
required would be so great that it would be impossible for any one examiner to hope to deal
with more than a few subfields and time periods. Yet the phenomenon of the emergence of new
concepts and of names for them is of interest, not only to historians and sociologists of science
but also to librarians and others concerned with storing information and making it accessible.
Some time ago it occurred to me that the rank-frequency curve (the "Zipf curve") of a
body of prose in a field of science might provide a tool for measuring the extent of the
standardization of the words and phrases used to express concepts in that field. The idea
seemed worth pursuing because Zipf curves depend upon only the frequencies of words in a
body of prose and not upon knowledge of the semantic content of those words, a much stiffer
requirement. Furthermore, the counting of word frequencies could be an inexpensive
by-product of computer-based handling of scientific texts in libraries and other
information-processing systems. Thus it appeared that if Zipf curves were found to yield usable
information about the standardization of scientific terminology, they could be utilized for that
purpose in practical information-handling systems at low cost.
This dissertation reports my efforts to find out whether Zipf curves would, in fact, yield
such information. These efforts followed two paths. First, in each of four scientific fields I
looked at writings separated by almost half a century to see whether the later writings appeared
more standardized than the earlier writings when they were measured by their Zipf curves.
Second, having chosen the four fields to include two fields in the hard sciences and two in the
soft sciences, I looked at differences in Zipf-curve measurements between the hard and the soft
fields to see whether the hard fields appeared more standardized than the soft fields.
ix
The first path led to the conclusion that the later writings did not appear more
standardized than the earlier ones, according to Zipf-curve measurements. In fact, the later
writings exhibited a slight tendency to be less standardized than the earlier ones. The second
path led to the conclusion that writings in hard sciences do appear more standardized than
writings of the same period in soft sciences.
This study had its origin in readings for a course I took in 1971 from Professor Victor
H. Yngve of the Graduate Library School, University of Chicago. These readings led me to an
awareness that Zipf curves are not so immutable as Zipf himself thought. Later, at the
University of Wisconsin—Madison, in thinking about the problems of librarians and others in
handling information in rapidly advancing fields of science, I became concerned about the
difficulties that information systems have in keeping up with changes in scientific vocabulary and
with the emergence of new scientific subfields in which there is great activity. The possible use
of Zipf curves as a tool to measure rates of change and to delimit subfields of science occurred
to me, and with the assistance of Professor James Krikelas I focused that general thought into
the investigation reported here.
The study took place at the University of Texas at Austin (UTA) during academic years
1972-73 and 1973-74. In the first of these years my efforts were directed toward becoming
familiar with the facilities of the UTA Computation Center and the UTA Linguistics Research
Center. The selection and preparation of the data for the study began in the autumn of 1973,
and other phases of the study followed as rapidly as possible. This dissertation was written
during the late spring and early summer of 1974.
In acknowledging assistance in this study, I wish to thank first and foremost Dr. Claud
Glenn Sparks, Dean of the Graduate School of Library Science (GSLS), University of Texas
at Austin, for the very generous financial support, clerical help, and assistance in meeting my
teaching responsibilities that I received from the GSLS during the period of this study. Without
that aid, this study would have been impossible.
The study was immeasurably assisted by the natural-language-processing programs of the
Linguistics Research Center (LRC) of the University of Texas at Austin. These computer
programs were developed under the direction of Dr. Rolf A. Stachowitz by Bary A. Gold. I am
greatly indebted to Dr. Stachowitz for allowing me to use the programs and to Mr. Gold for
many hours of consultation and for tailoring the programs to my needs. I should also like to
thank Dr. Stachowitz for his continuing interest in my study and for helpful suggestions. Dr.
Helen-Jo J. Hewitt of the LRC staff provided assistance in guiding me through the literature of
linguistics. Funding for the LRC has been provided during the past several years by: (1) Rome
Air Development Center, Air Force Systems Command, Griffiss Air Force Base, New York,
under contracts F30602-70-C-0118, F30602-73-C-0192, and F30602-74-C-0028; and (2) Air
Force Office of Scientific Research, Air Force Systems Command, Arlington, Virginia, through
x
grant AFOSR-69-1788. It is a pleasure to record my personal gratitude for this support, which
has made the LRC, its computer programs, and its facilities possible.
I am grateful to Dr. Charles H. Warlick, Director, Computation Center, University of
Texas at Austin (UTACC), and his staff for the superb interactive computing facilities provided
by the UTACC. Without the convenience and speed provided by the ability to compile, de-bug,
and execute programs interactively; without the ease and speed of data- and program-correction
via the UTACC's EDIT program; and without the capabilities of the UTACC's locally prepared
interactive version of the OMNITAB II program package developed by the National Bureau of
Standards, I would still be only beginning this study. Among the UTACC staff I should like
particularly to thank G. Scott Harris, Florence Turck, and Christopher V. Yurkanan for their
help.
Three members of the faculty of the University of Chicago deserve my special thanks.
Professor Victor H. Yngve of the Graduate Library School (GLS) not only taught the most
exciting course I ever took in any graduate school but also called my attention to Zipf's
successors and started me on the line of inquiry that led to this study. Professor, and former
Dean, Don R. Swanson of GLS made special arrangements for me to do part of my pre-doctoral
work at GLS. Professor Patrick P. Billingsley of the Department of Mathematics was very
helpful when I first started struggling with Mandelbrot's mathematics.
I am grateful to too many members of the faculty of the University of
Wisconsin—Madison (UW) to be able to name here all those who taught me or helped me in
other ways. But I do want to acknowledge the invaluable assistance in helping me formulate and
carry out this study that I received from my doctoral committee: my advisor, Associate Professor
James Krikelas, Professor William L. Williamson, and Associate Professor Richard D. Walker, all
of the UW Library School; George C. Tiao, Professor of Business and of Statistics, and
Chairman, Statistics Department, who also taught the most useful course in statistics that I ever
took; and Larry E. Travis, Professor of Computer Sciences, and Director, UW Computing
Center. To Orie L. Loucks, Professor of Environmental Studies and of Botany, I owe thanks
for very helpful discussions of this study. I also want to thank Professors Jack A. Clarke and
Margaret E. Monroe of the Library School for encouraging and assisting me in entering on my
doctoral fellowship. It is a great pleasure to acknowledge here that that fellowship, indispensable
to my library education, was supported by the U. S. Office of Education under the Title II-B
program of the Higher Education Act of 1965.
Mrs. Ruth Sawyer, Librarian, Library Science Library, University of Texas at Austin, gave
generously of her time in helping me locate hard-to-find materials. Edward A. Eaton, III,
rendered invaluable assistance on many occasions in acquainting me with the UTACC's
interactive system. My thanks go to Athala J. Wyllys for her assistance in preparing Table 2-1
and to R. K. Graham Wyllys for calling my attention to the Barnhart Dictionary's definitions of
"hard science" and "soft science." Lane DeCamp's typing skill was important in the preparation
xi
of the data for processing by the optical character reader. His comprehension of the purpose of
the study played a vital role, for he was able to recognize many unanticipated problems as he
came across them in his typing.
It is almost a tradition that acknowledgements conclude with an expression of gratitude
to the author's spouse and family. Not only do I want to thank them; but I must add that never
in the past did I imagine how deep that gratitude can be, and how inadequate are any words to
express what I owe to my family for their patient understanding and especially to my wife, Jean
M. Wyllys, for her support and encouragement in everything for many years.
xii
CHAPTER 1
INTRODUCTION
1.1 How Do Scientists Use Words?
The fields of information retrieval, librarianship, and abstracting and indexing unite in their
concern with the written word as a medium for conveying thought, over distances and through time.
This concern leads naturally to a concern with the ways in which the meanings and patterns of use
of words change as time passes; for these changes lead to changes in how information-handling
agencies cope with the processing, storing, retrieving, and disseminating of verbal materials.
One interesting change takes place with respect to the words that people working in a
particular discipline use to express the concepts they deal with in the discipline. Such words are
known as the "jargon" of the discipline, a term used with no pejorative connotations. Typically, a
new concept is referred to in several different ways as the experts begin to recognize it and work
toward understanding it. As the concept is developed, the experts come to use for it a single term,
or at most a relatively small number of preferred terms. The process of the experts' reaching
consensus on how to express a concept is called "jargon standardization" in this study.
A recent example of jargon standardization can be found in the history of the phenomena
and the devices known as the "maser" and the "laser." These names are acronyms for "microwave
amplification by stimulated emission of radiation" and "light amplification of the stimulated
emission of radiation." Other names used for the phenomena have included "amplification of light,"
"amplification of microwave radiation," "coherent radiation," and "coherent emission." According
to Brotherton (1964, p. 81), the primary phenomenon was referred to as "negative oscillation" and
"negative dispersion" when it was first observed in 1924.
In working with these phenomena, physicists and electronic engineers gradually came to
understand what they consisted in and to shift from using a variety of terms for them to using
primarily a few preferred terms including the convenient acronyms. The acronyms have now been
so thoroughly incorporated into the jargon of physics and electronics that they have begotten
working, conjugatable verbs. A recent article in Science (14 June 1974, p. 1166) includes such
sentences as, "The lasing transition would then occur when one of the excited electrons relaxes to a
lower state," and "So far, no proof of lasing has been obtained."
Similar changes, from several names for a not yet fully grasped concept to one or two
standard names for a thoroughly developed concept, occur in all disciplines. In view of this
tendency, it seemed possible that one way of assessing the degree of maturation of a discipline might
be to examine the extent to which the jargon of the field has been standardized and made precise, in
the sense that workers in the field tend to use only one verbal expression for each concept in the
field. In this sense chemistry could be viewed as a mature discipline because most concepts (e.g.,
chemical compounds, or their physical properties) tend to have one standard name, in contrast to
the larger number of alternative expressions available and used for a given concept in, say, sociology.
1
It should be emphasized that the use of the phrase "degree of maturation" in connection with
various fields is not intended to suggest that a more mature field is "better" than a less mature one.
1.2 A Possible Measure of Jargon Standardization
The concept of differing degrees of jargon standardization has implications for the design
and management of information systems, including both traditional libraries and computer-based
systems. A field that is mature with respect to its jargon would be expected to exhibit proportionally
fewer changes in its vocabulary over a given period of time than would an immature field. Thus,
information systems designed for fields of low jargon standardization would need to be more
flexible in coping with vocabulary change than would systems for fields of high jargon
standardization. Design decisions in systems with different degrees of standardization might well
differ in the means for vocabulary review and modification and in the frequency of vocabulary
review. For example, an indexing service for a field of low jargon standardization (i.e., of rapid
change in vocabulary) might find it desirable to conduct a thorough vocabulary review every six
months, in contrast to such a review once in two or three years for an indexing service covering a
field of high jargon standardization.
The usefulness of a measure of jargon standardization would depend partly on its ease of
use. This study investigated a possible measure which would be an easy and inexpensive by-product
of an information system employing computer processing of at least the titles of documents. The
measure consists in a generalized form of language phenomenon usually associated with the name of
George Kingsley Zipf.
Although the phenomenon has come to be generally referred to as "Zipf 's Law," Zipf 's
concept of it has been sufficiently discredited to be called a "law" no longer. Hence, it will be
called "Zipf 's Hypothesis" in this study.
1.3 Outline of Zipf 's Hypothesis
If one has a natural-language corpus—e.g., a book written in English—he can make a
frequency count of the words in the corpus, i.e., a count of the number of occurrences of "the,"
"and," "of," etc. Then one can arrange the words in decreasing order of frequency so that the most
frequent word has rank one; the next most frequent, rank two; and so on.
For example, a frequency count of the 104 words in the two preceding paragraphs yields the
results shown in Table 1-1.
2
TABLE 1-1
EXAMPLE OF A RANKED FREQUENCY COUNT
Word-Type
Rank r
Frequency f
Product rf
the
1
8
8.0
of
2
7
14.0
a
3
5
15.0
has, in
4 - 5 (mean = 4.5)
4
18.0
be, one, to, Zipf 's
6 - 9, (mean = 7.5)
3
22.5
(13 words)
10-22, (mean = 16.0)
2
32.0
(38 words)
23-60, (mean = 41.5)
1
41.5
Despite the small sample, the, frequency count in Table 1-1 well illustrates the narrowly
constrained relation of rank and frequency for words in natural language. The values of the
products of rank r and frequency f fall in the relatively limited range 8 - 22.5, except for the last two
products (which are discussed below). By definition, rank and frequency must have an inverse
relationship, but there was no a priori reason to expect that the products rf would fall within so
constrained a range. This point can be illustrated by a hypothetical larger sample of 1,000 words.
Table 1-2 shows the beginnings of two conceivable rank-frequency distributions for this sample,
which illustrate how rank-frequency products could have more widely varying values than they do in
reality.
TABLE 1-2
TWO HYPOTHETICAL EXAMPLES OF RANK-FREQUENCY DISTRIBUTIONS
r
f
rf
r
f
rf
1
100
100
1
100
100
2
99
198
2
50
100
3
98
294
3
25
75
4
97
388
4
12
48
5
96
480
5
6
30
The surprisingly constrained relationship between the frequency of a word in a corpus and
its rank gained wide attention in the 1930s and 1940s through the work of George Kingsley Zipf, a
professor of philology at Harvard University. Zipf 's Hypothesis is the rank-frequency relationship:
3
rf c
where
(1.1)
r = rank of a word
f = frequency of the word
c = a constant, dependent on the corpus (often around 1/10 of the total
size of the corpus)
Zipf 's Hypothesis merely approximates the relationship between rank r and frequency f for
any actual corpus. Zipf 's work (1949, passim) shows that the approximation is much better for the
middle ranks than for the very highest and lowest ranks. His work with samples of various sizes
(e.g., Zipf, 1949, p. 291) suggests that the corpus should be on the order of 5,000 words or more for
a reasonably good approximation, even in the middle ranks. (These qualifications are why the words
of frequencies 2 and 1 were ignored in the discussion of Table 1-1.)
When stated algebraically, Zipf 's Hypothesis is usually given in the form of equation (1.1),
but it is probably most familiar in the graphic representation of a mathematically equivalent form:
log f log c log r
(1.2)
A linear equation, such as (1.2), can be represented by a straight line in a two-dimensional graph.
The solid line of slope1 -1.0 in Figure 1-1 illustrates a typical display of Zipf 's Hypothesis in
the form of equation (1.2). The dashed line illustrates what a slightly different equation,
log f = log c - 0.7log r , would look like.
FIGURE 1-1
A TYPICAL ZIPF CURVE
In analytic geometry the slope of a straight line is found by taking any two points on the line
and calculating the ratio of (1) the difference between the points measured along the vertical axis to
(2) the difference between the points measured along the horizontal axis.
1
4
For convenience, the relation of the logarithm of the frequency to the logarithm of the rank
is called the "rank-frequency" relation in this study. In general, a line representing a straight-line
approximation of the rank-frequency relation would have slope -B and would be represented by the
equation
log f log c B log r
(1.3)
To equation (1.3) corresponds the nonlogarithmic relation
rB f c
(1.4)
which is thus a generalization of Zipf 's Hypothesis.
1.4 Outline of the Purposes of this Study
The ideas just discussed can be applied to the problem of measuring the degree of jargon
standardization. Assume for the moment that the words in a corpus of writings in the field of
chemistry have been counted, and that the rank-frequency curve for this corpus has been found to
have a slope of -1.0. Hence, this curve can be represented by the solid line in Figure 1-1. Assume
also that the words in a corpus of the same size taken from the field of sociology have been counted
and that the rank-frequency curve for this corpus has been plotted.
One can ask the question, How will the value of the slope of the rank-frequency curve for
the sociology corpus compare with the value of the slope for the chemistry corpus? Because the
verbalization of concepts in sociology is less standardized than that of concepts in chemistry,
sociologists will tend to use different alternative expressions in several references to a sociological
concept, whereas chemists will tend to use the same expression in each of several references to a
chemical concept. Thus sociologists will tend to use a larger number of different words to express a given
number of concepts than chemists will, and hence sociologists will tend to use any particular word less
often on the average than chemists will.
As a result, the high-frequency (i.e., low-rank) words in the sociology corpus will tend to
have frequencies that are smaller than those of comparable high-frequency words in the chemistry
corpus. At the same time, there will tend to be a larger number of distinct words in the sociology
corpus than in the chemistry corpus, so that the maximum rank for sociology will exceed the
maximum rank for chemistry. The net effect will be to produce for sociology a rank-frequency
curve resembling the dashed line in Figure 1-1, i.e., a line that starts from the vertical axis at a lesser
point, and stretches to a larger point on the horizontal axis, than does the solid line representing
chemistry.
In short, this reasoning leads one to expect that a higher degree of standardization in the
jargon of a field would manifest itself in a steeper slope of the rank-frequency curve for a corpus
from that field than for a corpus of similar size from a field with a lower degree of jargon
standardization. That is, one would expect larger absolute values of the slopes to be associated with
5
corpora having higher degrees of jargon standardization. This suggests that it would be possible to
use the values of the slopes of the rank-frequency curves for different fields as a measure of their
relative degrees of jargon standardization.
The goal of this study was to test this possibility. Two paths of investigation were planned.
One of them started from the idea that the hardness of a scientific discipline is reflected in the
proportion of standardized terms used for standardized concepts in the field. This path sought to
determine whether writings in hard disciplines have steeper slopes than coeval writings in soft
disciplines. The other path started from the idea that the amount of standardized jargon in a
discipline can be expected to increase with the passage of time, as new concepts are introduced,
become accepted, and are given standard names. This path sought to determine whether later
writings in a given scientific discipline have steeper slopes than earlier writings, as would be expected
to result from an increase in the amount of standardized jargon.
1.5 Organization of this Thesis
Chapter 1 has introduced the notion of jargon standardization, and has presented the two
paths of investigation by which this study sought to determine whether jargon standardization could
be measured. Chapter 2 of this thesis discusses work related to this study in linguistics, on Zipf 's
Hypothesis, and on the notion of hard and soft science. Chapter 3 presents the definitions and
assumptions employed in the study, and states the hypotheses tested in it. Chapter 4 describes how
the data of the study were selected, prepared, processed, and analyzed. Chapter 5 presents the
results of the tests of the hypotheses. Chapter 6 offers an interpretation of the results and suggests
some possible further investigations.
6
CHAPTER 2
RELATED WORK
Chapter 1 of this thesis introduced the notion of jargon standardization, the process by
which scientists move from using a variety of words and phrases in expressing an emerging concept
to using a single, standard name for a fully developed, well defined concept. The suggestion was
made that it might be possible to measure the degree of jargon standardization in a discipline by the
slopes of rank-frequency curves formed from counts of frequencies of words in texts in the
discipline. The amount of standardized jargon in a discipline would be expected to increase with the
passage of time. It was also suggested that the degree of jargon standardization in a discipline might
be closely related to the hardness of that discipline, a matter of interest in the sociology of science.
These ideas led to the two major purposes of this study. The first was to ascertain whether
measurements of the slopes of rank-frequency curves would reveal the expected increase over time
in the amount of standardized jargon. The second was to ascertain whether such measurements
would provide a quantitative measure of the hardness of a scientific discipline.
Chapter 2 reviews the published research relevant to this study, which falls into three
categories: (1) linguistic analysis of specialized subsets of languages; (2) work by G. K. Zipf and
others on examples of the rank-frequency phenomenon, and on attempts to explain it; and (3)
investigations into characteristics of hard and soft science. It appears that nothing has been
published dealing specifically with the idea of jargon standardization in scientific writing, from either
a qualitative or a quantitative viewpoint.
2.1 Quantitative Analysis of Scientific Prose
Only a small amount of quantitative work has been done in the general area of specialized
vocabularies and their development. Philology, of course, deals with the phenomena of historical
change in language, but philology has been almost entirely qualitative in nature rather than
quantitative. Not till the past quarter century has glottochronology (see, for example, Hockett, 1958,
pp. 526-35) introduced a strongly quantitative note, though still into only a small, specialized part of
philology. Like the present study, glottochronology treats the phenomenon of changes in
vocabulary, but treats it on the scale of an entire language and over time periods an order of
magnitude greater than the half century covered by this study.
Computational linguistics, whose name suggests that it might be concerned with quantitative
methods, actually has followed a mathematical but nonquantitative line of development,
emphasizing syntactic analysis. An interesting example is a paper by Bross, Shapiro, and Anderson
(1972), which reports major differences between the syntactic structures preponderating in surgical
reports and in mathematical papers.
Statistical linguistics is still a small new field, which so far has concentrated on finding and
developing basic tools and applying them to stylostatistical studies, such as versification and
7
authorship. In a survey of the state of linguistics, Carroll (1953) pointed out the newness of the
"statistical or quantitative approach to language study" by citing the establishment of a "'Committee
on Quantitative Linguistics' by the Sixth International Congress of Linguists in Paris, 1948." He also
reported that a "course in statistical linguistics was offered, for the first time, at the summer 1951
session of the Linguistics Institute. . . ." Nevertheless, it was not till 1968 that the heading
"Statistical Studies of Language" first appeared in the standard bibliography of linguistics, the
International Bibliography of Books and Articles on the Modern Languages and Literatures of the Modern
Language Association.
A still useful bibliography by Guiraud (1954a) covered statistical linguistics up to its
publication, and its commentaries can serve as an introduction to the field. A recent bibliography
by Edmundson, Crook, and Tung (1972) included coverage of statistical linguistics from 1967 to
1971, and listed other possible sources of bibliographic information.
Somewhat parallel to the line of inquiry of this study is a recent paper by Klimeš (1972),
concerned with quantitative analysis of the jargon (here properly called the slang) of certain social
groups in Czechoslovakia. Klimeš studied the numbers of "specific lexical units occurring in the
slang" of the various groups and attempted to measure the differing rates of turnover of the slang
in different groups. His comments on the newness of such an effort are pertinent here:
The findings of modern linguistics have contributed considerably to the
solution of some problems of social dialects. But as far as we know no attention has
been given so far to the quantitative aspects of social dialects. . . .
The present study has tried to discover the quantitative relations in the
structure of the kinds of slang used by miners, postmen, railwaymen, footballers, and
students. It paid attention to the number of the lexical units in the slang vocabulary,
to its synonymy, to the occurrence of slang lexical units and the quantitative
development of the slang, especially of apprentices' slang. . . . Without
quantitative and statistical methods it would be impossible to estimate the increase of
apprentices' slang and to answer the question how many months or even years the
apprentices need to master the slang used by skilled workers. . . .
In spite of various difficulties, the quantitative and statistical method is able
to yield new information about the structure and development of the new kinds of
slang. As far as we know, it has been applied here, on a larger scale, for the first
time.
Barber (1962) dealt with the needs of "teachers of English abroad, and especially . . . those
who teach English to scientists and technologists." He examined the questions of what sentence
structures and verb forms are common in scientific prose, and what specific English words "would
be generally useful to the science and technology student trying to read specialist text-books in
English." Kučera and Francis (1967) presented "lexical and statistical data" on the "Standard
Corpus of Present-Day Edited American English, a computer-processible corpus of language texts
assembled at Brown University during 1963-64." Among their text samples were 80 in the category
they called "Learned and Scientific Writings," a category so broad that it included articles entitled
"Functional Marriage Course for the Already Married" and "Elections in Morocco: Progress or
8
Confusion?" Though both the Barber and the Kučera and Francis studies included quantitative
study of scientific prose, neither of them is closely relevant to the present study.
2.2 Zipf 's Hypothesis and Some Suggested Explanations
In contrast to the scarcity of quantitative studies of scientific prose is the abundance of
studies dealing with Zipf 's Hypothesis in the field of language, similar phenomena in other fields,
and efforts to find a theoretically satisfying explanation of the pervasiveness of such phenomena.
2.2.1 Early Work by Zipf
As a graduate student at Harvard in the late 1920s, Zipf encountered the phenomenon he
later made famous. In studying phonetic changes in languages, he became interested in the
frequency of use of phonemes as a factor in their tendency to change phonetically over long periods
of time. His first studies were published under the title "Relative Frequency as a Determinant of
Phonetic Change" (Zipf, 1929), a paper based on his doctoral dissertation.
A review by Joos (1936) provides a summary of Zipf 's thesis in this and his later works:
The thesis, very briefly stated, is that the key to the explanation of all
synchronic and diachronic language-phenomena has been found in a statistically
established tendency to maintain equilibrium between size and frequency. . . .
In greater detail: (1) That relatively frequent use of a linguistic unit causes it
to be reduced in one or more of its various kinds of magnitude—accent, complexity
of articulation, extent in time, number of components, etc.—while relative
infrequency of use occasions corresponding enlargements; (2) that this Law of
Abbreviations has been established by statistical study; (3) that this Law can serve as
the basis of a new science of language; (4) that current techniques of linguistic
science thereby become partly obsolete, partly ancillary.
Zipf 's first book, Selected Studies of the Principle of Relative Frequency in Language (1932), included
his first published findings with respect to the frequencies of words. In his next book, The PsychoBiology of Language (1935), Zipf continued to explore countable linguistic phenomena. Two ideas
from this book are of interest here. First, Zipf treated the tendency for an inverse relationship to
exist between the frequency of use of a word and its length, measured in letters, phonemes, or
syllables. Second, he called attention for the first time to the phenomenon of the rank-frequency
curve, rf = c. Interestingly enough, in view of the fact that this phenomenon was eventually to bear
his name, Zipf began his discussion of it by noting that it was "suggested by a friend" (1935, p. 44).
As part of his review of The Psycho-Biology of Language, Joos (1936) touched on a problem in
the use of the relation rf = c. This relation implies a mathematical contradiction in that it leads
theoretically, Joos showed, to the impossible conclusion that a divergent series has a finite sum. To
avoid this problem, Joos suggested the model r1+bf = c, where b is a small positive fraction. It can be
argued that because of the numbers involved in actual vocabulary sizes, the contradiction lacks
9
practical importance. Nevertheless, Joos was certainly correct in pointing out that there was neither
a theoretical nor an empirical reason for Zipf 's insistence that the exponent of r in the relation
rf = c be exactly 1 , and in urging Zipf to admit the possibility that r1+bf = c was a better model.
2.2.2 Later Work by Zipf
In addition to considering word frequencies in "Relative Frequency as a Determinant of
Phoneti Change," Selected Studies of the Principle of Relative Frequency in Language, and The Psycho-Biology of
Language, Zipf devoted much effort to studying the frequency distribution of phonetic elements and
to similar studies of such matters as intervals between repetitions of various linguistic elements,
accent patterns, and occurrences of parts of speech. All of these phenomena exhibit behavior that
can be described by some form of Zipf 's Hypothesis. The regularity of behavior that he found was
fascinating to Zipf, as it has been to many others. An indication of his feelings may be found in
these comments (Zipf, 1935, p. vii):
. . . a very understandable human question . . . continuously [lurks] in the
background. What person in speaking ever selects or arranges his words for the sake
of preserving or restoring any imaginable condition of equilibrium in the resultant
frequency-distribution of the elements of his speech? Clearly we select words
according to their meanings, and according to the ideas and feelings which we wish
to convey; both content and direction of our speech are dictated almost solely by
exigencies of meaning and emotion. What, then, is the nature of meaning and
emotion that their manifestation in the production of speech reveals such a high
degree of orderliness as we find?
After publication of The Psycho-Biology of Language in 1935, and especially from 1940 on, Zipf
examined many human phenomena for which the formula rBf = c (i.e., the generalized form of
Zipf 's Hypothesis) or the related generalized harmonic series r -p are mathematical models.
During this period he continued to seek an intellectually satisfactory rationale for the astonishingly
pervasive applicability of these mathematical models. His earlier arguments were based on what he
called the Force of Unification and the Force of Diversification, from whose struggle for
equilibrium supposedly arose the observed statistical regularity. Later, he developed the Principle of
Least Effort: that men seek to minimize the effort needed in their work. He saw in this principle
the resolution of the struggle between the Forces of Unification and Diversification.
Zipf 's last book was Human Behavior and the Principle of Least Effort, published in 1949. Its 573
pages include an impressive range of phenomena to which Zipf was able to apply his mathematical
models. One interesting point arose in his discussion (pp. 288 - 296) of rank-frequency curves for
the language of schizophrenics, which seems to be the only place where Zipf recognized the
possibility that language curves could have slopes different from -1 : "of all the rank-frequency data
on words that have ever come to the attention of the present writer, only those of [two
schizophrenics] have negative slopes greater than unity." However, it must be borne in mind that
Zipf was accustomed to "measure" such slopes by merely plotting the data, drawing a line of slope
-1 , and then noting that the fit of the line to the data looked good to the eye. Even with the
10
schizophrenics' data he simply observed that the data curves were quite discrepant from the "ideal
line" of slope -1 , being much steeper.
For this observation, Zipf (1949, p. 295) offered the explanation that "these data . . .
suggest that [the schizophrenic] was loading his words with an inconveniently large number of
meanings." That is, a schizophrenic, as compared with a normal person, would attach more
meanings to an average word and thus would use a smaller number of distinct words within a given
amount of prose. The result would be a lower maximum rank, along with higher frequencies for the
words used, causing the observed steeper slope of the rank-frequency curve. The idea that such
curves may have slopes that are not necessarily -1 and in fact may vary with differences in the
subject fields and in the time periods of prose is, of course, the topic of the present study.
2.2.3 Mandelbrot's Explanation of the Rank-Frequency Phenomenon
Following the publication of Human Behavior and the Principle of Least Effort, several workers
undertook to find a more acceptable explanation of the rank-frequency phenomenon than Zipf 's
Principle of Least Effort. The most persistent of these workers has undoubtedly been Benoît
Mandelbrot. In a series of publications (1951a, 1951b, 1953a, 1953b, and 1954a) Mandelbrot
initiated the use of ideas from information theory to explain the rank-frequency phenomenon. An
excellent summary is included in a paper by Miller (1954). The essence of Mandelbrot's contribution
was his considering communication costs of words in terms of the letters that spell them and the
spaces that separate them. This cost increases with the number of letters in a word and, by
extension, in a message. Mandelbrot showed that Zipf 's Hypothesis follows, as a first
approximation, from the minimization of communication costs in terms of letters and spaces.
Linguistically, this amounts to minimizing costs in terms of phonemes.
On the basis of communication costs, Mandelbrot further derived a more accurate second
approximation:
( r m) B f c
where
(2.1)
f = frequency of a word
r = rank of the word
B, c, m = constants dependent on the corpus.
The key idea in this approximation is that m has its greatest effect when r is small and thus enables
equation (2.1) to provide a better fit to typical observed data, especially to the low-rank, highfrequency words, than does Zipf 's Hypothesis, rBf = c . In a later paper Mandelbrot emphasized
the variability of the slope -B of the rank-frequency curve. This paper (1954b, pp. 25-45; my
translation) called 1/B the
informational temperature of the text. High temperature means that the available words
are well employed, rare words being utilized with appreciable frequencies. Low
11
temperature means that the words are poorly employed, rare words being extremely
rare. . . .
The qualitative appearance of texts is distinctly different according to
whether the temperature is less than 1 , or greater than 1 .
If the temperature is less than 1 [i.e., if B > 1] . . . the diversity generally increases
with temperature. For example, James Joyce, who has an extreme diversity, also has
a B very close to 1 . This example was misleading to Zipf, for this author considered
Joyce to be the best available sample, because of his length and diversity, and took
B = 1 to be the best estimate of B for every author, whereas in fact this value is due
to the exceptionally large diversity of Joyce's text. At the other extreme,
schizophrenics have a very low temperature (B varying from 1.4 to 1.6 ). . . .
Typical values for the English authors considered range from B = 1.1 to B = 1.3 .
Thus the individual variations are very large among texts in the same
language. . . .
If the temperature is greater than 1 [i.e., if B < 1 ], which is found only in
exceptional cases, . . . rare words are employed relatively more frequently than
[when B > 1 ] . . . .
In actual fact, the only examples of texts with a temperature higher than 1 are
modern Hebrew around 1935, certain very purist poets, and finally the Latin in the
prose of Notker1 and the English in Pennsylvania Dutch.
In the first case, the vocabulary was limited for historical reasons whereas the
diversity of meanings for which people desired to use this vocabulary was large. In
the second case, the vocabulary was intentionally limited. The two other cases are a
matter of loan words. . . .
It would be equally interesting to test our explanation of the case of
temperature higher than 1 , by a comparative study of texts that are authentically of
children or of the common people, where the temperature ought to be below 1 ,
with texts written by adults forcing themselves to employ only a restricted
vocabulary. There would be the same interest in the language of aphasics.
In his later work Mandelbrot (1954b, 1956, 1957a, 1957b, 1957c, 1957d, 1959, and 1961)
refined the information-theoretic bases of his explanation of Zipf 's Hypothesis. A very readable
summary is that of Mandelbrot (1966). He also considered certain other areas where essentially the
same mathematical model applies: e.g., "natural systems of categories," such as the distribution of
species within genera, first studied by the botanist J. C. Willis.
1
Zipf (1949, pp. 115-16) explained this reference by saying:
For the purpose of instructing his pupils, Notker Labio (died 1022) . . . "translated"
certain Latin classics into a mixture of Old Higher German and Latin in such a
fashion that the two languages are inextricably woven together. . . . [In his
translations] we probably see . . . what the students of the monastery really spoke
. . . an Old High German that was heavily garbled with words and clichés from Latin
which the students were trying to learn.
12
Miller and Newman (1958) and Miller, Newman, and Friedman (1958) reported on an
empirical study of word frequencies and word lengths that tended to support Mandelbrot's
approach. The second paper also contained a full list of the words they treated as common words in
the study.
2.2.4 Simon's Explanation of the Rank-Frequency Phenomenon
Herbert A. Simon found a somewhat different path along which to seek a rationale for
Zipf 's Hypothesis. He described the origin of his interest in the problem in the following way
(Simon, 1957, p. 97):
In the late 1930's I came across some papers and books by a Harvard
philologist, George Kingsley Zipf, which reported remarkable regularities in such
phenomena as the distribution of cities by size, the relation of rank order to
frequency of word occurrences, the distribution of frequencies of publication by
scientists, and many others. Zipf tried to bring all these under the roof of a single
mathematical relationship, the harmonic law, and devised a pseudo-explanation
which he dubbed the "law of least effort."
These data were more irritating than a grain of sand in an oyster. The
relations were undoubtedly genuine, and the purported explanation undoubtedly
spurious.
In his first paper on the problem, Simon (1955, p. 427) proposed
. . . to analyze a class of distribution functions that appears in a wide range of
empirical data—particularly data describing sociological, biological and economic
phenomena. Its appearance is so frequent, and the phenomena in which it appears
so diverse, that one is led to the conjecture that if these phenomena have any
property in common it can only be a similarity in the structure of the underlying
probability mechanisms. The empirical distributions to which we shall refer
specifically are: (A) distributions of words in prose samples by their frequency of
occurrence, (B) distributions of scientists by number of papers published, (C)
distributions of cities by population, (D) distributions of incomes by size, and (E)
distributions of biological genera by number of species.
Simon employed a stochastic model, "in which the probability that a particular word will be the next
one written depends on what words have been written previously." He showed that this approach
led to a class of distribution functions conceptually different from Zipf 's and Mandelbrot's models,
though all three models make numerically similar predictions.
Simon had offered his stochastic model as an alternative to Mandelbrot's informationtheoretic approach. An exchange of papers with Mandelbrot ensued, leading to a paper by Simon
(1960). Its interest here is Simon's observation that since none of the models fits the data exactly,
especially for low ranks, "it is difficult to know how to estimate" the slope of the rank-frequency
curve in logarithmic coordinates. "An unweighted least-squares fit to the distribution on a
13
logarithmic scale is perhaps not the most plausible method." However, Simon did not suggest an
alternative method.
2.2.5 Other Work on the Rank-Frequency Phenomenon
Parker-Rhodes and Joyce (1956 and 1957) and Good (1957) developed Zipf 's Hypothesis
from assumptions concerning the amount of effort required in searching the memory for each word
seen, heard, or to be expressed. Mandelbrot (1954b, p. 22) had earlier mentioned the possible
relationship of his information-theoretic cost of a word to studies that seemed to show that the time
required to read a word depended "logarithmically on the supposed frequency of the word within
the subject-field studied," a dependence that is consistent with Zipf 's Hypothesis.
Belevitch (1959) found that the truncated lognormal distribution also provides a means of
approximating observed rank-frequency data. Carroll (1967) discussed the problems of sampling
text under the hypothesis that word frequencies are distributed lognormally, and he found good
agreement with the lognormal distribution in empirical data.
Edmundson (1972) elegantly brought the work of Zipf, Joos, and Mandelbrot together in
what he called the "3-parameter rank distribution"
f (r ; c, b, a ) c(r a) b c 0,b 0,a 0
which includes the Joos and Zipf versions as special cases. Edmundson, Fostel, Tung, and
Underwood (1972a and 1972b) also treated the problem of estimating, given a sample of text, the
total vocabulary size V from which the sample was formed (i.e., the number of words in the entire
theoretical vocabulary). They provided useful tables relating V, a , b , and c .
2.2.6 The Waring-Herdan Formula
Herdan (1956, 1960, 1962, 1964, and 1966) condemned the work of Zipf, Mandelbrot,
Simon, and others, for a variety of reasons. One of these, a manageable problem, is the sensitivity
of the rf = c model to the size of the corpus to which it is applied. Another criticism (Herdan, 1962,
p. 61) concerned the
unscientific nature, if not futility, of the frequency-rank relation . . . best brought
out by comparison with a physical law of essentially the same form. As is well
known, the relation between volume and pressure in an adiabatic gas can be
represented as an equi-lateral hyperbola in a co-ordinate system, one of whose
co-ordinates, say x , represents the volume, and the other, say y , the pressure. The
formula is then xy = const . On a bilogarithmic grid, the relation would, of course,
plot as a straight line.
It is easily seen that both formula and graph of the volume-pressure relation
are closely similar to those for the word frequency-rank relation. But there is a great
14
difference. The value of the volume-pressure relation consists in this, that it enables
us to estimate either the volume for a specified pressure, or the pressure for a
specified volume, of the gas under certain conditions. Both, volume and pressure,
are independently observed, and a specified value of one enables us to estimate the
corresponding measure of the other variable.
Nothing of this is possible according to the Zipf-Mandelbrot law. There is
no way of independently obtaining the rank of a word from which to estimate the
frequency, simply because the rank of the word results only from having first
arranged all the words, including the one we have in mind, according to their
frequency. And vice versa, although the frequency of a word can be directly
observed by counting the number of occurrences, again we cannot give an estimate
of the rank of that word until the frequency observations for all words in the text
have been done, and the frequency-rank curve has been properly set up.
The analogy with Boyle's Law is somewhat misleading. It is true that, unlike volume and
pressure, the rank of a particular word cannot be independently observed. Nevertheless, if one
knows at least the size of a corpus, then he can estimate what rank value will correspond to a given
frequency value, and vice versa, after the ordering by frequency. Although these estimates cannot
be made so precisely as Boyle's Law enables volume to be estimated from pressure, this does not
make such estimates totally impossible. That the rank of a particular word cannot be observed in
advance of the ordering by frequency is irrelevant to the estimation problem. Perhaps Herdan's
objection is really that not only frequency but also total size of corpus (and possibly also B ) should
be known in order that rank may be estimated. If so, it ought to be recognized that the application
of Boyle's Law requires that not only volume but also temperature must be known in order that
pressure may be estimated.
In his Quantitative Linguis tics (1964, pp. 85-88), Herdan offered an alternative to the rf = c
model. He suggested that a better approximation to observed rank-frequency data was to be found
in a probability function based on a little known algebraic relation:
x a f 1
x
pf
0 a x
a(a 1)(a 2) (a f 2)
( x a)
f 2,3,
x( x 1)( x 2) ( x f 1)
where pf denotes the probability that a word will appear with frequency f in a large corpus, and a
and x are constants that will depend on the corpus. The function is due to Irwin (1963), who
discovered its applicability in a search for "adequate theoretical forms to describe biological
distributions with very long tails." Irwin credited an eighteenth-century British mathematician,
Edward Waring, with discovering the basic inverse factorial expansion underlying the probability
function. In linguistics the function has come to be known as the Waring-Herdan formula.
Herdan fitted the formula to data from a story by Pushkin, with good results. Earlier, Irwin
(1963) had presented examples of fitting the formula to data from a study by Kendall (1960) on the
15
frequency distribution of bibliographic citations. These results were also good. In an informal
comment on, and immediately following, Irwin's paper, Kendall reported a good fit of the formula
to data on the distribution of lengths of runs among 1602 runs on an IBM 7090. Muller (1969)
tested the Waring-Herdan formula on several texts and for larger ranges of f . Though his results
were not uniformly so good as the examples just mentioned, especially for very high f , Muller
concluded that the formula's
success over a large range of data is striking and cannot be attributed to a series of
happy coincidences. . . .
The uses of this formula will be interesting to explore. . . . They can only
enrich our still limited knowledge of the processes which transform the elements of
our mental lexicon into vocabulary in action.
2.2.7 A Perspective on the Rank-Frequency Phenomenon
There can be no doubt that it was Zipf 's pioneering work that stimulated interest among
other linguists and among mathematicians in the striking pervasiveness, among diverse real
phenomena, of frequency distributions that are related to Zipf 's Hypothesis. But Zipf was not able
to provide an underlying rationale for his phenomena that others could accept. Consequently, it was
left to his successors to try to develop better mathematical models, hoping thereby to provide also
better explanations of why such diverse phenomena should share such a strong mathematical
relationship.
Three excellent summaries of many of the matters in section 2.2 deserve mention here:
Plath (1963), Fairthorne (1969), and Good (1969).
2.3 The Notion of Hard Science and Soft Science
One of the major purposes of the present study was to investigate the possibility that the
slope of the rank-frequency curve of writings in a scientific discipline might provide a measure of
the hardness of the discipline. The present section reviews published discussions of hard and soft
science: first, a clarification of the notions of hardness and softness in science, and second, reports
of objective measures found to be consistent with accepted evaluations of the hardness of various
disciplines.
The recency of the phrases "hard science" and "soft science" is indicated by their inclusion
in The Barnhart Dictionary of New English since 1963 (Barnhart, Steinmetz, and Barnhart, 1973), which
claims to cover "those terms and meanings which have come into the common or working
vocabulary of the English-speaking world during the period from 1963 to 1972." It includes the
definitions
hard science, any of the natural or physical sciences, such as physics, chemistry,
biology, geology, and astronomy.
16
soft science, any of the social or behavioral sciences, such as political science,
economics, sociology, and psychology.
The earliest quotation given for either item in the Barnhart Dictionary is from a story in Time for
June 3, 1966. The 1961 edition of Webster's Third New International Dictionary of the English Language
contains no definition of either "hard" or "soft" linking them to the word "science."
2.3.1 Storer's Work
A paper by Storer (1967) appears to be the first published discussion of hard and soft
science. Storer was interested in certain sociological aspects of the ways in which scientists in hard
and soft fields work, but he took the trouble to try to define what the new phrases "hard science"
and "soft science" meant.
Let us begin by looking at some of the connotations of the words "hard" and
"soft." These terms are obviously expressive in some way of something about
different fields of science; we think of physics as hard and of political science as soft;
we think of chemistry as being harder than zoology, and of sociology as being
somewhat softer than economics. But what is it we sense about these different
sciences that makes it appropriate for us to assign these adjectives as we do? . . .
"Hard" seems to imply tough, brittle, impenetrable, and strong, while "soft,"
on the other hand, calls to mind the qualities of weakness, gentleness, and
malleability. In more personal terms, "hard" suggests impersonality, aggressiveness,
and a sharp concern for the letter of the law, while "soft" implies sympathy, warmth,
and informality. Going still further, we find that a "hard" job is one that is difficult
or laborious, and that "soft" jobs are those that are easy, not demanding of great
effort.
Somewhere among these various connotations, I think we will find the key to
why we feel it appropriate to say that biochemistry is hard and psychology is soft, or
that genetics is hard and anthropology is soft. The immediate explanation that
comes to mind, of course, is that a hard science is one that requires more effort to
learn. Physics presumably requires more concentration, more hours of homework
and laboratory exercises, than does sociology if one is to earn an "A" for the
semester. However, I am not satisfied that it is simply the relative difficulty involved
in mastering different subjects that accounts for the way we employ these adjectives;
....
As an aspect of the hardness of a field, Storer then considered the manner in which new
contributions to the field are evaluated.
Knowledge refers essentially to a set of symbols that are organized so that
the meaning of each symbol is supported by the others. Further, the relationships
among these symbols must be such that, ideally, no logical contradictions among
them are produced by the rules that govern their relationships. That is, while the
symbols making up scientific knowledge refer primarily to events and to their
17
relationships with each other in the "real world," there do exist rules that enable us
to relate these symbols to each other so that they constitute more than just a
congeries of separate statements. To judge the goodness of a contribution to
knowledge, then, requires not only that we find it to be a valid representation of
empirical phenomena, but also that we be able to relate it to the established set of
symbols representing what we already know about these phenomena. . . .
. . . the rules governing the relationships among these symbols . . . may be
more or less "rigorous." Both the precision with which a contribution fits into an
existing body of knowledge, and the specific implications it has for existing
knowledge, are ultimately functions of the amount of precision that characterizes
these rules. . . .
. . . "Hardness" . . . implies much more than relative difficulty in mastering
a subject; it suggests also the degree of difficulty involved in making a contribution
to the subject and, thus, the degree of risk a scientist takes when he offers a
contribution. If a hard science is one in which error, irrelevance, or sloppy thinking
is relatively easy to detect, then the scientist must take greater pains in his research if
he does not wish to be exposed as incompetent.
In the softer sciences, on the other hand, where such a high level of rigor is
lacking, it is likely that such nonscientific criteria as relevance to common values or
to practical problems, elegance of style, or even the unexpectedness of one's findings
vis-à-vis common sense, will play a larger part in determining the acceptance and
success of a contribution.
Storer's explanation of what constitutes hardness and softness in science is consistent with
the supposition in this study that hard sciences are those that contain a greater proportion of well
defined, standardized concepts than do soft sciences. Storer went on in his 1967 paper to find an
interesting correlation between the hardness of a field and the frequencies with which authors in the
field use tables and provide only the initials, rather than the first names, of persons whom they cite.
In a later paper Storer (1972a) added another dimension by which to classify scientific
disciplines, with these results:
Dividing the social dimension between "basic" and "applied," and the
cultural dimension between "hard" and "soft," we have four cells.
The assignment of disciplines to separate cells must be a fairly impressionistic
operation. Working with NSF data from 1968 and using the fifteen disciplines into
which the scientific community was categorized that year . . . we can tentatively
distinguish between the hard and soft disciplines by assuming that the physical and
mathematical sciences are "harder" than the biological and social sciences. For the
basic-applied distinction, it seems reasonable to draw this
line on the basis of the relative percentage of each discipline's members who report
being engaged principally in basic research or teaching, as opposed to applied
research, administration, and so on. The eight disciplines ranking highest on this
measure are classified as "basic," and the remainder "applied." . . . This
arrangement yields the following categories:
18
Basic-Soft
: anthropology, biology, linguistics, political science, and
sociology
Basic-Hard : mathematics and physics
Applied-Soft : agricultural sciences, economics, and psychology
Applied-Hard : atmospheric and space sciences, chemistry, computer
sciences, earth and marine sciences, and statistics
Storer found differences among the four cells with respect to the following measures:
"percentage of discipline-members holding the doctorate," "percentage of doctorates under forty
years of age," "percentage employed in educational institutions," "percentage of submissions
rejected by disciplinary journals," and "percentage of discipline-members receiving Federal support."
2.3.2 Hagstrom's Work
In what has already become a classic work in the sociology of science, Hagstrom (1965)
studied many aspects of the social organization of science. One of these was the tendency, as
perceived by "recent recipients of the Ph.D.," of professors to exploit their students by such
techniques as "prolongation of graduate work and subordination of the student's educational
interests to the professor's research interests." Hagstrom ranked various fields by the "percentage of
recent recipients agreeing that 'major professors often exploit doctoral candidates'." He found that
this percentage declined from the hard sciences through the soft sciences to the humanities, except
that the percentage in the field of mathematics and statistics was as low as that in the humanities.
However, Hagstrom pointed out that by its nature mathematics is a field in which "professors are
very seldom assisted by their students," so that opportunities for exploitation are rare.
Hagstrom also studied the relative productivity of workers in several fields. Although his
data here are more difficult to interpret, they appear to show a general decline in productivity from
the biological sciences, through the physical and mathematical sciences, then through the social
sciences and engineering, to the humanities.
Another topic was the relative percentage of jointly authored articles in several fields.
Hagstrom found a decline in this percentage from hard to soft fields, mathematics being an
anomaly. His figures are:
chemistry
biology
physics
psychology
mathematics
philosophy
history
English
83 %
70
67
47
15
5
4
3
19
2.3.3 Price's Work
Price (1970) discussed some further possible measures of hard and soft science. One was
the percentage of Ph.D.s in a field "who become employed in the nonuniversity world," Price's
suggestion being that this percentage is low for hard fields of science, higher for soft fields, and
highest of all for the humanities. But his main purpose was to propose what he named "Price's
Index," defined as "the proportion of the references [in a paper] that are to the last five years of
literature." Of the use of this statistic, Price said:
Perhaps the most important finding I have to offer is that the hierarchy of
Price's Index seems to correspond very well with what we intuit as hard science, soft
science, and non-science as we descend the scale. At the top, with indexes of 60 to
70 percent we have journals of physics and biochemistry; a little lower there are
publications like Radiology (58 percent) and American Journal of Roentgenology (54
percent). American Sociological Review stands at 46.5 percent and a study by Parker
et al.1 shows most of the other social sciences clustering around 41.9 percent ± 1.2
percent, while a pilot project investigation of my own covering 154 journals of
various brands of scholarship . . . showed that the median over all fields of science
and nonscience was 32 percent with quartiles at 21 percent and 42 percent.
Table 2-1 presents the values of Price's Index for the journals used in the present study. The
first five lines of the table were taken from Price (1970). The last line was calculated for this study,
since Price did not include the Psychological Review among his data.
In summary, Price declared:
. . . It would seem that this index provides a good diagnostic for the extent
to which a subject is attempting, so to speak, to grow from the skin rather than from
the body [i.e., from the most recent results rather than from the accumulated
knowledge of the field]. With a low index one has a humanistic type of metabolism
in which the scholar has to digest all that has gone before, let it mature gently in the
cellar of his wisdom, and then distill forth new words of wisdom about the same
sorts of questions. In hard science the positiveness of the knowledge and its short
term permanence enable one to move through the packed down past while still a
student and then to emerge at the research front where interaction with one's peers
is as important as the storehouse of conventional wisdom. The thinner the skin of
science the more orderly and crystalline the growth and the more rapid the process.
1
The reference is to "Bibliographic Citations as Unobtrusive Measures of Scientific
Communication," by Edwin B. Parker, William J. Paisley, and Roger Garrett. Institute for
Communication Research, Stanford University, Palo Alto, CA, 1967.
20
TABLE 2-1
VALUES OF PRICE'S INDEX FOR JOURNALS USED IN THIS STUDY (AFTER PRICE)
Date
Average Number
of References
per Article
Percent of all
References
Dated within
Last 5 Years
12
"recent"
9
29
Ecology
8
"recent"
24
26
Physical Review
47
1900
5
56
Physical Review
28
1925
8
67
Physical Review
18
"current"
11
72
Psychological
Review
37
1968
42
48
Journal
Number of
Articles
American Journal
of Mathematics
Though mathematics is clearly a hard science with respect to its rigor and positiveness of
knowledge, Price's Index assigns it to the level of a soft science. Part of the anomaly is explained by
the long-term permanence of research in mathematics, which means that authors are more likely to
cite older material. Another part can be explained by the finding by Campbell and Edmisten (1964)
that mathematics journals run a much higher backlog (more than twice the average) of papers
accepted and awaiting publication than do journals in other fields; thus, by the time a mathematics
paper appears in print, its references have a greater probability of being more than five years old.
2.3.4 Other Work on Hard and Soft Science
Chase (1970) followed up Storer's suggestion that the difference between hard and soft
science might consist at least partly of different degrees of rigor and precision in organizing
knowledge and of different criteria for the evaluation of new research. She asked faculty members
at a "Big Ten" university to rate the importance of ten possible evaluation criteria, examples of
which are "logical rigor," "replicability of research techniques," "pertinence to current research in
the discipline," and "applicability to 'practical' problems or applied problems in the field." Of these
examples, the first two were ranked highest, and the last two lowest, by the respondents as a whole;
but Chase found significant differences between the responses of "natural scientists" and "social
scientists." She summarized her findings by stating:
Disciplinary variations in the norms used to evaluate scientific information
for publication appear to be related to the stage of development of the discipline's
knowledge base. The results of this study indicate that the harder natural sciences
21
stress precise mathematical and technical criteria, whereas the softer social sciences
emphasize less-defined logico-theoretical standards.
Zuckerman and Merton (1971) ranked 16 scientific and humanistic fields according to the
average rates of rejection of manuscripts by journals in each field. They reported:
The figures exhibit marked and determinate variation. Journals in the
humanities have the highest rates of rejection [e.g., 90 percent in history]. They are
followed by the social and behavioral sciences with mathematics and statistics next in
line. The physical, chemical, and biological sciences have the lowest rates, running to
no more than a third of the rates found in the humanities.
Confirming this empirical uniformity are subsidiary patterns of deviant rates
within disciplines which virtually reproduce the major patterns. To begin with,
consider the field of physics. The 12 journals had an average rejection rate of 24
percent, with the figures for 11 of them varying narrowly between 17 percent and 25
percent. But the twelfth journal, the American Journal of Physics, departs widely from
this norm with a rejection rate of 40 percent. In light of the general pattern of
rejection rates, we suggest that this seemingly deviant case only confirms the rule.
For this journal, alone among the twelve assigned to physics . . . is not so much a
journal in physics as a journal about physics. It publishes articles dealing primarily
with the humanistic, pedagogical, historical and social aspects of physics rather than
articles presenting new research in physics. Accordingly, it diverges from the
relatively low rate characteristic of the physical sciences in the direction of the
substantially higher one characteristic of the humanities and social sciences.
We find similar patterns within other disciplines. . . . The journals devoted
to social, abnormal, clinical, and educational psychology average a rejection rate of
70 percent while the journals in experimental, comparative, and physiological
psychology diverge toward the physical sciences with an average of 51 percent. . . .
The pattern of differences between fields and within fields can be described
in the same rule of thumb: the more humanistically oriented the journal, the higher
the rate of rejecting manuscripts for publication; the more experimentally and
observationally oriented, with an emphasis on rigor of observation and analysis, the
lower the rate of rejection.
The recognition of differences in hardness among scientific disciplines is still quite new. But
the papers discussed above show that a number of quantitative measures of hardness have already
been discovered. To investigate another possible measure of the hardness of a discipline was one of
the purposes of this study.
2.4 Summary
Chapter 2 has discussed published research relevant to this study in three areas. First was
work in linguistic analysis, including statistical linguistics and studies of scientific prose.
22
Second, the work of George Kingsley Zipf was discussed. Zipf 's discovery of his model,
rf = c , as a language phenomenon was described, and his pursuit of diverse applications of his
model outside of language was mentioned. Numerous attempts by others to find better
mathematical models and more satisfactory explanations of the rank-frequency phenomenon were
outlined. The discussion noted that the slope of the rank-frequency curves can vary in different
bodies of prose, a fact that is at the heart of the present study.
The third portion of this chapter described work related to the idea of hard and soft science.
The concept of hardness of a discipline was discussed. Mentioned were several studies reporting
that objective measures had been found to be associated with generally accepted judgments of the
hardness of various scientific disciplines.
23
CHAPTER 3
PURPOSES OF THE STUDY
Jargon standardization was defined in Chapter 1 as the process by which scientists move
from using a variety of words and phrases in expressing an emerging concept to using a single,
standard name for a fully developed, well defined concept. The amount of standardized jargon in a
discipline would be expected to increase over time. Chapter 1 suggested that it might be possible to
measure the degree of jargon standardization in a discipline by the slopes of rank-frequency curves
formed from counts of frequencies of words in texts in the discipline, and also that the degree of
jargon standardization in a discipline might be related to the hardness of that discipline. Chapter 2
reviewed published research on linguistic analysis of specialized subsets of languages such as
scientific writing, on the rank-frequency phenomenon and efforts to explain it, and on investigations
of how to characterize hard science and soft science.
Chapter 1 also outlined the two major purposes of this study: to ascertain whether
measurements of the slopes of rank-frequency curves would reveal the expected increase over time
in the amount of standardized jargon; and to ascertain whether such measurements would provide a
quantitative measure of the hardness of a scientific discipline.
Chapter 3 re-states these purposes in the form of the hypotheses to be tested in this study.
First, however, the assumptions made and definitions employed in the study are presented.
3.1 Assumptions and Definitions
3.1.1 Scientific and Technical Disciplines
Underlying this study was the assumption that the notion of a scientific or technical
discipline finds sufficient consensus that it is reasonable to talk about the observable, written
behavior of people working in a discipline. This statement contains two problems: the definition of
"discipline," and the definition of what satisfactorily constitutes written behavior of people working
in a discipline.
Among its definitions of "discipline" Webster's Seventh New Collegiate Dictionary (1965) gives
"2: a subject that is taught; a field of study." Pursuing this dictionary's definitions of "subject," one
finds among them "3 a: a department of knowledge or learning." But one also finds that its
definition of "field" that comes closest to being pertinent says only "2 a: an area or division of an
activity."
To find a definition more closely concerned with the use of the term "discipline" in science,
one may turn to a sociologist of science. Storer (1972b, pp. 231-32) approaches the problem of
defining "discipline" by a careful chain of reasoning.
24
If there were no "reality" beyond men's imaginations, or if the most
fundamental aspects of reality were characterized by random, capricious change,
there would be no possibility of building a trustworthy body of knowledge about it.
Direct sensory experience, however, persuades us that there is an external reality
which is both organized and stable in its fundamental characteristics. . . .
Repeated observations of spatial and temporal correlations among physical
events, further, tell us that this external reality is organized into "clusters" of events
and relationships. These clusters are distinguished from one another not only by
differences in spatial and temporal location but also by the fact that change in one
cluster seems to have little or no effect upon another. . . .
Systematic observation of empirical clusters of events leads to the
identification of different categories of events, which are more abstract and
economic descriptions of reality. . . . the conceptual distinctions . . . [men] make
among different categories of natural events approximate the distinctions that exist
among different parts of reality. . . .
The process of identifying and analyzing clusters and categories of natural
events is called research. It is a painstaking and time-consuming activity. One man
rarely has enough time in his life to concentrate on more than a few of these clusters.
It is inevitable, therefore, that the major clusterings of natural events come to be
matched by clusters of men engaged in their investigation. The scientific community
is thus differentiated roughly in the same way that natural phenomena are separated
into distinctive categories, so that at any given time the organization of science
comes close to reflecting men's current understanding of the organization of
nature. . . .
The present division of science into the physical, mathematical, life, and
social sciences represents the most general subdivision of the scientific community.
Within each of these broad areas, more specific clusters of events and relationships
have been singled out for attention and have become the foci of the major scientific
disciplines.
Storer's reasoning still leaves the problem of just what is to be understood as a cluster of
sufficient size, and yet sufficient distinctness from other clusters, to constitute a discipline. The
definitional difficulties are evident.
Furthermore, it is hard to define precisely the four specific disciplines—ecology,
mathematics, physics, and psychology—with which this study has dealt. The boundaries of these
disciplines encroach upon other disciplines in ways that make definitions difficult. The example of
mathematical physics comes readily to mind; and there are people who would find it hard to decide
whether they had rather be called "mathematical physicists" or "applied mathematicians." The
major role of statistics in psychology is well known. Specialties exemplified by biochemistry,
molecular biology, and biostatistics abound today. Some might even argue that parapsychology
could someday lead to a field of "psychological physics."
Because of the difficulty of defining the four disciplines in this study satisfactorily, the
following assumption was employed: It is reasonable to treat, as belonging to each of the
disciplines, those written materials whose authors and editors have been satisfied to identify them as
25
contributions to the discipline by submitting them to and accepting them for, respectively, a journal
in that discipline. This assumption is supported by the high professional standing of each of the
journals used in this study, and by the fact that each of them is sponsored by the major American
professional organization in its field: the American Journal of Mathematics, by the American
Mathematical Society; Ecology, by the Ecological Society of America; the Physical Review, by the
American Physical Society; and the Psychological Review, by the American Psychological Association.
3.1.2 Representativeness of the Journals Sampled
The next question concerns the extent to which the materials appearing in each of the
journals used are representative, in the everyday sense of that word, of materials in the discipline
as a whole. Although each journal has editorial criteria that restrict the subject scope of the
materials it selects for publication, each journal was judged to be adequate, for the purposes of this
study, in breadth of scope of subject matter when compared with the discipline as a whole. This
judgment was based on examination of the journals themselves, on general knowledge about three
of the disciplines, and on the opinion of an expert1 in the fourth discipline, ecology.
The assumptions involved here can be stated as follows, where "discernible" means
"discernible by the procedures tested in this study" and "discipline" refers to the four disciplines
treated in this study:
1. It was assumed that any discernible tendency toward standardization in the jargon of a
discipline as a whole would be reflected in a similar tendency within a reasonably broad portion
of the discipline.
2.
It was assumed that any discernible difference in the degrees of standardization of
different disciplines would be reflected in a similar difference between broad portions of those
disciplines.
3.
It was assumed that each journal used in this study represented a reasonably broad
portion of its discipline.
The journals used in this study were chosen because each is a major journal in its discipline
and each has been published continuously over a period of approximately 50 years. Selection of a
single, long-term journal for each discipline was expected to produce a higher level of editorial
consistency than would have been likely with an alternative procedure of using a number of
different journals that might have produced a greater breadth of subject coverage. Consistency was
valued more highly than breadth, but the breadth achieved was judged to be sufficient. With respect
to their coverage, the journals chosen can be described as follows.
1
Personal communication from Orie L. Loucks, Professor of Environmental Studies and of
Botany, University of Wisconsin—Madison.
26
Fifty years ago, Ecology was the only journal in that field. It covered the field as well as one
journal could cover a discipline. By 1969 Ecology tended to emphasize articles that described
ecological situations or reported experiments, rather than articles of a highly theoretical nature, but it
still provided broad coverage of its discipline. Of its editorial policy, Ecology stated in 1921, "The
pages of Ecology are open to papers of ecological interest from the entire field of biological science."
By 1969, the statement had become, "The pages of Ecology . . . are open to papers of ecological
interest stressing basic, not applied, problems."
The American Journal of Mathematics, though it slights some new areas such as numerical
methods and statistics, covers the traditional major areas, analysis and modern algebra, that
constitute the core of mathematics. It makes no statement of editorial policy in its pages.
There is no question about the suitability of the Physical Review as a journal of broad
coverage. It covers the whole field of physics. The coverage has not been without strain; in 1921
the Physical Review contained 1,193 pages of reports, but by 1969 it had grown to 24,532 pages.1 The
Physical Review makes no statement of editorial policy in its pages.
The Psychological Review covers general experimental psychology, which constitutes almost all
the research in psychology other than that which is concerned with human psychological disorders
and their treatment. As a journal of experimental psychology, the Psychological Review was judged to
be adequate in breadth of coverage. In 1921 the Psychological Review made no statement of
editorial policy in its pages, but by 1969 it was declaring, "The Psychological Review is devoted to
articles of theoretical significance to any area of scientific endeavor in psychology."
Appendix A, which lists the articles from which samples of text were drawn, provides the
reader with an opportunity to assess the range of topics covered in each discipline.
3.1.3 The Use of Journals
Another assumption in this study was that if there existed a discernible tendency toward
jargon standardization in a discipline, it would show up best in writings that describe the active
research being conducted in the discipline. That is, such a tendency should appear more clearly in
materials that report new research, such as professional journals, than in materials such as textbooks
and state-of-the-art reviews.
3.1.4 Hard Science and Soft Science
It was assumed in this study that there is a quality in science, sensed by both scientists and
nonscientists, which is expressed as the hardness or softness of a particular discipline, and that a
1
To those concerned with rates of growth of scientific information, it may be of interest that
this increase corresponds almost exactly to a rate of 6.5% a year.
27
general consensus of scientists would rate mathematics and physics as being harder fields of science
than ecology or psychology.
3.1.5 The Jargon of a Discipline: "Characteristic Idiom" and "Technical Vocabulary"
Webster's Seventh New Collegiate Dictionary (1965) states that among the meanings of "jargon" is
"the technical terminology or characteristic idiom of a special activity or group." In this study
"jargon" is used in both of these slightly different senses. "Jargon" as "characteristic idiom" is used
to refer to the whole vocabulary employed in writings in a discipline. It was important to consider
the whole vocabulary of a discipline for at least two reasons. First, the idea of a tendency toward
jargon standardization implies that in the early stages of the formulation of a new concept, the
concept will be expressed in words at least some of which will not be unique to the discipline or
have a special meaning in it. Second, a long neglected phenomenon is that writings in different
disciplines can exhibit considerable differences in the typical patterns of use of even the most
common words.
The first to study this phenomenon quantitatively was Wallace (1965). Using one corpus
made up of abstracts in psychology and another of abstracts dealing with electronic computers,
Wallace counted the words in each of the two sets of abstracts and ranked the words in each set in
order of decreasing frequency. He found such differences as those for "can" (rank 100 in the
psychology corpus and 23 in the computer corpus), "each" (ranks 84 and 54, respectively), "its"
(ranks 49 and 90), and "may" (ranks 50 and 91). Noting the same phenomenon in his foreword to
the book by Kučera and Francis (1967), Twaddell (1967) stated that the
distribution-of-occurrence table also shows some of the interesting variations along
with the partial uniformities of high-frequency vocabulary items. For example, in all
genres the most frequent item is the. But the second most frequent word is of in ten
genres, and in five. Already at rank 2 genre difference affects the occupancy of a
rank.
Thus the whole vocabulary of a discipline, including even the most common words, may be peculiar
to that discipline in subtle ways.
This study treats the "technical terminology" sense of "jargon" under the name "technical
vocabulary." The phrase "technical vocabulary" is used to refer to what remains of the whole
vocabulary of a discipline after the set of words listed in Appendix B is deleted. This method of
defining the technical vocabulary was used for three reasons. First, the set of words remaining after
the deletion contained all words that could be expected to be unique to the discipline or to have
special meanings within it. Second, the method was impartial, in that all the corpora in the study
were treated alike. Third, it was practicable, in that the deletions could be accomplished by a
computer program.
Appendix B contains the 248 words that were defined in this study as "common" words.
This set of words consists of the union of two lists that were developed through much
experimentation and practical experience. One list was provided by Professor Gerard Salton of
28
Cornell University; it is the list of words he found to be nonsignificant in many scientific fields and
used as a list of words to be excluded from consideration as content indicators in the SMART
system (Salton, 1968). The other list was provided by Dr. Melvin Weinstock of the Institute for
Scientific Information (ISI); it is the list of "full-stop words" used in the preparation of ISI's
Permuterm Subject Index of the Science Citation Index (Weinstock, 1970). The 1972 Science Citation Index
Guide and Journal Lists (1973) describes "full-stop words" as "terms that have no practical semantic
value" and therefore "are completely suppressed" when the Permuterm Subject Index is prepared.
3.1.6 Words, Word-Tokens, and Word-Types
What is a word? Are "child" and "children" to be treated as two different words, or simply
as the singular and plural forms of one word? A choice has to be made between the two possible
treatments. In this study it was desirable, because of precedents and programming considerations,
to choose to treat "child" and "children" as two different words. Formally stated, the definition is
that the study treated, as a different word, each distinguishable string of characters that was bounded
by a space at each end after the removal of terminal punctuation marks, such as a period or a
comma. Thus "give," "gives," and "given" would constitute three different words in this study,
because they are formed from three distinct sequences of letters. Furthermore, this definition of
"word" admits sequences of any characters: letters, digits, special symbols, and interior punctuation
marks other than the space. The space alone is excluded from the interior of strings, because it is
the natural symbol to use for the boundaries of strings. For example, the sentence
If she gives him the score-card they've given her, she'll ask him to
give it away.
would be considered to have the following distinct words and stated numbers of occurrences of
each word:
ask
away
give
given
gives
1
1
1
1
1
her
him
if
it
score-card
1
2
1
1
1
she
1
she'll 1
the
1
they've 1
to
1
As another example, the sentence "The H202 was added to the H20" would be considered to have
two occurrences of "the" and one occurrence each of five other words, "added," "H20," "H202,"
"to," and "was."
An important precedent for the definition in this study of a word as what can be represented
by a distinct sequence of characters was Zipf 's use of the same definition. Here is how Zipf
described his usage in The Psycho-Biology of Language (1935, pp. 39-40):
For example, in English the word child may be considered as one word, children as
another, give a third, gives a fourth, given a fifth, — five different words for each of
which the respective frequencies in a given sample may be established. On the other
29
hand, a dictionary compiled on the basis of this evidence would view child and children
as but two forms of one word, and give, gives, given as but three forms of one word; the
five different words above would appear in the dictionary list under the words child
and give, — two words. The sole difference between these two uses of the term word
is that the former considers only a word in its fully inflected form as a unit of speech,
whereas the lexicographers are inclined to take the word either in its non-inflected
form, or in some one arbitrarily selected . . . form . . . as the unit. For the sake of
keeping these two units distinct, let us henceforth call the lexicographer's unit the
lexical unit and the other unit, that is, the word in its fully inflected form, the word.
Although both terms are popularly considered words, slight reflection on the subject
will convince the reader that a statistical analysis of the stream of speech into fully
inflected words and into lexical units will yield significantly different quantitative
results.
In the present investigation the term word will always designate a word in its
fully inflected form. . . .
Using "word" to "designate . . . [the] fully inflected form" is the same thing as
distinguishing different sequences of letters as different words. It should be noted in passing that
what Zipf called the "lexical unit" played no part in the present study.
Sometimes it is useful to be able to draw a distinction between (1) a given word as an
abstract linguistic entity and (2) particular occurrences of that entity. The technical terms used in
linguistics to make such distinctions are "word-type," or just "type," and "word-token" or "token."
The pertinent definitions in Webster's Seventh New Collegiate Dictionary (1965) are "token . . . an
instance of a linguistic expression" and "type . . . the form common to all instances of a word."
For example, the sentence "She wore a long red dress, red shoes, a red hat, and long red qloves" can
be described as containing four tokens of the word-type "red," two tokens each of the word-types
"a" and "long," and one token each of the word-types "and," "dress," "gloves," "hat," "she,"
"shoes," and "wore." As another example, the sentence
The chemist poured H2O2 into one flask of H2O, and poured H2SO4
into another flask of H2O.
can be described as containing the distinct word-types and numbers of word-tokens listed in Table
3-1.
3.1.7 Rank-Frequency Curves, Their Slopes, and Regression
In this study, the phrase "rank-frequency curve" means the curve formed by the following
process:
1.
The words in a corpus are counted and the number of word-tokens of each
word-type in the corpus is determined. The number of tokens of a word-type is called the
"frequency" of that word-type.
30
TABLE 3-1
EXAMPLE OF A FREQUENCY COUNT OF WORD-TYPES
Word-Type
Number of WordTokens of the WordType
Word-Type
Number of WordTokens of the WordType
and
1
H2SO4
1
another
1
into
2
chemist
1
of
2
flask
2
one
1
H2O
2
poured
2
H2O2
1
the
1
2.
The word-types are arranged in a list in decreasing order of frequency. For this study
it is immaterial how word-types of equal frequency are arranged; for convenience, they are arranged
alphabetically.
3.
The first word-type in the list is assigned rank 1; the second word-type in the list,
rank 2; and so on. For example, the sentence "She wore a long red dress, red shoes, a red hat, and
long red gloves" yields the list in Table 3-2.
TABLE 3-2
EXAMPLE OF A FREQUENCY-ORDERED LIST OF WORD-TYPES
FOR A ONE-SENTENCE CORPUS
Word-Type
Frequency
Rank
red
4
1
a
2
2
long
2
3
and
1
4
dress
1
5
gloves
1
6
hat
1
7
she
1
8
shoes
1
9
wore
1
10
31
4.
Next, consider any particular frequency realized in the corpus, say fi , and the one or
more word-types occurring with that frequency. These word-types will have ranks, say
ri1 , ri 2 ,..., rimi where mi 2 . In accordance with a common statistical procedure for handling the
ranks of tied scores, the rank ri corresponding to the frequency fi is defined as the mean of the ranks rij of
the word-types occurring with frequency fi .
1
ri
mi
mi
r
j 1
ij
It should be noted that if fi+1 is the next lower frequency realized in the corpus, then the rank of the
first word-type that occurs with frequency fi+1 will be
ri 1,1 rimi 1
That is, the ranks of the individual word-types are sequential throughout the list, even though mean
values are used for the ranks said to correspond to particular realized frequencies. This definition of
"the rank corresponding to a frequency" was adopted for the purpose of comparability with the
work of Zipf. The data in Table 3-2 illustrate the sequentialness of the ranks of the word types, and
they yield the example in Table 3-3 of the calculation of the ranks ri corresponding to the distinct
observed frequencies fi . For instance, Table 3-2 shows that there are two word-types of frequency
2, with ranks 2 and 3, respectively; hence for this frequency, mi =2 , and the rank r corresponding to
frequency 2 is
r
1
1
(2 3) (2 3) 2.5
m
2
as shown in Table 3-3.
5. Finally, common logarithms are taken of the fi and ri . Pairs (log ri, log fi) are formed, in
which as usual the first member can be taken to represent the abscissa (i.e., position on the
horizontal axis), and the second member, the ordinate (i.e., position on the vertical axis), in a twodimensional plot. The pairs (log ri, log fi) constitute the rank-frequency curve for the corpus. The data in
Table 3-3 yield the example in Table 3-4.
The slope of the rank-frequency curve for a corpus is defined as the slope of the straight line
that provides the best fit to the rank-frequency curve in the least-squares sense (see, for example,
Draper and Smith, 1966, pp. 7-13). In the present study this line is called the regression line of the
rank-frequency curve; for it is the same line as that found if one uses linear regression, treating the
log ri values as values of an independent variable and the log f. values as those of a dependent
variable (see, for example, Draper and Smith, 1966). The slope of the rank-frequency curve is then
32
TABLE 3-3
EXAMPLE OF OBSERVED DISTINCT FREQUENCIES AND DEFINED
CORRESPONDING RANKS FOR A ONE-SENTENCE CORPUS
Distinct Observed Frequency
4
Defined Corresponding Rank
1
2
2.5 = (1/2)(2 +3)
1
7 = (1/7)(4 + 5 + 6 + 7 + 8 + 9 + 10)
TABLE 3-4
EXAMPLE OF PAIRS (log ri, log fi) FOR A ONE-SENTENCE CORPUS
ri
fi
(log ri, log fi)
1
4
(0.0000, 0.6021)
2.5
2
(0.3979, 0.3010)
7
1
(0.8451, 0.0000)
the regression coefficient of the log ri terms. Figure 3-1 plots the (log ri, log fi) pairs of Table 3-4,
along with the regression line of the rank-frequency curve consisting of these pairs. The slope of the
regression line is -0.71.
FIGURE 3-1
RANK-FREQUENCY CURVE AND REGRESSION LINE,
FOR A ONE-SENTENCE CORPUS
33
Using the method of least squares to fit a straight line to a rank-frequency curve is a well
known procedure that presents no problems. But there may be a problem in assessing the tightness
of the fit by the usual regression procedures; the extent, if any, to which the assessment is affected is
unknown. The problem stems from the fact that the relation between the values ri and fi is not the
ordinary one between an independent and a dependent variable in regression (cf. section 2.2.6).
Although the ranks ri are not truly independent of the frequencies fi, there is, nevertheless, a good
deal of random choice involved in just what values of fi will correspond to any particular value of ri .
Similarly, given any particular fi (except perhaps the few highest), there is at least a range, increasing
with decreasing fi , of possibilities for the number mi of word-types occurring with frequency fi ; and
hence there is a range of possibilities for ri . In these respects the relation between ri and fi, is
somewhat like the relation that regression requires.
These arguments having been weighed, the assumption was made that for the purposes of
this study it would be reasonable to interpret the values obtained for the slopes of the rankfrequency curves for the experimental corpora as though they were true regression coefficients.
3.2 Hypotheses
3.2.1 Hypotheses Concerning a Tendency toward Jargon Standardization over Time
Discussed first are the hypotheses dealing with the possibility that examination of the slopes
of the rank-frequency curves for corpora in a scientific discipline will reveal the expected increase
over time in the amount of standardized jargon in the discipline. These hypotheses may be
summarized in General Hypothesis I: The slopes of the rank-frequency curves of corpora in a
discipline will become steeper as time passes.
The reasoning behind General Hypothesis I is this: If a tendency toward standardization
exists, the jargon of a discipline will, as time passes, contain an ever greater number of words that
are used in a regular way, by many workers in the discipline, to denote well accepted, well defined
concepts in the discipline. As more concepts become standardized in their expression, fewer
circumlocutions will be necessary. As the circumlocutions are reduced, the total number of wordtypes used will become smaller, but each word-type remaining in use will tend to be used more
often. The result will be higher frequencies of occurrence for each of a smaller total number of
word-types in the jargon than at an earlier period in the discipline. A smaller total number of wordtypes will mean a smaller maximum rank in the frequency-ordered list of words for a later corpus.
The combined effect of higher frequencies and fewer word types on a rank-frequency curve
(such as that of Figure 3-1) will be to raise the left part of the curve and to pull the right end in
toward zero. Thus the rank-frequency curves of later corpora will be tilted up at the left and down
at the right, making their slopes steeper. Numerically, this means that the slopes will take on
increasingly negative values, moving from, say -0.8 to -0.9 or even to -1.0 .
The effects on rank-frequency curves of a tendency toward jargon standardization over time
ought to show up both in corpora composed of only the technical vocabularies of writings in a
34
discipline and also in corpora composed of the whole vocabularies. Hence, it was desirable in this
study to examine the technical vocabularies to see if the hypothesized tendency was discernible. But
it was also desirable to examine the whole vocabularies for the presence of the hypothesized
tendency, for at least two reasons. First, the very notion that incompletely formed concepts in a
discipline are discussed in a variety of ways suggests that words outside the technical vocabulary of
the discipline will be employed. Second, since practically all previous studies of the slopes of
rank-frequency curves have been based on the whole vocabulary of some corpus, this study needed
to include examination of whole vocabularies for the purpose of comparability.
General Hypothesis I must be turned into explicit hypotheses to be tested. The notation of
these hypotheses employs the abbreviations EC for ecology, MA for mathematics, PH for physics,
and PS for psychology.
Hypothesis 1-EC: In ecology, a corpus drawn from later writings will have a steeper
slope for the rank-frequency curve formed from the whole vocabulary of the corpus
than will a corpus drawn from earlier writings.
Hypotheses 1-MA, 1-PH, and 1-PS consist of the analogous statements for their respective
disciplines.
Hypothesis 2-EC: In ecology, a corpus drawn from later writings will have a steeper
slope for the rank-frequency curve formed from the technical vocabulary of the
corpus than will a corpus drawn from earlier writings.
Hypotheses 2-MA, 2-PH, and 2-PS consist of the analogous statements for their respective
disciplines.
3.2.2 Hypotheses Concerning the Relative Jargon Standardization of Different Disciplines
The second major purpose of this study was to investigate the possibility that the degree to
which the jargon of a discipline has been standardized at a given time is related to the subjectively
judged notion of the hardness or softness of the discipline at that time. It seemed possible that
judgments that a discipline is a soft science might be based at least partly on the presence, in writings
of that discipline, of a smaller proportion of regularly used terms for well defined concepts than
would be true of writings in a hard science.
If the hardness of a discipline is related to the proportion id well accepted terms used for
well defined concepts in the discipline, then hard disciplines will have a higher proportion of such
terms and a lower proportion of circumlocutions than will soft disciplines. Thus corpora in hard
disciplines will tend to exhibit higher frequencies for each of a smaller total number of word-types
than will corpora in soft discipli nes. The effect will be that the rank-frequency curves of corpora in
hard disciplines will be tilted further up at the left and further down at the right than the curves of
corpora in soft disciplines, so that hard corpora will have steeper slopes for these curves than soft
corpora will.
35
Since the total number of well defined concepts in a discipline undoubtedly increases with
time, comparisons of disciplines should be made in terms of contemporary states of the disciplines.
Few would disagree with the statement that as of 1974 physics is a harder science than psychology,
but few would feel confident in making any assertion about the hardness of psychology in 1974
compared with the hardness of physics in 1900.
The foregoing reasoning leads to General Hypothesis II: The slopes of the rank-frequency
curves of corpora in hard disciplines will be steeper than those of coeval corpora in soft disciplines.
The testing of this second general hypothesis again involves both the technical vocabularies
and the whole vocabularies of the disciplines. This results in the following explicit hypotheses to
be tested.
Hypothesis 3-PH-EC: A corpus drawn from writings in physics will have a steeper
slope for the rank-frequency curve formed from the whole vocabulary of the corpus
than will a coeval corpus drawn from ecology.
Hypotheses 3-PH-PS, 3-MA-EC, and 3-MA-PS consist of the analogous statements for their
respective pairs of disciplines.
Hypothesis 4-PH-EC: A corpus drawn from writings in physics will have a steeper
slope for the rank-frequency curve formed from the technical vocabulary of the
corpus than will a coeval corpus drawn from ecology.
Hypotheses 4-PH-PS, 4-MA-EC, and 4-MA-PS consist of the analogous statements for their
respective pairs of disciplines.
Not only can pairs of disciplines—one hard and the other soft—be tested against each
other, but also combinations of the hard disciplines can be tested against combinations of the soft.
Combinations of the hard disciplines represent the whole class of hard sciences somewhat better
than just one hard discipline can. The analogous statement for combinations of soft disciplines also
holds. It is true that if the tests of Hypotheses 3 and 4 confirmed them all, then similar tests of
combinations of disciplines would add nothing new. But it could happen that some of Hypotheses
3 and 4 would fail to be confirmed while similar hypotheses about combinations of hard and of soft
disciplines were confirmed. Such a situation would be of interest, because it would suggest that
further testing of the class of hard sciences against the class of soft sciences would be useful.
Therefore, it was judged desirable to test the following hypotheses about combinations of
disciplines.
Hypothesis 5: The mean slope of rank-frequency curves formed from the whole
vocabularies of corpora drawn from physics and mathematics will be steeper than
the corresponding slope for coeval corpora from ecology and psychology.
Hypothesis 6 consists of the analogous statement for the rank-frequency curves formed from the
technical vocabularies of corpora.
36
3.3 Summary
Chapter 3 has detailed the reasoning, the assumptions, and the definitions leading to the
formal hypotheses of the study. The purposes of these hypotheses are to test whether it is possible
to measure (1) an increase over time in jargon standardization within a scientific discipline, and (2)
differences in jargon standardization between hard and soft disciplines.
37
CHAPTER 4
DATA SELECTION, PREPARATION, PROCESSING, AND ANALYSIS
This thesis treats the possibility that jargon standardization—the process of moving from
using a variety of words and phrases in expressing an emerging concept in a discipline to using a
single, standard name for a fully developed, well defined concept—might be measurable in terms of
the slopes of rank-frequency curves formed from counts of frequencies of words in texts in the
discipline. The two major purposes of this study were to ascertain whether measurements of the
slopes of rank-frequency curves would reveal the expected increase over time in the amount of
standardized jargon in a discipline, and whether such measurements would provide a quantitative
measure of the hardness of the discipline.
Chapter 3 detailed the assumptions and definitions made in this study, and discussed the
reasoning about slopes of rank-frequency curves for scientific writings that formed the basis for the
hypotheses to be tested in the study. These hypotheses were stated, and their relation to the two
major purposes of the study was shown. The hypotheses can be summarized as General Hypothesis
I:
The slopes of the rank-frequency curves of corpora in a discipline will become
steeper as time passes.
and General Hypothesis II:
The slopes of the rank-frequency curves of corpora in hard disciplines will be steeper
than those of coeval corpora in soft disciplines.
The reasoning leading to General Hypothesis I can be summarized as follows: As a
discipline ages, its practitioners tend to settle upon a standard vocabulary to express its concepts and
to use the same words consistently and often; this tendency results in their making increasingly
frequent use of each of a relatively limited number of standard words; and this result has the effect
of steepening the slopes of rank-frequency curves. The reasoning leading to General Hypothesis II
can be summarized as follows: At any given time, the practitioners of a hard discipline have
available a more standardized vocabulary to describe their discipline's concepts than practitioners of
a soft discipline have; the latter have to use a larger number of different words in wrestling with
concepts that tend to be less well defined than those in the hard discipline; hence, the practitioners
of the hard discipline tend to make more frequent use of each of a relatively limited number of
different words, while the practitioners of the soft discipline tend to make less frequent use of each
of a larger number of different words; and the result is that hard corpora have steeper slopes for
their rank-frequency curves than do soft corpora.
Chapter 4 explains how the data employed to test the hypotheses were selected, prepared,
processed, and analyzed.
38
4.1 Selection of the Data
The purposes of this study required that the data of the study, written materials in science,
had to span a number of years, include both hard and soft fields of science, and represent the
reporting of research in journals. In addition, it seemed desirable to use a single journal in each
discipline in order to minimize variations in jargon usage that might arise from differing editorial
policies of several journals, such as aiming at different audiences or levels of presentation. Each
such single journal needed to be a major one in its discipline.
These considerations proved to be rather restrictive, especially when combined with a desire
to have a time span of about 50 years, which was judged sufficient to provide a good chance of
revealing the suspected tendency toward standardization. Eventually, however, four journals in
suitable disciplines were found: American Journal of Mathematics, which began publication in 1878;
Physical Review, which began in 1893; Psychological Review, in 1894; and Ecology, in 1920. Each of these
journals is among the leading ones in its discipline.
The time periods to be used were chosen as 1921 and 1969, because: (1) Ecology was begun
in 1920, and to give it a year in which to establish its editorial policies seemed reasonable; and (2) in
1970 the Physical Review was split into four journals, each covering a subfield of physics. The result
was a difference of 48 years in the time periods of the writings.
In all, therefore, there were to be eight individual corpora, one from 1921 and one from
1969 in each of the four disciplines. Next came the question of how much text each corpus should
contain. On the basis of Zipf 's work (1949, passim, but especially p. 291), a size of 20,000 words
each for the individual corpora appeared to be adequate for determining the slopes of the rankfrequency curves, while yielding a total sample of a size (8 x 20,000 = 160,000 words) that would be
manageable.
How should the sample corpus of 20,000 words for a given journal and year be formed? It
was apparent that some kind of random sampling from the text should be employed, but it was
obviously impracticable to make 20,000 random selections of individual word tokens from among
all the words appearing in the journal in the year. It would be almost equally impracticable to deal
with random selections of individual sentences and paragraphs. If entire papers were selected at
random, the selection process would be easy, but there would arise difficulties stemming from great
disparities in the lengths of papers and from heavy dependence on the writing styles of a small
number of authors. A rather convenient unit, so far as random selection was concerned, would be
the page, but selection of whole pages would involve disparities in the number of words, not only
between different journals but even within a given journal.
Because of such considerations, the following sampling procedure was chosen. The basic
unit would be a text string approximately 400 word-tokens long, and the full sample corpus of
20,000 words for a given journal and year would consist of 50 such units. Random selection of the
units was accomplished through the following steps:
1.
The total number N of pages appearing in the journal in the year was determined.
This was not quite trivial, especially with the Physical Review for 1969, whose more than 24,000 pages
39
were numbered independently each month. Where there was more than one sequence of page
numbers during the year, a list of equivalences was prepared between the sequence 1 to N and the
chronologically arranged sequences of page numbers.
2.
The number D of digits in N was noted. Numbers within the range 1 to N were
drawn from a table of random numbers by proceeding sequentially through sets of D digits each
from an arbitrary starting point, ignoring repeated numbers and numbers larger than N . If the
journal used a two-column format, sets of D + 1 digits each were selected, with the first D digits
representing the page number and the (D + l)st digit indicating the left or right column according as
it was odd or even.
3.
Each tentatively selected page was examined to see whether it contained text of a
research article or something else, such as only a figure, only a list of references, or a memorial. A
page that contained no text reporting research was ignored, and another random number was drawn
to replace it.
4.
The first complete word of running text on the page was marked. Titles, authors'
names, and the like were excluded when they appeared outside the running text, as on the first page
of an article. In journals with a two-column format, the first complete word of running text at the
top of the left or the right column was marked, as indicated by the (D + 1)st digit.
5. The typist was instructed to begin with the marked word and to continue to copy running
text till the page in the typewriter could not fully accommodate the next word of text, which was
therefore not begun. The typing was done in a special font and in a rigid format of 75 characters
(including spaces) per line and 37 double-spaced lines per legal-size (8-1/2" × 14") page, because it
was to be processed by an optical character reader (OCR). A word that had not been finished in
column 75 of a line was simply continued, without hyphenation, in column 1 of the next line.
The consequence of interest here is that the typewritten page, i.e., the sample unit, contained at most
75 × 37 = 2775 characters (of which the first 7 were not text but were inserted to serve as
a unique identifier for the sample). Since a typical word of general English text contains about 5.5
characters including the necessary terminating space (Pratt, 1939), and since no partial words were to
be entered at the end of the last line, about 2765/5.5 503 words were to be expected in each
sample unit thus constructed. The typing procedure was intended primarily to ensure that there
would be at least 400 words in each sample, and secondarily to take maximum advantage of the fact
that the company doing the OCR conversion from the typescript to computer tape was willing to
process a legal-size sheet of paper at the same cost as an ordinary 8-1/2" × 11" sheet.1 It was a
bonus that the average sample length might go as high as 500 words rather than the originally
1
It may be of interest to note here that typewriting and subsequent OCR processing was
chosen as the mode of input because of its considerable advantage in cost . The estimated cost of
this procedure, paying at local competitive rotes for typewriting and commercial rates for the OCR
processing, came to about $550. (The actual cost was about $560, for somewhat more text than
estimated.) Estimates of the cost of keypunching, using a rule of thumb of 6,000 keystrokes per
hour for coded data, came to $1,015.
40
planned minimum of 400 words. In fact, the average turned out to be 454.4 words per sample,
corresponding to a grand total of 181,755 words. That the average number of words per sample
fell so far below 500 probably means that when compared with general English, scientific English
has a slightly higher average word length, apparently about 6.1 characters per word including the
terminal space.
6.
In the very small number of instances out of the 400 samples when the randomly
chosen page numbers were so close that one sample ran over onto the initial page of another, the
typist began the second sample where the first one ended.
7.
Not much more frequent were the cases in which a sample was still incomplete when
the typist reached the end of the running text of the article. In such cases he filled out the sample
with text beginning at the first complete word on the page previous to the start of the incomplete
sample. Neither this procedure nor that of the preceding paragraph appeared to pose any difficulty
with respect to the goal of obtaining for each sample 400 - 500 words of homogeneous running text
from one article.
8.
Footnotes and references were not included. Neither were captions of figures and
tables, nor the contents of tables. As elsewhere, the evident intentions of the authors (and possibly
the editor) were the final criterion: thus, long lists—e.g., of species observed in a certain
locale—would be omitted if they appeared in tabular form but included if the author had placed
them in running text, say as a parenthesized insertion. A possible objection to the foregoing
procedure for selecting samples is that the practice of always starting with the first word on a page
(except when the right-hand column of a two-column page was chosen) resulted in a slight increase
in the probability that the first few lines of an article would be chosen for a sample. This would
affect the purposes of the study only if that increase were considerable and if, further, the first few
lines of articles differed consistently in vocabulary from the rest of the articles. A perusal of the
articles in the study indicated that their beginnings did not differ consistently, that in fact they were
quite varied in style and content and were by no means confined to general or introductory
comments. Hence it was judged that the slight bias in favor of the first few lines of an article could
have had only a negligible effect on the results of the study. However, any similar study in the
future could avoid the problem by randomly choosing the starting lines of samples.
4.2 Preparation of the Text Samples
In connection with the procedure for selecting the samples, the previous section has
discussed the essentials of the process of typing them, using a font readable by an optical character
reader. What follows is a discussion of special problems that arose during the typing, the correcting
of errors, and the counting of word frequencies:
1. In typing the running text, the typist omitted numbers that were printed as digits, because
their great variety would have led to the identification of a speciously large number of word-types
of very low frequency in a corpus that included such numbers. Likewise, the initials of personal
names were not typed. Mathematical formulas were omitted for the same reason. Besides, any
attempt to translate formulas into prose almost certainly would have been inaccurate and also would
41
have mixed my vocabulary with that of the author of the article. However, such strings of
characters as "9-point" were included, on the grounds that the author was clearly thinking of the
string as a word. Similarly, "*-automorphism" was treated as a word in that form, and "λ-plane" was
handled as "lambda-plane."
2. The author's hyphenation was almost always followed. One kind of exception was part
of the process of counting frequencies. Here problems arose that are exemplified by "nonzero" and
"non-zero," two variants, used by different authors, of what is clearly the same concept. Such cases
required an arbitrary choice of one of the variants, ordinarily the one of higher frequency, which was
credited with the sum of the frequencies of both variants. The other kind of exception occurred
when the lack of a hyphen was almost certainly due to a printer's error, as when "nearly Euclidean"
occurred in a sample in the same kind of construction as that in which "nearly-Euclidean" occurred
several times elsewhere in the article. But another author's consistent use of "nearly Euclidean"
as separate words would, of course, not be changed. The point is that the frequency-counting
program treated as a single word-token any pair of words linked by a hyphen, in accordance with the
definition of "word" in this study, as stated in section 3.1.6.
3. Another kind of variational problem is illustrated by "sodium" and its standard
abbreviation "Na". In the counting of frequencies, standard abbreviations and the corresponding
full forms were combined into one arbitrarily chosen variant, usually the full form, which was
credited with the sum of the frequencies of the variants. Abbreviations such as "eV" were treated as
tokens of a single word-type, here "electron-volt." Variations like "keV" and "MeV" were treated as
distinct word-types, here "kilo-electronvolt" and "mega-electron-volt." On the other hand, "AOV"
is a standard abbreviation of "analysis of variance," which authors almost always use without
hyphens in its full form; hence, each occurrence of "AOV" was treated as representing one token
each of the word-types "analysis," "of," and "variance." Psychology makes heavy use of the
abbreviations "S" and "Ss," which were treated as the word-types "subject" and "subjects."
Abbreviations like "hr" that are regularly used for both singular and plural forms were treated in the
frequency-counting process as representing "hour" and "hours" in the same proportion as those full
forms themselves occurred in the corpus. An author's individual intra-article abbreviations, e.g.,
"IGE" for "inverse gap equation," and "s.p." for "single-particle," were treated as tokens of one or
more word-types according to the author's definition; i.e., an occurrence of "IGE" was taken to
represent one token each of "inverse,", "gap," and "equation," whereas "s.p." was equated with the
single word-token "single-particle."
4.
Chemical symbols that the author treated as a unit, e.g., "NaI(Td)", were handled as
a unit in the frequency count. Isotopes such as "40Ca" and "44Ca" were typewritten as "calcium-40"
and "calcium-44", which resulted in their being treated as tokens of distinct word types.
5.
Obvious misspellings were corrected. Words with variant British and American
spellings were used in the American style; e.g., "colour" was typed as "color." Many more examples
of such problems could be given, but the foregoing discussion has shown the kinds of problems
encountered and the approach to their solutions. The guidelines were to follow, first, the author's
own usage insofar as that could be determined, and second, the standard usage in the field
concerned, as evidenced both in the journal from which the sample was drawn and also by resort,
when necessary, to standard reference works and textbooks in the field. One such search beyond
42
the journal may be of interest as an example. The Physical Review for 1969 did not supply an
explanation of "DHCP", which the Solid State Abstract Journal finally revealed to represent "double
hexagonal close-packed." Apparently the property of having a DHCP arrangement at the molecular
level has become so common and well understood a concept in solid state physics that it is now
almost always referred to by its acronym. Here, surely, is an example of jargon standardization at
work.
4.3 Processing of the Machine-Usable Data
When the text samples had been prepared in the form of typescript, they were processed by
a commercial firm with an optical character reader, and were turned into 400 records on a computer
tape. Thereafter, all processing took place at the Computation Center, University of Texas at Austin
(UTACC).
The facilities at the UTACC include a Control Data Corporation (CDC) 6600 computer in
combination with a CDC 6400 computer, disk storage units, tape drives, and a CalComp Plotter.
The CDC 6600/6400 computer system supports a large remote-terminal, timesharing system
covering the campus of the University of Texas at Austin (UTA) and extending to other educational
institutions in central Texas. All but a tiny fraction of the processing for this study was
accomplished interactively via a terminal in the Graduate School of Library Science, UTA.
The programs used in the processing included some provided by the UTACC, others
provided by the Linguistics Research Center (LRC) of the UTA, and several small programs written
as part of the study.
The first step in the processing took place with the aid of an LRC program, LISTUP, which
converted the 400 records of text samples of about 2770 characters each into blocked streams of 80
character segments on another computer tape. Each segment consisted of 70 characters of text and
a 10-character unique label that identified the corpus, the sample, and the sequence number of the
segment within the sample.
A small program converted the contents of the preceding tape into 80-character cardimages. This step made it possible to store the data on what the UTACC calls "permanent file sets."
At the UTACC, a permanent file set is a tape specially prepared for convenient interactive use on
the UTACC system, and a permanent file is a file (i.e., a sequence of characters terminated by what
the computer recognizes as an "end of file" mark) on such a tape. Because of the ambiguity of the
term "permanent file" for anyone outside the UTACC environment, the abbreviation "PF" is used
herein to denote "permanent file at the UTACC." Each corpus in the study was, in these terms,
stored in the form of a PF.
Computer printouts were made of each corpus, and the checking began. Each corpus was
checked by comparing the printouts word for word with the original article. A few errors that the
typist himself had not caught were found, of course; but the great bulk of the single-character errors
resulted from failure of the optical character reader properly to identify a letter. Most of the
difficulties were caused by the typist's use of white correction fluid on the typescript. It turned out
43
that if even a tiny chip of dried correction fluid broke loose and carried away with it a piece of a
letter (e.g., making a tiny crack in the bar of an "H"), the OCR was quite likely to misinterpret what
remained of the letter. A few OCR errors stemmed from flaws in the paper.
Errors were corrected interactively by use of the UTACC's highly satisfactory EDIT
program, with which it is possible to do such things as changing all the occurrences of "non-zero" in
a 20,000-word corpus into "nonzero" with one instruction and a fraction of a second of computer
time. Lest this make the correction process sound too easy, it should be added that most of the
errors had to be tackled line by line, or card-image by card-image. Furthermore, the card-images
had to contain no more than a maximum of 80 characters, because of requirements of later steps in
the processing. Hence, any change involving the addition of one or more characters had to be
handled by the creation and insertion of a new card-image, complete with corpus and sample
identifiers and a suitable sequence number.
When each corpus had been corrected as completely as possible by the foregoing
procedures, a new computer printout was made from it. The proofreading of the printout gave
special attention to catching errors that had been introduced during the correcting of the initial
errors. Only after these secondary errors had been corrected in the PF-stored corpora via EDIT
was the next step taken.
This step consisted in applying to each corpus a sequence of three LRC programs, INDEX,
FREQ, and GROUP. These programs counted the word-tokens in each corpus and produced two
lists of word-types and their corresponding frequencies. One list was arranged alphabetically; the
other, in order of decreasing frequency and then alphabetically among word-types of equal
frequency. In the same computer run a small program called PRPCNT put the frequency-ordered
list for each corpus into the form of a card-image file and stored it as a PF.
The alphabetical list turned out to be far more useful than expected. It spotlighted not only
some residual misspelled words but also many spelling variations that had previously gone
unnoticed. It was at this stage, for example, that the problem exemplified by "non-zero" and
"nonzero" came to light.
As misspellings and spelling variants were removed, corresponding changes were made in
the PF-stored corpora. The necessary adjustments in frequencies (e.g., from combining "non-zero"
and "nonzero") were made in the frequency-ordered list. The frequency adjustments were entered
in the PF copy of this list through the use of the EDIT program, since it cost less in computer time
to do so than to reapply LRC's INDEX, FREQ, and GROUP programs to the entire corrected
corpus. A small program, CNTCHK, then printed the PF copy of the list in the same format as that
used in the printouts of the LRC programs, to facilitate a final visual check of the PF copy against
the now hand-corrected list originally produced by the LRC programs.
4.4 Analysis of the Word-Frequency Data
At this point the study had yielded eight machine-usable lists of words arranged by their
frequencies. There was a list for each corpus drawn from one of the four disciplines at one of the
44
two time periods. From each of these lists a program called RANK produced a PF consisting of
pairs of numbers (fi, ri) , where each fi represented a distinct frequency found in the corpus and
each ri represented the rank corresponding to the frequency fi . The ranks ri were computed
according to the definition in section 3.1.7; the point of that definition is that where more than one
word-type occurs with a given frequency, the two or more word-types are ordered in some fashion
(in fact, alphabetically) and the mean of the resulting ranks is used as the rank of the frequency. In
graphic terms this means that the (log ri, log fi) pairs in this study correspond to the midpoints of the
horizontal segments in the lower-right halves of Zipf 's diagrams (cf. Zipf, 1949).
For each corpus, RANK also computed and reported (1) the total number of (fi, ri) pairs, (2)
the total number of word-types, (3) the total number of word-tokens, (4) the type-token ratio, i.e.,
the ratio of the total number of word-types to the total number of word-tokens (Miller, 1951), and
(5) "Yule's characteristic, K" (Yule, 1944). The last two computations were included for possible
future use.
RANK put the sets of pairs (fi, ri) for each corpus into a form suited for a linear regression
program. The original plan was to use the regression subprogram in SPSS: Statistical Package for
the Social Sciences (Nie, Bent, and Hull, 1970), a widely used and recommended program package
available on the UTACC system. But experiment revealed that the regression feature in the
OMNITAB II package (Hogben, Peavy, and Varner, 1971) cost about one-third as much as SPSS
did for each regression run on the data in this study.
Hence, the regression processing employed the very satisfactory interactive version of
OMNITAB II that the UTACC had developed. After preparatory commands, the FIT command in
OMNITAB II was applied to the observed (log ri, log fi) pairs for each corpus. To determine the
linear regression line for the rank-frequency curve FIT performed a least-squares fit of the observed
pairs to the model
log f i log ri i
In this model, the use of which was based on the assumption stated in section 3.1.7, α and β
represented constants to be determined through the least-squares process, and the εi represented
random errors meeting the basic assumptions required for regression (see, for example, Draper and
Smith, 1 966, p. 17). The application of FIT to a particular set of pairs (log ri, log fi) yielded
particular values A and B such that the equation
f A B log r
log
provided for any value of r the predicted value (denoted by the " ^ ") of log f that was best in the
least-squares sense. The value of B was the slope of the rank-frequency curve for the corpus.
To aid in the presentation of this study, a program called PLTRF plotted the (log ri, log fi) values for
each corpus together with their regression line and, for comparison, a line of slope -1 passing
through the midpoint of the regression line. Appendix C contains these plots.
45
The foregoing discussion has described the processing of the whole vocabularies of the
various corpora. To prepare for the tests of the hypotheses concerned with the technical
vocabularies, the study employed a combination of two empirically developed lists of words found
to have no value as indicators of content in scientific fields. The combined list was stored as a PF,
called COMMON. This list and its sources are detailed in Appendix B. A program named DELFN
compared COMMON with a PF copy of the frequency-ordered list of words in each corpus.
Whenever DELFN found a match, it deleted the word and its frequency from the PF copy. What
remained when DELFN had finished was taken to be the technical vocabulary of the corpus. It
was, of course, arranged in order of decreasing frequencies and alphabetically within a given
frequency.
Following the preparation of the PF lists of the technical vocabularies, these lists were
processed by RANK, OMNITAB II, and PLTRF in the same way as the lists of the whole
vocabularies. A minor change in PLTRF omitted the line of slope -1 , since it turned out to be of
no interest in the plots of the technical vocabularies.
The actual tests of the hypotheses of this study consisted in comparisons of slopes of
rank-frequency curves determined by the processing just described. The OMNITAB II program
package was used in making these comparisons.
4.5 Summary
Chapter 4 has described the preparation and use of the data in this study, texts of scientific
writings. These data were (1) randomly selected from journal articles published in 1921 and 1969 in
four disciplines, (2) typewritten, and then converted by an optical character reader into computerusable form, (3) carefully corrected and tailored to the needs of the study, and (4) processed by
several computer programs to yield measurements of the slopes of rank-frequency curves. These
slopes were compared in order to test the hypotheses of the study. The results are presented in
Chapter 5.
46
CHAPTER 5
RESULTS OF THE STUDY
Jargon standardization is the process by which scientists move from using a variety of
expressions for an emerging concept to a standard name for an accepted concept. This study was
intended to ascertain whether measurements of the slopes of rank-frequency curves would reveal
the expected increase over time in the amount of standardized jargon in a discipline, and whether
such measurements would provide a quantitative measure of the hardness of discipline.
After presenting the assumptions and definitions in this study, Chapter 3 detailed the
hypotheses of the study and their relation to its two major purposes. The hypotheses can be
summarized as General Hypothesis I:
The slopes of the rank-frequency curves of corpora in a discipline will become
steeper as time passes.
and General Hypothesis II:
The slopes of the rank-frequency curves of corpora in hard disciplines will be steeper
than those of coeval corpora in soft disciplines.
Chapter 4 began by explaining how the samples of scientific writing that the study employed
to test the hypotheses were selected from journal articles published in 1921 and in 1969 in each of
four scientific disciplines, forming a total of eight experimental corpora. The disciplines were:
physics and mathematics, as representative hard sciences; ecology and psychology, as representative
soft sciences. Chapter 4 then discussed how the samples were prepared for processing by computer
programs, and how these programs were used to obtain measurements of the slopes of rankfrequency curves for the eight corpora. For each corpus, two such slopes were measured: one for
the whole vocabulary of the corpus; and one for what was called the "technical vocabulary" of the
corpus, the words remaining after the deletion of the set of common words listed in Appendix B.
Chapter 5 discusses the results obtained when the hypotheses of the study were tested by
comparisons of these slopes.
5.1 Outline of the Results
General Hypothesis I of this study states that the slopes of the rank-frequency curves of
corpora in a discipline will become steeper as time passes. The results do not support this
hypothesis. In fact, there is a slight tendency toward the opposite behavior.
In the tests using the original texts—i.e., in the tests of the whole vocabularies of the
corpora—there was a difference in slope that was significant at the 5% level in 3 out of the 4 pairs
47
of time-differentiated corpora in a discipline. Of the 3 significant differences, 2 were less steep in
1969 than in 1921, and only 1 conformed to General Hypothesis I.
In the tests of texts from which the common words had been deleted—i.e., in the tests of
the technical vocabularies—none of the pairs of time-differentiated corpora in a discipline showed a
difference in slope that was large enough to be significant at the 5% level.
General Hypothesis II of the study states that the slopes of the rank-frequency curves of
corpora in hard disciplines will be steeper than those of coeval corpora in soft disciplines. The
results generally support this hypothesis.
In the tests with original texts, 6 out of the 8 pairs of corpora—one corpus from a hard
discipline and one from a soft discipline in the same year—showed a difference in slope that was
significant at the 5% level. In each of the 6 significantly different pairs the slope of the hard corpus
was steeper than that of the soft corpus.
In the tests of texts from which common words had been deleted, 4 out of the 8 pairs of
hard and soft corpora showed a difference in slope that was significant at the 5% level. All 4
significant differences resulted from steeper slopes for hard corpora than for soft corpora.
The tests that compared mean slopes of pairs of hard corpora with mean slopes of pairs of
coeval soft corpora also supported General Hypothesis II. All 4 comparisons, 2 of them using the
original texts and 2 using texts from which common words had been deleted, showed differences
significant at the 5% level. In all 4 cases, the mean slope of the hard corpora was steeper than the
mean slope of the soft corpora.
The details of the tests of the specific hypotheses are presented in section 5.2. As noted in
section 3.2.1, the notation of the hypotheses represents ecology by EC, mathematics by MA, physics
by PH, and psychology by PS.
5.2 The Tests of the Hypotheses
5.2.1 Tests of Hypotheses 1
The four specific Hypotheses 1 consist of statements exemplified by
Hypothesis 1-EC: In ecology, a corpus drawn from later writings will have a steeper
slope for the rank-frequency curve formed from the whole vocabulary of the corpus
than will a corpus drawn from earlier writings.
The other three Hypotheses 1, labelled 1-MA, 1-PH, and 1-PS, make the analogous statements for
their respective disciplines.
Table 5-1 presents the data for the tests of Hypotheses 1. The contents of the first and third
columns of data in this table resulted from the application of the FIT command in OMNITAB II to
48
the pairs (log ri, log fi) for each corpus; the contents of the fourth column, from the application of
the command STATISTICAL ANALYSIS to the values of log ri .
The data in columns 2 through 4 were processed, via OMNITAB II, in accordance with the
standard procedure for comparing regression coefficients from different samples, as presented in
Dixon and Massey (1969, pp. 206-9). For each pair of corpora, the estimated standard deviation of
the difference in slopes of samples was determined and is shown in column 5. The observed
difference in the slopes of a pair of corpora was divided by the corresponding estimated standard
deviation to yield a t score (column 6), whose number of degrees of freedom is given in column 7.
Column 8 contains the interpretation, at the 5% level of significance, of the t scores as tests of the
null hypothesis that the true slopes of the corpora are equal.
Only Hypothesis 1-MA was found to be consistent with the observations. The test of
Hypothesis 1-PH showed a nonsignificant difference. The tests of Hypotheses 1-EC and 1-PS
showed significantly less steep slopes for the 1969 corpora than for the 1921 corpora, the opposite
of what these hypotheses predicted.
5.2.2 Tests of Hypotheses 2
The four specific Hypotheses 2 consist of statements exemplified by
Hypothesis 2-EC: In ecology, a corpus drawn from later writings will have a steeper
slope for the rank-frequency curve formed from the technical vocabulary of the
corpus than will a corpus drawn from earlier writings.
The other three Hypotheses 2, labelled 2-MA, 2-PH, and 2-PS, make the analogous
statements for their respective disciplines.
Table 5-2 presents the data for the tests of Hypotheses 2. These data were prepared in the
manner described in section 5.2.1.
None of Hypotheses 2 was found to be consistent with the observations. All the differences
in slope failed to be significant at the 5% level, and 3 out of the 4 nonsignificant differences were
opposite in tendency to what Hypotheses 2 predicted.
5.2.3 Tests of Hypotheses 3
The four specific Hypotheses 3 consist of statements exemplified by
Hypothesis 3-PH-EC: A corpus drawn from writings in physics will have a steeper
slope for the rank-frequency curve formed from the whole vocabulary of the corpus
than will a coeval corpus drawn from ecology.
49
TABLE 5-1
DATA FOR THE TESTS OF HYPOTHESES 1, USING THE ORIGINAL TEXTS OF THE SAMPLES
Corpora Compared
(Original Text)
Slope of Number of Residual
Standard
RankRankFrequency Frequency Deviation
S f·r
Curve B
Pairs N
Variance
of Log r
Sr2
Standard
Deviation of
Differences of
Slopes
SD(B 21 - B 69 )
t
d.f.
Significant at
5% Level?
Ecology, 1921
Ecology, 1969
-0.92946
-0.88788
83
77
0.042936
0.048953
0.41641
0.43641
0.011202
-3.712
156
yes
Mathematics, 1921
Mathematics, 1969
-1.00399
-1.04011
106
102
0.054802
0.058259
0.32623
0.33475
0.013707
2.635
204
yes
Physics, 1921
Physics, 1969
-0.95301
-0.93650
90
85
0.052965
0.062424
0.38983
0.40822
0.013905
-1.187
171
no
Psychology, 1921
Psychology, 1969
-0.93947
-0.91651
84
88
0.035506
0.039716
0.40082
0.38756
0.009223
-2.489
168
yes
TABLE 5-2
DATA FOR THE TESTS OF HYPOTHESES 2, USING SAMPLES WITH COMMON WORDS DELETED
Corpora Compared
(Common Words
Deleted)
Slope of Number of Residual
Standard
RankRankFrequency Frequency Deviation
S f·r
Curve B
Pairs N
Variance
of Log r
Sr2
Standard
Deviation of
Differences of
Slopes
SD(B 21 - B 69 )
t
d.f.
Significant at
5% Level?
Ecology, 1921
Ecology, 1969
-0.54337
-0.49556
47
44
0.072939
0.106078
0.59647
0.62538
0.024545
-1.948
87
no
Mathematics, 1921
Mathematics, 1969
-0.65485
-0.65347
61
63
0.117012
0.108233
0.43118
0.42991
0.031085
-0.044
120
no
Physics, 1921
Physics, 1969
-0.55826
-0.54443
54
49
0.121175
0.103025
0.50326
0.56093
0.030849
-0.448
99
no
Psychology, 1921
Psychology, 1969
-0.50137
-0.52713
43
49
0.107595
0.098325
0.59464
0.55149
0.028667
0.899
88
no
50
The other three Hypotheses 3, labelled 3-PH-PS, 3-MA-EC, and 3-MA-PS, make the
analogous statements for their respective pairs of disciplines.
Table 5-3 presents the data for the tests of Hypotheses 3. These data were prepared in the
manner described in section 5.2.1.
The tests showed Hypotheses 3-MA-EC, 3-MA-PS, and 3-PH-EC to be consistent with the
observations for both the 1921 data and the 1969 data. The tests of Hypothesis 3-PH-PS showed a
nonsignificant difference in slope for both the 1921 and the 1969 data, though in each year the
physics corpus had a steeper slope than the psychology corpus.
On balance, these results support the idea that hard corpora have steeper slopes than do
coeval soft corpora, when these corpora consist of the whole vocabularies used.
5.2.4 Tests of Hypotheses 4
The four specific Hypotheses 4 consist of statements exemplified by
Hypothesis 4-PH-EC: A corpus drawn from writings in physics will have a steeper
slope for the rank-frequency curve formed from the technical vocabulary of the
corpus than will a coeval corpus drawn from ecology.
The other three Hypotheses 4, labelled 4-PH-PS, 4-MA-EC, and 4-MA-PS, make the
analogous statements for their respective pairs of disciplines.
Table 5-4 presents the data for the tests of Hypotheses 4. These data were prepared in the
manner described in section 5.2.1.
The tests showed Hypotheses 4-MA-EC and 4-MA-PS to be consistent with the
observations for both the 1921 data and the 1969 data. The tests of Hypotheses 4-PH-EC and
4-PH-PS showed nonsignificant differences in slope for both the 1921 data and the 1969 data. It is
interesting that these four differences, though not significant, all involved a physics slope that was
steeper than the slope of the corresponding soft corpus.
On balance, these results give partial support to the idea that hard corpora have steeper
slopes than do coeval soft corpora, when these corpora consist of texts from which the common
words have been deleted, i.e., the technical vocabularies.
51
TABLE 5-3
DATA FOR THE TESTS OF HYPOTHESES 3, USING THE ORIGINAL TEXTS OF THE SAMPLES
Corpora Compared
(Original Text)
Slope of Number of Residual
Standard
RankRankFrequency Frequency Deviation
S f·r
Curve B
Pairs N
Variance
of Log r
Sr2
Standard
Deviation of
Differences of
Slopes
SD(B s - B h )
t
d.f.
Significant at
5% Level?
Ecology, 1921
Mathematics, 1921
-0.92946
-1.00399
83
106
0.042936
0.054802
0.41641
0.32623
0.012080
6.170
185
yes
Psychology, 1921
Mathematics, 1921
-0.93947
-1.00399
84
106
0.035506
0.054802
0.40082
0.32623
0.011508
5.607
186
yes
Ecology, 1921
Physics, 1921
-0.92946
-0.95301
83
90
0.042936
0.052965
0.41641
0.38983
0.011672
2.018
169
yes
Psychology, 1921
Physics, 1921
-0.93947
-0.95301
84
90
0.035506
0.052965
0.40082
0.38983
0.011014
1.229
170
no
Ecology, 1969
Mathematics, 1969
-0.88788
-1.04011
77
102
0.048953
0.058259
0.43641
0.33475
0.013311
11.436
175
yes
Psychology, 1969
Mathematics, 1969
-0.91651
-1.04011
88
102
0.039716
0.058259
0.38756
0.33475
0.012300
10.049
186
yes
Ecology, 1969
Physics, 1969
-0.88788
-0.93650
77
85
0.048953
0.062424
0.43641
0.40822
0.013744
3.538
158
yes
Psychology, 1969
Physics, 1969
-0.91651
-0.93650
88
85
0.039716
0.062424
0.38756
0.40822
0.012641
1.581
169
no
52
TABLE 5-4
DATA FOR THE TESTS OF HYPOTHESES 4, USING SAMPLES WITH COMMON WORDS DELETED
Corpora Compared
(Common Words
Deleted)
Slope of Number of Residual
Standard
RankRankFrequency Frequency Deviation
S f·r
Curve B
Pairs N
Variance
of Log r
Sr2
Standard
Deviation of
Differences of
Slopes
SD(B s - B h )
t
d.f.
Significant at
5% Level?
Ecology, 1921
Mathematics, 1921
-0.54337
-0.65485
47
61
0.072939
0.117012
0.59647
0.43118
0.027499
4.054
104
yes
Psychology, 1921
Mathematics, 1921
-0.50137
-0.65485
43
61
0.107595
0.117012
0.59464
0.43118
0.031768
4.831
100
yes
Ecology, 1921
Physics, 1921
-0.54337
-0.55826
47
54
0.072939
0.121175
0.59647
0.50326
0.027649
0.539
97
no
Psychology, 1921
Physics, 1921
-0.50137
-0.55826
43
54
0.107595
0.121175
0.59464
0.50326
0.032128
1.771
93
no
Ecology, 1969
Mathematics, 1969
-0.49556
-0.65347
44
63
0.106078
0.108233
0.62538
0.42991
0.029344
5.381
103
yes
Psychology, 1969
Mathematics, 1969
-0.52713
-0.65347
49
63
0.098325
0.108233
0.55149
0.42991
0.028547
4.426
108
yes
Ecology, 1969
Physics, 1969
-0.49556
-0.54443
44
49
0.106078
0.103025
0.62538
0.56093
0.028484
1.716
89
no
Psychology, 1969
Physics, 1969
-0.52713
-0.54443
49
49
0.098325
0.103025
0.55149
0.56093
0.027563
0.628
94
no
53
5.2.5 Tests of Hypothesis 5 and Hypothesis 6
The first hypothesis discussed in this section is
Hypothesis 5: The mean slope of rank-frequency curves formed from the whole
vocabularies of corpora drawn from physics and mathematics will be steeper than
the corresponding slope for coeval corpora from ecology and psychology.
The second hypothesis discussed in this section, Hypothesis 6, makes the analogous statement for
the rank-frequency curves formed from the technical vocabularies of the corpora.
Table 5-5 presents the data for the tests of Hypothesis 5 and Hypothesis C. These data were
prepared in essentially the same manner as that described in 5.2.1. The only changes of procedure
were: (1) The slope of a pair of corpora was calculated as the unweighted mean of the individual
slopes; and (2) the residual standard deviation and the variance of the log r terms were pooled for
each pair by the ordinary procedure for pooling such quantities (see Dixon and Massey, 1969, p.
113).
The tests showed both Hypotheses 5 and 6 to be consistent with the observations for both
the 1921 and the 1969 data.
These results support the idea that combinations of hard corpora have steeper slopes than
do combinations of coeval soft corpora, for both whole vocabularies and technical vocabularies.
5.3 Summary
The results of the present study showed that the evidence fails to support the hypothesis that
jargon standardization, acting over time, will bring about increased steepness in the slopes of rankfrequency curves of scientific texts.
The results also showed that it does appear possible to distinguish between hard and soft
disciplines by the relatively greater steepness of slopes of rank-frequency curves for writings in hard
disciplines.
54
TABLE 5-5
DATA FOR THE TESTS OF HYPOTHESIS 5, USING THE ORIGINAL TEXTS OF THE SAMPLES,
AND OF HYPOTHESIS 6, USING TEXTS WITH COMMON WORDS DELETED
Pairs of Corpora
Total
Mean Slope
Pooled
Number of
of RankResidual
RankFrequency
Standard
Frequency
Curves
Deviation
Pairs - 2
Pooled
Variance
of Log r
Standard
Deviation of
Differences of
Slopes
t
d.f.
Significant at
5% Level?
0.008214
5.360
355
yes
0.009043
9.523
354
yes
0.021136
3.983
197
yes
0.020261
4.234
197
yes
ORIGINAL TEXT
Psychology &
Ecology, 1921
vs.
Mathematics &
Physics, 1921
Psychology &
Ecology 1969
vs.
Mathematics &
Physics, 1969
-0.93447
165
0.039374
0.40857
-0.50137
194
0.053967
0.35541
-0.54337
173
0.044548
0.41184
-0.50137
185
0.060186
0.36811
COMMON WORDS DELETED
Psychology &
Ecology, 1921
vs.
Mathematics &
Physics, 1921
Psychology &
Ecology 1969
vs.
Mathematics &
Physics, 1969
-0.49556
88
0.091138
0.59560
-0.52713
113
0.118983
0.46499
-0.49556
91
0.102062
0.58641
-0.52713
110
0.105992
0.48708
55
CHAPTER 6
INTERPRETATION AND RECOMMENDATIONS
This thesis defined jargon standardization to be the process by which scientists move from
using a variety of words and phrases in expressing an emerging concept to using a single, standard
name for a fully developed, well defined concept. Two suggestions were made: first, that it might
be possible to measure the degree of jargon standardization in a discipline by the slopes of rankfrequency curves formed from counts of frequencies of words in texts in the discipline; and second,
that the degree of jargon standardization in a discipline might be closely related to the hardness of
that discipline. From these ideas came the two purposes of this study: to ascertain whether
measurements of the slopes of rank-frequency curves would reveal the expected increase over time
in the amount of standardized jargon in a discipline, and whether such measurements would provide
a quantitative measure of the hardness of a scientific discipline.
The hypotheses of this study can be summarized as General Hypothesis I:
The slopes of the rank-frequency curves of corpora in a discipline will become
steeper as time passes.
and General Hypothesis II:
The slopes of the rank-frequency curves of corpora in hard disciplines will be steeper
than those of coeval corpora in soft disciplines.
General Hypothesis I stems from the following reasoning: As a discipline ages, its
practitioners tend to settle upon a standard vocabulary to express its concepts and to use the same
words consistently and often; this tendency results in their making increasingly frequent use of each
of a relatively limited number of standard words; and this result has the effect of steepening the
slopes of rank-frequency curves. The reasoning leading to General Hypothesis II can be
summarized thus: At any given time, the practitioners of a hard discipline have available a more
standardized vocabulary to describe their discipline's concepts than practitioners of a soft discipline
have; the latter have to use a larger number of different words in wrestling with concepts that tend
to be less well defined than those in the hard discipline; hence, the practitioners of the hard
discipline tend to make more frequent use of each of a relatively limited number of different words,
while the practitioners of the soft discipline tend to make less frequent use of each of a larger
number of different words; and the result is steeper slopes for rank-frequency curves for hard
corpora than for soft corpora.
Chapter 4 explained how the eight experimental corpora in this study were selected from
journal articles published in 1921 and 1969 in each of four scientific disciplines: physics and
mathematics, as representative hard sciences; ecology and psychology, as representative soft
sciences. For each corpus, two slopes of rank-frequency curves were measured: one for the whole
vocabulary of the corpus; and one for what was called the "technical vocabulary" of the corpus, the
words remaining after the deletion of the set of common words listed in Appendix B.
56
Chapter 5 set forth the results of the present study. These showed that the evidence failed
to support the hypothesis that jargon standardization, acting over time, will bring about increased
steepness in the slopes of rank-frequency curves of scientific texts. But the results also showed that
it does appear possible to distinguish between hard and soft disciplines by the relatively greater
steepness of slopes in hard disciplines.
Chapter 6 presents first a possible way of interpreting these seemingly conflicting results, in
section 6.1. The next two sections discuss some recommendations for further research. The major
portion, section 6.3, of these recommendations concerns the problem of whether hard and soft
disciplines can be distinguished by applying certain standard vocabulary statistics to writings in the
disciplines. Such research would be a logical extension of the finding by this study that a nonstandard vocabulary statistic, the slope of the rank-frequency curve, is capable of making such a
distinction. Section 6.3 illustrates the recommendations for such research by examples based on
data available from the present study.
6.1 A Possible Interpretation of the Results of the Study
The results of this study with respect to General Hypothesis I indicate that the process of
jargon standardization in a discipline either does not exist or, if it does exist, fails to manifest itself
by an increase over time in the steepness of the slopes of rank-frequency curves in a discipline. On
the other hand, the comparisons of hard and soft corpora accord generally with the idea—General
Hypothesis II—that the presumed greater jargon standardization in a hard discipline should
manifest itself in a steeper slope of a rank-frequency curve for that discipline than for a coeval soft
discipline. Though these results appear to be contradictory, resolution of the contradiction may
reside in the following line of reasoning.
That the process of jargon standardization in scientific disciplines takes place is not open to
doubt. As time passes in any discipline, new concepts emerge and come to bear one standard label
or at most a very small number of standard alternative labels. The absolute number of such
standardized concepts and corresponding labels in any discipline undoubtedly increases with time.
But what is crucial to the present study is whether there is an increase over time in the relative number
of standardized concepts, compared with the total number of all standardized and unstandardized
concepts, discussed in the journal articles reporting research in a discipline.
A possible interpretation of the findings of this study with respect to General Hypothesis I is
that the relative number of standardized concepts in the journal articles reporting research in a
discipline does not increase with time and may, in fact, tend to decrease. After all, the research
literature of an actively developing discipline should contain much discussion of concepts that are
just emerging in the discipline and are the subjects of much of the research activity in it.
An especially actively developing discipline might well be the one in which new concepts
—not yet standardized—are being introduced so fast and discussed so much that their role in the
research literature balances, or even outweighs, the role of the standardized concepts, even though
the latter continue to increase in absolute number in the discipline. In such a discipline, the
proportion of standardized concepts discussed in the research literature could decrease over time,
57
with an accompanying decrease in the slope of rank-frequency curves for texts in the discipline. The
amount of decrease might even be a measure of the change in the activeness of development of the
discipline.
That the whole vocabulary of ecology had a slope of -0.92946 in 1921 and a less steep slope
of -0.88788 in 1969 (cf. Table 5-1) would indicate, under this interpretation, that ecology was
developing more rapidly in 1969 than in 1921. Similar statements would hold for physics and
psychology. In contrast, mathematics, with a slope of -1.00399 in 1921 and a steeper slope of
-1.04011 in 1969, would appear to be developing less rapidly in 1969 than in 1921. The differences
of slope for whole vocabularies, shown in Table 5-1, are not matched by the differences of slopes
for the technical vocabularies, shown in Table 5-2. This lack of matching might be explained as
reflecting the relative stability of the proportion of standardized terms in the technical vocabulary
when the diluting effect of the nonstandardized terms, employed to talk about concepts not yet
standardized, was removed.
Under this interpretation, furthermore, the findings of the study with respect to General
Hypothesis II could be stated in the following way: The findings support the idea that the relative
number of standardized terms in the research literature of a hard discipline at any given time tends
to exceed the relative number of standardized terms in the research literature of a coeval soft
discipline.
These are only speculations, of course. But they do suggest that it might be worthwhile to
pursue the line of inquiry of this study using as data literature other than journal articles, for
example, textbooks and state-of-the-art reviews, which would appear to be less susceptible to the
diluting effects of nonstandardized concepts than the research literature is.
6.2 Possible Inquiries into Changes over Time in Scientific Writings
Certain phenomena are readily sensed when one compares present-day writings in science
with older materials. Though subjective impressions do not constitute evidence, anyone who takes
the trouble to spend a few minutes browsing in the 1921 issues and then in the 1969 issues of any of
the journals used in this study will recognize that the tone, the style, and the ambience differ
markedly between the two years. The older writings display a casualness, a digressiveness, and a
relaxed air that the newer writings lack.
It would be interesting to look into ways of objectifying these impressions. It seems possible
that an investigation of scientific writing might find a number of differences over time in standard
stylostatistical measures: average sentence length, noun-to-verb ratios, measures of sentence
complexity, etc.
One difference readily apparent in all four disciplines in this study is the much greater
density of mathematical expressions in the 1969 articles than in the 1921 articles. Though the
difference is easy to see, to measure it precisely would be difficult. No studies are known that have
attempted to deal with the considerable problems involved in carefully assessing the "amount of
mathematics" in a body of text. Storer (1967) referred to his "characterization of 'hardness' . . . [as] a
58
measure of the frequency with which mathematics is used in the different sciences," but his measure
was merely "the frequency [with which] tables are used in articles—on the assumption that a table
involves at least some mathematics." A study that carefully assessed the "amount of mathematics"
in articles in various disciplines at different times would be desirable.
Other possible measures of change over time in scientific writing that deserve investigation
include average number of citations, average length of articles, and average length of abstracts.
Another possibility is the amount of borrowing by one discipline of techniques and models from
other fields, a measure of the "interdisciplinariness" of the discipline. A study complementary to the
present one could investigate changes over time in scientific writing within much narrower and more
homogeneous disciplines. For such a study disciplines might be defined in terms of networks
formed by citations.
Three further topics for speculation, arising in the present study, may be noted here. First,
the fit of the regression line to the (log ri, log fi) pairs for the technical vocabularies was much less
satisfactory than for the whole vocabularies. The plots for the corpora with common words deleted,
shown in Appendix C, suggest that (log ri, log fi) pairs would be better approximated by a polynomial
of degree at least 2 than by the linear polynomial shown. Investigation of fitting the technical
vocabulary corpora by higher-order polynomials was left to further study.
Second, when more than one word-type occurred with a particular frequency, the rank
corresponding to that frequency was defined as the mean rank of all such word-types (cf. section
3.1.7). This definition, adopted for comparability with Zipf 's work, was not the only possible way to
handle the problem of tied ranks, but other possible definitions were left for future investigation.
Third, instead of the definition in this study of "technical vocabulary" as what remained after
certain common words had been deleted from the whole corpus, the technical vocabulary of a
discipline might be defined as only those words found in a standard technical dictionary of the
discipline. Application of the procedures of this study to a different kind of technical vocabulary
might have interestingly different results.
6.3 Possible Inquiries into Potential Measures of the Hardness of a Scientific Discipline
The present study found that a nonstandard vocabulary statistic, the slope of the
rank-frequency curve, is capable of distinguishing writings in hard science from those in soft science.
This finding suggests that it would be worthwhile to inquire whether standard vocabulary statistics
would be capable of making such distinctions. To pursue such an inquiry was outside the scope of
the present study, but data from the study can be used to illustrate some possible directions for such
an inquiry.
The rest of this section discusses these possible directions. Table 6-1 brings together several
standard vocabulary statistics for the corpora of this study. Table 6-2 does the same thing for the
corpora with common words deleted, and also notes the numbers of word-types and word-tokens
that were deleted.
59
TABLE 6-1
VOCABULARY STATISTICS OF THE CORPORA, USING THE ORIGINAL TEXTS
Corpus (Original
Text)
Ecology, 1921
Ecology, 1969
Number
Guiraud's
of
Number of Number
Tokens, of Types, Singleton Ratio,
Types,
N
V
V /◊ N
V1
23,141
3,865
1,934
25.41
22,002
3,793
1,846
25.57
TypeToken
Ratio,
V /N
0.1670
0.1724
Mean
Frequency
log V /log N of WordY
Types,
N /V
0.8219
5.987
0.8242
5.801
Probability
that a Type
Has Exactly
One Token
p 1 =V 1 /V
0.5004
0.4867
Yule's
Characteristic,
K
156.9
122.3
Mathematics, 1921
Mathematics, 1969
23,429
23,873
1,892
1,954
691
766
12.36
12.65
0.0808
0.0818
0.7499
0.7517
12.383
12.218
0.3652
0.3920
187.8
164.4
Physics, 1921
Physics, 1969
23,398
22,323
2,714
3,027
1,113
1,340
17.74
20.26
0.1160
0.1356
0.7859
0.8005
8.621
7.375
0.4101
0.4427
209.3
192.4
Psychology, 1921
Psychology, 1969
22,235
21,354
______
181,755
3,676
3,276
1,894
1,498
24.03
22.42
0.1653
0.1534
0.8202
0.8120
6.049
6.518
0.5152
0.4537
130.7
118.5
60
TABLE 6-2
VOCABULARY STATISTICS OF THE CORPORA, USING TEXTS WITH COMMON WORDS DELETED
Corpus
(Common
Words
Deleted)
Ecology,
1921
Ecology,
1969
Mathematics,
1921
Mathematics,
1969
Physics,
1921
Physics,
1969
Psychology,
1921
Psychology,
1969
Number
Number Number Number
Number
Number
of
of
of
of
of
of
Common
Common Singleton Singleton
Tokens,
Types,
Types,
Tokens
Types
Types
N
V
V1
Deleted
Deleted
Deleted
TypeToken
Ratio,
V /N
Probability
Mean
log V / Frequency that a Type
of Word- Has Exactly
log N
One Token
Types,
Y
p 1 =V 1 /V
N /V
23,141
11,943
3,648
217
1,914
20
0.3258
0.8797
3.070
0.5247
9.357
22,002
10,470
3,584
209
1,823
23
0.3108
0.8750
3.218
0.5086
8.296
23,429
14,124
1,698
194
670
21
0.1825
0.8138
5.480
0.3946
26.78
23,873
14,201
1,769
185
747
19
0.1829
0.8149
5.467
0.4223
29.84
23,398
13,274
2,500
214
1,096
17
0.2469
0.8483
4.050
0.4384
14.12
22,323
11,743
2,815
212
1,315
25
0.2661
0.8571
3.758
0.4671
12.35
22,235
12,074
3,453
223
1,885
9
0.3398
0.8830
2.943
0.5459
8.855
21,354
10,910
3,061
215
1,479
19
0.2931
0.8674
3.412
0.4832
10.29
61
Yule's
Characteristic,
K
6.3.1 The Type-Token Ratio
Several standard vocabulary statistics showed a surprising variability in the corpora of this
study. One of these is the type-token ratio, i.e., the ratio of the number of word-types to the
number of word-tokens. Miller (1951) summarized several studies of the type-token ratio and noted
its dependence on the total size of the body of text on which it is measured. In Tables 6-1 and 6-2
the type-token ratio is identified as V/N , in accordance with the notation introduced by Guiraud
(1954b): V = vocabulary size, i.e., number of different word-types; and N = size of sample, i.e.,
number of word-tokens.
The corpora in this study are quite close in size. The mean of their numbers N of tokens is
N = 22,719.38 , and the standard deviation, of N is SD(N) = 865.47 , which is about 4% of N .
Guiraud (1959) suggested that "for normal texts composed of approximately 10,000 to 50,000
words, the vocabulary V increases with the square root of the text length, in such fashion that
V/N 22 " for French texts.1 The corpora in this study have a mean number of types
V = 3024.63 , whence V / N = 20.07, but the ratio actually varies from 12.36 to 25.57 among
the corpora. Their type-token ratios, V/N , also vary considerably, from 0.0808 to 0.1670 . A
similar situation holds for the corpora with common words deleted, for which N is 10,264.00 and
SD(N) = 770.78 , about 8% of N , and among which the type-token ratio varies from 0.1825 to
0.3398.
Unfortunately, there is no theory adequate to provide norms for the mean and standard
deviation of the type-token ratio. The closest approach to such a theory is a paper by Mizutani
(1973), in which the author derived the following expression for the variance Var(V ) of the
expected number V of word-types in a sample of N running words:
2
L
L
Var(V ) (1 Pi ) (1 Pi ) (1 Pi Pj ) N
i 1
i 1 j i
i 1
L
N
In this expression, L is the size of the potential vocabulary realizable in a not very well defined but
certainly very large body of text, of which an example would be all writings in physics in a given
year, and Pi is the probability that a randomly selected word-token will be a token of the i-th wordtype in a list of all L word-types.
Mizutani's expression for Var(V ) depends on knowledge of the probabilities of
occurrence of the individual word-types in the language as a whole. Hence, to use his result to
provide a numerical value of the theoretical variance against which to compare the observed values
in this study would be very time-consuming.
1
My translation.
62
Some shortening of that task could be achieved by using an approximation procedure based
on that of Guiraud (1954b, p. 33). He formed groups consisting of the 50 most frequent wordtypes, the 50 next most frequent, then the 500 next most frequent, and so on. Presumably, the
assignment of a word-type to a group would reflect its usage in a very large body of text. Guiraud
then stated the proportion Pk of word-tokens from the k-th group that occurred in a (presumably
very large) body of text. He used this proportion to calculate the expected total number of
word-types, along with the expected numbers of word-types occurring once, twice, etc., in a specific
text, on the hypothesis that these numbers are governed by the Poisson distribution.
Guiraud offered the data in the first three columns of Table 6-3 as an example of this
procedure using French word-types. His approach can be extended to provide an approximate
probability of occurrence, Pi = pk /nk , for each individual word-type in the k-th group, which
contains nk word-types. Such approximations are shown in the fourth column of Table 6-3.
Corresponding approximate probabilities for English could be used in calculating numerical values
of Mizutani's variance of the expected number of word-types in a sample.
TABLE 6-3
PROPORTIONS OF RUNNING TEXT ACCOUNTED FOR BY WORD-TYPES OF
VARIOUS FREQUENCIES (AFTER P. GUIRAUD)
Group
Number in Group
Proportion of
Running Text
Probability of
Occurrence of a
Word-Type Belonging
to Group k
k
nk
pk
Pi
1
50
0.51
0.0102
2
50
0.08
0.0016
3
500
0.19
0.00038
4
500
0.08
0.00016
5
1,000
0.065
0.000065
6
1,000
0.03
0.000030
7
1,000
0.02
0.000020
8
5,000
0.02
0.000004
9
5,000
0.003
0.0000006
10
10,000
0.002
0.0000002
63
6.3.2 The Logarithmic Type-Token Ratio
Another way of handling the type-token ratio was suggested by Herdan (1960, p. 26), who observed
that this ratio
changes, in general, with the size of the piece of literary work, the vocabulary
increasing with the text length—but by no means proportional to it—in such a way
that the quantity decreases, on the whole, with increasing sample size. It cannot,
therefore, in this form serve as a characteristic of style, which must be independent
of the text length. The logarithmic type/token ratio, i.e., the log type/log token, on
the other hand, remains sensibly constant for samples of different size from a given
literary text and is, therefore, suitable to serve as a style characteristic.
This fact—one of the most remarkable in the field of quantitative
linguistics—appears to have been independently discovered by three investigators:
Chotlos . . . Herdan . . . and Devooght. . . .
Tables 6-1 and 6-2 include values of the logarithmic (Herdan also called it the "bilogarithmic")
type-token ratio, γ = log V/log N , for the corpora in the present study. These observed values of γ
suggest that the logarithmic type-token ratio might plausibly serve as a stylistic discriminant between
individual disciplines and, even more plausibly, as a discriminant between hard and soft corpora.
With respect to the latter possibility, it is interesting to note from Table 6-1 that the four hard
corpora show = 0.7720 and SD(γ) = 0.0252 , while the four soft corpora show = 0.8196 and
SD (γ) = 0.0053 .
6.3.3 Yule's Characteristic, K
Yule's characteristic K is one of the earliest vocabulary statistics, having been introduced by
Yule (1944) in his pioneering work The Statistical Study of Literary Vocabulary. In the notation used in
the present study, Yule's definition was
K 10, 000 ni f i 2 N / N 2
i
where fi
= a distinct frequency with which one or more word-types occur in the body of the
text being considered
ni = the number of word-types occurring with frequency fi
N = the total number of word-tokens in the body of text.
Yule's characteristic K is especially important because it is independent of the size of the
text from which it is calculated (Yule, 1944, p. 46). It is essentially a measure of the probability that
a randomly chosen pair of words will be identical; the larger K is, the greater is this probability. For
this reason Williams (1970) preferred to think of K as a measure of uniformity and proposed that
10,000/K be called "Yule's Index of Diversity." This index of diversity, which will be called W here
64
in honor of Williams, has the convenient interpretation that W = 10,000/K is the expected number
of pairs of words that must be chosen at random before an identical pair is obtained.
As an illustration, Table 6-4 presents values of "Yule's Index of Diversity" for the eight
corpora in the present study. It is interesting to note that in each pair of corpora in a discipline the
later corpus has the greater diversity, and that the soft corpora have a mean diversity W = 76.60 ,
with a standard deviation SD(W) = 9.18 , while the hard corpora have W = 53.39 and
SD(W) = 5.55 . This difference in diversities is consistent with the idea that a hard discipline has a
more standardized vocabulary than a soft discipline has; for one consequence of this would be more
frequent repetition of standard terms and, hence, less diversity.
TABLE 6-4
VALUES OF THE PROPOSED "YULE'S INDEX OF DIVERSITY,"
W = 10,000/K , FOR THE ORIGINAL-TEXT CORPORA
SOFT
Corpus (Original
Text)
HARD
Yule's Index of
Diversity, W
Corpus (Original
Text)
Yule's Index of
Diversity, W
Ecology, 1921
63.73
Mathematics, 1921
53.25
Ecology, 1969
81.77
Mathematics, 1969
60.83
Psychology, 1921
76.51
Physics, 1921
47.78
Psychology, 1969
84.39
Physics, 1969
51.98
6.3.4 Singleton Types and Their Probability
Among the data in both Table 6-1 and 6-2 are the numbers of singleton types, i.e.,
word-types represented by just one token in the corpus. These numbers are denoted by V1 ,
following Guiraud (1954b), who worked extensively with this vocabulary statistic. Of it Muller
(1969) said:
V1 has always been considered a very significant value. . . .
The nuisance is that in non-exhaustive statistical surveys, and even in indices
verborum, attention is generally concentrated on the most frequently appearing words,
while those that appear only once are passed over without comment.
The values of V1 lead directly to the calculations of p1 = V1/V , the probability that a
word-type has exactly one token in the corpus. Table 6-1 shows that the hard and soft corpora
appear to be distinguished by the values of p1 . The former have a mean p1 = 0.4025 and a standard
deviation SD(p1) = 0.0325 ; the latter have p1 = 0.4899 with SD(p1) = 0.0247 . Since the number
65
of singleton types is involved in Yule's Index of Diversity (cf. section 6.3.3), it is not surprising that
p1 = V1/V also appears to discriminate between hard and soft corpora.
As a matter of possible interest, Table 6-5 is included. It shows the distribution, in each of
the corpora, of numbers of word-types represented by 1, 2, . . . , 10 tokens. Among other things,
this table shows that the relative deficit of the hard corpora with respect to their proportions of
singleton types has not been fully made up even by the time the decaton types have been counted.
6.4 Summary
Chapter 6 has suggested, in section 6.1, an interpretation of the finding by this study that
jargon standardization failed to reveal itself as expected in the behavior of the slopes of rankfrequency curves of time-differentiated corpora, although it did show up as hypothesized in
differences of these slopes between hard and soft corpora. Sections 6.2 and 6.3 have suggested
some possible investigations into changes over time in scientific writing and into the distinguishing
of writings in hard and soft sciences, and have illustrated some of these suggestions.
Section 6.3 has provided strong indications that measurable differences exist between
writings in hard and in soft sciences. These indications are consistent with the finding by this study
that the slope of the rank-frequency curve provides a measure of such differences. The general
problem of measuring such differences merits further investigation.
66
TABLE 6-5
DISTRIBUTION OF NUMBERS OF WORD-TYPES REPRESENTED BY 1, 2, …, 10
TOKENS IN THE ORIGINAL-TEXT CORPORA
Corpus
Total Number of
(Original Text)
Types
Ecology,
1921
Ecology,
1969
(3,865)
(3,793)
Mathematics,
1921
Mathematics,
1969
(1,892)
(1,954)
Physics,
1921
Physics,
1969
(2,714)
(3,027)
Psychology,
1921
Psychology,
1969
(3,676)
(3,276)
Singleton
%
(No.)
%
Doubleton
Cum. %
(No.)
%
Tripleton
Cum. %
(No.)
%
Quadraton
Cum. %
(No.)
%
Quintaton
Cum. %
(No.)
50.05
(1,934)
48.67
(1,846)
16.74 66.78
(647)
16.74 65.41
(635)
8.51 75.29
(329)
8.46 73.87
(321)
5.36 80.65
(207)
5.06 78.93
(192)
3.67 84.32
(142)
3.82 82.75
(145)
36.52
(691)
39.20
(766)
14.53 51.05
(275)
14.69 53.89
(287)
9.83 60.88
(186)
9.47 63.36
(185)
6.18 67.06
(117)
5.42 68.78
(106)
4.60 71.66
(87)
4.25 73.03
(83)
41.01
(1,113)
44.27
(1,340)
16.62 57.63
(451)
17.58 61.85
(532)
9.10 66.73
(247)
9.12 70.97
(276)
7.00 73.73
(190)
4.92 75.89
(149)
4.68 78.41
(127)
3.83 79.72
(116)
51.52
(1,894)
45.73
(1,498)
16.43 67.95
(604)
17.28 63.01
(566)
8.03 75.98
(295)
8.97 71.98
(294)
5.14 81.12
(189)
5.25 77.23
(172)
3.51 84.63
(129)
4.00 81.23
(131)
TABLE 6-5 (cont'd)
DISTRIBUTION OF NUMBERS OF WORD-TYPES REPRESENTED BY 1, 2, …, 10
TOKENS IN THE ORIGINAL-TEXT CORPORA
Corpus
%
(Original Text)
Septaton
Sexaton
Cum. %
Cum. % %
(No.)
(No.)
%
Octaton
Cum. %
(No.)
%
Nonaton
Cum. %
(No.)
%
Decaton
Cum. %
(No.)
Ecology,
1921
Ecology,
1969
2.12 86.44
(82)
2.45 85.20
(93)
2.07 88.51
(80)
2.35 87.55
(89)
1.29 89.80
(50)
1.92 89.47
(73)
1.03 90.83
(40)
1.34 90.81
(51)
1.60 92.43
(62)
0.95 91.76
(36)
Mathematics,
1921
Mathematics,
1969
3.75 75.41
(71)
3.74 76.77
(73)
2.64 78.05
(50)
1.94 78.71
(38)
2.43 80.48
(46)
2.00 80.71
(39)
2.22 82.70
(42)
1.59 82.30
(31)
1.64 84.34
(31)
1.33 83.63
(26)
Physics,
1921
Physics,
1969
3.17 81.58
(86)
3.17 82.89
(96)
2.10 83.68
(57)
2.44 85.33
(74)
1.84 85.52
(50)
2.02 87.35
(61)
1.66 87.18
(45)
1.42 88.77
(43)
1.40 88.58
(38)
0.79 89.56
(24)
Psychology,
1921
Psychology,
1969
2.29 86.92
(84)
2.81 84.04
(92)
1.93 88.85
(71)
2.66 86.70
(87)
1.31 90.16
(48)
1.89 88.59
(62)
1.09 91.25
(40)
1.25 89.94
(41)
0.82 92.07
(30)
0.95 90.79
(31)
67
APPENDIX A
ARTICLES FROM WHICH TEXT SAMPLES WERE DRAWN
This appendix identifies articles from which one or more samples of text were drawn. Its
purpose is to indicate the scope of the subjects covered in the samples from each of the disciplines
in 1921 and 1969. Because of this limited purpose, only titles and first authors are listed. For
consistency with the practice elsewhere in this report, each author's own decisions with respect to
capitalization and other stylistic matters have been followed.
American Journal of Mathematics, 1921
Carmichael, R. D.
Boundary value and expansion problems: Algebraic basis of the
theory
Carmichael, R. D.
Boundary value and expansion problems: Formulation of various
transcendental problems
Coble, A. B.
Multiple binary forms with the closure property
Daniell, P. J.
Integral products and probability
Datta, B.
On the motion of two spheroids in an infinite liquid along their
common axis of revolution
Dickson, L. E.
Algebraic theory of the expressibility of cubic forms as determinants,
with application to diophantine analysis
Hart, W. L.
The Cauchy-Lipschitz method for infinite systems of differential
equations
Hazlett, O. C.
Associated forms in the general theory of modular covariants
Hollcroft, T. R.
On (2,3) compound involutions
Kasner, E.
Einstein's theory of gravitation: Determination of the field by light
signals
Kasner, E.
Finite representation of the solar gravitational field in flat space of six
dimensions
Kasner, E.
Geometrical theorems on Einstein's cosmological equations
68
Lane, E. P.
Conjugate systems with indeterminate axis curves
Morse, H. M.
A one-to-one representation of geodesics on a surface of negative
curvature
Post, E. L.
Introduction to a general theory of elementary propositions
Sparrow, C. M.
On the Fermat and hessian points for the non-Euclidean triangle and
their analogues for the tetrahedron
Whittemore, J. K.
Reciprocity in a problem of relative maxima and minima
American Journal of Mathematics, 1969
Adler, A.
The second fundamental forms of S6 and Pn(C)
Asker, R., et al.
A convolution structure for Jacobi series
Beals, R.
Correction to "Classes of compact operators and eigenvalue
distributions for elliptic operators"
Billigheimer, C. E.
Singular boundary problems for a five-term recurrence relation
Brown, M.
A note on cartesian products
Cheeger, J.
Pinching theorems for a certain class of riemannian manifolds
Epstein, D. B. A.
Group representations and functors
Fell, J. M. G.
An extension of Mackey's method to algebraic bundles over finite
groups
Gerstein, L. J.
Splitting quadratic forms over integers of global fields
Gottlieb, D. H.
Evaluation subgroups of homotopy groups
Hartman, P.
Principal solutions of disconjugate n-th order linear differential
equations
Harvey, R.
The theory of hyperfunctions on totally real subsets of a complex
manifold with applications to extension problems
Hochschild, G., et al.
Pro-affine algebraic groups
69
Johnson, R. P.
Orthogonal groups of local anisotropic spaces
Kallman, R. R.
Unitary groups and automorphisms of operator algebras
Knopp, M. I.
A corona theorem for automorphic forms and related results
Kodama, Y.
On subset theorems and the dimension of products
Kra, I.
On Teichmüller spaces for finitely generated Fuchsian groups
Lance, E. C.
Automorphisms of certain operator algebras
Lipsman, R. L.
Uniformly bounded representations of SL(2,C)
Lipsman, R. L.
Uniformly bounded representations of the Lorentz groups
Lutz, D. A.
Asymptotic behavior of solutions of linear systems of ordinary
differential equations near an irregular singular point
Pepe, W. D.
On the total curvature of C1 hypersurfaces in En+1
Pincus, J. D., et al.
A spectral theory for some unbounded self-adjoint singular integral
operators
Putnam, C. R.
Absolute continuity of singular integral operators
Radjavi, H., et al.
On invariant subspaces and reflexive algebras
Santaló, L. A.
On some geometric inequalities in the style of Fáry
Seeley, R. T.
Analytic extension of the trace associated with elliptic boundary
problems
Seeley, R. T.
The resolvent of an elliptic boundary problem
Seidenberg, A.
Analytic products
Shalika, J. A., et al.
On an explicit construction of a certain class of automorphic forms
Stein, J. D., Jr.
Continuity of homomorphisms of von Neumann algebras
Takesaki, M.
A characterization of group algebras as a converse of TannakaStinespring-Tatsuuma duality theorem
Veech, W. A.
Properties of minimal functions on abelian groups
70
White, J. H.
Self-linking and the Gauss integral in higher dimensions
Wyman, B. F.
Wildly ramified gamma extensions
Ecology, 1921
Allen, W. E.
Problems of floral dominance in the open sea
Arrhenius, O.
Influence of soil reaction on earthworms
Baker, F. C.
The importance of ecology in the interpretation of fossil faunas
Baker, F. S., et al.
Snowshoe rabbits and conifers in the Wasatch Mountains of Utah
Barney, R. L., et al.
Seasonal abundance of the mosquito destroying top-minnow,
Gambusia affinis, especially in relation to male frequency
Bird, H.
Soil acidity in relation to insects and plants
Braun, E. L.
Composition and source of the flora of the Cincinnati region
Clements, F. E.
Drouth periods and climatic cycles
Ekblaw, W. E.
The ecological relations of the Polar Eskimo
Gail, F. W.
Factors controlling the distribution of Douglas fir in semi-arid
regions of the northwest
Geiser, S. W.
Notes on the differential death-rate in Gambusia
Haasis, F. W.
Relations between soil type and root form of western yellow pine
seedlings
Hofmann, J. V.
Adaptation in Douglas fir
Kern, F. D.
Observations on the disseminations of the barberry
MacDougal, D. T.
The reaction of plants to new habitats
McHargue, J. S.
Some points of interest concerning the cocklebur and its seeds
Michael, E. L., et al.
Problems of marine ecology
Moore, B., et al.
Plant composition and soil acidity of a Maine bog
71
Redway, J. W.
The dust of the upper air
Rigg, G. B.
Some factors in evergreenness in the Puget Sound region
Satterthwait, A. F.
Notes on the food plants and distribution of certain billbugs
Shaw, W. T.
Moisture and altitude as factors in determining the seasonal activities
of the Townsend Ground Squirrel in Washington
Ward, H. B.
Some of the factors controlling the migration and spawning of the
Alaska red salmon
Wheeler, W. M.
A new case of parabiosis and the "ant gardens" of British Guiana
Ecology, 1969
Anderson, R. C., et al.
Herbaceous response to canopy cover, light intensity, and throughfall
precipitation in coniferous forests
Barbour, R. W., et al.
Home range, movements, and activity of the eastern worm snake,
Carphophis amoenus amoenus
Baskerville, G. L., et al.
Rapid estimation of heat accumulation from maximum and minimum
temperatures
Bryant, E. H.
The fates of immatures in mixtures of two housefly strains
Collias, N. E., et al.
Size of breeding colony related to attraction of mates in a tropical
passerine bird
Cubit, J.
Behavior and physical factors causing migration and aggregation of
the sand crab Emerita analoga (Stimpson)
Cunningham, G. L., et al.
An ecological significance of seasonal leaf variability in a desert shrub
Davis, M. B.
Climatic changes in southern Connecticut recorded by pollen
deposition at Rogers Lake
Dawson, T. J., et al.
A bioclimatological comparison of the summer day
microenvironments of two species of arid-zone kangaroo
Golley, F. B.
Caloric value of wet tropical forest vegetation
Gorden, R. W., et al.
Studies of a simple laboratory micro-ecosystem: Bacterial activities in
a heterotrophic succession
72
Haertel, L., et al.
Nutrient and plankton ecology of the Columbia River estuary
Hamner, W. M., et al.
The behavior and life history of a sand-beach isopod, Tylos punctatus
Heuschele, A. S.
Invertebrate life cycle patterns in the benthos of a floodplain in
Minnesota
Hurlbert, S. H.
A coefficient of interspecific association
Idso, S. B., et al.
A method for determination of infrared emittance of leaves
Krebs, C. J., et al.
Microtus population biology: Demographic changes in fluctuating
populations of M. ochrogaster and M. pennsylvanicus in southern Indiana
Lee, G. F., et al.
Use of chemical composition of freshwater clamshells as indicators
of paleohydrologic conditions
McColl, J. G.
Soil-plant relationships in a Eucalyptus forest on the south coast of
New South Wales
Maly, E. J.
A laboratory study of the interaction between the predatory rotifer
Asplanchna and Paramecium
Nord, E. C., et al.
Atriplex species [or taxa] that spread by root sprouts, stem layers, and
by seed
Orians, G. H.
The number of bird species in some tropical forests
Paine, R. T.
The Pisaster-Tegula interaction: Prey patches, predator food
preference, and intertidal community structure
Patten, D. T., et al.
Carbon dioxide exchange patterns of cacti from different
environments
Pearson, L. C.
Influence of temperature and humidity on distribution of lichens in a
Minnesota bog
Quade, H. W.
Cladoceran faunas associated with aquatic macrophytes in some lakes
in northwestern Minnesota
Ritchie, G. A.
Cuvette temperatures and transpiration rates
Rosenzweig, M. L., et al.
Population ecology of desert rodent communities: Habitats and
environmental complexity
73
Sale, P. F.
Pertinent stimuli for habitat selection by the juvenile manini,
Acanthurus triostegus sandvicensis
Schroeder, L.
Population growth efficiencies of laboratory Hydra pseudoligactis
Hyman populations
Schweger, C. E.
Pollen analysis of Iola Bog and paleoecology of the Two Creeks
Forest Bed, Wisconsin
Simberloff, D. S.
Experimental zoogeography of islands. A model for insular
colonization
Sinha, R. N., et al.
Principal-component analysis of interrelations among fungi, mites,
and insects in grain bulk ecosystems
Sinko, J. W., et al.
Applying models incorporating age-size structure of a population to
Daphnia
Steenbergh, W. F., et al.
Critical factors during the first years of life of the saguaro (Cereus
giganteus) at Saguaro National Monument, Arizona
Stirling, I.
Ecology of the Weddell Seal in McMurdo Sound, Antarctica
Terborgh, J., et al.
Colonization of secondary habitats by Peruvian birds
Turner, R. M., et al.
Mortality of transplanted saguaro seedlings
Vogl, R.
One hundred and thirty years of plant succession in a southeastern
Wisconsin lowland
Weaver, M. E., et al.
Morphological changes in swine associated with environmental
temperature
Whitcomb, C. E.
A connecting pot technique for root competition investigations
between woody plants or between woody and herbaceous plants
White, T. C. R.
An index to measure weather-induced stress of trees associated with
outbreaks of psyllids in Australia
Wilhm, J. L., et al.
Succession in algal mat communities at three different nutrient levels
Wilson, E. O., et al.
Experimental zoogeography of islands. Defaunation and monitoring
techniques
Yarranton, G. A.
Pattern analysis by regression
74
Physical Review, 1921
Abbott, R. B.
Damped electric oscillations
Barber, I. G.
Secondary electron emission from copper surfaces
Bateman, H.
Electricity and gravitation
Benade, J. M.
Thermoelectric electric effects in iron and mercury due to
asymmetric heating
Brant, L.
Magnetic susceptibility of nickel and cobalt chloride solutions
Breit, G.
The distributed capacity of inductance coils
Bridgman, P. W.
The electrical resistance of metals
Brown, B. E.
The occlusion of water by gas-mask charcoal
Bryan, A. B.
Conductivity of flames containing salt vapors
Carson, J. R.
Theory and calculation of variable electrical systems
Compton, A. H.
The absorption of gamma rays by magnetized iron
Curtiss, L. F.
Physical properties of thin metallic films. III. The effect of
temperature on the change of resistance of bismuth films in a
magnetic field
Davis, B., et al.
An experimental study of the reflection of X-rays from calcite
Dempster, A. J.
Positive ray analysis of lithium and magnesium
Dennison, D. M.
The crystal structure of ice
Hartley, R. V. L., et al.
The binaural location of pure tones
Helmick, P. S.
The blackening of a photographic plate as a function of intensity of
light and time of exposure
Howes, H. L.
The luminescence of samarium
Hurlburt, E. O.
The detecting efficiency of the resistance-capacity coupled electron
tube amplifier
Hull, A. W.
X-ray crystal analysis of thirteen common metals
75
Hull, A. W.
The effect of a uniform magnetic field on the motion of electrons
between coaxial cylinders
Ingersoll, L. R.
Some pecularities of polarization and energy distribution by
speculum gratings
Kleeman, R. D.
An electrical doublet theory of the nature of the molecular forces of
chemical and physical interaction
Lassalle, L. J.
On the motion of a sphere of oil through carbon dioxide and a
determination of the coefficient of viscosity of that gas by the oil
drop method
Loeb, L. B.
The formation of negative ions in air
Maizlish, I.
The principle of projective covariance
Millikan, R. A.
The distinction between intrinsic and spurious contact e.m.f.s and the
question of the absorption of radiation by metals in quanta
Murdock, C. C.
A study of the photo-active electrolytic cell, platinum–rhodamineB–platinum
Murnaghan, F. D.
The absolute significance of Maxwell's equations
Nichols, E. L., et al.
Flame excitation of luminescence
Nolan, P. J.
Evidence for the existence of homogeneous groups of large ions
Page, L.
A generalization of electrodynamics with applications to the structure
of the electron and to non-radiating orbits
Roman, I.
The transmission of waves through a symmetric optical instrument
Rood, E. S.
Thermal conductivity of some wearing materials
Sethi, N. K.
On Talbot's bands and the theory of the Lummer-Gehrcke
interferometer
Smith, A. W.
The Hall effect and the Nernst effect in magnetic alloys
Wilson, H. A.
The reflexion of x-rays by crystals
76
Physical Review, 1969
Alzetta, R., et al.
Improved inverse gap equation and quasiparticle theories of odd and
even tin isotopes
Balling, L. C.
Phase-shift calculation for low-energy electron-Rb scattering
Bennett, H. S.
F center in ionic crystals. II. Polarizable-ion models
Berger, E. L., et al.
Pion-nucleon and kaon-nucleon scattering in the Veneziano model
Bisiacchi, G., et al.
Compton scattering from a bound system of two charged particles
Blazey, K. W., et al.
Antiferromagnetic phase diagram and magnetic band gap shift of
NaCrS2
Bouchiat, C. C., et al.
Evidence for Rb–rare-gas molecules from the relaxation of polarized
Rb atoms in a rare gas. Theory
Canada, T. R., et al.
Decay of the 40Ca 3.90-MeV state
Choudhury, D. K., et al.
Consequences of a QQQ model for meson baryon processes. II.
Positive-parity mesons
Cole, R. K., Jr.
Nuclear-polarization corrections to the levels of muonic atoms
Cornwall, J. M.
Problems of gauge invariance in composite-particle theory
Fink, C. L., et al.
Nemets effect in deuteron breakup by heavy nuclei
Flytzanis, C., et al.
Second-order optical susceptibilities of III-V semiconductors
Frederick, D. E., et al.
O16 (γ, p)N15 angular distributions over the energy range 21-33 MeV
Franklin, J.
Quarks of almost any spin
Gordon, J. P., et al.
Photon echoes in gases
Haas, F., et al.
Experimental investigations of the C12 (h,p)N14 reaction mechanism
Hannon, J. P., et al.
Mössbauer diffraction. II. Dynamical theory of Mössbauer optics
Hardy, J. C., et al.
Isobaric analog states and coulomb displacement energies in the
(ld5/2) shell
77
Hwang, I. K., et al.
Theory of photons in a fully ionized gas. I. Photon momentum
distribution
James, L. W., et al.
Transport properties of GaAs obtained from photoemission
measurements
Jung, M., et al.
Cross sections of fragments emitted in spallation reactions of carbon
and hydrogen with 90-MeV α particles
Klose, J. Z.
Transition probabilities and mean lives of the 3s2 laser level in neon I
Knight, R. E.
Correlation energies of some states of 3-10 electron atoms
Kruse, U. E., et al.
Production of Σ± multibody final states in 5.5-GeV/c K–p
interactions
Lindholm, D. A., et al.
Determination of anisotropic relaxation times in copper by cyclotron
resonance
Lipinski, H. M.
Regge trajectory of the pion in the NN Bethe-Salpeter equation
McDonald, P. F.
Ultrasonic paramagnetic resonance of U4+ in CaF2
Mandelstam, S.
Relativistic quark model based on the Veneziano representation. I.
Meson trajectories
Moyer, L., et al.
Theory of pion-nucleus scattering lengths
Mubayi, V., et al.
Phase transition in the two-dimensional Heisenberg ferromagnet
Narasimhamurty, T. S.
Photoelastic behavior of Rochelle salt
Narath, A.
Nuclear magnetic resonance in hexagonal lanthanum metal: Knight
shifts, spin relaxation rates, and quadrupole coupling constants
Nellis, W. J., et al.
Thermal conductivities and Lorenz functions of gadolinium, terbium,
and holmium single crystals
Oosterhuis, W. T., et al.
Mössbauer effect in K3Fe(CN)6
Parker, E. H. C., et al.
Diffusion of Kr isotopes in solid Ar
Payne, G.
Compact operators and particle-bound state scattering: One channel
case
Pešić, S. S., et al.
Transport properties of negative ions in their parent gases
78
Polkinghorne, J. C.
Renormalization of Regge trajectories and singularity structure in
Kikkawa-Sakita-Virasoro-type theories
Rowe, V. A., et al.
Nernst effect and flux flow in superconductors. III. Films of tin and
indium
Schumacher, M.
Elastic scattering of 145-, 279-, 412-, and 662-keV γ rays from lead
Shugart, C. G. et al.
Isobaric-analog studies with 107Ag and 109Ag
Sommers, C. B., et al.
Relativistic band structure of gold
Spector, R. M.
Implications of direct- and crossed-channel Regge self-consistency
Stein, T. S., et al.
Permanent electric dipole moment of the cesium atom. An upper
limit to the electric dipole moment of the electron
Sweeney, W. E., Jr.
Angular correlations in the Li7(p,γ)Be8*(16.63MeV)-2α reaction
Thompson, R. L., et al.
Inelastic nuclear interactions of high-energy electrons and muons in
emulsion
van der Ziel, J. P., et al.
Optical emission spectrum of Cr3+-Eu3+ pairs in europium gallium
garnet
Ying, S. C., et al.
Spin-independent oscillations of a degenerate electron liquid
Zeller, M. E., et al.
K+-Meson branching-ratio measurement
Psychological Review, 1921
Bernard, L. L.
The misuse of instinct in the social sciences
Boodin, J. E.
Sensation, imagination and consciousness
Bronner, A. F.
Apperceptive abilities
Calkins, M. W.
The truly psychological behaviorism
Calkins, M. W.
Fact and inference in Raymond Wheeler's doctrine of will and
self-activity
English, H. B.
Dynamic psychology and the problem of motivation
Franz, S. I.
Cerebral-mental relations
79
Hull, C. L.
A device for determining coefficients of partial correlation
Kantor, J. R.
An attempt toward a naturalistic description of emotions (II)
Kantor, J. R.
Association as a fundamental process of objective psychology
Kantor, J. R.
How do we acquire our basic reactions?
Lundholm, H.
The affective tone of lines: Experimental researches
Melrose, J. A.
The structure of animal learning
Pepper, S. C.
The law of habituation
Russell, F. T.
A poet's portrayal of emotion
Schilling, W.
The effect of caffein and acetanilid on simple reaction time
Stratton, G. M.
The control of another person by obscure signs
Thorndike, E. L.
The correlation between interests and abilities in college courses
Warren, H. C.
Psychology and the central nervous system
Warren, H. C.
Some unusual visual after-effects
Wolfe, A. B.
The motivation of radicalism
Psychological Review, 1969
Arhib, M. A.
Memory limitations of stimulus-response models
Berger, R. J.
Oculomotor control: A possible function of REM sleep
Buchwald, A. M.
Effects of "right" and "wrong" on subsequent behavior: A new
interpretation
Capehart, J., et al.
A theory of stimulus equivalence
Catlin, J.
On the word-frequency effect
Chalmers, D. K.
Meanings, impressions, and attitudes: A model of the evaluation
process
Clark, H. H.
Linguistic processes in deductive reasoning
80
Henley, N. M., et al.
Goodness of figure and social structure
Herrnstein, R. J.
Method and theory in the study of avoidance
Kinchla, R. A., et al.
A theory of visual movement perception
Klinger, E., et al.
Fantasy need achievement and performance: A role analysis
Kornblum, S.
Sequential determinants of information processing in serial and
discrete choice reaction time
Krantz, D. H.
Threshold theories of signal detection
Landauer, T. K.
Reinforcement as consolidation
Lewis, D. J.
Sources of experimental amnesia
Paivio, A.
Mental imagery in associative learning and memory
Pfaff, D. W.
Parsimonious biological models of memory and reinforcement
Polson, P. G., et al.
Nonstationary performance before all-or-none learning
Seligman, M. E. P.
Control group and conditioning: A comment on operationism
Shimp, C. P.
Optimal behavior in free-operant experiments
Singer, G., et al.
Comment on roles of activation and inhibition in sex differences in
cognitive abilities
Treisman, A. M.
Strategies and models of selective attention
Trowill, J. A., et al.
An incentive model of rewarding brain stimulation
Tversky, A.
Intransitivity of preferences
Underwood, B. J.
Attributes to memory
Warren, R. M.
Visual intensity judgments: An empirical rule and a theory
81
APPENDIX B
LIST OF WORDS DEFINED AS "COMMON" WORDS IN THIS STUDY
The words listed in this appendix constitute the set defined for this study as "common"
words. The words remaining in each corpus, after deletion of these "common" words, were
considered to be the technical vocabulary of the corpus for the purposes of this study, as defined in
section 3.1.5.
There are 248 words in this list (plus a period-less variant of each of "e.g." and "i.e."). It is
essentially the union of two lists that have been developed through much experimentation and
practical experience. One list was kindly provided by Professor Gerard Salton of Cornell University;
it is the list of words he found to be nonsignificant in many scientific fields and used as a list of
words to be excluded from consideration as content indicators in the SMART system (Salton, 1968).
The other list was kindly provided by Dr. Melvin Weinstock of the Institute for Scientific
Information (ISI); it is the list of "full-stop words" used in the preparation of ISI's Permuterm Subject
Index of the Science Citation Index (Weinstock, 1970). The 1972 Science Citation Index Guide and Journal
Lists (1973) describes "full-stop words" as "terms that have no practical semantic value" and
therefore "are completely suppressed" when the Permuterm Subject Index is prepared.
A
ABOUT
ABOVE
ACCORDING
ACROSS
ACTUAL
AFTER
AGAIN
AGAINST
ALL
ALMOST
ALONG
ALREADY
ALSO
ALTHOUGH
ALWAYS
AMONG
AN
AND
ANOTHER
ANY
ARE
AROUND
AS
AT
AWAY
BE
BECAUSE
BEEN
BEFORE
BEHIND
BEING
BELOW
REST
BETTER
BETWEEN
BEYOND
BOTH
BRIEF
BUT
BY
CAN
CANNOT
CERTAIN
CERTAINLY
CF
COMING
COMPLETELY
CONCERNING
CONSEQUENTLY
82
COULD
DID
DISCUSSION
DO
DOES
DOING
DONE
DOWN
DR
DUE
DURING
EACH
EG
E.G.
EIGHT
EITHER
ENOUGH
ESPECIALLY
EVER
FEW
FIFTH
FIRST
FIVE
FOR
FOUR
FROM
FURTHER
FURTHERMORE
GET
GIVE
GIVEN
GIVING
GO
GOING
GONE
HAD
HAS
HAVE
HAVING
HE
HER
HERE
HERSELF
HIM
HIMSELF
HIS
HOW
HOWEVER
I
IE
I.E.
IF
IMMEDIATE
IN
INSIDE
INSTEAD
INTO
IS
IT
ITEMS
ITS
ITSELF
LESS
LET
LIKE
LITTLE
LOOK
LOOKS
MADE
MAKE
MAKES
MAKING
MANY
MAY
ME
MORE
MOST
MUCH
MUST
MY
MYSELF
NAMELY
NEAR
NEARLY
NEVER
NEW
NEXT
NINE
NONE
NOT
NOW
OF
OFF
OFTEN
OLD
ON
ONCE
ONE
ONLY
ONTO
OR
OTHER
OUR
OUT
OUTSIDE
OVER
OVERALL
OVER-ALL
OWN
PARTICULAR
PER
PUT
REALLY
REGARDING
RELATIVELY
RESPECTIVELY
SAME
83
SECOND
SECONDLY
SEEN
SENSIBLE
SERIOUS
SEVEN
SEVERAL
SHALL
SHE
SHOULD
SHOWN
SINCE
SIX
SO
SOME
STILL
SUCH
TAKE
TAKEN
TAKES
TAKING
THAN
THAT
THE
THEIR
THEM
THEN
THERE
THEREFORE
THEREFROM
THESE
THEY
THIRD
THIS
THOSE
THREE
THROUGH
THROUGHOUT
THUS
TO
TOGETHER
TOO
TOWARD
TOWARDS
TWICE
TWO
UNDER
UNDERGOING
UNTIL
UP
UPON
UPWARD
VARIOUS
VERY
VIA
VIZ
WAS
WAY
WAYS
WE
WELL
WENT
WERE
WHAT
WHEN
WHERE
WHEREAS
WHICH
WHILE
WHO
WHOM
WHOEVER
WHOMEVER
WHOSE
WHY
WILL
WITH
WITHIN
WITHOUT
WOULD
YET
YOU
YOUR
84
APPENDIX C
GRAPHS OF RANK-FREQUENCY PAIRS AND REGRESSION LINES
FOR THE CORPORA
This appendix contains graphs showing the observed rank-frequency pairs for each corpus,
with and without the common words. Each graph also shows, as a solid line, the line of best fit, in
the least-squares sense, to the observations. The graphs of the original texts of the samples further
show, as a dashed line, the line of slope -1 that passes through the midpoint of the regression line.
The graphs were made on the CalComp plotter of the Computation Center, University of
Texas at Austin.
85
REFERENCES
Barber, C. L. 1962. "Characteristics of Modern Scientific Prose." In Contributions to English Syntax and
Philology, pp. 21-43. Edited by Frank Behre. Gothenburg Studies in English, no. 14.
Göteborg: University of Göteborg
Barnhart, Clarence L., Steinmetz, Sol, and Barnhart, Robert K. 1973. The Barnhart Dictionary of New
English since 1963. Bronxville, NY: Barnhart/Harper and Row
Belevitch, V. 1959. "On the Statistical Laws of Linguistic Distributions," Annales de la Société
Scientifique de Bruxelles 73:310-26
Bross, I. D. J., Shapiro, P. A., and Anderson, B. B. 1972. "How Information Is Carried in Scientific
Sub-Languages," Science 176:1303-7
Brotherton, M. 1964. Masers and Lasers. New York: McGraw-Hill
Campbell, David T. H., and Edmisten, Jane. "Characteristics of Professional Scientific Journals,
1962." Washington: Herner and Company
Carroll, John B. 1953. The Study of Language: A Survey of Linguistics and Related Disciplines in America.
Cambridge, MA: Harvard University Press
________. 1967. "On Sampling from a Lognormal Model of Word-Frequency Distribution." In
Computational Analysis of Present-Day American English, pp. 406-24. By Henry Kučera and W.
Nelson Francis. Providence, RI: Brown University Press
Chase, Janet M. 1970. "Normative Criteria for Scientific Publication," The American Sociologist
5:262-65
Dixon, Wilfrid J., and Massey, Frank J., Jr. 1969. Introduction to Statistical Analysis, 3rd ed. New York:
McGraw-Hill
Draper, N. R., and Smith, H. 1966. Applied Regression Analysis. New York: Wiley
Edmundson, H. P. 1972. "The Rank Hypothesis: A Statistical Relation between Rank and
Frequency." Technical Report TR-186. College Park, MD: Computer Science Center,
University of Maryland
Edmundson, H. P., Crook, D., and Tung, I. 1972. "Bibliography of Mathematical and
Computational Linguistics." Technical Report TR-189. College Park, MD: Computer
Science Center, University of Maryland
102
Edmundson, H. P., Fostel, G., Tung, I., and Underwood, W. 1972a. "Approximation Formulas for
Vocabulary Size for the One-, Two-, and Three-Parameter Rank Distributions." Technical
Report TR-188. College Park, MD: Computer Science Center, University of Maryland
Edmundson, H. P., Fostel, G., Tung, I., and Underwood, W. 1972b. "Methods of Computing
Vocabulary Size for the Two-Parameter Rank Distribution." Technical Report TR-187.
College Park, MD: Computer Science Center, University of Maryland
Fairthorne, Robert A. 1969. "Empirical Hyperbolic Distributions (Bradford-Zipf-Mandelbrot) for
Bibliometric Description and Prediction," Journal of Documentation 25:319-43
Good, Irving John. 1957. "Distribution of Word Frequencies," Nature 179:595
________. 1969. "Statistics of Language: Introduction." In Encyclopædia of Linguistics, Information and
Control. Edited by A. R. Meetham. Oxford: Pergamon Press
Guiraud, Pierre. 1954a. Bibliographie critique de la statistique linguistique. Revised and completed by
Thomas D. Houchin, Jaan Puhvel, and Calvert W. Watkins, under the direction of Joshua
Whatmough. Utrecht: Editions Spectrum
________. 1954b. Les caractères statistiques du vocabulaire. Paris: Presses Universitaires de France
________. 1959. Problèmes et Méthodes de la statistique linguistique. Dordrecht, Netherlands: D. Reidel
Hagstrom, Warren O. 1965. The Scientific Community. New York: Basic Books
Herdan, Gustav. 1956. Language as Choice and Chance. Groningen: P. Noordhoff
________. 1960. Type-Token Mathematics. The Hague: Mouton
________. 1962. The Calculus of Linguistic Observations. The Hague: Mouton
________. 1964. Quantitative Linguistics. London: Butterworth
________. 1966. The Advanced Theory of Language as Choice and Chance. Berlin: Springer-Verlag
Hockett, Charles F. 1958. A Course in Modern Linguistics. New York: Macmillan
Hogben, David, Peavy, Sally T., and Varner, Ruth N. 1971. OMNITAB II User's Reference Manual.
National Bureau of Standards Technical Note 552. Washington: National Bureau of
Standards
Irwin, J. O. 1963. "The Place of Mathematics in Medical and Biological Statistics," Journal of the Royal
Statistical Society, part 1 126:1-41
Joos, Martin. 1936. Review of The Psycho-Biology of Language by George K. Zipf, Language 12:196-210
103
Kendall, Maurice G. 1960. "The Bibliography of Operational Research," Operational Research Quarterly
11:31-36
Klimeš, Lumír. 1972. "An Attempt at a Quantitative Analysis of Social Dialects." In Prague Studies in
Mathematical Linguistics 4, pp. 77-93. Edited by Ján Horecký, Petr Sgall, and Marie
Těšitelová. University, AL: University of Alabama Press
Kučera, Henry, and Francis, W. Nelson. 1967. Computational Analysis of Present-Day American English.
Providence, RI: Brown University Press
Mandelbrot, Benoît. 1951a. "Adaptation du message à la ligne de transmission: I. Quanta
d'information," Comptes Rendus de l'Académie des Sciences 232:1638-40
________. 1951b. "Adaptation du message à la ligne de transmission: II. Interprétations physiques,"
Comptes Rendus de l'Académie des Sciences 232:2003-5
________. 1953a. "An Informational Theory of the Statistical Structure of Language." In
Communication Theory: Papers Read at a Symposium on "Applications of Communication Theory," pp.
486-502. Edited by Willis Jackson. London: Butterworth. [Also known as "Second
London Symposium on Information Theory."]
________. 1953b. "Contributions à la théorie mathématique des jeux de communication."
Publications de l'Institut de Statistique de l'Université de Paris, vol. 2, fascicules 1 et 2
________. 1954a. "Simple Games of Strategy Occurring in Communication through Natural
Languages." In "Symposium on Statistical Methods in Communication Engineering,"
Transactions of the IRE Professional Group on Information Theory, (PGIT-3), pp. 124-37. [Also
known as "First London Symposium on Information Theory."]
________. 1954b. "Structure formelle des textes et communication," Word 10:1-27, 424-25
________. 1956. "On the Language of Taxonomy: An Outline of a 'Thermostatistical' Theory of
Systems of Categories with Willis (Natural) Structure." In Information Theory: Proceedings of the
Third London Symposium on Information Theory. Edited by Colin Cherry. London: Butterworth
________. 1957a. "Linguistique statistique macroscopique." In Logique, langage, et théorie de
l'information, pp. 1-78. By L. Apostel, B. Mandelbrot, and A. Morf. Volume III of Études
d'épistémologie génétique. Edited by J. Piaget. Paris: Presses Universitaires de France
________. 1957b. Linguistique statistique macroscopique: Théorie mathématique de la loi de Zipf . Paris:
Institut Henri Poincaré de l'Université
________. 1957c. "A Note on a Law of Berry and on Insistence Stress," Information and Control
1:76-81
104
________. 1957d. Théorie mathématique de la loi d'Estoup-Zipf . Paris: Institut de Statistique de
l'Université
________. 1959. "A Note on a Class of Skew Distribution Functions: Analysis and Critique of a
Paper by H. A. Simon," Information and Control 2:90-99
________. 1961. "On the Theory of Word Frequencies and on Related Markovian Models of
Discourse." In Proceedings of the Twelfth Symposium in Applied Mathematics, pp. 190-219. Edited
by Roman Jakobsen. Providence, RI: American Mathematics Society
________. 1966. "Information Theory and Psycholinguistics: A Theory of Word Frequencies." In
Readings in Mathematical Social Science, pp. 350-70. Edited by Paul F. Lazarsfeld and Neil W.
Henry. Chicago: Science Research Associates; reprint ed., Cambridge, MA: M.I.T. Press,
1968
Miller, George A. 1951. Language and Communication. New York: McGraw-Hill; reprint ed., 1963
________. 1954. "Communication," Annual Review of Psychology 5:401-20
Miller, George A., and Newman, Edwin B. 1958. "Tests of a Statistical Explanation of the
Rank-Frequency Relation for Words in Written English," American Journal of Psychology
71:209-18
Miller, George A., Newman, Edwin B., and Friedman, E. A. 1958. "Length-Frequency Statistics for
Written English," Information and Control 1:370-89
Mizutani Sizuo. 1973. "On the Relation between the Numbers of Running Words and Different
Words in Random Samples," ITL: Review of the Institute of Applied Linguistics, Louvain 19:1-14
Muller, Charles. 1969. "Lexical Distribution, Reconsidered: The Waring-Herdan Formula." In
Statistics and Style, pp. 42-56. Edited by Lubomír Doležel and Richard W. Bailey. New
York: American Elsevier. [Translation of "Du nouveau sur les distributions lexicales: La
formule de Waring-Herdan," Cahiers de Lexicologie 1:35-53 (1965).]
Nie, Norman, Bent, Dale H., and Hull, C. Hadlai. 1970. SPSS: Statistical Package for the Social Sciences.
New York: McGraw-Hill
Parker-Rhodes, A. F., and Joyce, T. 1956. "A Theory of Word-Frequency Distribution," Nature
178:1308
Parker-Rhodes, A. F., and Joyce, T. 1957. "Distribution of Word Frequencies," Nature 179:595-96
Plath, Warren. 1963. "Mathematical Linguistics." In Trends in European and American Linguistics, pp.
21-57. Edited by Christine Mohrmann, Alf Sommerfelt, and Joshua Whatmough. Utrecht:
Editions Spectrum
105
Pratt, Fletcher. 1939. Secret and Urgent. New York: Blue Ribbon Books
Price, Derek J. de Solla. 1970. "Citation Measures of Hard Science, Soft Science, Technology, and
Nonscience." In Communication Among Scientists and Engineers, pp. 4-22, 327. Edited by
Carnot E. Nelson and Donald K. Pollock. Lexington, MA: Heath
Salton, Gerard. 1968. Automatic Information Organization and Retrieval. New York: McGraw-Hill
1973. 1972 Science Citation Index Guide and Journal Lists. Philadelphia: Institute for Scientific
Information
Simon, Herbert A. 1955. "On a Class of Skew Distribution Functions," Biometrika 42:425-40.
[Reprinted with an introduction in Models of Man, by Herbert A. Simon. New York: Wiley,
1957.]
________. 1957. Models of Man. New York: Wiley
________. 1960. "Some Further Notes on a Class of Skew Distribution Functions," Information and
Control 3:80-88
Storer, Norman W. 1967. "The Hard Sciences and the Soft: Some Sociological Observations,"
Bulletin of the Medical Library Association 55:75-84
________. 1972a. "Comparative Study of Scientific Disciplines: Opportunities for Research." Paper
presented at the American Sociological Association Meetings, New Orleans, LA, 30
August 1972.
________. 1972b. "Relations Among Scientific Disciplines." In The Social Contexts of Research, pp.
229-68. Edited by Saad Z. Nagi and Ronald G. Corwin. New York: Wiley-Interscience
Twaddell, W. F. 1967. Foreword to Computational Analysis of Present-Day American English, by Henry
Kučera and W. Nelson Francis. Providence, RI: Brown University Press
Wallace, Everett M. 1965. "Rank Order Patterns of Common Words as Discriminators of Subject
Content in Scientific and Technical Prose." In Statistical Association Methods for Mechanized
Documentation, pp. 225-29. Edited by Mary Elizabeth Stevens, Vincent E. Giuliano, and
Laurence B. Heilprin. National Bureau of Standards Miscellaneous Publication 269.
Washington: National Bureau of Standards
1965. Webster's Seventh New Collegiate Dictionary. Springfield, MA: G. and C. Merriam
Weinstock, M., Fenichel, C., and Williams, M. V. V. 1970. "System Design Implications of the Title
Words of Scientific Journal Articles in the Permuterm Subject Index. I. Conformity to
Zipf 's Law." Paper presented at the 7th Annual National Information Retrieval
Colloquium, Philadelphia, PA, 7-8 May 1970.
106
Williams, Carrington B. 1970. Style and Vocabulary: Numerical Studies. New York: Hafner
Yule, G. Udny. 1944. The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University
Press
Zipf, George Kingsley. 1929. "Relative Frequency as a Determinant of Phonetic Change," Harvard
Studies in Classical Philology 40:1-95
________. 1932. Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA:
Harvard University Press
________. 1935. The Psycho-Biology of Language: An Introduction to Dynamic Philology. Boston: Houghton
Mifflin; reprint ed., Cambridge, MA: M.I.T. Press, 1965
________. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology.
Cambridge, MA: Addison-Wesley; reprint ed., New York: Hafner, 1965
Zuckerman, Harriet, and Merton, Robert K. 1971. "Patterns of Evaluation in Science:
Institutionalisation, Structure and Functions of the Referee System," Minerva 9:66-100.
[Reprinted as "Institutionalized Patterns of Evaluation in Science," in The Sociology of Science:
Theoretical and Empirical Investigations, pp. 460-96. Edited by Norman W. Storer. Chicago:
University of Chicago Press, 1973.]
107
© Copyright 2026 Paperzz