An experimental computer-stored,
aue:mented cataloe:
of ~Drofessional
~
literature
-
by RICHARD s. MARCUS, PETER KUGEL
and ROBERT L. KUSIK
Massachusetts Institute of Technology
Cambridge, Massachusetts
like instructional aids and a stenuning algorithm, have
been developed to a greater degree than" they have in
the cited efforts. On the other hand, because of a desire to get experiments under way as soon as possible
some techniques, like an automated thesaurus and
sophisticated matching criteria, have been deferred until experience has been gained with initial experiments.
INTRODUCTION
This paper reports on progress in the development
and application of computer programs for storage
and retrieval operations associated with the Project
Intrex augmented catalog experiments. A general review of Project Intrex and details of other Intrex developments are given in companion papers. 1- 3
The environment for which these programs are currently being developed includes the following features:
computer system: time-sharing computer utility
with satellite computer links emphasizing manmachine" interaction via typewriter and display
consoles.
data base: an in-depth catalog to a moderate-sized
(10,000 documents), but growing, data base featuring free-vocabulary indexing and a variety of
document types.
experiments: experiments with researchers using the
system to satisfy real infonnation needs to determine the relative values of system features
While this mix of features is, perhaps, unique, there
have been, of course, many other efforts having several of these features as components. Our own work is,
in many ways, a blend of various techniques found in
one or more of these efforts. Among those systems having the greatest influence on the design of our system
are systems implemented at SDC,4 NASA (BunkerRamO),6 University of Pennsylvania,6 Harvard-Cornell,7 MIT Project TIP,S Staniord,9 and Bolt, Beranek
and N ewman.lO
Because of the complicated nature of our catalog
and an emphasis on untutored users, some techniques
Creation of catalog data
Data base and literature selection
The literature base for the augmented catalog eX..
periments is drawn from the broad subject area of
materials science and engineering, thus providing an
interdisciplinary subject area within the scope of t~
M.LT. Engineering Library which is to serve as the
experimental laboratory for Intrex. Because the cur-rent literature for this entire field greatly exceeds the
10,000 document initial size C?f the first experimental
catalog, only literature in selected areas of materials
science and engineering is cataloged. These selected
areas reflect the research interests of particular grou:ps
at M.I.T., thus assuring Intrex of a specific popuJation
of experimental users." Five research groups have been
selected, two in physics and three in metallurgy.
The documents for this data base have been selected
primarily (so far) from the journal and conference proceedings literature after January 1, 1967. (Eventually,
we intend to include significant numbers of other
literature forms; e.g., books, reports, theses, memoranda, etc.) Each research group chooses the journals
of interest to it. Then members of each group select
articles relevant to their group from journal-iasUe
tables of contents.
461
From the collection of the Computer History Museum (www.computerhistory.org)
462
Spring Joint Computer Conference, 1969
The catalog and its fields
The augmented nature of the catalog is indicated by
the range of the 50 fields shown in Table I. (See Benenfeld lI for a more complete description of the nature
and preparation of the catalog.) Of course, only a fraction of the fields are applicable for the catalog record
of a particular document. A typical catalog record, as
typed for input to the computer, is shown in Figure 1.
It may be noted that some fields contain encoded information; e.g., in field 36 "e" stands for "English."
The most important field in terms of retrieval is,
perhaps, the subject index terms field. Our current
procedures call for generating index terms based upon
the text of a document. In general, terms are combinations of phrases. Each term is structured to provide
sufficient context so that the term may be understood
in its own right. Further, each term is given a "range"
number to reflect that proportion of a document devoted to discussing the represented concept (see
Field 73 of Figure 1).
Several features of this indexing may be noted. In
the first place, "free vocabulary" is used; that is, the
indexer may choose any words to make up the subject
terms and is not restricted to an "authority" list of
terms. In practioo, the indexer primarily chooses terms
in the author's own words. Secondly, the meanings of,
and relationships among, the terms, which in some
other retrieval systems are given explicitly by a formal
system of "roles" and "links," are implicitly expressed
in the Intrex terms through the context given by these
relatively long terms.
The rationale behind using free vocabulary and
long "stand-alone" terms is one of ease of indexing
and flexibility. It is evidently easier to use the author's
own words to indicate the subject content than to have
to re-analyze the content in terms of a fixed authority
list and previously established set of roles and links.
Presumably automatic indexing could be more readily
developed on this basis. Similarly, there seems to be a
greater flexibility in using natural language,. which
adapts naturally to changing conditions and conventions of meaning, rather than using authority lists,
whose organization is always tending to lag behind
current usage. Another feature of "stand-alone" terms
is that they can be displayed to a user and, hopefully,
give a more comprehensive understanding, of the subject content than a list of keywords. Some experiments
indicating the value of free-vocabulary indexing have
re~ntly been done by Shaw and Rotlunan. 12
One disadvantage of the natural language approach
is the possible reduction in retrieval effectiveness due
to the diversity in ways of expressing the "same" subject. ~v1eans to circlh"llVent this problem are discussed
/ / 1/
4299
/ /2/
A24
/ / 5/
7-Cl(}cD3
1/20/
/ /30/
/ /69/
1e(4)
/ /33/
ill us.
/ /36/
e
/ /37 /
/ /65/
/ /66/
3
/ /31/
dd
/ /47/
pp. 1453-1458. IN S+4180S
/ /24/
An internally reflecting optical resonator with confocai
properties
/ /21/
Holshouser, D.F.
/ /22/
Unversity of 'Illinois', 'Urbana ' • Electrical Engineering
Dept.
/ /40/
'U. S.' Air Force Office of Scientific Research
*AF-49-(638)-556 ....AfOSR-62-250 .
/ /68/
Contains diagram of geometry for confoc:al internal reflection
/ /70/
Optical resonators using spherical mirrors, e.g. confoc:al
systems, have been shown to have significant advantages
over configurations U5ing planar mirrors. In particular,
diffraction losses can be mu<:h lower and alignment is I .
critical. However, planar systems have had an advantage
heretofore iii that coated mirrors could be replaced by
internally reflectin9 prisms, thereby eliminating the
I'roblems associated with lossy or fragile coatings.
Also, undesired modes are reduced since rays not parallel
to the axis are not co~letely reflected. this paper
describes the configuration for an internally reflecting
surface which exhibits properties of a spherical mirror,
and presents experimental results obtained with a
serni-c6nfoc:al maser using this configuration.
~ext, p. 14~
/ /73/
internally reflecting OP.tical resonator with confocal
properties (1);
configuration •~L~"" an inter.,nally reflecti'!9 s~ce which
axhi ... i ... prcpe.-t:es of a sphenc(!1 m,rror
confoc:al system (1);
basic properties of a confoc:al resonator (2);
analytic expression for the internally reflecting surface
which satisfies confocal requirements (3)i
Schott barium crown. glass d~ed with n~mium (4);
fabric:afion of semi-c:onfocal optical maser (3);
//3/
~7/2,
b
./1,
1/7
1"1;4
032968,
040168,
040268,
040468,
11 :15 -11 :25;
9:26 -
9~29;
11 :06 - 11 :08;
11 : 35 - 11 : 50;
Figure 1-8ample catalog record
in the section on retrieval below. Another disadvantage
of the long, stand -alone teITnS is their redundancy,
which requires additional storage in the computer. On a
purely keyword basis the redundancy is about 66
percent; that is, for each document each unique word
(stem) type is used about three times in the subject
terms, Determining the usefulness of this redundancy
From the collection of the Computer History Museum (www.computerhistory.org)
Experimental Conlputer-Stored, Augmented Catalog of Professional Literature
in aiding retrieval effectiveness is one object of our
experimental investigations.
A third feature of the indexing philosophy is the
extended depth of indexing. Not only are the major'
subjects of the document indexed but so also are subjects covered to a lesser degree or mentioned only
briefly in the document. Of course, these minor subjects are given a lesser range number reflecting the
smaller portion of document devoted to them. A typical
journal article of five pages may be indexed by about
20 terms, each term averaging about five words. Assuming about 400 words per page, this represented an
index word to text word ratio of about 0.05.
The index terms (Field 73) make up about 30 percent of the bulk of the catalog records. The abstract
(Field 71) or the excerpts (Field 70)-usually only one
or the other is present-make up about 40 percent of
the catalog record for a given document. The other
fields comprise the remaining 30 percent.
463
Under moderate loading conditions-15 to 20 consoles
with moderate demand-typical response time (from
the end of a user statement to the beginning of the
typed response) is about five seconds. User programs
typically run in time slices of up to 2, 4, or 8 seconds.
Over the time period of the work reported in this
paper neither the text-access equipment 3 nor the Intrex
display console 2 was available. Therefore, the bulk of
the work reported concerns use of the IBM 2741 typewriter console with some use of the Computer Display
ARDS display console. Of course, planning has taken
this equipment into account in the retrieval programs
(see below).
The M.I. T. Engineering Library is being physically
reconstructed and expanded to provide an operational
environment in which regular library users can experiment with the facilities of Project Intrex. Of course, it
is also planned to use consoles in or near the laboratory
facilities of our user groups.
Inputting and editing of catalog data
Computer system q,nd library facilities
The Intrex Storage and Retrieval programs are presently operat~ng in the environment of the M.LT.modified IBJ\f 7094 Compatible Time-Sharing System
(CTSS).13 CTSS includes:
1. 32K (36 bit) word core for the system super-
visor;
2. Another 32K core for the user working programs;
3. A high-speed drum for core images of user programs that have been "swapped out" awaiting
I/O or additional service from the CPU as
allocated through the service queue;
4. A low-speed drum for storage of directories to
user's files;
5. Two IBJ\f 2301 disc files (40,000,000 words
each) serving as the primary storage medium
for program and data files;
6. An IBJI 7750 communications interface for
servicing I/O needs to user consoles; and
7. lVIagnetic tape drives for auxiliary readmg and
writing of large files.
Approximately 200 typewriter consoles (including
IBIVr Models 2741 and 1050, and Teletypewriter
::Model 37) and several Computer Display ARDS display consoles14 are located on or near the campus. (Of
course, consoles can be located wherever telephone
lines exist.) Approximately 30 consoles can be connected to CTSS at anyone time, although heavy demand by several consoles can reduce the number that
can be handled practically at anyone time to about 20.
Friden 2303 Flexowriters are used to produce
machine-readable punched paper tapes of the catalog
records simultaneously with typed copies. A paper
tape file of 10 catalog records is read into the CTSS
IBlVI 7094 computer through a satellite PDP-7 computer in which Flexowriter codes are converted into
ASCII codes. (One of the reasons for choosing ASCII
for internal representation was to allow for upper and
lower case alphabetic characters.) The file is stored on
disc and also output on magnetic tape for printing on
an IBIVI 1403 line printer, equipped with an extended
character-set print chain. The printout is returned to
the catalogers who proofread and mark errors. Correction of errors in the computer-stored working file is
done by a typist at an IBM: 2741 console using an
online context editing program. It takes about three
seconds of computer time per catalog record to perform
the editing process. On the basis of our present error
rate of 1.05 errors per catalog record, our error-correction cost is approximately twenty-five cents per entry.
This amount represents computer .time only; the
typist's time (about 2.3 minutes) and the proofreader's
time must also be added to determine total errorcorrection cost.
An analysis of the economics of replacing paper
tape input with the use of online inputting showed
that in the present CTSS environment online inputting
would be much more expensive. Present inputting
costs-both for man and machine-run about $2.50
per catalog record whereas the estimated costs for
online inputting would be over $4.00 per record. The
cost differential is largely due to increased computer
From the collection of the Computer History Museum (www.computerhistory.org)
464
Spring Joint Computer Conference, 1969
processing time in the online mode. It is planned to
redo the analysis when the lntrex Console2 with its
buffer/contr?ller satellite computer becomes avaliable.
Computer files for retrieval
File organization (See Figure 2)
The files are organized on two levels and permit
searching on three levels. The first level of file structure
consists of the iIlVerted files. An entry in the inverted
files is a list of references to catalog records (documents) associated with a si~gle primary key (title or
subject word stem, or author's last name). The references contain not only document numbers but also
word specifiers and reference attributes. The word
specifiers· establish the word position within the subject phrase, the position number of the subject phrase
in the subj ect field and the ending that has been taken
off the word to form the stem. The attributes presently include parameters such as: subj ect-term range
number; the initials of authors' given names (note that
only his last name is a primary key); and document
information including whether it is a whole work or
part of a larger work, whether it is a textbook or
review articles, whether it has been refeered before
publication and whether it is of professional level
(rather than for lay consumption) .. A small inverted
file directory in core memory serves to localize the position of any list in the disc-stored inverted file.
The second level of file consists of the full catalog
records. These records are stored on the disc file in the
order created and are located by means of a catalog
directory, also located on the disc.
Thus, a first-level search may be made on author,
title, or subject terms used as the primary key. A
second-level search is then possible on the word speci-
~
1M
D.ECTQ!Y
Figure 2-File organization
fiers or document attributes used as secondary keys.
The third search level is a search through the catalog
records themselves. Thus, the speed with which a given
kind of information can be found is clearly dependent
upon the distribution of the information among the
various levels. The determination of the optimum distribution is therefore one of the main objectives of our
experimentation.
Further details of the file organization and generatio.;n are given below including observed time values
for some of the critical operations. It should be pointed
out that these values are a consequence of the particular software-hardware combinations we are presently
using in the CTSS system (often as somewhat inefficient expedients to getting a working system going)
and do not represent optimum ·values that are possible
for magnetic-disc hardware.
Formatting and extracting
The formatting and extracting program operates
on new catalog records to the system to produce three
major types of output: (1) an updated main catalogrecord file is produced by adding the new catalog
records to the existing file; (2) index-term files are
produced which contain the subj ect, title, and author
terms and the document attributes used for updating
the inverted files; and (3) a catalog-directory file is
generated through which the retrieval system accesses the catalog records.
The catalog-record file is structured so that the
formatting information (which indicates where records,
and fields within records, begin and end) is separated
from content information. The formatting information
is contained in the header (Figure 3) which indicates
where a given record or field begins and ends. The rest
of the information is stored in the upper and lower
bodies. The upper body contains the information that
does not require a free format. Such information can be
both compressed to save space and preformatted to
simplify retrieval.
The bulk of the information in a catalog record is
stored as straight text in the lower body. Characters
that serve to delimit fields and record entries are removed, as are formatting characters (carriage returns
and tabs). These format characters are reinserted later
by the output program to fit into the various line
widths of the output devices (2741 typewriter: 120
characters; ARDS display: 78 characters; lntrex console: 56 characters). The average formatted catalog
record requires approximately 600 computer words for
storage whereas the average preformatted record as
From the collection of the Computer History Museum (www.computerhistory.org)
Experimental Computer=Stored, Augmented Catalog of Professional Literature
!
UPPER
BODY
HEADER
RN
OD
DClsclLEILA
ME I FO I PU
1
{
L
LOWER
BODY
(empty)
MN
CP
FENCE
SH
FN(l)
IN!
BP(l)
FN(n)
IN!
BP(n)
DATA
:
~
~
RN:
Record Number (Bits 0-20)
OD: Online Date (Bits 21-35)
DC:
FO: Format (Bits 25-30)
PU:
Purpose (Bits 31-35)
Descriptive Cataloger (Bits 0-5) MN: Method Number (Bits 21-35)
SC: Subject Cataloger (Bits 6-11)
CP:
LE:
Level of approach (Bits 12-14)
FN(i): Field number of i-th field
(address portion)
LA:
Language (Bits 15-19)
NI:
Continuation Pointer
Note Indicator (tag portion)
ME: Medium (Bits 20-24)
SH: Size of Header (in computer
words)
BP(i): Byte pointer to bottom of
field i (decrement portion)
Figure 3-The catalog record format
input by the catalogers takes up about 700 computer
words (4 characters stored per 36-bit word).
Table I-Infonnation fields in catalog
I.
1.
2.
3.
4.
5.
CATALOG CONTROL FIELDS
Document Number
Document Selection
Input Control
On-Line Date
Microfiche Location
II.
10.
11.
12.
PHYSICAL DOCUMENT CONTROL FIELDS
L. C. Card Number
Library Location
Serial Holdings
III. DESCRIPTIVE CATALOGING FIELDS
20. Main Entry Pointer
21. Personal Names
22. Personal Affiliations
23. Corporate Names
24. Title
25. Coden Title
26. Edition Statement
465
Table I (Contd)
Zl. Publisher
28. Place of Publication
29. Dates of Publication
30. Medium
31. Format
32. Pagination
33. Illustrations
34. Dimemtiont=!
35. Serial Frequency
36. Language of Document
37. Language of Abstract
38. Series Statement
39. Report/Patent Numbers
40. Contract Statement
41. Supplement Referral
42. Errata
43. Thesis
44. Variants
45. Titles of Variants
46. Article Receipt Date
47. Analytical Citation
48. Abstract Services
49. Cost-Text Access
50. Commercial Cost
IVo
65.
66.
67.
68.
69.
70.
71.
72.
SUBJECT CONTENT FIELDS
Author's Purpose
Level of Approach
Table of Contents
Special Features
Bibliography
Excerpts
Abstracts
Reviews
73. Subject Indexing
V. ARTICLE CITATION FIELD
80. References/Citations
VI. USER FEEDBACK FIELD
85. User Conunents
The formatting and extracting program requires approxiInately two to three seconds to process a single
catalog record.
Phrase decomposition and stemming
The subj ect and title terms are broken down into
individual words and these words are stemmed by
dropping off endings.
A two-phase stemming algorithm has been de-
From the collection of the Computer History Museum (www.computerhistory.org)
466
Spring Joint Computer Conference, 1969
veloped. I5 In the first phase, the longest possible ending
from a list of about 280 endings is dropped from the
word. Before an ending is dropped it must satisfy a
context rule. (For example, do not drop s after s.) The
second phase of the algorithm includes transformational
rules to account for certain spelling anomalies in
English (for example, absorb/absorp-tion; split/
split-t-ing) .
Because the transformational rules of the second
phase involve various complexities in actually performing the stemming and keeping account of it in the
inverted files, and because the. number of additional
cases it handled. seemed relatively small, it was decided to tryout only the first phas:e procedures in
initial Intrex systems. A list of endings and context
rules for applying them are contained in the reference
document. 15
About 100,000 subject-term words from about 1000
catalog records have been stemmed according to this
algorithm. So far the results seem promising. Table II
gives a statistical compilation of the number of endings found for each ending type in one run of 2,382
words stemmed.
The output from the phras'e decomposition and
stemming program, which takes about seven seconds
per document, is a set of" shreds" : one for each subj ect
or title word and one for each full phrase not longer
than a certain number o{words. The maximum number
of words for full phrases retained for the inverted files
is currently four. This covers only about 20 percent of
the terms in the catalog.
Sorting, common word culling, and merging
Sorting and merging are accomplished using a generalized ~ort-merge package developed by the staff of
the M.l. T. Technical Information Program. 8 This
package features manipUlation of variable length
records with a variable number of variable length keys.
The first operation involves an alphabetic sort of the
shreds with the word stem or author last name as primary key and the word ending code or author's initials
string as a secondary key. The secondary key is used to
facilitate structuring th~ inverted files (see beloW). The
distinction between upper and lower case alphabetic
characters is suppressed during sorting. This operation
takes about six seconds per document or about 80
msec. pe~' index-word shred (which averages about ten
computer words in length at this stage of processing).
The second operation culls out the 13 most common
function words. These are listed in Table III in order of
frequency. The culling operation, which takes about
Table II-Endings found in sterruning 2,382 words
Ending
arization
entations
ableness
entation
ability
ationa!
ibility
ization
ations
encies
ential
istics
acity
aries
arity
ately
ating
ation
ative
ators
atory
ement
ening
ental
ially
icity
iona!
istic
ities
ivity
able
ally
ance
ants
ated
atic
ator
ence
eney
ents
eous
ible
ical
ions
ious
ized
less
Occurrences
9
2
1
3
2
5
16
Ending
Occurrences
ness
ogen
9
20
3
1
4
9
7
5
wise
ying
age
als
2
ant
ary
ate
ely
ene
ent
10
1
ial
ian
4
ics
ied
ier
ies
ine
ing
ion
ism
ity
ive
one
ons
23
20
3
3
1
25
105
3
2
1
11
3
2
1
6
15
1
1
30
8
8
10
4
3
16
14
83
215
4
30
5
1
3
7
10
ora
1
ous
3
4
's
al
24
9..1'
27
8
9
1
24
16
12
2
7
31
15
30
2
4
62
40
ed
en
es
ia
ic
is
ly
on
or
urn
a
e
57
106
34
148
1
127
12
2
11
11
30
25
449
23
1
o
16
11
s
y
118
85
3
From the collection of the Computer History Museum (www.computerhistory.org)
Experimental Computer-Stored, Augmented Catalog of Professional Literature
Table III-Common words excluded from index lists
1. of
2. in
3. the
4. for
.5. a
6. on
7. to
8. at
9. with
10. and
11. as
12. by
13. from
range number). Generation of the author inverted files,
which contain about two author names per document,
takes about 50 msec. per document.
Listings can be made, for analysis purposes, of all
(or sections of) these inverted files and they may contain the full references or just counts of the number of
references and documents. See Figure 5 for an excerpt
from a full listing. A full listing requires about three
seconds of 7094 cpr; time per document plus some
offline 1401 time.
Retrieval procedures
An initial version of the system (see Figure 6 for a
NO. REFS.
one second per document, reduces the size of the files
by about 20 percent.
The third operation merges the batch of sorted
shreds from the latest operation with the cumulative
batch of sorted shreds from previous runs. Merging
takes about 0.6 seconds per document in the total data
base.
Inverted file generation and listing
{
HEADER
CAP
{
{
TERM
STEM
HEADER
FOR lst
ENDING
ICWl
Del I RFl
I EDS I EWN I BYN
M
A
G
E
T
-
N
BWl:
Number of blanks at end of I;st (f", _><pOns;on)
CWl:
Total numbeo- of c""""',, wood> on I;"
RFl:
Nu~
of refer.,ces
OCl:
Nu~r
of distinct documents among references
BYN:
Nu""" of bytes ;n tenn stem (h... 6)
F512 3 2 I (0) OP
correct correction
corrections
3553 2 3 I (I) JOP
3634 34 5 2 (3) J P
correlcorrelateEJ
correlation
correlations
0 3 I (5)
NO. roeS.
22
4
16
4
6
1
512 4 I
(2) OP
516 6 9
(3) OP
512 3 I I (0) OP
512 141 (I) OP
512 0 5 I (5) OP
516 3 9 2 (I) OP
516 9 II 2 (3) OP
2
F512 4 2 I
~
2
512 5 7 I (I) OP
512 II 7 I (I) OP
541 II 2 I (3) OP
512106 1(3) OP
512 2 4 I (4) OP
516 10 I I 2 (3) OP
516 8 10 2 (3) OP
I
f~}
OP
5
4
2
3
2
2
3437 8 3 I (3) OP 3124
3634 33 5 2 (3) J P
21
I
17
3
7
2
6
3
I
4
3
JOP
Figure 5-Inverted file listing (excerpt)
&191;'" wood> ;n term (hero I)
EWN:
~
of
EDS:
Nu"""
of d;ffe,ent end;ngs (hero
DeEl
CAJ':
lit to indicate variable capitalization
REfl:
NurrOer of referiMe. for the work -magnetic·
END CODE 2 (e.g.,";cs")
1099 21 I I (3) JOP
512 6 7 I (I) OP
516 10 I I (3) OP
512 9 3 I (2) OP
516 7 I I (3) OP
516 8 I I (3) OP
516 7 10 2 (3) OP
516 4 2 2 (4) OP
1030
END CODE l (e.g.,";c")
REFI
corcore
cores
cor- e steel
The merged shreds are processed into the inverted
file structure shown in Figure 4. This operation takes
about 1.5 seconds per document for the combined
subj'ect/title inverted file (title words and terms are
distinguished from subject words and terms by a unique
BWl
N)
DeEl: Number of distinct documents referring to "magnetic N
HEADER
~~I~T
1
REFERENCE-WORD FORMAT
END CODE N (e.g., "s")
DeEN
REFN
I .~
~I
f~1':~~T
ATTRIIUTES
:~:~~:O
JI
•
ON:
REF I.REFI
. REF 2.1
•
W/p:
Is t ..m a ,WIgle wO<d (W)
,,
,,
I
WN:
WO<d """"- w;th;n phruoe (fo< W/p = VI)
I
TN:
The term nurrCer fA fttis term
:
EN,
WO<d ond;ng nuri>e< fo< tIN, ".Feronc. (f_ I to N)
WT:
Tho subject/mi. w';ght (Ievol).
W:
h document whol. work?
I,
{u
I
~crrH~~~
p
PROPERTY CODE
j-:
I,,
REFERENCES
'
IW/p IWN I TN I EN I WT I W : J i 0 : I ON I
_-_0___---;:1 L
DDc_""""""
REFERENCES
:::
FOR T E R M S '
AllOW FOR
EXPANSION
{I:
REF N.I
0-
EF
N.REFN
T
-*.BWl
467
0<
full phruoe (P)?
f~
given document.
J:
Is docu"..,t iournol article?
0:
0.- doc_ ,.floct o<;ginal work?
P:
is document wril't8'l for profesional?
Figure 4-Format for subject-term list
- - - - - GENERATION - - - - - - - ' - _
I
I
I
I
I
r
----RETR'EVAl
Figure 6-Intrex storage and retrieval system
From the collection of the Computer History Museum (www.computerhistory.org)
I 2 (3)
OP
Spring Joint Computer Conference, 1969
468
diagram of the storage and retrieval programs) for the
interactive interrogation of the catalog from remote
consoles has been implemented. The system, termed the
Prototype System, has been used, in conjunction with
a data base including about 1000 documents, to begin
experiments with users as described in a later section.
15
TST7X5:
READY.
USERS = 16,
IU login mS806
25 W 1315.7
2U
35
Description of prototype system
The Prototype System permits the user to search the
data base for documents by specifying subj ects, authors and/or titles. The user may then make a selection
among the documents retrieved by requesting that
MAX = 47.
(24) TITLE
Ferroelectricity in solid hydrogen halides
2. DOCUMENT NUMBER 3430
marcus
(2t) AUTHOR
Sihvonen, Y. T.
P_ord
STANDBY LINE HAS BEEN ASSIGNED
MS806 160 LOGGED IN 08/19/68
1315.9 FROM 8O(fl77
LAST LOGOUT WAS 08/15/68 944.4 FROM BOO277
HOME FILE DIRECTORY IS MS806 CMFL01
(204~
TITlE
Photoluminescence, photoc:urrent, and phase-transition correiations
3. DOCuMENT NUMBER 3174
DUE TO HARDWARE DifFICULTIES, CTSS OPERATION MAY BE
IRREGUlAR.
CTSS BEING
l.I-sec IS TST7XS
7. DOCUMENT NUMBER 1690
R 6.166+1.016
3U
45
(21) AUTHOR
WiII_, R. H.;
resume intrex
W 1316.8
Greetings This is Intrex la. Please sign in by typing yaur name and
address as in the foll_ing exa""le:
Buehler, E. (JA)
Matthias, B. T. (JA)
(24) TITLE
Superconduc:tivity of the transition-metal carbides
smith, r i/mit 13-5251
4U
55
Output completed. Total of 7 documents found. You may nt:Nf see
additional output on these documents by making a n_ 'output' request (for
information on how to do this, see Part 8 of the guide or type info 8).
You may also select a portion of these documents by making a ~ 'infield'
request (see Part 9.5). Otherwise, you may make a n_ search (see Part
2) or make ott- requests (see Part 1).
Note that your sign-in statement should end with a carriage return.
READY
marcus, r s/mit 35-.406
If you already knt:Nf h_ to use Intrex, you may go ahead and type
in commands. (Remember, each command ends in a carriage retum.)
Otherwise, for information on how ta make si""le queries of the
catalog, type
info 2
or, to see the Tobie of Contents (Part 1) of Intrex-l Guide which will
direct you to other parts of the Guide explaining how to make more detailed
queries, type
5U
65
info 1
READY
info 2
Port 2 of Intrex lA Guide:
eu
95
(22) AFFILIATION
University of > Tokyo <. Institute for Solid State Physics;
University of > Tokyo < • Institute for Solid SIote Physics;
University of > Tokyo<. Institute for Solid State Physics
Si""le Queries
(74) MATCHSUB
Ta find documents in the system specify your query by subject, author,
and/or title tenna, as shown in the 3 exa""les below:
subject ferroelectric transitions
phase transition at 1_ tetll>erature in solid h~ halides (0).
i. DOCUMENT NUMBER 3430; RELEVANCE 2/3
author Hess, G. B./subject helium
(22) AFFILIATION
>Texas< Instruments,
title sulfurization/author Swisher
>Oollas<
(74) MATCHSUB
In order to specify additional restrictions (e.g., where author comes
from, journal, word variations not to use, etc.), see Part 9 of the Guide
(or type info 9). For other than standard output (document nurreers, mies,
and authors) see Part 8. For general lotrex cOmmand format and abbreviations
see Parts 6.1 and 6.2.
To lee Table of Contents for Intrex la Guide and how to use
the Guide on line, typ
(TITLE)
3, DOCUMENT NUMBER 3174; RELEVANCE
213
6. DOCUMENT NUMBER 1715; RELEVANCE 2/3,
2/3
(22) AFFILIATION
info 1
Sandia Laboratory, > Albuquerque < ,
Otherwise, you may make si""le queries or use any ott- c")IIIIIIQI'Id.
READY
6U
75
READY
output affiliation matchsub relevance/go
1. DOCUMENT NUMBER 2851; RELEVANCE 3/3
>N. M.<
(74) MA TCHSUB
second-order phase trarwition in peravskites (3);
first-order phase trarwltion (0);
7. DOCUMENT NUMBER 1690j RELEVANCE 2/3,
subi ect solid phase transitions
A search on your query SUBJECT solid phas-e transit-ions found 7 documents.
ia output the catalog fields DOCUMENT NUMBER, TITlE, AUTHOR an those
documents type
2/3
go
READY
This output will take about 15 seconds per document. You may terminate
this output at any time by hitting the ATTN key ONCE. Otherwise, you may
change your output request. For information see Part 8 of Guide or type
9U infield affiliation harvard/o 71/110
105 .1. DOCUMENT NUMBER 3174
infa 8
or change your field restriction (see Part 9.5) or make another request of
Intrex (see Part 1)
READY
~~
r.
DOCUMENT NUMBER 2851
( 71) ABSTRACT
The hiih.;.t~erafure series expansion of the zera-field magnetic susceptibility
"chi*f"chi**sub Curie*
1 + *SIGMA*-sub 1 l*sup *infinity**a*lubl
is related to the diagrammatic representation of the corresponding
expansion of the zero-field static spin correlation function INT. 1
Ooes the criREADY
quit
Thank you for using Intrex.
R 109.583+20.800
=
fOU
115
=
(21) AUTHOR
Hoshino, Sadooj
Shimaoka, Kahji (JA);
Ni imura , Nobuo (JA)
Figure 7-Sample demonstration system dialog
From the collection of the Computer History Museum (www.computerhistory.org)
Experimental Computer-Stored, AUg!llented Catalog of Professional Literature
their catalog records contain specified information in
certain fields. Finally, he may. request that the information contained in any or all of the catalog fields be
printed out.
In order to illustrate the nature of the Prototype
System more concretely, a sample user-system dialog
is given in Figure 7. The dialog has been retyped in
~hortened line-width form from the typewritten copy
prepared on an IBM 2741 teletypewriter console attached to the time-sharing system (CTSS). For illustrative purposes each user statement is flagged by a
number and the letter U in the left margin. Similarly,
system Illessages are flagged by numbers and the
letter S.
In his first two statements the user has logged-in
to the time-sharing system (CTSS) which hosts the
Prototype System as well as many other computer
programs. Note that the user's second statement, his
password to CTSS, is not printed because the 2741 is
set to the nonprinting mode by CTSS to protect the
security of the password. This log-in procedure will be
unnecessary for individual users situated at consoles
dedicated to Intrex use.
The third user statement "resumes" the "Intrex"
system (at this point the user could have called for any
other program currently resident in CTSS) and in the
fourth statement the user "signs in" to Intrex by
typing his name and address. The" sign-in" statement,
in conjunction with the monitoring procedures (see
below), provides us; as system analysts, with a means
for keeping track of system use. It also serves to introduce the user to certain system procedures. For example, the user is apprised of the fact that his statement must be terminated with a carriage return. (Note
that a more natural statement-terminator button or
switch will be possible with the Intrex display console.)
'Vhile, at present, the sign-in statement merely serves
to aid in monitoring system use, we anticipate future
system developments whereby some history of past
users is kept so that, when someone signs in, the syst~m
can take account of his previoub experience to help
direct him.
Intrex response to the user's sign-in statement,
message 5S, is illustrative of several features of the
prompting and instructional techniques employed by
the system. In the first place, the user is told of the
various alternative actions that he may take at any
given time. Secondly, the specific form of the statement he should type to invoke one of these actions is
e~plicitly stated, where possible. Thirdly where it is
not possible to explain the alternatives completely, the
user is referred to a Guide for further details.
The Guide is available both in hard-copy form and
Part 1 of Intrex lA Guide:
469
Table of Contents
To have a part of the Guide printed out on line use the
"Info" command. For exa",ple, for Information on makinr;
simple Queries (I.e., to print out part 2), type
Info 2
PART
1
2
3
~
5
6.1
6.2
7
8
9
10
11
12.1
12.2
13
14
15
16
17
CONTENTS
Table of Contents
Simple Queries
General Remarks - How to Get Guide (printed copy)
Log-In to CTSS and Call Intr~x
Typing Errors - How to Correct
Commands, Modes (LONG, SHORT), Time Checks
Command Names and Abbreviations
Preliminary Output
Final Output
Generalized Queries
Scanning Index Terms
Interrupting System Messages
Text ~ccess
.
Library Services
User ComI'Ients and Questions
Documents In the Collection
The Catalog and Its Fields
Sample Catalog Record
Exit from the Syste'"
This online guide
INTREX lA as of
2~
Part 2 of Intrex 1A
w~s
last revised on
7/2~/68.
JUL 68
Guld~:
Simple Queries
To find docu"'ents In the system specify your Qu~ry by
subject, author, and/or tltl~ terl'ls, as shown In the 3
examples below:
subject ferroelectric transitions
author Hess, G.B./subject helium
title sulfurlzatlon/author Swisher
In order to specify additional restrictions (e.g.,
where author cOl'les frol'l, journal, word variations not to
use, etc.), see Part 9 of the Guide (or type Info 9).
For
other than standard output (document numbers, titles, and
authors) see Part 8. For general Intrex command for"'at and
abbreviations see Parts 6.1 and 6.2.
Figure R--8ections from Intrex guide
online. Selected pages of the Guide are shown in
Figure 8. The user may request that a section of the
Guide be printed online by using the INFO command
(see user statement 5U and system response 6S). The
Guide also attempts to use the techniques of presentation of alternatives, example, and reference to more
detailed information. The sections of the Guide are
sized for convenient printing and viewing online.
The user's sixth statement initiates a search in the
inverted files for documents on a given subject. Searching may also be done on title or author terms or combinations of subject. title and/or author terms. It may
be noted that the form of the user's statements is a
compromise between the precise, but esoteric and
complicated, form of many programming languages and
the familiar, but ambiguous (and, therefore, difficult to
interpret automatically), form of natural English. Command and argument names are simple and mnemonic.
From the collection of the Computer History Museum (www.computerhistory.org)
470
Spring Joint Computer Conference, 1969
Format is kept simple with only three basic delimiters
required: spaces to separate arguments from each other
and from command names, slashes to separate commands, and a carriage return to terminate the
statement.
In response to the user's search request, Intrex replies with a message (7S) illustrative of several other
features of system dialog. In the first place, the system
plays back its understanding of the user's statement.
Also the system indicates, by hyphenating word endings, how it has stemmed the words in the user's search
specification. This is important because Intrex matches
these word stems to word stems in the inverted file. As
a further indication of the progress of the retrieval
process, the number of matching documents found in
the inverted files is printed. Since the user made no
special output request in statement 6U, Intrex reports
the estimated time to output the standard catalog
fields. This system message, then, gives feedback which
may interact with the user's original intentions and
expectations and allow him to redirect his search.
The points at which the system reports to the user
have been chosen in light 'of the operating characteristics
of the host CT8S time-sharing system. The intention is
to report back soon enough so that the user experiences quick response but not to report so often that
the user is forced into unnecessary additional responses
of his own. The incorporation of the buffer-controller
computer for the Intrex display consoles will improve
the operating characteristics of the time-sharing environment and may allow more frequent feedback with
less cost at the central-computer level.
In our sample dialog the user takes the first alternative (statement 7U) and the system responds with
the standard output (message 8S) for the matching
documents. The ellipses (...) in the figure indicate
where portions of the system response have been
deleted to reduce figure length.
At this point in the dialog, let us assume that the
user already knows how to make an output request or
that he refers to his hard-copy version of the Guide.
In any case, the user's eighth statement requests additional output information. Note that by appending
the GO command to the OUTPUT command, the user
signifies he is sufficiently sure of his statement not to
want Intrex to respond with its interpretation and
timing estimate but rather to print the requested
output immediately.
The system then responds (message 9S) as directed.
Note the special output information giving those subject terms that matched (lVIATCHSUB) and the estimated relevance of these terms. The relevance of a
subject term to a user query is currently estimated
simply by the ratio of the number of words in the
query which match words in the term to the total
number of words in the query.
In statement 9U the user is selecting~ by means of
an INFIELD command, a subset of the seven documents which his original search found. This command
enables the user to request only those documents in
which a specific character string (here, "harvard")
appears in a particular catalog field (here, "affiliation").
Note that, at the same time, the user is changing his
output request and using abbreviation "0" for the
command name "output" and "71" for the field name
"abstract' , .
In the system's response (message lOS) to the above
request, the user has availed himself of the interrupt
capability and halted the output at the point indicated
by the letters "INT 1." The system then responds
with the READY message indicating the user may go
ahead with other requests. The interrupt capability is
a general facility allowing the user to cut short system
messages. After the user has become familiar with the
system he can reduce the' verbosity of system messages
by entering the SHORT mode. He may do this at any
time during the dialog or even upon resuming the
system as shown in Figure 9.
resume intrex short
W 1355.1
Please sign in.
R
marcus r s/mit 35-406
R
s solid phase transitions/in affiliation harvard/o 22 abstract/go
1. 0 3174
(22) Lyman Laboratory of Physics Harvard University, > Cambridge < ,
>Mass.<; / lincoln Laboratory M. I. T., > Lexington < , >Mass.<
(71) The high-temperature series expansion of the zero-field magnetic
susceptibility, *chi*/"chi**sl)b Cl)rie* = 1 + *SIGMA**sub 1 = 1*svp
*infinity**a*sub 1*( J/k T)*sup 1*, is related to the diagrammatic
representation of the corresponding hh-tiNT. 1
are then
R
s transint*'" tions/a hoshin%
21 74 75
S: transit-ions / A: hoshino found: 1 doc
sees/doc.
0:
21, rei, msub
R
ogo
Sorry, I can't understand you.
R
go
1. 0 2851
(21) Hoshino, Sadao;
Shimaoka, Kohji (JA);
Niimura, Nobuo (J A)
(24) Ferroelectricity in solid hydrogen halides
1 docs found
quit
Thank you for using Intrex.
R 36.050+ 10.433
Figure 9-Sample dialog in SHORT mode
From the collection of the Computer History Museum (www.computerhistory.org)
15
Ex-periTIlental COTIlputer-Stored, Augnlented Catalog of Professional Literature
Monitoring system use
Embedded in the retrieval system is a monitoring
system which records, on a disc file for later analysis,
all user commands and system responses as well as
certain timing information. In addition, a shared console remote from the user console may be employed to
monitor experiments in real time.
With the COMMENT command a user may input a
comment about the system, the catalogi."tJ.g,
. or the docu=
ments in the collection. These comments are recorded
by the monitoring system. Comments about the catalog
may result
modificatibns to the catalog at some
subsequent update whereas comments about the documents may get entered into Field 85 (see Table I) of
the pertinent catalog records.
in
console sessions (which lasted between about one-half
and one hour).
Results
Results of these first experiments are still under
analysis and, in any case, the small size of the sample
user population and still modest size of the data base
make it clear that these "results" can only suggest
future lines of investigation and cannot provide de:firti~
tive conclusions. With these caveats in mind we make
the following tentative observations:
1. Users learned to operate the basic features of
2.
User experiments
The users
3.
When the size of the data base reached about 1000
documents, experiments utilizing the Prototype Retrieval System were begun with users having a real need
for information. The first user was a second-year graduate student in physics (from the first of the research
groups mentioned in an earlier section) who was starting
a project to measure the magnetic susceptibility of
europium sulfide near the critical point. He had already
compiled a bibliography on this subj ect through conventional library techniques but was still seeking information on the light absorption properties of EuS to
properly set up his experimental equipment. Six additional users were taken from the ranks of the Intrex
catalogers by giving them a description of the student's
problem and asking them to serve as reference librarians
using the Intrex system.
Experimental environment
Users were seated at a 2741 console with no personal
instruction. They had previously been given a hardcopy version of the Guide (at least a day in advance) to
which they could refer during the retrieval session.
Other user aids (besides the system dialog as exemplified in Section 6), included messages pasted on the
console (e.g., "Don't forget the carriage return"); wall
charts (to remind the user how to perform common
functions); the NASA Thesaurus 16 (to suggest semantically related words for user search requests); and the
Inverted File listings (to suggest additional search
words as well as show document counts for index
terms). "Gsers were given extensive debriefings (up to
an hour and a half) by systems analysts after their
471:
4.
5.
6.
the system fairly easily and found all or most of
the relevant documents rather quickly.
The users, in the hour or so of their acquaintance with the system, mastered few of the
sophisticated features of the system nor did
they really understand the nature of the matching algorithm.
Users without previous computer experience
tended to be awed by the computer which inhibited their trying, and learning, system
features.
The difficulties listed under (2) and (3), while
not hindering user retrieval seriously for the
sample problem, could adversely affect results on
other problems and could degrade our efforts to
determine the relative merits of various augmented-catalog features.
One possible solution to some of these problems
is to advance the user's understanding in stages
by starting with simplified guides or with personalized instruction.
Users dislike and are confused by command
mnemonics that are not single English words
(e.g., ~1ATCHSLB, I~FIELD).
Futw·e plans
Enlarging the data base, improving system efficiency, improving user aids (see previous section), and
. expanding user experiments are high on the list of
planned proj ects. Incorporation of the Intrex console
and text access systems into the retrieval system is "also
an immediate prosp~ct. The first full version incorporating all Intrex subsystems will find the Intrex
console operating in a "transparent" mode so that this
console will look like a standard CTSS console and
present retrieval programs can be used essentially unchanged. As experience is gained with this siinple
configuration an attempt will be made to shift some
CTSS operations to the satellite computer associated
with the Intrex consoles.
From the collection of the Computer History Museum (www.computerhistory.org)
•
472
If
Spring Joint Computer Conference, 1969
I
As mentioned in the Introduction, a number of
features have been deferred in order to get the
Prototype System into operation as soon as possible.
Spme of these features are listed below. The schedule of
their incorporation will undoubtedly be determined to
some extent by our experimental findings.
One major area under study is the question of
matching algorithms and relevance. As indicated above,
matching is done on a modified "anding" of all query
terms. One would like the user to have the ability to
specify any combination of " ands", "ors", and" nots"
among query terms, to control the relative emphasis
on words witpin the search specifications, to make
online modifications to the relevance and matching
criteria (by term ranges, for example), and to seleclively override the stemming and phrase decomposition
algorithm.
Other improvements being planned include: searching restrictions on document properties or subject
term ranges; decoding of catalog fields (for example,
"English" for "e") on output; more general INFIELD
specifications (for example, ranges on dates); online
display of inverted file terms and frequency counts;
naming lists of documents or commands for later
reference; and an overlay procedure for reading in
sections of the retrieval program from disc storage as
the overall size of the retrieval system expands beyond
core memory size.
~'ystem
complete catalog entries appears prohibitively expensive with today's technology, and new mass-storage
concepts must be examined. lVlultiple reading-head,
high density, continuous motion magnetic tape devices
appear promising and are being studied.
ACKNOWLEDGlVIENT
The research reported in this paper was supported
through grants from the Council on Library Resources,
Inc., the National Science Foundation and the Carnegie Corporation.
REFERENCES
1 J F REINTJES
System characteristics of I ntrex
Proc S J C C 1969 ~
2 D R HARING J K ROBERGE
A combined display for compuwr-generated data and
scanned photographic Images
Proc S J C C 1969
3 D R KNUDSON S N TEICHER
Remote wxt-access in a computerized library informationretrieval system .
Proc S J C C 1969
4 H P BURNAUGH
The BOLD (bibliographic on-line display) syst.em
In: Schecter George ed. Information retrieval-a critical
review Thompson Washington DC 19675:3-66
5 D J SULLIVAN D M MJ:l~ISTER
Evaluation of 'USer reactions to a prototype on-li'tlR information
retrieval system [REGON]
Related work
Two further capabilities that we hope to incorporate into future augmented-catalog experiments have
been studied by two students working toward :\1.S.
d~s. l\1r. Richard Domercq17 has studied the automatic derivation of synonomy and hierarchical relati~ps among the subject terms on the basis of
oo-oecurrence. l\1r. William Kampe 18 has investigated
automatic methods for deriving subject terms from the
title and abstract of a document.
Throughout the work on storage and retrieval of
oo.talQg data we have been conscious of the problems
that will be encountered in scaling up a computer~red augmented catalog by two orders of magnitude.
Fo~ a collection of one million documents, we estimate
that the total information stored in the catalog will be
of the order of 2 X 1010 bits. About 15 percent of this
iQf~ation will reside in the inverted files. A· preliminary study of file organization, cost, and speed of
response of such a catalog has been conducted by
Professor A. K. Susskind.19 He has arrived at a conceptual design that will perform inverted-file searches
at an average rate of 40 per second with a storage
lktvice
that costs about $250,000. On-line storage of the
,
In: Proc of the 30th Annual Meeting of the American
Documentation Institute New York October 1967
6 M RUBINOFF - S BERGMA~ W FRANKS
E R RUBINOFF
Experimental ellaluation of information retrieval through
a teletypewriter
C A C M Vol 2 No 9 September 1968
7 G SALTON
A uiomatic information organization and retrieval
McGraw Hill New York 1968
8 MM KESSLER
The "on-line" technical information system at 1lf I T
Project TIP
In: 1967 IEEE International Convention Record
Institute of Electrical and Electronics Engineers
New York 1967 part 1040-43
9 E B PARKER
Stanford physics information retrieval sysi.em (SPIRES)
Annual Report
Stanford Institute for Communicat.ions Re..<learch
Stanford California December 1967
10 S I ALLEN G 0 BARNETT
P A CASTLEMAN
Use of a time-shared general-purpose file-handling system
in hospital research
Proc IEEE Vol 54 No 12 December 1966
11 A R BENENFELD
GeneraUon and encoding of the Project Intrex angmented
From the collection of the Computer History Museum (www.computerhistory.org)
I
l4'..,.."""'''';'I'V\o
..... 4-n 1 ("I1'\'I'V\...,.,,4-o'WO~<::!4-I'\'WOo,l
A " ....Tno ..... 4-o,l ('1<:\+0 In .... n~
P'rn~oc:!cdnno 1 T .1+0'r0+11'r0
~A}1C.L~.L.1.1.-';;;;.I."'~.1 VV~J...l.J:.'U"-'-";;;"'-U"V.l."'U, • ............ cl. ...... \",,1.&..&u"""'u. ,-,QrUilN.l.V6
V.L
.... .LV .... "'UU4-V ... .&4.4I.&.
........ .,"'... «.41"' ...... ""
MT Vol 11 No 1 March 1969)
catalog data base
Proc of the 6th Annual Clinic on Library Applications of
Data Processing University of Illinois Urbana Illinois
May 71968
12 T N SHAW H ROTHMAN
A n experiment in indexing by word-choobing
Journal of Documentation Vol 24 No 3 September 1968
13 R M FANO
The MAC system: the computer-utility approach
IEEE Spectrum January 1965
14 R H STOTZ
16 NASA thesaurus
NASA report SP-7030 Scientific and Technical Information
Division National Aeronautics and Space Administration
December 1967
17 R J DOMERCQ
A machine-aided thesaurus generation system
M S Thesis Electrical Engineering Department
MIT September 1967
18 W R KAMPE
Pre-indexing by machine
A new display terminal
Computer Design April 1968
15 J B LOVINS
Development of a stemming alogrithm
MIT Electronic Systems Laboratory Technical
Memorandum ESL-TM-353 June 1968 (To appear in
..t.,'.1
-:.;.u
1
~
M S Thesis Electrical Engineering Department
MIT June 1968 also Electronic System Laboratory
Report ESL-R-355 July 1968
INTREX Staff
Project Intrex Semi-annual Activity Report PR-4
MIT September 15 1967
From the collection of the Computer History Museum (www.computerhistory.org)
From the collection of the Computer History Museum (www.computerhistory.org)
© Copyright 2026 Paperzz