Where are the pictures? Linking photographic

Where are the pictures? Linking photographic records
across collections using fuzzy logic
Paper for Museums and the Web Asia 2013, Dec 9 -12 2013
Stephen Brown, Simon Coupland, David Croft, Jethro Shell, Alexander von Lünen, De Montfort
University, United Kingdom
Abstract
This paper describes a novel approach to interrogating different online collections to identify
potential matches between them, using fuzzy logic based data mining algorithms. Potentially,
information about objects from one collection could be used to enrich records in another where
there are overlaps. But although there is a considerable amount of bibliographic and other kinds of
data on the Web that share similar information, a standardized way of structuring such data in a
way that makes it easy to identify significant relationships does not yet exist. In the case of
historical photographs, the challenge is further exacerbated by the enormous breadth of subjects
depicted and the fact that surviving records are not always complete, accurate or consistent and the
amount of text available per record is very small. Fuzzy matching algorithms and sematic
similarity techniques offer a way of finding potential matches between such items when standard
ontology and corpus based approaches are inadequate, in this case helping researchers to match
photographs held in different archives to historical exhibition catalogue records for the first time.
Introduction
The volume of online resources has grown exponentially and is forecast to continue to expand
similarly into the foreseeable future (figure 1). This growth creates enormous potential for
cross-linkage between records, where there are overlaps, with the possibility of using information
about objects from one data set to enrich records in another, thus enhancing their value.
1 zettabyte = 1 trillion (1018) gigabytes
Figure 1: Information availability online from 2005 to 2015 (estimated). Source: IDC VIEW
report ‘Extracting value from chaos’, June 2011. Cited in Fullan and Donnelly (2013), p. 9.
Galleries, Libraries, Archives and Museums (GLAMs) are adding to this content by digitizing
their collections. In the USA an analysis of 18,142 museums and libraries by the Institute of
Museums and Library Services (IMLS, 2006) revealed that the majority of museums and larger
public libraries make digital records available to the public via the Internet and the same was true
for a smaller proportion of public and smaller academic libraries. Individual institutional postings
can run to hundreds of thousands of objects. As of January 2013, the Museum of Modern Art had
posted more than 540,000 images, the British Museum offers images of more than 730,000 works
and the Victoria and Albert museum (V&A) website offers access to more than 300,000 images of
works in its collection (Kelly, 2013). Collectively these resources amount to a vast quantity of data.
The Europeana aggregator website offers access to 21.3 million objects from 33 countries
(Europeana, 2012).
However, finding, evaluating and gaining access to this content is challenging. GLAMs object
data typically resides in databases, which, while they may have Web-facing search interfaces, are
not well integrated with other data sources on the Web. Although there are considerable amounts
of bibliographic and other kinds of data on the Web that share information that could be linked,
“Without tools and methods for filtering, categorizing, clustering, weighting, and disambiguating
our expanding datasets, museums will face increasing difficulty in managing and organizing their
online resources, while their visitors will struggle to locate and make use of them” (Klavans et al.,
2011). Most of the information in GLAMs records is encoded as natural-language text which is
not readily machine processable and the GLAMs and Semantic Web communities have different
terminology for similar metadata concepts, making implementation of semantic approaches
difficult. Considerable interest has been expressed in the concept of Linked Data (Bizer et al.,
2009; Heath and Bizer, 2011; Oommen and Baltussen, 2012) and some progress has been made in
the form of REpresentational State Transfer (REST) and RDF Query Language (SPARQL)
interfaces (ACRL Research Planning and Review Committee, 2012). But building a data pipeline
to convert heterogeneous data from various source collections into common RDF in order to
create linked open data that can facilitate searching across multiple collections requires
considerable technical expertise (Henry and Brown, 2012). While the onus is on GLAMs to
standardize and convert their collections metadata the prospect of universally linked data remains
remote.
This paper describes an alternative approach for helping researchers to find potential matches
between such items based on similarity identification, in this case helping researchers to match
photographs held in different archives to historical exhibition catalogue records for the first time.
Exhibitions of the Royal Photographic Society
From its foundation in 1853, the Photographic Society in London always considered itself as the
'parent society' and, after suffering several setbacks during the late 1860s, re-emerged to establish
itself as the pre-eminent photographic society of Britain. During the years that followed, the
Photographic Society changed its name, first to the Photographic Society of Great Britain in 1874,
and then in 1894 to the Royal Photographic Society, the name it still retains today. Although other
photographic societies flourished elsewhere in Britain and held their own annual exhibitions,
catalogues from these societies have not survived in any significant number. In contrast, the 46
surviving catalogues from the Photographic Society's annual exhibitions, 1870 - 1915, contain
34,917 individual exhibit records. Collectively, these exhibition records are a highly significant
research resource, offering a unique insight into the evolution of aesthetic trends and photographic
technologies, the response of a burgeoning group of photographic manufacturers to business
opportunities and the activities and fortunes of individuals concerned with the technical, artistic
and commercial development of photography. The society's exhibitions attracted a wide
constituency of photographers, from Britain, Europe and America and many individuals launched
their photographic career by exhibiting at the Royal Photographic Society.
Exhibitions of the Royal Photographic Society 1870-1915 (ERPS) http://erps.dmu.ac.uk is one of
a number of photographic research resources hosted by De Montfort university
(http://kmd.dmu.ac.uk/kmd_photohistory_page). These resources have enormous potential to
enrich and contextualise records of historical photographs in GLAMs collections, if co-referent
photographs exist. Equally, matching surviving pictures to the exhibition catalogue records would
enhance their value to researchers, particularly since although the ERPS exhibition catalogues
contain information about many different aspects of the exhibits, they contain very few
reproductions of the photographic exhibits themselves. There are only 1040 images, most of
which are merely artists’ sketches of the originals because, at the time of the exhibitions,
mechanical reproduction of photographic images was technically impossible. While many
researchers have commended the ERPS database for its usefulness, a common criticism is that
most of the pictures are missing. The “FuzzyPhoto” project described here is developing and
testing computer based finding aids aimed at answering the questions:


Where are the surviving pictures referred to in the exhibition catalogues?
What do those pictures look like?
Research challenges and methods
GLAMs collection records generally and the ERPS records in particular present a number of
challenges. In the case of the ERPS exhibition records, a new exhibition committee each year
structured the exhibition into sections that seemed appropriate and labeled exhibits in particular
ways. Thus some records have extensive metadata within a large number of fields while others
have only minimal entries under just a few headings. For example, sometimes a complete address
is listed for an exhibitor while for others there is only a neighborhood or a town name. It is
unsurprising that address information should change and equally unsurprising that in some cases
exhibitor names changed as well, either through marriage or through promotion. For example one
can trace through the records the career of Lieut., later Captain and finally Sir, C.E. Abney. Less
expected was that sometimes the same photograph was exhibited in different years by different
people and sometimes under a different title.
GLAMs records have their own issues. Record management systems and cataloguing conventions
tend to vary between institutions, which tends to create structural differences between their
respective data sets. Clustering algorithms can be used to identify similarities between different
data sets (Dhillon et al., 2003), but when we ran a clustering algorithm on a sample of data from a
range of institutions we found that data clustered most strongly within the originating collections
and that clusters of data between collections were very much less evident. In other words, the
institutional cataloguing style was so strong that it masked any other similarities that might exist
between records in different collections. Also the quality of records is highly variable. In some
cases records are kept as simple spreadsheets, with little or no structure, while other institutions
keep meticulous records clearly differentiated into multiple fields. The problem of lack of clear
structure is exacerbated when data are entered inconsistently, making it difficult to develop rules
for separating out different data elements. For example, examination of the records for one
collection revealed that 20 different ways of formatting date information had been used.
In order to be able to compare records across ERPS and GLAMs collections an approach was
required that could cope with incompleteness, uncertainty, and inconsistency. The methodology
we are using comprises 4 steps:
1. Data preparation (correcting typographical errors, standardizing data such as dates,
removing duplicate entries and mapping data to a common metadata schema).
2. Data aggregation (combining standardized records in a single XML database where they
can be mined for similarities).
3. Query expansion (extending the range of keywords that are searched for).
4. Field comparison (comparing the contents of individual fields and combining these to
produce an overall similarity metric).
Figure 2 outlines the overall workflow of the project, showing schematically how collection
records are ingested, aggregated and mined for links which can then be output to partners’ own
Web sites where they can be accessed by visitors to those sites.
Figure 2: Overall project workflow.
Data preparation and aggregation
Thus far the FuzzyPhoto project has acquired circa 1.5 million historical photographic records
from GLAMs including the V&A, British Library, Metropolitan Museum New York, the National
Media Museum, National Museums Scotland, Birmingham City Library, St Andrews University,
the National Archives, the Musée d’Orsay in Paris and open sources such as Culture Grid. The
first step is to convert all the data to a common MySQL table format. CSV files can be imported
directly into MySQL but for XML data, an XML data store (BaseX) and XQuery queries has to be
used to convert data to MySQL tables and an intermediate Microsoft SQL database was necessary
to convert British Library (BL) data, as figure 3 shows. After the data are imported, “cleaning up”
such as eliminating duplicates or spelling variants is done, chiefly with SQL queries. Some
operations, however, are too complex to be handled in SQL, so Java and Python scripts is used
externally.
Figure 3: Workflow for mapping different metadata sources to LIDO.
The next step is to map the records in the temporary MySQL tables to a common metadata schema.
This is necessary because the original collections were catalogued using different schemas, which
makes it difficult to compare records between them. We chose the ICOM LIDO metadata schema
(http://network.icom.museum/cidoc/working-groups/data-harvesting-and-interchange/what-is-lido
/). LIDO is an XML harvesting schema developed for exposing, connecting and aggregating
information about museum objects on the Web, ideally suited to the task of standardizing the
metadata provided by each of the contributors.
Query expansion
The endpoint of the first stage therefore is a MySQL database of records in LIDO format. The aim
is to compare these records in order to identify similarities. Similarity identification, known
variously as co-reference identification, record linkage and entity resolution, is a common problem
in many fields and a correspondingly wide range of approaches have been developed to deal with
it (Elmagarmid, 2007). Effective techniques have been developed for contexts where there are
sufficiently large volumes of text to allow matching of key words, for example Luo and Xie
(2009), or where there are very strong identifiers such as book titles, post codes or telephone
numbers to link different items). However, such approaches are of only limited use in the context
of GLAMs collections which typically have only limited metadata, often just a title, date and
name, and sometimes not even that, and records are not always complete, accurate or consistent
(Tzompanaki, 2012). In addition to these general problems, most ERPS exhibit titles are just a few
words, even though some extend to several lines. The average title length is just 8.1 words, of
which only 5.4 are useful, making it very difficult to apply document classification or corpus
linguistics based approaches that rely on matching clusters of similar words in different
documents. There are just not enough words in the titles as they stand to find such clusters.
Furthermore, the range of subject matter depicted is enormous, including landscapes, events,
scientific experiments, medicine, architecture, fashion, museums, folk customs, celebrities,
photographic equipment, etc, against which ontology based matching approaches are ineffective.
The requisite ontology would have to cover just about every subject imaginable in the mid to late
nineteenth century.
One way of tackling the problem of very limited amounts of text is query expansion. Query
expansion uses semantically similar terms to those in a query to increase the chances of locating
matching words (Xu et al., 1996). Query expansion is relevant in the case of ERPS records
because the titles contain so few words and because we know that variants of some photographs
appear under similar but not identical titles. So by searching for similar words we increase our
chances of finding the same photograph even though its title may be different.
Before a term can be semantically expanded its meaning has to be unambiguously defined
(Stevenson and Wilks, 2003). Many English language words have multiple meanings, for example
“fair” can mean “blonde”, “attractive”, “festival” or “equitable”. Therefore when matching an
exhibit title such as “Fair Daffodils” the term “fair” has to be disambiguated first. Many semantic
disambiguation techniques employ a corpus taken from a specific subject area with text structures
that encompass long sentences, or even whole paragraphs, to derive a deep understanding of the
context. Since the subject matter of the ERPS exhibits is so diverse, and the exhibit titles are so
short, an alternative approach is required. We are using the WordNet Lexical database
(http://wordnet.princeton.edu/) to identify synsets of keywords, each of which represents a
different meaning for that term, and supplementing this with Part-Of-Speech Tagging (POST).
Post applies descriptors to each element in a sentence, such as such as noun, verb, participle,
article, pronoun, preposition, adverb, and conjunction, to help disambiguate the words
(Voutilainen, 2003). A comparison of alternative software Part-of-Speech taggers indicated that
the Stanford POST software (http://nlp.stanford.edu/software/tagger.shtml) is the most accurate
when used against a test dataset of ERPS records.
Field comparison
Field comparison entails comparing the expanded set of keywords relating to each individual
record from the ERPS database (defined as a “seed record”) with the expanded query terms
generated for other records in the data warehouse to assess their similarity. Similarity between
terms in different records is assessed firstly by comparison between individual fields and
individual field similarity metrics are then combined to produce an overall similarity metric for
each pair of records. Overall metrics are then ranked in order of most to least similar.
Thus far the research has concentrated on just four fields: title, date, creator and process. Of these,
date, creator and process are the most straightforward. The key factor in relation to date is the
amount of time between the dates described in the fields. The greater the amount of time between
them, the less similar they are. Some of the dates given describe a span of time rather than a
specific year (i.e. "1890s", "the 19th century"). Greater differences between time spans indicate
less similarity between fields. For example "19th century" and "1900" are less similar than
"1900s" and "1900". In the case of the creator field, name comparison is a well-established
problem in many areas and a large number of algorithms exist in order to compare names which
can handle typographical errors, alternative spellings etc. We are using established edit distance
techniques that measure similarity in terms of the number of changes (edits) that are required in
order to convert one string into another (Winkler, 2006). The technique we are using to assess the
similarity of process field data matches the stated process with a list of preset keywords describing
various known processes. This allows for typographical errors and spelling variations. The various
photographic processes are organised into a branched dendrogram structure in which processes
sharing specific traits appear on the same branch. Once a field has been matched to a specific
locus it is compared to other fields by finding the shortest path between the processes in the
dendrogram. The shorter the distance between the approaches, the more similar they are
considered to be.
The title field
The extreme brevity of most ERPS exhibit titles rules out standard statistical text similarity
metrics based on analysis of long text strings. Short text semantic similarity tools such as Latent
Semantic Analysis (LSA) and Sentence Similarity (STASIS), overcome this problem but they are
most effective with small volumes of text (O’Shea et al., 2008). As the volume of text increases,
the time taken to process it becomes prohibitive. Although the titles of ERPS records are short,
there are 35,000 and in total we have 1.5 million records to process. To deal with this volume we
have developed a simplified semantic similarity measure called Lightweight Semantic Similarity
(LSS) which significantly lowers the computational overhead without significantly reducing the
accuracy of results (Croft et al., 2013). LSS is based on standard statistical similarity metrics
known as “cosine similarity” (Manning and Schütze, 1999), but additionally takes into account the
semantic similarity between words, with aid of WordNet and Part-of-Speech-Tagging.
Combined similarity metric
The final step in our workflow is to combine the individual field metrics into an overall record
similarity metric. A series of rules were created which describe how to combine and weight the
individual metrics. As an expert knowledge approach, it relies on the programmed knowledge of
domain experts (Winkler, 2006), in our case derived from a survey of members of the GLAM
community to establish the relative importance of the different variables. The result is a series of
“if, then” statements (i.e. IF x THEN y) which represent the rules that a person would use to solve
the same task. Rule based systems are popular for commercial applications, but they often
function poorly when faced with imprecise information, where the identity of x may be uncertain.
In our case uncertainty arises from factors such as imprecise dates (eg. “circa 1890”), lack of
definitive information about historical photographic processes, especially since the production of a
photograph could entail more than one process, and the potential existence of variants resulting
from different prints from the same negative. The solution to this difficulty is to use a fuzzy
rule-based approach. Fuzzy logic allows an object to belong to more than one category but to
various degrees. For example, consider a number of people of different heights. While some are
taller than others, the distinction between tall and not tall is not clear cut and in different contexts
a person of average height might be considered both tall and short. Fuzzy logic can accommodate
this relativity by assigning a value to such a person that quantifies the extent to which they belong
to the group “tall”, giving it greater resilience regarding imprecision compared with crisp logic.
Fuzzy logic has been applied previously to resource discovery challenges (Feng, 2012; Lai et al.,
2011; Li et al., 2009). However, all these approaches were based on analysis of large volumes of
text and thus not applicable in our context. Our approach is tailored to the challenge of comparing
records of titles, person names, dates and processes, in which most fields contain only small
amounts of text. The full set of fuzzy rules we are using draws on the individual field similarity
metrics to ascertain if the match between any pair of fields is good and then combines that
comparison with other field comparisons as follows:
IF title is good AND person is good THEN match is good.
IF title is good AND (date is good OR process is good) THEN match is ok.
IF person is good AND title is bad THEN match is ok.
IF title is bad AND person is bad THEN match is bad.
Results
Figure 4 illustrates how the overall similarity values are used to create connections between the
records. Starting with a seed record at the root node, the combined rules identify the record with
the highest similarity to that seed record and adds it as a child node. The record with the highest
similarity to either of those two nodes is then added, creating a branching tree-like structure and so
on until all the records are added to the tree and the process ends. The connections shown in figure
4 are just the topmost level of the complete dendrogram.
Figure 4: Top level dendrogram of search results for a single ERPS seed record.
The result is that those records with the greatest similarity to the seed record appear in the highest
layers of the tree. Of course a similar effect could be achieved by simply selecting the records with
the highest similarity to the seed record. However, this approach does more than just select the
records with the greatest similarity to the seed record, it also groups similar records together
within the hierarchy. For example, records with the same/similar person fields will be grouped
together, which allows for easy exploration of all the records by a single photographer.
Figure 5 shows the results of one such search relating to ERPS record erps28409, a portrait of
George Meredith, exhibited by Fred Hollyer in 1909. Examining the metadata for the records, it
appears that erps28409 was not the original study. In fact V&A exhibit vaO75248 was created
first by Frederick Hollyer in 1886, then later erps28409 was created for exhibition in 1909 by
enlarging of a portion of vaO75248. An etching was also produced by Frederick's brother Samuel
in 1900 which resulted in the Library of Congress record loc92512466.
Figure 5: Comparison of results of search for seed record erps 28409.
What is remarkable about these results is that while the images are patently similar, from a
computational perspective the metadata for each is quite different. For example, as human
beings we can see that the creator names “Fred. Hollyer”, “Hollyer, Samuel,” and “Hollyer,
Frederick” are similar, but as they are not identical in the way they have been expressed a simple
word-by-word comparison would not work. However, it should be noted that these are preliminary
results and confirmation of the efficacy of the proposed approach must await full scale
implementation. Thus far we have tested a sample of the outcomes on a panel of subject experts,
to compare speed and accuracy of co-reference identification between expert human beings and
our computer algorithms. The results indicate our new approach is at least as accurate as
techniques employed by experienced researchers, historians and curators and considerably faster if
the off-line processing time needed to create the links database is discounted.
Conclusions
Cultural heritage institutions are beginning to explore the added value of sharing and connecting
data. Recent moves towards making GLAMs records available in computer readable formats such
as XML and JSON and making the collections searchable using REST and SPARQL interfaces go
some way towards improving access to collections records but a major barrier to effective sharing
across multiple platforms is the absence of machine-readable text standards that make it easy to
automate the process of identifying significant relationships between records. Most of the
information in GLAMs records is encoded as human-readable text which is not readily
machine-processable and the GLAMs Semantic Web communities have different terminologies for
similar metadata concepts, making implementation of semantic approaches difficult. While Linked
Open Data offers a tantalizing glimpse of what the future could contain, the reality is that
universally applied and adopted standards are a remote possibility for the foreseeable future.
Alternative approaches are needed that can work with the messy characteristics of existing
collection records. This research shows that low-cost, lightweight approaches to co-reference
identification may be possible although further work is required to prove the results beyond a pilot
study so far and refinements would be needed to reduce processing time and extend the approach
to other domains. A further problem is variation in the structure and quality of GLAMs records.
Each contributing institution has its own records management system and distinctive cataloguing
style. This has necessitated manual data cleansing and mapping to an appropriate common
metadata schema. In order to open this approach more widely to other institutions ways need to be
found to automate these labour intensive data preparation stages.
Acknowledgements
This research is funded by the UK Arts and Humanities Research Council AHRC Research Grant
AH/J004367/1. Thanks are due to Birmingham City Library, the British Library, the Louvre, the
Metropolitan Museum, Musée d’Orsay, the National Archives, the National Media Museum,
National Museums Scotland, St Andrews University, the V&A and Professor Roger Taylor for
their generous support.
References
ACRL Research Planning and Review Committee (2012). “2012 top ten trends in academic
libraries,” College & Research Libraries News 73 (6), 311–320. Available
http://crln.acrl.org/content/73/6/311.full.pdf+html Consulted October 28, 2013.
Bizer, C., Heath, T. and Berners-Lee, T. (2009). “Linked Data - The Story So Far.”. International
Journal on Semantic Web and Information Systems 5 (3), 1-22.
Croft, D., Coupland, S., Shell, J. and Brown, S. (2013) “A fast and efficient semantic short text
similarity metric” In Y. Jin and S.A. Thomas (Eds.) Proceedings of the 2013 UK Workshop on
Computational Intelligence (UKCI 2013). Surrey: University of Surrey. (In press).
Europeana (2012). Breaking New Ground: Europeana Annual Report and Accounts 2011. The
Hague: Europeana Foundation. Available at
http://pro.europeana.eu/documents/858566/ade92d1f-e15e-4906-97db-16216f82c8a6 Consulted
October 28, 2013.
Dhillon, I., Kogan, J.and Nicholas, C. (2003). “Feature Selection and Document Clustering”. in:
M. Berry (Ed.), A Comprehensive Survey of Text Mining. Berlin/Heidelberg: Springer-Verlag,
73-100.
Elmagarmid, A., Ipeirotis, P., and Verykios, V. (2007). “Duplicate record detection: A survey”.
IEEE Transactions on Knowledge and Data Engineering 19 (1), 1–16.
Feng, J. (2012). “Efficient Fuzzy Type-Ahead Search in XML data”. IEEE Transactions on
Knowledge and Data Engineering 24 (5), 882-895.
Fullan, M. and Donnelly, K. (2013). Alive in the Swamp: Assessing digital innovations in
education London: Nesta. Available at
http://www.nesta.org.uk/areas_of_work/public_services_lab/digital_education/assets/features/alive
_in_the_swamp_assessing_digital_innovations_in_education Consulted September 5, 2013.
Heath, T., and C. Bizer. (2011). “Linked Data: Evolving the Web into a Global Data Space.”
Synthesis Lectures on the Semantic Web: Theory and Technology 1 (1), 1–136.
Henry, D. and Brown, E. (2012). Using an RDF Data Pipeline to Implement Cross-Collection
Search. In D. Bearman and J. Trant (Eds.). Museums and the Web 2011: Proceedings. Toronto:
Archives & Museum Informatics, 2011. last updated March 25, 2012, consulted September 5,
2013
http://www.museumsandtheweb.com/mw2012/papers/using_an_rdf_data_pipeline_to_implement_
cross_.html
IMLS (2006). State of Technology and Digitization in the Nation’s Museums and Libraries.
Washington, DC: The Institute of Museums and Library Services. Available at
http://web.archive.org/web/20060926090433/http://www.imls.gov/resources/TechDig05/Technolo
gy%2BDigitization.pdf Consulted October 25, 2013.
Kelly, K. (2013). Images of Works of Art in Museum Collections: The Experience of Open Access
A Study of 11 Museums. Washington, DC: The Council on Library and Information Resources.
available at http://www.clir.org/pubs/reports/pub157/pub157.pdf Consulted October 28, 2013.
Klavans, J., Stein, R., Chun, S. and Guerra, R.D. (2011). Computational Linguistics in Museums:
Applications for Cultural Datasets. In J. Trant and D. Bearman (Eds).Museums and the Web 2011:
Proceedings. Toronto: Archives & Museum Informatics, 2011. last updated March 27 2011,
consulted September 5, 2013
http://conference.archimuse.com/mw2011/papers/computational_linguistics_in_museums
Lai, L.F., Wu, C.C., Lin, P.Y. and Huang, L.T. (2011). Developing a Fuzzy Search Engine Based
on Fuzzy Ontology and Semantic search. In Proceedings of IEEE International Conference on
Fuzzy Systems Taipei: IEEE, 2684-2689.
Li, F.Z., Luo, D.Y. and Xie, D. (2009). Fuzzy search on non-numeric attributes of keyboard query
over relational databases. In Proceedings of ICCSE ’09. 4th International Conference on Computer
Science and Education. Nanning, China, IEEE, 811-814.
Manning, C. D. and Schütze, H. (1999). Foundations of statistical natural language processing.
Cambridge, MA: MIT Press.
Oommen, J., Baltussen, L. and van Erp, M. (2012). Sharing cultural heritage the linked open data
way: why you should sign up, Museums and the Web 2012, In J. Trant and D. Bearman
(Eds).Museums and the Web 2011: Proceedings. Toronto: Archives & Museum Informatics, 2012.
last updated March 25, 2012, consulted September 5, 2013
http://www.museumsandtheweb.com/mw2012/papers/sharing_cultural_heritage_the_linked_open
_data.
O’Shea, J. Bandar, Z. Crockett, K. and Mclean, D. (2008). “A Comparative Study of Two Short
Text Semantic Similarity Measures”. Lecture Notes in Computer Science, 4953, 172-181.
Stevenson, M. and Wilks, Y. (2003). “Word sense disambiguation”. In R. Mitkov (Ed.) The Oxford
Handbook of Computational Linguistics. Oxford: Oxford University Press, 249-265.
Tzompanaki, K. (2012). A New Framework for Querying Semantic Networks, Museums and the
Web, 2012. In J. Trant and D. Bearman (Eds).Museums and the Web 2011: Proceedings. Toronto:
Archives & Museum Informatics, 2012. last updated March 25, 2012, consulted September 5,
2013
http://www.museumsandtheweb.com/mw2012/papers/a_new_framework_for_querying_semantic_
networks
Voutilainen, A. (2003). “Part-of-speech tagging”. In R. Mitkov (Ed.) The Oxford Handbook of
Computational Linguistics. Oxford: Oxford University Press, 219-232.
Winkler, W. E. (2006). "Overview of Record Linkage and Current Research Directions". Research
Report Series, (Statistics #2006-). Washington, DC: US Census Bureau. Available at
http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf Consulted October 28, 2013.
Xu, J. and Croft, W. B. (1996). “Query expansion using local and global document analysis”. In
H.P. Frei, D. Harman, P. Schaubie and R. Wilkinson (Eds.) Proceedings of the 19th annual
international ACM SIGIR conference on Research and development in information retrieval ACM,
4-11.