Levels of binary checksum and near-duplication in text

File Share and SharePoint Duplication Comparison
Comparison of levels of exact and near-duplication
in text documents collections in file share and
SharePoint environments
Simon Kravis, Principal, Aleka Consulting
Abstract
Levels of binary checksum and near-duplication in text document collections from two large
software development projects are compared. One collection was stored on a shared filesystem, the
other in the SharePoint document management system. In both collections, the level of nearduplication was much higher than the level of binary checksum duplication. The use of SharePoint
did not result in a lower level of near-duplication and examples indicated that although SharePoint
can offer version control of documents, users did not utilise this feature extensively, and little
advantage was gained over the use of file share as a document repository.
Introduction
Levels of duplication in electronic document collections are a frequent cause for concern, as
document users may not be aware of which document is the ‘definitive version’ when making
changes to documents. The definition of duplication is important: Word documents with the same
text content created by different authors or given different titles by the same author will not share
the same binary checksum, as the author name and file name is included in the Word file and this
will give rise to each document having a different checksum. Despite this, there is a significant level
of checksum duplication in most document collections.
The term near-duplication for text documents is taken to mean duplication or similarity of authored
text content. This definition implies that the definition of duplication is dependent on the algorithm
used for text extraction and for determining similarity. For example, PDF documents created from
Word documents are usually taken to be the same as the parent Word document, but text
extraction software applied to PDF and parent Word documents gives different results. In this study,
near-duplication was assessed on the basis of the identity of vectors of words created from
statistical analysis of the words in parent documents. Other near-matching methods use
sophisticated hashing algorithms 1. Word vector approaches to similarity estimation for short texts
are described by Ma and Suel 2. The MD5 algorithm was used for obtaining the binary checksum of
files.
1
CHARIKAR, M.S.. Similarity Estimation Techniques from Rounding
Algorithms, STOC’02, May 19-21, 2002, Montreal, Quebec, Canada. Retrieved from
http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf
2
MA, W.; SUEL, T.. Structural Sentence Similarity Estimation for Short Texts. Florida Artificial Intelligence
Research Society Conference, North America, Mar. 2016. Available at:
<http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS16/paper/view/12940>. Date accessed: 17 Oct. 2016.
File Share and SharePoint Duplication Comparison
This paper examines binary and near duplication in two similar document collections created in the
course of two large software development projects extending over a number of years and utilising
10 – 50 staff for design , management, development and testing. One project (FS) used a file share as
a document store, and the other (SP) used SharePoint as the document repository. Both projects
used the same folder structure for documents, but the SP project supported storage of multiple
versions of a document in a systematic fashion, whereas the FS project did not. Files in the SP
collection were accessed via the Windows Explorer interface, and where multiple version of
SharePoint files existed, only the most recent version was profiled.
Results
File type, age and binary and near-duplicate profiles for the FS collection of 10,523 text files are
shown below:
Figure 1 FS collection profiles - File Type (Top L), Binary Duplicate Spectrum (Top R), Modified & Created Date (Bottom
L), Near-duplicate Spectrum (Bottom R)
The first point to note is that is that 85% of the text files have a unique binary checksum, and the
largest cluster of binary duplicates has 15 members, but that only 40% of text files have a unique
File Share and SharePoint Duplication Comparison
word vector, with the largest cluster of near-duplicate files being of size 129. This indicates that
near-duplication is much more common than exact binary duplication.
The corresponding profiles for the SP collection of 10,814 files are shown in Figure 2 below:
Figure 2 SP Collection profiles - File Type (Top L), Binary Duplicate Spectrum (Top R), Modified & Created Date (Bottom
L), Near-duplicate Spectrum (Bottom R)
The profiles shown in Figure 2 are broadly similar to those of the of the FS collection shown in Figure
1, except that the median Modified date is some 6 months later and there more Office 2007 and
later format documents (extensions docx, xlsx). 80% of the files have a unique binary checksum and
the largest binary checksum cluster is of size 23. 38% of the files have a unique document vector,
and the largest near-duplicate cluster is of size 161.
The presence of multiple versions of files in the SP (SharePoint) collection would probably increase
the proportion of near-duplicates somewhat, but the number of files stored in this way was not
available.
File Share and SharePoint Duplication Comparison
The similarity of duplicate spectrum profiles in the two collections indicates that although
SharePoint can support document version control, files which were similar or identical to each other
were not stored in SharePoint in this way. An example of a near-duplicate cluster of 4 files (in this
case identical, with the same name) in 4 different folders is shown in Figure 3:
Figure 3 Example of a cluster of 4 identical files in different SharePoint folders
It may well have been that the storage of multiple copies of the same file was optimal in terms of
the progress of the software development.
The problem of multiple versions of files and files modified by different authors being stored in
separate SharePoint folders is shown in the cluster of near-duplicate files shown in Figure 4.
Figure 4 Example of a cluster of different versions of files stored in separate SharePoint folders and distinguished by
editor initial.
Separate folders are used for versions 0.2 and 0.3 of the draft Traceability Matrix and the approved
version is distinguished by its file name including the word Approved, omitting the word Draft and
including a date. One variant of draft of version 0.3 is distinguished by the last editors initials (RL)
being appended to the file name.
If this complex convention was adhered to by all users it would be possible to locate the definitive
version of a document, but the version control facilities of SharePoint (at the time new to most of
File Share and SharePoint Duplication Comparison
the project staff) were not used and SharePoint was used very much like a file share by the majority
of users.
Conclusions
Analysis of working sets of text documents from two software development environments reveal a
modest level of binary checksum duplication (80-85% unique) but a much higher level of nearduplication of text content (40% unique), using a word vector similarity measure. Duplication levels
did not vary significantly between a file share and a SharePoint document repository. The lack of
reduction in near-duplication when using SharePoint may be attributable to user inexperience but it
does underscore the fact that the availability of version control in a document management system
does not necessarily mean that it will be used.