File Share and SharePoint Duplication Comparison Comparison of levels of exact and near-duplication in text documents collections in file share and SharePoint environments Simon Kravis, Principal, Aleka Consulting Abstract Levels of binary checksum and near-duplication in text document collections from two large software development projects are compared. One collection was stored on a shared filesystem, the other in the SharePoint document management system. In both collections, the level of nearduplication was much higher than the level of binary checksum duplication. The use of SharePoint did not result in a lower level of near-duplication and examples indicated that although SharePoint can offer version control of documents, users did not utilise this feature extensively, and little advantage was gained over the use of file share as a document repository. Introduction Levels of duplication in electronic document collections are a frequent cause for concern, as document users may not be aware of which document is the ‘definitive version’ when making changes to documents. The definition of duplication is important: Word documents with the same text content created by different authors or given different titles by the same author will not share the same binary checksum, as the author name and file name is included in the Word file and this will give rise to each document having a different checksum. Despite this, there is a significant level of checksum duplication in most document collections. The term near-duplication for text documents is taken to mean duplication or similarity of authored text content. This definition implies that the definition of duplication is dependent on the algorithm used for text extraction and for determining similarity. For example, PDF documents created from Word documents are usually taken to be the same as the parent Word document, but text extraction software applied to PDF and parent Word documents gives different results. In this study, near-duplication was assessed on the basis of the identity of vectors of words created from statistical analysis of the words in parent documents. Other near-matching methods use sophisticated hashing algorithms 1. Word vector approaches to similarity estimation for short texts are described by Ma and Suel 2. The MD5 algorithm was used for obtaining the binary checksum of files. 1 CHARIKAR, M.S.. Similarity Estimation Techniques from Rounding Algorithms, STOC’02, May 19-21, 2002, Montreal, Quebec, Canada. Retrieved from http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf 2 MA, W.; SUEL, T.. Structural Sentence Similarity Estimation for Short Texts. Florida Artificial Intelligence Research Society Conference, North America, Mar. 2016. Available at: <http://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS16/paper/view/12940>. Date accessed: 17 Oct. 2016. File Share and SharePoint Duplication Comparison This paper examines binary and near duplication in two similar document collections created in the course of two large software development projects extending over a number of years and utilising 10 – 50 staff for design , management, development and testing. One project (FS) used a file share as a document store, and the other (SP) used SharePoint as the document repository. Both projects used the same folder structure for documents, but the SP project supported storage of multiple versions of a document in a systematic fashion, whereas the FS project did not. Files in the SP collection were accessed via the Windows Explorer interface, and where multiple version of SharePoint files existed, only the most recent version was profiled. Results File type, age and binary and near-duplicate profiles for the FS collection of 10,523 text files are shown below: Figure 1 FS collection profiles - File Type (Top L), Binary Duplicate Spectrum (Top R), Modified & Created Date (Bottom L), Near-duplicate Spectrum (Bottom R) The first point to note is that is that 85% of the text files have a unique binary checksum, and the largest cluster of binary duplicates has 15 members, but that only 40% of text files have a unique File Share and SharePoint Duplication Comparison word vector, with the largest cluster of near-duplicate files being of size 129. This indicates that near-duplication is much more common than exact binary duplication. The corresponding profiles for the SP collection of 10,814 files are shown in Figure 2 below: Figure 2 SP Collection profiles - File Type (Top L), Binary Duplicate Spectrum (Top R), Modified & Created Date (Bottom L), Near-duplicate Spectrum (Bottom R) The profiles shown in Figure 2 are broadly similar to those of the of the FS collection shown in Figure 1, except that the median Modified date is some 6 months later and there more Office 2007 and later format documents (extensions docx, xlsx). 80% of the files have a unique binary checksum and the largest binary checksum cluster is of size 23. 38% of the files have a unique document vector, and the largest near-duplicate cluster is of size 161. The presence of multiple versions of files in the SP (SharePoint) collection would probably increase the proportion of near-duplicates somewhat, but the number of files stored in this way was not available. File Share and SharePoint Duplication Comparison The similarity of duplicate spectrum profiles in the two collections indicates that although SharePoint can support document version control, files which were similar or identical to each other were not stored in SharePoint in this way. An example of a near-duplicate cluster of 4 files (in this case identical, with the same name) in 4 different folders is shown in Figure 3: Figure 3 Example of a cluster of 4 identical files in different SharePoint folders It may well have been that the storage of multiple copies of the same file was optimal in terms of the progress of the software development. The problem of multiple versions of files and files modified by different authors being stored in separate SharePoint folders is shown in the cluster of near-duplicate files shown in Figure 4. Figure 4 Example of a cluster of different versions of files stored in separate SharePoint folders and distinguished by editor initial. Separate folders are used for versions 0.2 and 0.3 of the draft Traceability Matrix and the approved version is distinguished by its file name including the word Approved, omitting the word Draft and including a date. One variant of draft of version 0.3 is distinguished by the last editors initials (RL) being appended to the file name. If this complex convention was adhered to by all users it would be possible to locate the definitive version of a document, but the version control facilities of SharePoint (at the time new to most of File Share and SharePoint Duplication Comparison the project staff) were not used and SharePoint was used very much like a file share by the majority of users. Conclusions Analysis of working sets of text documents from two software development environments reveal a modest level of binary checksum duplication (80-85% unique) but a much higher level of nearduplication of text content (40% unique), using a word vector similarity measure. Duplication levels did not vary significantly between a file share and a SharePoint document repository. The lack of reduction in near-duplication when using SharePoint may be attributable to user inexperience but it does underscore the fact that the availability of version control in a document management system does not necessarily mean that it will be used.
© Copyright 2026 Paperzz