Recommendations for uploading data - ETH

ETH Zurich
ETH-Bibliothek
Digital Curation Office
Phone +41 44 632 41 16
[email protected]
www.library.ethz.ch/Digital-Curation
Recommendations for uploading data
The following recommendations apply when files are uploaded manually using the web interface of the
ETH Data Archive (http://data-archive.ethz.ch/deposit).
If this web upload does not meet your need, please do not hesitate to contact us to discuss further
options.
The first section of this document explains how you may prepare your files and folders to ensure longterm readability of your data.
For archiving large collections of heterogeneous research data sets over a limited time period we
currently recommend to pack the data into container formats. The second section of the current
document explains how to create ZIP- or tar-containers and recommends suitable tools.
1.
Data preparation
Data selection
We recommend to carefully select the data, such that the archived data is of scientific relevance and
worth archiving over the long term. Please remove unneeded data and avoid storing identical files in
several places, such as storing ZIP-Files and their unzipped contents, multiple backups or temporary
files. Private information does not belong into the ETH Data Archive.
Choose open formats
To allow for long-term readability of your files, non-proprietary file formats that follow open and
properly-documented standards should be preferred. If you plan to archive your data for more than 10
years, it is recommended to convert unusual file formats into more popular formats. Please consult the
fact sheet File Formats for Archiving for further information on this topic.
Avoid special characters
Avoid special characters in names of files and folders. These characters hamper compatibility
because they lead to undesired effects that depend on the operating system.
Avoid the following characters:
•
\/?:*"><|
These characters are not allowed in Windows file names. If a folder is unpacked by WinZip,
these characters are usually replaced by underscores.
•
Non-ASCII characters, such as ¢ ™ ® ä ö ü à é ô and other characters with diacritics
If files are packed with WinZip, these files are moved to locations outside of their original
folder due to a flaw in Linux.
The following ASCII characters are permitted:
!#$%&'()+,-.0123456789;=@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`
abcdefghijklmnopqrstuvwxyz{}~
June 18, 2015
1/3
ETH Zurich
ETH-Bibliothek
Digital Curation Office
Phone +41 44 632 41 16
[email protected]
www.library.ethz.ch/Digital-Curation
Proper use of file extensions
File extensions (such as .txt, .pdf) should be consistent with the file format. Avoid saving files without
file extensions or using special characters in the file extension.
Limit the lengths of file and folder names
Avoid overly long path lengths in your folder structure. Long file names combined with a detailed folder
hierarchy may lead to path lengths exceeding 256 characters, which causes some issues for Windows
users 1. Such containers cannot be completely unpacked with WinZip. Effective path lengths are
further increased when special characters are used in file names and when container files are
unpacked within subfolders. We thus recommend using path lengths of less than roughly 200
characters.
2.
How to package data into ZIP or tar archives
We currently recommend packing the data into ZIP or tar container files in order to archive large
collections of heterogeneous research data sets in the ETH Data Archive (without active validation
and preservation measures) over a limited time period. Using container files has the advantage that all
files in an archival package are uploaded (and downloaded) in a single batch. Furthermore, the folder
structure remains unchanged.
Data preparation
Despite using file containers, we strongly recommend preparing the data as described in the first
section of this document. The data should be carefully selected and the contents should be
documented. Furthermore, the used file formats should still be readable in 10 or 15 years.
Limit the length of file and folder names
Please consider that the original folder structure may need to be recovered from the container files in
various operating systems. Therefore avoid overly long path lengths when organizing your data. Path
lengths exceeding 256 characters hamper further processing in Windows, and WinZip fails to unpack
such containers. See also the recommendations described in section 1.
Split large data packages
Large data sets can lead to difficulties uploading your data and also when data are downloaded using
the viewer. We have no influence on several factors that cause these difficulties (such as browser and
internet connection). Uploading data packages up to 15 GB is possible, but downloading packages of
this size with a browser is usually not feasible. Therefore, we recommend using ZIP or tar files not
exceeding a maximal size of 2 GB. If your archival package exceeds this size, please split it into
meaningful subunits and use one ZIP or tar container for each subunit. You will then be able to upload
all your container files in a single batch.
Please do not use the split feature of WinZip when splitting your data!
General comments on creating container files
•
1
Only use archives with extensions .zip or .tar (do not use .7z, tar.gz, .rar, and so on).
For file names, the lenght is limited by most operating systems to less than 256 characters.
June 18, 2015
2/3
ETH Zurich
ETH-Bibliothek
Digital Curation Office
Phone +41 44 632 41 16
[email protected]
www.library.ethz.ch/Digital-Curation
•
If you create ZIP archives, please zip your data without any data compression.
•
Avoid encrypting your files.
Container formats and suitable software tools
We suggest selecting a format for your container files that is convenient to create on your operating
system. On a Windows operating system you may generate ZIP containers whereas Mac OS users
usually prefer creating tar containers.
The tar format is preferred for long-term archiving because it is an openly-documented format that
does not depend on a single producer.
Windows:
Format:
Recommended tool:
Mac:
Format:
Recommended Tool:
.zip, uncompressed
7-Zip 2
.tar
Keka 3
Or use function „tar“ on command line
2
Download is free of charge at http://www.7-zip.de/ (acessed 03.03.2015). Please contact your IT support.
3
Download is free of charge at http://www.kekaosx.com/de/ (acessed 03.03.2015). Please contact your IT support.
June 18, 2015
3/3