Copying very large files with error handling

Command Line Recipes
Copying very large files with error handling
Once in a while I need to copy multi-gigabyte files on my very slow internet
connection to one of the servers. This means the copy must go on for days,
saturating my network, and any failure results in having to redo the copy all over
again. So, for example, if you used normal scp
scp bigfile [email protected]:
and you were 99% done, then the connection failed, you would have to start all over
again.
My perfect solution would be some scheme where, on failure, the data already
copied to the server is validated, then the copy starts again from there. This allows
me to manually stop the copy at any time, have full speed on my network, and start
the copy again with minimal recopying.
I'm sure there are many solutions, probably some better. Some kind of super copy
with resume. But, the following solution works for me. I use the unix command split
to split the file into smaller pieces, the copy the pieces using rsync, then join them
together on target using cat.
Note: this requires twice as much space on both the source and target as we are
splitting the large file into several smaller pieces, copying them, then joining them
back together.
Assuming the file to be copied is named bigfile and the target server is
server.example.com, the following works. I create a temporary directory on both
source and target named /tmp/copying.
mkdir -p /tmp/copying
split -n676 bigfile /tmp/copying/bigfile.
rsync -avc --progress /tmp/copying server.example.com:/tmp
ssh server.example.com 'cat /tmp/copying/bigfile* > ~/bigfile'
ssh server.example.com 'rm -fRv /tmp/copying'
rm -fRv /tmp/copying
The first line simply creates the requisite local working directory. This directory
must be on a partition large enough to hold the entire file, in 676 chunks.
The second line splits bigfile into 676 chunks of approximately equal size, storing
them as /tmp/copying/bigfile.aa through /tmp/copying/bigfile.zz. aa-zz results in 676
possibles, ie 26 * 26.
Line three is the workhorse. It may be started and stopped at will. Stopping the
Page 1 / 2
(c) 2017 Rod <[email protected]> | 2017-06-15 10:26
URL: http://wiki.linuxservertech.com//index.php?action=artikel&cat=29&id=245&artlang=en
Command Line Recipes
rsync command, then restarting it will first verify all files already copied to the
server are correct (the -c does a checksum to do the match), then begins copying
any files which have not been successful. Thus, worst case scenario is stopping the
copy just before the last byte is copied on one of the chunks, then starting it back
up again. In that case, the entire chunk will be recopied, but that is 1/676th of the
total file size, so it helps. Note the --progress flag gives some more information
about how far along each of the file sync's are, and the -c option takes a lot of time
when the syncing is restarted. I generally do NOT use the -c option during copying,
but do one final check at the end to make sure the copies succeeded correctly.
Line four concats all the files together and puts the output into ~/bigfile (users
home directory, file named bigfile).
Lines 5-6 simply clean up the temp directories on both machines.
Analysis
The big part is the chunk size. What I did was simply told it to create the largest
number of files available with the possible letter combinations for split's default 'aa'
through 'zz', which is 676. That seems like a good number since keeping more than
a thousand fiels in one directory really causes disk problems, and if bigfile is 10 gig,
this will result in chunk sizes of about 15M. That takes about 5 minutes to copy on
my connection, so it is acceptable to me.
Using smaller chunk sizes (and thus, more chunks) would result in less recopying.
Increasing the number of chunks to 1024 would mean each chunk of a 10G file
would be 10M. You would need to modify the number of characters used in the
suffix length using the -a parameter, chaning the above split command to:
split -a3 -n1024 bigfile /tmp/copying/bigfile.
That would give you the files from bigfile.aaa to bigfile.bnj, 1024 in all. By using the
-a3 parameter, you can create up to 17576 (26*26*26) files from .aaa to .zzz, but of
course most file systems are going to barf big time on that many files. However,
each chunk would only be 597k in size!
Unique solution ID: #1244
Author: Rod
Last update: 2015-08-23 23:13
Page 2 / 2
(c) 2017 Rod <[email protected]> | 2017-06-15 10:26
URL: http://wiki.linuxservertech.com//index.php?action=artikel&cat=29&id=245&artlang=en
Powered by TCPDF (www.tcpdf.org)