Tuesday, September 25, 2012
Using rsync over multi-hop ssh
Rsync is a wonderful tool, I can't believe that I wasn't using it years ago.
Nowadays I have a couple of large disks, well actually a couple of IcyBox IB-RD4320StU3 hard disk enclosures each with 2 disks in - of course! Out of a healthy paranoia I keep one of these enclosures on site and one off site. But how to keep them in sync?
Well, it's not completely automated.
On the one hand I keep a list of cksums (md5sum) of the files to be able to detect duplicates and compare against previous cksum lists. I have a Perl script for that which I must put in my github account some day ... as soon as I've made the 3600 line script readable. Don't hold your breath. It might be quicker for me to rewrite it in Python.
If you want to cut to the chase, how to do rsync over multi-hop ssh, scroll down to the last paragraph ...
I recently started using rsync to synchronize these disks when I have them on the same site, but this is not practical when they're on separate sites.
It's not my intention to give an rsync tutorial here, but simply to show a few steps on the way to the commands I use for multi-hop rsyn cusage.
rsync from DISK1 to DISK2 on same machine
So supposing I have DISK1 and DISK2 mounted on /DISK1 and /DISK2 respectively on the same machine. To sync changes from1 to 2
rsync -av /DISK1/ /DISK2/
Any more recent files, or file changes on DISK1 will be propagated to DISK2
Rsync has many many options to consider.
Here are a few of them.
-n: is your friend ! You should use this option a lot as it says "do nothing" allowing you to see what your rsync command would do. It's very useful just for performing comparisons or for doing a sanity check of your command before it deletes valuable files due to an erroneous command ...
--progress: allows to see the progression of the transfers. At any moment we see the %age transferred of the currently transferring file as well as how many more *known* files are left to check. Note: rsync examines the disk hierarchies as it works so the total number of files it knows about increases with time and you will see jumps in the number of files left to synchronize as new directories are discovered.
--delete: allows to delete from DISK2 files which do not exist on DISK1.
rsync -av --progress --delete /DISK1/ /DISK2/
will mirror from DISK1 to DISK2 showing progress.
To compare 2 disks I use
rsync -anv --delete /DISK1 /DISK2
Below is an example of the output where both disks have directories DIR1 DIR2,
but DISK1 has extra files DIR1/FILE1, DIR2/FILE2
it also has DIR2/FILE1 which is more recent than the same file on DISK2
and DISK2 has extra files DIR1/FILE3
rsync -anv --delete /DISK1/ /DISK2/
sending incremental file list
sent 129 bytes received 32 bytes 322.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
We see that new files DIR1/FILE1, DIR2/FILE2 would be created in DISK2,
the file DIR2/FILE1 would be updated as the copy on DISK1 is more recent,
and DIR1/FILE3 would be deleted on DISK2 because it doesn't exist on DISK1.
Note that we see the names of the directories as they are traversed whether they are different or not.
So to see changes I tend to do
rsync -anv --delete /DISK1/ /DISK2/ | grep -v /$
to see only the file changes.
Of course none of these updates actually happen because we used the -n option.
rsync from DISK1 to DISK2 on different machines
So now if I want to do this over an ssh connection between two machines, possibly two sites I can do
rsync -anv --delete /DISK1/ user@home:/DISK2/ | grep -v /$
It's recommended that you've already configured your ssh keys so that you're not prompted for passwords here.
By default rsync assumes ssh as the underlying transport but this can be changed using various options. In the next step we'll use the -e option to specify our transport as a multi-hop ssh connection.
rsync from DISK1 to DISK2 on different machines across a multi-hop ssh connection
OK, we got there at last. The point of this article is to see how we can do the same thing when we can't connect directly to the target machine but have to connect to an intermediate machine before performing a second ssh connection to the target machine.
In my case I have my Raspberry Pi which I can connect to at the address 'home' and from there I can connect to the final 'target' machine where my DISK2 is.
So to connect from my location to the target by ssh I either have to do
home> ssh user@target
or directly by
ssh -A -t user@home ssh -A -t user@target
So thanks to the -e parameter of the rsync command which allows us to specify the underlying transport we can specify to use this same multi-hop ssh chain to connect between machines.
Our rsync 'mirror comparison' command now becomes
rsync -anv --delete -e "ssh -A -t user@home ssh -A -t user@target" /DISK1/ :/DISK2/ | grep -v /$
Note that now we specify the DISK2 path as being :/DISK2/ indicating that it is on the remote end of the connection.
That's quite a handful and you may wish to setup an alias for all of part of this command if you plan using it a lot.
alias rs12='rsync -anv --delete -e "ssh -A -t user@home ssh -A -t user@target" /DISK1/ :/DISK2/'
I thoroughly recommend looking at rsync's man pages to see the many options, capabilities of this wonderful tool.
In April I had the chance to present at CodeEurope.pl , first in Warsaw on Apr 24th, and then in Wroclaw ("wroslof" was my best at...
Rsync is a wonderful tool, I can't believe that I wasn't using it years ago. Nowadays I have a couple of large disks, well actual...
I try to be less of a freeloader these days but there's still a temptation when you need a tool to just wait for someone else to provide...