Tuesday, September 25, 2012

Using rsync over multi-hop ssh

Rsync is a wonderful tool, I can't believe that I wasn't using it years ago.

Nowadays I have a couple of large disks, well actually a couple of IcyBox IB-RD4320StU3 hard disk enclosures each with 2 disks in - of course!  Out of a healthy paranoia I keep one of these enclosures on site and one off site.  But how to keep them in sync?

Well, it's not completely automated.
On the one hand I keep a list of cksums (md5sum) of the files to be able to detect duplicates and compare against previous cksum lists.  I have a Perl script for that which I must put in my github account some day ... as soon as I've made the 3600 line script readable.  Don't hold your breath.  It might be quicker for me to rewrite it in Python.

If you want to cut to the chase, how to do rsync over multi-hop ssh, scroll down to the last paragraph ...

I recently started using rsync to synchronize these disks when I have them on the same site, but this is not practical when they're on separate sites.

It's not my intention to give an rsync tutorial here, but simply to show a few steps on the way to the commands I use for multi-hop rsyn cusage.

rsync from DISK1 to DISK2 on same machine
So supposing I have DISK1 and DISK2 mounted on /DISK1 and /DISK2 respectively on the same machine.  To sync changes from1 to 2
    rsync -av /DISK1/ /DISK2/

Any more recent files, or file changes on DISK1 will be propagated to DISK2

Rsync has many many options to consider.
Here are a few of them.

-n: is your friend !  You should use this option a lot as it says "do nothing" allowing you to see what your rsync command would do.   It's very useful just for performing comparisons or for doing a sanity check of your command before it deletes valuable files due to an erroneous command ...

--progress: allows to see the progression of the transfers.  At any moment we see the %age transferred of the currently transferring file as well as how many more *known* files are left to check.  Note: rsync examines the disk hierarchies as it works so the total number of files it knows about increases with time and you will see jumps in the number of files left to synchronize as new directories are discovered.

--delete: allows to delete from DISK2 files which do not exist on DISK1.

    rsync -av --progress --delete /DISK1/ /DISK2/
will mirror from DISK1 to DISK2 showing progress.

To compare 2 disks I use
    rsync -anv --delete /DISK1 /DISK2

Below is an example of the output where both disks have directories DIR1 DIR2,
but DISK1 has extra files DIR1/FILE1, DIR2/FILE2
     it also has DIR2/FILE1 which is more recent than the same file on DISK2
and DISK2 has extra files DIR1/FILE3

  rsync -anv --delete /DISK1/ /DISK2/
sending incremental file list
deleting DIR1/FILE3

sent 129 bytes  received 32 bytes  322.00 bytes/sec
total size is 0  speedup is 0.00 (DRY RUN)

We see that new files DIR1/FILE1, DIR2/FILE2 would be created in DISK2,
the file DIR2/FILE1 would be updated as the copy on DISK1 is more recent,
and DIR1/FILE3 would be deleted on DISK2 because it doesn't exist on DISK1.

Note that we see the names of the directories as they are traversed whether they are different or not.
So to see changes I tend to do
  rsync -anv --delete /DISK1/ /DISK2/ | grep -v /$

to see only the file changes.

Of course none of these updates actually happen because we used the -n option.

rsync from DISK1 to DISK2 on different machines

So now if I want to do this over an ssh connection between two machines, possibly two sites I can do

  rsync -anv --delete /DISK1/ user@home:/DISK2/ | grep -v /$

It's recommended that you've already configured your ssh keys so that you're not prompted for passwords here.

By default rsync assumes ssh as the underlying transport but this can be changed using various options.  In the next step we'll use the -e option to specify our transport as a multi-hop ssh connection.

rsync from DISK1 to DISK2 on different machines across a multi-hop ssh connection

OK, we got there at last.  The point of this article is to see how we can do the same thing when we can't connect directly to the target machine but have to connect to an intermediate machine before performing a second ssh connection to the target machine.

In my case I have my Raspberry Pi which I can connect to at the address 'home' and from there I can connect to the final 'target' machine where my DISK2 is.

So to connect from my location to the target by ssh I either have to do
    ssh user@home
followed by
    home> ssh user@target

or directly by
    ssh -A -t user@home ssh -A -t user@target

So thanks to the -e parameter of the rsync command which allows us to specify the underlying transport we can specify to use this same multi-hop ssh chain to connect between machines.

Our rsync 'mirror comparison' command now becomes
  rsync -anv --delete -e "ssh -A -t user@home ssh -A -t user@target" /DISK1/ :/DISK2/ | grep -v /$

Note that now we specify the DISK2 path as being :/DISK2/ indicating that it is on the remote end of the connection.

That's quite a handful and you may wish to setup an alias for all of part of this command if you plan using it a lot.
    alias rs12='rsync -anv --delete -e "ssh -A -t user@home ssh -A -t user@target" /DISK1/ :/DISK2/'

I thoroughly recommend looking at rsync's man pages to see the many options, capabilities of this wonderful tool.


Anonymous said...

For the last commands, this will work if we issue this in source machine, what if we want to issue this in target machine (to pull the file).?

imagine due to firewall source cannot reach target, but target can reach source.

eu4 console commands said...

If you rename a directory and then run rsync again does it make a backup of that directory ? I suppose you could use the --delete feature, but if it is a large directory (say 500GB) that is a lot of data to rsync again when the only thing that changed was the directory name. If you never rename existing files+folders then rsync works really well.

mjbright said...

Yes, you are right.

If you rename a directory rsync will effectively make a copy.

If you use --delete that will produce the desired "result" in terms of resulting files/dirs but as you say rsync would do this by unnecessarily copying that possibly large directory across (and delete the differently named copy already on the remote).

mjbright said...

Replying to Anonymous, well swapping source and target would work fine but only if you still have the ssh access to the jump host in the "reverse" direction - that may be the case ...

[Conference - CodeEurope.pl] Developing Micro-services on Kubernetes

In April I had the chance to present at CodeEurope.pl , first in Warsaw on Apr 24th, and then in Wroclaw ("wroslof" was my best at...