Risk of Ruin: Rsync Magic

I do not pretend to be anything other than an rsync tourist. But I have had need of its services on two occasions, today being the latest, and I found it frustrating how difficult it is to get simple and clear information on how to set up its filtering to get all the files you want and nothing else. Both times I've needed it, I've had to do a lot of relearning and searching.

I won't be solving the world's rsync problems on this blog. But if nothing else I'll have another page to find on the internet the next time I need rsync, and with any luck it will answer my questions. So here are the two ways I've used rsync, and how I set it up.

Task #1: Back up certain files to the NAS
The McHouse Enterprise Computing Cluster includes a machine running FreeNAS where I store all of our backups, MP3s, ISO images, and other assorted stuff I or the bintgoddess might need in our constant quest for entertainment. I use Bacula to do full hard drive backups of the Windows 7 laptop she uses at school, her Windows XP desktop, and my Ubuntu desktop. This is great, but certain files need more frequent backups than this, such as my machine's home directory.

To do this, first I enabled the rsync service in FreeNAS. Within FreeNAS' configuration web page, I set up an rsync path called HildeMark for my home directory. In the image below, I've cropped off the stuff I didn't change.

Settings for the home directory rsync path

Then on my machine, I added the following command to my crontab to run at 4:15pm Monday-Friday:

/usr/bin/rsync -aFx --delete /home/mark/ kinakuta::HildeMark

For space considerations, refer to the rsync manpage for the precise meanings of each of these parameters. In a nutshell, however, the contents of HildeMark's physical location should look exactly like my home directory right after the transfer completes.

But then I noticed that I was copying a bunch of stuff I didn't want, like gigantic source trees from work - those are backed up at the office, no need for me to do it again here - and Firefox's cache, which likewise doesn't need backing up. That's where the .rsync-filter file comes in. In each directory traversed by rsync, you can optionally create a file called .rsync-filter containing exclusions of things that should not be backed up at or below the current level. For example, my very simple ~/.mozillia/.rsync-filter excludes Firefox's cache from the backup:

exclude Cache/

Similarly the .rsync-filter in my ~/src directory excludes anything that is source-controlled, since the SVN server's storage is already backed up and that's the master copy. I can adjust any directory's .rsync-filter without having to worry about whether my changes have unintended side-effects on some other directory - it only affects file selection within that directory sub-tree.

Task #2: Roll Out Updates to a Customer
I just started a new project at the office wherein I am writing library code to be immediately used by another developer. For reasons upon which I cannot elaborate, this developer cannot simply be given SVN access to the full source code tree. He gets headers and libraries only for the things he needs to build. Everything else is held back. Additionally, since he will be developing against library code that I am actively writing, I can't just let him have whatever file I just saved... I need to at least make sure it compiles before I give it to him. Likewise, he needs to control when he updates his copy of my library so that he can reach a good stopping place in his own code before having the library change on him.

This guy is basically a customer, so I need a staging area where I place the files he is allowed to see. I do this when it is appropriate, and then when he's ready he copies everything from that staging area to his own development machine. Most times, a small subset of files - or a small portion of a file - have changed.

Again I started by setting up an rsync daemon, but this time I didn't have the luxury of using FreeNAS' administrative web page. On my development workstation, I created the following configuration in /etc/rsync.config:

lock file = /var/run/rsync.lock
log file = /var/log/rsync.log

[snapshot]
path = /staging
uid = nobody
gid = nobody
read only = yes
list = yes
hosts allow = 172.0.0.99

This creates a visible path called "snapshot" with storage at /staging. I made it read-only since there is no reason for him to upload to me, and gave his IP address access to it without him needing a password to my machine. The rest is pretty boilerplate.

Next I started rsync in daemon-mode with the logical syntax of: rsync --daemon

To copy into this location, I needed to reproduce my source tree but redact those things that he didn't need - namely unrelated modules and all C++ implementation files. The paths to various headers should not change so that cascading #include directives continue to work nicely. I also needed to take all the libraries scattered throughout my source tree and assemble them in a single directory to make it easier for him to link to them. To do these two things, I wrote a little script:

#!/bin/bash
rsync --progress --stats --recursive --times --delete \
--exclude-from=$HOME/rsync/filter.txt \
$HOME/src/proj1/source/ /staging/source
find $HOME/src/proj1/source -name "*.a" -exec cp -p {} /staging/lib/ \;

The rsync command gives me progress and statistics, deleting any files from the target directory that I remove from my source directory, and filtering based on the rules in ~/rsync/filter.txt. I'll get to that in a moment.

The find command returns all the files ending in ".a" under ~/src/proj1/source, which happen to be exactly the static libraries I want him to be able to link against. Using find's -exec syntax, it copies each file to /staging/lib while preserving the modified date/time. This is important, because otherwise when I run this script it will make it appear as though I changed all the libraries when maybe I only changed one or two.

And here's a representation of the filter.txt file with some details removed for confidentiality.

- **/.svn*
- CMakeFiles*
- /build
- /secret1
- /Libs/secretLib
- /secret2
+ /examples/example1/main.cpp
+ /Tools/tool1/*.cpp
+ /Tools/toolsuite1/*/*.cpp
+ **/
+ **/*.h
+ **/*.inl
- *

Leading forward-slashes (/) refer to the top of the source directory, not the top of the volume. So "/build" really means "~/src/proj1/source/build".
"-" at the beginning of the line means "exclude stuff matching this pattern."
"+" at the beginning means "include stuff matching this pattern."
The "**" token means: "match against every subdirectory". So "- **/.svn*" will exclude x/.svn-info, x/y/.svn-stuff, and x/y/z/.svn.
"- CMakeFiles*" excludes all files and directories starting with "CMakeFiles" - this is where a lot of the temporary build scripts go, so this line removes a ton of useless gunk.
The next four exclusion lines keep rsync from descending into the named directories. It is free to descend into Libs, just not Libs/secretLib.
The next two lines explicitly add some C++ implementation files he's allowed to have and might find useful. The third inclusion line adds C++ implementation files from all the directories within toolsuite1.
"+ **/" is a catch-all rule to say: descend into every directory from here if you haven't matched an exclusion rule yet. It only matches directories, so it controls where files are taken from, not which files.
"+ **/*.h" and "+ **/*.inl" explicitly specify that all files ending in "*.h" and "*.inl" in all subdirectories (if not previously excluded) should get copied.
"- *" means "and nothing else".

Whew! That's only half the job: getting the files from my work area to the staging area when I deem them ready for roll-out. Thankfully the other half of the job is a single command, and much simpler.

On the other developer's computer, he executes the following to copy everything down from my staging area to his library area:

rsync --progress --delete --recursive dev-mark1::snapshot ~/src/markLib/

The dev-mark1::snapshot notation, which is in the form machine::path, is interesting. The double-colon (::) indicates that his rsync client should attempt to connect to a remote rsync daemon running on dev-mark1, and then request files from its snapshot path. Since I set snapshot up to point to my staging area, this gives him two directories under ~/src/markLib/: source/ and lib/. Source contains a full source tree with only headers in it, and lib contains all the libraries I copied into it using find.

Now whenever I feel like sending out a mini-release, I run my script. Whenever he wants to check for a mini-release, he runs his rsync command. Rsync and find take care of the rest. Voila!

Risk of Ruin

12 October 2010

Rsync Magic

No comments:

Post a Comment