DaveWentzel.com All Things Data
Today's blog post in my Vertica series is about Vertica backups. Vertica stores its data in a series of many files. Once a file is written it is never modified. Data updates are handled as deletes-followed-by-inserts and the deleted data is "logically deleted" via delete vectors. Deleted data is removed when the final results from each node are merged. This makes backups very simple...you just back up the files in the data directory.
Backing up Vertica databases is roughly equivalent to a "hot backup" in Oracle (there are separate configuration backup utilities for MC and even admintools). Vertica provides you with a python script that gives you a bunch of options to make life simple...
vbr.py. In order to use the Vertica backup tools your db must be up and running. Essentially vbr.py make copies of your Vertica files to another storage area. The scripts are written such that the files are transactionally consistent and have minimal locking overhead on the running Vertica system.
There are no transaction logs in Vertica. There is nothing equivalent to rollback segments or redo logs in Oracle. In a later blog post I'll cover how Vertica maintains transactional consistency without something akin to transaction logs. Also remember that your data is sharded across many nodes. Vertica manages all of that for you as well. In the worst case scenario Vertica will require you to run recovery on a node if it gets out-of-sync for whatever reason. Even this is an automated process. Even though data is sharded across nodes Vertica is not "eventually consistent". All queries are transactionally-consistent. Again, I'll cover how this is handled in a later post.
Vertica backup utilities
As mentioned before, you can always roll your own scripts to copy the data directory on your node to somewhere else.
rsync is great for this. Newer versions of Vertica provide better utilities to handle this for you.
- backup.sh and restore.sh are the original Vertica-supplied bash scripts.
- these can be found in /opt/vertica/scripts
- these use ssh and rsync to copy the data files, catalog, and configuration files to another area.
- the restore script takes a similar set of arguments but shuts down your db first, copies the data, then bootstraps your db.
- All backups are always done by node so any data that changed since backup.sh was run is recovered automatically from other nodes (assuming the db is k-safe).
- With Release 5.1 Vertica deprecated these scripts which, although they still work, are unsupported.
- vbr.py is the replacement and is fully-supported by HP. This is a Python script located in /opt/vertica/bin.
- Fundamentally backups and restores are still handled via rsync and ssh.
- Roll your own scripts. None of these utilities are documented as being available on Community Edition but they are there. I'm not sure if HP changed its mind and simply forgot to update the documentation to reflect this fact. (This is also the case with FlexTables which we'll cover in a future blog post). How do you roll your own scripts?
- Create a tarball of your
/opt/verticadirectory (google can help with this). Then copy the file to a safe place. The script I use (which I'm reluctant to share just in case my testing is lax...I don't want to be responsible for your data loss) does something like this:
tar -czvf /tmp/vertica.tgz /opt/vertica
- Create a tarball of your
More on vbr.py
There are lots of nifty features in vbr.py other than simple node backups and restores. For instance, besides full backups we can also take incremental and snapshot backups, as well as backups of specific schemas and tables (helpful if you do multi-tenancy in Vertica). You can also use vbr.py to copy a database to another cluster (a non-prod staging cluster for example). Let's look at some of these.
This launches a pseudo-wizard that generates a configuration file for subsequent backup/restore tasks. The benefit of taking backups this way is that the restore occurs using the same configuration file. This makes it harder to screw things up.
This will provide you with the following options...
...but it will not actually take a backup. Rather it saves the backup configuration to the ini file. Let's take a look at that. Note that all of the options I chose are saved. Note that I am backing up to any other Linux-based host (or really any host that can support ssh, rcopy, and scp). When I actually run the backup I must ensure that the path exists on the remote backup host, in this case
We can now run the backup using the config file:
vbr.py --task backup --config-file MyBackup.ini
As mentioned earlier this is essentially copying all of the relevant data from your data and catalog directories.
Here you'll note that Vertica wants you to verify that you really trust this remote backup host.
Again, all vbr.py is doing is copying all files and directories under data and catalog to your backup location under the MyBackup folder which corresponds to the .ini file you used for your backup:
You can have many different configuration files for different backup purposes, but hopefully you get the general idea.
You can restore using your config file as well. To restore a full db:
- the "target" database must be down.
- The cluster "target" must also be down.
- The cluster "target" must also have the same number of nodes as the "source" of the backup. In other words, you can't backup a 5 node cluster and restore it on a 4 node cluster. You have to have the same number of nodes...the restore process will node juggle segmentation for you to reduce k-safety. Remember, the restore process is merely copying files from the backup host to the cluster nodes.
- Node names and IP addresses must be the same on the target cluster.
To copy a database to another cluster
vbr.py --task copycluster --config-file
This is essentially a backup/restore in one atomic operation. The data and catalog paths must be identical. In the configuration file you specify the mappings of nodes in the source/target clusters. The target cluster must be down.
So, this is just like a standard restore except you are mapping old node names to new node names.