HDFS snapshots

We mentioned earlier that HDFS replication alone is not a suitable backup strategy. In the Hadoop 2 filesystem, snapshots have been added, which brings another level of data protection to HDFS.

Filesystem snapshots have been used for some time across a variety of technologies. The basic idea is that it becomes possible to view the exact state of the filesystem at particular points in time. This is achieved by taking a copy of the filesystem metadata at the point the snapshot is made and making this available to be viewed in the future.

As changes to the filesystem are made, any change that would affect the snapshot is treated specially. For example, if a file that exists in the snapshot is deleted then, even though it will be removed from the current state of the filesystem, its metadata will remain in the snapshot, and the blocks associated with its data will remain on the filesystem though not accessible through any view of the system other than the snapshot.

An example might illustrate this point. Say, you have a filesystem containing the following files:

/data1 (5 blocks)
/data2 (10 blocks)

You take a snapshot and then delete the file /data2. If you view the current state of the filesystem, then only /data1 will be visible. If you examine the snapshot, you will see both files. Behind the scenes, all 15 blocks still exist, but only those associated with the un-deleted file /data1 are part of the current filesystem. The blocks for the file /data2 will be released only when the snapshot is itself removed—snapshots are read-only views.

Snapshots in Hadoop 2 can be applied at either the full filesystem level or only on particular paths. A path needs to be set as snapshottable, and note that you cannot have a path snapshottable if any of its children or parent paths are themselves snapshottable.

Let's take a simple example based on the directory we created earlier to illustrate the use of snapshots. The commands we are going to illustrate need to be executed with superuser privileges, which can be obtained with sudo -u hdfs.

First, use the dfsadmin subcommand of the hdfs CLI utility to enable snapshots of a directory, as follows:

$ sudo -u hdfs hdfs dfsadmin -allowSnapshot \
/user/cloudera/testdir
Allowing snapshot on testdir succeeded

Now, we create the snapshot and examine it; snapshots are available through the .snapshot subdirectory of the snapshottable directory. Note that the .snapshot directory will not be visible in a normal listing of the directory. Here's how we create a snapshot and examine it:

$ sudo -u hdfs hdfs dfs -createSnapshot \
/user/cloudera/testdir sn1
Created snapshot /user/cloudera/testdir/.snapshot/sn1

$ sudo -u hdfs hdfs dfs -ls \
/user/cloudera/testdir/.snapshot/sn1

Found 1 items -rw-r--r-- 1 cloudera cloudera 12 2014-11-13 11:21 /user/cloudera/testdir/.snapshot/sn1/testfile.txt

Now, we remove the test file from the main directory and verify that it is now empty:

$ sudo -u hdfs hdfs dfs -rm \
/user/cloudera/testdir/testfile.txt
14/11/13 13:13:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1440 minutes, Emptier interval = 0 minutes. Moved: 'hdfs://localhost.localdomain:8020/user/cloudera/testdir/testfile.txt' to trash at: hdfs://localhost.localdomain:8020/user/hdfs/.Trash/Current
$ hdfs dfs -ls /user/cloudera/testdir
$

Note the mention of trash directories; by default, HDFS will copy any deleted files into a .Trash directory in the user's home directory, which helps to defend against slipping fingers. These files can be removed through hdfs dfs -expunge or will be automatically purged in 7 days by default.

Now, we examine the snapshot where the now-deleted file is still available:

$ hdfs dfs -ls testdir/.snapshot/sn1
Found 1 items drwxr-xr-x - cloudera cloudera 0 2014-11-13 13:12 testdir/.snapshot/sn1
$ hdfs dfs -tail testdir/.snapshot/sn1/testfile.txt
Hello world

Then, we can delete the snapshot, freeing up any blocks held by it, as follows:

$ sudo -u hdfs hdfs dfs -deleteSnapshot \
/user/cloudera/testdir sn1 
$ hdfs dfs -ls testdir/.snapshot
$

As can be seen, the files within a snapshot are fully available to be read and copied, providing access to the historical state of the filesystem at the point when the snapshot was made. Each directory can have up to 65,535 snapshots, and HDFS manages snapshots in such a way that they are quite efficient in terms of impact on normal filesystem operations. They are a great mechanism to use prior to any activity that might have adverse effects, such as trying a new version of an application that accesses the filesystem. If the new software corrupts files, the old state of the directory can be restored. If after a period of validation the software is accepted, then the snapshot can instead be deleted.