Learning Hadoop 2

书名：Learning Hadoop 2
作者名：Garry Turkington Gabriele Modena
本章字数：1468字
更新时间：2025-02-16 19:20:40

Apache ZooKeeper – a different type of filesystem

Within Hadoop, we will mostly talk about HDFS when discussing filesystems and data storage. But, inside almost all Hadoop 2 installations, there is another service that looks somewhat like a filesystem, but which provides significant capability crucial to the proper functioning of distributed systems. This service is Apache ZooKeeper (http://zookeeper.apache.org) and, as it is a key part of the implementation of HDFS HA, we will introduce it in this chapter. It is, however, also used by multiple other Hadoop components and related projects, so we will touch on it several more times throughout the book.

ZooKeeper started out as a subcomponent of HBase and was used to enable several operational capabilities of the service. When any complex distributed system is built, there are a series of activities that are almost always required and which are always difficult to get right. These activities include things such as handling shared locks, detecting component failure, and supporting leader election within a group of collaborating services. ZooKeeper was created as the coordination service that would provide a series of primitive operations upon which HBase could implement these types of operationally critical features. Note that ZooKeeper also takes inspiration from the Google Chubby system described at http://research.google.com/archive/chubby-osdi06.pdf.

ZooKeeper runs as a cluster of instances referred to as an ensemble. The ensemble provides a data structure, which is somewhat analogous to a filesystem. Each location in the structure is called a ZNode and can have children as if it were a directory but can also have content as if it were a file. Note that ZooKeeper is not a suitable place to store very large amounts of data, and by default, the maximum amount of data in a ZNode is 1 MB. At any point in time, one server in the ensemble is the master and makes all decisions about client requests. There are very well-defined rules around the responsibilities of the master, including that it has to ensure that a request is only committed when a majority of the ensemble have committed the change, and that once committed any conflicting change is rejected.

You should have ZooKeeper installed within your Cloudera Virtual Machine. If not, use Cloudera Manager to install it as a single node on the host. In production systems, ZooKeeper has very specific semantics around absolute majority voting, so some of the logic only makes sense in a larger ensemble (3, 5, or 7 nodes are the most common sizes).

There is a command-line client to ZooKeeper called zookeeper-client in the Cloudera VM; note that in the vanilla ZooKeeper distribution it is called zkCli.sh. If you run it with no arguments, it will connect to the ZooKeeper server running on the local machine. From here, you can type help to get a list of commands.

The most immediately interesting commands will be create, ls, and get. As the names suggest, these create a ZNode, list the ZNodes at a particular point in the filesystem, and get the data stored at a particular ZNode. Here are some examples of usage.

Create a ZNode with no data:
```
$ create /zk-test '' 
```
Create a child of the first ZNode and store some text in it:
```
$ create /zk-test/child1 'sampledata'
```
Retrieve the data associated with a particular ZNode:
```
$ get /zk-test/child1 
```

The client can also register a watcher on a given ZNode—this will raise an alert if the ZNode in question changes, either its data or children being modified.

This might not sound very useful, but ZNodes can additionally be created as both sequential and ephemeral nodes, and this is where the magic starts.

Implementing a distributed lock with sequential ZNodes

If a ZNode is created within the CLI with the -s option, it will be created as a sequential node. ZooKeeper will suffix the supplied name with a 10-digit integer guaranteed to be unique and greater than any other sequential children of the same ZNode. We can use this mechanism to create a distributed lock. ZooKeeper itself is not holding the actual lock; the client needs to understand what particular states in ZooKeeper mean in terms of their mapping to the application locks in question.

If we create a (non-sequential) ZNode at /zk-lock, then any client wishing to hold the lock will create a sequential child node. For example, the create -s /zk-lock/locknode command might create the node, /zk-lock/locknode-0000000001, in the first case, with increasing integer suffixes for subsequent calls. When a client creates a ZNode under the lock, it will then check if its sequential node has the lowest integer suffix. If it does, then it is treated as having the lock. If not, then it will need to wait until the node holding the lock is deleted. The client will usually put a watch on the node with the next lowest suffix and then be alerted when that node is deleted, indicating that it now holds the lock.

Implementing group membership and leader election using ephemeral ZNodes

Any ZooKeeper client will send heartbeats to the server throughout the session, showing that it is alive. For the ZNodes we have discussed until now, we can say that they are persistent and will survive across sessions. We can, however, create a ZNode as ephemeral, meaning it will disappear once the client that created it either disconnects or is detected as being dead by the ZooKeeper server. Within the CLI an ephemeral ZNode is created by adding the -e flag to the create command.

Ephemeral ZNodes are a good mechanism to implement group membership discovery within a distributed system. For any system where nodes can fail, join, and leave without notice, knowing which nodes are alive at any point in time is often a difficult task. Within ZooKeeper, we can provide the basis for such discovery by having each node create an ephemeral ZNode at a certain location in the ZooKeeper filesystem. The ZNodes can hold data about the service nodes, such as host name, IP address, port number, and so on. To get a list of live nodes, we can simply list the child nodes of the parent group ZNode. Because of the nature of ephemeral nodes, we can have confidence that the list of live nodes retrieved at any time is up to date.

If we have each service node create ZNode children that are not just ephemeral but also sequential, then we can also build a mechanism for leader election for services that need to have a single master node at any one time. The mechanism is the same for locks; the client service node creates the sequential and ephemeral ZNode and then checks if it has the lowest sequence number. If so, then it is the master. If not, then it will register a watcher on the next lowest sequence node to be alerted when it might become the master.

Java API

The org.apache.zookeeper.ZooKeeper class is the main programmatic client to access a ZooKeeper ensemble. Refer to the javadocs for the full details, but the basic interface is relatively straightforward with obvious one-to-one correspondence to commands in the CLI. For example:

create: is equivalent to CLI create
getChildren: is equivalent to CLI ls
getData: is equivalent to CLI get

Building blocks

As can be seen, ZooKeeper provides a small number of well-defined operations with very strong semantic guarantees that can be built into higher-level services, such as the locks, group membership, and leader election we discussed earlier. It's best to think of ZooKeeper as a toolkit of well-engineered and reliable functions critical to distributed systems that can be built upon without having to worry about the intricacies of their implementation. The provided ZooKeeper interface is quite low-level though, and there are a few higher-level interfaces emerging that provide more of the mapping of the low-level primitives into application-level logic. The Curator project (http://curator.apache.org/) is a good example of this.

ZooKeeper was used sparingly within Hadoop 1, but it's now quite ubiquitous. It's used by both MapReduce and HDFS for the high availability of their JobTracker and NameNode components. Hive and Impala, which we will explore later, use it to place locks on data tables that are being accessed by multiple concurrent jobs. Kafka, which we'll discuss in the context of Samza, uses ZooKeeper for node (broker in Kafka terminology), leader election, and state management.

Apache ZooKeeper – a different type of filesystem

Implementing a distributed lock with sequential ZNodes

Implementing group membership and leader election using ephemeral ZNodes

Java API

Building blocks

Further reading