- Learning Hadoop 2
- Garry Turkington Gabriele Modena
- 1468字
- 2025-02-16 19:20:40
Apache ZooKeeper – a different type of filesystem
Within Hadoop, we will mostly talk about HDFS when discussing filesystems and data storage. But, inside almost all Hadoop 2 installations, there is another service that looks somewhat like a filesystem, but which provides significant capability crucial to the proper functioning of distributed systems. This service is Apache ZooKeeper (http://zookeeper.apache.org) and, as it is a key part of the implementation of HDFS HA, we will introduce it in this chapter. It is, however, also used by multiple other Hadoop components and related projects, so we will touch on it several more times throughout the book.
ZooKeeper started out as a subcomponent of HBase and was used to enable several operational capabilities of the service. When any complex distributed system is built, there are a series of activities that are almost always required and which are always difficult to get right. These activities include things such as handling shared locks, detecting component failure, and supporting leader election within a group of collaborating services. ZooKeeper was created as the coordination service that would provide a series of primitive operations upon which HBase could implement these types of operationally critical features. Note that ZooKeeper also takes inspiration from the Google Chubby system described at http://research.google.com/archive/chubby-osdi06.pdf.
ZooKeeper runs as a cluster of instances referred to as an ensemble. The ensemble provides a data structure, which is somewhat analogous to a filesystem. Each location in the structure is called a ZNode and can have children as if it were a directory but can also have content as if it were a file. Note that ZooKeeper is not a suitable place to store very large amounts of data, and by default, the maximum amount of data in a ZNode is 1 MB. At any point in time, one server in the ensemble is the master and makes all decisions about client requests. There are very well-defined rules around the responsibilities of the master, including that it has to ensure that a request is only committed when a majority of the ensemble have committed the change, and that once committed any conflicting change is rejected.
You should have ZooKeeper installed within your Cloudera Virtual Machine. If not, use Cloudera Manager to install it as a single node on the host. In production systems, ZooKeeper has very specific semantics around absolute majority voting, so some of the logic only makes sense in a larger ensemble (3, 5, or 7 nodes are the most common sizes).
There is a command-line client to ZooKeeper called zookeeper-client
in the Cloudera VM; note that in the vanilla ZooKeeper distribution it is called zkCli.sh
. If you run it with no arguments, it will connect to the ZooKeeper server running on the local machine. From here, you can type help
to get a list of commands.
The most immediately interesting commands will be create
, ls
, and get
. As the names suggest, these create a ZNode, list the ZNodes at a particular point in the filesystem, and get the data stored at a particular ZNode. Here are some examples of usage.
- Create a ZNode with no data:
$ create /zk-test ''
- Create a child of the first ZNode and store some text in it:
$ create /zk-test/child1 'sampledata'
- Retrieve the data associated with a particular ZNode:
$ get /zk-test/child1
The client can also register a watcher on a given ZNode—this will raise an alert if the ZNode in question changes, either its data or children being modified.
This might not sound very useful, but ZNodes can additionally be created as both sequential and ephemeral nodes, and this is where the magic starts.
Implementing a distributed lock with sequential ZNodes
If a ZNode is created within the CLI with the -s
option, it will be created as a sequential node. ZooKeeper will suffix the supplied name with a 10-digit integer guaranteed to be unique and greater than any other sequential children of the same ZNode. We can use this mechanism to create a distributed lock. ZooKeeper itself is not holding the actual lock; the client needs to understand what particular states in ZooKeeper mean in terms of their mapping to the application locks in question.
If we create a (non-sequential) ZNode at /zk-lock
, then any client wishing to hold the lock will create a sequential child node. For example, the create -s /zk-lock/locknode
command might create the node, /zk-lock/locknode-0000000001
, in the first case, with increasing integer suffixes for subsequent calls. When a client creates a ZNode under the lock, it will then check if its sequential node has the lowest integer suffix. If it does, then it is treated as having the lock. If not, then it will need to wait until the node holding the lock is deleted. The client will usually put a watch on the node with the next lowest suffix and then be alerted when that node is deleted, indicating that it now holds the lock.
Implementing group membership and leader election using ephemeral ZNodes
Any ZooKeeper client will send heartbeats to the server throughout the session, showing that it is alive. For the ZNodes we have discussed until now, we can say that they are persistent and will survive across sessions. We can, however, create a ZNode as ephemeral, meaning it will disappear once the client that created it either disconnects or is detected as being dead by the ZooKeeper server. Within the CLI an ephemeral ZNode is created by adding the -e
flag to the create command.
Ephemeral ZNodes are a good mechanism to implement group membership discovery within a distributed system. For any system where nodes can fail, join, and leave without notice, knowing which nodes are alive at any point in time is often a difficult task. Within ZooKeeper, we can provide the basis for such discovery by having each node create an ephemeral ZNode at a certain location in the ZooKeeper filesystem. The ZNodes can hold data about the service nodes, such as host name, IP address, port number, and so on. To get a list of live nodes, we can simply list the child nodes of the parent group ZNode. Because of the nature of ephemeral nodes, we can have confidence that the list of live nodes retrieved at any time is up to date.
If we have each service node create ZNode children that are not just ephemeral but also sequential, then we can also build a mechanism for leader election for services that need to have a single master node at any one time. The mechanism is the same for locks; the client service node creates the sequential and ephemeral ZNode and then checks if it has the lowest sequence number. If so, then it is the master. If not, then it will register a watcher on the next lowest sequence node to be alerted when it might become the master.
Java API
The org.apache.zookeeper.ZooKeeper
class is the main programmatic client to access a ZooKeeper ensemble. Refer to the javadocs for the full details, but the basic interface is relatively straightforward with obvious one-to-one correspondence to commands in the CLI. For example:
create
: is equivalent to CLIcreate
getChildren
: is equivalent to CLIls
getData
: is equivalent to CLIget
Building blocks
As can be seen, ZooKeeper provides a small number of well-defined operations with very strong semantic guarantees that can be built into higher-level services, such as the locks, group membership, and leader election we discussed earlier. It's best to think of ZooKeeper as a toolkit of well-engineered and reliable functions critical to distributed systems that can be built upon without having to worry about the intricacies of their implementation. The provided ZooKeeper interface is quite low-level though, and there are a few higher-level interfaces emerging that provide more of the mapping of the low-level primitives into application-level logic. The Curator project (http://curator.apache.org/) is a good example of this.
ZooKeeper was used sparingly within Hadoop 1, but it's now quite ubiquitous. It's used by both MapReduce and HDFS for the high availability of their JobTracker and NameNode components. Hive and Impala, which we will explore later, use it to place locks on data tables that are being accessed by multiple concurrent jobs. Kafka, which we'll discuss in the context of Samza, uses ZooKeeper for node (broker in Kafka terminology), leader election, and state management.
Further reading
We have not described ZooKeeper in much detail and have completely omitted aspects such as its ability to apply quotas and access control lists to ZNodes within the filesystem and the mechanisms to build callbacks. Our purpose here was to give enough of the details so that you would have some idea of how it was being used within the Hadoop services we explore in this book. For more information, consult the project home page.