Hops-HDFS is a new implementation of the the Hadoop Filesystem (HDFS), that supports multiple stateless NameNodes, where the metadata is stored in an in-memory distributed database (MySQL Cluster). Hops-HDFS enables more scalable clusters than Apache HDFS (up to ten times larger clusters), and enables NameNode metadata to be both customized and analyzed, because it can now be easily accessed via a SQL API.
HDFS v1 Architecture
In HDFS v1, all metadata is stored on the heap of a single JVM, and mutations to it are also persisted in a log (editLog) along with periodic snapshots of the NN state, so that the NameNode can recover the metadata in the event of it failing. It has higher performance than HDFS v2 (as updates are only logged to a local disk on the NN, instead of to a quorum of remote journal nodes). EditLog updates are batched, so threads can return NameNode operations out-of-order, which causes confusion on the exact semantics of HDFS filesystem operations .
HDFS v2 Architecture
Highly available HDFS architecture, with eventually consistent replication of the NameNode state from an Active NameNode to a Standby NameNode. If the Active NameNode fails, the Standby NameNode takes over after it has applied all outstanding updates (which can take up to several minutes). There is no partitioning of the metadata in this solution, so it does not improve the throughput of HDFS. All metadata is still stored on the heap of the Active NameNode, while updates are now propagated to a quorum of Journal Nodes, which negatively impacts the performance of write/update operations.
As network partitions may occur, all nodes in the cluster need to reliably reach agreement on which NameNode is currently Active and which is Passive, and this problem is solved by adding a number of Zookeeper instances on different nodes (typically 3).
In large clusters, there is very little free memory on the heap of the NameNode(s), and it is not possible to efficiently create a snapshot of the NameNode's state, which is needed periodically to reduce node recovery time on failure. As such, an additional Checkpoint server is needed with at least as much RAM as the NameNodes. It is used to periodically store a checkpoint of the NameNode's state to disk.
The basic approach we took was to migrate the main datastructures from the Apache Hadoop NameNode into tables in MySQL Cluster database: inodes, blocks, and replicas. We index columns where necessary, and we partition data such that all inodes in the same directory are located on the same database node, and all blocks and replicas associated with an inode are all located on the same database node. We have published a paper, Scaling HDFS with a Strongly Consistent Relational Model for Metadata, on how we maintain the consistency of the metadata when migrating it to the relational database. The TLDR ; for the paper is that we reorder all filesystem and block operation to follow the same order in acquiring locks on inodes, starting at the root inode and then following a depth-first-search path.
There are, however, some filesystem operations, such as rename, delete, setquota, and move, that operate on a subtree of inodes and may potentially be executed on millions of inodes - atomically. Firstly, we assume that operations on the NameNode should be executed atomically and that concurrent clients are isolated from one another - we use pessimistic concurrency control to serialize updates to individual inodes. However, databases, even NewSQL databases such as MySQL Cluster, cannot execute transactions containing millions of operations atomically. As such, we designed our own subtree protocol to make subtree operations atomic and to isolate concurrent clients.
In HDFS, there are a number of background tasks that are problematic if multiple NameNodes attempt to perform them concurrently. Examples of such tasks include:
- replication monitoring,
- lease management,
- block token generation,
- and the decomissioning of datanodes.
Without any coordination between NameNodes, and since all NameNodes have identical behaviour, we would have problems with many of these background tasks. For example, if a block becomes underreplicated, several NameNodes could independently identify this under-replicated state, and each would select a DataNode to replicate that block to. This would cause multiple re-replications of the block, leading it to enter an over-replicated state, upon which, multiple NameNodes could recognize this over-replicated state and attempt to remove replicas, possibly leading to an under-replicated state - back where we started. We solve this coordination problem, by implementing a Leader Election Algorithm, where only the leader NameNode is assigned the task of performing the above background tasks. Our leader election algorithm uses the shared, transactional memory abstraction provided by MySQL Cluster to coordinate the election process.