HDFS architecture and its associated processes allow the technology to make better use of storage resources than many enterprise storage arrays on the market today.
The Hadoop Distributed File System offers storage performance at large scale and low cost -- three things enterprise arrays have difficulty doing at the same time.
Apache HDFS is a parallelized, distributed, Java-based file system designed for use in Hadoop clusters that currently scale to 200 PB and can support single Hadoop clusters of 4,000 nodes. HDFS supports simultaneous data access from multiple applications and Apache Yet Another Resource Negotiator. It is designed to be fault-tolerant, meaning it can withstand disk and node failures without interruption in cluster availability. Failed disks and cluster nodes can be replaced as needed.
The following HDFS features are found in most deployments. Each one aids performance, minimizes workload and helps to keep costs down.
Core Hadoop Distributed File System features
Shared nothing, asymmetrical architecture. The major physical components of a Hadoop cluster are commodity servers with JBOD storage -- typically six to 12 disks in each server -- interconnected by an Ethernet network. InfiniBand is supported and used for performance acceleration.
The internal network is the only major physical component shared among cluster nodes; compute and storage are not shared.
Servers within the cluster are designated as control nodes that manage data placement and perform many other tasks or DataNodes. With earlier versions of Apache Hadoop, control node failure could result in a cluster outage. This problem has been addressed in recent Apache versions by vendors offering commercial distributions of Hadoop, as well as by enterprise storage vendors.
Data locality. The Hadoop Distributed File System minimizes the distance between compute and data by moving compute processes to the data stored on Hadoop DataNodes where the actual processing takes place. This minimizes the load on the cluster's internal network and results in a high aggregate bandwidth.
Multiple full copies of data. As Hadoop ingests raw data, often from multiple sources, it spreads data across many disks within a Hadoop cluster. Query processes are then assigned by the control node to DataNodes in the cluster that operate in parallel on the data spread across the disks directly attached to each processor in the cluster. HDFS makes multiple copies of data on ingest -- normal default is three copies -- and they are spread across disks.
Eventual consistency. Because data is replicated across multiple DataNodes, a change made on one node takes time to propagate across the cluster, depending on network bandwidth and other performance factors. Eventually, the change will be replicated across all copies. At that point, HDFS will be in a consistent state. However, it is possible for HDFS to encounter simultaneous queries and run them against two different replicated versions of the data, leading to inconsistent results.
Rack awareness and backup/archiving. HDFS tries to optimize performance by considering a DataNode's physical location when placing data, allocating storage resources and scheduling automated tasks. Some nodes can be configured with more storage for capacity while others can be configured to be more compute-heavy. However, the latter is not a standard practice. For backup and archiving, a storage-heavy Hadoop cluster is usually recommended.
Erasure coding. By default, the Hadoop Distributed File System creates three copies of data by replicating each data block three times upon ingestion. While this practice is common across other distributed file systems as a way to keep clusters running despite failure, it makes for very inefficient use of storage resources. This is particularly true when many advanced users are considering the use of flash drives to populate Hadoop clusters.
The alternative, which will be delivered in an upcoming release by the Apache Software Foundation, is to use erasure coding in place of replication. According to Apache, the default erasure coding policy will save 50% of the storage space consumed by storing three replicas and will also tolerate more storage failure per cluster.
Individual directories can be configured with an erasure coding policy or the normal three-replica policy. However, there is a performance penalty in using an erasure-coding policy because data locality is decreased for erasure code-configured directories. I/O to these directories will be elongated. As a result, Hadoop developers now see erasure codes as a form of storage tiering, whereby less frequently accessed and cold data is allocated to erasure code directories. Source