It makes sense to get excited about the possibilities afforded by Apache™ Hadoop® YARN-based applications such as Spark, Storm, Presto and others to provide substantial business value. However, the actual tasks of managing and maintaining the environment should not get short shrift. Without considering best practices to ensure big data system performance and stability, business users will slowly lose faith and trust in Hadoop as a difference maker for the enterprise.
With a goal of increasing big data application adoption, the Hadoop environment must run optimally to meet end-user expectations. Think Big, a Teradata company, runs Hadoop platforms for numerous global customers and has identified three best practices that can help you improve operations.
1. LEVERAGE WORKLOAD MANAGEMENT CAPABILITIES
Workload management is important in a Hadoop environment. Why? Because as your big data systems are more widely used for production, the needs of business teams will invariably drive competition among different components for system resources.
Even though your Hadoop cluster can be deployed with guidelines provided by the distribution provider, it should be configured for your own specific workloads. Administrators can use YARN’s workload management capabilities to decide which users get what systems resources and when to meet service levels.
When workload management settings are properly identified and adjusted, administrators can schedule jobs to gain maximum utilization of cluster resources. This not only keeps the Hadoop cluster’s footprint to an appropriate size, it also increases the adaptability to match resources to changing business requirements.
2. STRIVE FOR BUSINESS CONTINUITY
As valuable data is staged and housed in Hadoop, continuous system availability and data protection become more essential. However, Hadoop’s data replication capabilities are not enough to protect vital data sets from disaster. A standard three-way replication is sufficient to protect various data objects from corruption or loss, but it is not an adequate backup and disaster recovery strategy.
Hadoop’s replication is designed to enable better fault tolerance and data locality for processing. But having three copies of the data in the same rack will not protect it from the inevitable problems that will arise. That’s why data must be backed up daily to another data center using an enterprise data archive tool or cloud instance. These efforts help protect the information from a natural disaster, cyberattack or other unforeseen incident.
For business continuity, don’t forget about NameNode backup. The NameNode stores a directory tree of files in the Hadoop Distributed File System (HDFS) and records where data is kept in the cluster. It is a single point of failure, and rebuilding the NameNode from scratch is a time-consuming endeavor fraught with the potential for considerable data loss. That’s why as your production system grows, it’s increasingly important to back up not only business data but also the NameNode.
Critical applications relying on Hadoop resources also require a high-availability strategy. This requires a plan that can be quickly enacted to ensure production workloads aren’t troubled by unanticipated circumstances. Be sure to include a process to reconstruct data sets from raw sources and/or readily restorable offline backups of irreplaceable data sets.
3. UTILIZE HADOOP EXPERIENCE
While detailed documentation on Hadoop architecture, daily monitoring tasks and issue resolution are essential, there is no substitute for experience. Even if application support processes are documented, challenges will undoubtedly arise, which is where experience comes into play. A specific skill set is needed to administer and develop on big data open-source platforms, far beyond what a typical DBA is trained to perform.
In addition to Hadoop admin experience, your big data application support teams should have a solid technical background that allows for adapting to non-standard issues. A senior technical person who can help resolve particularly thorny challenges should be part of that team. Ideally, he or she will have a detailed knowledge of custom application development in Hadoop, strong Linux skills and the ability to troubleshoot complex problems.
Think Big recognizes that even the most experienced Hadoop administrators need the right tools to competently perform their jobs. For example, while a support and development team may utilize open-source administration tools such as Ambari and Nagios, they will ultimately discover that many of these tools are immature. That’s why creating additional tooling or purchasing off-the-shelf software that provides monitoring and repair of common and abnormal issues will be needed to keep big data systems up and running with significantly less downtime.
GET THE OPTIMAL HADOOP ENVIRONMENT
Although Hadoop is not a database, many of the same data management techniques are relevant, such as prioritizing workloads to meet business demands and ensuring continuity to mitigate the risk of downtime and lost information. Additionally, experience is critical in properly storing, managing and analyzing big data sets in Hadoop. After all, a highly optimized environment doesn’t happen by chance—it is the result of smartly managed day-to-day operations that keep your big data applications humming and your business users satisfied. Source