Global Big Data Conference

Industry News Details

Four Ways Automation Can Rescue Your Data Lake Posted on : Feb 21 - 2019

Ask Gartner Research and you’ll find that as of late 2017, 60% of big data projects failed to survive the pilot phase and only 17% of Hadoop deployments went on to the production phase. However, it’s often not due to a lack of desire or appreciation of big data’s value among executive leadership. It’s simply that most organizations aren’t aware of many of big data’s most common and formidable challenges.

There are two big challenges. The first one involves underestimating the technical complexities to develop a data lake and the associated expertise required to overcome those complexities, and the second one is related to underestimating the ongoing effort required to maintain an often brittle operational environment where every successive analytics project takes longer and costs more.

Let’s face it: big data is extremely complex, a fact that vendors and deploying organizations aren’t always willing to admit publicly. Even open source platforms aren’t enterprise ready right out of the box without significant effort by your team or a third party systems integrator.

2010 saw the First Wave of companies using data lakes as a mechanism to store all their raw data in a cost-effective way. The problem was while the data lake turned out to be a great way to store data cheaply, it also became a dumping ground and a terrible way to generate actual value, with data left languishing unknown, ungoverned and unusable. But that’s changing with new approaches that leverage the underlying compute power of the data lake itself to automate and simplify much of the development, ongoing management and governance of the data engineering and dataops processes. Data lakes are now starting to recapture their original luster, evolving to deliver the best combination of both agility and governance.

But how can agility and self-service be achieved if you’re also enforcing rules to make the data lake a fully governed environment? Those objectives have historically been seen as mutually exclusive. If you want agility, you had to give up control and vice versa. But it turns out that by leveraging the compute power of the data lake to also manage the data lake itself (by using statistics, heuristics and machine learning to automate development, performance tuning and ongoing management), data lakes can deliver both agility and governability. The key is to automate the data lake wherever possible and avoid hand coding. Fortunately, there’s a new Second Wave of data lake technologies that provide some of this automation.

If you had previously liked the concept of the data lake and either built a data lake but then hit a scalability wall or you never tried to build a data lake because you didn’t have the skills in-house to get it to work, now is a good time to revisit this concept. But before you do, it’s important to understand why building a successful data lake was so complex in the first place. View More

Get the