Xpand IT is a global consultancy company which specialises in the implementation of Business Intelligence and Big Data solutions and is a partner of Cloudera and Pentaho. This article is about our experience with these projects and strategies that will enable organisations to successfully adopt these technologies.
Over the years we have implemented many projects but most of them were not for the typical big data cliché company, an internet company where analysing data is part of their culture. Instead we are talking about companies from traditional sectors, mostly the financial sector and retail, that saw in Hadoop a great opportunity to reduce cost and implement use cases that could bring new insights and nurture a data-driven culture.
For these companies the hype around big data means that everybody has a high level understanding of what can be achieved. Nevertheless, due to the technical nature of these technologies it is hard to understand the implementation path and establish a line between the myths and reality. Therefore, Big Data is a challenge for them.
How can we make it work?
From our experience we have to make the benefits and implementation of Big Data initiatives clearer right from the start. When selecting the first use case we always advise companies to look at where data is already valued within the business. This isn’t about unstructured data, social networks or any other fancy use case, it is about the down-to-earth requirements that were already identified as having a high value and have never left the sketch board because they simply could not be implemented with traditional technologies. This will allow companies to gradually make the change to Hadoop while keeping the use cases familiar and thus clearly perceived and evaluated.
The good news is that Hadoop is a constantly evolving open source ecosystem. However, being composed of many different projects with completely different purposes, Hadoop requires in-depth know-how to master. Again keeping it simple is the best strategy, choosing a typical ETL and analytics use case allowed us using projects such as Hive or Impala that have a SQL interface to make the transition smoother. One needs to learn the principles behind it but at least everybody is already familiar with the language used and the way to express solutions. It might not be as trendy as other projects like Spark but it will answer these use cases while reducing complexity.
Now that we have the use case and the proper Hadoop projects to work with it is time to dive into the data. Data won’t show up on Hadoop magically, it has to come from somewhere, and on all of our projects the source of the data has been traditional technologies such as databases or files. Besides that, we have always kept an existing data warehouse on a traditional database meaning that in the end the summarized data had to get out of Hadoop and be published there. So an important part of the work will be the integration with traditional technologies to ingest or publish data as well as orchestrating the process.
Pentaho Data Integration is definitely the right tool for the job, it allows us to implement the whole ETL process on a single drag & drop user interface no matter if it’s big data or any traditional technology. We have published an eBook about these advantages and how you can leverage it - it’s freely available here. Besides that, due to its flexibility we have been able to implement design patterns across all of the projects promoting re-usage and reducing the complexity of scaling the solution. We are talking about table imports configured based on variables or metadata driven ingestion that enable creating a single ETL flow to process any number of different files or tables.
Following these strategies will definitely help reduce the adoption gap while enabling you to establish the foundation for future projects that open a new world of possibilities. Once the project matures it is possible to unleash new business cases, perhaps including real-time requirements or semi-structured data. This is something Hadoop is great at, making it the Swiss army knife for data related projects. Source