Back

 Industry News Details

 
Accelerate Apache Spark to boost big data platforms Posted on : Apr 22 - 2017

Big data platforms like Apache Spark process massive volumes of data faster than other options. As data volumes grow, enterprises seek ways to speed up Spark.

So, we have data -- lots and lots of data. We have blocks, files and objects in storage. We have tables, key values and graphs in databases. And increasingly, we have media, machine data and event streams flowing in.

It must be a fun time to be an enterprise data architect, figuring out how to best take advantage of all this potential intelligence -- without missing or dropping a single byte.

Big data platforms such as Spark help process this data quickly and converge traditional transactional data center applications with advanced analytics. If you haven't yet seen Spark show up in the production side of your data center, you will soon. Organizations that don't, or can't, adopt big data platforms to add intelligence to their daily business processes are soon going to find themselves way behind their competition.

Spark, with its distributed in-memory processing architecture -- and native libraries providing both expert machine learning and SQL-like data structures -- was expressly designed for performance with large data sets. Even with such a fast start, competition and larger data volumes have made Spark performance acceleration a sizzling hot topic. You can see this trend at big data shows, such as the recent, sold-out Spark Summit in Boston, where it seemed every vendor was touting some way to accelerate Spark.

If Spark already runs in memory and scales out to large clusters of nodes, how can you make it faster, processing more data than ever before? Here are five Spark acceleration angles we've noted:

In-memory improvements. Spark can use a distributed pool of memory-heavy nodes. Still, there is always room to improve how memory management works -- such as sharding and caching -- how much memory can be stuffed into each node and how far clusters can effectively scale out. Recent versions of Spark use native Tungsten off-heap memory management -- i.e., compact data encoding -- and the optimizing Catalyst query planner to greatly reduce both execution time and memory demand. According to Databricks, the leading Spark sponsor, we'll continue to see future releases aggressively pursue greater Spark acceleration.

Native streaming data. The hottest topic in big data is how to deal with streaming data. This is really about how to process data bits as they arrive. But real-time streaming data sets require special handling, and this presents quite a management challenge. In the past, this often required complexly managed workflow and messaging and queuing algorithms; sometimes the answer was the use of separate infrastructure clusters running a different stack of software altogether. Today we are seeing streaming data support converging into -- and under -- more friendly paradigms. Spark 2.0, for instance, now natively supports structured streaming, which easily folds new kinds of streaming data sources into the existing developer-friendly big data platform. View More