Back

 Industry News Details

 
Big Data Requires More than Just Big Storage Posted on : Aug 26 - 2016

Big Data and big storage go hand in hand in the age of the Internet of Things, but while scale-out solutions are coming fast and furious, the enterprise should still keep in mind that capacity is not the only consideration when planning for device-driven workloads.

As most experts will confirm, speed counts just as much as size when dealing with extreme volumes, and not only in the form of faster throughput in support of rapid, even real-time, analytics. The ability to quickly configure and reconfigure storage environments is crucial as self-service, on-demand resource provisioning becomes more common.

If you look back at what storage was like in the very early days of Spark and Hadoop, you’ll realize how much progress has been made, says InfoStor’s Paul Rubens. Back then, storage was provided by locally attached disk drives, which met the goal of keeping data close to processors but at the expense of numerous enterprise requirements such as compliance and regulatory controls, audit capabilities and even some key security functions. Since then, we’ve seen the development of the Hadoop Distributed File Systems (HDFS) as an API to allow for massive storage-only clusters, as well as software-defined storage (SDS), hyperconvergence on the physical layer and container-based virtualization – all of which result in tighter, more agile data infrastructure.

At the same time, there has been a steady influx of object storage in Big Data environments, according to Jonathan Ring, CEO of Austin, Texas, storage software developer Caringo. One aspect that is often overlooked in Big Data architectures is that they create both data and metadata as part of the routine process of turning outside streams into actionable intelligence. By combining object storage with a highly sophisticated automation stack, organizations will be able to create a search-friendly archive while maintaining a high level of flexibility at critical points in the data-processing chain, such as ingestion and analytics. And it can do this using simple commodity storage hardware and common HTTP access.

Flash storage is also making inroads into the data lake as new high-capacity solutions enter the channel. One of the latest is IBM’s DeepFlash 150 that provides upwards of 170 TB per rack managed under the Spectrum Scale GPFS (General Parallel File System) software stack. The company says it can deliver this solution at less than $1 per GB while maintaining key enterprise features like snapshots, replication, compression and encryption. The system can also be configured for other data-heavy workloads like rich media streaming, HPC storage and even in-memory data analytics. (Disclosure: I provide web content services for IBM.)

Even with these developments, Big Data will still pose a challenge when it comes to preserving and retrieving information. As Enterprise Storage Forum’s Henry Newman points out, data tiering is likely to emerge as a particularly thorny issue considering data collection is expected to exceed the capabilities of even the most robust network architecture. Getting the right data to the right tier within a reasonable timeframe will require a fair bit of processing, and this is before the real work begins in the analysis and modeling engines. As well, data movement will likely be continuous throughout the analytics lifecycle because knowledge of each data set will change constantly, and this will affect, among other things, where and how it is to be stored.

To be sure, virtually all aspects of Big Data infrastructure will be works in progress for some time to come, if not indefinitely. But the very fact that the technology has matured to the point that the enterprise industry can finally embark on this journey is quite extraordinary, and there is no telling what digital treasures will be discovered once the full scope of Big Data analytics hits its stride. Source