Researchers Demonstrate Breakthrough Storage Performance for Big Data Applications

Researchers from IBM today demonstrated the future of large-scale storage systems by successfully scanning 10 billion files on a single system in just 43 minutes, shattering the previous record of one billion files in three hours by a factor of 37. Growing at unprecedented scales, this advance unifies data environments on a single platform, instead of being distributed across several systems that must be separately managed. It also reduces and simplifies data management tasks, allowing more information to be stored in the same technology, rather than continuing to buy more and more storage.

In 1998, IBM Researchers unveiled a highly scalable, clustered parallel file system called General Parallel File System (GPFS), which was furthered tuned to make this breakthrough possible. GPFS represents an advance of scaling for storage performance and capacity, while keeping management costs flat. This innovation could help organizations cope with the exploding growth of data, transactions and digitally-aware sensors and other devices that comprise Smarter Planet systems. It is suited for applications requiring high-speed access to large volumes of data such as data mining to determine customer buying behaviors across massive data sets, seismic data processing, risk management and financial analysis, weather modeling and scientific research.

Today's breakthrough was achieved using GPFS running on a cluster of 10 eight core systems and solid state storage, taking 43 minutes to perform this selection. The GPFS management rules engine provides the comprehensive capabilities to service any data management task.

GPFS's algorithm makes possible the full use of all processor cores on all of these machines in all phases of the task (data read, sorting and rules evaluation). GPFS exploits the solid state storage appliances with only 6.8 terabytes of capacity for excellent random performance and high data transfer rates for containing the metadata storage. The appliances sustainably perform hundreds of millions of data input-output operations, while GPFS continuously identifies, selects and sorts the right set of files among the 10 billion on the system.

"Today's demonstration of GPFS scalability will pave the way for new products that address the challenges of a rapidly growing, multi-zettabyte world," said Doug Balog, vice president, storage platforms, IBM. "This has the potential to enable much larger data environments to be unified on a single platform and dramatically reduce and simplify data management tasks such as data placement, aging, backup and migration of individual files."

The previous record was also set by IBM researchers at the Supercomputing 2007 conference in Reno, NV, where they demonstrated the ability to scan one billion files in three hours.