Nvidia and Google Post new MLPerf AI Training Results

Today the MLPerf effort released results for MLPerf Training v0.6, the second round of results from their machine learning training performance benchmark suite.

MLPerf is a consortium of over 40 companies and researchers from universities, and the MLPerf benchmark suites are becoming the industry standard for measuring machine learning performance. The MLPerf Training benchmark suite measures the time it takes to train one of six machine learning models to a standard quality target in tasks including image classification, object detection, translation, and playing Go. To see the results, go to mlperf.org/training-results-0-6.

The first version of MLPerf Training was v0.5; this release, v0.6, improves on the first round in several ways. According to the MLPerf Training Special Topics Chairperson Paulius Micikevicius, “these changes demonstrate MLPerf’s commitment to its benchmarks’ representing the current industry and research state." The improvements include:

Raises quality targets for image classification (ResNet) to 75.9%, light-weight object detection (SSD) to 23% MAP, and recurrent translation (GNMT) to 24 Sacre BLEU. These changes better align the quality targets with state of the art for these models and datasets.
Allows use of the LARS optimizer for ResNet, enabling additional scaling.
Experimentally allows a slightly larger set of hyperparameters to be tuned, enabling faster performance and some additional scaling.
Changes timing to start the first time the application accesses the training dataset, thereby excluding startup overhead. This change was made because the large scale systems measured are typically used with much larger datasets than those in MLPerf, and hence normally amortize the startup overhead over much greater training time.
Improves the MiniGo benchmark in two ways. First, it now uses a standard C++ engine for the non-ML compute, which is substantially faster than the prior Python engine. Second, it now assesses quality by comparing to a known-good checkpoint, which is more reliable than the previous very small set of game data.
Suspends the Recommendation benchmark while a larger dataset and model are being created.

In the second slate of training results, both Nvidia and Google have demonstrated their abilities to reduce the compute time needed to train the underlying deep neural networks used in common AI applications from days to hours.

However, these impressive results had a high cost. For example, the Nvidia DGX2h SuperPod used to perform these training jobs has an estimated retail price of some $38 million. Consequently, Google seeks to exploit their advantage as the only major public cloud provider to deliver AI supercomputing as a service to researchers and AI developers, all using their in-house developed Tensor Processing Units (TPUs) as their alternative to Nvidia GPUs.

Both Nvidia and Google claim #1 performance spots in three of the six “Max Scale” benchmarks. Nvidia was able to reduce their run-times dramatically (up to 80%) using the identical V100 TensorCore accelerator in the DGX2h building block.

While there are over 40 companies around the world developing AI-specific accelerators, most are developing chips for “inference,” not model training, where Nvidia enjoys a massive share of the multi-billion-dollar market. For these companies, the mlperf organization plans to release results in early September. Even for companies building silicon for training, the costs of competing in this marathon will preclude most if not all startups from participation. Intel is expected to enter the market using their highly anticipated Nervana NNP-T later this year.

According to the results, Nvidia’s best absolute performance is on the more complex neural network models, perhaps showing that their hardware programmability and flexibility helps them keep pace with the development of newer, more complex and deeper models.

Google has decided to cast a larger net to capture TPU users, working now to support the popular PyTorch AI framework in addition to Google’s TensorFlow tool set. This will remove one of the two largest barriers to adoption, the other being the exclusivity of TPU in the Google Compute Platform (GCP).

Google TPU is a force to be reckoned with in public cloud hosted training in GCP, while Nvidia continues to provide excellent performance for in-house infrastructure and non-GCP public cloud services.