Nvidia Says Intel Is Misleading With Outdated Deep Learning Benchmarks

Nvidia, which has been promoting the use of its GPUs for deep learning applications, claims that Intel published some incorrect "facts" about Xeon Phi processors and and benchmarks related to about deep learning. Deep learning has the potential to revolutionize computing, improve the efficiency and intelligence of our business systems, and deliver advancements that will help humanity in profound ways.

Nvidia has been pitching high-performance GPUs with its Tesla products for deep learning, and Intel has just announced at IDF a specialized Xeon Phi chip called Knights Mill optimized for deep learning. Knights Mill is scheduled for release in 2017.

Intel has not shied away from comparing their tech to GPUs and touting why they believe Xeon Phi to be superior.

For example, Knights Mill is capable of acting as a host processor. So expect to see Intel promoting the benefits of not needing separate host processors & co-processors, and how Knights Mill can be attached directly to system RAM. This, along with the performance differences between the GPU architectures and Knights Mill, will be a recurring fight between the two companies both now and next year.

Nvidia used a blog post to outlibe Intel's "Deep Learning benchmark mistakes".

Intel recently published some benchmarks to make three claims about deep learning performance with Knights Landing Xeon Phi processors:

Xeon Phi is 2.3x faster in training than GPUs
Xeon Phi offers 38% better scaling that GPUs across nodes
Xeon Phi delivers strong scaling to 128 nodes while GPUs do not

Accordign to Nvidia, Intel used Caffe AlexNet data that is 18 months old, comparing a system with four Maxwell GPUs to four Xeon Phi servers. "With the more recent implementation of Caffe AlexNet, Intel would have discovered that the same system with four Maxwell GPUs delivers 30% faster training time than four Xeon Phi servers," Nvidia said. "In fact, a system with four Pascal-based NVIDIA TITAN X GPUs trains 90% faster and a single NVIDIA DGX-1 is over 5x faster than four Xeon Phi servers," the company continued.

Intel is comparing Caffe GoogleNet training performance on 32 Xeon Phi servers to 32 servers from Oak Ridge National Laboratory’s Titan supercomputer. Titan uses four-year-old GPUs (Tesla K20X) and an interconnect technology inherited from the prior Jaguar supercomputer. Xeon Phi results were based on recent interconnect technology.

According to Nvidia, using more recent Maxwell GPUs and interconnect, Baidu has shown that their speech training workload scales almost linearly up to 128 GPUs.

"Scalability relies on the interconnect and architectural optimizations in the code as much as the underlying processor. GPUs are delivering great scaling for customers like Baidu," Nvidia claims.

Intel says that 128 Xeon Phi servers deliver 50x faster performance compared with a single Xeon Phi server, while no such scaling data exists for GPUs. But as Nvidia noted above, Baidu already published results showing near-linear scaling up to 128 GPUs.

According to Nvidia:

"For strong-scaling, we believe strong nodes are better than weak nodes. A single strong server with numerous powerful GPUs delivers superior performance than lots of weak nodes, each with one or two sockets of less-capable processors, like Xeon Phi. For example, a single DGX-1 system offers better strong-scaling performance than at least 21 Xeon Phi servers (DGX-1 is 5.3x faster than 4 Xeon Phi servers)"

It is obvious that Nvidia and Intel have chosen different approaches in order to address the demands of deep learning apps.

Regarding the new Intel Knights Mill chip, scheduled for release in 2017, it will be 10nm chip.

Among the features/design tweaks for the new processor, Intel is adding what they are calling "variable precision" support. What that fully entails isn’t clear, but the use of lower precision modes has been a major factor in the development and subsequent high performance of machine learning-focused processors, so it’s likely that this means that Intel is adding FP16 and possibly other lower-precision modes, something the current Knights Landing lacks.

Also on the feature list is improved scale-out performance. It’s not clear right now if this is some kind of fabric/interconnect change. But the ultimate goal is to make clusters of Xeon Phi processors perform better, which is an important factor in bringing down the training time of very large and complex datasets. Meanwhile there are also unspecified memory changes for Knights Mill, with Intel touting the chip’s "flexible, high capacity memory."