Wednesday, October 18, 2017
Search
  
Submit your own News for
inclusion in our Site.
Click here...
Breaking News
ICDM to Establish New Display Evaluation Standards Based on Samsung and LG Proposals
Globalfoundries and Intel to Talk About 10, 7nm at IEDM
Intel and Mobileye Offer Present Algorithms to Prove the Safety of Autonomous Vehicles
Chinese BOE and CSOT to Invest in Japanese JOLED, Adding Pressure to South Korean Rivals
Samsung's 8nm LPP is Ready For Production
Intel Advances Artificial Intelligence With Nervana Neural Network Processor
Razer Launches Quad-core Blade Stealth Laptop and Core V2 External Graphics Enclosure
Google Designed New Pixel Visual Core Chip Internally
Active Discussions
Which of these DVD media are the best, most durable?
How to back up a PS2 DL game
Copy a protected DVD?
roxio issues with xp pro
Help make DVDInfoPro better with dvdinfomantis!!!
menu making
Optiarc AD-7260S review
cdrw trouble
 Home > News > General Computing > Google ...
Last 7 Days News : SU MO TU WE TH FR SA All News

Wednesday, April 05, 2017
Google Says Its Tensor Processing Unit Outperforms CPUs And GPUs


Google claims that a Tensor Processing Unit (TPU), a dedicated hardware for running machine- learning applications like voice recognition, is more powerful than Intel's Haswell CPUs and Nvidia Tesla K80 GPUs.

ATensor processing unit (or TPUs) is a custom application-specific integrated circuits (ASIC) developed specifically for machine learning.

Today, Google released a comparison of the TPU's performance and efficiencies compared with Haswell CPUs and Nvidia Tesla K80 GPUs, in a datacenter environment.

Google's paper evaluates a ​Tensor Pro​cessing Unit (TPU)? deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory.

Google says that the TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of its NN applications than are the time-varying optimizations of CPUs and GPUs
(caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. Google compares the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters.

Google's workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of its datacenters' NN inference demand.

Google claims that despite low utilization for some applications, the TPU is on average about 15X -30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's
GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

The TPU chip size can fit in a hard drive slot within a data center rack according to Google Distinguished Hardware Engineer Norm Jouppi, who joined Google three years ago and has been one of the chief architects of the MIPS processor- possibly the best processor achitecture ever.

Despite its multitude of matrix multiplication units, the TPU does not have any stored program; it simply executes instructions sent from the host. The DRAM on the TPU is operated as one unit in parallel because of the need to fetch so many weights to feed to the matrix multiplication unit.

More details about the test

In Google's tests, a Haswell Xeon E5-2699 v3 processor with 18 cores running at 2.3 GHz using 64-bit floating point math units was able to handle 1.3 Tera Operations Per Second (TOPS) and delivered 51 GB/sec of memory bandwidth; the Haswell chip consumed 145 watts and its system (which had 256 GB of memory) consumed 455 watts when it was busy.

The TPU, by comparison, used 8-bit integer math and access to 256 GB of host memory plus 32 GB of its own memory was able to deliver 34 GB/sec of memory bandwidth on the card and process 92 TOPS - a factor of 71X more throughput on inferences, and in a 384 watt thermal envelope for the server that hosted the TPU.

Google also examined the throughput in inferences per second that the CPUs, GPUs, and TPUs could handle with different inference batch sizes and with a 7 milliseconds.

With a small batch size of 16 and a response time near 7 milliseconds for the 99th percentile of transactions, the Haswell CPU delivered 5,482 inferences per second (IPS), which was 42 percent of its maximum throughout (13,194 IPS at a batch size of 64) when response times were allowed to go as long as they wanted and expanded to 21.3 millisecond of the 99th percentile of transactions. The TPU, by contrast, could do batch sizes of 200 and still meet the 7 millisecond ceiling and delivered 225,000 IPS running the inference benchmark, which was 80 percent of its peak performance with a batch size of 250 and the 99th percentile response coming in at 10 milliseconds.



Previous
Next
Google Invests in INDIGO Undersea Cable to Link Southeast Asia        All News        YouTube TV Goes Live For $35 Per Month
Google Invests in INDIGO Undersea Cable to Link Southeast Asia     General Computing News      YouTube TV Goes Live For $35 Per Month

Get RSS feed Easy Print E-Mail this Message

Related News
NVIDIA CEO Declares Death of Moore's Law, Outlines Company's GPU-centered Ecosystem
Google I/O: Google Digital Assistant Coming to iPhone, Android O, Android Go, New TPU and VR

Most Popular News
 
Home | News | All News | Reviews | Articles | Guides | Download | Expert Area | Forum | Site Info
Site best viewed at 1024x768+ - CDRINFO.COM 1998-2017 - All rights reserved -
Privacy policy - Contact Us .