ARM Details New Cortex-A72 Processor
Following its officiall annoucnement in the beginning of February, ARM on Thursday provided a better insight into ARM's new Cortex-A72 CPU during ARM's TechDay 2015 event in London. The Cortex-A72 processor is ARM?s highest-performance and most advanced processor. Based on the ARMv8-A Architecture and launched in early 2015, the Cortex-A72 CPU builds on the success of the Cortex-A57 processor across mobile and enterprise markets, and promises to deliver three and a half times the performance of Cortex-A15 based devices in the smartphone power budget, as well as significant reductions in overall power consumption.
ARM is promoting the A72 on the new FinFET processes from amsung/GlobalFoundries and TSMC, which are referred to as 14nm and 16nm in the slides the company showed during the event.
The figures you see below have been measured on the same frequency, same process and identical memory system interfaces:
ARM has made several optimizations to the architecture to improve performance when compared to the A57. The redesigned Cortex-A72 processor offers increased SIMD / floating point and integer compute performance. ARM showed a general 15-30% increase on Instructions Per Clock (IPC) depending on the kind of workload.
With the Cortex-A72 ARM has improved the erformance, power and size of the processor compared to the A57. ARM says the CPU area is just 1.15 square mm on TSMC's 16FF design.
Optimizations inlcude a redesigned CPU architecture, a new branch-predictor and improvements in the decoder pipeline.
Instruction fetch capabilities have been improved through a new branch-prediction algorithm that increases performance and reduces power through reduced misprediction and speculation, which has been cut down by 50% for mispredictions and 25% for speculation when compared to the A57. Superfluous branch-predictor accesses have also been suppressed. There also has been general power optimization in the RAM-organization by coupling IP blocks closer together.
A72's decoder/rename capabilities have been also improved. The 3-wide decoder features increased effective decode bandwidth, and has received some AArch64 instruction-fusion enhancements. Power consumption has been also reduced through optimizations in decoding, the buffers and flow-control hardware.
The dispatch/retire stage sees the biggest improvements to performance. ARM's dispatch unit can feed the execution units with smaller ops, and has increased effective dispatch bandwidth (5-wide). The result of this increases decoder throughput while also increasing the total number of micro-ops created by the dispatcher and eventually executed per cycle. ARM says that the result is an average of 1.08 micro-ops per instruction in code. In addition, ARM has worked on their register file by reducing the number of read-ports by introducting port-sharing and further reducing superfluous access.
ARM also introduced a new FP/Advanced SIMD units, which allow for a significantly reduced latency as the FP pipeline length is reduced from 9 to 6. FMUL is reduced from 5 cycles down to 3 (40% latency reduction), FADD goes from 4 to 3 (25% latency reduction), FMAC from 9 to 6 (33% latency reduction), and the CVT units go from 4 to 2 units (50% latency reduction).
The integer units also see an improvement, as the Radix-16 divider has seen its bandwidth doubled, while the CRC unit now becomes a pipelined block with just 1-cycle latency, a 3x increase in bandwidth over the A57.
The A57 also features improvements in its Load/Store unit. ARM says that bandwidth to L1/L2 has been improved by up to 30%. This was achieved by introducting a sophisticated L1/L2 data prefetcher, improvements in the L1-hit pipeline, fowarding network, and L1 way-predictor.
It is obvious that the new Cortex-A72 processor is more porwerful while ARM has managed to maintain a small core. The ARM Cortex-A72 will find its way to high-end mobiles but also to low-power servers possibly next year.