AMD Details High Floating Point Capabilities Of Upcoming Bulldozer Chips
One of the most interesting features planned for AMD's next generation core architecture, which features the new "Bulldozer" core, is something called the "Flex FP," which promises to deliver tremendous floating point capabilities for technical and financial applications.
For those of you not familiar with floating point math, this is the high
level stuff, not 1+1 integer math that most applications use. In
computing, floating point describes a system for representing numbers that
would be too large or too small to be represented as integers. Numbers are
in general represented approximately to a fixed number of significant
digits and scaled using an exponent. AMD claims that its "Flex FP"
floating point unit could offer technical applications and financial
applications that rely on heavy-duty use of floating point math huge
increases in performance over the existing architectures, as well as far
more flexibility.
Flex FP is a single floating point unit that is shared between two integer cores in a module (so an AMD 16-core "Interlagos" would have 8 Flex FP units). Each Flex FP has its own scheduler; it does not rely on the integer scheduler to schedule FP commands, nor does it take integer resources to schedule 256-bit executions. This helps to ensure that the FP unit stays full as floating point commands occur. AMD says that Intel and other competitors architectures have had single scheduler for both integer and floating point, which means that both integer and floating point commands are issued by a single shared scheduler vs. having dedicated schedulers for both integer and floating point executions.
There will be some instruction set extensions that include SSSE3, SSE 4.1 and 4.2, AVX, AES, FMA4, XOP, PCLMULQDQ and others.
One of these new instruction set extensions, AVX, can handle 256-bit FP executions. However, there is no such thing as a 256-bit command. Single precision commands are 32-bit and double precision are 64-bit. With todays standard 128-bit FPUs, you execute four single precision commands or two double precision commands in parallel per cycle. With AVX you can double that, executing eight 32-bit commands or four 64-bit commands per cycle but only if your application supports AVX. If it doesnt support AVX, then that flashy new 256-bit FPU only executes in 128-bit mode (half the throughput). That is, unless you have a Flex FP.
In todays typical data center workloads, the bulk of the processing is integer and a smaller portion is floating point. So, in most cases you dont want one massive 256-bit floating point unit per core consuming all of that die space and all of that power just to sit around watching the integer cores do all of the heavy lifting. By sharing one 256-bit floating point unit per every 2 cores, AMD can keep die size and power consumption down, helping hold down both the acquisition cost and long-term management costs.
The Flex FP unit is built on two 128-bit FMAC units. The FMAC building blocks are quite robust on their own. Each FMAC can do an FMAC, FADD or a FMUL per cycle.
"When you compare that competitive solutions that can only do an FADD on their single FADD pipe or an FMUL on their single FMUL pipe, you start to see the power of the Flex FP whether 128-bit or 256-bit, there is flexibility for your technical applications. With FMAC, the multiplication or addition commands dont start to stack up like a standard FMUL or FADD; there is flexibility to handle either math on either unit," said John Fruehe, the director of product marketing for server/workstation products at AMD.
Here are some additional benefits:
* Non-destructive DEST via FMA4 support (which helps reduce register pressure)
* Higher accuracy (via elimination of intermediate round step)
* Can accommodate FMUL OR FADD ops (if an app is FADD limited, then both FMACs can do FADDs, etc), which is a huge benefit
The new AES instructions allow hardware to accelerate the large base of applications that use this type of standard encryption (FIPS 197). The "Bulldozer" Flex FP is able to execute these instructions, which operate on 16 Bytes at a time, at a rate of 1 per cycle, which provides 2X more bandwidth than current offerings, AMD added.
By having a shared Flex FP the power budget for the processor is held down. This allows AMD to add more integer cores into the same power budget. By sharing FP resources (that are often idle in any given cycle) AMD can add more integer execution resources (which are more often busy with commands waiting in line). In fact, the Flex FP is designed to reduce its active idle power consumption to a mere 2% of its peak power consumption.
"The Flex FP gives you the best of both worlds: performance where you need it yet smart enough to save power when you dont need it," Mr. Fruehe said.
The beauty of the Flex FP is that it is a single 256-bit FPU that is shared by two integer cores. With each cycle, either core can operate on 256 bits of parallel data via two 128-bit instructions or one 256-bit instruction, OR each of the integer cores can execute 128-bit commands simultaneously. This is not something hard coded in the BIOS or in the application; it can change with each processor cycle to meet the needs at that moment. When you consider that most of the time servers are executing integer commands, this means that if a set of FP commands need to be dispatched, there is probably a high likelihood that only one core needs to do this, so it has all 256-bit to schedule.
Floating point operations typically have longer latencies so their utilization is typically much lower; two threads are able to easily interleave with minimal performance impact. So the idea of sharing doesnt necessarily present a dramatic trade-off because of the types of operations being handled.
Also, each of AMD's pipes can handle SSE or AVX as well as FMUL, FADD, or FMAC providing the greatest flexibility for any given application. Existing apps will be able to take full advantage of AMD's hardware with potential for improvement by leveraging the new ISAs, the company said.
"Obviously, there are benefits of recompiled code that will support the new AVX instructions. But, if you think that you will have some older 128-bit FP code hanging around (and lets face it, you will), then dont you think having a flexible floating point solution is a more flexible choice for your applications? For applications to support the new 256-bit AVX capabilities they will need to be recompiled; this takes time and testing, so I wouldnt expect to see rapid movement to AVX until well after platforms are available on the streets. That means in the meantime, as we all work through this transition, having flexibility is a good thing. Which is why we designed the Flex FP the way that we have," Mr. Fruehe added.
Flex FP is a single floating point unit that is shared between two integer cores in a module (so an AMD 16-core "Interlagos" would have 8 Flex FP units). Each Flex FP has its own scheduler; it does not rely on the integer scheduler to schedule FP commands, nor does it take integer resources to schedule 256-bit executions. This helps to ensure that the FP unit stays full as floating point commands occur. AMD says that Intel and other competitors architectures have had single scheduler for both integer and floating point, which means that both integer and floating point commands are issued by a single shared scheduler vs. having dedicated schedulers for both integer and floating point executions.
There will be some instruction set extensions that include SSSE3, SSE 4.1 and 4.2, AVX, AES, FMA4, XOP, PCLMULQDQ and others.
One of these new instruction set extensions, AVX, can handle 256-bit FP executions. However, there is no such thing as a 256-bit command. Single precision commands are 32-bit and double precision are 64-bit. With todays standard 128-bit FPUs, you execute four single precision commands or two double precision commands in parallel per cycle. With AVX you can double that, executing eight 32-bit commands or four 64-bit commands per cycle but only if your application supports AVX. If it doesnt support AVX, then that flashy new 256-bit FPU only executes in 128-bit mode (half the throughput). That is, unless you have a Flex FP.
In todays typical data center workloads, the bulk of the processing is integer and a smaller portion is floating point. So, in most cases you dont want one massive 256-bit floating point unit per core consuming all of that die space and all of that power just to sit around watching the integer cores do all of the heavy lifting. By sharing one 256-bit floating point unit per every 2 cores, AMD can keep die size and power consumption down, helping hold down both the acquisition cost and long-term management costs.
The Flex FP unit is built on two 128-bit FMAC units. The FMAC building blocks are quite robust on their own. Each FMAC can do an FMAC, FADD or a FMUL per cycle.
"When you compare that competitive solutions that can only do an FADD on their single FADD pipe or an FMUL on their single FMUL pipe, you start to see the power of the Flex FP whether 128-bit or 256-bit, there is flexibility for your technical applications. With FMAC, the multiplication or addition commands dont start to stack up like a standard FMUL or FADD; there is flexibility to handle either math on either unit," said John Fruehe, the director of product marketing for server/workstation products at AMD.
Here are some additional benefits:
* Non-destructive DEST via FMA4 support (which helps reduce register pressure)
* Higher accuracy (via elimination of intermediate round step)
* Can accommodate FMUL OR FADD ops (if an app is FADD limited, then both FMACs can do FADDs, etc), which is a huge benefit
The new AES instructions allow hardware to accelerate the large base of applications that use this type of standard encryption (FIPS 197). The "Bulldozer" Flex FP is able to execute these instructions, which operate on 16 Bytes at a time, at a rate of 1 per cycle, which provides 2X more bandwidth than current offerings, AMD added.
By having a shared Flex FP the power budget for the processor is held down. This allows AMD to add more integer cores into the same power budget. By sharing FP resources (that are often idle in any given cycle) AMD can add more integer execution resources (which are more often busy with commands waiting in line). In fact, the Flex FP is designed to reduce its active idle power consumption to a mere 2% of its peak power consumption.
"The Flex FP gives you the best of both worlds: performance where you need it yet smart enough to save power when you dont need it," Mr. Fruehe said.
The beauty of the Flex FP is that it is a single 256-bit FPU that is shared by two integer cores. With each cycle, either core can operate on 256 bits of parallel data via two 128-bit instructions or one 256-bit instruction, OR each of the integer cores can execute 128-bit commands simultaneously. This is not something hard coded in the BIOS or in the application; it can change with each processor cycle to meet the needs at that moment. When you consider that most of the time servers are executing integer commands, this means that if a set of FP commands need to be dispatched, there is probably a high likelihood that only one core needs to do this, so it has all 256-bit to schedule.
Floating point operations typically have longer latencies so their utilization is typically much lower; two threads are able to easily interleave with minimal performance impact. So the idea of sharing doesnt necessarily present a dramatic trade-off because of the types of operations being handled.
Also, each of AMD's pipes can handle SSE or AVX as well as FMUL, FADD, or FMAC providing the greatest flexibility for any given application. Existing apps will be able to take full advantage of AMD's hardware with potential for improvement by leveraging the new ISAs, the company said.
"Obviously, there are benefits of recompiled code that will support the new AVX instructions. But, if you think that you will have some older 128-bit FP code hanging around (and lets face it, you will), then dont you think having a flexible floating point solution is a more flexible choice for your applications? For applications to support the new 256-bit AVX capabilities they will need to be recompiled; this takes time and testing, so I wouldnt expect to see rapid movement to AVX until well after platforms are available on the streets. That means in the meantime, as we all work through this transition, having flexibility is a good thing. Which is why we designed the Flex FP the way that we have," Mr. Fruehe added.