Processors Push The Edge at ISSCC 2010
Intel, AMD, Sun and IBM will talk about their onging effort to develop high-performance processors at lower cost, through the migration to more-advanced processes, at the the International Solid-State Circuits Conference next February.
These new processes present significant technological hurdles, which result in a reduced gain in transistor density below that which simple process scaling would predict.The most recent migration from 45nm to 32nm illustrates these aspects.
At ISSCC 2010, the first processors fabricated in 32nm CMOS technology will be presented. Intel's ―Westmere family of processors implements six-cores at higher clock frequencies in the same power envelope as the four-core 45nm ― Nehalem. The Westmere chip contains 1.17B transistors and a 12MB shared L3 cache in approximately the same die area as Nehalem which had 731M transistors. The 50% increase in transistor density, less than observed in previous generations, illustrates the constraints of the 32nm process. Other integration features include an anti-resonance feature for the Quickpath I/O supply and an on-package LC power filter that reduces the DDR clock drivers? jitter. Dynamic voltage frequency scaling is used to minimize power, as in previous generations.
Engineers from AMD will describe a 3+ GHz, 32nm Silicon-on-Insulator (SOI) x86-64-processor core with power controllable in a 2.5W to 25W range. Regular-Vt nFET technology with a total width of 1.38m is used to construct the power gating ring by taking advantage of the isolated substrates inherent in SOI technology. This enables a lower voltage drop during operation, and a low-impedance virtual ground layer using a package-metal layer. The L1 cache uses 8T cells for a lower minimum voltage enabling scalability to lower supply voltages for reduced power dissipation. Power-monitor circuitry calculates the dynamic power dissipation with respect to the thermal design limits, enabling the system controls to enact power reduction by methods such as voltage frequency scaling.
IBM's POWER7 processor integrates eight quad-threaded cores in a 567mm2 45nm SOI die size. New integration features include the use of embedded DRAM for a 32MB L3 cache utilizing deep-trench technology for at 0.065μm2 cell size. The deep trench is also used for very high density 100fF/μm2 on-chip supply decoupling capacitors. The 32MB of shared L3 is implemented using eDRAM, the first time this dense memory technology has been implemented in a volume commercial processor. To aid in power management, each core can run at a separate frequency appropriate to its operating mode. The processor is designed in a modular fashion with multiple voltage and clock domains controllable according to workload requirements.
Sun Microsystems will describe a 16-core SoC implemented in 40nm, which is enabling up to (an unprecedented) 512 threads in a 4-way glueless system, to deliver double the throughput performance in the same power envelope. The processor uses six clock and four voltage domains together with power management and circuit techniques to optimize performance, power, variability and yield across the 377mm2 die. The unified 6MB L2 cache delivers a peak bandwidth of 461GB/s. The high speed SerDes provides a remarkable bandwidth of 2.4Tb/s.
Using general-purpose processors for specialized functions results in lower performance than can be attained with special-purpose hardware. This is especially true in data centers which must combine networking capability with server functions: Engineers from IBM will describe a ―wire-speed processor with network and server attributes that integrates special accelerators for cryptographic applications, compression, pattern matching, and web servicing/security. Similarly, digital TV processors feature high-performance requirements combined with special processing features that cannot be satisfied with a multi-core processor using homogeneous general-purpose cores. A team led by Renesas will describe a multi-core processor using heterogeneous cores. The chip is implemented using both special-purpose processors such as video accelerators, audio processors, and two 1024-way matrix processors, and dynamically reconfigurable cores to provide image recognition and database search.
Can Many-Core Processors Realize Their Potential/?/>
Modern microprocessors have heavily embraced parallelism as a way to continue scaling up their performance while maintaining an increased focus on energy-efficiency. As the core count within the processor chip increases, so do the demands placed on both on-chip and off-chip communication fabrics that enable the data flows necessary to provide the cores with required information. On-chip networks must be designed with a high-degree of flexibility to accommodate the wide variety of application loads inherent in general-purpose computing to ensure that application performance can scale with increasing core count. All this, while overcoming the energy-efficiency challenges in switching and routing imposed by the increasingly miniscule metal filaments used for on-chip interconnects, several million of which fit within a single inch.
At ISSCC 2010, innovations in networks-on-chip feature solutions spanning both the architectural and circuit levels which improve computing performance by taking energy-efficiency and throughput to unprecedented levels. The multi-stage crossbar in SUN?s 128 thread Rainbow Falls ―data-center-on-a-chip enables core-to-L2 communication at an impressive 461GB/s. The first of Intel's many contributions this year proposes an alternative approach, utilizing a ring interconnect bus to deliver an eye-popping 1.2TB/s of communication bandwidth among the 8 on-chip Xeon cores. As well, Intel describes a message passing scheme using on-die shared memory in a 48-core system with a 6x4 virtual cut-through network enabling a 256GB/s bisectional bandwidth. Dynamic voltage and frequency scaling in 8 voltage and 28 frequency domains yield an impressive network efficiency of approximately 0.2Tb/s/W . Intel will also describe an experimental circuit-switched streaming network protocol that is applied to a 64-node on-chip mesh network. Router energy-efficiency is significantly improved using circuit-switching and dual-supply channels, and network throughput is increased using back-to-back streaming, achieving an aggregate throughput of 2.6Tb/s at energy efficiencies on the order of 0.66 to 2.0pJ/b (that is, 0.5Tb/s/W to 1.5Tb/s/W) . Lastly, Intel describes their experiments on the interaction of dynamic voltage frequency scaling with the application-level impact on core mapping and thread hopping, achieving improvements of 20% to 60% in both throughput and energy-efficiency of many-core processors.
As core counts increase towards hundreds-per-chip in the very near future, it is becoming nearly impossible to predict the sustained performance of a network architecture or application-mapping protocols using conventional simulation-based analysis. Hence, the insights gained by fabricating various network-on-chip hardware implementation prototypes are proving to be an essential basis for understanding and advancing the sustained application performance of future many-core processor chips.
This and other related topics will be discussed at length at ISSCC 2010, the foremost global forum for new developments in the integrated-circuit industry. ISSCC, the International Solid-State Circuits Conference, will be held on February 7-11, 2010, at the San Francisco Marriott Marquis Hotel.
At ISSCC 2010, the first processors fabricated in 32nm CMOS technology will be presented. Intel's ―Westmere family of processors implements six-cores at higher clock frequencies in the same power envelope as the four-core 45nm ― Nehalem. The Westmere chip contains 1.17B transistors and a 12MB shared L3 cache in approximately the same die area as Nehalem which had 731M transistors. The 50% increase in transistor density, less than observed in previous generations, illustrates the constraints of the 32nm process. Other integration features include an anti-resonance feature for the Quickpath I/O supply and an on-package LC power filter that reduces the DDR clock drivers? jitter. Dynamic voltage frequency scaling is used to minimize power, as in previous generations.
Engineers from AMD will describe a 3+ GHz, 32nm Silicon-on-Insulator (SOI) x86-64-processor core with power controllable in a 2.5W to 25W range. Regular-Vt nFET technology with a total width of 1.38m is used to construct the power gating ring by taking advantage of the isolated substrates inherent in SOI technology. This enables a lower voltage drop during operation, and a low-impedance virtual ground layer using a package-metal layer. The L1 cache uses 8T cells for a lower minimum voltage enabling scalability to lower supply voltages for reduced power dissipation. Power-monitor circuitry calculates the dynamic power dissipation with respect to the thermal design limits, enabling the system controls to enact power reduction by methods such as voltage frequency scaling.
IBM's POWER7 processor integrates eight quad-threaded cores in a 567mm2 45nm SOI die size. New integration features include the use of embedded DRAM for a 32MB L3 cache utilizing deep-trench technology for at 0.065μm2 cell size. The deep trench is also used for very high density 100fF/μm2 on-chip supply decoupling capacitors. The 32MB of shared L3 is implemented using eDRAM, the first time this dense memory technology has been implemented in a volume commercial processor. To aid in power management, each core can run at a separate frequency appropriate to its operating mode. The processor is designed in a modular fashion with multiple voltage and clock domains controllable according to workload requirements.
Sun Microsystems will describe a 16-core SoC implemented in 40nm, which is enabling up to (an unprecedented) 512 threads in a 4-way glueless system, to deliver double the throughput performance in the same power envelope. The processor uses six clock and four voltage domains together with power management and circuit techniques to optimize performance, power, variability and yield across the 377mm2 die. The unified 6MB L2 cache delivers a peak bandwidth of 461GB/s. The high speed SerDes provides a remarkable bandwidth of 2.4Tb/s.
Using general-purpose processors for specialized functions results in lower performance than can be attained with special-purpose hardware. This is especially true in data centers which must combine networking capability with server functions: Engineers from IBM will describe a ―wire-speed processor with network and server attributes that integrates special accelerators for cryptographic applications, compression, pattern matching, and web servicing/security. Similarly, digital TV processors feature high-performance requirements combined with special processing features that cannot be satisfied with a multi-core processor using homogeneous general-purpose cores. A team led by Renesas will describe a multi-core processor using heterogeneous cores. The chip is implemented using both special-purpose processors such as video accelerators, audio processors, and two 1024-way matrix processors, and dynamically reconfigurable cores to provide image recognition and database search.
Can Many-Core Processors Realize Their Potential/?/>
Modern microprocessors have heavily embraced parallelism as a way to continue scaling up their performance while maintaining an increased focus on energy-efficiency. As the core count within the processor chip increases, so do the demands placed on both on-chip and off-chip communication fabrics that enable the data flows necessary to provide the cores with required information. On-chip networks must be designed with a high-degree of flexibility to accommodate the wide variety of application loads inherent in general-purpose computing to ensure that application performance can scale with increasing core count. All this, while overcoming the energy-efficiency challenges in switching and routing imposed by the increasingly miniscule metal filaments used for on-chip interconnects, several million of which fit within a single inch.
At ISSCC 2010, innovations in networks-on-chip feature solutions spanning both the architectural and circuit levels which improve computing performance by taking energy-efficiency and throughput to unprecedented levels. The multi-stage crossbar in SUN?s 128 thread Rainbow Falls ―data-center-on-a-chip enables core-to-L2 communication at an impressive 461GB/s. The first of Intel's many contributions this year proposes an alternative approach, utilizing a ring interconnect bus to deliver an eye-popping 1.2TB/s of communication bandwidth among the 8 on-chip Xeon cores. As well, Intel describes a message passing scheme using on-die shared memory in a 48-core system with a 6x4 virtual cut-through network enabling a 256GB/s bisectional bandwidth. Dynamic voltage and frequency scaling in 8 voltage and 28 frequency domains yield an impressive network efficiency of approximately 0.2Tb/s/W . Intel will also describe an experimental circuit-switched streaming network protocol that is applied to a 64-node on-chip mesh network. Router energy-efficiency is significantly improved using circuit-switching and dual-supply channels, and network throughput is increased using back-to-back streaming, achieving an aggregate throughput of 2.6Tb/s at energy efficiencies on the order of 0.66 to 2.0pJ/b (that is, 0.5Tb/s/W to 1.5Tb/s/W) . Lastly, Intel describes their experiments on the interaction of dynamic voltage frequency scaling with the application-level impact on core mapping and thread hopping, achieving improvements of 20% to 60% in both throughput and energy-efficiency of many-core processors.
As core counts increase towards hundreds-per-chip in the very near future, it is becoming nearly impossible to predict the sustained performance of a network architecture or application-mapping protocols using conventional simulation-based analysis. Hence, the insights gained by fabricating various network-on-chip hardware implementation prototypes are proving to be an essential basis for understanding and advancing the sustained application performance of future many-core processor chips.
This and other related topics will be discussed at length at ISSCC 2010, the foremost global forum for new developments in the integrated-circuit industry. ISSCC, the International Solid-State Circuits Conference, will be held on February 7-11, 2010, at the San Francisco Marriott Marquis Hotel.