At the GTC Keynote Nvidia has announced its next generation GPU and Cuda architecture - Fermi (G300). The Californians concentrated on flexible usability and high workload of the 512 Shader ALUs - DirectX 11 was only an aside.
High-Level Diagram of the G300/Fermi [Source: view picture gallery]
The architecture of the G300 is code-named Fermi and features about 3 billion transistors, 512 ALUs, up to 6 GiByte GDDR5 RAM and a 384-bit memory interface. Nvidia has not anything revealed about clock rates yet - therefore all details about capability are meant by clock which does not necessarily show the performance ratio of the final products to its predecessors.
With the Fermi architecture Nvidia more and more focuses on GPU computing and also uses those terms in their presentation. The former texture units turned into Load/Store units, the shader ALUs (which Nvidia formerly called stream processors) are Cuda core or Cuda processors now. Certainly chips basing on the Fermi architecture will be DirectX 11-compatible but Nvidia doesn't talk about that much.
Specifications: G300 Fermi
Fermi Streaming Multiprocessor orSIMD [Source: view picture gallery]
A total of 512 Cuda cores will be on the G300 chip, organized in 16 SIMD units. So every SIMD has 32 ALUs which share the 16 loading and memory units (LS units, ex-TMUs). Up to now Nvidia has not revealed details about the capabilities of the LS units. The presented specifications have not given hints about the power of texture filtering yet. Values between 4 and 16 texture filters per SIMD could be possible.
Special Function Units (SFUs) execute transcendental instructions such as sin, cosine, reciprocal and square root. Each SFU executes one instruction per thread, per clock; a warp executes over eight clocks. The SFU pipeline works independently from the dispatch unit, allowing the dispatch unit to issue to other execution units while the SFU is occupied.
Double precision arithmetic has been improved as well. It does not only fulfill the IEEE 754-2008 floating-point standard (formerly IEEE 754-1985 was used) with the more precise FMA (Fused Multiply-Add, which AMD offers with the HD 5800 series and Nvidia with the GT200 only for DP), but also the DP output increases by factor 8 compared with the GT200 (per clock cycle!). Every SIMD (called Streaming Multiprocessor) can execute 16 FMA operations and 256 per chip respectively - the GT200 was only capable of 30 DP-MADs.
The SIMDs thread in groups of 32 parallel threads called warps. Each SIMD has two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Every warp scheduler activates either a group of sixteen cores, sixteen load/store units or four SFUs. Since the warps work independently, Fermi's scheduler does not need to check for dependencies from within the instruction stream.
Fermi Cache and RAM
Speeds & Feeds [Source: view picture gallery]
A clever trick is the cache division of the SIMDs. Physically there are 64 kiByte per SIMD/Streaming Multiprocessor. 16 kiByte are fix configurated Shared Memory (as it was with G80 and GT200) and 16 kiByte are configured as Level1 cache - the rest of the 32 kiByte can be freely used.
Furthermore the G300 with Fermi architecture offers a unified Level 2 cache with 768 kiByte capacity.
In our gallery you will also see brand-new screenshots from physics and raytracing techdemos. At the GDC the already known GPU Raytracing demo with a Bugatti Veyron was shown in an updated version and it uses a much more realistic illumination based upon Global Illumination. Other demos show a physically correct water simulation used in films as well as particle effects and a destruction simulation.
Fermi will probably not launch until 2010. Nvidia is still working on details so it will take another couple of months before the graphics cards will hit the retail market.