Hot Chips 2018: AMD APU Optimization Live Blog (Noon PT, 7pm UTC)
by Ian Cutress on August 20, 2018 2:55 PM EST- Posted in
- CPUs
- AMD
- Hot Chips
- APUs
- GPUs
- Trade Shows
- SoCs
- Live Blog
- Ryzen
- Raven Ridge
03:01PM EDT - AMD is also at Hot Chips, speaking about Raven Ridge and its APUs. The key elements to this talk will be the optimizations made for Raven Ridge, specifically around power and data management. We'll be live blogging the talk for everyone to follow.
03:02PM EDT - Working on APUs for several years
03:02PM EDT - Raven Ridge allowed the users to experience an uplift in performance
03:03PM EDT - Focusing on power and power efficiency
03:03PM EDT - GPUs have an appetite for memory - getting efficiency is important, as is power
03:03PM EDT - Battery life is a key consideration too
03:03PM EDT - 4C/8T, 11 Vega CUs
03:04PM EDT - ALmost everything is new - new Upgraded display engine, new audio coprocessor, new IO subsystem, USB 3.1 and USB-C native
03:05PM EDT - Went big on the GPU
03:05PM EDT - Up 60% transistors from Bristol
03:05PM EDT - But 16% smaller based on process note
03:05PM EDT - BGA package
03:06PM EDT - Also in desktop
03:06PM EDT - Dedicated L2 cache in the GPU
03:06PM EDT - Flexible geometry engine
03:06PM EDT - 16 pixel units
03:06PM EDT - two render back-ends
03:06PM EDT - 1200 Tri/sec at 1200 MHz
03:08PM EDT - Smaller L3 cache than desktop
03:08PM EDT - Smaller L3 cache than desktop
03:08PM EDT - Precision Boost 2 for dynamic core frequency
03:09PM EDT - GPU workloads tends to go through phases of render and physics phases
03:09PM EDT - Can adjust power based on where it is needed
03:09PM EDT - Fine grained p-states
03:09PM EDT - On die power regulation
03:09PM EDT - Can exploit all the power
03:10PM EDT - Heart of the chip is the fabric
03:10PM EDT - oen coherent protocol - Infinity Fabric
03:10PM EDT - Manages the full SoC
03:10PM EDT - Overall power budget and management
03:10PM EDT - Enhanced flow for powering on and off components
03:11PM EDT - Knew in 2013/2014 AMD was going to major reset the CPU portfolio
03:11PM EDT - IF was designed to scale from Server to high-end graphics into smaller mobile SoCs and desktop
03:11PM EDT - IF is a Transport Layer
03:12PM EDT - Scalable Data Ports and SDP interface modules to the transport layer
03:12PM EDT - SDP hides complexities of coherence protocol from connected IP
03:12PM EDT - CPU, GPU, Memory, IO, all use an SDPIM into the transport layer
03:13PM EDT - Transport Layer Switches are crossbars
03:13PM EDT - Coherent ports in region A/B
03:13PM EDT - Structured for multi-region power gating
03:13PM EDT - Up to 5 transfers per clock per switch
03:14PM EDT - Turn off bits of the fabric that are not needed to save power
03:14PM EDT - Transport request queues at each entry point into the fabric
03:14PM EDT - Hard Real Time, Soft Real Time, and Non-Real Time
03:15PM EDT - Within this, multiple virtual queues and channels
03:15PM EDT - Also priority classes end-to-end
03:15PM EDT - Can escalate through the entire fabric
03:15PM EDT - Need to improve mem bandwidth for GPU
03:15PM EDT - Bigger CPU and GPU caches helped a lot
03:16PM EDT - Caching algorithms and lossless compression (DCC) helps
03:16PM EDT - Direct reads of compressed memory
03:16PM EDT - Shadow of Mordor seems more memory active
03:17PM EDT - Direct compare Bristol Ridge and Ravin Ridge
03:17PM EDT - Draw Stream Binning Radterizer improves bandwidth
03:18PM EDT - Increase display engines - reset the engine and set a target to do a 4K display at Vmin (lowest voltage of process)
03:18PM EDT - Four pipes and flexibility to combine pipes or act independent
03:18PM EDT - 4x throuput over Bristol
03:18PM EDT - Keeping up with codec improvements
03:19PM EDT - Rather than 3 rails (CPU, GPU, IO), now have one
03:19PM EDT - 3 rail methods meant overprovisioning required - very inefficient. Now can be very efficient
03:19PM EDT - In-chip, rather than do FIVR, do individual LDO in each CPU core and graphics
03:19PM EDT - Allows more fine grained power delivery
03:20PM EDT - Allows for better control
03:20PM EDT - From Bristol Ridge, higher boost current but smaller regulator, overall lower ICCMax for CPU+GPU
03:21PM EDT - Each core has a regulator, so can control power states and power down each core when idle
03:21PM EDT - Fast graphics off and power down when needed
03:21PM EDT - New power modes
03:22PM EDT - managing a lot of skin temps through STAPM
03:22PM EDT - Skin Temperature Aware Power management
03:22PM EDT - Previous gen didn't exploit thermal headroom
03:22PM EDT - Now can go into thermal budgets and maximise performance
03:23PM EDT - Performance goes up
03:23PM EDT - Now 25x20 goal
03:24PM EDT - Above progress goal
03:24PM EDT - New systems will come out in the future to meet the target
03:25PM EDT - Q&A Time
03:27PM EDT - Q: Talk coherency and master/slave. A: The masters are participating in the protocol, slaves are the memory controllers. Manage all coherent / non coherent traffic
03:28PM EDT - Q: Are the display controllers coherent? A: The frame buffer is considered non-coherent memory
03:29PM EDT - Q: Moore's Law is painful but you can stay on course for 25x20 ? A: We are seeing improvements, looking forward to 7nm. Looking beyond that. Challenges today are more like IO scaling. If we start looking at IO bottlenecks, such as mem and mem throughput, that's our bigger challenges. Also, thermal management, thermal density, trying to tune every last bit in the device.
03:29PM EDT - That's a wrap. Next Live Blog at 4pm PT.
9 Comments
View All Comments
jjj - Monday, August 20, 2018 - link
Covering Xilinx, Mythic and ARM tomorrow?Ian Cutress - Monday, August 20, 2018 - link
https://twitter.com/IanCutress/status/103171584120...Mythic isn't on my list.
jjj - Tuesday, August 21, 2018 - link
Maybe take a closer look at what Mythic is trying to do and reconsider?SquarePeg - Monday, August 20, 2018 - link
I wish that AMD would give their APU's a quad channel memory controller for better bandwidth. A quad core APU with 24 CU's and 16gb (4x4gb) of ram @ 3466mhz would make for a solid little machine.PeachNCream - Tuesday, August 21, 2018 - link
A small amount of optional dedicated GDDR5 like was done with TurboCache and Sideport a number of years about might also be a cost-effective option. iGPUs more memory bandwidth and lower latency than system memory can provide alone and it might not be acceptable to add in two more memory channels.Targon - Tuesday, August 21, 2018 - link
Due to the AM4 socket on the desktop, AMD can't add more memory channels without changing the socket. If AMD adopts Gen-Z in the next few years, that has the potential to change things a lot in terms of how memory is addressed, because it can do away with the need for more memory channels to get higher bandwidth access to memory. Memory would be on the Gen-Z bus, and the CPU would just talk to the bus. This might require the new socket, or Ryzen fourth generation might be able to support it on AM4(if the motherboard is Gen-Z, then all pins for PCI Express slots and RAM could be used for the Gen-Z connection, if the motherboard is "legacy", then the CPU would use legacy mode and be limited to the 2 channels. How long it will take for Gen-Z to show up is the question.Blitzed Penguin - Tuesday, August 21, 2018 - link
Did AMD say if they are going to release optimized drivers for thier mobile Raven Ridge APU's? Current drivers for Ryzen 5 2500u are still stuck on 17.7 and are released from OEMs not AMD. Power management between the CPU and GPU is all wrong.denywinarto - Wednesday, August 22, 2018 - link
Is this desktop or laptop apu? Kinda to soon for an upgrade isnt it? 2019 would make more sense..abufrejoval - Friday, August 24, 2018 - link
As good as this is, what I really expected AMD to produce is what they gave away to Intel or did with the Chinese: Produce an APU which doesn't compromise on the graphics, even if that precludes the 15Watt mobile space (BTW: Is there any initiative to produce mobile GDDRx or HBM?).So put some HBMx on the die carrier and then use DDRx only for the CPU (would even work with an AM4 socket, right?) or go all GDDR5 like they did with the console and perhaps allow to combine several of these consoles into something beefier (2x, 4x) connected via IF.