Name: Analyzing Intel’s Discrete Xe-HPC Graphics Disclosure: Ponte Vecchio, Rambo Cache, and Gelato
Item: Analyzing Intel’s Discrete Xe-HPC Graphics Disclosure: Ponte Vecchio, Rambo Cache, and Gelato
Author: Dr. Ian Cutress

Analyzing Intel’s Discrete Xe-HPC Graphics Disclosure: Ponte Vecchio, Rambo Cache, and Gelato

by Dr. Ian Cutress on 12/24/2019 9:30 AM EST

Post Your Comment
Please log in or sign up to comment.

Comments Locked

47 Comments

Back to Article

FireSnake - Tuesday, December 24, 2019 - link
Having 3rd player will sure be interesting...
coburn_c - Tuesday, December 24, 2019 - link
Intel sees a future where everything will be connected and capturing data, and they are gonna build the tools to data mine every bit of it and sell it to the highest bidder. Goodbye freedom, hello profit.
surt - Tuesday, December 24, 2019 - link
Just wait a little bit. Very soon there will be some non-tragically-dumb police officer investigating a child death who will realize an Alexa device captured the event. There will be calls for secure storage and Megan's law reporting. It will clear up all the data listening problems very quickly.
29a - Friday, December 27, 2019 - link
This.
Alexvrb - Tuesday, December 24, 2019 - link
Well they're late to the party, Alphabet Inc. already does that.
PeachNCream - Tuesday, December 24, 2019 - link
I see you are unfamiliar with Google, Facebook, Amazon, or Microsoft (and a bunch of smaller fish that fly well below the radar, but are actively getting to know you very well).
imaheadcase - Tuesday, December 24, 2019 - link
So like what most companies do now?
bank account? shared
medical record? shared (yep they can)
insurance? Shared (often with banks)
The amount of data points is insane that most people are having info shared and not even know it.
Pro-competition - Wednesday, December 25, 2019 - link
"Having 3rd player will sure be interesting..." What a vacuous statement. Are you a bot?
Flunk - Wednesday, December 25, 2019 - link
Yes, I think he might be the type of bot that randomly quotes the OP and then adds "What a vacuous statement. Are you a bot?". I deeply regret writing that one but it's out there now.
29a - Tuesday, December 24, 2019 - link
I’ll believe it when I see it.
martinw - Tuesday, December 24, 2019 - link
> we’re looking at 66.6 TeraFLOPs per GPU. Current GPUs will do in the region of 14 TF on FP32, so we could assume that Intel is looking at a ~5x increase in per-GPU performance by 2021/2022 for HPC.

But HPC ExaFLOPs are traditionally measured using FP64, so that means a ~10x increase.
Santoval - Tuesday, December 24, 2019 - link
If Intel manage to deliver ~67 TFLOPs of *double* precision in a single GPU package -even if it consists of multiple GPU chiplets- I will eat the hat I don't have. ~67 TFLOPs of single precision in a single GPU package might be possible (at a 480 - 500W TDP) due to Intel's new GPU design and its 7nm node, which should be quite more power efficient than their 10nm & 14nm nodes, assuming Intel can fab it at a tolerable yield that is.

The use of Foveros and EMIB also reduce the power budget and increase performance further, because they alleviate I/O power draw and, along with that "Rambo cache", mitigate the memory bottleneck. The graphics memory will also probably be HBM3, with quite a higher performance and energy efficiency.

So a ~5x performance at roughly x2 the TDP of the RTX 2080 Ti might be doable. It is ~2.5 times the performance per watt, which is high but not excessive. To double that performance further though is impossible. Intel are a semiconductor company, they are not wizards.
nft76 - Wednesday, December 25, 2019 - link
I'm guessing the number of nodes (and GPUs) will be at least two, probably more like three to four times larger than estimated in the article. I'm guessing that the ~200 racks is without storage included and there will be more nodes per rack. If I'm not mistaken, Cray Shasta high-density racks are much larger than standard.
eastcoast_pete - Tuesday, December 24, 2019 - link
Thanks Ian, Happy Holidays to All at AT and here in "Comments"!
My first thought was, boy, that lower case/upper case in oneAPI is really necessary; reading the subheading, I almost thought it's about an unusual Irish name (O'NEAPI), w/o the apostrophe.
On a more serious note, this also shows how important the programming ecosystem is; IMO, a key reason why NVIDIA remains the market leader in graphics and HPC.
UltraWide - Tuesday, December 24, 2019 - link
Nvidia recognized this more than 10 years ago, everyone else is playing catch up.
JayNor - Tuesday, December 24, 2019 - link
Intel is extending Sycl for FPGA config using data flow pipes. They've mentioned previously that Agilex will have the first implementation of pcie5 and CXL. Perhaps OneAPI will do something to simplify FPGA design.

https://github.com/intel/llvm/blob/sycl/sycl/doc/e...
JayNor - Tuesday, December 24, 2019 - link
Intel's current NNP chips don't have PCIE5 or CXL, and I recall some discussion about it being a feature that the NNP-I chips did manual memory management.

Is Intel enthusiastically pushing shared memory for the GPU high performance programming, or is this just a convenience during development to get CPU solutions working on GPU quickly?
ksec - Tuesday, December 24, 2019 - link
>The promise with Xe-HPC is a 40x increate in FP64 compute performance.

Increase

One or two other spelling mistakes but I can no longer find it.

>The CPUs will be Sapphire Rapids CPUs, Intel’s second generation of 10nm server processors coming after the Ice Lake Xeons

First time I heard SR will be an 10nm++ CPU, always thought it was destined for 7nm. Possibly Another Roadmap shift.

Other than that Great Article. But as with anything Recent Intel, I will believe it when I see it. They are ( intentionally or not ) leaking lots of benchmarks and roadmaps, and lots more "communication" on the ( far ) future as some sort of distraction against AMD.

I have my doubt on their GPU Drivers, not entirely sure their 10nm yield and cost could compete against NV and AMD without lowering margin. But at least in terms of GPGPU it will bring some competition to Nvidia's ridiculously expensive solution.
Alexvrb - Tuesday, December 24, 2019 - link
Yeah they'll probably be more competitive in HPC in the short term. For gaming... we'll see. I suspect they'll get murdered in the short term unless they are really aggressive with pricing. If they go this route most likely they'll do CPU+GPU bundle deals with OEMs to force their way into the "gaming" market.
Spunjji - Friday, December 27, 2019 - link
That approach seems highly likely.
MenhirMike - Tuesday, December 24, 2019 - link
This is way above the stuff I work with, but now I want RAMBO Cache on all my stuff.
Batmeat - Tuesday, December 24, 2019 - link

If you’ve skipped to this final page....

How did he know I would do that?
Duncan Macdonald - Tuesday, December 24, 2019 - link
Given Intel's brilliant(!!!) success in getting its 10nm process to work, I would take the dates with a few megatons of salt!!!
repoman27 - Wednesday, December 25, 2019 - link
Ian, I think your block diagram is a little off. Although the Intel illustrations clearly involve a certain amount of artistic license, I think we can agree that there's an organic package substrate with 8 HBM stacks and 2 transceiver tiles which are connected via EMIB to two larger modules. The modules appear to be a stack with two interposers sandwiched together. The bottom interposer has 8 large chips which are most likely the XeMF dies, as well as several color coded regions representing EMIB zones along with a bunch of vias. The top interposer has the 8 XeHPC chiplets and 4 additional chips which are almost certainly the RAMBO caches, seeing as they look exactly like the depiction of said caches in the other slide. Then there is one giant ball grid connecting the top and bottom layers of the sandwich.

That looks an awful lot like Co-EMIB to me. The 7nm compute chiplets and SDRAM caches (built on whatever process is the best fit) are bonded directly (Foveros) to a wafer with the memory fabric dies (probably on 14nm) and riddled with TSVs. Those modules then get singulated and plunked onto a substrate with a bunch of EMIBs inserted into it which connect them to each other as well as to the HBM stacks and transceiver tiles.

Also, this point seems a little harsh: "Transition through DDR3 to DDR4 (and DDR5?) in that time frame". Intel may be way behind on their roadmap, but they made the transition to DDR4 several years ago with Skylake.
repoman27 - Wednesday, December 25, 2019 - link
In fact, Intel may have already shown off a prototype wafer of the modules themselves: https://pbs.twimg.com/media/D_C-9b3U0AAeyv7.jpg

via Anshel Sag on Twitter: https://twitter.com/anshelsag/status/1148627973882...
thetrashcanisfull - Wednesday, December 25, 2019 - link
This seems worryingly light on technical details with a lot of bold performance claims. Particularly the architectural stuff

If Intel really has managed to execute a proper chiplet style GPU with EMIB / chip stacking, that would certainly open the door to major performance uplifts, but they are staying super vague on the underlying architecture and topology. Honestly, this slideware feels reminiscent to 3D XPoint, which, while still a solid technology, was years late and never delivered on the sort of hype it was announced with.

I'll remain skeptical until we get more details - the advances in packaging and interconnects that Intel is touting could certainly enable improvements on this scale, but Intel's execution over the last decade leaves a lot of room for doubt.
smilingcrow - Wednesday, December 25, 2019 - link
'Intel's execution over the last decade leaves a lot of room for doubt.'

Decade! I thought they were ahead of the pack generally until Zen 2 was released 18 months ago!
They have had a terrible 2 years but if you want to look at the last decade the real underachievers surely were AMD.
The next few years are crucial so we will have to see how things pan out.
thetrashcanisfull - Wednesday, December 25, 2019 - link
Decade may be an exaggeration, but not by much. Look at all of Intel's attempts to break into markets new markets: mobile/cellular, Larrabee/MIC, FPGAs (Altera), 3D XPoint...

Intel has shown that it can be fairly successful as an incumbent in the server/desktop/laptop CPU market (or at least it could until the 10nm problems) but outside of that Intel has consistently struggled to deliver on anything over the last 8+ years.
jabber - Wednesday, December 25, 2019 - link
Maybe it could be said with AMD struggling they did let off the gas pedal a bit and coasted a while.
thetrashcanisfull - Wednesday, December 25, 2019 - link
I think that's certainly true. Intergenerational improvements post Sandy Bridge were pretty anemic in the consumer market largely since intel refused to put out more than 4 cores on a mainstream platform until coffee lake. In the server/HEDT intel was doing pretty well for a while by virtue of increasing core counts, but the 10nm woes have halted any progress on that front.
Spunjji - Friday, December 27, 2019 - link
They were even worse on the notebook side of things. They were happy to sling dual-core + HT CPUs as "i7" processors until AMD announced the 2500U / 2700U; suddenly Intel came up with a "new" 15W Kaby Lake R CPU that looked suspiciously like a TDP-limited Kaby Lake, which itself was just a voltage-tweaked Skylake on an improved 14nm process.

Any interested hobbyists can observe the extent of the truth in this by taking a notebook with a 45W quad-core Skylake CPU, undervolting it by 100-125mV (the vast majority will do this) and dropping in a TDP limit. The barely perceptible change in performance that results is truly something to behold. My own tweaked 6700HQ averages a 22W TDP under load just from the undervolt.
extide - Monday, December 30, 2019 - link
Yeah just described what a mobile CPU is, why are you surprised?
JayNor - Wednesday, January 1, 2020 - link
I think if you asked Intel, they'd say ADAS, FPGAs and Optane are still very exciting programs for them, and probably already making money on ADAS and FPGAs.

Intel shipped 88 million LTE modems for iphones this year. How many LTE modems did AMD deliver?
Korguz - Thursday, January 2, 2020 - link
sources ????
Spunjji - Friday, December 27, 2019 - link
Depends whether you're focusing entirely on the CPU side of their business or making a more general assessment. Generally speaking, they've burned money on all sorts of unsuccessful projects (see: Atom cores in phones and tablets, their 4G modem projects).

Even on the CPU side they had their fair share of struggles prior to the 10nm-induced disasters. Ivy Bridge was a weak and somewhat cheapened follow-up to the absolute blockbuster that was Sandy, while Broadwell arrived late and barely showed its face at all on the desktop due to early struggles with 14nm. These things didn't have larger effects because AMD and their foundry competitors were performing so terribly at the time, but they were missteps nonetheless.
Spunjji - Friday, December 27, 2019 - link
Agreed re: worryingly light on technical details. I'm sure it all feels very real to the engineers working on it, but from an end-user perspective this is still very much a marketing exercise.
Sychonut - Wednesday, December 25, 2019 - link
Raja "The Wood Elf" Koduri better not let his mouth write a check his ass can't cash.
Spunjji - Friday, December 27, 2019 - link
Those who observed the Vega launch know he has no such compunctions :D
GreenReaper - Thursday, December 26, 2019 - link
Hmm. That HPC block diagram . . . where have I seen it before . . .
https://www.resetera.com/threads/how-would-a-ryzen...
peevee - Monday, December 30, 2019 - link
Can it deal with compressed representations of sparse matrices?
peevee - Monday, December 30, 2019 - link
"Xe contains two fundamental units: SIMT and SIMD. In essence, SIMD (single instruction, multiple data) is CPU like and can be performed on single elements with multiple data sources, while SIMT (single instruction, multiple threads) involves using the same instructions on blocks of data"

That phrase makes absolutely no sense. "CPU-like" SIMD executes the same instruction on multiple data elements, not on "single elements".
peevee - Monday, December 30, 2019 - link
What the H Lenovo, a Chinese company, is doing developing a critical tool for top-secret projects within DoE?
henryiv - Thursday, January 2, 2020 - link
Thanks for the great article. DPC++ stands for data-parallel c++ btw (which is basically SYCL implementation of Intel).
Deicidium369 - Wednesday, January 27, 2021 - link
Xe HP was shown with 4 tiles and 42 TFLOPS so each tile = 10.5 TFLOPS at FP32 or half of that for FP64. Assuming FP64 is the most likely

Xe HPC has 16 Tiles x 5.25 TFLOPS per tile = 84 TFLOPS per Xe HPC. There are 6 Xe HPC per sled = 504 TFLOPS per sled or roughly 0.5 PFLOPS - so ~2000 sleds needed for 1 ExaFLOP FP64.

2000 sleds - 20 sleds per rack = 100 racks at FP64

230 Petabytes of storage at the densest config 1U = 1PB so 230 1U 1PB - 230 U = less than 6 racks...

Even if using 2.5" would not need more than 20 racks for storage

So 100 rack cabinets of Compute + 20 rack cabinets to reach 1 ExaFLOP and 230PB - Networking could be 1-2 racks - not sure the water cooling components are in standalone racks or not. So 122 Cabinets + ??? for cooling.

Analyzing Intel’s Discrete Xe-HPC Graphics Disclosure: Ponte Vecchio, Rambo Cache, and Gelato

Post Your Comment

47 Comments

Back to Article

FireSnake - Tuesday, December 24, 2019 - link

coburn_c - Tuesday, December 24, 2019 - link

surt - Tuesday, December 24, 2019 - link

29a - Friday, December 27, 2019 - link

Alexvrb - Tuesday, December 24, 2019 - link

PeachNCream - Tuesday, December 24, 2019 - link

imaheadcase - Tuesday, December 24, 2019 - link

Pro-competition - Wednesday, December 25, 2019 - link

Flunk - Wednesday, December 25, 2019 - link

29a - Tuesday, December 24, 2019 - link

martinw - Tuesday, December 24, 2019 - link

Santoval - Tuesday, December 24, 2019 - link

nft76 - Wednesday, December 25, 2019 - link

eastcoast_pete - Tuesday, December 24, 2019 - link

UltraWide - Tuesday, December 24, 2019 - link

JayNor - Tuesday, December 24, 2019 - link

JayNor - Tuesday, December 24, 2019 - link

ksec - Tuesday, December 24, 2019 - link

Alexvrb - Tuesday, December 24, 2019 - link

Spunjji - Friday, December 27, 2019 - link

MenhirMike - Tuesday, December 24, 2019 - link

Batmeat - Tuesday, December 24, 2019 - link

Duncan Macdonald - Tuesday, December 24, 2019 - link

repoman27 - Wednesday, December 25, 2019 - link

repoman27 - Wednesday, December 25, 2019 - link

thetrashcanisfull - Wednesday, December 25, 2019 - link

smilingcrow - Wednesday, December 25, 2019 - link

thetrashcanisfull - Wednesday, December 25, 2019 - link

jabber - Wednesday, December 25, 2019 - link

thetrashcanisfull - Wednesday, December 25, 2019 - link

Spunjji - Friday, December 27, 2019 - link

extide - Monday, December 30, 2019 - link

JayNor - Wednesday, January 1, 2020 - link

Korguz - Thursday, January 2, 2020 - link

Spunjji - Friday, December 27, 2019 - link

Spunjji - Friday, December 27, 2019 - link

Sychonut - Wednesday, December 25, 2019 - link

Spunjji - Friday, December 27, 2019 - link

GreenReaper - Thursday, December 26, 2019 - link

peevee - Monday, December 30, 2019 - link

peevee - Monday, December 30, 2019 - link

peevee - Monday, December 30, 2019 - link

henryiv - Thursday, January 2, 2020 - link

Deicidium369 - Wednesday, January 27, 2021 - link

Log in

Don't have an account? Sign up now