NVIDIA Announces Quadro GP100 - Big Pascal Comes to Workstations
by Ryan Smith on February 5, 2017 6:01 PM ESTKicking off on this Sunday afternoon is CAD & CAE software developer Dassault Systèmes’ annual trade show, the aptly named SOLIDWORKS World. One of the major yearly gatherings for workstation hardware and software vendors, it’s often used as a backdrop for announcing new products. And this year NVIDIA is doing just that with a literal Big Pascal product launch for workstations.
The last time we checked in on NVIDIA’s Quadro workstation video card group, they had just launched the Quadro P6000. Based on a fully enabled version of NVIDIA’s GP102 GPU, the P6000 was the first high-end Quadro card to be released based on the Pascal generation. This is a notable distinction, as NVIDIA’s GPU production strategy has changed since the days of Kepler and Maxwell. No longer does NVIDIA’s biggest GPU pull triple-duty across consumer, workstations, and servers. Instead the server (and broader compute market) is large enough to justify going all-in on a compute-centric GPU. This resulted in Big Pascal coming to life as the unique GP100, while NVIDIA’s graphical workhorse was the smaller and more conventional (but still very powerful) GP102.
Because of this split in NVIDIA GPU designs, it wasn’t clear where this new compute-centric GPU would fit in across NVIDIA’s product lines. It’s the backbone of Tesla server cards, of course, and meanwhile it’s very unlikely to show up in consumer GeForce products. But what about the Quadro market, which in previous generations has catered to both graphics and compute users at the high-end (if only because of the mixed-use nature of previous generation GPUs)? The answer, as it turns out, is that Big Pascal has a place in the Quadro family after all. And that’s an interesting place at the top that NVIDIA calls the Quadro GP100.
Based on NVIDIA’s GP100 GPU, Quadro GP100 defies a simple explanation due in large part to GP100’s unique place in NVIDIA’s Pascal GPU family. Quadro GP100 on one hand a return to form for NVIDIA’s Quadro lineup. It’s the jack of all trades card that does everything – graphics and compute – including features that the Tesla cards don’t offer, a job previously fulfilled by cards like the Quadro K6000. On the other hand, it’s not necessarily NVIDIA’s most powerful workstation card: on paper its FP32/graphics performance is lower than Quadro P6000’s. So where does Quadro GP100 fit in to the big picture?
The long and short of it is that the Quadro GP100 is meant to be a Tesla/GP100 card for workstations, but with even more functionality. While NVIDIA offers PCIe Tesla P100 cards, those cards only feature passive cooling and are designed for servers; the lack of active cooling means you can’t put them in (conventional) workstations. The Quadro GP100 on the other hand is a traditional, fan & shroud active cooled card, like the rest of the Quadro lineup. And then NVIDIA doesn’t stop there, enabling graphics functionality that isn’t on the Tesla cards. The fact that NVIDIA isn’t even giving it a P-series name – rather naming it after the GPU underneath – is a good hint of where NVIDIA is going.
NVIDIA Quadro Specification Comparison | ||||||
GP100 | P6000 | M6000 | K6000 | |||
CUDA Cores | 3584 | 3840 | 3072 | 2880 | ||
Texture Units | 224 | 240 | 192 | 240 | ||
ROPs | 128? | 96 | 96 | 48 | ||
Boost Clock | ~1430MHz | ~1560MHz | ~1140MHz | N/A | ||
Memory Clock | 1.4 Gbps HBM2 | 9Gbps GDDR5X | 6.6Gbps GDDR5 | 6Gbps GDDR5 | ||
Memory Bus Width | 4096-bit | 384-bit | 384-bit | 384-bit | ||
VRAM | 16GB | 24GB | 24GB | 12GB | ||
ECC | Full | Partial | Partial | Full | ||
FP64 | 1/2 FP32 | 1/32 FP32 | 1/32 FP32 | 1/3 FP32 | ||
TDP | 235W | 250W | 250W | 225W | ||
GPU | GP100 | GP102 | GM200 | GK110 | ||
Architecture | Pascal | Pascal | Maxwell 2 | Kepler | ||
Manufacturing Process | TSMC 16nm | TSMC 16nm | TSMC 28nm | TSMC 28nm | ||
Launch Date | March 2017 | October 2016 | 03/22/2016 | 07/23/2013 |
The Quadro GP100 then is being pitched at an interesting mix of users. For compute users who need a workstation-suitable GP100 card, then the Quadro GP100 is meant to be their card. It offers all of GP100’s core functionality, including ECC memory, half-speed FP64, and double-speed (packed) FP16 instructions. As an added kicker, the Quadro GP100 introduces a new NVLink connector for PCIe cards. This allows for a pair of Quadro cards to be linked up in a 2-way NVLink configuration, bringing with it NVLInk’s memory access and low latency data transfer benefits to PCIe cards. Notably, this isn’t available on the Tesla PCIe cards.
As NVIDIA discusses it, they sound rather confident that Quadro GP100 will sell well to compute users, and for good reason. The Tesla P100 cards have been a hit with neural network programmers, and now researchers have a card suitable for dropping into a workstation to develop against.
Meanwhile the second market for the Quadro GP100 is the traditional high-end CAD/CAE market. For those more specialized users who need a workstation card with fast FP64 performance and ECC memory for maximum accuracy and reliability, the Quadro GP100 is the first Quadro card since the K6000 to offer that functionality. Arguably this is a bit of a niche, since most CAD users don’t need that kind of reliability, but for those who do for complex engineering simulations and the like, it’s critical (not to mention a lucrative market for NVIDIA). Serving this market also makes the Quadro GP100 unique in that it’s the only GP100 card with its graphical functionality turned on.
However when it comes to those graphical workloads, this is where the line between the Quadro GP100 and P6000 gets a lot murkier. The Quadro P6000 is rated for 12 TFLOPS FP32, versus GP100’s 10.3 TFLOPS, and similarly the Quadro GP100 features around 86% of the texture throughput as well. Paper specs aren’t everything, of course, but in pure SM throughput-bound scenarios the P6000 should be the faster card. This being the advantage of the more compact (and manufacturable) GP102 versus the massive GP100.
The one wildcard here is the HBM2 memory interface and associated ROPs. NVIDIA is specifically touting the Quadro GP100 as offering their fastest rendering performance, and depending on the scenario that can very well be the case. With 720GB/sec of memory bandwidth – thanks to 4 HBM2 stacks clocked at 1.4Gbps each – the Quadro GP100 has 66% more memory bandwidth than the Quadro P6000’s mere 432GB/sec. Coupled with what’s almost certainly a ROP count advantage – NVIDIA still hasn’t disclosed GP100’s ROP count, but based on what we know of GP102, 128 ROPs is a safe bet – and Quadro GP100’s pure pixel pushing power should be greater than even P6000 by around 22%. Given that CAD/CAE can be very pixel-bound, and this should be a tangible benefit for some Quadro customers.
The one drawback though is memory capacity. While the Quadro P6000 offers 24GB of VRAM due to the greater practical capacity of GDDR5X, like all GP100 products the Quadro GP100 tops out at 16GB of HBM2. This means that for very large dataset users, a single Quadro GP100 is a good deal smaller than what they can get out of the P6000. It’s worth noting that NVIDIA is touting NVLink as helping out with memory crunch issues, however I suspect that’s rooted in compute more than graphics.
Moving on then, outside of the GPU underneath, the Quadro GP100 packs the typical Quadro family hardware features. This includes 4 DisplayPort 1.4 ports and a single DVI port for display outputs, and NVIDIA is classifying it as VR Ready as well. Meanwhile towards the rear of the card are the Quadro Sync and Stereo connectors for synchronized refresh and quad-buffered stereo respectively.
Wrapping things up, like the rest of the Quadro cards being launched today, NVIDIA is expecting the Quadro GP100 to ship in March. Pricing has yet to be determined, but as the Quadro GP100 is the jack-of-all-trades GP100 card, I'm told that pricing will be a bit slightly above the Quadro P6000, which would put it somewhere north of $5,000.
54 Comments
View All Comments
ddriver - Tuesday, February 7, 2017 - link
What you are missing is that hbm is not without its problems either - cost, availability, production delays. Obviously hbm is not intrinsically better, if it were, nobody would use gddrx. Also, it would only take a 768bit memory interface for gddrx to beat hbm, and that's not too much of a stretch either, there have already been 512bit memory controllers on gpus, a 25% increase outta be possible.What defines products is not possibility but commercial viability. If you had the money to spare for a larger MC on the die and the extra power, you wouldn't even bother with hbm.
The reason hbm is more energy efficient is that it is SLOWER than gddrx. So it is just not objectively true to say that hbm is faster.
Yojimbo - Wednesday, February 8, 2017 - link
I'm not missing that HBM2 is not a perfect solution. I also never said it is "instrinsically better". I said it's faster. Within 9 months you will see HBM2 at 1 TB/s. Already you can see it at 720 GB/s. There's a reason you don't see GDDR5X used for bandwidths that high, and it's not because it's faster.There is always a cost and power trade off for performance. It's irrelevant how fast GDDR5X could be if you allowed it to run off a small nuclear reactor. The only speed that matters is the speed that works within reasonable constraints of cost and power. Within those reasonable constraints HBM2 is faster. It's faster because it's more energy efficient. It is most certainly objectively true to say HBM2 is faster. There's a context to every statement. And an engineering statement has an engineering context. I don't have to read an HBM2 whitepaper to determine it's scalability to infinity to counter your argument. Such arguments are entirely irrelevant.
ddriver - Wednesday, February 8, 2017 - link
And yet that doesn't make it faster. It just makes it more energy efficient. It could not possibly be technologically faster, as it runs on a much lower clock rate. It doesn't scale to infinity either, you don't simply add more dies and they work by magic, this has to be facilitated at the memory controller.HBM mainly owes its efficiency to two factors - 1 it is produced at a better process node, and 2 it is optimized for best performance to power ratio. There is nothing preventing you from doing the same with gddrx chips, aside from economic viability. It is more economically efficient to use older production lines for gddrx and have the new production lines free for chips where efficiency is more critical, and since gddrx doesn't live on the same chip as a 200 watt gpu, it can be pushed to its limit in order to maximize raw performance at the expense of some extra power you have the TDP budget to displace.
Today it is entirely possible to create a GPU with a 1024 bit MC, which will beat hbm2 amply in terms of bandwidth. And it will be able to support like 128 GB of memory at least, whereas with your fabled "scalability to infinity" for hbm2, you hit a brick wall at 32gb, and only once the 8gb chips become available, for now 16gb is the limit. Sure you could possibly make a GPU that could support 8 hbm modules, but newsflash - that would require a 8192 bit memory controller, and that's about 8 times more complex than a 1024 bit memory controller, and a lot more transistors to turn out faulty during manufacturing, rendering the entire chip useless.
So get the fanboyism out of your head and realize that hbm is not faster, it is just more energy efficient.
Yojimbo - Wednesday, February 8, 2017 - link
Oh yes I am an HBM fanboy. HBM! HBM! HBM! Though, obviously one of us clearly can't see straight. Probably the one that doesn't realize that engineering solutions are designed to solve engineering problems, and engineering problems have constraints. 128 GB of GDDR5 memory would take up a huge amount of real estate just like a 1024-bit GDDR bus width would add too much complexity and draw too much power.I have no idea how easy it would be to extend HBM2 past 32 channels. But it really doesn't matter. It's designed to operate within the realm of practicality. No one is going to build your 1024 bit GDDR5 solution. If they wanted to they would would already be doing it and wouldn't be using HBM2. Within the realm of practicality HBM2 is faster and that's all that matters.
Anyway, we're obviously getting nowhere here. Take care.
JoeyJoJo123 - Monday, February 6, 2017 - link
>Games don't need it...Man I sure do hate tired old arguments that are structured like this.
Yeah, and nobody _needs_ every car to have self-driving functionality. And nobody _needs_ phones with battery lifespans that exceed one day of usage. And nobody _needs_ monitors with high refresh rates and high color fidelity.
But it sure as hell would improve the bottom-line performance for the lowest ranking product being sold, and it improves the user experience for anyone buying into this technology.
So stuff it with your crappy argument.
Yojimbo - Monday, February 6, 2017 - link
"Man I sure do hate tired old arguments that are structured like this.
Yeah, and nobody _needs_ every car to have self-driving functionality. And nobody _needs_ phones with battery lifespans that exceed one day of usage. And nobody _needs_ monitors with high refresh rates and high color fidelity.
But it sure as hell would improve the bottom-line performance for the lowest ranking product being sold, and it improves the user experience for anyone buying into this technology.
So stuff it with your crappy argument."
Games are not memory bandwidth bound. When I say they don't need it, it means they are not really helped by an increase in memory bandwidth. The "lowest ranking product" needs HBM 2 less than the top end product. GDDR5 is plenty fast enough to keep up with the number of compute cores in low end and mainstream graphics cards, even with AMD's more bandwidth-hungry architecture. If the memory subsystem is fast enough to keep up with the demands placed on it from the compute subsystem that's 100% of what you can get from it. I'm not sure where you expect this performance boost from increasing memory bandwidth to come from.
Yojimbo - Monday, February 6, 2017 - link
I think maybe what you're missing is that there is always a cost-benefit analysis to any design decision. Increasing the manufacturing cost of a card reduces its competitiveness unless it gives some offsetting benefit. Suppose, just for illustrative purposes, it costs $20 more per card to have HBM2 memory instead of GDDR5. NVIDIA or AMD could spend that $20 on HBM2 or they could spend that $20 on increasing the number of compute cores on the card. If the resulting performance across a range of critical benchmarks favors spending the money on compute cores then that's what they should do. It's not that there is never a situation where more memory bandwidth could help, it's just that the benefit is smaller than what is achieved by the same use of resources in a different area. NVIDIA and AMD need to make the choices that result in the best price/performance ratio for the market segment they are targeting. An HBM2 configuration is not an optimum one for NVIDIA and is only an optimum one for AMD at the very high end.Michael Bay - Tuesday, February 7, 2017 - link
Self-driving car will carry you right into slavery.Yojimbo - Tuesday, February 7, 2017 - link
Yes, just like the self-cleaning oven.sna1970 - Monday, February 6, 2017 - link
why put HBM2 on a high end card ? simple , Card size ... you can have a Titan X card the length of 170mm instead of 280mm ...putting the Memory on GPU itself reduces the card size by 50%