Comments on: The Shape Of AMD HPC And AI Iron To Come

By: peter j connell

peter j connell — Sat, 09 Dec 2017 20:36:31 +0000

In reply to peter j connell. PS, further, what of multi gpuS? Can each gpu be configured separately? Could each gpu have a discrete nvme allocated to its cache pool e.g if it suits? any links I could check?

By: peter j connell

peter j connell — Sat, 09 Dec 2017 20:29:51 +0000

I cannot seem to find any details on user settings within HBCC.

We know hbcc allows setting a portion of system ram AND a segment of nvme storage space on a raid array, but i dont know if multiple storage devices can be allocated to hbcc use. An array AND a single nvme ssd e.g.

we also know that for big sequential read/write operations, raid 0 in arrays of up to 7 drives can provide massive, ~memory like bandwidth, on TR/Epyc.

We also know, that for random i/o of smaller files, raid is little better? than cheap and plentiful single nvme nvme ssdS.

Putting aside subjective and wrong opinion on latency being a universal deal breaker for nvme, then if, IF, extra latency does not preclude nvme from being useful as a cache “memory” extender in all apps, then….

My question is, does hbcc allow users to specify only one storage device as a cache extender, or can hbcc ~intelligently manage multiple storage resources in its pool of cache resources?

If so, it would improve hbcc if the powerful array could focus on large sequential r/w, and use supplementary secondary device(s) when an array has no great advantage.

Just as hbcc decides between using fast but limited system ram and vast but slower nvme storage to extend gpu cache/memory address space, may it also decide between multiple storage resources with different properties?

By: SupplyAndEconomysOfScale

SupplyAndEconomysOfScale — Sat, 12 Aug 2017 19:00:34 +0000

In reply to OranjeeGeneral.

The Radeon Pro FE SKUs, as well as the Professional Radeon Pro WX/Instinct SKUs currently offer 16 GB VRAM(2 HBM2 8-Hi stacks), and that’s across only 2 HBM2 stacks. So AMD has already created a design with 4 HBM(first generation) stacks GPU/Interposer with the Fury X (Fiji GPU micro-architecture) based design. So the JEDEC standard for HBM2 allows for up to 8GB per HBM2 stack so 4 stacks of HBM2 would net 32 GB of VRAM on for any Vega based variant should AMD decide it needed that.

There will be Dual Vega/2 dies per PCIe card SKUs offered for the consumer and professional markets and starting with the Vega GPU micro-architecture AMD will be able to use the Infinity Fabric IP to wire those dual Vega GPUs on a single PCIe card designs together for even a faster connection between those 2 dies than is afforded when using the PCIe protocol.

Then there is the fact that even with that 8GB of VRAM under the New Vega GPU micro-architecture that Vega’s new included HBCC/HBC technology can utilize that 8GB of VRAM as a last level cache to leverage the regular system DRAM as secondary VRAM pool with Vega’s HBCC able to perform the swapping in the background to and from that 8GB of HBM2 based VRAM cache to effectively increase the VRAM size to whatever the system DRAM size may be. The Vega GPU IP also includes in that HBCC/HBC/memory controller subsystem the ability to manage in its Vega GPU micro-architecture based GPU SKUs the GPU’s own virtual memory paging swap space of up to 512TB of total GPU virtual memory address space. So that’s 512TB of addressable into any system DRAM and onto the system’s memory swap space of any system NVM/SSD or hard drive storage devices attached. And Vega’s HBCC/HBC manages all that rather effortlessly. There is also the Radeon SSG SKUs that also make use of their own PCIe card included NVM stores for that SKUs needs, and there will be Vega micro-architecture based “SSG” branded variants for both acceleration and AI workloads.

8GB of VRAM is nothing laugh at if that 8GB of GPU VRAM is actually a VRAM cache that can leverage at its disposal up to 512TB virtual memory address space across regular system DRAM or SSD/NVM and only keep in that 8GB of VRAM cache what data/textures the GPU actually requires for it more immediate needs. That new Vega HBCC/HBC IP also has implications for any discrete mobile Vega GPU SKUs that usually come with 4GB or less of total GPU VRAM, so look for an even greater positive gains for any Vega discrete mobile SKUs when they arrive.

By: SupplyAndEconomysOfScale

SupplyAndEconomysOfScale — Sat, 12 Aug 2017 17:53:31 +0000

In reply to OranjeeGeneral.

It’s interposer not Interpose and if I may interpose for a moment concerning silicon interposers. Those Interposers, of the silicon variety, are just made of the very same material as the processor dies that are(for GPUs)/will be attached to the interposer to create those AMD interposer based APUs. Also note that silicon interposers are in greater numbers being used by more than just AMD/Nvidia and the methods to produce rather large silicon boules necessary for the computing industry’s needs is a rather polished and mature technology. So supplies of the silicon wafers necessary for interposer usage are rather assured by default and can easily be ramped up to meet demand. There is however problems with that reticle limit of current IC lithography IP but as demand for silicon interposers increases so does the economy of scale to allow for some specialized equipment to be designed to get past current reticle limits.

And yes the silicon interposer costs more currently than say an MCM/other IP to produce and have dies mounted properly but there is no better way to get thousands of traces etched onto a single level as opposed to using the fatter traces across the many layered PCBs necessary to compete with any silicon etching processes ability to be etched with thousands of metal traces. One need only look at the BEOL metal traces used on silicon processor dies to realize that any PCB could not have such thin wires made using any of the current PCB process technology.

Now owing to that simple fact that silicon interposers are basically nothing more than dies that are only etched with passive traces currently, we can also extent that to imply that there will be at some time silicon interposers of the active interposer design that will contain actual circuitry. So maybe the silicon interposer will actually comprise a whole coherent connection fabric of 10s of thousands of traces and its associated control circuitry and buffer memory an whatnot. There is really no great technological impediments to making a silicon interposers and creating the equipment to get past any reticle limits other that the economy of scale sorts of things have to come before the increased demand needs are met and that comes via more adoption of the technology for interposer based processor designs used across the larger CPU/GPU/Other processors market place.

It would be a good idea to take the time and read the research PDF listed in the post above your reply and do note that its not at all infeasible to splice 2 interposers together and create a larger area with which to play host to many more processor/HBM2/whatever dies.

Also of note from AMD’s patent filings there is a pending patent for some FPGA in local HBM2 compute with some FPGA die/s added to the HBM2 die stacks for that sort of localized compute that will become important for the exascale markets and on down into the general server/workstation markets and most likely the consumer markets as well. It’s very likely that the Exacale Initiative funding from the US government to AMD/Nvidia/Others will result in some IP that will be utilized across all the relevant processor markets, as always happens when U-Sam/other governments provides R&D matching grants to meet some government needs. I’m relatively sure that Big Blue and its system 360 IP benefited from that sorts of R&D matching government funding when the 360 represented the heights of computing technology back in the day, ditto for Burroughs(stack machines), and Sperry/Others.

By: OranjeeGeneral

OranjeeGeneral — Fri, 11 Aug 2017 13:19:26 +0000

In reply to Keep_DP_On_The_GPU_Mostly. I'll bet they can't get that interpose manufactured in serious enough quantities that's why they are super quiet about it.

By: SeeHereForTheBrainStorm

SeeHereForTheBrainStorm — Fri, 11 Aug 2017 01:48:23 +0000

In reply to peter j connell.

From this next platform article(1) comes this refrennce to an AMD/academic(PDF) research paper(2), and that number 2 refrence will give you an Idea of where Interposer based APUs designs can go, It is a very good read. Silicon Interposers are a bit more interesting than MCMs, as a silicon interposer can not only be etched with thousands of parallel traces, they can also be etched/doped with logic/math circuitry. The whole coherent connection fabric can be made on a silicon interposer and create an active interposer design with which to host the various processor dies HBM2 and whatnot!

(1)

“AMD Researchers Eye APUs For Exascale”

https://www.nextplatform.com/2017/02/28/amd-researchers-eye-apus-exascale/

(2)

“Design and Analysis of an APU for Exascale Computing”

http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf

By: peter j connell

peter j connell — Thu, 10 Aug 2017 15:01:58 +0000

In reply to Keep_DP_On_The_GPU_Mostly. Amen to the workstation APUs. What a killer product? Care to fantasize about general configurations, assuming an Epyc size interposer/mcm?

By: Keep_DP_On_The_GPU_Mostly

Keep_DP_On_The_GPU_Mostly — Wed, 09 Aug 2017 23:50:17 +0000

I fail to see the any reason to add wider AVX units on Epyc/”Rome” in 2018 when AMD has its Vega GPU IP able to communicate over the Infinity Fabric with its Epyc systems. I can see AMD maybe offering some Zen2 variants with maybe larger/wider AVX units. But with all of AMD’s GPU IP available for FP number crunching, I’d rather see AMD keep its Epyc SKUs saving more power with those smaller AVX units on the CPU cores. And for those that need the extra FP performance maybe AMD developing GPU SKUs with an all double precision shader/cores accelerator product specifically to give AMD’s Current/Future Epyc customers the option of getting their DP FP needs met via an accelerator option.

I’d also like an update from AMD on the development/road-map status of its Workstation Grade Interposer based APU SKUs where the Epyc cores die and the Vega/newer GPU die/s, as well as the HBM2 stacks are all married together on the Interposer with the CPU cores able to via the Infinity Fabric directly dispatch FP workloads to the Interposer based GPU’s NCU(Next-generation Vega Compute Units) using a direct CPU to GPU coherent Infinity Fabric process to feed FP workloads to the GPU and have things managed coherently. There is that Exascale APU research article where AMD/educational institution partner had a very interesting proposal that was funded in part by US government exascale initiative grant money.

AMD’s current Zen(First generation) CPU micro-architecture needs a dancing partner for some heavy DP FP workloads in the HPC markets so AMD has its Vega based SKUs that may only need a variant created that forgoes more 32 bit units on its GPUs for more 64 bit units on some newer Vega variant specialized for DP FP intensive workloads. Vega is not as much a DP FP unit heavy GPU micro-architecture compared to some of AMD’s Pre-Polaris GPU micro-architectures that had higher DP FP resources ratios relative to SP FP resources. the Vega 10 micro-architecture’s Peak Double Precision Compute Performance is around 720-859 GFLOPS dependong on the base clock/boost clock speed of the GPU(Radon Pro FE/liquid cooled SKU). But AMD’s Vega 20 micro-architecture is rumored to up that DP FP ratio from the 1/16th on Vega 10 to maybe 1/2 on Vega 20.

With Polaris and Vega there is less DP FP unit ratios. Maybe AMD can extend that Rapid Packed Math concept in is Vega designs where Vega’s 32 bit FP Units can perform 2/16 bit computations inside each32 bit FP unit’s 32bit ALU. So maybe take it a step further with maybe a single DP ALU able to work on 2 32 bit values. With most server workloads not needing or utilizing Large AVX units as much, I rather that work be given over to accelerator products that excel at DP FP/other precision math and keep the CPU designs more General Purpose in nature.

By: OranjeeGeneral

OranjeeGeneral — Wed, 09 Aug 2017 16:53:19 +0000

8GB of VRAM is a laugh there not going to make much inroad on DL with such a low memory config.

By: puddlefunk

puddlefunk — Wed, 09 Aug 2017 03:27:34 +0000

“..Radeon Fury X GPU accelerators based on the “Polaris” GPUs..”

um, what? in what world was furyX running Polaris silicon??

if you’re done the slightest research you’d know it was running Fiji, i mean, its even built on a different (28nm) node rather than Polaris 14nm finFET process.