Comments on: What Would You Do With A 16.8 Million Core Graph Processing Beast?

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Tue, 05 Sep 2023 19:07:42 +0000

In reply to Laurent. Nope. You're right. I pointed to cores in my sheets instead of chips for the higher two numbers.

By: Laurent

Laurent — Tue, 05 Sep 2023 17:33:27 +0000

Hi,

I believe the numbers for the RAM are not correct:

> Let’s walk through this.
> A single sled with sixteen sockets has 128 cores with 8,448 threads and 512 GB of memory.
> The first level of the HyperX network has 256 sleds, 32,768 cores, 270,336 threads, and 1 TB of memory.
> Step up to level two of the HyperX network, and you can build out a PIUMA cluster with 16,384 sleds, 2.1 million cores, 17.3 million threads, and 64 TB of shared memory.
> And finally, up at level three of the HyperX network, you can expand out to 131,072 sleds, 16.8 million cores, 138.4 million threads, and 512 PB of shared memory.

1 sled: 512GB (seems correct from what I can read from the beginning of the article)
256 sleds the 1TB should be 128TB
16,384 sleds the 64TB should be 8PB
131,072 sleds the 512 PB should be 64PB

Or maybe there is something I’ve misunderstood?

By: HuMo

HuMo — Mon, 04 Sep 2023 03:26:21 +0000

In reply to Slim Albert.

P.S. A most happy of Labor Days to folks in North America (best holiday of the year)!

For my money, I’d say 32-byte memory access granularity and cache-line size may the sweet-spot for mixed graph-based and non-graph workloads, somewhere between this PIUMA’s 8-byte and the more common 64-byte. At 64-bit, the LISP-standard cons-cell (car+cdr pointers for lists and trees) is 16-byte, and a N-E-S-W pointer-cell for quadtrees is a 32-byte item, both a better fit to 32B than 64B cache-lines (the main drawback being “moderately” more die-area for cache tags). 32-Byte also matches the 256-bit vector length that some seem to be standardizing on (eg. AVX10, Neoverse V1 TRM), and is already the line size of cache “sectors” in some high-performing xPUs (POWER9, Fermi, Kepler, …).

The DDR5+ DIMM could then be an MCR job with four independent 32-bit channels (same num of 32b chans as an HBM3 die), 8x pumped, so each channel fills a 32B cache line per burst. This should give a 2x to 8x perf boost on graphs, without slowing down linear jobs. As a bonus, 8 of these DIMMs would essentially mimick an 8-die HBM3 single-stack (32x 32b chans). Extra graph-grit might then come from many-threading, if really desired (and double-checked).

The CPU-design pros could surely do worse than simulate such a system to verify its most awesome in the world of mixed-load oomphs, prior to fabbing it in the commensurate volumes it truly deserves!

By: Hubert

Hubert — Sun, 03 Sep 2023 23:40:40 +0000

In reply to Slim Albert. Yep, I can see how graphs become more important when dealing with FEMs that produce irregular sparse matrices (unlike regular sparsity of FDM, tackled by ADI) and in pruned/culled AI models -- block-based memory accesses then naturally give way to retro-futuristic kung-fu!

By: Slim Albert

Slim Albert — Sat, 02 Sep 2023 03:36:56 +0000

Interesting outside-the-box innovative stuff (retro-futuristic?), especially as graphs and arborescence are key to so many computations, as noted here last month by Tenstorrent’s Jim Keller. I like the improvements that these newfangled thingamabobs bring to prior cuckoos (ahem: https://www.theregister.com/2011/02/20/cscs_cray_xmt_2/), especially the Co-Packaged Optics (CPO), 8-byte-granular DDR5, and the combo of single- and multi-thread pipelines.

Heavy multi-threading pretty much did SUN in back in the Niagara days I think, because of CMT’s poor performance on common single-threaded workloads (though CMT is probably great for graph-oriented Oracle database analysis and search applications). Combining single- and multi-threaded pipelines in one chip, as in this computational feline, looks like a more balanced way to prowl through both tree and trail.

CPO is one of those oft-promised techs that we seem to be just “way behind” on (permanently in the next 2-5 years) and I’m glad Intel/Ayar jumped right in to figure out the right materials, techniques, and tools for this hybrid packaging, as needed to foster composable disaggregation of heterogeneous cephalopods (computational units). Google’s TPUs already demonstrated several tangible benefits of this kind of flexible optical networking (if I read that previous article well).

I’m not sure how they approached 8-byte granularity on DDR5, but if it means eight 8-bit wide channels read independently via burst-8 (octal data rate), then that could also be a winner (not sure how else this could be done really)!

IMHO, a key to future systems (next platforms) will be in how these retro-futuristic innovations can be woven into and merged with current single-threaded, large-cache-line, copper archs, so that both kinds of workloads are efficiently processed.

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Fri, 01 Sep 2023 22:06:56 +0000

In reply to ze. That's because I didn't use the term correctly. All fixed.

By: ze

ze — Fri, 01 Sep 2023 22:00:08 +0000

Thank you TPM, I Really enjoyed this article.

I was confused with the use of the phrase ‘core complex’; would it be possible to elaborate on this.

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Fri, 01 Sep 2023 21:56:46 +0000

In reply to UK. Yup, I would love to see that too. HA! No, it was 500 MHz for the XMT.

By: hoohoo

hoohoo — Fri, 01 Sep 2023 21:22:57 +0000

500 GHZ!!!

I’ll take a dozen, please.

By: UK

UK — Fri, 01 Sep 2023 20:36:22 +0000

Hello Mr. Morgan,

you do not need to publish this…just want to tell you a typo I found while overflying some articles before reading, also yours here:

” The XMT line from a decade ago culminated with a massive shared memory thread monster that was perfect for graph analysis, which had up to 8,192 processors, each with 128 threads running at 500 GHz”
->I think my technically interested part would like to see such a thing as much as you, for sure, but I think, as I overflew the ElReg article, it should be MHz