Comments on: HPC Gets A Reconfigurable Dataflow Engine To Take On CPUs And GPUs

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Tue, 05 Nov 2024 19:17:12 +0000

In reply to UK. Oddly enough, I have been to a kindergarten event where I experienced angst because my child is too chatty sometimes and also had sauerkraut on a frankfurter. HA!

By: UK

UK — Tue, 05 Nov 2024 10:03:19 +0000

…at least Tachyum says, it has now the last FPGA emulator before tape-out next year…time will tell…
(btw. typo in my text…should be “milk” and not “milch” of course, unless “milch” should have made it (without my knowledge) into the Englisch vocabulary as “Kindergarten”, “Angst” or “Sauerkraut” did somehow 🙂

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Thu, 31 Oct 2024 13:31:13 +0000

In reply to UK. I would be happy to talk abiout Tachyum as soon as they show me a real chip. I still see this ICAm as NextSilicon calls it, as an accelerator. My understanding is Prodigy is an actualy host processor plus accelerator wrapped in one, and like NextSilicon, has a compiler that accelerates the most common routines in code. That was the idea, anyway.

By: UK

UK — Thu, 31 Oct 2024 11:19:22 +0000

Thanks a lot Mr. Morgan for this very interesting insight.
Still curious how Tachyums approach looks like in comparison…perhaps not much different, as they also talk about any general “faster in everything at lower power consumption” – in Germany this is called “Eierlegende Wollmilchsau” (a pig, that produces eggs, milch and wool)?
Let’s wait and see.

By: Carl Schumacher

Carl Schumacher — Wed, 30 Oct 2024 16:38:34 +0000

“And now for something completely different” (as Monty Python may have titled this article)…Seems that others have tried to move beyond von Neumann’s fetch/execute/write-back loop in various ways over the years (1984 Yale-infused Multiflow with its very wide VLIW and a decade later with Intel’s light VLIW Itanic)…Always looking to a “super compiler” to draw out the latent parallelism in the Universe’s serial IF/THEN/ELSE code base…Hmmm and yet in a half-century+ only has the GPU risen to be on par with the serial CPU…Well maybe this time will be different.

By: Slim Albert

Slim Albert — Wed, 30 Oct 2024 16:04:03 +0000

Way to go NextSilicon! And I’m glad to see Sandia’s Vanguard program testing out this emerging tech for NNSA viability ( https://www.sandia.gov/research/news/sandia-partners-with-nextsilicon-and-penguin-solutions-to-deliver-first-of-its-kind-runtime-reconfigurable-accelerator-technology/ ).

It seems that there’s a lot of interest in using reconfigurable connections between computational units to allow systems to adapt flexibily to workloads that range from “standard flow” (or even no flow) to dataflow. Google’s reconfigurable optical interconnect for TPUs would be a large scale example, while SambaNova’s Reconfigurable Dataflow Units (RDUs), Groq’s Software-defined Scale-out TSP/LPU, or RipTide’s Coarse-Grained Reconfigurable array (CGRA) would be finer scale examples. In my mind, such reconfigurability should mean (as a goal) that the resulting machine performs as well on dense matrix-vector workloads as on graph-oriented workloads (hopefully), through those in-between (HPCG?), by reconfiguring itself accordingly ( https://www.nextplatform.com/2018/08/30/intels-exascale-dataflow-engine-drops-x86-and-von-neuman/ ).

That the compiler (software) is an important aspect of this was certainly stressed by Tenstorrent’s Jim Keller in an excellent interview last year ( https://www.nextplatform.com/2023/08/02/unleashing-an-open-source-torrent-on-cpus-and-ai-engines/ ): “the graph needs to be lowered with interesting software transformations and map that to the hardware”. Here, with NextSilicon’s Maverick, the compiler additionally raises the (reconfigurable) hardware to the graph, which demands some cool extra sophistication (for extra performance), if I understand well.

In time, the reconfigurable NoC between processing units might be advantageously implemented using high-bandwidth reconfigurable optical interposers, with integrated controllers (eg. https://www.eetimes.com/lightmatter-raises-400-million-series-d/ ) … and it should prove worthy to evaluate RAM amounts and distribution, as suggested in the Cerebras article from the day before yesterday, where SwarmX and MemoryX are used to provide supplemental memory where needed (to reap the full benefits of near- and in-memory computing).

Cool stuff!

By: Paul Berry

Paul Berry — Wed, 30 Oct 2024 15:00:25 +0000

I’m looking forward to seeing what types of sample codes they have that work well on this. My concern is that so many of the codes that perform poorly on GPUs and CPUs, do so because of memory latency, or because existing optimizations can’t be used due to ordering/correctness constraints (often false, but unclear due to the language). Often taking care of the ‘unlikely flow’ cases is very very costly to the likely flow.