Comments on: Intel’s Exascale Dataflow Engine Drops X86 And Von Neumann

By: Derrick Hamlin

Derrick Hamlin — Mon, 10 Sep 2018 17:56:18 +0000

At last, Fine-Grained Massively-Parallel Processing is back on track after thirty years in the wings. It’s the only way ahead. (Neural-Nets are inscrutable/impenetrable/indecipherable and probably single-paradigm; Quantum ‘Computers’ and ‘Nuclear Fusion’ and a family on Mars are proving to be viable – about equally.)
Good article and valuable blogs.
What’s at stake, technically-speaking, is a billion-processor Universal-Computing machine that leap-frogs the current, non-Universal million-processor machines, or rather, the chip that makes it happen.
Currently typical actual transistor count : 10ˆ9 transistors on 1 cmˆ2 chip @ 22 nm features. For comparison, the theoretic maximum (package of transistors only) would be about a trillion; i.e., 30:1 linear difference between theoretic-max and the true spacing, giving healthy scope for a highly-regular matrix of processors and network-layout.
Since each minimalist processing-element contains about 10ˆ4 transistors (e.g., Intel 4004 inc. serial interfaces and memory), total accommodated in 1 cmˆ2 = 10ˆ9 / 10ˆ4 = 100,000 processors / chip (and easy to make duplex-redundant for chip-yield).
Total serial floating-point ADD-time including a serial transaction between processors = 10 ns (say).
=== 1 Billion elemental processors in 10,000 chips; 400 kW; 10ˆ9*10ˆ8 = 100 Petaflops/s; in a 5 ft cube. Of course, this is merely dreaming . . .

Intel’s “Exaflow” is not suggesting this sort of package, but it could, if only the massive chip-pinout count might be resolved with a favourable connectivity (it’s only mechanical-engineering, after all !)

Even for more modest processor counts (e.g., 100 processors on a chip), there still remain the thorny questions of :
1) a workable interconnect topology between nodal-processors for Turing-Universal computation. The Fleming team only outline 2D Cartesian links with (presumably) cross-bar signal-directors (like a Field-Programmable Gate-Array switch-matrix), each pre-set to match the exact topology of an exascale-sized, single user application.
2) Concurrent user-access to the architecture. There does not yet appear to be a concept of multi-user, multi-port accesses capable of mix-sharing the resources in real-time.
3) Garbage-Collection. In any practical Massively-Parallel Processor, terminated tasks and sub-tasks need to be eliminated in order to free-up resources for further incoming tasks. It appears that each Intel application exclusively and statically owns the machine which must then be expunged before a new application can be statically compiled onto the network-switches, as indeed is currently the case for the logic of other programmable arrays. An example, well-known in the field, is a Xilinx FPGA with an array of PicoBlaze core processors (Chapman – KCPSM3, Xilinx 2003).
4) Secure segregation between multiple users (if any). The whole world is panting for trustworthy, burglar-free computing – this is the great opportunity.
5) Most important: the exact choice of hardwired chip design (carved in stone) that needs to satisfy programmers for the next 30 years, like the ‘I86′. Will the Fleming design define that Corporate strategy?

Connectivity, Connectivity and Connectivity are the key. No more chunky processors, please, and it’s time to ditch mill-stone legacy software; if you want that, you can still run it as long as Microsoft and Apple allow, but there are much better ways of doing it.

By: Derrick Hamlin

Derrick Hamlin — Sat, 08 Sep 2018 13:13:43 +0000

At last, Fine-Grained Massively-Parallel Processing is back on track after thirty years in the wings. It’s the only way ahead. (Neural-Nets are inscrutable/impenetrable/indecipherable and probably single-paradigm; Quantum ‘Computers’ and ‘Nuclear Fusion’ and a family on Mars are proving to be viable – about equally.)
Good article and valuable blogs.
What’s at stake, technically-speaking, is a billion-processor Universal-Computing machine that leap-frogs the current, non-Universal million-processor machines, or rather, the chip that makes it happen.
Currently typical actual transistor count : 10ˆ9 transistors on 1 cmˆ2 chip @ 22 nm features. For comparison, the theoretic maximum (package of transistors only) would be about a trillion; i.e., 30:1 linear difference between theoretic-max and the true spacing, giving healthy scope for a highly-regular matrix of processors and network-layout.
Since each minimalist processing-element contains about 10ˆ4 transistors (e.g., Intel 4004 inc. serial interfaces and memory), total accommodated in 1 cmˆ2 = 10ˆ9 / 10ˆ4 = 100,000 processors / chip (and easy to make duplex-redundant for chip-yield).
Total serial floating-point ADD-time including a serial transaction between processors = 100 ns (say).
=== 1 Billion elemental processors in 10,000 chips; 400 kW; 10ˆ9*10ˆ8 = 100 Petaflops/s; in a 5 ft cube. Of course, this is merely dreaming . . .

Even for more modest processor counts (e.g., 100 processors on a chip), there still remain the thorny questions of :
— a workable interconnect topology between nodal-processors for Turing-Universal computation. The Fleming team only outline 2D Cartesian links with (presumably) cross-bar signal-directors (like a Field-Programmable Gate-Array switch-matrix), each pre-set to match the exact topology of an exascale-sized, single user application.
— Concurrent user-access to the architecture. There does not yet appear to be a concept of multi-user, multi-port accesses capable of mix-sharing the resources in real-time.
— Garbage-Collection. In any practical Massively-Parallel Processor, terminated tasks and sub-tasks need to be eliminated in order to free-up resources for further incoming tasks. It appears that each Intel application exclusively and statically owns the machine which must then be expunged before a new application can be statically compiled onto the network-switches, as indeed is currently the case for the logic of other programmable arrays. An example, well-known in the field, is a Xilinx FPGA with an array of PicoBlaze core processors (Chapman – KCPSM3, Xilinx 2003).
— Secure segregation between multiple users (if any). The whole world is panting for trustworthy, burglar-free computing – this is the great opportunity.
— Most important: the exact choice of hardwired chip design (carved in stone) that needs to satisfy programmers for the next 30 years, like the ‘I86′. Will the Fleming design define that Corporate strategy?

By: H!GHGuY

H!GHGuY — Tue, 04 Sep 2018 18:47:53 +0000

Would Intel be on the track to use EDGE architectures? (https://en.wikipedia.org/wiki/Explicit_data_graph_execution)
If every PE is capable of executing an instruction block, then you’re basically there. The utility CPU is then only responsible for uploading all the instruction blocks and wiring the interconnect and off you go.

I’m also wondering how much difference there is with today’s graphics pipelines aside from the interconnect. Those might be a shorter mental path if used as base point. If so, AMD might be in a comfy position if they need to catch up. On the other hand, the interconnect network might have borrowed from Altera heavily, so AMD has nothing to show for on the front.

Finally, choosing C/C++/Fortran as languages to start from is probably suboptimal if you can use functional languages instead. These are already in the desired form and don’t need expensive analysis to reverse engineer data flows from imperative code.

By: Ronald G

Ronald G — Tue, 04 Sep 2018 08:13:29 +0000

wrt the ‘last thought’: I see the CSA as an evolution of an FPGA. An FPGA has switches that configure the wiring between gates. CSA has switches that configure the busses between PEs. Therefore the investment in Altera is a foundation for a different level of abstraction that could be called FPPA (Field programmable PE array).

By: Joshua

Joshua — Sat, 01 Sep 2018 23:04:37 +0000

Looking forward to see:
-how inherently sequential portions of the code are scaled in this architecture
-task dependency synchronization.
-dynamic congestion avoidance (QoS)
-large scale virtualization(processing and I/O)
-streaming like parallel programming/debugging paradigms
-scalable checkpointing/restarting
-link fault tolerance.
-topology aware data/task algorithms.
-collective/broadcast/reduction operations
-low latency point to point
-PGAS
-scalable uniform RDMA.

By: hoohoo

hoohoo — Thu, 30 Aug 2018 22:24:48 +0000

“This is obviously easy to block diagram, but terribly difficult to do with any computational efficiency in the real world.”

Sometimes the patent office should demand a working prototype. Just to keep things clear and simple.

By: A. Bacus

A. Bacus — Thu, 30 Aug 2018 22:21:51 +0000

I dunno. You might need racks of pizza boxes just to model the program for this sort of system, prior to attempting a compile/run.