New Research Shows Power of 41 Harnessed GPUs

Can silicon interconnect fabric (SiIF) connect existing GPUs better and with less probability of failure than using standard interconnects?

The origin of parallel processing, which is performed by billions of transistors present on today’s GPUs, has deep and complicated roots. The term refers to a method of simultaneously segmenting and running software instructions on multiple microprocessors. Also known as parallel computing, the concept is extremely simple: large problems that require a lot of calculations can be subdivided into smaller problems, and those problems can be solved in parallel, at the same time.

Modern GPUs use parallel processing (among other things) to power graphics in video games, CAD software and artificial intelligence applications such as deep learning, reinforcement learning and deep reinforcement learning.

NVIDIA pioneered the modern GPU by selling the GeForce 256 in 1999, and they were the first company to call the computer chip a “GPU”. The company’s next major contribution was developing CUDA, which opened up the floodgates for developers unfamiliar with graphics. CUDA being a general-purpose interface gave more developers new access to parallel hardware of the GPU. Computationally intensive applications could now be developed on a wider scale than ever before on the GPU. (Image courtesy of NVIDIA.)

NVIDIA pioneered the modern GPU by selling the GeForce 256 in 1999, and they were the first company to call the computer chip a “GPU”. The company’s next major contribution was developing CUDA, which opened up the floodgates for developers unfamiliar with graphics. CUDA being a general-purpose interface gave more developers new access to parallel hardware of the GPU. Computationally intensive applications could now be developed on a wider scale than ever before on the GPU. (Image courtesy of NVIDIA.)

One theory critical to parallel computing is known as Amdahl’s Law. It is named after Gene Amdahl, a 20th century computing pioneer whose contributions to computer architecture culminated in two mainframe computing systems: The IBM System/360 mainframe computer and the Amdahl 470.

What is Amdahl’s Law?

Presented at the AFIPS Spring Joint Computer conference in 1967, Gene Amdahl presented his law as a framework for understanding the rate at which computation could be sped up by dividing the task at hand among multiple processors. (Image courtesy of Computer History Museum.)

Presented at the AFIPS Spring Joint Computer conference in 1967, Gene Amdahl presented his law as a framework for understanding the rate at which computation could be sped up by dividing the task at hand among multiple processors. (Image courtesy of Computer History Museum.)

The law works by subdividing the tasks of program by whether or not they are able to be parallelized. Though that sounds like an implicit conclusion to an implicit hypothesis, it had a profound though glacial impact on the notion of parallel computing.

What it means is this: Say a program needs 40 hours on one processor core, but one part of the program, which takes an hour to complete, cannot be parallelized. This translates to software engineers who now have a certainty to work around: No matter what they are designing and no matter how many processors are available to solve the problem in parallel, there will be a minimum execution time of one hour.

To computer scientists at the time, Amdahl’s Law simply meant that parallel computing is only useful for programs that are highly parallelizable. Amdahl started Trilogy Systems Corporation in 1980 with the express mission of designing a chip that would lessen the cost of building mainframe computers. Trilogy raised USD 230 million dollars from venture capitalists, which was the most ever at the time. The key piece of technology driving this ambition was a silicon-wafer-sized processor that was able to keep enough data on it so it wouldn’t expend energy sending it to memory chips through the circuit board. It was essentially like having a GPU with multiple GPUs built into it.

Unfortunately, Trilogy Systems ended in disaster. Trying to create the first commercial wafer-scale integration was such a disaster also led to another first: the phrase “to crater”, which was used by the press to describe the monumental failure of the company. (Image courtesy of Nexus.)

Unfortunately, Trilogy Systems ended in disaster. Trying to create the first commercial wafer-scale integration was such a disaster also led to another first: the phrase “to crater”, which was used by the press to describe the monumental failure of the company. (Image courtesy of Nexus.)

Time moves on, and new technologies enable researchers to look at old ideas in a fresh light. A group of Engineers from UCLA and University of Illinois Urbana-Champaign are beginning to sharpen their arguments in favor of developing a wafer-scale computer made of 40 GPUs. Benchmarks from simulations of the theoretical multiprocessor cut energy use and propagation delay 140 times over, and enhanced calculation speed 19 times over.

A Fresh Take

Rakesh Kumar is part of this group at the University of Illinois Urbana-Champaign. He is a computer engineering associate professor who is working on the problem of reducing the energy use and propagation delay that occurs between computational units in typical supercomputers. The way supercomputers work now is ineffectual as applications are parsed between circuit boards that host hundreds of disparate GPUs. This hardware communicates over what’s called long-haul data links, which are slow and energy-intensive compared to the interconnects within a GPU.

There’s a problem that complicates things further. Inconsistencies between the mechanical properties of PCBs and chips that creates a limiting protocol: the processors have to be kept in packages which reduces the usable number of inputs and outputs on each chip. But this is the exact reason why sending data via these long-haul links are slow and consume a relatively large amount of energy.

What Kumar and his group are doing is essentially trying to boost the connection strength between 40 GPUs so that they act as one monster GPU. This will also make things a bit easier on programmers, who will have the ability to see the whole application through the lens of one GPU versus hundreds of them.

How can you turn many GPUs into one?

At Trilogy Systems, Amdahl employed the philosophy that standard chip manufacturing methods would be enough to build multiple GPUs on one silicon wafer by using interconnects. This did not work for Trilogy Systems because of the fact that defects become more common and more impactful as the manufacturing size of the chip increases. This also amplifies the negative effect a single defect could have on the chip.

Kumar and his research team believe that a technology called silicon interconnect fabric (SiIF) can connect existing GPUs better and with less probability of failure than using standard interconnects and increasing the size of the GPU as Amdahl tried and failed to do at Trilogy Systems. Kumar and his research team believe this will cancel out the inherent mismatch between chip and board, thereby eliminating the chip package.

What makes the SiIF wafer promising is the way it’s interconnects are patterned. With 2-micrometer-wide copper interconnects spaced only 4 micrometers from one another (like top level interconnects on a normal chips), and with copper pillars spaced appropriately for GPUs to plug in, thermal compression bonding fuses the pillars with the interconnects. Kumar and his fellow researchers claim that they can add up to 25 times more inputs and outputs per chip.

In order to accommodate a two-fold increase in kilowatts consumed for the SiIF wafer’s wiring, so the team of researchers increase the voltage supply, reducing the amount of power lost. This solution cost the team available space for additional GPUs due to components like signal conditioning capacitors and voltage regulators.

Bottom Line

The research team connecting 41 GPUs in total and the simulation of this design sped up movement of data and increasing computation power while using less energy than 40 standard GPU servers. A prototype for physical testing is in progress, but details are sketchy at the moment. Cooling will likely be a challenging bottleneck due to higher than normal levels of heat dissipation.