UPDATED 10:57 EDT / NOVEMBER 14 2022

INFRA

How AI has made hardware interesting again

SUPERCOMPUTING SPECIAL REPORT by Paul Gillin

Lawrence Livermore National Laboratory has long been one of the world’s largest consumers of supercomputing capacity. With computing power of more than 200 petaflops, or 200 billion floating-point operations per second, the U.S. Department of Energy-operated institution runs supercomputers from every major U.S. manufacturer.

For the past two years, that lineup has included two newcomers: Cerebras Systems Inc. and SambaNova Systems Inc. The two startups, which have collectively raised more than $1.8 billion in funding, are attempting to upend a market that has been dominated so far by off-the-shelf x86 central processing units and graphics processing units with hardware that’s purpose-built for use in artificial intelligence model development and inference processing to run those models.

Cerebras says its WSE-2 chip, built on a wafer-scale architecture, can bring 2.6 trillion transistors and 850,000 CPU cores to bear on the task of training neural networks. That’s about 500 times as many transistors and 100 times as many cores as are found on a high-end GPU. With 40 gigabytes of onboard memory and the ability to access up to 2.4 petabytes of external memory, the company claims, the architecture can process AI models that are too massive to be practical on GPU-based machines. The company has raised $720 million on a $4 billion valuation.

Lawrence Livermore National Lab’s de Supinski sees specialized hardware delivering a five-fold performance boost. Photo: LLNL

SambaNova says the integrated hardware-software system it calls DataScale can train deep learning models — which are widely used in language processing and image recognition — six times faster than Nvidia Corp.’s AI-optimized DGX A100 computer, in part because it boasts nearly 13 times the memory capacity.

LLNL is using a combination of the National Nuclear Security Administration’s Lassen supercomputer (pictured) and a dedicated Cerebras chip that’s 57 times the size of a standard data center graphics card and can thus pack in more than 1.2 trillion transistors.

Bronis de Supinski, Lawrence Livermore’s chief technology officer, said early experience with AI-optimized machines has been promising. “Many people think we’re coming to the end of Moore’s Law,” he said in a SambaNova video testimonial. “By bringing in machine learning models that greatly reduce the computation we have to perform, we think we can gain significant improvements in speed.” Early indications are that the purpose-built AI supercomputers deliver five times the performance of GPU-based systems, he said.

Chips reborn

Computer hardware has been a less-than-electrifying market for years. The dominant x86 microprocessor architecture is reaching the limits of the performance gains that can be realized through miniaturization, so manufacturers have focused mainly on packing more cores into their chips.

For the rapidly evolving disciplines of machine learning and deep learning – which are two of the main types of AI in use today – the salvation has been in GPUs. Initially designed for graphics processing, GPUs can have thousands of small cores, making them ideal for the parallel processing power needed for AI training.

“The nature of AI is that it benefits from processing in parallel,” said Peter Rutten, research vice president within the worldwide infrastructure practice at International Data Corp. “About 10 years ago it was discovered that GPUs, which were designed to put pixels on the screen, are good for this because they’re parallel processing engines and you can put a lot of cores into them.”

That was good news for Nvidia Corp., which saw its market capitalization soar from less than $18 billion in 2015 to $735 billion last year before the market contraction set in. Until recently, the company had the market nearly all to itself. But a host of competitors is looking to change that.

For AI workloads, “it’s been mainly Nvidia GPUs till now, but users are looking at technologies that can take them to the next level,” said Chirag Dekate, a Gartner Inc. analyst who specializes in operational AI systems. “This is not to imply that GPUs are over, but as high-performance computing and AI workloads continue to converge, you’ll see a greater variety of accelerators emerge.”

Everyone in the pool

Gartner’s Detake sees the convergence of HPC and AI workloads driving the need for new types of hardware acceleration. Photo: Gartner

The big chipmakers aren’t standing still. Three years ago Intel Corp. acquired Habana Labs Ltd. and made the Israeli-based chip maker the focal point of its AI development efforts. Habana’s Gaudi 2 training-optimized processor and Greco inference processor, which were introduced last spring, are claimed to be at least twice as fast as Nvidia’s flagship A100 GPU. Advanced Micro Devices Inc. has been less vocal about its AI aspirations, but it says its new series of Ryzen processors include built-in machine learning capabilities.

In March, Nvidia rolled out the H100 accelerator GPU with 80 billion transistors and support for the company’s high-speed NVLink interconnect. It features a dedicated engine that can speed up the execution of the Transformer-based models used in natural language processing up to sixfold compared with the previous generation. The most recent tests using the MLPerf benchmark showed the H100 besting Gaudi 2 on most deep learning tests. Nvidia is also considered to have an advantage in its software stack.

“Many users choose GPUs because they can tap into ecosystems of containerized software,” Dekate said. “The reason Nvidia is as successful as it is is because of the ecosystem strategy they have built.”

Cloud hyperscalers got into the business even earlier than the chipmakers. Google LLC’s Tensor Processing Unit, which is an application-specific integrated circuit, was launched in 2016 and is already in its fourth generation. Amazon Web Services Inc. introduced the machine learning-oriented Inferentia inference processing accelerator in 2018, claiming more than double the performance of GPU-accelerated instances. Last month it announced the general availability of cloud instances based on its Trainium chip, saying they cost up to 50% less than comparable GPU-based EC2 at comparable performance in deep learning model training scenarios. Both companies’ efforts are mainly focused on delivery via their cloud services.

While the established market leaders focus on incremental improvement, much of the more interesting innovation is going on among startups building AI-specific hardware. They brought in the lion’s share of the $1.8 billion venture capitalists invested in chip startups last year, more than double the amount invested in 2017, according to PitchBook Data Inc. They’re chasing what could be a massive payday: Allied Market Research, the research and business consulting wing of Allied Analytics LLP, expects the global AI chip market to grow from $8 billion in 2020 to nearly $195 billion by 2030.

Smaller, faster, cheaper

Few startups are angling to replace x86 CPUs, but that’s because there’s relatively little leverage in doing so. “The chip is no longer the bottleneck,” said Mike Gualtieri, a principal analyst at Forrester Research Inc. “It’s the communication between the different chips that’s a huge bottleneck.”

Cerebras uses “wafer-scale” integration to apply trillions of processors to neural networks. Photo: Cerebras

CPUs perform low-level operations like managing files and allocating tasks, but “a pure CPU-exclusive approach no longer works for scaling,” said Gartner’s Dekate. “The CPU is designed for all sorts of activities from open files to managing memory caching. It has to be general-purpose.” That means it’s poorly suited for the massively parallel matrix arithmetic operations that AI model training demands.

Most of the action in the market is around co-processor accelerators, application-specific integrated circuits and, to a lesser extent, field-programmable gate arrays that can be fine-tuned for specific uses. “Nobody’s trying to replace CPUs other than Arm and AMD,” said IDC’s Rutten, referring to Arm Ltd., the creator of reference designs for low-power CPUs.

“Everybody is following the Google narrative, which is to develop co-processors that work together with a CPU and target a specific portion of an AI workload by hard-coding the algorithm into the processor rather than running as software,” Rutten said. “These are ASICs that are only built for doing one thing very well.”

Acceleration equation

That’s essentially what Blaize Inc. is doing. The El Dorado Hills, California-based company, which was founded by three former Intel engineers, has raised $155 million to build what it calls a Graph Streaming Processor for use in edge computing scenarios like autonomous vehicles and video surveillance. The fully programmable chipset takes on many of the functions of a CPU but is optimized for task-level parallelism and streaming execution processing, operating on only seven watts of power.

Blaize’s Graph Streaming Processor uses a graph data structure to support neural network processing. Photo: Blaize

Blaize’s architecture is based on graph data structures in which relationships between objects are represented as connected nodes and edges. “Every machine learning framework uses graph concepts,” said Val Cook, the company’s co-founder and chief software architect. “We maintain that same semantic throughout the design of the chip. We can execute an entire graph that includes CMMs but can include custom nodes. We can accelerate anything parallel in those graphs.”

The company says its graph-based architecture solves some of the capacity limitations of GPUs and CPUs and adapts more flexibly to different kinds of AI tasks. It also allows developers to move more processing to the edge for better inferencing. “If you can pre-process 80% of the processing on the camera, you’re saving tons of time and cost,” said Dmitry Zakharchenko, vice president of software development.

Blaize is one of several startups that are targeting edge applications where intelligence is moved closer to the data to enable split-second decision-making. Most are aimed at inferencing, which is the field deployment of AI models, rather than the more compute-intensive task of training.

Axelera AI B.V. is building a chip that uses in-memory computation to reduce latency and the need for outboard storage devices. “Our AI platform will deliver flexibility and the ability to run multiple neural networks while keeping high accuracy,” Marketing and Communications Manager Merlijn Linschooten said in an email response to questions.

Kalray SA calls its line of data processing units “massively parallel processor arrays” with a scalable 80-core processor capable of executing dozens of tasks in parallel. “The key innovations of Kalray are the tight integration of a tensor coprocessor inside each processing element, and the support of direct tensor data exchange between the elements to avoid memory bandwidth bottlenecks,” Chief Executive Eric Baissus said in an email interview. “This enables highly efficient AI application acceleration since pre- and post-processing is performed on the same processing elements.”

Tel Aviv-based Hailo Technologies Ltd. focuses on inferencing of deep learning models with a thumbnail-sized chipset that it claims can perform 26 trillion operations per second while consuming less than three watts of power. It does that, in part, by breaking down each of the network layers that are used to train deep learning models into the required computing elements and consolidating them all on one chip that’s purpose-built for deep learning.

The use of onboard memory further reduces overhead, said Liran Bar, vice president of business development. “Hailo has the entire network is within the chip itself. We don’t have external memory,” he said. That means chips can be smaller and consume less power. Hailo says its chips can run deep learning models across high-definition images in near-real-time, enabling a single device to run automated license plate recognition on four lanes of traffic simultaneously.

The game changers

Some startups are taking more of a moonshot approach, aiming to redefine the entire platforms on which AI models are trained and run.

Graphcore’s 3-D chip design packs nearly 1,500 parallel processing cores onto a chip. Photo: Graphcore

Graphcore Ltd. says its AI processors, which are optimized for machine learning, can manage up to 350 trillion processing operations per second with nearly 9,000 concurrent threads and 900 megabytes of in-processor memory. Its integrated computing system, called the Bow-2000 IPU Machine, is claimed to deliver 1.4 petaflops or quadrillion computer operations per second.

The difference is its three-dimensional stacked wafer design that enables it to pack nearly 1,500 parallel processing cores in a chip. “All of those are capable of running completely different operations,” CEO Nigel Toon said in an email interview. “This differentiates it from the widely used GPU architecture which favors running the same operations of large blocks of data.”

Graphcore has raised over $680 million and counts the U.S. Department of Energy’s Pacific Northwest National Laboratory, Argonne and Sandia Labs among its customers. Toon said its systems have been used to train protein-folding models and for climate modeling.

Boston-based LightMatter Inc. is addressing the interconnect, which is the wiring that links components on an integrated circuit to each other. As processors have reached their theoretical maximum speed, the pathways that move bits around have increasingly become a chokepoint, particularly when multiple processors are accessing memory at the same time. “The chip is now less the bottleneck than the interconnect,” said Forrester’s Gualtieri.

Lightmatter uses photons instead of electrons to achieve high-speed interconnect performance at low power. Photo: Lightmatter

LightMatter’s chips use nano-photonic waveguides in an AI platform that it says combines high speed and massive bandwidth in a low-energy package. It essentially functions as an optical communication layer to which multiple other processors and accelerators can be attached.

“The quality of the AI results comes from the ability to simultaneously support very large and complex models while achieving very high-throughput responses, both of which we can achieve,” CEO Nick Harris said in emailed comments. “This is applicable for any operations that can be done using linear algebra,” which includes most AI uses.

The company, which has raised $113 million, touts the energy efficiency of its approach as a virtue. That may resonate with organizations that rely heavily on GPU-based model training, given that Nvidia’s new H100 chip clocks in at an eye-popping 700 watts of maximum power consumption.

“The chip is now less the bottleneck than the interconnect,” said Forrester’s Gualtieri. “Lightmatter could be very disruptive.”

The Unicorn

No startup has generated as much interest – and investment – as SambaNova. With $1.1 billion in funding and a $5 billion market value, expectations are sky-high for its integrated hardware and software platform that it claims can run AI and other data-intensive applications anywhere from the data center to the edge.

SambaNova’s DataScale hardware platform uses custom seven-nanometer chips designed for machine and deep learning. Its reconfigurable Dataflow Architecture runs an AI-optimized software stack and its hardware architecture is built to minimize memory access and thus interconnect bottlenecks.

“Its processor can be reconfigured for AI or high-performance computing HPC workloads and its processors are designed to process large-scale matrix operations at a higher level of performance,” said Dekate. That’s a plus for customers with variable workloads.

Marshall Choy, senior vice president of product at SambaNova, said the company is taking a clean-slate approach because CPUs, GPUs and even FPGAs “are really well-built for deterministic software like transactional systems and ERP.” Machine learning algorithms, however, are probabilistic, meaning that the outcome isn’t known in advance. “A whole different type of hardware infrastructure is required for that,” he said.

SambaNova’s platform minimizes interconnect problems by bolting a terabyte of high-speed Double Data Rate Synchronous memory onto its processor. “We can basically mask the DDR controller latency with 20-times faster on-chip memory so that’s transparent to the user,” Choy said. “That enables us to train much higher parameter-count language models and highest resolution images without tiling or downsampling.”

Tiling is a technique used in image analysis that reduces the need for computing power by slicing images into smaller pieces, analyzing each piece and then reassembling them again. Downsampling trains a model on a random subset of training data to save time and compute resources.

The result is a system that the company says is not only faster than one based on GPUs but able to tackle much bigger problems. “If you try to train a GPU model with 512 cubed images, the result is an out-of-memory error; you just can’t do it,” Choy said. “We’re enabling people to do things they could not do before.”

A new kind of computing

SambaNova’s Choy: “We’re enabling people to do things they could not do before.” Photo: SambaNova

With so many companies seeking solutions to the same problems, a shakeout is inevitable, but no one foresees one coming soon. GPUs will be around for a long time and probably remain the most cost-effective solution for AI training and inferencing projects that don’t require extreme performance.

As models at the high end of the market grow larger and more sophisticated, though, so will the need for function-specific architectures. “Three to five years from now you’re likely to see a diversity of GPUs and AI accelerators,” said Gartner’s Dekate. “That is the only way we can scale to meet the needs at the end of the decade and beyond.”

Expect the leading chipmakers to continue to do what they do well and incrementally build on established technology. Many will also follow Intel’s lead by acquiring AI-focused startups. The high-performance computing world is also looking at the potential of AI to help tackle classic problems such as large-scale simulations and climate modeling.

“HPC ecosystems always looking at new technologies they can ingest to stay at the forefront,” Dekate said. “They are exploring what AI can bring to the table.” Lurking in the background is quantum computing, a technology that is still more theoretical than practical but that has the potential to revolutionize the way computing is done.

Regardless of which new architectures gain favor, the AI surge has unquestionably rekindled interest in the potential of hardware innovations to open up new frontiers in software.

“For every order of magnitude you grow a system, you have to look at a complete refactoring of the system architecture,” said SambaNova’s Choy. “The jump from exascale to zettascale is going to require a ton of hardware and software optimization and silicon is going to be a big part of it. It isn’t going to be the same old same old for the next 10 years.”

Photo: Lawrence Livermore National Lab

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One-click below supports our mission to provide free, deep and relevant content.

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU