1400 RISC - Weifang Die Cutting Machine Co.,Ltd

Untether AI in Canada has developed a AI device with over 1400 RISC-V processors called Boqueria for ‘at memory’ computing.

Boqueria, discussed at the HotChips Conference today, is built on TSMC's 7nm process with 238MB of SRAM. The device has a performance of 2 PetaFlops for FP8 8bit AI data types with a power figure of 30 TFLOPs/W that comes from keeping processing closer to the AI cores with 729 dual RISC-V memory banks.

Because at-memory compute is significantly more energy efficient than traditional von Neumann architectures, more TFlops can be performed for a given power envelope. With the introduction of the runAI devices in 2020, Untether AI saw an energy efficiency level at 8 TOPs/W for the INT8 datatype.

The speedAI architecture used in Boqueria improves upon that, delivering 30 TFlops/W. This energy efficiency is a product of the second-generation at-memory compute architecture, over 1,400 optimized RISC-V processors with custom instructions, energy efficient dataflow, and the adoption of a new FP8 datatype, all of which helps quadruple efficiency compared to the previous generation runAI device.

Each memory bank of the speedAI architecture has 512 processing elements with direct attachment to dedicated SRAM. These processing elements support INT4, FP8, INT8, and BF16 datatypes, along with zero-detect circuitry for energy conservation and support for 2:1 structured sparsity.

Arranged in 8 rows of 64 processing elements, each row has its own dedicated row controller and hardwired reduce functionality to allow flexibility in programing and efficient computation of transformer network functions such as Softmax and LayerNorm. The rows are managed by two RISC-V processors with over 20 custom instructions designed for inference acceleration. The flexibility of the memory bank allows it to adapt to a variety of neural network architectures, including convolutional, transformer, and recommendation networks as well as linear algebra models

The first member of the family, the speedAI240, provides 2 PetaFlops of FP8 performance and 1 PetaFlop of BF16 performance. This translates into higher performance, for example running the BERT framework at over 750 queries per second per watt (qps/w), 15x greater than the current state of the art from leading GPUs.

Untether AI's research determined that two different FP8 formats provided the best mix of precision, range, and efficiency. A 4-mantissa version (FP8p for "precision") and a 3-mantissa version (FP8r for "range") provided the best accuracy and throughput for inference across a variety of different networks. For both convolutional networks like ResNet-50 and transformer networks like BERT-Base, Untether AI's implementation of FP8 results in less than 1/10th of 1 percent of accuracy loss compared to using BF16 data types, with a fourfold increase in throughput and energy efficiency.

The speedAI240 device is designed to scale to large models. The memory architecture is multi-leveled, with 238MB of SRAM dedicated to the processing elements offering 1 petabyte/s of memory bandwidth, four 1MB scratchpads, and two 64-bit wide ports of LPDDR5, providing up to 32GB of external DRAM.

There are 16 lanes of PCIe Gen5 for host connectivity at 63GB/s with three ports of PCIe Gen5 x8 for chip-to-chip and card-to-card connectivity, each providing 31.5GB/s.

"The merits of at-memory compute have been proven with the first generation runAI device, and the second generation speedAI architecture enhances the energy efficiency, throughput, accuracy, and scalability of our offering," said Arun Iyengar, CEO of Untether AI. "speedAI devices offer an ability that is unmatched by any other inference offering in the marketplace."

Untether AI has a Software Development Kit (SDK) called imAIgine that provides a path to running networks at high performance, with push-button quantization, optimization, physical allocation, and multi-chip partitioning. The imAIgine SDK also provides an extensive visualization toolkit, cycle-accurate simulator, and an easily integrated runtime API and is available now.

speedAI devices will be offered as standalone chips as well as a variety of m.2 and PCI-Express form factor cards. Sampling of speedAI240 devices and cards to early access customers is expected to begin in the first half of 2023.