Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Pages

Posts

portfolio

publications

A 3-D Multi-Precision Scalable Systolic FMA Architecture

Published in IEEE Transactions on Circuits and Systems I: Regular Papers, 2024

Abstract:Artificial Intelligence (AI) has almost become the default approach in a wide range of applications, such as computer vision, chatbots, and natural language processing. These AI-based applications require computing large-scale data with sufficient precision, typically in floating-point numbers, within a limited time window. A primary target for AI acceleration is matrix multiplication, mainly involving dot products through Multiply-Accumulate (MAC) operations. Current research employs the Fused Multiply-Add (FMA) operation, based on IEEE-754 Floating Point (FP) standard, to meet these requirements. However, current research focuses more on simplifying the internal digital circuits of the Processing Elements (PEs) performing FMA operations, rather than optimizing the FMA process specifically for MAC tasks. Current PE arrays often use a two-dimensional (2-D) systolic array design, without specific optimization for MAC operations, thus their parallelism is not fully utilized. Additionally, these designs lack reconfigurability and flexibility, leading to suboptimal performance on Field-Programmable Gate Arrays (FPGAs). Moreover, some designs adopt lower precision computing in AI inference for higher performance. However, some AI models still rely on high-precision computing to maintain the accuracy. Thus, multi-precision computing is commonly used in AI accelerators. To address these challenges, this paper proposes a novel Multi-Fused Multiply-Accumulate (MFMA) scheme and a corresponding three-dimensional (3-D) scalable systolic FP computing architecture. The MFMA scheme addresses the problem of the classical FMA scheme. It optimizes FMA for MAC operations with the Fused Multiply-Accumulate (FMAC) operation. Also, it combines multi-precision and mixed-precision FP computing methods for higher accuracy and lower overflow error. The proposed architecture integrates two 2-D systolic arrays into the PE for a 3-D systolic array, achieving higher parallelism and flexibility. The proposed scalable architecture can be customized to suit various FMAC operations. Compared with existing state-of-the-art FP architectures on FPGAs, our proposed architecture achieves 47%, 10%, and 159% energy efficiency improvements in FP32, FP16, and INT8 operations, respectively. Furthermore, our proposed architecture achieves energy efficiency improvements of 105%, 54%, and 262% under efficiency saturation conditions, outperforming the existing state-of-the-art design.

A High Accuracy and Hardware Efficient Approximate Computing based Leaky integrate-and-fire Neuron Model

Published in 2025 IEEE 38th International System-on-Chip Conference (SOCC), 2025

Abstract: The adoption of Spiking Neural Networks (SNNs) has grown significantly, driven by their potential to enhance image processing, robotics, and motor control. These applications typically demand both high performance and low power consumption, especially when deployed on edge devices. Achieving high-performance and low-power SNNs in hardware remains challenging due to their computational complexity and large-scale design. Balancing accuracy and speed often increases power and resource usage, making efficient implementation essential. Optimizing a single neuron model—a fundamental unit replicated thousands of times—is key to improving overall hardware efficiency. The Leaky Integrate-and-Fire (LIF) model, widely used in SNNs, offers a more efficient alternative to other neuron models by improving computational and energy efficiency.This paper presents a LIF neuron model design based on approximate multiplication. Given the high robustness of SNNs, they are well-suited for approximate computing. The proposed design significantly reduces multiplication complexity with only 2.6099% error. The minimal error has little impact on SNN performance, as shown by similar training results between approximate and precise LIF-based SNNs across various datasets and network sizes. To further demonstrate the advantages of our AM-based LIF neuron model, we carry on a test through a Xilinx FPGA. The FPGA implementation results demonstrate that the AM-based LIF neuron achieves a 8.26% to 84.36% reduction in Look-Up Table utilization, a 17.04% to 86.49% reduction in slice register utilization, and achieving a 10.31-fold increase in energy efficiency relative to the state-of-the-art LIF neuron model.

An Approximate Computing-Based Spiking Neural Networks Neuron Model and STDP Learning

Published in IEEE Transactions on Circuits and Systems I: Regular Papers, 2025

Abstract:Spiking Neural Networks (SNNs) show great potential in applications such as image processing, robotics, and communications. However, the vast number of neuron models and learning algorithms in large-scale SNNs impose significant hardware and energy overhead, with multiplication remaining the most critical operation. Thus, to address this challenge, this paper presents the hardware design of the Logarithmic Linear Multiply (LLMu) and Logarithmic Linear Segmented Multiply (LLSMu). These two components are specifically designed for neuron models and learning algorithms, achieving high accuracy with low hardware resource utilization and energy consumption. To demonstrate the capabilities of LLMu and LLSMu, we implement two mainstream SNN neuron models—Leaky Integrate-and-Fire (LIF) and Izhikevich—as well as the Spike Timing-Dependent Plasticity (STDP) learning algorithm, and compare their performance with state-of-the-art approaches on FPGA and ASIC platforms. The scope of this work is limited to these models and algorithms. The LLMu- and LLSMu-based implementations exhibit significantly improved energy efficiency over existing methods. Specifically, in the FPGA implementation, the LLSMu-based LIF neuron model achieves a 6.75× improvement, the LLSMu-based Izhikevich neuron model achieves a 2.70× to 3.72× improvement, and the LLMu-based STDP achieves a 21.03× to 48.78× improvement in energy efficiency. In the ASIC implementation, the LLSMu-based Izhikevich neuron model further improves energy efficiency by 5.58× to 5.69× , while the LLMu-based STDP achieves 5.96× and 3.69× improvements compared to prior designs.

talks

teaching