EDA² | ISEDA 2025 | Technical Session

Innovations in Memory Architecture and Near/In-Memory Computing

Abstract: Federated learning (FL) enables decentralized collaborative training while preserving data privacy, yet it remains vulnerable to dual privacy threats: client-side training data inference attacks and server-side insecure aggregation. To address these challenges, this study proposes a hybrid encryption framework accelerated on Resistive Random Access Memory (RRAM) that integrates differential privacy (DP) and homomorphic encryption (HE) for multilevel security protection. The framework innovatively harnesses two key characteristics of RRAM devices: (1) The parallel in-memory computing capability enabled by crossbar array architectures reduces the computational complexity of HE-based matrix encryption, achieving real-time encryption throughput; (2) The stochastic cycle-to-cycle read noise inherent to RRAM devices is systematically harnessed as a DP-compliant noise injection mechanism, enabling tunable privacy budgets without additional computational overhead. Experiments demonstrate that the proposed scheme achieves a privacy budget of 1.737 to 5.791 on the CIFAR-10 dataset while maintaining a negligible accuracy loss (1% to 4%). Benchmark results reveal a 270x acceleration over CPU implementations. This framework provides a new hardware implementation path for privacy-preserving machine learning on resource-constrained edge devices.

Abstract: Transformer-based GPT models in artificial intelligence have exhibited remarkable performance advantages across generative tasks. However, edge-side deployment of GPT models faces significant challenges due to the limited memory and computational resources of edge devices. The lack of specialized compiler toolchains often necessitates manual compilation, resulting in complicated deployment processes. To address these challenges, we propose a Batch-Block Parallelization and Dual-NoC Scheduling algorithm(BBP-DNS) for edge devices, which automates tensor partitioning and batch scheduling to enhance data locality and reduce overheads in data storage, movement, and kernel scheduling, thereby improving inference speed. To mitigate hardware constraints in accelerator memory, bandwidth, and compute resources, we extend tensor parallelism with a fine-grained batch-level scheduling strategy. Additionally, we introduce a novel operator mapping methodology that automates accelerator data management, addressing the inefficiencies of traditional manual compilation workflow. Experimental results demonstrate BBP-DNS's superior performance, achieving a 30x performance improvement of a GPT single-layer block on simulator of that evaluated by TVM and PyTorch on CPUs.

Abstract: Content addressable memories (CAMs) embed parallel associative search directly into memory blocks, making them essential for associative memory (AM) applications. CAMs can efficiently perform best or exact search operations that identify a data entry in the database that is the closest or exactly the same as the input entry, respectively. Recently, a single FeFET-based design that is compact and highly efficient has been proposed. However, its multi-step sensing scheme is tailored for best search and is thus inefficient for exact search operations. To address this, we propose a sensing scheme with the concept of conditional execution that dynamically prunes execution steps according to the outputs of the early steps. For higher efficiency, we implement the proposed sensing scheme in the voltage domain and propose an efficient transformation scheme to support current domain arrays. The proposed scheme achieves up to 3.66x and 3x lower energy and latency respectively. The proposed method also achieves a 2.5x lower error rate.

Abstract: This paper presents NandPIM, an innovative NAND flash-based processing-in-memory (PIM) accelerator design coupled with an automatic deep neural network (DNN) quantization framework. By integrating spatial partition mapping and weight duplication strategies, NandPIM fully leverages the high-density storage capability of NAND flash memory and significantly improves computational parallelism. Moreover, alternating mapping of weights within the flash memory array improves the effective endurance of the PIM system. The quantization framework of NandPIM, powered by a Genetic Algorithm, automatically optimizes the bit-widths of DNN weights to determine the optimal layer-by-layer quantization strategy within the constraints of hardware resources and inference accuracy. Experimental results demonstrate that the proposed design outperforms traditional PIM approaches in terms of area, latency and energy efficiency. Furthermore, NandPIM achieves automatic bit-width compression across various DNN models, boosting the computational efficiency while preserving the model accuracy. This framework provides a compelling solution for the efficient deployment of DNNs on 2D/3D NAND-based PIM chips.

Abstract: The rapid adoption of Transformer models in AI has exposed critical inefficiencies in conventional computing architectures, particularly due to their large memory footprint and low data reuse. Near-Data Processing (NDP) architectures have emerged as a promising solution to mitigate the memory wall problem, especially for memory-intensive neural network(NN) workloads. However, existing AI compilation frameworks, optimized for von Neumann architectures, fail to exploit the distributed nature and fine-grained parallelism inherent in the NDP architectures, especially for memory-intensive Transformer models. This work proposes a mapping framework that leverages novel pipeline parallelism to maximize the inference throughput of Transformers on NDP architectures. Firstly, we propose a pipeline layout strategy called Rect-zigzag, which offers superior flexibility compared to existing pipeline layout strategies (such as 1D static layout and zigzag layout) while adapting to the fine-grained partitioning required by Transformer models. Secondly, we propose a dynamic programming-based mapping algorithm capable of achieving joint optimization of partitioning and pipeline layout for Transformer models. Experiments demonstrate that our mapping method enables the inference throughput of Transformers to reach 1.14 ∼ 3.92× that of the baseline methods.