EDA² | ISEDA 2025 | Technical Session

System-Level Design Exploration and NoC Innovations

Session Chair: Jianan Mu, Institute of Computing Technology, Chinese Academy of Sciences

Abstract: Traditional Network-on-Chip (NoC) architectures frequently suffer from network congestion when processing complex Convolutional Neural Network (CNN) models, primarily due to the massive data volume and computation-intensive tasks. To address this issue, this paper proposes a new NoC-based CNN acceleration architecture. The core innovation involves integrating computational units into routers, thereby exploiting network congestion periods for in-router computations and alleviating the impact of NoC congestion on CNN inference efficiency. Focusing on activation operations as a representative case, we conduct experiments with three CNN models: LeNet, AlexNet, and ResNet. Experimental results demonstrate that, compared to the conventional VC-based router baseline, the proposed architecture achieves inference cycle reductions of 5.72%–6.64%, 5.12%–5.53%, and 4.53%–5.30% for the three models under varying NoC configurations (4×4, 8×8, and 16×16 mesh topologies). These results validate that the proposed architecture significantly improves inference efficiency. Additionally, we design two optimization strategies based on the architecture and provide a systematic analysis of their effectiveness and applicability across different CNN models.

Abstract: The rapid advancement of deep learning has driven the remarkable success of Deep Neural Networks (DNNs) across various domains. However, the computational and data transmission complexity of DNN models presents significant challenges to traditional hardware architectures. Network-on-Chip (NoC)-based DNN accelerators have emerged as a promising solution to address these challenges. To improve DNN inference speed and alleviate the impact of network congestion on computational efficiency, this paper presents a NoC architecture with in-network multiplication. By embedding multiplication units within the NoC routers, partial computations are offloaded during periods of network congestion, thereby enhancing overall system efficiency. We develop a cycle-accurate NoC-DNN simulator. Under NoC scales of 8×8 and 16×16, the inference speeds of LeNet and AlexNet-like models are improved 10.7%, 6.9%, 16.3%, and 15.5%, respectively.

Abstract: In recent years, many industrial CPUs have integrated AI-specific co-processors as "CPU for AI" solution. However, the implementation details of co-processors are diverse. In this work, we present VeriRAG, an LLM-assisted agile design methodology for AI-specific CPU co-processor, which includes a summary-template RAG-enhanced LLM flow for agile RTL generation. We first present a general RISC-V-based AI-specific co-processor architecture template to support both Matrix and Nonlinear computations with detailed instruction extensions. Furthermore, we develop a series of AI-specific co-processors based on VeriRAG for different CPUs, i.e., high-end Xuantie C910 and low-power CVA6 CPU. It is observed that the AI co-processors can be effectively and properly designed by VeriRAG within a short design cycle, and co-processors gain notable performance boost for AI tasks.

Abstract: Processors are the cornerstone of computing systems, yet their design process remains highly complex and labor-intensive. Current automated architecture design methods often rely on pre-configured models, limiting flexibility and requiring extensive manual effort. Additionally, challenges such as capturing architectural knowledge, optimizing parameters within constrained simulation times, and managing complex decision-making processes further complicate the task. To address these issues, we propose ArchBot, a labor-free processor architecture design framework leveraging large language models and reinforcement learning. ArchBot integrates a curated RISC-V architecture knowledge base, RL for microarchitecture exploration, and a machine-learning-based fast simulation model to accelerate optimization. It also uses LLMs to automate design requirement analysis and task decomposition, generating the gem5 code as output. Experiments demonstrate that ArchBot achieves a 95% success rate in meeting design requirements, surpassing existing frameworks and offering a novel solution for automated processor design.