Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators

A novel joint hardware-workload co-optimization framework for in-memory computing accelerators uses evolutionary algorithms to design generalized hardware for multiple neural network workloads. The approach achieves Energy-Delay-Area Product reductions of 76.2% for 4 workloads and 95.5% for 9 workloads compared to baseline designs, bridging the gap between specialized and generalized IMC platforms. The framework has been validated across both RRAM- and SRAM-based architectures and enables single platforms to efficiently support diverse AI applications.

Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators

The development of a novel software-hardware co-design framework for in-memory computing (IMC) accelerators marks a significant shift from single-workload optimization toward generalized, multi-model hardware platforms. This approach directly addresses a critical bottleneck in AI hardware deployment, where the need for flexible, efficient systems that can run diverse neural networks is paramount for real-world applications from edge devices to data centers.

Key Takeaways

  • A new joint hardware-workload co-optimization framework uses an evolutionary algorithm to design generalized IMC accelerators for multiple neural network workloads, not just one.
  • The method significantly bridges the performance gap between specialized and generalized designs, achieving energy-delay-area product (EDAP) reductions of up to 76.2% (4 workloads) and 95.5% (9 workloads) compared to baselines.
  • The framework demonstrates robustness across different IMC technologies, being evaluated on both RRAM- and SRAM-based architectures.
  • By capturing cross-workload trade-offs, it enables a single IMC platform to efficiently support varied applications, a key requirement for practical deployment.
  • The source code is publicly available on GitHub, promoting reproducibility and further research in hardware-software co-design.

A Framework for Generalized In-Memory Computing Accelerators

Current optimization frameworks for in-memory computing hardware typically focus on maximizing performance for a single, specific neural network workload. This results in highly specialized accelerator designs that suffer from poor generalization, creating inefficiency and cost barriers when deploying multiple AI models on the same hardware platform. The proposed research tackles this fundamental limitation head-on.

The core innovation is a joint hardware-workload co-optimization framework built upon an optimized evolutionary algorithm. Instead of tuning an architecture for one model, the algorithm explicitly explores and captures the trade-offs across a suite of target workloads. This allows it to search the design space for IMC accelerator configurations—considering factors like memory array size, dataflow, and peripheral circuitry—that deliver the best aggregate performance across all required applications.

The framework's effectiveness is quantified using the Energy-Delay-Area Product (EDAP), a holistic metric that balances efficiency, speed, and silicon footprint. When optimizing across a set of four workloads, the framework's designs achieved an EDAP reduction of 76.2% compared to baseline generalized designs. Remarkably, when scaling to a more diverse set of nine workloads, the EDAP improvement reached 95.5%, nearly closing the gap with idealized, workload-specific hardware. The framework was validated as technology-agnostic, showing strong results for accelerators based on both non-volatile Resistive RAM (RRAM) and conventional Static RAM (SRAM).

Industry Context & Analysis

This work arrives at a pivotal moment in AI hardware, where the industry is grappling with the tension between specialization and generalization. Companies like Graphcore with their IPU and Groq with their LPU have pioneered specialized architectures for specific domains (graph neural networks, LLM inference), while others like NVIDIA continue to refine general-purpose GPUs with dedicated tensor cores. The IMC accelerator space mirrors this conflict; most academic and industrial designs are optimized for a single benchmark model, limiting real-world utility.

The proposed framework's use of an evolutionary algorithm for co-design is a sophisticated alternative to more common approaches. Unlike one-shot, gradient-based neural architecture search (NAS) methods which often target software-only networks, this method iteratively evolves hardware parameters alongside workload considerations. It is more akin to the differentiable hardware-software co-design explored by companies like Google and Tesla for their TPU and Dojo systems, but with a explicit multi-workload objective. The reported EDAP gains of up to 95.5% are substantial, suggesting the performance penalty for generalization can be almost eliminated with intelligent design.

Technologically, the validation across both RRAM and SRAM is crucial. RRAM-based IMC is a leading candidate for next-generation, ultra-efficient accelerators due to its high density and non-volatility, championed by research consortia and startups. SRAM-based IMC, while less dense, benefits from CMOS compatibility and maturity. Showing that the co-design framework works for both proves its methodology is fundamental, not tied to an emerging or niche technology. This adaptability is its greatest strength for an industry still debating the winning IMC substrate.

What This Means Going Forward

For chip designers and AI hardware companies, this framework provides a concrete methodology to build more versatile and commercially viable accelerators. The ability to support a portfolio of models—from computer vision CNNs to transformer-based LLMs—on a single, efficient IMC chip reduces development costs, simplifies system integration, and accelerates time-to-market. This is especially valuable for edge AI and IoT applications, where hardware resources are severely constrained and models may need to be updated or swapped frequently.

The public release of the source code on GitHub will accelerate research in this niche but critical field. It provides a benchmark and a tool for others to build upon, potentially leading to more advanced co-design algorithms that incorporate real-world constraints like manufacturing variability or reliability. The next logical steps for this research direction include scaling the optimization to even larger and more diverse workload sets (e.g., 50+ models) and integrating the framework with industry-standard electronic design automation (EDA) tools for physical layout synthesis.

Looking ahead, the success of this multi-workload co-design approach will pressure the industry to move beyond single-benchmark hero numbers. Just as MLPerf has become the standard for benchmarking AI system performance across diverse tasks, hardware design competitions and evaluations may need to adopt similar multi-model suites. The ultimate winners in the AI accelerator race may not be those with the highest peak TOPS/W for one model, but those whose architectures, guided by frameworks like this one, deliver consistently high efficiency across the ever-expanding landscape of AI workloads.

常见问题