Optimizing Compute Core Assignment for Dynamic Batch Inference in AI Inference Accelerator

Mar 1, 2025·
Ze-Wei Liou
Ze-Wei Liou
,
D.-Y. Hong
· 0 min read
Abstract
Modern AI inference accelerators offer high-performance and power-efficient computations for machine learning models. Most accelerators employ static inference to enhance performance, which requires models to be compiled with predetermined input batch sizes and intermediate tensor shapes. However, static inference can lead to program failures or inefficient execution when processing batched data of varying sizes, a scenario known as dynamic batch inference. This work addresses this challenge by focusing on the emerging multicore AI inference accelerators that offer flexible compute core assignment. We propose to dynamically partition the input batch data into smaller batches, and create multiple model instances to process each partition in parallel. The challenge lies in how to determine the optimal number of model instances, the proper batch size for each handling model, and the assignment of compute cores among the models, to minimize the inference time. To solve the problem, we construct an accurate profiling-based cost model and devise a dynamic programming algorithm to determine the best configuration. Experimental results indicate that our method achieves 3.05× higher throughput on average in multi-person pose estimation benchmarks, compared to the EdgeTPU-like inference strategy.
Type
Publication
Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing
publications
Ze-Wei Liou
Authors
PhD student @ Princeton

📢 Check out my new blog post! ReplaySSM

I am a first-year PhD student at Princeton, advised by Prof. Tri Dao. My research focuses on ML Systems.

My official name is Ze-Wei Liou, but I also go by the name “Johnny.” I did my undergrad in EE at National Taiwan University. Feel free to reach out!