Cross-Filter Structured Pruning for Efficient Sparse CNN Acceleration
Convolutional Neural Networks (CNNs) are widely used in vision tasks for resource-constrained environments due to their computational efficiency and strong generalization. However, the dominance of <inline-formula> <tex-math notation="LaTeX">$1 \times 1$ </tex-math></i...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11072696/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Convolutional Neural Networks (CNNs) are widely used in vision tasks for resource-constrained environments due to their computational efficiency and strong generalization. However, the dominance of <inline-formula> <tex-math notation="LaTeX">$1 \times 1$ </tex-math></inline-formula> convolutions in modern CNN architectures introduces challenges for sparsity-aware hardware accelerators, particularly in processing element (PE) load balancing, which limits speedup in sparse inference. To address this issue, this paper proposes a cross-filter structured pruning method, enforcing a uniform sparsity pattern across multiple filters to ensure balanced workload distribution among PEs. This approach is further extended to k <inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> k convolutions by decomposing them into <inline-formula> <tex-math notation="LaTeX">$1 \times 1$ </tex-math></inline-formula> filters, improving applicability across various CNN layers. This paper also proposes an intra-kernel parallelism technique which significantly reduces the size of PE’s local buffers, a critical bottleneck in sparse CNN accelerators. Experimental results show that the proposed approach maintains accuracy comparable to globally unstructured pruning while significantly enhancing inference speed. FPGA implementation and cycle-accurate simulations confirm improvements in processing speed, energy efficiency, and hardware utilization, making this method well-suited for edge and mobile AI applications. Specifically, the proposed architecture achieves <inline-formula> <tex-math notation="LaTeX">$1.14\times $ </tex-math></inline-formula> to <inline-formula> <tex-math notation="LaTeX">$1.6\times $ </tex-math></inline-formula> speedup over Sparten for various CNN models and delivers <inline-formula> <tex-math notation="LaTeX">$7.6\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$1.9\times $ </tex-math></inline-formula> higher energy efficiency compared to Sparten and StarSPA, respectively. In terms of area efficiency, synthesis results show a <inline-formula> <tex-math notation="LaTeX">$1.73\times $ </tex-math></inline-formula>–<inline-formula> <tex-math notation="LaTeX">$10.95\times $ </tex-math></inline-formula> reduction in required hardware primitives compared to Sparten and StarSPA. |
|---|---|
| ISSN: | 2169-3536 |