Beef Carcass Grading with EfficientViT: A Lightweight Vision Transformer Approach

Beef carcass grading plays a pivotal role in determining market value and consumer preferences. While traditional visual inspection by experts remains the industry standard, it suffers from subjectivity and inconsistencies, particularly in high-throughput slaughterhouse environments. To address thes...

Full description

Saved in:
Bibliographic Details
Main Authors: Hyunwoo Lim, Eungyeol Song
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/11/6302
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Beef carcass grading plays a pivotal role in determining market value and consumer preferences. While traditional visual inspection by experts remains the industry standard, it suffers from subjectivity and inconsistencies, particularly in high-throughput slaughterhouse environments. To address these limitations, we propose a one-stage automated grading model based on EfficientViT, a lightweight vision transformer architecture. Unlike conventional two-stage methods that require prior segmentation of the loin region, our model directly predicts beef quality grades from raw RGB images, significantly simplifying the pipeline and reducing computational overhead. We evaluate the proposed model against representative convolutional neural networks (VGG-16, ResNeXt-50, DenseNet-121) as well as two-stage combinations of segmentation and classification models. Experiments were conducted on a publicly available beef carcass dataset consisting of over 77,000 labeled images. EfficientViT achieves the highest accuracy (98.46%) and F1-score (0.9867) among all evaluated models while maintaining low inference latency (3.92 ms) and compact parameter size (36.4 MB). In particular, it outperforms CNNs in predicting the top grade (1++), where global visual patterns such as marbling distribution are crucial. Furthermore, we employ Grad-CAM and attention map visualizations to analyze the model’s focus regions and demonstrate that EfficientViT captures holistic contextual features better than CNNs. The model also exhibits robustness across varying loin area proportions. Our findings suggest that EfficientViT is not only accurate but also efficient and interpretable, making it a strong candidate for real-time industrial applications in beef quality grading.
ISSN:2076-3417