Visual Automatic Localization Method Based on Multi-level Video Transformer
Objective This study investigates the advanced application of a six-axis robotic arm equipped with a high-resolution industrial camera to capture precise images of workpiece surfaces. The setup is designed to acquire a dynamic video sequence illustrating the transition of image clarity, starting fro...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Editorial Department of Journal of Sichuan University (Engineering Science Edition)
2024-11-01
|
| Series: | 工程科学与技术 |
| Subjects: | |
| Online Access: | http://jsuese.scu.edu.cn/thesisDetails#10.12454/j.jsuese.202400072 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Objective This study investigates the advanced application of a six-axis robotic arm equipped with a high-resolution industrial camera to capture precise images of workpiece surfaces. The setup is designed to acquire a dynamic video sequence illustrating the transition of image clarity, starting from blurry, achieving optimal clarity, and then reverting to blurry. The primary goal is to select the clearest frame from this sequence, which is critical in determining the precise focusing distance required for automated machining processes. The industrial camera is strategically mounted on the robotic arm, which meticulously controls the camera’s downward trajectory, ensuring the capture of varying image qualities. As the camera descends, it records the shifting focus on the workpiece surface, from out-of-focus (blurry) to in-focus (clear), and back to out-of-focus. This fluctuation is crucial, as blurry images can significantly impair the performance of subsequent tasks, particularly those involving deep learning-based intelligent recognition systems utilized in modern manufacturing. Blurry images may result in inaccurate feature recognition, adversely affecting the quality and precision of automated operations. An effective and precise video processing methodology is utilized to address these challenges. This approach incorporates advanced image processing techniques to analyze video sequences captured by the industrial camera. Sophisticated algorithms enable the system to identify the frame with optimal clarity and sharpness. This frame is critical feedback for adjusting the robotic arm, ensuring that the camera aligns precisely with the position where the focal length is accurately calibrated to the workpiece’s surface. This process guarantees the high quality of captured images and boosts the overall efficiency of the machining process. The system significantly reduces human error and enhances the consistency of output, which is crucial in high-precision manufacturing environments by automating the focus adjustment based on the clearest image. In addition, integrating this technology into existing industrial setups is expected to streamline operations, decrease waste, and enhance the speed and accuracy of production cycles. This study highlights the technological integration, addresses the challenges, and the substantial enhancements in automated machining processes facilitated by this innovative approach.Methods This study introduces a sophisticated Multi-level Video Transformer algorithm-based video classification model, the Multi-level Video Transformer, designed for high-level semantic video representation learning. This innovative model is developed to identify the clearest frame within a video sequence, a pivotal step for enhancing automated machining precision. The methodology commences with a novel token segmentation approach named Multi-level Tokenization (MLT). This approach divides the original video data into token sequences across four levels: 2D Patch, 3D Patch, Frame, and Clip, capturing a comprehensive range of spatial and temporal details. After token segmentation, positional encodings are applied to these tokens to preserve the sequence order, which is crucial for processing time-dependent data. The tokens are input into the newly developed Multi-level Encoder (MLE) for advanced attention calculations. At the core of the MLE are its dual attention modules: the Level-wise Learnable Attention (LWLA) and the Multi-level Cross Attention (MLCA), each stacked multiple times to deepen learning and integrate features more effectively. The LWLA employs a deformable attention mechanism, an innovative replacement for global attention that calculates feature similarity more flexibly and efficiently, reducing computational costs and mitigating slow convergence issues commonly associated with traditional models. In contrast, the MLCA transcends traditional layer limitations by conducting global attention calculations across the entire token sequence, fostering a deeper integration of features at all levels. This integration is further enhanced by incorporating classification tokens in the MLCA layer, which develop concurrently with global tokens. It is maintained throughout the processing stages. After multiple iterations through the MLEs, these tokens are fed into a Multi-Layer Perceptron for final classification predictions, focusing on semantic-level video classification. The empirical validation of this model employs data gathered on-site with a robotic arm equipped with an industrial camera, creating a unique dataset. The operational setup involves positioning the camera vertically downward above metal samples. The data collection protocol is meticulously designed: it starts with the camera at a height where the sample appears blurry due to exceeding the focal length, then the camera is gradually lowered until the clearest possible image is captured and continued beyond to capture the re-emergence of blurriness. This methodical movement generates a video sequence of “blurry - gradually clear - clearest - gradually blurry - blurry” images. From this sequence, the clearest image is meticulously selected, and its index is recorded as the actual value label, serving as a critical dataset for training and testing the proposed model. This comprehensive approach ensures the precision of frame selection and significantly contributes to the reliability and efficiency of automated processes that rely on precise visual data for operational accuracy.Results and Disccussions The empirical assessment of the Multi-level Video Transformer model demonstrated encouraging outcomes across its three variants, achieving classification accuracies of 87.2%, 88.6%, and 88.9% on a custom video dataset. These results signify substantial advancements in video processing for precision tasks. The proposed models display superior performance when these variants are compared to mainstream video transformers of comparable parameter sizes. This success highlights the effectiveness of the specialized approach in managing the complexities of selecting the clearest frame from video sequences. The refinement and precision inherent in the proposed models facilitate the identification of the sharpest frames and reduce potential errors in subsequent automated tasks that depend critically on image clarity for accuracy. The Multi-level Video Transformer affirms its robustness and reliability, establishing a new benchmark for video classification tasks in industrial applications by attaining such high classification accuracies. In addition, these results offer compelling evidence that the proposed methodological innovations, such as MLT, deformable attention mechanisms, and cross-level attention integration, significantly enhance model performance. These advancements are particularly advantageous for tasks requiring great detail and precision in frame selection, which are crucial in many industrial and manufacturing environments.Conclusions Accordingly, the Multi-level Video Transformer meets and surpasses current industry standards for video classification, marking a significant advancement over existing technologies. This model’s triumph lays the groundwork for more nuanced and effective automated systems capable of operating with heightened accuracy and reduced human intervention. This will be particularly transformative in sectors where precision is critical, laying a robust foundation for further research and development in intelligent automation and machine learning applications in visual data processing. |
|---|---|
| ISSN: | 2096-3246 |