Optimizing the performance of a server-based classification for a large business document flow

The document categorization problem in the case of a large business document flow is considered. Textual and visual embeddings were employed for classification. Textual embeddings were extracted via OCR Tesseract. The Viola and Jones method was applied to generate visual embeddings. This paper descr...

Full description

Saved in:

Bibliographic Details
Main Author:	O. A. Slavin
Format:	Article
Language:	English
Published:	Belarusian National Technical University 2023-02-01
Series:	Системный анализ и прикладная информатика
Subjects:	text analysis document recognition document classification speedup
Online Access:	https://sapi.bntu.by/jour/article/view/595
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832543651739205632
author	O. A. Slavin
author_facet	O. A. Slavin
author_sort	O. A. Slavin
collection	DOAJ
description	The document categorization problem in the case of a large business document flow is considered. Textual and visual embeddings were employed for classification. Textual embeddings were extracted via OCR Tesseract. The Viola and Jones method was applied to generate visual embeddings. This paper describes the performance optimization technology for the implemented classification algorithm. Servers with Intel CPUs were used for the algorithm execution. For single-threaded implementation, high-level and low-level optimizations were performed. High-level optimization was based on the parametrization of the recognition algorithms and the employment of intermediate data. Low-level optimization was carried out via compiler tools allowing for an extended set of SIMD instructions. The implementation of parallelization with several multithreaded applications on multiple servers was also described. The proposed solution was tested using own test data sets of business documents. The proposed method can be applied in modern information systems to analyze the content of a large flow of digital document images.
format	Article
id	doaj-art-922c6c5eadff4cc58ffd0029da62d89d
institution	Kabale University
issn	2309-4923 2414-0481
language	English
publishDate	2023-02-01
publisher	Belarusian National Technical University
record_format	Article
series	Системный анализ и прикладная информатика
spelling	doaj-art-922c6c5eadff4cc58ffd0029da62d89d2025-02-03T11:37:40ZengBelarusian National Technical UniversityСистемный анализ и прикладная информатика2309-49232414-04812023-02-0104606410.21122/2309-4923-2022-4-60-64444Optimizing the performance of a server-based classification for a large business document flowO. A. Slavin0Federal Research Center “Informatics and Management” of the Russian Academy of Sciences; Smart Engines Service LLCThe document categorization problem in the case of a large business document flow is considered. Textual and visual embeddings were employed for classification. Textual embeddings were extracted via OCR Tesseract. The Viola and Jones method was applied to generate visual embeddings. This paper describes the performance optimization technology for the implemented classification algorithm. Servers with Intel CPUs were used for the algorithm execution. For single-threaded implementation, high-level and low-level optimizations were performed. High-level optimization was based on the parametrization of the recognition algorithms and the employment of intermediate data. Low-level optimization was carried out via compiler tools allowing for an extended set of SIMD instructions. The implementation of parallelization with several multithreaded applications on multiple servers was also described. The proposed solution was tested using own test data sets of business documents. The proposed method can be applied in modern information systems to analyze the content of a large flow of digital document images.https://sapi.bntu.by/jour/article/view/595text analysisdocument recognitiondocument classificationspeedup
spellingShingle	O. A. Slavin Optimizing the performance of a server-based classification for a large business document flow Системный анализ и прикладная информатика text analysis document recognition document classification speedup
title	Optimizing the performance of a server-based classification for a large business document flow
title_full	Optimizing the performance of a server-based classification for a large business document flow
title_fullStr	Optimizing the performance of a server-based classification for a large business document flow
title_full_unstemmed	Optimizing the performance of a server-based classification for a large business document flow
title_short	Optimizing the performance of a server-based classification for a large business document flow
title_sort	optimizing the performance of a server based classification for a large business document flow
topic	text analysis document recognition document classification speedup
url	https://sapi.bntu.by/jour/article/view/595
work_keys_str_mv	AT oaslavin optimizingtheperformanceofaserverbasedclassificationforalargebusinessdocumentflow

Optimizing the performance of a server-based classification for a large business document flow

Similar Items