A Bag-of-Words Approach for Information Extraction from Electricity Invoices

In the context of digitization and automation, extracting relevant information from business documents remains a significant challenge. It is typical to rely on machine-learning techniques to automate the process, reduce manual labor, and minimize errors. This work introduces a new model for extract...

Full description

Saved in:
Bibliographic Details
Main Authors: Javier Sánchez, Giovanny A. Cuervo-Londoño
Format: Article
Language:English
Published: MDPI AG 2024-10-01
Series:AI
Subjects:
Online Access:https://www.mdpi.com/2673-2688/5/4/91
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850049963867766784
author Javier Sánchez
Giovanny A. Cuervo-Londoño
author_facet Javier Sánchez
Giovanny A. Cuervo-Londoño
author_sort Javier Sánchez
collection DOAJ
description In the context of digitization and automation, extracting relevant information from business documents remains a significant challenge. It is typical to rely on machine-learning techniques to automate the process, reduce manual labor, and minimize errors. This work introduces a new model for extracting key values from electricity invoices, including customer data, bill breakdown, electricity consumption, or marketer data. We evaluate several machine learning techniques, such as Naive Bayes, Logistic Regression, Random Forests, or Support Vector Machines. Our approach relies on a bag-of-words strategy and custom-designed features tailored for electricity data. We validate our method on the IDSEM dataset, which includes 75,000 electricity invoices with eighty-six fields. The model converts PDF invoices into text and processes each word separately using a context of eleven words. The results of our experiments indicate that Support Vector Machines and Random Forests perform exceptionally well in capturing numerous values with high precision. The study also explores the advantages of our custom features and evaluates the performance of unseen documents. The precision obtained with Support Vector Machines is 91.86% on average, peaking at 98.47% for one document template. These results demonstrate the effectiveness of our method in accurately extracting key values from invoices.
format Article
id doaj-art-9d5488d02be14b0aaca361a5c511a5ec
institution DOAJ
issn 2673-2688
language English
publishDate 2024-10-01
publisher MDPI AG
record_format Article
series AI
spelling doaj-art-9d5488d02be14b0aaca361a5c511a5ec2025-08-20T02:53:34ZengMDPI AGAI2673-26882024-10-01541837185710.3390/ai5040091A Bag-of-Words Approach for Information Extraction from Electricity InvoicesJavier Sánchez0Giovanny A. Cuervo-Londoño1Centro de Tecnologías de la Imagen (CTIM), Instituto Universitario de Cibernética, Empresas y Sociedad (IUCES), 3507 Las Palmas de Gran Canaria, SpainCentro de Tecnologías de la Imagen (CTIM), Instituto Universitario de Cibernética, Empresas y Sociedad (IUCES), 3507 Las Palmas de Gran Canaria, SpainIn the context of digitization and automation, extracting relevant information from business documents remains a significant challenge. It is typical to rely on machine-learning techniques to automate the process, reduce manual labor, and minimize errors. This work introduces a new model for extracting key values from electricity invoices, including customer data, bill breakdown, electricity consumption, or marketer data. We evaluate several machine learning techniques, such as Naive Bayes, Logistic Regression, Random Forests, or Support Vector Machines. Our approach relies on a bag-of-words strategy and custom-designed features tailored for electricity data. We validate our method on the IDSEM dataset, which includes 75,000 electricity invoices with eighty-six fields. The model converts PDF invoices into text and processes each word separately using a context of eleven words. The results of our experiments indicate that Support Vector Machines and Random Forests perform exceptionally well in capturing numerous values with high precision. The study also explores the advantages of our custom features and evaluates the performance of unseen documents. The precision obtained with Support Vector Machines is 91.86% on average, peaking at 98.47% for one document template. These results demonstrate the effectiveness of our method in accurately extracting key values from invoices.https://www.mdpi.com/2673-2688/5/4/91electricity invoiceinformation extractionsemi-structured documentmachine learningsupport vector machine
spellingShingle Javier Sánchez
Giovanny A. Cuervo-Londoño
A Bag-of-Words Approach for Information Extraction from Electricity Invoices
AI
electricity invoice
information extraction
semi-structured document
machine learning
support vector machine
title A Bag-of-Words Approach for Information Extraction from Electricity Invoices
title_full A Bag-of-Words Approach for Information Extraction from Electricity Invoices
title_fullStr A Bag-of-Words Approach for Information Extraction from Electricity Invoices
title_full_unstemmed A Bag-of-Words Approach for Information Extraction from Electricity Invoices
title_short A Bag-of-Words Approach for Information Extraction from Electricity Invoices
title_sort bag of words approach for information extraction from electricity invoices
topic electricity invoice
information extraction
semi-structured document
machine learning
support vector machine
url https://www.mdpi.com/2673-2688/5/4/91
work_keys_str_mv AT javiersanchez abagofwordsapproachforinformationextractionfromelectricityinvoices
AT giovannyacuervolondono abagofwordsapproachforinformationextractionfromelectricityinvoices
AT javiersanchez bagofwordsapproachforinformationextractionfromelectricityinvoices
AT giovannyacuervolondono bagofwordsapproachforinformationextractionfromelectricityinvoices