Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks

Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. Thi...

Full description

Saved in:
Bibliographic Details
Main Authors: Sudhir Kumar Patnaik, C. Narendra Babu, Mukul Bhave
Format: Article
Language:English
Published: Tsinghua University Press 2021-12-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2021.9020012
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832572962499198976
author Sudhir Kumar Patnaik
C. Narendra Babu
Mukul Bhave
author_facet Sudhir Kumar Patnaik
C. Narendra Babu
Mukul Bhave
author_sort Sudhir Kumar Patnaik
collection DOAJ
description Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.
format Article
id doaj-art-a826c864be2d4aa081bc5069b7b2b972
institution Kabale University
issn 2096-0654
language English
publishDate 2021-12-01
publisher Tsinghua University Press
record_format Article
series Big Data Mining and Analytics
spelling doaj-art-a826c864be2d4aa081bc5069b7b2b9722025-02-02T06:14:04ZengTsinghua University PressBig Data Mining and Analytics2096-06542021-12-014427929710.26599/BDMA.2021.9020012Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning NetworksSudhir Kumar Patnaik0C. Narendra Babu1Mukul Bhave2<institution content-type="dept">Department of Computer Science and Engineering</institution>, <institution>M. S. Ramaiah University of Applied Sciences</institution>, <city>Bangalore</city> <postal-code>560054</postal-code>, <country>India</country><institution content-type="dept">Department of Computer Science and Engineering</institution>, <institution>M. S. Ramaiah University of Applied Sciences</institution>, <city>Bangalore</city> <postal-code>560054</postal-code>, <country>India</country><institution>Gibraltar India Solutions LLP</institution>, <city>Bangalore</city> <postal-code>560103</postal-code>, <country>India</country>Data are crucial to the growth of e-commerce in today’s world of highly demanding hyper-personalized consumer experiences, which are collected using advanced web scraping technologies. However, core data extraction engines fail because they cannot adapt to the dynamic changes in website content. This study investigates an intelligent and adaptive web data extraction system with convolutional and Long Short-Term Memory (LSTM) networks to enable automated web page detection using the You only look once (Yolo) algorithm and Tesseract LSTM to extract product details, which are detected as images from web pages. This state-of-the-art system does not need a core data extraction engine, and thus can adapt to dynamic changes in website layout. Experiments conducted on real-world retail cases demonstrate an image detection (precision) and character extraction accuracy (precision) of 97% and 99%, respectively. In addition, a mean average precision of 74%, with an input dataset of 45 objects or images, is obtained.https://www.sciopen.com/article/10.26599/BDMA.2021.9020012adaptive web scrapingdeep learninglong short-term memory (lstm)web data extractionyou only look once (yolo)
spellingShingle Sudhir Kumar Patnaik
C. Narendra Babu
Mukul Bhave
Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks
Big Data Mining and Analytics
adaptive web scraping
deep learning
long short-term memory (lstm)
web data extraction
you only look once (yolo)
title Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks
title_full Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks
title_fullStr Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks
title_full_unstemmed Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks
title_short Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks
title_sort intelligent and adaptive web data extraction system using convolutional and long short term memory deep learning networks
topic adaptive web scraping
deep learning
long short-term memory (lstm)
web data extraction
you only look once (yolo)
url https://www.sciopen.com/article/10.26599/BDMA.2021.9020012
work_keys_str_mv AT sudhirkumarpatnaik intelligentandadaptivewebdataextractionsystemusingconvolutionalandlongshorttermmemorydeeplearningnetworks
AT cnarendrababu intelligentandadaptivewebdataextractionsystemusingconvolutionalandlongshorttermmemorydeeplearningnetworks
AT mukulbhave intelligentandadaptivewebdataextractionsystemusingconvolutionalandlongshorttermmemorydeeplearningnetworks