Information extraction from massive Web pages based on node property and text content

To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning...

Full description

Saved in:
Bibliographic Details
Main Authors: Hai-yan WANG, Pan CAO
Format: Article
Language:zho
Published: Editorial Department of Journal on Communications 2016-10-01
Series:Tongxin xuebao
Subjects:
Online Access:http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850096352046874624
author Hai-yan WANG
Pan CAO
author_facet Hai-yan WANG
Pan CAO
author_sort Hai-yan WANG
collection DOAJ
description To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.
format Article
id doaj-art-e45161a5acca4d059c30ac4b42ef7522
institution DOAJ
issn 1000-436X
language zho
publishDate 2016-10-01
publisher Editorial Department of Journal on Communications
record_format Article
series Tongxin xuebao
spelling doaj-art-e45161a5acca4d059c30ac4b42ef75222025-08-20T02:41:15ZzhoEditorial Department of Journal on CommunicationsTongxin xuebao1000-436X2016-10-013791759703782Information extraction from massive Web pages based on node property and text contentHai-yan WANGPan CAOTo address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/Web informationextractionMapReduceDOM tree
spellingShingle Hai-yan WANG
Pan CAO
Information extraction from massive Web pages based on node property and text content
Tongxin xuebao
Web information
extraction
MapReduce
DOM tree
title Information extraction from massive Web pages based on node property and text content
title_full Information extraction from massive Web pages based on node property and text content
title_fullStr Information extraction from massive Web pages based on node property and text content
title_full_unstemmed Information extraction from massive Web pages based on node property and text content
title_short Information extraction from massive Web pages based on node property and text content
title_sort information extraction from massive web pages based on node property and text content
topic Web information
extraction
MapReduce
DOM tree
url http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/
work_keys_str_mv AT haiyanwang informationextractionfrommassivewebpagesbasedonnodepropertyandtextcontent
AT pancao informationextractionfrommassivewebpagesbasedonnodepropertyandtextcontent