Information extraction from massive Web pages based on node property and text content

To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hai-yan WANG, Pan CAO
Format:	Article
Language:	zho
Published:	Editorial Department of Journal on Communications 2016-10-01
Series:	Tongxin xuebao
Subjects:	Web information extraction MapReduce DOM tree
Online Access:	http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850096352046874624
author	Hai-yan WANG Pan CAO
author_facet	Hai-yan WANG Pan CAO
author_sort	Hai-yan WANG
collection	DOAJ
description	To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.
format	Article
id	doaj-art-e45161a5acca4d059c30ac4b42ef7522
institution	DOAJ
issn	1000-436X
language	zho
publishDate	2016-10-01
publisher	Editorial Department of Journal on Communications
record_format	Article
series	Tongxin xuebao
spelling	doaj-art-e45161a5acca4d059c30ac4b42ef75222025-08-20T02:41:15ZzhoEditorial Department of Journal on CommunicationsTongxin xuebao1000-436X2016-10-013791759703782Information extraction from massive Web pages based on node property and text contentHai-yan WANGPan CAOTo address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/Web informationextractionMapReduceDOM tree
spellingShingle	Hai-yan WANG Pan CAO Information extraction from massive Web pages based on node property and text content Tongxin xuebao Web information extraction MapReduce DOM tree
title	Information extraction from massive Web pages based on node property and text content
title_full	Information extraction from massive Web pages based on node property and text content
title_fullStr	Information extraction from massive Web pages based on node property and text content
title_full_unstemmed	Information extraction from massive Web pages based on node property and text content
title_short	Information extraction from massive Web pages based on node property and text content
title_sort	information extraction from massive web pages based on node property and text content
topic	Web information extraction MapReduce DOM tree
url	http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/
work_keys_str_mv	AT haiyanwang informationextractionfrommassivewebpagesbasedonnodepropertyandtextcontent AT pancao informationextractionfrommassivewebpagesbasedonnodepropertyandtextcontent

Information extraction from massive Web pages based on node property and text content

Similar Items