Information extraction from massive Web pages based on node property and text content
To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Editorial Department of Journal on Communications
2016-10-01
|
| Series: | Tongxin xuebao |
| Subjects: | |
| Online Access: | http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850096352046874624 |
|---|---|
| author | Hai-yan WANG Pan CAO |
| author_facet | Hai-yan WANG Pan CAO |
| author_sort | Hai-yan WANG |
| collection | DOAJ |
| description | To address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods. |
| format | Article |
| id | doaj-art-e45161a5acca4d059c30ac4b42ef7522 |
| institution | DOAJ |
| issn | 1000-436X |
| language | zho |
| publishDate | 2016-10-01 |
| publisher | Editorial Department of Journal on Communications |
| record_format | Article |
| series | Tongxin xuebao |
| spelling | doaj-art-e45161a5acca4d059c30ac4b42ef75222025-08-20T02:41:15ZzhoEditorial Department of Journal on CommunicationsTongxin xuebao1000-436X2016-10-013791759703782Information extraction from massive Web pages based on node property and text contentHai-yan WANGPan CAOTo address the problem of extracting valuable information from massive Web pages in big data environments,a novel information extraction method based on node property and text content for massive Web pages was put forward.Web pages were converted into a document object model (DOM) tree,and a pruning and fusion algorithm was introduced to simplify the DOM tree.For each node in the DOM tree,both density property and vision property was defined and Web pages were pretreated based on these property values.A MapReduce framework was employed to realize parallel information extraction from massive Web pages.Simulation and experimental results demonstrate that the proposed extraction method can not only achieve better performance but also have higher scalability compared with other methods.http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/Web informationextractionMapReduceDOM tree |
| spellingShingle | Hai-yan WANG Pan CAO Information extraction from massive Web pages based on node property and text content Tongxin xuebao Web information extraction MapReduce DOM tree |
| title | Information extraction from massive Web pages based on node property and text content |
| title_full | Information extraction from massive Web pages based on node property and text content |
| title_fullStr | Information extraction from massive Web pages based on node property and text content |
| title_full_unstemmed | Information extraction from massive Web pages based on node property and text content |
| title_short | Information extraction from massive Web pages based on node property and text content |
| title_sort | information extraction from massive web pages based on node property and text content |
| topic | Web information extraction MapReduce DOM tree |
| url | http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2016190/ |
| work_keys_str_mv | AT haiyanwang informationextractionfrommassivewebpagesbasedonnodepropertyandtextcontent AT pancao informationextractionfrommassivewebpagesbasedonnodepropertyandtextcontent |