Providing Web Archive News Articles as Corpus Data
While the huge data repositories of web archives carry big potential for knowledge production in academia, researchers have described significant challenges when trying to access and make use of web archives in research. This article describes the creation of a “Web News Collection” where content fr...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Ubiquity Press
2025-01-01
|
Series: | Journal of Open Humanities Data |
Subjects: | |
Online Access: | https://account.openhumanitiesdata.metajnl.com/index.php/up-j-johd/article/view/281 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | While the huge data repositories of web archives carry big potential for knowledge production in academia, researchers have described significant challenges when trying to access and make use of web archives in research. This article describes the creation of a “Web News Collection” where content from the National Library of Norway’s web archive has been made available for computational text analysis, in a manner that facilitates access for research and beyond – aligning with FAIR principles, while also accounting for copyright restrictions. Developing the warc2corpus pipeline, we detail the processes for extracting natural language from WARC files, curating content, and enhancing metadata for analytical purposes. This structured collection — consisting of 1.5 million news articles accessible via a REST API —enables distant reading of news from the web, with tools for building corpora, word frequencies and collocations. To support usage, both programming interfaces and user-friendly web apps are offered, representing a significant step forward in making web archives usable and valuable for digital scholars. |
---|---|
ISSN: | 2059-481X |