News Mine - front-page news for text mining.

Here we make avaliable the result of text mining front pages of majour online news outlets released to accompany our pre-print here. We collected front-pages from 172 outlets in 11 countries summarized in the table below. We extracted news article content from links appearing on each of the front pages. The articles were retrieved for the period 2015-2020 and currently encompass ca. 26 million articles.

Data availability.

The data are available for download in either the raw or processed form as described below and the relationship between illustrated in Figure 1.

Raw Front Pages The raw front pages as obtained from webarchive. Documentation can be found here. To download this data please contact the authors (as it is quite large).
Raw Individual Pages The raw individual pages obtained from links in each page. Documentation can be found here. To download this data please contact the authors (as it is quite large).
Processed Pages The collection of information extracted from individual pages on each front page, sorted per-day. Download data here and the documentation can be found here.

Figure 1. Relation between the raw and processed file formats.

Raw Front Pages data format.

Upon downloading and uncompressing the archive containing raw front pages, you will see that the data is split between differeng outlets (e.g.nytimes.com ). Within each of those folders are WebArchive timestamps (e.g. nytimes.com/20201105050723). The timestamp is a folder containing the front page html captured at that specific points. For most of the pages the timestamp folder will contain the index file of the domain's landing page (nytimes.com/20201105050723/index.html). There are exception to this and sometimes you will notice subfolders before encountering the landing page's index file (e.g. dailymail.co.uk/20201105211235/home/index.html)

Raw Individual Pages data format.

Upon downloading and uncompressing the archive containing raw front pages, you will see that the data is split between different outlets (e.g.edition.cnn.com). Within each of those outlets the individual front pages are sorted between folders with three-character codes (data/raw_pages/edition.cnn.com/ffd/). The three-character folder names are derived from the three first letters of the hash of an articles link. Within each folder you can find archived versions of individual pages with filenames corresponding to their hashed links (edition.cnn.com/ffd/ffdf624aa94cbde4da8eb1591575940b.gz ).

Processed frontpage data format.

This data is the extract of titles, metadata and links for articles from front pages together with derived information (stemming, sentiment etc.). Structural and malformed links removed leaving (mostly) news articles/news like items. This does not mean that the articles are 100% error free. The collection of such 'news' items rather well reflects the heterogeneity of the news forms available on contemporary online news sites.

Each outlet is sorted into a folder by its domain. Within the folder there is another folder called per_day. Here there are collections of articles from a given outlet sorted by dates, indicated by file name. For instance 20170915.gz means 15.09.2017.

foxnews.com/days.tar.gz->
           ./per_day
                    ./20170915.gz
                    ./20170415.gz
                    ./...
           
bbc.com/days.tar.gz->
          ./per_day
                   ./20170215.gz
                   ./20150912.gz
                   ...

Each day has a jsonified dictionary holding articles from that day in the following format:

_id -> {
	'title':'',#Raw title
	'title_stem1':'',#raw keywords stem for title
	'title_stem2':'',#2-gram stem for title - English stopwords removed.
	'description':'',#Raw description
	'description_stem':'',#raw keywords stem for description
	'description_stem2':'',#2-gram stem for description - English stopwords removed.
	'link':'',#link of the news item
	'sentiment':'',#Annotated vader sentiment of raw title + description
	'is_covid':'',#Identification whether we are a covid topic
}

Contact.

In case of any queries please contact Konrad Krawczyk, konradk@imada.sdu.dk

To cite this work, please refer to the pre-print Quantifying the online news media coverage of the COVID-19 pandemic.