Title: Breaking news detection from the web documents through text mining and seasonality

Authors: Syed Tanveer Jishan; Md. Nuruddin Monsur; Hafiz Abdur Rahman

Addresses: Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada ' Department of Electrical and Computer Engineering, North South University, Plot-15, Block-B, Bashundhara, Dhaka 1229, Bangladesh ' Department of Electrical and Computer Engineering, North South University, Plot-15, Block-B, Bashundhara, Dhaka 1229, Bangladesh

Abstract: In recent years, news distribution through the internet has increased significantly and so does our growing dependency on online news sources. As vast numbers of web documents from different news websites are readily available, it is possible to extract information that can be used for various applications. One possible application is breaking news detection through text and property analysis of these web documents. In this paper, we presented an approach to detect breaking news from web documents by using keywords extraction through Brill's tagger and HTML tag attributes. Once the keywords are extracted, seasonality for each of the keywords are calculated by the ratio of the linear weighted moving averages (LWMA) at each point of the time series. Our approach has been validated and performance metrics have been evaluated with two online newspapers.

Keywords: information extraction; web mining; data cleaning; time series analysis; Brill tagger; breaking news; news detection; web documents; text mining; seasonality; data mining; news distribution; online news sources; HTML tags; keywords; keyword extraction; online newspapers.

DOI: 10.1504/IJKWI.2016.078714

International Journal of Knowledge and Web Intelligence, 2016 Vol.5 No.3, pp.190 - 207

Received: 18 Sep 2015
Accepted: 10 Dec 2015

Published online: 01 Sep 2016 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article