Authors: Syed Tanveer Jishan; Md. Nuruddin Monsur; Hafiz Abdur Rahman
Addresses: Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada ' Department of Electrical and Computer Engineering, North South University, Plot-15, Block-B, Bashundhara, Dhaka 1229, Bangladesh ' Department of Electrical and Computer Engineering, North South University, Plot-15, Block-B, Bashundhara, Dhaka 1229, Bangladesh
Abstract: In recent years, news distribution through the internet has increased significantly and so does our growing dependency on online news sources. As vast numbers of web documents from different news websites are readily available, it is possible to extract information that can be used for various applications. One possible application is breaking news detection through text and property analysis of these web documents. In this paper, we presented an approach to detect breaking news from web documents by using keywords extraction through Brill's tagger and HTML tag attributes. Once the keywords are extracted, seasonality for each of the keywords are calculated by the ratio of the linear weighted moving averages (LWMA) at each point of the time series. Our approach has been validated and performance metrics have been evaluated with two online newspapers.
Keywords: information extraction; web mining; data cleaning; time series analysis; Brill tagger; breaking news; news detection; web documents; text mining; seasonality; data mining; news distribution; online news sources; HTML tags; keywords; keyword extraction; online newspapers.
International Journal of Knowledge and Web Intelligence, 2016 Vol.5 No.3, pp.190 - 207
Received: 18 Sep 2015
Accepted: 10 Dec 2015
Published online: 26 Aug 2016 *