Title: A distributed architecture for large scale news and social media processing

Authors: Iraklis Varlamis; Dimitrios Michail; Pavlos Polydoras; Panagiotis Tsantilas

Addresses: Department of Informatics and Telematics, Harokopio University of Athens, Athens, Greece ' Department of Informatics and Telematics, Harokopio University of Athens, Athens, Greece ' Palo Ltd, Kokkoni, Greece ' Palo Ltd, Kokkoni, Greece

Abstract: When designing a data processing and analytics pipeline for data streams, it is important to provide the data load and be able to successfully balance it over the available resources. This can be achieved more easily if small processing modules, which require limited resources, replace large monolithic processing software. In this work, we present the case of a social media and news analytics platform, called PaloAnalytics, which performs a series of content aggregation, information extraction (e.g., NER, sentiment tagging, etc.) and visualisation tasks in a large amount of data, on a daily basis. We demonstrate the architecture of the platform that relies on micro-modules and message-oriented middleware for delivering distributed content processing. Early results show that the proposed architecture can easily stand the increased content load that occasionally occurs in social media (e.g., when a major event takes place) and quickly release unused resources when the content load reaches its normal flow.

Keywords: web crawler; text processing; distributed data processing; message-oriented middleware.

DOI: 10.1504/IJWET.2020.114029

International Journal of Web Engineering and Technology, 2020 Vol.15 No.4, pp.383 - 406

Published online: 06 Apr 2021 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article