Article: Big data multi-query optimisation with Apache Flink Journal: International Journal of Web Engineering and Technology (IJWET) 2018 Vol.13 No.1 pp.78 - 97 Abstract: Big data analytic frameworks, such as MapReduce, Spark and Flink, have recently gained more popularity to process large data. Flink is an open-source Apache-hosted big data analytic framework for processing batch and streaming data. For historical data processing (batch), Flink's query optimiser is built based on techniques which have been used in the parallel database systems. Flink query optimiser translates the queries into jobs which are repeatedly submitted with similar tasks. Therefore, exploiting the similarity of tasks can avoid redundant computation. In this paper, Flink multi-query optimisation system, Flink-MQO, has been proposed and built on top of Flink software stack. It is considered as an add-on to Apache Flink to optimise multi-query based on data sharing. The Flink-MQO system exploits the data sharing opportunities of selection operators to eliminate the redundancy and duplication of data in-network movement of multi-query. Experimental results show that the exploiting of shared selection operators in big data multi-query can provide promising query execution time. Therefore, Flink-MQO system can potentially be used in the stream processing to improve the performance of the real-time applications. Inderscience Publishers - linking academia, business and industry through research

Title: Big data multi-query optimisation with Apache Flink

Authors: Radhya Sahal; Mohamed H. Khafagy; Fatma A. Omara

Addresses: Department of Computer Science, Computers and Engineering College, Hodeida University, Yemen; Department of Computer Science, Faculty of Computers and Information, Cairo University, Egypt ' Department of Computer Science, Faculty of Computers and Information, Fayoum University, Egypt ' Department of Computer Science, Faculty of Computers and Information, Cairo University, Egypt

Abstract: Big data analytic frameworks, such as MapReduce, Spark and Flink, have recently gained more popularity to process large data. Flink is an open-source Apache-hosted big data analytic framework for processing batch and streaming data. For historical data processing (batch), Flink's query optimiser is built based on techniques which have been used in the parallel database systems. Flink query optimiser translates the queries into jobs which are repeatedly submitted with similar tasks. Therefore, exploiting the similarity of tasks can avoid redundant computation. In this paper, Flink multi-query optimisation system, Flink-MQO, has been proposed and built on top of Flink software stack. It is considered as an add-on to Apache Flink to optimise multi-query based on data sharing. The Flink-MQO system exploits the data sharing opportunities of selection operators to eliminate the redundancy and duplication of data in-network movement of multi-query. Experimental results show that the exploiting of shared selection operators in big data multi-query can provide promising query execution time. Therefore, Flink-MQO system can potentially be used in the stream processing to improve the performance of the real-time applications.

Keywords: big data; Flink; batch processing; multi-query optimisation; MQO; sharing opportunity; selection predicates filter.

DOI: 10.1504/IJWET.2018.092401

International Journal of Web Engineering and Technology, 2018 Vol.13 No.1, pp.78 - 97

Published online: 17 Jun 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Big data multi-query optimisation with Apache Flink

Keep up-to-date