International Journal of High Performance Systems Architecture (9 papers in press)
Heterogeneous Computing on Mobile GPU-FPGA Cooperation Platform
by Nan Hu, Xuehai Zhou, Xi Li
Abstract: In recent years, mobile GPUs have been widely adopted in Systems-On-Chip(SoCs) platforms, especially in the graphics area. Meanwhile, reconfigurable processors and emerging FPGA computing devices are also widely used. However, the research of mobile GPU for general computing cooperation with FPGA, is still scarce. Such heterogeneous systems pose a great challenge to the parallel programming. In this paper, we present a Flow-Lead-In Architecture (FLIA) is proposed as a unified data flow driven development model based on coupled GPU-FPGA. The servant represents an intermediate language module that is compiled from the high-level programming language and is compiled to different types of processors at runtime. Execution-flow abstracts the communication task between the servants and controls the pipeline execution for spatial parallelism. By scheduling multiple servants to heterogeneous processors, the cooperation system uses fewer resources to achieve near performance and power with the pure FPGA system.
Keywords: heterogeneous computing; GPU-FPGA cooperation; mobile GPU; ARM GPU FPGA partitioning; reconfigurable computing.
A framework for evaluating branch predictors using multiple performance parameters
by Moumita Das, Ansuman Banerjee, Bhaskar Sardar
Abstract: Selecting a branch predictor for a program for prediction is a challenging task.
The performance of a branch predictor is measured not only by the prediction accuracy - parameters like predictor size, energy expenditure, latency of execution play a key role in predictor selection. For a specific program, a predictor which provides the best results based on one of these parameters, may not be the best when some other parameter is considered. The task to select the best predictor considering all the different parameters, is therefore, a non-trivial one, and is considered one of the foremost challenges. In this paper, we propose a framework to systematically address this important challenge using the concept of aggregation and unification. For a given program, our framework considers the performance of the different predictors, with respect to the different parameters, and makes a predictor selection based on all of them. On one side, our framework can be an important aid for deciding on the best predictor to use at runtime. On the other side, the proposal of new predictor can be systematically evaluated and placed in purview of existing ones, considering the parameters of choice. We present experimental results of our framework on the Siemens, SPEC 2006 and SPEC 2017 benchmarks.
Keywords: Branch prediction; prediction accuracy; execution latency; rank aggregation.
Image saliency and co-saliency detection by low-rank multiscale fusion
by Rui Huang, Wei Feng, Jizhou Sun, Yaobin Zou
Abstract: Saliency and co-saliency detection aim to distinguish conspicuous foreground objects from single and multiple images, thus are essential in many multimedia and vision applications. To achieve balanced efficiency and accuracy, most recent successful saliency detectors are based on superpixels. However, saliency detection with single-scale superpixel segmentation may fail in capturing intrinsic salient objects in complex natural scenes with small-scale high-contrast backgrounds. To tackle this problem and realize reliable saliency and co-saliency detection, we present a simple strategy using multiscale superpixels to jointly detect salient object via low-rank analysis. Specifically, we first build a multiscale superpixel pyramid and derive the corresponding saliency map by multimodal saliency features and priors at each single scale. Then, we use joint low-rank analysis of multiscale saliency maps to obtain a more reliable and adaptively-fused saliency map, which properly takes all scales saliency into account. We further propose a GMM-based co-saliency prior to enable the above approach to detect co-salient objects from multiple images. Extensive experiments on benchmark datasets validate the effectiveness and superiority of the proposed saliency and co-saliency detector over state-of-the-arts.
Keywords: Saliency; co-saliency; co-saliency prior; generative model; GMM; low-rank analysis; multiscale.
Soft skills requirements in mobile applications development employment market
by Jingdong Jia, Zupeng Chen, Xi Liu
Abstract: The soft skills of developers have a major influence on the quality of software product and project. However, which soft skills are important for mobile applications development remains unknown. Additionally, it is necessary to examine the differences of soft skills requirements between traditional software and mobile applications development. In this article, based on text mining including word segmentation, similarity calculation and clustering analysis, we analyse lots of advertisements, and extract 13 categories of soft skills requirements for mobile applications development. We also compare the categories with those for traditional software development. We find that communication and teamwork are still the most important two soft skills. However, fast learning is more important for mobile developers, and we identified four soft skills that are not proposed before. Additionally, season has a minor impact on soft skills requirements of mobile applications development.
Keywords: soft skill requirements; mobile application development; employment market; job advertisement; text mining; word segmentation; similarity calculation; cluster analysis; traditional software development.
Energy optimised cryptography for low power devices in internet of things
by G. Rajesh, C. Vamsi Krishna, B. Christopher Selvaraj, S. Roshan Karthik, Arun Kumar Sangaiah
Abstract: Internet of things has a plethora of devices ranging from high capacity servers to low powered devices that works with Bluetooth, ZigBee, GPRS, RFID, Wi-Fi, etc. These low power devices are constrained to power management, reliability, security and privacy limitations. The existing traditional security algorithms could not be applied to these low power devices due to high processing and battery power requirements. Here, an energy optimised cryptography (EOC) for low power devices in IoT has been proposed. Here the security to low power devices has been accorded by two light weight security techniques called R2CV, a sub key generation method and optimised message authentication code generation function (OMGF) which maintain security without compromising energy consumption and processing power consumption. The proposed security algorithms reduce the computational requirements for sub key generation and MAC generation in low power devices. The experimental results are compared with existing algorithms like RC5 and SHA, and are proven that R2CV and OMGF reduce time consumption, increase battery life and in turn extend network lifetime.
Keywords: internet of things; IoT; low power devices; message authentication code; battery life.
Real-time physical register file allocation with neural networks for simultaneous multi-threading processors
by Wenjun Wang, Wei-Ming Lin
Abstract: Simultaneous multi-threading (SMT) processors improve system performance by allowing concurrent execution of multiple independent threads with shared key resources. Physical register file, shared among the threads in real-time, is one of the most critical resources in deciding overall system performance. Disproportional distribution of registers among the threads may easily hamper normal processing of some threads. In this paper, we develop a machine learning algorithm to efficiently allocate registers among concurrent executing threads based on current resource utilisation circumstances. An offline training process is first employed to establish a well-trained neural network which is then applied to dynamically adjust the resource distribution in real-time. Our experiment results on M-sim, which is a multi-threaded micro-architectural simulation environment, show that our proposed technique significantly improves the average system throughput by up to 42% without sacrificing execution fairness among the threads.
Keywords: simultaneous multi-threading; SMT; register renaming; physical register file; neural networks; machine learning.
Multiprocessing scalable string matching algorithm for network intrusion detection system
by Adnan A. Hnaif, Ali Aldahoud, Mohammad A. Alia, Issa S. Al'otoum, Duaa Nazzal
Abstract: With high increasing speed of today's computer networks which affects the performance of security issues in terms of detection speed, the traditional security tools such as firewall is insufficient to protect the networks from external threads. Intrusion detection systems (IDS) are one of the most reliable tools that can be used to monitor all the network traffic to identify unauthorised usage of computer system networks. In this paper, we have proposed a scalable string matching algorithm based on network IDS (NIDS) to enhance the speed of NIDS detection engine, which called multiprocessing scalable string matching algorithm for network intrusion detection system (MSNIDS). The MSNIDS implemented by using enhanced weighted exact matching algorithm (EWEMA) in both sequential and parallel processing. The MSNIDS based on EWEMA can be achieved more than 89% in sequential processing time compared with WEMA, and 86% in parallel processing time compared with sequential matching processing.
Keywords: string matching algorithms; distributed architecture; parallel processing; network intrusion detection system; NIDS.
Parallel video processing on FPGA architecture
by Lamjed Touil, Abdessalem Ben Abdelali, Lilia Kechiche, Chiheb Chaieb, Bouraoui Ouni, Abdellatif Mtibaa
Abstract: Real-time video applications are becoming widely used in many domains with more demand for high performance. Video processing is intensive and habitually has accompanying real-time or super-real-time requirements. Such us, multiple cameras are used in monitoring and surveillance systems in automatically real-time analyse video to detect unusual events. Due to the strong computational imposed by video algorithms, real-time video treatment is notably amenable to concurrent processing. Classical implementation solutions whether based on general purpose processors or dedicated ones like DSP cannot fulfil wanted performance. In this article, we focus on the applicability of computing reconfigurable architectures to parallel video processing applications. The experiment results show that the proposed hardware-oriented multi-treatment architecture can provide an average frame rate of 45 frames/s at high definition resolution. Statistics show a consumption about 18% of logic resources and 27% of on chip memory which gives the possibility to integrate additional treatments.
Keywords: FPGA; multi-port memory controller; MPMC; video processing; cut detection; picture in picture.
An efficient VLSI architecture for two-dimensional discrete wavelet transform
by Rohan Pinto, Kumara Shama
Abstract: In this paper, a memory efficient 2-D discrete wavelet transform (DWT) structure is presented for high-speed application. The architecture is based on the modified lifting scheme to reduce the critical path to one multiplier delay. In order to increase the speed of processing, four pipeline stages are introduced in the structure. The computation time for an N × N image is N2/4, as the throughput rate of the structure is four. The results after comparison reveal that the proposed architecture has a temporal memory lower than the other DWT architectures. The Z-scan method is employed to fetch the input data which suits the transpose unit design. Five registers and a multiplexer constitute a transpose unit, which is required to transpose the data between the row and the column processor. The proposed 2-D dual-scan DWT architecture has the merits of low latency, low control complexity and regular signal flow, making it suitable for a very large-scale integration (VLSI) implementation. The architecture is modelled in VHDL and synthesised with the CMOS 180 nm technology.
Keywords: discrete wavelet transform; DWT; lifting scheme; pipeline; VLSI architecture.