Forthcoming and Online First Articles

International Journal of Business Intelligence and Data Mining

International Journal of Business Intelligence and Data Mining (IJBIDM)

Forthcoming articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

Online First articles are published online here, before they appear in a journal issue. Online First articles are fully citeable, complete with a DOI. They can be cited, read, and downloaded. Online First articles are published as Open Access (OA) articles to make the latest research available as early as possible.

Open AccessArticles marked with this Open Access icon are Online First articles. They are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.

Register for our alerting service, which notifies you by email when new issues are published online.

We also offer which provide timely updates of tables of contents, newly published articles and calls for papers.

International Journal of Business Intelligence and Data Mining (49 papers in press)

Regular Issues

  • Analysis and Prediction of Heart Disease Aid of Various Data Mining Techniques: A Survey   Order a copy of this article
    by V. Poornima, D. Gladis 
    Abstract: In recent times, health diseases are expanding gradually because of inherited. Particularly, heart disease has turned out to be the more typical nowadays, i.e., life of individuals is at hazard. The data mining strategies specifically decision tree, Naive Byes, neural network, K-means clustering, association classification, support vector machine (SVM), fuzzy, rough set theory and orthogonal local preserving methodologies are examined on heart disease database. In this paper, we survey distinctive papers in which at least one algorithms of data mining are utilised for the forecast of heart disease. This survey comprehends the current procedures required in vulnerability prediction of heart disease for classification in data mining. Survey of pertinent data mining strategies which are included in risk prediction of heart disease gives best expectation display as hybrid approach contrasting with the single model approach.
    Keywords: Data mining; Heart Disease Prediction; performance measure; Fuzzy; and clustering.
    DOI: 10.1504/IJBIDM.2018.10014620
     
  • Mining the Productivity Data of Garment Industry   Order a copy of this article
    by Abdullah Al Imran, Md Shamsur Rahim, Tanvir Ahmed 
    Abstract: The Garment Industry one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories. This study explores the application of state-of-the-art data mining techniques for analysing industrial data, revealing meaningful insights, and predicting the productivity performance of the working teams in a garment company. As part of our exploration, we have applied 8 different data mining techniques with 6 evaluation metrics. Our experimental results show that the Tree Ensemble model and Gradient Boosted Tree model are the best performing models in the application scenario.
    Keywords: Data Mining; Productivity Prediction; Pattern Mining; Classification; Garment Industry; Industrial Engineering.
    DOI: 10.1504/IJBIDM.2021.10028084
     
  • GENERAL CRIME FROM THE DATA MINING POINT OF VIEW. A SYSTEMATIC LITERATURE REVIEW   Order a copy of this article
    by Maria Antonia Walteros Alcazar, Nicolas Aguirre Yacup, Sandra P. Castillo Landinez, Pablo E. Caicedo Rodríguez 
    Abstract: In recent decades, crime has become an issue of great concern to nations, which is why there is significant progress in the development of investigations in different areas. The literature review considers the data mining techniques applied to crime research, throughout the analysis of four thematic axes: countries, data sources, data mining techniques and software employed in different articles. The analysis used a systematic methodology to examine the 111 articles selected among 2008-2018 from almost 70 journals. The articles of this review are focused on different types of crime. The findings indicated that USA is the most active country analysing crimes using data mining techniques; also, the most common sources are open data websites and crime studies. In general, are more frequent than those that cover a specific type of crime, the algorithm mainly used in studies is cluster followed by classification, and the most widely used software is WEKA.
    Keywords: Data Mining DM; Crime; Criminal Patterns; Law Enforcement; Data Mining Techniques; Algorithms; Review; Knowledge Discovery; Literature Review LR;.
    DOI: 10.1504/IJBIDM.2021.10029504
     
  • A Clustering and Treemap-based Approach for Query Reuse and Visualization in Large Data Repositories   Order a copy of this article
    by Yousra Harb, Surendra Sarnikar, Omar F. El-Gayar 
    Abstract: This study presents a query clustering and tree map approach that facilitates access and reuse of pre-developed data retrieval models (queries) to analyze data and satisfy user information needs. The approach seeks to meet the following requirements: knowledge (represented as previously constructed queries) reuse, query exploration, and ease of use by data users. The approach proposes a feature space for representing queries, applies Hierarchical Agglomerative Clustering (HAC) techniques to cluster the queries, and leverages treemaps to visualize and navigate the resultant query clusters. We demonstrate the viability of the approach by building a prototype data exploration interface for health data from Behavioral Risk Factor Surveillance System (BRFSS). We conduct cognitive walkthroughs and a user study to further evaluate the effectiveness of the artifact. Overall, the results indicate that the proposed approach demonstrate the ability to meet its design requirements.
    Keywords: Query clustering; Query reuse; Query visualization; Query exploration; Information retrieval; Treemap.
    DOI: 10.1504/IJBIDM.2021.10030448
     
  • Application of structural modeling to measure the impact of quality on growth factors: Case of the young industrial enterprises installed in the Northwest of Morocco   Order a copy of this article
    by Mohamed B.E.N. ALI, MOHAMMED HADINI, Said Barijal, Saif Rifai 
    Abstract: This study aims to provide a conceptual model measuring the impact of quality practices on the growth factors of young industrial enterprises located in northwestern Morocco and to see how quality can stimulate and improve growth factors to this kind of enterprises The present study is empirical, based on surveys (face to face interviews) via questionnaires administered to the owners/managers of young industrial enterprises using the latent variable structural modeling according to the PLS-Path Modeling approach A total of 220 questionnaires were administered and exploited to assess the degree of use and application of quality practices, five practices have been chosen, and the PLS (Partial Least Squares) Path Modeling was used We concluded that in general the quality practices concerning “Leadership” and “Process Management” have a positive impact on the growth factors of this type of enterprises:"strong to medium" importance of effects In contrast, the quality practices concerning “Human Resources”,
    Keywords: Growth factors; Growth phase; Modeling; Quality Practices; Young Industrial Enterprises.
    DOI: 10.1504/IJBIDM.2021.10030835
     
  • Mining Trailer Reviews for Predicting Ratings and Box Office Success of Upcoming Movies   Order a copy of this article
    by Nirmalya Chowdhury, Debaditya Barman, Chandrai Kayal 
    Abstract: Around 60% of the movies produced worldwide are box office failures. Since it affects a large number of stakeholders, movie business prediction is a very relevant as well as challenging problem. There had been many attempts to predict the box-office earnings of a movie after the theatrical release. Comparatively research works are inadequate to predict a movie’s fate before its release. Viewers are introduced to a movie via trailers before its theatrical release. The reviews of these trailers are indicative of a movie’s initial success. This work is focused on movie rating and business prediction on the basis of trailer reviews as well as other attributes. Several experiments have been performed using multiple classifiers to find appropriate classifiers(s) which can predict rating and box-office performance of a movie to be launched. Experimentally it has been found that Random Forest (RF) Classifier has outperformed others and produced very promising results.
    Keywords: Text Mining; Sentiment Analysis; Machine Learning; Movie Rating; Opening Weekend Income; Gross Income; Movie Trailer; Sensitivity Analysis.
    DOI: 10.1504/IJBIDM.2021.10030880
     
  • Improvement Assessment Method for Special Kids By Observing The Social and Behaviour Activity Using Data Mining Techniques   Order a copy of this article
    by DHANALAKSHMI RADHAKRISHNAN, Muthukumar B 
    Abstract: In recent studies, high throughput innovations have offered ascend to accumulation of substantial measures of heterogeneous data that gives diverse information. Clustering is the process of gathering unique items into classes of comparative articles. To overcome the drawbacks of classification methods, clustering is used. Earlier, clustering algorithms like hierarchical clustering, density based clustering, which are based on either numerical or categorical attributes were commercially used in software. In this proposed work k-mean clustering under unsupervised learning algorithm can make sense in prediction. Taking the clinical data of special kids, clustering is made and categorized using rank with the help of relevant symptoms. In this context, the data of special kids make statistical impact on categorization and easy detection of associated conditions of a child earlier.As the results, the proposed method has validated the database of special kid’s information with global purity.
    Keywords: High-throughput development; Special kids; Categorical attributes; unsupervised k-means Clustering; Gene expressional values.
    DOI: 10.1504/IJBIDM.2021.10031032
     
  • Ensemble Feature Selection Approach for Imbalanced Textual Data Using MapReduce   Order a copy of this article
    by Houda Amazal, Kissi Mohamed, Mohammed Ramdani 
    Abstract: Feature selection is a fundamental preprocessing phase in text classification. It speeds up machine learning algorithms and improves classification accuracy. In big data context, feature selection techniques have to deal with two major issues which are the huge dimensionality and the imbalancing aspect of data. However, the libraries of big data frameworks, such as Hadoop, only implement a few single feature selection methods whose robustness does not meet the requirements imposed by the large amount of data. To deal with this, we propose in this paper a Distributed Ensemble Feature Selection approach (DEFS) for imbalanced large dataset. The first step of the proposal focus on tackling the imbalance distribution of data using Hadoop environment to transform usual documents of dataset into big documents. Afterwards, we introduce a novel feature selection method we called Term Frequency-Inverse Category Frequency (TFICF) which is both frequency and category based.
    Keywords: Ensemble feature selection; Imbalance data; MapReduce; Text classification.
    DOI: 10.1504/IJBIDM.2022.10031100
     
  • The Five Key Components for Building An Operational Business Intelligence Ecosystem   Order a copy of this article
    by SARMA A.D.N. 
    Abstract: Business intelligence (BI) plays a vital role in decision making in all most all private, business and government organizations. An Operational BI is a hybrid system which is an emerging concept in the BI space and gaining popularity in the last five years. In this paper, the key components of an Operational BI are presented, and their workings explained. The methodology adopted for the identification of components based on modularization of software engineering using cohesion and coupling parameters. The proposed components of the system leverage the principles of component-based software engineering. An orderly arrangement of the key components constitutes an Operational BI ecosystem. Further, explained how these individual key components of the system provide an increased business value and timely decision-making information to all the users in the organizations.
    Keywords: Business intelligence; operational BI; business performance management; operational analytics; operational reporting; event monitoring and notification; action time; and business value.
    DOI: 10.1504/IJBIDM.2021.10031395
     
  • A Novel Approach to Retrieve Unlabelled Images   Order a copy of this article
    by Deepali Kamthania, Ashish Pahwa, Aayush Gupta, Chirag Jain 
    Abstract: In this paper an attempt has been made to propose architecture of search engine for retrieving photographs from photo bank of unlabeled images. The primary purpose of the system is to retrieve images from image repository through string based queries on an interactive interface. To achieve this, image data set is transformed into a space where queries can execute significantly faster by developing a data pipeline through which each image is passed after entering into the system. The pipeline consists HOG based face detection and extraction, Face Landmark estimation, Indexer and Transformer. The image is passed through the data pipeline where each encoded face in the input image is compared with other vectors by computing l2 norm distance between them. The top N results (address of faces and corresponding images) are returned to the user. Once the image passes out from the pipeline Retrieval methods and Feedback mechanisms are performed.
    Keywords: Face Recognition (FR); Deep Learning; Histogram of Oriented Gradients (HOG); FaceNet Architecture; Machine Learning; Support Vector Machine (SVM).
    DOI: 10.1504/IJBIDM.2021.10031519
     
  • Prediction of Box-office Success: A Review of Trends and Machine Learning Computational Models   Order a copy of this article
    by Elliot Mbunge, Stephen Fashoto, Happyson Bimha 
    Abstract: The movie industry is faced with high uncertainty owing to challenges businesses have in forecasting sales and revenues. The huge upfront investments associated with the movie industry require investments to be informed by reliable methods of predicting success or returns from their investments. The study set to identify the best forecasting techniques for box-office products. Previous studies focused on predicting box-office success using pre-release and post-release during and after the production phase. The study was focusing on reviewing existing literature in predicting box-office success with the ultimate goal of determining the most frequently used prediction algorithm(s), dataset source and their accuracy results. We applied the PRISMA model to review published papers from 2010 to 2019 extracted from Google Scholar, Science Direct, IEEE Xplore Digital Library, ACM Digital Library and Springer Link. The study shows that the support vector machine was frequently used to predict box-office success with 21.74% followed by linear regression with 17.39% of total frequency contribution. The study also reviewed that Internet Movie Database (IMDb) is most used box-office dataset source with 40.741% of the total frequency followed with Wikipedia with 11.111%.
    Keywords: Box-office; machine learning; movie industry; pre-release; post-release features.
    DOI: 10.1504/IJBIDM.2021.10032162
     
  • A NOVEL FRAMEWORK FOR FORECASTING TIME SERIES DATA BASED ON FUZZY LOGIC AND VARIANTS OF HIDDEN MARKOV MODELS   Order a copy of this article
    by S. Sridevi, Parthasarathy Sudhaman, T. Chandrakumar, S. Rajaram 
    Abstract: The traditional time series forecasting methods such as naive, smoothing model and moving average model assumes that the time series is stationary and could not handle linguistic terms. To provide a solution to this problem, fuzzy time series forecasting methods are being considered in this research work. The objective of this research is to improve the accuracy by introducing a new partitioning method called Relative Differences (RD) based interval method. This research work implements the variants of RD based Hidden Markov Models (HMM) such as Classic HMM, Stochastic HMM, Laplace stochastic smoothing HMM, and Probabilistic Smoothing HMM (PsHMM) for forecasting time series data. In the proposed work, the performances of the above models were tested with Australian Electricity Market dataset and Tamilnadu Weather dataset. The results show that the performance of the proposed model- Relative Differences (RD) based PsHMM performs much better in terms of precision than other existing models.
    Keywords: Forecasting; Time series; Fuzzy; Time Variant Model; Markov Model; Relative Differences (RD) based interval method.
    DOI: 10.1504/IJBIDM.2021.10032543
     
  • Disease Prediction and Knowledge Extraction in Banana Crop Cultivation using Decision Tree Classifiers   Order a copy of this article
    by A. Anitha 
    Abstract: Agriculture plays a vital role in determining economic status of a country. To meet out the growing needs of society and to improve crop productivity, researchers are focusing on the development of various technologies. In India, banana is one of the leading crops with high demand. To improve the yield of banana, it is necessary to detect diseases at an early stage. Also, in order to acquire new farmers and to retain existing banana farmers, it is essential to extract knowledge about hidden causes for various diseases in banana crop. This work aims to apply data mining techniques like decision tree classifiers on banana cultivation dataset. Agricultural dataset used for experimentation is collected from farmers cultivating banana in regions fed by Thamirabharani River such as Kanyakumari, Tirunelveli and Tuticorin districts of Tamil Nadu. The higher the disease detection accuracy, the greater will be the crop productivity. Performance of classifiers such as J48, REP tree and random forest are compared based on classification accuracy, precision, recall and F-measure. Among various classification techniques applied over agricultural dataset, it has been identified that random forest algorithm out performs other techniques with respect to classification accuracy.
    Keywords: Attribute Selection; Decision Tree; Classification; Accuracy.
    DOI: 10.1504/IJBIDM.2022.10033424
     
  • Heart Disease Patient Risk Classification Based On Neutrosophic Sets   Order a copy of this article
    by Wael Hanna, Nouran Radwan 
    Abstract: Medical statistics show that heart disease is one of the biggest causes for mortality among the population. In developing countries, people have less concern about their health. The risk is increasing as there are five hundred deaths per one hundred thousand occur annually in Egypt. The diagnosis of heart disease remains an ambiguous task in the medical field as there are many features which are involved to take the decision. Besides, data gained for diagnosis are often vague and ambiguous. The main contribution of this paper is proposing a novel model of heart disease patient risk classification based on neutrosophic sets. The proposed model is applied to most relevant attributes of selected dataset, and compared to other famous classification techniques such as Naive Bayesian, JRip, and random forest for validation. The experimental results indicate that the proposed heart disease classification model achieves highest accuracy and f-measure results in heart disease.
    Keywords: Heart disease; supervised machine learning classification; and neutrosophic sets.
    DOI: 10.1504/IJBIDM.2021.10034129
     
  • A Semi-Supervised clustering based classification model for classifying imbalanced data streams in the presence of scarcely labelled data   Order a copy of this article
    by Kiran Bhowmick, Meera Narvekar 
    Abstract: Classification of data streams is still a current topic of research and a lot of research is focussed in this direction. Online frameworks for classifying data streams are generally supervised in nature so they assume the availability of labelled data all the time. Data streams in real time however are potentially infinite in length, massive, fast changing and scarcely labelled. It is practically impossible to label all the observed instances. Hence these existing frameworks cannot be used in most of the real time scenarios. Semi-supervised learning (SSL) addresses this problem of scarcely labelled data by using large amount of unlabelled data together with labelled data to build classifiers. Data streams may also suffer with the problem of imbalanced data. This paper proposes a model using a semi supervised clustering technique to classify an imbalanced data stream in the presence of scarcely labelled data.
    Keywords: data streams; imbalanced data; semi-supervised clustering; expectation maximization; partially labelled.
    DOI: 10.1504/IJBIDM.2022.10034300
     
  • Analysing traveller ratings for tourist satisfaction and tourist spot recommendation   Order a copy of this article
    by Angel Arul Jothi Joseph, Rajeni Nagarajan 
    Abstract: In this study, we propose an automated system to classify traveller ratings on travel destinations in 10 categories across East Asia using the UCI Travel Reviews dataset. The automated system developed in this study is called Traveller Rating Classification System (TRCS). Since the Travel Reviews dataset is an unlabelled dataset, K-means clustering algorithm is used to group the samples from the dataset into three clusters. The cluster numbers obtained from K-means clustering are assigned as class labels for the samples and the dataset is converted into a labelled dataset. Popular individual classifiers and ensemble classifiers are used to classify the samples present in the labelled dataset. In this study, Bagging with decision tree classifier achieved the best classification accuracy of 97.95%. The study further analyses the attributes in the dataset using visualization techniques to draw inferences by performing small transformations on them. The proposed system will be useful to understand traveller satisfaction and as a tourist spot recommendation system.
    Keywords: Tourist spot recommendation; Tourist satisfaction; Traveller rating; K-means Clustering; Classification; Ensemble; Visualization.
    DOI: 10.1504/IJBIDM.2022.10034520
     
  • Correlating pre-search and in-search context to predict search intent for exploratory search   Order a copy of this article
    by Vikram Singh 
    Abstract: Modern information systems are expected to respond to a wide variety of information needs from users with diverse goals. The topical dimension (what the user is searching for) of these information needs is well studied; however, the intent dimension (why the user is searching) has received relatively less attention. Traditionally, the intent is an immediate reason, purpose, or goal that motivates the user search, and captured in search contexts (pre-search, in-search, pro-search). An ideal information system would be able to use. This article proposed a novel intent estimation strategy; based on the intuition that captured intent proactively extracts potential results. Captured pre-search context adapts query term proximities within matched results beside document-terms statistics and pseudo-relevance feedback with user-relevance feedback for in-search. The assessment asserts the superior performance of the proposed strategy over equivalent on trade-offs, e.g., novelty, diversity (coverage, topicality), retrieval (precision, recall, F-measure) and exploitation vs. exploration.
    Keywords: Ambient Information; Exploratory Search; Human-Computer Interaction; Information Retrieval; Proactive Search; Query Term Proximity; Search Contexts; Relevance; Retrieval Model.
    DOI: 10.1504/IJBIDM.2022.10034960
     
  • Prediction of Students’ Failure using VLE and Demographic data: Case study Open University Data   Order a copy of this article
    by Rahila Umer, Sohrab Khan, Jun Ren, Shumaila Umer, Ayesha Shaukat 
    Abstract: Use of technology such as learning management system (LMS) in higher education institutes is getting very common. LMS provides support to teaching staff for communication, delivery of resources and in design of learning activities. Large amount of data is produced using these technologies which can be analysed using machine learning methods to extract knowledge regarding students’ behaviour and learning processes. In this study we focus on the Open University’s project for predicting student’s failure in the course by using their data. In this study multiple machine learning algorithms are applied on historical virtual learning environment (VLE) data and demographic data. This study confirms the importance of VLE and demographic data in the prediction of academic performance. This study highlights the importance of demographic data; which improves the accuracies of models for predicting student’s outcome in courses they are enrolled.
    Keywords: predictive learning analytics; student performance; retention; higher education; machine learning.
    DOI: 10.1504/IJBIDM.2022.10035109
     
  • Chaotic activities recognizing during the pre-processing event data phase   Order a copy of this article
    by Zineb Lamghari, Rajaa Saidi, Maryam Radgui, Moulay Driss Rahmani 
    Abstract: Process mining aims at obtaining insights into business processes by extracting knowledge from event data. Indeed, the quality of events is a crucial element for generating process models, to reflect business process reality. To do so, pre-processing methods are appeared, to clean events from deficiencies (noise, incompleteness and infrequent behaviours) in the limit of chaotic activities’ emergence. Chaotic activities are executed arbitrarily in the process and impact the quality of discovered models. Beyond, a supervised learning approach has been proposed, using labelled samples to detect chaotic activities. This puts forward the difficulty of defining chaotic activities in the case of no ground knowledge on which activities are truly chaotic. To that end, we develop an approach for recognising chaotic activities without having labelling training data, using unsupervised learning techniques.
    Keywords: pre-processing; process discovery; process mining; chaotic activity; business process intelligent; machine learning algorithms.
    DOI: 10.1504/IJBIDM.2022.10035223
     
  • Favourable subpopulation migration strategy for Travelling salesman problem   Order a copy of this article
    by Abhishek Chandar, Akshay Srinivasan, G. Paavai Anand 
    Abstract: Process mining aims at obtaining insights into business processes by extracting knowledge from event data. Indeed, the quality of events is a crucial element for generating process models, to reflect business process reality. To do so, pre-processing methods are appeared, to clean events from deficiencies (noise, incompleteness and infrequent behaviours) in the limit of chaotic activities’ emergence. Chaotic activities are executed arbitrarily in the process and impact the quality of discovered models. Beyond, a supervised learning approach has been proposed, using labelled samples to detect chaotic activities. This puts forward the difficulty of defining chaotic activities in the case of no ground knowledge on which activities are truly chaotic. To that end, we develop an approach for recognising chaotic activities without having labelling training data, using unsupervised learning techniques.
    Keywords: pre-processing; process discovery; process mining; chaotic activity; business process intelligent; machine learning algorithms.
    DOI: 10.1504/IJBIDM.2022.10035424
     
  • An evaluation method for searching the functional relationships between property prices and influencing factors in the detected data   Order a copy of this article
    by Pierluigi Morano, Francesco Tajani, Vincenzo Del Giudice, Pierfrancesco De Paola, Felicia Di Liddo 
    Abstract: The economic crisis of the last decade, started from the real estate sector, has spread the awareness of the importance of the use of advanced evaluation models, as a support in the assessments and in the periodic value updates of public and private property assets. With reference to a sample of recently sold properties located in the city of Rome (Italy), an innovative automated valuation model is explained and applied. The outputs are represented by different mathematical expressions, able to interpret and to simulate the investigated phenomena (i.e. the market prices formation). The application carried out outlines, in the selection phase of the best model, the fundamental condition that the valuer must adequately know the reference market. In this way, it is possible to identify the existing patterns in the detected data in terms of mathematical expressions, according to the empirical knowledge of the economic phenomena.
    Keywords: price property formation; office market; retail market; automated valuation methods; AVMs; genetic algorithm; reliable valuations.
    DOI: 10.1504/IJBIDM.2022.10035383
     
  • Predicting students' academic performance using machine learning techniques: a literature review   Order a copy of this article
    by Aya Nabil, Mohammed Seyam, Ahmed Aboul-Fotouh 
    Abstract: The amount of students’ data stored in educational databases is increasing rapidly. These databases contain hidden patterns and useful information about students’ behaviour and performance. Data mining is the most effective method to analyse the stored educational data. Educational data mining (EDM) is the process of applying different data mining techniques in educational environments to analyse huge amounts of educational data. Several researchers applied different machine learning techniques to analyse students’ data and extract hidden knowledge from them. Prediction of students’ academic performance is necessary for educational environments to measure the quality of the learning process. Therefore, it is one of the most common applications of EDM. In this survey paper, we present a review of data mining techniques, EDM and its applications, and discuss previous studies in predicting students’ academic performance. An analysis of different machine learning techniques used in previous studies is also presented in this paper.
    Keywords: data mining; educational data mining; EDM; prediction; student academic performance; machine learning techniques; deep learning.
    DOI: 10.1504/IJBIDM.2022.10035540
     
  • Harnessing the Meteorological Effect for Predicting the Retail Price of Rice in Bangladesh   Order a copy of this article
    by Abdullah Al Imran, Zaman Wahid, Alpana Akhi Prova, Md. Hannan 
    Abstract: Bangladesh has seen an absurd, steeper prize-hike for the last couple of years in one of the most consumed foods taken by millions of people every single day: rice. The impact of this phenomenon, however, is indispensably critical, especially to the one striving for daily meals. Thus, understanding the latent facts is vital to policymakers for better strategic measures and decision-making. In this paper, we have applied five different machine learning algorithms to predict the retail price of rice, find out the top-most factors responsible for the price hike, and determine the best model that produces higher prediction results. Leveraging six evaluation metrics, we found that random forest produces the best result with an explain variance score of 0.87 and an R2 score of 0.86 whereas gradient boosting produces the least, meanwhile discovering that average wind speed is the topmost reason for rice price hike in retail markets.
    Keywords: data mining; rice price prediction; pattern mining; regression; retail markets.
    DOI: 10.1504/IJBIDM.2022.10035542
     
  • Privacy Preservation of the User Data and Properly Balancing Between Privacy and Utility   Order a copy of this article
    by N. Yuvaraj, K. Praghash, T. Karthikeyan 
    Abstract: The privacy and utility are the trade-off factors, where the performance of one factor should sacrifice to achieve the other. If privacy is achieved without publishing the data, then efficient utility cannot be achieved, hence the original dataset tends to get published without privacy. Therefore, it is essential to maintain the equilibrium between privacy and utility of datasets. In this paper, we propose a new privacy utility method, where the privacy is maintained by lightweight elliptical curve cryptography (ECC), and utility is maintained through ant colony optimisation (ACO) clustering. Initially, the datasets are clustered using ACO and then the privacy of clustered datasets is maintained using ECC. The proposed method has experimented over medical datasets and it is compared with existing methods through several performance metrics like clustering accuracy, F-measure, data utility, and privacy metrics. The analysis shows that the proposed method obtains improved privacy preservation using the clustering algorithm than existing methods.
    Keywords: ant colony optimisation; ACO; elliptical curve cryptography; ECC; privacy preservation; utility.
    DOI: 10.1504/IJBIDM.2022.10035576
     
  • overdisp: An R Package for Direct Detection of Overdispersion in Count Data Multiple Regression Analysis   Order a copy of this article
    by Rafael Freitas Souza, Luiz Paulo Fávero, Patrícia Belfiore, Luiz Corrêa 
    Abstract: Within multiple areas, log-linear count data regression is one of the most popular techniques for predictive modelling where there is a non-negative discrete quantitative dependent variable. In order to ensure the inferences from the use of count data models are appropriate, researchers may choose between the estimation of a Poisson model and a negative binomial model, and the correct decision for prediction from a count data estimation is directly linked to the existence of overdispersion of the dependent variable, conditional to the explanatory variables. That said, the overdisp() command is a contribution to researchers, providing a fast and secure solution for the detection of overdispersion in count data. Real and simulated data were used to test the proposed solution, which proved to be computationally efficient, with no difference in the detection of overdispersion compared to the test postulated by the cited authors.
    Keywords: overdispersion; detection of overdispersion; count data; multiple regression analysis; non-negative discrete quantitative dependent variable; Poisson model; negative binomial model; R package.
    DOI: 10.1504/IJBIDM.2022.10035616
     
  • Web mining based on word-centric search with clustering approach using MLP-PSO hybrid   Order a copy of this article
    by Reza Samizadeh, Samaneh Tafahomi 
    Abstract: With web development, sometimes in keeping track of information on the web, the semantic meaning of words is not important, and the mere presence of words in the text is enough to extract information. In this research, word-centric search method is presented to prepare web data for clustering. Multi-layer perceptron networks are one of the most successful neural networks for learning, clustering and prediction. The researcher clusters the web data from the word-centric search method by using the K-means method and considers the results of clustering as the expected output of the MLP neural network. Considering that the weights of the neural network are selected randomly and may not be in best amount after the network training. Therefore, by using an optimisation algorithm for particle swarm, its effect on performance of the final neural network has been investigated in the training and initial weighing step.
    Keywords: web mining; clustering; multi-layer perceptron neural networks; particle swarm optimisation algorithm.
    DOI: 10.1504/IJBIDM.2022.10035725
     
  • Estimating Cluster Validity Using Compactness Measure and Overlap Measure for Fuzzy Clustering   Order a copy of this article
    by Bindu Rani, Shri Kant 
    Abstract: Cluster analysis discovers valuable patterns in data by partitioning n data points into valid number of clusters. The cluster validity index (CVI) helps in selecting the best partitions that fits the underlying structure of data. After presenting brief review on existing CVIs, this study formulates a competent overlap-compactness validity index (OCVI). The proposed index considers Kim et al. (2004b) overlap measure with compactness measure. Compactness measure considers the geometrical aspects of membership matrix (U) through cluster centres with an approach to reduce its monotonic tendency. Overlap measure calculates the average value of the overlapping degree of all probable fuzzy clusters pairs. Experiments are implemented on two artificial, two real and one biological dataset. Comparison results of partition coefficient, partition entropy, modified partition coefficient, Xie-Beni and Kim indices with the suggested index (OCVI) imply that suggested index outperforms with maximum compactness and minimum overlap than other validity indices.
    Keywords: cluster validity index; CVI; clustering; fuzzy clustering; fuzzy c-means algorithm.
    DOI: 10.1504/IJBIDM.2022.10036057
     
  • SHOMAN: An Efficient Method for Finding the Important Nodes in a Network   Order a copy of this article
    by Shivam Bathla, Omprakash Sah, Anurag Singh 
    Abstract: In this paper, we propose and study SHOMAN metric for determining the importance of a node in a network. The method is based on a significant feature of networks namely clustering coefficient and uses the principle of six degrees of separation to go through only up to six nodes to find the score of node influence. We demonstrate that our algorithm is highly effective at calculating the importance of nodes when compared to other centrality measures. We also propose that our method can be used in viral marketing and controlling disease spreading.
    Keywords: six-degrees of separation; clustering coefficient; robustness; centrality; viral marketing; disease spreading.
    DOI: 10.1504/IJBIDM.2021.10036357
     
  • Solving restriction of Bayesian network in giving domain knowledge by introducing factor nodes   Order a copy of this article
    by Yutaka Iwakami, Hironori Takuma, Motoi Iwashita 
    Abstract: Bayesian network is a probabilistic inference model that is effective for decision-making in business such as product development. Multiple events are represented as oval nodes and their relationships are drawn as edges among them. However, in order to obtain a sufficient effect, it is necessary to appropriately configure domain knowledge, for example more customer response to the product leads to more clarity of requirements for products. Such domain knowledge is configured as an edge connecting nodes. But in some cases, the constraint of the structure in Bayesian network prevents this configuration. In this study, the authors propose a method to avoid this constraint by introducing the redundant factor nodes generated by applying factor analysis to the data related with domain knowledge. With this approach more domain knowledge can be applied to Bayesian network, and the accuracy of decision-making in business is expected be improved.
    Keywords: model improvement; data extraction; data driven insight; probabilistic inference; decision-making; product development; Bayesian network; factor analysis; key goal indicator.
    DOI: 10.1504/IJBIDM.2022.10036731
     
  • Customer Segmentation Using Various Machine Learning Techniques   Order a copy of this article
    by SAMYUKTHA PALANGAD OTHAYOTH, Raja Muthalagu 
    Abstract: In the field of retail industry and marketing, customer segmentation is one of the most important tasks. A proper customer segmentation can help the managers to enhance the quality of products and provide better services for the targeting segments. Various machines learning algorithms-based customer segmentation techniques are used to get an insight about the customer’s behaviour and the potential customers that could be targeted to maximise profit. Based on the previous studies, this paper proposes improved machine learning models for customer segmentation in e-commerce. The agglomerative clustering algorithms have been implemented to segment the customers with the new matric for customer behaviour. Also, we have proposed a systematic approach for combining agglomerative clustering algorithm and filtering-based recommender system to improve customer experience and customer retention. In the experiment, the results were compared with K-means clustering model, and it was found that BLS greatly reduced training time while guaranteeing accuracy.
    Keywords: customer segmentation; agglomerative clustering algorithms; machine learning algorithms; K-means.
    DOI: 10.1504/IJBIDM.2022.10036753
     
  • A Unified Workflow Strategy for Analysing Large Scale TripAdvisor Reviews with BOW Model   Order a copy of this article
    by Jale Bektas, Arwa Abdalmajed 
    Abstract: Nowadays, firms need to transform customer online reviews data properly into information to achieve goals such as having a competitive edge and improving the quality of service. This paper presents a unified workflow to solve the problems of analysing large-scale data with 710,450 reviews for 1,134 hotels by using text mining methods among the different touristic regions of Turkey. Firstly, a star schema dimensional data mart is built that includes one fact table and two dimensional tables. Then, a series of text mining processes which includes data cleaning, tokenisation, and analysis are applied. Text mining is implemented through standard BOW and the extended BON model. The results show significant findings through this workflow. We propose to build a dimensional model dataset before performing any text mining process, since building such a dataset will optimise the data retrieval process and help to represent the data along with different measures of interest.
    Keywords: online TripAdvisor reviews; text mining; big data; N-gram tokenisation; dimensional data mart; data mining; BOW; BON.
    DOI: 10.1504/IJBIDM.2022.10037062
     
  • Text Mining for Opinion Analysis: The Case of Recent Flood of Iran on Twitter   Order a copy of this article
    by Reza Kamranrad, Ali Jozi, Ehsan Mardan 
    Abstract: The sentiment analysis relates to the study and understanding of emotions and beliefs in a particular text. This analysis gives us a lot of information. Twitter is a popular social network in recent years, in which users express their opinions and feelings about various topics in the Twitter social media operating system. By analysing this information, we can get an overview of public opinion about any particular topic. The classification of information is effective in understanding information and we cluster information. In this article, we are trying to analyse the status of Twitter on the monitoring and emotions of people about the recent flood events in Iran.
    Keywords: text mining; Twitter; sentiment analysis; machine learning; language processing; NLP; Python; clustering.
    DOI: 10.1504/IJBIDM.2022.10037064
     
  • Apriori-Roaring: Frequent Pattern Mining Method Based on Compressed Bitmaps   Order a copy of this article
    by Alexandre Colombo, Roberta Spolon, Aleardo Junior Manacero, Renata Spolon Lobato, Marcos Antônio Cavenaghi 
    Abstract: Association rule mining is one of the most common tasks in data analysis. It has a descriptive feature used to discover patterns in sets of data. Most existing approaches to data analysis have a constraint related to execution time. However, as the size of datasets used in the analysis grows, memory usage tends to be the constraint instead, and this prevents these approaches from being used. This article presents a new method for data analysis called apriori-roaring. The apriori-roaring method is designed to identify frequent items with a focus on scalability. The implementation of this method employs compressed bitmap structures, which use less memory to store the original dataset and to calculate the support metric. The results show that apriori-roaring allows the identification of frequent elements with much lower memory usage and shorter execution time. In terms of scalability, our proposed approach outperforms the various traditional approaches available.
    Keywords: frequent pattern mining; bitmap compression; data mining; association rules; knowledge discovery.
    DOI: 10.1504/IJBIDM.2022.10037305
     
  • Financial accounts reconciliation systems using enhanced mapping algorithm   Order a copy of this article
    by Olufunke Oluyemi Sarumi, Bolanle A. Ojokoh, Oluwafemi A. Sarumi, Olumide S. Adewale 
    Abstract: Account reconciliation has become a daunting task for many financial organisations due to the heterogeneity of data involved in the accounts reconciliation process-coupled with the recent data deluge in many accounting firms. Many organisations are using a heuristic-based algorithm for their account reconciliation process while in some firms the process is completely manual. These methods are already inundated and were no longer efficient in the light of the recent data explosion and are such, prone to lots of errors that could expose the organisations to several financial risks. In this regard, there is a need to develop a robust financial data analytic algorithm that can effectively handle the account reconciliation needs of financial organisations. In this paper, we propose a computational data analytic model that provides an efficient solution to the account reconciliation bottlenecks in financial organisations. Evaluation results show the effectiveness of our data analytic model for enhancing faster decision making in financial account reconciliation systems.
    Keywords: accounts reconciliation; financial analytics; functions; fraud; big data.
    DOI: 10.1504/IJBIDM.2022.10037414
     
  • Privacy Preserving Data Mining - Past and Present   Order a copy of this article
    by G. SATHISH KUMAR, K. Premalatha 
    Abstract: Data mining is the process of discovering patterns and correlations within the huge volume of data to forecast the outcomes. There are serious challenges occurring in data mining techniques due to privacy violation and sensitive information disclosure while providing the dataset to third parties. It is necessary to protect user’s private and sensitive data from exposure without the authorisation of data holders or providers when extracting useful information and revealing patterns from the dataset. Also, internet phishing gives more threat over the web on extensive spread of private information. Privacy preserving data mining (PPDM) is an essential for exchanging confidential information in terms of data analysis, validation, and publishing. To achieve data privacy, a number of algorithms have been designed in the data mining sector. This article delivers a broad survey on privacy preserving data mining algorithms, different datasets used in the research and analyses the techniques based on certain parameters. The survey is highlighted by identifying the outcome of each research along with its advantages and disadvantages. This survey will guide the feature researches in PPDM to choose the appropriate techniques for their research.
    Keywords: data mining; privacy preserving data mining; PPDM; privacy preserving techniques; sensitive attributes; privacy threats.
    DOI: 10.1504/IJBIDM.2022.10037595
     
  • STEM: STacked Ensemble Model design for aggregation technique in Group Recommendation System   Order a copy of this article
    by Nagarajan Kumar, P. Arun Raj Kumar 
    Abstract: A group recommendation system is required to provide a list of recommended items to a group of users. The challenge lies in aggregating the preferences of all members in a group to provide well-suited suggestions. In this paper, we propose an aggregation technique using stacked ensemble model (STEM). STEM involves two stages. In stage 1, the k-nearest neighbour (k-NN), singular value decomposition (SVD), and a combination of user-based and item-based collaborative filtering is used as base learners. In the second stage, the decision trees predictive model is used to aggregate the outputs obtained from the base learners by prioritising the most preferred items. From the experiments, it is evident that STEM provides a better group recommendation strategy than the existing techniques.
    Keywords: group recommendation system; aggregating user preferences; decision trees; stacked ensemble; machine learning.
    DOI: 10.1504/IJBIDM.2022.10037757
     
  • Convolutional Neural Network for Classification of SiO2 Scanning Electron Microscope Images   Order a copy of this article
    by Kavitha Jayaram, G. Prakash, V. Jayaram 
    Abstract: The recent development in deep learning has made image and speech classification and recognition tasks possible with better accuracy. An attempt was made to automatically extract required sections from literature published in journals to analyse and classify them according to their application. This paper presents high-temperature materials classification into four categories according to their wide applications such as electronic, high temperature, semiconductors, and ceramics. The challenging act is to extract SEM images' unique features as they are microscopic with different resolutions. A total of 10,000 Scanning Electron Microscope (SEM) images are classified into two labeled categories namely crystalline and amorphous structure. The image classification and recognition process of SiO2 was implemented using Convolutional Neural Network (CNN) deep learning framework. Our algorithm successfully classified with a precision of 96% and accuracy of 95.5% of the test dataset of SEM images.
    Keywords: deep learning; machine learning; image classification; convolution neural network; CNN; material.
    DOI: 10.1504/IJBIDM.2022.10038244
     
  • Rule-based Database Intrusion Detections Using Coactive Artificial Neuro-Fuzzy Inference System and Genetic Algorithm   Order a copy of this article
    by Anitarani Brahma, SUVASINI PANIGRAHI, Neelamani Samal, Debasis Gountia 
    Abstract: Recently, a fuzzy system having learning and adaptation capabilities is gaining lots of interest in research communities. In the current approach, two of the most successful soft computing approaches neural network and genetic algorithm with learning capabilities are hybridised to approximate reasoning method of fuzzy systems. The objective of this paper is to develop a coactive neuro-fuzzy inference system with genetic algorithm-based database intrusion detection system that can detect malicious transactions in database very efficiently. Experimental investigation and comparative assessment has been conducted with an existing statistical database intrusion technique to justify the efficacy of the proposed system.
    Keywords: fuzzy inference system; database intrusion detection; neural network; genetic algorithm; artificial neuro-fuzzy inference system; coactive artificial neuro-fuzzy inference system.
    DOI: 10.1504/IJBIDM.2022.10038259
     
  • A Regression Model to Evaluate Interactive Question Answering using GEP   Order a copy of this article
    by Mohammad Mehdi Hosseini 
    Abstract: Evaluation plays a pivotal role in the interactive question answering (IQA) systems. However, much uncertainty still exists on evaluating IQA systems and there is practically no specific methodology to evaluate these systems. One of the main challenges in designing an assessment method for IQA systems lies in the fact that it is rarely possible to predict the interaction part. To this end, human needs to be involved in the evaluation process. In this paper, an appropriate model is presented by introducing a set of characteristics features for evaluating IQA systems. Data were collected from four IQA systems at various timespans. For the purpose of analysis, pre-processing is performed on each conversation, the statistical characteristics of the conversations are extracted to form the characteristic matrix. The characteristics matrix is classified into three separate clusters using K-means. Then, an equation is allotted to each of the clusters with an application of gene expression programming (GEP). The results reveal that the proposed model has the least error with an average of 0.09 root mean square error between real data and GEP model.
    Keywords: evaluation; interactive question; answering systems; nonlinear regression; gene expression programming; GEP; feature extraction.
    DOI: 10.1504/IJBIDM.2022.10038261
     
  • Examining the impact of business intelligence related practices on organizational performance in Oman   Order a copy of this article
    by ROBIN ZARINE, MUHAMMAD SAQIB 
    Abstract: Business intelligence can greatly enhance organisational capabilities in devising profitable business actions and activities. It provides understanding of both current and future trends relating to customers, markets, competitors, or regulatory, and most importantly, the understanding of organisations’ own capabilities to compete. Business intelligence is arguably one of the key drivers to organisational competiveness. This paper looks at examining the extent to which organisations in Oman embrace business intelligence and the contributions of the different business intelligence components on organisational performance. Quantitative empirical approach is used with Microsoft Excel data analysis tool pack as the investigative tool to analyse and develop a regression model to better understand the impact of business intelligence related components on organisational performance. The finding shows a strong correlation between business intelligence and organisational performance. It also shows that by having the right IT functionalities with capable employees using them are the key to performance enhancement. Furthermore, having IT infrastructure without the appropriate functionalities and personnel or not embracing business intelligence will not result in any performance gain.
    Keywords: business intelligence; business intelligence components; organisational performance; Oman.
    DOI: 10.1504/IJBIDM.2022.10038337
     
  • Next location prediction using Transformers   Order a copy of this article
    by Salah Eddine Henouda, Laallam Fatima Zohra, Okba KAZAR, Abdessamed Sassi 
    Abstract: This work seeks to solve next location prediction problem of mobile users. Chiefly, we focus on ROBERTA architecture (robustly optimised BERT approach) in order to build a next location prediction model through the use of a subset of a large real mobility trace database. The latter was made available to the public through the CRAWDAD project. ROBERTA, which is a well-known model in natural language processing (NLP), works intentionally on predicting hidden sections of text based on language masking strategy. The current paper follows a similar architecture as ROBERTA and proposes a new combination of Bertwordpiece tokeniser and ROBERTA for location prediction that we call WP-BERTA. The results demonstrated that our proposed model WP-BERTA outperformed the state-of-the-art models. They also indicated that the proposed model provided a significant improvement in the next location prediction accuracy compared to the state-of-the-art models. We particularly revealed that WP-BERTA outperformed Markovian models, support vector machine (SVM), convolutional neural networks (CNNs), and long short-term memory networks (LSTMs).
    Keywords: machine learning; deep learning; transformer; neural networks; Wi-Fi; mobility traces; next location prediction; big data.
    DOI: 10.1504/IJBIDM.2022.10038854
     
  • Supervised and Unsupervised learning for characterizing the industrial material defects   Order a copy of this article
    by P. Radha, N. Selvakumar, J. Raja Sekar, J.V. Johnsonselva 
    Abstract: The ultrasonic based NDT is used in industries to examine the internal defects without damaging the components since the materials used in the industrial standard components must be 100% perfection. The ultrasonic signals are difficult to interpret and the domain expert has to concentrate at every sampling point to identify the defect. Hence, the existing ultrasonic based NDT method is improved by applying IoT, machine learning, deep learning techniques to process the ultrasonic data. This wok integrates NDT and IoT to analyse the properties of materials using deep learning based supervised model and filter outliers using unsupervised model like density-based clustering method. After analysing the different categories of defects, the notifications are sent to various stakeholders to either repair or replace the defective components through their mobile using advanced communication techniques to avoid expensive experimentation or maintenance.
    Keywords: ultrasonic testing; internet of things; IoT; machine learning; density based clustering; deep learning; deep neural network; DNN.
    DOI: 10.1504/IJBIDM.2022.10039148
     
  • Detection of Suspicious Text Messages and Profiles using Ant Colony Decision Tree Approach   Order a copy of this article
    by Asha Kumari, Balkishan N/A 
    Abstract: The ease of human communication connectivity through short messaging services (SMS) and social networking have immensely allured the suspicious activities that menace the legitimate users. The unsolicited or uninvited messages that can lead to rumours, spam, malicious, or any other threatening activities are termed as suspicious activities. This work ensemble the attributes of the ant colony optimisation (ACO) approach with decision tree for the detection of suspicious content and profile (ACDTDSCP). In the ACDTDSCP approach, the construction of the decision tree and splitting of nodes is based on the appropriate attributes of the pheromone trail and heuristic function chosen by each ant. The research experimentation is conducted on two Twitter datasets (Social Honeypot Dataset and 1KS-10KN dataset) and two SMS text corpus (SMS Spam Collection v.1 and SMS Spam Corpus v.0.1 Big). The experimental results indicate the efficacy and potential of the proposed ACDTDSCP approach.
    Keywords: ant colony optimisation; ACO; decision tree; suspicious messages; spam; short message service; SMS; Twitter microblogs.
    DOI: 10.1504/IJBIDM.2021.10039529
     
  • Performance evaluation of outlier rules for labelling outliers in multidimensional dataset   Order a copy of this article
    by Kelly C. Ramos Da Silva, Helder L. Costa De Oliveira, André C.P.L.F. De Carvalho 
    Abstract: The output of outlier detection algorithm applied to multidimensional dataset usually consists of scores defining the level of abnormality of each instance. However, this process per se does not identify the outlying instances. For this purpose, it is common to use an outlier rule to convert outlier scores into labels. The problem is therefore to determine an appropriate outlier rule, based on certain patterns of the scores alone. In order to deal with this problem, we studied and evaluated several traditional robust outlier rules following a pragmatic approach. The analysis of the results was facilitated by an evaluation measure developed by us. This measure was proved to be more effective than traditional measures involving only true positive and true negative rates. By using this measure, we were able to study the behaviour of different outlier rules whose performances were evaluated under varying skewness and contamination level.
    Keywords: outlier detection; outlier rule; evaluation measure; boxplot; adjusted boxplot; k-NN.
    DOI: 10.1504/IJBIDM.2021.10038643
     
  • Business intelligence: fuzzy logic in the risk client analysis   Order a copy of this article
    by Jorge Morris Arredondo, Victor Escobar-Jeria, Juan Luis Castro Peña 
    Abstract: The following paper focuses on achieving accurate results through rough data. Using an inference model based on fuzzy logic, human reasoning was proactively stimulated, under certain conditions, in order to deal with the possibility of client-loss due to service quality. The experimentation is carried out by information related of complaint receipts from a period of two years (70,000 registers). For that effect, a prototypical program is set in C++ language, which receives as input the crisp values that result from the failure resolution for each relevant service. The proposed model is intended to classify clients according to the risk they may have in the contractual relationship with the company.
    Keywords: business intelligence; fuzzy logic; soft computing; decision.
    DOI: 10.1504/IJBIDM.2021.10030559
     
  • MapReduce fuzzy C-means ensemble clustering with gentle AdaBoost for big data analytics   Order a copy of this article
    by K.M. Padmapriya, B. Anandhi, M. Vijayakumar 
    Abstract: Big data clustering is one of the significant processes employed in numerous application domains. Existing clustering algorithms do not cope with large-scale data, resulting in higher false positive rate. In order to cluster such large datasets with higher accuracy, MapReduce gradient descent gentle AdaBoost clustering (MGDGAC) technique is proposed. The MGDGAC technique designs MapReduce fuzzy C-means (MFCM) clustering where the large dataset is initially subdivided into a number of chunks which are executed in parallel on different nodes to effectively perform clustering processes with minimal time. The data with larger membership value are grouped in the cluster with help of mappers. Then, reducer in MFCM clustering re-estimates the centroid value and iteratively fed to the mapper again until it attains a particular iteration and groups the similar data together. Finally, MGDGAC technique applies gentle AdaBoost with intention of reducing the training error of large data clustering.
    Keywords: big data clustering; MapReduce; gradient descent; gentle AdaBoost; fuzzy C-means.
    DOI: 10.1504/IJBIDM.2021.10038642
     
  • A parallel approach for web session identification to make recommendations efficient   Order a copy of this article
    by M.S. Bhuvaneswari, K. Muneeswaran 
    Abstract: A sequence of web pages visited by the clients over a particular timeframe is called a session. Web log mining is done to analyse the behaviour of the users, using the web access patterns. Sessions are identified as the significant part of the construction of the recommendation model. The novel part of the work makes use of backward moves made by the user, considering both the referrer url and the requested url extracted from the extended web log for session identification which is not taken into consideration in the existing heuristic-based approach. Two noteworthy issues in session identification are: 1) framing excessively numerous smaller length sessions; 2) taking longer time for identifying the sessions. In the proposed work, the length of the sessions are maximised using split and merge technique and the time taken for session identification is reduced using thread parallelisation. For efficient storage and retrieval of information the hash map data structure is used. The proposed work shows significant improvement in performance in terms of execution time, standard error, correlation coefficient and the objective value.
    Keywords: extended web server logs; session identification; sequential pattern mining; split and merge technique; multithreaded; hash data structure.
    DOI: 10.1504/IJBIDM.2021.10029835
     
  • Extracted information quality, a comparative study in high and low dimensions   Order a copy of this article
    by Leandro Ariza-Jiménez, Luisa F. Villa, Nicolás Pinel, O. Lucia Quintero 
    Abstract: Uncovering interesting groups in either multidimensional or network spaces has become an essential mechanism for data exploration and understanding. Decision making requires relevant information as well as high-quality on the retrieved conclusions. We presented a comparative study of two compact representations drawn from the same set of data objects by clustering high-dimensional spaces and low-dimensional Barnes-Hut t-stochastic neighbour embeddings. There is no consensus on how the problem should be addressed and how these representations/models should be analysed because of their different notions. We introduced a measure to compare their results and capability to provide insights into the information retrieved. We considered low-dimensional embeddings as a potentially revealing strategy to uncover dynamics possibly not uncovered in big-data spaces. We demonstrated that a non-guided approach can be as revealing as a user-guided approach for data exploration and presented coherent results for good uncertainty modelling capability in terms of fuzziness and densities.
    Keywords: high-dimensional clustering; BH-SNE embeddings; cluster fuzziness; reliable information; decision making; consistency.
    DOI: 10.1504/IJBIDM.2021.10033994
     
  • Integrated robotics architecture with Kansei computing and its application   Order a copy of this article
    by Y. Itabashi, Y. Kiyoki 
    Abstract: Robots communicating with humans have been widely utilised in our society. This type of communication can be limited when only one robot is involved. Meanwhile, expressing the same unified mood and emotion is difficult when several robots are working. Furthermore, realising expressions representing different cultures in the context of a global society is also challenging. To address these issues, we propose an integrated robot architecture (IRA) providing a communication environment, in which various robots and devices can be combined. In this architecture, meta-level messages are communicated between various kinds of devices using a unified communication protocol. The framework of Kansei computing is integrated for realising non-verbal expression. This method makes it possible to express emotions with colours synchronised with each utterance, according to the characteristics of the device and required culture. This paper also presents a system prototype and verifies its feasibility.
    Keywords: robotics; emotional expression; communication robot; human-robot communication; Kansei; emotion; traditional colours.
    DOI: 10.1504/IJBIDM.2021.10038641