Forthcoming articles

International Journal of Business Intelligence and Data Mining

International Journal of Business Intelligence and Data Mining (IJBIDM)

These articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

Register for our alerting service, which notifies you by email when new issues are published online.

Open AccessArticles marked with this Open Access icon are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.
We also offer which provide timely updates of tables of contents, newly published articles and calls for papers.

International Journal of Business Intelligence and Data Mining (53 papers in press)

Regular Issues

  • Analysis and Prediction of Heart Disease Aid of Various Data Mining Techniques: A Survey   Order a copy of this article
    by V. Poornima, D. Gladis 
    Abstract: In recent times, health diseases are expanding gradually because of inherited. Particularly, heart disease has turned out to be the more typical nowadays, i.e., life of individuals is at hazard. The data mining strategies specifically decision tree, Naive Byes, neural network, K-means clustering, association classification, support vector machine (SVM), fuzzy, rough set theory and orthogonal local preserving methodologies are examined on heart disease database. In this paper, we survey distinctive papers in which at least one algorithms of data mining are utilised for the forecast of heart disease. This survey comprehends the current procedures required in vulnerability prediction of heart disease for classification in data mining. Survey of pertinent data mining strategies which are included in risk prediction of heart disease gives best expectation display as hybrid approach contrasting with the single model approach.
    Keywords: Data mining; Heart Disease Prediction; performance measure; Fuzzy; and clustering.
    DOI: 10.1504/IJBIDM.2018.10014620
  • Worldwide Gross Revenue Prediction for Bollywood Movies using Hybrid Ensemble Model   Order a copy of this article
    by Alina Zaidi, Siddhaling Urolagin 
    Abstract: Prediction of revenue before a movie is released can be very beneficial for stakeholders and investors in the movie industry. Even though Indian cinema is a booming industry, the literature work in the field of movie revenue prediction is more inclined towards non-Indian movie. In this study we built a novel hybrid prediction model to predict worldwide gross for Bollywood movies. Bollywood movies dataset is prepared by downloading movie related features from IMDb and YouTube movie trailers which consisting of 674 movies. K-means clustering is performed on the movie dataset and two major clusters are identifier. Important features specific to clusters are selected. The proposed hybrid prediction model performs segregation of movies into two clusters and employs prediction model for each cluster. Prediction models we tested included various basic machine learning models and ensemble models. The ensemble model that combined predictions from support vector regression, neural network and ridge regression gave us the best result for both clusters and we chose it to be our final model. We obtain an overall MAE of 0.0272 and R2 of 0.80 after 10-fold cross validation.
    Keywords: Bollywood; Movie Revenue Prediction; Box office; Regression; Ensemble; Feature Selection; Machine Learning; Scikit-Learn.
    DOI: 10.1504/IJBIDM.2019.10019858
  • Health Data Warehouses: Reviewing Advanced Solutions for Medical Knowledge Discovery   Order a copy of this article
    by Norah Alghamdi 
    Abstract: The implementation of a data warehouse and a decision support system by utilising the capabilities of information retrieval and knowledge discovery tools in the healthcare fields, has allowed for the enhancement in the offered healthcare. In this work, we present a review of recent data warehouses and decision support systems in the healthcare domain with their significance, and applications of evidence-based medicine, electronic health records, and nursing. Given the growing trend on their implementation in healthcare services, researches, and education, we present here the most recent publications that employ these tools to produce suitable decisions for patients or health providers. For all the reviewed publications, we have intensively explored their problems, suggested solutions, utilised methods, and their findings. We have also highlighted the strength of the existing approaches and identified potential drawbacks including data correctness, completeness, consistency, and integration to provide proper medical decision-making.
    Keywords: Data warehouses; Data Mining; Health Data; Medical Records; Quality; Knowledge Discovery; OLAP.
    DOI: 10.1504/IJBIDM.2019.10019971
  • Clustering Student Instagram accounts using Author-Topic Model Based   Order a copy of this article
    by Nur Rakhmawati, Faiz NF, Irmasari Hafidz, Indra Raditya, Pande Dinatha, Andrianto Suwignyo 
    Abstract: The aim of this study proposes topic model to cluster a group of high school teenager's Instagram account in Surabaya, Indonesia by using the author-topic models method. We collect valid 235 Instagram account (133 female, 102 male students). We gather a total 3,346 captions of the Instagram post from 18 senior high schools. We find major findings what are the topics that define their Instagram's post or caption, seven topics namely: feeling, Surabaya events, photography, artists, vacation, religion and music. Through the process, the lowest perplexity come from 90 iterations, which suggests six groups of topics. The six topics are concluded based on the lowest perplexity value and labelled according to the words included in the topic. The topic of Photography discussed by six schools. Photography-Artists and vacation are discussed by three schools, while feeling, religion and music are being discussed by two and one school respectively.
    Keywords: Topic Modelling ; Senior High School Students ; Author-Topic Models.
    DOI: 10.1504/IJBIDM.2020.10020280
  • Stock Price Forecasting and News Sentiment Analysis Model using Artificial Neural Network   Order a copy of this article
    by Sriram K. V, Somesh Yadav, Ritesh Singh Suhag 
    Abstract: The stock market is highly volatile, and the prediction of stock prices has always been an area of interest to many statisticians and researchers. This study is an attempt to predict the prices of stock using Artificial Neural Network (ANN). Three models have been built, one for the future prediction of stock prices based on previous trends, the second for prediction of next day closing price based on today’s opening price, and the third one analyzes the sentiment of news articles and gives scores based on the news impact. ANN is trained with the historical data using R-studio platform which is then used to predict the future values. Our experimental results for various stock prices showed that the model is effective using ANN.
    Keywords: Stock Pricing; Forecasting; Artificial Neural Network; News sentiment; Opening price; Closing price; R Studio; Data analytics;.
    DOI: 10.1504/IJBIDM.2021.10025494
  • Associative Classification Model for Forecasting Stock Market Trends   Order a copy of this article
    by Everton Castelão Tetila, Bruno Brandoli Machado, Jose F. Rorigues-Jr, Nícolas Alessando De Souza Belete, Diego A. Zanoni, Thayliny Zardo, Michel Constantino, Hemerson Pistori 
    Abstract: This paper proposes an associative classification model based on three technical indicators to forecast future trends of stock market. Our methodology assessed the performance of nine technical indicators, using a portfolio of ten stocks and a twelve-year time series. The experimental results showed that the use of a set of technical indicators leads to higher classification rates compared to the use of sole technical indicators, reaching an accuracy of 88.77%. The proposed approach also uses a multidimensional data cube that allows automatic updating of stock market asset values, which are essential to keep the forecast updated. The results indicate that our approach can support investors and analysts to operate in stock market.
    Keywords: stock market trends; technical indicators; associative classification; data mining; business intelligence.
    DOI: 10.1504/IJBIDM.2021.10025495
  • Mining the Productivity Data of Garment Industry   Order a copy of this article
    by Abdullah Al Imran, Md Shamsur Rahim, Tanvir Ahmed 
    Abstract: The Garment Industry one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories. This study explores the application of state-of-the-art data mining techniques for analysing industrial data, revealing meaningful insights, and predicting the productivity performance of the working teams in a garment company. As part of our exploration, we have applied 8 different data mining techniques with 6 evaluation metrics. Our experimental results show that the Tree Ensemble model and Gradient Boosted Tree model are the best performing models in the application scenario.
    Keywords: Data Mining; Productivity Prediction; Pattern Mining; Classification; Garment Industry; Industrial Engineering.
    DOI: 10.1504/IJBIDM.2021.10028084
    by Maria Antonia Walteros Alcazar, Nicolas Aguirre Yacup, Sandra P. Castillo Landinez, Pablo E. Caicedo Rodríguez 
    Abstract: In recent decades, crime has become an issue of great concern to nations, which is why there is significant progress in the development of investigations in different areas. The literature review considers the data mining techniques applied to crime research, throughout the analysis of four thematic axes: countries, data sources, data mining techniques and software employed in different articles. The analysis used a systematic methodology to examine the 111 articles selected among 2008-2018 from almost 70 journals. The articles of this review are focused on different types of crime. The findings indicated that USA is the most active country analysing crimes using data mining techniques; also, the most common sources are open data websites and crime studies. In general, are more frequent than those that cover a specific type of crime, the algorithm mainly used in studies is cluster followed by classification, and the most widely used software is WEKA.
    Keywords: Data Mining DM; Crime; Criminal Patterns; Law Enforcement; Data Mining Techniques; Algorithms; Review; Knowledge Discovery; Literature Review LR;.
    DOI: 10.1504/IJBIDM.2021.10029504
  • A Parallel Approach for Web Session Identification to make Recommendation Efficient   Order a copy of this article
    by Bhuvaneswari M.S, K. Muneeswaran 
    Abstract: Web Sessions are identified as the significant part of the construction of recommendation model. The novel part of the work makes use of backward moves made by the user, considering both the referrer url and the requested url extracted from the extended web log for session identification which is not taken into consideration in the existing heuristic based approach. Two noteworthy issues in session identification are i) framing excessively numerous smaller length sessions and ii) taking longer time for identifying the sessions. In the proposed work, the length of the sessions are maximized using split and merge technique and the time taken for session identification is reduced using thread parallelization. For efficient storage and retrieval of information the hash map data structure is used. The proposed work shows significant improvement in performance in terms of execution time, standard error, correlation coefficient and the objective value.
    Keywords: Extended Web server logs ; Session identification ; Split and merge technique ; Multithreaded ; Hash data structure.
    DOI: 10.1504/IJBIDM.2021.10029835
  • A Clustering and Treemap-based Approach for Query Reuse and Visualization in Large Data Repositories   Order a copy of this article
    by Yousra Harb, Surendra Sarnikar, Omar F. El-Gayar 
    Abstract: This study presents a query clustering and tree map approach that facilitates access and reuse of pre-developed data retrieval models (queries) to analyze data and satisfy user information needs. The approach seeks to meet the following requirements: knowledge (represented as previously constructed queries) reuse, query exploration, and ease of use by data users. The approach proposes a feature space for representing queries, applies Hierarchical Agglomerative Clustering (HAC) techniques to cluster the queries, and leverages treemaps to visualize and navigate the resultant query clusters. We demonstrate the viability of the approach by building a prototype data exploration interface for health data from Behavioral Risk Factor Surveillance System (BRFSS). We conduct cognitive walkthroughs and a user study to further evaluate the effectiveness of the artifact. Overall, the results indicate that the proposed approach demonstrate the ability to meet its design requirements.
    Keywords: Query clustering; Query reuse; Query visualization; Query exploration; Information retrieval; Treemap.
    DOI: 10.1504/IJBIDM.2021.10030448
  • BUSINESS INTELLIGENCE: The Fuzzy Logic in the risk client analysis   Order a copy of this article
    by Jorge Morris, Victor Escobar-Jeria, Juan Luis Castro Peña 
    Abstract: The following paper focuses on achieving accurate results through rough data. Using an inference model based on Fuzzy Logic, human reasoning was proactively stimulated, under certain conditions, in order to deal with the possibility of client-loss due to service quality. The experimentation is carried out by information related of Complaint Receipts from a period of two years (70,000 registers). For that effect, a prototypical program is set in C++ language, which receives as input the crips values that result from the failure resolution for each relevant service. The proposed model is intended to classify clients according to the risk they may have in the contractual relationship with the company.
    Keywords: Business Intelligence; Fuzzy Logic; Soft Computing; Decision.
    DOI: 10.1504/IJBIDM.2021.10030559
  • Application of structural modeling to measure the impact of quality on growth factors: Case of the young industrial enterprises installed in the Northwest of Morocco   Order a copy of this article
    by Mohamed B.E.N. ALI, MOHAMMED HADINI, Said Barijal, Saif Rifai 
    Abstract: This study aims to provide a conceptual model measuring the impact of quality practices on the growth factors of young industrial enterprises located in northwestern Morocco and to see how quality can stimulate and improve growth factors to this kind of enterprises The present study is empirical, based on surveys (face to face interviews) via questionnaires administered to the owners/managers of young industrial enterprises using the latent variable structural modeling according to the PLS-Path Modeling approach A total of 220 questionnaires were administered and exploited to assess the degree of use and application of quality practices, five practices have been chosen, and the PLS (Partial Least Squares) Path Modeling was used We concluded that in general the quality practices concerning “Leadership” and “Process Management” have a positive impact on the growth factors of this type of enterprises:"strong to medium" importance of effects In contrast, the quality practices concerning “Human Resources”,
    Keywords: Growth factors; Growth phase; Modeling; Quality Practices; Young Industrial Enterprises.
    DOI: 10.1504/IJBIDM.2021.10030835
  • Mining Trailer Reviews for Predicting Ratings and Box Office Success of Upcoming Movies   Order a copy of this article
    by Nirmalya Chowdhury, Debaditya Barman, Chandrai Kayal 
    Abstract: Around 60% of the movies produced worldwide are box office failures. Since it affects a large number of stakeholders, movie business prediction is a very relevant as well as challenging problem. There had been many attempts to predict the box-office earnings of a movie after the theatrical release. Comparatively research works are inadequate to predict a movie’s fate before its release. Viewers are introduced to a movie via trailers before its theatrical release. The reviews of these trailers are indicative of a movie’s initial success. This work is focused on movie rating and business prediction on the basis of trailer reviews as well as other attributes. Several experiments have been performed using multiple classifiers to find appropriate classifiers(s) which can predict rating and box-office performance of a movie to be launched. Experimentally it has been found that Random Forest (RF) Classifier has outperformed others and produced very promising results.
    Keywords: Text Mining; Sentiment Analysis; Machine Learning; Movie Rating; Opening Weekend Income; Gross Income; Movie Trailer; Sensitivity Analysis.
    DOI: 10.1504/IJBIDM.2021.10030880
  • Improvement Assessment Method for Special Kids By Observing The Social and Behaviour Activity Using Data Mining Techniques   Order a copy of this article
    Abstract: In recent studies, high throughput innovations have offered ascend to accumulation of substantial measures of heterogeneous data that gives diverse information. Clustering is the process of gathering unique items into classes of comparative articles. To overcome the drawbacks of classification methods, clustering is used. Earlier, clustering algorithms like hierarchical clustering, density based clustering, which are based on either numerical or categorical attributes were commercially used in software. In this proposed work k-mean clustering under unsupervised learning algorithm can make sense in prediction. Taking the clinical data of special kids, clustering is made and categorized using rank with the help of relevant symptoms. In this context, the data of special kids make statistical impact on categorization and easy detection of associated conditions of a child earlier.As the results, the proposed method has validated the database of special kid’s information with global purity.
    Keywords: High-throughput development; Special kids; Categorical attributes; unsupervised k-means Clustering; Gene expressional values.
    DOI: 10.1504/IJBIDM.2021.10031032
  • Ensemble Feature Selection Approach for Imbalanced Textual Data Using MapReduce   Order a copy of this article
    by Houda Amazal, Kissi Mohamed, Mohammed Ramdani 
    Abstract: Feature selection is a fundamental preprocessing phase in text classification. It speeds up machine learning algorithms and improves classification accuracy. In big data context, feature selection techniques have to deal with two major issues which are the huge dimensionality and the imbalancing aspect of data. However, the libraries of big data frameworks, such as Hadoop, only implement a few single feature selection methods whose robustness does not meet the requirements imposed by the large amount of data. To deal with this, we propose in this paper a Distributed Ensemble Feature Selection approach (DEFS) for imbalanced large dataset. The first step of the proposal focus on tackling the imbalance distribution of data using Hadoop environment to transform usual documents of dataset into big documents. Afterwards, we introduce a novel feature selection method we called Term Frequency-Inverse Category Frequency (TFICF) which is both frequency and category based.
    Keywords: Ensemble feature selection; Imbalance data; MapReduce; Text classification.
    DOI: 10.1504/IJBIDM.2022.10031100
  • The Five Key Components for Building An Operational Business Intelligence Ecosystem   Order a copy of this article
    by SARMA A.D.N. 
    Abstract: Business intelligence (BI) plays a vital role in decision making in all most all private, business and government organizations. An Operational BI is a hybrid system which is an emerging concept in the BI space and gaining popularity in the last five years. In this paper, the key components of an Operational BI are presented, and their workings explained. The methodology adopted for the identification of components based on modularization of software engineering using cohesion and coupling parameters. The proposed components of the system leverage the principles of component-based software engineering. An orderly arrangement of the key components constitutes an Operational BI ecosystem. Further, explained how these individual key components of the system provide an increased business value and timely decision-making information to all the users in the organizations.
    Keywords: Business intelligence; operational BI; business performance management; operational analytics; operational reporting; event monitoring and notification; action time; and business value.
    DOI: 10.1504/IJBIDM.2021.10031395
  • A Novel Approach to Retrieve Unlabelled Images   Order a copy of this article
    by Deepali Kamthania, Ashish Pahwa, Aayush Gupta, Chirag Jain 
    Abstract: In this paper an attempt has been made to propose architecture of search engine for retrieving photographs from photo bank of unlabeled images. The primary purpose of the system is to retrieve images from image repository through string based queries on an interactive interface. To achieve this, image data set is transformed into a space where queries can execute significantly faster by developing a data pipeline through which each image is passed after entering into the system. The pipeline consists HOG based face detection and extraction, Face Landmark estimation, Indexer and Transformer. The image is passed through the data pipeline where each encoded face in the input image is compared with other vectors by computing l2 norm distance between them. The top N results (address of faces and corresponding images) are returned to the user. Once the image passes out from the pipeline Retrieval methods and Feedback mechanisms are performed.
    Keywords: Face Recognition (FR); Deep Learning; Histogram of Oriented Gradients (HOG); FaceNet Architecture; Machine Learning; Support Vector Machine (SVM).
    DOI: 10.1504/IJBIDM.2021.10031519
  • Prediction of Box-office Success: A Review of Trends and Machine Learning Computational Models   Order a copy of this article
    by Elliot Mbunge, Stephen Fashoto, Happyson Bimha 
    Abstract: The movie industry is faced with high uncertainty owing to challenges businesses have in forecasting sales and revenues. The huge upfront investments associated with the movie industry require investments to be informed by reliable methods of predicting success or returns from their investments. The study set to identify the best forecasting techniques for box-office products. Previous studies focused on predicting box-office success using pre-release and post-release during and after the production phase. The study was focusing on reviewing existing literature in predicting box-office success with the ultimate goal of determining the most frequently used prediction algorithm(s), dataset source and their accuracy results. We applied the PRISMA model to review published papers from 2010 to 2019 extracted from Google Scholar, Science Direct, IEEE Xplore Digital Library, ACM Digital Library and Springer Link. The study shows that the support vector machine was frequently used to predict box-office success with 21.74% followed by linear regression with 17.39% of total frequency contribution. The study also reviewed that Internet Movie Database (IMDb) is most used box-office dataset source with 40.741% of the total frequency followed with Wikipedia with 11.111%.
    Keywords: Box-office; machine learning; movie industry; pre-release; post-release features.
    DOI: 10.1504/IJBIDM.2021.10032162
    by S. Sridevi, Parthasarathy Sudhaman, T. Chandrakumar, S. Rajaram 
    Abstract: The traditional time series forecasting methods such as naive, smoothing model and moving average model assumes that the time series is stationary and could not handle linguistic terms. To provide a solution to this problem, fuzzy time series forecasting methods are being considered in this research work. The objective of this research is to improve the accuracy by introducing a new partitioning method called Relative Differences (RD) based interval method. This research work implements the variants of RD based Hidden Markov Models (HMM) such as Classic HMM, Stochastic HMM, Laplace stochastic smoothing HMM, and Probabilistic Smoothing HMM (PsHMM) for forecasting time series data. In the proposed work, the performances of the above models were tested with Australian Electricity Market dataset and Tamilnadu Weather dataset. The results show that the performance of the proposed model- Relative Differences (RD) based PsHMM performs much better in terms of precision than other existing models.
    Keywords: Forecasting; Time series; Fuzzy; Time Variant Model; Markov Model; Relative Differences (RD) based interval method.
    DOI: 10.1504/IJBIDM.2021.10032543
  • Disease Prediction and Knowledge Extraction in Banana Crop Cultivation using Decision Tree Classifiers   Order a copy of this article
    by A. Anitha 
    Abstract: Agriculture plays a vital role in determining economic status of a country. To meet out the growing needs of society and to improve crop productivity, researchers are focusing on the development of various technologies. In India, banana is one of the leading crops with high demand. To improve the yield of banana, it is necessary to detect diseases at an early stage. Also, in order to acquire new farmers and to retain existing banana farmers, it is essential to extract knowledge about hidden causes for various diseases in banana crop. This work aims to apply data mining techniques like decision tree classifiers on banana cultivation dataset. Agricultural dataset used for experimentation is collected from farmers cultivating banana in regions fed by Thamirabharani River such as Kanyakumari, Tirunelveli and Tuticorin districts of Tamil Nadu. The higher the disease detection accuracy, the greater will be the crop productivity. Performance of classifiers such as J48, REP tree and random forest are compared based on classification accuracy, precision, recall and F-measure. Among various classification techniques applied over agricultural dataset, it has been identified that random forest algorithm out performs other techniques with respect to classification accuracy.
    Keywords: Attribute Selection; Decision Tree; Classification; Accuracy.
    DOI: 10.1504/IJBIDM.2022.10033424
  • Extracted information quality, a comparative study in high and low dimensions   Order a copy of this article
    by Leandro Ariza-Jiménez, Luisa F. Villa, Nicolás Pinel, Olga Lucia Quintero Montoya 
    Abstract: Uncovering interesting groups in either multidimensional or network spaces has become an essential mechanism for data exploration and understanding. Decision making requires relevant information as well as high-quality on the retrieved conclusions. We presented a comparative study of two compact representations drawn from the same set of data objects by clustering high-dimensional spaces and low-dimensional Barnes-Hut t-Stochastic Neighbor embeddings. There is no consensus on how the problem should be addressed and how these representations/models should be analysed because of their different notions. We introduced a measure to compare their results and capability to provide insights into the information retrieved. We considered low-dimensional embeddings as a potentially revealing strategy to uncover dynamics possibly not uncovered in big-data spaces. We demonstrated that a non-guided approach can be as revealing as a user-guided approach for data exploration and presented coherent results for good uncertainty modelling capability in terms of fuzzyness and densities.
    Keywords: High-dimensional Clustering; BH-SNE Embeddings; cluster Fuzzyness; Reliable Information; Decision Making; Consistency.
    DOI: 10.1504/IJBIDM.2021.10033994
  • Heart Disease Patient Risk Classification Based On Neutrosophic Sets   Order a copy of this article
    by Wael Hanna, Nouran Radwan 
    Abstract: Medical statistics show that heart disease is one of the biggest causes for mortality among the population. In developing countries, people have less concern about their health. The risk is increasing as there are five hundred deaths per one hundred thousand occur annually in Egypt. The diagnosis of heart disease remains an ambiguous task in the medical field as there are many features which are involved to take the decision. Besides, data gained for diagnosis are often vague and ambiguous. The main contribution of this paper is proposing a novel model of heart disease patient risk classification based on neutrosophic sets. The proposed model is applied to most relevant attributes of selected dataset, and compared to other famous classification techniques such as Naive Bayesian, JRip, and random forest for validation. The experimental results indicate that the proposed heart disease classification model achieves highest accuracy and f-measure results in heart disease.
    Keywords: Heart disease; supervised machine learning classification; and neutrosophic sets.
    DOI: 10.1504/IJBIDM.2021.10034129
  • A Semi-Supervised clustering based classification model for classifying imbalanced data streams in the presence of scarcely labelled data   Order a copy of this article
    by Kiran Bhowmick, Meera Narvekar 
    Abstract: Classification of data streams is still a current topic of research and a lot of research is focussed in this direction. Online frameworks for classifying data streams are generally supervised in nature so they assume the availability of labelled data all the time. Data streams in real time however are potentially infinite in length, massive, fast changing and scarcely labelled. It is practically impossible to label all the observed instances. Hence these existing frameworks cannot be used in most of the real time scenarios. Semi-supervised learning (SSL) addresses this problem of scarcely labelled data by using large amount of unlabelled data together with labelled data to build classifiers. Data streams may also suffer with the problem of imbalanced data. This paper proposes a model using a semi supervised clustering technique to classify an imbalanced data stream in the presence of scarcely labelled data.
    Keywords: data streams; imbalanced data; semi-supervised clustering; expectation maximization; partially labelled.
    DOI: 10.1504/IJBIDM.2022.10034300
  • Analysing traveller ratings for tourist satisfaction and tourist spot recommendation   Order a copy of this article
    by Angel Arul Jothi Joseph, Rajeni Nagarajan 
    Abstract: In this study, we propose an automated system to classify traveller ratings on travel destinations in 10 categories across East Asia using the UCI Travel Reviews dataset. The automated system developed in this study is called Traveller Rating Classification System (TRCS). Since the Travel Reviews dataset is an unlabelled dataset, K-means clustering algorithm is used to group the samples from the dataset into three clusters. The cluster numbers obtained from K-means clustering are assigned as class labels for the samples and the dataset is converted into a labelled dataset. Popular individual classifiers and ensemble classifiers are used to classify the samples present in the labelled dataset. In this study, Bagging with decision tree classifier achieved the best classification accuracy of 97.95%. The study further analyses the attributes in the dataset using visualization techniques to draw inferences by performing small transformations on them. The proposed system will be useful to understand traveller satisfaction and as a tourist spot recommendation system.
    Keywords: Tourist spot recommendation; Tourist satisfaction; Traveller rating; K-means Clustering; Classification; Ensemble; Visualization.
    DOI: 10.1504/IJBIDM.2022.10034520
  • Correlating pre-search and in-search context to predict search intent for exploratory search   Order a copy of this article
    by Vikram Singh 
    Abstract: Modern information systems are expected to respond to a wide variety of information needs from users with diverse goals. The topical dimension (what the user is searching for) of these information needs is well studied; however, the intent dimension (why the user is searching) has received relatively less attention. Traditionally, the intent is an immediate reason, purpose, or goal that motivates the user search, and captured in search contexts (pre-search, in-search, pro-search). An ideal information system would be able to use. This article proposed a novel intent estimation strategy; based on the intuition that captured intent proactively extracts potential results. Captured pre-search context adapts query term proximities within matched results beside document-terms statistics and pseudo-relevance feedback with user-relevance feedback for in-search. The assessment asserts the superior performance of the proposed strategy over equivalent on trade-offs, e.g., novelty, diversity (coverage, topicality), retrieval (precision, recall, F-measure) and exploitation vs. exploration.
    Keywords: Ambient Information; Exploratory Search; Human-Computer Interaction; Information Retrieval; Proactive Search; Query Term Proximity; Search Contexts; Relevance; Retrieval Model.
    DOI: 10.1504/IJBIDM.2022.10034960
  • Prediction of Students’ Failure using VLE and Demographic data: Case study Open University Data   Order a copy of this article
    by Rahila Umer, Sohrab Khan, Jun Ren, Shumaila Umer, Ayesha Shaukat 
    Abstract: Use of technology such as learning management system (LMS) in higher education institutes is getting very common. LMS provides support to teaching staff for communication, delivery of resources and in design of learning activities. Large amount of data is produced using these technologies which can be analysed using machine learning methods to extract knowledge regarding students’ behaviour and learning processes. In this study we focus on the Open University’s project for predicting student’s failure in the course by using their data. In this study multiple machine learning algorithms are applied on historical virtual learning environment (VLE) data and demographic data. This study confirms the importance of VLE and demographic data in the prediction of academic performance. This study highlights the importance of demographic data; which improves the accuracies of models for predicting student’s outcome in courses they are enrolled.
    Keywords: predictive learning analytics; student performance; retention; higher education; machine learning.
    DOI: 10.1504/IJBIDM.2022.10035109
  • Chaotic activities recognizing during the pre-processing event data phase   Order a copy of this article
    by Zineb Lamghari, Rajaa Saidi, Maryam Radgui, Moulay Driss Rahmani 
    Abstract: Process mining aims at obtaining insights into business processes by extracting knowledge from event data. Indeed, the quality of events is a crucial element for generating process models, to reflect business process reality. To do so, pre-processing methods are appeared, to clean events from deficiencies (noise, incompleteness and infrequent behaviours) in the limit of chaotic activities’ emergence. Chaotic activities are executed arbitrarily in the process and impact the quality of discovered models. Beyond, a supervised learning approach has been proposed, using labelled samples to detect chaotic activities. This puts forward the difficulty of defining chaotic activities in the case of no ground knowledge on which activities are truly chaotic. To that end, we develop an approach for recognising chaotic activities without having labelling training data, using unsupervised learning techniques.
    Keywords: pre-processing; process discovery; process mining; chaotic activity; business process intelligent; machine learning algorithms.
    DOI: 10.1504/IJBIDM.2022.10035223
  • Favourable subpopulation migration strategy for Travelling salesman problem   Order a copy of this article
    by Abhishek Chandar, Akshay Srinivasan, G. Paavai Anand 
    Abstract: Process mining aims at obtaining insights into business processes by extracting knowledge from event data. Indeed, the quality of events is a crucial element for generating process models, to reflect business process reality. To do so, pre-processing methods are appeared, to clean events from deficiencies (noise, incompleteness and infrequent behaviours) in the limit of chaotic activities’ emergence. Chaotic activities are executed arbitrarily in the process and impact the quality of discovered models. Beyond, a supervised learning approach has been proposed, using labelled samples to detect chaotic activities. This puts forward the difficulty of defining chaotic activities in the case of no ground knowledge on which activities are truly chaotic. To that end, we develop an approach for recognising chaotic activities without having labelling training data, using unsupervised learning techniques.
    Keywords: pre-processing; process discovery; process mining; chaotic activity; business process intelligent; machine learning algorithms.
    DOI: 10.1504/IJBIDM.2022.10035424
  • An evaluation method for searching the functional relationships between property prices and influencing factors in the detected data   Order a copy of this article
    by Pierluigi Morano, Francesco Tajani, Vincenzo Del Giudice, Pierfrancesco De Paola, Felicia Di Liddo 
    Abstract: The economic crisis of the last decade, started from the real estate sector, has spread the awareness of the importance of the use of advanced evaluation models, as a support in the assessments and in the periodic value updates of public and private property assets. With reference to a sample of recently sold properties located in the city of Rome (Italy), an innovative automated valuation model is explained and applied. The outputs are represented by different mathematical expressions, able to interpret and to simulate the investigated phenomena (i.e. the market prices formation). The application carried out outlines, in the selection phase of the best model, the fundamental condition that the valuer must adequately know the reference market. In this way, it is possible to identify the existing patterns in the detected data in terms of mathematical expressions, according to the empirical knowledge of the economic phenomena.
    Keywords: price property formation; office market; retail market; automated valuation methods; AVMs; genetic algorithm; reliable valuations.
    DOI: 10.1504/IJBIDM.2022.10035383
  • Predicting students' academic performance using machine learning techniques: a literature review   Order a copy of this article
    by Aya Nabil, Mohammed Seyam, Ahmed Aboul-Fotouh 
    Abstract: The amount of students’ data stored in educational databases is increasing rapidly. These databases contain hidden patterns and useful information about students’ behaviour and performance. Data mining is the most effective method to analyse the stored educational data. Educational data mining (EDM) is the process of applying different data mining techniques in educational environments to analyse huge amounts of educational data. Several researchers applied different machine learning techniques to analyse students’ data and extract hidden knowledge from them. Prediction of students’ academic performance is necessary for educational environments to measure the quality of the learning process. Therefore, it is one of the most common applications of EDM. In this survey paper, we present a review of data mining techniques, EDM and its applications, and discuss previous studies in predicting students’ academic performance. An analysis of different machine learning techniques used in previous studies is also presented in this paper.
    Keywords: data mining; educational data mining; EDM; prediction; student academic performance; machine learning techniques; deep learning.
    DOI: 10.1504/IJBIDM.2022.10035540
  • Harnessing the Meteorological Effect for Predicting the Retail Price of Rice in Bangladesh   Order a copy of this article
    by Abdullah Al Imran, Zaman Wahid, Alpana Akhi Prova, Md. Hannan 
    Abstract: Bangladesh has seen an absurd, steeper prize-hike for the last couple of years in one of the most consumed foods taken by millions of people every single day: rice. The impact of this phenomenon, however, is indispensably critical, especially to the one striving for daily meals. Thus, understanding the latent facts is vital to policymakers for better strategic measures and decision-making. In this paper, we have applied five different machine learning algorithms to predict the retail price of rice, find out the top-most factors responsible for the price hike, and determine the best model that produces higher prediction results. Leveraging six evaluation metrics, we found that random forest produces the best result with an explain variance score of 0.87 and an R2 score of 0.86 whereas gradient boosting produces the least, meanwhile discovering that average wind speed is the topmost reason for rice price hike in retail markets.
    Keywords: data mining; rice price prediction; pattern mining; regression; retail markets.
    DOI: 10.1504/IJBIDM.2022.10035542
  • Privacy Preservation of the User Data and Properly Balancing Between Privacy and Utility   Order a copy of this article
    by N. Yuvaraj, K. Praghash, T. Karthikeyan 
    Abstract: The privacy and utility are the trade-off factors, where the performance of one factor should sacrifice to achieve the other. If privacy is achieved without publishing the data, then efficient utility cannot be achieved, hence the original dataset tends to get published without privacy. Therefore, it is essential to maintain the equilibrium between privacy and utility of datasets. In this paper, we propose a new privacy utility method, where the privacy is maintained by lightweight elliptical curve cryptography (ECC), and utility is maintained through ant colony optimisation (ACO) clustering. Initially, the datasets are clustered using ACO and then the privacy of clustered datasets is maintained using ECC. The proposed method has experimented over medical datasets and it is compared with existing methods through several performance metrics like clustering accuracy, F-measure, data utility, and privacy metrics. The analysis shows that the proposed method obtains improved privacy preservation using the clustering algorithm than existing methods.
    Keywords: ant colony optimisation; ACO; elliptical curve cryptography; ECC; privacy preservation; utility.
    DOI: 10.1504/IJBIDM.2022.10035576
  • overdisp: An R Package for Direct Detection of Overdispersion in Count Data Multiple Regression Analysis   Order a copy of this article
    by Rafael Freitas Souza, Luiz Paulo Fávero, Patrícia Belfiore, Luiz Corrêa 
    Abstract: Within multiple areas, log-linear count data regression is one of the most popular techniques for predictive modelling where there is a non-negative discrete quantitative dependent variable. In order to ensure the inferences from the use of count data models are appropriate, researchers may choose between the estimation of a Poisson model and a negative binomial model, and the correct decision for prediction from a count data estimation is directly linked to the existence of overdispersion of the dependent variable, conditional to the explanatory variables. That said, the overdisp() command is a contribution to researchers, providing a fast and secure solution for the detection of overdispersion in count data. Real and simulated data were used to test the proposed solution, which proved to be computationally efficient, with no difference in the detection of overdispersion compared to the test postulated by the cited authors.
    Keywords: overdispersion; detection of overdispersion; count data; multiple regression analysis; non-negative discrete quantitative dependent variable; Poisson model; negative binomial model; R package.
    DOI: 10.1504/IJBIDM.2022.10035616
  • Web mining based on word-centric search with clustering approach using MLP-PSO hybrid   Order a copy of this article
    by Reza Samizadeh, Samaneh Tafahomi 
    Abstract: With web development, sometimes in keeping track of information on the web, the semantic meaning of words is not important, and the mere presence of words in the text is enough to extract information. In this research, word-centric search method is presented to prepare web data for clustering. Multi-layer perceptron networks are one of the most successful neural networks for learning, clustering and prediction. The researcher clusters the web data from the word-centric search method by using the K-means method and considers the results of clustering as the expected output of the MLP neural network. Considering that the weights of the neural network are selected randomly and may not be in best amount after the network training. Therefore, by using an optimisation algorithm for particle swarm, its effect on performance of the final neural network has been investigated in the training and initial weighing step.
    Keywords: web mining; clustering; multi-layer perceptron neural networks; particle swarm optimisation algorithm.
    DOI: 10.1504/IJBIDM.2022.10035725
  • Estimating Cluster Validity Using Compactness Measure and Overlap Measure for Fuzzy Clustering   Order a copy of this article
    by Bindu Rani, Shri Kant 
    Abstract: Cluster analysis discovers valuable patterns in data by partitioning n data points into valid number of clusters. The cluster validity index (CVI) helps in selecting the best partitions that fits the underlying structure of data. After presenting brief review on existing CVIs, this study formulates a competent overlap-compactness validity index (OCVI). The proposed index considers Kim et al. (2004b) overlap measure with compactness measure. Compactness measure considers the geometrical aspects of membership matrix (U) through cluster centres with an approach to reduce its monotonic tendency. Overlap measure calculates the average value of the overlapping degree of all probable fuzzy clusters pairs. Experiments are implemented on two artificial, two real and one biological dataset. Comparison results of partition coefficient, partition entropy, modified partition coefficient, Xie-Beni and Kim indices with the suggested index (OCVI) imply that suggested index outperforms with maximum compactness and minimum overlap than other validity indices.
    Keywords: cluster validity index; CVI; clustering; fuzzy clustering; fuzzy c-means algorithm.
    DOI: 10.1504/IJBIDM.2022.10036057
  • SHOMAN: An Efficient Method for Finding the Important Nodes in a Network   Order a copy of this article
    by Shivam Bathla, Omprakash Sah, Anurag Singh 
    Abstract: In this paper, we propose and study SHOMAN metric for determining the importance of a node in a network. The method is based on a significant feature of networks namely clustering coefficient and uses the principle of six degrees of separation to go through only up to six nodes to find the score of node influence. We demonstrate that our algorithm is highly effective at calculating the importance of nodes when compared to other centrality measures. We also propose that our method can be used in viral marketing and controlling disease spreading.
    Keywords: six-degrees of separation; clustering coefficient; robustness; centrality; viral marketing; disease spreading.
    DOI: 10.1504/IJBIDM.2021.10036357
  • Solving restriction of Bayesian network in giving domain knowledge by introducing factor nodes   Order a copy of this article
    by Yutaka Iwakami, Hironori Takuma, Motoi Iwashita 
    Abstract: Bayesian network is a probabilistic inference model that is effective for decision-making in business such as product development. Multiple events are represented as oval nodes and their relationships are drawn as edges among them. However, in order to obtain a sufficient effect, it is necessary to appropriately configure domain knowledge, for example more customer response to the product leads to more clarity of requirements for products. Such domain knowledge is configured as an edge connecting nodes. But in some cases, the constraint of the structure in Bayesian network prevents this configuration. In this study, the authors propose a method to avoid this constraint by introducing the redundant factor nodes generated by applying factor analysis to the data related with domain knowledge. With this approach more domain knowledge can be applied to Bayesian network, and the accuracy of decision-making in business is expected be improved.
    Keywords: model improvement; data extraction; data driven insight; probabilistic inference; decision-making; product development; Bayesian network; factor analysis; key goal indicator.
    DOI: 10.1504/IJBIDM.2022.10036731
  • Customer Segmentation Using Various Machine Learning Techniques   Order a copy of this article
    Abstract: In the field of retail industry and marketing, customer segmentation is one of the most important tasks. A proper customer segmentation can help the managers to enhance the quality of products and provide better services for the targeting segments. Various machines learning algorithms-based customer segmentation techniques are used to get an insight about the customer’s behaviour and the potential customers that could be targeted to maximise profit. Based on the previous studies, this paper proposes improved machine learning models for customer segmentation in e-commerce. The agglomerative clustering algorithms have been implemented to segment the customers with the new matric for customer behaviour. Also, we have proposed a systematic approach for combining agglomerative clustering algorithm and filtering-based recommender system to improve customer experience and customer retention. In the experiment, the results were compared with K-means clustering model, and it was found that BLS greatly reduced training time while guaranteeing accuracy.
    Keywords: customer segmentation; agglomerative clustering algorithms; machine learning algorithms; K-means.
    DOI: 10.1504/IJBIDM.2022.10036753
  • Use of the BI systems for organizing the information space of the university   Order a copy of this article
    by Oksana Leonidovna Kopnova 
    Abstract: This article is devoted to the method of constructing information-analytical systems of the university. For implementation, the author proposes to use business analytics systems. The article provides a comparative analysis of the use of business intelligence tools for business and university. A comparative table of the capabilities of existing business intelligence systems is also presented. The most optimal system for building the information-analytical system of the university has been proposed. Develop a model of the analytical report. Also in the article is an example of an interactive form of data analysis. The article is oriented on heads of higher educational institutions, as well as on companies developing data analysis systems and providing marketing services.
    Keywords: business intelligence; university intelligence; business intelligence system; information analytical system; management decision making.
    DOI: 10.1504/IJBIDM.2021.10036840
  • A Unified Workflow Strategy for Analysing Large Scale TripAdvisor Reviews with BOW Model   Order a copy of this article
    by Jale Bektas, Arwa Abdalmajed 
    Abstract: Nowadays, firms need to transform customer online reviews data properly into information to achieve goals such as having a competitive edge and improving the quality of service. This paper presents a unified workflow to solve the problems of analysing large-scale data with 710,450 reviews for 1,134 hotels by using text mining methods among the different touristic regions of Turkey. Firstly, a star schema dimensional data mart is built that includes one fact table and two dimensional tables. Then, a series of text mining processes which includes data cleaning, tokenisation, and analysis are applied. Text mining is implemented through standard BOW and the extended BON model. The results show significant findings through this workflow. We propose to build a dimensional model dataset before performing any text mining process, since building such a dataset will optimise the data retrieval process and help to represent the data along with different measures of interest.
    Keywords: online TripAdvisor reviews; text mining; big data; N-gram tokenisation; dimensional data mart; data mining; BOW; BON.
    DOI: 10.1504/IJBIDM.2022.10037062
  • Text Mining for Opinion Analysis: The Case of Recent Flood of Iran on Twitter   Order a copy of this article
    by Reza Kamranrad, Ali Jozi, Ehsan Mardan 
    Abstract: The sentiment analysis relates to the study and understanding of emotions and beliefs in a particular text. This analysis gives us a lot of information. Twitter is a popular social network in recent years, in which users express their opinions and feelings about various topics in the Twitter social media operating system. By analysing this information, we can get an overview of public opinion about any particular topic. The classification of information is effective in understanding information and we cluster information. In this article, we are trying to analyse the status of Twitter on the monitoring and emotions of people about the recent flood events in Iran.
    Keywords: text mining; Twitter; sentiment analysis; machine learning; language processing; NLP; Python; clustering.
    DOI: 10.1504/IJBIDM.2022.10037064
  • Dengue Fever Prediction Modelling using Data Mining Techniques   Order a copy of this article
    by Wipawan Buathong, Pita Jarupunphol 
    Abstract: This research experiments on several combinations of feature selection and classifier to obtain the most efficient classification model for predicting dengue fever. The features of relationship patterns for predicting dengue fever were investigated. In order to obtain the most effective classification model, several feature selection techniques were ranked and experimented with well-recognised classifiers. The measurement results of different models were illustrated and compared. The most efficient model is the neural network with three layers. Each layer contains 100 nodes with ReLu activation function. Five features were classified using information gain with 64.9% accuracy, 71.8% F-measure, 65.7% precision, and 79.0% recall. Other competitive machine learning models with slightly similar efficiency are: 1) the combined Naive Bayes and information gain; 2) the combined neural network and ReliefF; 3) the combined Naive Bayes and FCBF. SVM, on the other had, is considered as the least efficient model when experimented with selected feature selection techniques.
    Keywords: dengue fever; data mining; classification; feature selection; ranking.
    DOI: 10.1504/IJBIDM.2022.10037218
  • Managing Manufacturing Efficiency Using the Concept of “Automation with Human Touch”   Order a copy of this article
    by Arturo Alatrista Corrales, María Moreno-Arévalo, Carlos Carlos Zevallos-Pacheco, Marcos Rueda-Enríquez 
    Abstract: In the era of what is called Industry 4.0, information technologies become a crucial aspect for increasing productivity in the manufacturing sector. They require a certain level of electronic integration with processes and machines in the production plant in order to get relevant data. The purpose of this paper is to analyse the convenience of the concept of Automation with Human Touch, as an alternative approach for developing information systems based on data obtained partially from a manual source (human machine interaction). The case of Mentor Monitor is used for this purpose. It is argued that the concept allows to reduce implementation costs and technology complexity. At the same time, reliable data can be achieved for key performance indicators such as overall efficiency equipment (OEE) and specific energy consumption (SEC). It is also discussed the need for finding mechanisms for preventing error in manual data. Mentor Monitor is an information system developed by Calidad Total Mecatr
    Keywords: efficiency; overall equipment efficiency; OEE; specific energy consumption; SEC; production line; manufacturing; key performance indicators.
    DOI: 10.1504/IJBIDM.2021.10037304
  • Apriori-Roaring: Frequent Pattern Mining Method Based on Compressed Bitmaps   Order a copy of this article
    by Alexandre Colombo, Roberta Spolon, Aleardo Junior Manacero, Renata Spolon Lobato, Marcos Antônio Cavenaghi 
    Abstract: Association rule mining is one of the most common tasks in data analysis. It has a descriptive feature used to discover patterns in sets of data. Most existing approaches to data analysis have a constraint related to execution time. However, as the size of datasets used in the analysis grows, memory usage tends to be the constraint instead, and this prevents these approaches from being used. This article presents a new method for data analysis called apriori-roaring. The apriori-roaring method is designed to identify frequent items with a focus on scalability. The implementation of this method employs compressed bitmap structures, which use less memory to store the original dataset and to calculate the support metric. The results show that apriori-roaring allows the identification of frequent elements with much lower memory usage and shorter execution time. In terms of scalability, our proposed approach outperforms the various traditional approaches available.
    Keywords: frequent pattern mining; bitmap compression; data mining; association rules; knowledge discovery.
    DOI: 10.1504/IJBIDM.2022.10037305
  • Financial accounts reconciliation systems using enhanced mapping algorithm   Order a copy of this article
    by Olufunke Oluyemi Sarumi, Bolanle A. Ojokoh, Oluwafemi A. Sarumi, Olumide S. Adewale 
    Abstract: Account reconciliation has become a daunting task for many financial organisations due to the heterogeneity of data involved in the accounts reconciliation process-coupled with the recent data deluge in many accounting firms. Many organisations are using a heuristic-based algorithm for their account reconciliation process while in some firms the process is completely manual. These methods are already inundated and were no longer efficient in the light of the recent data explosion and are such, prone to lots of errors that could expose the organisations to several financial risks. In this regard, there is a need to develop a robust financial data analytic algorithm that can effectively handle the account reconciliation needs of financial organisations. In this paper, we propose a computational data analytic model that provides an efficient solution to the account reconciliation bottlenecks in financial organisations. Evaluation results show the effectiveness of our data analytic model for enhancing faster decision making in financial account reconciliation systems.
    Keywords: accounts reconciliation; financial analytics; functions; fraud; big data.
    DOI: 10.1504/IJBIDM.2022.10037414
  • Privacy Preserving Data Mining - Past and Present   Order a copy of this article
    by G. SATHISH KUMAR, K. Premalatha 
    Abstract: Data mining is the process of discovering patterns and correlations within the huge volume of data to forecast the outcomes. There are serious challenges occurring in data mining techniques due to privacy violation and sensitive information disclosure while providing the dataset to third parties. It is necessary to protect user’s private and sensitive data from exposure without the authorisation of data holders or providers when extracting useful information and revealing patterns from the dataset. Also, internet phishing gives more threat over the web on extensive spread of private information. Privacy preserving data mining (PPDM) is an essential for exchanging confidential information in terms of data analysis, validation, and publishing. To achieve data privacy, a number of algorithms have been designed in the data mining sector. This article delivers a broad survey on privacy preserving data mining algorithms, different datasets used in the research and analyses the techniques based on certain parameters. The survey is highlighted by identifying the outcome of each research along with its advantages and disadvantages. This survey will guide the feature researches in PPDM to choose the appropriate techniques for their research.
    Keywords: data mining; privacy preserving data mining; PPDM; privacy preserving techniques; sensitive attributes; privacy threats.
    DOI: 10.1504/IJBIDM.2022.10037595
  • STEM: STacked Ensemble Model design for aggregation technique in Group Recommendation System   Order a copy of this article
    by Nagarajan Kumar, P. Arun Raj Kumar 
    Abstract: A group recommendation system is required to provide a list of recommended items to a group of users. The challenge lies in aggregating the preferences of all members in a group to provide well-suited suggestions. In this paper, we propose an aggregation technique using stacked ensemble model (STEM). STEM involves two stages. In stage 1, the k-nearest neighbour (k-NN), singular value decomposition (SVD), and a combination of user-based and item-based collaborative filtering is used as base learners. In the second stage, the decision trees predictive model is used to aggregate the outputs obtained from the base learners by prioritising the most preferred items. From the experiments, it is evident that STEM provides a better group recommendation strategy than the existing techniques.
    Keywords: group recommendation system; aggregating user preferences; decision trees; stacked ensemble; machine learning.
    DOI: 10.1504/IJBIDM.2022.10037757
  • Portfolio selection with support vector regression: multiple kernels comparison   Order a copy of this article
    by Pedro Alexandre Moura Barros Henrique, Pedro Henrique Melo Albuquerque, Sarah Sabino De Freitas Marcelino, Yaohao Peng 
    Abstract: This study aimed to verify whether the use of support vector regression (SVR) makes the portfolio's return exceed the market. For such proposal, SVR was applied for 15 different kernel functions to select the best stocks for each quarter, calculating the quarterly portfolio return and cumulative return along the period. Subsequently, the returns of these portfolios were compared with the returns of a market benchmark. White's (2000) test was applied to avoid the data-snooping effect in assessing the statistical significance of the portfolios developed by the training strategies. The portfolio selected by SVR with inverse multiquadric kernel presented the highest cumulative return of 374.40% and a value at risk (VaR) of −6.87%. The results of this study corroborate the superiority hypothesis of the innovative method of SVR in the formation of portfolios, thus constituting a robust predictive method capable to cope with high dimensionality interactions.
    Keywords: statistical learning theory; optimisation theory; financial econometrics; support vector machine; SVM; kernel methods.
    DOI: 10.1504/IJBIDM.2019.10019195
  • An efficient approach for defect detection in pattern texture analysis using an improved support vector machine   Order a copy of this article
    by I. Manimozhi, S. Janakiraman 
    Abstract: Texture defect detection can be defined as the process of determining the location and size of the collection pixels in a textured image which deviate in their intensity values or spatial in compression to a background texture. The detection of abnormalities is a very challenging problem in computer vision. In our proposed method we have designed a method for detecting the defect of pattern texture analysis. Initially, features are extracted from the input image using the grey level co-occurrence matrix (GLCM) and grey level run-length matrix (GLRLM). Then the extracted features are fed to the input of classification stage. Here the classification is done by improved support vector machine (ISVM). The proposed pattern analysis showed that the traditional support vector machine is improved by means of kernel methods. In the final stage, the classified features are segmented using the modified fuzzy c means algorithm (MFCM).
    Keywords: texture defect detection; preprocessing; grey level co-occurrence matrix; GLCM; grey level run-length matrix; GLRLM; improved support vector machine; ISVM; modified fuzzy c means; MFCM.
    DOI: 10.1504/IJBIDM.2019.10018937
  • Discovery of inconsistent generalised coherent rules   Order a copy of this article
    by R. Anuradha, N. Rajkumar, G. Rathi, S. Prince Sahaya Brighty 
    Abstract: Mining multiple-level association rules in a predefined taxonomy is an hierarchy that paves the way for generalised rule mining using interestingness measures like support and confidence. Coherent rule mining identifies significant rules in a database without using interestingness measures. In this paper, we propose a new mining algorithm called generalised inconsistent coherent rule mining (GICRM) for mining a new form of generalised coherent rules called inconsistent coherent rules. The discovered rules are called inconsistent because the correlation of the rules changes from one level of taxonomy to another. The rules are mined from a structured dataset of predefined taxonomy. The inconsistent rules mined would be noteworthy at the business point of view for taking strategic decisions in market basket analysis.
    Keywords: generalised inconsistent coherent rule mining; GICRM; multiple-level; generalised inconsistent coherent rule; taxonomy.
    DOI: 10.1504/IJBIDM.2019.10021264
  • Analysis of road accident data and determining affecting factors by using regression models and decision trees   Order a copy of this article
    by Ali Nazeri, Hanieh GharehGozlu, Farshid Faraji, Shabnam Asakareh 
    Abstract: This study analyses the road accident data with the aim to predict the probability of road accidents leading to death and determine the affecting factors. Regression models including logit, probit, complementary log-log, gompertz and decision trees based on the CART algorithm were used to analyse the actual data of the rail road police centre of the country. The results show that the logit regression model is superior to the other models from the perspective of the scales of the health indicator. Also, the variables of day of week, age, shoulder path, road side, road type, road position, maximum speed, belt safety, specific safety equipment, vehicle type and vehicle manufacturer country are among the variables that significantly affect the probability of road deaths, and can be controlled by controlling their levels.
    Keywords: road accidents; regression models; decision tree model; accuracy indicator scales.
    DOI: 10.1504/IJBIDM.2021.10024451
  • Artisanal sand exploitation from the Mamfe Sedimentary Basin, Cameroon and resultant impacts   Order a copy of this article
    by Etah Enow Moses 
    Abstract: Mamfe Basin in Cameroon contains a large amount of mineral resources that are still unexploited. The basin's topography has enabled the deposition of sand from the Cretaceous Era till present. The exploiters currently use rudimentary techniques to extract the highly demanded non-ferrous mineral which now serves as a livelihood asset to some of the inhabitants. This study seeks to identify the areas of sand exploitation in the region, assess the methods of extraction, determine variation in output with time, examine the socio-economic impact of the activity and, analyse the different problems affecting the sector. By combining secondary and primary techniques of data acquisition, it was observed that the mean annual production of sand in the region is 4,574.1 m3. Between 2010 and 2016, a total of 5,365,295 FCFA was paid to the State Treasury. Sand quarrying is purely artisanal so there is need for the sector to be modernised for it to become sustainable.
    Keywords: artisanal mining; fluvial deposit; Mamfe Basin; non-metallic minerals; sand quarrying.
    DOI: 10.1504/IJBIDM.2021.10036841
  • The approach of using ontology as a pre-knowledge source for semi-supervised labelled topic model by applying text dependency graph   Order a copy of this article
    by Phu Pham, Phuc Do 
    Abstract: Multiple topics discovering from text is an important task in text mining. In the past, the supervised approaches fail to explore multiple topics in text. The topic modelling approach, such as: LSI, pLSI, LDA, etc. are considered as an unsupervised method which supports to discover distributions of multiple topics in text documents. The labelled LDA (LLDA) model is a supervised method which enables to integrate human labelled topics with the given text corpus during the process of modelling topics. However, in real applications, we may not have enough high qualified knowledge to properly assign the topics for all documents before applying the LLDA. In this paper, we present two approaches which have taken the advantage of dependency graph-of-words (GOW) in text analysis. The GOW approach uses frequent sub-graph mining (FSM) technique to extract graph-based concepts from the text. Our first approach is the method of using graph-based concepts for constructing domain-specific ontology. It is called GC2Onto model. In our second approach, the graph-based concepts are also applied to improve the quality of traditional LLDA. It is called LLDA-GOW model. We combine two GC2Onto and LLDA-GOW models to leverage the multiple topic identification as well as other mining tasks in the text.
    Keywords: topic identification; labelled topic modelling; latent Dirichlet allocation; LDA; labelled LDA; LLDA; ontology-driven topic labelling; dependency graph.
    DOI: 10.1504/IJBIDM.2019.10020863