Forthcoming and Online First Articles

International Journal of Business Intelligence and Data Mining

International Journal of Business Intelligence and Data Mining (IJBIDM)

Forthcoming articles have been peer-reviewed and accepted for publication but are pending final changes, are not yet published and may not appear here in their final order of publication until they are assigned to issues. Therefore, the content conforms to our standards but the presentation (e.g. typesetting and proof-reading) is not necessarily up to the Inderscience standard. Additionally, titles, authors, abstracts and keywords may change before publication. Articles will not be published until the final proofs are validated by their authors.

Forthcoming articles must be purchased for the purposes of research, teaching and private study only. These articles can be cited using the expression "in press". For example: Smith, J. (in press). Article Title. Journal Title.

Articles marked with this shopping trolley icon are available for purchase - click on the icon to send an email request to purchase.

Online First articles are published online here, before they appear in a journal issue. Online First articles are fully citeable, complete with a DOI. They can be cited, read, and downloaded. Online First articles are published as Open Access (OA) articles to make the latest research available as early as possible.

Open AccessArticles marked with this Open Access icon are Online First articles. They are freely available and openly accessible to all without any restriction except the ones stated in their respective CC licenses.

Register for our alerting service, which notifies you by email when new issues are published online.

We also offer which provide timely updates of tables of contents, newly published articles and calls for papers.

International Journal of Business Intelligence and Data Mining (46 papers in press)

Regular Issues

  • Analysis and Prediction of Heart Disease Aid of Various Data Mining Techniques: A Survey   Order a copy of this article
    by V. Poornima, D. Gladis 
    Abstract: In recent times, health diseases are expanding gradually because of inherited. Particularly, heart disease has turned out to be the more typical nowadays, i.e., life of individuals is at hazard. The data mining strategies specifically decision tree, Naive Byes, neural network, K-means clustering, association classification, support vector machine (SVM), fuzzy, rough set theory and orthogonal local preserving methodologies are examined on heart disease database. In this paper, we survey distinctive papers in which at least one algorithms of data mining are utilised for the forecast of heart disease. This survey comprehends the current procedures required in vulnerability prediction of heart disease for classification in data mining. Survey of pertinent data mining strategies which are included in risk prediction of heart disease gives best expectation display as hybrid approach contrasting with the single model approach.
    Keywords: Data mining; Heart Disease Prediction; performance measure; Fuzzy; and clustering.
    DOI: 10.1504/IJBIDM.2018.10014620
     
  • Application of structural modeling to measure the impact of quality on growth factors: Case of the young industrial enterprises installed in the Northwest of Morocco   Order a copy of this article
    by Mohamed B.E.N. ALI, MOHAMMED HADINI, Said Barijal, Saif Rifai 
    Abstract: This study aims to provide a conceptual model measuring the impact of quality practices on the growth factors of young industrial enterprises located in northwestern Morocco and to see how quality can stimulate and improve growth factors to this kind of enterprises The present study is empirical, based on surveys (face to face interviews) via questionnaires administered to the owners/managers of young industrial enterprises using the latent variable structural modeling according to the PLS-Path Modeling approach A total of 220 questionnaires were administered and exploited to assess the degree of use and application of quality practices, five practices have been chosen, and the PLS (Partial Least Squares) Path Modeling was used We concluded that in general the quality practices concerning “Leadership” and “Process Management” have a positive impact on the growth factors of this type of enterprises:"strong to medium" importance of effects In contrast, the quality practices concerning “Human Resources”,
    Keywords: Growth factors; Growth phase; Modeling; Quality Practices; Young Industrial Enterprises.
    DOI: 10.1504/IJBIDM.2021.10030835
     
  • Mining Trailer Reviews for Predicting Ratings and Box Office Success of Upcoming Movies   Order a copy of this article
    by Nirmalya Chowdhury, Debaditya Barman, Chandrai Kayal 
    Abstract: Around 60% of the movies produced worldwide are box office failures. Since it affects a large number of stakeholders, movie business prediction is a very relevant as well as challenging problem. There had been many attempts to predict the box-office earnings of a movie after the theatrical release. Comparatively research works are inadequate to predict a movie’s fate before its release. Viewers are introduced to a movie via trailers before its theatrical release. The reviews of these trailers are indicative of a movie’s initial success. This work is focused on movie rating and business prediction on the basis of trailer reviews as well as other attributes. Several experiments have been performed using multiple classifiers to find appropriate classifiers(s) which can predict rating and box-office performance of a movie to be launched. Experimentally it has been found that Random Forest (RF) Classifier has outperformed others and produced very promising results.
    Keywords: Text Mining; Sentiment Analysis; Machine Learning; Movie Rating; Opening Weekend Income; Gross Income; Movie Trailer; Sensitivity Analysis.
    DOI: 10.1504/IJBIDM.2021.10030880
     
  • Improvement Assessment Method for Special Kids By Observing The Social and Behaviour Activity Using Data Mining Techniques   Order a copy of this article
    by DHANALAKSHMI RADHAKRISHNAN, Muthukumar B 
    Abstract: In recent studies, high throughput innovations have offered ascend to accumulation of substantial measures of heterogeneous data that gives diverse information. Clustering is the process of gathering unique items into classes of comparative articles. To overcome the drawbacks of classification methods, clustering is used. Earlier, clustering algorithms like hierarchical clustering, density based clustering, which are based on either numerical or categorical attributes were commercially used in software. In this proposed work k-mean clustering under unsupervised learning algorithm can make sense in prediction. Taking the clinical data of special kids, clustering is made and categorized using rank with the help of relevant symptoms. In this context, the data of special kids make statistical impact on categorization and easy detection of associated conditions of a child earlier.As the results, the proposed method has validated the database of special kid’s information with global purity.
    Keywords: High-throughput development; Special kids; Categorical attributes; unsupervised k-means Clustering; Gene expressional values.
    DOI: 10.1504/IJBIDM.2021.10031032
     
  • Ensemble Feature Selection Approach for Imbalanced Textual Data Using MapReduce   Order a copy of this article
    by Houda Amazal, Kissi Mohamed, Mohammed Ramdani 
    Abstract: Feature selection is a fundamental preprocessing phase in text classification. It speeds up machine learning algorithms and improves classification accuracy. In big data context, feature selection techniques have to deal with two major issues which are the huge dimensionality and the imbalancing aspect of data. However, the libraries of big data frameworks, such as Hadoop, only implement a few single feature selection methods whose robustness does not meet the requirements imposed by the large amount of data. To deal with this, we propose in this paper a Distributed Ensemble Feature Selection approach (DEFS) for imbalanced large dataset. The first step of the proposal focus on tackling the imbalance distribution of data using Hadoop environment to transform usual documents of dataset into big documents. Afterwards, we introduce a novel feature selection method we called Term Frequency-Inverse Category Frequency (TFICF) which is both frequency and category based.
    Keywords: Ensemble feature selection; Imbalance data; MapReduce; Text classification.
    DOI: 10.1504/IJBIDM.2022.10031100
     
  • A Novel Approach to Retrieve Unlabelled Images   Order a copy of this article
    by Deepali Kamthania, Ashish Pahwa, Aayush Gupta, Chirag Jain 
    Abstract: In this paper an attempt has been made to propose architecture of search engine for retrieving photographs from photo bank of unlabeled images. The primary purpose of the system is to retrieve images from image repository through string based queries on an interactive interface. To achieve this, image data set is transformed into a space where queries can execute significantly faster by developing a data pipeline through which each image is passed after entering into the system. The pipeline consists HOG based face detection and extraction, Face Landmark estimation, Indexer and Transformer. The image is passed through the data pipeline where each encoded face in the input image is compared with other vectors by computing l2 norm distance between them. The top N results (address of faces and corresponding images) are returned to the user. Once the image passes out from the pipeline Retrieval methods and Feedback mechanisms are performed.
    Keywords: Face Recognition (FR); Deep Learning; Histogram of Oriented Gradients (HOG); FaceNet Architecture; Machine Learning; Support Vector Machine (SVM).
    DOI: 10.1504/IJBIDM.2021.10031519
     
  • Prediction of Box-office Success: A Review of Trends and Machine Learning Computational Models   Order a copy of this article
    by Elliot Mbunge, Stephen Fashoto, Happyson Bimha 
    Abstract: The movie industry is faced with high uncertainty owing to challenges businesses have in forecasting sales and revenues. The huge upfront investments associated with the movie industry require investments to be informed by reliable methods of predicting success or returns from their investments. The study set to identify the best forecasting techniques for box-office products. Previous studies focused on predicting box-office success using pre-release and post-release during and after the production phase. The study was focusing on reviewing existing literature in predicting box-office success with the ultimate goal of determining the most frequently used prediction algorithm(s), dataset source and their accuracy results. We applied the PRISMA model to review published papers from 2010 to 2019 extracted from Google Scholar, Science Direct, IEEE Xplore Digital Library, ACM Digital Library and Springer Link. The study shows that the support vector machine was frequently used to predict box-office success with 21.74% followed by linear regression with 17.39% of total frequency contribution. The study also reviewed that Internet Movie Database (IMDb) is most used box-office dataset source with 40.741% of the total frequency followed with Wikipedia with 11.111%.
    Keywords: Box-office; machine learning; movie industry; pre-release; post-release features.
    DOI: 10.1504/IJBIDM.2021.10032162
     
  • Disease Prediction and Knowledge Extraction in Banana Crop Cultivation using Decision Tree Classifiers   Order a copy of this article
    by A. Anitha 
    Abstract: Agriculture plays a vital role in determining economic status of a country. To meet out the growing needs of society and to improve crop productivity, researchers are focusing on the development of various technologies. In India, banana is one of the leading crops with high demand. To improve the yield of banana, it is necessary to detect diseases at an early stage. Also, in order to acquire new farmers and to retain existing banana farmers, it is essential to extract knowledge about hidden causes for various diseases in banana crop. This work aims to apply data mining techniques like decision tree classifiers on banana cultivation dataset. Agricultural dataset used for experimentation is collected from farmers cultivating banana in regions fed by Thamirabharani River such as Kanyakumari, Tirunelveli and Tuticorin districts of Tamil Nadu. The higher the disease detection accuracy, the greater will be the crop productivity. Performance of classifiers such as J48, REP tree and random forest are compared based on classification accuracy, precision, recall and F-measure. Among various classification techniques applied over agricultural dataset, it has been identified that random forest algorithm out performs other techniques with respect to classification accuracy.
    Keywords: Attribute Selection; Decision Tree; Classification; Accuracy.
    DOI: 10.1504/IJBIDM.2022.10033424
     
  • Heart Disease Patient Risk Classification Based On Neutrosophic Sets   Order a copy of this article
    by Wael Hanna, Nouran Radwan 
    Abstract: Medical statistics show that heart disease is one of the biggest causes for mortality among the population. In developing countries, people have less concern about their health. The risk is increasing as there are five hundred deaths per one hundred thousand occur annually in Egypt. The diagnosis of heart disease remains an ambiguous task in the medical field as there are many features which are involved to take the decision. Besides, data gained for diagnosis are often vague and ambiguous. The main contribution of this paper is proposing a novel model of heart disease patient risk classification based on neutrosophic sets. The proposed model is applied to most relevant attributes of selected dataset, and compared to other famous classification techniques such as Naive Bayesian, JRip, and random forest for validation. The experimental results indicate that the proposed heart disease classification model achieves highest accuracy and f-measure results in heart disease.
    Keywords: Heart disease; supervised machine learning classification; and neutrosophic sets.
    DOI: 10.1504/IJBIDM.2021.10034129
     
  • A Semi-Supervised clustering based classification model for classifying imbalanced data streams in the presence of scarcely labelled data   Order a copy of this article
    by Kiran Bhowmick, Meera Narvekar 
    Abstract: Classification of data streams is still a current topic of research and a lot of research is focussed in this direction. Online frameworks for classifying data streams are generally supervised in nature so they assume the availability of labelled data all the time. Data streams in real time however are potentially infinite in length, massive, fast changing and scarcely labelled. It is practically impossible to label all the observed instances. Hence these existing frameworks cannot be used in most of the real time scenarios. Semi-supervised learning (SSL) addresses this problem of scarcely labelled data by using large amount of unlabelled data together with labelled data to build classifiers. Data streams may also suffer with the problem of imbalanced data. This paper proposes a model using a semi supervised clustering technique to classify an imbalanced data stream in the presence of scarcely labelled data.
    Keywords: data streams; imbalanced data; semi-supervised clustering; expectation maximization; partially labelled.
    DOI: 10.1504/IJBIDM.2022.10034300
     
  • Analysing traveller ratings for tourist satisfaction and tourist spot recommendation   Order a copy of this article
    by Angel Arul Jothi Joseph, Rajeni Nagarajan 
    Abstract: In this study, we propose an automated system to classify traveller ratings on travel destinations in 10 categories across East Asia using the UCI Travel Reviews dataset. The automated system developed in this study is called Traveller Rating Classification System (TRCS). Since the Travel Reviews dataset is an unlabelled dataset, K-means clustering algorithm is used to group the samples from the dataset into three clusters. The cluster numbers obtained from K-means clustering are assigned as class labels for the samples and the dataset is converted into a labelled dataset. Popular individual classifiers and ensemble classifiers are used to classify the samples present in the labelled dataset. In this study, Bagging with decision tree classifier achieved the best classification accuracy of 97.95%. The study further analyses the attributes in the dataset using visualization techniques to draw inferences by performing small transformations on them. The proposed system will be useful to understand traveller satisfaction and as a tourist spot recommendation system.
    Keywords: Tourist spot recommendation; Tourist satisfaction; Traveller rating; K-means Clustering; Classification; Ensemble; Visualization.
    DOI: 10.1504/IJBIDM.2022.10034520
     
  • Correlating pre-search and in-search context to predict search intent for exploratory search   Order a copy of this article
    by Vikram Singh 
    Abstract: Modern information systems are expected to respond to a wide variety of information needs from users with diverse goals. The topical dimension (what the user is searching for) of these information needs is well studied; however, the intent dimension (why the user is searching) has received relatively less attention. Traditionally, the intent is an immediate reason, purpose, or goal that motivates the user search, and captured in search contexts (pre-search, in-search, pro-search). An ideal information system would be able to use. This article proposed a novel intent estimation strategy; based on the intuition that captured intent proactively extracts potential results. Captured pre-search context adapts query term proximities within matched results beside document-terms statistics and pseudo-relevance feedback with user-relevance feedback for in-search. The assessment asserts the superior performance of the proposed strategy over equivalent on trade-offs, e.g., novelty, diversity (coverage, topicality), retrieval (precision, recall, F-measure) and exploitation vs. exploration.
    Keywords: Ambient Information; Exploratory Search; Human-Computer Interaction; Information Retrieval; Proactive Search; Query Term Proximity; Search Contexts; Relevance; Retrieval Model.
    DOI: 10.1504/IJBIDM.2022.10034960
     
  • Prediction of Students’ Failure using VLE and Demographic data: Case study Open University Data   Order a copy of this article
    by Rahila Umer, Sohrab Khan, Jun Ren, Shumaila Umer, Ayesha Shaukat 
    Abstract: Use of technology such as learning management system (LMS) in higher education institutes is getting very common. LMS provides support to teaching staff for communication, delivery of resources and in design of learning activities. Large amount of data is produced using these technologies which can be analysed using machine learning methods to extract knowledge regarding students’ behaviour and learning processes. In this study we focus on the Open University’s project for predicting student’s failure in the course by using their data. In this study multiple machine learning algorithms are applied on historical virtual learning environment (VLE) data and demographic data. This study confirms the importance of VLE and demographic data in the prediction of academic performance. This study highlights the importance of demographic data; which improves the accuracies of models for predicting student’s outcome in courses they are enrolled.
    Keywords: predictive learning analytics; student performance; retention; higher education; machine learning.
    DOI: 10.1504/IJBIDM.2022.10035109
     
  • Chaotic activities recognizing during the pre-processing event data phase   Order a copy of this article
    by Zineb Lamghari, Rajaa Saidi, Maryam Radgui, Moulay Driss Rahmani 
    Abstract: Process mining aims at obtaining insights into business processes by extracting knowledge from event data. Indeed, the quality of events is a crucial element for generating process models, to reflect business process reality. To do so, pre-processing methods are appeared, to clean events from deficiencies (noise, incompleteness and infrequent behaviours) in the limit of chaotic activities’ emergence. Chaotic activities are executed arbitrarily in the process and impact the quality of discovered models. Beyond, a supervised learning approach has been proposed, using labelled samples to detect chaotic activities. This puts forward the difficulty of defining chaotic activities in the case of no ground knowledge on which activities are truly chaotic. To that end, we develop an approach for recognising chaotic activities without having labelling training data, using unsupervised learning techniques.
    Keywords: pre-processing; process discovery; process mining; chaotic activity; business process intelligent; machine learning algorithms.
    DOI: 10.1504/IJBIDM.2022.10035223
     
  • Favourable subpopulation migration strategy for Travelling salesman problem   Order a copy of this article
    by Abhishek Chandar, Akshay Srinivasan, G. Paavai Anand 
    Abstract: Process mining aims at obtaining insights into business processes by extracting knowledge from event data. Indeed, the quality of events is a crucial element for generating process models, to reflect business process reality. To do so, pre-processing methods are appeared, to clean events from deficiencies (noise, incompleteness and infrequent behaviours) in the limit of chaotic activities’ emergence. Chaotic activities are executed arbitrarily in the process and impact the quality of discovered models. Beyond, a supervised learning approach has been proposed, using labelled samples to detect chaotic activities. This puts forward the difficulty of defining chaotic activities in the case of no ground knowledge on which activities are truly chaotic. To that end, we develop an approach for recognising chaotic activities without having labelling training data, using unsupervised learning techniques.
    Keywords: pre-processing; process discovery; process mining; chaotic activity; business process intelligent; machine learning algorithms.
    DOI: 10.1504/IJBIDM.2022.10035424
     
  • An evaluation method for searching the functional relationships between property prices and influencing factors in the detected data   Order a copy of this article
    by Pierluigi Morano, Francesco Tajani, Vincenzo Del Giudice, Pierfrancesco De Paola, Felicia Di Liddo 
    Abstract: The economic crisis of the last decade, started from the real estate sector, has spread the awareness of the importance of the use of advanced evaluation models, as a support in the assessments and in the periodic value updates of public and private property assets. With reference to a sample of recently sold properties located in the city of Rome (Italy), an innovative automated valuation model is explained and applied. The outputs are represented by different mathematical expressions, able to interpret and to simulate the investigated phenomena (i.e. the market prices formation). The application carried out outlines, in the selection phase of the best model, the fundamental condition that the valuer must adequately know the reference market. In this way, it is possible to identify the existing patterns in the detected data in terms of mathematical expressions, according to the empirical knowledge of the economic phenomena.
    Keywords: price property formation; office market; retail market; automated valuation methods; AVMs; genetic algorithm; reliable valuations.
    DOI: 10.1504/IJBIDM.2022.10035383
     
  • Predicting students' academic performance using machine learning techniques: a literature review   Order a copy of this article
    by Aya Nabil, Mohammed Seyam, Ahmed Aboul-Fotouh 
    Abstract: The amount of students’ data stored in educational databases is increasing rapidly. These databases contain hidden patterns and useful information about students’ behaviour and performance. Data mining is the most effective method to analyse the stored educational data. Educational data mining (EDM) is the process of applying different data mining techniques in educational environments to analyse huge amounts of educational data. Several researchers applied different machine learning techniques to analyse students’ data and extract hidden knowledge from them. Prediction of students’ academic performance is necessary for educational environments to measure the quality of the learning process. Therefore, it is one of the most common applications of EDM. In this survey paper, we present a review of data mining techniques, EDM and its applications, and discuss previous studies in predicting students’ academic performance. An analysis of different machine learning techniques used in previous studies is also presented in this paper.
    Keywords: data mining; educational data mining; EDM; prediction; student academic performance; machine learning techniques; deep learning.
    DOI: 10.1504/IJBIDM.2022.10035540
     
  • Harnessing the Meteorological Effect for Predicting the Retail Price of Rice in Bangladesh   Order a copy of this article
    by Abdullah Al Imran, Zaman Wahid, Alpana Akhi Prova, Md. Hannan 
    Abstract: Bangladesh has seen an absurd, steeper prize-hike for the last couple of years in one of the most consumed foods taken by millions of people every single day: rice. The impact of this phenomenon, however, is indispensably critical, especially to the one striving for daily meals. Thus, understanding the latent facts is vital to policymakers for better strategic measures and decision-making. In this paper, we have applied five different machine learning algorithms to predict the retail price of rice, find out the top-most factors responsible for the price hike, and determine the best model that produces higher prediction results. Leveraging six evaluation metrics, we found that random forest produces the best result with an explain variance score of 0.87 and an R2 score of 0.86 whereas gradient boosting produces the least, meanwhile discovering that average wind speed is the topmost reason for rice price hike in retail markets.
    Keywords: data mining; rice price prediction; pattern mining; regression; retail markets.
    DOI: 10.1504/IJBIDM.2022.10035542
     
  • Privacy Preservation of the User Data and Properly Balancing Between Privacy and Utility   Order a copy of this article
    by N. Yuvaraj, K. Praghash, T. Karthikeyan 
    Abstract: The privacy and utility are the trade-off factors, where the performance of one factor should sacrifice to achieve the other. If privacy is achieved without publishing the data, then efficient utility cannot be achieved, hence the original dataset tends to get published without privacy. Therefore, it is essential to maintain the equilibrium between privacy and utility of datasets. In this paper, we propose a new privacy utility method, where the privacy is maintained by lightweight elliptical curve cryptography (ECC), and utility is maintained through ant colony optimisation (ACO) clustering. Initially, the datasets are clustered using ACO and then the privacy of clustered datasets is maintained using ECC. The proposed method has experimented over medical datasets and it is compared with existing methods through several performance metrics like clustering accuracy, F-measure, data utility, and privacy metrics. The analysis shows that the proposed method obtains improved privacy preservation using the clustering algorithm than existing methods.
    Keywords: ant colony optimisation; ACO; elliptical curve cryptography; ECC; privacy preservation; utility.
    DOI: 10.1504/IJBIDM.2022.10035576
     
  • overdisp: An R Package for Direct Detection of Overdispersion in Count Data Multiple Regression Analysis   Order a copy of this article
    by Rafael Freitas Souza, Luiz Paulo Fávero, Patrícia Belfiore, Luiz Corrêa 
    Abstract: Within multiple areas, log-linear count data regression is one of the most popular techniques for predictive modelling where there is a non-negative discrete quantitative dependent variable. In order to ensure the inferences from the use of count data models are appropriate, researchers may choose between the estimation of a Poisson model and a negative binomial model, and the correct decision for prediction from a count data estimation is directly linked to the existence of overdispersion of the dependent variable, conditional to the explanatory variables. That said, the overdisp() command is a contribution to researchers, providing a fast and secure solution for the detection of overdispersion in count data. Real and simulated data were used to test the proposed solution, which proved to be computationally efficient, with no difference in the detection of overdispersion compared to the test postulated by the cited authors.
    Keywords: overdispersion; detection of overdispersion; count data; multiple regression analysis; non-negative discrete quantitative dependent variable; Poisson model; negative binomial model; R package.
    DOI: 10.1504/IJBIDM.2022.10035616
     
  • Web mining based on word-centric search with clustering approach using MLP-PSO hybrid   Order a copy of this article
    by Reza Samizadeh, Samaneh Tafahomi 
    Abstract: With web development, sometimes in keeping track of information on the web, the semantic meaning of words is not important, and the mere presence of words in the text is enough to extract information. In this research, word-centric search method is presented to prepare web data for clustering. Multi-layer perceptron networks are one of the most successful neural networks for learning, clustering and prediction. The researcher clusters the web data from the word-centric search method by using the K-means method and considers the results of clustering as the expected output of the MLP neural network. Considering that the weights of the neural network are selected randomly and may not be in best amount after the network training. Therefore, by using an optimisation algorithm for particle swarm, its effect on performance of the final neural network has been investigated in the training and initial weighing step.
    Keywords: web mining; clustering; multi-layer perceptron neural networks; particle swarm optimisation algorithm.
    DOI: 10.1504/IJBIDM.2022.10035725
     
  • Estimating Cluster Validity Using Compactness Measure and Overlap Measure for Fuzzy Clustering   Order a copy of this article
    by Bindu Rani, Shri Kant 
    Abstract: Cluster analysis discovers valuable patterns in data by partitioning n data points into valid number of clusters. The cluster validity index (CVI) helps in selecting the best partitions that fits the underlying structure of data. After presenting brief review on existing CVIs, this study formulates a competent overlap-compactness validity index (OCVI). The proposed index considers Kim et al. (2004b) overlap measure with compactness measure. Compactness measure considers the geometrical aspects of membership matrix (U) through cluster centres with an approach to reduce its monotonic tendency. Overlap measure calculates the average value of the overlapping degree of all probable fuzzy clusters pairs. Experiments are implemented on two artificial, two real and one biological dataset. Comparison results of partition coefficient, partition entropy, modified partition coefficient, Xie-Beni and Kim indices with the suggested index (OCVI) imply that suggested index outperforms with maximum compactness and minimum overlap than other validity indices.
    Keywords: cluster validity index; CVI; clustering; fuzzy clustering; fuzzy c-means algorithm.
    DOI: 10.1504/IJBIDM.2022.10036057
     
  • Solving restriction of Bayesian network in giving domain knowledge by introducing factor nodes   Order a copy of this article
    by Yutaka Iwakami, Hironori Takuma, Motoi Iwashita 
    Abstract: Bayesian network is a probabilistic inference model that is effective for decision-making in business such as product development. Multiple events are represented as oval nodes and their relationships are drawn as edges among them. However, in order to obtain a sufficient effect, it is necessary to appropriately configure domain knowledge, for example more customer response to the product leads to more clarity of requirements for products. Such domain knowledge is configured as an edge connecting nodes. But in some cases, the constraint of the structure in Bayesian network prevents this configuration. In this study, the authors propose a method to avoid this constraint by introducing the redundant factor nodes generated by applying factor analysis to the data related with domain knowledge. With this approach more domain knowledge can be applied to Bayesian network, and the accuracy of decision-making in business is expected be improved.
    Keywords: model improvement; data extraction; data driven insight; probabilistic inference; decision-making; product development; Bayesian network; factor analysis; key goal indicator.
    DOI: 10.1504/IJBIDM.2022.10036731
     
  • Customer Segmentation Using Various Machine Learning Techniques   Order a copy of this article
    by SAMYUKTHA PALANGAD OTHAYOTH, Raja Muthalagu 
    Abstract: In the field of retail industry and marketing, customer segmentation is one of the most important tasks. A proper customer segmentation can help the managers to enhance the quality of products and provide better services for the targeting segments. Various machines learning algorithms-based customer segmentation techniques are used to get an insight about the customer’s behaviour and the potential customers that could be targeted to maximise profit. Based on the previous studies, this paper proposes improved machine learning models for customer segmentation in e-commerce. The agglomerative clustering algorithms have been implemented to segment the customers with the new matric for customer behaviour. Also, we have proposed a systematic approach for combining agglomerative clustering algorithm and filtering-based recommender system to improve customer experience and customer retention. In the experiment, the results were compared with K-means clustering model, and it was found that BLS greatly reduced training time while guaranteeing accuracy.
    Keywords: customer segmentation; agglomerative clustering algorithms; machine learning algorithms; K-means.
    DOI: 10.1504/IJBIDM.2022.10036753
     
  • A Unified Workflow Strategy for Analysing Large Scale TripAdvisor Reviews with BOW Model   Order a copy of this article
    by Jale Bektas, Arwa Abdalmajed 
    Abstract: Nowadays, firms need to transform customer online reviews data properly into information to achieve goals such as having a competitive edge and improving the quality of service. This paper presents a unified workflow to solve the problems of analysing large-scale data with 710,450 reviews for 1,134 hotels by using text mining methods among the different touristic regions of Turkey. Firstly, a star schema dimensional data mart is built that includes one fact table and two dimensional tables. Then, a series of text mining processes which includes data cleaning, tokenisation, and analysis are applied. Text mining is implemented through standard BOW and the extended BON model. The results show significant findings through this workflow. We propose to build a dimensional model dataset before performing any text mining process, since building such a dataset will optimise the data retrieval process and help to represent the data along with different measures of interest.
    Keywords: online TripAdvisor reviews; text mining; big data; N-gram tokenisation; dimensional data mart; data mining; BOW; BON.
    DOI: 10.1504/IJBIDM.2022.10037062
     
  • Text Mining for Opinion Analysis: The Case of Recent Flood of Iran on Twitter   Order a copy of this article
    by Reza Kamranrad, Ali Jozi, Ehsan Mardan 
    Abstract: The sentiment analysis relates to the study and understanding of emotions and beliefs in a particular text. This analysis gives us a lot of information. Twitter is a popular social network in recent years, in which users express their opinions and feelings about various topics in the Twitter social media operating system. By analysing this information, we can get an overview of public opinion about any particular topic. The classification of information is effective in understanding information and we cluster information. In this article, we are trying to analyse the status of Twitter on the monitoring and emotions of people about the recent flood events in Iran.
    Keywords: text mining; Twitter; sentiment analysis; machine learning; language processing; NLP; Python; clustering.
    DOI: 10.1504/IJBIDM.2022.10037064
     
  • Apriori-Roaring: Frequent Pattern Mining Method Based on Compressed Bitmaps   Order a copy of this article
    by Alexandre Colombo, Roberta Spolon, Aleardo Junior Manacero, Renata Spolon Lobato, Marcos Antônio Cavenaghi 
    Abstract: Association rule mining is one of the most common tasks in data analysis. It has a descriptive feature used to discover patterns in sets of data. Most existing approaches to data analysis have a constraint related to execution time. However, as the size of datasets used in the analysis grows, memory usage tends to be the constraint instead, and this prevents these approaches from being used. This article presents a new method for data analysis called apriori-roaring. The apriori-roaring method is designed to identify frequent items with a focus on scalability. The implementation of this method employs compressed bitmap structures, which use less memory to store the original dataset and to calculate the support metric. The results show that apriori-roaring allows the identification of frequent elements with much lower memory usage and shorter execution time. In terms of scalability, our proposed approach outperforms the various traditional approaches available.
    Keywords: frequent pattern mining; bitmap compression; data mining; association rules; knowledge discovery.
    DOI: 10.1504/IJBIDM.2022.10037305
     
  • Financial accounts reconciliation systems using enhanced mapping algorithm   Order a copy of this article
    by Olufunke Oluyemi Sarumi, Bolanle A. Ojokoh, Oluwafemi A. Sarumi, Olumide S. Adewale 
    Abstract: Account reconciliation has become a daunting task for many financial organisations due to the heterogeneity of data involved in the accounts reconciliation process-coupled with the recent data deluge in many accounting firms. Many organisations are using a heuristic-based algorithm for their account reconciliation process while in some firms the process is completely manual. These methods are already inundated and were no longer efficient in the light of the recent data explosion and are such, prone to lots of errors that could expose the organisations to several financial risks. In this regard, there is a need to develop a robust financial data analytic algorithm that can effectively handle the account reconciliation needs of financial organisations. In this paper, we propose a computational data analytic model that provides an efficient solution to the account reconciliation bottlenecks in financial organisations. Evaluation results show the effectiveness of our data analytic model for enhancing faster decision making in financial account reconciliation systems.
    Keywords: accounts reconciliation; financial analytics; functions; fraud; big data.
    DOI: 10.1504/IJBIDM.2022.10037414
     
  • Privacy Preserving Data Mining - Past and Present   Order a copy of this article
    by G. SATHISH KUMAR, K. Premalatha 
    Abstract: Data mining is the process of discovering patterns and correlations within the huge volume of data to forecast the outcomes. There are serious challenges occurring in data mining techniques due to privacy violation and sensitive information disclosure while providing the dataset to third parties. It is necessary to protect user’s private and sensitive data from exposure without the authorisation of data holders or providers when extracting useful information and revealing patterns from the dataset. Also, internet phishing gives more threat over the web on extensive spread of private information. Privacy preserving data mining (PPDM) is an essential for exchanging confidential information in terms of data analysis, validation, and publishing. To achieve data privacy, a number of algorithms have been designed in the data mining sector. This article delivers a broad survey on privacy preserving data mining algorithms, different datasets used in the research and analyses the techniques based on certain parameters. The survey is highlighted by identifying the outcome of each research along with its advantages and disadvantages. This survey will guide the feature researches in PPDM to choose the appropriate techniques for their research.
    Keywords: data mining; privacy preserving data mining; PPDM; privacy preserving techniques; sensitive attributes; privacy threats.
    DOI: 10.1504/IJBIDM.2022.10037595
     
  • STEM: STacked Ensemble Model design for aggregation technique in Group Recommendation System   Order a copy of this article
    by Nagarajan Kumar, P. Arun Raj Kumar 
    Abstract: A group recommendation system is required to provide a list of recommended items to a group of users. The challenge lies in aggregating the preferences of all members in a group to provide well-suited suggestions. In this paper, we propose an aggregation technique using stacked ensemble model (STEM). STEM involves two stages. In stage 1, the k-nearest neighbour (k-NN), singular value decomposition (SVD), and a combination of user-based and item-based collaborative filtering is used as base learners. In the second stage, the decision trees predictive model is used to aggregate the outputs obtained from the base learners by prioritising the most preferred items. From the experiments, it is evident that STEM provides a better group recommendation strategy than the existing techniques.
    Keywords: group recommendation system; aggregating user preferences; decision trees; stacked ensemble; machine learning.
    DOI: 10.1504/IJBIDM.2022.10037757
     
  • Convolutional Neural Network for Classification of SiO2 Scanning Electron Microscope Images   Order a copy of this article
    by Kavitha Jayaram, G. Prakash, V. Jayaram 
    Abstract: The recent development in deep learning has made image and speech classification and recognition tasks possible with better accuracy. An attempt was made to automatically extract required sections from literature published in journals to analyse and classify them according to their application. This paper presents high-temperature materials classification into four categories according to their wide applications such as electronic, high temperature, semiconductors, and ceramics. The challenging act is to extract SEM images' unique features as they are microscopic with different resolutions. A total of 10,000 Scanning Electron Microscope (SEM) images are classified into two labeled categories namely crystalline and amorphous structure. The image classification and recognition process of SiO2 was implemented using Convolutional Neural Network (CNN) deep learning framework. Our algorithm successfully classified with a precision of 96% and accuracy of 95.5% of the test dataset of SEM images.
    Keywords: deep learning; machine learning; image classification; convolution neural network; CNN; material.
    DOI: 10.1504/IJBIDM.2022.10038244
     
  • Rule-based Database Intrusion Detections Using Coactive Artificial Neuro-Fuzzy Inference System and Genetic Algorithm   Order a copy of this article
    by Anitarani Brahma, SUVASINI PANIGRAHI, Neelamani Samal, Debasis Gountia 
    Abstract: Recently, a fuzzy system having learning and adaptation capabilities is gaining lots of interest in research communities. In the current approach, two of the most successful soft computing approaches neural network and genetic algorithm with learning capabilities are hybridised to approximate reasoning method of fuzzy systems. The objective of this paper is to develop a coactive neuro-fuzzy inference system with genetic algorithm-based database intrusion detection system that can detect malicious transactions in database very efficiently. Experimental investigation and comparative assessment has been conducted with an existing statistical database intrusion technique to justify the efficacy of the proposed system.
    Keywords: fuzzy inference system; database intrusion detection; neural network; genetic algorithm; artificial neuro-fuzzy inference system; coactive artificial neuro-fuzzy inference system.
    DOI: 10.1504/IJBIDM.2022.10038259
     
  • A Regression Model to Evaluate Interactive Question Answering using GEP   Order a copy of this article
    by Mohammad Mehdi Hosseini 
    Abstract: Evaluation plays a pivotal role in the interactive question answering (IQA) systems. However, much uncertainty still exists on evaluating IQA systems and there is practically no specific methodology to evaluate these systems. One of the main challenges in designing an assessment method for IQA systems lies in the fact that it is rarely possible to predict the interaction part. To this end, human needs to be involved in the evaluation process. In this paper, an appropriate model is presented by introducing a set of characteristics features for evaluating IQA systems. Data were collected from four IQA systems at various timespans. For the purpose of analysis, pre-processing is performed on each conversation, the statistical characteristics of the conversations are extracted to form the characteristic matrix. The characteristics matrix is classified into three separate clusters using K-means. Then, an equation is allotted to each of the clusters with an application of gene expression programming (GEP). The results reveal that the proposed model has the least error with an average of 0.09 root mean square error between real data and GEP model.
    Keywords: evaluation; interactive question; answering systems; nonlinear regression; gene expression programming; GEP; feature extraction.
    DOI: 10.1504/IJBIDM.2022.10038261
     
  • Examining the impact of business intelligence related practices on organizational performance in Oman   Order a copy of this article
    by ROBIN ZARINE, MUHAMMAD SAQIB 
    Abstract: Business intelligence can greatly enhance organisational capabilities in devising profitable business actions and activities. It provides understanding of both current and future trends relating to customers, markets, competitors, or regulatory, and most importantly, the understanding of organisations’ own capabilities to compete. Business intelligence is arguably one of the key drivers to organisational competiveness. This paper looks at examining the extent to which organisations in Oman embrace business intelligence and the contributions of the different business intelligence components on organisational performance. Quantitative empirical approach is used with Microsoft Excel data analysis tool pack as the investigative tool to analyse and develop a regression model to better understand the impact of business intelligence related components on organisational performance. The finding shows a strong correlation between business intelligence and organisational performance. It also shows that by having the right IT functionalities with capable employees using them are the key to performance enhancement. Furthermore, having IT infrastructure without the appropriate functionalities and personnel or not embracing business intelligence will not result in any performance gain.
    Keywords: business intelligence; business intelligence components; organisational performance; Oman.
    DOI: 10.1504/IJBIDM.2022.10038337
     
  • Next location prediction using Transformers   Order a copy of this article
    by Salah Eddine Henouda, Laallam Fatima Zohra, Okba KAZAR, Abdessamed Sassi 
    Abstract: This work seeks to solve next location prediction problem of mobile users. Chiefly, we focus on ROBERTA architecture (robustly optimised BERT approach) in order to build a next location prediction model through the use of a subset of a large real mobility trace database. The latter was made available to the public through the CRAWDAD project. ROBERTA, which is a well-known model in natural language processing (NLP), works intentionally on predicting hidden sections of text based on language masking strategy. The current paper follows a similar architecture as ROBERTA and proposes a new combination of Bertwordpiece tokeniser and ROBERTA for location prediction that we call WP-BERTA. The results demonstrated that our proposed model WP-BERTA outperformed the state-of-the-art models. They also indicated that the proposed model provided a significant improvement in the next location prediction accuracy compared to the state-of-the-art models. We particularly revealed that WP-BERTA outperformed Markovian models, support vector machine (SVM), convolutional neural networks (CNNs), and long short-term memory networks (LSTMs).
    Keywords: machine learning; deep learning; transformer; neural networks; Wi-Fi; mobility traces; next location prediction; big data.
    DOI: 10.1504/IJBIDM.2022.10038854
     
  • Supervised and Unsupervised learning for characterizing the industrial material defects   Order a copy of this article
    by P. Radha, N. Selvakumar, J. Raja Sekar, J.V. Johnsonselva 
    Abstract: The ultrasonic based NDT is used in industries to examine the internal defects without damaging the components since the materials used in the industrial standard components must be 100% perfection. The ultrasonic signals are difficult to interpret and the domain expert has to concentrate at every sampling point to identify the defect. Hence, the existing ultrasonic based NDT method is improved by applying IoT, machine learning, deep learning techniques to process the ultrasonic data. This wok integrates NDT and IoT to analyse the properties of materials using deep learning based supervised model and filter outliers using unsupervised model like density-based clustering method. After analysing the different categories of defects, the notifications are sent to various stakeholders to either repair or replace the defective components through their mobile using advanced communication techniques to avoid expensive experimentation or maintenance.
    Keywords: ultrasonic testing; internet of things; IoT; machine learning; density based clustering; deep learning; deep neural network; DNN.
    DOI: 10.1504/IJBIDM.2022.10039148
     
  • Detection of Suspicious Text Messages and Profiles using Ant Colony Decision Tree Approach   Order a copy of this article
    by Asha Kumari, Balkishan N/A 
    Abstract: The ease of human communication connectivity through short messaging services (SMS) and social networking have immensely allured the suspicious activities that menace the legitimate users. The unsolicited or uninvited messages that can lead to rumours, spam, malicious, or any other threatening activities are termed as suspicious activities. This work ensemble the attributes of the ant colony optimisation (ACO) approach with decision tree for the detection of suspicious content and profile (ACDTDSCP). In the ACDTDSCP approach, the construction of the decision tree and splitting of nodes is based on the appropriate attributes of the pheromone trail and heuristic function chosen by each ant. The research experimentation is conducted on two Twitter datasets (Social Honeypot Dataset and 1KS-10KN dataset) and two SMS text corpus (SMS Spam Collection v.1 and SMS Spam Corpus v.0.1 Big). The experimental results indicate the efficacy and potential of the proposed ACDTDSCP approach.
    Keywords: ant colony optimisation; ACO; decision tree; suspicious messages; spam; short message service; SMS; Twitter microblogs.
    DOI: 10.1504/IJBIDM.2021.10039529
     
  • An Optimal Dimension Reduction Strategy and Experimental Evaluation for Parkinson’s Disease Classification   Order a copy of this article
    by Saidulu D, Sasikala Ramasamy 
    Abstract: The amount of data streamed and generated through various healthcare systems is exponentially increasing day by day. Applying traditional data mining algorithms on this massive sized data to construct automated decision support systems is a tedious and time consuming task. In recent years, there has been increasing interest in the development of telediagnosis and telemonitoring systems for Parkinsons disease (PD). Parkinsons disease is a progressive neurodegenerative disease which affect the movement characteristics. PD patients commonly face vocal impairments during the early stages of the disease. This work proposes a computationally efficient method for dimension reduction and classification of healthcare related data. The devised framework is capable to deal with the data having discrete as well as continuous natured features. The experimental evaluation is performed on Parkinsons disease classification database (Sakar et al., 2018). The statistical performance metrices used are validation and test accuracy, precision, recall, F1-score, etc. There will be computational complexity advantages when this reduced dimension data is further processed for modelling and building prediction system. In order to prove the optimality of proposed framework, comparative analysis is performed with the significant existing approaches.
    Keywords: big data; learning; dimension reduction; machine learning; knowledge discovery; information retrieval.
    DOI: 10.1504/IJBIDM.2022.10040204
     
  • Detection of Spammers disseminating obscene content on Twitter   Order a copy of this article
    by Deepali Dhaka, Surbhi Kakar, Monica Mehrotra 
    Abstract: Spammers distributing adult content are becoming an apparent and yet intrusive problem with the increasing prevalence of online social networks among users. For improving user experience and especially preventing exposure to users of lower age groups, these accounts need to be detected efficiently. In this work, a model is proposed, in which a lexicon-based approach is used to label users with their values. This study is based on the fact that users behave according to the values they possess. The amalgamation of content-based features like values, the entropy of words, lexical diversity, and context-based word embeddings are found to be robust. Among several machine learning models, XGboost performs exceedingly well with accuracy (92.28 ± 1.28%) for all features. Feature importance and their discriminative power have also been shown. A comparative study is also done with one of the latest approaches and our approach is found to be more efficient.
    Keywords: values; emotions; Twitter; online social network; spammer; pornographic spammer.
    DOI: 10.1504/IJBIDM.2022.10040432
     
  • Suspicious Tweet Identification Using Machine Learning Approaches for Improving Social Media Marketing Analysis   Order a copy of this article
    by Senthil Arasu Balasubramanian, Jonath BackiaSeelan, Thamaraiselvan Natarajan 
    Abstract: Social media acts as one of the eminent platforms for communication. Twitter is one of the leading social media microblogging platforms, where users can post and interact. #Hashtags specify the tweeter trends on a certain topic. Currently, the hashtag value or trend ranking for a particular hashtag has been calculated based on the cumulative number of tweets. This type of cumulative amount of hashtag ranking may result in an anonymous intervention of irrelevant tweets, which affects social media marketing. The proposed approach uses the relevance of tweets and #hashtags to improve and identify the suspicious or irrelevant tweets of media marketing. The proposed research work uses the linear regression algorithm, which is one of the familiar machine learning approaches to explain the spam tweet generation and the method to identify. The test results found the proposed system has 84% of significance when compared to the market analysis algorithms.
    Keywords: tweets; hashtags; trend prediction; linear regression; social media marketing.
    DOI: 10.1504/IJBIDM.2022.10040478
     
  • An Evolutionary-based Approach for Providing Accurate and Novel Recommendations   Order a copy of this article
    by Chemseddine Berbague, Hassina Seridi, Nour El-Islam Karabadji, Panagiotis Symeonidis, Markus Zanker 
    Abstract: For memory-based collaborative filtering, the quality of the target user’s neighbourhood plays an important role for providing him/her succesful item recommendations. The existent techniques for neighbourhood selection aim to maximise the pairwise similarity between the target user and his/her neighbours, which mainly improves only the recommendation accuracy. However, these methods do not consider other important aspects for succesful recommendations such as providing diversified and novel item recommendations, which also affect highly users’ satisfaction. In this paper, we linearly combine two probabilistic criteria for selecting the right neighbourhood of a target user and provide him/her accurate, and novel item recommendations. The combination of these two probabilistic quality measures forms a fitness function, which guides the evolution of a genetic algorithm. For each target user, the genetic algorithm explores the user’s whole search space and selects the most suitable neighbourhood for him. Thus, our approach makes a balance between the accuracy and the novelty of the provided item recommendations, as will be experimentally shown on MovieLens dataset.
    Keywords: genetic algorithm; neighbourhood selection; novelty; diversity; relevancy; cold start problem.
    DOI: 10.1504/IJBIDM.2022.10040584
     
  • Leveraging the Fog based Machine Learning Model for ECG based Coronary disease prediction   Order a copy of this article
    by Hanumantharaju R, Shreenath KN, Sowmya BJ, K.G. Srinivasa 
    Abstract: Smart healthcare systems needs a remote monitoring system based on the Internet of Things. Smart healthcare services are an innovative way of synergising the benefits of sensors for large-scale analytics to communicate better patient care. Work provides the sick with healthcare administrations as a sound population through remote observation using detailed calculations, tools and methods for better care. The proposed system integrates architecture based on IoT, fog computing and machine learning (ML) algorithms. The dimensionality of the data collected about heart diseases is loaded, filtered and extracted attributes at the fog layer, the classification model is built at the fog nodes. The resultant of the model is sent to the cloud layer to train classifiers. Cloud layer estimates the level of ML algorithms to predict disease. Result shows that random forest has better feature extraction than naive Bayes with flawlessness of 3% in precision, 3% in recall, 13% in f-measure.
    Keywords: internet of things; IoT; machine learning; random forest; naive Bayes; fog layer; remote monitoring; feature extraction.
    DOI: 10.1504/IJBIDM.2022.10041200
     
  • A predictive model of electricity quality indicator in distribution subsidiaries   Order a copy of this article
    by Ana Flávia L. Gonçalves, Rafael Frinhani, Bruno G. Batista, Rafael P. Pagan, Edvard M. De Oliveira, Bruno T. Kuehne, João Paulo R. R. Leite, João Víctor De M. S. Gomes 
    Abstract: Electricity concessionaires give off high financial amounts annually in repairs to consumers that experience service unavailability. Availability of the energy supply is a major challenge because the distribution infrastructure is constantly affected by climatic, environmental, and social causes. To assist decision making in mitigating grid failures, this study aims to predict the number of incidences of electricity shortage for consumers. A predictive model was developed using predictive data analysis and conforms to a knowledge discovery process. A hybrid classifier was developed from the model, using both unsupervised and supervised methods. The experiments were carried out with real incidence and climatic data from four subsidiaries of an energy concessionaire. The results show the forecasting model’s feasibility, which presented classification accuracy between 58.33% to 91.66%. The results show that peculiarities in terms of geographic location, energy demand, and climatic conditions make it difficult to use a generic prediction model.
    Keywords: electric quality indicator; predictive data analysis; machine learning; unsupervised methods; supervised methods; knowledge discovery in data.
    DOI: 10.1504/IJBIDM.2022.10041550
     
  • Real-Time Predictive Big Data Analytics System: Forecasting Stock Trend Using Technical Indicators   Order a copy of this article
    by Myat Cho Mon Oo  
    Abstract: The emergence of financial big data stocks has caused dramatic changes, and predictive analytics systems require a scalable architecture to intelligently process these data. In this paper, a real-time predictive big data analytics (RPBA) system is proposed using the Technical Indicators to predict stock market trend. Scalable random forest (SRF) is enhanced as a financial instrument by contributing the hyperparameters optimisation. This paper explores the novel alternative by the combination of features engineering and enhanced SRF to maximise the desired measure of stock prediction models based on the data from four stocks periods: inactive, sub-active, active, and strong-active periods. The empirical findings indicate that the proposed RPBA system can provide high predictability 85% for short-term and 99% for long-term predictions over real-time financial eight stock markets.
    Keywords: big data; predictive analytics system; technical indicators; stock trend.
    DOI: 10.1504/IJBIDM.2022.10041467
     
  • AUGMENTING KEYWORD-BASED PATENT PRIOR ART SEARCH USING WEIGHTED CLASSIFICATION CODE HIERARCHIES   Order a copy of this article
    by Alok Khode, Sagar Jambhorkar 
    Abstract: Patents are critical intellectual assets for any business. With the rapid increase in the patent filings, patent prior art retrieval has become an important task. The goal of the prior art retrieval is to find documents relevant to a patent application. Due to special nature of the patent documents, only relying on the keyword-based queries do not prove effective in patent retrieval. Previous work have used international patent classification (IPC) to improve the effectiveness of keyword-based search. However, these systems have used two-stage retrieval process using IPC mostly to filter patent documents or to re-rank the documents retrieved by keyword-based query. In the approach proposed in this paper, weighted IPC code hierarchies have been explored to augment keyword-based search, thereby eliminating the use of an additional processing step. Experiments on the CLEF-IP 2011 benchmark dataset show that the proposed approach outperforms the baseline on the MAP, Recall and PRES.
    Keywords: patent retrieval; prior art search; international patent classification; IPC; query formulation; query expansion; information retrieval; IPC hierarchy; weighted IPC.
    DOI: 10.1504/IJBIDM.2022.10041582
     
  • A Review of Scalable Time Series Pattern Recognition   Order a copy of this article
    by Kwan Hua Sim, Kwan Yong Sim, Valliappan Raman 
    Abstract: Time series data mining helps derive new, meaningful and hidden knowledge from time series data. Thus, time series pattern recognition has been the core functionality in time series data mining applications. However, mining of unknown scalable time series patterns with variable lengths is by no means trivial. It could result in quadratic computational complexities to the search space, which is computationally untenable even with the state-of-the-art time series pattern mining algorithms. The mining of scalable unknown time series patterns also requires the superiority of the similarity measure, which is clearly beyond the comprehension of standard distance measure in time series. It has been a deadlock in the pursuit of a robust similarity measure, while trying to contain the complexity of the time series pattern search algorithm. This paper aims to provide a review of the existing literature in time series pattern recognition by highlighting the challenges and gaps in scalable time series pattern mining.
    Keywords: time series pattern recognition; scalable time series pattern matching; motif discovery; time series data mining; distance measure; dimension reduction; sliding window search.
    DOI: 10.1504/IJBIDM.2022.10041672