International Journal of Business Intelligence and Data Mining (40 papers in press)
Analysis and Prediction of Heart Disease Aid of Various Data Mining Techniques: A Survey
by V. Poornima, D. Gladis
Abstract: In recent times, health diseases are expanding gradually because of inherited. Particularly, heart disease has turned out to be the more typical nowadays, i.e., life of individuals is at hazard. The data mining strategies specifically decision tree, Naive Byes, neural network, K-means clustering, association classification, support vector machine (SVM), fuzzy, rough set theory and orthogonal local preserving methodologies are examined on heart disease database. In this paper, we survey distinctive papers in which at least one algorithms of data mining are utilised for the forecast of heart disease. This survey comprehends the current procedures required in vulnerability prediction of heart disease for classification in data mining. Survey of pertinent data mining strategies which are included in risk prediction of heart disease gives best expectation display as hybrid approach contrasting with the single model approach.
Keywords: Data mining; Heart Disease Prediction; performance measure; Fuzzy; and clustering.
Discovery of Rare Association Rules in the Distribution of Lawsuits in the Federal Justice System of Southern Brazil
by Lucia Gruginskie, Guilherme Vaccaro, Leonardo Chiwiakwosky, Attilla Blesz Jr
Abstract: In the context of data mining, infrequent association rules may be beneficial for analysing rare or extreme cases with very low support values and high confidence. In researching risky situations or allocating specific resources, such rules may have a much greater impact than rules with high support value. The objective of this study is to obtain association rules from the database of lawsuits filed in the Federal Court of Southern Brazil in 2016, including both frequent and rare rules. By finding these rules, especially rare ones, the information collected can assist in the decision-making process, in this case, such as training clerks or establishing specialised courts.
Keywords: Association Rules; Rare Rules; Distribution of lawsuits; Brazilian Federal Justice; Data mining.
A Multiclass Classification Approach for Incremental Entity Resolution on Short Textual Data
by Denilson Pereira, João A. Silva
Abstract: Several web applications maintain data repositories containing references to thousands of real-world entities originating from multiple sources, and they continually receive new data. Identifying the distinct entities and associating the correct references to each one is a problem known as entity resolution. The challenge is to solve the problem incrementally, as the data arrive, especially when those data are described by a single textual attribute. In this paper, we propose a new approach for incremental entity resolution. The method we have implemented, called AssocIER, uses an ensemble of multiclass classifiers with self-training and detection of novel classes. We have evaluated our method in various real-world datasets and scenarios, comparing it with a traditional entity resolution approach. The results show that AssocIER is effective and efficient to solve unstructured data in collections with a large number of entities and features, and is able to detect hundreds of novel classes.
Keywords: Entity Resolution; Associative Classification; Incremental Learning; Novel Class Detection; Self-training.
Method for Improvement of Transparency: Use of Text Mining Techniques for Reclassification of Governmental Expenditures Records in Brazil
by Gustavo De Oliveira Almeida, Kate Revoredo, Claudia Cappelli, Cristiano Maciel
Abstract: Many countries have transparency laws requiring availability of data. However, often data is available but not transparent. We present the Transparency Portal of Brazilian Federal Government case and discuss limitations of public acquisitions data stored in free text format. We employed text-mining techniques to reclassify descriptive texts of measurement units related to products and services. The solution presented in KNIME and JAVA aggregated measurements in the original (n = 69,372 with 78% reduction in number of descriptions, 94% items classified) and in cross validation sample (n = 105,266 with 88% reduction, classifying 78% of items). In addition, we tested computational time for processing of texts for a wide range of data input sizes, suggesting the stability and scalability of the solution to process larger datasets. Finally, we produced analysis identifying probable input errors, suppliers and purchasing units with abnormal transactions and factors affecting procurement prices. We present suggestions for future research and improvements.
Keywords: e-government; data mining; open government; text mining; transparency; KNIME; knowledge discovery; techniques; Brazil.
Data Mining in Credit Insurance Information System for Bank Loans Risk Management in Developing Countries
by Fouad J. Al Azzawi
Abstract: The task of credit risk insurance in our time is critical since loans
are taken by everyone and everywhere and it is quite difficult to accurately
estimate the possible losses that are incurred by failing to pay those loans.
This work proposes an information system module for the banking system to
improve the risk management operation that distributes losses on some fair
basis, as well as accepting the maximum number of loan requests. Insuring the
risk associated with stumbled loans, the bank will partially or completely shift
losses under this contract to the insurance company, thus minimising its own
losses. The proposed module could find out for what price the bank can buy
such insurance policy. The proposed module also could be a key valuable
motivation for different development countries to update their strategy of
current insurance market to outsource part of the states insurance functions to
independent insurance industry. Data mining techniques and mathematical
induction have been used and successfully implemented this model. An optimal
classification solution module for predicting risky loan requests have been
successfully employed. New mathematical model has been developed for
calculating the cost of insurance policy in crisis economy.
Keywords: Data mining; Credit insurance; information systems; Bank loans; risk management; developing countries.
CARs-RP: Lasso Based Class Association Rules Pruning
by AZMI Mohamed, Abdelaziz Berrado
Abstract: Classification based on association rules gets more and more interest in research and practice. In many contexts, rules are often mined from sparse data in high-dimensional spaces, which leads to large number of rules with considerable containment and overlap. Pruning is often used in search for an optimal subset of rules. This paper introduces a method for class association rules (CARs) pruning. It learns weights for a set of CARs by maximising the likelihood function subject to the sum of the absolute values of the weights. The pruning strength is controlled by a shrinkage parameter ?. The suggested method allows the user to choose the appropriate subset of CARs. This is achieved based on a trade-off between the accuracy and complexity of the resulting classifier which is controlled by changing ?. Experimental analysis shows that the introduced method allows to build more concise classifiers with comparable accuracy to other methods.
Keywords: class association rules; pruning; regularization; weighting; associative classification.
PPM-HC: a Method for Helping Project Portfolio Management Based on Topic Hierarchy Learning
by Ricardo M. Marcacini, Ricardo A. M. Pinto, Flavia Bernardini
Abstract: The projects categorisation is a crucial step in the project portfolio management (PPM). Categorising projects allows the organisation to identify categories with a lack or excess of projects, according to its strategic objectives. In this work, we present a new method for project portfolio management based on hierarchical clustering (PPM-HC) to organise the projects at several levels of abstraction. In the PPM-HC, similar projects are allocated to the same clusters and subclusters. PPM-HC automatically learns an understandable topic hierarchy from the project portfolio dataset, thereby facilitating the (human) task of exploring, analysing and prioritising the projects of the organisation. We also proposed a card sorting-based technique which allows the evaluation of the projects categorisation using an intuitive visual map. We carried out an experimental evaluation based on a benchmark dataset and we also presented a real-world case study. The results show that the proposed PPM-HC method is promising.
Keywords: Project Portfolio Management; Projects Categorization; Topic Hierarchy Learning; Hierarchical Clustering.
An efficient approach for Defect Detection in Texture analysis using Improved Support Vector Machine
by Manimozhi I., Janakiraman S.
Abstract: Texture defect detection can be defined as the process of determining the location and size of the collection pixels in a textures image which deviate in their intensity values or spatial in compression to a background texture. The detection of abnormalities is a very challenging problem in computer vision. In our proposed method we have designed a method for detecting the defect of pattern texture analysis. Initially, features are extracted from the input image using the gray level co-occurrence matrix (GLCM) and gray level run-length matrix (GLRLM). Then the extracted features are fed to the input of classification stage. Here the classification is done by improved support vector machine (ISVM). The proposed pattern analysis the traditional support vector machine is improved by means of kernel methods. Final stage is the classified features are segmented using the modified fuzzy C means algorithm (MFCM).
Keywords: Texture defect detection; preprocessing; Gray Level Co-occurrence matrix; Gray Level Run-Length Matrix; Improved Support Vector Machine; modified fuzzy c means algorithm.
A DYNAMIC REPLICATIVE K-MEANS WITH SELF-COMPILING PARTICLE SWARM INTELLIGENCE FOR DATASET CLASSIFICATION
by A. M. Viswa Bharathy
Abstract: The classification techniques proposed so far is not sufficiently intelligent in classifying data set beyond two level classifications. To multi classify the data set for network data we are in need of more hybrid algorithms. In this paper we propose a hybrid technique by combining a modified K-means algorithm called dynamic replicative K-means (DRKM) with self-compiling particle swarm intelligence (SCPSI). The dataset we have chosen for the experiment is KDD Cup 99. The DRKM-SCPSI performs better in terms of detection rate (DR), false positive rate (FPR) and accuracy which is visible from the results presented.
Keywords: anomaly; detection; intrusion; K-Means; PSI.
PORTFOLIO SELECTION WITH SUPPORT VECTOR REGRESSION: MULTIPLE KERNELS COMPARISON
by Pedro Alexandre Henrique, Pedro Albuquerque, Peng Yao Hao, Sarah Sabino
Abstract: This study aimed to verify whether the use of support vector regression (SVR) makes the portfolios return exceed the market. For such propose, SVR was applied for 15 different kernel functions to select the best stocks for each quarter, calculating the quarterly portfolio return and cumulative return along the period. Subsequently, the returns of these portfolios were compared with the returns of a market benchmark. Whites (2000) test was applied to avoid the data-snooping effect in assessing the statistical significance of the portfolios developed by the training strategies. The portfolio selected by SVR with inverse multiquadric kernel presented the highest cumulative return of 374.40% and a value at risk (VaR) of 6.87%.The results of this study corroborate the superiority hypothesis of the innovative method of Support Vector Regression in the formation of portfolios, thus constituting a robust predictive method capable to cope with high dimensionality interactions.
Keywords: Statistical Learning Theory. Optimization Theory. Financial Econometrics. Support Vector Machine. Kernel methods.
Worldwide Gross Revenue Prediction for Bollywood Movies using Hybrid Ensemble Model
by Alina Zaidi, Siddhaling Urolagin
Abstract: Prediction of revenue before a movie is released can be very beneficial for stakeholders and investors in the movie industry. Even though Indian cinema is a booming industry, the literature work in the field of movie revenue prediction is more inclined towards non-Indian movie. In this study we built a novel hybrid prediction model to predict worldwide gross for Bollywood movies. Bollywood movies dataset is prepared by downloading movie related features from IMDb and YouTube movie trailers which consisting of 674 movies. K-means clustering is performed on the movie dataset and two major clusters are identifier. Important features specific to clusters are selected. The proposed hybrid prediction model performs segregation of movies into two clusters and employs prediction model for each cluster. Prediction models we tested included various basic machine learning models and ensemble models. The ensemble model that combined predictions from support vector regression, neural network and ridge regression gave us the best result for both clusters and we chose it to be our final model. We obtain an overall MAE of 0.0272 and R2 of 0.80 after 10-fold cross validation.
Keywords: Bollywood; Movie Revenue Prediction; Box office; Regression; Ensemble; Feature Selection; Machine Learning; Scikit-Learn.
Health Data Warehouses: Reviewing Advanced Solutions for Medical Knowledge Discovery
by Norah Alghamdi
Abstract: The implementation of a data warehouse and a decision support system by utilising the capabilities of information retrieval and knowledge discovery tools in the healthcare fields, has allowed for the enhancement in the offered healthcare. In this work, we present a review of recent data warehouses and decision support systems in the healthcare domain with their significance, and applications of evidence-based medicine, electronic health records, and nursing. Given the growing trend on their implementation in healthcare services, researches, and education, we present here the most recent publications that employ these tools to produce suitable decisions for patients or health providers. For all the reviewed publications, we have intensively explored their problems, suggested solutions, utilised methods, and their findings. We have also highlighted the strength of the existing approaches and identified potential drawbacks including data correctness, completeness, consistency, and integration to provide proper medical decision-making.
Keywords: Data warehouses; Data Mining; Health Data; Medical Records; Quality; Knowledge Discovery; OLAP.
Survey on-demand: A versatile scientific article automated inquiry method using text mining applied to Asset Liability Management
by Pedro Henrique Albuquerque, Igor Nascimento, Peng Yao Hao
Abstract: We proposed a methodology that automatically relate content of text documents with lexical items. The model estimates whether an article addresses a specific research object based on the relevant words in its abstract and title using text mining and partial least square discriminant analysis. The model is efficient in accuracy and the adjustment and validation indicators are either superior or equal to the other models in the literature on text classification. In comparison to existing methods, our method offers highly interpretable outcomes and allows flexible measurements of word frequency. The proposed solution may aid scholars regarding the process of searching theoretical references, suggesting scientific articles based on the similarities among the used vocabulary. Applied to the finance area, our framework has indicated that approximately 10% of the publications in the selected journals that address the subject of asset liability management. Moreover, we highlight the journals with the largest number of publications over time and the key words about the subject using only freely accessible information.
Keywords: dimensionality reduction; discriminant analysis; text classification; partial least square; bibliometrics.
Clustering Student Instagram accounts using Author-Topic Model Based
by Nur Rakhmawati, Faiz NF, Irmasari Hafidz, Indra Raditya, Pande Dinatha, Andrianto Suwignyo
Abstract: The aim of this study proposes topic model to cluster a group of high school teenager's Instagram account in Surabaya, Indonesia by using the author-topic models method. We collect valid 235 Instagram account (133 female, 102 male students). We gather a total 3,346 captions of the Instagram post from 18 senior high schools. We find major findings what are the topics that define their Instagram's post or caption, seven topics namely: feeling, Surabaya events, photography, artists, vacation, religion and music. Through the process, the lowest perplexity come from 90 iterations, which suggests six groups of topics. The six topics are concluded based on the lowest perplexity value and labelled according to the words included in the topic. The topic of Photography discussed by six schools. Photography-Artists and vacation are discussed by three schools, while feeling, religion and music are being discussed by two and one school respectively.
Keywords: Topic Modelling ; Senior High School Students ; Author-Topic Models.
The approach of using ontology as pre-knowledge source for semi-supervised labelled topic model by applying text dependency graph
by Phu Pham, Phuc Do
Abstract: Multiple topics discovering from text is an important task in text mining. From the past, the supervised approaches fail to explore multiple topics in text. The topic modelling approach, such as: LSI, pLSI, LDA, etc. are considered as an unsupervised method which supports to discover distributions of multiple topics in text documents. The labelled LDA (LLDA) model is a supervised method which enables to integrate human labelled topics with the given text corpus during the process of modelling topics. However, in real applications, we may not have enough high qualified knowledge to properly assign the topics for all documents before applying the LLDA. In this paper, we present two approaches which are taken the advantage of dependency graph-of-words (GOW) in text analysis. The GOW approach uses frequent sub-graph mining (FSM) technique to extract graph-based concepts from text. Our first approach is method of using graph-based concepts for constructing domain-specific ontology. It is called GC2Onto model. In our second approach, the graph-based concepts are also applied to improve the quality of traditional LLDA. It is called LLDA-GOW model. We combine two GC2Onto and LLDA-GOW models to leverage the multiple topic identification as well as other mining tasks in text.
Keywords: topic identification; labelled topic modelling; LDA; labelled LDA; ontology-driven topic labelling; dependency graph.
A comparison of cluster algorithms as applied to unsupervised surveys
by Kathleen C. Garwood, Arpit Dhobale
Abstract: When considering answering important questions with data, unsupervised data offers extensive insight opportunity and unique challenges. This study considers student survey data with a specific goal of clustering students into like groups with underlying concept of identifying different poverty levels. Fuzzy logic is considered during the data cleaning and organising phase helping to create a logical dependent variable for analysis comparison. Using multiple data reduction techniques, the survey was reduced and cleaned. Finally, multiple clustering techniques (k-means, k-modes and hierarchical clustering) are applied and compared. Though each method has strengths, the goal was to identify which was most viable when applied to survey data and specifically when trying to identify the most impoverished students.
Keywords: Fuzzy logic; cluster analysis; unsupervised learning; survey analysis; decision support system; k-means; k-modes; hierarchical clustering.
Discovery of inconsistent generalized coherent rules
by Anuradha Radhakrishnan, Rajkumar N, Rathi Gopalakrishnan, Soosaimichael PrinceSahayaBrighty
Abstract: Mining multiple-level association rules in a predefined taxonomy is an hierarchies that paves the way for generalised rule mining using interestingness measures like support and confidence. Coherent rule mining identifies significant rules in a database without using interestingness measures. In this paper we propose a new mining algorithm called generalised inconsistent coherent rule mining (GICRM) for mining a new form of generalised coherent rules called Inconsistent coherent rules. The discovered rules are called inconsistent because the correlation of the rules changes from one level of taxonomy to another. The rules are mined from a structured dataset of predefined taxonomy. The inconsistent rules mined would be noteworthy at business point of view for taking strategic decisions in market basket analysis.
Keywords: GICRM; multiple-level; generalized inconsistent coherent rule; taxonomy.
Time and Structural Anomalies Detection in Business Processes Using Process Mining
by Elham Saeedi, Faramarz Safi-Esfahani
Abstract: Information systems are increasingly being integrated into operational process and as a result, many events are recorded by information systems. Lack of compatibility between the process model and the observed behaviour is one of the challenges in constructing the process model in process mining. This lack of compatibility could be present in both the structure (sequence of the task) and the time spent in each task. In this paper, a hybrid approach for detecting structural and time anomalies via process mining is proposed. A dataset form Iran Insurance Company is used for performing a case study. The proposed method has detected 98.5% of structure anomalies and 96.3% of time anomalies which is one of the main achievements of this paper. A second standard dataset is used to further examine the proposed method that referred to as dataset 2. The proposed method has demonstrated a better performance compared with the baseline approach.
Keywords: Process mining; conformance checking; workflow mining; structural anomaly; time anomaly; flexible model; Insurance anomaly; anomaly detection; process model; control-flow perspective.
Analysis of road accident data and determining affecting factors by using regression models and decision tree
by Hanieh GharehGozlu
Abstract: This study analyses the road accident data with the aim to predict the probability of the road accidents leading to death and determine the affecting factors. Regression models including logit, probit, complementary log-log, gompertz and decision trees based on the CART algorithm were used to analyse the actual data of the rail road police centre of the country. The results show that the logit regression model is superior to the other models from the perspective of the scales of the health indicator. Also, the variables of day of week, age, shoulder path, road side, road type, road position, maximum speed, belt safety, specific safety equipment, vehicle type and vehicle manufacturer country are among the variables that significantly affect the probability of road deaths, and can be controlled by controlling their levels.
Keywords: Road accidents; Regression models; Decision tree model; Accuracy indicator scales.
A Review of Market Basket Analysis on Business Intelligence and Data Mining
by Nilam Nur Amir Sjarif, Nurulhuda Firdaus Mohd Azmi, Siti Sophiayati Yuhaniz, Doris Hooi-Ten Wong
Abstract: Business insight (BI) is an information driven arrangement which umbrellas assortment of instruments, advances, applications, procedures and methodologies that empower mining of helpful learning and data from operational information resources. Hidden patterns or trends got from the tremendous volume of information are add to informed and strategic decision making. Market basket analysis (MBA) is one of the regularly utilised data mining technique in BI to help business organisation in accomplishing upper hand. In spite of the fact that, the appropriation of the MBA as a data mining technique in BI tools are common in e-commerce, paper that survey BI and MBA is limited. This paper gives a major picture on the current state of BI and the application of the MBA as a BI technique. Written works identified with BI and MBA from different sources such as digital libraries and Google Scholar are explored. The survey serves to some degree as a guide or platform for researchers and practitioners for future improvement.
Keywords: Market Basket Analysis; Business Intelligence; Data Mining.
Stock Price Forecasting and News Sentiment Analysis Model using Artificial Neural Network
by Sriram K. V, Somesh Yadav, Ritesh Singh Suhag
Abstract: The stock market is highly volatile, and the prediction of stock prices has always been an area of interest to many statisticians and researchers. This study is an attempt to predict the prices of stock using Artificial Neural Network (ANN). Three models have been built, one for the future prediction of stock prices based on previous trends, the second for prediction of next day closing price based on todays opening price, and the third one analyzes the sentiment of news articles and gives scores based on the news impact. ANN is trained with the historical data using R-studio platform which is then used to predict the future values. Our experimental results for various stock prices showed that the model is effective using ANN.
Keywords: Stock Pricing; Forecasting; Artificial Neural Network; News sentiment; Opening price; Closing price; R Studio; Data analytics;.
Associative Classification Model for Forecasting Stock Market Trends
by Everton Castelão Tetila, Bruno Brandoli Machado, Jose F. Rorigues-Jr, Nícolas Alessando De Souza Belete, Diego A. Zanoni, Thayliny Zardo, Michel Constantino, Hemerson Pistori
Abstract: This paper proposes an associative classification model based on three technical indicators to forecast future trends of stock market. Our methodology assessed the performance of nine technical indicators, using a portfolio of ten stocks and a twelve-year time series. The experimental results showed that the use of a set of technical indicators leads to higher classification rates compared to the use of sole technical indicators, reaching an accuracy of 88.77%. The proposed approach also uses a multidimensional data cube that allows automatic updating of stock market asset values, which are essential to keep the forecast updated. The results indicate that our approach can support investors and analysts to operate in stock market.
Keywords: stock market trends; technical indicators; associative classification; data mining; business intelligence.
Mining the Productivity Data of Garment Industry
by Abdullah Al Imran, Md Shamsur Rahim, Tanvir Ahmed
Abstract: The Garment Industry one of the key examples of the industrial globalization of this modern era. It is a highly labour-intensive industry with lots of manual processes. Satisfying the huge global demand for garment products is mostly dependent on the production and delivery performance of the employees in the garment manufacturing companies. So, it is highly desirable among the decision makers in the garments industry to track, analyse and predict the productivity performance of the working teams in their factories. This study explores the application of state-of-the-art data mining techniques for analysing industrial data, revealing meaningful insights, and predicting the productivity performance of the working teams in a garment company. As part of our exploration, we have applied 8 different data mining techniques with 6 evaluation metrics. Our experimental results show that the Tree Ensemble model and Gradient Boosted Tree model are the best performing models in the application scenario.
Keywords: Data Mining; Productivity Prediction; Pattern Mining; Classification; Garment Industry; Industrial Engineering.
GENERAL CRIME FROM THE DATA MINING POINT OF VIEW. A SYSTEMATIC LITERATURE REVIEW
by Maria Antonia Walteros Alcazar, Nicolas Aguirre Yacup, Sandra P. Castillo Landinez, Pablo E. Caicedo Rodríguez
Abstract: In recent decades, crime has become an issue of great concern to nations, which is why there is significant progress in the development of investigations in different areas. The literature review considers the data mining techniques applied to crime research, throughout the analysis of four thematic axes: countries, data sources, data mining techniques and software employed in different articles. The analysis used a systematic methodology to examine the 111 articles selected among 2008-2018 from almost 70 journals. The articles of this review are focused on different types of crime. The findings indicated that USA is the most active country analysing crimes using data mining techniques; also, the most common sources are open data websites and crime studies. In general, are more frequent than those that cover a specific type of crime, the algorithm mainly used in studies is cluster followed by classification, and the most widely used software is WEKA.
Keywords: Data Mining DM; Crime; Criminal Patterns; Law Enforcement; Data Mining Techniques; Algorithms; Review; Knowledge Discovery; Literature Review LR;.
A Parallel Approach for Web Session Identification to make Recommendation Efficient
by Bhuvaneswari M.S, K. Muneeswaran
Abstract: Web Sessions are identified as the significant part of the construction of recommendation model. The novel part of the work makes use of backward moves made by the user, considering both the referrer url and the requested url extracted from the extended web log for session identification which is not taken into consideration in the existing heuristic based approach. Two noteworthy issues in session identification are i) framing excessively numerous smaller length sessions and ii) taking longer time for identifying the sessions. In the proposed work, the length of the sessions are maximized using split and merge technique and the time taken for session identification is reduced using thread parallelization. For efficient storage and retrieval of information the hash map data structure is used. The proposed work shows significant improvement in performance in terms of execution time, standard error, correlation coefficient and the objective value.
Keywords: Extended Web server logs ; Session identification ; Split and merge technique ; Multithreaded ; Hash data structure.
A Clustering and Treemap-based Approach for Query Reuse and Visualization in Large Data Repositories
by Yousra Harb, Surendra Sarnikar, Omar F. El-Gayar
Abstract: This study presents a query clustering and tree map approach that facilitates access and reuse of pre-developed data retrieval models (queries) to analyze data and satisfy user information needs. The approach seeks to meet the following requirements: knowledge (represented as previously constructed queries) reuse, query exploration, and ease of use by data users. The approach proposes a feature space for representing queries, applies Hierarchical Agglomerative Clustering (HAC) techniques to cluster the queries, and leverages treemaps to visualize and navigate the resultant query clusters. We demonstrate the viability of the approach by building a prototype data exploration interface for health data from Behavioral Risk Factor Surveillance System (BRFSS). We conduct cognitive walkthroughs and a user study to further evaluate the effectiveness of the artifact. Overall, the results indicate that the proposed approach demonstrate the ability to meet its design requirements.
Keywords: Query clustering; Query reuse; Query visualization; Query exploration; Information retrieval; Treemap.
BUSINESS INTELLIGENCE: The Fuzzy Logic in the risk client analysis
by Jorge Morris, Victor Escobar-Jeria, Juan Luis Castro Peña
Abstract: The following paper focuses on achieving accurate results through rough data. Using an inference model based on Fuzzy Logic, human reasoning was proactively stimulated, under certain conditions, in order to deal with the possibility of client-loss due to service quality. The experimentation is carried out by information related of Complaint Receipts from a period of two years (70,000 registers). For that effect, a prototypical program is set in C++ language, which receives as input the crips values that result from the failure resolution for each relevant service. The proposed model is intended to classify clients according to the risk they may have in the contractual relationship with the company.
Keywords: Business Intelligence; Fuzzy Logic; Soft Computing; Decision.
Application of structural modeling to measure the impact of quality on growth factors: Case of the young industrial enterprises installed in the Northwest of Morocco
by Mohamed B.E.N. ALI, MOHAMMED HADINI, Said Barijal, Saif Rifai
Abstract: This study aims to provide a conceptual model measuring the impact of quality practices on the growth factors of young industrial enterprises located in northwestern Morocco and to see how quality can stimulate and improve growth factors to this kind of enterprises The present study is empirical, based on surveys (face to face interviews) via questionnaires administered to the owners/managers of young industrial enterprises using the latent variable structural modeling according to the PLS-Path Modeling approach A total of 220 questionnaires were administered and exploited to assess the degree of use and application of quality practices, five practices have been chosen, and the PLS (Partial Least Squares) Path Modeling was used We concluded that in general the quality practices concerning Leadership and Process Management have a positive impact on the growth factors of this type of enterprises:"strong to medium" importance of effects In contrast, the quality practices concerning Human Resources,
Keywords: Growth factors; Growth phase; Modeling; Quality Practices; Young Industrial Enterprises.
Mining Trailer Reviews for Predicting Ratings and Box Office Success of Upcoming Movies
by Nirmalya Chowdhury, Debaditya Barman, Chandrai Kayal
Abstract: Around 60% of the movies produced worldwide are box office failures. Since it affects a large number of stakeholders, movie business prediction is a very relevant as well as challenging problem. There had been many attempts to predict the box-office earnings of a movie after the theatrical release. Comparatively research works are inadequate to predict a movies fate before its release. Viewers are introduced to a movie via trailers before its theatrical release. The reviews of these trailers are indicative of a movies initial success. This work is focused on movie rating and business prediction on the basis of trailer reviews as well as other attributes. Several experiments have been performed using multiple classifiers to find appropriate classifiers(s) which can predict rating and box-office performance of a movie to be launched. Experimentally it has been found that Random Forest (RF) Classifier has outperformed others and produced very promising results.
Keywords: Text Mining; Sentiment Analysis; Machine Learning; Movie Rating; Opening Weekend Income; Gross Income; Movie Trailer; Sensitivity Analysis.
Improvement Assessment Method for Special Kids By Observing The Social and Behaviour Activity Using Data Mining Techniques
by DHANALAKSHMI RADHAKRISHNAN, Muthukumar B
Abstract: In recent studies, high throughput innovations have offered ascend to accumulation of substantial measures of heterogeneous data that gives diverse information. Clustering is the process of gathering unique items into classes of comparative articles. To overcome the drawbacks of classification methods, clustering is used. Earlier, clustering algorithms like hierarchical clustering, density based clustering, which are based on either numerical or categorical attributes were commercially used in software. In this proposed work k-mean clustering under unsupervised learning algorithm can make sense in prediction. Taking the clinical data of special kids, clustering is made and categorized using rank with the help of relevant symptoms. In this context, the data of special kids make statistical impact on categorization and easy detection of associated conditions of a child earlier.As the results, the proposed method has validated the database of special kids information with global purity.
Keywords: High-throughput development; Special kids; Categorical attributes; unsupervised k-means Clustering; Gene expressional values.
Ensemble Feature Selection Approach for Imbalanced Textual Data Using MapReduce
by Houda Amazal, Kissi Mohamed, Mohammed Ramdani
Abstract: Feature selection is a fundamental preprocessing phase in text classification. It speeds up machine learning algorithms and improves classification accuracy. In big data context, feature selection techniques have to deal with two major issues which are the huge dimensionality and the imbalancing aspect of data. However, the libraries of big data frameworks, such as Hadoop, only implement a few single feature selection methods whose robustness does not meet the requirements imposed by the large amount of data. To deal with this, we propose in this paper a Distributed Ensemble Feature Selection approach (DEFS) for imbalanced large dataset. The first step of the proposal focus on tackling the imbalance distribution of data using Hadoop environment to transform usual documents of dataset into big documents. Afterwards, we introduce a novel feature selection method we called Term Frequency-Inverse Category Frequency (TFICF) which is both frequency and category based.
Keywords: Ensemble feature selection; Imbalance data; MapReduce; Text classification.
The Five Key Components for Building An Operational Business Intelligence Ecosystem
by SARMA A.D.N.
Abstract: Business intelligence (BI) plays a vital role in decision making in all most all private, business and government organizations. An Operational BI is a hybrid system which is an emerging concept in the BI space and gaining popularity in the last five years. In this paper, the key components of an Operational BI are presented, and their workings explained. The methodology adopted for the identification of components based on modularization of software engineering using cohesion and coupling parameters. The proposed components of the system leverage the principles of component-based software engineering. An orderly arrangement of the key components constitutes an Operational BI ecosystem. Further, explained how these individual key components of the system provide an increased business value and timely decision-making information to all the users in the organizations.
Keywords: Business intelligence; operational BI; business performance management; operational analytics; operational reporting; event monitoring and notification; action time; and business value.
A Novel Approach to Retrieve Unlabelled Images
by Deepali Kamthania, Ashish Pahwa, Aayush Gupta, Chirag Jain
Abstract: In this paper an attempt has been made to propose architecture of search engine for retrieving photographs from photo bank of unlabeled images. The primary purpose of the system is to retrieve images from image repository through string based queries on an interactive interface. To achieve this, image data set is transformed into a space where queries can execute significantly faster by developing a data pipeline through which each image is passed after entering into the system. The pipeline consists HOG based face detection and extraction, Face Landmark estimation, Indexer and Transformer. The image is passed through the data pipeline where each encoded face in the input image is compared with other vectors by computing l2 norm distance between them. The top N results (address of faces and corresponding images) are returned to the user. Once the image passes out from the pipeline Retrieval methods and Feedback mechanisms are performed.
Keywords: Face Recognition (FR); Deep Learning; Histogram of Oriented Gradients (HOG); FaceNet Architecture; Machine Learning; Support Vector Machine (SVM).
Prediction of Box-office Success: A Review of Trends and Machine Learning Computational Models
by Elliot Mbunge, Stephen Fashoto, Happyson Bimha
Abstract: The movie industry is faced with high uncertainty owing to challenges businesses have in forecasting sales and revenues. The huge upfront investments associated with the movie industry require investments to be informed by reliable methods of predicting success or returns from their investments. The study set to identify the best forecasting techniques for box-office products. Previous studies focused on predicting box-office success using pre-release and post-release during and after the production phase. The study was focusing on reviewing existing literature in predicting box-office success with the ultimate goal of determining the most frequently used prediction algorithm(s), dataset source and their accuracy results. We applied the PRISMA model to review published papers from 2010 to 2019 extracted from Google Scholar, Science Direct, IEEE Xplore Digital Library, ACM Digital Library and Springer Link. The study shows that the support vector machine was frequently used to predict box-office success with 21.74% followed by linear regression with 17.39% of total frequency contribution. The study also reviewed that Internet Movie Database (IMDb) is most used box-office dataset source with 40.741% of the total frequency followed with Wikipedia with 11.111%.
Keywords: Box-office; machine learning; movie industry; pre-release; post-release features.
A NOVEL FRAMEWORK FOR FORECASTING TIME SERIES DATA BASED ON FUZZY LOGIC AND VARIANTS OF HIDDEN MARKOV MODELS
by S. Sridevi, Parthasarathy Sudhaman, T. Chandrakumar, S. Rajaram
Abstract: The traditional time series forecasting methods such as naive, smoothing model and moving average model assumes that the time series is stationary and could not handle linguistic terms. To provide a solution to this problem, fuzzy time series forecasting methods are being considered in this research work. The objective of this research is to improve the accuracy by introducing a new partitioning method called Relative Differences (RD) based interval method. This research work implements the variants of RD based Hidden Markov Models (HMM) such as Classic HMM, Stochastic HMM, Laplace stochastic smoothing HMM, and Probabilistic Smoothing HMM (PsHMM) for forecasting time series data. In the proposed work, the performances of the above models were tested with Australian Electricity Market dataset and Tamilnadu Weather dataset. The results show that the performance of the proposed model- Relative Differences (RD) based PsHMM performs much better in terms of precision than other existing models.
Keywords: Forecasting; Time series; Fuzzy; Time Variant Model; Markov Model; Relative Differences (RD) based interval method.
Disease Prediction and Knowledge Extraction in Banana Crop Cultivation using Decision Tree Classifiers
by A. Anitha
Abstract: Agriculture plays a vital role in determining economic status of a country. To meet out the growing needs of society and to improve crop productivity, researchers are focusing on the development of various technologies. In India, banana is one of the leading crops with high demand. To improve the yield of banana, it is necessary to detect diseases at an early stage. Also, in order to acquire new farmers and to retain existing banana farmers, it is essential to extract knowledge about hidden causes for various diseases in banana crop. This work aims to apply data mining techniques like decision tree classifiers on banana cultivation dataset. Agricultural dataset used for experimentation is collected from farmers cultivating banana in regions fed by Thamirabharani River such as Kanyakumari, Tirunelveli and Tuticorin districts of Tamil Nadu. The higher the disease detection accuracy, the greater will be the crop productivity. Performance of classifiers such as J48, REP tree and random forest are compared based on classification accuracy, precision, recall and F-measure. Among various classification techniques applied over agricultural dataset, it has been identified that random forest algorithm out performs other techniques with respect to classification accuracy.
Keywords: Attribute Selection; Decision Tree; Classification; Accuracy.
Extracted information quality, a comparative study in high and low dimensions
by Leandro Ariza-Jiménez, Luisa F. Villa, Nicolás Pinel, Olga Lucia Quintero Montoya
Abstract: Uncovering interesting groups in either multidimensional or network spaces has become an essential mechanism for data exploration and understanding. Decision making requires relevant information as well as high-quality on the retrieved conclusions. We presented a comparative study of two compact representations drawn from the same set of data objects by clustering high-dimensional spaces and low-dimensional Barnes-Hut t-Stochastic Neighbor embeddings. There is no consensus on how the problem should be addressed and how these representations/models should be analysed because of their different notions. We introduced a measure to compare their results and capability to provide insights into the information retrieved. We considered low-dimensional embeddings as a potentially revealing strategy to uncover dynamics possibly not uncovered in big-data spaces. We demonstrated that a non-guided approach can be as revealing as a user-guided approach for data exploration and presented coherent results for good uncertainty modelling capability in terms of fuzzyness and densities.
Keywords: High-dimensional Clustering; BH-SNE Embeddings; cluster Fuzzyness; Reliable Information; Decision Making; Consistency.
Heart Disease Patient Risk Classification Based On Neutrosophic Sets
by Wael Hanna, Nouran Radwan
Abstract: Medical statistics show that heart disease is one of the biggest causes for mortality among the population. In developing countries, people have less concern about their health. The risk is increasing as there are five hundred deaths per one hundred thousand occur annually in Egypt. The diagnosis of heart disease remains an ambiguous task in the medical field as there are many features which are involved to take the decision. Besides, data gained for diagnosis are often vague and ambiguous. The main contribution of this paper is proposing a novel model of heart disease patient risk classification based on neutrosophic sets. The proposed model is applied to most relevant attributes of selected dataset, and compared to other famous classification techniques such as Naive Bayesian, JRip, and random forest for validation. The experimental results indicate that the proposed heart disease classification model achieves highest accuracy and f-measure results in heart disease.
Keywords: Heart disease; supervised machine learning classification; and neutrosophic sets.
A Semi-Supervised clustering based classification model for classifying imbalanced data streams in the presence of scarcely labelled data
by Kiran Bhowmick, Meera Narvekar
Abstract: Classification of data streams is still a current topic of research and a lot of research is focussed in this direction. Online frameworks for classifying data streams are generally supervised in nature so they assume the availability of labelled data all the time. Data streams in real time however are potentially infinite in length, massive, fast changing and scarcely labelled. It is practically impossible to label all the observed instances. Hence these existing frameworks cannot be used in most of the real time scenarios. Semi-supervised learning (SSL) addresses this problem of scarcely labelled data by using large amount of unlabelled data together with labelled data to build classifiers. Data streams may also suffer with the problem of imbalanced data. This paper proposes a model using a semi supervised clustering technique to classify an imbalanced data stream in the presence of scarcely labelled data.
Keywords: data streams; imbalanced data; semi-supervised clustering; expectation maximization; partially labelled.
Analysing traveller ratings for tourist satisfaction and tourist spot recommendation
by Angel Arul Jothi Joseph, Rajeni Nagarajan
Abstract: In this study, we propose an automated system to classify traveller ratings on travel destinations in 10 categories across East Asia using the UCI Travel Reviews dataset. The automated system developed in this study is called Traveller Rating Classification System (TRCS). Since the Travel Reviews dataset is an unlabelled dataset, K-means clustering algorithm is used to group the samples from the dataset into three clusters. The cluster numbers obtained from K-means clustering are assigned as class labels for the samples and the dataset is converted into a labelled dataset. Popular individual classifiers and ensemble classifiers are used to classify the samples present in the labelled dataset. In this study, Bagging with decision tree classifier achieved the best classification accuracy of 97.95%. The study further analyses the attributes in the dataset using visualization techniques to draw inferences by performing small transformations on them. The proposed system will be useful to understand traveller satisfaction and as a tourist spot recommendation system.
Keywords: Tourist spot recommendation; Tourist satisfaction; Traveller rating; K-means Clustering; Classification; Ensemble; Visualization.