Template-Type: ReDIF-Article 1.0 Author-Name: Imad Bouteraa Author-X-Name-First: Imad Author-X-Name-Last: Bouteraa Author-Name: Makhlouf Derdour Author-X-Name-First: Makhlouf Author-X-Name-Last: Derdour Author-Name: Ahmed Ahmim Author-X-Name-First: Ahmed Author-X-Name-Last: Ahmim Title: Intrusion detection using classification techniques: a comparative study Abstract: Today's highly connected world suffers from the increase and variety of cyber-attacks. To mitigate those threats, researchers have been continuously exploring different methods for intrusion detection through the last years. In this paper, we study the use of data mining techniques for intrusion detection. The research intends to compare the performances of classification techniques for intrusion detection. To reach the goal, we involve 74 classification techniques in this comparative study. The study shows that no technique outperforms the others in all situations. However, some classification methods lead to promising results and give clues for further combinations. Journal: Int. J. of Data Mining, Modelling and Management Pages: 65-86 Issue: 1 Volume: 12 Year: 2020 Keywords: data mining; classification; network security; intrusion detection; KDD99. File-URL: http://www.inderscience.com/link.php?id=105596 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:1:p:65-86 Template-Type: ReDIF-Article 1.0 Author-Name: Sravani Nalluri Author-X-Name-First: Sravani Author-X-Name-Last: Nalluri Author-Name: R. Sasikala Author-X-Name-First: R. Author-X-Name-Last: Sasikala Title: An insight into application of big data analytics in healthcare Abstract: The main aim of this paper is to comprehend, gain insight of the current trends in application of big data in healthcare, and to identify the various potential healthcare horizons. A brief analysis was done on 'big data analytics in healthcare' focusing on collection of data, the tools employed, the aspects of health that were addressed, the type of machine learning algorithms and which statistics commissioned to compare the performance of these algorithms. The focus was mainly on prediction of the diseases, emergency department visits or a disease outbreak, using 'HADOOP' and 'WEKA' tool, by obtaining data from University of California machine learning repository, hospitals and government agencies. Support vector machine, artificial neural networks, naive Bayes and decision tree were commonly used algorithms whose efficacy was compared statistically using 'accuracy'. In my perspective, apart from prediction of disease other domains of health are to be addressed. Journal: Int. J. of Data Mining, Modelling and Management Pages: 87-117 Issue: 1 Volume: 12 Year: 2020 Keywords: big data; Hadoop; machine learning algorithms; healthcare; map-reduce; chronic diseases; accuracy rate; prevention; analytics. File-URL: http://www.inderscience.com/link.php?id=105598 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:1:p:87-117 Template-Type: ReDIF-Article 1.0 Author-Name: Aarti Author-X-Name-First: Author-X-Name-Last: Aarti Author-Name: Geeta Sikka Author-X-Name-First: Geeta Author-X-Name-Last: Sikka Author-Name: Renu Dhir Author-X-Name-First: Renu Author-X-Name-Last: Dhir Title: Grey relational classification algorithm for software fault proneness with SOM clustering Abstract: The estimation by the human judgment to deal with the inherent uncertainty of software gives a vague and imprecise solution. To cope with this challenge, we propose a new hybrid analogy model based on the integration of grey relational analysis (GRA) classification with self-organising map (SOM) clustering. In this paper, a new classification approach is proposed to distribute the data to similar groups. The attributes are selected based on GRC values. In the proposed, the similarity measure between reference project and cluster head is computed to determine the cluster to which target project belongs. The fault-proneness of reference project is estimated based on the regression equation of the selected cluster. The proposed algorithm gives resilience to users to select features for both continuous and categorical attributes. In this study, two scenarios based on the integration of proposed classification with regression have been proposed. Experimental results show significant results indicating that proposed methodology can be used for the prediction of faults and produce conceivable results when compared with the results of multilayer-perceptron, logistic regression, bagging, naïve Bayes and sequential minimal optimisation (SMO). Journal: Int. J. of Data Mining, Modelling and Management Pages: 28-64 Issue: 1 Volume: 12 Year: 2020 Keywords: self-organising map; SOM; grey relational analysis; GRA; unsupervised classification; fault-proneness; object-oriented; OO. File-URL: http://www.inderscience.com/link.php?id=105599 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:1:p:28-64 Template-Type: ReDIF-Article 1.0 Author-Name: Imane Messaoudi Author-X-Name-First: Imane Author-X-Name-Last: Messaoudi Author-Name: Nadjet Kamel Author-X-Name-First: Nadjet Author-X-Name-Last: Kamel Title: Overlapping community detection with a novel hybrid metaheuristic optimisation algorithm Abstract: Social networks are ubiquitous in our daily life. Due to the rapid development of information and electronic technology, social networks are becoming more and more complex in terms of sizes and contents. It is of paramount significance to analyse the structures of social networks in order to unveil the myth beneath complex social networks. Network community detection is recognised as a fundamental tool towards social networks analytics. As a consequence, numerical community detection methods are proposed in the literature. For a real-world social network, an individual may possess multiple memberships, while the existing community detection methods are mainly designed for non-overlapping situations. With regard to this, this paper proposes a hybrid metaheuristic method to detect overlapping communities in social networks. In the proposed method, the overlapping community detection problem is formulated as an optimisation problem and a novel bat optimisation algorithm is designed to solve the established optimisation model. To enhance the searchability of the proposed algorithm, a local search operator based on tabu search is introduced. To validate the effectiveness of the proposed algorithm, experiments on benchmark and real-world social networks are carried out. The experiments indicate that the proposed algorithm is promising for overlapping community detection. Journal: Int. J. of Data Mining, Modelling and Management Pages: 118-139 Issue: 1 Volume: 12 Year: 2020 Keywords: overlapping community; modified density; Tabu search; TS; Bat algorithm; BA; link clustering; social network. File-URL: http://www.inderscience.com/link.php?id=105601 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:1:p:118-139 Template-Type: ReDIF-Article 1.0 Author-Name: Nawel Sekkal Author-X-Name-First: Nawel Author-X-Name-Last: Sekkal Author-Name: Sidi Mohamed Benslimane Author-X-Name-First: Sidi Mohamed Author-X-Name-Last: Benslimane Author-Name: Michael Mrissa Author-X-Name-First: Michael Author-X-Name-Last: Mrissa Author-Name: Cheol Young Park Author-X-Name-First: Cheol Young Author-X-Name-Last: Park Author-Name: Boudjemaa Boudaa Author-X-Name-First: Boudjemaa Author-X-Name-Last: Boudaa Title: Proactive and reactive context reasoning architecture for smart web services Abstract: The web of things (WoT) uses web technologies to connect embedded objects to each other and to deliver services to stakeholders. The context of these interactions (situation) is a key source of information which can be sometimes uncertain. In this paper, we focus on the development of intelligent web services. The main requirements for intelligent service are to deal with context diversity, semantic context representation and the capacity to reason with uncertain information. From this perspective, we propose a framework for intelligent services to deal with various contexts, to reactively respond to real-time situations and proactively predict future situations. For the semantic representation of context, we use PR-OWL, a probabilistic ontology based on multi-entity Bayesian networks. PR-OWL is flexible enough to represent complex and uncertain contexts. We validate our framework with an intelligent plant watering use case to show its reasoning capabilities. Journal: Int. J. of Data Mining, Modelling and Management Pages: 1-27 Issue: 1 Volume: 12 Year: 2020 Keywords: smart web service; the web of things; context reasoning; proactive; reactive; multi-entity Bayesian networks; MEBNs; PR-OWL. File-URL: http://www.inderscience.com/link.php?id=105609 File-Format: text/html File-Restriction: Open Access Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:1:p:1-27 Template-Type: ReDIF-Article 1.0 Author-Name: Yasmine Chaabani Author-X-Name-First: Yasmine Author-X-Name-Last: Chaabani Author-Name: Jalel Akaichi Author-X-Name-First: Jalel Author-X-Name-Last: Akaichi Title: Bees colonies for detecting communities evolution using data warehouse Abstract: The analysis of social networks and their evolution has gained much interest in recent years. In fact, few methods revealed and tracked meaningful communities over time. These methods also dealt efficiently with structure and topic evolution of networks. In this paper, we propose a novel technique to track dynamic communities and their evolution behaviour. The main objective of our approach and using the artificial bee colony (ABC) is to trace the evolution of community and to optimise our objective function to keep proper partitioning. Moreover, we use a data warehouse as a mind of bees to store the information of different communities structure in every timestamp. The experimental results showed that the proposed method is efficient in discovering dynamics communities and tracking their evolution. Journal: Int. J. of Data Mining, Modelling and Management Pages: 192-206 Issue: 2 Volume: 12 Year: 2020 Keywords: social network; community detection; bees colony. File-URL: http://www.inderscience.com/link.php?id=106720 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:2:p:192-206 Template-Type: ReDIF-Article 1.0 Author-Name: Fatima Meskine Author-X-Name-First: Fatima Author-X-Name-Last: Meskine Author-Name: Safia Nait-Bahloul Author-X-Name-First: Safia Author-X-Name-Last: Nait-Bahloul Title: A support architecture to MDA contribution for data mining Abstract: The data mining process is the sequence of tasks applied to data, in order to discover relations between them to have knowledge. However, the data mining process lacks a formal specification that allows it to be modelled independently of platforms. Model driven architecture (MDA) is an approach for the development of software systems, based on the use of models to improve their productivity. Several research works have been elaborated to align the MDA approach with data mining on data warehouses, to specify the data mining process in a very high level of abstraction. In our work, we propose a support architecture that allows positioning these researches in different abstraction levels, on the basis of several criteria; with the aim to identify strengths for each level, in term of modelling; and to have a clear visibility on the MDA contribution for data mining. Journal: Int. J. of Data Mining, Modelling and Management Pages: 207-236 Issue: 2 Volume: 12 Year: 2020 Keywords: data mining; model driven architecture; MDA; data warehouses; UML profiles; data multidimensional model; transformation. File-URL: http://www.inderscience.com/link.php?id=106723 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:2:p:207-236 Template-Type: ReDIF-Article 1.0 Author-Name: Jaishree Ranganathan Author-X-Name-First: Jaishree Author-X-Name-Last: Ranganathan Author-Name: Angelina A. Tzacheva Author-X-Name-First: Angelina A. Author-X-Name-Last: Tzacheva Title: Emotion mining from text for actionable recommendations detailed survey Abstract: In the era of Web 2.0, people express their opinion, feelings and thoughts about topics including political and cultural events, natural disasters, products and services, through mediums such as blogs, forums, and micro-blogs, like Twitter. Also, large amount of text is generated through e-mail which contains the writer's feeling or opinion; for instance, customer care service e-mail. The texts generated through such platforms are a rich source of data which can be mined in order to gain useful information about user opinion or feeling which in turn can be utilised in specific applications such as: marketing, sale predictions, political surveys, health care, student-faculty culture, e-learning platforms, and social networks. This process of identifying and extracting information about the attitude of a speaker or writer about a topic, polarity, or emotion in a document is called sentiment analysis. There are variety of sources for extracting sentiment such as speech, music, facial expression. Due to the rich source of information available in the form of text data, this paper focuses on sentiment analysis and emotion mining from text, as well as discovering actionable patterns. The actionable patterns may suggest ways to alter the user's sentiment or emotion to a more positive or desirable state. Journal: Int. J. of Data Mining, Modelling and Management Pages: 143-191 Issue: 2 Volume: 12 Year: 2020 Keywords: actionable pattern mining; data mining; text mining; sentiment analysis. File-URL: http://www.inderscience.com/link.php?id=106729 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:2:p:143-191 Template-Type: ReDIF-Article 1.0 Author-Name: Abdullah Alsaeedi Author-X-Name-First: Abdullah Author-X-Name-Last: Alsaeedi Title: A survey of term weighting schemes for text classification Abstract: Text document classification approaches are designed to categorise documents into predefined classes. These approaches have two main components: document representation models and term-weighting methods. The high dimensionality of feature space has always been a major problem in text classification methods. To resolve high dimensionality issues and to improve the accuracy of text classification, various feature selection approaches were presented in the literature. Besides which, several term-weighting schemes were introduced that can be utilised for feature selection methods. This work surveys and investigates various term (feature) weighting approaches that have been presented in the text classification context. Journal: Int. J. of Data Mining, Modelling and Management Pages: 237-254 Issue: 2 Volume: 12 Year: 2020 Keywords: document frequency; supervised term weighting; text classification; unsupervised term weighting. File-URL: http://www.inderscience.com/link.php?id=106741 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:2:p:237-254 Template-Type: ReDIF-Article 1.0 Author-Name: Mohammed Al-Sarem Author-X-Name-First: Mohammed Author-X-Name-Last: Al-Sarem Author-Name: Abdel-Hamid Emara Author-X-Name-First: Abdel-Hamid Author-X-Name-Last: Emara Author-Name: Ahmed Abdel Wahab Author-X-Name-First: Ahmed Abdel Author-X-Name-Last: Wahab Title: Performance of authorship attribution classifiers with short texts: application of religious Arabic fatwas Abstract: Although authorship attribution is a well-known problem in authorship analysis domain, researches on Arabic contexts are still limited. In addition, examining the performance of the attribution methods on training set with short textual documents is also not considered well in other languages, such as English, Chinese, Spanish and Dutch. Therefore, this current work aims at examining the performance of attribution classifiers in the context of short Arabic textual documents. The experimental part of this work is conducted with well-known classifiers namely: decision tree C4.5 method, naive Bayes model, K-NN method, Markov model, SMO and Burrows Delta method. We experiment with various features combination. The results show that combining the word-based lexical features with the structural features yields the best accuracy. At this end, we use this combination as a baseline for further investigation. We also examine the effect of combining the n-gram features. The results indicate that some classifiers show an improvement while the others do not. In addition, the results show that the naive Bayes method gives the highest accuracy among all the attribution classifiers. Journal: Int. J. of Data Mining, Modelling and Management Pages: 350-364 Issue: 3 Volume: 12 Year: 2020 Keywords: authorship attribution; AA; stylomatric features; SF; attribution classifiers; JGAAP tool; Arabic language. File-URL: http://www.inderscience.com/link.php?id=108719 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:3:p:350-364 Template-Type: ReDIF-Article 1.0 Author-Name: Akram Osman Author-X-Name-First: Akram Author-X-Name-Last: Osman Author-Name: Naomie Salim Author-X-Name-First: Naomie Author-X-Name-Last: Salim Title: Extracting useful reply-posts for text forum threads summarisation using quality features and classification methods Abstract: Text forums threads have a large amount of information furnished by users who discuss on a specific topic. At times, certain thread reply-posts are entirely off-topic, thereby deviating from the main discussion. It negatively affects the user's preference to continue replying to the discussion. Thus, there is a possibility that the user prefers to read certain selected reply-posts that provide a short summary of the topic of the discussion. The objective of the paper is to choose quality reply-posts regarding a topic considered in the initial-post, which also serve a brief summary. We offer an exhaustive examination of the conversational patterns of the threads on the basis of 12 quality features for analysis. These features can ensure selection of relevant reply-posts for the thread summary. Experimental outcomes obtained using two datasets show that the presented techniques considerably enhanced the performance in selecting initial-post replies pairs for text forum threads summarisation. Journal: Int. J. of Data Mining, Modelling and Management Pages: 330-349 Issue: 3 Volume: 12 Year: 2020 Keywords: information retrieval; initial-post replies pairs; text data; text forum threads; TFThs; text forum threads summarisation; text summarisation; thread retrieval. File-URL: http://www.inderscience.com/link.php?id=108725 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:3:p:330-349 Template-Type: ReDIF-Article 1.0 Author-Name: Hiba Zuhair Author-X-Name-First: Hiba Author-X-Name-Last: Zuhair Author-Name: Ali Selamat Author-X-Name-First: Ali Author-X-Name-Last: Selamat Title: Phish webpage classification using hybrid algorithm of machine learning and statistical induction ratios Abstract: Although the conventional machine learning-based anti-phishing techniques outperform their competitors in phishing detection, they are still targeted by zero-hour phish webpages due to their constraints of phishing induction. Therefore, phishing induction must be boosted up with the extraction of new features, the selection of robust subsets of decisive features, the active learning of classifiers on a big webpage stream. In this paper, we propose a hybrid feature-based classification algorithm (HFBC) for decisive phish webpage classification. HFBC hybridises two statistical criteria optimised feature occurrence (OFC) and phishing induction ratio (PIR) with the induction settings of the most salient machine learning algorithms, Naïve bays and decision tree. Additionally, we propose two constituent algorithms of features extraction and features selection for holistic phish webpage characterisation. The superiority of our proposed approach is justified and proven throughout chronological, real-time, and comparative analyses against existing machines learning-based anti-phishing techniques. Journal: Int. J. of Data Mining, Modelling and Management Pages: 255-276 Issue: 3 Volume: 12 Year: 2020 Keywords: phish webpage; machine learning; optimised feature occurrence; OFC; phishing induction ratio; PIR; hybrid feature-based classifier; HFBC. File-URL: http://www.inderscience.com/link.php?id=108727 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:3:p:255-276 Template-Type: ReDIF-Article 1.0 Author-Name: Meryem Amar Author-X-Name-First: Meryem Author-X-Name-Last: Amar Author-Name: Bouabid El Ouahidi Author-X-Name-First: Bouabid El Author-X-Name-Last: Ouahidi Title: Weighted LSTM for intrusion detection and data mining to prevent attacks Abstract: The usage of cloud opportunities brings not only resources and storage availability, but puts also customer's privacy at stake. These services are carried out through web that generate log files. These files contain valuable information in tracking malicious behaviours. However, they are variant, voluminous and have high velocity. This paper structures input log files using data preparation treatment (DPT), anticipates missing features, and performs a weighted conversion to ease the discrimination of malicious activities. Regarding the robustness of deep learning in analysing high dimension databases, selecting dynamically features and detecting intrusions, our architecture avails its strength and proposes a weighted long short-term memory (WLSTM) deep learning algorithm. WLSTM mine network traffic predictors considering past events, and minimizes the vanishing gradient. Results prove its effectiveness; it achieves 98% of accuracy and reduces false alarm rates to 1.47%. For contextual malicious behaviours, the accuracy attained 97% and the loss was 22%. Journal: Int. J. of Data Mining, Modelling and Management Pages: 308-329 Issue: 3 Volume: 12 Year: 2020 Keywords: cloud security breaches; intrusion-detection; weight of evidence; WoE; deep learning; long short-term memory; LSTM. File-URL: http://www.inderscience.com/link.php?id=108728 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:3:p:308-329 Template-Type: ReDIF-Article 1.0 Author-Name: E.O. Rodrigues Author-X-Name-First: E.O. Author-X-Name-Last: Rodrigues Author-Name: D. Casanova Author-X-Name-First: D. Author-X-Name-Last: Casanova Author-Name: M. Teixeira Author-X-Name-First: M. Author-X-Name-Last: Teixeira Author-Name: V. Pegorini Author-X-Name-First: V. Author-X-Name-Last: Pegorini Author-Name: F. Favarim Author-X-Name-First: F. Author-X-Name-Last: Favarim Author-Name: E. Clua Author-X-Name-First: E. Author-X-Name-Last: Clua Author-Name: A. Conci Author-X-Name-First: A. Author-X-Name-Last: Conci Author-Name: Panos Liatsis Author-X-Name-First: Panos Author-X-Name-Last: Liatsis Title: Proposal and study of statistical features for string similarity computation and classification Abstract: Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results. Journal: Int. J. of Data Mining, Modelling and Management Pages: 277-307 Issue: 3 Volume: 12 Year: 2020 Keywords: word comparison; string similarity; classification; statistical features; text mining; optical character recognition; OCR; text plagiarism; text entailment; supervised learning. File-URL: http://www.inderscience.com/link.php?id=108731 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:3:p:277-307 Template-Type: ReDIF-Article 1.0 Author-Name: Alexey G. Finogeev Author-X-Name-First: Alexey G. Author-X-Name-Last: Finogeev Author-Name: Leyla A. Gamidullaeva Author-X-Name-First: Leyla A. Author-X-Name-Last: Gamidullaeva Author-Name: Sergey M. Vasin Author-X-Name-First: Sergey M. Author-X-Name-Last: Vasin Title: Application of hyper-convergent platform for big data in exploring regional innovation systems Abstract: The authors developed a decentralised hyper-convergent analytical platform for the collection and processing of big data in order to explore the monitoring processes of distributed objects in the regions on the basis of multi-agent approach. The platform is intended for modular integration of tools for searching, collecting, processing and big data mining from cyber-physical and cyber-social objects. The results of the intellectual analysis are used to assess the integrated criteria for the effectiveness of innovation systems of distributed monitoring and forecasting the dynamics of the influence of various factors on technological and socio-economic processes. The work analyses convergent and hyper-convergent systems, substantiates the necessity of creating a multi-agent decentralised platform for big data collection and analytical processing. The article proposes the principles of streaming architecture for the data integration analytical processing to resolve the problems of searching, parallel processing, data mining and uploading of information into a cloud storage. The paper also considers the main components of the hyper-convergent analytical platform. A new concept of distributed extraction, transformation, loading, mining (ETLM) system is considered. Journal: Int. J. of Data Mining, Modelling and Management Pages: 365-385 Issue: 4 Volume: 12 Year: 2020 Keywords: innovation system; convergence; convergent platform; hyper-convergent system; intellectual analysis; big data; multi-agent approach; ETLM. File-URL: http://www.inderscience.com/link.php?id=111395 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:4:p:365-385 Template-Type: ReDIF-Article 1.0 Author-Name: Mehdi Soleymani Author-X-Name-First: Mehdi Author-X-Name-Last: Soleymani Title: A quest for better anomaly detectors Abstract: Anomaly detection is a very popular method for detecting exceptional observations which are very rare. It has been frequently used in medical diagnosis, fraud detection, etc. In this article, we revisit some popular algorithms for anomaly detection and investigate why we are on a quest for a better algorithm for identifying anomalies. We propose a new algorithm, which unlike other popular algorithms, is not looking for outliers directly, but it searches for them by removing the inliers (opposite to outliers) in an iterative way. We present an extensive simulation study to show the performance of the proposed algorithm compared to its competitors. Journal: Int. J. of Data Mining, Modelling and Management Pages: 447-458 Issue: 4 Volume: 12 Year: 2020 Keywords: anomaly detection; algorithm; k-nearest neighbour. File-URL: http://www.inderscience.com/link.php?id=111399 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:4:p:447-458 Template-Type: ReDIF-Article 1.0 Author-Name: Borislava Petrova Vrigazova Author-X-Name-First: Borislava Petrova Author-X-Name-Last: Vrigazova Author-Name: Ivan Ganchev Ivanov Author-X-Name-First: Ivan Ganchev Author-X-Name-Last: Ivanov Title: The bootstrap procedure in classification problems Abstract: In classification problems, cross-validation chooses random samples from the dataset in order to improve the ability of the model to classify properly new observations in the respective class. Research articles from various fields show that when applied to regression problems, the bootstrap can improve either the prediction ability of the model or the ability for feature selection. The purpose of our research is to show that the bootstrap as a model selection procedure in classification problems can outperform cross-validation. We compare the performance measures of cross-validation and the bootstrap on a set of classification problems and analyse their practical advantages and disadvantages. We show that the bootstrap procedure can accelerate execution time compared to the cross-validation procedure while preserving the accuracy of the classification model. This advantage of the bootstrap is particularly important in big datasets as the time needed for fitting the model can be reduced without decreasing the model's performance. Journal: Int. J. of Data Mining, Modelling and Management Pages: 428-446 Issue: 4 Volume: 12 Year: 2020 Keywords: logistic regression; decision tree; k-nearest neighbour; KNN; the bootstrap; cross-validation. File-URL: http://www.inderscience.com/link.php?id=111400 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:4:p:428-446 Template-Type: ReDIF-Article 1.0 Author-Name: Mamoon Obiedat Author-X-Name-First: Mamoon Author-X-Name-Last: Obiedat Author-Name: Ali Al-yousef Author-X-Name-First: Ali Author-X-Name-Last: Al-yousef Author-Name: Mustafa Banikhalaf Author-X-Name-First: Mustafa Author-X-Name-Last: Banikhalaf Author-Name: Khairallah Al Talafha Author-X-Name-First: Khairallah Al Author-X-Name-Last: Talafha Title: A new quantitative method for simplifying complex fuzzy cognitive maps Abstract: Fuzzy cognitive map (FCM) is a qualitative soft computing approach addresses uncertain human perceptions of diverse real-world problems. The map depicts the problem in the form of problem nodes and cause-effect relationships among them. Complex problems often produce complex maps that may be difficult to understand or predict, and therefore, maps need to be simplified. Previous studies used subjectively simplification/condensation processes by grouping similar variables into one variable in a qualitative manner. This paper proposes a quantitative method for simplifying FCM. It uses the spectral clustering quantitative technique to classify/group related variables into new clusters without human intervention. Initially, improvements were added to this clustering technique to properly handle FCM matrix data. Then, the proposed method was examined by an application dataset to validate its appropriateness in FCM simplification. The results showed that the method successfully classified the dataset into meaningful clusters. Journal: Int. J. of Data Mining, Modelling and Management Pages: 415-427 Issue: 4 Volume: 12 Year: 2020 Keywords: soft computing; fuzzy cognitive map model; complex problems; FCM simplification; spectral clustering; topological overlap matrix; decision support systems. File-URL: http://www.inderscience.com/link.php?id=111402 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:4:p:415-427 Template-Type: ReDIF-Article 1.0 Author-Name: Ahmet Arif Aydin Author-X-Name-First: Ahmet Arif Author-X-Name-Last: Aydin Author-Name: Kenneth M. Anderson Author-X-Name-First: Kenneth M. Author-X-Name-Last: Anderson Title: Data modelling for large-scale social media analytics: design challenges and lessons learned Abstract: We live in a world of big data; organisations collect, store, and analyse large volumes of data for various purposes. The five V's of big data introduce new challenges for developers to handle when performing data processing and analysis. Indeed, data modelling is one of the most challenging and critical aspects of big data because it determines how data will be structured and stored; these decisions then impact how that data can be processed and analysed. In this paper, we report on designing a data model for storing and analysing Twitter data in support of crisis informatics. In this work, we leverage the data model provided by columnar NoSQL data stores to design column families that can efficiently index, sort, store and analyse large Twitter datasets. In particular, our column families are designed to achieve efficient batch data processing. We evaluate these claims and discuss our future work. Journal: Int. J. of Data Mining, Modelling and Management Pages: 386-414 Issue: 4 Volume: 12 Year: 2020 Keywords: data modelling; social media analytics; big data analytics; NoSQL. File-URL: http://www.inderscience.com/link.php?id=111409 File-Format: text/html File-Restriction: Access to full text is restricted to subscribers. Handle: RePEc:ids:ijdmmm:v:12:y:2020:i:4:p:386-414