International Journal of Data Mining, Modelling and Management (26 papers in press)
Performance of Authorship Attribution Classifiers with Short Texts: Application of Religious Arabic Fatwas
by Mohammed Al-Sarem, Abdel-Hamid Emara, Ahmed Abdel Wahab
Abstract: Although authorship attribution is a well-known problem in authorship analysis domain, researches on Arabic contexts are still limited. In addition, examining the performance of the attribution methods on training set with short textual documents is also not considered well in other languages, such as English, Chinese, Spanish and Dutch. Therefore, this current work aims at examining the performance of attribution classifiers in the context of short Arabic textual documents. The experimental part of this work is conducted with well-known classifiers namely: Decision Tree C4.5 method, Naive Bayes model, K-NN method, Markov Model, SMO and Burrows Delta method. We experiment with various features combination. The results show that combining the word-based lexical features with the structural features yields the best accuracy. At this end, we use this combination as a baseline for further investigation. We also examine the effect of combining the n-gram features. The results indicate that some classifiers show an improvement while the others do not. In addition, the results show that the naive Bayes method gives the highest accuracy among all the attribution classifiers.
Keywords: Authorship Attribution; Stylometric Features; Attribution Classifiers; JGAAP tool; Arabic Language.
Extracting useful reply-posts for text forum threads summarisation using quality features and classification methods
by Akram Osman, Naomie Salim
Abstract: Text forums threads have a large amount of information furnished by users who discuss on a specific topic. At times, certain thread reply-posts are entirely off-topic, thereby deviating from the main discussion. It negatively affects the users preference to continue replying to the discussion. Thus, there is a possibility that the user prefers to read certain selected reply-posts that provide a short summary of the topic of the discussion. The objective of the paper is to choose quality re-ply-posts regarding a topic considered in the initial-post, which also serve a brief summary. We offer an exhaustive examination of the conversational patterns of the threads on the basis of 12 quality features for analysis. These features can ensure selection of relevant reply-posts for the thread summary. Experimental outcomes obtained using two datasets show that the presented techniques considerably enhanced the performance in selecting initial-post replies pairs for text forum threads summarisation.
Keywords: information retrieval; initial-post replies pairs; text data; text forum threads; text forum threads summarisation; text summarisation; thread retrieval.
Phish Webpage Classification Using Hybrid Algorithm of Machine Learning and Statistical Induction Ratios
by Hiba Zuhair, Ali Selamat
Abstract: Although the conventional machine learning-based anti-phishing techniques outperform their competitors in phishing detection, they are still targeted by zero-hour phish webpages due to their constraints of phishing induction. Therefore, phishing induction must be boosted up with the extraction of new features, the selection of robust subsets of decisive features, the active learning of classifiers on a big webpage stream. In this paper, we propose a hybrid feature-based classification algorithm (HFBC) for decisive phish webpage classification. HFBC hybridizes two statistical criteria Optimized Feature Occurrence (OFC) and Phishing Induction Ratio (PIR) with the induction settings of the most salient machine learning algorithms, Na
Keywords: phish webpage; machine learning; optimized feature occurrence; phishing induction ratio; hybrid feature-based classifier.
Online Detection of Blood-Spot Eggs Based on Levenberg-Marquardt Spectral Amplitude Space Conversion and BPNN
by Zhihui Zhu, Linfeng Wu, Huaixin Yu
Abstract: This paper aims to improve the generalization ability and discrimination accuracy during online blood-spot egg detection. To this end, this paper establishes a linear converter based on Levenberg-Marquardt (LM) algorithm to transform spectral signals. First, the transmittance in the 500~599nm band was chosen for our research. After the moving average filtering, the LM algorithm was employed to transform the space of spectral amplitude. Then, the converted amplitude signal was applied to establish the back propagation neural network (BPNN) model. The results show that the model established from the LM converted signal withstood impact of different batches and speeds on the discrimination accuracy, and improved the model accuracy to more than 96%. This study provides a feasible method for online nondestructive detection of blood-spot eggs.
Keywords: Blood-spot eggs;Online detection;Levenberg-marquardt;Space conversion;Back propagation neural network (BPNN).
Application of hyperconvergent platform for Big Data in exploring regional innovation systems
by Alexey Finogeev, Leyla Gamidullaeva, Sergey Vasin
Abstract: The authors developed a decentralized hyperconvergent analytical platform for the collection and processing of Big Data in order to explore the monitoring processes of distributed objects in the regions on the basis of multi-agent approach. The platform is intended for modular integration of tools for searching, collecting, processing and big data mining from cyber-physical and cyber-social objects. The results of the intellectual analysis are used to assess the integrated criteria for the effectiveness of innovation systems of distributed monitoring and forecasting the dynamics of the influence of various factors on technological and socio-economic processes. The work analyzes convergent and hyperconvergent systems, substantiates the necessity of creating a multi-agent decentralized platform for Big Data collection and analytical processing. The article proposes the principles of streaming architecture for the data integration analytical processing to resolve the problems of searching, parallel processing, data mining and uploading of information into a cloud storage. The paper also considers the main components of the hyperconvergent analytical platform. A new concept of distributed ETLM (Extraction, Transformation, Loading, Mining) system is considered.
Keywords: innovation system; convergence; convergent platform; hyperconvergent system; intellectual analysis; Big Data; multi-agent approach; ETLM (Extraction; Transformation; Loading; Mining) system.
A quest for better anomaly detectors
by Mehdi Soleymani
Abstract: Anomaly detection is a very popular method for detecting exceptional
observations which are very rare. It has been frequently used in medical
diagnosis, fraud detection, etc. In this article, we revisit some
popular algorithms for anomaly detection and investigate why we are
on a quest for a better algorithm for identifying anomalies. We propose
a new algorithm, which unlike other popular algorithms, is not looking
for outliers directly, but it searches for them by removing the inliers
(opposite to outliers) in an iterative way. We present an extensive
simulation study to show the performance of the proposed algorithm
compared to its competitors.
Keywords: Anomaly detection; Algorithm; k-Nearest Negibhour.
The bootstrap procedure in classification problems
by Borislava Vrigazova, Ivan Ivanov
Abstract: In classification problems cross validation chooses random samples from the dataset in order to improve the ability of the model to classify properly new observations in the respective class. Research articles from various fields show that when applied to regression problems, the bootstrap can improve either the prediction ability of the model or the ability for feature selection. The purpose of our research is to show that the bootstrap as a model selection procedure in classification problems can outperform cross validation. We compare the performance measures of cross validation and the bootstrap on a set of classification problems and analyze their practical advantages and disadvantages. We show that the bootstrap procedure can accelerate execution time compared to the cross-validation procedure while preserving the accuracy of the classification model. This advantage of the bootstrap is particularly important in big datasets as the time needed for fitting the model can be reduced without decreasing the models performance.
Keywords: logistic regression; decision tree; k-nearest neighbor; the bootstrap; cross validation.
Weighted LSTM for Intrusion Detection and Data Mining to Prevent Attacks
by Meryem Amar, Bouabid El Ouahidi
Abstract: The usage of cloud opportunities brings not only resources and storage availability, but puts also customers privacy at stake. Cloud services are carried out using web applications that generate log files at every level of the computing infrastructure, and they contain valuable information to track malicious behaviors and to identify the attackers. Unfortunately, they scale up to high Velocity, big Volume, and Variant formats.
Deep Learning is robust in analyzing high dimension of databases, selecting dynamically relevant features and detecting intrusions with greater accuracy and reduced loss error rate. This paper proposes first a Data Preparation Treatment (DPT) method that structures diverse input log files, anticipates missing features, and performs a weighted conversion on categorical features to ease the discrimination of normal behaviors from malicious ones. It also avails the strength of a Weighted Long Short-Term Memory (WLSTM) Deep Learning algorithm to mine network traffic predictors, regarding past events and minimizes the vanishing gradient values. This solution proposes also a Data mining approach to prevent the occurrence of a set of consecutive attacks based on Decision Trees model.
The results prove the robustness of the proposed architecture. It achieved 98% of accuracy in detecting attacks and minimized the False Alarm Rates to 1, 47% only. For contextual malicious behaviors, the accuracy attained 97% and the loss was 22%.
Keywords: Cloud security breaches; Intrusion-detection; Weight of Evidence; Deep Learning; Long Short-Term Memory (LSTM);.
A New Quantitative Method for Simplifying Complex Fuzzy Cognitive Maps
by Mamoon Obiedat, Ali Al-yousef, Mustafa Banikhalaf, Khairallah Al Talafha
Abstract: Fuzzy cognitive map (FCM) is a qualitative soft computing approach addresses uncertain human perceptions of diverse real-world problems. The map depicts the problem in the form of problem nodes and cause-effect relationships among them. Complex problems often produce complex maps that may be difficult to understand or predict, and therefore, maps need to be simplified. Previous studies used subjectively simplification/condensation processes by grouping similar variables into one variable in a qualitative manner. This paper proposes a quantitative method for simplifying FCM. It uses the spectral clustering quantitative technique to classify/group related variables into new clusters without human intervention. Initially, improvements were added to this clustering technique to properly handle FCM matrix data. Then, the proposed method was examined by an application dataset to validate its appropriateness in FCM simplification. The results showed that the method successfully classified the dataset into meaningful clusters.
Keywords: Soft computing; fuzzy cognitive map model; complex problems; FCM simplification; spectral clustering; topological overlap matrix; decision support systems.
Proposal and Study of Statistical Features for String Similarity Computation and Classification
by Erick Rodrigues, Aura Conci, Esteban Clua, Panos Liatsis
Abstract: Strings are usually compared among themselves using language related information such as taxonomy and dictionaries, which can be challenging. In this work, adaptations of features commonly applied in the field of visual computing, Co-Occurrence Matrix (COM) and Run-Length Matrix (RLM), are used in the similarity computation of strings in general (words, phrases, codes and texts). The proposed features do not consider language related information, they are purely statistical and can be used in any context with any language or textual structure. Other statistical measures that are commonly employed in the field such as Longest Common Subsequence, Maximal Consecutive Longest Common Subsequence, Mutual Information and Edit Distances are evaluated and compared to the proposed features. Devised experiments consist of training and testing classifiers on features extracted from two strings. Various classification algorithms are evaluated, which include neural networks, function or rule based classification and decision tree classifiers. The proposed features provide interesting results. In the first synthetic set of experiments, the Co-Occurrence and Run-Length features outperform the remaining state-of-the-art statistical groups of features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group of features based on distances ($P value < 0.001$). The second set of experiments uses a real text plagiarism dataset. In this case, the RLM features obtained the best results.
Keywords: word comparison; string similarity; classification; statistical features; text mining; ocr; text plagiarism; text entailment; supervised learning.
Emotion mining from text for actionable recommendations detailed survey
by Jaishree Ranganathan, Angelina A. Tzacheva
Abstract: In the era of Web 2.0, people express their opinion, feelings and thoughts about topics including political and cultural events, natural disasters, products and services, through mediums such as blogs, forums, and micro-blogs, like Twitter. Also, large amount of text is generated through e-mail which contains the writer's feeling or opinion; for instance, customer care service e-mail. The texts generated through such platforms are a rich source of data which can be mined in order to gain useful information about user opinion or feeling which in turn can be utilised in specific applications such as: marketing, sale predictions, political surveys, health care, student-faculty culture, e-learning platforms, and social networks. This process of identifying and extracting information about the attitude of a speaker or writer about a topic, polarity, or emotion in a document is called sentiment analysis. There are variety of sources for extracting sentiment such as speech, music, facial expression. Due to the rich source of information available in the form of text data, this paper focuses on sentiment analysis and emotion mining from text, as well as discovering actionable patterns. The actionable patterns may suggest ways to alter the user's sentiment or emotion to a more positive or desirable state.
Keywords: actionable pattern mining; data mining; text mining; sentiment analysis.
A support architecture to MDA contribution for data mining
by Fatima Meskine, Safia Nait-Bahloul
Abstract: The data mining process is the sequence of tasks applied to data, in order to discover relations between them to have knowledge. However, the data mining process lacks a formal specification that allows it to be modelled independently of platforms. Model driven architecture (MDA) is an approach for the development of software systems, based on the use of models to improve their productivity. Several research works have been elaborated to align the MDA approach with data mining on data warehouses, to specify the data mining process in a very high level of abstraction. In our work, we propose a support architecture that allows positioning these researches in different abstraction levels, on the basis of several criteria; with the aim to identify strengths for each level, in term of modelling; and to have a clear visibility on the MDA contribution for data mining.
Keywords: data mining; model driven architecture; MDA; data warehouses; UML profiles; data multidimensional model; transformation.
A survey of term weighting schemes for text classification
by Abdullah Alsaeedi
Abstract: Text document classification approaches are designed to categorise documents into predefined classes. These approaches have two main components: document representation models and term-weighting methods. The high dimensionality of feature space has always been a major problem in text classification methods. To resolve high dimensionality issues and to improve the accuracy of text classification, various feature selection approaches were presented in the literature. Besides which, several term-weighting schemes were introduced that can be utilised for feature selection methods. This work surveys and investigates various term (feature) weighting approaches that have been presented in the text classification context.
Keywords: document frequency; supervised term weighting; text classification; unsupervised term weighting.
Special Issue on: Data Mining and Computational Biology in Analysing of Biological Data
Bees colonies for detecting communities evolution using data warehouse
by Yasmine Chaabani, Jalel Akaichi
Abstract: The analysis of social networks and their evolution has gained much interest in recent years. In fact, few methods revealed and tracked meaningful communities over time. These methods also dealt efficiently with structure and topic evolution of networks. In this paper, we propose a novel technique to track dynamic communities and their evolution behaviour. The main objective of our approach and using the artificial bee colony (ABC) is to trace the evolution of community and to optimise our objective function to keep proper partitioning. Moreover, we use a data warehouse as a mind of bees to store the information of different communities structure in every timestamp. The experimental results showed that the proposed method is efficient in discovering dynamics communities and tracking their evolution.
Keywords: social network; community detection; bees colony.
Special Issue on: IFIP CIIA 2018 Advanced Research in Computational Intelligence
Recommendation of Items Using a Social-based Collaborative Filtering Approach and Classification Techniques
by Lamia Berkani
Abstract: With the development of Web 2.0 and social media, the study of social-based recommender systems has emerged. The social relationships among users can improve the accuracy of recommendation. However, with the large amount of data generated every day in social networks, the use of classification techniques becomes a necessity. The clustering-based approaches reduce the search space by clustering similar users or items together. We focus in this paper on the personalized item recommendation in social context. Our approach combines in different ways the social filtering algorithm and the traditional user-based collaborative filtering algorithm. The social information is formalized by some social-behavior metrics such as friendship, commitment and trust degrees of users. Moreover, two classification techniques are used: an unsupervised technique applied initially to all users and a supervised technique applied to newly added users. Finally, the proposed approach has been experimented using different existing datasets. The obtained results show the contribution of integrating social information on the collaborative filtering and the added value of using the classification techniques on the different algorithms in terms of the recommendation accuracy.
Keywords: Item Recommendation; Collaborative filtering; Social filtering; Supervised Classification; Unsupervised classification.
Distributed Heterogeneous Ensemble Learning on Apache Spark for Ligand Based Virtual Screening
by Karima Sid, Mohamed Batouche
Abstract: Virtual screening is one of the most common Computer-Aided Drug Design techniques that apply computational tools and methods on large libraries of molecules to extract the drugs. Ensemble learning is a recent paradigm launched to improve machine-learning results in terms of predictive performance and robustness. It has been successfully applied in ligand-based virtual screening (LBVS) approaches. Applying ensemble learning on huge molecular libraries is computationally expensive. Hence, the distribution and parallelization of the task have become a significant step by using sophisticated frameworks such as Apache Spark. In this paper, we propose a new approach HEnsL_DLBVS, for heterogeneous ensemble learning, distributed on Spark to improve the large-scale LBVS results. To handle the problem of imbalanced big training datasets, we propose a novel hybrid technique. We generate new training datasets to evaluate the approach. Experimental results confirm the effectiveness of our approach with satisfactory accuracy and its superiority over homogeneous models.
Keywords: virtual screening; big data; computer-aided drug design; Apache Spark; machine learning; drug discovery; ensemble learning; imbalanced datasets; Spark MLlib; ligand-based virtual screening.
Hash-Processing of Universal Quantification-like Queries dealing with Requirements and Prohibitions.
by Noussaiba Benadjimi, Khaled Walid HIDOUCI
Abstract: This paper is focused on flexible universal quantification-like queries handling simultaneously positive and negative preferences (requirements or prohibitions). We emphasize the performance improvement of the considered operator by proposing new variants of the classical Hash-Division algorithm. The issue of answers ranking is also dealt with. We target in our work the in-memory databases systems (also called main-memory database systems) with a very large volume of data. In these systems, all the data is primarily stored in the RAM of a computer. We have introduced a parallel implementation of the operator that takes into account the data skew issue. Our empirical analysis for both sequential and parallel versions shows the relevance of our approach. They demonstrate that the new processing of the mixed operator in a main- memory database achieves better performance compared to the conventional ones, and becomes faster through parallelism.
Keywords: Universal quantification queries; Relational division; Relational anti-division; Main-memory databases; Flexible division; Hash-division.
An Enhanced Cooperative Method to Solve Multiple Sequence Alignment Problem
by Lamiche Chaabane
Abstract: In this research study, we aim to propose a novel cooperative approach called dynamic simulated particle swarm optimization (DSPSO) which is based on metaheuristics and the pairwise dynamic programming procedure (DP) to find an approximate solution for the multiple sequence alignment (MSA) problem. The developed approach applies the particle swam optimization (PSO) algorithm to discover the search space globally and the simulated annealing (SA) technique to improve the population leader quality in order to overcome local optimum problem. After that the dynamic programming technique is integrated as an improver mechanism in order to improve the worst solution quality and to increase the convergence speed of the proposed approach. Simulation results on BaliBASE benchmarks have shown the potent of the proposed method to produce good quality alignments comparing to those given by other literature existing methods.
Keywords: Cooperative approach; multiple sequence alignment; DSPSO; PSO; SA; DP; BaliBASE benchmarks.
A Formal Theoretical Framework for a Flexible Classification Process
by Ismail Biskri
Abstract: The classification process is a complex technique that connects language, text, information and knowledge theories with computational formalization, statistical and symbolic approaches, standard and non-standard logics, etc. This process should always be under the control of the user according to his subjectivity, his knowledge and the purpose of his analysis. It becomes important to create platforms to support the design of classification tools, their management, their adaptation to new needs and experiments. In the last years, several platforms for data digging including textual data where classification is the main functionality have emerged. However, they lack flexibility and formal foundations. We propose in this paper a formal model with strong logical foundations based on applicative type systems.
Keywords: Classification; Flexibility; Applicative Systems; Operators/Operands; Combinatory Logics; Inferential Calculus; Compositionality; Processing Chains; Modules; Discovery Process; Collaborative Intelligent Science.
Graph-based Cumulative Score Using statistical features for multilingual automatic text summarization
by Abdelkrime Aries, Djamel Eddine Zegour, Walid Khaled Hidouci
Abstract: Multilingual summarization began to receive more attention these late years. Many approaches can be used to achieving this, among them: statistical and graph-based approaches. Our idea is to combine these two approaches into a new extractive text summarization method. Surface statistical features are used to calculate a primary score for each sentence. The graph is used to selecting some candidate sentences and calculating a final score for each sentence based on its primary score and those of its neighbors in the graph. We propose four variants to calculating the cumulative score of a sentence. Also, the order of sentences is an important aspect of summary readability. We propose some other algorithms to generating the summary not just based on final scores but on sentences connections in the graph. The method is tested using MultiLing'15 workshop's MSS corpus and ROUGE metric. It is evaluated against some well known methods and it gives promising results.
Keywords: Autmatic text summarization; Graph-based summarization; Statistical features; Multilingual summarization; Extractive summarization.
An ontology-based modelling and reasoning for alert correlation
by Tayeb Kenaza
Abstract: A good defense strategy utilizes multiple solutions such as
Firewalls, IDS, Antivirus, AAA server, VPN, etc. However, these tools
can easily generate hundreds of thousands of events per day. Security
information and event management system (SIEM for short) is a centralized solution that collects information from these tools and use some
correlation techniques to build a reliable picture of the underlying monitored system. SIEM is a modern and powerful security tool thanks to
several functions that provide to take benefit of collected data, such as
normalization and aggregation of data. It provides security operators a
dashboard and helps them in the forensic analysis when an incident is reported. Indeed, the main important function is events correlation, when
security operators can get a precise and quick picture about threats and
attacks in real time. The quality of that picture depends on the efficiency
of the adopted reasoning approach to putting together pieces of information provided by several analyzers. However, most of proprietary SIEM
use its own data representation and its own correlation techniques which
are not always favorable to share knowledge and to do incremental or
collaborative reasoning. In this paper, we propose a semantic approach
based on Description Logics (DLs) which is a powerful tool for knowledge
representation and reasoning. Indeed, Ontology provides a comprehensive environment to represent information for intrusion detection and allows easy maintain of information or adding new ones. We implemented
a rule-based engine for alert correlation based on the proposed ontology
and two attack scenarios are carried out to show the usefulness of the
Keywords: ntrusion detection; Alert correlation; Rules based reason-
ing; Ontology; OWL.
Convolutional Neural Network with Stacked Autoencoders for Predicting Drug-Target Interaction and Binding Affinity
by Meriem Bahi, Mohamed Batouche
Abstract: The prediction of novel drug-target interactions (DTIs) is critically important for drug repositioning, as it can lead the researchers to find new indications for existing drugs and to reduce the cost and time of the de novo drug development process. In order to explore new ways for this innovation, we have proposed two novel methods named SCA-DTIs and SCA-DTA, respectively to predict both drug-target interactions and drug-target binding affinities (DTA) based on Convolutional Neural Network (CNN) with Stacked Autoencoders (SAE). Initializing a CNN's weights with filters of trained stacked autoencoders yields to superior performance. Moreover, for boosting the performance of the DTIs prediction, we propose a new method called RNDTIs to generate reliable negative samples. Tests on different benchmark datasets show that the proposed method can achieve an excellent prediction performance with an accuracy of more than 99%. These results demonstrate the strength of the proposed model potential for DTIs and DTA prediction, thereby improving the drug repurposing process.
Keywords: Stacked Autoencoders; Convolutional Neural Network; Semi-Supervised Learning; Deep Learning; Drug Repositioning; Drug-Target Interaction; Binding Affinity.
Efficient Deployment Approach of Wireless Sensor Networks on 3D Terrains
by Mostefa Zafer, Mustapha Reda Senouci, Mohamed Aissani
Abstract: Ensuring the coverage of a Region of Interest (RoI) when deploying a Wireless Sensor Network (WSN) is an objective that depends on several factors, such as the detection capability of the used sensor nodes and the topography of the RoI. To address the topography challenges, in this paper, we propose a new WSN deployment approach based on the idea of partitioning the RoI into sub-regions with relatively simple topography. Then allocating, to each constructed sub-region, the necessary number of sensor nodes and finding their appropriates positions to maximize the coverage quality. The performance evaluation of this approach coupled with three different deployment methods named DMSA (Deployment Method based on Simulated Annealing), GDM (Greedy Deployment Method), and RDM (Random Deployment Method), has revealed its relevance since it helped to significantly improve the coverage quality of the RoI.
Keywords: Wireless Sensor Networks; 3D terrains; Deployment; Coverage.
Special Issue on: ICBBD 2019 Business, Big Data and Decision Sciences
Modelling Attrition to Know Why Your Employees Leave or Stay
by Sachin Deshmukh, Seema Sant, Neerja Kashive
Abstract: Todays environmental factors influence every aspect of business, be it its Marketing, Finance, Operations or Human Resources policies. Increased globalization and technological developments have resulted into fierce competition among companies. Talent shortage has become a global issue for organizations. One of the major challenges faced by any organization is the increase in the level of employee attrition. Attrition up to a certain limit is good for any organization as it enables to inject new blood and ideas which can help in developing competitive advantage. But attrition beyond a certain limit can prove unhealthy as talented employees may go elsewhere in search of a greener pasture. Data Analytics is used as an effective tool to delve into the problem of attrition. Predictive models are been used to understand factors responsible for attrition and also predict probabilities of employees who may leave the organization for some reason. The current study has tried to build a predictive model by using logistic regression and understand the specific factors that lead to attrition. This paper also attempts to compare factors responsible for attrition in two time periods, first period from 1996 to 2008 (Holtoms Model) and second period from 2009 to 2016 to find whether any changes have taken place in employees expectations, which, if not fulfilled, may lead to attrition. An analysis of an IT organizations data reveal that factors responsible for attrition in the second period have changed, compared to the first period.
Keywords: Attrition;Predictive Model;Logistic Regression.
Long Text to Image Converter for Financial Reports
by Chia-Hao Chiu, Yun-Cheng Tsai, Ho-Lin FLong Text To Image Converter For Financial Reports
Abstract: In this study, we proposed a novel article analysis method. This method
converts the article classification problem into image classification problem by
projecting texts into images and then applying CNN models for classification.
We called the method the Long Text to Image Converter (LTIC). The features
are extracted automatically from the generated images, hence there is no need
of any explicit step of embedding the words or characters into numeric vector
representations.This method saves the time to experiment pre-process.
This study using the financial domain as an example. In companies financial
reports, there will be a chapter describes the companys financial trends. The
content has many financial terms used to infer the companys current and
futures financial position. The LTIC achieved excellent convolution matrix and
test data accuracy. The results indicated an 80% accuracy rate. The proposed
LTIC produced excellent results during practical application. The LTIC achieved
excellent performance in classifying corporate financial reports under review.
The return on simulated investment is 46%. In addition to tangible returns, the
LTIC method reduced the time required for article analysis and is able to provide
article classification references in a short period to facilitate the decisions of the
Keywords: Article Analysis,Convolutional Neural Network,Financial Analysis; Long Text to Image Converter.
E-Learning process through text mining for academic literacy
by Maira Alejandra Pulgarin Roriguez, Bárbara Maricely Fierro Chong, Erica María Ossa Taborda
Abstract: This paper aims to present the results of research carried out in a Virtual Faculty of Education in a Private university in Colombia. It consists of the characterization of student's abilities for reading and writing comprehension for academic literacy. This study is to verify the effectiveness of an E-learning platform implementation for all the programs incorporated in the Faculty. According to the policies, at the University exists a structure of a methodological procedure for the text mining through specific keywords applicable to different text typologies in specialized areas. This platform allows professors and students to develop expertise in disciplines using text mining as an interdisciplinary strategy to build knowledge and improve the quality in their professional context.
Keywords: Text mining; terminological work; cognitive processes; E-learning; academic literacy; reading comprehension; academic writing.