Authors: Pradeep Kumar
Addresses: Indian Institute of Management Lucknow, Lucknow, India
Abstract: Clustering web usage data is useful to discover interesting patterns pertaining to user traversals, behaviour and their usage characteristics. It is also useful for trend discovery as well as for building personalisation and recommendation engines. Since web is dynamic, clustering web user transactions results in arbitrary shapes. Moreover, users accesses web pages in an order in which they are interested and hence incorporating sequence nature of their usage is crucial for clustering web transactions. In this paper, we present an approach to cluster web usage sequence data and removing noise using DBSCAN algorithm. We also study the impact of clustering process when both sequence and content information is incorporated while computing similarity measure. We use sequence and set similarity (S3M) measure to capture both the order of occurrence of page visits and the page information itself, and compared the results with Euclidean distance and Jaccard similarity measures. The inter-cluster and intra-cluster distances are computed using average Levensthein distance (ALD) to demonstrate the usefulness of the proposed approach in the context of web usage mining.
Keywords: sequence clustering; web usage data; similarity measures; average Levensthein distance; ALD.
International Journal of Business Information Systems, 2018 Vol.28 No.1, pp.67 - 78
Received: 16 May 2016
Accepted: 29 Sep 2016
Published online: 13 Apr 2018 *