Performance Evaluation of K-Means Clustering Algorithm Using Some Robust Distances: a Case Study on Seismic Data in Sumatra
Ulfasari Rafflesia, Dedi Rosadi, Devni Prima Sari, Adhitya Ronnie Effendie
Statistika, 105(4): 538–554
https://doi.org/10.54694/stat.2024.72
Abstract
Clustering is an unsupervised learning technique that categorizes data into groups based on inherent patterns and similarities, with K-means being one of the most common methods. K-means clustering is particularly susceptible to outliers because of its dependence on non-robust distances (such as the most used Euclidean distance). To address this issue, robust distance metrics such as a new Standardized Euclidean Robust distance and Mahalanobis Robust distance have been discussed in this paper, which will reduce the influence of outliers and, at the same time, improve clustering accuracy empirically. The main objective of this study is to investigate the impact of applying robust distance metrics in the K-means clustering and to identify the most suitable distance metric for seismic data containing outliers. The findings indicate that robust distance measures outperform the non-robust distances in accuracy, yielding superior outcomes for minimum-valued indices such as Davies-Bouldin, Xie-Beni, and Ball-Hall indices, as well as maximum-valued indices such as Calinski-Harabasz and Dunn indices.
Keywords
Clustering, k-means, outliers, robust distance