Abstract
Clustering is one of the most commonly used data mining techniques. The K-means and its variants are popular clustering methods. The simplistic Lloyd K-means algorithm, with randomly chosen initial cluster centers, suffers from poor clustering quality and high iteration numbers, especially unsuitable for clustering large datasets. Successful methods that choose a good set of initial cluster centers include the algorithm of Bradley and Fayyad [1] using sampled data subsets, and the bisecting K-means algorithm of Steinbach, Karypis, and Kumar [2]. Recently, it was discovered that iterations in the two-means algorithm used in bisecting K-means to bisect a subset can be limited to small numbers while still maintaining the final clustering quality for the bisecting K-means algorithm. In this paper, for datasets larger than memory capacity, we develop an iteration limiting strategy for bisecting K-means which adaptively determines the number of iterations for each call of the two-means bisecting subroutine based on the memory capacity of a computer and the size of the data subset to be partitioned. The strategy has been incorporated into the bisecting K-means algorithm, applied to the large challenge-response datasets of Physical Unclonable Functions that the authors are investigating, with comparison with the sampled-subsets algorithm of Bradley and Fayyad. Testing results show high computing efficiency for the bisecting K-means algorithm incorporated with the iteration limiting strategy, while exhibiting almost identical clustering quality.