A Memory Capacity-Aware Algorithm for Fast Clustering of Disk-Resident Big Datasets

Ahmad O. Aseeri; Yu Zhuang; Mohammed Saeed Alkatheiri; IEEE Computer Society

doi:10.1109/DASC-PICom-DataCom-CyberSciTec.2017.44

Back

Conference proceeding

A Memory Capacity-Aware Algorithm for Fast Clustering of Disk-Resident Big Datasets

Ahmad O. Aseeri, Yu Zhuang, Mohammed Saeed Alkatheiri and IEEE Computer Society

2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, Vol.2018-, pp.194-201

01/01/2017

DOI: https://doi.org/10.1109/DASC-PICom-DataCom-CyberSciTec.2017.44

Abstract

Automation & Control Systems

Computer Science

Computer Science, Artificial Intelligence

Computer Science, Theory & Methods

Engineering

Engineering, Electrical & Electronic

Science & Technology

Technology

Clustering is one of the most commonly used data mining techniques. The K-means and its variants are popular clustering methods. The simplistic Lloyd K-means algorithm, with randomly chosen initial cluster centers, suffers from poor clustering quality and high iteration numbers, especially unsuitable for clustering large datasets. Successful methods that choose a good set of initial cluster centers include the algorithm of Bradley and Fayyad [1] using sampled data subsets, and the bisecting K-means algorithm of Steinbach, Karypis, and Kumar [2]. Recently, it was discovered that iterations in the two-means algorithm used in bisecting K-means to bisect a subset can be limited to small numbers while still maintaining the final clustering quality for the bisecting K-means algorithm. In this paper, for datasets larger than memory capacity, we develop an iteration limiting strategy for bisecting K-means which adaptively determines the number of iterations for each call of the two-means bisecting subroutine based on the memory capacity of a computer and the size of the data subset to be partitioned. The strategy has been incorporated into the bisecting K-means algorithm, applied to the large challenge-response datasets of Physical Unclonable Functions that the authors are investigating, with comparison with the sampled-subsets algorithm of Bradley and Fayyad. Testing results show high computing efficiency for the bisecting K-means algorithm incorporated with the iteration limiting strategy, while exhibiting almost identical clustering quality.

Metrics

1 Record Views

Details

Title: A Memory Capacity-Aware Algorithm for Fast Clustering of Disk-Resident Big Datasets
Creators - without role: Ahmad O. Aseeri - Texas Tech University
Yu Zhuang - Dalian University of Technology
Mohammed Saeed Alkatheiri - Jeddah University
IEEE Computer Society
Publication Details: 2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, Vol.2018-, pp.194-201
Publisher: IEEE
Number of pages: 8
Grant note: CNS-1526055 / National Science Foundation; National Science Foundation (NSF)
Identifiers: 9925558608331
Academic Unit: Prince Sattam Bin Abdulaziz University; University of Jeddah
Language: English
Resource Type: Conference proceeding