Automated big security text pruning and classification

Khudran Alzhrani; Ethan M. Rudd; C. Edward Chow; Terrance E. Boult

doi:10.1109/BigData.2016.7841028

Back

Conference proceeding

Automated big security text pruning and classification

Khudran Alzhrani, Ethan M. Rudd, C. Edward Chow and Terrance E. Boult

2016 IEEE International Conference on Big Data (Big Data), pp.3629-3637

12/2016

DOI: https://doi.org/10.1109/BigData.2016.7841028

Abstract

Big data

Cloud computing

Cloud Storage

Companies

Data Leak Prevention

Privacy

Security

Security Classification

Sensitivity

Topic Modeling

Training

WikiLeaks

Many security related big data problems, including document, traffic, and system log analysis require analysis of unstructured text. Consider the task of analyzing company documents for secure storage. Some might be too sensitive to put on a public cloud and require private storage with associated backup overhead, some may safe on the cloud in encrypted form, and some may be sufficiently non-sensitive to be stored on the cloud in plain-text without encryption and decryption overhead. Being able to make such categorizations autonomously can significantly strengthen data security, organization, and storage efficiency. In this paper, we analyze several base machine learning based security risk assessment algorithms and develop techniques to improve upon standard algorithms. In particular, we examine labeling document sensitivity, labeling each paragraph in the document with one of three levels of security risk. For evaluation, we use real sensitive texts, from documents leaked by the WikiLeaks organization. We improve upon the base models using probabilistic topic modeling via Latent Dirichlet Analysis to identify samples from impure subtopics in the training set, prior to training a logistic regression classifier.

Metrics

1 Record Views

See more details

Details

Title: Automated big security text pruning and classification
Creators - without role: Khudran Alzhrani - University of Colorado Colorado Springs
Ethan M. Rudd - Vision & Security Technol. (VAST) Lab., Univ. of Colorado at Colorado Springs, Colorado Springs, CO, USA
C. Edward Chow - University of Colorado Colorado Springs
Terrance E. Boult - Vision & Security Technol. (VAST) Lab., Univ. of Colorado at Colorado Springs, Colorado Springs, CO, USA
Publication Details: 2016 IEEE International Conference on Big Data (Big Data), pp.3629-3637
Publisher: IEEE
Identifiers: 9930986708331
Academic Unit: Umm Al Qura University
Language: English
Resource Type: Conference proceeding