Abstract
The problem of sensitive information leaks became apparent in the recent infamous security breaches such as WikiLeaks, DNC emails, and Panama Papers. Detecting sensitive texts on the fly enhances the capabilities of security solutions' to monitor and protect critical information flow within the network. Automated text security classification is relatively a new research area, where sensitive texts are marked with labels as Secret, Confidential, and Unclassified with no human interaction. This paper examines the performance of deep learning networks in detecting the sensitivity levels of a given text. In deep text classification networks, regardless of text samples length, each paragraph/sentence is represented by a single sequence. We propose techniques to expand training set size, minimize the number of padding character in sequences, and lower inputs' dimensionality through learning from long paragraphs' segments as independent instances. Also, we introduce a wide variation of Convolution Neural Networks (CNN) network evaluated on four large sets of U. S. embassy's diplomatic cables. We are not aware of any paper that applied deep networks to sensitive text classification. Thus, we further evaluate our multi-sequencing technique and CNN network on well-researched non-sensitive text corpora. Our approach outperformed the state-of-the-art models on non-sensitive text datasets and competed with other traditional classifiers on the sensitive text datasets.