Abstract
The theft and exfiltration of sensitive data (e.g. state secrets, trade secrets, company records, etc.) represent one of the most damaging threats that can be carried by malicious insiders against institutions and organizations. In the last decade, data leak prevention (DLP) has emerged as a new mechanism to detect and block unauthorized data transfer from the organization perimeter. While DLP has gained traction and led to several commercial products in industry, there are still many unresolved challenges hampering its operation and full adoption. Existing DLP systems exhibit relatively limited accuracy and can be circumvented using various evasive tactics. In this paper, we present a new DLP model that tracks sensitive data using a summarized version of the content semantic called document semantic signature (DSS). The DSS can be updated dynamically as the protected content change and it is resilient against evasion tactics, such as content rewriting. The evaluation of the DSS model on a public dataset achieved very encouraging results in terms of detection effectiveness. (C) 2019 Elsevier B.V. All rights reserved.