Abstract
A trojan horse is a program that surreptitiously performs its operation under the guise of a legitimate program. Traditional approaches using signatures to detect these programs pose little danger to new and unseen samples whose signatures are not available. The focus of malware research is shifting from using signature patterns to identifying the malicious behavior displayed by these malwares. This paper presents the novel idea of extracting variable length instruction sequences that can identify trojans from clean programs using data mining techniques. The analysis is facilitated by the program control flow information contained in the instruction sequences. Based on general statistics gathered from these instruction sequences, we formulated the problem as a binary classification problem and built random forest, bagging and support vector machine classifiers. Our approach showed a 94.0% detection rate on novel trojans whose data was not used in the model building process.