Abstract
The work of this paper focuses on the idea of adapting computerised machines with the sophisticated abilities to relatively comprehend and act upon an auditory input of natural linguistic nature. In this paper, we emphasise the addition of acoustic-based audio inputs to the current closed circuit television (CCTV) systems for the goal of compensating any incomplete data and to reach the maximum utilisation of the current surveillance systems. In this model, we apply the isolated word technique on a dataset of 8000 audio inputs dedicated to different individuals through the application of two distinct neural networks. The algorithm provides event-based detection capabilities by allowing the detection of unauthorised accesses through the automatic recognition of each spoken input together with the identity of the speaker. The proposed algorithm obtained accuracy rates of 84.1% and 80.1% for both the recognition by the speaker's identity and the spoken input recognition. In addition, it showed its superiority over the support vector machines (SVM) based model.