Abstract
Nowadays, Machine learning techniques are found to be unique among the most effective approaches for Voice and Emotion Recognition (VER). Moreover, automatic recognition of voice and emotions is essential for smooth psychosocial interactions between humans and machines. There have been huge strides in creating workable pieces of art that combine spectrogram and deep learning characteristics in the VER research. On the other hand, although single Machine Learning (ML) methods deliver acceptable results, it's not quite reaching the standards yet. This necessitates the development of strategies that use various ML techniques, target multiple aspects and elements of voice recognition. This article proposes an ensembling classifier model that incorporates the outcome of base classifiers (CapsNet and RNNs) for VER. The CapsNet model can identify the spatial correlation of vital speech information in spectrograms using a pooling technique. The RNN, on the other hand, is excellent for processing time-series datasets, and both are well known for their performance in classification work. Stacked generalization is used for constructing ensemble classifiers that integrate predictions made by CapsNet and RNN classifiers. As much as 96.05% of overall accuracy is obtained when using this ensemble approach, which is more effective than either CapsNets or RNN when individually compared. One of the significant benefits of the proposed classifier is that it effectively detects the emotional class 'FEAR', with a recognition rate of 96.68% among seven other classes.