Abstract
Distinctive phonetic features (DPFs) abstractedly describe the place, manner of articulation, and voicing of the language phonemes. While DPFs are powerful features of speech signals that capture the unique articulatory characteristics of each phoneme, the task of DPF extraction is challenged by the need for efficient computational model. Unlike the ordinary acoustic features that can be directly determined form speech waveform using closed-form expressions, DPF elements are extracted from acoustic features using machine learning (ML) techniques. Therefore, for the objective of developing an acoustic-to-phonetic converter of high accuracy and low complexity, it is important to select the input acoustic features that are simple, yet carry adequate information. This paper examines the effectiveness of using spectrogram as the acoustic feature with DPFs modeled using two deep learning techniques: the deep belief network (DBN) and the convolutional recurrent neural network (CRNN). The proposed method is applied on Modern Standard Arabic (MSA). Multi-label modeling is considered in the proposed acoustic-to-phonetic converter. The learning techniques were evaluated by proper evaluation measures that accommodate the imbalanced nature of DPF elements. The results showed that the CRNN is more accurate in extracting the DPFs than the DBN.