Sequence-to-Sequence Acoustic-to-Phonetic Conversion Using Spectrograms and Deep Learning

Mustafa A. Qamhan; Yousef Ajami Alotaibi; Yasser Mohammad Seddiq; Ali Hamid Meftah; Sid Ahmed Selouani

doi:10.1109/ACCESS.2021.3083972

Back

Sequence-to-Sequence Acoustic-to-Phonetic Conversion Using Spectrograms and Deep Learning

Journal article

Open access

Peer reviewed

Sequence-to-Sequence Acoustic-to-Phonetic Conversion Using Spectrograms and Deep Learning

Mustafa A. Qamhan, Yousef Ajami Alotaibi, Yasser Mohammad Seddiq, Ali Hamid Meftah and Sid Ahmed Selouani

IEEE access, Vol.9, pp.80209-80220

2021

DOI: https://doi.org/10.1109/ACCESS.2021.3083972

Abstract

Acoustics

Arabic

convolutional recurrent neural network

deep belief networks

Distinctive phonetic features

Feature extraction

Finite element analysis

KAPD corpus

Kernel

MSA

Phonetics

Spectrogram

spectrograms

speech processing

Training

Distinctive phonetic features (DPFs) abstractedly describe the place, manner of articulation, and voicing of the language phonemes. While DPFs are powerful features of speech signals that capture the unique articulatory characteristics of each phoneme, the task of DPF extraction is challenged by the need for efficient computational model. Unlike the ordinary acoustic features that can be directly determined form speech waveform using closed-form expressions, DPF elements are extracted from acoustic features using machine learning (ML) techniques. Therefore, for the objective of developing an acoustic-to-phonetic converter of high accuracy and low complexity, it is important to select the input acoustic features that are simple, yet carry adequate information. This paper examines the effectiveness of using spectrogram as the acoustic feature with DPFs modeled using two deep learning techniques: the deep belief network (DBN) and the convolutional recurrent neural network (CRNN). The proposed method is applied on Modern Standard Arabic (MSA). Multi-label modeling is considered in the proposed acoustic-to-phonetic converter. The learning techniques were evaluated by proper evaluation measures that accommodate the imbalanced nature of DPF elements. The results showed that the CRNN is more accurate in extracting the DPFs than the DBN.

Files and links (1)

url

https://doi.org/10.1109/ACCESS.2021.3083972View

Published (Version of record) Open

Metrics

1 Record Views

Details

Title: Sequence-to-Sequence Acoustic-to-Phonetic Conversion Using Spectrograms and Deep Learning
Creators - without role: Mustafa A. Qamhan - King Saud University
Yousef Ajami Alotaibi - King Saud University
Yasser Mohammad Seddiq - King Abdulaziz City for Science and Technology
Ali Hamid Meftah - King Saud University
Sid Ahmed Selouani - Université de Moncton
Publication Details: IEEE access, Vol.9, pp.80209-80220
Publisher: IEEE
Grant note: RG-1439-033 / Deanship of Scientific Research, King Saud University (10.13039/501100011665)
Identifiers: 9948469508331
Academic Unit: King Saud University
Language: English
Resource Type: Journal article