Article: Audio-visual speech recognition based on machine learning approach Journal: International Journal of Advanced Intelligence Paradigms (IJAIP) 2022 Vol.21 No.3/4 pp.211 - 224 Abstract: Audio-visual speech recognition by machine plays an important role when research in automatic speech recognition reaches its highest performance. Audio alone also gives good performance, but adding the visual information potentially gives more convenient recognition system when an audio signal degrades in a noisy environment and may vary because of the environmental channel. This paper proposes an audio-visual automatic speech recognition (AV-ASR) system based on machine learning approaches. Visual information is captured from lip contour. Pseudo Zernike moments (PZMs) and 19th order Mel frequency cepstral coefficients (MFCCs) are extracted to obtain visual information and audio feature respectively. Machine learning approach, artificial neural networks (ANN) and support vector machines (SVM) are used to recognise speech for audio and visual modality. After the individual recognition of two systems, a combined decision is taken. This paper also evaluates the individual performance of both audio and visual speech recognition by machine learning approach. Inderscience Publishers - linking academia, business and industry through research

Title: Audio-visual speech recognition based on machine learning approach

Authors: Saswati Debnath; Pinki Roy

Addresses: Computer Science and Engineering Department, National Institute of Technology, Silchar, Assam, 788010, India ' Computer Science and Engineering Department, National Institute of Technology, Silchar, Assam, 788010, India

Abstract: Audio-visual speech recognition by machine plays an important role when research in automatic speech recognition reaches its highest performance. Audio alone also gives good performance, but adding the visual information potentially gives more convenient recognition system when an audio signal degrades in a noisy environment and may vary because of the environmental channel. This paper proposes an audio-visual automatic speech recognition (AV-ASR) system based on machine learning approaches. Visual information is captured from lip contour. Pseudo Zernike moments (PZMs) and 19th order Mel frequency cepstral coefficients (MFCCs) are extracted to obtain visual information and audio feature respectively. Machine learning approach, artificial neural networks (ANN) and support vector machines (SVM) are used to recognise speech for audio and visual modality. After the individual recognition of two systems, a combined decision is taken. This paper also evaluates the individual performance of both audio and visual speech recognition by machine learning approach.

Keywords: audio-visual speech recognition; lip tracking; pseudo-Zernike moment; Mel frequency cepstral coefficients; MFCC; artificial neural network; ANN; support vector machine; SVM.

DOI: 10.1504/IJAIP.2022.122193

International Journal of Advanced Intelligence Paradigms, 2022 Vol.21 No.3/4, pp.211 - 224

Received: 27 Apr 2018
Accepted: 06 Nov 2018
Published online: 12 Apr 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Audio-visual speech recognition based on machine learning approach

Keep up-to-date