Authors: Priyanka Varshney; Omar Farooq; Prashant Upadhyaya
Addresses: Department of Electronics Engineering, Aligarh Muslim University, Aligarh 202002, India ' Department of Electronics Engineering, Aligarh Muslim University, Aligarh 202002, India ' Department of Electronics Engineering, Aligarh Muslim University, Aligarh 202002, India
Abstract: Many automatic speech recognition (ASR) applications may have noisy background conditions; hence, robustness has become an important area of research. The automated recognition of human speech using features from the visual domain along with audio has proved to be useful under these conditions. In this paper, use of visual information is proposed to increase the recognition performance and robustness of Hindi viseme recognition system. A database has been prepared comprising of ten Hindi sentences uttered by five different speakers. The audio feature based on mel frequency cepstral coefficient (MFCC) has been extracted and subspace-based discrete cosine transform (DCT) was applied to extract visual features. The video-based features were integrated with audio features before using a discriminant function-based classifier for five Hindi viseme classes. Integration of visual features gave an improvement in viseme recognition in case of clean as well as noisy speech. Maximum improvement of 6.67% in accuracy of viseme recognition was found at −10 dB SNR using Mahalanobis distance-based classifier for speech corrupted by car noise.
Keywords: automatic speech recognition; viseme recognition; phoneme recognition; feature extraction; mel frequency cepstral coefficient; MFCC; discrete wavelet transform; DWT; discrete cosine transform; DCT; Hindi; audio features; video features; car noise; vehicle noise.
International Journal of Applied Pattern Recognition, 2014 Vol.1 No.3, pp.257 - 272
Received: 30 Mar 2013
Accepted: 31 Aug 2013
Published online: 17 Nov 2014 *