Article: DhwaniClone lite: a light-weight encoder framework for voice cloning Journal: International Journal of Computational Intelligence Studies (IJCISTUDIES) 2024 Vol.13 No.1/2 pp.112 - 130 Abstract: Voice cloning has garnered significant attention for its ability to replicate individuals' voices using artificial intelligence. Existing methods include mel spectrogram and vector embedding approaches, each with strengths and weaknesses. This study introduces a hybrid pipeline leveraging both. It proposes a lightweight feed-forward neural network, crucial to the pipeline's performance. Unlike GAN-based architectures, this approach requires fewer data samples while achieving comparable results. The system uses a novel network as an encoder, Tacotron2 as a synthesiser, and WaveNet as a vocoder. The encoder captures distinctive vocal characteristics, generating speaker embeddings. Tacotron2 creates mel spectrograms for synthesised speech, and WaveNet produces high-quality audio waveforms resembling natural speech. The system is accessible in low-data scenarios, enabling faster training. Objective evaluations, including MOS, PESQ, and SNR metrics, confirm its superiority. This study presents a lightweight, data-efficient voice cloning system with applications in voice assistants, personalised speech synthesis, and entertainment. Inderscience Publishers - linking academia, business and industry through research

Title: DhwaniClone lite: a light-weight encoder framework for voice cloning

Authors: Jay Doshi; Jay Jani; Ruhina Karani

Addresses: Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai, India ' Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai, India ' Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai, India

Abstract: Voice cloning has garnered significant attention for its ability to replicate individuals' voices using artificial intelligence. Existing methods include mel spectrogram and vector embedding approaches, each with strengths and weaknesses. This study introduces a hybrid pipeline leveraging both. It proposes a lightweight feed-forward neural network, crucial to the pipeline's performance. Unlike GAN-based architectures, this approach requires fewer data samples while achieving comparable results. The system uses a novel network as an encoder, Tacotron2 as a synthesiser, and WaveNet as a vocoder. The encoder captures distinctive vocal characteristics, generating speaker embeddings. Tacotron2 creates mel spectrograms for synthesised speech, and WaveNet produces high-quality audio waveforms resembling natural speech. The system is accessible in low-data scenarios, enabling faster training. Objective evaluations, including MOS, PESQ, and SNR metrics, confirm its superiority. This study presents a lightweight, data-efficient voice cloning system with applications in voice assistants, personalised speech synthesis, and entertainment.

Keywords: voice cloning; encoder; feed-forward neural network; voice embeddings; mel spectogram.

DOI: 10.1504/IJCISTUDIES.2024.144046

International Journal of Computational Intelligence Studies, 2024 Vol.13 No.1/2, pp.112 - 130

Received: 12 May 2023
Accepted: 30 May 2024
Published online: 22 Jan 2025 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: DhwaniClone lite: a light-weight encoder framework for voice cloning

Keep up-to-date