Title: DhwaniClone lite: a light-weight encoder framework for voice cloning
Authors: Jay Doshi; Jay Jani; Ruhina Karani
Addresses: Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai, India ' Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai, India ' Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai, India
Abstract: Voice cloning has garnered significant attention for its ability to replicate individuals' voices using artificial intelligence. Existing methods include mel spectrogram and vector embedding approaches, each with strengths and weaknesses. This study introduces a hybrid pipeline leveraging both. It proposes a lightweight feed-forward neural network, crucial to the pipeline's performance. Unlike GAN-based architectures, this approach requires fewer data samples while achieving comparable results. The system uses a novel network as an encoder, Tacotron2 as a synthesiser, and WaveNet as a vocoder. The encoder captures distinctive vocal characteristics, generating speaker embeddings. Tacotron2 creates mel spectrograms for synthesised speech, and WaveNet produces high-quality audio waveforms resembling natural speech. The system is accessible in low-data scenarios, enabling faster training. Objective evaluations, including MOS, PESQ, and SNR metrics, confirm its superiority. This study presents a lightweight, data-efficient voice cloning system with applications in voice assistants, personalised speech synthesis, and entertainment.
Keywords: voice cloning; encoder; feed-forward neural network; voice embeddings; mel spectogram.
DOI: 10.1504/IJCISTUDIES.2024.144046
International Journal of Computational Intelligence Studies, 2024 Vol.13 No.1/2, pp.112 - 130
Received: 12 May 2023
Accepted: 30 May 2024
Published online: 22 Jan 2025 *