Automatic Timbre Transformation Using Enhanced Diffusion Model

We present a novel timbre transfer model that uses an enhanced diffusion architecture to convert music from various instruments into Erhu timbre. The Erhu, a traditional Chinese instrument, is difficult to simulate due to its rich vibrato and smooth note transitions. Existing Musical Instrument Digi...

Full description

Saved in:
Bibliographic Details
Main Authors: Cheng-Han Wu, Pimpa Cheewaprakobkit, Timothy K. Shih, Yu-Cheng Lin, Bing-Ze Liu
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11004054/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We present a novel timbre transfer model that uses an enhanced diffusion architecture to convert music from various instruments into Erhu timbre. The Erhu, a traditional Chinese instrument, is difficult to simulate due to its rich vibrato and smooth note transitions. Existing Musical Instrument Digital Interface systems struggle to capture its nuanced dynamics. Our model integrates a Pitch Encoder, a Loudness Encoder, and a Diffusion Decoder. The encoders extract pitch features and dynamic loudness variations, guiding the decoder in generating realistic Erhu timbre. By extracting general musical features, the system generalizes to unseen input types without retraining. Evaluations based on pitch accuracy, cosine similarity, and Fréchet Audio Distance show that our model achieves 96% pitch accuracy and high fidelity in Erhu timbre reproduction. This study demonstrates the potential of diffusion-based timbre transfer models in music generation and provides new directions for future work on both music generation and timbre transfer.
ISSN:2169-3536