Sequence-to-Sequence Text Generation with Discrete Diffusion Models
Diffusion language models are currently the most promising language models among non-autoregressive models, and are expected to replace autoregressive language models, which suffer from slow inference speed, to achieve efficient and quality-preserving text generation. Sequence-to-sequence (Seq2Seq)...
Saved in:
| Main Author: | |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press
2025-03-01
|
| Series: | Jisuanji kexue yu tansuo |
| Subjects: | |
| Online Access: | http://fcst.ceaj.org/fileup/1673-9418/PDF/2405063.pdf |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Diffusion language models are currently the most promising language models among non-autoregressive models, and are expected to replace autoregressive language models, which suffer from slow inference speed, to achieve efficient and quality-preserving text generation. Sequence-to-sequence (Seq2Seq) text generation is a common scenario encountered in practical applications of diffusion language models, including text summarization generation, machine translation, dialogue generation, etc. Achieving high-quality Seq2Seq text generation with low latency remains a persistent challenge in the field of natural language processing. To this end, this paper simplifies the training process of the discrete diffusion model by deriving an upper bound on the training objective, and subsequently introduces and modifies the mask-and-predict decoding strategy from the conditional mask language model as the inference algorithm of the diffusion model. In order to further improve the quality of generated text in the first few rounds of inference for discrete diffusion models, this paper also proposes a sinusoidal noise schedule. Compared with the original linear noise schedule, the high noise interval in time steps becomes larger, and the model will focus more on learning how to recover data from high noise data that are common encountered by the model in the first few rounds of inference. Inspired by curriculum learning strategies, this paper also designs a new sampling distribution for time steps to achieve an easy-to-hard learning strategy. Experiments on public datasets show that the proposed method can effectively improve model performance. On the WMT16 EN-RO dataset, the diffusion model achieves comparable generation quality to the autoregressive baseline in only half the inference time. |
|---|---|
| ISSN: | 1673-9418 |