FMSeq

February 13, 2025 ยท View on GitHub

Official Codebase for Enable Fast Sampling for Seq2Seq Text Diffusion.

Performance Workflow

Performance: Picture on the left shows BLEU Scores of different models for the paraphrase task on the QQP dataset. Our FMSeq beats all the models when using a single sampling step and achieves comparable performance to DiffuSeq (2000 steps) with only 10 steps. Workflow: Picture on the right shows workflow of FMSeq. We utilize embedding to map the discrete token space into a continuous space. The forward process diffuses the target embedding along a linear path, and the model fits the velocity of the target part conditioned on clean source embedding and noisy target embedding.

Dataset

Prepare datasets and put them under the datasets folder. Take datasets/CommonsenseConversation/train.jsonl as an example. We use four datasets in our paper.

TaskDatasetsSource
Open-domain DialogueCommonsenseConversationdownload
Question GenerationQuasar-Tdownload
Text SimplificationWiki-alignmentdownload
ParaphraseQQP-Officialdownload
Machine Translationiwslt14-de-endownload

Training

For Non-MT (Machine Translation) tasks, run:

cd scripts
# qqp:
bash train_qqp.sh
# others: modify learning_steps, dataset, data_dir, notes

For MT tasks, run:

cd scripts
bash train_de2en.sh

The trained checkpoints are provided here: link of ckpt

Decoding

cd scripts
bash run_decode.sh
# core parameters: step and td

Evaluation

cd scripts
bash eval.sh
# you can eval single file or multiple file which are in the same folder (mbr in default)

Citation

Please add the citation if our paper or code helps you.

@inproceedings{liu2024enable,
  title={Enable Fast Sampling for Seq2Seq Text Diffusion},
  author={Liu, Pan and Tian, Xiaohua and Lin, Zhouhan},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
  pages={8495--8505},
  year={2024}
}

Acknowledgements

This implementation is based on DiffuSeq