README.md
August 2, 2025 Β· View on GitHub
πΌ Text2midi-InferAlign
Improving Symbolic Music Generation with Inference-Time Alignment
Text2midi-InferAlign is an inference-time technique that enhances symbolic music generation by improving alignment between generated compositions and textual prompts. It is designed to extend autoregressive modelsβlike Text2Midiβwithout requiring any additional training or fine-tuning.
Our method introduces two lightweight but effective alignment-based objectives into the generation process:
- π΅ Text-Audio Consistency: Encourages the temporal structure of the music to reflect the rhythm and pacing implied by the input caption.
- π΅ Harmonic Consistency: Penalizes musically inconsistent notes (e.g., out-of-key or dissonant phrases), promoting tonal coherence.
By incorporating these alignment signals into the decoding loop, Text2midi-InferAlign produces music that is not only more faithful to textual descriptions but also harmonically robust.
We evaluate our technique on Text2Midi, a state-of-the-art text-to-MIDI generation model, and report improvements in both objective metrics and human evaluations.
π¦ Installation & Usage
This repository contains the implementation of the Inference-Time Alignment module. Follow the steps below to get started.
1. Clone the Repository
git clone https://github.com/AMAAI-Lab/t2m-inferalign.git
cd t2m-inferalign
2. Set Up the Environment
We recommend using Python 3.10 and conda for environment management.
conda create -n alignment python=3.10
conda activate alignment
pip install -r requirements.txt
Please export your API key.
export ANTHROPIC_API_KEY=<your key>
or you can set your key here.
3. Download Model Weights and Resources
-
Download the pretrained Text2Midi model from HuggingFace:
π https://huggingface.co/amaai-lab/text2midi -
Also download the corresponding tokenizer and soundfonts:
π https://huggingface.co/amaai-lab/text2midi/tree/main/
You may choose to organize them like this:
t2m-inferalign/
βββ checkpoints/
β βββ pytorch_model.bin
βββ tokenizer/
β βββ vocab_remi.pkl
βββ soundfonts/
β βββ soundfont.sf2
Please fix the soundfont path here or here.
4. Run Inference with Alignment
python progressive_explorer.py --caption "A gentle piano lullaby with soft melodies" --model_path checkpoints/pytorch_model.bin --tokenizer_path tokenizer/vocab_remi.pkl --output_path outputs/lullaby.mid
Optional arguments:
--max_tokens: Max number of tokens in the generated sequence.--batch_size: Number of tokens to generate before checking rewards.--beams: Number of parallel sequences to generate.
π Experimental Results
β Objective Evaluation
We evaluate on the MidiCaps dataset using six standard metrics. Our approach outperforms the Text2Midi baseline in all key alignment and tonal consistency metrics.
| Metric | Text2Midi | Text2midi-InferAlign |
|---|---|---|
| CR (Compression Ratio) β | 2.16 | 2.31 |
| CLAP (Text-Audio Consistency) β | 0.17 | 0.22 |
| TB (Tempo Bin %) β | 29.73 | 35.41 |
| TBT (Tempo Bin w/ Tolerance %) β | 60.06 | 62.59 |
| CK (Correct Key %) β | 13.59 | 29.80 |
| CKD (Correct Key w/ Duplicates %) β | 16.66 | 32.54 |
All results are averaged over the MidiCaps test set.
π§ Subjective Evaluation
A user study was conducted with 24 participants, comparing outputs from Text2Midi and Text2midi-InferAlign. Participants rated musical quality and text-audio alignment.
Music Quality & Text-Audio Match
| Evaluation Criteria | Text2Midi (%) | Text2midi-InferAlign (%) |
|---|---|---|
| Music Quality | 31.25 | 68.75 |
| Text-Audio Match | 41.67 | 58.33 |
Caption Type Preference
| Caption Type | Text2Midi (%) | Text2midi-InferAlign (%) |
|---|---|---|
| MidiCaps Caption | 48.33 | 51.67 |
| Free Text Caption | 27.78 | 72.22 |
These results demonstrate that Text2midi-InferAlign significantly enhances both musical structure and semantic relevance, especially for free-form, open-ended prompts.
π Citation
If you find this work useful in your research, please cite:
@article{text2midi-inferalign,
title={Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment},
author={Abhinaba Roy, Geeta Puri, Dorien Herremans},
year={2025},
journal={arXiv:2505.12669}
}
π Resources
- π§ Examples
- πΌ Text2Midi (Base Model)
- π€ Text2Midi on HuggingFace