README.md

August 2, 2025 Β· View on GitHub

🎼 Text2midi-InferAlign

Improving Symbolic Music Generation with Inference-Time Alignment

Examples arXiv


Text2midi-InferAlign is an inference-time technique that enhances symbolic music generation by improving alignment between generated compositions and textual prompts. It is designed to extend autoregressive modelsβ€”like Text2Midiβ€”without requiring any additional training or fine-tuning.

Our method introduces two lightweight but effective alignment-based objectives into the generation process:

  • 🎡 Text-Audio Consistency: Encourages the temporal structure of the music to reflect the rhythm and pacing implied by the input caption.
  • 🎡 Harmonic Consistency: Penalizes musically inconsistent notes (e.g., out-of-key or dissonant phrases), promoting tonal coherence.

By incorporating these alignment signals into the decoding loop, Text2midi-InferAlign produces music that is not only more faithful to textual descriptions but also harmonically robust.

We evaluate our technique on Text2Midi, a state-of-the-art text-to-MIDI generation model, and report improvements in both objective metrics and human evaluations.


πŸ“¦ Installation & Usage

This repository contains the implementation of the Inference-Time Alignment module. Follow the steps below to get started.

1. Clone the Repository

git clone https://github.com/AMAAI-Lab/t2m-inferalign.git
cd t2m-inferalign

2. Set Up the Environment

We recommend using Python 3.10 and conda for environment management.

conda create -n alignment python=3.10
conda activate alignment
pip install -r requirements.txt

Please export your API key.

export ANTHROPIC_API_KEY=<your key>

or you can set your key here.

3. Download Model Weights and Resources

You may choose to organize them like this:

t2m-inferalign/
β”œβ”€β”€ checkpoints/
β”‚   └── pytorch_model.bin
β”œβ”€β”€ tokenizer/
β”‚   └── vocab_remi.pkl
β”œβ”€β”€ soundfonts/
β”‚   └── soundfont.sf2

Please fix the soundfont path here or here.

4. Run Inference with Alignment

python progressive_explorer.py --caption "A gentle piano lullaby with soft melodies" --model_path checkpoints/pytorch_model.bin --tokenizer_path tokenizer/vocab_remi.pkl --output_path outputs/lullaby.mid

Optional arguments:

  • --max_tokens: Max number of tokens in the generated sequence.
  • --batch_size: Number of tokens to generate before checking rewards.
  • --beams: Number of parallel sequences to generate.

πŸ“Š Experimental Results

βœ… Objective Evaluation

We evaluate on the MidiCaps dataset using six standard metrics. Our approach outperforms the Text2Midi baseline in all key alignment and tonal consistency metrics.

MetricText2MidiText2midi-InferAlign
CR (Compression Ratio) ↑2.162.31
CLAP (Text-Audio Consistency) ↑0.170.22
TB (Tempo Bin %) ↑29.7335.41
TBT (Tempo Bin w/ Tolerance %) ↑60.0662.59
CK (Correct Key %) ↑13.5929.80
CKD (Correct Key w/ Duplicates %) ↑16.6632.54

All results are averaged over the MidiCaps test set.


🎧 Subjective Evaluation

A user study was conducted with 24 participants, comparing outputs from Text2Midi and Text2midi-InferAlign. Participants rated musical quality and text-audio alignment.

Music Quality & Text-Audio Match

Evaluation CriteriaText2Midi (%)Text2midi-InferAlign (%)
Music Quality31.2568.75
Text-Audio Match41.6758.33

Caption Type Preference

Caption TypeText2Midi (%)Text2midi-InferAlign (%)
MidiCaps Caption48.3351.67
Free Text Caption27.7872.22

These results demonstrate that Text2midi-InferAlign significantly enhances both musical structure and semantic relevance, especially for free-form, open-ended prompts.


πŸ“Œ Citation

If you find this work useful in your research, please cite:

@article{text2midi-inferalign,
  title={Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment},
  author={Abhinaba Roy, Geeta Puri, Dorien Herremans},
  year={2025},
  journal={arXiv:2505.12669}
}

πŸ”— Resources