FT-Agent: Autonomous LLM Fine-Tuning
May 6, 2026 ยท View on GitHub
This directory contains the RD-Agent LLM fine-tuning scenario used by FT-Agent in the ICML 2026 paper FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents.
FT-Agent automates a benchmark-driven fine-tuning loop:
- inspect the target benchmark and available raw datasets;
- generate data processing code and LLaMA-Factory training configs;
- run fail-fast validation before full training;
- fine-tune the target model;
- evaluate with OpenCompass and use feedback for the next iteration.
The implementation is research-oriented. A full run can download large datasets, build training/evaluation environments, call LLM APIs for data processing, and consume GPU hours.
Supported Benchmarks
The scenario currently includes benchmark adapters for the FT-Dojo tasks and several related extensions.
| Domain | Benchmarks | Main raw dataset |
|---|---|---|
| Math | aime24, aime25 | deepscaler |
| Patent | panorama_par4pc, panorama_pi4pc, panorama_noc4pc | panorama |
| Chemistry | chemcotbench_mol_und, chemcotbench_mol_edit, chemcotbench_mol_opt, chemcotbench_reaction | chemcot |
| Table QA | tablebench_data_analysis, tablebench_fact_checking, tablebench_numerical_reasoning, tablebench_visualization | tableinstruct |
| Finance | FinanceIQ_gen | financeiq |
| Biology | bioprobench_gen, bioprobench_ord, bioprobench_err, bioprobench_pqa | bioprobench |
Dataset registration lives in rdagent/scenarios/finetune/datasets/__init__.py. Benchmark adapters live in rdagent/scenarios/finetune/benchmark/data/adaptor.py.
Current dataset behavior: scenario startup prepares the registered dataset resources under FT_FILE_PATH. The first run can therefore be large and slow. Reusing the same FT_FILE_PATH lets later runs reuse already downloaded assets. The --dataset option constrains the agent's selected dataset after preparation; it does not change the current preparation step.
Prerequisites
- Linux.
- Docker available to the current user without
sudo, or a compatible conda setup. - NVIDIA GPU for realistic fine-tuning runs.
- LLM API access for the RD-Agent planner and optional data processing calls.
- Hugging Face access for target models and datasets. Some datasets may require accepting upstream licenses or setting
HF_TOKEN.
Install RD-Agent from source when using this scenario:
git clone https://github.com/microsoft/RD-Agent
cd RD-Agent
make dev
Minimal .env
Create .env in the repository root. Adjust model names and API settings to your provider.
# LLM backend
BACKEND=rdagent.oai.backend.LiteLLMAPIBackend
CHAT_MODEL=gpt-4o
CHAT_TEMPERATURE=1
CHAT_STREAM=True
OPENAI_API_KEY=<your_api_key>
# OPENAI_API_BASE=<your_api_base_if_needed>
# Embedding model used by RD-Agent infrastructure
EMBEDDING_MODEL=text-embedding-3-small
# Fine-tuning workspace. Keep this stable to reuse downloaded models/datasets.
FT_FILE_PATH=/absolute/path/to/finetune_files
# Runtime environment: docker is the default path; conda is available for local setups.
FT_Coder_CoSTEER_env_type=docker
# Target task. You may also pass these through CLI arguments.
FT_TARGET_BENCHMARK=aime25
FT_BENCHMARK_DESCRIPTION="AIME 2025 math competition problems. Each answer is an integer from 0 to 999. Expected Output Format: put the final answer within \\boxed{}, for example \\boxed{42}."
# Target model and data-processing settings
FT_BASE_MODEL=Qwen/Qwen2.5-7B-Instruct
FT_UPPER_DATA_SIZE_LIMIT=2000
FT_API_MAX_WORKERS=8
FT_STRONG_MODELS='["gpt-4o"]'
FT_WEAK_MODELS='["gpt-4o-mini"]'
# Hugging Face token, if required by a model or dataset.
# HF_TOKEN=<your_hf_token>
FT_API_MAX_WORKERS defaults to 8 to avoid surprising rate-limit and cost spikes for public users. If you have a high-throughput internal endpoint, increase it explicitly in .env.
Single-Task Run
Using the CLI entry point:
rdagent llm_finetune \
--benchmark aime25 \
--benchmark-description "AIME 2025 math competition problems. Each answer is an integer from 0 to 999. Expected Output Format: put the final answer within \\boxed{}, for example \\boxed{42}." \
--base-model Qwen/Qwen2.5-7B-Instruct \
--loop-n 3 \
--timeout 12h
Equivalent direct Python entry point:
dotenv run -- python rdagent/app/finetune/llm/loop.py \
--benchmark aime25 \
--benchmark-description "AIME 2025 math competition problems. Each answer is an integer from 0 to 999. Expected Output Format: put the final answer within \\boxed{}, for example \\boxed{42}." \
--base-model Qwen/Qwen2.5-7B-Instruct \
--loop-n 3 \
--timeout 12h
Useful arguments:
| Argument | Meaning |
|---|---|
--base-model | Hugging Face model id to fine-tune. Required unless set by FT_BASE_MODEL. |
--benchmark | Target benchmark key, such as aime25 or chemcotbench_mol_edit. |
--benchmark-description | Natural-language task and output-format description. Required unless set in .env. |
--dataset | Dataset name to select for the agent after dataset preparation. |
--upper-data-size-limit | Maximum number of training examples used by one experiment. |
--loop-n | Maximum number of RD-Agent loops. |
--timeout | Overall wall-clock budget, such as 12h. |
Batch Runs
For multiple benchmark/model runs, use the job helper in this directory:
cp rdagent/app/finetune/llm/job/tasks.json.example rdagent/app/finetune/llm/job/tasks.json
cp .env rdagent/app/finetune/llm/job/.env
bash rdagent/app/finetune/llm/job/run_ft_job.sh rdagent/app/finetune/llm/job/tasks.json
The job runner reads benchmark descriptions from rdagent/app/finetune/llm/job/scenarios.json when a task does not provide benchmark_description directly.
Logs and UI
Runs write RD-Agent traces under the configured log directory. For the Streamlit FT UI:
streamlit run rdagent/app/finetune/llm/ui/app.py
For the generic RD-Agent log UI:
rdagent ui --port 19899 --log-dir <your_log_folder>
Notes
- The first run is expected to be slow because it may download models, datasets, LLaMA-Factory assets, and OpenCompass assets.
- Docker mode mounts models and datasets from
FT_FILE_PATHinto the training/evaluation environments. - Generated training data must use Alpaca-style
instruction,input, andoutputfields; the validator checks this before full training. - Evaluation uses validation feedback for agent iteration and keeps the test split for final reporting/front-end display.