MedicalGPT: Training Medical GPT Model

April 20, 2026 Β· View on GitHub

πŸ‡¨πŸ‡³δΈ­ζ–‡ | 🌐English | πŸ“–ζ–‡ζ‘£/Docs | πŸ€–ζ¨‘εž‹/Models

Logo

MedicalGPT: Training Medical GPT Model

HF Models Contributions welcome License Apache 2.0 python_version GitHub issues Wechat Group

πŸ“– Introduction

MedicalGPT trains a medical large language model using the ChatGPT training pipeline, implementing pretraining, supervised finetuning, RLHF (Reward Modeling and Reinforcement Learning), and DPO (Direct Preference Optimization).

MedicalGPT trains medical large models, implementing incremental pretraining, supervised fine-tuning, RLHF (reward modeling, reinforcement learning training), and DPO (direct preference optimization).

DPO

Training MedicalGPT model:

  • Stage 1:PT(Continue PreTraining), Pre-training the LLaMA model on massive domain document data to inject domain knowledge
  • Stage 2: SFT (Supervised Fine-tuning) has supervised fine-tuning, constructs instruction fine-tuning data sets, and performs instruction fine-tuning on the basis of pre-trained models to align instruction intentions
  • Stage 3: RM (Reward Model) reward model modeling, constructing a human preference ranking data set, training the reward model to align human preferences, mainly the "HHH" principle, specifically "helpful, honest, harmless"
  • Stage 4: RL (Reinforcement Learning) is based on human feedback reinforcement learning (RLHF), using the reward model to train the SFT model, and the generation model uses rewards or penalties to update its strategy in order to generate higher quality, more in line with human preferences text
  • Stage 5: Agent Finetuning, support Agent function call formatting with --tool_format argument during SFT stage. We support various models like Qwen, Mistral, LLaMA3, GLM4 and more.

▢️ Demo

  • Hugging Face Demo: doing

We provide a simple Gradio-based interactive web interface. After the service is started, it can be accessed through a browser, enter a question, and the model will return an answer. The command is as follows:

python demo/gradio_demo.py --base_model path_to_llama_hf_dir --lora_model path_to_lora_dir

Parameter Description:

  • --base_model {base_model}: directory to store LLaMA model weights and configuration files in HF format, or use the HF Model Hub model call name
  • --lora_model {lora_model}: The directory where the LoRA file is located, and the name of the HF Model Hub model can also be used. If the lora weights have been merged into the pre-trained model, delete the --lora_model parameter
  • --tokenizer_path {tokenizer_path}: Store the directory corresponding to the tokenizer. If this parameter is not provided, its default value is the same as --lora_model; if the --lora_model parameter is not provided, its default value is the same as --base_model
  • --use_cpu: use only CPU for inference
  • --gpus {gpu_ids}: Specifies the number of GPU devices used, the default is 0. If using multiple GPUs, separate them with commas, such as 0,1,2

πŸ“ Project Structure

MedicalGPT/
β”œβ”€β”€ training/                # Core training scripts (main training path)
β”‚   β”œβ”€β”€ template.py                         # Conversation template definitions
β”‚   β”œβ”€β”€ pretraining.py                      # Stage 1: Continue Pretraining (PT)
β”‚   β”œβ”€β”€ supervised_finetuning.py            # Stage 2: Supervised Fine-tuning (SFT)
β”‚   β”œβ”€β”€ reward_modeling.py                  # Stage 3: Reward Modeling (RM)
β”‚   β”œβ”€β”€ ppo_training.py                     # Stage 3: Reinforcement Learning (PPO/RLOO)
β”‚   β”œβ”€β”€ dpo_training.py                     # Stage 3: Direct Preference Optimization (DPO)
β”‚   β”œβ”€β”€ orpo_training.py                    # Stage 3: ORPO
β”‚   └── grpo_training.py                    # Stage 3: GRPO
β”‚
β”œβ”€β”€ scripts/                 # One-click run scripts + DeepSpeed configs
β”‚   β”œβ”€β”€ run_pt.sh / run_sft.sh / run_dpo.sh / ...
β”‚   └── zero1.json / zero2.json / zero3.json
β”‚
β”œβ”€β”€ demo/                    # Inference, deployment & application examples
β”‚   β”œβ”€β”€ inference.py / gradio_demo.py / fastapi_server_demo.py
β”‚   β”œβ”€β”€ openai_api.py / chatpdf.py
β”‚   └── inference_multigpu_demo.py
β”‚
β”œβ”€β”€ tools/                   # Model merging, quantization & data processing
β”‚   β”œβ”€β”€ merge_peft_adapter.py / merge_tokenizers.py
β”‚   β”œβ”€β”€ model_quant.py / eval_quantize.py
β”‚   └── convert_dataset.py / validate_jsonl.py
β”‚
β”œβ”€β”€ notebooks/               # Colab tutorial notebooks
β”‚   β”œβ”€β”€ run_training_dpo_pipeline.ipynb
β”‚   └── run_training_ppo_pipeline.ipynb
β”‚
β”œβ”€β”€ data/                    # Training data
β”œβ”€β”€ docs/                    # Documentation
└── tests/                   # Tests
DirectoryDescriptionTarget Audience
training/Core training code covering PT→SFT→RM→PPO/DPO/ORPO/GRPO pipelineDevelopers learning training principles
scripts/One-click run scripts and DeepSpeed configs, copy and useUsers who want to start training quickly
demo/Inference, Gradio UI, FastAPI server, RAG QA examplesUsers who want to deploy and try models
tools/LoRA merging, quantization, vocab extension, data conversionUsers needing model post-processing
notebooks/End-to-end Colab tutorials, one-click runBeginners for quick hands-on experience

All scripts are run from the project root, e.g.: bash scripts/run_sft.sh

πŸš€ Training Pipeline

Stage 1: Continue Pretraining

Based on the llama-7b model, use medical encyclopedia data to continue pre-training, and expect to inject medical knowledge into the pre-training model to obtain the llama-7b-pt model. This step is optional

bash scripts/run_pt.sh

Training Detail wiki

Stage 2: Supervised FineTuning

Based on the llama-7b-pt model, the llama-7b-sft model is obtained by using medical question-and-answer data for supervised fine-tuning. This step is required

Supervised fine-tuning of the base llama-7b-pt model to create llama-7b-sft

bash scripts/run_sft.sh

Training Detail wiki

Stage 3: Reward Modeling

RM(Reward Model): reward model modeling

In principle, we can directly use human annotations to fine-tune the model with RLHF.

However, this will require us to send some samples to humans to be scored after each round of optimization. This is expensive and slow due to the large number of training samples required for convergence and the limited speed at which humans can read and annotate them. A better strategy than direct feedback is to train a reward model RM on the human annotated set before entering the RL loop. The purpose of the reward model is to simulate human scoring of text.

The best practice for building a reward model is to rank the prediction results, that is, for each prompt (input text) corresponding to two results (yk, yj), the model predicts which score the human annotation is higher. The RM model is trained by manually marking the scoring results of the SFT model. The purpose is to replace manual scoring. It is essentially a regression model used to align human preferences, mainly based on the "HHH" principle, specifically "helpful, honest, harmless".

Based on the llama-7b-sft model, the reward preference model is trained using medical question and answer preference data, and the llama-7b-reward model is obtained after training. This step is required

Reward modeling using dialog pairs from the reward dataset using the llama-7b-sft to create llama-7b-reward:

bash scripts/run_rm.sh

Training Detail wiki

Stage 4: Reinforcement Learning

The purpose of the RL (Reinforcement Learning) model is to maximize the output of the reward model. Based on the above steps, we have a fine-tuned language model (llama-7b-sft) and reward model (llama-7b-reward). The RL loop is ready to execute.

This process is roughly divided into three steps:

  1. Enter prompt, the model generates a reply
  2. Use a reward model to score responses
  3. Based on the score, a round of reinforcement learning for policy optimization (PPO)

<img src=https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/blog/stackllama/trl_loop.png height=400 />

Reinforcement Learning fine-tuning of llama-7b-sft with the llama-7b-reward reward model to create llama-7b-rl

bash scripts/run_ppo.sh

Training Detail wiki

Supported Models

Model NameModel SizeTarget ModulesTemplate
Baichuan7B/13BW_packbaichuan
Baichuan27B/13BW_packbaichuan2
BLOOMZ560M/1.1B/1.7B/3B/7.1B/176Bquery_key_valuevicuna
ChatGLM6Bquery_key_valuechatglm
ChatGLM26Bquery_key_valuechatglm2
ChatGLM36Bquery_key_valuechatglm3
Cohere104Bq_proj,v_projcohere
DeepSeek7B/16B/67Bq_proj,v_projdeepseek
DeepSeek3671Bq_proj,v_projdeepseek3
InternLM27B/20Bwqkvintern2
LLaMA7B/13B/33B/65Bq_proj,v_projalpaca
LLaMA27B/13B/70Bq_proj,v_projllama2
LLaMA38B/70Bq_proj,v_projllama3
Mistral7B/8x7Bq_proj,v_projmistral
Orion14Bq_proj,v_projorion
Qwen1.8B/7B/14B/72Bc_attnchatml
Qwen1.50.5B/1.8B/4B/14B/72Bq_proj,v_projqwen
Qwen2.50.5B/1.8B/4B/14B/72Bq_proj,v_projqwen
Qwen30.6B/1.7B/4B/8B/14B/32B/235Bq_proj,v_projqwen3
Qwen3.50.8B/2B/4B/9B/27B/35B/122Bq_proj,v_projqwen3_5
XVERSE13Bquery_key_valuexverse
Yi6B/34Bq_proj,v_projyi

πŸ’Ύ Install

Updating the requirements

From time to time, the requirements.txt changes. To update, use this command:

git clone https://github.com/shibing624/MedicalGPT
cd MedicalGPT
pip install -r requirements.txt --upgrade

Hardware Requirement (VRAM)

Train MethodBits7B13B30B70B110B8x7B8x22B
FullAMP120GB240GB600GB1200GB2000GB900GB2400GB
Full1660GB120GB300GB600GB900GB400GB1200GB
LoRA1616GB32GB64GB160GB240GB120GB320GB
QLoRA810GB20GB40GB80GB140GB60GB160GB
QLoRA46GB12GB24GB48GB72GB30GB96GB
QLoRA24GB8GB16GB24GB48GB18GB48GB

πŸ”₯ Inference

After the training is complete, now we load the trained model to verify the effect of the model generating text.

python demo/inference.py \
    --base_model path_to_llama_hf_dir \
    --lora_model path_to_lora \
    --with_prompt \
    --interactive

Parameter Description:

  • --base_model {base_model}: Directory to store LLaMA model weights and configuration files in HF format
  • --lora_model {lora_model}: The directory where the LoRA file is decompressed, and the name of the HF Model Hub model can also be used. If you have incorporated LoRA weights into the pre-trained model, you can not provide this parameter
  • --tokenizer_path {tokenizer_path}: Store the directory corresponding to the tokenizer. If this parameter is not provided, its default value is the same as --lora_model; if the --lora_model parameter is not provided, its default value is the same as --base_model
  • --with_prompt: Whether to merge the input with the prompt template. Be sure to enable this option if loading an Alpaca model!
  • --interactive: start interactively for multiple single rounds of question and answer
  • --data_file {file_name}: Start in non-interactive mode, read the contents of file_name line by line for prediction
  • --predictions_file {file_name}: In non-interactive mode, write the predicted results to file_name in json format
  • --use_cpu: use only CPU for inference
  • --gpus {gpu_ids}: Specifies the number of GPU devices used, the default is 0. If using multiple GPUs, separate them with commas, such as 0,1,2

Inference Examples

| Human | Assistant | |-----|------| | How to make yourself energetic, list 5 suggestions | Here are some suggestions to keep you energetic:
1. Get enough sleep. Make sure to get enough sleep every night.
2. Exercise more. Doing moderate aerobic exercise, such as jogging and swimming, can help improve your energy and stamina. 3. Eat healthy food. Choose protein-rich foods such as chicken, fish, and eggs, as well as fresh fruits and vegetables. 4. Reduce caffeine intake. Try to avoid drinking caffeinated beverages or eating caffeinated foods during the day.
5. Relax. Try meditation, yoga, or other forms of relaxation to reduce stress and anxiety. |

πŸ“š Dataset

  • 2.4 million Chinese medical datasets (including pre-training, instruction fine-tuning and reward datasets): shibing624/medical

Attach links to some general datasets and medical datasets

☎️ Contact

  • Issue (suggestion) : GitHub issues
  • Email me: xuming: xuming624@qq.com
  • WeChat Me: Add me* WeChat ID: xuming624, Remarks: Name-Company Name-NLP* Enter the NLP exchange group.

⚠️ LICENSE

The license agreement for the project code is The Apache License 2.0, the code is free for commercial use, and the model weights and data can only be used for research purposes. Please attach MedicalGPT's link and license agreement in the product description.

πŸ˜‡ Citation

If you used MedicalGPT in your research, please cite as follows:

@misc{MedicalGPT,
   title={MedicalGPT: Training Medical GPT Model},
   author={Ming Xu},
   year={2023},
   howpublished={\url{https://github.com/shibing624/MedicalGPT}},
}

😍 Contribute

The project code is still very rough. If you have improved the code, you are welcome to submit it back to this project. Before submitting, please pay attention to the following two points:

  • Add corresponding unit tests in tests
  • Use python -m pytest to run all unit tests to ensure that all unit tests are passed

Then you can submit a PR.

πŸ’• Acknowledgements

Thanks for their great work!

  • shibing624/agentica:Framework for building LLM Agents, supporting various Agent types, including RAG, online search, Code interpreter, Vibe Coding, Claude Code, Copilot Agent, etc.