Training Perception Language Model (PLM)
April 28, 2025 ยท View on GitHub
We provide instruction to train or finetune PLM on a custom dataset.
Data Format :open_file_folder:
We use support both image and video conversation datasets using jsonl. Each line of jsonl file should follow the following format,
For Image Conversation Dataset
{
"image": "<image path>",
"conversations": [
{
"from": "human",
"value": "human instruction"
},
{
"from": "assistant",
"value": "model response"
}
]
}
For Video Conversation Dataset
{
"video": "<video path>",
"conversations": [
{
"from": "human",
"value": " human instruction"
},
{
"from": "assistant",
"value": "model response"
}
]
}
Note that for images, we require the image key to be present in the jsonl line, while for videos we require the video key to be present in the jsonl line. The conversations key is common between the two types.
Tip
The repo also support text-only, multi-image, image-region, video-region-caption (RCap), video-region-temporal-localization (RTLoc) and video-region-dense-captioning (RDCap) tasks. Please download the provided dummy-datasets for an example of each dataset.
Registration of New Dataset
Given the dataset jsonl file, we can register a new dataset by adding an entry in apps/plm/configs/datasets.yaml.
custom_dataset_name:
annotation: path/to/the/jsonl/file.jsonl
root_dir: path/to/the/image-or-video/root-dir
Please refer to apps/plm/configs/datasets.yaml for already present dummy image, video and grounding datasets.
Training / Finetuning PLM :train:
Training PLM involves creating a .yaml configuration file, defining all model and training related configurable parameters. Please refer to the provided plm_configs for details.
Tip
To run the following code, download the dummy-datasets and extract them to apps/plm/dummy_datasets.
Given a .yaml configuration file, please run the following command to launch the training on a single node with 8 GPUs.
torchrun --nproc-per-node 8 -m apps.plm.train config=apps/plm/configs/stage_3/plm_3b.yaml
Consolidate Checkpoints
In order to run inference / evaluation, please consolidate checkpoints using the following command,
python apps/plm/consolidate.py --ckpt <path to the saved checkpoints.>
Run Inference / Evaluation
After consoldating the checkpoints, you can run inference using the following command,
python apps/plm/generate.py \
--ckpt facebook/Perception-LM-3B \
--media_type image \ # Replace with "video" for running inference on video
--media_path <path to image or video> \
--question <Question to be asked about the video.>
For evaluation, please refer to evaluation.md.
We also provide a script to launch a distributed multinode training on slurm. Please use the provided utility named stool.py.
python -m core.stool script=apps.plm.train config=apps/plm/configs/stage_3/plm_8b.yaml qos=<QoS> nodes=<num_of_nodes>
We provide a step-by-step example for how to finetune PLM on a public dataset that elaborates on each of the steps above in detail. Please see finetune_example.md.