Training Perception Language Model (PLM)

April 28, 2025 · View on GitHub

We provide instruction to train or finetune PLM on a custom dataset.

Tip

We provide configurations to run warm-up and sft to facilitate reproducibility of PLM training.

Data Format :open_file_folder:

We use support both image and video conversation datasets using jsonl. Each line of jsonl file should follow the following format,

For Image Conversation Dataset

  {
    "image": "<image path>",
    "conversations": [
      {
        "from": "human",
        "value": "human instruction"
      },
      {
        "from": "assistant",
        "value": "model response"
      }
    ]
  }

For Video Conversation Dataset

  {
    "video": "<video path>",
    "conversations": [
      {
        "from": "human",
        "value": " human instruction"
      },
      {
        "from": "assistant",
        "value": "model response"
      }
    ]
  }

Note that for images, we require the image key to be present in the jsonl line, while for videos we require the video key to be present in the jsonl line. The conversations key is common between the two types.

Tip

The repo also support text-only, multi-image, image-region, video-region-caption (RCap), video-region-temporal-localization (RTLoc) and video-region-dense-captioning (RDCap) tasks. Please download the provided dummy-datasets for an example of each dataset.

Registration of New Dataset

Given the dataset jsonl file, we can register a new dataset by adding an entry in apps/plm/configs/datasets.yaml.

custom_dataset_name:
    annotation: path/to/the/jsonl/file.jsonl
    root_dir: path/to/the/image-or-video/root-dir

Please refer to apps/plm/configs/datasets.yaml for already present dummy image, video and grounding datasets.

Training / Finetuning PLM :train:

Training PLM involves creating a .yaml configuration file, defining all model and training related configurable parameters. Please refer to the provided plm_configs for details.

Tip

To run the following code, download the dummy-datasets and extract them to apps/plm/dummy_datasets.

Given a .yaml configuration file, please run the following command to launch the training on a single node with 8 GPUs.

torchrun --nproc-per-node 8 -m apps.plm.train config=apps/plm/configs/stage_3/plm_3b.yaml

Consolidate Checkpoints

In order to run inference / evaluation, please consolidate checkpoints using the following command,

python apps/plm/consolidate.py --ckpt <path to the saved checkpoints.>

Run Inference / Evaluation

After consoldating the checkpoints, you can run inference using the following command,

python apps/plm/generate.py \
--ckpt facebook/Perception-LM-3B \
--media_type image \  # Replace with "video" for running inference on video
--media_path <path to image or video> \
--question <Question to be asked about the video.>

For evaluation, please refer to evaluation.md.

We also provide a script to launch a distributed multinode training on slurm. Please use the provided utility named stool.py.

python -m core.stool script=apps.plm.train config=apps/plm/configs/stage_3/plm_8b.yaml qos=<QoS> nodes=<num_of_nodes>

We provide a step-by-step example for how to finetune PLM on a public dataset that elaborates on each of the steps above in detail. Please see finetune_example.md.