README.md
September 29, 2025 ยท View on GitHub
๐ง (Undergoing Sorting project process)Prompt2Act: Mapping Prompts into Sequence of Robotic Actions with Large Foundation Models
Official implementation of our system Prompt2Act, which maps open-ended multi-modal prompts into real-world robotic actions via large vision-language models, mixed execution agents, and visual grounding modules.
๐ Highlights
- Multimodal Prompt Understanding: Supports vision-language prompts such as images, sketches, pointing gestures, and free-form instructions.
- Mixed Execution Agent: Combines predefined symbolic functions and on-the-fly code generation to execute diverse tasks.
- Visual Grounding via VG-Marker: Automatically identifies objects and assigns semantic anchors from raw scenes using open-vocabulary segmentation + GPT-4o.
- Supports Novel Tasks: From "Arrange pens by reference photo" to "Pick the toy pointed by hand", with no finetuning needed.
- Zero-shot generalization to occluded, unseen, and cluttered environments.
๐ Project Structure
Prompt2Act/
โ
โโโ prompt2act/ # Core system modules
โ โโโ planner/ # LLM-based sequence planner
โ โโโ executor/ # Mixed Execution Agent (predefined + code-gen)
โ โโโ visual_grounding/ # VG-Marker
โ โโโ utils/
โ
โโโ data/ # Demo trajectories & prompt configs
โโโ models/ # LLM/VLM interface wrappers
โโโ scripts/ # Evaluation scripts and launchers
โโโ README.md
Hardware Configuration(Preparing...)
โ๏ธ Installation
We recommend Python 3.10 and Linux/Ubuntu.
# Clone repository
git clone https://github.com/Zero-coder/Prompt2Act.git
cd Prompt2Act
# Create virtual env (optional)
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
๐ Quick Start
Run a predefined task (e.g., Arrange Pens by Reference) in simulation:
python scripts/run_demo.py \
--task arrange_pens \
--prompt examples/prompts/arrange_pens.json
You can also modify the prompt file to test new tasks:
{
"instruction": "Please arrange these pens as shown in the image.",
"image": "reference_pens.jpg"
}
๐ง Model & Assets
To fully enable Prompt2Act, you need:
- GPT-4o / GPT-4V (OpenAI API or local Azure proxy)
- SAM / Grounded-SAM for segmentation
- VG-Marker (included, no training needed)
- Predefined skills:
pick,place,rotate, etc.
We provide starter checkpoints and test prompts in /data.
๐งช Evaluation
To evaluate Prompt2Act under different generalization axes:
python scripts/eval_benchmark.py \
--benchmark occlusion \
--config configs/eval_occlusion.yaml
You can test:
- Visual generalization (new textures, occlusion)
- Reasoning (visual constraints, sketch understanding)
- Embodiment shift (sim vs real)
๐ Citation
If you find this project helpful, please consider citing our paper:
@article{jiang2025prompt2act,
title={Prompt2Act: Mapping Prompts into Sequence of Robotic Actions with Large Foundation Models},
author={Maowei Jiang and Qi Wang and Hongfeng Ai and Zhiyong Dong and Yusong Hu and Ao Liang and Yifan Wang and Ruiqi Li and Quangao Liu and Moquan Chen and Peter Buลก and Long Zeng},
journal={ Information Fusion (under_revision)},
year={2025}
}
๐ฌ Contact
Maintained by @Zero-coder. Please open issues or pull requests for contributions.