๐ŸŒด PALM-E: A Multi-Modal AI Model

January 29, 2024 ยท View on GitHub

Multi-Modality

๐ŸŒด PALM-E: A Multi-Modal AI Model

model architecture

This is the open source implementation of the SOTA multi-modality foundation model "PALM-E: An Embodied Multimodal Language Model" from Google, PALM-E is a single large embodied multimodal model, that can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains.

PAPER LINK: PaLM-E: An Embodied Multimodal Language Model

GitHub issues GitHub forks GitHub stars GitHub license Share on Twitter Share on Facebook Share on LinkedIn Discord Share on Reddit Share on Hacker News Share on Pinterest Share on WhatsApp

Note

  • This is just the model architecture, no pretrained weights, no tokenizer
  • To actually conduct inference you would need to --> setup tokenizer for text and images -> train -> inference
  • If you are doing research into multi modal models and would like to train this model and release it open source join the agora lab by clicking on the banner!

Appreciation

  • All the creators in Agora, Join Agora the community of AI engineers changing the world with their creations.
  • LucidRains for inspiring me to devote myself to open source AI

๐Ÿš€ Quick Start

Installation ๐Ÿ“ฆ

pip install palme

Usage ๐ŸŽจ

import torch
from palme.model import PalmE

#usage
img = torch.randn(1, 3, 256, 256)
caption = torch.randint(0, 20000, (1, 1024))

model = PalmE()
output = model(img, caption)
print(output.shape) # (1, 1024, 20000)




Dataset Strategy

Here is a summary table of the key datasets mentioned in the paper:

DatasetTasksSizeLink
TAMPRobotic manipulation planning, VQA96,000 scenesCustom dataset
Language TableRobotic manipulation planningCustom datasetLink
Mobile ManipulationRobotic navigation and manipulation planning, VQA2912 sequencesBased on SayCan dataset
WebLIImage-text retrieval66M image-caption pairsLink
VQAv2Visual question answering1.1M questions on COCO imagesLink
OK-VQAVisual question answering requiring external knowledge14,031 questions on COCO imagesLink
COCOImage captioning330K images with captionsLink
WikipediaText corpusN/ALink

The key robotics datasets were collected specifically for this work, while the larger vision-language datasets (WebLI, VQAv2, OK-VQA, COCO) are standard benchmarks in that field. The datasets range from tens of thousands of examples for the robotics domains to tens of millions for the internet-scale vision-language data.


Contribute || Be Part of the PALM-E Adventure ๐Ÿค

Your brilliance is needed! Join us, and together, let's make PALM-E even more awe-inspiring:

  1. Get Your Copy: Fork the PALM-E repo.
  2. Make It Local: Clone your fork.
  3. Prep Your Tools: Install the necessities.
  4. Discover & Innovate: Dive into the code.
  5. Craft Your Magic: Branch and code away.
  6. Show & Tell: Push your changes and craft a pull request.

๐Ÿž Fixes, ๐ŸŽจ enhancements, ๐Ÿ“ docs, or ๐Ÿ’ก ideas โ€“ all are welcome! Let's shape the future of AI, hand in hand.


Citation

@article{driess2023palme,
  title={PALM-E: An Embodied Multimodal Language Model},
  author={Driess, Danny and Xia, Fei and Sajjadi, Mehdi S. M. and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and Huang, Wenlong and Chebotar, Yevgen and Sermanet, Pierre and Duckworth, Daniel and Levine, Sergey and Vanhoucke, Vincent and Hausman, Karol and Toussaint, Marc and Greff, Klaus and Zeng, Andy and Mordatch, Igor and Florence, Pete},
  journal={arXiv preprint arXiv:2303.03378},
  year={2023},
  url={https://doi.org/10.48550/arXiv.2303.03378}
}

Roadmap

  • URGENT: Debug Tokenizer, make sure multi-modal inputs work.
  • Create Dataset Strategy
  • Upload Training Documentation
  • Get Training running with multi-modal