README.md

July 10, 2024 · View on GitHub

Symbol-LLM: Towards Foundational Symbol-centric Interface for Large Language Models

[🌐 Website] • [📜 Paper] • [🤗 HF Models] • [🤗 HF Dataset] • [🐱 GitHub]

Repo for "Symbol-LLM: Towards Foundational Symbol-centric Interface for Large Language Models"

🔥 News

[2024/05/16] 🔥🔥🔥 Symbol-LLM is accepted by ACL 2024 (main conference) !
[2023/12/28] 🔥🔥🔥 We release Symbolic collection (~880K) on 🤗 HuggingFace! Download and Try it !
[2023/10/08] 🔥🔥🔥 Model weights of Symbol-LLM are released at 🤗 HuggingFace!
[2023/11/15] We make the Symbol-LLM paper public !

Although Large Language Models (LLMs) demonstrate remarkable ability in processing and generating human-like text, they do have limitations when it comes to comprehending and expressing world knowledge that extends beyond the boundaries of natural language(e.g., chemical molecular formula). Injecting a collection of symbolic data directly into the training of LLMs can be problematic, as it disregards the synergies among different symbolic families and overlooks the need for a balanced mixture of natural and symbolic data. In this work, we tackle these challenges from both a data and framework perspective and introduce Symbol-LLM series models. First, we curated a data collection consisting of 34 tasks and incorporating approximately 20 distinct symbolic families, intending to capture the interrelations and foster synergies between symbols. Then, a two-stage tuning framework succeeds in injecting symbolic knowledge without loss of the generality ability. Extensive experiments on both symbol- and NL-centric tasks demonstrate the balanced and superior performances of Symbol-LLM series models.

🚀 Quick Start

To try on Symbol-LLM, please use the Transformer library:

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Symbol-LLM/Symbol-LLM-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Symbol-LLM/Symbol-LLM-7B-Instruct")

To utilize our symbolic collection, please load the dataset:

from datasets import load_dataset

# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("Symbol-LLM/Symbolic_Collection")

📃 Deployed As A WebUI

The implementation of WebUI is modified from text-generation-webui. The running script is as follows:

cd demo-webui/
python server.py --model <model_name> --api --share --gpu-memory 40 40 --compute_dtype float32 --bf16

📒 Note

This work is still under review. We will open-source the model weights, symbolic collection and the code.

🔧 Repo Structure

This repo contains the training scripts and the demo deployment. Detailed structure is as follow:

.
├── README.md
├── logo.png
├── demo-webui

Citation

If you find it helpful, please kindly cite the paper.

@article{xu2023symbol,
  title={Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models},
  author={Xu, Fangzhi and Wu, Zhiyong and Sun, Qiushi and Ren, Siyu and Yuan, Fei and Yuan, Shuai and Lin, Qika and Qiao, Yu and Liu, Jun},
  journal={arXiv preprint arXiv:2311.09278},
  year={2023}
}