:paperclip: How to configure MINT to evaluate your LLM

September 20, 2023 · View on GitHub

Add an evaluated LLM

MINT use different class to abstract out the API of different LLMs. You can find the list of implemented LLMs in mint/agents/__init__.py.

For closed-source models:

For open-source models, we have VLLMAgent that can be used to evaluate any LLMs that can be served with VLLM or FastChat into an OpenAI-compatible API.

If you want to evaluate an open-source LLM that can be served with VLLM or FastChat: First, refer to docs/SERVING.md to learn about how to serve your model. Then, modify mint/configs/config_variables.py by adding a dictionary describing the model to be evaluated into EVALUATED_MODEL_LIST.

# For Chat Model
{
    "agent_class": "VLLMAgent",
    "config": {
        "model_name": "<YOUR_MODEL_NAME>",
        "chat_mode": True,
        "max_tokens": 512,
        "temperature": 0.0,
        "openai.api_base": "<YOUR_API_BASE>",
        "add_system_message": False,
    },
}
# For Completion-only Model
{
    "agent_class": "VLLMAgent",
    "config": {
        "model_name": "Llama-2-70b-hf",
        "chat_mode": False,
        "max_tokens": 512,
        "temperature": 0.0,
        "openai.api_base": "<YOUR_API_BASE>",
        "add_system_message": False,
    },
},

If you want to evaluate another closed-source LLM with a different API schema than the existing implementation: You need to implement a new agent class that inherits from LMAgent (PR welcomed!). You can use mint/agents/openai_lm_agent.py as an example, then add this model configuration to mint/configs/config_variables.py similar to the above.

Add a feedback-providing LLM

We implemented three different feedback agent classes:

If you want to use an existing open-source model compatible with VLLM or FastChat, you can add a configuration similar to the above to FEEDBACK_PROVIDER_LIST in mint/configs/config_variables.py.

FEEDBACK_PROVIDER_LIST = [
    ...
    {
        "agent_class": "VLLMFeedbackAgent",
        "model_name": "<YOUR_MODEL_NAME>",
        "openai.api_base": "<YOUR_API_BASE>",
        "chat_mode": True, # Set to False if your model is completion-only
    },
    ...
]

If needed, you can use these classes as an example to implement your own feedback agent class (PR welcomed!). Then, add this model configuration to FEEDBACK_PROVIDER_LIST in mint/configs/config_variables.py. For example:

FEEDBACK_PROVIDER_LIST = [
    ...
    {
        # Your custom feedback provider
        "agent_class": "<YOUR_FEEDBACK_AGENT_CLASS>",
        "model_name": "<YOUR_FEEDBACK_MODEL_NAME>",
    },
    ...
]

Change Experiment Configurations

Optionally, you can change different experiment settings in mint/configs/config_variables.py.

`ENV_CONFIGS`

This specifies the settings of the environment. Here is an example:

ENV_CONFIGS = [
    ...,
    {
        "max_steps": 5,
        "use_tools": True,
        "max_propose_solution": 2,
        "count_down": True,
    },
    ...
]

where max_steps corresponds to the budget of interaction (k) in the paper, use_tools should always be True (no tool setting is not implemented yet), max_propose_solution is the maximum number of solutions that the evaluated LLM can propose, and count_down is whether to count down the remaining steps in the environment (read Section 2 in the paper for more detail).

`FEEDBACK_TYPES`

This specifies the types of feedback we instruct the feedback-providing LLM to provide. Here are all the settings we currently support:

FEEDBACK_TYPES = [
    {"pseudo_human_feedback": "no_GT", "feedback_form": "textual"}, # default setting
    {"pseudo_human_feedback": "no_GT", "feedback_form": "binary"},
    {"pseudo_human_feedback": "GT", "feedback_form": "binary"},
    {"pseudo_human_feedback": "GT", "feedback_form": "textual"},
]

pseudo_human_feedback specifies whether we provide a ground-truth solution of the problem to the feedback-providing LLM. no_GT means we do not provide a ground-truth solution (default setting), and GT means we provide ground-truth feedback.
feedback_form specifies the form of feedback we provide. textual means we provide textual feedback (default setting), and binary means we instruct the feedback-provider to provide binary feedback.