Legacy results

September 15, 2024 · View on GitHub

Solvable Pass Rate:

MethodI1 InstructionI1 CategoryI1 ToolI2 CategoryI2 InstructionI3 InstructionAverage
GPT-3.5-Turbo-0613 (CoT)52.2±1.147.3±0.653.6±1.342.5±2.135.8±2.048.1±0.846.6±1.3
GPT-3.5-Turbo-0613 (DFS)60.3±1.366.2±1.267.1±0.059.1±0.451.3±1.273.8±2.363.0±1.1
GPT-4-0613 (CoT)45.5±0.457.4±0.348.8±0.743.0±0.746.5±0.948.1±1.548.2±0.8
GPT-4-0613 (DFS)57.3±0.657.3±0.360.9±1.057.9±1.051.3±0.866.4±2.458.5±1.0
ToolLLaMA v2 (CoT)32.3±1.040.3±0.836.7±0.534.7±0.725.2±0.433.9±1.533.9±0.8
ToolLLaMA v2 (DFS)44.5±0.949.6±1.348.9±2.750.8±1.131.9±1.953.6±2.046.6±1.7
GPT-3.5-Turbo-1106 (CoT)50.4±0.545.1±1.450.8±0.348.7±0.842.1±0.455.7±0.048.8±0.6
GPT-3.5-Turbo-1106 (DFS)62.8±0.363.9±1.265.6±0.356.5±0.756.9±1.267.2±1.362.2±0.8
GPT-4-Turbo-Preview (CoT)52.8±1.356.6±0.951.9±0.551.9±1.052.8±0.852.5±0.053.1±0.8
GPT-4-Turbo-Preview (DFS)59.2±0.561.7±0.765.7±1.055.6±0.655.2±0.466.1±4.360.6±1.3

Solvable Win Rate: (Reference model: ChatGPT-CoT)

MethodI1 InstructionI1 CategoryI1 ToolI2 InstructionI2 CategoryI3 InstructionAverage
GPT-3.5-Turbo-0613 (DFS)60.767.359.563.262.175.464.7
GPT-4-0613 (CoT)54.658.858.275.560.562.361.7
GPT-4-0613 (DFS)62.662.758.274.562.967.264.7
ToolLLaMA v2 (CoT)31.328.133.535.833.924.631.2
ToolLLaMA v2 (DFS)44.845.844.359.441.150.847.7
GPT-3.5-Turbo-1106 (CoT)47.247.744.950.954.062.351.2
GPT-3.5-Turbo-1106 (DFS)55.853.651.968.959.768.959.8
GPT-4-Turbo-Preview (CoT)71.277.161.479.271.867.271.3
GPT-4-Turbo-Preview (DFS)73.075.268.477.466.960.770.2
We run all models once against GPT-3.5-Turbo-0613 + CoT and evaluate them three times. We follow the ToolBench implementation to take the most frequent result for each query during evaluation.