SyncBench: Agent Out-of-Sync Benchmark

February 12, 2025 ยท View on GitHub

SyncBench

Alt text

๐Ÿ€1. Environment Setup

To use SyncMind and SyncBench:

git clone https://github.com/xhguo7/SyncMind.git
  • Quick Install:

    conda env create -f environment.yml
    conda activate syncmind
    
  • Env Setup:

    cd your_desired_root_dir
    git clone https://github.com/xhguo7/SyncMind.git
    cd SyncMind
    python -m pip install -e .
    

๐Ÿ“ 2. Dataset Construction

(1) Set params

cd SyncMind/scripts/construction.sh
  • Set root_path to the dir with enough space to save generated benchmark instances

    ROOT_PATH="/home/xuehangg/"
    
  • Set the path to source repositories

    DATA_PATH="./source/my_repo_dict.json"
    
  • Set dataset type: caller or callee

    DATASET='caller'
    
  • Define function and method filtering strictness

    STRICT_FILTERING=0  # 0: not strict | 1: strict (may result in no filtered data being collected)
    

    If set STRICT_FILTERING=1, what will be filtered out?

    • Filter out functions with zero arguments
    • Filter out functions with no return statements
    • Filter out functions with literal return values
    • Filter out functions without a docstring
    • Filter out functions with bad names (e.g., "test", "temp", or "sample")
    • Filter out functions that are 5 lines or shorter
    • Filter out functions with syntax errors in the code
    • Filters out dunder functions (e.g., functions starting with '__')
  • Define execution test timeout

    TIMEOUT=600
    
  • Define the maximum length of data to be filtered

    MAX_LENGTH=1000
    
  • Set source repository range: [CONSTRUCT_START, CONSTRUCT_END), start from 0

    For example, if constructing SyncBench based on source repositories with ID 1-3:

    CONSTRUCT_START=0
    CONSTRUCT_END=3
    
  • Set out-of-sync mode

    [Execution test filtering mode]

    • fp: fail-to-pass only
    • pp: pass-to-pass only
    • both: fail-to-pass and pass-to-pass
    TEST_MODE="fp"
    
  • Set commit tracing mode

    • Trace all commits that satisfy TEST_MODE: TRACE_MODE=0
    • Trace only the oldest commit that satisfies TEST_MODE: TRACE_MODE=0
    TRACE_MODE=0
    

(2) (Optional) Check Gits

  • If would like to check git commits before constructing SyncBench
    cd SyncMind
    bash ./scripts/git.sh
    

(3) SyncBench Construction

  • Construct SyncBench

    cd SyncMind
    bash ./scripts/construction.sh
    

    This will save both the structured data in .json format and the instantiated data in .csv format

    • JSON data: will be saved to ./syncbench_build/dataset in .json format
    • CSV data: will be saved to ./syncbench_build/syncbench in .csv format

    Where syncbench_build shares the same parent directory as SyncMind.

(4) (Optional) SyncBench Instantiation

  • Want to customize instances? Run syncbench.sh:
    • To instantiate JSON data into CSV instances (after running syncbench construction bash ./scripts/construction.sh to generate JSON data):

      cd SyncMind
      bash ./scripts/syncbench.sh
      

      This will convert structured .json data into instantiated datasets in .csv format

      • CSV data: will be saved to ./syncbench_build/syncbench in .csv format

      Where syncbench_build shares the same parent directory as SyncMind.

  • Noted that this step in totally optional, just in case if you would like to change instance attributes.
    • Running construction.sh already includes this instantiation step with default attributes for agent out-of-sync recovery evaluation.

๐ŸŽจ3. Customize Your SyncBench

In our current version, SyncBench is built upon 21 popular GitHub repositories.

SyncBench can be readily scale up by applying to diverse qualified Python repositories, and can also be quickly downsampled to smaller evaluation subsets.

๐Ÿ“ˆ 3.1 Scale Up

SyncBench can be readily scale up by applying to diverse Python repositories that meet the following prerequisites:

  • Have Python as the primary language
  • Possess well-developed unit tests
  • (Optional) Support easy env setup is a plus, be not required
    • Repositories with env setup files, such as setup.py, .toml, .yml, etc., can help quickly build up the docker environment
    • Meanwhile, please be rest assured that you can also manually specified certain packages to install when your selected repositories may not include these env setup files.
      • Need manual package installation? See examples at my_repo_dict.json
        • Add custom source repos to this file (my_repo_dict.json) and then run construction.sh, SyncBench construction will install your specified packages automatically in corresponding isolated environments.

3.1.1 Prepare Source Repo

Source Repository

Edit source repo at: ./source/my_repo_dict.json

  • Append new source repositories to this dictionary
  • One may preset environment dependencies in this dictionary if the source repository does not prepare environment setup necessities

Set source repo in SyncBench construction command to specify which source repositories to use.

3.1.2 Let's Expand SyncBench!

(1) (Optional) Check gits

cd SyncMind
bash ./scripts/git.sh

(2) Construct datasets

cd SyncMind
bash ./scripts/construction.sh

This will save the constructed datasets in .json format

(3) (Optional) Instantiate SyncBench for Agent Out-of-Sync Recovery

cd SyncMind
bash ./scripts/syncbench.sh

This will save the constructed datasets in .csv format

๐Ÿ“‰ 3.2 Scale Down

For small-scale evaluation, SyncBench can be readily downsampled to fewer instances:

(1) 300 Instances:

  • We have sampled a small evaluation dataset through weighted downsampling: 300 Instances
    • 300 out-of-sync instances derived from 21 GitHub repositories
      • 150 Caller instances
      • 150 Callee instances

(2) Custom subset

  • Choose a proper method to downsample a custom SyncBench subset