Data Pipeline Ablation study

April 22, 2025 ยท View on GitHub

This includes the experiments answering the following research questions RQ5, RQ6, RQ7, RQ8 mentioned in the paper:

  • RQ5. Impact of component granularity: How do different component granularities affect the trained models' performance?
  • RQ6. Impact of component selection: How do different component selection strategies affect the trained models' performance?
  • RQ7. Impact of model size: How does the size of the model used for component rewriting affect the trained models' performance?
  • RQ8. Ground-truth extraction strategies: How well does the model perform when being trained on reverse patch diff compared to that when being trained on SWE-Synth with rollout?

Setup for RQ5, RQ6, RQ7, RQ8

After created the synthetic dataset, you need to install llama-factory and moatless as described in setup instructions

bash swesynth/lib/llama_factory/install-llama-factory.sh
bash swesynth/lib/moatless/install-moatless-fork.sh

RQ5 + RQ6

Create the synthetic dataset using the following notebook component-ablation.ipynb

The generated dataset can be later used for training models with different component granularity and selection strategies using the following command:

llamafactory-cli train swesynth/experiments/data_pipeline_ablation/component_granularity/ablation_EmptyClassStrategy.yaml
llamafactory-cli train swesynth/experiments/data_pipeline_ablation/component_granularity/ablation_EmptyFunctionStrategy.yaml
llamafactory-cli train swesynth/experiments/data_pipeline_ablation/component_selection/ablation_PriorityAwareMutationStrategy.yaml

RQ7

Create the synthetic dataset using the following command:

bash swesynth/experiments/data_pipeline_ablation/mutator_model_size/collect.sh

The generated dataset can be later used for training models with different sizes using the following command:

llamafactory-cli train swesynth/experiments/data_pipeline_ablation/mutator_model_size/ablation_05.yaml
llamafactory-cli train swesynth/experiments/data_pipeline_ablation/mutator_model_size/ablation_3B.yaml
llamafactory-cli train swesynth/experiments/data_pipeline_ablation/mutator_model_size/ablation_14B.yaml

RQ8

First we need to prepare Gold RAG dataset

bash swesynth/experiments/data_pipeline_ablation/groundtruth_generation/create-data.sh

Then sampling 3 times

bash swesynth/experiments/data_pipeline_ablation/groundtruth_generation/rejection-sampling-rollout.sh

The generated dataset can be later used for training RAG reverse patch diff model and RAG rejection sampling model using the following command:

llamafactory-cli train swesynth/experiments/data_pipeline_ablation/groundtruth_generation/gold-rollout.yaml
llamafactory-cli train swesynth/experiments/data_pipeline_ablation/groundtruth_generation/reverse.yaml