Dataset Usage Examples
December 10, 2021 ยท View on GitHub
The current dataset contains the samples generated from 6 open-source projects, namely. OpenSSL, FFmpeg, HTTPD, NGINX, Libtiff, and Libav.
For each project, there are 3 pickle.gz files like nginx_after_fix_extractor_0.pickle.gz, nginx_labeler_1.pickle.gz, and nginx_labeler_0.pickle.gz, which are generated by two slightly different extractors (see label_source field in the sample format description).
Viewing the Samples in Pickle Files
Each pickle.gz file contains compressed samples in JSON (e.g. auto_labeler_0.json).
Function print_sample() in read_pickled_samples.py reads the JSON objects from pickle file and decode the compressed static analysis output.
Note that by default, print_sample() only reads the first issue in each file, so it will display 1 issues loaded even though there may be more issues in the file. You can comment out lines 36-37 (if cnt == 1: break) to load all issues in the file.
Data Preparation Example
Split
We provide a global split file splits.csv, which specifies the train, dev, and test sets:
id,split,project
httpd_82b42a45bba53a76fbf167dfe944131e785f5514_1,dev,httpd
...
httpd_1bd8218a89d7b01a14f6172cacfe0e61bee86689_1,test,httpd
...
httpd_598682ce281bf6f4783e9ad3b09639c1686add8e_1,train,httpd
...
For example, sample identified by httpd_82b42a45bba53a76fbf167dfe944131e785f5514_1 belongs to the dev set.
Note:
- The sample ids are unique.
- If you see sample ids like
openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_0andopenssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_1, it doesn't mean they are the same sample with conflicting labels. Instead, they are different samples:openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_1is anauto_labelersample (see Sample Types) fromopenssl_labeler_1.pickle.gz.openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_0is anafter_fix_extractorsample fromopenssl_after_fix_extractor_0.pickle.gz. Basically, we took the positiveauto-labelersampleopenssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_1, extracted the corresponding functions from the after-fix version, produced the correspondingafter-fix-extractorsample and assigned0as its label.- The
splits.csvdoesn't have the sample types. The data preparation script will add the sample types in the output.
Script
The example script split_data.py takes the splits.csv (line 9) and the folder with the pickle.gz files (line 13) as input. It creates the inputs for BERT model training and testing in a folder specified at line 14.