Dataset Usage Examples

December 10, 2021 · View on GitHub

The current dataset contains the samples generated from 6 open-source projects, namely. OpenSSL, FFmpeg, HTTPD, NGINX, Libtiff, and Libav.

For each project, there are 3 pickle.gz files like nginx_after_fix_extractor_0.pickle.gz, nginx_labeler_1.pickle.gz, and nginx_labeler_0.pickle.gz, which are generated by two slightly different extractors (see label_source field in the sample format description).

Viewing the Samples in Pickle Files

Each pickle.gz file contains compressed samples in JSON (e.g. auto_labeler_0.json).

Function print_sample() in read_pickled_samples.py reads the JSON objects from pickle file and decode the compressed static analysis output.

Note that by default, print_sample() only reads the first issue in each file, so it will display 1 issues loaded even though there may be more issues in the file. You can comment out lines 36-37 (if cnt == 1: break) to load all issues in the file.

Data Preparation Example

Split

We provide a global split file splits.csv, which specifies the train, dev, and test sets:

id,split,project
httpd_82b42a45bba53a76fbf167dfe944131e785f5514_1,dev,httpd
...
httpd_1bd8218a89d7b01a14f6172cacfe0e61bee86689_1,test,httpd
...
httpd_598682ce281bf6f4783e9ad3b09639c1686add8e_1,train,httpd
...

For example, sample identified by httpd_82b42a45bba53a76fbf167dfe944131e785f5514_1 belongs to the dev set.

Note:

The sample ids are unique.
If you see sample ids like openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_0 and openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_1, it doesn't mean they are the same sample with conflicting labels. Instead, they are different samples:
- openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_1 is an auto_labeler sample (see Sample Types) from openssl_labeler_1.pickle.gz.
- openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_0 is an after_fix_extractor sample from openssl_after_fix_extractor_0.pickle.gz. Basically, we took the positive auto-labeler sample openssl_fffe56733db3d1a8a2c81c40dedb4f0103a4406a_1, extracted the corresponding functions from the after-fix version, produced the corresponding after-fix-extractor sample and assigned 0 as its label.
- The splits.csv doesn't have the sample types. The data preparation script will add the sample types in the output.

Script

The example script split_data.py takes the splits.csv (line 9) and the folder with the pickle.gz files (line 13) as input. It creates the inputs for BERT model training and testing in a folder specified at line 14.