StackOverflow Dataset Generation
November 27, 2023 ยท View on GitHub
SQL parser is built and only functions in Python2 env. However, since NaturalCC is designed on Python3, we have processed SQL/C#/Python data in a Python2 based environment. and saved them in stack_overflow.zip. If interested in the data processing, you can follow original stack_overflow.
Step 1. Download StackOverflow C#/SQL/Python datasets
bash dataset/stack_overflow/download.sh
Step 2. SQL Generation
- flatten SQL code/docstring at
~/stack_overflow/flatten/sql
python -m dataset.stack_overflow.flatten -l sql
- move those decompressed files to
~/stack_overflow/flatten/sql
unzip dataset/stack_overflow/sql_tokens.zip -d ~/stack_overflow/flatten/sql
- binarize SQL dataset
python -m dataset.stack_overflow.summarization.preprocess -f config/sql
Step 3. C# Generation
- install antlr4-python3-runtime
pip install antlr4-python3-runtime==4.5.2
- flatten C# code/docstring at ~/stack_overflow/flatten/csharp
python -m dataset.stack_overflow.flatten -l csharp
- tokenize
code/docstringintocode_token/docstring_token
python -m dataset.stack_overflow.tokenization -l csharp
Since generating code_token/docstring_token is slow, you can move those decompressed files to ~/stack_overflow/flatten/csharp
unzip dataset/stack_overflow/csharp_tokens.zip -d ~/stack_overflow/flatten/csharp
- binarize C# dataset
python -m dataset.stack_overflow.summarization.preprocess -f config/csharp
Step 4. Python Generation
- flatten Python code/docstring at ~/stack_overflow/flatten/python
python -m dataset.stack_overflow.flatten -l python
- tokenize
code/docstringintocode_token/docstring_token
python -m dataset.stack_overflow.tokenization -l python
- binarize Python dataset
python -m dataset.stackoverflow.summarization.preprocess -f config/python