Korean Wellness Chatbot Models

January 12, 2026 · View on GitHub

Formerly: WellnessConversation-LanguageModel

Korean README: README.ko.md

Korean wellness/counseling chatbot models built with PyTorch and Hugging Face Transformers. This repository is a research/learning snapshot and contains two main directions:

Text classification (query → category): KoELECTRA, KoBERT
Text generation (query → response): KoGPT2 (autoregressive)

What’s in this repo

1) Category classification (KoELECTRA / KoBERT)

Goal: given a user query, predict a wellness category (the dataset includes 359 category classes).

Typical training example format:

input: a user query sentence
label: an integer category id (0..358)

See:

Training: train/run_koelectra.py, train/run_text_classification.py
Models: model/koelectra.py, model/kobert.py
Dataset loader: dataloader/wellness.py
Example inference: example/koelectra-wellness-qa.py, example/kobert-wellness-qa.py

In the example scripts, the predicted category is used to retrieve a canned answer from the dataset (randomly sampled among answers for the category).

2) Response generation (KoGPT2)

Goal: train an autoregressive model to generate the next response given a user query.

Typical training example format:

input: question answer pairs from the wellness dataset

See:

Training: train/run_auto_regressive.py
Model: model/kogpt2.py
Dataset loader: dataloader/wellness.py
Example inference: example/kogpt2-text-generation.py

Data

This repo does not ship the AI Hub dataset itself.

AI Hub mental health counseling dataset (Korean): requires signup and approval
http://www.aihub.or.kr/keti_data_board/language_intelligence
Additional public dataset often used in experiments: songys/Chatbot_data
https://github.com/songys/Chatbot_data

The original wellness dataset is organized around:

category / question / answer, with multiple candidate answers per category.

Dataset sizes used in this project (snapshot)

From the original Korean README:

Number of category classes: 359
Number of (query, category) pairs used for classification: 5231

Setup

Install dependencies:

pip install -r requirements.txt

Notes:

The code was written around older transformers versions (see requirements.txt, e.g. transformers==3.0.2).
Training scripts assume local folders like data/ and checkpoint/ under the repo root.

Environment used (snapshot)

GPU: Colab Pro, P100
Core packages:
- kogpt2-transformers
- kobert-transformers
- transformers==3.0.2
- torch

Training (best-effort)

The training scripts use hard-coded paths (examples below). You may need to adjust them to match your local data layout.

KoELECTRA classification: train/run_koelectra.py
- expects data/wellness_dialog_for_text_classification_train.txt
- saves checkpoint to checkpoint/koelectra-wellnesee-text-classification.pth
KoBERT classification: train/run_text_classification.py
- expects data/wellness_dialog_for_text_classification_train.txt
- saves checkpoint to checkpoint/kobert-wellnesee-text-classification.pth
KoGPT2 generation: train/run_auto_regressive.py
- expects data/wellness_dialog_for_autoregressive_train.txt
- saves checkpoint to checkpoint/kogpt2-wellnesee-auto-regressive.pth

preprocess/training_data.py contains helper functions used during dataset preparation (Excel → text files, splitting train/test, etc.). It is best treated as a reference script; you may need to adapt it depending on how you store the dataset locally.

Serving (Flask)

A simple REST API is provided:

service/api.py
- /api/wellness/dialog/bert?s=...
- /api/wellness/dialog/electra?s=...

Project period

2020.06 ~ 2020.07

References

KoBERT: https://github.com/SKTBrain/KoBERT
KoBERT-Transformers: https://github.com/monologg/KoBERT-Transformers
KoGPT2: https://github.com/SKT-AI/KoGPT2
KoGPT2-Transformers: https://github.com/taeminlee/KoGPT2-Transformers/
KoELECTRA: https://github.com/monologg/KoELECTRA
enlipleai/kor_pretrain_LM: https://github.com/enlipleai/kor_pretrain_LM
Hugging Face blog (text generation): https://huggingface.co/blog/how-to-generate