Korean Wellness Chatbot Models

January 12, 2026 · View on GitHub

Formerly: WellnessConversation-LanguageModel

Korean README: README.ko.md

Korean wellness/counseling chatbot models built with PyTorch and Hugging Face Transformers. This repository is a research/learning snapshot and contains two main directions:

  • Text classification (query → category): KoELECTRA, KoBERT
  • Text generation (query → response): KoGPT2 (autoregressive)

What’s in this repo

1) Category classification (KoELECTRA / KoBERT)

Goal: given a user query, predict a wellness category (the dataset includes 359 category classes).

Typical training example format:

  • input: a user query sentence
  • label: an integer category id (0..358)

See:

  • Training: train/run_koelectra.py, train/run_text_classification.py
  • Models: model/koelectra.py, model/kobert.py
  • Dataset loader: dataloader/wellness.py
  • Example inference: example/koelectra-wellness-qa.py, example/kobert-wellness-qa.py

In the example scripts, the predicted category is used to retrieve a canned answer from the dataset (randomly sampled among answers for the category).

2) Response generation (KoGPT2)

Goal: train an autoregressive model to generate the next response given a user query.

Typical training example format:

  • input: question answer pairs from the wellness dataset

See:

  • Training: train/run_auto_regressive.py
  • Model: model/kogpt2.py
  • Dataset loader: dataloader/wellness.py
  • Example inference: example/kogpt2-text-generation.py

Data

This repo does not ship the AI Hub dataset itself.

The original wellness dataset is organized around:

  • category / question / answer, with multiple candidate answers per category.

Dataset sizes used in this project (snapshot)

From the original Korean README:

  • Number of category classes: 359
  • Number of (query, category) pairs used for classification: 5231

Setup

Install dependencies:

pip install -r requirements.txt

Notes:

  • The code was written around older transformers versions (see requirements.txt, e.g. transformers==3.0.2).
  • Training scripts assume local folders like data/ and checkpoint/ under the repo root.

Environment used (snapshot)

  • GPU: Colab Pro, P100
  • Core packages:
    • kogpt2-transformers
    • kobert-transformers
    • transformers==3.0.2
    • torch

Training (best-effort)

The training scripts use hard-coded paths (examples below). You may need to adjust them to match your local data layout.

  • KoELECTRA classification: train/run_koelectra.py
    • expects data/wellness_dialog_for_text_classification_train.txt
    • saves checkpoint to checkpoint/koelectra-wellnesee-text-classification.pth
  • KoBERT classification: train/run_text_classification.py
    • expects data/wellness_dialog_for_text_classification_train.txt
    • saves checkpoint to checkpoint/kobert-wellnesee-text-classification.pth
  • KoGPT2 generation: train/run_auto_regressive.py
    • expects data/wellness_dialog_for_autoregressive_train.txt
    • saves checkpoint to checkpoint/kogpt2-wellnesee-auto-regressive.pth

Preprocessing

preprocess/training_data.py contains helper functions used during dataset preparation (Excel → text files, splitting train/test, etc.). It is best treated as a reference script; you may need to adapt it depending on how you store the dataset locally.

Serving (Flask)

A simple REST API is provided:

  • service/api.py
    • /api/wellness/dialog/bert?s=...
    • /api/wellness/dialog/electra?s=...

Project period

2020.06 ~ 2020.07

References