Ttuyssubot
October 23, 2019 · View on GitHub
Contextual Spacing for Conversation-style (and non-normalized) Text
Requirements
fasttext, Keras (TensorFlow), Numpy
Word Vector
Pretrained 100dim fastText vector
- Download this and unzip THE FOLDER in the same folder with 'csct.py'
- Loading the model will be processed by load_model('vectors/model')
System Description
- Easy start: Python3 execute file
python3 csct.py
- This system assigns a contextual spacing for conversation-style and non-normalized Korean text
- ex1) 아버지친구분당선되셨더라 >> "아버지 친구분 당선 되셨더라"
- ex2) 너본지꽤된듯 >> "너 본지 꽤 된 듯"
- ex3) 뭣이중헌지도모름서 >> "뭣이 중헌지도 모름서"
- ex4) 나얼만큼사랑해 >> "나 얼만큼 사랑해"
- The spacing may not be strictly correct, but the system was trained in a way to give a plausible duration for speech synthesis, in the aspect of a non-canonical spoken language.
- Importing automatic spacer
from csct_dist import correct as cor
Reference (as a toolkit)
- 조원익, 천성준, 김지원, 김남수, "문장 정보를 고려한 딥 러닝 기반 자동 띄어쓰기의 개념 및 활용," 제30회 한글 및 한국어 정보처리 학술대회 논문집, 2018, pp. 181-184. [Paper] [Slide]
@inproceedings{cho2018concept,
title={Concept and Application of Deep learning-based Automatic Spacing},
author={Cho, Won Ik and Cheon, Sung Jun and Kim, Ji Won and Kim, Nam Soo},
booktitle={Annual Conference on Human and Language Technology},
pages={181--184},
year={2018},
organization={Human and Language Technology}
}
- For English version, check RAWS