Ttuyssubot

October 23, 2019 · View on GitHub

Contextual Spacing for Conversation-style (and non-normalized) Text

Requirements

fasttext, Keras (TensorFlow), Numpy

Word Vector

Pretrained 100dim fastText vector

  • Download this and unzip THE FOLDER in the same folder with 'csct.py'
  • Loading the model will be processed by load_model('vectors/model')

System Description

  • Easy start: Python3 execute file
 python3 csct.py 
  • This system assigns a contextual spacing for conversation-style and non-normalized Korean text
  • ex1) 아버지친구분당선되셨더라 >> "아버지 친구분 당선 되셨더라"
  • ex2) 너본지꽤된듯 >> "너 본지 꽤 된 듯"
  • ex3) 뭣이중헌지도모름서 >> "뭣이 중헌지도 모름서"
  • ex4) 나얼만큼사랑해 >> "나 얼만큼 사랑해"
  • The spacing may not be strictly correct, but the system was trained in a way to give a plausible duration for speech synthesis, in the aspect of a non-canonical spoken language.
  • Importing automatic spacer
 from csct_dist import correct as cor 

Reference (as a toolkit)

  • 조원익, 천성준, 김지원, 김남수, "문장 정보를 고려한 딥 러닝 기반 자동 띄어쓰기의 개념 및 활용," 제30회 한글 및 한국어 정보처리 학술대회 논문집, 2018, pp. 181-184. [Paper] [Slide]
@inproceedings{cho2018concept,
  title={Concept and Application of Deep learning-based Automatic Spacing},
  author={Cho, Won Ik and Cheon, Sung Jun and Kim, Ji Won and Kim, Nam Soo},
  booktitle={Annual Conference on Human and Language Technology},
  pages={181--184},
  year={2018},
  organization={Human and Language Technology}
}
  • For English version, check RAWS

DISCLAIMER: This model is trained with drama scripts and targets user-generated noisy texts; for the accurate spacing of literary style texts, refer to PyKoSpacing

Demonstration