SEFR CUT (Stacked Ensemble Filter and Refine for Word Segmentation)

February 2, 2024 · View on GitHub

Domain Adaptation of Thai Word Segmentation Models using Stacked Ensemble (EMNLP 2020)
CRF as Stacked Model and DeepCut as Baseline model

Citation

@inproceedings{limkonchotiwat-etal-2020-domain,
    title = "Domain Adaptation of {T}hai Word Segmentation Models using Stacked Ensemble",
    author = "Limkonchotiwat, Peerat  and
      Phatthiyaphaibun, Wannaphong  and
      Sarwar, Raheem  and
      Chuangsuwanich, Ekapol  and
      Nutanong, Sarana",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.315",
}

Install

pip install sefr_cut

How To use

Requirements

python >= 3.6
python-crfsuite >= 0.9.7
pyahocorasick == 1.4.0

Example

Example files are on SEFR Example notebook
Try it on Colab

Load Engine & Engine Mode

ws1000, tnhc, and BEST !!
- ws1000: The model trained on Wisesight-1000 and test on Wisesight-160
- tnhc: The model trained on TNHC (80:20 train&test split with random seed 42)
- BEST: The model trained on BEST-2010 Corpus (NECTEC)
```
sefr_cut.load_model(engine='ws1000')
# OR
sefr_cut.load_model(engine='tnhc')
# OR
sefr_cut.load_model(engine='best')
```
tl-deepcut-XXXX
- We also provide transfer learning of deepcut on 'Wisesight' as tl-deepcut-ws1000 and 'TNHC' as tl-deepcut-tnhc
```
sefr_cut.load_model(engine='tl-deepcut-ws1000')
# OR
sefr_cut.load_model(engine='tl-deepcut-tnhc')
```
deepcut
- We also provide the original deepcut
```
sefr_cut.load_model(engine='deepcut')
```

Segment Example

You need to read the paper to understand why we have $k$ value!

Tokenize with default k-value

sefr_cut.load_model(engine='ws1000')
print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ']))
print(sefr_cut.tokenize(['สวัสดีประเทศไทย']))
print(sefr_cut.tokenize('สวัสดีประเทศไทย'))

[['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]
[['สวัสดี', 'ประเทศ', 'ไทย']]
[['สวัสดี', 'ประเทศ', 'ไทย']]

Tokenize with a various k-value

sefr_cut.load_model(engine='ws1000')
print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=5)) # refine only 5% of character number
print(sefr_cut.tokenize(['สวัสดีประเทศไทย','ลุงตู่สู้ๆ'],k=100)) # refine 100% of character number

[['สวัสดี', 'ประเทศไทย'], ['ลุงตู่', 'สู้', 'ๆ']]
[['สวัสดี', 'ประเทศ', 'ไทย'], ['ลุง', 'ตู่', 'สู้', 'ๆ']]

Evaluation

We also provide Character & Word Evaluation by call function evaluation()

For example

answer = 'สวัสดี|ประเทศไทย'
pred = 'สวัสดี|ประเทศ|ไทย'
char_score,word_score = sefr_cut.evaluation(answer,pred)
print(f'Word Score: {word_score} Char Score: {char_score}')

Word Score: 0.4 Char Score: 0.8

answer = ['สวัสดี|ประเทศไทย']
pred = ['สวัสดี|ประเทศ|ไทย']
char_score,word_score = sefr_cut.evaluation(answer,pred)
print(f'Word Score: {word_score} Char Score: {char_score}')

Word Score: 0.4 Char Score: 0.8


answer = [['สวัสดี|'],['ประเทศไทย']]
pred = [['สวัสดี|'],['ประเทศ|ไทย']]
char_score,word_score = sefr_cut.evaluation(answer,pred)
print(f'Word Score: {word_score} Char Score: {char_score}')

Word Score: 0.4 Char Score: 0.8

Performance

How to re-train the model?

You can re-train the model. The example is in the folder Notebooks We provided everything for you!!
Re-train Model
- You can run the notebook file #2, the corpus inside 'Notebooks/corpus/' is Wisesight-1000, you can try with BEST, TNHC, and LST20 !
- Rename variable name: CRF_model_name
- Link:HERE
Filter and Refine Example
- Set variable name CRF_model_name same as file#2
- If you want to know why we use filter-and-refine, you can try to uncomment 3 lines in score_() function
```
#answer = scoring_function(y_true,cp.deepcopy(y_pred),entropy_index_og)
#f1_hypothesis.append(eval_function(y_true,answer))
#ax.plot(range(start,K_num,step),f1_hypothesis,c="r",marker='o',label='Best case')
```
- Link:HERE
Use your trained model?
- Just move your model inside 'Notebooks/model/' to 'seft_cut/model/' and call model in one line.
```
SEFR_CUT.load_model(engine='my_model')
```

Thank you many code from

Deepcut (Baseline Model) : We used some of code from Deepcut to perform transfer learning
@bact (CRF training code) : We used some from https://github.com/bact/nlp-thai