FDHS
September 4, 2018 ยท View on GitHub
Fully Data-driven Contextual Hashtag Segmentation
Requirements
Keras (TensorFlow), Numpy
Dictionary
Visit: https://nlp.stanford.edu/projects/glove/ Download: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip Download link: http://nlp.stanford.edu/data/glove.twitter.27B.zip
- Download this and locate 100dim dictionary to the same folder with 'hashseg.py', in file name 'glove100.txt'.
- Dictionary-free version is under implementation!
System Description
- The system was trained with 'train.py' (line by line; will be disclosed later!)
- Easy start: start your work in the folder
git clone https://github.com/warnikchow/fdhs
- Locate dictionary inside the folder
- Utilize the segmentation toolkit by following command:
from fdhs.hashseg import segment as seg
- Sample usage:
seg('#what_do_you_want')
>> 'what do you want'
seg('#WhatDoYouWant')
>> 'What Do You Want'
seg('#whatdoyouwant')
>> 'what do you want'