FDHS

September 4, 2018 ยท View on GitHub

Fully Data-driven Contextual Hashtag Segmentation

Requirements

Keras (TensorFlow), Numpy

Dictionary

Visit: https://nlp.stanford.edu/projects/glove/ Download: Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip Download link: http://nlp.stanford.edu/data/glove.twitter.27B.zip

  • Download this and locate 100dim dictionary to the same folder with 'hashseg.py', in file name 'glove100.txt'.
  • Dictionary-free version is under implementation!

System Description

  • The system was trained with 'train.py' (line by line; will be disclosed later!)
  • Easy start: start your work in the folder
 git clone https://github.com/warnikchow/fdhs 
  • Locate dictionary inside the folder
  • Utilize the segmentation toolkit by following command:
 from fdhs.hashseg import segment as seg 
  • Sample usage:
 seg('#what_do_you_want') 
 >> 'what do you want' 
 seg('#WhatDoYouWant')  
 >> 'What Do You Want' 
 seg('#whatdoyouwant') 
 >> 'what do you want'