neural-chatbot

July 26, 2017 ยท View on GitHub

A chatbot based on seq2seq architecture.

About

This is the successor to my previous torch project 'seq2seq-chatbot'. Portions of this project were directly adapted from the tensorflow translation example.

This is based off the research in these papers: Sutskever et al., 2014. and Vinyals & Le, 2015.

Dependencies

  1. Python 3
  2. TensorFlow 1.2
  3. (optional) matplotlib $ pip install matplotlib

How to use

To use your own data read Data Format, otherwise to use included Cornell Movie Dialogues Corpur data:

$ python train.py

Data Format

If you wish to provide your own data use the format below to provide your own data.

Data Format

Each continuous conversation must be in it's own text file. Each line of the text files is a response to the previous line. The exception of course is that the first line of the text file is the conversation start. After entering your own data, you can run it with:

$ python sequencelengthplotter.py --data_dir="path/to/text_files"

The options for the above command are:

NameTypeDescription
data_dirstringpath to raw data files
plot_histogramsbooleanplots histograms for target and source sequences
plot_scatterbooleanplot x-y scatter plot for target length vs. source length

in buckets.cfg you can then modify the bucket values accordingly. You can add or remove buckets. All bucket values in the [buckets] subheading will be parsed in. Each line under [buckets] should be of the format:

bucket_name: source_length,target_length

In the same configuration file the data settings can also be changed under the [max_data_sizes] subheading.

NameTypeDescription
num_linesintnumber of lines in conversation to go back
max_target_lengthintmax length of target sequences
max_source_lengthintmax length of source sequences

A configuration file was used because it was a mess trying to find out how to pass bucket values via command line. This seemed like a half-way decent solution. It also enables (in the future) sequencelengthplotter.py the ability to autonomously change these values without requiring any user input.

Training Network

Once you are satisfied with bucket sizes you can then run the optimizer. This will also clean and process your raw text data. To do this:

$ python train.py --raw_data_dir="data/cornell_lines"

There are several options that can be employed:

NameTypeDescription
hidden_sizeintnumber of hidden units in hidden layers
num_layersintnumber of hidden layers
batch_sizeintsize of batchs in training
max_epochintmax number of epochs to train for
learning_ratefloatbeggining learning rate
steps_per_checkpointintnumber of steps before running test set
lr_decay_factorfloatfactor by which to decay learning rate
batch_sizeintsize of batchs in training
checkpoint_dirstringdirectory to store/restore checkpoints
dropoutfloatprobability of hidden inputs being removed
grad_clipintmax gradient norm
max_train_data_sizeintUse a subset of available data to train
vocab_sizeintmax vocab size
train_fracintpercentage of data to use for training (rest is used for testing)
raw_data_dirintraw conversation text files stored here
data_dirintDirectory data processor will store train/test/vocab files
max_source_lengthintHow long the source sentences can be at most
max_target_lengthintHow long the target sentences can be at most
convo_limitsintHow far back the bot's memory should go for the conversation

Tensorboard Usage

Tensorboard not yet implemented.

Sampling output

** This still needs to be tested since the update to TF 1.2

To have a real conversation with your bot you can begin an interactive prompt by doing:

*Note the sample.py file does not currently read in bucket sizes set from the config file (this will be fixed shortly)

$ python sample.py --checkpoint_dir="path/to/checkpointdirectory" --data_dir="path/to/datadirectory"

A prompt will open up something like:

>

The user can then type in to the prompt

ex/ > Hello chatbot, how are you?

Then hit 'enter' for the response.

Summary of options below:

NameTypeDescription
checkpoint_dirstringpath to saved checkpoint
data_dirstringpath to directory where vocab file is

Results

Coming soon.

Future Features

  • Downloadable pretrained model