README

November 9, 2020 · View on GitHub

            Improving methods to learn word representations
            ===============================================
            for efficient semantic similarities computations
            ================================================

ABOUT This repository contains the PhD thesis of Julien Tissier, entitled,

"Improving  methods  to  learn  word   representations   for   efficient
semantic similarities computations"

It also contains all the source materials used to  produce  the  thesis,
including the Latex .tex source files, the images and  their  respective
source files to generate or modify them  (either  the  Libreoffice  Draw
source  or  the  Python  code)  and  the  slides  of  the  PhD  defense.

CONTENT This repository is composed of: - chapters/: this folder contains all the chapters of the thesis, as ".tex" source files. There are 10 chapters (from 00-introduction.tex to 09-software.tex), a cover page (000-garde.tex) and the bibliography (99-bibliography.bib). - images/: this folder contains all the images used in the thesis (i.e. with the \includegraphics{} command in the .tex files) either as PNG or PDF. - images-code/: this folder contains the Python code used to generate some plots or illustration images of the thesis with the Matplotlib library. - images-src/: this folder contains the source files of some illustrations images used in the thesis, as Libreoffice Draw files (.odg). - PhD-Defense-Julien-Tissier.pdf: the defense presentation as PDF, 48 slides. - PhD-Thesis-Julien-Tissier.pdf: the thesis as PDF, 127 pages. - makefile: used to generate the thesis from source files. Use the command make at the root of this repository to produce it. You will need the following tools: make, pdflatex and bibtex. - phd-thesis.tex: the main .tex file, containing all the Latex package to use and the different chapters to include.

SUMMARY Many natural language processing applications rely on word embeddings (also called word representations) to achieve state-of-the-art results. These numerical representations of the language should encode both syntactic and semantic information to perform well in downstream tasks. However, common models (word2vec, GloVe) use generic corpus like Wikipedia to learn them and they therefore lack specific semantic information. Moreover it requires a large memory space to store them because the number of representations to save can be in the order of a million.

The topic of my thesis is to develop new  learning  algorithms  to  both
improve the semantic  information  encoded  within  the  representations
while making them requiring less  memory space  for  storage  and  their
applications in NLP tasks.

The first part of  my  work  is  to  improve  the  semantic  information
contained in word embeddings.  I developed dict2vec, a model  that  uses
additional information from online lexical  dictionaries  when  learning
word representations.  The dict2vec word embeddings perform ∼15%  better
against  the  embeddings  learned  by  other  models  on  word  semantic
similarity tasks.

The second part of  my  work  is  to  reduce  the  memory  size  of  the
embeddings.  I developed an architecture  based  on  an  autoencoder  to
transform commonly used real-valued embeddings into  binary  embeddings,
reducing their size in memory by 97% with only a loss of ∼2% in accuracy
in downstream NLP tasks.

AUTHOR Written by Julien Tissier 30314448+tca19@users.noreply.github.com.

COPYRIGHT This thesis and all the files in this repository are licensed under the "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License". By using or downloading this repository, you agree to: 1. NonCommercial - You may not use the material for commercial purposes. 2. Attribution - You must give appropriate credit, provide a link to the licensor, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. 3. ShareAlike - If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original. 4. No additional restrictions - You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

For more details, see https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode