Vietnamese NLP tasks

December 6, 2021 · View on GitHub

Dependency parsing

  • Experiments employ the benchmark Vietnamese dependency treebank VnDT of 10K+ sentences, using 1,020 sentences for test, 200 sentences for development and the remaining sentences for training. LAS and UAS scores are computed on all tokens (i.e. including punctuation).

VnDT v1.1:

ModelLASUASPaperCode
Predicted POSPhoNLP (2021)79.1185.47PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsingOfficial
Predicted POSPhoBERT-base (2020)78.7785.22PhoBERT: Pre-trained language models for VietnameseOfficial
Predicted POSPhoBERT-large (2020)77.8584.32PhoBERT: Pre-trained language models for VietnameseOfficial
Predicted POSBiaffine (2017)74.9981.19Deep Biaffine Attention for Neural Dependency Parsing
Predicted POSjointWPD (2018)73.9080.12A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
Predicted POSjPTDP-v2 (2018)73.1279.63An improved neural network model for joint POS tagging and dependency parsing
Predicted POSVnCoreNLP (2018)71.3877.35VnCoreNLP: A Vietnamese Natural Language Processing ToolkitOfficial

VnDT v1.0:

ModelLASUASPaperCode
Predicted POSVnCoreNLP (2018)70.2376.93VnCoreNLP: A Vietnamese Natural Language Processing ToolkitOfficial
Gold POSVnCoreNLP (2018)73.3979.02VnCoreNLP: A Vietnamese Natural Language Processing ToolkitOfficial
Gold POSBIST BiLSTM graph-based parser (2016)73.1779.39Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature RepresentationsOfficial
Gold POSBIST BiLSTM transition-based parser (2016)72.5379.33Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature RepresentationsOfficial
Gold POSMSTparser (2006)70.2976.47Online large-margin training of dependency parsers
Gold POSMaltParser (2007)69.1074.91MaltParser: A language-independent system for datadriven dependency parsing

Intent detection and Slot filling

PhoATIS

  • The first dataset for intent detection and slot filling for Vietnamese, based on the common ATIS benchmark in the flight booking domain. Data is localized (e.g. replacing slot values with Vietnamese-specific entities) to fit the context of flight booking in Vietnam.
  • Training set: 4478 sentences
  • Development set: 500 sentences
  • Test set: 893 sentences
ModelIntent Acc.Slot F1Sentence Acc.PaperCodeNote
JointIDSF (2021)97.6294.9886.25Intent Detection and Slot Filling for VietnameseOfficialText are automatically word-segmented using RDRSegmenter
JointBERT (2019) with PhoBERT encoder97.4094.7585.55Intent Detection and Slot Filling for VietnameseOfficialText are automatically word-segmented using RDRSegmenter

Machine translation

PhoMT Dataset

  • A large-scale and high-quality dataset for Vietnamese-English Machine Translation with 3.02M sentence pairs, available at https://github.com/VinAIResearch/PhoMT.
    • Consists of 6 domains: TED Talks, WikiHow, MediaWiki, OpenSubtitles, News and Blog.
    • Training set: 2.9M sentence pairs
    • Validation set: 18719 sentence pairs
    • Test set: 19151 sentence pairs
ModelEN-VI (BLEU)VI-EN (BLEU)PaperCode
mBART (2020)43.4639.78Multilingual Denoising Pre-training for Neural Machine TranslationLink
Transformer-big (2017)42.9437.83Attention is all you needLink
Transformer-base (2017)42.1237.19Attention is all you needLink

IWSLT2015 Dataset

English-to-Vietnamese

tst2015 is used for test

ModelBLEUPaperCode
Stanford (2015)26.4Stanford Neural Machine Translation Systems for Spoken Language Domains

tst2013 is used for test

ModelBLEUPaperCode
Nguyen and Salazar (2019)32.8Transformers without Tears: Improving the Normalization of Self-AttentionOfficial
Provilkov et al. (2019)33.27 (uncased)BPE-Dropout: Simple and Effective Subword Regularization
Xu et al. (2019)31.4Understanding and Improving Layer NormalizationOfficial
CVT (2018)29.6 (SST)Semi-Supervised Sequence Modeling with Cross-View Training
ELMo (2018)29.3 (SST)Deep contextualized word representations
Transformer (2017)28.9Attention is all you needLink
Kudo (2018)28.5Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Google (2017)26.1Neural machine translation (seq2seq) tutorialOfficial
Stanford (2015)23.3Stanford Neural Machine Translation Systems for Spoken Language Domains

Vietnamese-to-English

tst2013 is used for test

ModelBLEUPaperCode
Provilkov et al. (2019)32.99 (uncased)BPE-Dropout: Simple and Effective Subword Regularization
Kudo (2018)26.31Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Named entity recognition

PhoNER_COVID19

  • A named entity recognition dataset for Vietnamese with 10 newly-defined entity types in the context of the COVID-19 pandemic. Data is extracted from news articles and manually annotated. In total, there are 34 984 entities over 10 027 sentences.
  • Training set: 5027 sentences
  • Development set: 2000 sentences
  • Test set: 3000 sentences
ModelF1PaperCodeNote
PhoBERT-large (2020)94.5PhoBERT: Pre-trained language models for VietnameseOfficial
PhoBERT-base (2020)94.2PhoBERT: Pre-trained language models for VietnameseOfficial
XLM-R-large (2019)93.8Unsupervised Cross-lingual Representation Learning at ScaleOfficial
XLM-R-base (2019)92.5Unsupervised Cross-lingual Representation Learning at ScaleOfficial
BiLSTM-CRF + CNN-char (2016) + Word Segmentation91End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRFLinkText are automatically word-segmented using RDRSegmenter
BiLSTM-CRF + CNN-char (2016)90.6End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRFLinkNo word segmentation

VLSP

  • 16,861 sentences for training and development from the VLSP 2016 NER shared task:
    • 14,861 sentences are used for training.
    • 2k sentences are used for development.
  • Test data: 2,831 test sentences from the VLSP 2016 NER shared task.
  • NOTE that in the VLSP 2016 NER data, each word representing a full personal name are separated into syllables that constitute the word. The VLSP 2016 NER data also consists of gold POS and chunking tags as reconfirmed by VLSP 2016 organizers. This scheme results in an unrealistic scenario for a pipeline evaluation:
    • The standard annotation for Vietnamese word segmentation and POS tagging forms each full name as a word token, thus all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a POS label to the entire full-name.
    • Gold POS and chunking tags are NOT available in a real-world application.
  • For a realistic scenario, contiguous syllables constituting a full name are merged to form a word. POS/chunking tags--if used--have to be automatically predicted!
ModelF1PaperCodeNote
PhoBERT-large (2020)94.7PhoBERT: Pre-trained language models for VietnameseOfficial
PhoNLP (2021)94.41PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsingOfficial
vELECTRA (2020)94.07Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural ModelsOfficial
PhoBERT-base (2020)93.6PhoBERT: Pre-trained language models for VietnameseOfficial
VnCoreNLP (2018) [1]91.30VnCoreNLP: A Vietnamese Natural Language Processing ToolkitOfficialUsed ETNLP embeddings
BiLSTM-CRF + CNN-char (2016) [1]91.09End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRFOfficial / LinkUsed ETNLP embeddings
VNER (2019)89.58Attentive Neural Network for Named Entity Recognition in Vietnamese
VnCoreNLP (2018)88.55VnCoreNLP: A Vietnamese Natural Language Processing ToolkitOfficialPre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + CNN-char (2016) [2]88.28End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRFOfficial / LinkPre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF + LSTM-char (2016) [2]87.71Neural Architectures for Named Entity RecognitionLinkPre-trained embeddings learned from Baomoi corpus
BiLSTM-CRF (2015) [2]86.48Bidirectional LSTM-CRF Models for Sequence TaggingLinkPre-trained embeddings learned from Baomoi corpus

Part-of-speech tagging

  • 27,870 sentences for training and development from the VLSP 2013 POS tagging shared task:
    • 27k sentences are used for training.
    • 870 sentences are used for development.
  • Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
ModelAccuracyPaperCode
PhoBERT-large (2020)96.8PhoBERT: Pre-trained language models for VietnameseOfficial
vELECTRA (2020)96.77Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural ModelsOfficial
PhoNLP (2021)96.76PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsingOfficial
PhoBERT-base (2020)96.7PhoBERT: Pre-trained language models for VietnameseOfficial
jointWPD (2018)95.97A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
VnCoreNLP-VnMarMoT (2017)95.88From Word Segmentation to POS Tagging for VietnameseOfficial
jPTDP-v2 (2018)95.70An improved neural network model for joint POS tagging and dependency parsing
BiLSTM-CRF + CNN-char (2016)95.40End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRFOfficial / Link
BiLSTM-CRF + LSTM-char (2016)95.31Neural Architectures for Named Entity RecognitionLink
BiLSTM-CRF (2015)95.06Bidirectional LSTM-CRF Models for Sequence TaggingLink
RDRPOSTagger (2014)95.11RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech TaggerOfficial

Semantic parsing

ViText2SQL

  • The first public large-scale Text-to-SQL semantic parsing dataset for Vietnamese, consisting of about 10K question and SQL query pairs.
  • Training set: 6831 question and query pairs
  • Development set: 954 question and query pairs
  • Test set: 1906 question and query pairs
ModelExact Match AccuracyPaperCodeNote
IRNet (2019)53.2A Pilot Study of Text-to-SQL Semantic Parsing for VietnameseLinkUsing PhoBERT as encoder
EditSQL (2019)52.6A Pilot Study of Text-to-SQL Semantic Parsing for VietnameseLinkUsing PhoBERT as encoder

Word segmentation

  • Training & development data: 75k manually word-segmented training sentences from the VLSP 2013 word segmentation shared task.
  • Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
ModelF1PaperCode
UITws-v1 (2019)98.06Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix CaptureOfficial
VnCoreNLP-RDRsegmenter (2018)97.90A Fast and Accurate Vietnamese Word SegmenterOfficial
UETsegmenter (2016)97.87A hybrid approach to Vietnamese word segmentationOfficial
jointWPD (2018)97.81A neural joint model for Vietnamese word segmentation, POS tagging and dependency parsing
vnTokenizer (2008)97.33A Hybrid Approach to Word Segmentation of Vietnamese Texts
JVnSegmenter (2006)97.06Vietnamese Word Segmentation with CRFs and SVMs: An Investigation
DongDu (2012)96.90Ứng dụng phương pháp Pointwise vào bài toán tách từ cho tiếng Việt