Skip to content

VinitSR7/TensorFlow-2.0-Question-Answering

Repository files navigation

TensorFlow-2.0-Question-Answering

Introduction

  1. This is a question an open-domain question answering (QA) system should be able to respond to Question Answer systems.
  2. In this case study, the goal is to predict short and long answer responses to real questions about Wikipedia articles. The dataset is provided by Google's Natural Questions , but contains its own unique private test set.
  3. A visualization of examples shows long and—where available—short answers.

Data Overview

  1. Data Source

    Kaggle
  2. Data Format

    1. Each sample contains a Wikipedia article, a related question, and the candidate long form answers.
    2. The training examples also provide the correct long and short form answer or answers for the sample, if any exist.
  3. What we have to Predict

    1. For each article + question pair, we must predict / select
      1. long and
      2. short
      form answers to the question drawn directly from the article
    2. A long answer would be a longer section of text that answers the question - several sentences or a paragraph.
    3. A short answer might be a sentence or phrase, or even in some cases a YES/NO. The short answers are always contained within / a subset of one of the plausible long answers.
    4. A given article can (and very often will) allow for both long and short answers, depending on the question
    5. There is more detail about the data and what you're predicting on the Github page for the Natural Questions dataset.
  4. File Description

    1. simplified-nq-train.jsonl - the training data, in newline-delimited JSON format.
    2. simplified-nq-kaggle-test.jsonl - the test data, in newline-delimited JSON format.
    3. sample_submission.csv - a sample submission file in the correct format
  5. Data Attributes

    1. document_text - the text of the article in question (with some HTML tags to provide document structure). The text can be tokenized by splitting on whitespace.
    2. question_text - the question to be answered
    3. long_answer_candidates - a JSON array containing all of the plausible long answers.
    4. annotations - a JSON array containing all of the correct long + short answers. Only provided for train.
    5. document_url - the URL for the full article. Provided for informational purposes only. This is NOT the simplified version of the article so indices from this cannot be used directly. The content may also no longer match the html used to generate document_text. Only provided for train.
    6. example_id - unique ID for the sample.

Evaluation

  1. Submissions are evaluated using micro F1 between the predicted and expected answers. Predicted long and short answers must match exactly the token indices of one of the ground truth labels ((or match YES/NO if the question has a yes/no short answer). There may be up to five labels for long answers, and more for short. If no answer applies, leave the prediction blank/null.
  2. Refer here for more details

Modelling

Comming Soon

References and concepts learnt

  1. BERT:-

    1. [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
    2. A Visual Guide to Using BERT for the First Time
    3. BERT is a bi-directional transformer.
    4. BERT works in two steps, First, it uses a large amount of unlabeled data to learn a language representation in an unsupervised fashion called pre-training.
    5. Then, the pre-trained model can be fine-tuned in a supervised fashion using a small amount of labeled trained data to perform various supervised tasks.
    6. Bert pre-training is done by 2 tasks called Masked Language Model(MLM) and Next Sentence Prediction (NSP).
    7. MLM makes it possible to perform bidirectional learning from the text. The MLM pre-training task converts the text into tokens and uses the token representation as an input and output for the training. A random subset of the tokens (15%) are masked, i.e. hidden during the training, and the objective function is to predict the correct identities of the tokens.
    8. The NSP task allows BERT to learn relationships between sentences by predicting if the next sentence in a pair is the true next or not.
    9. BERT was trained using 3.3 Billion words total with 2.5B from Wikipedia and 0.8B from BooksCorpus.
  2. ALBERT:-

    1. [1909.11942] ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
    2. ALBERT is advancement over BERT, where researchers introduced three new concepts
      1. Factorized embedding parameterization:
        1. The model isolates the size of the hidden layers from the size of vocabulary embeddings by projecting one-hot vectors into a lower dimensional embeddings space and then to the hidden space.
      2. Cross-layer parameter sharing:
        1. It shares all parameters across layers to prevent the parameters from growing along with the depth of the network.
        2. It results in 18 times less parameters compared to BERT-large.
      3. Inter-sentence coherence loss:
        1. In case of pre-training, instead of using next sentence prediction technique (NSP), it uses sentence-order prediction(SOP) loss, which enable more robust multi-sentence encoding tasks
      4. For pretraining baseline models, researchers used the BOOKCORPUS and English Wikipedia, which together contain around 16GB of uncompressed text.
  3. XLNet:-

    1. [1906.08237] XLNet: Generalized Autoregressive Pretraining for Language Understanding
    2. XLNet was trained with over 130 GB of textual data and 512 TPU chips.
    3. XLNet is a large bidirectional transformer that is an advancement in training methodology from BERT, trained with larger data to achieve better performance than BERT on 20 language tasks.
    4. XLNet introduces the concept of permutation language modeling, i.e where all tokens are predicted but in random order. This helps the model to learn bidirectional relationships and therefore better handles dependencies and relations between words
    5. Base architecture for XLNet is Transformer XL, which showed good performance even in the absence of permutation based training.
  4. RoBERTa:-

    1. RoBERTa uses 160 GB of text for pre-training, including 16GB of Books Corpus and English Wikipedia used in BERT.
    2. Robustly optimized BERT approach RoBERTa, is a retraining of BERT with improved training methodology, 1000% more data and compute power.
    3. RoBERTa removes the Next Sentence Prediction (NSP) task from BERT’s pre-training and introduces dynamic masking so that the masked token changes during the training epochs.
    4. As a result, RoBERTa outperforms both BERT and XLNet on GLUE benchmark results.
  5. Comparison Between all the above models
  6. Hugging Face Transformers for Question Answering
    1. Bert Fine Tuining Code Walk Through
    2. Hugging face Transformer Usage
    3. Text Extraction From a Corpus Using BERT (AKA Question Answering)