Current location - Loan Platform Complete Network - Big data management - BERT - Interpretation of Papers
BERT - Interpretation of Papers

BERT: Pre-training of Deep Bidirectional Transformers for

Language Understanding

There are two general strategies for applying pre-trained language models to downstream tasks:

The authors argue that the bottleneck affecting current pre-trained language models is - "the model is unidirectional" . For example, GPT chooses a left-to-right architecture, which allows each token to notice only the token in front of it, which has a minor impact on sentence-level tasks, but a huge impact on token-level tasks. For example, in quiz tasks, it is crucial to combine context from both directions.

BERT mitigates the unidirectionality constraints of previous models by using the Mask Language Model (MLM), inspired by a crossword task, which randomly masks out some of the tokens in the input text and then predicts the masked tokens based on the remaining context. In addition to Mask Language Model, the authors also propose the Next Sequence Predict task to jointly train text pair representations.

The improvements to BERT in the paper are as follows:

General linguistic representations prior to pre-training have a long history, and in this section we briefly review the most widely used approaches.

2.1 Unsupervised Feature-Based Approaches :

Learning broadly applicable lexical representations has been an active area of research for decades, including non-neural, neurological approaches. Pre-trained word embeddings are an integral part of modern NLP systems and offer significant improvements over embeddings learned from scratch (Turian et al., 2010). To pre-train word embedding vectors, left-to-right language modeling goals have been used (Mnih and Hinton, 2009), as well as the goal of distinguishing between correct and incorrect words in left and right contexts (Mikolov et al., 2013).

These approaches have been generalized to coarser granularities, such as sentence embedding (Kiros et al., 2015; Logeswaran and Lee, 2018) or paragraph embedding (Le and Mikolov, 2014). To train sentence representations, previous work has used targets to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 2018), generating next sentence words from left to right based on the previous sentence's representations (Kiros et al., 2015), or denoising autoencoder-derived targets (Hill et al. 2016).

ELMo and its predecessor (Peters et al., 20172018a) generalize traditional word embedding research in different dimensions. They do so by extracting context-sensitive features from left-to-right and right-to-left language models. The contextual representation of each token is a series of left-to-right and right-to-left representations. In combining contextual word embeddings with existing task-specific architectures, ELMo advances the state-of-the-art of several major NLP benchmarks (Peters et al., 2018a), including question and answer (Rajpurkar et al., 2016), sentiment analysis (Socher et al., 2013), and named entity recognition (Tjong Kim-Sang and De Meulder, 2003).Melamud et al. (2016) propose to learn contextual representations through a task that uses LSTM to predict individual words from left and right contexts. Similar to ELMo, their model is feature-based rather than y bidirectional.Fedus et al. (2018) show that the completion task can be used to improve the robustness of text generation models.

2.2 Unsupervised fine-tuning methods:

As with feature-based methods, the first approach works in this direction only if the word embedding parameters are pre-trained in unlabeled text. Recently, sentence or document encoders that produce context-tagged representations have been fine-tuned against supervised downstream tasks by pre-training from unlabeled text and text.

The advantage of these approaches is that very few parameters need to be learned from scratch. At least in part due to this advantage, OpenAI GPT achieved previous state-of-the-art results on many of the sentence-level tasks in the GLUE benchmark. From left to right, language modeling and autoencoder targets have been used for pre-training such models.

Note : The overall pre-training and fine-tuning procedure for BERT. The same architecture is used for pre-training and fine-tuning, except for the output layer. The same pre-training model parameters are used to initialize the model for different downstream tasks. During fine-tuning, all parameters are fine-tuned.

2.3 Supervised Data-Based Transfer Learning:

It has also been shown that transformations can be efficiently performed in supervised tasks with large datasets, such as natural language reasoning and machine translation. Computer vision research has also demonstrated the importance of migration learning from large pre-trained models , and one effective way to do this is to fine-tune the use of ImageNet pre-trained models.

This section describes BERT and its detailed implementation. There are two steps in our framework: pre-training and fine-tuning.

One of the distinguishing features of BERT is its unified architecture across different tasks. The differences between the pre-trained architecture and the final downstream architecture are minimized.

The model architecture of BERT is a multilayer, bi-directional transformer encoder, and BERT is almost identical to a transformer encoder in its implementation.

Definitions: the number of transformer blocks is L; the hidden size is H; the number of self-attentions head is A. The authors mainly show two scales of the BERT model:

In this work, we denote the number of layers (i.e., transformer blocks) as L

In this work, we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attentive headers as A. We primarily report the results for both models:

For comparison, BERT-base was chosen to have the same model size as the OpenAI GPT. However, it is crucial to note that BERT Transformer uses the bidirectional self-attention mechanism self-attention , while GPT Transformer uses the constrained self-attention mechanism constrained self-attention, where each token can only focus on the context to its left.

To enable BERT to handle a large number of different downstream tasks, the authors designed the model's inputs to be either individual sentences or pairs of sentences, both of which are modeled as the same token sequence. The authors used vocabulary word embeddings with 30,000 tokens.

3.1 Pre-training BERT :

We do not use the traditional left-to-right or right-to-left language models to pre-train BERT. Instead, we pre-train BERT using the two unsupervised tasks described in this section. This step is shown in the left half of Figure 1.

Task #1: Masked LM

Standard language models can only be trained left-to-right or right-to-left, but not truly bidirectional, because bidirectionality requires that each word "sees itself" directly. This is because bi-directionality requires that each word "sees itself" directly, and that the model can easily predict the target word in multiple layers of context.

In order to achieve bi-directional deep pre-training, the authors chose to randomly mask out some proportion of tokens, and then predict the masked tokens, in which the hidden vector representations of the masked tokens are outputted to the softmax of the vocabulary, as in the standard This is the same as the standard language modeling setup. The authors refer to this process as "Masked LM", also known as "completion" .

○ The disadvantage of the Masked LM pre-training task:

is that since the [MASK] markers do not appear in the fine-tuning phase, this creates an inconsistency between the pre-training and fine-tuning phases. To solve this problem, the authors proposed a compromise:

○ BERT's mask strategy:

Task #2: Next Sentence Prediction (NSP)

Many downstream tasks are based on understanding the relationship between two sentences, and language models cannot capture this information directly. To train models to understand such inter-sentence relationships, the authors designed the binary classification task of next sentence prediction. Specifically, two sentences are selected as a training sample, with a 50% probability of being a next sentence relation and a 50% probability of being a randomly selected sentence pair, and prediction inputs the final hidden state C of the [CLS] into the sigmoid implementation.

○ Pre-training data:

The authors chose BooksCorpus (800M words) and English Wikipedia (2,500M words) as the pre-training corpus. The authors only selected text passages in Wikipedia, and ignored tables, headings, etc. In order to obtain long continuous text sequences, the authors chose a document-level corpus such as BIllion Word Benchmark instead of a disrupted sentence-level corpus.

3.2 Fine-tuning BERT:

Because the self-attention mechanism in the transformer is applicable to many downstream tasks, it is straightforward to fine-tune the model. For tasks involving text pairs, it is common practice to encode the text pairs independently and then apply bi-directional cross attention to the interaction; Bert unifies these two stages using the self-attention mechanism, which directly enables cross encoding of two concatenated sentences.

For different tasks, simply insert the inputs and outputs specific to that task into Bert, and then do end2end fine-tuning.

Fine-tuning is relatively inexpensive compared to pre-training. Starting with the exact same pre-trained model, all the results in this paper can be replicated for up to 1 hour on a single cloud TPU, or a few hours on a GPU.

In this section, we present BERT fine-tuning results for 11 NLP tasks.

4.1 GLUE:

GLUE (General Language Understanding Evaluation) is a collection of multiple NLP tasks. The authors set the batch size to 32; train 3 epochs; and select the optimal learning rate from (5e-5, 4e-5, 3e-5, 2e-5) on the validation set. The results are as follows:

The results are shown in Table 1. BERT-base and BERT-large outperform all systems on all tasks, with average accuracy improvements of 4.5% and 7.0%, respectively, compared to existing techniques . Note that BERT-base and OpenAI GPT are nearly identical in terms of model architecture, except for the attention masking.

For the largest and most widely reported GLUE task, MNLI, BERT gained an absolute accuracy improvement of 4.6%. In the official GLUE Leaderboard 10, BERT-large received a score of 80.5, while OpenAI GPT received a score of 72.8 as of the date of this writing. We find that BERT-large significantly outperforms BERT-base in all tasks, especially those with little training data.

4.2 SQuAD v1.1 :

The Stanford Question and Answer Dataset (SQuAD v1.1) collects 100,000 crowdsourced question and answer pairs. Given a question and a Wikipedia article containing the answer, the task is to predict the answer text in the article.

As shown in Figure 1, in the Q&A task, we represent the input question and paragraph as a single compressed sequence, with the question using an A embedding and the paragraph using a B embedding. For the fine-tuning process, we introduce only a start vector S and an end vector E. The probability of word i as the start of the answer range is computed as the dot product between Ti and S, followed by the softmax of all words in the passage:

A similar formula is used for the end of the answer range. The candidate's score from position i to position j is defined as S-Ti + E-Tj ,and the maximum score spanning j ≥ i is used as a prediction. The training objective is the sum of the log probabilities of the correct start position and end position. We fine-tuned 3 stages with a learning rate of 5e-5 and a batch size of 32.

Table 2 shows the results for the top leaderboard entries as well as for the top published systems.The top of the SQuAD leaderboards do not have an up-to-date description of the public **** system and are allowed to use any public **** data when training the system. Therefore, a modest data expansion was used in our system, first fine-tuned on TriviaQA, and then fine-tuned for the team.

Our best-performing system outperformed the top-ranked system in ensembling, outperformed the top-ranked system in ensembling by +1.5 F1, and outperformed the top-ranked system in single system by +1.3 F1 score. In fact, our single BERT model outperforms the top ensemble system in terms of F1 scores. Without the TriviaQA fine-tuning data, we would only lose 0.1-0.4 F1, still far outperforming all existing systems.

Other experiments: omitted

In this section, we perform ablation experiments on many aspects of BERT to better understand their relative importance. Additional ablation studies can be found in Appendix C.

5.1 Effectiveness of Pre-Training Tasks :

○ The following ablation tests were conducted:

○ The results are as follows:

5.2 Impact of model size:

○ The results are as follows:

The authors demonstrated that if the model is adequately pre-trained, even if the model size is scaled up to be very large, it can greatly improve downstream tasks with small training data sizes.

5.3 Applying Bert to Feature-Based Approaches:

○ A feature-based approach extracts fixed features from a pre-trained model without fine-tuning it for a specific task.

○ This approach also has some advantages:

The authors conducted the following experiments: the NER task was performed on the CoNLL-2003 dataset, and instead of using the CRF outputs, activation values were extracted from one or more layers, fed into a 2-layer, 768-dimensional BiLSTM, and then classified directly. The results are as follows:

The results show that the Bert model is valid with or without fine-tuning.

Personally, I think the implications of Bert are:

Recent empirical improvements due to transfer learning of language models suggest that rich, unsupervised pre-training is an integral part of many language understanding systems. In particular, these results allow even low-resource tasks to benefit from deep unidirectional architectures. Our main contribution is to further generalize these findings to deep bi-directional architectures, enabling the same pre-trained models to successfully handle a wide range of NLP tasks.