The vast majority of the training costs

should not be conflated conceptually [545], in practice, there exist popular combinations that achieve good performances. Attending to all tokens, as shown in Fig. 3(left), is the most data-efficient strategy since it uses context from before and after the token to be predicted. However, for that reason, it is unsuitable for text generation [120], since it considers future context for prediction. We typically employ it in natural language understanding (NLU) tasks [120], where it has shown strong results. The next token prediction objective is most suitable for natural language generation (NLG) but also the least data efficient since it only attends to the past context (Fig. 3(middle)). More recent advances in pre-training objectives aim to find a middle-ground to increase data efficiency by providing stronger and more diverse training signals, e.g., the Prefix LM, which partly attends to past tokens, as illustrated in Fig. 3(right) and discussed below. The following discusses the trade-offs between some of the recently proposed objectives. Fig. 4 visually depicts the different pre-training objectives. Notation-wise, we denote a sequence of N tokens x as x = x1, . . . , xN . We start with the most basic and still widelyused Language Modeling [59] (or next token prediction) objective. Here, we learn parameters θ by maximizing the likelihood of the next token given the previous tokens, L(x) = X N i=1 log P(xi |x1, . . . , xi−1; θ). (1) Masked Language Modeling (MLM; or Cloze) [549, 120] hides a set proportion of tokens in the sequence by replacing them with a special [MASK] token. The literature employs the MLM objective for non-autoregressive, i.e., non-generative, bidirectional context models, where the model uses tokens before and after the target token for predictions, leveraging a more holistic understanding of its context than the NTP objective. Furthermore, we can use each input sentence to predict multiple masked tokens in a single pass, while the NTP objective typically learns from predicting one token at a time. Let xMASK denote the set of indices of the masked tokens and x¬MASK the unmasked tokens. The objective of MLM is then to maximize the likelihood given the parameters θ, L(xMASK|x¬MASK) = 1 |xMASK| · X i∈xMASK log P(xMASKi |x¬MASK; θ). (2) Patel et al. [410] show that such models produce representations more suitable for transfer learning; however, they come with difficulties in performing in-context learning (Sec. 2.7). To further improve the training efficiency of the MLM objective, Bajaj et al. [33] propose to replace input tokens with ones generated by an auxiliary language model (ALM), resulting in a Model generated dEnoising TRaining Objective (METRO). Their approach consists of roughly three components: (i) train an ALM using the MLM objective, (ii) given some inputs with masked positions, predict the tokens (with the ALM), (iii) train the main model to correct these tokens inserted in the masked positions, i.e., 1) predict whether the ALM has replaced a token and if so, 2) predict the original token. They train the auxiliary and main model jointly. Prefix Language Modeling [443] generalizes language modeling by allowing prefix tokens with a bidirectional receptive field to be added to the input (without prefix, it is equivalent to standard LM). Note that this is still different from the bidirectional context as in MLM, where we always condition on all the tokens before and after the masked ones (see Fig. 3 left). For computing the hidden states of the prefix, prefix-LM attends to tokens before and after (see Fig. 3 right). Span Corruption [303, 443, 132] or span denoising refers to a group of denoising objectives that generalize MLM to denoise contiguous sequences of tokens within a given text, called spans. The denoising objectives typically replace the sampled spans with a single unique masking token and train the model to fill it in. Raffel et al. [443]

The vast majority of the training costs

Report Page