Continual pre-training vs. Fine-tuning a language model with MLM

Question

I have some custom data I want to use to further pre-train the BERT model. I’ve tried the two following approaches so far:

Starting with a pre-trained BERT checkpoint and continuing the pre-training with Masked Language Modeling (MLM) + Next Sentence Prediction (NSP) heads (e.g. using BertForPreTraining model)
Starting with a pre-trained BERT model with the MLM objective (e.g. using the BertForMaskedLM model assuming we don’t need NSP for the pretraining part.)

But I’m still confused that if using either BertForPreTraining or BertForMaskedLM actually does the continual pre-training on BERT or these are just two models for fine-tuning that use MLM+NSP and MLM for fine-tuning BERT, respectively. Is there even any difference between fine-tuning BERT with MLM+NSP or continually pre-train it using these two heads or this is something we need to test?

I've reviewed similar questions such as this one but still, I want to make sure that whether technically there's a difference between continual pre-training a model from an initial checkpoint and fine-tuning it using the same objective/head.

Ashwin Geet D'Sa · Accepted Answer · 2023-02-09 08:54:39Z

23

The answer is a mere difference in the terminology used. When the model is trained on a large generic corpus, it is called 'pre-training'. When it is adapted to a particular task or dataset it is called as 'fine-tuning'.

Technically speaking, in either cases ('pre-training' or 'fine-tuning'), there are updates to the model weights.

For example, usually, you can just take the pre-trained model and then fine-tune it for a specific task (such as classification, question-answering, etc.). However, if you find that the target dataset is from a specific domain, and you have a few unlabled data that might help the model to adapt to the particular domain, then you can do a MLM or MLM+NSP 'fine-tuning' (unsupervised learning) (some researchers do call this as 'pre-training' especially when a huge corpus is used to train the model), followed by using the target corpus with target task fine-tuning.

edited Feb 9, 2023 at 8:54

answered Jul 22, 2021 at 10:13

Ashwin Geet D'Sa

6,9342 gold badges33 silver badges62 bronze badges

1

I also came to the conclusion that it's more of a terminological use difference.
– Pedram
Jul 25, 2021 at 5:56
1

Does anyone have a script for continual pre-training using MLM and NSP from checkpoint?
– Kamil
Sep 8, 2021 at 1:55
1

The fact is that pretraining is an unsupervised process and for example in bert the tasks are MLM and NSP while finetuning to a specific task requires a labelled dataset. While reading papers and blogs those terms become ambiguous even in Hugging Face blogs most of the time the process is called finetuning if it isn't training from scratch. In Hugging Face it is only a matter of how you initiallize a model either from a configuration or an existing pretrained model. More here in part 3 and the updated colab it provides huggingface.co/blog/how-to-train.
– Zisis F
Mar 28, 2022 at 13:55

Add a comment |

Pedram · Accepted Answer · 2022-05-03 00:10:37Z

10

Yes, there is a difference between pre-training and "further pre-training".

Pre-training usually would mean take the original model, initialize the weights randomly, and train the model from absolute scratch on some large corpora.

Further pre-training means take some already pre-trained model, and basically apply transfer learning - use the already saved weights from the trained model and train it on some new domain. This is usually beneficial if you don't have a very large corpora.

Regarding BertForPreTraining and BertForMaskedLM, you can use either one of them for both of the above purposes. It has been shown that further pre-training on MLM is very beneficial, and often NSP is not needed at all. So you will be good to go with BertForMaskedLM.

NB! You can initialize a model checkpoint by: BertForMaskedLM.from_pretrained({model_name}) and then apply the training procedure, otherwise just pass it a config (from the huggingface API)

edited May 3, 2022 at 0:10

Pedram

2,5315 gold badges31 silver badges52 bronze badges

answered May 1, 2022 at 12:15

Petar Ulev

3314 silver badges14 bronze badges

Note - catastrophic forgetting is a problem in further pre-training, there are some solutions with freezing some layers of the network.
– Petar Ulev
May 1, 2022 at 12:16
Also Note: This article: aclanthology.org/2021.insights-1.9.pdf states that further pre-training may not be so beneficial (or nearly useless) if the data for the downstream task is very large. But in short - yes, it is beneficial to have a pre-trained model for a specific domain.
– Petar Ulev
May 1, 2022 at 15:36
2

NB! = notate bene = note well，an important note，take notice……
– K. Symbol
Jul 19, 2022 at 3:10

Add a comment |

Collectives™ on Stack Overflow

Continual pre-training vs. Fine-tuning a language model with MLM

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
deep-learning
nlp
huggingface-transformers
bert-language-model
pre-trained-model
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged deep-learningnlphuggingface-transformersbert-language-modelpre-trained-model or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
deep-learning
nlp
huggingface-transformers
bert-language-model
pre-trained-model
or ask your own question.