11

Given a sentiment classification dataset, I want to fine-tune Bert.

As you know that BERT created to predict the next sentence given the current sentence. Thus, to make the network aware of this, they inserted a [CLS] token in the beginning of the first sentence then they add [SEP] token to separate the first from the second sentence and finally another [SEP] at the end of the second sentence (it's not clear to me why they append another token at the end).

Anyway, for text classification, what I noticed in some of the examples online (see BERT in Keras with Tensorflow hub) is that they add [CLS] token and then the sentence and at the end another [SEP] token.

Where in other research works (e.g. Enriching Pre-trained Language Model with Entity Information for Relation Classification) they remove the last [SEP] token.

Why is it/not beneficial to add the [SEP] token at the end of the input text when my task uses only single sentence?

3
  • .@user_007 interesting question, have you had any further insights?
    – MJimitater
    Jan 20, 2021 at 14:47
  • @MJimitater unfortunately no.
    – Minions
    Jan 20, 2021 at 14:53
  • 1
    I proposed some (unfortunately rather unsatisfactory) ideas of mine in an answer below, please let me know your thoughts on this, so we can both can move further towards the truth
    – MJimitater
    Jan 20, 2021 at 14:56

2 Answers 2

6

Im not quite sure why BERT needs the separation token [SEP] at the end for single-sentence tasks, but my guess is that BERT is an autoencoding model that, as mentioned, originally was designed for Language Modelling and Next Sentence Prediction. So BERT was trained that way to always expect the [SEP] token, which means that the token is involved in the underlying knowledge that BERT built up during training.

Downstream tasks that followed later, such as single-sentence use-cases (e.g. text classification), turned out to work too with BERT, however the [SEP] was left as a relict for BERT to work properly and thus is needed even for these tasks.

BERT might learn faster, if [SEP] is appended at the end of a single sentence, because it encodes somewhat of a knowledge in that token, that this marks the end of the input. Without it, BERT would still know where the sentence ends (due to the padding tokens), which explains that fore mentioned research leaves away the token, but this might slow down training slightly, since BERT might be able to learn faster with appended [SEP] token, especially if there are no padding tokens in a truncated input.

2
  • 1
    Thanks @MJimitater .. I think this is the most probable answer. I am still lack why it works with and without it (scientifically)? I feel that no one will answer this, then you will have your answer as the accepted one ;)
    – Minions
    Jan 20, 2021 at 16:17
  • I guess this is close to the truth. The model has always seen it and it probably expects it. Jan 30, 2021 at 15:39
0

As mentioned in BERT's paper, BERT is pre-trained using two novel unsupervised prediction tasks: Masked Language Model and Next Sentence Prediction. In Next Sentence Prediction task, the model takes a pair of sentences as input and learns to predict whether the second sentence is the next sequence in original document or not.

Accordingly, I think the BERT model uses the relationship between two text sentences in text classification task as well as other tasks. This relationship can be used to predict if these two sentences belong to the same class or not. Therefore, the [SEP] token is needed to merge these two sentences and determine the relationship between them.

6
  • This doesn't answer my question: why when there is a single sentence for classification, some works/papers add SEP at the end of that sentence and other don't?
    – Minions
    Aug 15, 2020 at 21:46
  • I don't know why some works/papers don't add [SEP] token at the end of the sentence. But I think below is why others do: Imagine there is a set of sentences that you want to classify, for example, sentence A, sentence B, sentence C, and sentence D. In the first stage, BERT marges A and B to understand the relationship between them and predict whether they belong to the same class or not. So the merged sequence will be like this: [CLS]A[SEP]B[SEP] and this step should be repeated for AC, AD, BC and etc. In my opinion, this is the reason for using [SEP] token at the end of each sentence. Aug 15, 2020 at 22:18
  • Thanks, but I am aware of this .. this is the core idea of BERT. Please read my question again, especially the last sentence (when my task uses only single sentence).
    – Minions
    Aug 15, 2020 at 23:56
  • A classification task can't be applied to a single sentence. It should be a set of sentences. I can't understand what you mean. Can you give an example? Aug 16, 2020 at 13:55
  • 1
    @SoroushFaridan what is meant is a text classification task like sentiment classification: "The movie was boring" -> 0; "The actors were fantastic" -> 1
    – MJimitater
    Jan 20, 2021 at 14:49

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.