Given a sentiment classification dataset, I want to fine-tune Bert.
As you know that BERT created to predict the next sentence given the current sentence. Thus, to make the network aware of this, they inserted a [CLS]
token in the beginning of the first sentence then they add [SEP]
token to separate the first from the second sentence and finally another [SEP]
at the end of the second sentence (it's not clear to me why they append another token at the end).
Anyway, for text classification, what I noticed in some of the examples online (see BERT in Keras with Tensorflow hub) is that they add [CLS]
token and then the sentence and at the end another [SEP]
token.
Where in other research works (e.g. Enriching Pre-trained Language Model with Entity Information for Relation Classification) they remove the last [SEP]
token.
Why is it/not beneficial to add the [SEP]
token at the end of the input text when my task uses only single sentence?