11

what exactly is the difference between "token" and a "special token"?

I understand the following:

  • what is a typical token
  • what is a typical special token: MASK, UNK, SEP, etc
  • when do you add a token (when you want to expand your vocab)

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens? If an example uses a special token, why can't a normal token achieve the same objective?

tokenizer.add_tokens(['[EOT]'], special_tokens=True)

And I also dont quite understand the following description in the source documentation. what difference does it do to our model if we set add_special_tokens to False?

add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

1 Answer 1

11

Special tokens are called special because they are not derived from your input. They are added for a certain purpose and are independent of the specific input.

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens?

Just an example, in extractive conversational question-answering it is not unusual to add the question and answer of the previous dialog-turn to your input to provide some context for your model. Those previous dialog turns are separated with special tokens from the current question. Sometimes people use the separator token of the model or introduce new special tokens. The following is an example with a new special token [Q]

#first dialog turn - no conversation history
[CLS] current question [SEP] text [EOS]
#second dialog turn - with previous question to have some context
[CLS] previous question [Q] current question [SEP] text [EOS]

And I also dont quite understand the following description in the source documentation. what difference does it do to our model if we set add_special_tokens to False?

from transformers import RobertaTokenizer
t = RobertaTokenizer.from_pretrained("roberta-base")

t("this is an example")
#{'input_ids': [0, 9226, 16, 41, 1246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}

t("this is an example", add_special_tokens=False)
#{'input_ids': [9226, 16, 41, 1246], 'attention_mask': [1, 1, 1, 1]}

As you can see here, the input misses two tokens (the special tokens). Those special tokens have a meaning for your model since it was trained with it. The last_hidden_state will be different due to the lack of those two tokens and will therefore lead to a different result for your downstream task.

Some tasks, like sequence classification, often use the [CLS] token to make their predictions. When you remove them, a model that was pre-trained with a [CLS] token will struggle.

3
  • 3
    Thank you so much for the clarification on the special tokens. However, if I need to add a special token [Q], what will happen if I add token [Q] as a regular token instead of special token, and train the model as per usual. Does it lead to any difference in training? Apr 4, 2022 at 0:49
  • 4
    @ShaoMinLiu The model isn't aware of the difference between special tokens and ordinary tokens. It learns that some tokens are a bit different but there is no code line that treats them differently. For the tokenizer, on the other hand, it can make a difference. Depending on the strategy, it could lead to a different mapping of the ids (e.g. map special tokens first).
    – cronoik
    Apr 4, 2022 at 1:00
  • @cronoik, can you say please which one is best, add a token as special or as regular, and why? because i have tried both, and they work the same, another question is that when i add [Q] as new special token it overrides the default one "tokenizer.add_special_tokens({"sep_token": "[Q]"}, replace_additional_special_tokens=False)"
    – Kaoutar
    Sep 3, 2023 at 12:41

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.