what is so special about special tokens?

Question

what exactly is the difference between "token" and a "special token"?

I understand the following:

what is a typical token
what is a typical special token: MASK, UNK, SEP, etc
when do you add a token (when you want to expand your vocab)

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens? If an example uses a special token, why can't a normal token achieve the same objective?

tokenizer.add_tokens(['[EOT]'], special_tokens=True)

And I also dont quite understand the following description in the source documentation. what difference does it do to our model if we set add_special_tokens to False?

add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

cronoik · Accepted Answer · 2022-04-02 22:58:35Z

Special tokens are called special because they are not derived from your input. They are added for a certain purpose and are independent of the specific input.

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens?

Just an example, in extractive conversational question-answering it is not unusual to add the question and answer of the previous dialog-turn to your input to provide some context for your model. Those previous dialog turns are separated with special tokens from the current question. Sometimes people use the separator token of the model or introduce new special tokens. The following is an example with a new special token [Q]

#first dialog turn - no conversation history
[CLS] current question [SEP] text [EOS]
#second dialog turn - with previous question to have some context
[CLS] previous question [Q] current question [SEP] text [EOS]

And I also dont quite understand the following description in the source documentation. what difference does it do to our model if we set add_special_tokens to False?

from transformers import RobertaTokenizer
t = RobertaTokenizer.from_pretrained("roberta-base")

t("this is an example")
#{'input_ids': [0, 9226, 16, 41, 1246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}

t("this is an example", add_special_tokens=False)
#{'input_ids': [9226, 16, 41, 1246], 'attention_mask': [1, 1, 1, 1]}

As you can see here, the input misses two tokens (the special tokens). Those special tokens have a meaning for your model since it was trained with it. The last_hidden_state will be different due to the lack of those two tokens and will therefore lead to a different result for your downstream task.

Some tasks, like sequence classification, often use the [CLS] token to make their predictions. When you remove them, a model that was pre-trained with a [CLS] token will struggle.

Thank you so much for the clarification on the special tokens. However, if I need to add a special token [Q], what will happen if I add token [Q] as a regular token instead of special token, and train the model as per usual. Does it lead to any difference in training? — ShaoMin Liu, Apr 4, 2022 at 0:49
@ShaoMinLiu The model isn't aware of the difference between special tokens and ordinary tokens. It learns that some tokens are a bit different but there is no code line that treats them differently. For the tokenizer, on the other hand, it can make a difference. Depending on the strategy, it could lead to a different mapping of the ids (e.g. map special tokens first). — cronoik, Apr 4, 2022 at 1:00
@cronoik, can you say please which one is best, add a token as special or as regular, and why? because i have tried both, and they work the same, another question is that when i add [Q] as new special token it overrides the default one "tokenizer.add_special_tokens({"sep_token": "[Q]"}, replace_additional_special_tokens=False)" — Kaoutar, Sep 3, 2023 at 12:41

Collectives™ on Stack Overflow

what is so special about special tokens?

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
nlp
tokenize
huggingface-transformers
bert-language-model
huggingface-tokenizers
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged nlptokenizehuggingface-transformersbert-language-modelhuggingface-tokenizers or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
nlp
tokenize
huggingface-transformers
bert-language-model
huggingface-tokenizers
or ask your own question.