How to add new special token to the tokenizer?

Question

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased).

QUERY: I want to ask a question.
ANSWER: Sure, ask away.
QUERY: How is the weather today?
ANSWER: It is nice and sunny.
QUERY: Okay, nice to know.
ANSWER: Would you like to know anything else?

Apart from this I have two more inputs.

I was wondering if I should put special token in the conversation to make it more meaning to the BERT model, like:

[CLS]QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else? [SEP]

But I am not able to add a new [EOT] special token.
Or should I use [SEP] token for this?

EDIT: steps to reproduce

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]

num_added_toks = tokenizer.add_tokens(['[EOT]'])
model.resize_token_embeddings(len(tokenizer))  # --> Embedding(30523, 768)

tokenizer.convert_tokens_to_ids('[EOT]')  # --> 30522

text_to_encode = '''QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else?'''

enc = tokenizer.encode_plus(
  text_to_encode,
  max_length=128,
  add_special_tokens=True,
  return_token_type_ids=False,
  return_attention_mask=False,
)['input_ids']

print(tokenizer.convert_ids_to_tokens(enc))

Result:

['[CLS]', 'query', ':', 'i', 'want', 'to', 'ask', 'a', 'question', '.', '[', 'e', '##ot', ']', 'answer', ':', 'sure', ',', 'ask', 'away', '.', '[', 'e', '##ot', ']', 'query', ':', 'how', 'is', 'the', 'weather', 'today', '?', '[', 'e', '##ot', ']', 'answer', ':', 'it', 'is', 'nice', 'and', 'sunny', '.', '[', 'e', '##ot', ']', 'query', ':', 'okay', ',', 'nice', 'to', 'know', '.', '[', 'e', '##ot', ']', 'answer', ':', 'would', 'you', 'like', 'to', 'know', 'anything', 'else', '?', '[SEP]']

Ashwin Geet D'Sa · Accepted Answer · 2021-09-15 21:51:21Z

13

As the intention of the [SEP] token was to act as a separator between two sentence, it fits your objective of using [SEP] token to separate sequences of QUERY and ANSWER.

You also try to add different tokens to mark the beginning and end of QUERY or ANSWER as <BOQ> and <EOQ> to mark the beginning and end of QUERY. Likewise, <BOA> and <EOA> to mark the beginning and end of ANSWER.

Sometimes, using the existing token works much better than adding new tokens to the vocabulary, as it requires huge number of training iterations as well as the data to learn the new token embedding.

However, if you want to add a new token if your application demands so, then it can be added as follows:

num_added_toks = tokenizer.add_tokens(['[EOT]'], special_tokens=True) ##This line is updated
model.resize_token_embeddings(len(tokenizer))

###The tokenizer has to be saved if it has to be reused
tokenizer.save_pretrained(<output_dir>)

edited Sep 15, 2021 at 21:51

answered Sep 15, 2021 at 14:07

Ashwin Geet D'Sa

6,9342 gold badges33 silver badges62 bronze badges

i have added [EOT] token to the tokenizer using add_tokens. then i added [EOT] in data after every turn. but while tokenizing it is breaking [EOT] as '[', 'e', '##ot', ']',
– sid8491
Sep 15, 2021 at 18:29
Can you please share a small reproducible snippet?
– Ashwin Geet D'Sa
Sep 15, 2021 at 19:03
i have added steps in question detail. let me know if any confusion. appreciate the help.
– sid8491
Sep 15, 2021 at 19:39
Hi, I found the error. Since, [EOT], was added as a special token, we had to use special_tokens=True as a parameter. This prevents from lowercasing the text, as after lowercasing, the added token will not be found in the vocabulary.
– Ashwin Geet D'Sa
Sep 15, 2021 at 21:52
1

It's a little tricky.... If you have sufficient data to train the system you can go with <BOQ> & <EOQ>.... But there is no one perfect answer for what is 'sufficient amount' of data..... So,its more of empirical approach, you just try both of them....
– Ashwin Geet D'Sa
Sep 16, 2021 at 8:55

| Show 1 more comment

Wesam Na · Accepted Answer · 2023-06-14 08:57:26Z

You should add it as a special token, not as a normal token, i.e. use "add_special_tokens" method instead of "add_tokens" method.

Here is a code example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print("Before")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]


special_tokens_dict = {'additional_special_tokens': ['[EOT]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
# model.resize_token_embeddings(len(tokenizer))  # --> Embedding(30523, 768)

tok_id = tokenizer.convert_tokens_to_ids('[EOT]')  # --> 30522

print("After")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]

Then, to encode the text, we use:

text_to_encode = '''QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else?'''

enc = tokenizer.encode_plus(
  text_to_encode,
  max_length=128,
truncation=True,
  add_special_tokens=True,
  return_token_type_ids=False,
  return_attention_mask=False,
)['input_ids']

tokenizer.convert_ids_to_tokens(enc)

To get back the original text without the special tokens:

tokenizer.convert_ids_to_tokens(enc,skip_special_tokens = True)

Collectives™ on Stack Overflow

How to add new special token to the tokenizer?

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
bert-language-model
huggingface-tokenizers
sentencepiece
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged bert-language-modelhuggingface-tokenizerssentencepiece or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
bert-language-model
huggingface-tokenizers
sentencepiece
or ask your own question.