10

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased).

QUERY: I want to ask a question.
ANSWER: Sure, ask away.
QUERY: How is the weather today?
ANSWER: It is nice and sunny.
QUERY: Okay, nice to know.
ANSWER: Would you like to know anything else?

Apart from this I have two more inputs.

I was wondering if I should put special token in the conversation to make it more meaning to the BERT model, like:

[CLS]QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else? [SEP]

But I am not able to add a new [EOT] special token.
Or should I use [SEP] token for this?

EDIT: steps to reproduce

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]

num_added_toks = tokenizer.add_tokens(['[EOT]'])
model.resize_token_embeddings(len(tokenizer))  # --> Embedding(30523, 768)

tokenizer.convert_tokens_to_ids('[EOT]')  # --> 30522

text_to_encode = '''QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else?'''

enc = tokenizer.encode_plus(
  text_to_encode,
  max_length=128,
  add_special_tokens=True,
  return_token_type_ids=False,
  return_attention_mask=False,
)['input_ids']

print(tokenizer.convert_ids_to_tokens(enc))

Result:

['[CLS]', 'query', ':', 'i', 'want', 'to', 'ask', 'a', 'question', '.', '[', 'e', '##ot', ']', 'answer', ':', 'sure', ',', 'ask', 'away', '.', '[', 'e', '##ot', ']', 'query', ':', 'how', 'is', 'the', 'weather', 'today', '?', '[', 'e', '##ot', ']', 'answer', ':', 'it', 'is', 'nice', 'and', 'sunny', '.', '[', 'e', '##ot', ']', 'query', ':', 'okay', ',', 'nice', 'to', 'know', '.', '[', 'e', '##ot', ']', 'answer', ':', 'would', 'you', 'like', 'to', 'know', 'anything', 'else', '?', '[SEP]']

2 Answers 2

13

As the intention of the [SEP] token was to act as a separator between two sentence, it fits your objective of using [SEP] token to separate sequences of QUERY and ANSWER.

You also try to add different tokens to mark the beginning and end of QUERY or ANSWER as <BOQ> and <EOQ> to mark the beginning and end of QUERY. Likewise, <BOA> and <EOA> to mark the beginning and end of ANSWER.

Sometimes, using the existing token works much better than adding new tokens to the vocabulary, as it requires huge number of training iterations as well as the data to learn the new token embedding.

However, if you want to add a new token if your application demands so, then it can be added as follows:

num_added_toks = tokenizer.add_tokens(['[EOT]'], special_tokens=True) ##This line is updated
model.resize_token_embeddings(len(tokenizer))

###The tokenizer has to be saved if it has to be reused
tokenizer.save_pretrained(<output_dir>)
6
  • i have added [EOT] token to the tokenizer using add_tokens. then i added [EOT] in data after every turn. but while tokenizing it is breaking [EOT] as '[', 'e', '##ot', ']',
    – sid8491
    Sep 15, 2021 at 18:29
  • Can you please share a small reproducible snippet? Sep 15, 2021 at 19:03
  • i have added steps in question detail. let me know if any confusion. appreciate the help.
    – sid8491
    Sep 15, 2021 at 19:39
  • Hi, I found the error. Since, [EOT], was added as a special token, we had to use special_tokens=True as a parameter. This prevents from lowercasing the text, as after lowercasing, the added token will not be found in the vocabulary. Sep 15, 2021 at 21:52
  • 1
    It's a little tricky.... If you have sufficient data to train the system you can go with <BOQ> & <EOQ>.... But there is no one perfect answer for what is 'sufficient amount' of data..... So,its more of empirical approach, you just try both of them.... Sep 16, 2021 at 8:55
3

You should add it as a special token, not as a normal token, i.e. use "add_special_tokens" method instead of "add_tokens" method.

Here is a code example:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

print("Before")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]


special_tokens_dict = {'additional_special_tokens': ['[EOT]']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
# model.resize_token_embeddings(len(tokenizer))  # --> Embedding(30523, 768)

tok_id = tokenizer.convert_tokens_to_ids('[EOT]')  # --> 30522

print("After")
print(tokenizer.all_special_tokens) # --> ['[UNK]', '[SEP]', '[PAD]', '[CLS]', '[MASK]']
print(tokenizer.all_special_ids)    # --> [100, 102, 0, 101, 103]

Then, to encode the text, we use:

text_to_encode = '''QUERY: I want to ask a question. [EOT]
ANSWER: Sure, ask away. [EOT]
QUERY: How is the weather today? [EOT]
ANSWER: It is nice and sunny. [EOT]
QUERY: Okay, nice to know. [EOT]
ANSWER: Would you like to know anything else?'''

enc = tokenizer.encode_plus(
  text_to_encode,
  max_length=128,
truncation=True,
  add_special_tokens=True,
  return_token_type_ids=False,
  return_attention_mask=False,
)['input_ids']

tokenizer.convert_ids_to_tokens(enc)

To get back the original text without the special tokens:

tokenizer.convert_ids_to_tokens(enc,skip_special_tokens = True)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.