When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.
I have a the following string:
test_string = 'text with percentage%'
Then I am running the following code:
import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
test_string = 'text with percentage%'
# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)
And the output looks like this:
'text with percentage %'
With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces
but this is for something different.
How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.
BertTokenizer
to get subwords