BertTokenizer - when encoding and decoding sequences extra spaces appear

Question

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.

I have a the following string:

test_string = 'text with percentage%'

Then I am running the following code:

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

test_string = 'text with percentage%'

# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)

And the output looks like this:

'text with percentage %'

With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces but this is for something different.

How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.

I don't think the BERT tokenization process is 100% reversible, as you've noticed. Why do you need it to be? There may be other ways to accomplish what you want, e.g. by keeping around the original string instead of reconstructing it from tokens. — Tavian Barnes, Nov 21, 2019 at 16:54
In contrast, things like github.com/kovalevfm/SubTokenizer actually are fully reversible. I wish BERT was careful about this, but attention to detail in text segmentation seems to be a "production" issue not a "research" issue :( — Tavian Barnes, Nov 21, 2019 at 17:01
This is just a snippet from my script to show the problem. In between I am doing question answering and what i would like to achieve is a fully traceable text — Henryk Borzymowski, Nov 21, 2019 at 17:35
Ah, so ideally you would have something like github.com/huggingface/transformers/pull/1274. Assuming you're okay with snapping answer spans to whole words, you can use something like bistring.readthedocs.io/en/latest/Python/Tokenizer.html to split into words while keeping track of string indexes, then just use BertTokenizer to get subwords — Tavian Barnes, Nov 21, 2019 at 17:45

vermouth · Accepted Answer · 2021-03-01 16:09:35Z

If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast with the option return_offsets_mapping=True.

test_string = 'text with percentage%'

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]

span_start_index, span_stop_index = some_model(input_ids)

Then once you get the token classification results, you can do something like

predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]

Anjie Guo · Accepted Answer · 2019-12-05 14:50:37Z

2

According to https://github.com/huggingface/transformers/pull/1274 they're working on it. hopefully there will be a solution sometime next week.

answered Dec 5, 2019 at 14:50

Anjie Guo

212 bronze badges

Add a comment |

huang ting shieh · Accepted Answer · 2023-01-11 13:19:44Z

One method of combining percentage and %, but I am unsure if it is useful for you.

from transformers import AutoTokenizer
test_string = 'text with percentage%'
test_string = test_string.split()

collect_tokens = []
for string in test_string:
    tokens = tokenizer.tokenize(string)
    for index in range(1, len(tokens)):
        if "##" not in tokens[index]:
            tokens[index] = "##"+tokens[index]
    collect_tokens += tokens

# encode tokens to input_ids
input_ids = tokenizer.convert_tokens_to_ids(collect_tokens)

# decode
output = tokenizer.decode(input_ids)

Collectives™ on Stack Overflow

BertTokenizer - when encoding and decoding sequences extra spaces appear

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
pytorch
tokenize
torch
bert-language-model
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonpytorchtokenizetorchbert-language-model or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
pytorch
tokenize
torch
bert-language-model
or ask your own question.