10

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.

I have a the following string:

test_string = 'text with percentage%'

Then I am running the following code:

import torch
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

test_string = 'text with percentage%'

# encode Converts a string in a sequence of ids (integer), using the tokenizer and vocabulary.
input_ids = tokenizer.encode(test_string)
output = tokenizer.decode(input_ids)

And the output looks like this:

'text with percentage %'

With an extra space before the %. I have tried the extra arguments like clean_up_tokenization_spaces but this is for something different.

How what should I use in the decoding and encoding to get exactly the same text before and after. This also happens for other special signs.

4
  • 6
    I don't think the BERT tokenization process is 100% reversible, as you've noticed. Why do you need it to be? There may be other ways to accomplish what you want, e.g. by keeping around the original string instead of reconstructing it from tokens. Nov 21, 2019 at 16:54
  • 1
    In contrast, things like github.com/kovalevfm/SubTokenizer actually are fully reversible. I wish BERT was careful about this, but attention to detail in text segmentation seems to be a "production" issue not a "research" issue :( Nov 21, 2019 at 17:01
  • This is just a snippet from my script to show the problem. In between I am doing question answering and what i would like to achieve is a fully traceable text Nov 21, 2019 at 17:35
  • 1
    Ah, so ideally you would have something like github.com/huggingface/transformers/pull/1274. Assuming you're okay with snapping answer spans to whole words, you can use something like bistring.readthedocs.io/en/latest/Python/Tokenizer.html to split into words while keeping track of string indexes, then just use BertTokenizer to get subwords Nov 21, 2019 at 17:45

3 Answers 3

4

If you are trying to use BERT for token classification in order to find a span in your original string, then one workaround is to use BertTokenizerFast with the option return_offsets_mapping=True.

test_string = 'text with percentage%'

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokens = tokenizer(test_string, return_offsets_mapping=True)
input_ids = tokens.data["input_ids"]

span_start_index, span_stop_index = some_model(input_ids)

Then once you get the token classification results, you can do something like

predicted_span = test_string[tokens.encodings[0].offsets[span_start_index][0]:tokens.encodings[0].offsets[span_stop_index][1]]
2

According to https://github.com/huggingface/transformers/pull/1274 they're working on it. hopefully there will be a solution sometime next week.

0
0

One method of combining percentage and %, but I am unsure if it is useful for you.

from transformers import AutoTokenizer
test_string = 'text with percentage%'
test_string = test_string.split()

collect_tokens = []
for string in test_string:
    tokens = tokenizer.tokenize(string)
    for index in range(1, len(tokens)):
        if "##" not in tokens[index]:
            tokens[index] = "##"+tokens[index]
    collect_tokens += tokens

# encode tokens to input_ids
input_ids = tokenizer.convert_tokens_to_ids(collect_tokens)

# decode
output = tokenizer.decode(input_ids)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.