All Questions
Tagged with bert-language-model tokenize
57
questions
46
votes
5
answers
58k
views
ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error
def split_data(path):
df = pd.read_csv(path)
return train_test_split(df , test_size=0.1, random_state=100)
train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), ...
17
votes
2
answers
33k
views
The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1
I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition ...
11
votes
1
answer
6k
views
what is so special about special tokens?
what exactly is the difference between "token" and a "special token"?
I understand the following:
what is a typical token
what is a typical special token: MASK, UNK, SEP, etc
when ...
10
votes
3
answers
12k
views
BertTokenizer - when encoding and decoding sequences extra spaces appear
When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method.
I have a the following string:
test_string = 'text with percentage%'
Then I am running the ...
7
votes
1
answer
5k
views
Token indices sequence length error when using encode_plus method
I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library.
I am using data from this Kaggle competition. Given a ...
6
votes
2
answers
9k
views
How to untokenize BERT tokens?
I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word.
from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("...
6
votes
3
answers
5k
views
How to stop BERT from breaking apart specific words into word-piece
I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any ...
6
votes
0
answers
2k
views
How to slice string depending on length of tokens
When I use (with a long test_text and short question):
from transformers import BertTokenizer
import torch
from transformers import BertForQuestionAnswering
tokenizer = BertTokenizer.from_pretrained('...
5
votes
2
answers
4k
views
BERT model : "enable_padding() got an unexpected keyword argument 'max_length'"
I am trying to implement the BERT model architecture using Hugging Face and KERAS. I am learning this from the Kaggle (link) and try to understand it. When I tokenized my data, I face some problems ...
5
votes
1
answer
2k
views
What does merge.txt file mean in BERT-based models in HuggingFace library?
I am trying to understand what merge.txt file infers in tokenizers for RoBERTa model in HuggingFace library. However, nothing is said about it on their website. Any help is appreciated.
4
votes
1
answer
2k
views
PyTorch tokenizers: how to truncate tokens from left?
As we can see in the below code snippet, specifying max_length and truncation for a tokenizer cuts excess tokens from the left:
tokenizer("hello, my name", truncation=True, max_length=6).input_ids
...
3
votes
1
answer
10k
views
Huggingface's BERT tokenizer not adding pad token
It's not entirely clear from the documentation, but I can see that BertTokenizer is initialised with pad_token='[PAD]', so I assume when you encode with add_special_tokens=True then it would ...
3
votes
2
answers
933
views
BPE multiple ways to encode a word
With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", &...
3
votes
0
answers
3k
views
Explaining BERT output through SHAP values without WordPiece tokenization
I fine-tuned BERT on a sentiment analysis task in PyTorch.
Now I want to use SHAP to explain which tokens led the model to the prediction (positive or negative sentiment).
Currently, SHAP returns a ...
3
votes
1
answer
6k
views
How to get the vocab file for Bert tokenizer from TF Hub
I'm trying to use Bert from TensorFlow Hub and build a tokenizer, this is what I'm doing:
>>> import tensorflow_hub as hub
>>> from bert.tokenization import FullTokenizer
>>&...
2
votes
2
answers
3k
views
Is there a way to get the location of the substring from which a certain token has been produced in BERT?
I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings.
...
2
votes
2
answers
1k
views
Translation between different tokenizers
Sorry if this question is too basic to be asked here. I tried but I couldn't find solutions.
I'm now working on an NLP project that requires using two different models (BART for summarization and BERT ...
2
votes
2
answers
5k
views
"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." ValueError: Input is not valid
I am using Bert tokenizer for french and I am getting this error but I do not seems to solutionated it. If you have a suggestion.
Traceback (most recent call last):
File "training_cross_data_2....
2
votes
1
answer
2k
views
How does the BERT tokenizer result in an input tensor shape of (b, 24, 768)?
I understand how the BERT tokenizer works thanks to this article:
https://albertauyeung.github.io/2020/06/19/bert-tokenization.html
However, I am confused about how this ends up as the final input ...
2
votes
1
answer
2k
views
How to combine two tokenized bert sequences
Say I have two tokenized BERT sequences:
seq1 = tensor([[ 101, 2023, 2003, 1996, 23032, 102]])
seq2 = tensor([[ 101, 2023, 2003, 6019, 1015, 102]])
This is produced with huggingface's tokenizer:...
2
votes
1
answer
2k
views
Why was BERT's default vocabulary size set to 30522?
I have been trying to build a BERT model for a specific domain. However, my model is trained on non-English text, so I'm worried that the default token size, 30522, won't fit my model.
Does anyone ...
2
votes
1
answer
2k
views
Can I pre-trained BERT model from scratch using tokenized input file and custom vocabulary file for Khmer language
I would like to know if it's possible for me to use my own tokenized/segmented documents (with my own vocab file as well) as the input file to the create_pretraining_data.py script (git source: https:/...
2
votes
0
answers
352
views
KeyError when trying to fine tuning Bert for text classification
I am trying to fine tune Bert for text classification on my dataset and I am getting the following error:
KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not ...
2
votes
0
answers
2k
views
How to encode empty string using BERT
I have recently been trying to encode an empty string with CamemBERT (BERT model for French). I wasn't sure on how to do that. If I try to simply encode an empty string,
from transformers import ...
2
votes
0
answers
796
views
UnparsedFlagAccessError: Trying to access flag --preserve_unused_tokens before flags were parsed
Hello I am a beginner in ML. I tried to use BERT and tokenizer didn't work like below.
train_input = bert_encode(train.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, ...
2
votes
0
answers
165
views
Explicit likelihood of WordPiece used for pre-processing of BERT
At each iteration the WordPiece algorithm for subword tokenization merges the two symbols which increase the likelihood the most. Now, in the literature it is only mentioned that this likelihood is ...
2
votes
0
answers
160
views
Wordpiece Tokenization Model
Can somebody tell me how exactly the wordpiece model work ? I am having some hard time trying to understand how exactly the wordpiece model is working. I understand the BPE that it is based on merging ...
2
votes
1
answer
161
views
bert_vocab.bert_vocab_from_dataset returning wrong vocabulary [closed]
i'm trying to build a tokenizer following the tf's tutorial https://www.tensorflow.org/text/guide/subwords_tokenizer. I'm basically doing the same thing only with a different dataset. The dataset in ...
1
vote
1
answer
2k
views
How to preprocess a dataset for BERT model implemented in Tensorflow 2.x?
Overview
I have a dataset made for classification problem. There are two columns one is sentences and the other is labels (total: 10 labels). I'm trying to convert this dataset to implement it in a ...
1
vote
1
answer
8k
views
Bert Tokenizing error ValueError: Input nan is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers
I am using the Bert for text classification task , when I try to tokenize one data sample using the code:
encoded_sent = tokenizer.encode(
sentences[7],
...
1
vote
1
answer
535
views
NER Classification Deberta Tokenizer error : You need to instantiate DebertaTokenizerFast
I'm trying to perform a NER Classification task using Deberta, but I'm stacked with a Tokenizer error. This is my code (my input sentence must be splitted word by word by ",:):
from transformers ...
1
vote
1
answer
755
views
how to use BertTokenizer to load Tokenizer model?
i use tokenizers to train a Tokenizer and save the model like this
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()
tokenizer.decoder = ByteLevelDecoder()
trainer = BpeTrainer(...
1
vote
1
answer
488
views
getting word-level encodings from sub-word tokens encodings
I'm looking into using a pretrained BERT ('bert-base-uncased') model to extract contextualised word-level encodings from a bunch sentences.
Wordpiece tokenisation breaks down some of the words in my ...
1
vote
1
answer
31
views
why Tokenizer and TokenizerFast encode the same sentence get different result
error1
when i use tokenizer encode text and use ‘do_basic_tokenize=False’, i found two different results.
But when i set ‘do_basic_tokenize=True’, the results is same.
this text is 'LUXURY HOTEL ...
1
vote
1
answer
27
views
ber-base-uncase does not use newly added suffix token
I want to add custom tokens to the BertTokenizer. However, the model does not use the new token.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-...
1
vote
0
answers
110
views
How to obtain the [CLS] sentence embedding of multiple sentences successively without facing a RAM crash?
I would like to obtain the [CLS] token's sentence embedding (as it represents the whole sentence's meaning) using BERT. I have many sentences (about 40) that belong to a Document, and 246 such ...
1
vote
1
answer
351
views
Equivalent to tokenizer() in Transformers 2.5.0?
I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0.
# Converting pretrained BERT classification model to regression model
#...
1
vote
0
answers
174
views
'AutoTrackable' object is not callable
I've tried to Instantiate tokenizer following this sentence:
tokenizer = create_tokenizer_from_hub_module(bert_path=BERT_PATH)
I tried fixed it with some of that other topic recommendations
But the ...
0
votes
1
answer
3k
views
Token indices sequence length is longer than the specified maximum sequence length for this model (28627 > 512)
I am using BERT's Huggingface DistilBERT model as a backend for a question and answer application. The text I am using with which to train the model is one very large single text field. Even though ...
0
votes
1
answer
2k
views
How to specify input sequence length for BERT tokenizer in Tensorflow?
I am following this example to use BERT for sentiment classification.
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
"https://tfhub.dev/...
0
votes
2
answers
3k
views
Bert Tokenizer is not working despite importing all packages. Is there a new syntax change to this?
Trying to run the tokenizer for Bert but I keep getting errors. Can anyone help where I am going wrong.
FullTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub....
0
votes
1
answer
47
views
Truncate texts in the middle for Bert
I am learning about Bert, which only deals with texts with fewer than 512 tokens, and came across this answer which says that truncating text in the middle (as opposed to at the start or at the end) ...
0
votes
1
answer
145
views
Understand the difference between the arguments "text" and "text_target" in the bert tokenizer from the huggingface transformers library [duplicate]
From the transformers library by huggingface
from transformers import BertTokenizer
tb = BertTokenizer.from_pretrained("bert-base-uncased")
tb is not a wordpiece tokenizer. It has arguments ...
0
votes
1
answer
1k
views
Loading local tokenizer
I'm trying to load a local tokenizer using;
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained(r'file path\tokenizer')
however, this gives me the ...
0
votes
1
answer
373
views
bert_vocab.bert_vocab_from_dataset taking too long
I'm following this tutorial (https://colab.research.google.com/github/tensorflow/text/blob/master/docs/guide/subwords_tokenizer.ipynb#scrollTo=kh98DvoDz7Jn) to generate a vocabulary from a custom ...
0
votes
1
answer
2k
views
Split a sentence by words just as BERT Tokenizer would do?
I'm trying to localize all the [UNK] tokens of BERT tokenizer on my text. Once I have the position of the UNK token, I need to identify what word it belongs to. For that, I tried to get the position ...
0
votes
0
answers
48
views
Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task in Python
I have a problem. I am working on pretraining a RoBERTa MLM model from scratch on Slovak language text in Python. I have trained my own BPE tokenizer and tokenized texts with it. I obtained the ...
0
votes
1
answer
102
views
Map BERT token indices to Spacy token indices
I’m trying to make Bert’s (bert-base-uncased) tokenization token indices (not ids, token indices) map to Spacy’s tokenization token indices. In the following example, my approach doesn’t work becos ...
0
votes
0
answers
44
views
Value Error when using add_tokens, 'the truth value of an array with more than one element is ambiguous'
I'm trying to improve a basic BERT, pretrained tokenizer model. Im adding new tokens using add_tokens, but running into issues with the built in method.
Namely:
ValueError ...
0
votes
1
answer
36
views
How to model with NLP when the token is not relevant (by itself) but its type is?
I would like to build an NLP classification model.
My input is a paragraph or a sentence. Ideally, my output is a score or probability (between 0 and 1).
I have defined specific entities ex-ante, each ...