All Questions

Filter by
Sorted by
Tagged with
46 votes
5 answers
58k views

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

def split_data(path): df = pd.read_csv(path) return train_test_split(df , test_size=0.1, random_state=100) train, test = split_data(DATA_DIR) train_texts, train_labels = train['text'].to_list(), ...
Raoof Naushad's user avatar
17 votes
2 answers
33k views

The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition ...
Mee's user avatar
  • 1,561
11 votes
1 answer
6k views

what is so special about special tokens?

what exactly is the difference between "token" and a "special token"? I understand the following: what is a typical token what is a typical special token: MASK, UNK, SEP, etc when ...
ShaoMin Liu's user avatar
10 votes
3 answers
12k views

BertTokenizer - when encoding and decoding sequences extra spaces appear

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method. I have a the following string: test_string = 'text with percentage%' Then I am running the ...
Henryk Borzymowski's user avatar
7 votes
1 answer
5k views

Token indices sequence length error when using encode_plus method

I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library. I am using data from this Kaggle competition. Given a ...
Niels's user avatar
  • 1,191
6 votes
2 answers
9k views

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained("...
JayJay's user avatar
  • 183
6 votes
3 answers
5k views

How to stop BERT from breaking apart specific words into word-piece

I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any ...
parvaneh shayegh's user avatar
6 votes
0 answers
2k views

How to slice string depending on length of tokens

When I use (with a long test_text and short question): from transformers import BertTokenizer import torch from transformers import BertForQuestionAnswering tokenizer = BertTokenizer.from_pretrained('...
user avatar
5 votes
2 answers
4k views

BERT model : "enable_padding() got an unexpected keyword argument 'max_length'"

I am trying to implement the BERT model architecture using Hugging Face and KERAS. I am learning this from the Kaggle (link) and try to understand it. When I tokenized my data, I face some problems ...
Samrat Alam's user avatar
5 votes
1 answer
2k views

What does merge.txt file mean in BERT-based models in HuggingFace library?

I am trying to understand what merge.txt file infers in tokenizers for RoBERTa model in HuggingFace library. However, nothing is said about it on their website. Any help is appreciated.
Akim's user avatar
  • 149
4 votes
1 answer
2k views

PyTorch tokenizers: how to truncate tokens from left?

As we can see in the below code snippet, specifying max_length and truncation for a tokenizer cuts excess tokens from the left: tokenizer("hello, my name", truncation=True, max_length=6).input_ids ...
aayc's user avatar
  • 41
3 votes
1 answer
10k views

Huggingface's BERT tokenizer not adding pad token

It's not entirely clear from the documentation, but I can see that BertTokenizer is initialised with pad_token='[PAD]', so I assume when you encode with add_special_tokens=True then it would ...
doctopus's user avatar
  • 5,557
3 votes
2 answers
933 views

BPE multiple ways to encode a word

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", &...
SweetSpot's user avatar
  • 101
3 votes
0 answers
3k views

Explaining BERT output through SHAP values without WordPiece tokenization

I fine-tuned BERT on a sentiment analysis task in PyTorch. Now I want to use SHAP to explain which tokens led the model to the prediction (positive or negative sentiment). Currently, SHAP returns a ...
Maria's user avatar
  • 105
3 votes
1 answer
6k views

How to get the vocab file for Bert tokenizer from TF Hub

I'm trying to use Bert from TensorFlow Hub and build a tokenizer, this is what I'm doing: >>> import tensorflow_hub as hub >>> from bert.tokenization import FullTokenizer >>&...
bachr's user avatar
  • 5,898
2 votes
2 answers
3k views

Is there a way to get the location of the substring from which a certain token has been produced in BERT?

I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings. ...
MarciBE's user avatar
  • 23
2 votes
2 answers
1k views

Translation between different tokenizers

Sorry if this question is too basic to be asked here. I tried but I couldn't find solutions. I'm now working on an NLP project that requires using two different models (BART for summarization and BERT ...
exitialium's user avatar
2 votes
2 answers
5k views

"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." ValueError: Input is not valid

I am using Bert tokenizer for french and I am getting this error but I do not seems to solutionated it. If you have a suggestion. Traceback (most recent call last): File "training_cross_data_2....
emma's user avatar
  • 323
2 votes
1 answer
2k views

How does the BERT tokenizer result in an input tensor shape of (b, 24, 768)?

I understand how the BERT tokenizer works thanks to this article: https://albertauyeung.github.io/2020/06/19/bert-tokenization.html However, I am confused about how this ends up as the final input ...
Joshua Clancy's user avatar
2 votes
1 answer
2k views

How to combine two tokenized bert sequences

Say I have two tokenized BERT sequences: seq1 = tensor([[ 101, 2023, 2003, 1996, 23032, 102]]) seq2 = tensor([[ 101, 2023, 2003, 6019, 1015, 102]]) This is produced with huggingface's tokenizer:...
Union find's user avatar
  • 8,034
2 votes
1 answer
2k views

Why was BERT's default vocabulary size set to 30522?

I have been trying to build a BERT model for a specific domain. However, my model is trained on non-English text, so I'm worried that the default token size, 30522, won't fit my model. Does anyone ...
Byoungchan Han's user avatar
2 votes
1 answer
2k views

Can I pre-trained BERT model from scratch using tokenized input file and custom vocabulary file for Khmer language

I would like to know if it's possible for me to use my own tokenized/segmented documents (with my own vocab file as well) as the input file to the create_pretraining_data.py script (git source: https:/...
Nik Muhammad Naim's user avatar
2 votes
0 answers
352 views

KeyError when trying to fine tuning Bert for text classification

I am trying to fine tune Bert for text classification on my dataset and I am getting the following error: KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not ...
Jules's user avatar
  • 21
2 votes
0 answers
2k views

How to encode empty string using BERT

I have recently been trying to encode an empty string with CamemBERT (BERT model for French). I wasn't sure on how to do that. If I try to simply encode an empty string, from transformers import ...
NoIdea's user avatar
  • 113
2 votes
0 answers
796 views

UnparsedFlagAccessError: Trying to access flag --preserve_unused_tokens before flags were parsed

Hello I am a beginner in ML. I tried to use BERT and tokenizer didn't work like below. train_input = bert_encode(train.text.values, tokenizer, max_len=160) test_input = bert_encode(test.text.values, ...
Tony's user avatar
  • 21
2 votes
0 answers
165 views

Explicit likelihood of WordPiece used for pre-processing of BERT

At each iteration the WordPiece algorithm for subword tokenization merges the two symbols which increase the likelihood the most. Now, in the literature it is only mentioned that this likelihood is ...
SweetSpot's user avatar
  • 101
2 votes
0 answers
160 views

Wordpiece Tokenization Model

Can somebody tell me how exactly the wordpiece model work ? I am having some hard time trying to understand how exactly the wordpiece model is working. I understand the BPE that it is based on merging ...
Bika's user avatar
  • 21
2 votes
1 answer
161 views

bert_vocab.bert_vocab_from_dataset returning wrong vocabulary [closed]

i'm trying to build a tokenizer following the tf's tutorial https://www.tensorflow.org/text/guide/subwords_tokenizer. I'm basically doing the same thing only with a different dataset. The dataset in ...
Niccolò Tiezzi's user avatar
1 vote
1 answer
2k views

How to preprocess a dataset for BERT model implemented in Tensorflow 2.x?

Overview I have a dataset made for classification problem. There are two columns one is sentences and the other is labels (total: 10 labels). I'm trying to convert this dataset to implement it in a ...
Y4RD13's user avatar
  • 966
1 vote
1 answer
8k views

Bert Tokenizing error ValueError: Input nan is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers

I am using the Bert for text classification task , when I try to tokenize one data sample using the code: encoded_sent = tokenizer.encode( sentences[7], ...
Yaman Afadar's user avatar
1 vote
1 answer
535 views

NER Classification Deberta Tokenizer error : You need to instantiate DebertaTokenizerFast

I'm trying to perform a NER Classification task using Deberta, but I'm stacked with a Tokenizer error. This is my code (my input sentence must be splitted word by word by ",:): from transformers ...
Chiara's user avatar
  • 380
1 vote
1 answer
755 views

how to use BertTokenizer to load Tokenizer model?

i use tokenizers to train a Tokenizer and save the model like this tokenizer = Tokenizer(BPE()) tokenizer.pre_tokenizer = Whitespace() tokenizer.decoder = ByteLevelDecoder() trainer = BpeTrainer(...
Jack.Sparrow's user avatar
1 vote
1 answer
488 views

getting word-level encodings from sub-word tokens encodings

I'm looking into using a pretrained BERT ('bert-base-uncased') model to extract contextualised word-level encodings from a bunch sentences. Wordpiece tokenisation breaks down some of the words in my ...
rbroc's user avatar
  • 13
1 vote
1 answer
31 views

why Tokenizer and TokenizerFast encode the same sentence get different result

error1 when i use tokenizer encode text and use ‘do_basic_tokenize=False’, i found two different results. But when i set ‘do_basic_tokenize=True’, the results is same. this text is 'LUXURY HOTEL ...
feng shen's user avatar
1 vote
1 answer
27 views

ber-base-uncase does not use newly added suffix token

I want to add custom tokens to the BertTokenizer. However, the model does not use the new token. from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-...
Lulacca's user avatar
  • 13
1 vote
0 answers
110 views

How to obtain the [CLS] sentence embedding of multiple sentences successively without facing a RAM crash?

I would like to obtain the [CLS] token's sentence embedding (as it represents the whole sentence's meaning) using BERT. I have many sentences (about 40) that belong to a Document, and 246 such ...
Aadithya Seshadri's user avatar
1 vote
1 answer
351 views

Equivalent to tokenizer() in Transformers 2.5.0?

I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0. # Converting pretrained BERT classification model to regression model #...
galactic_tok's user avatar
1 vote
0 answers
174 views

'AutoTrackable' object is not callable

I've tried to Instantiate tokenizer following this sentence: tokenizer = create_tokenizer_from_hub_module(bert_path=BERT_PATH) I tried fixed it with some of that other topic recommendations But the ...
Agustina Villaggi's user avatar
0 votes
1 answer
3k views

Token indices sequence length is longer than the specified maximum sequence length for this model (28627 > 512)

I am using BERT's Huggingface DistilBERT model as a backend for a question and answer application. The text I am using with which to train the model is one very large single text field. Even though ...
Scott Bing's user avatar
0 votes
1 answer
2k views

How to specify input sequence length for BERT tokenizer in Tensorflow?

I am following this example to use BERT for sentiment classification. text_input = tf.keras.layers.Input(shape=(), dtype=tf.string) preprocessor = hub.KerasLayer( "https://tfhub.dev/...
Jane Sully's user avatar
  • 3,247
0 votes
2 answers
3k views

Bert Tokenizer is not working despite importing all packages. Is there a new syntax change to this?

Trying to run the tokenizer for Bert but I keep getting errors. Can anyone help where I am going wrong. FullTokenizer = bert.bert_tokenization.FullTokenizer bert_layer = hub.KerasLayer("https://tfhub....
L Akshay's user avatar
0 votes
1 answer
47 views

Truncate texts in the middle for Bert

I am learning about Bert, which only deals with texts with fewer than 512 tokens, and came across this answer which says that truncating text in the middle (as opposed to at the start or at the end) ...
Tuan Do's user avatar
  • 159
0 votes
1 answer
145 views

Understand the difference between the arguments "text" and "text_target" in the bert tokenizer from the huggingface transformers library [duplicate]

From the transformers library by huggingface from transformers import BertTokenizer tb = BertTokenizer.from_pretrained("bert-base-uncased") tb is not a wordpiece tokenizer. It has arguments ...
figs_and_nuts's user avatar
0 votes
1 answer
1k views

Loading local tokenizer

I'm trying to load a local tokenizer using; from transformers import RobertaTokenizerFast tokenizer = RobertaTokenizerFast.from_pretrained(r'file path\tokenizer') however, this gives me the ...
Jon's user avatar
  • 91
0 votes
1 answer
373 views

bert_vocab.bert_vocab_from_dataset taking too long

I'm following this tutorial (https://colab.research.google.com/github/tensorflow/text/blob/master/docs/guide/subwords_tokenizer.ipynb#scrollTo=kh98DvoDz7Jn) to generate a vocabulary from a custom ...
Kurt's user avatar
  • 186
0 votes
1 answer
2k views

Split a sentence by words just as BERT Tokenizer would do?

I'm trying to localize all the [UNK] tokens of BERT tokenizer on my text. Once I have the position of the UNK token, I need to identify what word it belongs to. For that, I tried to get the position ...
Andrea NR's user avatar
  • 1,567
0 votes
0 answers
48 views

Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task in Python

I have a problem. I am working on pretraining a RoBERTa MLM model from scratch on Slovak language text in Python. I have trained my own BPE tokenizer and tokenized texts with it. I obtained the ...
daviddo's user avatar
0 votes
1 answer
102 views

Map BERT token indices to Spacy token indices

I’m trying to make Bert’s (bert-base-uncased) tokenization token indices (not ids, token indices) map to Spacy’s tokenization token indices. In the following example, my approach doesn’t work becos ...
lrthistlethwaite's user avatar
0 votes
0 answers
44 views

Value Error when using add_tokens, 'the truth value of an array with more than one element is ambiguous'

I'm trying to improve a basic BERT, pretrained tokenizer model. Im adding new tokens using add_tokens, but running into issues with the built in method. Namely: ValueError ...
Manny's user avatar
  • 35
0 votes
1 answer
36 views

How to model with NLP when the token is not relevant (by itself) but its type is?

I would like to build an NLP classification model. My input is a paragraph or a sentence. Ideally, my output is a score or probability (between 0 and 1). I have defined specific entities ex-ante, each ...
Maxou's user avatar
  • 25