Highest scored 'bert-language-model+tokenize' questions

46 votes

5 answers

58k views

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

def split_data(path): df = pd.read_csv(path) return train_test_split(df , test_size=0.1, random_state=100) train, test = split_data(DATA_DIR) train_texts, train_labels = train['text'].to_list(), ...

Raoof Naushad

736

asked Aug 21, 2020 at 5:59

17 votes

2 answers

33k views

The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition ...

Mee

1,561

asked Oct 12, 2020 at 15:34

11 votes

1 answer

6k views

what is so special about special tokens?

what exactly is the difference between "token" and a "special token"? I understand the following: what is a typical token what is a typical special token: MASK, UNK, SEP, etc when ...

ShaoMin Liu

123

asked Mar 30, 2022 at 14:58

10 votes

3 answers

12k views

BertTokenizer - when encoding and decoding sequences extra spaces appear

When using Transformers from HuggingFace I am facing a problem with the encoding and decoding method. I have a the following string: test_string = 'text with percentage%' Then I am running the ...

Henryk Borzymowski

1,058

asked Nov 21, 2019 at 16:43

7 votes

1 answer

5k views

Token indices sequence length error when using encode_plus method

I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library. I am using data from this Kaggle competition. Given a ...

Niels

1,191

asked Apr 20, 2020 at 12:12

6 votes

2 answers

9k views

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained("...

JayJay

183

asked Feb 16, 2021 at 22:14

6 votes

3 answers

5k views

How to stop BERT from breaking apart specific words into word-piece

I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any ...

parvaneh shayegh

517

asked May 29, 2020 at 9:37

6 votes

0 answers

2k views

How to slice string depending on length of tokens

When I use (with a long test_text and short question): from transformers import BertTokenizer import torch from transformers import BertForQuestionAnswering tokenizer = BertTokenizer.from_pretrained('...

user12975267

asked Jun 21, 2020 at 18:20

5 votes

2 answers

4k views

BERT model : "enable_padding() got an unexpected keyword argument 'max_length'"

I am trying to implement the BERT model architecture using Hugging Face and KERAS. I am learning this from the Kaggle (link) and try to understand it. When I tokenized my data, I face some problems ...

Samrat Alam

568

asked Mar 22, 2021 at 9:47

5 votes

1 answer

2k views

What does merge.txt file mean in BERT-based models in HuggingFace library?

I am trying to understand what merge.txt file infers in tokenizers for RoBERTa model in HuggingFace library. However, nothing is said about it on their website. Any help is appreciated.

Akim

149

asked May 31, 2020 at 16:30

4 votes

1 answer

2k views

PyTorch tokenizers: how to truncate tokens from left?

As we can see in the below code snippet, specifying max_length and truncation for a tokenizer cuts excess tokens from the left: tokenizer("hello, my name", truncation=True, max_length=6).input_ids ...

aayc

41

asked Feb 13, 2022 at 18:44

3 votes

1 answer

10k views

Huggingface's BERT tokenizer not adding pad token

It's not entirely clear from the documentation, but I can see that BertTokenizer is initialised with pad_token='[PAD]', so I assume when you encode with add_special_tokens=True then it would ...

doctopus

5,557

asked Apr 26, 2020 at 15:37

3 votes

2 answers

933 views

BPE multiple ways to encode a word

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", &...

SweetSpot

101

asked Aug 5, 2020 at 11:07

3 votes

0 answers

3k views

Explaining BERT output through SHAP values without WordPiece tokenization

I fine-tuned BERT on a sentiment analysis task in PyTorch. Now I want to use SHAP to explain which tokens led the model to the prediction (positive or negative sentiment). Currently, SHAP returns a ...

Maria

105

asked Nov 19, 2021 at 14:19

3 votes

1 answer

6k views

How to get the vocab file for Bert tokenizer from TF Hub

I'm trying to use Bert from TensorFlow Hub and build a tokenizer, this is what I'm doing: >>> import tensorflow_hub as hub >>> from bert.tokenization import FullTokenizer >>&...

bachr

5,898

asked Jan 8, 2020 at 21:39

2 votes

2 answers

3k views

Is there a way to get the location of the substring from which a certain token has been produced in BERT?

I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings. ...

MarciBE

23

asked Aug 14, 2020 at 13:09

2 votes

2 answers

1k views

Translation between different tokenizers

Sorry if this question is too basic to be asked here. I tried but I couldn't find solutions. I'm now working on an NLP project that requires using two different models (BART for summarization and BERT ...

exitialium

43

asked Jun 15, 2022 at 3:12

2 votes

2 answers

5k views

"Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." ValueError: Input is not valid

I am using Bert tokenizer for french and I am getting this error but I do not seems to solutionated it. If you have a suggestion. Traceback (most recent call last): File "training_cross_data_2....

emma

323

asked May 6, 2021 at 13:15

2 votes

1 answer

2k views

How does the BERT tokenizer result in an input tensor shape of (b, 24, 768)?

I understand how the BERT tokenizer works thanks to this article: https://albertauyeung.github.io/2020/06/19/bert-tokenization.html However, I am confused about how this ends up as the final input ...

Joshua Clancy

131

asked Jan 19, 2021 at 18:25

2 votes

1 answer

2k views

How to combine two tokenized bert sequences

Say I have two tokenized BERT sequences: seq1 = tensor([[ 101, 2023, 2003, 1996, 23032, 102]]) seq2 = tensor([[ 101, 2023, 2003, 6019, 1015, 102]]) This is produced with huggingface's tokenizer:...

Union find

8,034

asked Aug 3, 2020 at 20:58

2 votes

1 answer

2k views

Why was BERT's default vocabulary size set to 30522?

I have been trying to build a BERT model for a specific domain. However, my model is trained on non-English text, so I'm worried that the default token size, 30522, won't fit my model. Does anyone ...

Byoungchan Han

23

asked Aug 4, 2022 at 8:06

2 votes

1 answer

2k views

Can I pre-trained BERT model from scratch using tokenized input file and custom vocabulary file for Khmer language

I would like to know if it's possible for me to use my own tokenized/segmented documents (with my own vocab file as well) as the input file to the create_pretraining_data.py script (git source: https:/...

Nik Muhammad Naim

558

asked Nov 27, 2019 at 9:08

2 votes

0 answers

352 views

KeyError when trying to fine tuning Bert for text classification

I am trying to fine tune Bert for text classification on my dataset and I am getting the following error: KeyError: 'Indexing with integers (to access backend Encoding for a given batch index) is not ...

Jules

21

asked Dec 29, 2022 at 15:03

2 votes

0 answers

2k views

How to encode empty string using BERT

I have recently been trying to encode an empty string with CamemBERT (BERT model for French). I wasn't sure on how to do that. If I try to simply encode an empty string, from transformers import ...

NoIdea

113

asked Jun 21, 2021 at 11:07

2 votes

0 answers

796 views

UnparsedFlagAccessError: Trying to access flag --preserve_unused_tokens before flags were parsed

Hello I am a beginner in ML. I tried to use BERT and tokenizer didn't work like below. train_input = bert_encode(train.text.values, tokenizer, max_len=160) test_input = bert_encode(test.text.values, ...

Tony

21

asked Apr 7, 2021 at 13:32

2 votes

0 answers

165 views

Explicit likelihood of WordPiece used for pre-processing of BERT

At each iteration the WordPiece algorithm for subword tokenization merges the two symbols which increase the likelihood the most. Now, in the literature it is only mentioned that this likelihood is ...

SweetSpot

101

asked Aug 4, 2020 at 9:58

2 votes

0 answers

160 views

Wordpiece Tokenization Model

Can somebody tell me how exactly the wordpiece model work ? I am having some hard time trying to understand how exactly the wordpiece model is working. I understand the BPE that it is based on merging ...

Bika

21

asked Apr 28, 2020 at 21:10

2 votes

1 answer

161 views

bert_vocab.bert_vocab_from_dataset returning wrong vocabulary [closed]

i'm trying to build a tokenizer following the tf's tutorial https://www.tensorflow.org/text/guide/subwords_tokenizer. I'm basically doing the same thing only with a different dataset. The dataset in ...

Niccolò Tiezzi

77

asked Apr 8, 2023 at 10:30

1 vote

1 answer

2k views

How to preprocess a dataset for BERT model implemented in Tensorflow 2.x?

Overview I have a dataset made for classification problem. There are two columns one is sentences and the other is labels (total: 10 labels). I'm trying to convert this dataset to implement it in a ...

Y4RD13

966

asked May 8, 2021 at 20:52

1 vote

1 answer

8k views

Bert Tokenizing error ValueError: Input nan is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers

I am using the Bert for text classification task , when I try to tokenize one data sample using the code: encoded_sent = tokenizer.encode( sentences[7], ...

Yaman Afadar

53

asked Nov 4, 2020 at 13:18

1 vote

1 answer

535 views

NER Classification Deberta Tokenizer error : You need to instantiate DebertaTokenizerFast

I'm trying to perform a NER Classification task using Deberta, but I'm stacked with a Tokenizer error. This is my code (my input sentence must be splitted word by word by ",:): from transformers ...

Chiara

380

asked Jan 21, 2022 at 9:42

1 vote

1 answer

755 views

how to use BertTokenizer to load Tokenizer model?

i use tokenizers to train a Tokenizer and save the model like this tokenizer = Tokenizer(BPE()) tokenizer.pre_tokenizer = Whitespace() tokenizer.decoder = ByteLevelDecoder() trainer = BpeTrainer(...

Jack.Sparrow

141

asked Sep 6, 2021 at 9:57

1 vote

1 answer

488 views

getting word-level encodings from sub-word tokens encodings

I'm looking into using a pretrained BERT ('bert-base-uncased') model to extract contextualised word-level encodings from a bunch sentences. Wordpiece tokenisation breaks down some of the words in my ...

rbroc

13

asked Jan 28, 2020 at 19:02

1 vote

1 answer

31 views

why Tokenizer and TokenizerFast encode the same sentence get different result

error1 when i use tokenizer encode text and use ‘do_basic_tokenize=False’， i found two different results. But when i set ‘do_basic_tokenize=True’, the results is same. this text is 'LUXURY HOTEL ...

feng shen

11

asked Mar 8 at 6:57

1 vote

1 answer

27 views

ber-base-uncase does not use newly added suffix token

I want to add custom tokens to the BertTokenizer. However, the model does not use the new token. from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-...

Lulacca

13

asked Jul 14, 2023 at 13:53

1 vote

0 answers

110 views

How to obtain the [CLS] sentence embedding of multiple sentences successively without facing a RAM crash?

I would like to obtain the [CLS] token's sentence embedding (as it represents the whole sentence's meaning) using BERT. I have many sentences (about 40) that belong to a Document, and 246 such ...

Aadithya Seshadri

21

asked Dec 4, 2022 at 4:09

1 vote

1 answer

351 views

Equivalent to tokenizer() in Transformers 2.5.0?

I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0. # Converting pretrained BERT classification model to regression model #...

galactic_tok

13

asked Jul 26, 2022 at 16:55

1 vote

0 answers

174 views

'AutoTrackable' object is not callable

I've tried to Instantiate tokenizer following this sentence: tokenizer = create_tokenizer_from_hub_module(bert_path=BERT_PATH) I tried fixed it with some of that other topic recommendations But the ...

Agustina Villaggi

11

asked Nov 28, 2021 at 13:00

0 votes

1 answer

3k views

Token indices sequence length is longer than the specified maximum sequence length for this model (28627 > 512)

I am using BERT's Huggingface DistilBERT model as a backend for a question and answer application. The text I am using with which to train the model is one very large single text field. Even though ...

Scott Bing

125

asked Aug 22, 2021 at 21:36

0 votes

1 answer

2k views

How to specify input sequence length for BERT tokenizer in Tensorflow?

I am following this example to use BERT for sentiment classification. text_input = tf.keras.layers.Input(shape=(), dtype=tf.string) preprocessor = hub.KerasLayer( "https://tfhub.dev/...

Jane Sully

3,247

asked Aug 26, 2021 at 10:27

0 votes

2 answers

3k views

Bert Tokenizer is not working despite importing all packages. Is there a new syntax change to this?

Trying to run the tokenizer for Bert but I keep getting errors. Can anyone help where I am going wrong. FullTokenizer = bert.bert_tokenization.FullTokenizer bert_layer = hub.KerasLayer("https://tfhub....

L Akshay

13

asked May 31, 2020 at 18:17

0 votes

1 answer

47 views

Truncate texts in the middle for Bert

I am learning about Bert, which only deals with texts with fewer than 512 tokens, and came across this answer which says that truncating text in the middle (as opposed to at the start or at the end) ...

Tuan Do

159

asked Jan 17 at 21:05

0 votes

1 answer

145 views

Understand the difference between the arguments "text" and "text_target" in the bert tokenizer from the huggingface transformers library [duplicate]

From the transformers library by huggingface from transformers import BertTokenizer tb = BertTokenizer.from_pretrained("bert-base-uncased") tb is not a wordpiece tokenizer. It has arguments ...

figs_and_nuts

5,126

asked Nov 25, 2023 at 10:48

0 votes

1 answer

1k views

Loading local tokenizer

I'm trying to load a local tokenizer using; from transformers import RobertaTokenizerFast tokenizer = RobertaTokenizerFast.from_pretrained(r'file path\tokenizer') however, this gives me the ...

Jon

91

asked Jun 3, 2023 at 8:36

0 votes

1 answer

373 views

bert_vocab.bert_vocab_from_dataset taking too long

I'm following this tutorial (https://colab.research.google.com/github/tensorflow/text/blob/master/docs/guide/subwords_tokenizer.ipynb#scrollTo=kh98DvoDz7Jn) to generate a vocabulary from a custom ...

Kurt

186

asked Jan 20, 2022 at 14:27

0 votes

1 answer

2k views

Split a sentence by words just as BERT Tokenizer would do?

I'm trying to localize all the [UNK] tokens of BERT tokenizer on my text. Once I have the position of the UNK token, I need to identify what word it belongs to. For that, I tried to get the position ...

Andrea NR

1,567

asked Feb 22, 2021 at 12:41

0 votes

0 answers

48 views

Issues with Training RoBERTa Model for Text Prediction with Fill Mask Task in Python

I have a problem. I am working on pretraining a RoBERTa MLM model from scratch on Slovak language text in Python. I have trained my own BPE tokenizer and tokenized texts with it. I obtained the ...

daviddo

1

asked Mar 13 at 18:58

0 votes

1 answer

102 views

Map BERT token indices to Spacy token indices

I’m trying to make Bert’s (bert-base-uncased) tokenization token indices (not ids, token indices) map to Spacy’s tokenization token indices. In the following example, my approach doesn’t work becos ...

lrthistlethwaite

514

asked Oct 25, 2023 at 13:58

0 votes

0 answers

44 views

Value Error when using add_tokens, 'the truth value of an array with more than one element is ambiguous'

I'm trying to improve a basic BERT, pretrained tokenizer model. Im adding new tokens using add_tokens, but running into issues with the built in method. Namely: ValueError ...

Manny

35

asked Apr 27, 2023 at 11:40

0 votes

1 answer

36 views

How to model with NLP when the token is not relevant (by itself) but its type is?

I would like to build an NLP classification model. My input is a paragraph or a sentence. Ideally, my output is a score or probability (between 0 and 1). I have defined specific entities ex-ante, each ...

Maxou

25

asked Sep 17, 2022 at 21:54

Collectives™ on Stack Overflow

All Questions

Related Tags