Highest scored 'bert-language-model+huggingface-tokenizers' questions

46 votes

5 answers

58k views

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

def split_data(path): df = pd.read_csv(path) return train_test_split(df , test_size=0.1, random_state=100) train, test = split_data(DATA_DIR) train_texts, train_labels = train['text'].to_list(), ...

Raoof Naushad

736

asked Aug 21, 2020 at 5:59

24 votes

1 answer

49k views

How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased')?

I am working with Text Classification problem where I want to use the BERT model as the base followed by Dense layers. I want to know how does the 3 arguments work? For example, if I have 3 sentences ...

Deshwal

3,872

asked Dec 11, 2020 at 6:26

16 votes

2 answers

31k views

Download pre-trained sentence-transformers model locally

I am using the SentenceTransformers library (here: https://pypi.org/project/sentence-transformers/#pretrained-models) for creating embeddings of sentences using the pre-trained model bert-base-nli-...

neha tamore

311

asked Dec 23, 2020 at 5:34

15 votes

2 answers

10k views

BertModel transformers outputs string instead of tensor

I'm following this tutorial that codes a sentiment analysis classifier using BERT with the huggingface library and I'm having a very odd behavior. When trying the BERT model with a sample text I get a ...

Miguel

2,922

asked Dec 3, 2020 at 18:42

12 votes

8 answers

37k views

SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /dslim/bert-base-NER/resolve/main/tokenizer_config.json

I am facing below issue while loading the pretrained BERT model from HuggingFace due to SSL certificate error. Error: SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries ...

Nikita Malviya

627

asked Jan 13, 2023 at 15:09

11 votes

1 answer

6k views

what is so special about special tokens?

what exactly is the difference between "token" and a "special token"? I understand the following: what is a typical token what is a typical special token: MASK, UNK, SEP, etc when ...

ShaoMin Liu

123

asked Mar 30, 2022 at 14:58

10 votes

2 answers

21k views

How to add new special token to the tokenizer?

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased). QUERY: I want to ask a question. ANSWER: Sure, ask away. ...

sid8491

6,740

asked Sep 15, 2021 at 10:24

8 votes

6 answers

6k views

Problem with inputs when building a model with TFBertModel and AutoTokenizer from HuggingFace's transformers

I'm trying to build the model illustrated in this picture: I obtained a pre-trained BERT and respective tokenizer from HuggingFace's transformers in the following way: from transformers import ...

Gerardo Zinno

1,672

asked Sep 15, 2021 at 15:28

7 votes

1 answer

8k views

max_seq_length for transformer (Sentence-BERT)

I'm using sentence-BERT from Huggingface in the following way: from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') model.max_seq_length = 512 model....

BlackHawk

779

asked Mar 31, 2023 at 17:29

7 votes

1 answer

14k views

How padding in huggingface tokenizer works?

I tried following tokenization example: tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True) sent = "I hate this. Not that.", _tokenized = tokenizer(sent, ...

MsA

2,829

asked Nov 22, 2021 at 14:43

6 votes

2 answers

9k views

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained("...

JayJay

183

asked Feb 16, 2021 at 22:14

6 votes

2 answers

11k views

BERT get sentence embedding

I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding. I have around 500,000 sentences for which I need sentence embedding and it is ...

user2543622

6,258

asked Oct 10, 2021 at 17:32

6 votes

1 answer

11k views

BertWordPieceTokenizer vs BertTokenizer from HuggingFace

I have the following pieces of code and trying to understand the difference between BertWordPieceTokenizer and BertTokenizer. BertWordPieceTokenizer (Rust based) from tokenizers import ...

HopeKing

3,413

asked Jun 16, 2020 at 9:19

6 votes

3 answers

9k views

Huggingface BERT Tokenizer add new token

I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords. tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') tokenizer....

Nui

111

asked Nov 3, 2020 at 19:29

4 votes

3 answers

5k views

How to apply max_length to truncate the token sequence from the left in a HuggingFace tokenizer?

In the HuggingFace tokenizer, applying the max_length argument specifies the length of the tokenized text. I believe it truncates the sequence to max_length-2 (if truncation=True) by cutting the ...

Ondrej Sotolar

1,382

asked May 11, 2022 at 13:52

4 votes

1 answer

12k views

HuggingFace Bert Sentiment analysis

I am getting the following error : AssertionError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples)., ...

paris

43

asked Jan 25, 2021 at 9:13

3 votes

1 answer

4k views

Fast and slow tokenizers yield different results

Using HuggingFace's pipeline tool, I was surprised to find that there was a significant difference in output when using the fast vs slow tokenizer. Specifically, when I run the fill-mask pipeline, ...

Michael

153

asked Apr 12, 2020 at 3:32

3 votes

1 answer

9k views

resize_token_embeddings on the a pertrained model with different embedding size

I would like to ask about the way to change the embedding size of the trained model. I have a trained model models/BERT-pretrain-1-step-5000.pkl. Now I am adding a new token [TRA]to the tokeniser and ...

tw0930

61

asked Jun 27, 2022 at 16:38

3 votes

1 answer

4k views

How to map token indices from the SQuAD data to tokens from BERT tokenizer?

I am using the SQuaD dataset for answer span selection. After using the BertTokenizer to tokenize the passages, for some samples, the start and end indices of the answer don't match the real answer ...

KoalaJ

145

asked Mar 17, 2021 at 3:21

3 votes

1 answer

1k views

BERT - Is that needed to add new tokens to be trained in a domain specific environment?

My question here is no how to add new tokens, or how to train using a domain-specific corpus, I'm already doing that. The thing is, am I supposed to add the domain-specific tokens before the MLM ...

rdemorais

253

asked Apr 12, 2021 at 12:51

3 votes

1 answer

2k views

Using Hugging-face transformer with arguments in pipeline

I am working on using a transformer. Pipeline to get BERT embeddings to my input. using this without a pipeline i am able to get constant outputs but not with pipeline since I was not able to pass ...

Israel-abebe

548

asked Sep 15, 2021 at 16:47

3 votes

2 answers

933 views

BPE multiple ways to encode a word

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", &...

SweetSpot

101

asked Aug 5, 2020 at 11:07

3 votes

0 answers

923 views

Create custom data_collator for Huggingface Trainer

I need to create a custom data_collator for finetuning with Huggingface Trainer API. HuggingFace offers DataCollatorForWholeWordMask for masking whole words within the sentences with a given ...

kkgarg

1,346

asked Apr 19, 2022 at 19:25

3 votes

0 answers

3k views

How to get tokens to words in BERT tokenizer

I have a list, using higgingface bert tokenizer I can get the mapping numerical representation. X = ['[CLS]', '[MASK]', 'love', 'this', '[SEP]'] tokens = tokenizer.convert_tokens_to_ids(X) toekns: [...

kowser66

155

asked Mar 21, 2022 at 4:08

3 votes

3 answers

4k views

How to save a tokenizer after training it?

I have just followed this tutorial on how to train my own tokenizer. Now, from training my tokenizer, I have wrapped it inside a Transformers object, so that I can use it with the transformers library:...

user16098918

asked Aug 12, 2021 at 16:33

2 votes

2 answers

3k views

Is there a way to get the location of the substring from which a certain token has been produced in BERT?

I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings. ...

MarciBE

23

asked Aug 14, 2020 at 13:09

2 votes

1 answer

720 views

BART Tokenizer tokenises same word differently?

I have noticed that if I tokenize a full text with many sentences, I sometimes get a different number of tokens than if I tokenise each sentence individually and add up the tokens. I have done some ...

andrea

682

asked Aug 23, 2022 at 13:28

2 votes

1 answer

984 views

Why does Transformer's BERT (for sequence classification) output depend heavily on maximum sequence length padding?

I am using Transformer's RobBERT (the dutch version of RoBERTa) for sequence classification - trained for sentiment analysis on the Dutch Book Reviews dataset. I wanted to test how well it works on a ...

Wouter S

23

asked May 31, 2021 at 9:33

2 votes

3 answers

2k views

Any reason to save a pretrained BERT tokenizer?

Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer....

ginobimura

115

asked Sep 22, 2020 at 22:54

2 votes

1 answer

2k views

Are these normal speed of Bert Pretrained Model Inference in PyTorch

I am testing Bert base and Bert distilled model in Huggingface with 4 scenarios of speeds, batch_size = 1: 1) bert-base-uncased: 154ms per request 2) bert-base-uncased with quantifization: 94ms per ...

marlon

6,847

asked May 26, 2021 at 6:07

2 votes

1 answer

3k views

Calculate precision, recall, f1 score for custom dataset for multiclass classification Huggingface library

I am trying to do multiclass classification for the sentence pair task. I uploaded my custom dataset of train and test separately in the hugging face data set and trained my model and tested it and ...

Alex Kujur

121

asked May 24, 2022 at 17:49

2 votes

0 answers

330 views

how to make BERT predict new token

my problem looks like this: tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForMaskedLM.from_pretrained('bert-base-uncased') fill_mask_pipeline_pre = pipeline("fill-...

Maximilian Huber

21

asked Jan 13, 2023 at 16:02

2 votes

0 answers

422 views

Huggingface pre-trained model

I try to use the below code: from transformers import AutoTokenizer, AutoModel t = "ProsusAI/finbert" tokenizer = AutoTokenizer.from_pretrained(t) model = AutoModel.from_pretrained(t) The ...

Learner91

103

asked Sep 7, 2022 at 9:57

2 votes

0 answers

492 views

Train BERT model from scratch on a different language

First i create tokenizer as follow from tokenizers import Tokenizer from tokenizers.models import BPE,WordPiece tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) from tokenizers.trainers ...

Talha Anwar

2,873

asked Jun 13, 2021 at 10:56

2 votes

0 answers

1k views

huggingface pipeline: bert NER task throws RuntimeError: The size of tensor a (921) must match the size of tensor b (512) at non-singleton dimension 1

I try to set up a german ner, pretrained with bert via the huggingface pipeline. For some texts the following code throws an error "RuntimeError: The size of tensor a (921) must match the size of ...

Michael Göggelmann

71

asked May 5, 2021 at 22:57

1 vote

1 answer

3k views

how to add tokens in vocab.txt which decoded as [UNK] bert tokenizer

i was decoding the tokenized tokens from bert tokenizer and it was giving [UNK] for € symbol. but i tried by add ##€ token in vocab.txt file. but it was not reflected in prediction result was same as ...

Ramakant Shakya

13

asked May 2, 2021 at 12:57

1 vote

1 answer

2k views

BERT tokenize URLs

I want to classify a bunch of tweets and therefore I'm using the huggingface implementation of BERT. However I noticed that the deafult BertTokenizer does not use special tokens for urls. >>> ...

random314

63

asked Oct 27, 2020 at 23:51

1 vote

1 answer

137 views

Truncating a training dataset so that it fits exactly within the context window

I have a dataset where the total tokens once tokenised is around 5000. I was to feed that into a BERT-style model so I have to shrink it down to 512 tokens but I want to rearrange the text to train it ...

Shafiq Jetha

1,437

asked Dec 2, 2023 at 0:05

1 vote

2 answers

2k views

How can we pass a list of strings to a fine tuned bert model?

I want to pass a list of strings instead of a single string input to my fine tuned bert question classification model. This is my code which accept a single string input. questionclassification_model ...

Abin Jilson

35

asked Aug 17, 2022 at 5:46

1 vote

1 answer

1k views

How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace?

Say, I have three sample sentences: s0 = "This model was pretrained using a specific normalization pipeline available here!" s1 = "Thank to all the people around," s2 = "...

Abu Ubaida

105

asked May 15, 2022 at 9:44

1 vote

1 answer

2k views

Encoding/tokenizing dataset dictionary (BERT/Huggingface)

I am trying to finetune my Sentiment Analysis Model. Therefore, I have splitted my pandas Dataframe (column with reviews, column with sentiment scores) into a train and test Dataframe and transformed ...

soulwreckedyouth

535

asked Aug 18, 2021 at 14:27

1 vote

0 answers

22 views

BERT MLM model fine-tuning bad results on new dataset

I'm trying to fine tune a MLM model on new kind of small data (train.csv 43285 lines, validation.csv 3597), my data looks like this: text בראשית ברא אלהים את השמים ואת הארץ והארץ היתה תהו ובהו וחשך על ...

bsteo

1,725

asked Apr 16 at 11:20

1 vote

1 answer

31 views

why Tokenizer and TokenizerFast encode the same sentence get different result

error1 when i use tokenizer encode text and use ‘do_basic_tokenize=False’， i found two different results. But when i set ‘do_basic_tokenize=True’, the results is same. this text is 'LUXURY HOTEL ...

feng shen

11

asked Mar 8 at 6:57

1 vote

0 answers

184 views

The tokenizer Doesn't recognize the new special tokens

When I run the code below, the tokenizer doesn't recognize the new special tokens that I added ([SP] and [EMPTY]). I wanted to tokenize arabic text. from tokenizers import BertWordPieceTokenizer from ...

FQ912

11

asked Apr 28, 2023 at 8:21

1 vote

0 answers

25 views

Training an Hugginface model without n_epochs

I would like to train from scratch a RobertaForMaskedLM in Hugginface. However I would like to not specify any stopping time, but to stop only when there is no more improvement in the training. ...

Chiara

380

asked Dec 8, 2022 at 11:18

1 vote

1 answer

351 views

Equivalent to tokenizer() in Transformers 2.5.0?

I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0. # Converting pretrained BERT classification model to regression model #...

galactic_tok

13

asked Jul 26, 2022 at 16:55

1 vote

1 answer

2k views

How to run Huggingface BERT tokenizer in offline mode?

While running above code in my work laptop I'm getting the following error, but same error is not there when I run in my personal laptop. I wanted to check whether there is a way to fix this SSL error?...

CreaTorr

11

asked Jan 27, 2022 at 7:48

1 vote

1 answer

240 views

Optimize Albert HuggingFace model

Goal: Amend this Notebook to work with albert-base-v2 model Kernel: conda_pytorch_p36. Section 2.1 exports the finalised model. It too uses a BERT specific function. However, I cannot find an ...

DanielBell99

1,321

asked Jan 17, 2022 at 11:25

1 vote

0 answers

854 views

how to extend a pretrained transformer model configured with small max_position_embeddings to a longer one

suppose I want to use the existing pre-trained model. https://huggingface.co/Salesforce/grappa_large_jnt/ as the initial checkpoint for finetuning. This grappa model has max position embedding as 514 ...

mt07tm

11

asked Nov 3, 2021 at 5:31

1 vote

0 answers

126 views

DistilBERT Prediction Output - "TypeError: only size-1 arrays can be converted to Python scalars"

I am trying to apply a DistilBERT model to create a prediction, whether a sentence is a Claim, Premise or Non-Argumentative (3 Outputs) However when I apply the model and want to create a prediction ...

Philipp

11

asked Oct 19, 2021 at 15:16

Collectives™ on Stack Overflow

All Questions

Related Tags