All Questions

Filter by
Sorted by
Tagged with
46 votes
5 answers
58k views

ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error

def split_data(path): df = pd.read_csv(path) return train_test_split(df , test_size=0.1, random_state=100) train, test = split_data(DATA_DIR) train_texts, train_labels = train['text'].to_list(), ...
Raoof Naushad's user avatar
24 votes
1 answer
49k views

How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased')?

I am working with Text Classification problem where I want to use the BERT model as the base followed by Dense layers. I want to know how does the 3 arguments work? For example, if I have 3 sentences ...
Deshwal's user avatar
  • 3,872
16 votes
2 answers
31k views

Download pre-trained sentence-transformers model locally

I am using the SentenceTransformers library (here: https://pypi.org/project/sentence-transformers/#pretrained-models) for creating embeddings of sentences using the pre-trained model bert-base-nli-...
neha tamore's user avatar
15 votes
2 answers
10k views

BertModel transformers outputs string instead of tensor

I'm following this tutorial that codes a sentiment analysis classifier using BERT with the huggingface library and I'm having a very odd behavior. When trying the BERT model with a sample text I get a ...
Miguel's user avatar
  • 2,922
12 votes
8 answers
37k views

SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /dslim/bert-base-NER/resolve/main/tokenizer_config.json

I am facing below issue while loading the pretrained BERT model from HuggingFace due to SSL certificate error. Error: SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries ...
Nikita Malviya's user avatar
11 votes
1 answer
6k views

what is so special about special tokens?

what exactly is the difference between "token" and a "special token"? I understand the following: what is a typical token what is a typical special token: MASK, UNK, SEP, etc when ...
ShaoMin Liu's user avatar
10 votes
2 answers
21k views

How to add new special token to the tokenizer?

I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased). QUERY: I want to ask a question. ANSWER: Sure, ask away. ...
sid8491's user avatar
  • 6,740
8 votes
6 answers
6k views

Problem with inputs when building a model with TFBertModel and AutoTokenizer from HuggingFace's transformers

I'm trying to build the model illustrated in this picture: I obtained a pre-trained BERT and respective tokenizer from HuggingFace's transformers in the following way: from transformers import ...
Gerardo Zinno's user avatar
7 votes
1 answer
8k views

max_seq_length for transformer (Sentence-BERT)

I'm using sentence-BERT from Huggingface in the following way: from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') model.max_seq_length = 512 model....
BlackHawk's user avatar
  • 779
7 votes
1 answer
14k views

How padding in huggingface tokenizer works?

I tried following tokenization example: tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True) sent = "I hate this. Not that.", _tokenized = tokenizer(sent, ...
MsA's user avatar
  • 2,829
6 votes
2 answers
9k views

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained("...
JayJay's user avatar
  • 183
6 votes
2 answers
11k views

BERT get sentence embedding

I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding. I have around 500,000 sentences for which I need sentence embedding and it is ...
user2543622's user avatar
  • 6,258
6 votes
1 answer
11k views

BertWordPieceTokenizer vs BertTokenizer from HuggingFace

I have the following pieces of code and trying to understand the difference between BertWordPieceTokenizer and BertTokenizer. BertWordPieceTokenizer (Rust based) from tokenizers import ...
HopeKing's user avatar
  • 3,413
6 votes
3 answers
9k views

Huggingface BERT Tokenizer add new token

I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords. tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased') tokenizer....
Nui's user avatar
  • 111
4 votes
3 answers
5k views

How to apply max_length to truncate the token sequence from the left in a HuggingFace tokenizer?

In the HuggingFace tokenizer, applying the max_length argument specifies the length of the tokenized text. I believe it truncates the sequence to max_length-2 (if truncation=True) by cutting the ...
Ondrej Sotolar's user avatar
4 votes
1 answer
12k views

HuggingFace Bert Sentiment analysis

I am getting the following error : AssertionError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples)., ...
paris's user avatar
  • 43
3 votes
1 answer
4k views

Fast and slow tokenizers yield different results

Using HuggingFace's pipeline tool, I was surprised to find that there was a significant difference in output when using the fast vs slow tokenizer. Specifically, when I run the fill-mask pipeline, ...
Michael's user avatar
  • 153
3 votes
1 answer
9k views

resize_token_embeddings on the a pertrained model with different embedding size

I would like to ask about the way to change the embedding size of the trained model. I have a trained model models/BERT-pretrain-1-step-5000.pkl. Now I am adding a new token [TRA]to the tokeniser and ...
tw0930's user avatar
  • 61
3 votes
1 answer
4k views

How to map token indices from the SQuAD data to tokens from BERT tokenizer?

I am using the SQuaD dataset for answer span selection. After using the BertTokenizer to tokenize the passages, for some samples, the start and end indices of the answer don't match the real answer ...
KoalaJ's user avatar
  • 145
3 votes
1 answer
1k views

BERT - Is that needed to add new tokens to be trained in a domain specific environment?

My question here is no how to add new tokens, or how to train using a domain-specific corpus, I'm already doing that. The thing is, am I supposed to add the domain-specific tokens before the MLM ...
rdemorais's user avatar
  • 253
3 votes
1 answer
2k views

Using Hugging-face transformer with arguments in pipeline

I am working on using a transformer. Pipeline to get BERT embeddings to my input. using this without a pipeline i am able to get constant outputs but not with pipeline since I was not able to pass ...
Israel-abebe's user avatar
3 votes
2 answers
933 views

BPE multiple ways to encode a word

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", &...
SweetSpot's user avatar
  • 101
3 votes
0 answers
923 views

Create custom data_collator for Huggingface Trainer

I need to create a custom data_collator for finetuning with Huggingface Trainer API. HuggingFace offers DataCollatorForWholeWordMask for masking whole words within the sentences with a given ...
kkgarg's user avatar
  • 1,346
3 votes
0 answers
3k views

How to get tokens to words in BERT tokenizer

I have a list, using higgingface bert tokenizer I can get the mapping numerical representation. X = ['[CLS]', '[MASK]', 'love', 'this', '[SEP]'] tokens = tokenizer.convert_tokens_to_ids(X) toekns: [...
kowser66's user avatar
  • 155
3 votes
3 answers
4k views

How to save a tokenizer after training it?

I have just followed this tutorial on how to train my own tokenizer. Now, from training my tokenizer, I have wrapped it inside a Transformers object, so that I can use it with the transformers library:...
user avatar
2 votes
2 answers
3k views

Is there a way to get the location of the substring from which a certain token has been produced in BERT?

I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings. ...
MarciBE's user avatar
  • 23
2 votes
1 answer
720 views

BART Tokenizer tokenises same word differently?

I have noticed that if I tokenize a full text with many sentences, I sometimes get a different number of tokens than if I tokenise each sentence individually and add up the tokens. I have done some ...
andrea's user avatar
  • 682
2 votes
1 answer
984 views

Why does Transformer's BERT (for sequence classification) output depend heavily on maximum sequence length padding?

I am using Transformer's RobBERT (the dutch version of RoBERTa) for sequence classification - trained for sentiment analysis on the Dutch Book Reviews dataset. I wanted to test how well it works on a ...
Wouter S's user avatar
2 votes
3 answers
2k views

Any reason to save a pretrained BERT tokenizer?

Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer....
ginobimura's user avatar
2 votes
1 answer
2k views

Are these normal speed of Bert Pretrained Model Inference in PyTorch

I am testing Bert base and Bert distilled model in Huggingface with 4 scenarios of speeds, batch_size = 1: 1) bert-base-uncased: 154ms per request 2) bert-base-uncased with quantifization: 94ms per ...
marlon's user avatar
  • 6,847
2 votes
1 answer
3k views

Calculate precision, recall, f1 score for custom dataset for multiclass classification Huggingface library

I am trying to do multiclass classification for the sentence pair task. I uploaded my custom dataset of train and test separately in the hugging face data set and trained my model and tested it and ...
Alex Kujur's user avatar
2 votes
0 answers
330 views

how to make BERT predict new token

my problem looks like this: tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForMaskedLM.from_pretrained('bert-base-uncased') fill_mask_pipeline_pre = pipeline("fill-...
Maximilian Huber's user avatar
2 votes
0 answers
422 views

Huggingface pre-trained model

I try to use the below code: from transformers import AutoTokenizer, AutoModel t = "ProsusAI/finbert" tokenizer = AutoTokenizer.from_pretrained(t) model = AutoModel.from_pretrained(t) The ...
Learner91's user avatar
  • 103
2 votes
0 answers
492 views

Train BERT model from scratch on a different language

First i create tokenizer as follow from tokenizers import Tokenizer from tokenizers.models import BPE,WordPiece tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) from tokenizers.trainers ...
Talha Anwar's user avatar
  • 2,873
2 votes
0 answers
1k views

huggingface pipeline: bert NER task throws RuntimeError: The size of tensor a (921) must match the size of tensor b (512) at non-singleton dimension 1

I try to set up a german ner, pretrained with bert via the huggingface pipeline. For some texts the following code throws an error "RuntimeError: The size of tensor a (921) must match the size of ...
Michael Göggelmann's user avatar
1 vote
1 answer
3k views

how to add tokens in vocab.txt which decoded as [UNK] bert tokenizer

i was decoding the tokenized tokens from bert tokenizer and it was giving [UNK] for € symbol. but i tried by add ##€ token in vocab.txt file. but it was not reflected in prediction result was same as ...
Ramakant Shakya's user avatar
1 vote
1 answer
2k views

BERT tokenize URLs

I want to classify a bunch of tweets and therefore I'm using the huggingface implementation of BERT. However I noticed that the deafult BertTokenizer does not use special tokens for urls. >>> ...
random314's user avatar
1 vote
1 answer
137 views

Truncating a training dataset so that it fits exactly within the context window

I have a dataset where the total tokens once tokenised is around 5000. I was to feed that into a BERT-style model so I have to shrink it down to 512 tokens but I want to rearrange the text to train it ...
Shafiq Jetha's user avatar
  • 1,437
1 vote
2 answers
2k views

How can we pass a list of strings to a fine tuned bert model?

I want to pass a list of strings instead of a single string input to my fine tuned bert question classification model. This is my code which accept a single string input. questionclassification_model ...
Abin Jilson's user avatar
1 vote
1 answer
1k views

How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace?

Say, I have three sample sentences: s0 = "This model was pretrained using a specific normalization pipeline available here!" s1 = "Thank to all the people around," s2 = "...
Abu Ubaida's user avatar
1 vote
1 answer
2k views

Encoding/tokenizing dataset dictionary (BERT/Huggingface)

I am trying to finetune my Sentiment Analysis Model. Therefore, I have splitted my pandas Dataframe (column with reviews, column with sentiment scores) into a train and test Dataframe and transformed ...
soulwreckedyouth's user avatar
1 vote
0 answers
22 views

BERT MLM model fine-tuning bad results on new dataset

I'm trying to fine tune a MLM model on new kind of small data (train.csv 43285 lines, validation.csv 3597), my data looks like this: text בראשית ברא אלהים את השמים ואת הארץ והארץ היתה תהו ובהו וחשך על ...
bsteo's user avatar
  • 1,725
1 vote
1 answer
31 views

why Tokenizer and TokenizerFast encode the same sentence get different result

error1 when i use tokenizer encode text and use ‘do_basic_tokenize=False’, i found two different results. But when i set ‘do_basic_tokenize=True’, the results is same. this text is 'LUXURY HOTEL ...
feng shen's user avatar
1 vote
0 answers
184 views

The tokenizer Doesn't recognize the new special tokens

When I run the code below, the tokenizer doesn't recognize the new special tokens that I added ([SP] and [EMPTY]). I wanted to tokenize arabic text. from tokenizers import BertWordPieceTokenizer from ...
FQ912's user avatar
  • 11
1 vote
0 answers
25 views

Training an Hugginface model without n_epochs

I would like to train from scratch a RobertaForMaskedLM in Hugginface. However I would like to not specify any stopping time, but to stop only when there is no more improvement in the training. ...
Chiara's user avatar
  • 380
1 vote
1 answer
351 views

Equivalent to tokenizer() in Transformers 2.5.0?

I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0. # Converting pretrained BERT classification model to regression model #...
galactic_tok's user avatar
1 vote
1 answer
2k views

How to run Huggingface BERT tokenizer in offline mode?

While running above code in my work laptop I'm getting the following error, but same error is not there when I run in my personal laptop. I wanted to check whether there is a way to fix this SSL error?...
CreaTorr's user avatar
1 vote
1 answer
240 views

Optimize Albert HuggingFace model

Goal: Amend this Notebook to work with albert-base-v2 model Kernel: conda_pytorch_p36. Section 2.1 exports the finalised model. It too uses a BERT specific function. However, I cannot find an ...
DanielBell99's user avatar
  • 1,321
1 vote
0 answers
854 views

how to extend a pretrained transformer model configured with small max_position_embeddings to a longer one

suppose I want to use the existing pre-trained model. https://huggingface.co/Salesforce/grappa_large_jnt/ as the initial checkpoint for finetuning. This grappa model has max position embedding as 514 ...
mt07tm's user avatar
  • 11
1 vote
0 answers
126 views

DistilBERT Prediction Output - "TypeError: only size-1 arrays can be converted to Python scalars"

I am trying to apply a DistilBERT model to create a prediction, whether a sentence is a Claim, Premise or Non-Argumentative (3 Outputs) However when I apply the model and want to create a prediction ...
Philipp's user avatar
  • 11