All Questions
Tagged with bert-language-model huggingface-tokenizers
83
questions
46
votes
5
answers
58k
views
ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error
def split_data(path):
df = pd.read_csv(path)
return train_test_split(df , test_size=0.1, random_state=100)
train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), ...
24
votes
1
answer
49k
views
How does max_length, padding and truncation arguments work in HuggingFace' BertTokenizerFast.from_pretrained('bert-base-uncased')?
I am working with Text Classification problem where I want to use the BERT model as the base followed by Dense layers. I want to know how does the 3 arguments work? For example, if I have 3 sentences ...
16
votes
2
answers
31k
views
Download pre-trained sentence-transformers model locally
I am using the SentenceTransformers library (here: https://pypi.org/project/sentence-transformers/#pretrained-models) for creating embeddings of sentences using the pre-trained model bert-base-nli-...
15
votes
2
answers
10k
views
BertModel transformers outputs string instead of tensor
I'm following this tutorial that codes a sentiment analysis classifier using BERT with the huggingface library and I'm having a very odd behavior. When trying the BERT model with a sample text I get a ...
12
votes
8
answers
37k
views
SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /dslim/bert-base-NER/resolve/main/tokenizer_config.json
I am facing below issue while loading the pretrained BERT model from HuggingFace due to SSL certificate error.
Error:
SSLError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries ...
11
votes
1
answer
6k
views
what is so special about special tokens?
what exactly is the difference between "token" and a "special token"?
I understand the following:
what is a typical token
what is a typical special token: MASK, UNK, SEP, etc
when ...
10
votes
2
answers
21k
views
How to add new special token to the tokenizer?
I want to build a multi-class classification model for which I have conversational data as input for the BERT model (using bert-base-uncased).
QUERY: I want to ask a question.
ANSWER: Sure, ask away.
...
8
votes
6
answers
6k
views
Problem with inputs when building a model with TFBertModel and AutoTokenizer from HuggingFace's transformers
I'm trying to build the model illustrated in this picture:
I obtained a pre-trained BERT and respective tokenizer from HuggingFace's transformers in the following way:
from transformers import ...
7
votes
1
answer
8k
views
max_seq_length for transformer (Sentence-BERT)
I'm using sentence-BERT from Huggingface in the following way:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
model.max_seq_length = 512
model....
7
votes
1
answer
14k
views
How padding in huggingface tokenizer works?
I tried following tokenization example:
tokenizer = BertTokenizer.from_pretrained(MODEL_TYPE, do_lower_case=True)
sent = "I hate this. Not that.",
_tokenized = tokenizer(sent, ...
6
votes
2
answers
9k
views
How to untokenize BERT tokens?
I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word.
from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained("...
6
votes
2
answers
11k
views
BERT get sentence embedding
I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding.
I have around 500,000 sentences for which I need sentence embedding and it is ...
6
votes
1
answer
11k
views
BertWordPieceTokenizer vs BertTokenizer from HuggingFace
I have the following pieces of code and trying to understand the difference between BertWordPieceTokenizer and BertTokenizer.
BertWordPieceTokenizer (Rust based)
from tokenizers import ...
6
votes
3
answers
9k
views
Huggingface BERT Tokenizer add new token
I am using Huggingface BERT for an NLP task. My texts contain names of companies which are split up into subwords.
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer....
4
votes
3
answers
5k
views
How to apply max_length to truncate the token sequence from the left in a HuggingFace tokenizer?
In the HuggingFace tokenizer, applying the max_length argument specifies the length of the tokenized text. I believe it truncates the sequence to max_length-2 (if truncation=True) by cutting the ...
4
votes
1
answer
12k
views
HuggingFace Bert Sentiment analysis
I am getting the following error :
AssertionError: text input must of type str (single example), List[str] (batch or single pretokenized example) or List[List[str]] (batch of pretokenized examples)., ...
3
votes
1
answer
4k
views
Fast and slow tokenizers yield different results
Using HuggingFace's pipeline tool, I was surprised to find that there was a significant difference in output when using the fast vs slow tokenizer.
Specifically, when I run the fill-mask pipeline, ...
3
votes
1
answer
9k
views
resize_token_embeddings on the a pertrained model with different embedding size
I would like to ask about the way to change the embedding size of the trained model.
I have a trained model models/BERT-pretrain-1-step-5000.pkl.
Now I am adding a new token [TRA]to the tokeniser and ...
3
votes
1
answer
4k
views
How to map token indices from the SQuAD data to tokens from BERT tokenizer?
I am using the SQuaD dataset for answer span selection. After using the BertTokenizer to tokenize the passages, for some samples, the start and end indices of the answer don't match the real answer ...
3
votes
1
answer
1k
views
BERT - Is that needed to add new tokens to be trained in a domain specific environment?
My question here is no how to add new tokens, or how to train using a domain-specific corpus, I'm already doing that.
The thing is, am I supposed to add the domain-specific tokens before the MLM ...
3
votes
1
answer
2k
views
Using Hugging-face transformer with arguments in pipeline
I am working on using a transformer. Pipeline to get BERT embeddings to my input. using this without a pipeline i am able to get constant outputs but not with pipeline since I was not able to pass ...
3
votes
2
answers
933
views
BPE multiple ways to encode a word
With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", &...
3
votes
0
answers
923
views
Create custom data_collator for Huggingface Trainer
I need to create a custom data_collator for finetuning with Huggingface Trainer API.
HuggingFace offers DataCollatorForWholeWordMask for masking whole words within the sentences with a given ...
3
votes
0
answers
3k
views
How to get tokens to words in BERT tokenizer
I have a list, using higgingface bert tokenizer I can get the mapping numerical representation.
X = ['[CLS]', '[MASK]', 'love', 'this', '[SEP]']
tokens = tokenizer.convert_tokens_to_ids(X)
toekns: [...
3
votes
3
answers
4k
views
How to save a tokenizer after training it?
I have just followed this tutorial on how to train my own tokenizer.
Now, from training my tokenizer, I have wrapped it inside a Transformers object, so that I can use it with the transformers library:...
2
votes
2
answers
3k
views
Is there a way to get the location of the substring from which a certain token has been produced in BERT?
I am feeding sentences to a BERT model (Hugging Face library). These sentences get tokenized with a pretrained tokenizer. I know that you can use the decode function to go back from tokens to strings.
...
2
votes
1
answer
720
views
BART Tokenizer tokenises same word differently?
I have noticed that if I tokenize a full text with many sentences, I sometimes get a different number of tokens than if I tokenise each sentence individually and add up the tokens. I have done some ...
2
votes
1
answer
984
views
Why does Transformer's BERT (for sequence classification) output depend heavily on maximum sequence length padding?
I am using Transformer's RobBERT (the dutch version of RoBERTa) for sequence classification - trained for sentiment analysis on the Dutch Book Reviews dataset.
I wanted to test how well it works on a ...
2
votes
3
answers
2k
views
Any reason to save a pretrained BERT tokenizer?
Say I am using tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True), and all I am doing with that tokenizer during fine-tuning of a new model is the standard tokenizer....
2
votes
1
answer
2k
views
Are these normal speed of Bert Pretrained Model Inference in PyTorch
I am testing Bert base and Bert distilled model in Huggingface with 4 scenarios of speeds, batch_size = 1:
1) bert-base-uncased: 154ms per request
2) bert-base-uncased with quantifization: 94ms per ...
2
votes
1
answer
3k
views
Calculate precision, recall, f1 score for custom dataset for multiclass classification Huggingface library
I am trying to do multiclass classification for the sentence pair task. I uploaded my custom dataset of train and test separately in the hugging face data set and trained my model and tested it and ...
2
votes
0
answers
330
views
how to make BERT predict new token
my problem looks like this:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
fill_mask_pipeline_pre = pipeline("fill-...
2
votes
0
answers
422
views
Huggingface pre-trained model
I try to use the below code:
from transformers import AutoTokenizer, AutoModel
t = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(t)
model = AutoModel.from_pretrained(t)
The ...
2
votes
0
answers
492
views
Train BERT model from scratch on a different language
First i create tokenizer as follow
from tokenizers import Tokenizer
from tokenizers.models import BPE,WordPiece
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
from tokenizers.trainers ...
2
votes
0
answers
1k
views
huggingface pipeline: bert NER task throws RuntimeError: The size of tensor a (921) must match the size of tensor b (512) at non-singleton dimension 1
I try to set up a german ner, pretrained with bert via the huggingface pipeline. For some texts the following code throws an error "RuntimeError: The size of tensor a (921) must match the size of ...
1
vote
1
answer
3k
views
how to add tokens in vocab.txt which decoded as [UNK] bert tokenizer
i was decoding the tokenized tokens from bert tokenizer and it was giving [UNK] for € symbol. but i tried by add ##€ token in vocab.txt file. but it was not reflected in prediction result was same as ...
1
vote
1
answer
2k
views
BERT tokenize URLs
I want to classify a bunch of tweets and therefore I'm using the huggingface implementation of BERT. However I noticed that the deafult BertTokenizer does not use special tokens for urls.
>>> ...
1
vote
1
answer
137
views
Truncating a training dataset so that it fits exactly within the context window
I have a dataset where the total tokens once tokenised is around 5000. I was to feed that into a BERT-style model so I have to shrink it down to 512 tokens but I want to rearrange the text to train it ...
1
vote
2
answers
2k
views
How can we pass a list of strings to a fine tuned bert model?
I want to pass a list of strings instead of a single string input to my fine tuned bert question classification model.
This is my code which accept a single string input.
questionclassification_model ...
1
vote
1
answer
1k
views
How truncation works when applying BERT tokenizer on the batch of sentence pairs in HuggingFace?
Say, I have three sample sentences:
s0 = "This model was pretrained using a specific normalization pipeline available here!"
s1 = "Thank to all the people around,"
s2 = "...
1
vote
1
answer
2k
views
Encoding/tokenizing dataset dictionary (BERT/Huggingface)
I am trying to finetune my Sentiment Analysis Model. Therefore, I have splitted my pandas Dataframe (column with reviews, column with sentiment scores) into a train and test Dataframe and transformed ...
1
vote
0
answers
22
views
BERT MLM model fine-tuning bad results on new dataset
I'm trying to fine tune a MLM model on new kind of small data (train.csv 43285 lines, validation.csv 3597), my data looks like this:
text
בראשית ברא אלהים את השמים ואת הארץ
והארץ היתה תהו ובהו וחשך על ...
1
vote
1
answer
31
views
why Tokenizer and TokenizerFast encode the same sentence get different result
error1
when i use tokenizer encode text and use ‘do_basic_tokenize=False’, i found two different results.
But when i set ‘do_basic_tokenize=True’, the results is same.
this text is 'LUXURY HOTEL ...
1
vote
0
answers
184
views
The tokenizer Doesn't recognize the new special tokens
When I run the code below, the tokenizer doesn't recognize the new special tokens that I added ([SP] and [EMPTY]). I wanted to tokenize arabic text.
from tokenizers import BertWordPieceTokenizer
from ...
1
vote
0
answers
25
views
Training an Hugginface model without n_epochs
I would like to train from scratch a RobertaForMaskedLM in Hugginface.
However I would like to not specify any stopping time, but to stop only when there is no more improvement in the training. ...
1
vote
1
answer
351
views
Equivalent to tokenizer() in Transformers 2.5.0?
I am trying to convert the following code to work with Transformers 2.5.0. As written, it works in version 4.18.0, but not 2.5.0.
# Converting pretrained BERT classification model to regression model
#...
1
vote
1
answer
2k
views
How to run Huggingface BERT tokenizer in offline mode?
While running above code in my work laptop I'm getting the following error, but same error is not there when I run in my personal laptop. I wanted to check whether there is a way to fix this SSL error?...
1
vote
1
answer
240
views
Optimize Albert HuggingFace model
Goal: Amend this Notebook to work with albert-base-v2 model
Kernel: conda_pytorch_p36.
Section 2.1 exports the finalised model. It too uses a BERT specific function. However, I cannot find an ...
1
vote
0
answers
854
views
how to extend a pretrained transformer model configured with small max_position_embeddings to a longer one
suppose I want to use the existing pre-trained model.
https://huggingface.co/Salesforce/grappa_large_jnt/
as the initial checkpoint for finetuning.
This grappa model has max position embedding as 514 ...
1
vote
0
answers
126
views
DistilBERT Prediction Output - "TypeError: only size-1 arrays can be converted to Python scalars"
I am trying to apply a DistilBERT model to create a prediction, whether a sentence is a Claim, Premise or Non-Argumentative (3 Outputs)
However when I apply the model and want to create a prediction ...