9

I'm trying to build a model for document classification. I'm using BERT with PyTorch.

I got the bert model with below code.

bert = AutoModel.from_pretrained('bert-base-uncased')

This is the code for training.

for epoch in range(epochs):
 
    print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))

    #train model
    train_loss, _ = modhelper.train(proc.train_dataloader)

    #evaluate model
    valid_loss, _ = modhelper.evaluate()

    #save the best model
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(modhelper.model.state_dict(), 'saved_weights.pt')

    # append training and validation loss
    train_losses.append(train_loss)
    valid_losses.append(valid_loss)

    print(f'\nTraining Loss: {train_loss:.3f}')
    print(f'Validation Loss: {valid_loss:.3f}')

this is my train method, accessible with the object modhelper.

def train(self, train_dataloader):
    self.model.train()
    total_loss, total_accuracy = 0, 0
    
    # empty list to save model predictions
    total_preds=[]
    
        # iterate over batches
    for step, batch in enumerate(train_dataloader):
        
        # progress update after every 50 batches.
        if step % 50 == 0 and not step == 0:
            print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(train_dataloader)))
        
        # push the batch to gpu
        #batch = [r.to(device) for r in batch]
        
        sent_id, mask, labels = batch
        
        # clear previously calculated gradients 
        self.model.zero_grad()        

        print(sent_id.size(), mask.size())
        # get model predictions for the current batch
        preds = self.model(sent_id, mask) #This line throws the error
        
        # compute the loss between actual and predicted values
        self.loss = self.cross_entropy(preds, labels)
        
        # add on to the total loss
        total_loss = total_loss + self.loss.item()
        
        # backward pass to calculate the gradients
        self.loss.backward()
        
        # clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
        
        # update parameters
        self.optimizer.step()
        
        # model predictions are stored on GPU. So, push it to CPU
        #preds=preds.detach().cpu().numpy()
        
        # append the model predictions
        total_preds.append(preds)
      
    # compute the training loss of the epoch
    avg_loss = total_loss / len(train_dataloader)
    
    # predictions are in the form of (no. of batches, size of batch, no. of classes).
    # reshape the predictions in form of (number of samples, no. of classes)
    total_preds  = np.concatenate(total_preds, axis=0)
      
    #returns the loss and predictions
    return avg_loss, total_preds

preds = self.model(sent_id, mask) this line throws the following error(including full traceback).

 Epoch 1 / 1
torch.Size([32, 4000]) torch.Size([32, 4000])
Traceback (most recent call last):

File "<ipython-input-39-17211d5a107c>", line 8, in <module>
train_loss, _ = modhelper.train(proc.train_dataloader)

File "E:\BertTorch\model.py", line 71, in train
preds = self.model(sent_id, mask)

File "E:\BertTorch\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "E:\BertTorch\model.py", line 181, in forward
#pass the inputs to the model

File "E:\BertTorch\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "E:\BertTorch\venv\lib\site-packages\transformers\modeling_bert.py", line 837, in forward
embedding_output = self.embeddings(

File "E:\BertTorch\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)

File "E:\BertTorch\venv\lib\site-packages\transformers\modeling_bert.py", line 201, in forward
embeddings = inputs_embeds + position_embeddings + token_type_embeddings

RuntimeError: The size of tensor a (4000) must match the size of tensor b (512) at non-singleton dimension 1

If you observe I've printed the torch size in the code. print(sent_id.size(), mask.size())

The output of that line of code is torch.Size([32, 4000]) torch.Size([32, 4000]).

as we can see that size is the same but it throws the error. Please put your thoughts. Really appreciate it.

please comment if you need further information. I'll be quick to add whatever is required.

5
  • The error is thrown particularly at this line: embeddings = inputs_embeds + position_embeddings + token_type_embeddings. Probably there's a shape mismatch between the three variables and hence the error. Nov 26, 2020 at 14:05
  • @planet_pluto hope you checked the line showing the size of both the tnsors. torch.Size([32, 4000]) torch.Size([32, 4000]) Nov 26, 2020 at 14:18
  • 1
    @Venkatesh I know that the self.model() throws the error. But if you look carefully at the stack trace, you can find out where exactly during the forward pass of the model the error occurs. Nov 26, 2020 at 14:35
  • 1
    The bert you have loaded was trained to handle sequences with a length of 512 elements. You are providing a sequence with 4000 and the model is telling you that it can't handle that. You can either use a different model (like longformer) or use a sliding window approach. That depends on your task.
    – cronoik
    Nov 27, 2020 at 22:47
  • @cronoik, yes that was the problem, and I reduced the word count. thank you btw. Nov 28, 2020 at 17:57

1 Answer 1

20

The issue is regarding the BERT's limitation with the word count. I've passed the word count as 4000 where the maximum supported is 512(have to give up 2 more for '[cls]' & '[Sep]' at the beginning and the end of the string, so it is 510 only). Reduce the word count or use some other model for your promlem. something like Longformers as suggested by @cronoik in the comments above.

Thanks.

2
  • Is this issue valid for all BERT variants such as RoBERTa and DeBERTa? If the limit is only 512 tokens, then it means we lose information with longer texts :|
    – mah65
    Sep 28, 2021 at 15:55
  • 1
    You can also use chunking as described here: huggingface.co/course/chapter7/6?fw=tf
    – S. P
    Mar 29, 2022 at 6:27

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.