Using trained BERT Model and Data Preprocessing

Question

When using a pre-trained BERT embeddings from pytorch (which are then fine-tuned), should the text data fed into the model be pre-processed like in any standard NLP task?

For instance, should stemming, removing low frequency words, de-captilisation, be performed or should the raw text simply be passed to `transformers.BertTokenizer'?

Charles Okwuagwu · Accepted Answer · 2022-05-29 09:17:56Z

I think preprocessing will not change your output predictions. I will try to explain for each case you mentioned -

stemming or lemmatization : Bert uses BPE (Byte- Pair Encoding to shrink its vocab size), so words like run and running will ultimately be decoded to run + ##ing. So it's better not to convert running into run because, in some NLP problems, you need that information.
De-Capitalization - Bert provides two models (lowercase and uncased). One converts your sentence into lowercase, and others will not change related to the capitalization of your sentence. So you don't have to do any changes here just select the model for your use case.
Removing high-frequency words - Bert uses the Transformer model, which works on the attention principal. So when you finetune it on any problem, it will look only on those words which will impact the output and not on words which are common in all data.

janw · Accepted Answer · 2024-02-29 08:03:15Z

1

For the casing part check the pretrained models

Based on how they are trained there are cased and uncased BERTs in the output.

Training BERT is usually done on raw text, using the WordPiece tokenizer for BERT.

So no stemming or lemmatization or similar NLP tasks are required.

Lemmatization assumes morphological word analysis to return the base form of a word, while stemming is brute removal of the word endings or affixes in general.

edited Feb 29 at 8:03

janw

9,27611 gold badges46 silver badges64 bronze badges

answered Sep 21, 2020 at 7:38

prosti

44.4k17 gold badges191 silver badges159 bronze badges

Add a comment |

Adnan S · Accepted Answer · 2020-09-20 20:04:59Z

0

In most cases, feeding raw text works fine. Share sample data on your use case if you would like a more specific answer.

answered Sep 20, 2020 at 20:04

Adnan S

1,8721 gold badge15 silver badges19 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Using trained BERT Model and Data Preprocessing

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged
nlp
pytorch
bert-language-model
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged nlppytorchbert-language-model or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
nlp
pytorch
bert-language-model
or ask your own question.