10

When using a pre-trained BERT embeddings from pytorch (which are then fine-tuned), should the text data fed into the model be pre-processed like in any standard NLP task?

For instance, should stemming, removing low frequency words, de-captilisation, be performed or should the raw text simply be passed to `transformers.BertTokenizer'?

3 Answers 3

11

I think preprocessing will not change your output predictions. I will try to explain for each case you mentioned -

  1. stemming or lemmatization : Bert uses BPE (Byte- Pair Encoding to shrink its vocab size), so words like run and running will ultimately be decoded to run + ##ing. So it's better not to convert running into run because, in some NLP problems, you need that information.
  2. De-Capitalization - Bert provides two models (lowercase and uncased). One converts your sentence into lowercase, and others will not change related to the capitalization of your sentence. So you don't have to do any changes here just select the model for your use case.
  3. Removing high-frequency words - Bert uses the Transformer model, which works on the attention principal. So when you finetune it on any problem, it will look only on those words which will impact the output and not on words which are common in all data.
1

For the casing part check the pretrained models

enter image description here

Based on how they are trained there are cased and uncased BERTs in the output.

Training BERT is usually done on raw text, using the WordPiece tokenizer for BERT.

So no stemming or lemmatization or similar NLP tasks are required.

Lemmatization assumes morphological word analysis to return the base form of a word, while stemming is brute removal of the word endings or affixes in general.

0

In most cases, feeding raw text works fine. Share sample data on your use case if you would like a more specific answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.