10

Is stopwords removal ,Stemming and Lemmatization necessary for text classification while using Spacy,Bert or other advanced NLP models for getting the vector embedding of the text ?

text="The food served in the wedding was very delicious"

1.since Spacy,Bert were trained on huge raw datasets are there any benefits of apply stopwords removal ,Stemming and Lemmatization on these text before generating the embedding using bert/spacy for text classification task ?

2.I can understand stopwords removal ,Stemming and Lemmatization will be good when we use countvectorizer,tfidf vectorizer to get embedding of sentences .

1
  • You can test to see if doing stemming lemmatization and stopword removal helps. It doesn't always. I usually do if I gonna graph as the stopwords clutter up the results. Aug 28, 2020 at 14:05

4 Answers 4

15

You can test to see if doing stemming lemmatization and stopword removal helps. It doesn't always. I usually do if I gonna graph as the stopwords clutter up the results.

A case for not using Stopwords Using Stopwords will provide context to the user's intent, so when you use a contextual model like BERT. In such models like BERT, all stopwords are kept to provide enough context information like the negation words (not, nor, never) which are considered to be stopwords.

According to https://arxiv.org/pdf/1904.07531.pdf

"Surprisingly, the stopwords received as much attention as non-stop words, but removing them has no effect inMRR performances. "

4

With BERT you don't process the texts; otherwise, you lose the context (stemming, lemmatization) or change the texts outright (stop words removal).

Some more basic models (rule-based or bag-of-words) would benefit from some processing, but you must be very careful with stop words removal: many words that change the meaning of an entire sentence are stop words (not, no, never, unless).

2
  • Do not remove SW, as they add new information(context-awareness) to the sentence (viz., text summarization, machine/language translation, language modeling, question-answering)

  • Remove SW if we want only general idea of the sentence (viz., sentiment analysis, language/text classification, spam filtering, caption generation, auto-tag generation, topic/document

1

It's not mandatory. Removing stopwords can sometimes help and sometimes not. You should try both.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.