I'm currently working on a project involving sentence vectors (from a RoBERTa pretrained model). These vectors are lower quality when sentences are long, and my corpus contains many long sentences with subclauses.
I've been looking for methods for clause extraction / long sentence segmentation, but I was surprised to see that none of the major NLP packages (e.g., spacy or stanza) offer this out of the box.
I suppose this could be done by using either spacy or stanza's dependency parsing, but it would probably be quite complicated to handle all kinds of convoluted sentences and edge cases properly.
I've come across this implementation of the the ClausIE information extraction system with spacy that does something similar, but it hasn't been updated and doesn't work on my machine.
I've also come across this repo for sentence simplification, but I get an annotation error from Stanford coreNLP when I run it locally.
Is there any obvious package/method that I've overlooked? If not, is there a simple way to implement this with stanza or spacy?
conj
orccomp
dependency and breaking that; essentially just take the.subtree
of those verbs. The last sentence is already split for you too. explosion.ai/demos/displacy