9

I'm currently working on a project involving sentence vectors (from a RoBERTa pretrained model). These vectors are lower quality when sentences are long, and my corpus contains many long sentences with subclauses.

I've been looking for methods for clause extraction / long sentence segmentation, but I was surprised to see that none of the major NLP packages (e.g., spacy or stanza) offer this out of the box.

I suppose this could be done by using either spacy or stanza's dependency parsing, but it would probably be quite complicated to handle all kinds of convoluted sentences and edge cases properly.

I've come across this implementation of the the ClausIE information extraction system with spacy that does something similar, but it hasn't been updated and doesn't work on my machine.

I've also come across this repo for sentence simplification, but I get an annotation error from Stanford coreNLP when I run it locally.

Is there any obvious package/method that I've overlooked? If not, is there a simple way to implement this with stanza or spacy?

5
  • 1
    Can you give examples of your long sentences and how you expect them to be split? This shouldn't be hard with dependency parsing, but it depends on the kinds of sentences you have - are they like grocery lists, or do they have semicolons, or are they like stories with multiple verbs, or something else?
    – polm23
    Dec 10, 2020 at 12:23
  • This is an example: "This all encompassing experience wore off for a moment and in that moment, my awareness came gasping to the surface of the hallucination and I was able to consider momentarily that I had killed myself by taking an outrageous dose of an online drug and this was the most pathetic death experience of all time." Dec 10, 2020 at 16:13
  • I expect it to split as follows: - "This all encompassing experience wore off for a moment" - "in that moment, my awareness came gasping to the surface of the hallucination" - "I was able to consider momentarily that I had killed myself by taking an outrageous dose of an online drug" - "this was the most pathetic death experience of all time." Dec 10, 2020 at 16:14
  • 2
    I shoved your sentence in displaCy. Looking at this you can see that you can break sentences by finding verbs with a conj or ccomp dependency and breaking that; essentially just take the .subtree of those verbs. The last sentence is already split for you too. explosion.ai/demos/displacy
    – polm23
    Dec 10, 2020 at 16:20
  • Thanks! Would you be able to help me writing the piece of code to that? I'm not sure how to proceed. Dec 14, 2020 at 14:58

1 Answer 1

8

Here is code that works on your specific example. Expanding this to cover all cases is not simple, but can be approached over time on an as-needed basis.

import spacy
import deplacy
en = spacy.load('en_core_web_sm')

text = "This all encompassing experience wore off for a moment and in that moment, my awareness came gasping to the surface of the hallucination and I was able to consider momentarily that I had killed myself by taking an outrageous dose of an online drug and this was the most pathetic death experience of all time."

doc = en(text)
#deplacy.render(doc)

seen = set() # keep track of covered words

chunks = []
for sent in doc.sents:
    heads = [cc for cc in sent.root.children if cc.dep_ == 'conj']

    for head in heads:
        words = [ww for ww in head.subtree]
        for word in words:
            seen.add(word)
        chunk = (' '.join([ww.text for ww in words]))
        chunks.append( (head.i, chunk) )

    unseen = [ww for ww in sent if ww not in seen]
    chunk = ' '.join([ww.text for ww in unseen])
    chunks.append( (sent.root.i, chunk) )

chunks = sorted(chunks, key=lambda x: x[0])

for ii, chunk in chunks:
    print(chunk)
        

deplacy is optional but I find it useful for visualizing dependencies.

Also, I see you express surprise this is not an inherent feature of common NLP libraries. The reason for that is simple - most applications don't need this, and while it seems like a simple task it actually ends up being really complicated and application specific the more cases you try to cover. On the other hand, for any specific application, like the example I gave it's relatively easy to hack together a good-enough solution.

6
  • Thanks a lot! As you suspected this didn't quite work for other complex sentences. However I found a pretty robust rule-based solution in this paper. I will paste it below: Dec 16, 2020 at 20:44
  • Given a complex sentence, the model runs the following processes once each: #1. Wh Handling Using semantic role labeling, themodel looks for a Relational Argument (R-ARG),and the Subject Argument (asserted to be the ARG preceding the R-ARG). Then, a split is made with the Relational Argument replaced by the Subject Argument. Dec 16, 2020 at 20:45
  • 2. Conjunction Handling The model looks for theword “and”. Using semantic role labeling, if the word following “and” is an argument (ARG), assert that “and” is followed by a sentence, and a split is made. Or, if the word following “and” is a verb (V), the model asserts the Subject Argument to be the ARG preceding the V; a split is made with “and” replaced by the Subject Argument. Dec 16, 2020 at 20:45
  • 3. Insertion Handling. Using dependency parsing, the model looks for a node with type participle modifier, relative clause modifier, prepositional modifier, adjective modifier,or appositional modifier. The clause with the node as the root is extracted, prepended with the subject, and split as a new simple sentence. The rest of the original complex sentence is split as another new simple sentence. Dec 16, 2020 at 20:45
  • 1
    Implementing a paper is way beyond the scope of a Stack Overflow answer, but if you want to hire me contact info is in my profile.
    – polm23
    Dec 17, 2020 at 3:27

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.