Segmenting sentence into subsentences with CoreNLP - stanford-nlp

I am working on the following problem: I would like to split sentences into subsentences using Stanford CoreNLP. The example sentence could be:
"Richard is working with CoreNLP, but does not really understand what he is doing"
I would now like my sentence to be split into single "S" as shown in the tree diagram below:
I would like the output to be a list with the single "S" as follows:
['Richard is working with CoreNLP', ', but', 'does not really understand what', 'he is doing']
I would be really thankful for any help :)

I suspect the tool you're looking for is Tregex, described in more detail in the power point here or the Javadoc of the class itself.
In your case, I believe the pattern you're looking for is simply S. So, something like:
tregex.sh “S” <path_to_file>
where the file is a Penn Treebank formatted tree -- that is, something like (ROOT (S (NP (NNS dogs)) (VP (VB chase) (NP (NNS cats))))).
As an aside: I believe the fragment ", but" is not actually a sentence, as you've hightlighted in the figure. Rather, the node you've highlighted subsumes the whole sentence "Richard is working with CoreNLP, but does not really understand what he is doing". Tregex would then print out this whole sentence as one of the matches. Similarly, "does not really understand what" is not a sentence unless it subsumes the entire SBAR: "does not understand what he is doing".
If you want just the "leaf" sentences (i.e., a sentence that's not subsumed by another sentence), you can try a pattern more like:
S !>> S
Note: I haven't tested the patterns -- use at your own risk!

Ok, I found that one do this as follows:
import requests
url = "http://localhost:9000/tregex"
request_params = {"pattern": "S"}
text = "Pusheen and Smitha walked along the beach."
r = requests.post(url, data=text, params=request_params)
print r.json()
Does anybody know how to use other languages (I need German)?

Related

Algorithm to tell when we've processed a complex variable path expression while parsing?

I am working on a compiler for a homemade programming language and I am stuck on how to convert the lexical token stream into a tree of commands for constructing a DOM-like tree. The "tree of commands" will still be a list, essentially emmitting events in a way that describes how to create a tree, from partial information provided by the lexer. (This language is like CoffeeScript in a way, indentation based, or like XML with indentation focus).
I am stuck on how to tell when a variable path has been discovered. A variable path can be simple, or complex, as these examples demonstrate:
foo
foo.bar
foo.bar[baz].hello[and][goodday].there
this[is[even[more.complicated].wouldnt.you[say]]]
They could get more complicated still, if we handled dynamic interpolation of strings, such as:
foo[`bar${x}abc`].baz
But in my simple lang, there are two relevant things, "paths", and "terms". Terms are anything /a-z/ for now, and paths are chaining together and nesting, like the first examples.
For demonstration purposes, everything else is a simple "term" of 1 word, so you might have this:
abc foo.bar[baz].hello[and][goodday].there, one foo.bar
It forms a simple tree.
Right now I have a lexer which spits out the tokens, so basically:
abc
[SPACE]
foo
.
bar
[
baz
]
.
hello
[
and
]
[
goodday
]
.
there
,
[SPACE]
one
[SPACE]
foo
.
bar
That is at least how I broke it up initially.
So given that sequence of strings, how can you generate messages to tell the parser how to build a tree?
term
nest-down
term
period
term
open-square
and
close-square
...
That is the stream of tokens with a name now, but it is not a tree yet. I would like this:
term-start
term # value: abc
term-end
nest-down
term-path-start
term-start
term # value: foo
term-end
period
term-start
term # value: bar
term-end
term-nest-start
term-start
term # value: and
term-and
term-nest-end
...
I have been struggling with this example for several days now (boiled down from a complex real-world scenario). I cant seem to figure out how to keep track of all the information you need to make a decision on when to say "this structure is done now, close it out" sort of thing. Wondering if you know how to get past this.
Note, I don't need the last tree to actually be a tree structure visually, I just need it to generate those messages which can be interpreted on the other end and used to construct a tree at runtime.
There is no way to construct a tree from a list without having the description of the tree in some form. Often, in relation to parsing, the description of this tree is given by a context-free grammar (CFG).
Then you create a parser on the basis of this given CFG. The lexical token stream is given as an input to the parser. The parser organizes the lexical tokens into a tree by using some parsing algorithm.
The parser emits commands for syntax tree construction based on the rules it uses during parsing. On entering into a rule a command "rule X enter" is emitted, on exiting a rule a command "exit X rule" is emitted. When you accept a lexical token then a "token forward" is emitted with its lexeme characters. Some grammars, namely these in ABNF format, support repetitions of elements. Depending from these repetitions the syntax tree might be represented as lists or arrays.
Then a builder module receives this commands and builds a tree, or uses the commands for the specific task with the listener pattern.
I have co-authored (2021) a paper describing a list of commands for building a concrete/abstract syntax trees, depending on CFG's structure, that are used in the parsers generated by parser generator Tunnel Grammar Studio.
The paper is named "Тhe Expressive Power of the Statically Typed Concrete Syntax Trees". It is in an open-access journal (intentionally). The commands are in section "4.3 Syntax Structure Construction Commands". The article is bit "compressed", due to space limitations, and it is not really intended to be a software development guide, but to note the taken approach. It might give you some ideas.
Another co-authored paper of mine, from 2021 named "A Parsing Machine Architecture Encapsulating Different Parsing Approaches" (also in open-access journal) describes a general form of parsing machine and its modules. There Fig.1, p.33, will give you a quick description.
Disclaimer: I have made the parser generator.

Count the number of sentences in a paragraph using Ruby

I have gotten to the point where I can split and count sentences with simple end of sentence punctuation like ! ? .
However, I need it to work for complex sentences such as:
"Learning Ruby is a great endeavor!!!! Well, it can be difficult at times..."
Here you can see the punctuation repeats itself.
What I have so far, that works with simple sentences:
def count_sentences
sentence_array = self.split(/[.?!]/)
return sentence_array.count
end
Thank you!
It's pretty easy to adapt your code to be a little more forgiving:
def count_sentences
self.split(/[.?!]+/).count
end
There's no need for the intermediate variable or return.
Note that empty strings will also be caught up in this, so you may want to filter those out:
test = "This is junk! There's a space at the end! "
That would return 3 with your code. Here's a fix for that:
def count_sentences
self.split(/[.?!]+/).grep(/\S/).count
end
That will select only those strings that have at least one non-space character.
class String
def count_sentences
scan(/[.!?]+(?=\s|\z)/).size
end
end
str = "Learning Ruby is great!!!! The course cost $2.43... How much??!"
str.count_sentences
#=> 3
(?=\s|\z)/) is a positive lookahead, requiring the match to be immediately followed by a whitespace character or the end of the string.
String#count might be easiest.
"Who will treat me to a beer? I bet, alexnewby will!".count('.!?')
Compared to tadman's solution, no intermediate array needs to be constructed. However it yields incorrect results if, for instance, a run of periods or exclamation mark is found in the string:
"Now thinking .... Ah, that's it! This is what we have to do!!!".count('.!?')
=> 8
The question therefore is: Do you need absolute, exact results, or just approximate ones (which might be sufficient, if this is used for statistical analysis of, say, large printed texts)? If you need exact results, you need to define, what is a sentence, and what is not. Think about the following text - how many sentences are in it?
Louise jumped out of the ground floor window.
"Stop! Don't run away!", cried Andy. "I did not
want to eat your chocolate; you have to believe
me!" - and, after thinking for a moment, he
added: "If you come back, I'll buy you a new
one! Large one! With hazelnuts!".
BTW, even tadman's solution is not exact. It would give a count of five for the following single sentence:
The IP address of Mr. Sloopsteen's dishwasher is 192.168.101.108!

Recognize phrases ending by "?" from a given text in Prolog

I'm gonna write a program in Prolog in order to analyze a text and to recognize the questions within it.
Given a text, the program have to recognize all sentences ending by an interrogative mark and save them in a list. Then every element of that list (that is, each phrase ending by "?") will be analyzed and simplified to make sure they will start with the "WH-questions".
Here an example:
"What is climate change?
The planet's climate has constantly been changing over geological time. [...]
What is the "greenhouse effect"?
The greenhouse effect refers to the way the Earth's atmosphere traps some of the energy from the Sun. [...].
The question is: how will these balance out? "
The list should contain: ["What is climate change?","What is the greenhouse effect ?", " how will these balance out?"]
Using split_string/4 I obtain this list
L = ["What is climate change", "The planet's (...). What is the greenhouse effect" , "The greenhouse (...). The question is: how will these balance out?"]
I don't know how to analyze and further to split each elements of the list in order to have the first list I've shown you.
Can you help me, please? Thanks :)
I suggest to feed a DCG with the output of tokenize_atom:
?- tokenize_atom('What is climate change?', L).
L = ['What', is, climate, change, ?].
Then you can capture all the content between literals 'What' and ?.
To accomplish the capture, library(dcg/basics) has string//1 that could help.
Example:
:- use_module(library(dcg/basics)).
wh_capture(P, Cs) :-
tokenize_atom(P, Tks),
phrase(wh_capture(Cs), Tks).
wh_capture([]) --> [].
wh_capture([C|Cs]) -->
['What'], string(Content), [?], {C=['What'|Content]},
wh_capture(Cs).
wh_capture(Cs) --> string(_), [.], wh_capture(Cs).
Usage:
?- wh_capture('What about you? Phrase to skip. What now?',L).
L = [['What', about, you], ['What', now]]
string//1 has a peculiar behaviour... I usually would place a cut after the end sequence delimiter... like
wh_capture([C|Cs]) -->
['What'], string(Content), [?], {C=['What'|Content]},
!, wh_capture(Cs).
Your approach is naive for any language (and this is a very deep subject), so don't try to re-invent the wheel (at least until you know what to reinvent). Google for a) parsing and then b) [Prolog] Natural Language Processing.
Basically, before further analysis, you need (in the sense to not have a million problems later) to tokenize first.

Parsing s-expressions in Go

Here's a link to lis.py if you're unfamiliar: http://norvig.com/lispy.html
I'm trying to implement a tiny lisp interpreter in Go. I've been inspired by Peter Norvig's Lis.py lisp implementation in Python.
My problem is I can't think of a single somewhat efficient way to parse the s-expressions. I had thought of a counter that would increment by 1 when it see's a "(" and that would decrement when it sees a ")". This way when the counter is 0 you know you've got a complete expression.
But the problem with that is that it means you have to loop for every single expression which would make the interpreter incredibly slow for any large program.
Any alternative ideas would be great because I can't think of any better way.
There is an S-expression parser implemented in Go at Rosetta code:
S-expression parser in Go
It might give you an idea of how to attack the problem.
You'd probably need to have an interface "Sexpr" and ensure that your symbol and list data structures matches the interface. Then you can use the fact that an S-expression is simply "a single symbol" or "a list of S-expressions".
That is, if the first character is "(", it's not a symbol, but a list, so start accumulating a []Sexpr, reading each contained Sexpr at a time, until you hit a ")" in your input stream. Any contained list will already have had its terminal ")" consumed.
If it's not a "(", you're reading a symbol, so read until you hit a non-symbol-constituent character, unconsume it and return the symbol.
In 2022, you can also test eigenhombre/l1, a small Lisp 1 written in Go, by John Jacobsen .
It is presented in "(Yet Another) Lisp In Go"
It does include in commit b3a84e1 a parsing and tests for S-expressions.
func TestSexprStrings(T *testing.T) {
var tests = []struct {
input sexpr
want string
}{
{Nil, "()"},
{Num(1), "1"},
{Num("2"), "2"},
{Cons(Num(1), Cons(Num("2"), Nil)), "(1 2)"},
{Cons(Num(1), Cons(Num("2"), Cons(Num(3), Nil))), "(1 2 3)"},
{Cons(
Cons(
Num(3),
Cons(
Num("1309875618907812098"),
Nil)),
Cons(Num(5), Cons(Num("6"), Nil))), "((3 1309875618907812098) 5 6)"},
}

list intersection, Prolog

ok, so there's basically 3 tasks this program must carry out:
Parse a sentence given in the form of a list, in this case (and throughout the example) the sentence will be [the,traitorous,tostig_godwinson,was,slain]. (its history, don't ask!) so this would look like:
sentence(noun_phrase(det(the),np2(adj(traitorous),np2(noun(tostig_godwinson)))),verb_phrase(verb(slain),np(noun(slain)))).
use the parsed sentence to extract the subject, verb and object, and output as a list, e.g. [tostig_godwinson,was,slain] using the current example. I had this working too until I attempted number 3.
use the target list and compare it against a knowledge base to basically answer the question you asked in the 1st place (see code below) so using this question and the knowledge base the program would print out 'the_battle_of_stamford_bridge' as this is the sentence in the knowledge base with the most matches to the list in question
so here's where i am so far:
history('battle_of_Winwaed',[penda, king_of_mercia,was,slain,killed,oswui,king_of_bernicians, took_place, '15_November_1655']).
history('battle_of_Stamford_Bridge',[tostig_godwinson,herald_hardrada,was,slain, took_place, '25_September_1066']).
history('battle_of_Boroughbridge',[edwardII,defeated,earl_of_lancaster,execution, took_place, '16_march_1322']).
history('battle_of_Towton',[edwardIV,defeated,henryVI,palm_Sunday]).
history('battle_of_Wakefield',[richard_of_york, took_place,
'30_December_1490',was,slain,war_of_the_roses]).
history('battle_of_Adwalton_Moor',[earl_of_newcastle,defeats,fairfax, took_place, '30_June_1643',battle,bradford,bloody]).
history('battle_of_Marston_Moor',[prince_rupert,marquis_of_newcastle,defeats,fairfax,oliver_cromwell,ironsides, took_place,
'2_June_1644', bloody]).
noun(penda).
noun(king_of_mercia).
noun(oswui).
noun(king_of_bernicians).
noun('15_November_1655').
noun(tostig_godwinson).
noun(herald_hardrada).
noun('25_September_1066').
noun(edwardII).
noun(earl_of_lancaster).
noun('16_march_1322').
noun(edwardIV).
noun(henryVI).
noun(palm_Sunday).
noun(richard_of_york).
noun('30_December_1490').
noun(war_of_the_roses).
noun(earl_of_newcastle).
noun(fairfax).
noun('30_June_1643').
noun(bradford).
noun(prince_rupert).
noun(marquis_of_newcastle).
noun(fairfax).
noun(oliver_cromwell).
noun('2_June_1644').
noun(battle).
noun(slain).
noun(defeated).
noun(killed).
adj(bloody).
adj(traitorous).
verb(defeats).
verb(was).
det(a).
det(the).
prep(on).
best_match(Subject,Object,Verb):-
history(X,Y),
member(Subject,knowledgebase),
member(Object,knowledgebase),
member(Verb,knowledgebase),
write(X),nl,
fail.
micro_watson:- write('micro_watson: Please ask me a question:'), read(X),
sentence(X,Sentence,Subject,Object,Verb),nl,write(Subject),nl,write(Verb),nl,write(Object).
sentence(Sentence,sentence(Noun_Phrase, Verb_Phrase),Subject,Object,Verb):-
np(Sentence,Noun_Phrase,Rem),
vp(Rem,Verb_Phrase),
nl, write(sentence(Noun_Phrase,Verb_Phrase)),
noun(Subject),
member(Subject,Sentence),
noun(Object),
member(Object,Rem),
verb(Verb),
member(Verb,Rem),
best_match(Subject,Object,Verb).
member(X,[X|_]).
member(X,[_|Tail]):-
member(X,Tail).
np([X|T],np(det(X),NP2),Rem):-
det(X),
np2(T,NP2,Rem).
np(Sentence,Parse,Rem):- np2(Sentence,Parse,Rem).
np(Sentence,np(NP,PP),Rem):-
np(Sentence,NP,Rem1),
pp(Rem1,PP,Rem).
np2([H|T],np2(noun(H)),T):-noun(H).
np2([H|T],np2(adj(H),Rest),Rem):- adj(H),np2(T,Rest,Rem).
pp([H|T],pp(prep(H),Parse),Rem):-
prep(H),
np(T,Parse,Rem).
vp([H|[]],verb(H)):-
verb(H).
vp([H|T],vp(verb(H),Rest)):-
verb(H),
pp(T, Rest,_).
vp([H|T],vp(verb(H),Rest)):-
verb(H),
np(T, Rest,_).
As i said i had number 2 working until i tried number 3, now it just prints the parsed sentence out and then give me a 'Error: out of local stack message' any help is greatly appreciated! So at the top is the knowledge base with which we are comparing out list to find the best match, these are called (albeit incorrectly at this stage) by the best_match method, which executes immediately after the sentence method which parses the sentence and extract the key words. Also i apologise if the code is terribly laid out!
Cheers
I assume the person who posted this is never coming back, I wanted to remind myself some prolog, so here it is.
There are two major issues with this code, apart from the fact that there are still some logical problems in some predicates.
Problem 1:
You ignored singleton warnings, and they usually are something not to be ignored. The best match predicate should look like this:
best_match(Subject,Object,Verb):-
history(X,Y),
member(Subject,Y),
member(Object,Y),
member(Verb,Y),
write(X),nl,
fail.
The other warning was about the Sentence variable in the sentence predicate, so it goes like this:
sentence(X,Subject,Object,Verb),nl,write(Subject),nl,write(Verb),nl,write(Object).
sentence(Sentence,Subject,Object,Verb):-
np(Sentence,_,Rem),
vp(Rem,_),
nl,
noun(Subject),
member(Subject,Sentence),
noun(Object),
member(Object,Rem),
verb(Verb),
member(Verb,Rem),
best_match(Subject,Object,Verb).
Problem 2:
I assume you divided the np logic into np and np2 to avoid infinite loops, but then forgot to apply this division just where it was necessary. The longest np clause should be:
np(Sentence,np(NP,PP),Rem):-
np2(Sentence,NP,Rem1),
pp(Rem1,PP,Rem).
If you really wanted to allow more complicated np there, which I doubt, you can do it like this:
np(Sentence,np(NP,PP),Rem):-
append(List1,List2,Sentence),
List1\=[],
List2\=[],
np(List1,NP,Rem1),
append(Rem1,List2,Rem2),
pp(Rem2,PP,Rem).
This way you will not end up calling np with the same arguments over and over again, because you make sure that the sentence checked is shorter each time.
Minor issues:
(How the program works, after the infinite loop problem has been fixed)
The last vp is repeated
I am not sure about your grammar, and e.g. why "defeated" is a noun...
Just to check that the program works I used the sentence [edwardIV,defeated,henryVI,on,palm_Sunday].
I changed "defeated" to a verb, and also changed the last vp clause to:
vp([H|T],vp(verb(H),Rest)):-
verb(H),
np(T,_,Rest1),
pp(Rest1, Rest,_).
For the example sentence I got battle_of_Boroughbridge and battle_of_Towton as results.

Resources