how can I get the binary parsed tree from coreNLP parser? - binary-tree

I need the binary parse tree of an sentence to do my experiment. But after I used Stanford Parser and CoreNLP parser, I got non-binary tree. I have tried to add propertiy "parse.binaryTrees": "true", but it didn't work. I also have tried to startup a server in commanline like "-binarize", it also failed!!
So how can I get a binary tree from parser??
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP.properties -port 9000 -timeout 15000
nlp = StanfordCoreNLP(r'/home/lsl/stanford-corenlp-full-2018-10-05')
output = nlp.annotate(sentence, properties={'annotators': 'parse',
'parse.binaryTrees': 'true',
'outputFormat': 'json'})
I want to use python to solve this problem. Thank you all!

def binarize(tree):
"""
Recursively turn a tree into a binary tree.
"""
if isinstance(tree, str):
return tree
elif len(tree) == 1:
return binarize(tree[0])
else:
label = tree.label()
return reduce(lambda x, y: (binarize(x), binarize(y)), tree)
t = Tree.fromstring("(ROOT (S (S (VP (VBD ليس) (ADVP (RB هناك)) (SBAR (WHNP (WP ما)) (S (VP (VBN يمنع) (PP (IN من) (NP (NN اخذ) (NP (DTNNS المكملات) (DTJJ الغذائية))))))))) (CC و) (S (NP (DTNNS البروتين)) (NP (NN شرط) (SBAR (SBAR (IN ان) (S (VP (VBP تكون) (ADJP (JJ مسجلة))))) (CC و) (SBAR (S (VP (VN موافق) (PP (IN على) (NP (PRP ها))) (PP (IN من) (NP (NP (NN وزارة) (NP (DTNN الصحة))) (CC او) (NP (NN مؤسسة) (NP (DTNN الغذاء) (CC و) (DTNN الدواء))) (CC و) (NP (NN دون) (NP (NP (NN مبالغة)) (PP (IN في) (NP (DT ذلك))))))))))))) (PUNC ;) (S (PP (PP (IN في) (NP (DTNN الاونة) (DTJJ الاخيرة))) (CC و) (PP (IN في) (NP (NOUN_QUANT معظم) (NP (DTNN البلدان) (DTJJ العربية))))) (VP (VBD اصبح) (VP (VBP يتسلل) (PP (IN ل) (NP (NN اسواق) (NP (PRP$ ها)))) (NP (NP (NN اصناف) (NP (NN غير) (JJ مرخصة))) (CC و) (NP (NN غير) (NP (NN مسموح)))) (PP (IN ب) (NP (PRP ها)))))) (CC و) (S (VP (VBN لوحظ) (NP (NOUN_QUANT بعض) (NP (DTNNS المضاعفات))) (PP (IN على) (NP (DTNN القلب) (CC و) (DTNN الكلى))))) (PUNC ;) (FRAG (VBP ينصح) (NP (NN عدم) (NP (NN اخذ) (NP (NP (NNS مكملات) (JJ غذائية)) (PP (PP (IN من) (NP (NN تحت) (NP (DTNN الطاولة)))) (CC او) (PP (IN من) (NP (NN خلال) (NP (NN صديق))))))))) (PUNC .)))")
bt = binarize(t)
print(bt)

Related

How to get parse tree from CoreNLP server's returned string in python?

I'm using pycorenlp with the corenlp server. I can get the parse tree in the string format. But can I get it as a tree like the NLTK library?
from pycorenlp import StanfordCoreNLP
import pprint
import nltk
nlp = StanfordCoreNLP('http://localhost:9000')
text = ('Purgrug Vobter and Juklog Qligjar vruled into the Battlefield. Vobter was about to Hellfire. Juklog Qligjar started kiblaring.')
output = nlp.annotate(text, properties={
'annotators': 'tokenize,ssplit,pos,depparse,parse',
'outputFormat': 'json'
})
print [s['parse'] for s in output['sentences']]
Output:
[u'(ROOT\r\n (S\r\n (NP (NNP Purgrug) (NNP Vobter)\r\n (CC and)\r\n (NNP Juklog) (NNP Qligjar))\r\n (VP (VBD vruled)\r\n (PP (IN into)\r\n (NP (DT the) (NN Battlefield))))\r\n (. .)))', u'(ROOT\r\n (S\r\n (NP (NNP Vobter))\r\n (VP (VBD was)\r\n (ADJP (IN about)\r\n (PP (TO to)\r\n (NP (NNP Hellfire)))))\r\n (. .)))', u'(ROOT\r\n (S\r\n (NP (NNP Juklog) (NNP Qligjar))\r\n (VP (VBD started)\r\n (S\r\n (VP (VBG kiblaring))))\r\n (. .)))']
Import tree from nltk :
from nltk.tree import *
Next, for
a = [u'(ROOT\r\n (S\r\n (NP (NNP Purgrug) (NNP Vobter)\r\n (CC and)\r\n (NNP Juklog) (NNP Qligjar))\r\n (VP (VBD vruled)\r\n (PP (IN into)\r\n (NP (DT the) (NN Battlefield))))\r\n (. .)))', u'(ROOT\r\n (S\r\n (NP (NNP Vobter))\r\n (VP (VBD was)\r\n (ADJP (IN about)\r\n (PP (TO to)\r\n (NP (NNP Hellfire)))))\r\n (. .)))', u'(ROOT\r\n (S\r\n (NP (NNP Juklog) (NNP Qligjar))\r\n (VP (VBD started)\r\n (S\r\n (VP (VBG kiblaring))))\r\n (. .)))']
Tree.fromstring(a[0]).pretty_print()
And that's all.

Extracting contents from the most inner parenthesis?

I want to convert this
(TOP (S (NP (NP (JJ Influential) (NNS members)) (PP (IN of) (NP (DT the) (NNP House) (NNP Ways) (CC and) (NNP Means) (NNP Committee)))) (VP (VBD introduced) (NP (NP (NN legislation)) (SBAR (WHNP (WDT that)) (S (VP (MD would) (VP (VB restrict) (SBAR (WHADVP (WRB how)) (S (NP (DT the) (JJ new) (NN savings-and-loan) (NN bailout) (NN agency)) (VP (MD can) (VP (VB raise) (NP (NN capital)))))) (, ,) (S (VP (VBG creating) (NP (NP (DT another) (JJ potential) (NN obstacle)) (PP (TO to) (NP (NP (NP (DT the) (NN government) (POS 's)) (NN sale)) (PP (IN of) (NP (JJ sick) (NNS thrifts)))))))))))))) (. .)))
(TOP (S (NP (DT The) (JJ interest-only) (NNS securities)) (VP (VBD were) (VP (VBN priced) (PP (IN at) (NP (QP (CD 35) (CD 1\/2)))) (S (VP (TO to) (VP (VB yield) (NP (CD 10.72) (NN %))))))) (. .)))
(TOP (S (NP (EX There)) (VP (VBD were) (NP (DT no) (JJ major) (NNP Eurobond) (CC or) (JJ foreign) (NN bond) (NNS offerings)) (PP (IN in) (NP (NNP Europe))) (NP (NNP Friday))) (. .)))
To the following sequence in which only the innermost open&close parenthesis pair of each scope is captured:
(JJ Influential) (NNS members) (IN of) (DT the) (NNP House) (NNP Ways) (CC and) (NNP Means) (NNP Committee) (VBD introduced) (NN legislation) (WDT that) (MD would) (VB restrict) (WHADVP (WRB how) (DT the) (JJ new) (NN savings-and-loan) (NN bailout) (NN agency) (MD can) (VB raise) (NN capital) (, ,) (VBG creating) (DT another) (JJ potential) (NN obstacle) (TO to) (DT the) (NN government) (POS 's) (NN sale) (IN of) (NP (JJ sick) (NNS thrifts) (. .)
(DT The) (JJ interest-only) (NNS securities) (VBD were) (VBN priced) (IN at) (CD 35) (CD 1\/2) (TO to) (VB yield) (CD 10.72) (NN %) (. .)
(EX There) (VBD were) (DT no) (JJ major) (NNP Eurobond) (CC or) (JJ foreign) (NN bond) (NNS offerings) (IN in) (NNP Europe) (NNP Friday) (. .)
you can print the line numbers of the matches and let awk join the lines
$ grep -oPn "\([^()]*\)" line |
awk -F: 'p==$1{a=a OFS $2} p!=$1{if(NR>1)print a;a=$2;p=$1} END{print a}'
(JJ Influential) (NNS members) (IN of) (DT the) (NNP House) (NNP Ways)
(CC and) (NNP Means) (NNP Committee) (VBD introduced) (NN le
gislation) (WDT that) (MD would) (VB restrict) (WRB how) (DT the) (JJ
new) (NN savings-and-loan) (NN bailout) (NN agency) (MD can) (VB
raise) (NN capital) (, ,) (VBG creating) (DT another) (JJ potential)
(NN obstacle) (TO to) (DT the) (NN government) (POS 's) (N N sale) (IN
of) (JJ sick) (NNS thrifts) (. .)
Look for sets of parentheses that don't contain other parentheses inside.
egrep -o '\([^()]*\)'
To keep the results on the same line, you could do:
while read line; do
egrep -o '\([^()]*\)' <<< "$line" | tr '\n' ' '
echo
done
Or using Perl:
perl -e 'while(<>) { my #m = $_ =~ /\([^()]*\)/g; print "#m\n" }'
(There must be a simpler way, but I'm drawing a blank.)
With GNU awk for FPAT all you need is:
awk -v FPAT='[(][^()]*[)]' '{$1=$1}1' file
e.g.:
$ awk -v FPAT='[(][^()]*[)]' '{$1=$1}1' file
(JJ Influential) (NNS members) (IN of) (DT the) (NNP House) (NNP Ways) (CC and) (NNP Means) (NNP Committee) (VBD introduced) (NN legislation) (WDT that) (MD would) (VB restrict) (WRB how) (DT the) (JJ new) (NN savings-and-loan) (NN bailout) (NN agency) (MD can) (VB raise) (NN capital) (, ,) (VBG creating) (DT another) (JJ potential) (NN obstacle) (TO to) (DT the) (NN government) (POS 's) (NN sale) (IN of) (JJ sick) (NNS thrifts) (. .)
(DT The) (JJ interest-only) (NNS securities) (VBD were) (VBN priced) (IN at) (CD 35) (CD 1\/2) (TO to) (VB yield) (CD 10.72) (NN %) (. .)
(EX There) (VBD were) (DT no) (JJ major) (NNP Eurobond) (CC or) (JJ foreign) (NN bond) (NNS offerings) (IN in) (NNP Europe) (NNP Friday) (. .)
With other awks it'd just be a while(match()) loop:
$ awk '{r=""; while (match($0,/[(][^()]*[)]/)) {r=r (r?OFS:"") substr($0,RSTART,RLENGTH); $0=substr($0,RSTART+RLENGTH)} print r}' file
(JJ Influential) (NNS members) (IN of) (DT the) (NNP House) (NNP Ways) (CC and) (NNP Means) (NNP Committee) (VBD introduced) (NN legislation) (WDT that) (MD would) (VB restrict) (WRB how) (DT the) (JJ new) (NN savings-and-loan) (NN bailout) (NN agency) (MD can) (VB raise) (NN capital) (, ,) (VBG creating) (DT another) (JJ potential) (NN obstacle) (TO to) (DT the) (NN government) (POS 's) (NN sale) (IN of) (JJ sick) (NNS thrifts) (. .)
(DT The) (JJ interest-only) (NNS securities) (VBD were) (VBN priced) (IN at) (CD 35) (CD 1\/2) (TO to) (VB yield) (CD 10.72) (NN %) (. .)
(EX There) (VBD were) (DT no) (JJ major) (NNP Eurobond) (CC or) (JJ foreign) (NN bond) (NNS offerings) (IN in) (NNP Europe) (NNP Friday) (. .)
You could also put a placeholder for the newlines, then delete the grep induced newlines and switch the placeholder back with sed:
sed 's/.$/&_NL/g' file | grep -oP "\([^()]*\)" | tr -d '\n' | sed 's/_NL/\n/g'

Regex to extract Noun Phrases from a Part of Speech parse tree?

I am trying to extract all three word noun phrases from a Stanford POS Parse Tree. Basically, anything that looks like:
(NP (TAG WORD) (TAG WORD) (TAG WORD))
Or:
(NP (TAG WORD) (TAG (TAG WORD) (TAG WORD)))
This is what a parse tree can look like:
(ROOT (SQ (VBZ Is) (NP (DT this)) (NP (DT an) (NN asthma) (NN attack)) (. ?)))
When I do this regex, it extracts the correct 3 word noun phrase:
threeWordNounPhrases = full.scan(/\(NP \([^()]+ [^()]+\) \([^()]+ [^()]+\)\)/)
# => "(NP (DT an) (NN asthma) (NN attack))"
However, this does not work for something like:
(ROOT (SQ (NNP Should) (NP (PRP I)) (VP (VB watch) (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones)))) ) (. ?)))
Which should return:
(NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones))))
Specifically for three words, it is possible, but not pretty. For N words, the complexity of the regexp rises. Note that this is just for fun (and regexp/Oniguruma education); in reality, I'd suggest to go with what everyone else says: use a tree parsing library and manipulate the tree.
str = "(ROOT (SQ (NNP Should) (NP (PRP I)) (VP (VB watch) (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones)))) ) (. ?)))"
re = /
(?<tag>
[A-Z]+
){0}
(?<word>
\( \g<tag> \s
(?:
[^()]+ |
\g<word>
)
\)
){0}
(?<word2>
\g<word> \s \g<word> |
\( \g<tag> \s \g<word2> \)
){0}
(?<word3>
\g<word> \s \g<word> \s \g<word> |
\g<word2> \s \g<word> |
\g<word> \s \g<word2> |
\( \g<tag> \s \g<word3> \)
){0}
\( NP \s \g<word3> \)
/x;
puts str[re]
# => (NP (NP (NNP Game)) (PP (IN of) (NP (NNP Thrones))))
I don't see a way to use regular expressions unless are able to consider all possible structures. What you did works for simple cases, but as you found, it fails with deeper, nested structures. I see two options:
From the point where you encounter (NP in the text, read in additional characters. Keep a running tally of parenthesis. Add to it when you see (, subtract when you see ). When you get to zero, you've reached the end of the NP.
Parse the tree using rubytree. Extract all subtrees that are dominated by a node with label of NP. Convert the subtree back to string form by concatenating leaf nodes.

Taking a name and adding it to a binary tree structure

A node in this tree is a list of three items (name left right), where name is a string, and left and right are the child trees i feel like i have gotten off track is there an easy way to write this with just (define(insert name left right))
(define tree
(lambda (node word)
(cond
((null? node) (make-tree word))
((string=? word (tree-word node))
(set-tree-count! node (+ (tree-count node) 1))
node)
((string<? word (tree-word node))
(set-tree-left! node (tree (tree-left node) word))
node)
(else
(set-tree-right! node (tree (tree-right node) word))
node))))
There's no need to use mutation operations, in general in Scheme we avoid them, in this case in particular is easy (and recommended) to build a new tree as we go. And, why the count? this problem has nothing to do with adding numbers. Also notice that the definition (insert name left right) doesn't make much sense, we want to insert a word starting from a tree's root node; left and right aren't useful as parameters. Let's start again from scratch.
(define (insert node word)
(cond ((null? node) (make-tree word '() '()))
((string=? word (tree-word node)) node)
((string>=? word (tree-word node))
(make-tree (tree-word node)
(tree-left node)
(insert (tree-right node) word)))
(else
(make-tree (tree-word node)
(insert (tree-left node) word)
(tree-right node)))))

stanford core nlp: sentiment tree and pos tags

The sentiment annotation provides with an annotated tree with attached annotations on its nodes, used for predicting sentiment.
This tree is different from the parse tree, provided by the parse annotation.
For example, for this sentence:
I don't know half of you half as well as I should like; and I like less than half of you half as well as you deserve.
This is the parse tree:
(ROOT
(S
(S
(NP (PRP I))
(VP (VBP do) (RB n't)
(VP (VB know)
(NP
(NP (NN half))
(PP (IN of)
(NP (PRP you))
(ADVP (DT half) (RB as) (RB well))))
(SBAR (IN as)
(S
(NP (PRP I))
(VP (MD should)
(VP (VB like))))))))
(: ;)
(CC and)
(S
(NP (PRP I))
(VP (VBP like)
(NP
(NP
(QP (JJR less) (IN than) (NN half)))
(PP (IN of)
(NP (PRP you) (DT half))))
(ADVP
(ADVP (RB as) (RB well))
(SBAR (IN as)
(S
(NP (PRP you))
(VP (VBP deserve)))))))
(. .)))
This is the sentiment annotated tree:
(ROOT
(#S
(#S
(#S
(S (NP I)
(VP
(#VP (VBP do) (RB n't))
(VP
(#VP (VB know)
(NP (NP half)
(PP
(#PP (IN of) (NP you))
(ADVP (DT half)
(#ADVP (RB as) (RB well))))))
(SBAR (IN as)
(S (NP I)
(VP (MD should) (VP like)))))))
(: ;))
(CC and))
(S (NP I)
(VP
(#VP (VBP like)
(NP
(NP (JJR less)
(#QP (IN than) (NN half)))
(PP (IN of)
(NP (PRP you) (DT half)))))
(ADVP
(ADVP (RB as) (RB well))
(SBAR (IN as)
(S (NP you) (VP deserve)))))))
(. .))
Why are the trees slightly different? I think that verbs are verb phrases in the sentiment tree, nouns noun phrases etc.
What do these "#" mean?
How can I access the pos tags of the nodes, while iterating through the sentiment tree? Is it the value of the node?
If I want to get the pos tag of sentiment tree nodes from the parse tree, is searching for the parse tree node with the same yield(leaves) the only way? If the trees were the same, I could iterate through both of them for example.
What is the suggested way to get the pos tag of sentiment nodes?

Resources