Has anyone ever tried parsing out phrasal verbs with Stanford NLP?
The problem is with separable phrasal verbs, e.g.: climb up, do over: We climbed that hill up. I have to do this job over.
The first phrase looks like this in the parse tree:
(VP
(VBD climbed)
(ADVP
(IN that)
(NP (NN hill)
)
)
(ADVP
(RB up)
)
)
the second phrase:
(VB do)
(NP
(DT this)
(NN job)
)
(PP
(IN over)
)
So it seems like reading the parse tree would be the right way, but how to know that verb is going to be phrasal?
Dependency parsing, dude. Look at the prt (phrasal verb particle) dependency in both sentences. See the Stanford typed dependencies manual for more info.
nsubj(climbed-2, We-1)
root(ROOT-0, climbed-2)
det(hill-4, that-3)
dobj(climbed-2, hill-4)
prt(climbed-2, up-5)
nsubj(have-2, I-1)
root(ROOT-0, have-2)
aux(do-4, to-3)
xcomp(have-2, do-4)
det(job-6, this-5)
dobj(do-4, job-6)
prt(do-4, over-7)
The stanford parser gives you very nice dependency parses. I have code for programmatically accessing these if you need it: https://gist.github.com/2562754
Related
I am writing my first Prolog code, and I am have some difficulties with it I was wondering if anyone could help me out.
I am writing a program that needs to follow the following rules:
for Verb phrases., noun phrases come before transitive verbs.
subjects (nominative noun phrases) are followed by ga
Direct Objects (nominative noun phrases are followed by o.
it must be able to form these sentences with the given words in the code:
Adamu ga waraimasu (adam laughs)
Iive ga nakimasu (eve cries)
Adamu ga Iivu O mimasi (adam watches Eve)
Iivu ga Adamu O tetsudaimasu (eve helps adam)
here is my code. It it mostly complete except, I don't know if the rules are correct in the code:
Japanese([adamu ],[nounphrase],[adam],[entity]).
Japanese([iivu ],[nounphrase],[eve],[entity]).
Japanese([waraimasu ],[verb,intransitive],[laughs],[property]).
Japanese([nakimasu],[verb,intransitive],[cries],[property]).
Japanese([mimasu ],[verb,transitive],[watches],[relation]).
Japanese([tetsudaimasu ],[verb,transitive],[helps],[relation]).
Japanese(A,[verbphrase],B,[property]):-
Japanese(A,[verb,intransitive],B,[property]).
Japanese(A,[nounphrase,accusative],B,[entity]):-
Japanese(C,[nounphrase],B,[entity]),
append([ga],C,A).
Japanese(A,[verbphrase],B,[property]):-
Japanese(C,[verb,transitive],D,[relation]),
Japanese(E,[nounphrase,accusative],F,[entity]),
append(C,E,A),
append(D,F,B).
Japanese(A,[sentence],B,[proposition]):-
Japanese(C,[nounphrase],D,[entity]),
Japanese(E,[verbphrase],F,[property]),
append(E,C,A),
append(F,D,B).
I'm using the latest version [3.8.0] of CoreNLP with the python wrapper [py-corenlp] and I realized there is some inconsistency between the output I get from CoreNLP when I do the annotation with the following annotators: tokenize, ssplit, pos, depparse, parse, and the output from the Online Demo. What is more, Stanford's Parser, both when calling it in my code or when I run it online, is giving me the same results as CoreNLP.
For instance, I have the following question (borrowed from the Free917 question corpus):
at what institutions was Marshall Hall a professor
Using CoreNLP I get the following parsing:
(ROOT\n (SBAR\n (WHPP (IN at)\n (WHNP (WDT what)))\n (S\n (NP (NNS institutions))\n (VP (VBD was)\n (NP\n (NP (NNP Marshall) (NNP Hall))\n (NP (DT a) (NN professor)))))))
Same with Stanford's Parser:
[Tree('ROOT', [Tree('SBAR', [Tree('WHPP', [Tree('IN', ['at']), Tree('WHNP', [Tree('WP', ['what'])])]), Tree('S', [Tree('NP', [Tree('NNS', ['institutions'])]), Tree('VP', [Tree('VBD', ['was']), Tree('NP', [Tree('NP', [Tree('NNP', ['Marshall']), Tree('NNP', ['Hall'])]), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['professor'])])])])])])])]
The Online Demo is the correct version though:
Online Demo Parsing
How can I get the results I get using the Online Demo?
Thank you in advance!
The demo runs the shift-reduce parser, which is both faster and more accurate, at the expense of a [much] larger serialized model size. See https://nlp.stanford.edu/software/srparser.shtml
Given a sentence:
I had peanut butter and jelly sandwich and a cup of coffee for
breakfast
I want to be able to extract the following food items from it:
peanut butter and jelly sandwich
coffee
Till now, using POS tagging, I have been able to extract the individual food items, i.e.
peanut, butter, jelly, sandwich, coffee
But like I said, what I need is peanut butter and jelly sandwich instead of the individual items.
Is there some way of doing this without having a corpus or database of food items in the backend?
You can attempt it without using a trained set which contains a corpus of food items, but the approach shall work without it too.
Instead of doing simple POS tagging, do a dependency parsing combined with POS tagging.
That way would be able to find relations between multiple tokens of the phrase, and parsing the dependency tree with restricted conditions like noun-noun dependencies you shall be able to find relevant chunk.
You can use spacy for dep parsing. Here is output from displacy :
https://demos.explosion.ai/displacy/?text=peanut%20butter%20and%20jelly%20sandwich%20is%20delicious&model=en&cpu=1&cph=1
You can use freely available data here, or something better:
https://en.wikipedia.org/wiki/Lists_of_foods as a training set to
create a base set of food items (the hyperlinks in the crawled tree)
Based on the dependency parsing on your new data, you can keep
enriching the base data. For example: if 'butter' exists in your
corpus, and 'peanut butter' is a frequently encountered pair of
tokens, then 'peanut' and 'peanut butter' also get added to the
corpus.
The corpus can be maintained in a file which can be loaded in memory
while processing, or database like redis,aerospike etc.
Make sure you work with normalized i.e. small cased, special
characters cleaned, words lemmatized/stemmed, both in corpus and the
processing data. That would increase your coverage and accuracy.
First extract all Noun phrases using NLTK's Chunking (code copied from here):
import nltk
import re
import pprint
from nltk import Tree
import pdb
patterns="""
NP: {<JJ>*<NN*>+}
{<JJ>*<NN*><CC>*<NN*>+}
{<NP><CC><NP>}
{<RB><JJ>*<NN*>+}
"""
NPChunker = nltk.RegexpParser(patterns)
def prepare_text(input):
sentences = nltk.sent_tokenize(input)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
sentences = [NPChunker.parse(sent) for sent in sentences]
return sentences
def parsed_text_to_NP(sentences):
nps = []
for sent in sentences:
tree = NPChunker.parse(sent)
print(tree)
for subtree in tree.subtrees():
if subtree.label() == 'NP':
t = subtree
t = ' '.join(word for word, tag in t.leaves())
nps.append(t)
return nps
def sent_parse(input):
sentences = prepare_text(input)
nps = parsed_text_to_NP(sentences)
return nps
if __name__ == '__main__':
print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))
This will POS tag your sentences and uses a regex parser to extract Noun Phrases.
1.Define and Refine your noun phrase regex
You'll need to change the patterns regex to define and refine your Noun phrases.
For example is telling the parser than an NP followed by a coordinator (CC) like ''and'' and another NP is itself an NP.
2.Change from NLTK POS tagger to Stanford POS tagger
Also I noted that NLTK's POS tagger is not performing very well (e.g. It considers had peanut as a verb phrase. You can change the POS tagger to Stanford Parser if you want.
3.Remove smaller noun phrases:
After you have extracted all the Noun phrases for a sentence, you can remove the ones that are part of a bigger noun phrase. For example in the following example beef burger and peanut butter should be removed because
they're a part of a bigger noun phrase peanut butter and beef burger.
4.Remove noun phrases which none of the words are in a food lexicon
you will get noun phrases like school bus. if none of school and bus is in a food lexicon that you can compile from Wikipedia or WordNet then you remove the noun phrase. In this case remove cup and breakfast because they're not hopefully in your food lexicon.
The current code returns
['peanut butter and beef burger', 'peanut butter', 'beef burger', 'cup', 'coffee', 'breakfast']
for input
print(sent_parse('I ate peanut butter and beef burger and a cup of coffee for breakfast.'))
Too much for a comment, but not really an answer:
I think you would at least get closer if when you got two foods without a proper separator and combined them into one food. That would give peanut butter, jelly sandwich, coffee.
If you have correct English you could detect this case by count/non-count. Correcting the original to "I had a peanut butter and jelly sandwich and a cup of coffee for breakfast". Butter is non-count, you can't have "a butter", but you can have "a sandwich". Thus the a must apply to sandwich and despite the and "peanut butter" and "jelly sandwich" must be the same item--"peanut butter and jelly sandwich". Your mistaken sentence would parse the other way, though!
I would be very surprised if you could come up with general rules that cover every case, though. I would come at this sort of thing figuring that a few would leak and need a database to catch.
You could search for n-grams in your text where you vary the value of n. For example, if n=5 then you would extract "peanut butter and jelly sandwich" and "cup of coffee for breakfast", depending on where you start your search in the text for groups of five words. You won't need a corpus of text or a database to make the algorithm work.
A rule based approach with a lexicon of all food items would work here.
You can use GATE for the same and use JAPE rules with it.
In the above example your jape rule would have a condition to find all (np cc np) && np in "FOOD LEIXCON"
Can share a detailed jape code in an event you plan to go this route.
I just started using Stanford Parser but I do not understand the tags very well. This might be a stupid question to ask but can anyone tell me what does the SBARQ and SQ tags represent and where can I find a complete list for them? I know how the Penn Treebank looks like but these are slightly different.
Sentence: What is the highest waterfall in the United States ?
(ROOT
(SBARQ
(WHNP (WP What))
(SQ (VBZ is)
(NP
(NP (DT the) (JJS highest) (NN waterfall))
(PP (IN in)
(NP (DT the) (NNP United) (NNPS States)))))
(. ?)))
I have looked at Stanford Parser website and read a few of the journals listed there but there are no explanation of the tags mentioned earlier. I found a manual describing all the dependencies used but it doesn't explain what I am looking for. Thanks!
This reference looks to have an extensive list - not sure if it is complete or not.
Specifically, it lists the ones you're asking about as:
SBARQ - Direct question introduced by a wh-word or a wh-phrase. Indirect
questions and relative clauses should be bracketed as SBAR, not SBARQ.
SQ - Inverted yes/no question, or main clause of a wh-question,
following the wh-phrase in SBARQ.
To see the entire list just print the tagIndex of the parser
LexicalizedParser lp = LexicalizedParser.loadModel();
System.out.println(lp.tagIndex); // print the tag index
My googling didn't come up with how to do a switch statement in an algorithm using the algorithm and algorithmic packages, but I'm assuming you can. Most guides just didn't mention it either way.
\begin{algorithm}
\caption{send(...) method}
\begin{algorithmic}
\IF{dest equals..}
%\SWITCH{nature}
\STATE cast data...
\STATE extract data...
\STATE copy...
%\ENDSWITCH
\ELSE
\STATE match dest....
%\SWITCH{nature}
\STATE cast data...
\STATE extract data...
\STATE send...
%\ENDSWITCH
\ENDIF
\end{algorithmic}
\end{algorithm}
Thanks!
I wrote the following definitions in my latex document.
It seems that they work.
Just insert the above lines anywhere after your inclusion statement of the algorithmic package. Especially, to make the algorithm presentation concise, I distinguish between compound cases and one-line cases. The one-line cases begin with \CASELINE. The compound cases begin with \CASE and end with \ENDCASE. Similar to the default statements.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% The following definitions are to extend the LaTeX algorithmic
%% package with SWITCH statements and one-line structures.
%% The extension is by
%% Prof. Farn Wang
%% Dept. of Electrical Engineering,
%% National Taiwan University.
%%
\newcommand{\SWITCH}[1]{\STATE \textbf{switch} (#1)}
\newcommand{\ENDSWITCH}{\STATE \textbf{end switch}}
\newcommand{\CASE}[1]{\STATE \textbf{case} #1\textbf{:} \begin{ALC#g}}
\newcommand{\ENDCASE}{\end{ALC#g}}
\newcommand{\CASELINE}[1]{\STATE \textbf{case} #1\textbf{:} }
\newcommand{\DEFAULT}{\STATE \textbf{default:} \begin{ALC#g}}
\newcommand{\ENDDEFAULT}{\end{ALC#g}}
\newcommand{\DEFAULTLINE}[1]{\STATE \textbf{default:} }
%%
%% End of the LaTeX algorithmic package extension.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
You can try the following example.
\SWITCH {$\theta$}
\CASE {1}
\STATE Hello
\ENDCASE
\CASELINE {2}
\STATE Good-bye
\DEFAULT
\STATE Again ?
\ENDDEFAULT
\ENDSWITCH
Farn Wang
Dept. of Electrical Eng.
National Taiwan University
If you have a look at the official documentation from CTAN on the algorithm package you will notice, that there is no default SWITCH-CASE-statement. I assume that to be the reason, why it is left out in so many documentations ;)