Let's take the following"
client = CoreNLPClient(memory='1G', threads=1, annotators=['tokenize','ssplit','pos','lemma','ner','depparse'], timeout=1000)
ann = client.annotate('Wow a nice sentence here')
sentence = ann.sentence[0]
Then I process the tags, dependencies, etc. but I also want to use TokensRegex to extract specific words. I saw the answer using requests (here) however it seems odd to have to send another request (and do the tagging again) in order to use TokensRegex. Can we just use the already annotated sentence with TokensRegex?
Edit
I see that we can use client.tokensregex('Wow a nice sentence here', <pattern>) however this still has to send a request again I guess
There is a tokensregex annotator that you can place at the end of your pipeline that will run rules.
See here: https://stanfordnlp.github.io/CoreNLP/tokensregex.html
Related
Here is a pattern i've created
({SeminarsList}|seminar)[s]((list|lists)|(info|information))[(?|.|!)]
I would expect the optional [s] to work for both entity list and the word i.e. ending with s on seminars such as seminar vs seminars. However, only the entity list works as expected. the s for seminars is ignore and the pattern isn't recognized for seminars info
Is this a bug or something expected. I would rather it work like the entity list as that makes perfect sense and is the same way reflected in the documentation?
update
Also, the word on it's own without being a group works as expected.
so for example this works
where[(are|is)][the](SeminarsList|seminar)[[']s][seminar][[']s] [(location|locate|located)]
i.e. the send seminar with optional punctuation works as expected just not in a grouping
Update**
Here is an example from the documentation
Select the OrgChart-Manager intent, then enter the following template utterances:
Template utterances
Who is {Employee} the subordinate of[?]
Who does {Employee} report to[?]
Who is {Employee}['s] manager[?]
Who does {Employee} directly report to[?]
Who is {Employee}['s] supervisor[?]
Who is the boss of {Employee}[?]
In the above example this is the documentation of how this works. Including adding "punctuation" to the end of the sentence in an optional format. If one would expect this to work I would also expect the other methodology of working too.
Per the docs (emphasis mine):
Pattern syntax is a template for an utterance. The template should contain words and entities you want to match as well as words and punctuation you want to ignore. It is not a regular expression.
So, the pattern syntax is not meant to be used for single letters, but for full words. It has to due with the tokenization of utterances.
If you'd like this feature added, I'd recommend upvoting this LUIS UserVoice ticket.
Can anyone explain why nlp tags from tregex response differs from the tags obtained in constituency parse as shown in the figure below.
In the above figure, engineer is tagged as NN by constituency parse annotator, but tregex outputs it as NNP.
Is it because the annotator pipeline used to perform constituency parse uses a different parse model compared to the pipeline used to perform tregex?
It appears different pipelines are being used.
When you run the standard annotation process it will use the pipeline you specify, which it appears in your example includes the pos annotator. Since the pos annotator's tags are being used, you are seeing the NN.
When you submit a tregex request, it is simply running a pipeline with tokenize,ssplit,parse (you can see this in the code for StanfordCoreNLPServer.java , which has a specific tregex handler)
This means it is using the constituency parser's part of speech tagging, which produces a different result than the dedicated part-of-speech tagger's results. In this case the constituency parser applies the tag of NNP. I should note if you use the shift reduce parser it will require the part of speech tags to be provided by the part of speech tagger, whereas the lexical parser has the ability to create it's own part of speech tags.
Stanford NLP postagger claims imperative verbs added to recent version. I've inputted lots of text with abundant and obvious imperatives, but there seems to be no tag for them on output. Must one, after all, train it for this pos?
There is no special tag for imperatives, they are simply tagged as VB.
The info on the website refers to the fact that we added a bunch of manually annotated imperative sentences to our training data such that the POS tagger gets more of them right, i.e. tags the verb as VB.
I want to use gate-EN-twitter.model for pos tagging when in the process of parsing by Stanford parser. Is there an option on command line that does that? like -pos.model gate-EN-twitter.model? Or do I have to use Stanford pos tagger with gate model for tagging first then use its output as input for the parser?
Thanks!
If I understand you correctly, you want to force the Stanford Parser to use the tags generated by this Twitter-specific POS tagger. That's definitely possible, though this tweet from Stanford NLP about this exact model should serve as a warning:
Tweet from Stanford NLP, 13 Apr 2014:
Using CoreNLP on social media? Try GATE Twitter model (iff not parsing…) -pos.model gate-EN-twitter.model https://gate.ac.uk/wiki/twitter-postagger.html #nlproc
(https://twitter.com/stanfordnlp/status/455409761492549632)
That being said, if you really want to try, we can't stop you :)
There is a parser FAQ entry on forcing in your own tags. See http://nlp.stanford.edu/software/parser-faq.shtml#f
Basically, you have two options (see the FAQ for full details):
If calling the parser from the command line, you can pre-tag your text file and then alert the parser to the fact that the text is pre-tagged using some command-line options.
If parsing programmatically, the LexicalizedParser#parse method will accept any List<? extends HasTag> and treat the tags in that list as golden. Just pre-tag your list (using the CoreNLP pipeline or MaxentTagger) and pass on that token list to the parser.
I haven't found anything in the documentation about adding more tagged words to the tagger, specifically the bi-directional one.
Thanks
At present, you can't. Model training is an all-at-one-time operation. (Since the tagger uses weights that take into account contexts and frequencies, it isn't trivial to add new words to it post hoc.)
There is a workaround. It is ugly but should do the trick:
build a list of "your" words
scan text for these words
if any matches found to POS tagging yourself (NLTK can help you here)
feed it to Stanford parser.
FROM: http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf
"You can also give it POS tagged text; the parser will try to use
your tags if they make sense.
You might want to do this if the parser makes tagging
mistakes in your text domain."