I am using Stanford CoreNLP to parse my sentences and it works surprisingly good. But I am wondering: Since CoreNLP contains a probabilistic parser, how does the software deal with ambiguities?
"I saw the girl with the glasses".
(1) If I understand it the right way, CoreNLP prints the MOST probably tree. So there is no way to check, if there is a ambiguity, right?
(2) Does that actually mean, that CoreNLP ignores syntactical ambiguities?
Yes, CoreNLP will pick one of the two interpretations and return that. Though, it's important to note that the "most probable tree" is the one that's syntactically most probable (i.e., most like trees its seen in the training data), rather than most probable based on any sort of real-world knowledge. Chances are, "I ate the cake with a cherry" and "I ate the cake with a fork" will have the same parse.
Related
I noticed that the Stanford Parser taggs "anyone" and "anybody" as a noun, whereas they are pronouns; I tried to set "anyone" in different contexts, I got the same result. Can anyone tell me if it hapenned to him/her and if there is a way to correct it (I mean perhaps some settings ?).
Thank you!
This is more a linguistics question than a code question. The analysis of these words is complex. Many linguists would describe them as being a fused determiner and noun, as they fairly transparently are at least historically.
In general we currently use the Penn Treebank standards for tokenization, part-of-speech, and phrasal labels for English. The Penn Treebank annotates these words as noun (NN) - rightly or wrongly - so that is what our tools currently do.
However, you might be pleased to know that under the Universal Dependencies guidelines, these words are indeed pronouns (PRON). We are sure to be moving towards greater use of Universal Dependencies in future releases.
My aim is detecting simple elements in any sentence such as verb, noun or adjective. Is there any gem in Ruby for achieving that? For example:
They elected him president yesterday.
Output:
["subject","verb", "object", "predicative", adverbial"]
These are the only natural language processing options for Ruby that I know of.
Treat
Stanford Core NLP
Open NLP
Interestingly, they are all by the same person.
EDIT
Here is one more option that I found. It's a tutorial on n-gram analysis.
Natural Language Processing with Ruby: n-grams
I've used engtagger with good success in the past. It's ported from a Perl program called Lingua::EN::Tagger. It takes a bit of work to get it to do what you want, but I think it's the best tool available for this application (at least at the moment).
I am trying to develop a Sinhala (My native language) to English translator. Still I am thinking for an approach.
If I however parse a sentence of my language, then can I use that for generating english sentence with the help of stanford parser or any other parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks Mathee
This course on Coursera may help you implement a translator. From what I know based on that course, you can use a training set tagged by parts of speech (i.e. noun, verb, etc.) and use that training test to parse other sentence. I suggest looking into hidden Markov models.
My Pyramids parser is an unconventional single-sentence parser for English. (It is capable of parsing other languages too, but a grammar must be specified.) The parser can not only parse English into parse trees, but can convert back and forth between parse trees and word-level semantic graphs, which are graphs that describe the semantic relationships between all the words in a sentence. The correct word order is reconstructed based on the contents of the graph; all that needs to be provided, other than the words and their relationships, is the type of sentence (statement, question, command) and the linguistic category of each word (noun, determiner, verb, etc.). From there it is straight-forward to join the tokens of the parse tree into a sentence.
The parser is a (very early) alpha pre-release, but it is functional and actively maintained. I am currently using it to translate back and forth between English and an internal semantic representation used by a conversational agent (a "chat bot", but capable of more in-depth language understanding). If you decide to use the parser, do let me know. I will be happy to provide any assistance you might need with installing, using, or improving it.
What Algorithm/method do I use for a Question Answering System's Question Processing?
I have been searching possible algorithms for my Question Answering System, the only thing that I think that would be possible to use is Parsing but I have asked about parsing in my last question and with the answers there i think its not possible to be used?(I'm not sure).
My idea of using Parsing is by Cutting the question into pieces word per word and then it will go through a Storage of Words that would determine what Kind of Word(noun,adjective,verb,etc) is being said. My purpose of using Parsing is to remove or rather to determine the Topic of the question.
The other idea of mine is the ChatterBot. A Chatterbot uses a query of words? Correct me if I'm not mistaken and those words are assigned to another Word. It would randomly choose a word from its Query.
Example: User's Statement: Hello > ChatterBot's Possible Replies: Hi,Hello,Hey
I'm not quite sure what is the possible method/algorithm to use in a Question Answering, I have read the Wikipedia post : http://en.wikipedia.org/wiki/Question_answering but I do not quite understand what algorithm to use in Question Processing.
Thank you.
PS: I'm developing in Javascript. Q = Question
You could use a naive bayes classifier in order to look at the questions and determine their subject. You'd need a lot of training data and a fairly narrow domain.
The sophisticated responses to this problem involve a lot of machine inference techniques which are a bit out of my skill level to explain extremely well. My idea is to use a markov network in which each word has an edge to one or two words next to it. A series of tests are applied to each word which indicate likely memberhood of that word to one of its possible meanings (For example, Mark is more likely a name if it's capitalized, but if the next word is 'a' it probably is used in the sense of a verb.) From there the machine can attempt to determine the actual meaning of the sentence, which will rely on the use of, again, unimaginably large amounts of training data.
Coursera's Probabilistic Graphical Models class (Probably their NLP class too) would probably be the best resource if you're interested in becoming skilled in this area. (PGM is the only reason I know anything about this!)
here's a great book, you may need to read to get a lot of stuff related to NLP, and Question answering systems http://www.amazon.com/Speech-Language-Processing-2nd-Edition/dp/0131873210
the book has a full section (V.Applications) that will help you a lot to develop a good system.
but note that the book is discussing theories and algorithms only (no code)
it's not about parsing text only, you'll need to understand the context to provide better answer. actually you need to extract some keywords and ignore everything else.
also you may read in topics Keywords (Bag of words), algorithms like (TF/IDF).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I need to calculate word frequencies of a given set of adjectives in a large set of customer support reviews. However I don't want to include those that are negated.
For example suppose my list of adjectives was: [helpful, knowledgeable, friendly]. I want to make sure "friendly" isn't counted in a sentence such as "The representative was not very friendly."
Do I need to do a full NLP parse of the text or is there an easier approach? I don't need super high accuracy.
I'm not at all familiar with NLP. I'm hoping for something that doesn't have such a steep learning curve and isn't so processor intensive.
Thanks
If all you want is adjective frequencies, then the problem is relatively simple, as opposed to some brutal, not-so-good machine learning solution.
Wat do?
Do POS tagging on your text. This annotates your text with part of speech tags, so you'll have 95% accuracy or more on that. You can tag your text using the Stanford Parser online to get a feel for it. The parser actually also gives you the grammatical structure, but you only care about the tagging.
You also want to make sure the sentences are broken up properly. For this you need a sentence breaker. That's included with software like the Stanford parser.
Then just break up the sentences, tag them, and count all things with the tag ADJ or whatever tag they use. If the tags don't make sense, look up the Penn Treebank tagset (Treebanks are used to train NLP tools, and the Penn Treebank tags are the common ones).
How?
Java or Python is the language of NLP tools. Python, use NLTK. It's easy, well documented and well understood.
For Java, you have GATE, LingPipe and the Stanford Parser among others. It's a complete pain in the ass to use the Stanford Parser, fortunately I've suffered so you do not have to if you choose to go that route. See my google page for some code (at the bottom of the page) examples with the Stanford Parser.
Das all?
Nah, you might want to stem the adjectives too- that's where you get the root form of a word:
cars -> car
I can't actually think of a situation where this is necessary with adjectives, but it might happen. When you look at your output it'll be apparent if you need to do this. A POS tagger/parser/etc will get you your stemmed words (also called lemmas).
More NLP Explanations
See this question.
It depends on the source of your data. If the sentences come from some kind of generator, you can probably split them automatically. Otherwise you will need NLP, yes.
Properly parsing natural language pretty much is an open issue. It works "largely" for English, in particular since English sentences tend to stick to the SVO order. German for example is quite nasty here, as different word orders convey different emphasis (and thus can convey different meanings, in particular when irony is used). Additionally, German tends to use subordinate clauses much more.
NLP clearly is the way to go. At least some basic parser will be needed. It really depends on your task, too: do you need to make sure every one is correct, or is a probabilistic approach good enough? Can "difficult" cases be discarded or fed to a human for review? etc.