It looks like stanfordnlp has these SR models for some time.
I am really new to NLP but we are currently using PCFG parser and we are having serious performance issues( that we cut down the parse length to 35)
I was thinking if we could try using SR. I tried it with POS tagger from stanford(english-left3words-distsim.tagger)
Would you know how SR is on accuracy vs PCFG?
I also find sentence root detection issues with SR and dep parse: Example:
Michael Jeffrey Jordan, also known by his initials, MJ, is an American former professional basketball player, entrepreneur, and current majority owner and chairman of the Charlotte Bobcats
The PCFG is really accurate with the root and detects player as the root.
Would also appreciate a little insight on the NN people use e.g.(https://mailman.stanford.edu/pipermail/java-nlp-user/2014-November/006513.html) in above post.
Do I need to use another tagger like - left3words with this?
I am sorry if this sounds a little naive.
But all I want is a correct sentence root and its dependencies.
Does POS tagging upfront make it fast?
Thanks a lot in advance.
The English shift-reduce parser shipped with CoreNLP is actually slightly better than the PCFG parser on our test data. You can see performance metrics at the bottom of the shift-reduce parser homepage.
I've asked for clarification in a comment above.
Related
I noticed that the Stanford Parser taggs "anyone" and "anybody" as a noun, whereas they are pronouns; I tried to set "anyone" in different contexts, I got the same result. Can anyone tell me if it hapenned to him/her and if there is a way to correct it (I mean perhaps some settings ?).
Thank you!
This is more a linguistics question than a code question. The analysis of these words is complex. Many linguists would describe them as being a fused determiner and noun, as they fairly transparently are at least historically.
In general we currently use the Penn Treebank standards for tokenization, part-of-speech, and phrasal labels for English. The Penn Treebank annotates these words as noun (NN) - rightly or wrongly - so that is what our tools currently do.
However, you might be pleased to know that under the Universal Dependencies guidelines, these words are indeed pronouns (PRON). We are sure to be moving towards greater use of Universal Dependencies in future releases.
I am trying to develop a Sinhala (My native language) to English translator.
Still I am thinking for an approach.
If I however parse a sentence of my language, then can use that for generating english sentence with the help of stanford parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks
Mathee
If you have enough amounts of bilingual corpora then the best thing you can do is to train a model using an existing statistical machine translation system like Moses. Modern phrase-based SMT systems resemble a reverse parsing mechanism in the sense that they find the most probable combination of target-language phrases for translating a specific source-language sentence. More information in this Wikipedia article.
I am trying to develop a Sinhala (My native language) to English translator. Still I am thinking for an approach.
If I however parse a sentence of my language, then can I use that for generating english sentence with the help of stanford parser or any other parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks Mathee
This course on Coursera may help you implement a translator. From what I know based on that course, you can use a training set tagged by parts of speech (i.e. noun, verb, etc.) and use that training test to parse other sentence. I suggest looking into hidden Markov models.
My Pyramids parser is an unconventional single-sentence parser for English. (It is capable of parsing other languages too, but a grammar must be specified.) The parser can not only parse English into parse trees, but can convert back and forth between parse trees and word-level semantic graphs, which are graphs that describe the semantic relationships between all the words in a sentence. The correct word order is reconstructed based on the contents of the graph; all that needs to be provided, other than the words and their relationships, is the type of sentence (statement, question, command) and the linguistic category of each word (noun, determiner, verb, etc.). From there it is straight-forward to join the tokens of the parse tree into a sentence.
The parser is a (very early) alpha pre-release, but it is functional and actively maintained. I am currently using it to translate back and forth between English and an internal semantic representation used by a conversational agent (a "chat bot", but capable of more in-depth language understanding). If you decide to use the parser, do let me know. I will be happy to provide any assistance you might need with installing, using, or improving it.
This topic has many thread. But also I am posting another one. All the post may be a way to do a sentiment analysis, but I found no way.
I want to implement the doing ways of sentiment analysis. So I would request to show me a way. During my research, I found that this is used anyway. I guess Bayesian algorithm is used to calculate positive words and negative words and calculate the probability of the sentence being positive or negative using bag of words.
This is only for the words, I guess we have to do language processing too. So is there anyone who has more knowledge? If yes, can you guide me with some algorithms with their links for reference so that I can implement. Anything in particular that may help me in my analysis.
Also can you prefer me language that I can work with? Some says Java is comparably time consuming so they don't recommend Java to work with.
Any type of help is much appreciated.
First of all, sentiment analysis is done on various levels, such as document, sentence, phrase, and feature level. Which one are you working on? There are many different approaches to each of them. You can find a very good intro to this topic here. For machine-learning approaches, the most important element is feature engineering and it's not limited to bag of words. You can find many other useful features in different applications from the tutorial I linked. What language processing you need to do depends on what features you want to use. You may need POS-tagging if POS information is needed for your features for example.
For classifiers, you can try Support Vector Machines, Maximum Entropy, and Naive Bayes (probably as a baseline) and these are frequently used in the literature, about which you can also find a pretty comprehensive list in the link. The Mallet toolkit contains ME and NB, and if you use SVMlight, you can easily convert the feature formats to the Mallet format with a function. Of course there are many other implementations of these classifiers.
For rule-based methods, Pointwise Mutual Information is frequently used, and some kinds of scoring-based methods, etc.
Hope this helps.
For the text analyzing there is no language stronger than SNOBOL. In SNOBOL-4 the Fortran interpretator, for example, takes only 60 lines.
NLTK offers really good Algorithm for sentiment analysis. It is open source so you can have a look at the source code and check out the algorithm used. You can even download NLTK book which is free and has some good material on sentiment analysis.
Coming to your second point I dont think Java is that slow. I am myself coding in c++ for years but lately also started with java as if you see a lot of very popular open source softwares like lucene, solr, hadoop, neo4j are all written in java.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I need to calculate word frequencies of a given set of adjectives in a large set of customer support reviews. However I don't want to include those that are negated.
For example suppose my list of adjectives was: [helpful, knowledgeable, friendly]. I want to make sure "friendly" isn't counted in a sentence such as "The representative was not very friendly."
Do I need to do a full NLP parse of the text or is there an easier approach? I don't need super high accuracy.
I'm not at all familiar with NLP. I'm hoping for something that doesn't have such a steep learning curve and isn't so processor intensive.
Thanks
If all you want is adjective frequencies, then the problem is relatively simple, as opposed to some brutal, not-so-good machine learning solution.
Wat do?
Do POS tagging on your text. This annotates your text with part of speech tags, so you'll have 95% accuracy or more on that. You can tag your text using the Stanford Parser online to get a feel for it. The parser actually also gives you the grammatical structure, but you only care about the tagging.
You also want to make sure the sentences are broken up properly. For this you need a sentence breaker. That's included with software like the Stanford parser.
Then just break up the sentences, tag them, and count all things with the tag ADJ or whatever tag they use. If the tags don't make sense, look up the Penn Treebank tagset (Treebanks are used to train NLP tools, and the Penn Treebank tags are the common ones).
How?
Java or Python is the language of NLP tools. Python, use NLTK. It's easy, well documented and well understood.
For Java, you have GATE, LingPipe and the Stanford Parser among others. It's a complete pain in the ass to use the Stanford Parser, fortunately I've suffered so you do not have to if you choose to go that route. See my google page for some code (at the bottom of the page) examples with the Stanford Parser.
Das all?
Nah, you might want to stem the adjectives too- that's where you get the root form of a word:
cars -> car
I can't actually think of a situation where this is necessary with adjectives, but it might happen. When you look at your output it'll be apparent if you need to do this. A POS tagger/parser/etc will get you your stemmed words (also called lemmas).
More NLP Explanations
See this question.
It depends on the source of your data. If the sentences come from some kind of generator, you can probably split them automatically. Otherwise you will need NLP, yes.
Properly parsing natural language pretty much is an open issue. It works "largely" for English, in particular since English sentences tend to stick to the SVO order. German for example is quite nasty here, as different word orders convey different emphasis (and thus can convey different meanings, in particular when irony is used). Additionally, German tends to use subordinate clauses much more.
NLP clearly is the way to go. At least some basic parser will be needed. It really depends on your task, too: do you need to make sure every one is correct, or is a probabilistic approach good enough? Can "difficult" cases be discarded or fed to a human for review? etc.