The road map for CoreNLP is unclear. Is it in maintenance mode? I'm happy to see emphasis on StanfordNLP, but the lack of visibility into the direction is concerning. If the new neural models are better, will wee see them wrapped in the Java CoreNLP API's?
CoreNLP is not yet in maintenance mode. We are going to put in some quite significant (and compatibility-breaking) changes over the summer. Among other things, we're going to convert across to using UDv2 (from the current UDv1), we're going to make tokenization changes to English and perhaps other languages to better align with UD and "new" (since about 2004!) Penn Treebank tokenization, and we'll have more consistent availability and use of word vectors. These changes should increase compatibility between the Java and Python packages, and over time also make it possible for us to use more data to train Python stanfordnlp models. Now that the Python stanfordnlp v0.2 is out, work on CoreNLP should pick up.
On the other hand, most of the research energy in the Stanford NLP group has now moved to exploring neural models built in Python on top of the major deep learning frameworks. (Hopefully that's not a surprise to hear!) It is therefore less likely that major new components will be added to CoreNLP. It's hard to predict the future, but it is reasonable to expect that CoreNLP will head more in the direction of being a stable, efficient-on-CPU NLP package, rather than something implementing the latest neural models.
In the paper, "Improved Pattern Learning for Bootstrapped Entity Extraction. Sonal Gupta and Christopher D. Manning. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning (CoNLL). 2014."
cited for the tool the link for implementation has been specified as http://nlp.stanford.edu/software/patternviz.shtml but it seems to be taken down.
Oops! We should maybe fix that; but, in the meantime the new link is: http://nlp.stanford.edu/software/patternslearning.html. The code is distributed with Stanford CoreNLP, so there's no extra download. An example invocation is:
java -cp stanford-corenlp-3.5.1.jar:stanford-corenlp-3.5.1-models.jar:javax.json.jar:joda-time.jar:jollyday.jar edu.stanford.nlp.patterns.GetPatternsFromDataMultiClass -props patterns/example.properties
I am confused whether stanford dependency parser performs tokenization of sentences and words based on probabilistic theory or rule-based methods?? and I want to know what is dependency grammar and dependency parsing
please helpp!!!
thanks
The tokenization is entirely rule-based. If you're curious, you can take at the (very lengthy) tokenizer definition for English.
There is a short introduction to dependency parsing on this Stanford page, with some links to relevant papers as well.
I am trying to develop a Sinhala (My native language) to English translator. Still I am thinking for an approach.
If I however parse a sentence of my language, then can I use that for generating english sentence with the help of stanford parser or any other parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks Mathee
This course on Coursera may help you implement a translator. From what I know based on that course, you can use a training set tagged by parts of speech (i.e. noun, verb, etc.) and use that training test to parse other sentence. I suggest looking into hidden Markov models.
My Pyramids parser is an unconventional single-sentence parser for English. (It is capable of parsing other languages too, but a grammar must be specified.) The parser can not only parse English into parse trees, but can convert back and forth between parse trees and word-level semantic graphs, which are graphs that describe the semantic relationships between all the words in a sentence. The correct word order is reconstructed based on the contents of the graph; all that needs to be provided, other than the words and their relationships, is the type of sentence (statement, question, command) and the linguistic category of each word (noun, determiner, verb, etc.). From there it is straight-forward to join the tokens of the parse tree into a sentence.
The parser is a (very early) alpha pre-release, but it is functional and actively maintained. I am currently using it to translate back and forth between English and an internal semantic representation used by a conversational agent (a "chat bot", but capable of more in-depth language understanding). If you decide to use the parser, do let me know. I will be happy to provide any assistance you might need with installing, using, or improving it.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking to do some sentence analysis (mostly for twitter apps) and infer some general characteristics. Are there any good natural language processing libraries for this sort of thing in Ruby?
Similar to Is there a good natural language processing library but for Ruby. I'd prefer something very general, but any leads are appreciated!
Three excellent and mature NLP packages are Stanford Core NLP, Open NLP and LingPipe. There are Ruby bindings to the Stanford Core NLP tools (GPL license) as well as the OpenNLP tools (Apache License).
On the more experimental side of things, I maintain a Text Retrieval, Extraction and Annotation Toolkit (Treat), released under the GPL, that provides a common API for almost every NLP-related gem that exists for Ruby. The following list of Treat's features can also serve as a good reference in terms of stable natural language processing gems compatible with Ruby 1.9.
Text segmenters and tokenizers (punkt-segmenter, tactful_tokenizer, srx-english, scalpel)
Natural language parsers for English, French and German and named entity extraction for English (stanford-core-nlp).
Word inflection and conjugation (linguistics), stemming (ruby-stemmer, uea-stemmer, lingua, etc.)
WordNet interface (rwordnet), POS taggers (rbtagger, engtagger, etc.)
Language (whatlanguage), date/time (chronic, kronic, nickel), keyword (lda-ruby) extraction.
Text retrieval with indexation and full-text search (ferret).
Named entity extraction (stanford-core-nlp).
Basic machine learning with decision trees (decisiontree), MLPs (ruby-fann), SVMs (rb-libsvm) and linear classification (tomz-liblinear-ruby-swig).
Text similarity metrics (levenshtein-ffi, fuzzy-string-match, tf-idf-similarity).
Not included in Treat, but relevant to NLP: hotwater (string distance algorithms), yomu (binders to Apache Tiki for reading .doc, .docx, .pages, .odt, .rtf, .pdf), graph-rank (an implementation of GraphRank).
There are some things at Ruby Linguistics and some links therefrom, though it doesn't seem anywhere close to what NLTK is for Python, yet.
You can always use jruby and use the java libraries.
EDIT: The ability to do ruby natively on the jvm and easily leverage java libraries is a big plus for rubyists. This is a good option that should be considered in a situation like this.
Which NLP toolkit to use in JAVA?
I found an excellent article detailing some NLP algorithms in Ruby here. This includes stemmers, date time parsers and grammar parsers.
TREAT – the Text REtrieval and Annotation Toolkit – is the most comprehensive toolkit I know of for Ruby: https://github.com/louismullie/treat/wiki/
I maintain a list of Ruby Natural Language Processing resources (libraries, APIs, and presentations) on GitHub that covers the libraries listed in the other answers here as well as some additional libraries.
Also consider using SaaS APIs like MonkeyLearn. You can easily train text classifiers with machine learning and integrate via an API. There's a Ruby SDK available.
Besides creating your own classifiers, you can pick pre-created modules for sentiment analysis, topic classification, language detection and more.
We also have extractors like keyword extraction and entities, and we'll keep adding more public modules.
Other nice features:
You have a GUI to create/test algorithms.
Algorithms run really fast in our cloud computing platform.
You can integrate with Ruby or any other programming language.
Try this one
https://github.com/louismullie/stanford-core-nlp
About stanford-core-nlp gem
This gem provides high-level Ruby bindings to the Stanford Core NLP package, a set natural language processing tools for tokenization, sentence segmentation, part-of-speech tagging, lemmatization, and parsing of English, French and German. The package also provides named entity recognition and coreference resolution for English.
http://nlp.stanford.edu/software/corenlp.shtml
demo page
http://nlp.stanford.edu:8080/corenlp/
You need to be much more specific about what these "general characteristics" are.
In NLP "general characteristics" of a sentence can mean a million different things - sentiment analysis (ie, the attitude of the speaker), basic part of speech tagging, use of personal pronoun, does the sentence contain active or passive verbs, what's the tense and voice of the verbs...
I don't mind if you're vague about describing it, but if we don't know what you're asking it's highly unlikely we can be specific in helping you.
My general suggestion, especially for NLP, is you should get the tool best designed for the job instead of limiting yourself to a specific language. Limiting yourself to a specific language is fine for some tasks where the general tools are implemented everywhere, but NLP is not one of those.
The other issue in working with Twitter is a great deal of the sentences there will be half baked or compressed in strange and wonderful ways - which most NLP tools aren't trained for. To help there, the NUS SMS Corpus consists of "about 10,000 SMS messages collected by students". Due to the similar restrictions and usage, analysing that may be helpful in your explorations with Twitter.
If you're more specific I'll try and list some tools that will help.
I would check out Mark Watson's free book Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition. He has chapters on NLP using java, clojure, ruby, and scala. He also provides links to the resources you need.
For people looking for something more lightweight and simple to implement this option worked well for me.
https://github.com/yohasebe/engtagger