The road map for CoreNLP is unclear. Is it in maintenance mode? I'm happy to see emphasis on StanfordNLP, but the lack of visibility into the direction is concerning. If the new neural models are better, will wee see them wrapped in the Java CoreNLP API's?
CoreNLP is not yet in maintenance mode. We are going to put in some quite significant (and compatibility-breaking) changes over the summer. Among other things, we're going to convert across to using UDv2 (from the current UDv1), we're going to make tokenization changes to English and perhaps other languages to better align with UD and "new" (since about 2004!) Penn Treebank tokenization, and we'll have more consistent availability and use of word vectors. These changes should increase compatibility between the Java and Python packages, and over time also make it possible for us to use more data to train Python stanfordnlp models. Now that the Python stanfordnlp v0.2 is out, work on CoreNLP should pick up.
On the other hand, most of the research energy in the Stanford NLP group has now moved to exploring neural models built in Python on top of the major deep learning frameworks. (Hopefully that's not a surprise to hear!) It is therefore less likely that major new components will be added to CoreNLP. It's hard to predict the future, but it is reasonable to expect that CoreNLP will head more in the direction of being a stable, efficient-on-CPU NLP package, rather than something implementing the latest neural models.
Related
I used 'Anglican' which is based on Clojure, and I think that is not good for me. Bad documents and a too small community to find help. Also, I still can't get familiar with the Scheme-based languages. So I want to change the language to something based on Python.
Maybe Pyro or PyMC could be the case, but I totally have no idea about both of those.
What are the difference between the two frameworks?
Can they be used for the same problems?
Are there examples, where one shines in comparison?
(Updated for 2022)
Pyro is built on PyTorch. It has full MCMC, HMC and NUTS support. It has excellent documentation and few if any drawbacks that I'm aware of.
PyMC was built on Theano which is now a largely dead framework, but has been revived by a project called Aesara. PyMC3 is now simply called PyMC, and it still exists and is actively maintained. Its reliance on an obscure tensor library besides PyTorch/Tensorflow likely make it less appealing for widescale adoption--but as I note below, probabilistic programming is not really a widescale thing so this matters much, much less in the context of this question than it would for a deep learning framework.
There still is something called Tensorflow Probability, with the same great documentation we've all come to expect from Tensorflow (yes that's a joke). My personal opinion as a nerd on the internet is that Tensorflow is a beast of a library that was built predicated on the very Googley assumption that it would be both possible and cost-effective to employ multiple full teams to support this code in production, which isn't realistic for most organizations let alone individual researchers.
That said, they're all pretty much the same thing, so try them all, try whatever the guy next to you uses, or just flip a coin. The best library is generally the one you actually use to make working code, not the one that someone on StackOverflow says is the best. As for which one is more popular, probabilistic programming itself is very specialized so you're not going to find a lot of support with anything.
From here
Pyro is a deep probabilistic programming language that focuses on
variational inference, supports composable inference algorithms.
Pyro aims to be more dynamic (by using PyTorch) and universal
(allowing recursion).
Pyro embraces deep neural nets and currently focuses on variational inference. Pyro doesn't do Markov chain Monte Carlo (unlike PyMC and Edward) yet.
Pyro is built on pytorch whereas PyMC3 on theano. So you get PyTorch’s dynamic programming and it was recently announced that Theano will not be maintained after an year. However, I found that PyMC has excellent documentation and wonderful resources. Another alternative is Edward built on top of Tensorflow which is more mature and feature rich than pyro atm. Authors of Edward claim it's faster than PyMC3.
I guess the decision boils down to the features, documentation and programming style you are looking for.
Let me begin by saying that I am struggling to understand what is going on in deep learning. From what I gather, it is an approach to try to have a computer engineer different layers of representations and features to enable it learn stuff on its own. SIFT seems to be a common way to sort of detect things by tagging and hunting for scale invariant things in some representation. Again I am completely stupid and in awe and wonder about how this magic is achieved. How does one have a computer do this by itself? I have looked at this paper https://www.cs.ubc.ca/~lowe/papers/ijcv04.pdf and I must say at this point I think it is magic. Can somebody help me distill the main points of how this works and why a computer can do it on its own?
SIFT and CNN are both methods to extract features from images in different ways and outputs.
SIFT/SURF/ORB or any similar feature extraction algorithms are "Hand-made" feature extraction algorithms. It means, independent from the real world cases, they are aiming to extract some meaningful features. This approach has some advantages and disadvantages.
Advantages :
You don't have to care about input image conditions and probably you don't need any pre-processing step to extract those features.
You can directly get SIFT implementation and integrate it to your application
With GPU based implementations (i.e. GPU-SIFT), you can achieve high inference speed.
Disadvantages:
It has limitation about finding the features. You will have trouble about getting features over quite plain surfaces.
SIFT/SURF/ORB cannot solve all problems that requires feature classification / matching. Think face recognition problem. Do you think that extracting & classifying SIFT features over face will be enough to recognize people?
Those are hand-made feature extraction techniques, they cannot be improved over time (of course unless a better technique is being introduced)
Developing such a feature extraction technique requires a lot of research work
In the other hand, in deep learning, you can start analyzing much complex features which are impossible by human to recognize. CNNs are perfect approach as today to analyze hierarchical filter responses and much complex features which are created by combining those filter responses (going deeper).
The main power of CNNs are coming from not extracting features by hand. We only define "how" PC has to look for features. Of course this method has some pros and cons too.
Advantages :
More data, better success! It is all depending on data. If you have enough data to explain your case, DL outperforms hand-made feature extraction techniques.
As soon as you extract the features from image, you can use it for many purposes like to segment image, to create description words, to detect objects inside image, to recognize them. The better part is, all of them can be obtained in one shot, rather than complex sequential processes.
Disadvantages:
You need data. Probably a lot.
It is better to use supervised or reinforcement learning methods in these days. As unsupervised learning is still not good enough yet.
It takes time and resource to train a good neural net. A complex hierarchy like Google Inception took 2 weeks to be trained on 8 GPU server rack. Of course not all the networks are so hard to train.
It has some learning curve. You don't have to know how SIFT is working to use it for your application but you have to know how CNNs are working to use them in your custom purposes.
This topic has many thread. But also I am posting another one. All the post may be a way to do a sentiment analysis, but I found no way.
I want to implement the doing ways of sentiment analysis. So I would request to show me a way. During my research, I found that this is used anyway. I guess Bayesian algorithm is used to calculate positive words and negative words and calculate the probability of the sentence being positive or negative using bag of words.
This is only for the words, I guess we have to do language processing too. So is there anyone who has more knowledge? If yes, can you guide me with some algorithms with their links for reference so that I can implement. Anything in particular that may help me in my analysis.
Also can you prefer me language that I can work with? Some says Java is comparably time consuming so they don't recommend Java to work with.
Any type of help is much appreciated.
First of all, sentiment analysis is done on various levels, such as document, sentence, phrase, and feature level. Which one are you working on? There are many different approaches to each of them. You can find a very good intro to this topic here. For machine-learning approaches, the most important element is feature engineering and it's not limited to bag of words. You can find many other useful features in different applications from the tutorial I linked. What language processing you need to do depends on what features you want to use. You may need POS-tagging if POS information is needed for your features for example.
For classifiers, you can try Support Vector Machines, Maximum Entropy, and Naive Bayes (probably as a baseline) and these are frequently used in the literature, about which you can also find a pretty comprehensive list in the link. The Mallet toolkit contains ME and NB, and if you use SVMlight, you can easily convert the feature formats to the Mallet format with a function. Of course there are many other implementations of these classifiers.
For rule-based methods, Pointwise Mutual Information is frequently used, and some kinds of scoring-based methods, etc.
Hope this helps.
For the text analyzing there is no language stronger than SNOBOL. In SNOBOL-4 the Fortran interpretator, for example, takes only 60 lines.
NLTK offers really good Algorithm for sentiment analysis. It is open source so you can have a look at the source code and check out the algorithm used. You can even download NLTK book which is free and has some good material on sentiment analysis.
Coming to your second point I dont think Java is that slow. I am myself coding in c++ for years but lately also started with java as if you see a lot of very popular open source softwares like lucene, solr, hadoop, neo4j are all written in java.
I need to do a project on Computational Linguistics course. Is there any interesting "linguistic" problem which is data intensive enough to work on using Hadoop map reduce. Solution or algorithm should try and analyse and provide some insight in "lingustic" domain. however it should be applicable to large datasets so that i can use hadoop for it. I know there is a python natural language processing toolkit for hadoop.
If you have large corpora in some "unusual" languages (in the sense of "ones for which limited amounts of computational linguistics have been performed"), repeating some existing computational linguistics work already performed for very popular languages (such as English, Chinese, Arabic, ...) is a perfectly appropriate project (especially in an academic setting, but it might be quite suitable for industry, too -- back when I was in computational linguistics with IBM Research I got interesting mileage from putting together a corpus for Italian, and repeating [[in the relatively new IBM Scientific Center in Rome]] very similar work to what the IBM Research team in Yorktown Heights [[of which I had been a part]] had already done for English.
The hard work is usually finding / preparing such corpora (it was definitely the greatest part of my work back then, despite wholehearted help from IBM Italy to put me in touch with publishing firms who owned relevant data).
So, the question looms large, and only you can answer it: what corpora do you have access to, or can procure access to (and clean up, etc), especially in "unusual" languages? If all you can do is, e.g., English, using already popular corpora, the chances of doing work that's novel and interesting are of course harder, though there may of course be some.
BTW, I assume you're thinking strictly about processing "written" text, right? If you had a corpus of spoken material (ideally with good transcripts), the opportunities would be endless (there has been much less work on processing spoken text, e.g. to parameterize pronunciation variants by different native speakers on the same written text -- indeed, such issues are often not even mentioned in undergrad CL courses!).
One computation-intensive problem in CL is inferring semantics from large corpora. The basic idea is to take a big collection of text and infer the semantic relationships between words (synonyms, antonyms, hyponyms, hypernyms, etc) from their distributions, i.e. what words they occur with or close to.
This involves a lot of data pre-processing and then can involve many nearest neighbor searches and N x N comparisons, which are well-suited for MapReduce-style parallelization.
Have a look at this tutorial:
http://wordspace.collocations.de/doku.php/course:acl2010:start
Download 300M words from 60K OA papers published by BioMed Central. Try to discover propositional attitudes and related sentiment constructions. Point being that the biomed literature is chock full of hedging and related constructions, because of the difficulty of making flat declarative statements about the living world and its creatures - their form and function and genetics and biochemistry.
My feelings about Hadoop is that it's a tool to consider, but to consider after you have done the important tasks of setting goals. Your goals, strategies, and data should dictate how you proceed computationally. Beware the hammer in search of a nail approach to research.
This is part of what my lab is hard at work on.
Bob Futrelle
BioNLP.org
Northeastern University
As you mention there is a Python toolkit called NLTK which can be used with dumbo to make use of Hadoop.
PyCon 2010 had a good talk on just this subject. You can access the slides from the talk using the link below.
The Python and the Elephant: Large Scale Natural Language Processing with NLTK and Dumbo
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking to do some sentence analysis (mostly for twitter apps) and infer some general characteristics. Are there any good natural language processing libraries for this sort of thing in Ruby?
Similar to Is there a good natural language processing library but for Ruby. I'd prefer something very general, but any leads are appreciated!
Three excellent and mature NLP packages are Stanford Core NLP, Open NLP and LingPipe. There are Ruby bindings to the Stanford Core NLP tools (GPL license) as well as the OpenNLP tools (Apache License).
On the more experimental side of things, I maintain a Text Retrieval, Extraction and Annotation Toolkit (Treat), released under the GPL, that provides a common API for almost every NLP-related gem that exists for Ruby. The following list of Treat's features can also serve as a good reference in terms of stable natural language processing gems compatible with Ruby 1.9.
Text segmenters and tokenizers (punkt-segmenter, tactful_tokenizer, srx-english, scalpel)
Natural language parsers for English, French and German and named entity extraction for English (stanford-core-nlp).
Word inflection and conjugation (linguistics), stemming (ruby-stemmer, uea-stemmer, lingua, etc.)
WordNet interface (rwordnet), POS taggers (rbtagger, engtagger, etc.)
Language (whatlanguage), date/time (chronic, kronic, nickel), keyword (lda-ruby) extraction.
Text retrieval with indexation and full-text search (ferret).
Named entity extraction (stanford-core-nlp).
Basic machine learning with decision trees (decisiontree), MLPs (ruby-fann), SVMs (rb-libsvm) and linear classification (tomz-liblinear-ruby-swig).
Text similarity metrics (levenshtein-ffi, fuzzy-string-match, tf-idf-similarity).
Not included in Treat, but relevant to NLP: hotwater (string distance algorithms), yomu (binders to Apache Tiki for reading .doc, .docx, .pages, .odt, .rtf, .pdf), graph-rank (an implementation of GraphRank).
There are some things at Ruby Linguistics and some links therefrom, though it doesn't seem anywhere close to what NLTK is for Python, yet.
You can always use jruby and use the java libraries.
EDIT: The ability to do ruby natively on the jvm and easily leverage java libraries is a big plus for rubyists. This is a good option that should be considered in a situation like this.
Which NLP toolkit to use in JAVA?
I found an excellent article detailing some NLP algorithms in Ruby here. This includes stemmers, date time parsers and grammar parsers.
TREAT – the Text REtrieval and Annotation Toolkit – is the most comprehensive toolkit I know of for Ruby: https://github.com/louismullie/treat/wiki/
I maintain a list of Ruby Natural Language Processing resources (libraries, APIs, and presentations) on GitHub that covers the libraries listed in the other answers here as well as some additional libraries.
Also consider using SaaS APIs like MonkeyLearn. You can easily train text classifiers with machine learning and integrate via an API. There's a Ruby SDK available.
Besides creating your own classifiers, you can pick pre-created modules for sentiment analysis, topic classification, language detection and more.
We also have extractors like keyword extraction and entities, and we'll keep adding more public modules.
Other nice features:
You have a GUI to create/test algorithms.
Algorithms run really fast in our cloud computing platform.
You can integrate with Ruby or any other programming language.
Try this one
https://github.com/louismullie/stanford-core-nlp
About stanford-core-nlp gem
This gem provides high-level Ruby bindings to the Stanford Core NLP package, a set natural language processing tools for tokenization, sentence segmentation, part-of-speech tagging, lemmatization, and parsing of English, French and German. The package also provides named entity recognition and coreference resolution for English.
http://nlp.stanford.edu/software/corenlp.shtml
demo page
http://nlp.stanford.edu:8080/corenlp/
You need to be much more specific about what these "general characteristics" are.
In NLP "general characteristics" of a sentence can mean a million different things - sentiment analysis (ie, the attitude of the speaker), basic part of speech tagging, use of personal pronoun, does the sentence contain active or passive verbs, what's the tense and voice of the verbs...
I don't mind if you're vague about describing it, but if we don't know what you're asking it's highly unlikely we can be specific in helping you.
My general suggestion, especially for NLP, is you should get the tool best designed for the job instead of limiting yourself to a specific language. Limiting yourself to a specific language is fine for some tasks where the general tools are implemented everywhere, but NLP is not one of those.
The other issue in working with Twitter is a great deal of the sentences there will be half baked or compressed in strange and wonderful ways - which most NLP tools aren't trained for. To help there, the NUS SMS Corpus consists of "about 10,000 SMS messages collected by students". Due to the similar restrictions and usage, analysing that may be helpful in your explorations with Twitter.
If you're more specific I'll try and list some tools that will help.
I would check out Mark Watson's free book Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition. He has chapters on NLP using java, clojure, ruby, and scala. He also provides links to the resources you need.
For people looking for something more lightweight and simple to implement this option worked well for me.
https://github.com/yohasebe/engtagger