Natural Language Processing in Ruby [closed] - ruby

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking to do some sentence analysis (mostly for twitter apps) and infer some general characteristics. Are there any good natural language processing libraries for this sort of thing in Ruby?
Similar to Is there a good natural language processing library but for Ruby. I'd prefer something very general, but any leads are appreciated!

Three excellent and mature NLP packages are Stanford Core NLP, Open NLP and LingPipe. There are Ruby bindings to the Stanford Core NLP tools (GPL license) as well as the OpenNLP tools (Apache License).
On the more experimental side of things, I maintain a Text Retrieval, Extraction and Annotation Toolkit (Treat), released under the GPL, that provides a common API for almost every NLP-related gem that exists for Ruby. The following list of Treat's features can also serve as a good reference in terms of stable natural language processing gems compatible with Ruby 1.9.
Text segmenters and tokenizers (punkt-segmenter, tactful_tokenizer, srx-english, scalpel)
Natural language parsers for English, French and German and named entity extraction for English (stanford-core-nlp).
Word inflection and conjugation (linguistics), stemming (ruby-stemmer, uea-stemmer, lingua, etc.)
WordNet interface (rwordnet), POS taggers (rbtagger, engtagger, etc.)
Language (whatlanguage), date/time (chronic, kronic, nickel), keyword (lda-ruby) extraction.
Text retrieval with indexation and full-text search (ferret).
Named entity extraction (stanford-core-nlp).
Basic machine learning with decision trees (decisiontree), MLPs (ruby-fann), SVMs (rb-libsvm) and linear classification (tomz-liblinear-ruby-swig).
Text similarity metrics (levenshtein-ffi, fuzzy-string-match, tf-idf-similarity).
Not included in Treat, but relevant to NLP: hotwater (string distance algorithms), yomu (binders to Apache Tiki for reading .doc, .docx, .pages, .odt, .rtf, .pdf), graph-rank (an implementation of GraphRank).

There are some things at Ruby Linguistics and some links therefrom, though it doesn't seem anywhere close to what NLTK is for Python, yet.

You can always use jruby and use the java libraries.
EDIT: The ability to do ruby natively on the jvm and easily leverage java libraries is a big plus for rubyists. This is a good option that should be considered in a situation like this.
Which NLP toolkit to use in JAVA?

I found an excellent article detailing some NLP algorithms in Ruby here. This includes stemmers, date time parsers and grammar parsers.

TREAT – the Text REtrieval and Annotation Toolkit – is the most comprehensive toolkit I know of for Ruby: https://github.com/louismullie/treat/wiki/

I maintain a list of Ruby Natural Language Processing resources (libraries, APIs, and presentations) on GitHub that covers the libraries listed in the other answers here as well as some additional libraries.

Also consider using SaaS APIs like MonkeyLearn. You can easily train text classifiers with machine learning and integrate via an API. There's a Ruby SDK available.
Besides creating your own classifiers, you can pick pre-created modules for sentiment analysis, topic classification, language detection and more.
We also have extractors like keyword extraction and entities, and we'll keep adding more public modules.
Other nice features:
You have a GUI to create/test algorithms.
Algorithms run really fast in our cloud computing platform.
You can integrate with Ruby or any other programming language.

Try this one
https://github.com/louismullie/stanford-core-nlp
About stanford-core-nlp gem
This gem provides high-level Ruby bindings to the Stanford Core NLP package, a set natural language processing tools for tokenization, sentence segmentation, part-of-speech tagging, lemmatization, and parsing of English, French and German. The package also provides named entity recognition and coreference resolution for English.
http://nlp.stanford.edu/software/corenlp.shtml
demo page
http://nlp.stanford.edu:8080/corenlp/

You need to be much more specific about what these "general characteristics" are.
In NLP "general characteristics" of a sentence can mean a million different things - sentiment analysis (ie, the attitude of the speaker), basic part of speech tagging, use of personal pronoun, does the sentence contain active or passive verbs, what's the tense and voice of the verbs...
I don't mind if you're vague about describing it, but if we don't know what you're asking it's highly unlikely we can be specific in helping you.
My general suggestion, especially for NLP, is you should get the tool best designed for the job instead of limiting yourself to a specific language. Limiting yourself to a specific language is fine for some tasks where the general tools are implemented everywhere, but NLP is not one of those.
The other issue in working with Twitter is a great deal of the sentences there will be half baked or compressed in strange and wonderful ways - which most NLP tools aren't trained for. To help there, the NUS SMS Corpus consists of "about 10,000 SMS messages collected by students". Due to the similar restrictions and usage, analysing that may be helpful in your explorations with Twitter.
If you're more specific I'll try and list some tools that will help.

I would check out Mark Watson's free book Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition. He has chapters on NLP using java, clojure, ruby, and scala. He also provides links to the resources you need.

For people looking for something more lightweight and simple to implement this option worked well for me.
https://github.com/yohasebe/engtagger

Related

Tutorials For Natural Language Processing [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I recently attended a class on coursera about "Natural Language Processing" and I learnt a lot about parsing, IR and other interesting aspects like Q&A etc. though I grasped the concepts well but I did not actually get any practical knowledge of it. Can anyone suggest me good online tutorials or books for Natural Language Processing?
Thanks
You could read Jurafsky and Martin's Speech and Language Processing (2008 edition), which is the standard textbook in the field. It's long, and has a variety of topics, so I'd suggest reading just the chapters that really apply to your interests.
Further, the best way to learn is almost certainly to actually implement NLP algorithms from scratch. You could pick some standard tasks (language modeling, text classification, POS-tagging, NER, parsing) and implement various algorithms from the ground up (ngram models, HMMs, Naive Bayes, MaxEnt, CKY) to really understand what makes them work. It also shouldn't be too hard to find some free dataset to test your implementations on.
Finally, there are lots of tutorials out there for specific NLP algorithms that are excellent. For example, if you want to build an HMM, I suggest Jason Eisner's tutorial which also covers smoothing and unsupervised training with EM. If you want to implement Gibbs sampling for unsupervised Naive Bayes training, I suggest Philip Resnik's tutorial.
Aside from Jurafsky and Martin's book, Christopher D. Manning and Hinrich Schütze's Foundations of Statistical Natural Language Processing is also widely used. For IR, Manning et al. also wrote Introduction to Information Retrieval which can be read or downloaded online at their site.
If you want practical knowledge on how can you work on Natural language you should start implementing it.
I suggest to use NLTK(Natural Language Proecessing Toolkit) with Python. Its easy to implement NLP in python.
You can refer to this link
http://nltk.org/
Or you can try it online on
http://cst.dk/online/pos_tagger/uk/
Instead of reading a specific book, diving into the sea of papers might be an as good idea. http://www.aclweb.org, for example, contains many topics on NLP. Through those papers, you get references to more papers, some of which are the foundations of a certain branch of NLP. And because they were written by different authors, you are unlikely to be influenced too much by one point of view.
If you are a Java developer there is an extensive list of tutorials for how to build components of NLP systems using LingPipe at http://alias-i.com/lingpipe/demos/tutorial/read-me.html. Full disclosure I wrote some of those tutorials and one of the books below.
There are a few books that are more industrially oriented:
1) Natural Language Processing with Java by Richard M Reese
This covers how to do some common tasks with a range of open source toolkits (including LingPipe).
2) Natural Language Processing with Java and LingPipe Cookbook Paperback
by Breck Baldwin, Krishna Dayanidhi
This book is task driven at the level of "get the component built" and covers the major technologies driving most NLP systems that are text driven. It does not cover translation. It goes into more detail than the first book and has broader coverage than the LingPipe tutorials but is sometimes less detailed than the tutorials.
Breck
There is a hub for teaching and learning materials called TeLeMaCo. You can find resources for many aspects of NLP, and you can easily add more materials that you have found on the web.

Sentiment Analysis of given text

This topic has many thread. But also I am posting another one. All the post may be a way to do a sentiment analysis, but I found no way.
I want to implement the doing ways of sentiment analysis. So I would request to show me a way. During my research, I found that this is used anyway. I guess Bayesian algorithm is used to calculate positive words and negative words and calculate the probability of the sentence being positive or negative using bag of words.
This is only for the words, I guess we have to do language processing too. So is there anyone who has more knowledge? If yes, can you guide me with some algorithms with their links for reference so that I can implement. Anything in particular that may help me in my analysis.
Also can you prefer me language that I can work with? Some says Java is comparably time consuming so they don't recommend Java to work with.
Any type of help is much appreciated.
First of all, sentiment analysis is done on various levels, such as document, sentence, phrase, and feature level. Which one are you working on? There are many different approaches to each of them. You can find a very good intro to this topic here. For machine-learning approaches, the most important element is feature engineering and it's not limited to bag of words. You can find many other useful features in different applications from the tutorial I linked. What language processing you need to do depends on what features you want to use. You may need POS-tagging if POS information is needed for your features for example.
For classifiers, you can try Support Vector Machines, Maximum Entropy, and Naive Bayes (probably as a baseline) and these are frequently used in the literature, about which you can also find a pretty comprehensive list in the link. The Mallet toolkit contains ME and NB, and if you use SVMlight, you can easily convert the feature formats to the Mallet format with a function. Of course there are many other implementations of these classifiers.
For rule-based methods, Pointwise Mutual Information is frequently used, and some kinds of scoring-based methods, etc.
Hope this helps.
For the text analyzing there is no language stronger than SNOBOL. In SNOBOL-4 the Fortran interpretator, for example, takes only 60 lines.
NLTK offers really good Algorithm for sentiment analysis. It is open source so you can have a look at the source code and check out the algorithm used. You can even download NLTK book which is free and has some good material on sentiment analysis.
Coming to your second point I dont think Java is that slow. I am myself coding in c++ for years but lately also started with java as if you see a lot of very popular open source softwares like lucene, solr, hadoop, neo4j are all written in java.

Ruby Text Analysis

Is there any Ruby gem or else for text analysis? Word frequency, pattern detection and so forth (preferably with an understanding of french)
the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams
You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.
There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.
These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)
Check the following thread, which contains more details and links:
Building openears compatible language model
Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.
adi92's post lists some more Ruby NLP resources.
You can also Google for "ARPA Language Model" for more info
Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!
The Mendicant Bug: NLP Resources for Ruby
contains lots of useful Ruby NLP links.
I had tried using the Ruby Linguistics stuff a long time ago, and remember having a lot of problems with it... I don't recommend jumping into that.
If most of your text analysis involves stuff like counting ngrams and naive Bayes, I recommend just doing it on your own. Ruby has pretty good basic libraries and awesome support for regexes, so this should not be that tricky, and it will be easier for you to adapt stuff to the idiosyncrasies of the problem you are trying to solve.
Like the Stanford parser gem, its possible to use Java libraries that solve your problem from within Ruby, but this can be tricky, so probably not the best way to solve a problem.
I wrote the gem words_counted for this reason. You can see a demo on rubywordcount.com. It has a lot of the analysis features you mention, and a host more. The API is well documented and can be found in the readme on Github.

Data structures for bioinformatics [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
What are some data structures that should be known by somebody involved in bioinformatics? I guess that anyone is supposed to know about lists, hashes, balanced trees, etc., but I expect that there are domain specific data structures. Is there any book devoted to this subject?
The most fundamental data structure used in bioinformatics is string. There are also a whole range of different data structures representing strings. And algorithms like string matching are based on the efficient representation/data structures.
A comprehensive work on this is Dan Gusfield's Algorithms on Strings, Trees and Sequences
A lot of introductory books on bioinformatics will cover some of the basic structures you'd use. I'm not sure what the standard textbook is, but I'm sure you can find that. It might be useful to look at some of the language-specific books:
Bioinformatics Programming With Python
Beginning Perl for Bioinformatics
I chose those two as examples because they're published by O'Reilly, which, in my experience, publishes good quality books.
I just so happen to have the Python book on my hard drive, and a great deal of it talks about processing strings for bioinformatics using Python. It doesn't seem like bioinformatics uses any fancy special data structures, just existing ones.
Spatial hashing datastructures (kd-tree) for example are used often for nearest neighbor queries of arbitrary feature vectors as well as 3d protein structure analysis.
Best book for your $$ is Understanding Bioinformatics by Zvelebil because it covers everything from sequence analysis to structure comparison.
In addition to basic familiarity with the structures you mentioned, suffix trees (and suffix arrays), de Bruijn graphs, and interval graphs are used extensively. The Handbook of Computational Molecular Biology is very well written. I've never read the whole thing, but I've used it as a reference.
I also highly recommend this book, http://www.comp.nus.edu.sg/~ksung/algo_in_bioinfo/
And more recently, python is much more frequently used in bioinformatics than perl. So I really suggest you start with python, it is widely used in my projects.
Many projects in bioinformatics involve combining information from different, semi-structured sources. RDF and ontologies are essential for much of this. See, for example, the bio2RDF project. http://bio2rdf.org/. A good understanding of identifiers is valuable.
Much bioinformatics is exploratory and rapid lightweight tools are often used. See workflow tools such as Taverna where the primary resource is often a set of web services - so HTTP/REST are common.
Whatever your mathematical or computational expertise is, you are likely to find an application in computational biology. If not, make this another question of stackoverflow and you'll be helped :o)
As mentioned in the other answers, somewhat timeless are string comparisons and pattern discovery in 1-dimensional data since sequences are so easy to get. With a renewed interest in medical informatics though you also have two/three-dimensional image analysis that you run e.g. against genomic data. With molecular biochemistry you also have pattern searches on 3D surfaces and molecular simulations. To study drug effects you will work with gene networks and compare those across tissues. Typical challenges for big data and information integration apply. And then, you need statistical descriptions of the likelihood of a pattern or the clinical association of any features identified to be found by chance.

DSLs(Domain Specific Programming Languages) implemented using different GPLs(General Purpose Programming Languages)

I am looking for DSLs implemented using general purpose programming languages(GPLs) e.g., C#, Java , Scala and so on. Primary goal is to survey various important attributes of well-designed DSL implementations used in a daily basis in Software Industry.
I would highly appreciate if you could point me out such DSL implementations (examples or repositories) and state your reasons why you consider it to be a good DSL.
Thank you,
Adil Akhter
EDIT 1:
IMHO, this post can contribute to create a listing of interesting prevailing DSLs used extensively in today’s Software Development (after searching, at least I could not find any such listings covering all the GPLs).
One of the several inherent benefits of this listing – it can be used create taxonomy of the DSLs and domain they are targeting to.
Followings are the related links that describes some interesting DSLs and Tools:
DSLs( categorized by GPLs):
Ruby DSLs => Ruby DSL (Domain Specific Language) repositories, examples
Clojure DSLs => Are there any Clojure DSLs?
Scala DSLs => Interesting DSLs, Implemented in Scala?
C# DSLs => SharpDOM ( http://sharpdom.codeplex.com )
Tools:
Microsoft Visual Studio Visualization and Modeling SDK : http://code.msdn.microsoft.com/vsvmsdk
take a look at boost.spirit2 to find a very complex DSL in a mainstream language. Otherwise you could look at any dialect of lisp which makes it very easy to write DSLs and so you will find lots of them.
You might consider non-procedural techniques for implementing DSLs, such as (our) program transformation system. I think you will find them surprisingly powerful.

Resources