I'm trying to perform Natural Language Processing (NLP) analysis on source code, and especially on Ruby files. In particular, I want to extract identifiers and comments, considering the structure of the code.
My first attempt was using off-the-shelf NLP libraries, such as Lucene or spacy. However, I was not able to remove all the noise coming from keywords, literals, and the typical stuff in source code.
My second attempt is about to obtain the AST of a particular piece of code, and then extract some parts. There are multiple tools and libraries for a number of languages, but I'm not able to find anything specific to parse Ruby code. So far, my main option is using ANTLR 4, and tailor a Ruby-like grammar (Corundum) to work also with OOP.
Is there a more straightforward path to what I'm looking for?
My aim is detecting simple elements in any sentence such as verb, noun or adjective. Is there any gem in Ruby for achieving that? For example:
They elected him president yesterday.
Output:
["subject","verb", "object", "predicative", adverbial"]
These are the only natural language processing options for Ruby that I know of.
Treat
Stanford Core NLP
Open NLP
Interestingly, they are all by the same person.
EDIT
Here is one more option that I found. It's a tutorial on n-gram analysis.
Natural Language Processing with Ruby: n-grams
I've used engtagger with good success in the past. It's ported from a Perl program called Lingua::EN::Tagger. It takes a bit of work to get it to do what you want, but I think it's the best tool available for this application (at least at the moment).
I am trying to develop a Sinhala (My native language) to English translator.
Still I am thinking for an approach.
If I however parse a sentence of my language, then can use that for generating english sentence with the help of stanford parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks
Mathee
If you have enough amounts of bilingual corpora then the best thing you can do is to train a model using an existing statistical machine translation system like Moses. Modern phrase-based SMT systems resemble a reverse parsing mechanism in the sense that they find the most probable combination of target-language phrases for translating a specific source-language sentence. More information in this Wikipedia article.
I've worked with the Xerox toolchain so far, which is powerful, not opensource, and a bit overkill for my current problem. Are there libraries that allow my to implement a phrase structure grammar? Preferably in ruby or lisp.
AFAIK, there's no open-source Lisp phrase structure parser available.
But since a parser is actually a black box, it's not so hard to make your application work with a parser written in any language, especially as they produce S-expressions as output. For example, with something like pfp you can just pipe your sentences as strings to it, then read and process the resulting trees. Or you can wrap a socket server around it and you'll get a distributed system :)
There's also cl-langutils, that may be helpful in some basic NLP tasks, like tokenization and, maybe, POS tagging. But overall, it's much less mature and feature rich, than the commonly used packages, like Stanford's or OpenNLP.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking to do some sentence analysis (mostly for twitter apps) and infer some general characteristics. Are there any good natural language processing libraries for this sort of thing in Ruby?
Similar to Is there a good natural language processing library but for Ruby. I'd prefer something very general, but any leads are appreciated!
Three excellent and mature NLP packages are Stanford Core NLP, Open NLP and LingPipe. There are Ruby bindings to the Stanford Core NLP tools (GPL license) as well as the OpenNLP tools (Apache License).
On the more experimental side of things, I maintain a Text Retrieval, Extraction and Annotation Toolkit (Treat), released under the GPL, that provides a common API for almost every NLP-related gem that exists for Ruby. The following list of Treat's features can also serve as a good reference in terms of stable natural language processing gems compatible with Ruby 1.9.
Text segmenters and tokenizers (punkt-segmenter, tactful_tokenizer, srx-english, scalpel)
Natural language parsers for English, French and German and named entity extraction for English (stanford-core-nlp).
Word inflection and conjugation (linguistics), stemming (ruby-stemmer, uea-stemmer, lingua, etc.)
WordNet interface (rwordnet), POS taggers (rbtagger, engtagger, etc.)
Language (whatlanguage), date/time (chronic, kronic, nickel), keyword (lda-ruby) extraction.
Text retrieval with indexation and full-text search (ferret).
Named entity extraction (stanford-core-nlp).
Basic machine learning with decision trees (decisiontree), MLPs (ruby-fann), SVMs (rb-libsvm) and linear classification (tomz-liblinear-ruby-swig).
Text similarity metrics (levenshtein-ffi, fuzzy-string-match, tf-idf-similarity).
Not included in Treat, but relevant to NLP: hotwater (string distance algorithms), yomu (binders to Apache Tiki for reading .doc, .docx, .pages, .odt, .rtf, .pdf), graph-rank (an implementation of GraphRank).
There are some things at Ruby Linguistics and some links therefrom, though it doesn't seem anywhere close to what NLTK is for Python, yet.
You can always use jruby and use the java libraries.
EDIT: The ability to do ruby natively on the jvm and easily leverage java libraries is a big plus for rubyists. This is a good option that should be considered in a situation like this.
Which NLP toolkit to use in JAVA?
I found an excellent article detailing some NLP algorithms in Ruby here. This includes stemmers, date time parsers and grammar parsers.
TREAT – the Text REtrieval and Annotation Toolkit – is the most comprehensive toolkit I know of for Ruby: https://github.com/louismullie/treat/wiki/
I maintain a list of Ruby Natural Language Processing resources (libraries, APIs, and presentations) on GitHub that covers the libraries listed in the other answers here as well as some additional libraries.
Also consider using SaaS APIs like MonkeyLearn. You can easily train text classifiers with machine learning and integrate via an API. There's a Ruby SDK available.
Besides creating your own classifiers, you can pick pre-created modules for sentiment analysis, topic classification, language detection and more.
We also have extractors like keyword extraction and entities, and we'll keep adding more public modules.
Other nice features:
You have a GUI to create/test algorithms.
Algorithms run really fast in our cloud computing platform.
You can integrate with Ruby or any other programming language.
Try this one
https://github.com/louismullie/stanford-core-nlp
About stanford-core-nlp gem
This gem provides high-level Ruby bindings to the Stanford Core NLP package, a set natural language processing tools for tokenization, sentence segmentation, part-of-speech tagging, lemmatization, and parsing of English, French and German. The package also provides named entity recognition and coreference resolution for English.
http://nlp.stanford.edu/software/corenlp.shtml
demo page
http://nlp.stanford.edu:8080/corenlp/
You need to be much more specific about what these "general characteristics" are.
In NLP "general characteristics" of a sentence can mean a million different things - sentiment analysis (ie, the attitude of the speaker), basic part of speech tagging, use of personal pronoun, does the sentence contain active or passive verbs, what's the tense and voice of the verbs...
I don't mind if you're vague about describing it, but if we don't know what you're asking it's highly unlikely we can be specific in helping you.
My general suggestion, especially for NLP, is you should get the tool best designed for the job instead of limiting yourself to a specific language. Limiting yourself to a specific language is fine for some tasks where the general tools are implemented everywhere, but NLP is not one of those.
The other issue in working with Twitter is a great deal of the sentences there will be half baked or compressed in strange and wonderful ways - which most NLP tools aren't trained for. To help there, the NUS SMS Corpus consists of "about 10,000 SMS messages collected by students". Due to the similar restrictions and usage, analysing that may be helpful in your explorations with Twitter.
If you're more specific I'll try and list some tools that will help.
I would check out Mark Watson's free book Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition. He has chapters on NLP using java, clojure, ruby, and scala. He also provides links to the resources you need.
For people looking for something more lightweight and simple to implement this option worked well for me.
https://github.com/yohasebe/engtagger