Ruby Text Analysis - ruby

Is there any Ruby gem or else for text analysis? Word frequency, pattern detection and so forth (preferably with an understanding of french)

the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams
You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.
There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.
These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)
Check the following thread, which contains more details and links:
Building openears compatible language model
Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.
adi92's post lists some more Ruby NLP resources.
You can also Google for "ARPA Language Model" for more info
Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!

The Mendicant Bug: NLP Resources for Ruby
contains lots of useful Ruby NLP links.
I had tried using the Ruby Linguistics stuff a long time ago, and remember having a lot of problems with it... I don't recommend jumping into that.
If most of your text analysis involves stuff like counting ngrams and naive Bayes, I recommend just doing it on your own. Ruby has pretty good basic libraries and awesome support for regexes, so this should not be that tricky, and it will be easier for you to adapt stuff to the idiosyncrasies of the problem you are trying to solve.
Like the Stanford parser gem, its possible to use Java libraries that solve your problem from within Ruby, but this can be tricky, so probably not the best way to solve a problem.

I wrote the gem words_counted for this reason. You can see a demo on rubywordcount.com. It has a lot of the analysis features you mention, and a host more. The API is well documented and can be found in the readme on Github.

Related

Analysis of Ruby code

I'm trying to perform Natural Language Processing (NLP) analysis on source code, and especially on Ruby files. In particular, I want to extract identifiers and comments, considering the structure of the code.
My first attempt was using off-the-shelf NLP libraries, such as Lucene or spacy. However, I was not able to remove all the noise coming from keywords, literals, and the typical stuff in source code.
My second attempt is about to obtain the AST of a particular piece of code, and then extract some parts. There are multiple tools and libraries for a number of languages, but I'm not able to find anything specific to parse Ruby code. So far, my main option is using ANTLR 4, and tailor a Ruby-like grammar (Corundum) to work also with OOP.
Is there a more straightforward path to what I'm looking for?

Detecting Elements of Sentence In Ruby

My aim is detecting simple elements in any sentence such as verb, noun or adjective. Is there any gem in Ruby for achieving that? For example:
They elected him president yesterday.
Output:
["subject","verb", "object", "predicative", adverbial"]
These are the only natural language processing options for Ruby that I know of.
Treat
Stanford Core NLP
Open NLP
Interestingly, they are all by the same person.
EDIT
Here is one more option that I found. It's a tutorial on n-gram analysis.
Natural Language Processing with Ruby: n-grams
I've used engtagger with good success in the past. It's ported from a Perl program called Lingua::EN::Tagger. It takes a bit of work to get it to do what you want, but I think it's the best tool available for this application (at least at the moment).

How to use Stanford parser to generate the English sentence from given segmented nodes

I am trying to develop a Sinhala (My native language) to English translator.
Still I am thinking for an approach.
If I however parse a sentence of my language, then can use that for generating english sentence with the help of stanford parser. Or is there any other method you can recommend.
And I am thinking of a bottom up parser for my language, but still have no idea how to implement. Any suggestions for steps I can follow.
Thanks
Mathee
If you have enough amounts of bilingual corpora then the best thing you can do is to train a model using an existing statistical machine translation system like Moses. Modern phrase-based SMT systems resemble a reverse parsing mechanism in the sense that they find the most probable combination of target-language phrases for translating a specific source-language sentence. More information in this Wikipedia article.

Is there a framework for writing phrase structure rules out there that is opensource?

I've worked with the Xerox toolchain so far, which is powerful, not opensource, and a bit overkill for my current problem. Are there libraries that allow my to implement a phrase structure grammar? Preferably in ruby or lisp.
AFAIK, there's no open-source Lisp phrase structure parser available.
But since a parser is actually a black box, it's not so hard to make your application work with a parser written in any language, especially as they produce S-expressions as output. For example, with something like pfp you can just pipe your sentences as strings to it, then read and process the resulting trees. Or you can wrap a socket server around it and you'll get a distributed system :)
There's also cl-langutils, that may be helpful in some basic NLP tasks, like tokenization and, maybe, POS tagging. But overall, it's much less mature and feature rich, than the commonly used packages, like Stanford's or OpenNLP.

Natural Language Processing in Ruby [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking to do some sentence analysis (mostly for twitter apps) and infer some general characteristics. Are there any good natural language processing libraries for this sort of thing in Ruby?
Similar to Is there a good natural language processing library but for Ruby. I'd prefer something very general, but any leads are appreciated!
Three excellent and mature NLP packages are Stanford Core NLP, Open NLP and LingPipe. There are Ruby bindings to the Stanford Core NLP tools (GPL license) as well as the OpenNLP tools (Apache License).
On the more experimental side of things, I maintain a Text Retrieval, Extraction and Annotation Toolkit (Treat), released under the GPL, that provides a common API for almost every NLP-related gem that exists for Ruby. The following list of Treat's features can also serve as a good reference in terms of stable natural language processing gems compatible with Ruby 1.9.
Text segmenters and tokenizers (punkt-segmenter, tactful_tokenizer, srx-english, scalpel)
Natural language parsers for English, French and German and named entity extraction for English (stanford-core-nlp).
Word inflection and conjugation (linguistics), stemming (ruby-stemmer, uea-stemmer, lingua, etc.)
WordNet interface (rwordnet), POS taggers (rbtagger, engtagger, etc.)
Language (whatlanguage), date/time (chronic, kronic, nickel), keyword (lda-ruby) extraction.
Text retrieval with indexation and full-text search (ferret).
Named entity extraction (stanford-core-nlp).
Basic machine learning with decision trees (decisiontree), MLPs (ruby-fann), SVMs (rb-libsvm) and linear classification (tomz-liblinear-ruby-swig).
Text similarity metrics (levenshtein-ffi, fuzzy-string-match, tf-idf-similarity).
Not included in Treat, but relevant to NLP: hotwater (string distance algorithms), yomu (binders to Apache Tiki for reading .doc, .docx, .pages, .odt, .rtf, .pdf), graph-rank (an implementation of GraphRank).
There are some things at Ruby Linguistics and some links therefrom, though it doesn't seem anywhere close to what NLTK is for Python, yet.
You can always use jruby and use the java libraries.
EDIT: The ability to do ruby natively on the jvm and easily leverage java libraries is a big plus for rubyists. This is a good option that should be considered in a situation like this.
Which NLP toolkit to use in JAVA?
I found an excellent article detailing some NLP algorithms in Ruby here. This includes stemmers, date time parsers and grammar parsers.
TREAT – the Text REtrieval and Annotation Toolkit – is the most comprehensive toolkit I know of for Ruby: https://github.com/louismullie/treat/wiki/
I maintain a list of Ruby Natural Language Processing resources (libraries, APIs, and presentations) on GitHub that covers the libraries listed in the other answers here as well as some additional libraries.
Also consider using SaaS APIs like MonkeyLearn. You can easily train text classifiers with machine learning and integrate via an API. There's a Ruby SDK available.
Besides creating your own classifiers, you can pick pre-created modules for sentiment analysis, topic classification, language detection and more.
We also have extractors like keyword extraction and entities, and we'll keep adding more public modules.
Other nice features:
You have a GUI to create/test algorithms.
Algorithms run really fast in our cloud computing platform.
You can integrate with Ruby or any other programming language.
Try this one
https://github.com/louismullie/stanford-core-nlp
About stanford-core-nlp gem
This gem provides high-level Ruby bindings to the Stanford Core NLP package, a set natural language processing tools for tokenization, sentence segmentation, part-of-speech tagging, lemmatization, and parsing of English, French and German. The package also provides named entity recognition and coreference resolution for English.
http://nlp.stanford.edu/software/corenlp.shtml
demo page
http://nlp.stanford.edu:8080/corenlp/
You need to be much more specific about what these "general characteristics" are.
In NLP "general characteristics" of a sentence can mean a million different things - sentiment analysis (ie, the attitude of the speaker), basic part of speech tagging, use of personal pronoun, does the sentence contain active or passive verbs, what's the tense and voice of the verbs...
I don't mind if you're vague about describing it, but if we don't know what you're asking it's highly unlikely we can be specific in helping you.
My general suggestion, especially for NLP, is you should get the tool best designed for the job instead of limiting yourself to a specific language. Limiting yourself to a specific language is fine for some tasks where the general tools are implemented everywhere, but NLP is not one of those.
The other issue in working with Twitter is a great deal of the sentences there will be half baked or compressed in strange and wonderful ways - which most NLP tools aren't trained for. To help there, the NUS SMS Corpus consists of "about 10,000 SMS messages collected by students". Due to the similar restrictions and usage, analysing that may be helpful in your explorations with Twitter.
If you're more specific I'll try and list some tools that will help.
I would check out Mark Watson's free book Practical Semantic Web and Linked Data Applications, Java, Scala, Clojure, and JRuby Edition. He has chapters on NLP using java, clojure, ruby, and scala. He also provides links to the resources you need.
For people looking for something more lightweight and simple to implement this option worked well for me.
https://github.com/yohasebe/engtagger

Resources