Machine learning algorithms in ruby - ruby

I'm following the Stanford Machine Learning class with prof. Andrew Ng and I would like to start implementing the examples in ruby.
Are there any frameworks/gems/libs/existing code out there that approaches machine learning in ruby? I have found some questions related to this and some projects but seem to be quite old.

The algorithms themselves are not language specific. You can implement them using any language you want. For maximum efficiency you will want to use matrix/vector based computing.
Ruby has a built in Matrix class that you can use to implement these algorithms. The implementation will be very similar to the one using Octave. Everything you need to implement the algorithms yourself is included in the base Standard Library for 1.9+.
Octave is used because it provides a thorough and easy Matrix library.

Be sure to check this gist, it has plenty of information:
Resources for Machine Learning in Ruby
Moreover, following are some noteworthy algorithms libraries (which might or might not be already listed in the gist above):
AI4R
http://www.ai4r.org/ - https://github.com/SergioFierens/ai4r
AI4R is a collection of ruby algorithm implementations, covering several Artificial intelligence fields, and simple practical examples using them. A Ruby playground for AI researchers. It implements:
Genetic algorithms
Self-organized maps (SOM)
Neural Networks: Multilayer perceptron with Backpropagation learning, Hopfield net.
Automatic classifiers (Machine Learning): ID3 (Decision Trees), PRISM (J. Cendrowska, 1987), Multilayer Perceptron, OneR (AKA One Attribute Rule, 1R), ZeroR, Hyperpipes, Naive Bayes, IB1 (D. Aha, D. Kibler - 1991).
Data clustering: K-means, Bisecting k-means, Single linkage, Complete linkage, Average linkage, Weighted Average linkage, Centroid linkage, Median linkage, Ward's method linkage, Diana (Divisive Analysis)
kmeans-clusterer - k-means clustering in Ruby:
https://github.com/gbuesing/kmeans-clusterer
kmeans-clustering - A simple Ruby gem for parallelized k-means clustering:
https://github.com/vaneyckt/kmeans-clustering
tlearn-rb - Recurrent Neural Network library for Ruby:
https://github.com/josephwilk/tlearn-rb
TensorFlow Ruby wrapper - As of this writing it seems that work is about to begin in building a TensorFlow Ruby API:
https://github.com/tensorflow/tensorflow/issues/50#issuecomment-216200945
If JRuby is a viable alternative to Ruby for you:
weka-jruby - Machine Learning & Data Mining with JRuby based on the Weka Java library:
https://github.com/paulgoetze/weka-jruby
jruby_mahout - JRuby Mahout is a gem that unleashes the power of Apache Mahout in the world of JRuby:
https://github.com/vasinov/jruby_mahout
UPDATE: the Resources for Machine Learning in Ruby gist above is now starting to be mantained as a repository: https://github.com/arbox/machine-learning-with-ruby

Try Rumale and Numo::NArray
https://github.com/yoshoku/rumale
Rumale (Ruby machine learning) is a machine learning library in Ruby. Rumale provides machine learning algorithms with interfaces similar to Scikit-Learn in Python. Rumale supports Linear / Kernel Support Vector Machine, Logistic Regression, Linear Regression, Ridge, Lasso, Factorization Machine, Naive Bayes, Decision Tree, AdaBoost, Gradient Tree Boosting, Random Forest, Extra-Trees, K-nearest neighbor classifier, K-Means, K-Medoids, Gaussian Mixture Model, DBSCAN, Power Iteration Clustering, Mutidimensional Scaling, t-SNE, Principal Component Analysis, and Non-negative Matrix Factorization.

Related

GentleBoost n-ary classifier?

I'm looking for resources or implementation on n-ary Gentle Boost classifiers.
I've seen a number of Adaboost implementations, an implementation for GentleBoost in Matlab's Ensemble, but it always seems to be binary.
WEKA, too, has only an AdaBoost implementation, not Gentle Boost.
Does anyone have any suggestions of
- how to go about getting a n-ary Gentle Boost implementation?
- how long approximately it would take to build one if it isn't there already?
There is a package written in R language, named ada, which also has Gentle Boosting:
http://cran.r-project.org/web/packages/ada/index.html
Edit
Indeed, it is only for binary classification:
ada is used to fit a variety stochastic boosting models for a binary
response as described in Additive Logistic Regression: A Statistical
View of Boosting by Friedman, et al. (2000).

Searching for Genetic Programming framework/library

I am looking for framework, or library that could enable working with genetic programming (koza's style) not only by using mathematical functions, but also with loops, variable or constant assignment, object creations, or functions calling. I am not sure if there exists such branch of genetic algorithms and if it has a name.
I did my best in looking for informations, though the internet is poor with information on that specific topic.
HeuristicLab has a powerful implementation of Genetic Programming. It includes problems such as Symbolic Regression, Symbolic Classification, Time Series, Santa Fe Ant Trail, and there is a tutorial to implement custom problems such as the Lawn Mower (which is similar to the Santa Fe Ant Trail). HeuristicLab is implemented in C# and runs on Windows. It's released under GPL and can be freely downloaded.
The implementation of GP is very flexible and extensible, but also performance optimized using online calculations to avoid array allocation and memory overheads. We do include several benchmark problem instances for symbolic regression and classification. There are also more algorithms available such as Random Forests, Neural Networks, k-NN, SVM (if you're doing regression or classification).

What's the difference between LibSVM and LibLinear

libsvm and liblinear are both software libraries that implement Support Vector Machines. What's the difference? And how do the differences make liblinear faster than libsvm?
In practice the complexity of the SMO algorithm (that works both for kernel and linear SVM) as implemented in libsvm is O(n^2) or O(n^3) whereas liblinear is O(n) but does not support kernel SVMs. n is the number of samples in the training dataset.
Hence for medium to large scale forget about kernels and use liblinear (or maybe have a look at approximate kernel SVM solvers such as LaSVM).
Edit: in practice libsvm becomes painfully slow at 10k samples.
SVM is support vector machine, which is basically a linear classifier, but using many kernel transforms to turn a non-linear problem into a linear problem beforehand.
From the link above, it seems like liblinear is very much the same thing, without those kernel transforms. So, as they say, in cases where the kernel transforms are not needed (they mention document classification), it will be faster.
From : http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf
It supports L2-regularized logistic regression (LR), L2-loss and L1-loss linear support vector machines (SVMs) (Boser et al., 1992). It inherits many features of the popular SVM library LIBSVM
And you might also see some useful information here from one of the creators: http://agbs.kyb.tuebingen.mpg.de/km/bb/showthread.php?tid=710
The main idea, I would say, is that liblinear is optimized to deal with linear classification (i.e. no kernels necessary), whereas linear classification is only one of the many capabilities of libsvm, so logically it may not match up to liblinear in terms of classification accuracy. Obviously, I'm making some broad generalizations here, and the exact details on the differences are probably covered in the paper I linked above as well as with the corresponding user's guide to libsvm from the libsvm website.

parallel iterative algorithms for solving Linear System of Equations

Does someone know any library or ready source code of parallel implementation of quick iterative methods (bicgstab, CG, etc) for solving Linear System of Equations for example using MPI or OpenMP?
PetSC is a good example (both serial and MPI, and with a large library of linear and nonlinear solvers either included or provided as interfaces to external libraries). Trillinos is another example, but it's a much broader project and not as nicely integrated as PetSC. Aztec has a number of solvers, as does Hypre, which is hybrid (MPI+OpenMP).
These are all MPI-based at least in part; I don't know of too many OpenMP-enabled ones, although google suggests Lis, which I'm not familiar with.
Chapter 7 of Parallel Programming for Multicore and Cluster Systems contains algorithms for systems of linear equations, with source code (MPI).

state-of-the-art of classification algorithms

We know there are like a thousand of classifiers, recently I was told that, some people say adaboost is like the out of the shell one.
Are There better algorithms (with
that voting idea)
What is the state of the art in
the classifiers.Do you have an example?
First, adaboost is a meta-algorithm which is used in conjunction with (on top of) your favorite classifier. Second, classifiers which work well in one problem domain often don't work well in another. See the No Free Lunch wikipedia page. So, there is not going to be AN answer to your question. Still, it might be interesting to know what people are using in practice.
Weka and Mahout aren't algorithms... they're machine learning libraries. They include implementations of a wide range of algorithms. So, your best bet is to pick a library and try a few different algorithms to see which one works best for your particular problem (where "works best" is going to be a function of training cost, classification cost, and classification accuracy).
If it were me, I'd start with naive Bayes, k-nearest neighbors, and support vector machines. They represent well-established, well-understood methods with very different tradeoffs. Naive Bayes is cheap, but not especially accurate. K-NN is cheap during training but (can be) expensive during classification, and while it's usually very accurate it can be susceptible to overtraining. SVMs are expensive to train and have lots of meta-parameters to tweak, but they are cheap to apply and generally at least as accurate as k-NN.
If you tell us more about the problem you're trying to solve, we may be able to give more focused advice. But if you're just looking for the One True Algorithm, there isn't one -- the No Free Lunch theorem guarantees that.
Apache Mahout (open source, java) seems to pick up a lot of steam.
Weka is a very popular and stable Machine Learning library. It has been around for quite a while and written in Java.
Hastie et al. (2013, The Elements of Statistical Learning) conclude that the Gradient Boosting Machine is the best "off-the-shelf" Method. Independent of the Problem you have.
Definition (see page 352):
An “off-the-shelf” method is one that
can be directly applied to the data without requiring a great deal of timeconsuming data preprocessing or careful tuning of the learning procedure.
And a bit older meaning:
In fact, Breiman (NIPS Workshop, 1996) referred to AdaBoost with trees as the “best off-the-shelf classifier in the world” (see also Breiman (1998)).

Resources