SVD implementation map reduce - hadoop

Hi I need to perform a Singular Value Decomposition on large dense square matrices using Map Reduce.
I have already checked the Mahout project but what they provide is a TSQR algorithm
http://arbenson.github.io/portfolio/Math221/AustinBenson-math221-report.pdf .
The problem is that I want the full rank and this method does not work in such case.
The Distributed Lanczos SVD implementation they were using before it does not suit my case as well.
I found that the TWO-SIDED JACOBI SCHEME could be used for such purpose but I did not manage to find any available implementation.
Does anybody know if and where I can find a reference code?

If it may help - look to spark lib (mlib). It had implementation. You can use it, or looking at it you can make your own.
https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html

Related

What is tensor flows row reduction algorithm?

I'm wondering what tensor flow uses to perform row reduction. Specifically when I call tf.linalg.inv what algorithm runs? Tensorflow is open source so I figured that it would be easy enough to find but I find myself a little lost in the code base. If I could just get a pointer to the implementation of the aforementioned function that would be great. If there is a name for the Gauss Jordan elimination implementation they used that would be even better.
https://github.com/tensorflow/tensorflow
The op uses LU decomposition with partial pivoting to compute the inverses.
For more insighton tf.linalg.inv algorithm please refer to this link: https://www.tensorflow.org/api_docs/python/tf/linalg/inv
-
If you wish to experiment with something similar please refer to this stackoverflow link here

ELKI cluster extraction HiSC HiCO

I'm comuputing HiCO and HiSC clustering algorithms on my dataset. If I'm not mistaken, the algorithms use different approach to define relevant subspaces for clusters in the 1st step and in the 2nd they apply OPTICS for clustering. I'm getting only cluster order file after I run the algorithms.
Is there any way to extract clusters from it? for example like OPTICSXi? (I know there are 3 extraction methods under hierarchical clustering but I can't see anything for HiCO or HiSC)
Thank you in advance for any hints
Use OPTICSXi as algorithm, then use HiCO or HiSC "inside".
The Xi extraction can be parameterized to use a different OPTICS variant like HiCO, HiSC, and DeLi-Clu. It just defaults to using regular OPTICS.
-algorithm clustering.optics.OPTICSXi
-opticsxi.algorithm de.lmu.ifi.dbs.elki.algorithm.clustering.correlation.HiCO
respectively
-algorithm clustering.optics.OPTICSXi
-opticsxi.algorithm de.lmu.ifi.dbs.elki.algorithm.clustering.subspace.HiSC
We currently don't have implementations of the other extraction methods in ELKI yet, sorry.

some confusions in machine learning

I have two confusions when I use machine learning algorithm. At first, I have to say that I just use it.
There are two categories A and B, if I want to pick as many as A from their mixture, what kind of algorithm should I use ( no need to consider the number of samples) . At first I thought it should be a classification algorithm. And I use for example boost decision tree in a package TMVA, but someone told me that BDT is a regression algorithm indeed.
I find when I have coarse data. If I analysis it ( do some combinations ...) before I throw it to BDT, the result is better than I throw the coarse data into BDT. Since the coarse data contains every information, why do I need analysis it myself?
Is you are not clear, please just add a comment. And hope you can give me any advise.
For 2, you have to perform some manipulation on data and feed it to perform better because from it is not built into algorithm to analyze. It only looks at data and classifies. The problem of analysis as you put it is called feature selection or feature engineering and it has to be done by hand (of course unless you are using some kind of technique that learns features eg. deep learning). In machine learning, it has been seen a lot of times that manipulated/engineered features perform better than raw features.
For 1, I think BDT can be used for regression as well as classification. This looks like a classification problem (to choose or not to choose). Hence you should use a classification algorithm
Are you sure ML is the approach for your problem? In case it is, some classification algorithms would be:
logistic regression, neural networks, support vector machines,desicion trees just to name a few.

What is the algorithm used by the "Universal Recommender" on Prediction.IO?

good Afternoon
What is the name algorithm used by the "Universal Recommender (UR)" on Prediction.IO?
during which i know Algorithm for
system recommendation are "collaborative filtering" and "content based filtering".
thanks!
It uses Correlated Cross-Occurrence(CCO) algorithm from Apache-mahout.
check out these
https://actionml.com/blog/cco
https://mahout.apache.org/users/algorithms/recommender-overview.html
Prediction.io uses Apache Spark MLLib's Alternating Least Squares matrix factorization method (ALS). It is one of basic methods of Collaborative Filtering, which are User-based, Item-based and Matrix factorization. Documentation can be found at http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
Universal Recommender Template use this algorithm for computing "events" that are appearing "often" with "buying" some "item". Use of factorization is not what authors of Universal Recommender principle describe in their original idea, instead they used LLR similarity to find statistically significant "events". I personally doubt about suitability of use of matrix factorization and use of HBase (use Redis cluster instead). You can read about Universal Recommender general idea at https://www.mapr.com/practical-machine-learning and http://mahout.apache.org/users/algorithms/recommender-overview.html

Any good nearest-neighbors algorithm for similar images?

I am looking for an algorithm that can search for similar images in a large collection.
I'm currently using a SURF implementation in OpenCL.
At first I used the KNN search algorithm to compare every image's interrest points to the rest of the collection but tests revealed that it doesn't scale well. I've also tried a Hadoop implementation of KNN-Join which really takes a lot of temporary space in HDFS, way too much compared to the amount of input data. In fact pairwise distance approach isn't really appropriate because of the dimension of my input vectors (64).
I heard of Locally Sensitive Hashing and wondered if there was any free implementation, or if it's worth implementing it, maybe there's another algorithm I am not aware of ?
IIRC the flann algorithm is a good compromise:
http://people.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

Resources