Can I use hadoop to train a neutral network? - hadoop

I want to train a neural network with the help of Hadoop. We know when training a neural network, weights to each neuron are altered every iteration, and each iteration depends on the previous. I'm new to Hadoop and don't quite familiar with features it provides. Can I chaining the iteration with the help of method addDependingJob() emphasizing the dependences? Or there's other tricks can be used to implement the NN with help of Hadoop.
Any advice will be highly appreciated.
Thanks and Best Regards.

You can write it by youself. If you know how to write Bacvk propogation in single core from scratch. It can be easily migrated to Mapreduce approach. HDFS cache should store current neuron weights and each map job should evaluate their update upon training instance and then reduce should sum all this updates and put them to cachce.

Related

Distributed cross correlation matrix computation

How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be appreciated.
update:
I read the implementation of apache spark mlib correlation
Pearson Computaation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
Covariance Computation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
but for me it looks like all the computation is happening at one node and it is not distributed in real sense.
Please put some light in here. I also tried executing it on a 3 node spark cluster and below are the screenshot:
As you can see from 2nd image that data is pulled up at one node and then computation is being done.Am i right in here ?
To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.
Each local data sets can converted into stdv and covariances.
Also stdv and covariance and sum make correlation.
This is working example
https://github.com/jeesim2/distributed-correlation

Elbow method for determining number of clusters on Mahout

I'm using Mahout for clustering and I have implemented elbow method for determining number of clusters, so that I wouldn't have to specify it.
I have tried this on one machine, but now I'm having doubts when it comes to cluster of computers.
I have planned to use Oozie to simulate looping (running clustering algorithm each time incrementing the number of clusters by one). I read that Oozie is used for DAGs, but I see no other way of doing this.
The question is, does this look like a sound approach? If not, any alternatives?

Where does map-reduce/hadoop come in in machine learning training?

Map-reduce/hadoop is perfect in gathering insights from piles of data from various resources, and organize them in a way we want it to be.
But when it comes to training, my impression is that we have to dump all the training data into algorithm (be it SVN, Logistic regression, or random forest) all at once so that the algorithm is able to come up with a model that has it all. Can map-reduce/hadoop help in the training part? If yes, how in general?
Yes. There are many MapReduce implementations such as hadoop streaming and even some easy tools like Pig, which can be used for learning. In addition, there are distributed learning toolset built upon Map/Reduce such as vowpal wabbit (https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial). The big idea of this kind of methods is to do training on small portion of data (split by HDFS) and then averaging the models and commutation with each nodes. So the model get updates directly from submodels built on part of the data.

MapReduce project with data mining

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?
I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.
As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!

Hadoop Hypercube

Hey,
i am starting a hadoop based hypercube with a flexible number of dimensions.
Does anybody know any existing approaches for this?
I just found PigOLAPSketch, but there is no code to use it.
Another approach is Zohmg from lastfm, which uses hbase, but seems to be very dead.
I think i will start a pig solution, maybe you have some advices?
This would be very cool/useful. OpenTSDB is an HBase time-series database that might be interesting to look at, they have a clever approach to secondary indexing.
You can also look at gpu based database https://www.kinetica.com/
but this is not open source, requires separate appliances and movement of data from Hadoop to Kinetica infrastructure.

Resources