MapReduce project with data mining - hadoop

I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?

I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.

As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!

Related

Pros and cons of integrating hadoop with OBIEE

I am studying about integrating hadoop with OBIEE. However I am unable to find any good article highlighting pros and cons of integrating Hadoop with OBIEE.if anyone has this information kindly share the link/details.
Pros: You can get your data from Hadoop
Cons: Pointless, unless your data is in Hadoop
As a question this really doesn't make much sense. You integrate OBIEE with wherever your data is, in order to analyse it.
+1 to Robin. The point if a source-agnostic tool is to analyze data wherever it lies.
Pushing data to a new storage "just because" isn't adding value. You have to have a reason like performance, explicit physical modelling (multidimensional cubes for example) or the likes.

Can Neo4j work with Hadoop?

Can Neo4j work with Hadoop, for social network analysis of big data? If yes, is it hard to make them work together, and what is the bottleneck in such a system?
Basically, I am looking for a solution for social network analysis of big data, and the network could be of hundreds of millions of vertices. I am also expecting a user-friendly GUI for interactive exploring and analysis of graphs. Will Hadoop+Neo4j be good for above purpose? Or is Hadoop+Griph or Spark+GraphX better solution?
Any comments or suggestion will be appreciated. Thanks.
Spark + GraphX give you a faster performance. This is derived Pregal and GraphLab libs. But it doesnt has any UI to see graph output directly. User should have their own UI or can extend any graph example from D3 library,
check this link to know further about spark + graphx :
https://spark.apache.org/docs/latest/graphx-programming-guide.html

cluster analysis Hadoop, Map reduce environment

we are currently trying to create some very basic personas based on our user data base (few million profiles). The goal is to find out at this stage what the characteristics of our users are, for example what they look like and what they are looking for and to create several "typical" user profiles.
I believe the best way to achieve this would be to run a cluster analysis in order to find similarities among users.
The big roadblock however is how to get there. We are tracking our data in a Hadoop environment and I am being told that this could be potentially achieved with our tools.
I have familiarised myself with the theory of the topic and know that it can be done for example in SPSS (quite hard to use and limited to samples of large data sets).
The big question: Is it possible to perform a or different types of cluster analysis in a Hadoop environment and then visualise the results like in SPSS? It is my understanding that we would need to run several types of analysis in order to find the best way to cluster the data, also when it comes to distance measurements of the clusters.
I have not found any information on the internet with regards to this, so I wonder if this is possible at all, without a major programming effort (meaning literally implementing for example all the standard tools available in SPSS: Dendrograms, the different result tables and cluster graphs etc.).
Any input would be much appreaciated. Thanks.

Where does map-reduce/hadoop come in in machine learning training?

Map-reduce/hadoop is perfect in gathering insights from piles of data from various resources, and organize them in a way we want it to be.
But when it comes to training, my impression is that we have to dump all the training data into algorithm (be it SVN, Logistic regression, or random forest) all at once so that the algorithm is able to come up with a model that has it all. Can map-reduce/hadoop help in the training part? If yes, how in general?
Yes. There are many MapReduce implementations such as hadoop streaming and even some easy tools like Pig, which can be used for learning. In addition, there are distributed learning toolset built upon Map/Reduce such as vowpal wabbit (https://github.com/JohnLangford/vowpal_wabbit/wiki/Tutorial). The big idea of this kind of methods is to do training on small portion of data (split by HDFS) and then averaging the models and commutation with each nodes. So the model get updates directly from submodels built on part of the data.

practical usage of hadoop map reduce hive pig hbase

Hello,
I am learning Hadoop and after reading the material found on the net (tutorials, map reduce concepts, Hive, Ping and so on) and developed some small application with those I would like to learn the real world usages of these technologies.
What are the everyday software we use that are based upon Hadoop stack?
If you use the internet, there are good changes that you are indirectly impacted from Hadoop/MapReduce from Google Search to FaceBook to LinkedIn etc. Here are some interesting links to find how widespread Hadoop/MR usage is
Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)
10 ways big data changes everything
One thing to note is Hadoop/MR is not an efficient solution for every problem. Consider other distributed programming models like those based on BSP also.
Happy Hadooping !!!
Here are some sample mapreduce examples which will be helpful for beginners..
1.Word Count
2.SQL Aggregation using Map reduce
3.SQL Aggregation on multiple fields using Map reduce
URL - http://hadoopdeveloperguide.blogspot.in/

Resources