This is louvain method to find community in social graph.
https://sites.google.com/site/findcommunities/
I want to run it on a big graph.
If you are not stuck on Hadoop, I saw this implementation for Apach Spark.
https://github.com/Sotera/spark-distributed-louvain-modularity
I don't know of an implementation of this clustering method, which looks to be based on modularity. The main source of clustering algorithms in the Hadoop ecosystem is in Mahout.
Take a look here: https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Perhaps one of the clustering algorithms listed would work or provide the basis for your own implementation.
Related
Can Neo4j work with Hadoop, for social network analysis of big data? If yes, is it hard to make them work together, and what is the bottleneck in such a system?
Basically, I am looking for a solution for social network analysis of big data, and the network could be of hundreds of millions of vertices. I am also expecting a user-friendly GUI for interactive exploring and analysis of graphs. Will Hadoop+Neo4j be good for above purpose? Or is Hadoop+Griph or Spark+GraphX better solution?
Any comments or suggestion will be appreciated. Thanks.
Spark + GraphX give you a faster performance. This is derived Pregal and GraphLab libs. But it doesnt has any UI to see graph output directly. User should have their own UI or can extend any graph example from D3 library,
check this link to know further about spark + graphx :
https://spark.apache.org/docs/latest/graphx-programming-guide.html
I have successfully implemented the Shortest Path algorithm in Hadoop Map Reduce(Breath First Search). However I have a question that:
Is it possible to do graph traversal "Depth First Search" using Hadoop map reduce ?
Any Links..?
The nature of the Depth First Search makes it inappropriate for map reduce jobs. Because you only follow one strict path to the end before forking into another one. That lead to the fact that you can't use the scalability provided by hadoop properly. I'm not aware of a fine working implementation and I'm pretty sure you won't find one which uses the MapReduce paradigm in a good way.
If you try to implement graph algorithms in hadoop on your own you might want to have a look at some useful frameworks like Apache Giraph, xrime or Pegasus. xrime also contains a shortest path implementation which might be interesting for you.
I am planning to do a MapReduce project involving Hadoop libraries and testing it on big data uploaded at AWS. I have not finalized an idea yet. But I am sure it will involve some kind of data processing, MapReduce design patterns and possibly Graph algorithms, Hive and PigLatin. I would really appreciate if someone can give me some ideas about it. I have few of mine in mind.
In the end I have to work on some large data set and get some information and derive some conclusions. For this I have used Weka before for data mining, (using Trees).
But I am not sure if that is the only thing I can work with right now (using Weka). Is there any other ways by which I can work on large data and derive conclusions on the large data set?
Also how can I involve graphs in this ?
Basically I want to make a research project but I am not sure what exactly I should be working on and what it should be like ? Any thoughts ? suggestive links/ideas ? Knowledge sharing ?
I will suggest you check Apache Mahout, it a scalable machine learning and data mining framework that should integrate nicely with Hadoop.
Hive gives you SQL-like language to query big data, essentially it translates your high-level query into MapReduce jobs and run it on the data cluster.
Another suggestion is to consider doing your data processing algorithm using R, it is a statistical software (similar to matlab), and I would recommend instead of the standard R environment is to use R Revolution, which is an environment to develop R, but with much powerful tools for big data and clustering.
Edit: If you are a student, R Revolution has a free academic edition.
Edit: A third suggestion, is to look at GridGain which is another Map/Reduce implementation in Java that is relatively easy to run on a cluster.
As you are already working with MapRedude and Hadoop, you can extract some knowledge from your data using Mahout or you can get some ideas from this very good book:
http://infolab.stanford.edu/~ullman/mmds.html
This books provide ideas to mine Social-Network Graphs, and works with graphs in a couple of other ways too.
Hope it helps!
I am currently implement a parallel-for on hadoop to iterate the mapper a number of times as specify by the user. Can someone help me with a useful example that I can use my implementation for testing. Some application in Hadoop that needs iteration of the Mapper function.
Thank you
The simplest one is implementing Apriori algorithm which is used to find the frequent itemset.
What do you exactly mean by "iteration of the mapper"? I have an example of starting a job recursively (on the input of the last job).
Have a look here, it explains a simple graph mindist-search / graph exploration algorithm: http://codingwiththomas.blogspot.com/2011/04/graph-exploration-with-hadoop-mapreduce.html
A bit more generic version is this here:
http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html
There are plenty of examples in data mining. You could try one of the clustering algorithms, for example.
I'm looking for a research/implementation based project on Hadoop and I came across the list posted on the wiki page - http://wiki.apache.org/hadoop/ProjectSuggestions. But, this page was last updated in September, 2009. So, I'm not sure if some of these ideas have already been implemented or not. I was particularly interested in "Sort and Shuffle optimization in the MR framework" which talks about "combining the results of several maps on rack or node before the shuffle. This can reduce seek work and intermediate storage".
Has anyone tried this before? Is this implemented in the current version of Hadoop?
There is the combiner functionality (as described under the "Combine" section of http://wiki.apache.org/hadoop/HadoopMapReduce), which is more-or-less an in-memory shuffle. But I believe that the combiner only aggregates key-value pairs for a single map job, not all the pairs for a given node or rack.
The project description is aimed "optimization".
This feature is already present in the current Hadoop-MapReduce and it can probably run in a lot less time.
Sounds like a valuable enhancement to me.
I think it is very challenging task. In my understanding the idea is to make a computation tree instead of "flat" map-reduce.The good example of it is Google's Dremel engine (called BigQuey now). I would suggest to read this paper: http://sergey.melnix.com/pub/melnik_VLDB10.pdf
If you interesting in this kind of architecture - you can also take a look on the open source clone of this technology - Open Dremel.
http://code.google.com/p/dremel/