how to use mapreduce with hadoop - hadoop

im trying to implement an algorithm to find connected components in a large graph(size equivalent to social networks) using Mapreduce. Im not familiar with Hadoop though ive heard it can be used. I need some direction with using it.

Look at Apache Giraph. It is Hadoop-based framework to work with graph algorithms.

Related

MapReduce integration in grid computing environnement

I would like to know: is it is possible to integrate Apache Hadoop or MapReduceFramework on a grid computing environment?
Certainly it is possible, I have seen it in play.
IBM does it with its Spectrum Symphony grid middleware platform.
For details on the solution read here: https://www.ibm.com/support/knowledgecenter/en/SSZUMP_7.1.2/sym_kc/sym_kc_managing_mapreduce_framework.html

How can I integrate Hadoop with Mahout?

How can I integrate Hadoop with Mahout ?
i want to perform data Analytics and need to have machine learning libraries.
I would start by reviewing the mahout site, reviewing the tutorials, there are lots of useful links http://mahout.apache.org
There are a number of different books out there that will take you from first principles to producing Data Analytics, this is probably a good place to start (http://shop.oreilly.com/product/0636920033400.do) if you know python.

Batch learning in stream learning for clustering

I have a "research" question:
Are there methods combining batch learning (MapReduce) in stream learning for clustering ?
Take a look at Apache Spark or Google Dataflow for programming models that works in batch and stream mode.
Apache Spark has mllib for machine learning
There's some really interesting Spark Stream/ MLlib integration work coming out of the Freeman Lab, performing mini-batch clustering on streams by introducing a "forgetfulness" parameter
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
https://gist.github.com/freeman-lab/9672685

What is it exactly?

Why in this link:{http://www.ibm.com/developerworks/aix/library/au-cloud_apache/#figure2} in figure1,apache hadoop is defined as a Platform as a service but in http://nosql-databases.org it is defined as a no sql wide column store database?
I mean when working with hadoop do I need a database too?
Thanks in advance.
Hadoop is a basically a collection of java software that fundamentally provides two things:
A distributed file system implementation.
A framework for writing, and running Map Reduce jobs written in Java.
Many things are built on top of these two pieces (like HBase, which is probably the columnar datastore you have read about).
A good resource for learning more about Hadoop is the apache project page documetation. If that looks confusing, there is also a book called 'Hadoop: The Definitive Guide' which is pretty good reading.
If you want to read about how it all began, I'd recommend reading this google paper upon which Hadoop is based
Hope that helps.

Example application using HDFS+Map Reduce

I have an academic course "Middleware" which covers different aspects of Distributed Software Systems including introduction to topics like [tag:Distributed File system]. This also involves introduction to hbase,hadoop,mapreduce,hiveql,piglatin.
I want to know, can I have a small project which tries to integrate above technologies. For starters, I am aware of vm provided by cloudera for having a feel of hadoop and playing around using Eclipse.
I was thinking on lines of implementing an application which accepts stream of events as an input, Analyses this and gives an output.
I have both windows/linux on my machine with i7 procoessor and 4Gb Ram.
Please let me know how to get started with everything and any suggestions for simple example application are welcome.
Here is a blog post on analyzing Tweets using Hive/HDFS. And here is a blog post on performing Clickstream analytics using Pig and Hive.
Check some of the Big Data use cases here and try to solve an interesting problem.

Resources