What possibly can i do with Hadoop and Nutch used as a search engine ? I know that nutch is used to build a web crawler . But i'm not finding the perfect picture . Can i use mapreduce with nutch and do some mapreduce job ? Any ideas are welcome . Few links will be greatly appreciated . Thanks.
If you want to only do Map/Reduce jobs you don't need Nutch but Hadoop only. Hadoop brings you a cluster file system and a scheduler for map/reduce jobs on the filesystem.
As Nutch builds on top of Hadoop you can create your own map/reduce jobs on Nutch data as long as you understand the data structure and what the crawler is doing.
However if you only wanted to run some map/reduce jobs, just install hadoop and off you go.
Related
I understand that running Nutch in deploy mode is distributed crawling based on Hadoop but I couldn't fully understand what when we run it in local mode. Is Nutch independent of Hadoop in that case? And is the crawling process in local mode not based on MapReduce?
Nutch is based on MapReduce, regardless of how it runs. The Hadoop libs are dependencies of Nutch, in local mode, Nutch puts the Hadoop related libs on the classpath and runs it all in a single JVM. In distributed mode, the 'hadoop' command is called.
See Nutch script
PS: if you use Nutch on a single machine, it makes sense to run it in pseudo distributed mode so that you get the MapReduce UI to monitor the crawl + parallelism etc...
A cluster which runs mapreduce 2 doesn't have a job tracker and instead it is split into two separate components, resource manager and job manager. However, these thing are transparent from a user and he doesn't need to know whether the cluster is running mapreduce 1 or 2 when submitting a mapreduce job.
The thing I cannot quite understand is Yarn application. How is it different from a regular mapreduce application? What's the advantage of running a mapreduce job as a yarn application, etc? Could someone shed some light on that for me?
MR1 has Job tracker and task tracker which takes care of Map reduce application.
In MR2 Apache separated the management of the map/reduce process from the cluster's resource management by using YARN. YARN is a better resource manger than we had in MR1. It also enables versatility. MR2 is built on top of YARN.
Apart from Map reduce, we can run applications like spark, storm, Hbase, Tex etc on top of Yarn, which we cannot do using MR1.
The following is the architecture for MR1 and MR2.
HDFS <---> MR
HDFS <----> Yarn <----> MR
I am not sure about what hadoop can and cannot do, and how easy things are.
I understand hadoop is good at doing mapreduce jobs and at providing hdfs, their distributed filesystem.
What else is hadoop good at / easy to use ?
My problem : I would like to serve data, result of mapreduce. And as I have lot of traffic I would need 3 front end servers. Can Hadoop help me deploy a server on 3 of my n runnning nodes ?
Basically instead of running mapreduce on n machines, I would like to run a custom executable (my server) on 3 machines. And when 1 machine fails, that hadoop takes care of starting the job on another available machine.
Am I supposed to run that on the hadoop cluster ? or should the hadoop cluster be used only for the mapreduce and I should have a separate cloud to serve the data from the hadoop cluster ?
Thanks for sharing your experience.
P.S I am just considering hadoop right now as a solution, Im not tied to it
Your question isn't actually clear but here is my shot.
You want to display the result of your Hadoop job? Usually a Hadoop job writes its result to HDFS. What you can do is to create your own OutputFormat class. You might define a XMLOutputFormat for example.
But the nice thing is that you can create your own Writable. Take a look at Database Access with Apache Hadoop. In this tutorial you can save the output of a Hadoop job to a data base system.
Your frontend then can query the database and show the result.
What are sites for Hadoop Best practice , Not the Books where I can get the step by step process to create new projects and small examples . I am not able to find a single site like this , please share.
There is an awesome article from yahoo developers on Apache Hadoop: Best Practices and Anti-Patterns
Hadoop is not something one single application instead it is a distributed processing framework which is used by several applications which sits top of this framework. Pig, Hive, HBase, Cassandra, etc are few of many such application designed for specific requirement. Underneath all of these application consume Hadoop framework which mainly consist of distributed file system (HDFS) and distributed processing (MapReduce).
Technically when you have a bare minimum Hadoop cluster (HDFS + MapReduce only) you can start writing MapReduce based applications (in Java or other languages are supported through Hadoop Streaming) to process some data.
What you could do is first download a pre-build/configured Hadoop virtual Image from Cloudera or Hortonworks distribution and get it running in your machine. After that start learning writing MapReduce jobs in Java and run in your virtual machine.
Here is the URL to download Cloudera Hadoop Distribution VM
Here is the link to learn writing simplest wordcount job.
If I don't do any map/reduce jobs, still JobTracker/TaskTrackers need to be running for some HBase internal dependency?
No you don't need both for running solely HBase.
Just a tip: there are always scripts that just start the HDFS, bin/start-dfs.sh for example.
As mentioned above we don't need Job/Tasktracker if we are dealing with just Hbase. You can use bin/start-dfs.sh to start Name/Dtanodes..Moreover bin/start-all.sh has been deprecated now..So you should prefer using bin/start-dfs.sh to start Name/Datanodes and bin/start-mapred.sh to start Job/Tasktracker..I would suggest using Hbase in pseudo-distributed mode for learning and testing purpose, as in standalone Hbase doesn't use HDFS..You should be a bit careful while configuring though..
Basic case: You don't need JobTracker and TaskTrackers when using only HDFS+HBase (in smaller, testing environment you don't need event HDFS)
When you would like to run MapReduce jobs using data stored in HBase, you'll obviously need both JobTracker and TaskTrackers.