Hadoop Multi-Node Cluster Set-up Basic Questions

Hadoop Multi-Node Cluster Set-up Basic Questions - hadoop

I'm trying to set up a multi-node Apache Hadoop cluster. I'm following their tutorial here: http://hadoop.apache.org/docs/stable/cluster_setup.html. I've set up single-node Hadoop set-ups on each individual node, as well as Sun Java installed on each. Unfortunately, the documentation is not very clear and a bit outdated. Am I supposed to (under the 'Configuring the Hadoop Daemons' sections) update those files (.conf/*-site.xml) on every single node?
Also, how am I supposed to locate the host:port/IP:port pair for each node?
Sorry, I am new to all of this. Thank you in advance for your help!

Related

Druid + Hadoop (for both uses, deep-store & indexing)

If I have Hadoop server (pseudo-distributed mode) running on a separate machine, do I still need to have these files under my Druid's conf dir ? : http://druid.io/docs/latest/configuration/hadoop.html
The way I see it:
Looks like those -site.xml files are for Hadoop server..., and Druid only acts as Hadoop client. So I don't think Druid needs the hdfs-site.xml.
Core-site.xml..., ok, I can get it. I mean, Druid nees to know the IP of the name node (hadoop).
Mapred-site.xml, partially. Druid needs to know the status of mapreduce jobs (I suppose it will delegate the indexing to Hadoop as MR job). So it needs to communicate with those job trackers to see if the indexing is finished / failed / in progress. For that, it needs the URL of Hadoop JT.
However Druid does not need this prperty "mapreduce.cluster.local.dir", because it does not participate actively in MR job.
Yarn-site.xml? Maybe it should stay, partially. At least for submitting a job (?).
What about HDFS-site.xml? I think this can be scrapped completely.
Capacity-scheduler.xml? It can go.
Please correct me If I'm wrong.
These questions / doubts arises because I'm quite new to hadoop. I have my hadoop setup running. Pseudo distributed mode. I also tested it with javascript webhdfs library to write and read file. Also have tried the sample MR jobs provided by the hadoop dist. So I guess my hadoop setup is fine. I'm just a bit unsure on the Druid site, partly because the doc is not ver clear about it.
Btw.... I have hadoop 2.7.2... While the hadoop-client libs used by Druid is still on 2.3.0.
Should I downgrade my hadoop server to 2.3.0?
http://druid.io/docs/latest/operations/other-hadoop.html
Thansk,
Raka

Please add the mapred-site.xml core-site.xml hdfs-site.xml yarn-site.xml to the classpath.
Also you don't need to downgrade druid works well with 2.7.X.
As you can see in the doc you can use multiple version of hadoop.

Hortonworks Sandboxes in a cluster

I'm new to Hadoop ecosystem and i'm trying to understand how a cluster works. Until now, I've been using Hortonworks distribution to test anything in a single-node mode. Now I'm wondering - if it's possible to connect two VM's (running on one PC physically) so that one will be NameNode and the other one DataNode (i'm not sure if they should be separated). I found a similar tutorial for Cloudera, so I guess it's possible in theory.
If it's not even a good idea to run two Hadoop VM's on one PC, - then what is the most painless way to configure and run it on two separate PC's?

May be it will be useful. This post "Setting up a Hadoop cluster"
http://gbif.blogspot.ru/2011/01/setting-up-hadoop-cluster-part-1-manual.html

How to integrate Cassandra with Hadoop

I am trying to set up clustered Hadoop and Cassandra. Many sites I've read use a lot of words and concepts I am slowly grasping but I still need some help.
I have 3 nodes. I want to set up Hadoop and Cassandra on all 3. I am familiar with Hadoop and Cassandra individually but how so they work together and how do I configure them to work together? Also, how do I set up one node dedicated to, for example, analytics?
So far I have modified my hadoop-env.sh to point to Cassandra libs. I have put this on all of my nodes. Is that correct? What more do I need to do and how do I run it - start Hadoop cluster or Cassandra first?
Last little question: do I connect directly to Cassandra or to Hadoop from within my Java client?

Rather then connecting them via your java client, you need to install Cassandra On top of Hadoop. Please follow the article for step by step assistance.
BR

Hadoop on cluster configuration /Installation

Hi i have a small doubt , I have started to use in my curiosity but now i have the following problem
My scenario is like this - i have 10 machines connected in LAN and i need to create Name Node in one system and Data Nodes in remaining 9 machines . So do i need to install Hadoop on all the 10 machines ?
For example i have ( 1.. 10 ) machines , where machine1 is Server and from machine(2..9) are slaves[Data Nodes] so do i need to install hadoop on all 10 machines ?
And i have searched a lot On Hadoop cluster network on commodity machine but i dint get any thing related to Installation [ that is configuration]. Some of them given like how to config and install Hadoop on own system but not on the clustered environment
Can any one help me ? and give me the detailed idea or article suggested links to do the above process
Thanks

Yes, you need Hadoop installed in every node and each node should have the services started as for appropriate for its role. Also the configuration files, present on each node, have to coherently describe the topology of the cluster, including location/name/port for various common used resources (eg. namenode). Doing this manually, from scratch, is error prone, specially if you never did this before and you don't know exactly what you're trying to do. Also would be good to decide on a specific distribution of Hadoop (HortonWorks, Cloudera, HDInsight, Intel, etc)
I would recommend use one of the many deployment solutions out there. My favorite is Puppet, but I'm sure Chef will do too.
A different (perhaps better?) alternative is to use Ambari, which is a Hadoop specialized deployment and administering solution. See Deploying and Managing Hadoop Clusters with AMBARI.
Some Puppet resources to get you started: Using Vagrant, Puppet, Testing & Hadoop

Please verify below tutorial
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
Hope it helps

Yes hadoop needs to be there on all the computers
For clustered Environment please go through the video

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio