Hadoop client and cluster separation - hadoop

I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint?
Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job, it is submit to the masters of the clusters.
And I have some naiive ideas:
1.Create a client machine and install hadoop .
2.set fs.default.name to be hdfs://master:9000
3.set dfs.namenode.name.dir to be file://master/home/hduser/hadoop_tmp/hdfs/namenode
Is it correct?
4.Then I don't know how to set the dfs.namenode.name.dir and other configurations.
5.I think the main idea is to set the configuration files to make the job run in hadoop clusters, but I don't know how to do it exactly.

First of all.. this link has detailed information on how client communcates with namenode
http://www.informit.com/articles/article.aspx?p=2460260&seqNum=2
To my understanding, your professor wants to have a separate node as client from which you can run hadoop jobs but that node should not be part of the hadoop cluster.
Consider a scenario where you have to submit Hadoop job from client machine and client machine is not part of existing Hadoop cluster. It is expected that job to be get executed on Hadoop cluster.
Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go through configuration on Namenode,
core-site.xml will have this property-
<property>
<name>fs.default.name</name>
<value>192.168.0.1:9000</value>
</property>
mapred-site.xml will have this property-
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:8021</value>
</property>
These are two important properties must be copied to client machine’s Hadoop configuration.
And you need to set one addtinal property in mapred-site.xml file, to overcome from Privileged Action Exception.
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
Also you need to update /ets/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Now you can submit job from client machine with hadoop jar command, and job will be executed on Hadoop Cluster. Note that, you shouldn’t start any hadoop service on client machine.

Users shouldn't be able to disrupt the functionality of the cluster. That's the meaning. Imagine there is a whole bunch of data scientists that launch their jobs from one of the cluster's masters. In case someone launches a memory-intensive operation, the master processes that are running on the same machine could end up with no memory and crash. That would leave the whole cluster in a failed state.
If you separate client node from master/slave nodes, users could still crash the client, but the cluster would stay up.

Related

Hadoop cluster and client connection

I have the hadoop cluster. Now i want to install the pig and hive on another machines as a client. The client machine will not be a part of that cluster so is it possible? if possible then how i connect that client machine with cluster?
First of all, If you have Hadoop cluster then you must have Master node(Namenode) + Slave node(DataNode)
The one another thing is Client node.
The working of Hadoop cluster is:
Here Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go to Link1 Link2 for client configuration.
According to your question
After complete Hadoop cluster configuration(Master+slave+client). You need to do following steps :
Install Hive and Pig on Master Node
Install Hive and Pig on Client Node
Now Start Coding pig/hive on client node.
Feel free to comment if doubt....!!!!!!

Hadoop HA Namenode remote access

Im configuring Hadoop 2.2.0 stable release with HA namenode but i dont know how to configure remote access to the cluster.
I have HA namenode configured with manual failover and i defined dfs.nameservices and i can access hdfs with nameservice from all the nodes included in the cluster, but not from outside.
I can perform operations on hdfs by contact directly the active namenode, but i dont want that, i want to contact the cluster and then be redirected to the active namenode. I think this is the normal configuration for a HA cluster.
Does anyone now how to do that?
(thanks in advance...)
You have to add more values to the hdfs site:
<property>
<name>dfs.ha.namenodes.myns</name>
<value>machine-98,machine-99</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-98</name>
<value>machine-98:8100</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-99</name>
<value>machine-145:8100</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-98</name>
<value>machine-98:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-99</name>
<value>machine-145:50070</value>
</property>
You need to contact one of the Name nodes (as you're currently doing) - there is no cluster node to contact.
The hadoop client code knows the address of the two namenodes (in core-site.xml) and can identity which is the active and which is the standby. There might be a way by which you can interrogate a zookeeper node in the quorum to identify the active / standby (maybe, i'm not sure) but you might as well check one of the namenodes - you have a 50/50 chance it's the active one.
I'd have to check, but you might be able to query either if you're just reading from HDFS.
for Active Name node you can always ask Zookeeper.
you can get the active name node from the below Zk Path.
/hadoop-ha/namenodelogicalname/ActiveStandbyElectorLock
There are two ways to resolve this situation(code with java)
use core-site.xml and hdfs-site.xml in your code
load conf via addResource
use conf.set in your code
set hadoop conf via conf.set
an example use conf.set

HBase and ZooKeeper roles in Hadoop?

I have installed Hadoop single node cluster in my Ubuntu machine and able to run NameNode, datanode etc.. Now i need to install HBase and Zookeeper. But i don't really know what they are. Guys could anybody give me brief description about those tools.
Thanks
First of all I would strongly recommend you to go through the official pages of these projects. Go here for HBase and here for Zookeeper.
HBase is a NoSQL datastore that runs on top of your existing Hadoop cluster(HDFS). It provides you capabilities like random, real-time reads/writes, which HDFS being a FS lacks. Since it is a NoSQL datastore it doesn't follow SQL conventions and terminologies. HBase provides a good set of APIs( includes JAVA and Thrift). Along with this HBase also provides seamless integration with MapReduce framework. But, along with all these advantages of HBase you should keep this in mind that random read-write is quick but always has additional overhead. So think well before ye make any decision.
ZooKeeper is a high-performance coordination service for distributed applications(like HBase). It exposes common services like naming, configuration management, synchronization, and group services, in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs.
HBase relies completely on Zookeeper. HBase provides you the option to use its built-in Zookeeper which will get started whenever you start HBAse. But it is not good if you are working on a production cluster. In such scenarios it's always good to have a dedicated Zookeeper cluster and integrate it with your HBase cluster.
Note : You should always have odd number of nodes in your ZK Quorum.
HTH
An overview:
Zookeeper: In short, zookeeper is a distributed application (cluster) configuration and management tool, and it exits independent of HBase. From the docs:
ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in
some form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and
race conditions that are inevitable. Because of the difficulty of
implementing these kinds of services, applications initially usually
skimp on them ,which make them brittle in the presence of change and
difficult to manage. Even when done correctly, different
implementations of these services lead to management complexity when
the applications are deployed.
HBase:The NoSQL datastore on top of the HDFS (can use simple file system, but it guarantees no data durability). HBase contains two primary services:
Master server - The master server (HMaster) co-ordinates the
cluster and performs administrative operations, such as assigning
regions and balancing the loads.
Region servers - The region
servers do the real work. A subset of the data of each table is handled by each region server. Clients talk to region servers to access data in HBase.
The connection between HBase and Zookeeper:
A distributed HBase relies completely on Zookeeper (for cluster configuration and management). In Apache HBase, ZooKeeper coordinates, communicates, and shares state between the Masters and RegionServers. HBase has a design policy of using ZooKeeper only for transient data (that is, for coordination and state communication). Thus if the HBase’s ZooKeeper data is removed, only the transient operations are affected — data can continue to be written and read to/from HBase.
Once you have the HBase started - you can verify the processes it has started using jps command:
$ jps
the command will list all the java processes on the machine (HBase itself is a Java application) - the probable output (in case of simple standalone HBase setup) has to be:
62019 Jps
61098 HMaster
61233 HRegionServer
61003 HQuorumPeer
Technically speaking:
By default HBase manages zookeeper itself i.e. starting and stopping the zookeeper quorum (the cluster of zookeeper nodes) when we start and stop HBase - to verify the settings look into the file conf/hbase-evn.sh (in your hbase directory) there must be a line:
export HBASE_MANAGES_ZK=true
Once set all we need to do is set the following directives in conf/hbase-site.xml - from docs:
<configuration>
...
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
<description> The port at which the clients will connect.
</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value>
<description>Comma separated list of servers in the ZooKeeper Quorum.
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
By default this is set to localhost for local and pseudo-distributed modes
of operation. For a fully-distributed setup, this should be set to a full
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop ZooKeeper on.
</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/zookeeper</value>
<description>Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
</description>
</property>
...
</configuration>

Coexistance of Hadoop MR1 and MR2

Is is possible to run both Hadoop MR1 and MR2 together in same cluster (at least, in theory)?
If yes, how can I do that?
In theory, you can do as:
run DataNode TaskTracker and NodeManager on one machine
run NameNode SecondaryNameNode and ResourceManager on other machines
all processes with different ports
but, not suggest to do this, see cloudera blog:
"Make sure you are not trying to run MRv1 and YARN on the same set of nodes at the same time. This is not supported; it will degrade performance and may result in an unstable cluster deployment."
In theory, yes.
Unpack the tarball into 2 different locations, owned by different users.
In both of them, change all mapred/yarn related ports to mutually exclusive sets.
Run the datanodes from only one of the locations.
Start mapred/yarn related daemons in both locations
Do post here if it works.
Also dfs name dir and data dir should be different for MR1 and MR2.
<property>
<name>dfs.name.dir</name>
<value>/home/userx/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/userx/hdfs/data</value>
</property>
It seems for Mapr, this is not only a theory but practice, check this link.
You dont need to run both, just run the Hadoop 2.0, it provides full backward compatibility to MapReduce applications written for Hadoop 1.0.
There are few minor changes in API, please look at the link to check if any changes effect your applications.

Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?

I'm exploring the options for running a hadoop application on a local system.
As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as such we are bound to Hadoop 0.18.3 as the latest release (See this question). So unfortunately we can't use this new feature yet.
The first option is to simply run hadoop in pseudo distributed mode. Essentially: create a complete hadoop cluster with everything on it running on exactly 1 node.
The "downside" of this form is that it also uses a full fledged HDFS. This means that in order to process the input data this must first be "uploaded" onto the DFS ... which is locally stored. So this takes additional transfer time of both the input and output data and uses additional disk space. I would like to avoid both of these while we stay on a single node configuration.
So I was thinking: Is it possible to override the "fs.hdfs.impl" setting and change it from "org.apache.hadoop.dfs.DistributedFileSystem" into (for example) "org.apache.hadoop.fs.LocalFileSystem"?
If this works the "local" hadoop cluster (which can ONLY consist of ONE node) can use existing files without any additional storage requirements and it can start quicker because there is no need to upload the files. I would expect to still have a job and task tracker and perhaps also a namenode to control the whole thing.
Has anyone tried this before?
Can it work or is this idea much too far off the intended use?
Or is there a better way of getting the same effect: Pseudo-Distributed operation without HDFS?
Thanks for your insights.
EDIT 2:
This is the config I created for hadoop 0.18.3
conf/hadoop-site.xml using the answer provided by bajafresh4life.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:33301</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>localhost:33302</value>
<description>
The job tracker http server address and port the server will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>localhost:33303</value>
<description>
The task tracker http server address and port.
If the port is 0 then the server will start on a free port.
</description>
</property>
</configuration>
Yes, this is possible, although I'm using 0.19.2. I'm not too familiar with 0.18.3, but I'm pretty sure it shouldn't make a difference.
Just make sure that fs.default.name is set to the default (which is file:///), and mapred.job.tracker is set to point to where your jobtracker is hosted. Then start up your daemons using bin/start-mapred.sh . You don't need to start up the namenode or datanodes. At this point you should be able to run your map/reduce jobs using bin/hadoop jar ...
We've used this configuration to run Hadoop over a small cluster of machines using a Netapp appliance mounted over NFS.

Resources