Hadoop HA Namenode remote access - hadoop

Im configuring Hadoop 2.2.0 stable release with HA namenode but i dont know how to configure remote access to the cluster.
I have HA namenode configured with manual failover and i defined dfs.nameservices and i can access hdfs with nameservice from all the nodes included in the cluster, but not from outside.
I can perform operations on hdfs by contact directly the active namenode, but i dont want that, i want to contact the cluster and then be redirected to the active namenode. I think this is the normal configuration for a HA cluster.
Does anyone now how to do that?
(thanks in advance...)

You have to add more values to the hdfs site:
<property>
<name>dfs.ha.namenodes.myns</name>
<value>machine-98,machine-99</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-98</name>
<value>machine-98:8100</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-99</name>
<value>machine-145:8100</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-98</name>
<value>machine-98:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-99</name>
<value>machine-145:50070</value>
</property>

You need to contact one of the Name nodes (as you're currently doing) - there is no cluster node to contact.
The hadoop client code knows the address of the two namenodes (in core-site.xml) and can identity which is the active and which is the standby. There might be a way by which you can interrogate a zookeeper node in the quorum to identify the active / standby (maybe, i'm not sure) but you might as well check one of the namenodes - you have a 50/50 chance it's the active one.
I'd have to check, but you might be able to query either if you're just reading from HDFS.

for Active Name node you can always ask Zookeeper.
you can get the active name node from the below Zk Path.
/hadoop-ha/namenodelogicalname/ActiveStandbyElectorLock

There are two ways to resolve this situation(code with java)
use core-site.xml and hdfs-site.xml in your code
load conf via addResource
use conf.set in your code
set hadoop conf via conf.set
an example use conf.set

Related

Hadoop client and cluster separation

I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint?
Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job, it is submit to the masters of the clusters.
And I have some naiive ideas:
1.Create a client machine and install hadoop .
2.set fs.default.name to be hdfs://master:9000
3.set dfs.namenode.name.dir to be file://master/home/hduser/hadoop_tmp/hdfs/namenode
Is it correct?
4.Then I don't know how to set the dfs.namenode.name.dir and other configurations.
5.I think the main idea is to set the configuration files to make the job run in hadoop clusters, but I don't know how to do it exactly.
First of all.. this link has detailed information on how client communcates with namenode
http://www.informit.com/articles/article.aspx?p=2460260&seqNum=2
To my understanding, your professor wants to have a separate node as client from which you can run hadoop jobs but that node should not be part of the hadoop cluster.
Consider a scenario where you have to submit Hadoop job from client machine and client machine is not part of existing Hadoop cluster. It is expected that job to be get executed on Hadoop cluster.
Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go through configuration on Namenode,
core-site.xml will have this property-
<property>
<name>fs.default.name</name>
<value>192.168.0.1:9000</value>
</property>
mapred-site.xml will have this property-
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:8021</value>
</property>
These are two important properties must be copied to client machine’s Hadoop configuration.
And you need to set one addtinal property in mapred-site.xml file, to overcome from Privileged Action Exception.
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
Also you need to update /ets/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Now you can submit job from client machine with hadoop jar command, and job will be executed on Hadoop Cluster. Note that, you shouldn’t start any hadoop service on client machine.
Users shouldn't be able to disrupt the functionality of the cluster. That's the meaning. Imagine there is a whole bunch of data scientists that launch their jobs from one of the cluster's masters. In case someone launches a memory-intensive operation, the master processes that are running on the same machine could end up with no memory and crash. That would leave the whole cluster in a failed state.
If you separate client node from master/slave nodes, users could still crash the client, but the cluster would stay up.

Hadoop Ha namenode java client

I am new to hdfs. I am writing Java client that can connect and write data to remote hadoop cluster.
String hdfsUrl = "hdfs://xxx.xxx.xxx.xxx:8020";
FileSystem fs = FileSystem.get(hdfsUrl , conf);
This works fine. My problem is how to handle the HA enabled hadoop cluster. HA enabled hadoop cluster will have two namenodes- one active namenode and standby namenode. How can I identify the active namenode from my client code at runtime.
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.1/bk_system-admin-guide/content/ch_hadoop-ha-3-1.html has following details about a java class that can be used to contact active namenodes
dfs.client.failover.proxy.provider.[$nameservice ID]:
This property specifies the Java class that HDFS clients use to contact the Active NameNode. DFS Client uses this Java class to determine which NameNode is the current Active and therefore which NameNode is currently serving client requests.
Use the ConfiguredFailoverProxyProvider implementation if you are not using a custom implementation.
For example:
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
How can I use this class in my java client or is there any other way to identify the active namenode...
Not sure if it is the same context, but given a hadoop cluster one should put the core-site.xml (taken from cluster) into application classpath or in a hadoop configuration object(org.apache.hadoop.conf.Configuration) and then access that file with URL "hdfs://mycluster/path/to/file" where mycluster is the name of the hadoop cluster. Like this I have successfully read a file from hadoop cluster in a spark application.
Your client should have hdfs-site.xml of the hadoop cluster, as that would contain the nameservice that is being used for both namenodes and information about both namenodes hostname, port to connect etc.
You have to set these confs in your client as mentioned in the answer of ( https://stackoverflow.com/a/39445389/2584384 ):
"dfs.nameservices", "hadooptest"
"dfs.client.failover.proxy.provider.hadooptest" , "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
"dfs.ha.namenodes.hadooptest", "nn1,nn2"
"dfs.namenode.rpc-address.hadooptest.nn1", "10.10.14.81:8020"
"dfs.namenode.rpc-address.hadooptest.nn2", "10.10.14.82:8020"
So your client would use class "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" to find which namenode is active and will accordingly redirect the request to that namenode. It basically first tries to connect to first uri and if it fails then tries the second uri.
https://blog.woopi.org/wordpress/files/hadoop-2.6.0-javadoc/org/apache/hadoop/hdfs/server/namenode/ha/ConfiguredFailoverProxyProvider.html

Simulating Map-reduce using Cloudera

I want to use cloudera to simulate Hadoop job on a single machine (of course with many VMs). I have 2 question
1) Can I change the replication policy of HDFS in cloudera?
2) Can I see cpu usage of each VMs?
You can use hadoop fs -setrep to change the replication factor on any file. You can also change the default replication factor by modifying hdfs-site.xml by adding the following:
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
You'll have to log into each box and use top to see the cpu usage of each VM. There is nothing out of the box in Hadoop that lets you see this.
I found out that I can change data replication policy by changing "ReplicationTargetChooser.java".

Using s3 as fs.default.name or HDFS?

I'm setting up a Hadoop cluster on EC2 and I'm wondering how to do the DFS. All my data is currently in s3 and all map/reduce applications use s3 file paths to access the data. Now I've been looking at how Amazons EMR is setup and it appears that for each jobflow, a namenode and datanodes are setup. Now I'm wondering if I really need to do it that way or if I could just use s3(n) as the DFS? If doing so, are there any drawbacks?
Thanks!
in order to use S3 instead of HDFS fs.name.default in core-site.xml needs to point to your bucket:
<property>
<name>fs.default.name</name>
<value>s3n://your-bucket-name</value>
</property>
It's recommended that you use S3N and NOT simple S3 implementation, because S3N is readble by any other application and by yourself :)
Also, in the same core-site.xml file you need to specify the following properties:
fs.s3n.awsAccessKeyId
fs.s3n.awsSecretAccessKey
fs.s3n.awsSecretAccessKey
Any intermediate data of your job goes to HDFS, so yes, you still need a namenode and datanodes
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/core-default.xml
fs.default.name is deprecated, and maybe fs.defaultFS is better.
I was able to get the s3 integration working using
<property>
<name>fs.default.name</name>
<value>s3n://your-bucket-name</value>
</property>
in the core-site.xml and get the list of the files get using hdfs ls command.but should also should have namenode and separate datanode configurations, coz still was not sure how the data gets partitioned in the data nodes.
should we have local storage for namenode and datanode?

Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?

I'm exploring the options for running a hadoop application on a local system.
As with many applications the first few releases should be able to run on a single node, as long as we can use all the available CPU cores (Yes, this is related to this question). The current limitation is that on our production systems we have Java 1.5 and as such we are bound to Hadoop 0.18.3 as the latest release (See this question). So unfortunately we can't use this new feature yet.
The first option is to simply run hadoop in pseudo distributed mode. Essentially: create a complete hadoop cluster with everything on it running on exactly 1 node.
The "downside" of this form is that it also uses a full fledged HDFS. This means that in order to process the input data this must first be "uploaded" onto the DFS ... which is locally stored. So this takes additional transfer time of both the input and output data and uses additional disk space. I would like to avoid both of these while we stay on a single node configuration.
So I was thinking: Is it possible to override the "fs.hdfs.impl" setting and change it from "org.apache.hadoop.dfs.DistributedFileSystem" into (for example) "org.apache.hadoop.fs.LocalFileSystem"?
If this works the "local" hadoop cluster (which can ONLY consist of ONE node) can use existing files without any additional storage requirements and it can start quicker because there is no need to upload the files. I would expect to still have a job and task tracker and perhaps also a namenode to control the whole thing.
Has anyone tried this before?
Can it work or is this idea much too far off the intended use?
Or is there a better way of getting the same effect: Pseudo-Distributed operation without HDFS?
Thanks for your insights.
EDIT 2:
This is the config I created for hadoop 0.18.3
conf/hadoop-site.xml using the answer provided by bajafresh4life.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:33301</value>
</property>
<property>
<name>mapred.job.tracker.http.address</name>
<value>localhost:33302</value>
<description>
The job tracker http server address and port the server will listen on.
If the port is 0 then the server will start on a free port.
</description>
</property>
<property>
<name>mapred.task.tracker.http.address</name>
<value>localhost:33303</value>
<description>
The task tracker http server address and port.
If the port is 0 then the server will start on a free port.
</description>
</property>
</configuration>
Yes, this is possible, although I'm using 0.19.2. I'm not too familiar with 0.18.3, but I'm pretty sure it shouldn't make a difference.
Just make sure that fs.default.name is set to the default (which is file:///), and mapred.job.tracker is set to point to where your jobtracker is hosted. Then start up your daemons using bin/start-mapred.sh . You don't need to start up the namenode or datanodes. At this point you should be able to run your map/reduce jobs using bin/hadoop jar ...
We've used this configuration to run Hadoop over a small cluster of machines using a Netapp appliance mounted over NFS.

Resources