I am new to hdfs. I am writing Java client that can connect and write data to remote hadoop cluster.
String hdfsUrl = "hdfs://xxx.xxx.xxx.xxx:8020";
FileSystem fs = FileSystem.get(hdfsUrl , conf);
This works fine. My problem is how to handle the HA enabled hadoop cluster. HA enabled hadoop cluster will have two namenodes- one active namenode and standby namenode. How can I identify the active namenode from my client code at runtime.
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.1/bk_system-admin-guide/content/ch_hadoop-ha-3-1.html has following details about a java class that can be used to contact active namenodes
dfs.client.failover.proxy.provider.[$nameservice ID]:
This property specifies the Java class that HDFS clients use to contact the Active NameNode. DFS Client uses this Java class to determine which NameNode is the current Active and therefore which NameNode is currently serving client requests.
Use the ConfiguredFailoverProxyProvider implementation if you are not using a custom implementation.
For example:
<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
How can I use this class in my java client or is there any other way to identify the active namenode...
Not sure if it is the same context, but given a hadoop cluster one should put the core-site.xml (taken from cluster) into application classpath or in a hadoop configuration object(org.apache.hadoop.conf.Configuration) and then access that file with URL "hdfs://mycluster/path/to/file" where mycluster is the name of the hadoop cluster. Like this I have successfully read a file from hadoop cluster in a spark application.
Your client should have hdfs-site.xml of the hadoop cluster, as that would contain the nameservice that is being used for both namenodes and information about both namenodes hostname, port to connect etc.
You have to set these confs in your client as mentioned in the answer of ( https://stackoverflow.com/a/39445389/2584384 ):
"dfs.nameservices", "hadooptest"
"dfs.client.failover.proxy.provider.hadooptest" , "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
"dfs.ha.namenodes.hadooptest", "nn1,nn2"
"dfs.namenode.rpc-address.hadooptest.nn1", "10.10.14.81:8020"
"dfs.namenode.rpc-address.hadooptest.nn2", "10.10.14.82:8020"
So your client would use class "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider" to find which namenode is active and will accordingly redirect the request to that namenode. It basically first tries to connect to first uri and if it fails then tries the second uri.
https://blog.woopi.org/wordpress/files/hadoop-2.6.0-javadoc/org/apache/hadoop/hdfs/server/namenode/ha/ConfiguredFailoverProxyProvider.html
Related
I have the hadoop cluster. Now i want to install the pig and hive on another machines as a client. The client machine will not be a part of that cluster so is it possible? if possible then how i connect that client machine with cluster?
First of all, If you have Hadoop cluster then you must have Master node(Namenode) + Slave node(DataNode)
The one another thing is Client node.
The working of Hadoop cluster is:
Here Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go to Link1 Link2 for client configuration.
According to your question
After complete Hadoop cluster configuration(Master+slave+client). You need to do following steps :
Install Hive and Pig on Master Node
Install Hive and Pig on Client Node
Now Start Coding pig/hive on client node.
Feel free to comment if doubt....!!!!!!
As far as i know, Hadoop 1.x had secondary namenode but was used to create an image of the primary namenode and it updates the primary namenode when it fails and again starts up. But what is the use of secondary namenode in Hadoop 2.x given that we already have a hot standby present?
As far as I know the Hadoop 2.x can be done in 2 ways:
1. With HA (High Availability Cluster): if you are setting up HA cluster then you may not need to use Secondary namenode because standby namenode keep its state synchronized with the Active namenode.
The HDFS NameNode High Availability feature enables you to run redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby.Both NameNode require the same type of hardware configuration.In HA hadoop cluster Active NameNode reads and write metadata information in Separate JournalNode.
In the event of failover, standby NameNode will ensure that its namespace is completely updated according to edit logs before it is changes to active state. So there is no need of Secondary NameNode in this Cluster Setup.
2. Without HA: you can have a hadoop setup without standby node. Then the secondary NameNode will act as you already mentioned in Hadoop 1.x
When you configure HA for NameNodes, Secondary Namenode is not used. However you can still configure HDFS without HA (with NameNode and Secondary NameNode). This part didn't change much since hadoop 1.x.
I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint?
Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job, it is submit to the masters of the clusters.
And I have some naiive ideas:
1.Create a client machine and install hadoop .
2.set fs.default.name to be hdfs://master:9000
3.set dfs.namenode.name.dir to be file://master/home/hduser/hadoop_tmp/hdfs/namenode
Is it correct?
4.Then I don't know how to set the dfs.namenode.name.dir and other configurations.
5.I think the main idea is to set the configuration files to make the job run in hadoop clusters, but I don't know how to do it exactly.
First of all.. this link has detailed information on how client communcates with namenode
http://www.informit.com/articles/article.aspx?p=2460260&seqNum=2
To my understanding, your professor wants to have a separate node as client from which you can run hadoop jobs but that node should not be part of the hadoop cluster.
Consider a scenario where you have to submit Hadoop job from client machine and client machine is not part of existing Hadoop cluster. It is expected that job to be get executed on Hadoop cluster.
Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go through configuration on Namenode,
core-site.xml will have this property-
<property>
<name>fs.default.name</name>
<value>192.168.0.1:9000</value>
</property>
mapred-site.xml will have this property-
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:8021</value>
</property>
These are two important properties must be copied to client machine’s Hadoop configuration.
And you need to set one addtinal property in mapred-site.xml file, to overcome from Privileged Action Exception.
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
Also you need to update /ets/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Now you can submit job from client machine with hadoop jar command, and job will be executed on Hadoop Cluster. Note that, you shouldn’t start any hadoop service on client machine.
Users shouldn't be able to disrupt the functionality of the cluster. That's the meaning. Imagine there is a whole bunch of data scientists that launch their jobs from one of the cluster's masters. In case someone launches a memory-intensive operation, the master processes that are running on the same machine could end up with no memory and crash. That would leave the whole cluster in a failed state.
If you separate client node from master/slave nodes, users could still crash the client, but the cluster would stay up.
Im configuring Hadoop 2.2.0 stable release with HA namenode but i dont know how to configure remote access to the cluster.
I have HA namenode configured with manual failover and i defined dfs.nameservices and i can access hdfs with nameservice from all the nodes included in the cluster, but not from outside.
I can perform operations on hdfs by contact directly the active namenode, but i dont want that, i want to contact the cluster and then be redirected to the active namenode. I think this is the normal configuration for a HA cluster.
Does anyone now how to do that?
(thanks in advance...)
You have to add more values to the hdfs site:
<property>
<name>dfs.ha.namenodes.myns</name>
<value>machine-98,machine-99</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-98</name>
<value>machine-98:8100</value>
</property>
<property>
<name>dfs.namenode.rpc-address.myns.machine-99</name>
<value>machine-145:8100</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-98</name>
<value>machine-98:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.myns.machine-99</name>
<value>machine-145:50070</value>
</property>
You need to contact one of the Name nodes (as you're currently doing) - there is no cluster node to contact.
The hadoop client code knows the address of the two namenodes (in core-site.xml) and can identity which is the active and which is the standby. There might be a way by which you can interrogate a zookeeper node in the quorum to identify the active / standby (maybe, i'm not sure) but you might as well check one of the namenodes - you have a 50/50 chance it's the active one.
I'd have to check, but you might be able to query either if you're just reading from HDFS.
for Active Name node you can always ask Zookeeper.
you can get the active name node from the below Zk Path.
/hadoop-ha/namenodelogicalname/ActiveStandbyElectorLock
There are two ways to resolve this situation(code with java)
use core-site.xml and hdfs-site.xml in your code
load conf via addResource
use conf.set in your code
set hadoop conf via conf.set
an example use conf.set
With an Oozie workflow, you have to specify the cluster's JobTracker in the properties for the workflow. This is easy when you have a single JobTracker:
jobTracker=hostname:port
When the cluster is configured for HA (high availability) JobTracker, I need to be able to set up my properties files to be able to hit either of the JobTracker hosts, without having to update all my properties files when the JobTracker has failed over to the 2nd node.
When accessing one JobTracker through http, it will redirect to the other if it isn't running, but oozie doesn't use http, so there is no redirect, which causes the workflow to fail if the properties file specifies the job tracker host that is not running.
How can I configure my property file to handle JobTracker running in HA?
I just finished setting up some Oozie workflows to use HA JobTrackers and NameNodes. The key is to use the logical name of the HA service you configured, and not any individual hostnames or ports. For example, the default HA JobTracker name is 'logicaljt'. Replace hostname:port with 'logicaljt', and everything should just work, as long as the node from which you're running Oozie has the appropriate hdfs-site and mapred-site configs properly installed (implicitly due to being part of the cluster, or explicitly due to adding a gateway role to it).
Please specify the nameservice for the cluster in which the HA is enabled.
eg:
in properties file
namenode=hdfs://<nameserivce>
jobTracker=<nameservice>:8032