Behavior of HDFS High Availability - hadoop

How does distributed copy (distcp) work between two clusters when NameNode (NN) fails in High Availability (HA) configuration.
Will that job fail due to different IP address of name node and the standby node?

Depending on the configuration of your HDFS HA and if Automatic Failover is implemented, it might work (I personally haven't tested the specific command during a failover).
Another important part is that you are using names for the services and DNS is properly setup and configured for all involved nodes (you should never use direct IP addresses).

Yashwanth,
In an HA Hadoop cluster, it is not recommended to use active name node in the distcp commands. A simple answer to your question is Yes, if you hardcode Namenode IP or DNS in the distcp command. In an HA hadoop cluster you need to use cluster name in of IP in the distcp command.

Related

Remote access to HDFS on Kubernetes

I am trying to setup HDFS on minikube (for now) and later on a DEV kubernetes cluster so I can use it with Spark. I want Spark to run locally on my machine so I can run in debug mode during development so it should have access to my HDFS on K8s.
I have already set up 1 namenode deployment and a datanode statefulset (3 replicas) and those work fine when I am using HDFS from within the cluster. I am using a headless service for the datanodes and a cluster-ip service for the namenode.
The problem starts when I am trying to expose hdfs. I was thinking of using an ingress for that but that only exposes port 80 outside of the cluster and maps paths to different services inside the cluster which is not what I'm looking for. As far as I understand, my local spark jobs (or hdfs client) talk to the namenode which replies with an address for each block of data. That address though is something like 172.17.0.x:50010 and of course my local machine can't see those.
Is there any way I make this work? Thanks in advance!
I know this question is about just getting it to run in a dev environment, but HDFS is very much a work in progress on K8s, so I wouldn't by any means run it in production (as of this writing). It's quite tricky to get it working on a container orchestration system because:
You are talking about a lot of data and a lot of nodes (namenodes/datanodes) that are not meant to start/stop in different places in your cluster.
You have the risk of having a constantly unbalanced cluster if you are not pinning your namenodes/datanodes to a K8s node (which defeats the purpose of having a container orchestration system)
If you run your namenodes in HA mode and it for any reason your namenodes die and restart you run the risk of corrupting the namenode metadata which would make you lose all your data. It's also risky if you have a single node and you don't pin it to a K8s node.
You can't scale up and down easily without running in an unbalanced cluster. Running an unbalanced cluster defeats one of the main purposes of HDFS.
If you look at DC/OS they were able to make it work on their platform, so that may give you some guidance.
In K8s you basically need to create services for all your namenode ports and all your datanode ports. Your client needs to be able to find every namenode and datanode so that it can read/write from them. Also the some ports cannot go through an Ingress because they are layer 4 ports (TCP) for example the IPC port 8020 on the namenode and 50020 on the datanodes.
Hope it helps!

Dynamic IPs for Hadoop cluster

I need to setup a multi-node Hadoop cluster. So far, I have done installations using static IP addresses for each of the cluster nodes. However, in my latest cluster, I need to work with DHCP assigned nodes. So I am wondering, how should I get the cluster working and survive restarts etc.
Is it mandatory to have static IP address for the cluster nodes or can we get it working with dynamic IPs as well?
Any expert guidance please.
For standalone and pseudo-distributed modes, you can get going on dynamic IP address for it runs on a single node.
For fully distributed mode, the nodes are identified with the file masters and slaves located in 'HADOOP_HOME/conf'. These names are hosts which have been described in '/etc/hosts'. So, when IP of any node changes, Hadoop cannot identify the machines or nodes or hosts (even if identified, these nodes have no Hadoop configured). Thus, you cannot achieve the fully distributed Hadoop setup.
Get your DHCP configured on a router if you can. Else install DHCP on all of the nodes. And get going!!

what are the differences zookeeper, journal node tasks and quorum journal manager in hadoop?

On studying the material in multiple no of websites and videos, I am confused with the functionalities and differences in the purposes of the 3 hadoop components ZooKeeper, Journal Node and the Quorum Journal Manager.
Could anyone please explain me the reasons for inventing each of the above and differences in the purposes and functionalities of the above three components?
Thanks in advance.
Think of it like this, zookeeper is a group of people, each assigned to watch over a factory and coordinate them, journal node is a place where all factory managers can check others status and coordinate. QJM is a combination of both to be used in HA for better coordination in case of fail over.
zookeeper coordinates hbase regionservers and other hadoop modules which require zookeeper.
journal node coordinates hadoop datanodes with the namenode.
QJM coordinates regionservers using the technique used by journal node
on core hadoop setup only journal node is necessary in case of distributed setup
Firstly, quorum means there is a need of majority for decisions. So, when you see the word "quorum" you should think of a clustered, saying that; multi-host configuration. You can hear this term for both Zookeeper and Journal Nodes.
Short description of their functionalities will help you distinguish their purpose.
Zookeeper: Zookeeper is the central synchronisation application for informations which applications need to check frequently. There may be many informations that application need like naming structure, information, configuration information (or simply configurations) etc. Most common case is configuration of application. When you change a config which relates to lets say 80 servers, to synchronise this change to all nodes, you need to develop a synchronisation service. Application itself may have this feature. But imagine you add another 12 applications to your environment. You need to take care of each application's synchronisation service one by one. This is where zookeeper comes in. Zookeeper can handle management of all these information by itself. If you set it up as a cluster (need an odd number of hosts. why?) you will have high availability for Zookeeper (failover cases) and have a Zoopeeker Quorum.
Journal Node: In an high availability Hadoop cluster you have more than one Namenodes running in active/passive mode. Active namenode informs journal node for changes. Stand by name node asks to journal node about what changed. Like on the case of Zookeeper if you set up as cluster configuration (need odd number of hosts also here. why?), you have high availability also for Journal Node features and have a Quorum Journal Manager.
Actually I didn't hear them set as single host or node except for lab purposes (vm in pc).
1. Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications
Role of Zookeeper in Hadoop ecosystem:
During the Hadoop Namenode failover process, ZooKeeper has been used to avoid split brain scenario so that name node state is not getting diverged due to failover.
Refer to this post for more details:
How does Hadoop Namenode failover process works?
2. JournalNode ( Used in Namenode failover process)
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs).
JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager.
Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine
3.Quorum Journal Manager (QJM) allows to share edit logs between the Active and Standby NameNodes
Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario

Hadoop use only master node for processing data

I've setup a Hadoop 2.5 cluster with 1 master node(namenode and secondary namenode and datanode) and 2 slave nodes(datanode).All of the machines use Linux CentOS 7 - 64bit. When I run my MapReduce program (wordcount), I can only see that master node is using extra CPU and RAM. Slave nodes are not doing a thing.
I've checked the logs from all of the namenode and there is nothing wrong on slave nodes. Resource Manager is running and all of the slave nodes can see the Resource Manager.
Datanodes are working in terms of distributed data storing but I can't see any indication of distributed data processing. Do I have to configure the xml configuration files in some other way so all of the machines will process data while I'm running my MapReduce Job?
Thank you
Make sure you are mentioaning the IP's Addresses of the daanodes on the Masternode networking files. Also each node in the cluster is supposed to contain IP address of the other machines.
Besides that check the includes file if it contains the relevant datanodes entry onto it or not.

Hadoop Client Node Configuration

Assume that there is a Hadoop Cluster that has 20 machines. Out of those 20 machines 18 machines are slaves and machine 19 is for NameNode and machine 20 is for JobTracker.
Now i know that hadoop software has to be installed in all those 20 machines.
but my question is which machine is involved to load a file xyz.txt in to Hadoop Cluster. Is that client machine a separate machine . Do we need to install Hadoop software in that clinet machine as well. How does the client machine identifes Hadoop cluster?
I am new to hadoop, so from what I understood:
If your data upload is not an actual service of the cluster, which should be running on an edge node of the cluster, then you can configure your own computer to work as an edge node.
An edge node doesn't need to be known by the cluster (but for security stuff) as it does not store data nor compute job. This is basically what it means to be an edge-node: it is connected to the hadoop cluster but does not participate.
In case it can help someone, here is what I have done to connect to a cluster that I don't administer:
get an account on the cluster, say myaccount
create an account on you computer with the same name: myaccount
configure your computer to access the cluster machines (ssh w\out passphrase, registered ip, ...)
get the hadoop configuration files from an edge-node of the cluster
get a hadoop distrib (eg. from here)
uncompress it where you want, say /home/myaccount/hadoop-x.x
add the following environment variables: JAVA_HOME, HADOOP_HOME (/home/me/hadoop-x.x)
(if you'd like) add hadoop bin to your path: export PATH=$HADOOP_HOME/bin:$PATH
replace your hadoop configuration files by those you got from the edge node. With hadoop 2.5.2, it is the folder $HADOOP_HOME/etc/hadoop
also, I had to change the value of a couple $JAVA_HOME defined in conf files. To find them use: grep -r "export.*JAVA_HOME"
Then do hadoop fs -ls / which should list the root directory of the cluster hdfs.
Typically in case you have a multi tenant cluster (which most hadoop clusters are bound to be) then ideally no one other than administrators have access to the machines that are the part of the cluster.
Developers setup their own "edge-nodes". Edge Nodes basically have hadoop libraries and have the client configuration deployed to them (various xml files which tell the local installation where namenode, job tracker, zookeeper etc are core-site, mapred-site, hdfs-site.xml). But the edge node does not have any role as such in the cluster i.e. no persistent hadoop services are running on this node.
Now in case of a small development environment kind of setup you can use any one of the participating nodes of the cluster to run jobs or run shell commands.
So based on your requirement the definition and placement of client varies.
I recommend this article.
"Client machines have Hadoop installed with all the cluster settings, but are neither a Master or a Slave. Instead, the role of the Client machine is to load data into the cluster, submit Map Reduce jobs describing how that data should be processed, and then retrieve or view the results of the job when its finished."

Resources