Remote access to HDFS on Kubernetes - hadoop

I am trying to setup HDFS on minikube (for now) and later on a DEV kubernetes cluster so I can use it with Spark. I want Spark to run locally on my machine so I can run in debug mode during development so it should have access to my HDFS on K8s.
I have already set up 1 namenode deployment and a datanode statefulset (3 replicas) and those work fine when I am using HDFS from within the cluster. I am using a headless service for the datanodes and a cluster-ip service for the namenode.
The problem starts when I am trying to expose hdfs. I was thinking of using an ingress for that but that only exposes port 80 outside of the cluster and maps paths to different services inside the cluster which is not what I'm looking for. As far as I understand, my local spark jobs (or hdfs client) talk to the namenode which replies with an address for each block of data. That address though is something like 172.17.0.x:50010 and of course my local machine can't see those.
Is there any way I make this work? Thanks in advance!

I know this question is about just getting it to run in a dev environment, but HDFS is very much a work in progress on K8s, so I wouldn't by any means run it in production (as of this writing). It's quite tricky to get it working on a container orchestration system because:
You are talking about a lot of data and a lot of nodes (namenodes/datanodes) that are not meant to start/stop in different places in your cluster.
You have the risk of having a constantly unbalanced cluster if you are not pinning your namenodes/datanodes to a K8s node (which defeats the purpose of having a container orchestration system)
If you run your namenodes in HA mode and it for any reason your namenodes die and restart you run the risk of corrupting the namenode metadata which would make you lose all your data. It's also risky if you have a single node and you don't pin it to a K8s node.
You can't scale up and down easily without running in an unbalanced cluster. Running an unbalanced cluster defeats one of the main purposes of HDFS.
If you look at DC/OS they were able to make it work on their platform, so that may give you some guidance.
In K8s you basically need to create services for all your namenode ports and all your datanode ports. Your client needs to be able to find every namenode and datanode so that it can read/write from them. Also the some ports cannot go through an Ingress because they are layer 4 ports (TCP) for example the IPC port 8020 on the namenode and 50020 on the datanodes.
Hope it helps!

Related

Behavior of HDFS High Availability

How does distributed copy (distcp) work between two clusters when NameNode (NN) fails in High Availability (HA) configuration.
Will that job fail due to different IP address of name node and the standby node?
Depending on the configuration of your HDFS HA and if Automatic Failover is implemented, it might work (I personally haven't tested the specific command during a failover).
Another important part is that you are using names for the services and DNS is properly setup and configured for all involved nodes (you should never use direct IP addresses).
Yashwanth,
In an HA Hadoop cluster, it is not recommended to use active name node in the distcp commands. A simple answer to your question is Yes, if you hardcode Namenode IP or DNS in the distcp command. In an HA hadoop cluster you need to use cluster name in of IP in the distcp command.

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

"LOST" node in EMR Cluster

How do I troubleshoot and recover a Lost Node in my long running EMR cluster?
The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.
EMR nodes are ephemeral and you cannot recover them once they are marked as LOST. You can avoid this in first place by enabling 'Termination Protection' feature during a cluster launch.
Regarding finding reason for LOST node, you can probably check YARN ResourceManager logs and/or Instance controller logs of your cluster to find out more about root cause.

Should the HBase region server and Hadoop data node on the same machine?

Sorry that I don't have the resource to set up a cluster to test it, I'm just wondering to know:
Can I deploy hbase region server on a separated machine other than the hadoop data node machine? I guess the answer is yes, but I'm not sure.
Is it good or bad to deploy hbase region server and hadoop data node on different machines?
When putting some data into hbase, where is this data eventually stored in, data node or region server? I guess it's data node, but what is the StoreFile and HFile in region server, isn't it the physical file to store our data?
Thank you!
RegionServers should always run alongside DataNodes in distributed clusters if you want decent performance.
Very bad, that will work against the data locality principle (If you want to know a little more about data locality check this: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html)
Actual data will be stored in the HDFS (DataNode), RegionServers are responsible of serving and managing regions.
For more information about HBase architecture please check this excelent post from Lars' blog: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
BTW, as long as you have a PC with decent RAM you can set up a demo cluster with virtual machines. Do not ever try to set up a production environment without properly test the platform first in a development environment.
To go in more detail about this answer:
RegionServers should always run alongside? DataNodes in distributed clusters if you want decent performance."
I'm not sure how anyone would interpet the term alongside, so let's try to be even more precise:
What makes any physical server an "XYZ" server is that it's running a program called a daemon (think "eternally-running background event-handling" program);
What makes a "file" server is that it's running a file-serving daemon;
What makes a "web" server is that it's running a web-serving daemon;
AND
What makes a "data node" server is that it's running the HDFS data-serving daemon;
What makes a "region" server then is that it's running the HBase region-serving daemon (program);
So, in all Hadoop Distributions (eg Cloudera, MAPR, Hortonworks, others), the general best practice is that for HBase, the "RegionServers" are "co-located" with the "DataNodeServers".
This means that the actual slave (datanode) servers which form the HDFS cluster are each running the HDFS data-serving daemon (program)
and they're also running the HBase region-serving daemon (program) as well!
This way we ensure locality - the concurrent processing and storing of data on all the individual nodes in an HDFS cluster, with no "movement" of gigantic loads of big data from "storage" locations to "processing" locations. Locality is vital to the success of a Hadoop cluster, such that HBase region servers (data nodes running the HBase daemon as well) must also do all their processing (putting/getting/scanning) on each data node containing the HFiles which make up HRegions which make up HTables which make up HBases (Hadoop-dataBases) ... .
So, servers (VMs or physical on Windows, Linux, ..) can run multiple daemons concurrently, often, they run dozens of them regularly.

HBase HDFS zookeeper

Now I am learning about HBase. I set up my HBase Cluster and Hadoop Cluster like this:
server1: Namenode HMaster
server2: datanode1 RegionServer1 HQuorumPeer
Server3: datanode2 RegionServer2 HQuorumPeer
Server4: datanode3 RegionServer3 HQuorumPeer
I have several question about HBase cluster:
1: All RegionServers must be in the Hadoop Cluster so it can use HDFS to store
data, even though it will store data into local file system, right?
2: What does RegionServer do? Does the HMaster give the job to all RegionServeres
and let them running parallel, like tasktracker in datanode?
3: What does zookeeper do? Do I need to setup zookeeper in all RegionServers
nodes and the master node?
4: It is related to #3. I know HBase uses zookeeper to recovery once regionServer
is down. How does it specific work?
All RegionServers must be in the Hadoop Cluster so it can use HDFS to store
data, even though it will store data into local file system, right?
Yes. RegionServers are the daemons that are responsible for storing data in a HBase cluster. You store data in HBase tables which are spread over many regions on several RegionServers across the cluster. Although data goes into the RegionServers, it actually gets stored inside HDFS. But if you are on a standalone setup HDFS is not used. The data gets stored directly in the local FS. It is analogous to any DB and FS. Take MSQL and ext3 for example. And yes, all the HDFS data is stored on your disk in reality. You cannot see it directly though.
What does RegionServer do? Does the HMaster give the job to all RegionServeres
and let them running parallel, like tasktracker in datanode?
As specified in the comment above RegionServer is the daemon that actually stores data in a HBase cluster. I'm sorry I didn't quite get the second part of this question. what do you mean by like tasktracker in datanode? In a HBase cluster HMaster is the daemon which is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. Its job is monitoring and management. Regionservers don't run any job like TaskTrackers do. They just store data and are responsible for stuff like serving and managing regions.
What does zookeeper do? Do I need to setup zookeeper in all RegionServers
nodes and the master node?
Zookeeper is the guy who coordinates everything behind the curtains. It is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. A distributed HBase setup depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble. HBase by default manages a ZooKeeper cluster. It gets started and stopped as part of the HBase start/stop process. But, you can also manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use. You don't have to have Zookeepers running on all the nodes. Just decide some number which suits your cluster. One thing to note here is that you should always use an odd number of Zookeepers.
It is related to #3. I know HBase uses zookeeper to recovery once regionServer
is down. How does it specific work?
Each RegionServer is connected to ZooKeeper, and the master watches these connections. ZooKeeper manages a heartbeat with a timeout. So, on a timeout, the HMaster declares the region server as dead, and starts the recovery process. Following things happen during the recovery process :
Identifying that a node is down : a node can cease to respond simply because it is overloaded or as well because it is dead.
Recovering the writes in progress : that’s reading the commit log and recovering the edits that were not flushed.
Reassigning the regions : the region server was previously handling a set of regions. This set must be reallocated to other region servers, depending on their respective workload.
The process is actually a bit more involved. You can find more on this here. I would also suggest you to go through the book HBase The Definitive Guide by Lars in order to get some grip on HBase.
HTH

Resources