how do I set servers for hadoop? (CDH)

how do I set servers for hadoop? (CDH) - hadoop

I am running 3 instances using AWS EC2 (m1.small -- 20GB HDD & 1.7 GB RAM).
On the cluster, there will be hadoop, mapReduce, and several monitoring processes.
This is how I split :
1 Master server
NameNode
SecondaryNameNode
JobTracker
Activity Monitor
Alert Publisher
Event Server
Host Monitor
Service Monitor
2 Slave servers
TaskTracker
DataNode
Because of the server's spec, I think it is kind of burden for the master server to run those 8 jobs. How do I divide them? Should I make another server to allocate monitoring processes?

Having NameNode & SeondaryNameNode on same server does not serve any purpose.
With 1.7 GB ram /machine i don't think you can do much. You need more nodes or higher configuration. 8GB/ Node i think should be minimum.
You can assign some services to slave nodes also.

Related

Remote access to HDFS on Kubernetes

I am trying to setup HDFS on minikube (for now) and later on a DEV kubernetes cluster so I can use it with Spark. I want Spark to run locally on my machine so I can run in debug mode during development so it should have access to my HDFS on K8s.
I have already set up 1 namenode deployment and a datanode statefulset (3 replicas) and those work fine when I am using HDFS from within the cluster. I am using a headless service for the datanodes and a cluster-ip service for the namenode.
The problem starts when I am trying to expose hdfs. I was thinking of using an ingress for that but that only exposes port 80 outside of the cluster and maps paths to different services inside the cluster which is not what I'm looking for. As far as I understand, my local spark jobs (or hdfs client) talk to the namenode which replies with an address for each block of data. That address though is something like 172.17.0.x:50010 and of course my local machine can't see those.
Is there any way I make this work? Thanks in advance!

I know this question is about just getting it to run in a dev environment, but HDFS is very much a work in progress on K8s, so I wouldn't by any means run it in production (as of this writing). It's quite tricky to get it working on a container orchestration system because:
You are talking about a lot of data and a lot of nodes (namenodes/datanodes) that are not meant to start/stop in different places in your cluster.
You have the risk of having a constantly unbalanced cluster if you are not pinning your namenodes/datanodes to a K8s node (which defeats the purpose of having a container orchestration system)
If you run your namenodes in HA mode and it for any reason your namenodes die and restart you run the risk of corrupting the namenode metadata which would make you lose all your data. It's also risky if you have a single node and you don't pin it to a K8s node.
You can't scale up and down easily without running in an unbalanced cluster. Running an unbalanced cluster defeats one of the main purposes of HDFS.
If you look at DC/OS they were able to make it work on their platform, so that may give you some guidance.
In K8s you basically need to create services for all your namenode ports and all your datanode ports. Your client needs to be able to find every namenode and datanode so that it can read/write from them. Also the some ports cannot go through an Ingress because they are layer 4 ports (TCP) for example the IPC port 8020 on the namenode and 50020 on the datanodes.
Hope it helps!

differences between HDFS and ZooKeeper?

While reading ZooKeeper's documentation, it seems to me that HDFS relies on pretty much the same mechanisms of distribution/replication (broadly speeking) as ZooKeeper. I hear some echo from one to another, but I still can't distinguish things clearly and striclty.
I understand ZooKeeper is a Cluster Management / Sync tool, while HDFS is a Distributed File Management System, but could ZK be needed on an HDFS cluster for example?

Yes, the factor is distributed processing and high availability on a hadoop cluster with a zookeper's quorum
For ex. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
Zookeper resolves cluster address to an active namenode.

Calculating yarn.nodemanager.resource.cpu-vcores for a yarn cluster with multiple spark clients

If I have 3 spark applications all using the same yarn cluster, how should I set
yarn.nodemanager.resource.cpu-vcores
in each of the 3 yarn-site.xml?
(each spark application is required to have it's own yarn-site.xml on the classpath)
Does this value even matter in the client yarn-site.xml's ?
If it does:
Let's say the cluster has 16 cores.
Should the value in each yarn-site.xml be 5 (for a total of 15 to leave 1 core for system processes) ? Or should I set each one to 15 ?
(Note: Cloudera indicates one core should be left for system processes here: http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ however, they do not go into details of using multiple clients against the same cluster)
Assume Spark is running with yarn as the master, and running in cluster mode.

Are you talking about the server-side configuration for each YARN Node Manager? If so, it would typically be configured to be a little less than the number of CPU cores (or virtual cores if you have hyperthreading) on each node in the cluster. So if you have 4 nodes with 4 cores each, you could dedicate for example 3 per node to the YARN node manager and your cluster would have a total of 12 virtual CPUs.
Then you request the desired resources when submitting the Spark job (see http://spark.apache.org/docs/latest/submitting-applications.html for example) to the cluster and YARN will attempt to fulfill that request. If it can't be fulfilled, your Spark job (or application) will be queued up or there will eventually be a timeout.
You can configure different resource pools in YARN to guarantee a specific amount of memory/CPU resources to such a pool, but that's a little bit more advanced.
If you submit your Spark application in cluster mode, you have to consider that the Spark driver will run on a cluster node and not your local machine (that one that submitted it). Therefore it will require at least 1 virtual CPU more.
Hope that clarifies things a little for you.

How to allocate memory to datanode in hadoop configuration

we have a below requirement.
We have a totally 5 servers which will be utilized for building Bigdata Hadoop data warehouse system (we are not going to use any distribution like cloudera, hortonworks...etc).
All servers configurations are 512GB RAM, 30TB storage and 16 cores, Ubuntu Linux 14.04LTS server
We would install hadoop on all the servers. Server3,4,5 will be completely using them for datanode (slave machines) whereas server1 would have Active Namenode and Datanode. Server2 would have standby Namenode and datanode.
we want to configure 300GB RAM for Namenode and 212GB RAM for datanode while configuring hadoop.
Could anyone help me how to do that. which configuration file in hadoop needs to be changed. what are the parameter we need to configure in hadoop configuration files.
Thanks and Regards,
Suresh Pitchaipillai

You can cset these properties from cloudera manager (in case you are using CDH) or from Ambari (if you use Hortonworks).
Also you do not need 300GB for Namenode as namenode only stores metadat. Roughly speaking 1GB of namenode heap can store metadata of 1milion blocks (block size = 128MB).
More details here : https://issues.apache.org/jira/browse/HADOOP-1687

Assuming that you are going to use latest hadoop distribution with Yarn.
Read this article - Reference. It has explained every parameter in details and it is awesome in explanation.
There is one more article from Hortenworks, though it is applicable to all apache based hadoop distribution.
At last keep this handly - Yarn-configuration. It is self explanatory.

Hadoop use only master node for processing data

I've setup a Hadoop 2.5 cluster with 1 master node(namenode and secondary namenode and datanode) and 2 slave nodes(datanode).All of the machines use Linux CentOS 7 - 64bit. When I run my MapReduce program (wordcount), I can only see that master node is using extra CPU and RAM. Slave nodes are not doing a thing.
I've checked the logs from all of the namenode and there is nothing wrong on slave nodes. Resource Manager is running and all of the slave nodes can see the Resource Manager.
Datanodes are working in terms of distributed data storing but I can't see any indication of distributed data processing. Do I have to configure the xml configuration files in some other way so all of the machines will process data while I'm running my MapReduce Job?
Thank you

Make sure you are mentioaning the IP's Addresses of the daanodes on the Masternode networking files. Also each node in the cluster is supposed to contain IP address of the other machines.
Besides that check the includes file if it contains the relevant datanodes entry onto it or not.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio