What is the best way to test hadoop? - hadoop

I have completed the hadoop cluster setup with 3 journal nodes for QJM, 4 datanodes, 2 namenode, 3 zookeeper but I need to confirm whether the connectivity had made successfully between them so, I am searching for a too which can perform the following task
1) Should check which namenode currently in active state
2) Is both the namenode is communicating with each other successfully
3) Should check whether all journal nodes are communicating with each other successfully
4) Should check whether all zookeeper are communicating with each other successfully
5) Which zookeeper currently playing the master role
Is there any tool or any commands available to check above task?
Can anyone please help me to solve this?

Using Ambari you can monitor complete cluster performace and working.
Also if you want to validate your hadoop jobs programatically,
Then you can use the concept of Counters in Hadoop.

Related

Hadoop use only master node for processing data

I've setup a Hadoop 2.5 cluster with 1 master node(namenode and secondary namenode and datanode) and 2 slave nodes(datanode).All of the machines use Linux CentOS 7 - 64bit. When I run my MapReduce program (wordcount), I can only see that master node is using extra CPU and RAM. Slave nodes are not doing a thing.
I've checked the logs from all of the namenode and there is nothing wrong on slave nodes. Resource Manager is running and all of the slave nodes can see the Resource Manager.
Datanodes are working in terms of distributed data storing but I can't see any indication of distributed data processing. Do I have to configure the xml configuration files in some other way so all of the machines will process data while I'm running my MapReduce Job?
Thank you
Make sure you are mentioaning the IP's Addresses of the daanodes on the Masternode networking files. Also each node in the cluster is supposed to contain IP address of the other machines.
Besides that check the includes file if it contains the relevant datanodes entry onto it or not.

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

monitoring hadoop cluster with ganglia

I'm new to hadoop and trying to monitor the working of a multi node cluster using ganglia, The setup of gmond is done on all nodes and ganglia monitor only on the master.However,there are hadoop metrics graphs only for the master node and just system metrics for slaves. Do these hadoop metrics on the master include the slave metrics as well?Or is there any mistake in configuration files? Any help would be appreciated.
I think you should read this in order to understand how metrics flow between master and slave.
However, I would like to brief that, in genral, hadoop based or hbase based metrics are directly emitted/ sent to the master server (By master server, I mean the server on which gmetad is installed). All other OS related metrics are first collected by gmond installed on the corresponding slave and then redirected to the gmond installed on the master server.
So, if you are not getting any OS related metrics of slave servers then there is some misconfiguration in your gmond.conf. To know more about how to configure ganglia, please read this. This has helped me and could help you for sure, if you go through carefully.
There is a mistake in your configuration files.
More precisely, in transmitting / collecting the data, whichever approach you use.

How to administer Hadoop Cluster

i have running 4 nodes hadoop cluster and i am asking about any way to administer that cluster remotely
for example
administering the cluster from my laptop for
executing MapReduce tasks
disabling or enabling data nodes
is there any way to do that remotely ?
If you're using the Cloudera distribution, the Cloudera Manager webapp would let you do that.
Other distributions may have similar control apps. That would give you per-node control.
For executing MR tasks, you would setup normally submit the job from an external node anyway, pointing to the correct JobTracker and NameNode. So I'm not sure what else you're asking for there.

cloudera cluster node roles

I need to ran simple benchmark test on my cloudera CDH4 cluster setup.
My cloudera cluster setup (CDH4) has 4 nodes, A, B, C and D
I am using cloudera manager FREE edition to manage cloudera services.
Each node is configured to perform multiple roles as stated below.
A : NameNode, JobTrackerNode, regionserver, SecondaryNameNode, DataNode, TaskTrackerNode
B : DataNode, TaskTrackerNode
C : DataNode, TaskTrackerNode
D : DataNode, TaskTrackerNode
My first question is, Can one node be NameNode and DataNode?
Is this setup all right?
My second question is, on cloudera manager UI, i can see many services running but i am not sure whether i need all this services or not?
Services running on my setup are :
hbase1
hdfs1
mapreduce1
hue1
oozie1
zookeeper1
Do i need only hdfs1 and mapreduce1 services. If yes how can i remove other services?
Cloud and hadoop concept is new to me so pardon me if some of my assumptions are illogical or wrong.
answer to your first question is yes. but you would never do that in production as NameNode needs sufficient amount of RAM. people usually run only NameNode+JobTracker on their master node. it is also better to run SecondarNameNode on a different machine.
coming to your second question, Cloudera Manager is not only Hadoop. it's a complete package that includes several Hadoop sub-projects like HBase(a NOSQL DB), Oozie(a Workflow engine) etc. and these are the processes which yo see on the UI.
If you wanna play just with Hadoop, HDFS and MapReduce are sufficient. You can stop rest of the processes easily from the UI itself. it won't do any harm to your Hadoop cluster.
HTH

Resources