how to check if hadoop is tuned for job, - hadoop

I have a job (handles data of 4 GB) and I checked the CPU usage and memory usage they both are under 10%.

Your job most likely doesn't need more than that. You could try stress-testing your cluster with TeraSort (included in the examples jar), if your nodes are still operating at a very low amount of usage it may be a problem with your configuration.

Hadoop comes with Benchmark utility to verify Hadoop cluster settings.
Check Hadoop TestDFSIO.

Related

Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?

I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node cluster(each with 7.5gb RAM and 50GB disk) on the cloud which took 4min49sec, while the same job took 3min20sec on the single node virtual machine(my pc) having 3gb RAM and 27GB disk. Why is the result slower in the cloud with multi-node clustering than on normal pc?
First of all:
not easy to answer without knowing the complete configuration and the type of job your running.
possible reasons are:
missconfiguration
http://HOSTNAME:8080
open ressourcemanager webapp and compare available vcores and memory
job type
Job adds more overhead when running parallelized so that it is slower
hardware
Selected virtual Hardware is slower than the local one. Thourgh low disk io and network overhead
I would say it is something like 1. and 2.
For more detailed answer let me know:
size and type of the job and how you run it.
hadoop configuration
cloud architecture
br
to be a bit more detailed here the numbers/facts which are interesting to find out the reason for the "slower" cloud environment:
job type &size:
size of data 1mb or 1TB
xml , parquet ....
what kind of process (e.g wordcount, format change, ml,....)
and of course the options (executors and drivers ) for your spark-submit or spark-shell
Hadoop Configuration:
do you use a distribution (hortonworks or cloudera?)
spark standalone or in yarn mode
how are nodemangers configured

Optimal settings for apache spark based on the hardware

is there a mapping/translation for the number of hardware systems, cpu cores and their associated memory to the spark-submit tunables of:
executor-memory
executor-cores
num-executors
The application is certaionly bound to have something to do with these tunables, I am however looking for a "basic rule of thumb"
Apache spark is running on yarn with hdfs in cluster mode.
Not all the hardware systems in the spark/hadoop yarn cluster have the same number of cpu cores or RAM.
There is no thumb rule, but after considering
off heap memory
Number of applications and other hadoop dameons running
Resource manager needs
HDFS IO
etc.
You can derive a suitable configuration. Please check this url

Replication vs Snapshot in HBase

We have two systems- One Offline system(Performance is not critical here), where the MapReduce jobs run on the HBase Cluster. The Other is the Online System(Performace is very critical here), where the API reads from the same HBase Cluster. But As the MapReduce jobs running on the same cluster, there are performance issues on the online system. So we are trying to set up separate HBase cluster for Offline system which is the replication of few family names from the Source cluster.
So on the source heavy MapReduce job runs. On the replicated cluster only online system runs giving the best performance.
My Question here is :: Cant we use Snap shot feature in HBase for doing the Same? I also wanted to know what is the difference between them?
If you use snapshot feature for mapreduce, it will also spend cpu, memory and disk io on live hbase cluster nodes too. So if disk io or cpu is the bottleneck for you, a seperate cluster for mapreduce jobs is better solution.

Does Mesos Overwrite Hadoop Memory Settings?

My company runs hadoop on mesos, and I’m new to mesos. The current limiting rate of the hadoop application I’m in charge of is the speed of reducer tasks, so I was hoping to play around with mesos and hadoop memory settings to speed up the reducer.
Unfortunately, I don’t understand the relationship between hadoop memory settings and mesos memory configuration, and I suspect that mesos may be overriding some of my hadoop memory settings.
Is changing the value of mapreduce.reduce.java.opts or mapreduce.reduce.memory.mb (in /etc/hadoop/conf/mapred-site.xml) affected by mesos? Does mesos limit the amount of memory that I can allocate to the reducer?
If so, where are the config files in mesos so I can change those settings?
Thanks!
9/30/2015 Update:
The file at https://github.com/mesos/hadoop/blob/master/configuration.md lists parameters that you can put in your mapred-site.xml file.
I'm still not sure how those parameters affect the memory-associated hadoop configuration parameters in mapred-site.xml.
The configuration is described in the respective GitHub repo mesos/hadoop.

When will HDFS be unavailable?

Name node is the single point of failure for HDFS. Is this correct?
Then what about Jobtracker? If Jobtracker fails, is HDFS available?
HDFS is completely independent of the Jobtracker. As long as at least the NN is up, HDFS is nominally usable, with overall degradation dependent on the number of Datanodes that are down.
As Ambar mentioned HDFS as in the file system does not depend on the JobTracker. The current released version of Hadoop does not support Namenode high availability out of the box but you can work around it (e.g. deploy the namenode using a traditional clustering solution of active/passive with shared storage).
The next release (2.0/0.23) does fix the namenode availability issue.
You can read more about it in a blog post by Aaron Myers "High Availability for the Hadoop Distributed File System (HDFS)"
If the JobTracker is not available you cannot execute map/reduce jobs

Resources