Hardware recommendation for zookeeper in hadoop cluster - hadoop

I have a light-weight Hadoop environment:
2 namenodes(job tracker/HBase Master) + 3 datanodes(tasktracker/HBase Region)
They are all like two quad-core CPUs + 16-24G memory + total 15T
I am wondering what server specs the zookeepers would look like if I were to go for 3 zookeepers? Can anyone share some experience?

From HBase's perpective -
Give each ZooKeeper server around 1GB of RAM, and if possible, its own
dedicated disk (A dedicated disk is the best thing you can do to
ensure a performant ZooKeeper ensemble). For very heavily loaded
clusters, run ZooKeeper servers on separate machines from
RegionServers (DataNodes and TaskTrackers).
-Dedicated disk should be configured to store snapshots as the transaction logs grows.
-Suffcient RAM is requried so that it doesn't swap.

Related

Why is the Hadoop job slower in cloud (with multi-node clustering) than on normal pc?

I am using cloud Dataproc as a cloud service for my research. Running Hadoop and spark job on this platform(cloud) is a bit slower than that of running the same job on a lower capacity virtual machine. I am running my Hadoop job on 3-node cluster(each with 7.5gb RAM and 50GB disk) on the cloud which took 4min49sec, while the same job took 3min20sec on the single node virtual machine(my pc) having 3gb RAM and 27GB disk. Why is the result slower in the cloud with multi-node clustering than on normal pc?
First of all:
not easy to answer without knowing the complete configuration and the type of job your running.
possible reasons are:
missconfiguration
http://HOSTNAME:8080
open ressourcemanager webapp and compare available vcores and memory
job type
Job adds more overhead when running parallelized so that it is slower
hardware
Selected virtual Hardware is slower than the local one. Thourgh low disk io and network overhead
I would say it is something like 1. and 2.
For more detailed answer let me know:
size and type of the job and how you run it.
hadoop configuration
cloud architecture
br
to be a bit more detailed here the numbers/facts which are interesting to find out the reason for the "slower" cloud environment:
job type &size:
size of data 1mb or 1TB
xml , parquet ....
what kind of process (e.g wordcount, format change, ml,....)
and of course the options (executors and drivers ) for your spark-submit or spark-shell
Hadoop Configuration:
do you use a distribution (hortonworks or cloudera?)
spark standalone or in yarn mode
how are nodemangers configured

How to allocate physical resources for a big data cluster?

I have three servers and I want to deploy Spark Standalone Cluster or Spark on Yarn Cluster on that servers.
Now I have some questions about how to allocate physical resources for a big data cluster. For example, i want to know whether i can deploy Spark Master Process and Spark Worker Process on the same node. Why?
Server Details:
CPU Cores: 24
Memory: 128GB
I need your help. Thanks.
Of course you can, just put host with Master in slaves. On my test server I have such configuration, master machine is also worker node and there is one worker-only node. Everything is ok
However be aware, that is worker will fail and cause major problem (i.e. system restart), then you will have problem, because also master will be afected.
Edit:
Some more info after question edit :) If you are using YARN (as suggested), you can use Dynamic Resource Allocation. Here are some slides about it and here article from MapR. It a very long topic how to configure memory properly for given case, I think that these resources will give you much knowledge about it
BTW. If you have already intalled Hadoop Cluster, maybe try YARN mode ;) But it's out of topic of question

Optimal settings for apache spark based on the hardware

is there a mapping/translation for the number of hardware systems, cpu cores and their associated memory to the spark-submit tunables of:
executor-memory
executor-cores
num-executors
The application is certaionly bound to have something to do with these tunables, I am however looking for a "basic rule of thumb"
Apache spark is running on yarn with hdfs in cluster mode.
Not all the hardware systems in the spark/hadoop yarn cluster have the same number of cpu cores or RAM.
There is no thumb rule, but after considering
off heap memory
Number of applications and other hadoop dameons running
Resource manager needs
HDFS IO
etc.
You can derive a suitable configuration. Please check this url

how to install kafka in hadoop cluster

I want to install the latest release of Kafka on my ubuntu Hadoop cluster that contains 1 master nodes and 4 data nodes.
Here are my questions:
Should kafka be installed on all the machines or only on NameNode machine?
What about zookeeper? Should it be installed on all the machines or only
on NameNode machine?
Please share required document to install kafka and Zookeeper in a Hadoop 5 node cluster
The architecture is strictly based on your requirements and on what you have: how powerful your machines are, how much data do they need to process, how many consumers do the Kafka instances need to feed, and so on. In theory you can have 1 kafka instance and 1 zookeeper, but it won't be fault-tolerant - if it fails, you lose data and so on.
You find more information about zookeeper multi-cluster here.
What I would do first is to try to analyze
how much data they need to process,
how much data they need to
"ingest",
how powerful your machines are,
how many consumers you
are going to need,
how reliable your machines are
These are just a few factors to consider before starting to build up an infrastructure. If you want to have a rough estimate based on "just" 5 machines, assuming they are all equally powerful and with a good amount of memory (e.g., 32GB per machine), is that you need is to have at least a couple of Kafka nodes and at least 3 machines for Zookeeper (2N + 1) so that if one fails, Zookeeper can handle this failure.

Google cloud click to deploy hadoop

Why does google cloud click to deploy hadoop workflow requires picking size for local persistent disk even if you plan to use the hadoop connector for cloud storage? The default size is 500 GB .. I was thinking if it does need some disk it should be much smaller in size. Is there a recommended persistent disk size when using cloud storage connector with hadoop in google cloud?
"Deploying Apache Hadoop on Google Cloud Platform
The Apache Hadoop framework supports distributed processing of large data sets across a clusters of computers.
Hadoop will be deployed in a single cluster. The default deployment creates 1 master VM instance and 2 worker VMs, each having 4 vCPUs, 15 GB of memory, and a 500-GB disk. A temporary deployment-coordinator VM instance is created to manage cluster setup.
The Hadoop cluster uses a Cloud Storage bucket as its default file system, accessed through Google Cloud Storage Connector. Visit Cloud Storage browser to find or create a bucket that you can use in your Hadoop deployment.
Apache Hadoop on Google Compute Engine
Click to Deploy Apache Hadoop
Apache Hadoop
ZONE
us-central1-a
WORKER NODE COUNT
CLOUD STORAGE BUCKET
Select a bucket
HADOOP VERSION
1.2.1
MASTER NODE DISK TYPE
Standard Persistent Disk
MASTER NODE DISK SIZE (GB)
WORKER NODE DISK TYPE
Standard Persistent Disk
WORKER NODE DISK SIZE (GB)
"
The three big uses of persistent disks (PDs) are:
Logs, both daemon and job (or container in YARN)
These can get quite large with debug logging turned on and can result in many writes per second
MapReduce shuffle
These can be large, but benefit more from higher IOPS and throughput
HDFS (image and data)
Due to the layout of directories, persistent disks will also be used for other items like job data (JARs, auxiliary data distributed with the application, etc), but those could just as easily use the boot PD.
Bigger persistent disks are almost always better due to the way GCE scales IOPS and throughput with disk size [1]. 500G is probably a good starting point to start profiling your applications and uses. If you don't use HDFS, find that your applications don't log much, and don't spill to disk when shuffling, then a smaller disk can probably work well.
If you find that you actually don't want or need any persistent disk, then bdutil [2] also exists as a command line script that can create clusters with more configurability and customizability.
https://cloud.google.com/developers/articles/compute-engine-disks-price-performance-and-persistence/
https://cloud.google.com/hadoop/

Resources