I have recently built a multi-node Hadoop 2.5.0 cluster using few ARM development boards, and I can not decide if I should use same type of board as master, if I should use faster arm board as master, or even if I should use a desktop to be master to manage slave nodes?
Is there any benefit of having master node faster than the slave nodes in the same cluster?
Besides benefits of increased RAM, does increased CPU performance of master node matters?
Namenode/Jobtracker hardware specifications must be relational to worker nodes. (something like this might help)
But I don't recall any recommendation about having more powerful master nodes. They don't need to have extra Ram/HDD/CPU power. Actually you can save money by using less power in master nodes without losing much performance. (Do not forget relational)
Related
Any clustering system including Hadoop has both benefits and harms. Benefits is ability to compute in parallel, harms are overhead for task distribution.
Suppose I don't want to have any benefits and using one node. How can I run Hadoop to completely avoid overhead? Is running on single node pseudodistributed node sufficient? Or it will still have some parallelizing overhead?
I am learning about Amazon EMR lately, and according to my knowledge the EMR cluster lets us choose 3 nodes.
Master which runs the Primary Hadoop daemons like NameNode,Job Tracker and Resource manager.
Core which runs Datanode and Tasktracker daemons.
Task which only runs TaskTracker only.
My question to you guys in why does EMR provide task nodes? Where as hadoop suggests that we should have Datanode daemon and Tasktracker daemon on the same node. What is Amazon's logic behind doing this? You can keep data in S3 stream it to HDFS on the core nodes, do the processing on HDFS other than sharing data from HDFS to task nodes which will increase IO over head in that case. Because as far as my knowledge in hadoop, TaskTrackers run on DataNodes which have data blocks for that particular task then why have TaskTrackers on different nodes?
According to AWS documentation [1]
The node types in Amazon EMR are as follows:
Master node: A node that manages the cluster by running software
components to coordinate the distribution of data and tasks among
other nodes for processing. The master node tracks the status of tasks
and monitors the health of the cluster. Every cluster has a master
node, and it's possible to create a single-node cluster with only the
master node.
Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your
cluster. Multi-node clusters have at least one core node.
Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.
According to AWS documentation [2]
Task nodes are optional. You can use them to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors.
Task nodes don't run the Data Node daemon, nor do they store data in HDFS.
Some Use cases are:
You can use Task nodes for processing streams from S3. In this case Network IO won't increase as the used data isn't on HDFS.
Task nodes can be added or removed as no HDFS daemons are running. Hence, no data on task nodes. Core nodes have HDFS daemons running and keep adding and removing new nodes isn't a good practice.
Resources:
[1] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html#emr-overview-clusters
[2] https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html#emr-plan-task
One use case is if you use spot instances as task nodes. If its cheap enough, it may be worth while to add some compute power to your EMR cluster. This would be mostly for non-sensitive tasks.
Traditional Hadoop assumes all your workload requires high I/O, with EMR you can choose instance type based on your workload. For high IO needs example up to 100Gbps go with C type or R type, and you can use placement groups. And keep your Core Nodes to Task nodes ratio to 1:5 or lower, this will keep the I/O optimal and if you want higher throughput select C's or R's as your Core and Task. (edited - explaining barely any perf loss with EMR)
Task node's advantage it can scale up/down faster and can minimize compute cost. Traditional Hadoop Cluster it's hard to scale either ways since slaves also part of HDFS.
Task nodes are optional since core nodes can run Map and Reduce.
Core nodes takes longer to scale up/down depending on the tasks hence given the option of Task node for quicker auto scaling.
Reference: https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-emr/
The reason why Hadoop suggest that we should have DataNode and Tasktracker Daemons on the same nodes is because it wants our processing power as close to data as possible.
But there also comes Rack level optimization when you have to deal with multi-nodes cluster. In my point of view AWS reducing I/O overhead by providing task nodes in the same rack in which Datanodes exists.
And the reason to provide Task nodes are that we need more processing over our data than to just focusing on storing them on HDFS. We would always want more TaskTracker than the Daemon nodes. So AWS has provided you the opportunity to increase it using a complete node benefiting RackLevel optimization.
And the way you want to get data into your cluster(using S3 and only core nodes) is a good option if you want good performance but using only a transient cluster.
I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)
Is it a recommended practice to run multiple Elasticsearch nodes in one physical (virtual) machine? I'm speaking about production environment.
I currently have three virtual machines that unicast each other. Setup:
node.name:"VM1"
master:true
data:true
node.name:"VM2"
master:true
data:true
node.name:"VM3"
master:false
data:true
There's a request to have a dedicated master node in first virtual machine (next to VM1). I'm trying to avoid that and looking for strong arguments that I shouldn't do this.
Please advice.
Having a dedicated master makes sense in a larger environment to me. I would say if your nodes are not that busy having a data node also be a master would not be the end of the world. I would be more comfortable having 3 data nodes for high availability.
I have a small cluster with one node that has RAID storage, and several powerful diskless compute nodes that boot over PXE. All nodes are connected by InfiniBand (and 1G Ethernet for booting).
I need to deploy Hadoop on this cluster.
Please suggest optimal configuration
As I understand default configuration means that all compute nodes has self small storage, but in my situation (if I have NFS share) it will make too many copies by network. I have found resources about using Hadoop with Lustre, but I do not understand how to configure it
What you describe is probably possible but - instead of making use of Hadoop features - you are trying to find a way around them.
Moving computation is cheaper than moving data - data locality is one of the cornerstones of Hadoop and that's why all the worker nodes in the cluster are also storage nodes. Hadoop attempts to do as much computation as possible on the nodes where the processed blocks are located to avoid network congestion.
https://developer.yahoo.com/hadoop/tutorial/module1.html
The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system. Since files are spread across the distributed file system as chunks, each compute process running on a node operates on a subset of the data. Which data operated on by a node is chosen based on its locality to the node: most data is read from the local disk straight into the CPU, alleviating strain on network bandwidth and preventing unnecessary network transfers. This strategy of moving computation to the data, instead of moving the data to the computation allows Hadoop to achieve high data locality which in turn results in high performance.
MapReduce tends to generate large volumes of temporary files, so 15 GB per node is simply not enough storage.