Can a hadoop client leverage the benefit of rack awareness? - hadoop

I have 10 ingestion machines which use akka stream for data ingestion.
I have a Hadoop cluster of 50 nodes, and run pipelines using Spark Streaming. Hadoop cluster uses the data generated by 10 machines for producing reports.
Can I leverage rack awareness from those 10 machines without adding them as part of a Hadoop cluster?
When I say rack awareness, I mean if those machines are in the same rack as Hadoop data nodes, so using rack awareness, I would want each ingestion machine to upload data to it's nearest datanode instead of random manner, so that I would have less network traffic.
Please let me know if that's possible.

If I understood your setup correctly, this should happen automagically. According to HDFS Architecture:
For the common case, when the replication factor is three, HDFS’s
placement policy is to put one replica on the local machine if the
writer is on a datanode, otherwise on a random datanode in the same
rack as that of the writer, another replica on a node in a different
(remote) rack, and the last on a different node in the same remote
rack.
(highlighted is whats relevant to your case if your ingest nodes are not cluster datanodes.)

Related

Is it good to have hadoop Namenode and datanode in two different networks?

We are installing HA enabled 10 node Hadoop cluster by using Cloudera distribution.
Is it good to have Namenode and datanode on two different subnet which is secured through the hardware firewall ?
As long as network requests work in both directions from the active namenode (assuming you setup HA) and every datanode, then should work fine, although the extra network hop would add some latency
In case of big data networks, large number of node to node interactions shall get generated from a single client interaction for getting the expected operation done or result (like clients reading more than a single block of data). Such big data networks shall face performance impact due to additional hop count that can increase latency between the client, name node & job tracker and data node & task tracker when the data traverses between through rack switches.
Hadoop basically provides distributed processing of large data sets across clusters of computers which directly implies that networking plays a key role in deployment architecture and also directly associated with its performance and scalability. HDFS and MapReduce have high east-west traffic pattern.
In HDFS, if rack awareness configuration is enabled for HA, the replication is a continuous activity which happens across network based on replication factor. The shuffle phase involving the transfer of data from mapper to reducer in Hadoop is one of the most network bandwidth consuming activity as all the involved servers shall transfer data to every other simultaneously and this directly underlines the network topology.
Also, RPC mechanism are used by platform services like HDFS, HBase, Hive when a client requests for the remote service to execute a function. Every RPC would require the response sent back to client as soon as possible and if there is a delay for the response to reach the client, then the execution of the command can take longer time.
For optimum performance of hadoop, the network must have high bandwidth, low latency and reliable node connectivity across different nodes which boils down to having reduced hops as far as possible as one of the criteria.
In a typical network deployment, firewalls can impact cluster performance if placed between cluster nodes as they have to inspect the packets in network. Hence, it is better to avoid firewall between nodes in cluster.

Ingesting data in elasticsearch from hdfs , cluster setup and usage

I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)

Difference between Rack Awareness and Name node

I was going through Hadoop, I have doubt whether there is difference between Rack wareness and Name Node. Will Rack wareness and name node will remain on same box
As Aviral rightly said, the question has been quite vague. But just quoting for your understanding,
Namenode : The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
You can read in detail about this concept here.
Rack Awareness : In simple words rack awareness is the strategy namenode employs to choose the nearest datanode based on rack information. You can read details here
Further more, I would like to suggest this blog
Image credits Brad Hedlund
From Apache HDFS Users Guide
HDFS is the primary distributed storage used by Hadoop applications.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance.
From RackAwareness tutorial:
Hadoop components are rack-aware. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.
Let's see how Hadoop writes are implemented.
If the writer is on a datanode, the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack.
The 3rd replica is placed on a datanode which is on a different node of the rack as the second replica.
Due to replication of data blocks on three different nodes across two different RACs, Hadoop read operations provides high availability of data blocks.
At least one replica is stored on different RAC. If one RAC is not accessible, still Hadoop can fetch data block from other RAC.

Elasticsearch on Hadoop - Should ES nodes be Colocated with Hadoop DataNodes?

From the Elasticsearch for Hadoop documentation:
Whenever possible, elasticsearch-hadoop shares the Elasticsearch
cluster information with Hadoop to facilitate data co-location. In
practice, this means whenever data is read from Elasticsearch, the
source nodes IPs are passed on to Hadoop to optimize task execution.
If co-location is desired/possible, hosting the Elasticsearch and
Hadoop clusters within the same rack will provide significant network
savings.
Does this mean to say that ideally an Elasticsearch node should be colocated with every DataNode on the Hadoop cluster, or am I misreading this?
You may find this joint presentation by Elasticsearch and Hortonworks useful in answering this question:
http://www.slideshare.net/hortonworks/hortonworks-elastic-searchfinal
You'll note that on slides 33 and 34 they show multiple architectures - one where the ES nodes are co-located on the Hadoop nodes and another where you have separate clusters. The first option clearly gives you the best co-location of data which is very important for managing Hadoop performance. The second approach allows you to tune each separately and scale them independently.
I don't know that you can say one approach is better than the other as there are clearly tradeoffs. Running on the same node clearly minimizes data access latency at the expense of a loss of isolation and ability to tune each cluster separately.

HDFS replication factor

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.

Resources