How much data can my Hadoop cluster handle? - hadoop

I have a 4 node cluster configured to have 1 Namenode and 3 datanodes. Im performing a TPCH benchmark and i would like to know how much data you think my cluster can handle without affecting query response times. My total available HD size is about 700GB, each node has cpu with 8 cores and 16GB of RAM.
I saw some calculations that we could do to find the volume limit but i didnt understand IT, if someone could explain on a simple way how to calculate the data volume that a cluster can handle it would be very helpful.
Thank you

You can use 70 to 80 % of space in ur cluster to store the data, remaining will be used for processing and to store intermediate results in ur cluster.
This way performance will not be impacted

As you mentioned, you already configured your 4 node cluster. You can go and check in NN webUI-->Configured capacity section to find out the storage details, Let me know if you find any difficulties.

Related

How to calculate Hadoop Storage?

Im not sure If Im calculating it right but for example im using Hadoop default settings and I want to calculate how much data i can store in my cluster. For example I have 12 nodes and 8 TB total disk space per node allocated to HDFS storage.
Do I just calculate 12/8 = 1.5 TB?
You're not including the replication factor and overhead for processing any of that data. Plus, Hadoop won't run if all the disks are close to full
Therefore, 8 TB would be first divided by 3 (without the new Erasure Coding enabled), and then by the number of nodes
However, you can't technically hit 100% of HDFS usage because the services will start failing once you start going above 85% usage, so really, your starting number should be 7TB

Ingesting data in elasticsearch from hdfs , cluster setup and usage

I am setting up a spark cluster. I have hdfs data nodes and spark master nodes on same instances.
Current setup is
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
All instances are same, 16gig dual core (unfortunately).
I have 3 more machines, again same specs.
Now I have three options
1. Just deploy es on these 3 machines. The cluster will look like
1-master (spark and hdfs)
6-spark workers and hdfs data nodes
3-elasticsearch nodes
Deploy es master on 1, extend spark and hdfs and es on all other.
Cluster will look like
1-master (spark and hdfs)
1-master elasticsearch
8-spark workers, hdfs data nodes, es data nodes
My application is heavily use spark for joins, ml etc but we are looking for search capabilities. Search we definitely not needed realtime and a refresh interval of upto 30 minutes is even good with us.
At the same time spark cluster has other long running task apart from es indexing.
The solution need not to be one of above, I am open with experimentation if some one suggest. It would be handy for other dev's also once concluded.
Also I am trying with es hadoop, es-spark project but I felt ingestion is very slow if I do 3 dedicated nodes, its like 0.6 million records/minute.
The optimal approach here mostly depends on your network bandwidth and whether or not it's the bottleneck in your operation in my opinion.
I would just check whether my network links are saturated via say
iftop -i any or similar and check if that is the case. If you see data rates close to the physical capacity of your network, then you could try and run hdfs + spark on the same machines that run ES to save the network round trip and speed things up.
If network turns out not to be the bottleneck here, I would look into the way Spark and HDFS are deployed next.
Are your using all the RAM available (Java Xmx set high enough?, Spark memory limits? Yarn memory limits if Spark is deployed via Yarn?)
Also you should check whether ES or Spark is the bottleneck here, in all likelihood it's ES. Maybe you could spawn additional ES instances, 3 ES nodes feeding 6 spark workers seems very sub-optimal.
If anything, I'd probably try to invert that ratio, fewer Spark executors and more ES capacity. ES is likely a lot slower at providing the data than HDFS is at writing it (though this really depends on the configuration of both ... just an educated guess here :)). It is highly likely that more ES nodes and fewer Spark workers will be the better approach here.
So in a nutshell:
Add more ES nodes and reduce Spark worker count
Check if your network links are saturated, if so put both on the same machines (this could be detrimental with only 2 cores, but I'd still give it a shot ... you gotta try this out)
Adding more ES nodes is the better bet of the two things you can do :)

what's the actual ideal NameNode memory size when meet a lot files in HDFS

I will have 200 million files in my HDFS cluster, we know each file will occupy 150 bytes in NameNode memory, plus 3 blocks so there are total 600 bytes in NN.
So I set my NN memory having 250GB to well handle 200 Million files. My question is that so big memory size of 250GB, will it cause too much pressure on GC ? Is it feasible that creating 250GB Memory for NN.
Can someone just say something, why no body answer??
Ideal name node memory size is about total space used by meta of the data + OS + size of daemons and 20-30% space for processing related data.
You should also consider the rate at which data comes in to your cluster. If you have data coming in at 1TB/day then you must consider a bigger memory drive or you would soon run out of memory.
Its always advised to have at least 20% memory free at any point of time. This would help towards avoiding the name node going into a full garbage collection.
As Marco specified earlier you may refer NameNode Garbage Collection Configuration: Best Practices and Rationale for GC config.
In your case 256 looks good if you aren't going to get a lot of data and not going to do lots of operations on the existing data.
Refer: How to Plan Capacity for Hadoop Cluster?
Also refer: Select the Right Hardware for Your New Hadoop Cluster
You can have a physical memory of 256 GB in your namenode. If your data increase in huge volumes, consider hdfs federation. I assume you already have multi cores ( with or without hyperthreading) in the name node host. Guess the below link addresses your GC concerns:
https://community.hortonworks.com/articles/14170/namenode-garbage-collection-configuration-best-pra.html

Is it faster to replicate your data in hdfs for all your nodes?

If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?
I understand your means, you are talking about the data-locality. Generally speaking, the data-locality can reduce the run time, because it can save the time that block transmission by network. But in fact, if you don't open the "HDFS Short-Circuit Local Reads"(default it is off, please visit here), the MapTask will also read the block by the TCP protocol, it means by network, even if block and MapTask both on the same node.
Recently, I optimize hadoop and HDFS, we use SSD to instead the HDD disk, but we found the effect is not good and time is not shorter.Because the disk is not the bottleneck and network load is not heavy. According to the result, we conclude the cpu is very heavy. If you want you know the hadoop cluster situation clearly, I advise you to use ganglia to monitoring the cluster, it can help you to analysis your cluster bottleneck.please see here.
At last, hadoop is a very large and complicated system, the disk performance, cpu performance, network bandwidth, parameters values and also, there are many factor to consider. If you want to save time, you have much work to do, not just the replication factor.

Limiting non-dfs usage per data node

I am facing a strange problem due to Hadoop's crazy data distribution and management. one or two of my data nodes are completely filled up due to Non-DFS usage where as the others are almost empty. Is there a way I can make the non-dfs usage more uniform?
[I have already tried using dfs.datanode.du.reserved but that doesn't help either]
Example for the prob: I have 16 data nodes with 10 GB space each. Initially, each of the nodes have approx. 7 GB free space. When I start a job for processing 5 GB of data (with replication factor=1), I expect the job to complete successfully. But alas! when I monitor the job execution, I see suddenly one node runs out of space because the non-dfs usage is approx 6-7 GB and then it retries and another node now runs out of space. I don't really want to have higher retries because that's won't give the performance metric I am looking for.
Any idea how can I fix this issue.
It sounds like your input isn't being split up properly. You may want to choose a different InputFormat or write your own to better fit your data set. Also make sure that all your nodes are listed in your NameNode's slaves file.
Another problem can be serious data skew - case when big part of data is going to one reducer. You may need to create you own partitioner to solve it.
You can not restrict non-dfs usage, as far as I know. I would suggest to identify what exactly input file (or its split) cause the problem. Then you probably will be able to find solution.
Hadoop MR built under assumption that single split processing can be done using single node resources like RAM or disk space.

Resources