How to calculate the maximum total data size for a Minio cluster? - minio

For a Minio cluster of 4 nodes with erasure code enabled and each node having 4 x 1TB drives (total of 16 TB disk size):
what is the maximum size of data I can upload assuming that I have 8 data and 8 parity disks?
When I read the documentation of Minio, I understand that every object is replicated across all nodes and disks. We can upload 2 TB data for each node (other 2 TB being parity). And if all of the objects are replicated on all disks then other nodes will also store the same objects. Then I can only store 2 TB of data with 16 TB of available disk size. Is that correct?
Note: I know that I can increase the data size by decreasing the number of parity disks. I ask for the specific case of half data half parity disks.

If you have equal parity and data, then you have a 2x overhead, meaning your usable capacity is half your raw disk size. All objects are erasure coded across disks in a given erasure set, with a maximum of 16 disks per erasure set (and possibly many erasure sets), so it is not replication in the way you are thinking. From this doc you can see the formula for other parity to data ratios. The basic formula is total drives (N) / data drives (D).

Related

How to calculate Hadoop Storage?

Im not sure If Im calculating it right but for example im using Hadoop default settings and I want to calculate how much data i can store in my cluster. For example I have 12 nodes and 8 TB total disk space per node allocated to HDFS storage.
Do I just calculate 12/8 = 1.5 TB?
You're not including the replication factor and overhead for processing any of that data. Plus, Hadoop won't run if all the disks are close to full
Therefore, 8 TB would be first divided by 3 (without the new Erasure Coding enabled), and then by the number of nodes
However, you can't technically hit 100% of HDFS usage because the services will start failing once you start going above 85% usage, so really, your starting number should be 7TB

HDFS Data Write Process for different disk size nodes

We have 10 node HDFS (Hadoop - 2.6, cloudera - 5.8) cluster, and 4 are of disk size - 10 TB and 6 node of disk size - 3TB. In that case, Disk is constantly getting full on small size disk nodes, however the disk is free available on high disk size nodes.
I tried to understand, how namenode writes data/block to different disk size nodes. whether it is equally divided or some percentage of data getting written.
You should look at dfs.datanode.fsdataset.volume.choosing.policy. By default this is set to round-robin but since you have an asymmetric disk setup you should change it to available space.
You can also fine tune disk usage with the other two choosing properties.
For more information see:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_dn_storage_balancing.html

Hadoop Data Node: why is there a magic "number" for threshold of data blocks?

Experts,
We may see our block count grow in our hadoop cluster. "Too many" blocks have consequences such as increased heap requirements at data node, declining execution speeds, more GC etc. We should take notice when the number of blocks exceed a certain "threshold".
I have seen different static numbers for thresholds such as 200,000 or 500,000 -- "magic" numbers. Shouldn't it be a function of memory of node (Java Heap Size of DataNode in Bytes)?
Other interesting related questions:
What does a high block count indicate?
a. too many small files?
b. running out of capacity?
is it (a) or (b)? how to differentiate between the two?
What is a small file? A file whose size is smaller than block size (dfs.blocksize)?
Does each file take a new data block on disk? or is it the meta data associated with new file that is the problem?
The effects are more GC, declising execution speeds etc. How to "quantify" the effects of high block count?
Thanking in advance
Thanks everyone for their input. I have done some research on the topic and share my findings.
any static number is a magic number. I propose the number of block threshold to be: heap memory (in gb) x 1 million * comfort_%age (say 50%)
Why?
Rule of thumb: 1gb for 1M blocks, Cloudera [1]
The actual amount of heap memory required by namenode turns out to be much lower.
Heap needed = (number of blocks + inode (files + folders)) x object size (150-300 bytes [1])
For 1 million small files: heap needed = (1M + 1M) x 300b = 572mb <== much smaller than rule of thumb.
High block count may indicate both.
namenode UI states the heap capacity used.
For example,
http://namenode:50070/dfshealth.html#tab-overview
9,847,555 files and directories, 6,827,152 blocks = 16,674,707 total filesystem object(s).
Heap Memory used 5.82 GB of 15.85 GB Heap Memory. Max Heap Memory is 15.85 GB.
** Note, the heap memory used is still higher than 16,674,707 objects x 300 bytes = 4.65gb
To find out small files, do
hdfs fsck -blocks | grep "Total blocks (validated):"
It would return something like:
Total blocks (validated): 2402 (avg. block size 325594 B) <== which is smaller than 1mb
yes. a file is small if its size < dfs.blocksize.
each file takes a new data block on disk, though the block size is close to file size. so small block.
for every new file, inode type object is created (150B), so stress on heap memory of name node
Impact on name and data nodes:
Small files pose problems for both name node and data nodes:
name nodes:
- Pull the ceiling on number of files down as it needs to keep metadata for each file in memory
- Long time in restarting as it must read the metadata of every file from a cache on local disk
data nodes:
- large number of small files means a large amount of random disk IO. HDFS is designed for large files, and benefits from sequential reads.
[1] https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_nn_memory_config.html
Your first assumption is wrong, since Data node does not maintain the data file structure in memory, it is the job of the Name node to keep track of the filesystem (recurring to INodes) in memory. So the small files will actually cause your Name node do run out memory faster (since more metadata will be required to represent the same amount of data) and the execution speed will be affected since the Mapper is created per block.
To have an answer your first question check: Namenode file quantity limit
Execute the following command: hadoop fs -du -s -h. If you see that the first value (which represents the average file size of all files) is much smaller than the configured block size, then you are facing the problem of the small files. To check if you are running out of space: hadoop fs -df -h
Yup, can be much smaller. Sometimes though if the file is too big, it would require additional block. Once the block is reserved for some file it cannot be used by another files.
The block does not reserve the space on the disk beyond what it does actually need to store the data, it is metadata on the namenode which imposes the limits.
As I told before, it is more mapper tasks which need to be executed for the same amount of data. Since the mapper is ran on new JVM, the GC is not a problem, but the overhead of starting it for processing the tiny amount of data is the problem.

Replication Factor in Hadoop

I have a data of 5 TB and actual size of the whole size of combined cluster is 7 TB and I have set the Replication factor to 2.
In this case how it will replicate the data?
Due to the Replication factor the minimum size of the storage on the cluster(Nodes) should be always double the size of the Data,Do you think this is a drawback in Hadoop?
If your minimum size of storage on the cluster is not double the size of your data, then you will end up having under-replicated blocks. Under-replicated block are those which are replicated < replication factor, so if you're replication factor is 2, you will have blocks will have replication factor of 1.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Consider that one of the nodes in your cluster goes down. That node would have some data stored in it and if you do not replicate your data, then a part of your data will not be available due to the node failure. However, if your data is replicated, the data which was on the node which went down will still be accessible to you from other nodes.
If you do not feel the need to replicate your data, you can always set your replication factor = 1.
Replication of the data is not a drawback of Hadoop -- it's the factor that increases the efficiency of Hadoop (HDFS). Replication of data to a larger number of slave nodes provides high availability and good fault tolerance to the cluster. If we consider the losses incurred by the client due to downtime of nodes in the cluster (typically will be in millions of $), the cost spent for buying the extra storage facility required for replication of data is much less. So the replication of data is justified.
This is the case of under replication. Assume you have 5 blocks. HDFS was able to create the replicas only for first 3 blocks because of space constraint. Now the other two blocks are under replicated. When the HDFS finds sufficient space, it will try to replicate the 2 blocks also.

HDFS Replication - Data Stored

I am a relative newbie to hadoop and want to get a better understanding of how replication works in HDFS.
Say that I have a 10 node system(1 TB each node), giving me a total capacity of 10 TB. If I have a replication factor of 3, then I have 1 original copy and 3 replicas for each file. So, in essence, only 25% of my storage is original data. So my 10 TB cluster is in effect only 2.5 TB of original(un-replicated) data.
Please let me know if my train of thought is correct.
Your thinking is a little off. A replication factor of 3 means that you have 3 total copies of your data. More specifically, there will be 3 copies of each block for your file, so if your file is made up of 10 blocks there will be 30 total blocks across your 10 nodes, or about 3 blocks per node.
You are correct in thinking that a 10x1TB cluster has less than 10TB capacity- with a replication factor of 3, it actually has a functional capacity of about 3.3TB, with a little less actual capacity because of space needed for doing any processing, holding temporary files, etc.

Resources