adding nodes in an existing cluster - hadoop

I would like to increase my cluster. I have 20 slave nodes, dual-quad-core CPUs, each hard drive space is 12 TB. I would like to add 10 additional slave nodes. Do I have to care about the disk space on my new nodes? Can my nodes have any amount of hard drive space?

Related

HDFS Data Write Process for different disk size nodes

We have 10 node HDFS (Hadoop - 2.6, cloudera - 5.8) cluster, and 4 are of disk size - 10 TB and 6 node of disk size - 3TB. In that case, Disk is constantly getting full on small size disk nodes, however the disk is free available on high disk size nodes.
I tried to understand, how namenode writes data/block to different disk size nodes. whether it is equally divided or some percentage of data getting written.
You should look at dfs.datanode.fsdataset.volume.choosing.policy. By default this is set to round-robin but since you have an asymmetric disk setup you should change it to available space.
You can also fine tune disk usage with the other two choosing properties.
For more information see:
https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_dn_storage_balancing.html

How to maintain the same disk usage percentage in HDFS

My HDFS cluste consists of nodes with different storage capacity, and the node with lower capacity always get filled up and out of storage. Is there a way to configure the cluster so that each volume is around roughly the same percentage of usage?

Hadoop multi node cluster

I am a newbie to Hadoop. Please correct me if I am asking nonsense and help me to solve this problem :).
I installed and configured a two node hadoop cluster (yarn).
Master node : 2TB HDD, 4GB RAM
Slave node : 500GB HDD, 4GB RAM
Datanode:
Master node only (Not keeping replicated data in Slave node)
Map/Reduce :
Master node & Slave node.
Out of 10TB data, I uploaded 2TB to Master node(Data node). I am using slave nodes for Map/Reduce only (to use 100% CPU of slave node for running queries).
My questions:
If I add a new 2TB HDD to master node and I want to upload 2TB more to master node, how can I use both the HDD(data in Old HDD and New HDD in master)? Is there any way to give multiple HDD path in hdfs-site.xml?
Do I need to add 4TB HDD in slave node(With all the data in master) to use 100% of CPU of slave? Or Do slave can access data from master and run Map/Reduce jobs?
If I add 4TB to slave and uploading data to hadoop. Will that make any replication in master(duplicates)? Can I get access to all data in the primary HDD of master and primary HDD of slave? Do the queries uses 100% CPU of both nodes if I am doing this?
As a whole, If I have a 10TB data. what is the proper way to configure Hadoop two node cluster? what specification(for master and datanode) should I use to run the Hive queries fast?
I got stuck. I really need your suggestions and help.
Thanks a ton in advance.
Please find the answers below:
provide a comma-separated list of directories in hdfs-site.xml . source https://www.safaribooksonline.com/library/view/hadoop-mapreduce-cookbook/9781849517287/ch02s05.html
No. you don't need to add HDD on slave to use 100% CPU. Under the current configuration the node manager running on slave will read data from data node running on master (over network). This is not efficient in terms of data locality, but it doesn't affect the processing throughput. It will add additional latency due to network transfer.
No. The replication factor (number of copies to be stored) is independent of number of data nodes. The default replication factor can be changed hdfs-site.xml using the property dfs.replication. You can also configure this on per file basis.
You will need at least 10GB of storage across you cluster (all the data node combined, with replication factor 1). For a production system I would recommend a replication factor 3 (to handle node failure), that is 10*3 = 30GB storage across at least 3 nodes. Since 10GB is very small in Hadoop terms, have 3 nodes each with 2 or 4 core processor and 4 to 8 GB memory. Configur this as - node1: name node + data node + node manager, node2: resource manager + data node + node manager, node3: data node + node manager.

ElasticSearch - Dynamic capacity

i have three node elastic search cluster having 250 GB disk space on each one of them, I have three shards, one on each of them. If i run out of disk capacity and add another ( Fourth) node with 500 GB disk will ElasticSearch cluster would move one of shard to avail of higher disk on fourth node ?

Cassandra compaction taking too much time to complete

Initially we had 12 nodes in Cassandra cluster and with 500GB of data load on each node major compaction use to complete in 20 hours.
Now we have upgraded the cluster to 24 nodes and with same data size that is 500 GB on each node major compaction is taking 5 days.(hardware configuration of each node is exactly same and we are using cassandra-0.8.2 )
So what could be the possible reason for this slowdown?
Is increased cluster size causing this issue?
Compaction is is a completely local operation, so cluster size would not affect it. Request volume would, and so would data volume.

Resources