hadoop + how to rebalnce the hdfs - hadoop

we have HDP cluster version 2.6.5 with 8 data nodes , all machines are installed on rhel 7.6 version
HDP cluster is based amabri platform version - 2.6.1
each data-node ( worker machine ) include two disks and each disk size is 1.8T
when we access the data-node machines we can see differences between the size of the disks
for example on the first data-node the size is : ( by df -h )
/dev/sdb 1.8T 839G 996G 46% /grid/sdc
/dev/sda 1.8T 1014G 821G 56% /grid/sdb
on the second data-node the size is:
/dev/sdb 1.8T 1.5T 390G 79% /grid/sdc
/dev/sda 1.8T 1.5T 400G 79% /grid/sdb
on the third data-node th size is:
/dev/sdb 1.8T 1.7T 170G 91% /grid/sdc
/dev/sda 1.8T 1.7T 169G 91% /grid/sdb
and so on
the big question is why HDFS not perform the re-balance on the HDFS disks?
for example expected results on all disks should be with the same size on all datanodes machines
why is the used size differences between datanode1 to datanode2 to datanode3 etc ?
any advice about the tune parameters in HDFS that can help us?
because its very critical when one disk is reached 100% size and the other are more small as 50%

This is known behaviour of the hdfs re-balancer in HDP 2.6, There are many reasons for unbalanced block distribution. Click to check all the possible reasons.
With HDFS-1312 a disk balance option have been introduced to address this issue.
Following articles shall help you tune it more efficiently:-
HDFS Balancer (1): 100x Performance Improvement
HDFS Balancer (2): Configurations & CLI Options
HDFS Balancer (3): Cluster Balancing Algorithm
I would suggest to upgrade to HDP3.X as HDP 2.x is not supported anymore by Cloudera Support.

Related

Hadoop multinode cluster too slow. How do I increase speed of data processing?

I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.
After running an MR job, i checked my RAM Usage which is mentioned below:
Namenode
free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31
Datanode :
Slave1 :
free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28
Slave2:
total used free shared buff/cache available
Mem: 31 2 4 0 24 28
Swap: 31 1 30
Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.
Here is the output of ps command of the JAR that I submnitted to execute the MR job:
/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02
Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.
EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
Here is some more information :
18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
I believe you can edit the mapred-default.xml
The Params you are looking for are
mapreduce.job.running.map.limit
mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB
TEZ:
YARN:

How to free Non DFS Used space with Hortonworks hdp SSH client?

I'm using HDP for self-study to learn Big data basics. Today I've faced the following: HDFS disk usage is 91%. With Non DFS Used 31.2 GB / 41.6 GB (74.96%).
What exactly should I do to free disk space? Is it possible to do from sandbox hdp SSH client? I'm running HPD on a Virtual box.
I've executed the command from sandbox hdp SSH client: hdfs dfs -du -h / But this is obviously HDFS data usage.
12.2 M /app-logs
1.5 G /apps
0 /ats
860.9 K /demo
724.4 M /hdp
0 /livy2-recovery
0 /mapred
0 /mr-history
479.6 M /ranger
176.6 K /spark2-history
0 /tmp
4.0 G /user
0 /webhdfs
Just treat this like any other disk almost full issue.
Login to the sandbox. Run du -s /*/* to see what is using up disk space. I suspect it's probably the log files (under /var/log/*).

Unable to load large file to HDFS on Spark cluster master node

I have fired up a Spark Cluster on Amazon EC2 containing 1 master node and 2 servant nodes that have 2.7gb of memory each
However when I tried to put a file of 3 gb on to the HDFS through the code below
/root/ephemeral-hdfs/bin/hadoop fs -put /root/spark/2GB.bin 2GB.bin
it returns the error, "/user/root/2GB.bin could only be replicated to 0 nodes, instead of 1". fyi, I am able to upload files of smaller size but not when it exceeds a certain size (about 2.2 gb).
If the file exceeds the memory size of a node, wouldn't it will be split by Hadoop to the other node?
Edit: Summary of my understanding of the issue you are facing:
1) Total HDFS free size is 5.32 GB
2) HDFS free size on each node is 2.6GB
Note: You have bad blocks (4 Blocks with corrupt replicas)
The following Q&A mentions similar issues:
Hadoop put command throws - could only be replicated to 0 nodes, instead of 1
In that case, running JPS showed that the datanode are down.
Those Q&A suggest a way to restart the data-node:
What is best way to start and stop hadoop ecosystem, with command line?
Hadoop - Restart datanode and tasktracker
Please try to restart your data-node, and let us know if it solved the problem.
When using HDFS - you have one shared file system
i.e. all nodes share the same file system
From your description - the current free space on the HDFS is about 2.2GB , while you tries to put there 3GB.
Execute the following command to get the HDFS free size:
hdfs dfs -df -h
hdfs dfsadmin -report
or (for older versions of HDFS)
hadoop fs -df -h
hadoop dfsadmin -report

Spark on EC2, no space left on device

I'm running spark job consuming 50GB+, my guess is that shuffle operations written to disk are causing space to run out.
I'm using the current Spark 1.6.0 EC2 script to build my cluster, close to finishing I get this error:
16/03/16 22:11:16 WARN TaskSetManager: Lost task 29948.1 in stage 3.0 (TID 185427, ip-172-31-29-236.ec2.internal): java.io.FileNotFoundException: /mnt/spark/spark-86d64093-d1e0-4f51-b5bc-e7eeffa96e82/executor-b13d39ba-0d17-428d-846a-b1b1f69c0eb6/blockmgr-12c0d9df-3654-4ff8-ba16-8ed36ca68612/29/shuffle_1_29948_0.index.3065f0c8-2511-48ab-8bf0-d0f40ab524ba (No space left on device)
I've tried using various EC2 types, but they all seem to just have the 8GB mounted for / when they start. Doing a df -h doesn't show any other storage mounted for /mnt/spark so does that mean it's only using the little bit of space left?
My df -h:
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.8G 4.1G 3.7G 53% /
devtmpfs 30G 56K 30G 1% /dev
tmpfs 30G 0 30G 0% /dev/shm
How do you expand the disk space? I've created my own AMI for this based off the Amazon default Spark one, because of extra packages I need.

Amazon EMR: Configuring storage on data nodes

I'm using Amazon EMR and I'm able to run most jobs fine. I'm running into a problem when I start loading and generating more data within the EMR cluster. The cluster runs out of storage space.
Each data node is a c1.medium instance. According to the links here and here each data node should come with 350GB of instance storage. Through the ElasticMapReduce Slave security group I've been able to verify in my AWS Console that the c1.medium data nodes are running and are instance stores.
When I run hadoop dfsadmin -report on the namenode, each data node has about ~10GB of storage. This is further verified by running df -h
hadoop#domU-xx-xx-xx-xx-xx:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 9.9G 2.6G 6.8G 28% /
tmpfs 859M 0 859M 0% /lib/init/rw
udev 10M 52K 10M 1% /dev
tmpfs 859M 4.0K 859M 1% /dev/shm
How can I configure my data nodes to launch with the full 350GB storage? Is there a way to do this using a bootstrap action?
After more research and posting on the AWS forum I got a solution although not a full understanding of what happened under the hood. Thought I would post this as an answer if that's okay.
Turns out there is a bug in the AMI Version 2.0, which of course was the version I was trying to use. (I had switched to 2.0 because I wanted hadoop 0.20 to be the default) The bug in AMI Version 2.0 prevents mounting of instance storage on 32-bit instances, which is what the c1.mediums launch as.
By specifying on the CLI tool that the AMI Version should use "latest", the problem was fixed and each c1.medium launched with the appropriate 350GB of storage.
For example
./elastic-mapreduce --create --name "Job" --ami-version "latest" --other-options
More information about using AMIs and "latest" can be found here. Currently "latest" is set to AMI 2.0.4. AMI 2.0.5 is the most recent release but looks like it is also still a little buggy.

Resources