SPARK_HOME/work filling worker nodes disk - amazon-ec2

I'm running spark jobs on a standalone cluster (generated using spark-ec2 1.5.1) using crontab and my worker nodes are getting hammered by these app files that get created by each job.
java.io.IOException: Failed to create directory /root/spark/work/app-<app#>
I've looked at http://spark.apache.org/docs/latest/spark-standalone.html and changed my spark-env.sh (located in spark/conf on the master and worker nodes) to reflect the following:
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=3600"
Am I doing something wrong? I've added the line to the end of each spark-env.sh file on the master and both workers.
On maybe a related note, what are these mounts pointing to? I would use them, but I don't want to use them blindly.
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda1 8256952 8256952 0 100% /
tmpfs 3816808 0 3816808 0% /dev/shm
/dev/xvdb 433455904 1252884 410184716 1% /mnt
/dev/xvdf 433455904 203080 411234520 1% /mnt2

Seems like a 1.5.1 issue - I'm no longer using the spark-ec2 script to spin up the cluster. Ended up creating a cron job to clear out the directory as mentioned in my comment.

Related

Hadoop errorcode -1000, No space available in any of the local directories

I'm using Windows 7 with Hadoop 2.10.1 installed as shown here: https://exitcondition.com/install-hadoop-windows/ and I get an error when running my job:
INFO mapreduce.Job:
Job job_1605374051781_0001 failed with state FAILED due to:
Application application_1605374051781_0001 failed 2 times
due to AM Container for appattempt_1605374051781_0001_000002 exited with
exitCode: -1000 Failing this attempt.Diagnostics:
[2020-11-14 18:17:54.217]No space available in any of the local directories.
The expected output is several lines of text and my disks are nowhere near full (at least 10GB free). The code is some generic mapreduce job that I cannot post here because it's the intellectual property of the university.
Any tips on how to solve the "No space available" error?
For clarification I'm using only my PC, I'm not connected to other machines.
PS: I've solved it, as said here: Hadoop map reduce example stuck on Running job by user "banu reddy" https://stackoverflow.com/users/4249076/banu-reddy the free HDD space needs to be at least 10% od the disk.
Hadoop's jobs are executed within the framework's distributed filesystem aka HDFS, which works independently from the local filesystem (even by operating in just one machine, as you clarified).
That basically means that the error you got referred to the disk space available in the HDFS and not on your hard drives in general. To check if the HDFS has enough disk space to run the job or not, you can execute the following command on the terminal:
hdfs dfs -df -h
Which can have an output like this (ignoring the warning I get on my Hadoop setup):
If the command output in your system indicates that the available disk space is low or non-existent, you can individualy delete directories from the HDFS
by firstly checking what directories and files are stored:
hadoop fs -ls
And then deleting each directory from the HDFS:
hadoop fs -rm -r name_of_the_folder
Or file from the HDFS:
hadoop fs -rm name_of_the_file
Alternatively, you can empty everything stored in the HDFS to be sure that you will not hit the disk space limit again any time soon. You can do that by stopping the YARN and HDFS daemons at first:
stop-all.sh
Then enabling only the HDFS daemon:
start-dfs.sh
Then formatting everything on the namenode (aka the HDFS in your system, not your local files of course):
hadoop namenode -format
And enabling YARN and HDFS daemons at last:
start-all.sh
Remember to re-run the hdfs dfs -df -h command after deleting stuff in the HDFS so you make sure you have free space on the HDFS.

Datanode disks are full because huge files as stdout

we have the follwing hadoop cluster versions , ( DATA-NODE machine are on Linux OS version - 7.2 )
ambari - 2.6.1
HDP - 2.6.4
we saw few scenarios that disks on datanode machine became full 100%
and that because the files as - stdout are huge size
for example
/grid/sdb/hadoop/yarn/log/application_151746342014_5807/container_e37_151003535122014_5807_03_000001/stdout
from df -h , we can see
df -h /grid/sdb
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 1.8T 1.8T 0T 100% /grid/sdb
any suggestion how to avoid this situation that stdout are huge and actually this issue cause stopping the HDFS component on the datanode,
second:
since the PATH of stdout is:
/var/log/hadoop-yarn/containers/[application id]/[container id]/stdout
is it possible to limit the file size?
or do a purging of stdout when file reached the threshold ?
Looking at the above path looks like your application (Hadoop Job) is writing a lot of data to stdout file. This generally happens when the Job writes data to stdout using System.out.println function or similar which is not required but sometimes can be used to debug code.
Please check your application code and make sure that it does not write to stdout.
Hope this helps.

Unable to load large file to HDFS on Spark cluster master node

I have fired up a Spark Cluster on Amazon EC2 containing 1 master node and 2 servant nodes that have 2.7gb of memory each
However when I tried to put a file of 3 gb on to the HDFS through the code below
/root/ephemeral-hdfs/bin/hadoop fs -put /root/spark/2GB.bin 2GB.bin
it returns the error, "/user/root/2GB.bin could only be replicated to 0 nodes, instead of 1". fyi, I am able to upload files of smaller size but not when it exceeds a certain size (about 2.2 gb).
If the file exceeds the memory size of a node, wouldn't it will be split by Hadoop to the other node?
Edit: Summary of my understanding of the issue you are facing:
1) Total HDFS free size is 5.32 GB
2) HDFS free size on each node is 2.6GB
Note: You have bad blocks (4 Blocks with corrupt replicas)
The following Q&A mentions similar issues:
Hadoop put command throws - could only be replicated to 0 nodes, instead of 1
In that case, running JPS showed that the datanode are down.
Those Q&A suggest a way to restart the data-node:
What is best way to start and stop hadoop ecosystem, with command line?
Hadoop - Restart datanode and tasktracker
Please try to restart your data-node, and let us know if it solved the problem.
When using HDFS - you have one shared file system
i.e. all nodes share the same file system
From your description - the current free space on the HDFS is about 2.2GB , while you tries to put there 3GB.
Execute the following command to get the HDFS free size:
hdfs dfs -df -h
hdfs dfsadmin -report
or (for older versions of HDFS)
hadoop fs -df -h
hadoop dfsadmin -report

Spark on EC2, no space left on device

I'm running spark job consuming 50GB+, my guess is that shuffle operations written to disk are causing space to run out.
I'm using the current Spark 1.6.0 EC2 script to build my cluster, close to finishing I get this error:
16/03/16 22:11:16 WARN TaskSetManager: Lost task 29948.1 in stage 3.0 (TID 185427, ip-172-31-29-236.ec2.internal): java.io.FileNotFoundException: /mnt/spark/spark-86d64093-d1e0-4f51-b5bc-e7eeffa96e82/executor-b13d39ba-0d17-428d-846a-b1b1f69c0eb6/blockmgr-12c0d9df-3654-4ff8-ba16-8ed36ca68612/29/shuffle_1_29948_0.index.3065f0c8-2511-48ab-8bf0-d0f40ab524ba (No space left on device)
I've tried using various EC2 types, but they all seem to just have the 8GB mounted for / when they start. Doing a df -h doesn't show any other storage mounted for /mnt/spark so does that mean it's only using the little bit of space left?
My df -h:
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.8G 4.1G 3.7G 53% /
devtmpfs 30G 56K 30G 1% /dev
tmpfs 30G 0 30G 0% /dev/shm
How do you expand the disk space? I've created my own AMI for this based off the Amazon default Spark one, because of extra packages I need.

Amazon EMR: Configuring storage on data nodes

I'm using Amazon EMR and I'm able to run most jobs fine. I'm running into a problem when I start loading and generating more data within the EMR cluster. The cluster runs out of storage space.
Each data node is a c1.medium instance. According to the links here and here each data node should come with 350GB of instance storage. Through the ElasticMapReduce Slave security group I've been able to verify in my AWS Console that the c1.medium data nodes are running and are instance stores.
When I run hadoop dfsadmin -report on the namenode, each data node has about ~10GB of storage. This is further verified by running df -h
hadoop#domU-xx-xx-xx-xx-xx:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 9.9G 2.6G 6.8G 28% /
tmpfs 859M 0 859M 0% /lib/init/rw
udev 10M 52K 10M 1% /dev
tmpfs 859M 4.0K 859M 1% /dev/shm
How can I configure my data nodes to launch with the full 350GB storage? Is there a way to do this using a bootstrap action?
After more research and posting on the AWS forum I got a solution although not a full understanding of what happened under the hood. Thought I would post this as an answer if that's okay.
Turns out there is a bug in the AMI Version 2.0, which of course was the version I was trying to use. (I had switched to 2.0 because I wanted hadoop 0.20 to be the default) The bug in AMI Version 2.0 prevents mounting of instance storage on 32-bit instances, which is what the c1.mediums launch as.
By specifying on the CLI tool that the AMI Version should use "latest", the problem was fixed and each c1.medium launched with the appropriate 350GB of storage.
For example
./elastic-mapreduce --create --name "Job" --ami-version "latest" --other-options
More information about using AMIs and "latest" can be found here. Currently "latest" is set to AMI 2.0.4. AMI 2.0.5 is the most recent release but looks like it is also still a little buggy.

Resources