Amazon EMR: Configuring storage on data nodes - hadoop

I'm using Amazon EMR and I'm able to run most jobs fine. I'm running into a problem when I start loading and generating more data within the EMR cluster. The cluster runs out of storage space.
Each data node is a c1.medium instance. According to the links here and here each data node should come with 350GB of instance storage. Through the ElasticMapReduce Slave security group I've been able to verify in my AWS Console that the c1.medium data nodes are running and are instance stores.
When I run hadoop dfsadmin -report on the namenode, each data node has about ~10GB of storage. This is further verified by running df -h
hadoop#domU-xx-xx-xx-xx-xx:~$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 9.9G 2.6G 6.8G 28% /
tmpfs 859M 0 859M 0% /lib/init/rw
udev 10M 52K 10M 1% /dev
tmpfs 859M 4.0K 859M 1% /dev/shm
How can I configure my data nodes to launch with the full 350GB storage? Is there a way to do this using a bootstrap action?

After more research and posting on the AWS forum I got a solution although not a full understanding of what happened under the hood. Thought I would post this as an answer if that's okay.
Turns out there is a bug in the AMI Version 2.0, which of course was the version I was trying to use. (I had switched to 2.0 because I wanted hadoop 0.20 to be the default) The bug in AMI Version 2.0 prevents mounting of instance storage on 32-bit instances, which is what the c1.mediums launch as.
By specifying on the CLI tool that the AMI Version should use "latest", the problem was fixed and each c1.medium launched with the appropriate 350GB of storage.
For example
./elastic-mapreduce --create --name "Job" --ami-version "latest" --other-options
More information about using AMIs and "latest" can be found here. Currently "latest" is set to AMI 2.0.4. AMI 2.0.5 is the most recent release but looks like it is also still a little buggy.

Related

hadoop + how to rebalnce the hdfs

we have HDP cluster version 2.6.5 with 8 data nodes , all machines are installed on rhel 7.6 version
HDP cluster is based amabri platform version - 2.6.1
each data-node ( worker machine ) include two disks and each disk size is 1.8T
when we access the data-node machines we can see differences between the size of the disks
for example on the first data-node the size is : ( by df -h )
/dev/sdb 1.8T 839G 996G 46% /grid/sdc
/dev/sda 1.8T 1014G 821G 56% /grid/sdb
on the second data-node the size is:
/dev/sdb 1.8T 1.5T 390G 79% /grid/sdc
/dev/sda 1.8T 1.5T 400G 79% /grid/sdb
on the third data-node th size is:
/dev/sdb 1.8T 1.7T 170G 91% /grid/sdc
/dev/sda 1.8T 1.7T 169G 91% /grid/sdb
and so on
the big question is why HDFS not perform the re-balance on the HDFS disks?
for example expected results on all disks should be with the same size on all datanodes machines
why is the used size differences between datanode1 to datanode2 to datanode3 etc ?
any advice about the tune parameters in HDFS that can help us?
because its very critical when one disk is reached 100% size and the other are more small as 50%
This is known behaviour of the hdfs re-balancer in HDP 2.6, There are many reasons for unbalanced block distribution. Click to check all the possible reasons.
With HDFS-1312 a disk balance option have been introduced to address this issue.
Following articles shall help you tune it more efficiently:-
HDFS Balancer (1): 100x Performance Improvement
HDFS Balancer (2): Configurations & CLI Options
HDFS Balancer (3): Cluster Balancing Algorithm
I would suggest to upgrade to HDP3.X as HDP 2.x is not supported anymore by Cloudera Support.

Insufficient number of DataNodes reporting when creating dataproc cluster

I am getting "Insufficient number of DataNodes reporting" error when creating dataproc cluster with gs:// as default FS. Below is the command i am using dataproc cluster.
gcloud dataproc clusters create cluster-538f --image-version 1.2 \
--bucket dataproc_bucket_test --subnet default --zone asia-south1-b \
--master-machine-type n1-standard-1 --master-boot-disk-size 500 \
--num-workers 2 --worker-machine-type n1-standard-1 --worker-boot-disk-size 500 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' --project delcure-firebase \
--properties 'core:fs.default.name=gs://dataproc_bucket_test/'
I checked and confirmed that the bucket i am using is able to create default folder in the bucker.
As Igor suggests, Dataproc does not support GCS as a default FS. I also suggest unsetting this property. Note, that fs.default.name property can be passed to individual jobs and will work just fine.
The error arises when the file system is tried to be accessed (HdfsClientModule). So, I think it is probable that Google Cloud Storage doesn't have a specific feature that is required for Hadoop and the creation fails after some folders were created (first image).
As somebody else mentioned previously, it is better to give up the idea of using GCS as the default fs and leave HDFS work in Dataproc. Nonetheless, you can still take advantage of Cloud Storage to have data persistence, reliability, and performance because remember that data in HDFS is removed when a cluster is shut down.
1.- From a Dataproc node you can access data through the hadoop command to move data in and out, for example:
hadoop fs -ls gs://CONFIGBUCKET/dir/file
hadoop distcp hdfs://OtherNameNode/dir/ gs://CONFIGBUCKET/dir/file
2.- For accessing data from Spark or any Hadoop application just use the gs:// prefix to access your bucket.
Furthermore, if the Dataproc connector is installed on premises it can help to move HDFS data to Cloud Storage and then access it from a Dataproc cluster.

Spark on EC2, no space left on device

I'm running spark job consuming 50GB+, my guess is that shuffle operations written to disk are causing space to run out.
I'm using the current Spark 1.6.0 EC2 script to build my cluster, close to finishing I get this error:
16/03/16 22:11:16 WARN TaskSetManager: Lost task 29948.1 in stage 3.0 (TID 185427, ip-172-31-29-236.ec2.internal): java.io.FileNotFoundException: /mnt/spark/spark-86d64093-d1e0-4f51-b5bc-e7eeffa96e82/executor-b13d39ba-0d17-428d-846a-b1b1f69c0eb6/blockmgr-12c0d9df-3654-4ff8-ba16-8ed36ca68612/29/shuffle_1_29948_0.index.3065f0c8-2511-48ab-8bf0-d0f40ab524ba (No space left on device)
I've tried using various EC2 types, but they all seem to just have the 8GB mounted for / when they start. Doing a df -h doesn't show any other storage mounted for /mnt/spark so does that mean it's only using the little bit of space left?
My df -h:
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.8G 4.1G 3.7G 53% /
devtmpfs 30G 56K 30G 1% /dev
tmpfs 30G 0 30G 0% /dev/shm
How do you expand the disk space? I've created my own AMI for this based off the Amazon default Spark one, because of extra packages I need.

SPARK_HOME/work filling worker nodes disk

I'm running spark jobs on a standalone cluster (generated using spark-ec2 1.5.1) using crontab and my worker nodes are getting hammered by these app files that get created by each job.
java.io.IOException: Failed to create directory /root/spark/work/app-<app#>
I've looked at http://spark.apache.org/docs/latest/spark-standalone.html and changed my spark-env.sh (located in spark/conf on the master and worker nodes) to reflect the following:
SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.appDataTtl=3600"
Am I doing something wrong? I've added the line to the end of each spark-env.sh file on the master and both workers.
On maybe a related note, what are these mounts pointing to? I would use them, but I don't want to use them blindly.
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda1 8256952 8256952 0 100% /
tmpfs 3816808 0 3816808 0% /dev/shm
/dev/xvdb 433455904 1252884 410184716 1% /mnt
/dev/xvdf 433455904 203080 411234520 1% /mnt2
Seems like a 1.5.1 issue - I'm no longer using the spark-ec2 script to spin up the cluster. Ended up creating a cron job to clear out the directory as mentioned in my comment.

Run MapR script (Google Cloud SDK) to create MapR Hadoop cluster on GCP does not work

I downloaded scripts from https://github.com/mapr/gce for run MapR script to create MapR Hadoop cluster on GCP.
I already credential Google account with GCP. gcloud auth list OK.
Run MapR script.
./launch-admin-training-cluster.sh --project stone-cathode-10xxxx --cluster MaprBank10 --config-file 4node_yarn.lst --image centos-6 --machine-type n1-standard-2 --persistent-disks 1x256
This's messages from Cygwin command line.
CHECK: -----
project-id stone-cathode-10xxxx
cluster MaprBank10
config-file 4node_yarn.lst
image centos-6 machine n1-standard-2
zone us-central1-b
OPTIONAL: -----
node-name none
persistent-disks 1x256
----- Proceed {y/N} ? y Launch node1
Creating persistent data volumes first (1x256) seq: not found
Launch node2
Creating persistent data volumes first (1x256) seq: not found
Launch node3
Creating persistent data volumes first (1x256) seq: not found
Launch node4
Creating persistent data volumes first (1x256) seq: not found
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
How to Investigate and solve issue. Thank you very much.
The course ADM-201 has particular requirement that you must follow.
The config file that you have chosen is 4node_yarn.lst and it supports 3 x 50 persistent-disks for each node in your 4-node MapR cluster. Since you are mentioning only one disk (1 x 256) in your command, it is not able to meet it's requirements.
Also carefully follow the "Set Up a Virtual Cluster" provided by MapR. Below is the screenshot from the guide provided by MapR.

Resources