Unable to load large file to HDFS on Spark cluster master node - hadoop

I have fired up a Spark Cluster on Amazon EC2 containing 1 master node and 2 servant nodes that have 2.7gb of memory each
However when I tried to put a file of 3 gb on to the HDFS through the code below
/root/ephemeral-hdfs/bin/hadoop fs -put /root/spark/2GB.bin 2GB.bin
it returns the error, "/user/root/2GB.bin could only be replicated to 0 nodes, instead of 1". fyi, I am able to upload files of smaller size but not when it exceeds a certain size (about 2.2 gb).
If the file exceeds the memory size of a node, wouldn't it will be split by Hadoop to the other node?

Edit: Summary of my understanding of the issue you are facing:
1) Total HDFS free size is 5.32 GB
2) HDFS free size on each node is 2.6GB
Note: You have bad blocks (4 Blocks with corrupt replicas)
The following Q&A mentions similar issues:
Hadoop put command throws - could only be replicated to 0 nodes, instead of 1
In that case, running JPS showed that the datanode are down.
Those Q&A suggest a way to restart the data-node:
What is best way to start and stop hadoop ecosystem, with command line?
Hadoop - Restart datanode and tasktracker
Please try to restart your data-node, and let us know if it solved the problem.
When using HDFS - you have one shared file system
i.e. all nodes share the same file system
From your description - the current free space on the HDFS is about 2.2GB , while you tries to put there 3GB.
Execute the following command to get the HDFS free size:
hdfs dfs -df -h
hdfs dfsadmin -report
or (for older versions of HDFS)
hadoop fs -df -h
hadoop dfsadmin -report

Related

Hadoop errorcode -1000, No space available in any of the local directories

I'm using Windows 7 with Hadoop 2.10.1 installed as shown here: https://exitcondition.com/install-hadoop-windows/ and I get an error when running my job:
INFO mapreduce.Job:
Job job_1605374051781_0001 failed with state FAILED due to:
Application application_1605374051781_0001 failed 2 times
due to AM Container for appattempt_1605374051781_0001_000002 exited with
exitCode: -1000 Failing this attempt.Diagnostics:
[2020-11-14 18:17:54.217]No space available in any of the local directories.
The expected output is several lines of text and my disks are nowhere near full (at least 10GB free). The code is some generic mapreduce job that I cannot post here because it's the intellectual property of the university.
Any tips on how to solve the "No space available" error?
For clarification I'm using only my PC, I'm not connected to other machines.
PS: I've solved it, as said here: Hadoop map reduce example stuck on Running job by user "banu reddy" https://stackoverflow.com/users/4249076/banu-reddy the free HDD space needs to be at least 10% od the disk.
Hadoop's jobs are executed within the framework's distributed filesystem aka HDFS, which works independently from the local filesystem (even by operating in just one machine, as you clarified).
That basically means that the error you got referred to the disk space available in the HDFS and not on your hard drives in general. To check if the HDFS has enough disk space to run the job or not, you can execute the following command on the terminal:
hdfs dfs -df -h
Which can have an output like this (ignoring the warning I get on my Hadoop setup):
If the command output in your system indicates that the available disk space is low or non-existent, you can individualy delete directories from the HDFS
by firstly checking what directories and files are stored:
hadoop fs -ls
And then deleting each directory from the HDFS:
hadoop fs -rm -r name_of_the_folder
Or file from the HDFS:
hadoop fs -rm name_of_the_file
Alternatively, you can empty everything stored in the HDFS to be sure that you will not hit the disk space limit again any time soon. You can do that by stopping the YARN and HDFS daemons at first:
stop-all.sh
Then enabling only the HDFS daemon:
start-dfs.sh
Then formatting everything on the namenode (aka the HDFS in your system, not your local files of course):
hadoop namenode -format
And enabling YARN and HDFS daemons at last:
start-all.sh
Remember to re-run the hdfs dfs -df -h command after deleting stuff in the HDFS so you make sure you have free space on the HDFS.

Hadoop : swap DataNode & NameNode without losing any HDFS data

I have a cluster of 5 machines:
1 big NameNode
4 standard DataNodes
I want to change my current NameNode with a DataNode without losing the data stored in HDFS, so my cluster could become:
1 standard NameNode
3 standard DataNodes
1 big DataNode
Does someone know a simple way to do that?
Thank you very much
Decomission data node where namenode will be moved.
Stop the cluster.
Create a tar of dfs.name.dir from current namenode.
Copy all hadoop config files from current NN to target NN.
Replace the name/ip of target namenode by modifying core-site.xml.
Restore tarball of dfs.name.dir. Make sure that full path is same.
Now start the cluster by starting new namenode and one less datanode.
Verify that everything is working perfectly.
Add old namenode as datanode by configuring it as datanode.
I would suggest to uninstall and then install hadoop on both the nodes so that previous configuration does not cause any problem.

Datanode is in dead state as DFS used is 100 percent

I am having a standalone setup of Apache Hadoop with Namenode and Datanode running in the same machine.
I am currently running Apache Hadoop 2.6 (I cannot upgrade it) running on Ubuntu 16.04.
Although my system is having more than 400 GB of Hard disk left but my Hadoop dashboard is showing 100%.
Why Apache Hadoop is not consuming the rest of the disk space available to it? Can anybody help me figuring out the solution.
There can be certain reasons for it.
You can try following steps:
Goto $HADOOP_HOME/bin
./hadoop-daemon.sh --config $HADOOP_HOME/conf start datanode
Then you can try the following things:-
If any directory other than your namenode and datanode directories taking up too much space, you can start cleaning up
Also you can run hadoop fs -du -s -h /user/hadoop (to see usage of the directories).
Identify all the unnecessary directories and start cleaning up by running hadoop fs -rm -R /user/hadoop/raw_data (-rm is to delete -R is to delete recursively, be careful while using -R).
Run hadoop fs -expunge (to clean up the trash immediately, some times you need to run multiple times).
Run hadoop fs -du -s -h / (it will give you hdfs usage of the entire file system or you can run dfsadmin -report as well - to confirm whether storage is reclaimed)
Many times it shows missing blocks ( with replication 1).

Does Hadoop create multiple copies of input files, one copy per node

If I wish to copy a file from a local directory to a HDFS, do I need to physically copy the file on each Hadoop node?
Or if I use the hadoop dfs command, Hadoop will internally create a copy of this file on each node?
Am I correct to assume that each node needs to have a copy of the file?
When you will copy the file (any data) Hadoop (HDFS) will store it on any Datanode and metadata information will be stored on Namenode. The replication of the file (data) will be taken care by Hadoop, you need not to copy it multiple times.
You can use of the below command to copy files from local to HDFS
hdfs dfs -put <source> <destination>
hdfs dfs -copyFromLocal <source> <destination>
The replication factor configuration is stored in hdfs-site.xml file.
Am I correct to assume that each node needs to have a copy of the file?
This is not necessarily true. HDFS creates replica as per the configuration found in the hdfs-site.xml file. The default for the replication is 3.
Yeah hadoop distributed file system replicates data in minimum 3 datanodes. But nowadays trend is on spark which is also run on top of hadoop. And this is 100 times faster than hadoop.
spark http://spark.apache.org/downloads.html
You are not required to copy the file from local machine to every node in the cluster.
You could use client utiles like hadoop fs or hadoop dfs commands to do so.
It is not necessary that your file will be copied to all the nodes in the cluster, the number of replications is controlled by the dfs.replication property from the hdfs-site.xml configuration file, where its default value is 3, means that 3 copies of your file be stored across the cluster on some random nodes.
Please refer the more details below,
hadoop dfs command first contacts the Namenode with the given
files's details.
The Namenode computes the number of blocks that the file has to
splitted according to the block size configured in hdfs-site.xml
The Namenode returns the list of chosen Datanodes for every
computed block of the given file. This count of Datanodes in every
list is equal to the replication factor configured in the
hdfs-site.xml
Then the hadoop client starts storing every blocks of the file to the
given Datanode through Hadoop Streaming.
For each block, the hadoop client just prepares the data pipe line
in which all chosen Datanodes chosen to store the block are formed
as a Data queue.
The hadoop client just copies the current block only to the first
Datanode in the queue.
Upon completion of the copy, the first Datanode cascades the block to
second Datanode in the Queue and so on.
All the block details of the files and the details of Datanodes which
have the copy of them are maintained in Namenode's metadata.
You do not need to copy files manually to all nodes.
Hadoop will take care of distributing data to different nodes.
you can use simple commands to upload data to HDFS
hadoop fs -copyFromLocal </path/to/local/file> </path/to/hdfs>
OR
hadoop fs -put </path/to/local/file> </path/to/hdfs>
You can read more how data is internally written on HDFS here : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
You can also download file from HDFS to local filessytem wothout manually copying files from each datanode using command:
hadoop fs -copyToLocal </path/to/hdfs/file> </path/to/local>
OR
hadoop fs -get </path/to/hdfs/file> </path/to/local>

Hadoop fsck shows missing replicas

I am running Hadoop 2.2.0 cluster with two datanodes and one namenode. When I try checking the system using hadoop fsck command on namenode or any of the datanodes, I get the following:
Target Replicas is 3 but found 2 replica(s).
I tried changing the configuration in hdfs-site.xml (dfs.replication to 2 ) and restarted the cluster services. On running hadoop fsck / it is still showing the same status:
Target Replicas is 3 but found 2 replica(s).
Please clarify, is this a caching issue or a bug?
By setting dfs.replication does not bring down your replication. this property will be referred only when a files is created whose replication is not specified. For changing the replication following hadoop utility could be used
hadoop fs -setrep [-R] [-w] <rep> <path/file>
or
hdfs dfs -setrep [-R] [-w] <rep> <path/file>
Here / also can be specified for changing the replication factor of the complete filesystem.

Resources