Why I cannot see the block file in the path specified by dfs.data.dir? - hadoop

Just now I wrote a 90M file into hdfs, and execute command fsck below. The output is below.
xuhang#master:~$ hadoop fsck /home/xuhang/hadoopinput/0501/baidu_hadoop.flv -files -blocks -locations
/home/xuhang/hadoopinput/0501/baidu_hadoop.flv 103737775 bytes, 2 block(s)
.......................
0. blk_-7625024667897507616_12224 len=67108864 repl=2 [node1:50010, node2:50010]
1. blk_2225876293125688018_12224 len=36628911 repl=2 [node1:50010, node2:50010]
.................
.................
FSCK ended at Sun Sep 22 11:55:51 CST 2013 in 25 milliseconds
I have configured the same property in hdfs-site.xml to two datanodes like below.
<name>dfs.name.dir</name>
<value>/home/xuhang/hadoop-1.2.1/name1,/home/xuhang/hadoop-1.2.1/name2</value>
But I find nothing in /home/xuhang/hadoop-1.2.1/name1 and /home/xuhang/hadoop-1.2.1/name2 in two datanodes. Why? I am sure I have wrote the 90M file into hdfs successfully because I can read it from hadoop command or java client.

I see those blocks are in hosts node1 and node2. Have you been looking at node1 and node2?
Please check the hdfs-site.xml in both node1 and node2 too. It's likely that dfs.data.dir may be set to something different in those nodes. You should find the blk_ files inside a directory named current, which is inside the directories pointed by dfs.data.dir.

Related

Unable to load large file to HDFS on Spark cluster master node

I have fired up a Spark Cluster on Amazon EC2 containing 1 master node and 2 servant nodes that have 2.7gb of memory each
However when I tried to put a file of 3 gb on to the HDFS through the code below
/root/ephemeral-hdfs/bin/hadoop fs -put /root/spark/2GB.bin 2GB.bin
it returns the error, "/user/root/2GB.bin could only be replicated to 0 nodes, instead of 1". fyi, I am able to upload files of smaller size but not when it exceeds a certain size (about 2.2 gb).
If the file exceeds the memory size of a node, wouldn't it will be split by Hadoop to the other node?
Edit: Summary of my understanding of the issue you are facing:
1) Total HDFS free size is 5.32 GB
2) HDFS free size on each node is 2.6GB
Note: You have bad blocks (4 Blocks with corrupt replicas)
The following Q&A mentions similar issues:
Hadoop put command throws - could only be replicated to 0 nodes, instead of 1
In that case, running JPS showed that the datanode are down.
Those Q&A suggest a way to restart the data-node:
What is best way to start and stop hadoop ecosystem, with command line?
Hadoop - Restart datanode and tasktracker
Please try to restart your data-node, and let us know if it solved the problem.
When using HDFS - you have one shared file system
i.e. all nodes share the same file system
From your description - the current free space on the HDFS is about 2.2GB , while you tries to put there 3GB.
Execute the following command to get the HDFS free size:
hdfs dfs -df -h
hdfs dfsadmin -report
or (for older versions of HDFS)
hadoop fs -df -h
hadoop dfsadmin -report

Does Hadoop create multiple copies of input files, one copy per node

If I wish to copy a file from a local directory to a HDFS, do I need to physically copy the file on each Hadoop node?
Or if I use the hadoop dfs command, Hadoop will internally create a copy of this file on each node?
Am I correct to assume that each node needs to have a copy of the file?
When you will copy the file (any data) Hadoop (HDFS) will store it on any Datanode and metadata information will be stored on Namenode. The replication of the file (data) will be taken care by Hadoop, you need not to copy it multiple times.
You can use of the below command to copy files from local to HDFS
hdfs dfs -put <source> <destination>
hdfs dfs -copyFromLocal <source> <destination>
The replication factor configuration is stored in hdfs-site.xml file.
Am I correct to assume that each node needs to have a copy of the file?
This is not necessarily true. HDFS creates replica as per the configuration found in the hdfs-site.xml file. The default for the replication is 3.
Yeah hadoop distributed file system replicates data in minimum 3 datanodes. But nowadays trend is on spark which is also run on top of hadoop. And this is 100 times faster than hadoop.
spark http://spark.apache.org/downloads.html
You are not required to copy the file from local machine to every node in the cluster.
You could use client utiles like hadoop fs or hadoop dfs commands to do so.
It is not necessary that your file will be copied to all the nodes in the cluster, the number of replications is controlled by the dfs.replication property from the hdfs-site.xml configuration file, where its default value is 3, means that 3 copies of your file be stored across the cluster on some random nodes.
Please refer the more details below,
hadoop dfs command first contacts the Namenode with the given
files's details.
The Namenode computes the number of blocks that the file has to
splitted according to the block size configured in hdfs-site.xml
The Namenode returns the list of chosen Datanodes for every
computed block of the given file. This count of Datanodes in every
list is equal to the replication factor configured in the
hdfs-site.xml
Then the hadoop client starts storing every blocks of the file to the
given Datanode through Hadoop Streaming.
For each block, the hadoop client just prepares the data pipe line
in which all chosen Datanodes chosen to store the block are formed
as a Data queue.
The hadoop client just copies the current block only to the first
Datanode in the queue.
Upon completion of the copy, the first Datanode cascades the block to
second Datanode in the Queue and so on.
All the block details of the files and the details of Datanodes which
have the copy of them are maintained in Namenode's metadata.
You do not need to copy files manually to all nodes.
Hadoop will take care of distributing data to different nodes.
you can use simple commands to upload data to HDFS
hadoop fs -copyFromLocal </path/to/local/file> </path/to/hdfs>
OR
hadoop fs -put </path/to/local/file> </path/to/hdfs>
You can read more how data is internally written on HDFS here : http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
You can also download file from HDFS to local filessytem wothout manually copying files from each datanode using command:
hadoop fs -copyToLocal </path/to/hdfs/file> </path/to/local>
OR
hadoop fs -get </path/to/hdfs/file> </path/to/local>

DataNode is Not Starting in singlenode hadoop 2.6.0

I installed hadoop 2.6.0 in my laptop running Ubuntu 14.04LTS. I successfully started the hadoop daemons by running start-all.sh and I run a WourdCount example successfully, then I tried to run a jar example that didn't work with me so I decide to format using hadoop namenode -format and start all over again but when I start all daemons using start-dfs.sh && start-yarn.sh then jps all daemons runs but not the datanode as shown bellow:
hdferas#feras-Latitude-E4310:/usr/local/hadoop$ jps
12628 NodeManager
12110 NameNode
12533 ResourceManager
13335 Jps
12376 SecondaryNameNode
How to solve that?
I have faced this issue and it is very easy to solve. Your datanode is not starting because after your namenode and datanode started running you formatted the namenode again. That means you have cleared the metadata from namenode. Now the files which you have stored for running the word count are still in the datanode and datanode has no idea where to send the block reports since you formatted the namenode so it will not start.
Here are the things you need to do to fix it.
Stop all the Hadoop services (stop-all.sh) and close any active ssh connections.
cat /usr/local/hadoop/etc/hadoop/hdfs-site.xml
This step is important, see where datanode's data is gettting stored. It is the value associated for datanode.data.dir. For me it is /usr/local/hadoop/hadoop_data/hdfs/datanode. Open your terminal and navigate to above directory and delete the directory named current which will be there under that directory. Make sure you are only deleting the "current" directory.
sudo rm -r /usr/local/hadoop/hadoop_data/hdfs/datanode/current
Now format the namenode and check whether everything is fine.
hadoop namenode -format
say yes if it asks you for anything.
jps
Hope my answer solves the issue. If it doesn't let me know.
Little advice: Don't format your namenode. Without namenode there is no way to reconstruct the data. If your wordcount is not running that is some other problem.
I had this issue when formatting namenode too. What i did to solve the issue was:
Find your dfs.name.dir location. Consider for example, your dfs.name.dir is /home/hadoop/hdfs.
(a) Now go to, /home/hadoop/hdfs/current.
(b) Search for the file VERSION. Open it using a text editor.
(c) There will be a line namespaceID=122684525 (122684525 is my ID, yours will be different). Note the ID down.
Now find your hadoop.tmp.dir location. Mine is /home/hadoop/temp.
(a) Go to /home/hadoop/temp/dfs/data/current.
(b) Search for the file VERSION and open it using a text editor.
(c) There will be a line namespaceID=. The namespaceID in this file and previous one must be same.
(d) This is the main reason why my datanode was not started. I made them both same and now datanode starts fine.
Note: copy the namespaceID from /home/hadoop/hdfs/current/VERSION to
/home/hadoop/temp/dfs/data/current/VERSION. Dont do it in reverse.
Now do start-dfs.sh && start-yarn.sh. Datanode will be started.
You Just need To Remove All The Contents Of DataNode Folder And Format The Datanode By Using The Following Command
hadoop namenode -format
Even I had same issue and checked the log and found below error
Exception - Datanode log
FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in secureMain
java.io.IOException: All directories in dfs.datanode.data.dir are invalid: "/usr/local/hadoop_store/hdfs/datanode/
Ran the below command to resolve the issue
sudo chown -R hduser:hadoop /usr/local/hadoop_store
Note - I have create the namenode and datanode under the path /usr/local/hadoop_store
The above problem is occurred due to format the namenode (hadoop namenode -format) without stopping the dfs and yarn daemons. While formating namenode, the question given below is appeared and you press Y key for this.
Re-format filesystem in Storage Directory /tmp/hadoop-root/dfs/name ? (Y or N)
Solution,
You need to delete the files within the current(directory name) directory of dfs.name.dir, you mention in hdfs.site.xml. In my system dfs.name.dir is available in /tmp/hadoop-root/dfs/name/current.
rm -r /tmp/hadoop-root/dfs/name/current
By using the above comment I removed files inside in the current directory. Make sure you are only deleting the "current" directory.Again format the namenode after stopped the dfs and yarn daemons (stop-dfs.sh & stop-yarn.sh). Now datanode will start normally!!
at core-site.xml check for absolute path of temp directory, If this is not pointed correctly or not created (mkdir). The data node cant be started.
add below property in yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
not the right way to do it. but surely works~
remove files from your datanode ,namenode and tmp folder. any files/folders created inside these are owned by hadoop and may have some reference to the last run datanode details which may have failed or locked due to which the datanode does not star at the next attempt
I got the same issue (DataNode & TaskTracker would not come up).
RESOLUTION:
DELETE EVERY "CURRENT" SUB-DIRECTORY UNDER: data, name, and namesecondary to resolve DataNode/taskTracker not showing when you start-all.sh, then jps
(My dfs.name.dir location is: /home/training/hadoop-temp/dfs/data/current; /home/training/hadoop-temp/dfs/name/current; /home/training/hadoop-temp/dfs/namesecondary/current
Make sure you stop services: stop-all.sh
1. Go to each "current" sub-directory under data, name, namesecondary and remove/delete (example: rm -r name/current)
2. Then format: hadoop namenode -format
3. mkdir current under /home/training/hadoop-temp/dfs/data/current
4. Take the directory and contents from /home/training/hadoop-temp/dfs/name/current and copy into the /data/current directory
EXAMPLE: files under:
/home/training/hadoop-temp/dfs/name/current
[training#CentOS current]$ ls -l
-rw-rw-r--. 1 training training 9901 Sep 25 01:50 edits
-rw-rw-r--. 1 training training 582 Sep 25 01:50 fsimage
-rw-rw-r--. 1 training training 8 Sep 25 01:50 fstime
-rw-rw-r--. 1 training training 101 Sep 25 01:50 VERSION
5. Change the storageType=NAME_NODE in VERSION to storageType=DATA_NODE in the data/current/VERSION that you just copied over.
BEFORE:
[training#CentOS dfs]$ cat data/current/VERSION
namespaceID=1018374124
cTime=0
storageType=NAME_NODE
layoutVersion=-32
AFTER:
[training#CentOS dfs]$ cat data/current/VERSION
namespaceID=1018374124
cTime=0
storageType=DATA_NODE
layoutVersion=-32
6. Make sure each subdirectory below has the same files that name/current has for data, name, namesecondary
[training#CentOS dfs]$ pwd
/home/training/hadoop-temp/dfs/
[training#CentOS dfs]$ ls -l
total 12
drwxr-xr-x. 5 training training 4096 Sep 25 01:29 data
drwxrwxr-x. 5 training training 4096 Sep 25 01:19 name
drwxrwxr-x. 5 training training 4096 Sep 25 01:29 namesecondary
7. Now start the services: start-all.sh
You should see all 5 services when you type: jps
I am using hadoop-2.6.0.I resolved using:
1.Deleting all files within
/usr/local/hadoop_store/hdfs
command : sudo rm -r /usr/local/hadoop_store/hdfs/*
2.Format hadoop namenode
command : hadoop namenode -format
3.Go to ..../sbin directory(cd /usr/local/hadoop/sbin)
start-all.sh
use command==> hduser#abc-3551:/$ jps
Following services would be started now :
19088 Jps
18707 ResourceManager
19043 NodeManager
18535 SecondaryNameNode
18329 DataNode
18159 NameNode
When I had this same issue, the 'Current' folder wasn't even being created in my hadoop/data/datanode folder. If this is the case for you too,
~copy the contents of 'Current' from namenode and paste it into datanode folder.
~Then, open VERSION for datanode and change the storageType=NAME_NODE to storageType=DATA_NODE
~run jps to see that the datanode continues to run

Hadoop fsck shows missing replicas

I am running Hadoop 2.2.0 cluster with two datanodes and one namenode. When I try checking the system using hadoop fsck command on namenode or any of the datanodes, I get the following:
Target Replicas is 3 but found 2 replica(s).
I tried changing the configuration in hdfs-site.xml (dfs.replication to 2 ) and restarted the cluster services. On running hadoop fsck / it is still showing the same status:
Target Replicas is 3 but found 2 replica(s).
Please clarify, is this a caching issue or a bug?
By setting dfs.replication does not bring down your replication. this property will be referred only when a files is created whose replication is not specified. For changing the replication following hadoop utility could be used
hadoop fs -setrep [-R] [-w] <rep> <path/file>
or
hdfs dfs -setrep [-R] [-w] <rep> <path/file>
Here / also can be specified for changing the replication factor of the complete filesystem.

Hadoop: FileNotFoundExcepion when getting file from DistributedCache

I’ve 2 nodes cluster (v1.04), master and slave. On the master, in Tool.run() we add two files to the DistributedCache using addCacheFile(). Files do exist in HDFS.
In the Mapper.setup() we want to retrieve those files from the cache using
FSDataInputStream fs = FileSystem.get( context.getConfiguration() ).open( path ).
The problem is that for one file a FileNotFoundException is thrown, although the file exists on the slave node:
attempt_201211211227_0020_m_000000_2: java.io.FileNotFoundException: File does not exist: /somedir/hdp.tmp.dir/mapred/local/taskTracker/distcache/-7769715304990780/master/tmp/analytics/1.csv
ls –l on the slave:
[hduser#slave ~]$ ll /somedir/hdp.tmp.dir/mapred/local/taskTracker/distcache/-7769715304990780/master/tmp/ analytics/1.csv
-rwxr-xr-x 1 hduser hadoop 42701 Nov 22 10:18 /somedir/hdp.tmp.dir/mapred/local/taskTracker/distcache/-7769715304990780/master/tmp/ analytics/1.csv
My questions are:
Shouldn't all files exist on all nodes?
What should be done to fix that?
Thanks.
Solved - should have beed used:
FileSystem.getLocal( conf )
Thanks to Harsh J from Hadoop mailing list.

Resources