Add a entire directory to hadoop file system (hdfs) - hadoop

I have data that is stored within the sub directories and would like to put the parent directory in the HDFS. The data is always present at the last directory and the directory structure extends upto 2 levels.
So the structure is [parent_dir]->[sub_directories]->[sub_directories]->data
I tried to add the entire directory by doing
hadoop fs -put parent_dir input
This takes a long long time ! The sub directories are possibly 258X258. And this eventually fails with
ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(X.X.X.245:50010, storageID=DS-262356658-X.X.X.245-50010-1394905028736, infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space left on devic
I can see the required space on the nodes. What am I doing wrong here ?
Also I the way I was planning to access my files was
hadoop jar Computation.jar input/*/* output
This worked well for small data set.

That error message is usually fundamentally correct. You may not be taking into account the replication factor for the HDFS filesystem. If your replication factor is set to 3, which is the default, then you need 300GB of storage available to store a 100GB dataset.
There are a couple of things you can do to help get around the issue:
1) Decrease your replication factor (dfs.replication), and your maximum blocks (dfs.replication.max) to 2 in your hdfs-site.xml
2) Compress your datasets. Hadoop can operate on bzip and gzip compressed files (though you need to be careful of splitting)

Related

Fix corrupt HDFS Files without losing data (files in the datanode still exist)

I am new to the HDFS system and I come across a HDFS question.
We have a HDFS file system, with the namenode on a server (with this server named as 0002) and datanode on two other servers (with these two severs named as 0004 and 0005 respectively). The original data comes from a Flume application and with the "Sink" in the Flume as HDFS. The Flume will write the original data (txt files) into the datanode on the servers 0004 and 0005.
So, the original data is replicated twice and saved under two servers. The system work well for some time until one day there is a power outage. When restarting the servers, the datanode servers (0004 and 0005) are restarted before the namenode (0002) server. In this case, the original data is still saved onto the 0004 and 0005 server, however the metadata information on the namenode(0002) is lost. The block information become corrupt. The question is how to fix the corrupt blocks without losing the original data?
For example, when we check on the namenode
hadoop fsck /wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv -files -blocks -locations
We find the filename on the datanode, but the block is corrupt. The corresponding file name is:
blk_1090579409_16840906
When we go to the datanode (e.g. 0004) server, we can search the location of this file by
find ./ -name "*blk_1090579409*"
We have found the the file corresponding to the csv file under the virtual path of the HDFS system "/wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv". The file is saved under folder: "./subdir0/subdir235/" and we can open it and find it is in the correct format. The corresponding .meta data is in binary form(?) and we can not read directly.
./subdir0/subdir235/blk_1090579409
The question is, given that we have found the original file (blk_1090579409), how could we restore the corrupt HDFS system using and without losing these correct original files?
After some research, I find a solution, which may be not efficient but works. If someone comes with a better solution, please let me know.
The whole idea is to copy all the files from the HDFS, arrange these files by year/day/hour/minute into different folders and then update these folders onto HDFS.
I have two datanodes (0004 and 0005) where data is stored. The total data size is of the magnitude of 10+ terabytes. The folder structure is as following (it is the same as in the question, one displayed on linux and the other on Windows):
The replication factor is set as 2, which means (if no mistake happens) that each datanode will have one and only one copy of the original file. Therefore, we just scan the folders/files on one datanode (on server 0004, about 5+ terabytes). Based on the modification date and the timestep in each file, copy the files into a new folder on a backup server/driver. Luckily, in the original files, time step information is available, e.g. 2020-03-02T09:25. I round the time to the nearest five minutes, and parent folder is for each day, with the newly created folder structure as:
The code of scan and copy the files from the datanode into the new folders by each five minutes are written in Pyspark and it takes about 2 days (I leave the code to run in the evening) to run all the operation.
Then I can update the folders on HDFS for each day. On HDFS, the folder structure is as follows:
The created folder is with the same structure as on the HDFS, also the naming convention is the same (in the copy step, I rename each copied files to match the convention on HDFS).
In the final step, I write JAVA code in order to do operations to the HDFS. After some testing, I am able to update the data of each day on HDFS. I.e. It will delete e.g. the data under the folder ~/year=2019/month=1/day=2/ on HDFS and then upload all the folders/files under the newly created folder ~/20190102/ up to ~/year=2019/month=1/day=2/ on HDFS. I do such operation for each day. Then the corrupt blocks disappear, and the right files are uploaded to the correct path on HDFS.
According to my research, it is also possible to find the corrupt blocks before the power outage by using the fsimage file on Hadoop, but this means that I may corrupt the blocks on HDFS after the power outage. Therefore, I decide using the described approach to delete the corrupt blocks while still keeping the original files and update them on HDFS.
If anyone has a better or more efficient solution, please share!

Copy files/chunks from HDFS to local file system of slave nodes

In Hadoop, I understand that the master node(Namenode) is responsible for storing the blocks of data in the slave machines(Datanode).
When we use -copyToLocal or -get, from the master, the files could be copied from the HDFS to the local storage of the master node. Is there any way the slaves can copy the blocks(data) that are stored in them, to their own local file system?
For ex, a file of 128 MB could be split among 2 slave nodes storing 64MB each. Is there any way for the slave to identify and load this chunk of data to its local file system? If so, how can this be done programmatically? Can the commands -copyToLocal or -get be used in this case also? Please help.
Short Answer: No
The data/files cannot be copied directly from Datandode's. The reason is, Datanodes store the data but they don't have any metadata information about the stored files. For them, they are just block of bits and bytes. The metadata of the files is stored in the Namenode. This metadata contains all the information about the files (name, size, etc.). Along with this, Namenode keeps track of which blocks of the file are stored on which Datanodes. The Datanodes are also not aware of the ordering of the blocks, when actual files are splits in multiple blocks.
Can the commands -copyToLocal or -get be used in this case also?
Yes, you can simply run these from the slave. The slave will then contact the namenode (if you've configured it properly) and download the data to your local filesystem.
What it doesn't do is a "short-circuit" copy, in which it would just copy the raw blocks between directories. There is also no guarantee it will read the blocks from the local machine at all, as your commandline client doesn't know its location.
HDFS blocks are stored on the slaves local FS only . you can dig down the directory defined under property "dfs.datanode.dir"
But you wont get any benefit of reading blocks directly (without HDFS API). Also reading and editing block files directory can corrupt the file on HDFS.
If you want to store data on different slave local then you will have to implement your logic of maintaining block metadata (which is already written in Namenode and do for you).
Can you elaborate more why you want to distribute blocks by yourself when Hadoop takes care of all challenges faced in distributed data?
You can copy particular file or directory from one slave to another slave by using distcp
Usage: distcp slave1address slave2address

Calculating HashCode function in HDFS

I want to move some files from one location to another location [both the locations are on HDFS] and need to verify that the data has moved correctly.
In order to compare the data moved, I was thinking of calculating hash code on both the files and then comparing if they are equal. If equal, I would term the data movement as correct else the data movement has not happened correctly.
But I have a couple of questions regarding this.
Do I need to use the hashCode technique at all in first place? I am using MapR distribution and I read somewhere that data movement when done, implement hashing of the data at the backend and make sure that it has been transferred correctly. So is it guaranteed that when data will be moved inside HDFS, it will be consistent and no anomaly will be inserted while movement?
Is there any other way that I can use in order to make sure that the data moved is consistent across locations?
Thanks in advance.
You are asking about data copying. Just use DistCp.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
#sample example
$hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.
EDIT
DistCp uses MapReduce to effect its distribution, error handling and recovery, and reporting.
After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to ·verify that the copy was truly successful·. Since DistCp employs both MapReduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy.
EDIT
The common method I used to check the source and dist files was check the number of files and the specified size of each file. This can be done by generate a manifest at the source, then check at the dist both the number and size.
In HDFS, move doesn't physically move the data (blocks) across the data nodes. It actually changes the name space in HDFS metadata. Where as copying data from one HDFS location to another HDFS location, we have two ways;
copy
parallel copy distcp
In General copy, it doesn't check the integrity of blocks. If you want data integrity while copying file from one location to another location in same HDFS cluster use CheckSum concept by modifying the FsShell.java class or write your own class using HDFS Java API.
In Case of distCp, HDFS checks the data integrity while copying data from One HDFS cluster to another HDFS Cluster.

How to shrink size of HDFS in Hadoop

Iam using Hadoop to parse ample(about 1 million) text files and each has lot of data into it.
Firstly I uploaded all my text files into hdfs using Eclipse. But when uploading the files, my map-reduce operation resulted in huge amount of files in following directory C:\tmp\hadoop-admin\dfs\data.
So , is there any mechanism, using which I can shrink the size of my HDFS (basically above mentioned drive).
to shrink your HDFS size you can set a greater value (in bytes) to following hdfs-site.xml property
dfs.datanode.du.reserved=0
You can also lower the amount of data generated by map outputs by enabling map output compression.
map.output.compress=true
hope that helps.

Hadoop. About file creation in HDFS

I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb. Is that true? How can we load a file in HDFS which is less than 64 MB? Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb.
Could you provide the reference for the same? File of any size can be put into HDFS. The file is split into 64 MB (default) blocks and saved on different data nodes in the cluster.
Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
It doesn't matter if a block or file is on a particular data node or on all the data nodes. Data nodes can fetch data from each other as long as they are part of a cluster.
Think of HDFS as a very big hard drive and write the code for reading/writing data from HDFS. Hadoop will take care of the internals like 'reading from' or 'writing to' multiple data nodes if required.
Would suggest to read the following 1 2 3 on HDFS, especially the 2nd one which is a comic on HDFS.

Resources