Copy files/chunks from HDFS to local file system of slave nodes - hadoop

In Hadoop, I understand that the master node(Namenode) is responsible for storing the blocks of data in the slave machines(Datanode).
When we use -copyToLocal or -get, from the master, the files could be copied from the HDFS to the local storage of the master node. Is there any way the slaves can copy the blocks(data) that are stored in them, to their own local file system?
For ex, a file of 128 MB could be split among 2 slave nodes storing 64MB each. Is there any way for the slave to identify and load this chunk of data to its local file system? If so, how can this be done programmatically? Can the commands -copyToLocal or -get be used in this case also? Please help.

Short Answer: No
The data/files cannot be copied directly from Datandode's. The reason is, Datanodes store the data but they don't have any metadata information about the stored files. For them, they are just block of bits and bytes. The metadata of the files is stored in the Namenode. This metadata contains all the information about the files (name, size, etc.). Along with this, Namenode keeps track of which blocks of the file are stored on which Datanodes. The Datanodes are also not aware of the ordering of the blocks, when actual files are splits in multiple blocks.

Can the commands -copyToLocal or -get be used in this case also?
Yes, you can simply run these from the slave. The slave will then contact the namenode (if you've configured it properly) and download the data to your local filesystem.
What it doesn't do is a "short-circuit" copy, in which it would just copy the raw blocks between directories. There is also no guarantee it will read the blocks from the local machine at all, as your commandline client doesn't know its location.

HDFS blocks are stored on the slaves local FS only . you can dig down the directory defined under property "dfs.datanode.dir"
But you wont get any benefit of reading blocks directly (without HDFS API). Also reading and editing block files directory can corrupt the file on HDFS.
If you want to store data on different slave local then you will have to implement your logic of maintaining block metadata (which is already written in Namenode and do for you).
Can you elaborate more why you want to distribute blocks by yourself when Hadoop takes care of all challenges faced in distributed data?

You can copy particular file or directory from one slave to another slave by using distcp
Usage: distcp slave1address slave2address

Related

Fix corrupt HDFS Files without losing data (files in the datanode still exist)

I am new to the HDFS system and I come across a HDFS question.
We have a HDFS file system, with the namenode on a server (with this server named as 0002) and datanode on two other servers (with these two severs named as 0004 and 0005 respectively). The original data comes from a Flume application and with the "Sink" in the Flume as HDFS. The Flume will write the original data (txt files) into the datanode on the servers 0004 and 0005.
So, the original data is replicated twice and saved under two servers. The system work well for some time until one day there is a power outage. When restarting the servers, the datanode servers (0004 and 0005) are restarted before the namenode (0002) server. In this case, the original data is still saved onto the 0004 and 0005 server, however the metadata information on the namenode(0002) is lost. The block information become corrupt. The question is how to fix the corrupt blocks without losing the original data?
For example, when we check on the namenode
hadoop fsck /wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv -files -blocks -locations
We find the filename on the datanode, but the block is corrupt. The corresponding file name is:
blk_1090579409_16840906
When we go to the datanode (e.g. 0004) server, we can search the location of this file by
find ./ -name "*blk_1090579409*"
We have found the the file corresponding to the csv file under the virtual path of the HDFS system "/wimp/contract-snapshot/year=2020/month=6/day=10/snapshottime=1055/contract-snapshot.1591779548475.csv". The file is saved under folder: "./subdir0/subdir235/" and we can open it and find it is in the correct format. The corresponding .meta data is in binary form(?) and we can not read directly.
./subdir0/subdir235/blk_1090579409
The question is, given that we have found the original file (blk_1090579409), how could we restore the corrupt HDFS system using and without losing these correct original files?
After some research, I find a solution, which may be not efficient but works. If someone comes with a better solution, please let me know.
The whole idea is to copy all the files from the HDFS, arrange these files by year/day/hour/minute into different folders and then update these folders onto HDFS.
I have two datanodes (0004 and 0005) where data is stored. The total data size is of the magnitude of 10+ terabytes. The folder structure is as following (it is the same as in the question, one displayed on linux and the other on Windows):
The replication factor is set as 2, which means (if no mistake happens) that each datanode will have one and only one copy of the original file. Therefore, we just scan the folders/files on one datanode (on server 0004, about 5+ terabytes). Based on the modification date and the timestep in each file, copy the files into a new folder on a backup server/driver. Luckily, in the original files, time step information is available, e.g. 2020-03-02T09:25. I round the time to the nearest five minutes, and parent folder is for each day, with the newly created folder structure as:
The code of scan and copy the files from the datanode into the new folders by each five minutes are written in Pyspark and it takes about 2 days (I leave the code to run in the evening) to run all the operation.
Then I can update the folders on HDFS for each day. On HDFS, the folder structure is as follows:
The created folder is with the same structure as on the HDFS, also the naming convention is the same (in the copy step, I rename each copied files to match the convention on HDFS).
In the final step, I write JAVA code in order to do operations to the HDFS. After some testing, I am able to update the data of each day on HDFS. I.e. It will delete e.g. the data under the folder ~/year=2019/month=1/day=2/ on HDFS and then upload all the folders/files under the newly created folder ~/20190102/ up to ~/year=2019/month=1/day=2/ on HDFS. I do such operation for each day. Then the corrupt blocks disappear, and the right files are uploaded to the correct path on HDFS.
According to my research, it is also possible to find the corrupt blocks before the power outage by using the fsimage file on Hadoop, but this means that I may corrupt the blocks on HDFS after the power outage. Therefore, I decide using the described approach to delete the corrupt blocks while still keeping the original files and update them on HDFS.
If anyone has a better or more efficient solution, please share!

Hive Tables in multiple nodes - Processing

I have a conceptual doubt in Hive. I know that Hive s a data warehouse tool that runs on top of Hadoop. We know that Hadoop has a distributed file system -HDFS.
Suppose, I have one master and three slaves. Now, I have created a table employees in HiveQL. The table is so huge that it cant be stored in one machine. Hence it must be stored in all four machines. How can I load such data. Should it be done manually. Or like I type "LOAD DATA ... " in the master and it will be automatically get distributed among all the machines.
Hive uses HDFS as warehouse to store the data. So HDFS concept is used for data storage.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Please refer HDFS architecture for more detail.

in hadoop,will files copied to master nodes or slave nodes

Shall we copyFromlocal/put file to hdfs before processing map-reduce job? When I run mapreduce example I was taught to format hdfs in master node and copyFromLocal files to that hdfs space in master.
Then why some tutorials said master nodes just inform metadata to client.The laptop(client) will copy file blocks to data nodes not to master? e.g. http://www.youtube.com/watch?v=ziqx2hJY8Hg at 25:50. My understanding based on this tutorial is that the file (splitted by blocks) will be copied to slave nodes. so we do not need to copyFromlocal /put files to master nodes. I was so confused. Can anybody explain where will files copied/replicated to?
Blocks will not be copied to master node.
The master (Namenode) sends meta data to the client containing the data node locations
for placing each block by the client.
No actual block data is transferred to the NameNode.
I found this comic to be a good hdfs explanation
The master node (Namenode) in hadoop just deals with the Metadata (Datanode<->data information). It does not deal with the actual files. The actual files are instead stored only in the datanodes.

HDFS vs LFS - How Hadoop Dist. File System is built over local file system?

From various blogs I read, I comprehended that HDFS is another layer that exists over Local filesystem in a computer.
I also installed hadoop but I have trouble understanding the existence of hdfs layer over local file system.
Here is my question..
Consider I am installing hadoop in pseudo-distributed mode. What happens under the hood during this installation? I added a tmp.dir parameter in configuration files. Is is the single folder to which namenode daemon talks to, when it attemps to access the datanode??
OK..let me give it a try..When you configure Hadoop it lays down a virtual FS on top of your local FS, which is the HDFS. HDFS stores data as blocks(similar to the local FS, but much much bigger as compared to it) in a replicated fashion. But the HDFS directory tree or the filesystem namespace is identical to that of local FS. When you start writing data into HDFS, it eventually gets written onto the local FS only, but you can't see it there directly.
The temp directory actually serves 3 purposes :
1- Directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name and can be specified explicitly by dfs.name.dir. If you specify dfs.name.dir, then the namenode metedata will be stored in the directory given as the value of this property.
2- Directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data and can be specified explicitly by dfs.data.dir. If you specify dfs.data.dir, then the HDFS data will be stored in the directory given as the value of this property.
3- Directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary and can be specified explicitly by fs.checkpoint.dir.
So, it's always better to use some proper dedicated location as the values for these properties for a cleaner setup.
When access to a particular block of data is required metadata stored in the dfs.name.dir directory is searched and the location of that block on a particular datanode is returned to the client(which is somewhere in dfs.data.dir directory on the local FS). The client then reads data directly from there (same holds good for writes as well).
One important point to note here is that HDFS is not a physical FS. It is rather a virtual abstraction on top of your local FS which can't be browsed simply like the local FS. You need to use the HDFS shell or the HDFS webUI or the available APIs to do that.
HTH
When you install hadoop in pseudo distributed mode, all the HDFS daemons namdenode, datanode and secondary name node run on the same machine. The temp dir which you configure is where a data node stores the data. So when you look at it from HDFS point of view, your data is still stored in block and read in blocks which are much bigger (and aggregation) on multiple file system level blocks.

Hadoop. About file creation in HDFS

I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb. Is that true? How can we load a file in HDFS which is less than 64 MB? Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb.
Could you provide the reference for the same? File of any size can be put into HDFS. The file is split into 64 MB (default) blocks and saved on different data nodes in the cluster.
Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
It doesn't matter if a block or file is on a particular data node or on all the data nodes. Data nodes can fetch data from each other as long as they are part of a cluster.
Think of HDFS as a very big hard drive and write the code for reading/writing data from HDFS. Hadoop will take care of the internals like 'reading from' or 'writing to' multiple data nodes if required.
Would suggest to read the following 1 2 3 on HDFS, especially the 2nd one which is a comic on HDFS.

Resources