Temporary storage usage between distcp and s3distcp - hadoop

I read the documentation for Amazon's S3DistCp - it says,
"During a copy operation, S3DistCp stages a temporary copy of the
output in HDFS on the cluster. There must be sufficient free space in
HDFS to stage the data, otherwise the copy operation fails. In
addition, if S3DistCp fails, it does not clean the temporary HDFS
directory, therefore you must manually purge the temporary files. For
example, if you copy 500 GB of data from HDFS to S3, S3DistCp copies
the entire 500 GB into a temporary directory in HDFS, then uploads the
data to Amazon S3 from the temporary directory".
This is not insignificant especially if you have a large HDFS cluster. Does anybody know if the regular Hadoop DistCp has this same behaviour of staging the files to copy in a temporary folder?

Distcp does not use a temporary folder rather distcp used Map Reduce for the file copy in inter/intra cluster. The same used for HDFS to S3 also. AFAIK distcp will not fail the whole bunch of file copy if it fails for some reason.
If total of 500 GB file copy needs to be happen and if 200 GB of file already copied in and distcp failed you have the 200 GB of data in S3. When you try to rerun the distcp job again it will skip the already existing files.
For more information about commands look at the distcp guide here

Related

How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?

I had uploaded the data file to the GCS bucket of my project in Dataproc. Now I want to copy that file to HDFS. How can I do that?
For a single "small" file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because hdfs://<master node> is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
Note that GCS objects use the gs: scheme. Paths should appear the same as they do when you use gsutil.
For a "large" file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consult the DistCp documentation for details.
Consider leaving data on GCS
Finally, consider leaving your data on GCS. Because the GCS connector implements Hadoop's distributed filesystem interface, it can be used as a drop-in replacement for HDFS in most cases. Notable exceptions are when you rely on (most) atomic file/directory operations or want to use a latency-sensitive application like HBase. The Dataproc HDFS migration guide gives a good overview of data migration.

Why DATA is COPIED and not MOVED while loading data from local filesystem Hive hadoop

When we use following command:
Load data local inpath "mypath"
why the data is copied from local filesystem into HDFS and not moved?
Since you are moving data between 2 different file systems (sh + HDFS) this cannot be a metadata operation as in non-local load.
The data itself should be copied.
Theoretically this command could also initiate a deletion command of the source file, but what for?

how to save data in HDFS with spark?

I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.
Is that correct if I write this:
myDStream.foreachRDD(frm->{
frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});
where ip_addr is the ip address of my hdfs remote server.
/home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And,
myNewFolder is the folder where I want to save my data.
Thanks in advance.
Yassir
The path has to be a directory in HDFS.
For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.
The path to use would be hdfs://namenode_ip:port/myNewFolder/
On execution of the spark job this directory myNewFolder will be created.
The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

how do i backup hbase using distcp?

I would like to do a back up of hbase files using distcp. Then point hbase to the newly copied files and work with the stored tables.
I realize that there are tools out there which are recommended for this job. However, I'd like to know what I need to do after I've copied the files to get hbase to recognize the copied files.
For example, i'd like to start hbase shell and scan the stored tables from the newly copied file.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. So if you want to backup your clusterA to clusterB, you'll have to:
do the copy from clusterA to clusterB using distcp
start an Hbase master and some RegionServers
enjoy the command line interface on clusterB
This means have 2 clusters each with HDFS and Hbase.
But, if you want to backup your data in the same cluster, this is simplier:
do the intra copy in a different folder: hadoop distcp hdfs://nn:8020/hbase hdfs://nn:8020/backuptest
stop all the Hbase processes and change the property hbase.rootdir from "hbase" to "backuptest"
restart all the processes

Copying directories in HDFS using the JAVA API

How do I copy a directory in HDFS to another directory in HDFS?
I found the copyFromLocalFile functions that copy from the local FS to HDFS, but I want both of the source/destination to be in HDFS.
Thanks
Use distcp command.
The canonical use case for distcp is for transferring data between two HDFS clusters.
If the clusters are running identical versions of Hadoop, the hdfs scheme is
appropriate:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
If you want to do it through Java code, see class org.apache.hadoop.tools.DistCp and call it appropriately.
You can try FileUtil.copy
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileUtil.html

Resources