How to copy HDFS files from one cluster to another cluster by preserving the modification time - hadoop

I have to move some HDFS files from my production cluster to dev cluster. I have to test some operations on HDFS files after moving to dev cluster based on the file modification time. Need files with different dates to test it in dev.
I tried doing with DISTCP, Modification time is updating with the current time in that. i checked the Distcp by using many parameters that I found here distcp version2 guide
Is there any other way to get the files without changing modification time? or can i change the modification time manually after getting the files into hdfs ?
thanks in advance

Use -pt flag with the hadoop distcp command. This will preserve timestamp (modification time) of the file that is distcp'd.
hadoop distcp -pt hdfs://src_cluster/file hdfs://dest_cluster/file
Tested with Hadoop-2.7.3
Refer latest Distcp Guide

Related

Uploading file in HDFS cluster

I was learning hadoop and till now I configured 3 Node cluster
127.0.0.1 localhost
10.0.1.1 hadoop-namenode
10.0.1.2 hadoop-datanode-2
10.0.1.3 hadoop-datanode-3
My hadoop Namenode directory looks like below
hadoop
bin
data-> ./namenode ./datanode
etc
logs
sbin
--
--
As I learned that when we upload a large file in the cluster in divide the file into blocks, I want to upload a 1Gig file in my cluster and want to see how it is being stored in datanode.
Can anyone help me with the commands to upload file and see where these blocks are being stored.
First, you need to check if you have Hadoop tools in your path, if not - I recommend integrate them into it.
One of the possible ways of uploading a file to HDFS:hadoop fs -put /path/to/localfile /path/in/hdfs
I would suggest you read the documentation and get familiar with high-level commands first as it will save you time
Hadoop Documentation
Start with "dfs" command, as this one of the most often used commands

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

Hortonworks VM - Hadoop batch upload?

Is there a way to batch upload files to Hadoop under a Hortonworks VM running CentOS? I see I can use the Ambari - Sandbox's HDFS Files tool, but that only allows uploading one-by-one. Apparently you could use Redgate's HDFS Explorer in the past, but it's no longer available. Hadoop is made to process big data, but it's absurd having to upload all files one-by-one...
Thank you!
Of course you can use the * wildcard in copyFromLocal, f.e.:
hdfs dfs -copyFromLocal input/* /tmp/input

how do i backup hbase using distcp?

I would like to do a back up of hbase files using distcp. Then point hbase to the newly copied files and work with the stored tables.
I realize that there are tools out there which are recommended for this job. However, I'd like to know what I need to do after I've copied the files to get hbase to recognize the copied files.
For example, i'd like to start hbase shell and scan the stored tables from the newly copied file.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. So if you want to backup your clusterA to clusterB, you'll have to:
do the copy from clusterA to clusterB using distcp
start an Hbase master and some RegionServers
enjoy the command line interface on clusterB
This means have 2 clusters each with HDFS and Hbase.
But, if you want to backup your data in the same cluster, this is simplier:
do the intra copy in a different folder: hadoop distcp hdfs://nn:8020/hbase hdfs://nn:8020/backuptest
stop all the Hbase processes and change the property hbase.rootdir from "hbase" to "backuptest"
restart all the processes

Copying directories in HDFS using the JAVA API

How do I copy a directory in HDFS to another directory in HDFS?
I found the copyFromLocalFile functions that copy from the local FS to HDFS, but I want both of the source/destination to be in HDFS.
Thanks
Use distcp command.
The canonical use case for distcp is for transferring data between two HDFS clusters.
If the clusters are running identical versions of Hadoop, the hdfs scheme is
appropriate:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
If you want to do it through Java code, see class org.apache.hadoop.tools.DistCp and call it appropriately.
You can try FileUtil.copy
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileUtil.html

Resources