How do i incrementally migrate HDFS data using the DistCp tool in Alibaba - alibaba-cloud

I am trying to migrate the HDFS data using the DistCp tool in Alibaba E-Mapreduce. I understand how to do full data migration.
Command:
hadoop distcp -pbugpcax -m 1000 -bandwidth 30 hdfs://clusterIP:8020 /user/hive/warehouse /user/hive/warehouse
What parameters do I need to add to achieve incremental synchronization in the above code?

In order to do incremental data synchronization you will have to add -update and -delete flags, that should take care of the sync.
hadoop distcp -pbugpcax -m 1000 -bandwidth 30 -update –delete hdfs://oldclusterip:8020 /user/hive/warehouse /user/hive/warehouse
A little more info on both the parameters:
-update, verifies the checksum and file size of the source and target files. If the file sizes compared are different, the source file updates the target cluster data. If there are data writes during the synchronization of the old and new clusters, -update can be used for incremental data synchronization.
-delete, if data in the old cluster no longer exists, the data in the new cluster will be deleted.
I hope this helps!

Related

How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?

I had uploaded the data file to the GCS bucket of my project in Dataproc. Now I want to copy that file to HDFS. How can I do that?
For a single "small" file
You can copy a single file from Google Cloud Storage (GCS) to HDFS using the hdfs copy command. Note that you need to run this from a node within the cluster:
hdfs dfs -cp gs://<bucket>/<object> <hdfs path>
This works because hdfs://<master node> is the default filesystem. You can explicitly specify the scheme and NameNode if desired:
hdfs dfs -cp gs://<bucket>/<object> hdfs://<master node>/<hdfs path>
Note that GCS objects use the gs: scheme. Paths should appear the same as they do when you use gsutil.
For a "large" file or large directory of files
When you use hdfs dfs, data is piped through your local machine. If you have a large dataset to copy, you will likely want to do this in parallel on the cluster using DistCp:
hadoop distcp gs://<bucket>/<directory> <HDFS target directory>
Consult the DistCp documentation for details.
Consider leaving data on GCS
Finally, consider leaving your data on GCS. Because the GCS connector implements Hadoop's distributed filesystem interface, it can be used as a drop-in replacement for HDFS in most cases. Notable exceptions are when you rely on (most) atomic file/directory operations or want to use a latency-sensitive application like HBase. The Dataproc HDFS migration guide gives a good overview of data migration.

How to copy HDFS files from one cluster to another cluster by preserving the modification time

I have to move some HDFS files from my production cluster to dev cluster. I have to test some operations on HDFS files after moving to dev cluster based on the file modification time. Need files with different dates to test it in dev.
I tried doing with DISTCP, Modification time is updating with the current time in that. i checked the Distcp by using many parameters that I found here distcp version2 guide
Is there any other way to get the files without changing modification time? or can i change the modification time manually after getting the files into hdfs ?
thanks in advance
Use -pt flag with the hadoop distcp command. This will preserve timestamp (modification time) of the file that is distcp'd.
hadoop distcp -pt hdfs://src_cluster/file hdfs://dest_cluster/file
Tested with Hadoop-2.7.3
Refer latest Distcp Guide

Copy Solr HDFS Data to another Cluster

I have a solr cloud (v 4.10) installation that sits on top of Cloudera (CDH 5.4.2) HDFS with 3 solr instances each hosting a shard of each core.
I am looking for a way to incrementally copy the solr data from our production cluster to our development cluster. There are 3 cores but I am only interested in copying one of them.
I have tried to use the Solr replication - backup and restore but that doesn't seem to load anything into the dev cluster.
http://host:8983/solr/core/replication?command=backup&location=/solr_transfer&name=core-name
http://host:8983/solr/core/replication?command=restore&location=/solr_transfer&name=core-name
I also tried to snapshot the /solr dir in the hdfs prod clusters and use hadoop disctp to copy the files but the solr indexer deletes some of the files so the distcp job fails.
hadoop distcp hftp://prod:50070/solr/* hdfs://dev:8020/solr/
Can anyone help me here?
please follow below steps to create snapshot of solr_hdfs folder and move the same on another cluster
1.Allow snapshot
sudo -u hdfs hadoop dfsadmin -allowSnapshot /user/solr/SolrCollectionName
2.Create snapshot with a specific name
sudo -u hdfs hadoop dfs -createSnapshot /user/solr/SolrCollectionName/ snapshotName
3. To list to snapshot directory
hdfs dfs -ls /user/solr/solrcollectionName/.snapshot
4. To copy, execute below command
sudo -u solr hadoop distcp hdfs://NNIP1:8020/user/solr/collectionName/.snapshot/SanpshotName hdfs://NNIP2:8020/user/solr
5. To restore snapshot
sudo -u solr hadoop fs -cp /user/solr/SanpshotName/* /user/solr/SolrcollectionName/
After a lot of trying this is the solution we worked out.
- Initialise solr in the second environment with all the collections in the same way as the primary.
- Take a snapshot of HDFS
- Use hadoop hdfs -cp to copy the data up to the checkpoint
After the first run the copy job will be quick as you are only copying the increments.

how do i backup hbase using distcp?

I would like to do a back up of hbase files using distcp. Then point hbase to the newly copied files and work with the stored tables.
I realize that there are tools out there which are recommended for this job. However, I'd like to know what I need to do after I've copied the files to get hbase to recognize the copied files.
For example, i'd like to start hbase shell and scan the stored tables from the newly copied file.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. So if you want to backup your clusterA to clusterB, you'll have to:
do the copy from clusterA to clusterB using distcp
start an Hbase master and some RegionServers
enjoy the command line interface on clusterB
This means have 2 clusters each with HDFS and Hbase.
But, if you want to backup your data in the same cluster, this is simplier:
do the intra copy in a different folder: hadoop distcp hdfs://nn:8020/hbase hdfs://nn:8020/backuptest
stop all the Hbase processes and change the property hbase.rootdir from "hbase" to "backuptest"
restart all the processes

Copying directories in HDFS using the JAVA API

How do I copy a directory in HDFS to another directory in HDFS?
I found the copyFromLocalFile functions that copy from the local FS to HDFS, but I want both of the source/destination to be in HDFS.
Thanks
Use distcp command.
The canonical use case for distcp is for transferring data between two HDFS clusters.
If the clusters are running identical versions of Hadoop, the hdfs scheme is
appropriate:
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
If you want to do it through Java code, see class org.apache.hadoop.tools.DistCp and call it appropriately.
You can try FileUtil.copy
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileUtil.html

Resources