I am working on a project which has 1TB data in Hbase. For backup purpose I read about snapshot.
hbase snapshot is on a cluster and I want to export to different cluster and I am getting
Caused by:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException):
org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException:
So what other files do I need to include in my export?
and is it possible to restore the snapshot in another cluster like moving the snapshot directory from one cluster to another via winscp?
If you are getting CorruptedSnapshotException is due to this reason the snapshot info from the filesystem is not valid. So, please check whether your
export command was right.
example:
hbase class org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot snapshot30072017 -copy-to hdfs://127.0.0.1:9000/hbase -mappers 8 -bandwidth 100
Please read this Issue tracker.
The above command has eight map jobs which will run to export all snapshots to another cluster with a limiting bandwidth of 100 MB/s.
Note :
The org.apache.hadoop.hbase.snapshot.ExportSnapshot tool copies all the data related to a snapshot (HFiles, logs, and snapshot metadata) to another cluster.
snapshot details can be found under this hdfs location
/apps/hbase/data/.hbase-snapshot/ (cloudera vm path), please copy those files to another cluster
and restore using 'restore_snapshot 'snapshot_name''
Please read this HBase snapshot documentation.
Related
I have scenario in which i have to pull data from Hadoop cluster into AWS.
I understand running dist-cp on the hadoop cluster is a way to copy the data into s3, but i have a restriction here, i wont be able to run any commands in the cluster. I should be able to pull the files from hadoop cluster into AWS. The data is available in hive.
I thought of the below options:
1) Sqoop data from Hive ? Is it possible ?
2) S3-distcp (running it on aws), if so what would be the configuration needed ?
Any Suggestions ?
If the hadoop cluster is visible from EC2-land, you could run a distcp command there, or, if it's a specific bit of data, some hive query which uses hdfs:// as input and writes out to s3. You'll need to deal with kerberos auth though: you cannot use distcp in an un-kerberized cluster to read data from a kerberized one, though you can go the other way.
You can also run distcp locally in 1+ machine, though you are limited by the bandwidth of those individual systems. distcp is best when it schedules the uploads on the hosts which actually have the data.
Finally, if it is incremental backup you are interested in, you can use the HDFS audit log as a source of changed files...this is what incremental backup tools tend to use
I have a requirement, I need to refresh the production HAWQ database to QA environment on daily basis.
How to move the every day delta into QA cluster from Production.
Appreciate your help
Thanks
Veeru
Shameless self-plug - have a look at the following open PR for using Apache Falcon to orchestrate a DR batch job and see if it fits your needs.
https://github.com/apache/incubator-hawq/pull/940
Here is the synopsis of the process:
Run hawqsync-extract to capture known-good HDFS file sizes (protects against HDFS / catalog inconsistency if failure during sync)
Run ETL batch (if any)
Run hawqsync-falcon, which performs the following steps:
Stop both HAWQ masters (source and target)
Archive source MASTER_DATA_DIRECTORY (MDD) tarball to HDFS
Restart source HAWQ master
Enable HDFS safe mode and force source checkpoint
Disable source and remote HDFS safe mode
Execute Apache Falcon-based distcp sync process
Enable HDFS safe mode and force remote checkpoint
There is also a JIRA with the design description:
https://issues.apache.org/jira/browse/HAWQ-1078
There isn't a built-in tool to do this so you'll have to write some code. It shouldn't be too difficult to write either because HAWQ doesn't support UPDATE or DELETE. You'll only have to append new data to QA.
Create writable external tables in Production for each table that puts data in HDFS. You'll use the PXF format to write the data.
Create readable external tables in QA for each table that reads this data.
Day 1, you write everything to HDFS and then read everything from HDFS.
Day 2+, you find the max(id) from QA. Remove files from HDFS for the table. Insert into writable external table but filter the query so you get only records larger than the max(id) from QA. Lastly, execute an insert in QA by selecting all data from the external table.
We have a hbase-0.94 cluster with hadoop-1.0.1. We don't want to have downtime for this cluster while upgrading to hbase-0.98 with hadoop-2.5.1
I have provisioned another hbase-0.98 cluster with hadoop-2.5.1 and want to copy hbase-0.94 tables to hbase-0.98. Hbase CopyTable does not seem to work for this purpose.
Please suggest a way to perform theabove task.
These are available options, out of which you can choose.
You can use org.apache.hadoop.hbase.mapreduce.Export tool to
export tables to HDFS and then you can use hadoop distcp to move data to
another cluster. When data is place on second cluster you can use
org.apache.hadoop.hbase.mapreduce.Import tool to import tables.
Please look at http://hbase.apache.org/book.html#export.
Second option is to us CopyTable tool, please look at:
http://hbase.apache.org/book.html#copytable
Have a look at pivotal
Third option is to enable hbase Snapshots, create table
snapshots, and then use ExportSnapshot tool to move them to second cluster. When snapshots are on second cluster you can clone tables from snapshots. Please look: http://hbase.apache.org/book.html#ops.snapshots
HBase Snapshots allow you to take a snapshot of a table without too
much impact on Region Servers. Snapshot, Clone and restore operations
don't involve data copying. Also, Exporting the snapshot to another
cluster doesn't have impact on the Region Servers
I was using 1 and 3 for moving data between clusters and I in my case 3
was better solution.
Also, have a look at my answer posted
Run below command on source cluster, make sure you have cross cluster authentication enabled.
/usr/bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable -Ddfs.nameservices=nameservice1,devnameservice -Ddfs.ha.namenodes.devnameservice=devnn1,devnn2 -Ddfs.namenode.rpc-address.devnameservice.devnn1=<destination_namenode01_host>:<destination_namenode01_port> -Ddfs.namenode.rpc-address.devnameservice.devnn2=<destination_namenode02_host>:<destination_namenode02_port> -Ddfs.client.failover.proxy.provider.devnameservice=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider -Dmapred.map.tasks.speculative.execution=false --peer.adr=<destination_zookeeper host>:<port>:/hbase --versions=<n> <table_name>
I want to copy data from CDH3 to CDH4 (on a different server). My CDH4 server is set up such that it cannot see the CDH3, so I have to push data upstream from CDH3 to CDH4. (which means I cannot run the distcp command from CDH4 to copy the data). How can I get my data over to CDH4' HDFS via running a command on the lower version CDH3 hadoop or is this not possible?
Ideally, you should be able to use distcp to copy the data from one HDFS cluster to another.
hadoop distcp -p -update "hdfs://A:8020/user/foo/bar" "hdfs://B:8020/user/foo/baz"
-p to preserve status, -update to overwrite data if a file is already present but has a different size.
In practice, depending on the exact versions of Cloudera you're using, you may run into incompatibilities issues such as CRC mismatch errors. In this case, you can try to use HTFP instead of HDFS, or upgrade your cluster to the latest version of CDH4 and check the release notes to see if there is any relevant known issue and work-around.
If you still have issues using distcp, feel free to create a new stackoverflow question with the exact error message, versions of CDH3 and CDH4, and exact command.
You will have to use distcp with the following command when transferring b/w 2 different versions of HDFS (Notice hftp):
hadoop distcp hftp://Source-namenode:50070/user/ hdfs://destination-namenode:8020/user/
DistCp is intra-cluster only.
The only way I know is "fs -get", "fs -put" for every subset of data that can fit local disc.
For copying between two different versions of Hadoop, one will usually use HftpFileSystem. This is a read-only FileSystem, so DistCp must be run on the destination cluster (more specifically, on TaskTrackers that can write to the destination cluster). Each source is specified as hftp:/// (the default dfs.http.address is :50070).
We have a hadoop+hbase cluster on amazon EMR with the default configuration, so that both mapred.child.tmp and hbase.tmp.dir point to /tmp. Our cluster has been running for a while and now /tmp is 500Gb, compared to 70Gb for actual /hbase data.
This kind of difference seems too much, are we supposed to periodically delete some of the /tmp data?
After some investigation I found that the largest part of our /tmp data was created by failed mapreduce tasks during Amazon's automatic backup of Hbase to S3. Our successful mapreduce tasks don't leave much data in /tmp.
We have decided to disable Amazon's automatic backup and implement our own backup script using Hbase tool for importing/exporting tables.