Hbase snapshot incremental backup - hadoop

In HBase, we can use ExportSnapshot which will export the snapshot data & metadata to suppose another hbase cluster. So that on the second cluster we'll be able to do "list_snapshots" to check the exported snapshot. And we do clone_snapshot "snapshot_name", "new_table_name" then we can restored the snapshot on the second cluster.
So, is there any method or utility available to take incremental backup of hbase snapshot.let's say 7 days period.

Related

While exporting the hbase snapshot will mappers use exiting hfile data for exporting or will they create duplicate data form these hfiles?

I have an HBASE table with 1TB of storage having 100 regions. I took a snapshot of the table. While I export this snapshot to remote HDFS cluster will the mappers create 1TB extra space in existing HDFS cluster before transferring to remote cluster or it will use only referenced hfiles?

What (really) happens in HDFS during a Hive Update?

Here is the situation :
HDFS is known to be Append-Only (No Update per se).
Hive writes data to its warehouse, which is located in HDFS.
Updates can be performed in Hive
This implies that new data is written, and old data should be somehow marked as deprecated and later wiped out at some point.
I searched but did not manage to find any information about this so far.

Contents of elasticsearch snapshot

We are going to be using the snapshot API for blue green deployment of our cluster. We want to snapshot the existing cluster, spin up a new cluster, restore the data from the snapshot. We also need to apply any changes to the existing cluster data to our new cluster (before we switchover and make the new cluster live).
The thinking is we can index data from our database that has changes after the timestamp of when the snapshot was created, to ensure that any writes that happened to the running live cluster will get applied to the new cluster (the new cluster only has the data restored from the snapshot). My question is what timestamp date to use? Snapshot API has start_time and end_time values for a given snapshot - but I am not certain that end_time in this context means “all data modified up to this point”. I feel like it is just a marker to tell you how long the operation took. I may be wrong.
Does anyone know how to find what a snapshot contains? Can we use the end_time as a marker to know that th snapshot contains all data modifications before that date?
Thanks!
According to documentation
Snapshotting process is executed in non-blocking fashion. All indexing
and searching operation can continue to be executed against the index
that is being snapshotted. However, a snapshot represents the
point-in-time view of the index at the moment when snapshot was
created, so no records that were added to the index after the snapshot
process was started will be present in the snapshot.
You will need to use start_time or start_time_in_millis.
Because snapshots are incremental, you can create first full snapshot and than one more snapshot right after the first one is finished, it will be almost instant.
One more question: why create functionality already implemented in elasticsearch? If you can run both clusters at the same time, you can merge both clusters, let them sync, switch write queries to new cluster and gradually disconnect old servers from merged cluster leaving only new ones.

how to manage modified data in Apache Hive

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.
Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.
Are there any other standard better ways to update modified data in Hadoop?
Thanks
HDFS might be append only, but Hive does support updates from 0.14 on.
see here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
A design pattern is to take all your previous and current data and insert it into a new table every time.
Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.

Backup Hadoop in order to install new cluster, best practice

I am building a new Hadoop cluster (expanding number of nodes and extending capacity of current nodes) and need to back up all of the existing data. Right now I am just tar-ing everything and sending it to another server.
Is there a smarter way of doing this which will allow me to easily deploy once the new cluster is set up?
Edit: I should also point out that I don't store any data on the cluster. I bring data to the cluster, process it, and then send the processed data back to the original server. Any temporary data on the cluster is the deleted.
Use Distcp to transfer the HDFS data to other cluster or any cloud inorder to store the data.
If you want to schedule the Backup process you may avail OOZIE-DISTCP for backup process!!

Resources