diffrence between hbase copy and snapshot command - hadoop

I have a table in hbase which contain a huge amount of data I want to take the back of the table so in this situation which is good
1--Copy command to take the back up of the table
2--Take the snapshot of that table
And also please explain the internal mechanism of snapshot Is it simply renaming the table?
Regards
Amit

snapshot is best.
HBase Snapshots allow you to take a snapshot of a table without too much impact on Region Servers. Snapshot, Clone and restore operations don't involve data copying. Also, Exporting the snapshot to another cluster doesn't have impact on the Region Servers.
Prior to version 0.94.6, the only way to backup or to clone a table is to use CopyTable/ExportTable, or to copy all the hfiles in HDFS after disabling the table. The disadvantages of these methods are that you can degrade region server performance (Copy/Export Table) or you need to disable the table, that means no reads or writes; and this is usually unacceptable.
Snapshot is not just rename, between multiple operations if you want to restore at one particular point then this is the right case to use :
A snapshot is a set of metadata information that allows an admin to get back to a previous state of the table. A snapshot is not a copy of the table; it’s just a list of file names and doesn’t copy the data. A full snapshot restore means that you get back to the previous “table schema” and you get back your previous data losing any changes made since the snapshot was taken.
Also, see Snapshots+and+Repeatable+reads+for+HBase+Tables
Snapshot Internals

Related

ETL + sync data between with Redshift and Dynamodb

I need to aggregate data coming from DynamoDB to AWS Redshift, and I need to be accurate and in-sync. For the ETL I'm planning to use DynamoDB Streams, Lambda transform, Kinesis Firehorse to, finally, Redshift.
How would be the process for updated data? I find it's all fine-tuned just for ETL. Which should be the best option to maintain both (Dynamo and Redshift) in sync?
These are my current options:
Trigger an "UPDATE" command direct from Lambda to Redshift (blocking).
Aggregate all update/delete records and process them on an hourly basis "somehow".
Any experience with this? Maybe is Redshift not the best solution? I need to extract aggregated data for reporting / dashboarding on 2 TB of data.
Redshift COPY command supports using a DyanmoDB table as a data source. This may or may not be a possible solution in your case as there are some limitations to this process. Data types and table naming differences can trip you up. Also this isn't a great option for incremental updates but can be done if the amount of data is small and you can design the updating SQL.
Another route to look at DynamoDB Stream. This will route data updates through Kinesis and this can be used to update Redshift at a reasonable rate. This can help keep data synced between these databases. This will likely make the data available for Redshift as quickly as possible.
Remember that you are not going to get Redshift to match on a moment by moment bases. Is this what you mean by "in-sync"? These are very different databases with very different use cases and architectures to support these use cases. Redshift works in big chunks of data changing slower than what typically happens in DynamoDB. There will be updating of Redshift in "chunks" which happen a more infrequent rate than on DynamoDB. I've made systems to bring this down to 5min intervals but 10-15min update intervals is where most end up when trying to keep a warehouse in sync.
The other option is to update Redshift infrequently (hourly?) and use federated queries to combine "recent" data with "older data" stored in Redshift. This is a more complicated solution and will likely mean changes to your data model to support but doable. So only go here if you really need to query very recent data right along side with older and bigger data.
The best-suited answer is to use a Staging table with an UPSERT operation (or a Redshift interpretation of it).
I found the answer valid on my use case when:
Keep Redshift as up to date as possible without causing blocking.
Be able to work with complex DynamoDB schemas so they can't be used as a source directly and data has to be transformed to adapt to Redshift DDL.
This is the architecture:
So we constantly load from Kinesis using the same COPY mechanism, but instead of loading directly to the final table, we use a staging one. Once the batch is loaded into staging we seek for duplicates between the two tables. Those duplicates on the final table will be DELETED before an INSERT is performed.
After trying this I've found that all DELETE operations on the same batch perform better if enclosed within a unique transaction. Also, a VACUUM operation is needed in order to re-balance the new load.
For further detail on the UPSERT operation, I've found this source very useful.

what are the steps I need to perform to clean the data if data into the dimension/fact table improperly loaded

Suppose there is a scenario where there is a data loading process into the fact table\dimensional table, and after analysis found that 100 millions records are being improperly
loaded, what are the steps I need to perform to clean the data properly.
Here are two practices which help in that scenario:
Take a backup or snapshot before each batch. In the case of a major error like this you can roll back to the snapshot, reload and process the correct data.
Maintain an insert-only persistent staging area in the DW, such as a data vault, with each row stamped with a batch ID and timestamp. Remove the rows in error, and rebuild your facts and dimensions.
If this represents a real situation your only chance is #1.
If you don't have a reliable backup, and you have updated and/or deleted rows during the ETL/ELT process, you don't have any record of the pre-fail state and it may be impossible to go back.

HBase - snapshots performance

I am working on a use case where we need to take several snapshots (80-100) of a table in HBase, lets call it "data". We want the capability of reading from these snapshots at any given time. So we would need to clone the snapshot and make use of it as a new table (for example "data_v01", "data_v02" etc. I am unable to figure out whether having multiple snapshots affect the performance of the original "data" table.
From what I understood from reading HBase documentation, HBase doesn't copy the data when a snapshot is taken nor when a new table is created ("cloned") from a snapshot. To me this seems like HBase creates a base set of HFiles and then changes are tracked in the form or something similar to WAL. If this is true, and the base snapshot is 100 days old, this would mean the changes would be many. Is my understanding correct? I couldn't find too much reference around this other than https://hbase.apache.org/book.html#ops.snapshots
As you may already know, HBase consistency is given by the collection of HFile and WAL files. A snapshot is merely the list of all HFiles in the table at the time of the snapshot (whether or not the snapshot forced WAL and memstores flush) .That's why snapshot is very fast and cheap to create - all it does is to save a list of paths to files. This means that the files must not be deleted in case of compaction, and instead are moved to archive folder until no snapshot is referencing them (very much like GC). In some cases this might lead to storage overhead.
I am unable to figure out whether having multiple snapshots affect the performance of the original "data" table.
Creating a table from a snapshot has nothing to do with the original table. The fact that both tables might be sharing some HFiles has no meaning since HFiles are immutable.
...(if) the base snapshot is 100 days old, this would mean (that the data is outdated)
Yes this is correct. The snapshot will only see the HFiles that existed when it was created.

What's the best way to perform data archiving on an Oracle database?

I'd like to figure out the best way to archive the data that is no needed anymore, in order to improve the application performance and also to save disk space. In your experience what is the best way to implement this, what kind of tools can I use? It is better to develop an specific in house application for that purpose?
One way to manage archiving:
Partition your tables on date range, e.g. monthly partitions
As partitions become obsolete, e.g. after 36 months, those partitions can now be moved to your data warehouse, which could be another database of just text files depending upon your access needs.
After moving, the obsolete partitions can be removed from your primary database, so always maintaining just (e.g.) 36 months of current data.
All this can be automated using a mix of SQL/Shell scripts.
The best way to archive old data in ORACLE database is:
Define an archive and retention policy based on date or size.
Export archivable data to an external table (tablespace) based on a defined policy.
Compress the external table and store it in a cheaper storage medium.
Delete the archived data from your active database using SQL DELETE.
Then to clean up the space execute the below commands:
alter table T_XYZ enable row movement;
alter table T_XYZ shrink space;
If you still want to free up some disk space back to the OS (As Oracle would have now reserved the total space that it was previously using), then you may have to resize the datafile itself:
SQL> alter database datafile '/home/oracle/database/../XYZ.dbf' resize 1m;
For more details, please refer:
http://stage10.oaug.org/file/sroaug080229081203621527.pdf
I would export the data to a comma-delimited file so it can be exported into almost any database. So if you change versions of Oracle or go to something else years later you can restore it without much concern.
Use the spool file feature of SQL*Plus to do this: http://cisnet.baruch.cuny.edu/holowczak/oracle/sqlplus/#savingoutput

overcoming 'log file sync' by design?

Advice/suggestions needed for a bit of application design.
I have an application which uses 2 tables, one is a staging table, which many separate processes write to, once a 'group' of processes has finished, another job comes along a aggregates the results together into a final table, then deletes that 'group' from the staging table.
The problem that I'm having is that when the staging table is being cleared out, lots of redo is generated and I'm seeing a lot of 'log file sync' waits in the database. This is a shared database with many other applications and this is causing some issues.
When applying the aggregate, the rows are reduced to about 1 row in the final table for every 20 rows in the staging table.
I'm thinking of getting around this by rather than having a single 'staging' table, I will create a table for each 'group'. Once done, this table can just be dropped, which should result in much less redo.
I only have SE, so partitioned tables isn't an option. Also faster disks for the redo probably isn't an option in the short term either.
Is this a bad idea? Any better solutions to be offered?
Thanks.
Would it be possible to solve the problem by having your process do a logical delete (i.e. set a DELETE_FLAG column in the table to 'Y') and then having a nightly process that truncates the table (potentially writing any non-deleted rows to a separate table before the truncate and then copy them back after the table is truncated)?
Are you certain that the source of the log file sync waits is that your disks can't keep up with the I/O? That's certainly possible, of course, but there are other possible causes of excessive log file sync waits including excessive commits. There is an excellent article on tuning log file sync events on the Pythian blog.
The most common cause of excessive log file syncs is too frequent commits, which are often deliberately coded in a mistaken attempt to reduce system load due to locking. You should commit only when your business transaction is complete.
Loading each group into a separate table sounds like a fine plan to reduce redo. You can truncate individual group table following each aggregation.
Another (but I think probably worse) option is to create a new staging table with the groups that haven't been aggregated then drop the original and rename the new table to replace the staging table.
I prefer Justin's suggestion ("logical delete"), but another option to consider might be a partitioned table, if you have the EE licence. The aggregation process could drop a partition instead of deleting the rows.

Resources