HBase - snapshots performance - performance

I am working on a use case where we need to take several snapshots (80-100) of a table in HBase, lets call it "data". We want the capability of reading from these snapshots at any given time. So we would need to clone the snapshot and make use of it as a new table (for example "data_v01", "data_v02" etc. I am unable to figure out whether having multiple snapshots affect the performance of the original "data" table.
From what I understood from reading HBase documentation, HBase doesn't copy the data when a snapshot is taken nor when a new table is created ("cloned") from a snapshot. To me this seems like HBase creates a base set of HFiles and then changes are tracked in the form or something similar to WAL. If this is true, and the base snapshot is 100 days old, this would mean the changes would be many. Is my understanding correct? I couldn't find too much reference around this other than https://hbase.apache.org/book.html#ops.snapshots

As you may already know, HBase consistency is given by the collection of HFile and WAL files. A snapshot is merely the list of all HFiles in the table at the time of the snapshot (whether or not the snapshot forced WAL and memstores flush) .That's why snapshot is very fast and cheap to create - all it does is to save a list of paths to files. This means that the files must not be deleted in case of compaction, and instead are moved to archive folder until no snapshot is referencing them (very much like GC). In some cases this might lead to storage overhead.
I am unable to figure out whether having multiple snapshots affect the performance of the original "data" table.
Creating a table from a snapshot has nothing to do with the original table. The fact that both tables might be sharing some HFiles has no meaning since HFiles are immutable.
...(if) the base snapshot is 100 days old, this would mean (that the data is outdated)
Yes this is correct. The snapshot will only see the HFiles that existed when it was created.

Related

ETL + sync data between with Redshift and Dynamodb

I need to aggregate data coming from DynamoDB to AWS Redshift, and I need to be accurate and in-sync. For the ETL I'm planning to use DynamoDB Streams, Lambda transform, Kinesis Firehorse to, finally, Redshift.
How would be the process for updated data? I find it's all fine-tuned just for ETL. Which should be the best option to maintain both (Dynamo and Redshift) in sync?
These are my current options:
Trigger an "UPDATE" command direct from Lambda to Redshift (blocking).
Aggregate all update/delete records and process them on an hourly basis "somehow".
Any experience with this? Maybe is Redshift not the best solution? I need to extract aggregated data for reporting / dashboarding on 2 TB of data.
Redshift COPY command supports using a DyanmoDB table as a data source. This may or may not be a possible solution in your case as there are some limitations to this process. Data types and table naming differences can trip you up. Also this isn't a great option for incremental updates but can be done if the amount of data is small and you can design the updating SQL.
Another route to look at DynamoDB Stream. This will route data updates through Kinesis and this can be used to update Redshift at a reasonable rate. This can help keep data synced between these databases. This will likely make the data available for Redshift as quickly as possible.
Remember that you are not going to get Redshift to match on a moment by moment bases. Is this what you mean by "in-sync"? These are very different databases with very different use cases and architectures to support these use cases. Redshift works in big chunks of data changing slower than what typically happens in DynamoDB. There will be updating of Redshift in "chunks" which happen a more infrequent rate than on DynamoDB. I've made systems to bring this down to 5min intervals but 10-15min update intervals is where most end up when trying to keep a warehouse in sync.
The other option is to update Redshift infrequently (hourly?) and use federated queries to combine "recent" data with "older data" stored in Redshift. This is a more complicated solution and will likely mean changes to your data model to support but doable. So only go here if you really need to query very recent data right along side with older and bigger data.
The best-suited answer is to use a Staging table with an UPSERT operation (or a Redshift interpretation of it).
I found the answer valid on my use case when:
Keep Redshift as up to date as possible without causing blocking.
Be able to work with complex DynamoDB schemas so they can't be used as a source directly and data has to be transformed to adapt to Redshift DDL.
This is the architecture:
So we constantly load from Kinesis using the same COPY mechanism, but instead of loading directly to the final table, we use a staging one. Once the batch is loaded into staging we seek for duplicates between the two tables. Those duplicates on the final table will be DELETED before an INSERT is performed.
After trying this I've found that all DELETE operations on the same batch perform better if enclosed within a unique transaction. Also, a VACUUM operation is needed in order to re-balance the new load.
For further detail on the UPSERT operation, I've found this source very useful.

what are the steps I need to perform to clean the data if data into the dimension/fact table improperly loaded

Suppose there is a scenario where there is a data loading process into the fact table\dimensional table, and after analysis found that 100 millions records are being improperly
loaded, what are the steps I need to perform to clean the data properly.
Here are two practices which help in that scenario:
Take a backup or snapshot before each batch. In the case of a major error like this you can roll back to the snapshot, reload and process the correct data.
Maintain an insert-only persistent staging area in the DW, such as a data vault, with each row stamped with a batch ID and timestamp. Remove the rows in error, and rebuild your facts and dimensions.
If this represents a real situation your only chance is #1.
If you don't have a reliable backup, and you have updated and/or deleted rows during the ETL/ELT process, you don't have any record of the pre-fail state and it may be impossible to go back.

diffrence between hbase copy and snapshot command

I have a table in hbase which contain a huge amount of data I want to take the back of the table so in this situation which is good
1--Copy command to take the back up of the table
2--Take the snapshot of that table
And also please explain the internal mechanism of snapshot Is it simply renaming the table?
Regards
Amit
snapshot is best.
HBase Snapshots allow you to take a snapshot of a table without too much impact on Region Servers. Snapshot, Clone and restore operations don't involve data copying. Also, Exporting the snapshot to another cluster doesn't have impact on the Region Servers.
Prior to version 0.94.6, the only way to backup or to clone a table is to use CopyTable/ExportTable, or to copy all the hfiles in HDFS after disabling the table. The disadvantages of these methods are that you can degrade region server performance (Copy/Export Table) or you need to disable the table, that means no reads or writes; and this is usually unacceptable.
Snapshot is not just rename, between multiple operations if you want to restore at one particular point then this is the right case to use :
A snapshot is a set of metadata information that allows an admin to get back to a previous state of the table. A snapshot is not a copy of the table; it’s just a list of file names and doesn’t copy the data. A full snapshot restore means that you get back to the previous “table schema” and you get back your previous data losing any changes made since the snapshot was taken.
Also, see Snapshots+and+Repeatable+reads+for+HBase+Tables
Snapshot Internals

Sugessions on defining the Technical Stack for my Hadoop infra structe

I am planning to build a new system in Hadoop, that brings data from External Environment and then do some transformations and builds up a end product.
The external data (if we can assume it is from either oracle/mysql/postgre-sql data base, there can be n-data bases schema) that comes to hadoop system should be always real time (new data should get inserted and updated data should get updated), may be atleast an hour delay at max (we can poll/push hourly basis).
We can also assume the data that exists in my data base schema is with n-tables, I may need m-tables only out of n-tables that exists in source. And each table data of size in GB/TB. So I can't go with full table replace. I should always go incremental(updates/inserts) push/pull into hadoop system.
Hive may support, by dividing my data into date wise partitions, and can query faster, but doesn't not support updates so I have to go for full table replace always, which does not scalable.
My end goal is "Real time data into hadoop system, read query performace, update performance".
Your Technical suggestions for my use case is very useful.

Hive "add partition" concurrency

We have an external Hive table that is used for processing raw log file data. The files are hourly, and are partitioned by date and source host name.
At the moment we are importing files using simple python scripts that are triggered a few times per hour. The script creates sub folders on HDFS as needed, copies new files from the temporary local storage and adds any new partitions to Hive.
Today, new partitions are created using "ALTER TABLE ... ADD PARTITION ...". However, if another Hive query is running on the table it will be locked, which means that the add partition command will fail (if the query runs for long enough) since it requires an exclusive lock.
An alternative to this approach would be to use "MSCK REPAIR TABLE", which for some reason does not seem to aquire any locks on the table. However, I have gotten the impression that using repair table is not recommended for a production setting.
What is the best practise for adding Hive partitions programmatically in a concurrent environment?
What are the risks or disadvantages of using MSCK REPAIR TABLE?
Is there an explanation for the seemingly inconsistent locking behaviour of the two partition adding commands? I.e. do they have different effects on running queries?
Not a good answer, but we have the same issue and here are our findings :
in the Hive doc, https://cwiki.apache.org/confluence/display/Hive/Locking , locks seem pretty sensible: an 'ADD partition" will request an exclusive lock on the created partition, and a shared lock on the whole table. A SELECT query will request a shared lock on the table. So it should be fine
however, it does not work this way, at least in CDH 5.3. According to this thread, https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/u7aM9W3pegM this is a known behavior, probably new (I am not sure, but I also think, as the author of this thread, that the issue was not there on CDH 4.7)
So basically, we're still thinking of our partition strategy, but we will probably try to create all possible partition in advance (before getting the data), as we know precisely the values of all future partitions (might not be the case for you).

Resources