Data aggregation during data load to snowflake using snowpipe - etl

I am evaluating snowflake for reporting usecase. I am considering snowpipe for ETL. Data is ingested from S3. Data in S3 contains information about user sessions captured at regular interval. In Snowflake, I want to stored these data aggregate. As per documentation snowflake supports only basic transformation and doesn't support group by and join during copying data from S3 staging to tables in snowflake.
I am new to ETL and snowflake. One way i was thinking is to load raw detailed data from staging to temporary table in snowflake. Then run aggregations (group by and join) on temporary table to load data into final fact tables. Is this the correct approach for implementing complex tranformations?

Temporary tables in Snowflake only stick around for the session that they have been created in. This means that you won't be able to point a Snowpipe to it.
Instead of a temporary table, point Snowflake to a transient table to store the raw data and then truncate the table after some period of time. This will reduce costs. Personally, I'd keep the data in the transient table for as long as possible provided that it is not too cost prohibitive. This is to account for potentially late data etc.

Yes, your aproach looks good to me.
Snowpipe loads your data continously from S3 to Snowflake and within Snowflake you use
Views
Tables and Stored Procedures
to transform the data and load it into your final fact table.

Related

Data Readiness Check

Let's say there is a job A which executes a Python to connect to Oracle, fetch the data from Table A and load the data into Snowflake once a day. Application A dependent on Table A in Snowflake can just depend on the success of job A for further processing, this is easy.
But if the data movement is via Replication (Change Data Capture from Oracle moves to s3 using Golden Gate, pipes pushes into stage, stream to target using Task every few mins) - what is the best way to let Application A know that the data is ready? How to check if the data is ready? is there something available in Oracle, like a table level marker that can be moved over to Snowflake? Table's in Oracle cannot be modified to add anything new, marker rows also cannot be added - these are impractical. But something that Oracle provides implicitly, which can be moved over to Snowflake or some SCN like number at the table level that can be compared every few minutes could be a solution, eager to know any approaches.

ETL + sync data between with Redshift and Dynamodb

I need to aggregate data coming from DynamoDB to AWS Redshift, and I need to be accurate and in-sync. For the ETL I'm planning to use DynamoDB Streams, Lambda transform, Kinesis Firehorse to, finally, Redshift.
How would be the process for updated data? I find it's all fine-tuned just for ETL. Which should be the best option to maintain both (Dynamo and Redshift) in sync?
These are my current options:
Trigger an "UPDATE" command direct from Lambda to Redshift (blocking).
Aggregate all update/delete records and process them on an hourly basis "somehow".
Any experience with this? Maybe is Redshift not the best solution? I need to extract aggregated data for reporting / dashboarding on 2 TB of data.
Redshift COPY command supports using a DyanmoDB table as a data source. This may or may not be a possible solution in your case as there are some limitations to this process. Data types and table naming differences can trip you up. Also this isn't a great option for incremental updates but can be done if the amount of data is small and you can design the updating SQL.
Another route to look at DynamoDB Stream. This will route data updates through Kinesis and this can be used to update Redshift at a reasonable rate. This can help keep data synced between these databases. This will likely make the data available for Redshift as quickly as possible.
Remember that you are not going to get Redshift to match on a moment by moment bases. Is this what you mean by "in-sync"? These are very different databases with very different use cases and architectures to support these use cases. Redshift works in big chunks of data changing slower than what typically happens in DynamoDB. There will be updating of Redshift in "chunks" which happen a more infrequent rate than on DynamoDB. I've made systems to bring this down to 5min intervals but 10-15min update intervals is where most end up when trying to keep a warehouse in sync.
The other option is to update Redshift infrequently (hourly?) and use federated queries to combine "recent" data with "older data" stored in Redshift. This is a more complicated solution and will likely mean changes to your data model to support but doable. So only go here if you really need to query very recent data right along side with older and bigger data.
The best-suited answer is to use a Staging table with an UPSERT operation (or a Redshift interpretation of it).
I found the answer valid on my use case when:
Keep Redshift as up to date as possible without causing blocking.
Be able to work with complex DynamoDB schemas so they can't be used as a source directly and data has to be transformed to adapt to Redshift DDL.
This is the architecture:
So we constantly load from Kinesis using the same COPY mechanism, but instead of loading directly to the final table, we use a staging one. Once the batch is loaded into staging we seek for duplicates between the two tables. Those duplicates on the final table will be DELETED before an INSERT is performed.
After trying this I've found that all DELETE operations on the same batch perform better if enclosed within a unique transaction. Also, a VACUUM operation is needed in order to re-balance the new load.
For further detail on the UPSERT operation, I've found this source very useful.

Data Injected to SQL DB too large in comparaison to Oracle Source

Hello I am using Azure Data Factory to inject data from Oracle to SQL DB, data are extracted in csv format. The problem is in the source I am reading like 10 Go of Data but when they're injected into Azure SQL DB the data size becomes 40 Go.
Is that normal ? and is there a way to lower the destination data size ?
Thank you
Try setting the table compression level. If the table has many number columns, SQL Server's ROW compression is similar to Oracle's default storage style for numbers and should produce a similar storage size to Oracle
PAGE compression provides additional compression strategies and may give you ~3X compression.
And if the table has many millions of rows, use a clustered cloumnstore table, and may provide ~10X compression.

How do I transfer huge amount of data(nearly 10TB) from Oracle DB to Snowflake in hours

How do I transfer huge amount of data(nearly 10TB) from Oracle DB to Snowflake in hours? I see some options like Hevo and fivetran which are paid. However, I need the data to be moved fast so that I need not keep the production system down.
The fastest way to get data into Snowflake is in 10MB to 100MB chunk files. Then, you can leverage a big warehouse to COPY INTO all of the chunk files at one time. I can't speak to how to get the data out of Oracle DB quickly to S3/Azure Blob, though, especially while the system is running its normal workload.
I recommend you look at this document from Snowflake for reference on the Snowflake side: https://docs.snowflake.net/manuals/user-guide/data-load-considerations-prepare.htm
Is there a network speed issue?
Anyways, the data should be compressed when transferred over the network.
There are three locations involved in the staging:
Oracle database,
the extraction client,
and the cloud storage.
You have two data transfers:
between database and client,
and between client and cloud storage.
If the Oracle version is 12cR2 or newer, the DB client can compress data when getting it out of the database. The data should then be compressed again and transferred to cloud storage at your Snowflake destination.
The final step is to load the data from cloud storage into Snowflake (within the same data center)...
Ideally you shouldn't need to keep the production database down. You should be able to categorise the data into
1 - historical data that will not change. You can extract this data at your own leisure, and should not require database to be down.
2 - static data that is fairly stable. You can also extract this data at your leisure
You only need to keep your database fairly stable (not down) when you are extracting the rest of the data. This will require you to build some way to track and validate all your datasets. There is no reason why you couldn't let users continue to read from the database, while you are performing the extract from Oracle.

How to do real time data ingestion from Transactional tables to a Flat table

We have transaction tables in Oracle and for reporting purposes we need this data transfered in real time to another flat Oracle table in another database. The performance of the report is great with table placed in this flat table.
Currently we are using golden gate for replication to the other database and using materialized view for this but due to some problems we need to switch to some other way of populating/maintaining this flat table. What options do we have?
It is a pretty basic requirement but the solutions I can see are for batch processing. Also if there are any other solutions you feel would better serve this purpose. Changing the target database to something other is also an option as there might be more such reports coming ahead.

Resources