Kafka Streams Rocks DB Partitioning - apache-kafka-streams

I have a requirement where I want to send expiring data to a long term database. We are currently using the KeyValueStore and managing the old data our self. I know there are WindowStores that will periodically delete old data.
Is there either a way to use the KeyValueStore but create a partitioned rocks db database and that we can manage which partition to put records to and delete. Or use the windowed store and intercept the process that deletes the partition so we can send that data where to go.

Related

Data Readiness Check

Let's say there is a job A which executes a Python to connect to Oracle, fetch the data from Table A and load the data into Snowflake once a day. Application A dependent on Table A in Snowflake can just depend on the success of job A for further processing, this is easy.
But if the data movement is via Replication (Change Data Capture from Oracle moves to s3 using Golden Gate, pipes pushes into stage, stream to target using Task every few mins) - what is the best way to let Application A know that the data is ready? How to check if the data is ready? is there something available in Oracle, like a table level marker that can be moved over to Snowflake? Table's in Oracle cannot be modified to add anything new, marker rows also cannot be added - these are impractical. But something that Oracle provides implicitly, which can be moved over to Snowflake or some SCN like number at the table level that can be compared every few minutes could be a solution, eager to know any approaches.

Data aggregation during data load to snowflake using snowpipe

I am evaluating snowflake for reporting usecase. I am considering snowpipe for ETL. Data is ingested from S3. Data in S3 contains information about user sessions captured at regular interval. In Snowflake, I want to stored these data aggregate. As per documentation snowflake supports only basic transformation and doesn't support group by and join during copying data from S3 staging to tables in snowflake.
I am new to ETL and snowflake. One way i was thinking is to load raw detailed data from staging to temporary table in snowflake. Then run aggregations (group by and join) on temporary table to load data into final fact tables. Is this the correct approach for implementing complex tranformations?
Temporary tables in Snowflake only stick around for the session that they have been created in. This means that you won't be able to point a Snowpipe to it.
Instead of a temporary table, point Snowflake to a transient table to store the raw data and then truncate the table after some period of time. This will reduce costs. Personally, I'd keep the data in the transient table for as long as possible provided that it is not too cost prohibitive. This is to account for potentially late data etc.
Yes, your aproach looks good to me.
Snowpipe loads your data continously from S3 to Snowflake and within Snowflake you use
Views
Tables and Stored Procedures
to transform the data and load it into your final fact table.

ETL + sync data between with Redshift and Dynamodb

I need to aggregate data coming from DynamoDB to AWS Redshift, and I need to be accurate and in-sync. For the ETL I'm planning to use DynamoDB Streams, Lambda transform, Kinesis Firehorse to, finally, Redshift.
How would be the process for updated data? I find it's all fine-tuned just for ETL. Which should be the best option to maintain both (Dynamo and Redshift) in sync?
These are my current options:
Trigger an "UPDATE" command direct from Lambda to Redshift (blocking).
Aggregate all update/delete records and process them on an hourly basis "somehow".
Any experience with this? Maybe is Redshift not the best solution? I need to extract aggregated data for reporting / dashboarding on 2 TB of data.
Redshift COPY command supports using a DyanmoDB table as a data source. This may or may not be a possible solution in your case as there are some limitations to this process. Data types and table naming differences can trip you up. Also this isn't a great option for incremental updates but can be done if the amount of data is small and you can design the updating SQL.
Another route to look at DynamoDB Stream. This will route data updates through Kinesis and this can be used to update Redshift at a reasonable rate. This can help keep data synced between these databases. This will likely make the data available for Redshift as quickly as possible.
Remember that you are not going to get Redshift to match on a moment by moment bases. Is this what you mean by "in-sync"? These are very different databases with very different use cases and architectures to support these use cases. Redshift works in big chunks of data changing slower than what typically happens in DynamoDB. There will be updating of Redshift in "chunks" which happen a more infrequent rate than on DynamoDB. I've made systems to bring this down to 5min intervals but 10-15min update intervals is where most end up when trying to keep a warehouse in sync.
The other option is to update Redshift infrequently (hourly?) and use federated queries to combine "recent" data with "older data" stored in Redshift. This is a more complicated solution and will likely mean changes to your data model to support but doable. So only go here if you really need to query very recent data right along side with older and bigger data.
The best-suited answer is to use a Staging table with an UPSERT operation (or a Redshift interpretation of it).
I found the answer valid on my use case when:
Keep Redshift as up to date as possible without causing blocking.
Be able to work with complex DynamoDB schemas so they can't be used as a source directly and data has to be transformed to adapt to Redshift DDL.
This is the architecture:
So we constantly load from Kinesis using the same COPY mechanism, but instead of loading directly to the final table, we use a staging one. Once the batch is loaded into staging we seek for duplicates between the two tables. Those duplicates on the final table will be DELETED before an INSERT is performed.
After trying this I've found that all DELETE operations on the same batch perform better if enclosed within a unique transaction. Also, a VACUUM operation is needed in order to re-balance the new load.
For further detail on the UPSERT operation, I've found this source very useful.

How would Redis get to know if it has to return cached data or fresh data from DB

Say, I'm Fechting thousands or record using some long runing task from DB and caching it using Redis. Next day somebody have changed few records in DB.
Next time how redis would know that it has to return cached data or again have to revisit that all thousands of records in DB?
How this synchronisation achived?
Redis has no idea whether the data in DB has been updated.
Normally, we use Redis to cache data as follows:
Client checks if the data, e.g. key-value pair, exists in Redis.
If the key exists, client gets the corresponding value from Redis.
Otherwise, it gets data from DB, and sets it to Redis. Also client sets an expiration, say 5 minutes, for the key-value pair in Redis.
Then any subsequent requests for the same key will be served by Redis. Although the data in Redis might be out-of-date.
However, after 5 minutes, this key will be removed from Redis automatically.
Go to step 1.
So in order to keep your data in Redis update-to-date, you can set a short expiration time. However, your DB has to serve lots of requests.
If you want to largely decrease requests to DB, you can set a large expiration time. So that, most of time, Redis can serve the requests with possible staled data.
You should consider carefully about the trade-off between performance and staled data.
Since the source of truth resides on your Database and you push data from this DB to Redis, you always have to update from DB to Redis, at least you create another process to sync data.
My suggestion is just to run a first full update from DB to Redis and then use a synch process which every time you notice update/creation/deletion operation in your database you pull it to Redis.
I don't know which Redis structure are you using to store database records in Redis but I guess it could be a Hash, probably indexed by your table index so the sync operation will be immediate: if a record is created in your database you set a HSET, if deletion HDEL and so on.
You even could omit the first full sync from DB to Redis, and just clean Redis and start the sync process.
If you cannot do the above for some reason you can create a syncher daemon which constantly read data from the database and compare them with the data store in Redis if they are different in somehow you update or if they don't exist in some of both sides you can delete or create the entry in Redis.
My solution is:
When you are updating, deleting or adding new data in database, you should delete all data in redis. In your get route, you should check if data exists. If not, you should store all data to redis from db.
you may use #CacheEvict on any update/delete applied on DB. that would clear up responding values from cache, so next query would get from DB

Is it possible to write multiple blobs in a single request?

We're planning to use Azure blob storage to save processing log data for later analysis. Our systems are generating roughly 2000 events per minute, and each "event" is a json document. Looking at the pricing for blob storage, the sheer number of write operations would cost us tons of money if we take each event and simply write it to a blob.
My question is: Is it possible to create multiple blobs in a single write operation, or should I instead plan to create blobs containing multiple event data items (for example, one blob for each minute's worth of data)?
It is possible ,but isn't good practice ,it take long times to multipart files to be merge, hence we are trying to separate upload action from entity persist operation by passing entity id and update doc[image] name in other controller
Also it keeps you clean upload functionality .Best Wish
It's impossible to create multiple blobs in a single write operation.
One feasible solution is to create blobs containing multiple event data items as you planned (which is hard to implement and query in my opinion); another solution is to store the event data into Azure Storage Table rather than Blob, and leverage EntityGroupTransaction to write table entities in one batch (which is billed as one transaction).
Please note that all table entities in one batch must have the same partition key, which should be considered when you're designing your table (see Azure Storage Table Design Guide for further information). If some of your events have large data size that exceeds the size limitation of Azure Storage Table (1MB per entity, 4MB per batch), you can save data of those events to Blob and store the blob links in Azure Storage Table.

Resources