Best way to cache large data that is added to (DynamoDB) - caching

I am currently working with large amounts of data that I'm storing in DynamoDB. Once data enters the database it never changes, but new data is flowing into the database consistently. My question is how can I perform a data cache (utilizing DAX if possible) to limit the amount of data that I have to directly query the database for.
For example, if I want the data from 10:00 AM to 11:00 AM then I can query with the parameters of:
start_time = 10:00 AM,
end_time = 11:00 AM
The response from this query will be cached in DAX for later use. My problem is that when I go to get data between 10:00 AM and 1:00 PM I have to query for data that is already in my cache (this is because the caching is based on parameters and I have new parameters).
My first thought was to cache the data in small sections and just make many queries. For example:
Request for 10 - 10:15 AM data and cache, then request for 10:15 - 10:30 AM data then cache, and so on. By doing this I could make many smaller queries but won't have overlapping data in my cache. Is this the best approach or should I cache the overlapping data. Any help is appreciated.

If i understood correctly:
start_time = 10:00 AM, end_time = 11:00 AM ( Cache has no data, hits DynamoDB )
start_time = 10:00 AM, end_time = 11:00 AM ( Cache has this data, doesn't hit DynamoDB )
start_time = 10:00 AM, end_time = 10:30 AM ( Difference in cache keys, hits DynamoDB )
Basically you could be having a full set of data in Cache, but unless you are using the same cache keys (that helps result in a cache hit), the Cache could never return smartly you a "subset" of the full data from Cache
DynamoDB DAX Item Cache
DyanmoDB DAX brings along Item Cache, where individual Items are stored and returned from DAX. However Item Cache is only limited to only GetItem and BatchGetItem
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DAX.concepts.html#DAX.concepts.item-cache
Fragmenting DDB Query
If DynamoDB DAX is not possible, or Query and Scan operations are needed. Then the next better least invasive technique is to fragment / partition the DDB query into "smaller" queries so that they will result in more Cache hits
e.g.
start_time = 10:00 AM, end_time = 10:15 AM
start_time = 10:15 AM, end_time = 10:30 AM
start_time = 10:30 AM, end_time = 10:45 AM
There are few good third party application libraries you can use to partition your Query Keys, and you can choose the granularity from 15 minute blocks to 1 minute blocks or even seconds block, suited to your performance needs
But this technique will not be without Cons, clearly the additional number of hops / queries it must now make needs to be taken into consideration
Application ORM
Solving problems like these are what application ORMs are really good at, for example Hibernate in the case of Java development (But the last i checked, Hibernate doesn't have support for DynamoDB quite yet, although it is possible to extend and build custom strategies)
You could check if your application ORM has support for DynamoDB
https://www.baeldung.com/hibernate-second-level-cache

Related

How does BigQuery caching on time partitioned tables work?

In contrast with the BigQuery documentation, we see that it DOES cache the results when selecting data from a streaming, data partitioned table (Standard SQL).
Example:
When we perform a deterministic date scan on the streaming, data partitioned table using:
where (_PARTITIONTIME > '2017-11-12' or _PARTITIONTIME is null)
...BigQuery caches the data for 5 to 20 minutes if we fire the same exact query within that time frame.
While in my interpretation of the documentation it states that it SHOULD NOT cache the data:
'When any of the tables referenced by the query have recently received streaming inserts (a streaming buffer is attached to the table) even if no new rows have arrived'
Important notes:
Our test query queries heartbeat events that really arrive at us continuously
We actually want this caching behavior, because we do not always need to have data to be actual to the last second. We just want to know if we really can depend on this behavior.
Our Questions:
What is going on here / Why does the BQ caching happen at all?
The time this data stays in the BQ cache is 'random' (between 5-20 minutes). What does this mean?
Thanks for clarifying the question. I think it's an overlook that we didn't disabled caching for partitioned tables with streaming data. It should as otherwise the query might return outdated results.
We invalidate the cache when the table is changed. Streaming into the table will cause the table to be changed. I guess that's why the cache is invalidated between 5 to 20 minutes.

AWS Kinesis Stream Aggregating Based on Time Spans

I currently have a Kinesis stream that is populated with JSON messages that are in the form of:
{"datetime": "2017-09-29T20:12:01.755z", "payload":"4"}
{"datetime": "2017-09-29T20:12:07.755z", "payload":"5"}
{"datetime": "2017-09-29T20:12:09.755z", "payload":"12"}
etc...
What im trying to accomplish here is to aggregate the data in terms of time chunks. In this case, i'd like to group the averages for 10 minute spans. For example, from 12:00 > 12:10, I want to average the payload value and save it as the 12:10 value.
For example, the above data would produce:
Datetime: 2017-09-29T20:12:10.00z
Average: 7
The method that i'm thinking of is to use caching at the service level and then some type of way to track the time. If the messages ever move into the next 10 minute timespan, I average the cached data, store it to the DB and then delete that cache value.
Currently, my service sees 20,000 messages every minute with higher volume to be expected in the future. I'm a little stuck on how to implement this to guarantee I get all the values for that 10 minute time period from Kinesis. Those of you that are more familiar with Kinesis and AWS, is there a simple way to go about this?
The reason for doing this is to shorten the query times for data from large timespans, such as for 1 year. I wouldn't want to grab millions of values but rather, a few aggregated values.
Edit:
I have to keep track of many different averages at the same time. For example, the above JSON may just pertain to one 'set', such as the average temperature per city in 10 minute timespans. This requires me to keep track of each cities averages for every timespan.
Toronto (12:01 - 12:10): average_temp
New York (12:01 - 12:10): average_temp
Toronto (12:11 - 12:20): average_temp
New York (12:11 - 12:20): average_temp
etc...
This could pertain to any city worldwide. If new temperatures arrive for say, Toronto and it pertains to the 12:01 - 12:10 timespan, I have to recalculate and store that average.
This is how I would do it. Thanks for the interesting question.
Kinesis Streams --> Lambda (Event Insertor) --> DynamoDB(Streams) --> Lambda(Count and Value incrementor) --> DynamoDB(streams) --> Average (Updater)
DynamoDB Table Structure:
{
Timestamp: 1506794597
Count: 3
TotalValue: 21
Average: 7
Event{timestamp}-{guid}: { event }
}
timestamp -- timestamp of the actual event
guid -- avoid any collision on a timestamp that occurred at same time
Event{timestamp}-{guid} -- This should be removed by (count and value incrementor)
If the fourth record for that timestamp arrives,
Get the time close to 10 min timespan, increment the count, increment the totalvalue. Neve read the value and increment, that will result in error unless you use strong consistency(which is very costly to read). Instead perform the increment operation with atomic increment.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters
Create DynamoDB streams from the above table, Listen on another lambda, Now calculate the average value and update the value.
When you calculate the average, don't perform a read from the table. Instead the data will be available over the stream, you just need to calculate the average and update it. (overwrite previous average value).
This will work on any scale and with high availability.
Hope it helps.
EDIT1:
Since the OP is not familier with AWS Services,
Lambda Documentation:
https://aws.amazon.com/lambda/
DynamoDB Documentation:
https://aws.amazon.com/dynamodb/
AWS cloud services used for the solution.

Importing data incrementally from RDBMS to hive/hadoop using sqoop

I have an oracle database and need to import data to a hive table. The daily import data size would be around 1 GB. What would be the better approach?
If I import each day data as a partition, how can the updated values be handled?
For example, if I imported today's data as a partition and for the next day there are some fields that are updated with the new values.
Using --lastmodified we can get the values but where we need to send the updated values to the new partition or to the old (already existing) partition?
If I send to the new partition, then the data is duplicated.
If I want to send to the already existing partition, how we can it be achieved?
Your only option is to override the entire existing partition with 'INSERT OVERWRITE TABLE...'.
Question is - how far back are you going to be constantly updating the data?
I think of 3 approaches u can consider:
Decide on a threshold for 'fresh' data. for example '14 days backwards' or '1 month backwards'.
Then each day you are running the job, you override partitions (only the ones which have updated values) backwards, until the threshold decided.
With ~1 GB a day it should be feasible.
All the data from before your decided time is not guranteed to be 100% correct.
This scenario could be relevant if you know the fields can only be changed a certain time window after they were initially set.
Make your Hive table compatible with ACID transactions, thus allowing updates on the table.
Split your daily job to 2 tasks: the new data being written for the run day. the updated data that you need to run backwards. the sqoop will be responsible for the new data. take care of the updated data 'manually' (some script that generates the update statements)
Don't use partitions based on time. maybe dynamic partitioning is more suitable for your use case.It depends on the nature of the data being handled.

How to speed up performance by avoiding to query Mongoid multiple times?

I have approx. 10 million Article objects in a Mongoid database. The huge number of Article objects makes the queries quite time consuming to perform.
As exemplified below, I am registering for each week (e.g. 700 days from now .. 7 days from now, 0 days from now) how many articles are in the database.
But for every query I make, the time consumption is increased, and Mongoid's CPU usage quickly reaches +100%.
articles = Article.where(published: true).asc(:datetime)
days = Date.today.mjd - articles.first.datetime.to_date.mjd
days.step(0, -7) do |n|
current_date = Date.today - n.days
previous_articles = articles.lt(datetime: current_date)
previous_good_articles = previous_articles.where(good: true).size
previous_bad_articles = previous_articles.where(good: false).size
end
Is there a way to save the Article objects to memory, so only need to call the database on the first line?
A MongoDB database is not build for that.
I think the best way is to run daily a script that creates your data for that day and save it in a Redis Database http://www.redis.io
Redis stores your data in the server memory, so you can access it every time of the day.
And is very quick.
Don't Repeat Yourself (DRY) is a best-practice that applies not only to code but also to processing. Many applications have natural epochs for summarizing data, a day is a good choice in your question, and if the data is historical, it only has to be summarized once. So you reduce processing of 10 million Article document down to 700 day-summary documents. You need special code for merging in today if you want up to the moment accurate data, but the previous savings is well worth the effort.
I politely disagree with the statement, "A MongoDB database is not build for that." You can see from the above that it is all about not repeating processing. The 700 day-summary documents can be stored in any reasonable data store. Since you already are using MongoDB, simply use another MongoDB collection for the day summaries. There's no need to spin up another data store if you don't want to. The summary data will easily fit in memory, and the reduction in processing means that your working set size will no longer be blown out by the historical processing.

Solution to implementing caching layer in pl/sql

I have a function with 1 argument (date) which encapsulates 1 query like
SELECT COUNT(*)
FROM tbl
WHERE some_date_field BETWEEN param_date - INTERVAL '0 1:00:00' DAY TO SECOND
AND param_date
What I want to do is to cache somewhere the result of this query with ttl = 1 minute. The cached result should be shared across all sessions, not just current one.
Any proposals?
PS: Yes, I know about oracle function result cache, but it doesn't fit the requirements.
PPS: Yes, we can create 2nd artificial argument with some value like date in format of yyyymmddhh24mi so it changes each minute and we're able to use function result cache, but I hope it is a solution which will allow me to hide the caching dependencies inside.
I'd use a global application context, and a job with a refresh interval of 1 minute to set the context.
PS: INTERVAL '1' HOUR is shorter and more meaningful than INTERVAL '0 1:00:00' DAY TO SECOND
You want to cache the result of this query, and share the cache across all sessions. The only way I can think of is to wrap the query in a function call, store the result in a small table. The function will query the small table to see if the count has already been stored within the last 1 minute, and if so, return it.
You would keep the table small by running a job periodically to delete rows in the "cache table" that are older than 1 minute - or better still, perhaps truncate it.
However, I can only see this being of benefit if the original SELECT COUNT(*) is a relatively expensive query.

Resources