Aggregate timeseries data over various timeframes - aws-lambda

I have a question about how to aggregate time series data that is coming into DynamoDB.
Currently energy usage data is coming into DynamoDB every 30 seconds per device. The devices are also spread across many timezones.
I want to show the aggregate energy usage over one hour, one day, one month, and one year.
I know one way that I can do it is run a Lambda on a 1 hour cron job that takes all of the readings for the previous hour and adds them all together and then records that in a different table in.
At the same time in that cron job the Lambda can check if any devices timezones just had their day end, and if so batch up the previous 24 hours for into a single day reading.
The same goes for month, and year.
But something tells me there is a another, better, way to do all this (probably using some otherAWS service which I am not thinking of)

Instead of a cron job, you can use dynamoDB streams.
In this case, when a record comes into your data collection table, it can kick off a lambda function that updates your aggregate tables. That will allow you to get more timely updates into the aggregate tables. The logic for what hour/day/month/year your record gets aggregated should be in that lambda.
Also, I’d use a cloud watch event instead of cron...

Related

How to schedule code for a specific date on a serverless architecture

I know there are ways to implement cron jobs in a serverless architecture such that a specific code is called at a periodic rate, but I want to schedule different times for different codes.
Idealy I will have a Queue with Events being added and each Event would have a date at which it will be removed from the Queue and sent to a Function. But during a Google search I couldn't find any architectures like this, just periodic, recurring, scheduling, like a "call this every 60 seconds".
Are there any widely adopted architectures like this?
You could create a CloudWatch Event rule for each of the different times. If they are just based on a specific point in time (e.g. January 31st at 1PM) and not recurring (e.g. every day at 2PM), you could have the lambda delete the rule after running.
If the times are vast, and that solution isn't very scalable, you could have the lambda run every minute (or whatever frequency is reasonable for you) and have that lambda read records from a DynamoDB table with the times in it. You can put the time value in the sort key on a GSI and query for records that are in the past and haven't been run yet. After doing your work you update the DynamoDB record so that it doesn't return in future runs.

a data structure to query number of events in different time interval

My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.
I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.
I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).
I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.
Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.
Is there any good data structure or even a database for this purpose?
You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.
One possible solution would be to use Yandex Clickhouse database.
Rough description of suggested pattern:
Load incoming raw events from your application into memory-based table Events
with Buffer storage engine
Create materialized view with per-minute aggregation in another
memory-based table EventsPerMinute with Buffer engine
Do the same for hourly aggregation of data in EventsPerHour
Optionally, use Grafana with clickhouse datasource plugin to build
dashboards
In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.
Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.
At 100K events/second you may need some kind of shaper/load balancer in front of database.
you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.

AWS Kinesis Stream Aggregating Based on Time Spans

I currently have a Kinesis stream that is populated with JSON messages that are in the form of:
{"datetime": "2017-09-29T20:12:01.755z", "payload":"4"}
{"datetime": "2017-09-29T20:12:07.755z", "payload":"5"}
{"datetime": "2017-09-29T20:12:09.755z", "payload":"12"}
etc...
What im trying to accomplish here is to aggregate the data in terms of time chunks. In this case, i'd like to group the averages for 10 minute spans. For example, from 12:00 > 12:10, I want to average the payload value and save it as the 12:10 value.
For example, the above data would produce:
Datetime: 2017-09-29T20:12:10.00z
Average: 7
The method that i'm thinking of is to use caching at the service level and then some type of way to track the time. If the messages ever move into the next 10 minute timespan, I average the cached data, store it to the DB and then delete that cache value.
Currently, my service sees 20,000 messages every minute with higher volume to be expected in the future. I'm a little stuck on how to implement this to guarantee I get all the values for that 10 minute time period from Kinesis. Those of you that are more familiar with Kinesis and AWS, is there a simple way to go about this?
The reason for doing this is to shorten the query times for data from large timespans, such as for 1 year. I wouldn't want to grab millions of values but rather, a few aggregated values.
Edit:
I have to keep track of many different averages at the same time. For example, the above JSON may just pertain to one 'set', such as the average temperature per city in 10 minute timespans. This requires me to keep track of each cities averages for every timespan.
Toronto (12:01 - 12:10): average_temp
New York (12:01 - 12:10): average_temp
Toronto (12:11 - 12:20): average_temp
New York (12:11 - 12:20): average_temp
etc...
This could pertain to any city worldwide. If new temperatures arrive for say, Toronto and it pertains to the 12:01 - 12:10 timespan, I have to recalculate and store that average.
This is how I would do it. Thanks for the interesting question.
Kinesis Streams --> Lambda (Event Insertor) --> DynamoDB(Streams) --> Lambda(Count and Value incrementor) --> DynamoDB(streams) --> Average (Updater)
DynamoDB Table Structure:
{
Timestamp: 1506794597
Count: 3
TotalValue: 21
Average: 7
Event{timestamp}-{guid}: { event }
}
timestamp -- timestamp of the actual event
guid -- avoid any collision on a timestamp that occurred at same time
Event{timestamp}-{guid} -- This should be removed by (count and value incrementor)
If the fourth record for that timestamp arrives,
Get the time close to 10 min timespan, increment the count, increment the totalvalue. Neve read the value and increment, that will result in error unless you use strong consistency(which is very costly to read). Instead perform the increment operation with atomic increment.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/WorkingWithItems.html#WorkingWithItems.AtomicCounters
Create DynamoDB streams from the above table, Listen on another lambda, Now calculate the average value and update the value.
When you calculate the average, don't perform a read from the table. Instead the data will be available over the stream, you just need to calculate the average and update it. (overwrite previous average value).
This will work on any scale and with high availability.
Hope it helps.
EDIT1:
Since the OP is not familier with AWS Services,
Lambda Documentation:
https://aws.amazon.com/lambda/
DynamoDB Documentation:
https://aws.amazon.com/dynamodb/
AWS cloud services used for the solution.

MongoID where queries map_reduce association

I have an application who is doing a job aggregating data from different Social Network sites Back end processes done Java working great.
Its front end is developed Rails application deadline was 3 weeks for some analytics filter abd report task still few days left almost completed.
When i started implemented map reduce for different states work great over 100,000 record over my local machine work great.
Suddenly my colleague gave me current updated database which 2.7 millions record now my expectation was it would run great as i specify date range and filter before map_reduce execution. My believe was it would result set of that filter but its not a case.
Example
I have a query just show last 24 hour loaded record stats
result comes 0 record found but after 200 seconds with 2.7 million record before it comes in milliseconds..
CODE EXAMPLE BELOW
filter is hash of condition expected to check before map_reduce
map function
reduce function
SocialContent.where(filter).map_reduce(map, reduce).out(inline: true).entries
Suggestion please.. what would be ideal solution in remaining time frame as database is growing exponentially in days.
I would suggest you look at a few different things:
Does all your data still fit in memory? You have a lot more records now, which could mean that MongoDB needs to go to disk a lot more often.
M/R can not make use of indexes. You have not shown your Map and Reduce functions so it's not possible to point out mistakes. Update the question with those functions, and what they are supposed to do and I'll update the answer.
Look at using the Aggregation Framework instead, it can make use of indexes, and also run concurrently. It's also a lot easier to understand and debug. There is information about it at http://docs.mongodb.org/manual/reference/aggregation/

Designing a Calendar system like Google Calendar [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have to create something similiar to Google Calendar, so I created an events table that contains all the events for a user.
The hard part is handling re-occurring events, the row in the events table has an event_type field that tells you what kind of event it is, since an event can be for a single date only, OR a re-occuring event every x days.
The main design challenge is handling re-occurring events.
When a user views the calendar, using the month's view, how can I display all the events for the given month? The query is going to be tricky, so I thought it would be easier to create another table and create a row for each and every event, including the re-occuring events.
What do you guys think?
I'm tackling exactly this problem, and I had completely spaced iCalendar (rfc 2445) up until reading this thread, so I have no idea how well this will or won't integrate with that. Anyway the design I've come up with so far looks something like this:
You can't possibly store all the instances of a recurring event, at least not before they occur, so I simply have one table that stores the first instance of the event as an actual date, an optional expiration, and nullable repeat_unit and repeat_increment fields to describe the repetition. For single instances the repition fields are null, otherwise the units will be 'day', 'week', 'month', 'year' and increment is simply the multiple of units to add to start date for the next occurrence.
Storing past events only seems advantageous if you need to establish relationships with other entities in your model, and even then it's not necessary to have an explicit "event instance" table in every case. If the other entities already have date/time "instance" data then a foreign key to the event (or join table for a many-to-many) would most likely be sufficient.
To do "change this instance"/"change all future instances", I was planning on just duplicating the events and expiring the stale ones. So to change a single instances, you'd expire the old one at it's last occurrence, make a copy for the new, unique occurrence with the changes and without any repetition, and another copy of the original at the following occurrence that repeats into the future. Changing all future instances is similar, you would just expire the original and make a new copy with the changes and repition details.
The two problems I see with this design so far are:
It makes MWF-type events hard to represent. It's possible, but forces the user to create three separate events that repeat weekly on M,W,F individually, and any changes they want to make will have to be done on each one separately as well. These kind of events aren't particularly useful in my app, but it does leave a wart on the model that makes it less universal than I'd like.
By copying the events to make changes, you break the association between them, which could be useful in some scenarios (or, maybe it would just be occasionally problematic.) The event table could theoretically contain a "copied_from" id field to track where an event originated, but I haven't fully thought through how useful something like that would be. For one thing, parent/child hierarchical relationships are a pain to query from SQL, so the benefits would need to be pretty heavy to outweigh the cost for querying that data. You could use a nested-set instead, I suppose.
Lastly I think it's possible to compute events for a given timespan using straight SQL, but I haven't worked out the exact details and I think the queries usually end up being too cumbersome to be worthwhile. However for the sake of argument, you can use the following expression to compute the difference in months between the given month and year an event:
(:month + (:year * 12)) - (MONTH(occursOn) + (YEAR(occursOn) * 12))
Building on the last example, you could use MOD to determine whether difference in months is the correct multiple:
MOD(:month + (:year * 12)) - (MONTH(occursOn) + (YEAR(occursOn) * 12), repeatIncrement) = 0
Anyway this isn't perfect (it doesn't ignore expired events, doesn't factor in start / end times for the event, etc), so it's only meant as a motivating example. Generally speaking though I think most queries will end up being too complicated. You're probably better off querying for events that occur during a given range, or don't expire before the range, and computing the instances themselves in code rather than SQL. If you really want the database to do the processing then a stored procedure would probably make your life a lot easier.
As previously stated, don't reinvent the wheel, just enhance it.
Checkout VCalendar, it is open source, and comes in PHP, ASP, and ASP.Net (C#)!
Also you could check out Day Pilot which offers a calendar written in Asp.Net 2.0. They offer a lite version that you could check out, and if it works for you, you could purchase a license.
Update (9/30/09):
Unless of course the wheel is broken! Also, you can put a shiny new coat of paint if you like (ie: make a better UI). But at least try to find some foundation to build off of, since the calendar system can be tricky (with Repeating events), and it's been done thousands of times.
Attempting to store each instance of every event seems like it would be really problematic and, well, impossible. If someone creates an event that occurs "every thursday, forever", you clearly cannot store all the future events.
You could try to generate the future events on demand, and populate the future events only when necessary to display them or to send notification about them. However, if you are going to build the "on-demand" generation code anyway, why not use it all the time? Instead of pulling from the event table, and then having to use on-demand event generation to fill in any new events that haven't been added to the table yet, just use the on-demand event generation exclusively. The end result will be the same. With this scheme, you only need to store the start and end dates and the event frequency.
I don't see any way that you can avoid having on-demand event generation, so I can't see the utility in the event table. If you want it for the sake of caching, then I think you're taking the wrong approach. First, it's a poor cache because you can't avoid on-demand event generation anyway. Second, you should probably be caching at a higher level anyway. If you want to cache, then cache generated pages, not events.
As far as making your polling more efficient, if you are only polling every 15 minutes, and your database and/or server can't handle the load, then you are already doomed. There's no way your database will be able to handle users if it can't handle much, much more frequent polling without breaking a sweat.
I would say start with the ical standard. If you use it as your model, then you'll be able to do everything that google calendar, outlook, mac ical (the program), and get virtually instant integration with them.
From there, time to bone up on your ajax and javascript cuz you can't have a flashy web ui with drag drop and multiple calendars without a ton of ajax and javascript.
You should have a start date, end date, and expiration date. Single day events would have the same start date and end date, and allows you to do partial day events as well. As for re-occuring events, then the start and end date would be for the same day, but have different times, then you have an enumeration or table that specifies the repeat frequency (daily, weekly, monthly, etc).
This allows you to say "this event appears every day" for daily, "this event appears on the 2nd day of every week" for weekly, "this event appears on the 5th day of every month" for monthly, "this event appears on the 215th day of every year" for yearly as long as the date is less than the expiration date.
Darren,
That is how I have designed my events table actually, but when thinking about it, say I have 100K users who have create events, this table will be hit pretty hard, especially if I am going to be sending out emails to remind people of their events (events can be for specific times of the day also!), so I could be polling this table every 15 minutes potentially!
This is why I wanted to create another table that would expand out all the events/re-occuring events, this way I can simple hit that table and get the users months view of events without doing any complicated querying and business logic, AND it will make polling much more effecient also.
The question is, should this secondary table be for the next day or month? What makes more sense? Since the maximum a user can view is a months view, I am leaning towards a table that writes out all the events for a given month.
(ofcourse I will have to maintain this secondary table for any edits the user might make to the original events table).
ChanChan,
I have designed it with the same sort of functionality actually, but I am just referring to how I will go about storing events, specifically how to handle re-occurring events.
The brute-force-ish but still reasonable way would be to create a new row in your single events table for every instance of the recurring event, all pointing not to the event preceding it in the series but to the first event in the series. This simplifies selecting and/or deleting all elements in a particular series, since you can select based on parent id. It also allows users to delete individual items from a series without affecting the rest of them.
This query gets you the series that starts on element 3:
SELECT * FROM events WHERE id = 3 OR parentid = 3
To get all items for this month, assuming you'd have a start date and an end date in your events table, all you'd have to do is:
SELECT * FROM events WHERE startdate >= '2008-08-01' AND enddate <= '2008-08-31'
Handling the creation/modification of series programmatically wouldn't be very difficult, but it really would depend on the feature set you want to provide and how you think it'll be used. If you want to differentiate between series and events, you could have a separate series table and a nullable series_id on your events, allowing you the freedom to muck about with individual events while still retaining control over your series.
From past experience I would create a new record for each occurring event and then have a column which references the previous event so you can track all events in a series.
This has two advantages:
No complicated routines to work out the next event date
Individual occurrences can be edited without effecting the rest
Hope this gives you some food for thought :)
I have to agree with #ChanChan on reading the ical spec for how to store these things. There is no easy way to handle recurrences, especially ones that have exceptions. I've built and rebuilt and rebuilt a database to handle this, and I keep coming back to ical.
It's not a bad idea to create a subordinate table, depending on use cases. The algorithm for calculating exactly when occurrences . . . um, occur . . . can indeed be quite complex. There's no getting away from running it, but it's worth considering caching the results.
#GateKiller
I hadn't thought of the case where you edit individual occurrences. It makes sense you would store the occurrences separately in that case.
When you do that, though, how far in the future do you store events? Do you pick an arbitrary date? Do you auto-generate the new occurrences the first time a user browses out into future months/years?
Also, how do you handle the case where the user wants to edit the whole series. "We've had a meeting every Tuesday morning at 10:30 but we're going to start meeting on Wednesday at 8"
I think I understand your second paragraph to mean you are considering a second events table that has a row for each occurrence of an event. I would avoid that.
Re-occurring events should have a start date and a stop date (which could be Null for events that continue every X days "forever") You'll have to decide what kinds of frequency you want to allow -- every X days, the Nth day of each month, every given weekday, etc.
I'd probably tend toward two tables - one for one time events and a second for recurring events. You'll have to query and display the two separately.
If I were going to tackle this (and I'd try as hard as I can to avoid reinventing this wheel) I'd look for open-source libraries or, at the very least, open source projects with Calendars that you can look at. Any recommendations guys?
undefined wrote:
…this table will be hit pretty
hard, especially if I am going to be
sending out emails to remind people of
their events (events can be for
specific times of the day also!), so I could be polling this table every 15 minutes potentially!
Create a table for the notifications. Poll only it.
Update the notification table when events (recurring or otherwise) are updated.
EDIT: A database View might not violate normal forms, if that's a concern. But, you'll probably want to track which notifications were sent and which have not yet been sent somewhere anyway.
Derek Park,
I would be creating each and every instance of an event in a table, and this table would be regenerated every month (so any event that was set to reoccurr 'forever' would be regenerated one month in advance using a windows service or maybe at the sql server level).
The polling won't only be done every 15 minutes, that might only be for polls related to email notifications. When someone wants to view their events for a month, I will have to fetech all their events, and re-occuring events and figure out which events to display (since a re-occuring event might have been created 6 months ago, but relates to a month the user is viewing).
Zack, i'm not too concerned with having a perfectly normalized database, the fact that I'm thinking of creating a secondary table is already breaking one of the rules hehe. My core database tables are following 'the rules', but I don't mind creating secondary tables/columns at times when it benefits things performance wise.
That is how I have designed my events table actually, but when thinking about it, say I have 100K users who have create events, this table will be hit pretty hard, especially if I am going to be sending out emails to remind people of their events (events can be for specific times of the day also!), so I could be polling this table every 15 minutes potentially!
Databases do exception jobs of handling sets of data, so i wouldn't be too worried about that. What you should do is use this as your primary table, and then as events expire then move them into another table (like an archive).
The next thing is you want to try is to query the db as little as possible, so move the information into a caching tier (like velocity) and just persist data to the database.
Then, you can partition the information across multiple databases for scaling purposes. ie users 1-10000 calendars exist on server 1, 10001 - 20000 exist on server 2, etc.
That's how i would scale a solution like this, but i still think the original solution i proposed is the way to go, it's just how you scale it that becomes the question.
The Ra-Ajax Calendar starter-kit features a sample of handling the RenderDate event which can modify the dates of specifically. Though the "recurring events" is more of an algorithmic thing and here I doubt very few calendars will help you much...
If anyone is doing Ruby there's a great library Runt that does this kind of thing. Worth checking out. http://runt.rubyforge.org/

Resources