a data structure to query number of events in different time interval - events

My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.
I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.
I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).
I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.
Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.
Is there any good data structure or even a database for this purpose?

You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.
One possible solution would be to use Yandex Clickhouse database.
Rough description of suggested pattern:
Load incoming raw events from your application into memory-based table Events
with Buffer storage engine
Create materialized view with per-minute aggregation in another
memory-based table EventsPerMinute with Buffer engine
Do the same for hourly aggregation of data in EventsPerHour
Optionally, use Grafana with clickhouse datasource plugin to build
dashboards
In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.
Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.
At 100K events/second you may need some kind of shaper/load balancer in front of database.

you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.

Related

How database sharding based on timestamp and pkid is better in performance?

I was checking a system design of twitter.
Sharding based on UserID: We can try storing all the data of a user on one server. While storing, we can pass the UserID to our hash function that will map the user to a database server where we will store all of the user’s tweets, favorites, follows, etc. While querying for tweets/follows/favorites of a user, we can ask our hash function where can we find the data of a user and then read it from there. This approach has a couple of issues:
What if a user becomes hot? There could be a lot of queries on the server holding the user. This high load will affect the performance of our service.
Over time some users can end up storing a lot of tweets or having a lot of follows compared to others. Maintaining a uniform distribution of growing user data is quite difficult.
To recover from these situations either we have to repartition/redistribute our data or use consistent hashing.
Sharding based on TweetID: Our hash function will map each TweetID to a random server where we will store that Tweet. To search for tweets, we have to query all servers, and each server will return a set of tweets. A centralized server will aggregate these results to return them to the user. Let’s look into timeline generation example; here are the number of steps our system has to perform to generate a user’s timeline:
Our application (app) server will find all the people the user follows.
App server will send the query to all database servers to find tweets from these people.
Each database server will find the tweets for each user, sort them by recency and return the top tweets.
App server will merge all the results and sort them again to return the top results to the user.
This approach solves the problem of hot users, but, in contrast to sharding by UserID, we have to query all database partitions to find tweets of a user, which can result in higher latencies.
We can further improve our performance by introducing cache to store hot tweets in front of the database servers.
Sharding based on Tweet creation time: Storing tweets based on creation time will give us the advantage of fetching all the top tweets quickly and we only have to query a very small set of servers. The problem here is that the traffic load will not be distributed.
What if we can combine sharding by TweetID and Tweet creation time? If we don’t store tweet creation time separately and use TweetID to reflect that, we can get benefits of both the approaches. This way it will be quite quick to find the latest Tweets. For this, we must make each TweetID universally unique in our system and each TweetID should contain a timestamp too.
We can use epoch time for this. Let’s say our TweetID will have two parts: the first part will be representing epoch seconds and the second part will be an auto-incrementing sequence. So, to make a new TweetID, we can take the current epoch time and append an auto-incrementing number to it. We can figure out the shard number from this TweetID and store it there. How does this approach helps better than the above approaches?

elasticsearch bulk ingestion how to avoid updates

Within my product I use elasticsearch for storing CDRs (call them txn logs, if you will). My transactions are asynchronous and happen at a very fast rate i.e. around 5000 txns/sec. My transaction involves submitting request to a network entity, and later at some other point of time I receive the response.
The data ingestion technique to ES, earlier involved two phase operations viz., 1) add an entry into ES as soon as I submit to the network layer; 2) when I get response, then update the previous entry with additional status such as delivery succeeded.
I am doing this with bulk insertion method, in which the bulk records contain both inserts and updates. As a result the ingestion is very very slow, which ended up hogging / halting my application. Later, we changed the ingestion technique in such a way that we only insert to elastic when we get final response. Till such time we store the data in a redis store. But this has disadvantages of data loss and non-realtime reports.
So, I was looking at some option like having 2 indexes for the same record. Parent index will have all data, and the child record will have delivery status. I don't know if this is possible. I studied about nested queries and has-child, has-parent queries. What I am unsure is, can I insert the parent and child data at separate points in time, without having to use update. Or should I create two different records with common txn-id without worrying about parent/child?
What is the best way?

Aggregate timeseries data over various timeframes

I have a question about how to aggregate time series data that is coming into DynamoDB.
Currently energy usage data is coming into DynamoDB every 30 seconds per device. The devices are also spread across many timezones.
I want to show the aggregate energy usage over one hour, one day, one month, and one year.
I know one way that I can do it is run a Lambda on a 1 hour cron job that takes all of the readings for the previous hour and adds them all together and then records that in a different table in.
At the same time in that cron job the Lambda can check if any devices timezones just had their day end, and if so batch up the previous 24 hours for into a single day reading.
The same goes for month, and year.
But something tells me there is a another, better, way to do all this (probably using some otherAWS service which I am not thinking of)
Instead of a cron job, you can use dynamoDB streams.
In this case, when a record comes into your data collection table, it can kick off a lambda function that updates your aggregate tables. That will allow you to get more timely updates into the aggregate tables. The logic for what hour/day/month/year your record gets aggregated should be in that lambda.
Also, I’d use a cloud watch event instead of cron...

How does BigQuery caching on time partitioned tables work?

In contrast with the BigQuery documentation, we see that it DOES cache the results when selecting data from a streaming, data partitioned table (Standard SQL).
Example:
When we perform a deterministic date scan on the streaming, data partitioned table using:
where (_PARTITIONTIME > '2017-11-12' or _PARTITIONTIME is null)
...BigQuery caches the data for 5 to 20 minutes if we fire the same exact query within that time frame.
While in my interpretation of the documentation it states that it SHOULD NOT cache the data:
'When any of the tables referenced by the query have recently received streaming inserts (a streaming buffer is attached to the table) even if no new rows have arrived'
Important notes:
Our test query queries heartbeat events that really arrive at us continuously
We actually want this caching behavior, because we do not always need to have data to be actual to the last second. We just want to know if we really can depend on this behavior.
Our Questions:
What is going on here / Why does the BQ caching happen at all?
The time this data stays in the BQ cache is 'random' (between 5-20 minutes). What does this mean?
Thanks for clarifying the question. I think it's an overlook that we didn't disabled caching for partitioned tables with streaming data. It should as otherwise the query might return outdated results.
We invalidate the cache when the table is changed. Streaming into the table will cause the table to be changed. I guess that's why the cache is invalidated between 5 to 20 minutes.

high volume data storage and processing

I am building a new application where I am expecting a high volume of geo location data something like a moving object sending geo coordinates every 5 seconds. This data needs to be stored in some database so that it can be used for tracking the moving object on a map anytime. So, I am expecting about 250 coordinates per moving object per route. And each object can run about 50 routes a day. and I have 900 such objects to track. SO, that brings to about 11.5 million geo coordinates to store per day. I have to store about one week of data at least in my database.
This data will be basically used for simple queries like find all the geocoordates for a particular object and a particular route. so, the query is not very complicated and this data will not be used for any analysis purpose.
SO, my question is should I just go with normal Oracle database like 12C distributed over two VMs or should I think about some big data technologies like NO SQL or hadoop?
One of the key requirement is to have high performance. Each query has to respond withing 1 second.
Since you know the volume of data (11.5 million) you can easily simulate the all your scenario in Oracle DB and test it well before.
My suggestions are you need to go for day level partitions & 2 sub partitions like objects & routs. All your business SQL has to hit right partitions always.
and also you might required to clear older days data. or Some sort of aggregation you can created with past days and delete your raw data would help.
its well doable 12C.

Resources