Aerospike: get upsert time without explicitly storing it for records with a TTL - user-defined-functions

Aerospike is blazingly fast and reliable, but expensive. The cost, for us, is based on the amount of data stored.
We'd like the ability to query records based on their upsert time. Currently, when we add or update a record, we set a bin to the current epoch time and can run scan queries on this bin.
It just occurred to me that Aerospike knows when to expire a record based on when it was upserted, and since we can query the TTL value from the record metadata via a simple UDF, it might be possible to infer the upsert time for records with a TTL. We're effectively using space to store a value that's already known.
Is it possible to access record creation or expiry time, via UDF, without explicitly storing it?

At this point, Aerospike only stores the void time along with the record (the time when the record expires). So the upsert time is unfortunately not available. Stay tuned, though, as I heard there were some plans to have some new features that may help you. (I am part of Aerospike's OPS/Support team).

void time : This tracks the life of a key in system. This is the time at which key should expire and is used by eviction subsystem.
so ttl is derived from the void time.
As we get ttl from a record, we can only calculate the void time (now + ttl)
Based on what you have, I think you can evaluate the upsert time from ttl only if you add same amount of expiration to all your records, say CONSTANT_EXPIRATION_TIME.
in that case
upsert_time = now - (CONSTANT_EXPIRATION_TIME - ttl)
HTH

Related

Redis expire a key but don't delete?

In redis, is there a way to mark a key as expired but don't actually drop it from the database? A use case for this would be to let a client know that they need to pull in fresh data from a producer of that data, but if the producer is unreachable, still have the stale data available in the cache. This would make it simpler as one would not have to leverage a separate database / redis cache for stale data and data that is meant to expire to trigger an update to the cache.
There's no builtin way to do that.
In order to achieve the goal, you have to do the expire work by yourself. You can save the data into a hash with last modification time:
hset key data value time last-modification-time
When you retrieve the data, you can compare the last-modification-time and current time to check if the data is stale.
When you update the value, also update the last-modification time with current time.
In order to make it easy to use, and atomic, wrap these logic into a Lua script.

In NiFi how to store attribute retrieved from DB which doesn’t change very frequently?

I have scheduled ExecuteSQL processor which retrieve speed limit from DB. This speed limit doesn’t change frequently so I created time interval of 24 hours. But I noticed that the next processor e.g RouteAttribute don’t store this speed limit value. With every FlowFile coming from Kafka I want to check whether speedlimit value in FlowFile is exceeding speedlimit value retrieved from DB. But value from DB gets processed as FlowFile once in 24 hours and it’s not available for comparison.
I have following flow:
1) ExecuteSQL->ConvertAvroToJson->EvaluateJsonPath-> from here I pass value of speed limit to following flow to processor RoutesAttribute.
2) ConsumeKafka->EvaluateJsonPath->RouteAttributes (RouteAtrribute get speed limit from above flow but it only gets this value once in 24 hours. How to store this value in memory permanently??)
Based on your description I think this how-to HCC post is very relevant:
https://community.hortonworks.com/questions/140060/nifi-how-to-load-a-value-in-memory-one-time-from-c.html
In summary it leverages the fact that UpdateAttribute has a state feature and makes sure the attribute only gets updated when data is pulled in from the reference table.
There is also an alternate solution, if it is OK for you to restart nifi after pulling in an updated reference value, this is called the variable registry and it simplifies things a bit:
https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.1.1/bk_administration/content/custom_properties.html

a data structure to query number of events in different time interval

My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.
I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.
I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).
I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.
Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.
Is there any good data structure or even a database for this purpose?
You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.
One possible solution would be to use Yandex Clickhouse database.
Rough description of suggested pattern:
Load incoming raw events from your application into memory-based table Events
with Buffer storage engine
Create materialized view with per-minute aggregation in another
memory-based table EventsPerMinute with Buffer engine
Do the same for hourly aggregation of data in EventsPerHour
Optionally, use Grafana with clickhouse datasource plugin to build
dashboards
In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.
Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.
At 100K events/second you may need some kind of shaper/load balancer in front of database.
you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.

Count redis keys without fetching them in Ruby

I'm keeping a list of online users in Redis with one key corresponding to one user. Keys are set to time out in 15 minutes, so all I have to do to see how many users have roughly been active in the past 15 minutes, I can do:
redisCli.keys('user:*').count
The problem is as the number of keys grows, the time it takes to fetch all the keys before counting them is increasing noticeably. Is there a way to count the keys without actually having to fetch all of them first?
There is an alternative to directly indexing keys in a Set or Sorted Set, which is to use the new SCAN command. It depends on the use case, memory / speed tradeoff, and required precision of the count.
Another alternative is that you use Redis HyperLogLogs, see PFADD and PFCOUNT.
Redis does not have an API for only counting keys with a specific pattern, so it is also not available in the ruby client.
What I can suggest is to have another data-structure to read to number of users from.
For instance, you can use redis's SortedSet, where you can keep each user with the timestamp of its last TTL set as the score, then you can call zcount to get the current number of active users:
redisCli.zcount('active_users', 15.minutes.ago.to_i, Time.now.to_i)
From time to time you will need to clean up the old values by:
redisCli.zremrangebyscore 'active_users', 0, 15.minutes.ago.to_i

What is the most efficient way to filter a search?

I am working with node.js and mongodb.
I am going to have a database setup and use socket.io to have real-time updates that will have the db queried again as well or push the new update to the client.
I am trying to figure out what is the best way to filter the database?
Some more information in regards to what is being queried and what the real time updates are:
A document in the database will include information such as an address, city, time, number of packages, name, price.
Filters include city/price/name/time (meaning only to see addresses within the same city, or within the same time period)
Real-time info: includes adding a new document to the database which will essentially update the admin on the website with a notification of a new address added.
Method 1: Query the db with the filters being searched?
Method 2: Query the db for all searches and then filter it on the client side (Javascript)?
Method 3: Query the db for all searches then store it in localStorage then query localStorage for what the filters are?
Trying to figure out what is the fastest way for the user to filter it?
Also, if it is different than what is the most cost effective way, then the most cost effective as well (which I am assuming is less db queries)...
It's hard to say because we don't see exact conditions of the filter, but in general:
Mongo can use only 1 index in a query condition. Thus whatever fields are covered by this index can be used in an efficient filtering. Otherwise it might do full table scan which is slow. If you are using an index then you are probably doing the most efficient query. (Mongo can still use another index for sorting though).
Sometimes you will be forced to do processing on client side because Mongo can't do what you want or it takes too many queries.
The least efficient option is to store results somewhere just because IO is slow. This would only benefit you if you use them as cache and do not recalculate.
Also consider overhead and latency of networking. If you have to send lots of data back to the client it will be slower. In general Mongo will do better job filtering stuff than you would do on the client.
According to you if you can filter by addresses within time period then you could have an index that cuts down lots of documents. You most likely need a compound index - multiple fields.

Resources