Aerospike set expiration date for specific field - caching

I have an Aerospike cache consists of list of data with value of json like structure.
example value: {"name": "John", "count": 10}
I was wandering if it is possible to set an expiration time for only the count field and reset it after some time.

Aerospike doesn't support such functionality out of the box. You would have to code this (hence your other post I guess:
Best way to update single field in Aerospike). You can add filters to only do this based on metadata of the record (the last update time of the record, accessible through Expressions) or any other logic and it should be super efficient and performant to then let a background ops query do the work.

Another approach can be adding your own custom expiration time stamp in your bin data like so:
{"name":"John", "count":10, "validTill":1672563600000000000}.
Here, I am using as below (you can use a different future timestamp format):
$ date --date="2023-01-01 09:00:00" +%s%N
1672563600000000000
Now when you read the record, read through an expression that returns count = 10 if your current clock is behind validTill, 0 otherwise. This can work if the count value on read is all you care about. Also, when you go to update the count value in a future write, you can use the same expression logic to update both count and validTill.
If this works for you, you don't have to scan and update records using background jobs.

Related

elasticsearch fill gaps with previous value

I have time series data in Elasticsearch and I want to aggregate it to create histogram. What I want to achieve is to fill the null buckets with the value of the previous data point. I know that I can use min_doc_count: 0 but it will put the value as 0 and I couldn't find any out of the box way to do this via Elastic. May be there is some trick that I am not aware of?
Appreciate your feedback.
I think the Date Histogram Aggregation does not provide a native way to perform what you would like.
The closest thing I can think of is using missing value. However, this will set a static value to all the dates where no values are found, which is not exactly what you want.
I also thought of using Painless with the following logic:
Get the first value in the Histogram and store it in a variable current.
If the next value is different to 0, store this value to current.
If the value is 0, set the current value to the histogram date. Don't change current.
Repeat step 2 until you finish the Histogram.
Using painless, in my experience is really painful but you can consider it as an alternative.
Additionally, I would recommend you to limit ES to perform searches and aggregations. If you require additional logic to the output, consider performing it outside ES. You can use the Python ES Client for instance.
I can think of the following script with a similar logic as the Painless scenario:
current = 0
results = es.search(...)
for i in res["aggregations"]["my_histogram_name"]["buckets"]:
if not i["doc_count"]: #this is the same as "if i["doc_count"]==0"
i["doc_count"] = current
current = i["doc_count"] #changed or not, we always use the last value to current
After that, the histogram should look as you want and ready to be displayed.
Hope this is helpful! :)

How to sort by a derived value that includes a moving date in ElasticSearch?

I have a requirement to sort the results returned by ElasticSearch by a special value i define, let's call it 'X'.
Now - the problem is, 'X' is a value derived based on:
field A in the document (which is a 'term')
field B (which is a 'date')
the current date (UTC)
So, the problem is obviously 3. The date always changes, therefore i'm not sure how to include this in the sort, since it's not part of the document.
From my initial reading it appears i can use a 'script' here, but i'm worried about the performance, since i could be searching + sorting over 1000's of documents.
The only other idea that came to mind is to calculate the value nightly, and store that in each document. But that has a few drawbacks:
i need to have something running in the background to update this value
could be a lot of documents to update (60%+ every night).
i lose precision for the value depending on how long between script runs. (if i run nightly, value is 23 hours 'stale')
Any advice?
Thanks
This can be done by having an ES script run nightly calculating value, and store that in each document

CouchDb filter and sort in one view

I'm new to the CouchDb.
I have to filter records by date (date must be between two values) and to sort the data by the name or by the date etc (it depends on user's selection in the table).
In MySQL it looks like
SELECT * FROM table WHERE date > "2015-01-01" AND date < "2015-08-01" ORDER BY name/date/email ASC/DESC
I can't figure out if I can use one view for all these issues.
Here is my map example:
function(doc) {
emit(
[doc.date, doc.name, doc.email],
{
email:doc.email,
name:doc.name,
date:doc.date,
}
);
}
I try to filter data using startkey and endkey, but I'm not sure how to sort data in this way:
startkey=["2015-01-01"]&endkey=["2015-08-01"]
Can I use one view? Or I have to create some views with keys order depending on my current order field: [doc.date, doc.name, doc.email], [doc.name, doc.date, doc.email] etc?
Thanks for your help!
As Sebastian said you need to use a list function to do this in Couch.
If you think about it, this is what MySQL is doing. Its query optimizer will pick an index into your table, it will scan a range from that index, load what it needs into memory, and execute query logic.
In Couch the view is your B-tree index, and a list function can implement whatever logic you need. It can be used to spit out HTML instead of JSON, but it can also be used to filter/sort the output of your view, and still spit out JSON in the end. It might not scale very well to millions of documents, but MySQL might not either.
So your options are the ones Sebastian highlighted:
view sorts by date, query selects date range and list function loads everything into memory and sorts by email/etc.
views sort by email/etc, list function filters out everything outside the date range.
Which one you choose depends on your data and architecture.
With option 1 you may skip the list function entirely: get all the necessary data from the view in one go (with include_docs), and sort client side. This is how you'll typically use Couch.
If you need this done server side, you'll need your list function to load every matching document into an array, and then sort it and JSON serialize it. This obviously falls into pieces if there are soo many matching documents that they don't even fit into memory or take to long to sort.
Option 2 scans through preordered documents and only sends those matching the dates. Done right this avoids loading everything into memory. OTOH it might scan way too many documents, trashing your disk IO.
If the date range is "very discriminating" (few documents pass the test) option 1 works best; otherwise (most documents pass) option 2 can be better. Remember that in the time it takes to load a useless document from disk (option 2), you can sort tens of documents in memory, as long as they fit in memory (option 1). Also, the more indexes, the more disk space is used and the more writes are slowed down.
you COULD use a list function for that, in two ways:
1.) Couch-View is ordered by dates and you sort by e-amil => but pls. be aware that you'd have to have ALL items in memory to do this sort by e-mail (i.e. you can do this only when your result set is small)
2.) Couch-View is ordered by e-mail and a list function drops all outside the date range (you can only do that when the overall list is small - so this one is most probably bad)
possibly #1 can help you

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Redis Data Structure to Store All Clicks for All Links

I'm trying to set up a system in which ALL links posted by users and clicked by their followers are stored in redis in such a way that the following requirements are met:
Able to get (for example, 10%) most clicked links within a time-frame (can be either today, this week, all time, or custom).
Able to query all users who posted the same link.
Since we already used many keys, the ideal is that we store all this in a single Redis key.
Can encode value to JSON if needed.
Here is what I came up so far:
-I use a single Redis Hash with each fields are single hour, so that in one day, that hash will contain 24 fields.
-In each field, I store a JSON encoded from an array with format:
array("timestamp1" => array($url1, $url2, ...)
, "timestamp2" => array($url3, $url4, ...)
, ..., ...);
-The complete structure is this hash:
[01/01/2010 00:00] => JSON(...),
[01/01/2010 01:00] => JSON(...),
....
This way, I can get all the clicks on any URL within any time-frame.
However, I can't seem to reuse this hash for getting all the users who posted the URL.
The question is: Is there any better way to do?
Updated 07/30/2011: I'm currently storing the minutes, the hours, the days, weeks, months, and years in the same hash.
So, one click is stored in many fields at once:
- in the field for the minute (format YmdHi)
- in the field for the hour (format YmdH)
- in the field for the day (format Ymd)
- in the field for the week (format YW)
- in the field for the month (format Ym)
- in the field for the year (format Y).
That's way, when trying to get a specific timeframe, I could only access the necessary fields withouth looping through the hours.
For example, if I need clicks from 07/26/2011 20:00 to 07/28/2011 02:00, I only need to query 7 fields: 1 field for the full day of 07/27/2011, 4 fields for the hours from 20:00 to 23:00 on 07/26, and then 2 more fields for hours from 00:00 to 01:00 on 07/28
If you drop the third requirement it becomes a lot easier. A lot of people seem to think that you should always use hashes instead of keys, but this stems from misunderstanding of a post about using hashes to improve performance in a particular limited set of circumstances.
To get the most clicked links, create a sorted set for each hour or day, with the value being the link and score being clicks set using ZINCRBY. Use ZCARD and ZREVRANGEBYSCORE to get the top 10%. It is simplest if the set holds all links in the system, though there are strategies you can use to drop less popular items from the set if necessary.
To get all users posting a link, store a set of users for each link. You could do this with JSON and a key or hash storing details for the link, but a set makes updating and querying easier.
I recommend using some bucket strategy like hashing Keys or keeping records of Link to User month wise as you don't have control on size of data structure how huge it may grow . There will be millions of user visiting a particular link . Now to get the details of all the user again it will be of no use if thrown at once . I believe what can be done is maintain counter or some metadata that act like current state and then maintain an archival storage not to be in mem. or go for a memory grid like GemFire

Resources