How to sort by a derived value that includes a moving date in ElasticSearch? - elasticsearch

I have a requirement to sort the results returned by ElasticSearch by a special value i define, let's call it 'X'.
Now - the problem is, 'X' is a value derived based on:
field A in the document (which is a 'term')
field B (which is a 'date')
the current date (UTC)
So, the problem is obviously 3. The date always changes, therefore i'm not sure how to include this in the sort, since it's not part of the document.
From my initial reading it appears i can use a 'script' here, but i'm worried about the performance, since i could be searching + sorting over 1000's of documents.
The only other idea that came to mind is to calculate the value nightly, and store that in each document. But that has a few drawbacks:
i need to have something running in the background to update this value
could be a lot of documents to update (60%+ every night).
i lose precision for the value depending on how long between script runs. (if i run nightly, value is 23 hours 'stale')
Any advice?
Thanks

This can be done by having an ES script run nightly calculating value, and store that in each document

Related

Aerospike set expiration date for specific field

I have an Aerospike cache consists of list of data with value of json like structure.
example value: {"name": "John", "count": 10}
I was wandering if it is possible to set an expiration time for only the count field and reset it after some time.
Aerospike doesn't support such functionality out of the box. You would have to code this (hence your other post I guess:
Best way to update single field in Aerospike). You can add filters to only do this based on metadata of the record (the last update time of the record, accessible through Expressions) or any other logic and it should be super efficient and performant to then let a background ops query do the work.
Another approach can be adding your own custom expiration time stamp in your bin data like so:
{"name":"John", "count":10, "validTill":1672563600000000000}.
Here, I am using as below (you can use a different future timestamp format):
$ date --date="2023-01-01 09:00:00" +%s%N
1672563600000000000
Now when you read the record, read through an expression that returns count = 10 if your current clock is behind validTill, 0 otherwise. This can work if the count value on read is all you care about. Also, when you go to update the count value in a future write, you can use the same expression logic to update both count and validTill.
If this works for you, you don't have to scan and update records using background jobs.

elasticsearch fill gaps with previous value

I have time series data in Elasticsearch and I want to aggregate it to create histogram. What I want to achieve is to fill the null buckets with the value of the previous data point. I know that I can use min_doc_count: 0 but it will put the value as 0 and I couldn't find any out of the box way to do this via Elastic. May be there is some trick that I am not aware of?
Appreciate your feedback.
I think the Date Histogram Aggregation does not provide a native way to perform what you would like.
The closest thing I can think of is using missing value. However, this will set a static value to all the dates where no values are found, which is not exactly what you want.
I also thought of using Painless with the following logic:
Get the first value in the Histogram and store it in a variable current.
If the next value is different to 0, store this value to current.
If the value is 0, set the current value to the histogram date. Don't change current.
Repeat step 2 until you finish the Histogram.
Using painless, in my experience is really painful but you can consider it as an alternative.
Additionally, I would recommend you to limit ES to perform searches and aggregations. If you require additional logic to the output, consider performing it outside ES. You can use the Python ES Client for instance.
I can think of the following script with a similar logic as the Painless scenario:
current = 0
results = es.search(...)
for i in res["aggregations"]["my_histogram_name"]["buckets"]:
if not i["doc_count"]: #this is the same as "if i["doc_count"]==0"
i["doc_count"] = current
current = i["doc_count"] #changed or not, we always use the last value to current
After that, the histogram should look as you want and ready to be displayed.
Hope this is helpful! :)

Issue with choice action when running transform map

I'm trying to insert records to a table by using transform maps. I have this field in the target table, which is a choice type, and I have set the choice action in the source table's field to reject if there's no matching value found. But, when I tried inserting the record using the transform map with the correct value, which exists in the choice list of the target field, it still got rejected and hence not inserting the records.
I have tried searching for possible reasons as to why it still got rejected even with correct value in the source field. Here's the sample link that I have found: https://hi.service-now.com/kb_view.do?sysparm_article=KB0677334
It says that if there are more than 40 characters for the choice list value it will be truncated and might not match those choice. But the choices in the target field has only 20 characters or less.
I have first tried running the transform map in the lower environments before proceeding to production. In the lower environment it works fine and the records got inserted. But, when I tried it in production it got rejected.
There is a difference between choice and choice list. Within the choice list the values are comma separated sys_ids. I could imagine that you have multiple values for import and then the max character are reached or the values do not match, etc.
You could use this approach:
Instead of a direct assignment, source to target field, use the script to target. Then you gain the full script power ;)
Maybe here you could add some logic like switch case or whatever, I guess you get the point.

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

How do I query Sitecore items by first version created date?

It's quite easy to write an XPath or Sitecore Query/Fast query to get all items within a date range:
E.g.
/sitecore/content/*[#__created>='20130301T000000' and #__created<'20130427T000000']
However, this kind of query only looks at the latest version of an item, so it seems impossible to find the actual item's created date (not the version's created date).
I could write a bit of C# code to do the querying but that would involve first retrieving version 1 of every single item in my database before I could then do my filter on created date. This would be mind-bogglingly slow.
Is it possible to do with XPath/Query notation/Fast? If not, is there a way I can do it that will be quick?
I thought of something that might work:
Create a new field which all your items will get. Then in this field, on the creation of an item (not version), you enter the datetime in there.
When versions get deleted, that's fine, because all versions will have that field, with the exact same value.
The only thing is, you'll have to run a script once to loop through your existing items to populate the field with the correct value for each item.
You can then use your XPath query same as now.

Resources