ElasticSearch: rank by proximity to date - elasticsearch

In ElasticSearch, is there a way to rank search results by proximity to a given date (or number)?

You can use Script Based Sorting to do calculate the proximity. However, if you have a large number of results, your might need to switch to native script to achieve good performance.

try this example:
"DECAY_FUNCTION": {
"FIELD_NAME": {
"origin": "2013-09-17",
"scale": "10d",
"offset": "5d",
"decay" : 0.5
}
}
DECAY_FUNCTION can be "linear", "exp" and "gauss". If your field is a date field, you can set scale and offset as days, weeks, and so on.
refer: http://www.elastic.co/guide/en/elasticsearch/reference/1.5/query-dsl-function-score-query.html

Related

Elasticsearch date based function scoring boosting the wrong way

I would like to boost scores of documents based on how "recent" a document is. I am trying to do this using a function_score. Here is an example of me doing this on a field called updated_at:
{
"function_score": {
"boost_mode": "sum",
"functions": [
{
"exp": {
"updated_at": {
"origin": "now",
"scale": "1h",
"decay": 0.01,
},
},
"weight": 1,
}
],
"query": query
},
}
I would expect documents close to the datetime now will have a score closer to 1, and documents closer to scale will have a score closer to decay (as described in the docs). Therefore, I'm using the boost_mode sum, to keep the original document scores, and increase depending on how close to now the updated_at value is. (Also, the query score is useful so I would rather add than multiply, which is the default).
To test this scenario, I create a document (A) that returns a query score of about 2. I then duplicate it (B) and modify the new document's updated_at timestamp to be an hour in the past.
In this scenario, I would expect (A) to have a higher score and (B) to have a lower score. However, when I run this scenario, I get the exact opposite. (B) ends up with a score of 3 and (A) ends up with a score of 2.
What am I misunderstanding here to cause this to happen? And how would I modify my function score to do what I would like?
This turned out to be a a timezone issue.
I ended up using the explain API to look at what was contributing to the score. When doing that, I noticed that the origin set to now was actually in a different timezone to the one I was setting in the documents.
I fixed this by manually providing a UTC timestamp in the elasticsearch query rather than using now as the value.
(If there is a better way to do this, please let me know)

Elasticsearch - query based on event frequency

I have multiple indexes to store user tracking log. In which there is 1 index is index-pageview. How can I query out the list of users who viewed the page 10 times between 2021-12-11 and 2021-12-13 using IOS operating system?
Log example:
index: index-pageview
[
{
"user_id": 1,
"session_id": "xxx",
"timestamp": "2021-12-11 hh:mm:ss",
"platform": "IOS"
},
{
"user_id": 1,
"session_id": "yyy",
"timestamp": "2021-12-13 hh:mm:ss",
"platform": "Android"
}
]
You can try building a normal bool query on timestamp and platform and then either terms aggregation (possibly with min_doc_count: 10) or collapse on user_id. Both ways will have some limitations though:
aggregation might be slower (needs benchmarking)
aggregation bucket number is limited (at 10k by default)
collapse will work on at most size docs at a time (capped at 10k as well) so you might need scrolling and app-side processing
Though performance of these might be pretty poor. If you need to run queries like those very often I would consider using another storage (SQL? Something more fancy?)

Using date_histogram with fixed_interval (30d) unexpected bucket start

I have a requirement to get data aggregated per 30 days (not month) so I'm using a date_histogram with "fixed_interval": "30d" to get that data. For example, if the user wants the last 90 days aggregations, there should be 3 buckets: [90-60, 60-30, 30-0]. Taking today's date (18-Mar-2021), I would want buckets [18-Dec,17-Jan,16-Feb].
However, what I actually get is [4-Dec,3-Jan,2-Feb,4-Mar]. The first bucket starts way earlier than any data is available, which also means an additional bucket than expected is needed in the end.
I found out that you can't easily tell when your buckets are meant to start (e.g. I want my first bucket to start at today-90 days). Buckets seem to start from 1970-01-01 according to what I could find (e.g. this) and the documentation kinda says this as well (this link, though it doesn't go into depth of the impact).
With this in mind, I worked out that I could use offset with an "interesting formula" so that I get the correct buckets that I need. E.g.:
GET /my_index/_search?filter_path=aggregations
{
"size": 0,
"query": {
"bool": {
"must": [
{ "range" : {
"#timestamp" : {
"gte" : "TODAY - 90/60/30",
"lt" : "TODAY"
}}
}
]
}
},
"aggs": {
"discussion_interactions_chart": {
"date_histogram": {
"field": "#timestamp",
"fixed_interval": "30d",
"format": "yyyy-MM-dd",
"offset": "(DAYS(#timestamp.gte, 1970-01-01) % 30)d"
}
}
}
}
(obviously this query doesn't work directly, I build the variables in code which for the example of 18-Mar-2021 offset is 14)
So basically offset is calculated as the number of days between my lower bound date and epoch, and then mod that value by 30. This seems to work but it's kinda hard to justify this logic on a code review. Is there a nicer solution to this?
Here's a Python implementation of the answer in your question (which you really deserve upvotes for, it's clever and helped me):
fixed_interval_days = 90
# offset needed to make fixed_interval histogram end on today's date (it starts the intervals at 1970-01-01)
offset_days = (datetime.datetime.utcnow() - datetime.datetime(1970, 1, 1)).days % fixed_interval_days
...
A(
"date_histogram",
fixed_interval=f"{fixed_interval_days}d",
offset=f"{offset_days}d",

elasticsearch more_like_this query is taking long time to run

I have the below more_like_this query to elasticsearch.
I run this in a loop for 15 times with different art_title and art_tags each time. For some articles the time it takes is very less but for some articles in the loop it takes too long to execute. Is there anything which I can do to optimize this query. Any help is appreciated.
bodyquery={
"query":
{"bool":
{"should":
[
{"more_like_this":
{
"like_text": art_title,
"fields": ["title"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
},
{"more_like_this":
{
"like_text": art_tags,
"fields": ["tags"],
"max_query_terms": 30,
"boost": 5,
"min_term_freq": 1
}
}
]
}
}
}
I believe you might have solved this already by now but depending on the content of your indexed docs and the analyzers applied to the fields you are looking at, this can take a wide range of time to complete. Think how similarity works and how it will be calculated for your documents and you probably will find the answer. Also, you can use the explain param to get a Lucene detailed step-by-step response to the question
, but just in case I want to add: it is virtually impossible to determine anything without more details:
What your mappings look like
How are those fields analyzed
What version of ES are you using
Your ES setup
Also, describe in english what are you trying to retrieve: "I want documents in the catalog index that have a title similar to art_title and/or a tag similar to art_tag".
There is reference to the syntax in HERE if you are using the latest version of ES
Cheers

Calculate change rate of Time Series values

I have an application which writes Time Series data to Elasticsearch. The (simplified) data looks like the following:
{
"timestamp": 1425369600000,
"shares": 12271
},
{
"timestamp": 1425370200000,
"shares": 12575
},
{
"timestamp": 1425370800000,
"shares": 12725
},
...
I now would like to use an aggregation to calculate the change rate of the shares field by time "buckets", for example like
The change rate of the share values within the last 10 minute "bucket" could be IMHO calculated as
# of shares t1
--------------
# of shares t0
I tried the Date Histogram aggregation, but I guess that's not what I need to calculate the change rates, because this would only give me the doc_count, and it's not clear to me how I could calculate the change rate from these:
{
"aggs" : {
"shares_over_time" : {
"date_histogram" : {
"field" : "timestamp",
"interval" : "10m"
}
}
}
}
Is there a way to achieve my goal with aggregations within Elasticsearch? I search the docs, but didn't find a matching method.
Thanks a lot for any help!
I think it is hard to achieve with out-of-the-box aggregate functions. However, you can take a look at percentile_ranks_aggregation and add your own modifications to the script to to create point in time rates.
Also, sorry for the off-top, but I wonder: is the elastic search the best fit for that kind of stuff? As I understand, at any given point in time you need only the previous sample data to calculate the correct rate for the current sample. This sounds to me like a better fit for some sliding window algorithm real time implementation (even on some relational DB like Postgres), where you keep a fixed number of time buckets and counters you are interested in inside the bucket. Once the new sample 'arrives', you update (slide) the window and calculate the updated rate for the most recent time bucket.

Resources