Elastic search minimize the boost factor as time pass - elasticsearch

I have elastic search document that looks like this:
...
{
title : "post 1",
total_likes : 100,
total_comments : 129,
updated_at : "2020-10-19"
},
...
And i use a query that boost the likes and comments with respect to the post creation date
so it look like this:
total_likes^6,
total_comments^4,
updated_at
now the issue with this approach, that if some post had a huge number of likes it will stuck on top of the results forever no matter when it is created.
How i can minimize the boost as the time pass, for example a very fresh post will have the full boost factor (6,4) however, a post that has been created 1 year ago will have the factors (2,1) ?

So I think what you are look for is the function score in coordination with the decay factor [doc]
Or if your logic is more complex, you could write it in painless in the function field value factor [doc]

Related

Unexpected Solr scores for documents boosted by the same boost values

I have 2 documents:
{
title: "Popular",
registrations_count: 700,
is_featured: false
}
and
{
title: "Unpopular",
registrations_count: 100,
is_featured: true
}
I'm running this Solr query (via the Ruby Sunspot gem):
fq: ["type:Event"],
sort: "score desc",
q: "*:*",
defType: "edismax",
fl: "* score",
bq: ["registrations_count_i:[700 TO *]^10", "is_featured_bs:true^10"],
start: 0, rows: 30
or, for those who are more used to ruby:
Challenge.search do
boost(10) do
with(:registrations_count).greater_than_or_equal_to(700)
end
boost(10) do
with(:is_featured, true)
end
order_by :score, :desc
end
One document matches the first boost query, and the other matches the other boost query. They have the same boost value.
What I would expect is that both documents get the same score. But they don't, they get something like that
1.2011336 # score for 'unpopular' (featured)
0.6366436 # score for 'popular' (not featured)
I also checked that if i boost an attribute that they both have in common, they get the exact same score, and they do. I also tried to change the 700 value, to something like 7000, but it makes no difference (which makes total sense).
Can anyone explain why they get such a different score, while they both match one of the boost queries?
I'm guessing the confusion stems from "the queries being boosted by the same value" - that's not true - the boost is the score of the query itself, which is then amplified 10x by your ^10.
The bq is additive - the score from the query is added to the score of the document (while boost is multiplicative, the score is multiplied by the boost query).
If you instead want to add the same score value to the original query based on either one matching, you can use ^=10 which makes the query constant scoring (the score will be 10 for that term, regardless of the regular score of the document).
Also, if you want to apply these factors independent of each other (instead of as a single, merged score with contributions from both factors), use multiple bq entries instead.

Really huge query or optimizing an elasticsearch update

I'm working in documents-visualization for binary classification of a big amount of documents (around 150 000). The challenge is how to present general visual information to end-users, so they can have an idea on the main "concepts" on each category (positive/negative). As each document has an associated set of topics, I thought about asking Elasticsearch through aggregations for the top-20 topics on positive classified documents, and then the same for the negatives.
I created a python script that downloads the data from Elastic and classify the docs, BUT the problem is that the predictions on the dataset are not registered on Elasticsearch, so I can not ask for the top-20 topics on a certain category. First I thought about creating a query in elastic to ask for the aggregations and passing a match
As I have the ids of the positive/negative documents, I can write a query to retrieve the aggregation of topics BUT in the query I should provide a really big amount of documents IDS to indicate, for instance, just the positive documents. That is impossible, since there is a limit on the endpoint and I can not pass 50 000 ids like:
"query": {
"bool": {
"should": [
{"match": {"id_str": "939490553510748161"}},
{"match": {"id_str": "939496983510742348"}}
...
],
"minimum_should_match" : 1
}
},
"aggs" : { ... }
So I tried to register the predicted categories of the classification in the Elastic index, but as the amount of documents is really huge, it takes like half an hour (compared to less than a minute for running the classification)... which is a LOT of time just for storing the predictions.... Then I also need to query the index to et the right data for the visualization. To update the documents, I am using:
for id in docs_ids:
es.update(
index=kwargs["index"],
doc_type=kwargs["doc_type"],
id=id,
body={"doc": {
"prediction": kwargs["category"]
}}
)
Do you know an alternative to update the predictions faster?
You could use bulk query that permits you to serialize your requests and query only one time against elasticsearch executing a lot of searches.
Try:
from elasticsearch import helpers
query_list = []
list_ids = ["1","2","3"]
es = ElasticSearch("myurl")
for id in list_ids:
query_dict ={
'_op_type': 'update',
'_index': kwargs["index"],
'_type': kwargs["doc_type"],
'_id': id,
'doc': {"prediction": kwargs["category"]}
}
query_list.append(query_dict)
helpers.bulk(client=es,actions=query_list)
Please have a read here
Regarding to query the list ids, to get faster response you should't bring with you the match_string value, as you have done in the question, but the _id field. That permits you to use multiget query, a bulk query for the get operation. Here in the python library. Try:
my_ids_list = [<some_ids_here>]
es.mget(index = kwargs["index"],
doc_type = kwargs["index"],
body = {'ids': my_ids_list})

Calculate change rate of Time Series values

I have an application which writes Time Series data to Elasticsearch. The (simplified) data looks like the following:
{
"timestamp": 1425369600000,
"shares": 12271
},
{
"timestamp": 1425370200000,
"shares": 12575
},
{
"timestamp": 1425370800000,
"shares": 12725
},
...
I now would like to use an aggregation to calculate the change rate of the shares field by time "buckets", for example like
The change rate of the share values within the last 10 minute "bucket" could be IMHO calculated as
# of shares t1
--------------
# of shares t0
I tried the Date Histogram aggregation, but I guess that's not what I need to calculate the change rates, because this would only give me the doc_count, and it's not clear to me how I could calculate the change rate from these:
{
"aggs" : {
"shares_over_time" : {
"date_histogram" : {
"field" : "timestamp",
"interval" : "10m"
}
}
}
}
Is there a way to achieve my goal with aggregations within Elasticsearch? I search the docs, but didn't find a matching method.
Thanks a lot for any help!
I think it is hard to achieve with out-of-the-box aggregate functions. However, you can take a look at percentile_ranks_aggregation and add your own modifications to the script to to create point in time rates.
Also, sorry for the off-top, but I wonder: is the elastic search the best fit for that kind of stuff? As I understand, at any given point in time you need only the previous sample data to calculate the correct rate for the current sample. This sounds to me like a better fit for some sliding window algorithm real time implementation (even on some relational DB like Postgres), where you keep a fixed number of time buckets and counters you are interested in inside the bucket. Once the new sample 'arrives', you update (slide) the window and calculate the updated rate for the most recent time bucket.

mongodb - Recommended tree structure for large amount of data points

I'm working on a project which records price history for items across multiple territories, and I'm planning on storing the data in a mongodb collection.
As I'm relatively new to mongodb, I'm curious about what might be a recommended document structure for quite a large amount of data. Here's the situation:
I'm recording the price history for about 90,000 items across 200 or so territories. I'm looking to record the price of each item every hour, and give a 2 week history for any given item. That comes out to around (90000*200*24*14) ~= 6 billion data points, or approximately 67200 per item. A cleanup query will be run once a day to remove records older than 14 days (more specifically, archive it to a gzipped json/text file).
In terms of data that I will be getting out of this, I'm mainly interested in two things: 1) The price history for a specific item in a specific territory, and 2) the price history for a specific item across ALL territories.
Before I actually start importing this data and running benchmarks, I'm hoping someone might be able to give some advice on how I should structure this to allow for quick access to the data through a query.
I'm considering the following structure:
{
_id: 1234,
data: [
{
territory: "A",
price: 5678,
time: 123456789
},
{
territory: "B",
price: 9876
time: 123456789
}
]
}
Each item is its own document, which each territory/price point for that item in a particular territory. The issue I run into with this is retrieving the price history for a particular item. I believe I can accomplish this with the following query:
db.collection.aggregate(
{$unwind: "$data"},
{$match: {_id: 1234, "data.territory": "B"}}
)
The other alternative I was considering was just put every single data point in its own document and putting an index on the item and territory.
// Document 1
{
item: 1234,
territory: "A",
price: 5679,
time: 123456789
}
// Document 2
{
item: 1234,
territory: "B",
price: 9676,
time: 123456789
}
I'm just unsure of whether having 6 billion documents with 3 indexes or having 90,000 documents with 67200 array objects each and using an aggregate would be better for performance.
Or perhaps there's some other tree structure or handling of this problem that you fine folks and MongoDB wizards can recommend?
I would structure the documents as "prices for a product in a given territory per fixed time interval". The time interval is fixed for the schema as a whole, but different schemas result from different choices and the best one for your application will probably need to be decided by testing. Choosing the time interval to be 1 hour gives your second schema idea, with ~6 billion documents total. You could choose the time interval to be 2 weeks (don't). In my mind, the best time interval to choose is 1 day, so the documents would look like this
{
"_id" : ObjectId(...), // could also use a combination of prod_id, terr_id, and time so you get a free unique index to look up by those 3 values
"prod_id" : "DEADBEEF",
"terr_id" : "FEEDBEAD",
"time" : ISODate("2014-10-22T00:00:00.000Z"), // start of the day this document contains the data for
"data" : [
{
"price" : 1234321,
"time" : ISODate("2014-10-22T15:00:00.000Z") // start of the hour this data point is for
},
...
]
}
I like the time interval of 1 day because it hits a nice balance between number of documents (mostly relevant because of index sizes), size of documents (16MB limit, have to pipe over network), and ease of retiring old docs (hold 15 days, wipe+archive all from 15th day at some point each day). If you put an index on { "prod_id" : 1, "terr_id" : }`, that should let you fulfill your two main queries efficiently. You can gain an additional bonus performance boost by preallocating the doc for each day so that updates are in-place.
There's a great blog post about managing time series data like this, based on experience building the MMS monitoring system. I've essentially lifted my ideas from there.

Conditional Sorting in ElasticSearch

I have some documents that I would like to sort on a date field. For documents with date equal to a specified date, example today, and all dates after that I would like to sort ascending. For dates before the specified date I would like to sort in descending order.
Is this possible in ElasticSearch? If so could you suggest any literature or an approach.
date is of type "date" and format "dateOptionalTime".
Thanks
Yes this is possible in ElasticSearch using a script, either for sorting or for scoring.
My preference would be for a scoring script because 'script based score' is going to be quicker (according to the documentation).
Using a scoring script, you could use the Unix timestamp for the date field of type int/long and an mvel sorting script in the custom_score query. You might need to re-index your documents. You would also need to be able to convert the searched for time into a Unix timestamp to pump it at ElasticSearch.
The sorting script would then deduct the requested timestamp from each document's timestamp and make an absolute value. Then the results are sorted in ascending order - the lowest 'distance' is the best.
So when looking for documents dated about a year ago, it would look something like:
"query": {
"custom_score" : {
"query" : {
....
},
"params" : {
"req_date_stamp" : 1348438345,
},
"script" : "abs(doc['timestamp'].value - req_date_timestamp)"
}
},
"sort": {
"_score": {
'order': 'asc'
}
}
(Apologies for any mistakes in my JSON - I tested this idea in pyes)
You might need to tweak this to get the rounding right - for example your question mentions matching days, so you might want to round the timestamp generator to the nearest day.
For "full" info you can check out the Custom Score Query docs and follow the link to MVEL scripting.
For this kind of specific use cases, you should use a sorting script.
See the "script based sorting" section in the Sort documentation page.
My English is poor.
My soluation is boost.
My data is {"terms_id": [20211011,20211012,20211013,20211014],"sort_value":1} {"terms_id": [20211012,20211013,20211014],"sort_value":2} {"terms_id": [20211013,20211014,20211015],"sort_value":1}
My query is {"bool":{"must":[],"should":[{"bool":{"must":[{"terms":{"terms_id":[20211012],"boost":5}}],"must_not":[]}},{"bool":{"must_not":[{"terms":{"terms_id":[20211012]}}]}}],"minimum_should_match":1}}
My sort is {"_score":{"order":"desc"},"sort_value":{"order":"desc"}}
Result is{"terms_id": [20211012,20211013,20211014],"sort_value":2} {"terms_id": [20211011,20211012,20211013,20211014],"sort_value":1} {"terms_id": [20211013,20211014,20211015],"sort_value":1}

Resources