Elasticsearch: Limit filtered query to 5 items per type per day - elasticsearch

I'm using elasticsearch to gather data for my frontpage on my event-portal. the current query is as follows:
{
"query": {
"function_score": {
"filter": {
"and": [
{
"geo_distance": {
"distance": "50km",
"location": {
"lat": 50.78,
"lon": 6.08
},
"_cache": true
}
},
{
"or": [
{
"and": [
{
"term": {
"type": "event"
}
},
{
"range": {
"datetime": {
"gt": "now"
}
}
}
]
},
{
"not": {
"term": {
"type": "event"
}
}
}
]
}
]
},
"functions": [
...
]
}
}
}
So basically all events in an 50km distance which are future events or other types. Other types could be status, photo, video, soundcloud etc... All these items have a datetime field and a parent field which account the items belongs to. There are some functions after the filter for scoring objects based on there distance and age.
Now my question:
Is there a way to filter the query to get only the first (or even better highest scored) 5 items per type per account per day?
So currently I have accounts which upload 20 images at the same time. This is too much to display on the frontpage.
I thought about using filter scripts in a post_filter. But i am not very familiar with this topic.
Any ideas?
many thanks in advance
DTFagus

I solved it this way:
"aggs": {
"byParent": {
"terms": {
"field": "parent_id"
},
"aggs": {
"byType": {
"terms": {
"field": "type"
},
"aggs": {
"perDay": {
"date_histogram" : {
"field" : "datetime",
"interval": "day"
},
"aggs": {
"topHits": {
"top_hits": {
"size": 5,
"_source": {
"include": ["path"]
}
}
}
}
}
}
}
}
}
}
Unfortunately there is no pagination for aggregations (or other way around: the pagination of the query is not used). So I will get the paginated query results and the aggregation of all hits and intersect the arrays in js. Does not sound very efficient but I currently have no better idea. Anyone?
The only way around this I see would be to index all data into two indices. One containing all data and one with only the top 5 per day per type per account. This would be less time consuming to query but more time and storage consuming while indexing :/

You can limit the number of results returned by your query using the "size" parameter.if you set size to 5, then you will get the first 5 results returned by your query.
Check the documentation http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/pagination.html
Hope this helps!

Related

Composite and Terms Aggregations on a field with a high cardinality

I am facing a huge performance problem with ES which results in more than 2 min response.
I have an index that has more than 25M files and composes of the next 4 fields (among others):
...
"group_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"group_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
...
I have something like 100K unique users and groups and each field is a list of users/groups that holds ~100 values. For example:
"user_read": ["user_1", "group_1", ...],
"user_write": ["user_1", "group_2", ...]
...
I have 2 kinds of aggregation I am using, composite and terms. Composite aggregations for getting only first X results to display and terms aggregation for prefix search.
Composite aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"composite": {
"sources": [
{
"Group Read": {
"terms": {
"field": "group_read.raw"
}
}
}
],
"size": 10
}
},
"Group_Write_Permissions": {
"composite": {
"sources": [
{
"Group Write": {
"terms": {
"field": "group_write.raw"
}
}
}
]
}
},
"User_Write_Permissions": {
"composite": {
"sources": [
{
"User Write": {
"terms": {
"field": "user_write.raw"
}
}
}
]
}
},
"User_Read_Permissions": {
"composite": {
"sources": [
{
"User Read": {
"terms": {
"field": "user_read.raw"
}
}
}
]
}
}
}
}
Terms aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"terms": {
"field": "group_read.raw",
"include": ".*[Ss].*"
}
},
"Group Write Permissions": {
"terms": {
"field": "group_write.raw",
"include": ".*[Ss].*"
}
},
"User Read Permissions": {
"terms": {
"field": "user_read.raw",
"include": ".*[Ss].*"
}
},
"User Write Permissions": {
"terms": {
"field": "user_write.raw",
"include": ".*[Ss].*"
}
}
}
}
Composite aggregation returns results within 1 min and the terms aggregation can take up to 5 min.
What I have tried so far:
Adding new field user_group_permissions and adding to the above 4 fields "copy_to": "user_group_permissions"
Adding to the above 4 fields and to the field "user_group_permissions" the next property: "eager_global_ordinals": true
Increased the refresh_interval up to 200s
** I reindexed for the first 2 suggestions [took something like 6 hours]
All of the above did help a little with the retrieval time but still: composite aggregation takes up to 20s and terms aggregation takes up to 3 min.
[The best results were on the fields user_group_permissions which has been created in the first suggestion, with eager_global_ordinals = true and refresh_interval = 120s].
Please, if someone has any idea how to improve the retrieval times I will be grateful.
First of all, if you only need the first 10 results, you don't need to use the composite aggregation, which is meant to be used only if you need to paginate over all results. Simply use the terms aggregation with default size 10, that'll do the job.
Second, what you're doing with the terms aggregation is not a prefix filtering, but infix filtering, which is completely different in terms of performance. While it's easy to search for prefixes, searching for infixes requires the equivalent of a "full table scan" because each and every term must be visited.
A first optimization I would suggest is that in your second query you should do your regex in the query part (bool/should with one regex query per field), so as to reduce the document set on which the terms aggregations need to run. That might help a bit.
A second optimization is to leverage the wildcard field type which is a specialized field type made specially for grep-like wildcard and regexp queries.
Another possible optimization is to lowercase all your permissions, so that you only need to search for .*s.* instead of the uppercase variant.
Depending on your comments, I'll add more optimizations as the discussion goes on.

Is it ok to use only filter query in elastic search

i have to query elastic search for some data and all my filters are drop down values as in they are exact matches only so i thought of using only the filter query and not any must or match query, so is there any problem with this kind of approach.
in the below example i am trying to get last 15 min data where L1 is any 1 of ("XYZ","CFG") and L2 is any 1 of ( "ABC","CDE")
My query looks like below :
{
"size": 20,
"sort": [
{
"eventTs": "desc"
}
],
"query": {
"bool": {
"filter": [
{
"range": {
"eventTs": {
"gte": "now-15m",
"lte": "now",
"format": "epoch_millis",
"boost": 1
}
}
},
{
"terms": {
"l1": [
"XYZ","CFG"
]
}
},
{
"terms": {
"l2":[
"ABC","CDE"
]
}
}
]
}
}
}
If you don't need _score which is used to show the relevant documents according to their score, you can use filter which is executed in much faster way(since calculation of score is disabled), and cached as well.
Must read query and filter context for in-depth understanding of these concepts.

Search and aggregation on two indices

Two indexes are created with the dates.
First index mapping:
PUT /index_one
{
"mappings": {
"properties": {
"date_start": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Second index mapping:
PUT /index_two
{
"mappings": {
"properties": {
"date_end": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss.SSSZZ||yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
Need to find a date in a certain range and perform aggregation average of the dates difference.
Tried to make a request like this:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"filtered_dates": {
"filter": {
"bool": {
"must": [
{
"exists": {
"field": "date_start"
}
},
{
"exists": {
"field": "date_end"
}
}
]
}
},
"aggs": {
"avg_date": {
"avg": {
"script": {
"lang": "painless",
"source": "doc['date_end'].value.toInstant().toEpochMilli() - doc['date_begin'].value.toInstant().toEpochMilli()"
}
}
}
}
}
}
}
I get the following response to the request:
{
"hits": {
"total": {
"value": 16508
},
"hits": [
{
"_index": "index_one",
"_type": "_doc",
"_id": "93a34c5b-101b-45ea-9965-96a2e0446a28",
"_score": 1.0,
"_source": {
"date_begin": "2021-02-26 07:26:29.732+0300"
}
}
]
},
"aggregations": {
"filtered_dates": {
"meta": {},
"doc_count": 0,
"avg_date": {
"value": null
}
}
}
}
Can you please tell me if it is possible to make a query with search and aggregation over two indices in Elasticsearch? If so, how?
If you stored date_start on the document which contains date_end, it'd be much easier to figure out the average — check my answer to Store time related data in ElasticSearch.
Now, the script context operates on one single document at a time and has "no clue" about the other, potentially related docs. So if you don't store both dates at the same time in at least one doc, you'd need to somehow connect the docs nonetheless.
One option would be to use their ids:
POST index_one/_doc
{ "id":1, "date_start": "2021-01-01" }
POST index_two/_doc
{ "id":1, "date_end": "2021-12-31" }
POST index_one/_doc/2
{ "id":2, "date_start": "2021-01-01" }
POST index_two/_doc/2
{ "id":2, "date_end": "2021-01-31" }
After that, it's possible to:
Target multiple indices — as you already do.
Group the docs by their IDs and select only those that include at least 2 buckets (assuming two buckets represent the start & the end).
Obtain the min & max dates — essentially cherry-picking the date_start and date_end to be used later down the line.
Use a bucket_script aggregation to calculate their difference (in milliseconds).
Leverage a top-level average bucket aggregation to run over all the difference buckets and ... average them.
In concrete terms:
GET /index_one,index_two/_search?scroll=1m&q=[2021-01-01+TO+2021-12-31]&filter_path=aggregations,hits.total.value,hits.hits
{
"aggs": {
"grouped_by_id": {
"terms": {
"field": "id",
"min_doc_count": 2,
"size": 10
},
"aggs": {
"min_date": {
"min": {
"field": "date_start"
}
},
"max_date": {
"max": {
"field": "date_end"
}
},
"diff": {
"bucket_script": {
"buckets_path": {
"min": "min_date",
"max": "max_date"
},
"script": "params.max - params.min"
}
}
}
},
"avg_duration_across_the_board": {
"avg_bucket": {
"buckets_path": "grouped_by_id>diff",
"gap_policy": "skip"
}
}
}
}
If everything goes right, you'll end up with:
...
"aggregations" : {
"grouped_by_id" : {
...
},
"avg_duration_across_the_board" : {
"value" : 1.70208E10 <-- 17,020,800,000 milliseconds ~ 4,728 hrs
}
}
⚠️ Caveat: note that the 2nd level terms aggregation has an adjustable size. You'll probably need to increase it to cover more docs. But there are theoretical and practical limits as to how far it makes sense to increase it.
📖 Shameless plug: this was inspired in part by the chapter Aggregations & Buckets in my recently published Elasticsearch Handbook — containing lots of other real-world, non-trivial examples 🙌

Elasticsearch Pipelined search?

I've been using Elasticsearch for a while at my company and seems to have been working well so far for our searches.
We've been seeing more complex use cases from our customers to need more "ad-hoc/advanced" query capabilities and inter-document relationships (or joins in the traditional sense).
I understand that ES isn't built for joins and denormalisation is the recommended way. We have been denormalising the documents to support every use case so far and that in itself has become overly complex and expensive for us to do as our customers have to wait for a long time to get this code change rolled out.
We've been more often criticized by our business that "Hey your data model isn't right. It isn't suited for smarter queries". It's painfully harder for the team everytime to make them understand why denormalisation is required.
A few examples of the problems:
"Find me all the persons having the same birthdays"
"Find me all the persons travelling to the same cities within the same time frame"
Imagine every event document is a person record with their travel details.
So is there a concept of a pipeline search where I can break the search into multiple search queries and pass the output of one as an input to another?
Or is there any other recommended way to solve these types of problems without having to boil the ocean?
The two queries above can be solved with aggregations.
I'm assuming the following sample document/schema:
{
"firstName": "John",
"lastName": "Doe",
"birthDate": "1998-04-02",
"travelDate": "2019-10-31",
"city": "London"
}
The first one by aggregating with a terms on the birthdate field (day of the year) and min_doc_count: 2, e.g.:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second one by aggregating with a terms aggregation on the city field and constrained with a range query on the travelDate field for the desired time frame:
{
"size": 0,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second query can also be done with field collapsing:
{
"_source": false,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"collapse": {
"field": "city.keyword",
"inner_hits": {
"name": "people"
}
}
}
If you need both aggregations at the same time, it is definitely possible to do so:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
},
"travels": {
"filter": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
}
}

Elasticsearch: how to scope aggregations to your query and filter?

I have been playing around with elasticsearch query and filter for some time now but never worked with aggregations before. The idea that we can scope the aggregations with our query seems quite amazing to me but I want to understand how to do it properly so that I do not make any mistakes. Currently all my search queries are designed this way:
{
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
Now, when I added some aggregation buckets, the structure became this:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
However, the results of the aggregation are always the same irrespective of what info I provide in filter.
Now, when I changed the query structure to something like this, it started showing different results:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
"filtered": {
"query": {
},
"filter": {
}
}
},
"from": 0,
"size": 60
}
Does it mean I will have to change the structure of my search queries everywhere to this new filtered type of structure ? Is there any other workaround which allows me to achieve desired results without having to change that much of code ?
Also, another thing I observed is that if my brand_slug field contains multiple keywords like "peter england", then both of these are returned in separate buckets like this:
{
"buckets": [
{
"key": "england",
"doc_count": 368
},
{
"key": "peter",
"doc_count": 368
}
]
}
How can I ensure that both these end up in a same bucket like this:
{
"buckets": [
{
"key": "peter england",
"doc_count": 368
}
]
}
UPDATE: This second part I have been able to accomplish by indexing brand, color and sizes differently like this:
"sizes": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
What you've noticed is by design. Have a look at my answer to a similar question on SO. Basically, input to both aggregation and filter sections is the output of query section. Filtered Query as you've suggested would be the best way to achieve the results you desire. There is another way too. You can use Filter Aggregation. Then you would not need to change your query and filter sections but simply copy the filter section inside the aggregation sections but that in my opinion would be an overkill and a violation of the DRY principle in general.

Resources