Elasticsearch ranking aggregation with multiple terms query - elasticsearch

tl;dr: Want to rank aggregations based on whether bucket key has used either of the search terms.
I have two indices documents and recommendations with the following mappings:
Documents:
{
"id": string,
"document_text" : string,
"author" : { "name": string }
...other fields
}
Recommendations:
{
"id": string,
"recommendation_text" : string,
"author" : { "name": string }
...other fields
}
The problem I am solving is to have top authors for query terms.
This works quite well with multimatch for a single query term like this:
{
"size": 0,
"query": {
"multi_match": {
"query": "science",
"fields": [
"document_text",
"recommendation_text"
],
"type": "phrase",
}
},
"aggs": {
"search-authors": {
"terms": {
"field": "author.name.keyword",
"size": 50
},
"aggs": {
"top-docs": {
"top_hits": {
"size": 100
}
}
}
}
}
}
But when I have multiple keywords, let's say zoology, botany, I want the aggregation ranking to place the authors who have talked about both zoology and botany higher than those who have used either of them.
having multiple multi_match with bool doesn't help since this isn't exactly an and/or situation.

Related

Composite and Terms Aggregations on a field with a high cardinality

I am facing a huge performance problem with ES which results in more than 2 min response.
I have an index that has more than 25M files and composes of the next 4 fields (among others):
...
"group_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_write": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"group_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
},
"user_read": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
...
I have something like 100K unique users and groups and each field is a list of users/groups that holds ~100 values. For example:
"user_read": ["user_1", "group_1", ...],
"user_write": ["user_1", "group_2", ...]
...
I have 2 kinds of aggregation I am using, composite and terms. Composite aggregations for getting only first X results to display and terms aggregation for prefix search.
Composite aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"composite": {
"sources": [
{
"Group Read": {
"terms": {
"field": "group_read.raw"
}
}
}
],
"size": 10
}
},
"Group_Write_Permissions": {
"composite": {
"sources": [
{
"Group Write": {
"terms": {
"field": "group_write.raw"
}
}
}
]
}
},
"User_Write_Permissions": {
"composite": {
"sources": [
{
"User Write": {
"terms": {
"field": "user_write.raw"
}
}
}
]
}
},
"User_Read_Permissions": {
"composite": {
"sources": [
{
"User Read": {
"terms": {
"field": "user_read.raw"
}
}
}
]
}
}
}
}
Terms aggregation:
{
"size": 0,
"aggs": {
"Group_Read_Permissions": {
"terms": {
"field": "group_read.raw",
"include": ".*[Ss].*"
}
},
"Group Write Permissions": {
"terms": {
"field": "group_write.raw",
"include": ".*[Ss].*"
}
},
"User Read Permissions": {
"terms": {
"field": "user_read.raw",
"include": ".*[Ss].*"
}
},
"User Write Permissions": {
"terms": {
"field": "user_write.raw",
"include": ".*[Ss].*"
}
}
}
}
Composite aggregation returns results within 1 min and the terms aggregation can take up to 5 min.
What I have tried so far:
Adding new field user_group_permissions and adding to the above 4 fields "copy_to": "user_group_permissions"
Adding to the above 4 fields and to the field "user_group_permissions" the next property: "eager_global_ordinals": true
Increased the refresh_interval up to 200s
** I reindexed for the first 2 suggestions [took something like 6 hours]
All of the above did help a little with the retrieval time but still: composite aggregation takes up to 20s and terms aggregation takes up to 3 min.
[The best results were on the fields user_group_permissions which has been created in the first suggestion, with eager_global_ordinals = true and refresh_interval = 120s].
Please, if someone has any idea how to improve the retrieval times I will be grateful.
First of all, if you only need the first 10 results, you don't need to use the composite aggregation, which is meant to be used only if you need to paginate over all results. Simply use the terms aggregation with default size 10, that'll do the job.
Second, what you're doing with the terms aggregation is not a prefix filtering, but infix filtering, which is completely different in terms of performance. While it's easy to search for prefixes, searching for infixes requires the equivalent of a "full table scan" because each and every term must be visited.
A first optimization I would suggest is that in your second query you should do your regex in the query part (bool/should with one regex query per field), so as to reduce the document set on which the terms aggregations need to run. That might help a bit.
A second optimization is to leverage the wildcard field type which is a specialized field type made specially for grep-like wildcard and regexp queries.
Another possible optimization is to lowercase all your permissions, so that you only need to search for .*s.* instead of the uppercase variant.
Depending on your comments, I'll add more optimizations as the discussion goes on.

Deduplicate and perform composite aggregation on deduced result

I've an index in elastic search which contains data of daily transactions. Each doc has mainly three fields as below :
TxnId, Status, TxnType,userId
two documents can have same TxnIds.
I'm looking for a query that provides aggregation over status,TxnType for unique txnIds. Basically I'm looking for something like : select unique txnIds from user_table group by status,txnType.
I've a ES query which will dedup on TxnIds. I've another ES query which can perform composite aggregation on status and txnType. I want to do both things in Single query.
I tried collapse feature . I also tried cardinality and dedup features. But query is not giving correct output.:
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"term": {
"streamSource": 3
}
}
]
}
},
"collapse": {
"field": "txnId"
},
"aggs": {
"buckets": {
"composite": {
"size": 30,
"sources": [
{
"status": {
"terms": {
"field": "status"
}
}
},
{
"txnType": {
"terms": {
"field": "txnType"
}
}
}
]
}
}
}
}

Nested Fields, Wildcard Queries and Aggregations in Elasticsearch

I have an index that collects web redirects data for various sites. I am using a nested field to collect the data as shown in the mapping below:
"chain": {
"type": "nested",
"properties": {
"url.position": {
"type": "long"
},
"url.full": {
"type": "text"
},
"url.domain": {
"type": "keyword"
},
"url.path": {
"type": "keyword"
},
"url.query": {
"type": "text"
}
}
}
As you can imagine, each document contains an array of url chains, the size of the array being equal to number of web redirects. I want to get aggregations based on wildcard/regexp matches to url.query field. Here is a sample query:
GET push_url_chain/_search
{
"query": {
"nested": {
"path": "chain",
"query": {
"regexp": {
"chain.url.query": "aff_c.*"
}
}
}
},
"size": 0,
"aggs": {
"dataFields": {
"nested": {
"path": "chain"
},
"aggs": {
"offers": {
"terms": {
"field": "chain.url.domain",
"size": 30
}
}
}
}
}
}
The above query does produce aggregated results but not the way I want.
I want to see chain.url.domain aggregations for the urls that contain the aff_c.* phrase. Right now it is looking at all the urls in the chain and then aggregating the buckets by doc_count regardless of whether that url/domain has the particular phrase. I hope I have been able to explain this clearly. How do I get my results to show bucket aggregations that contain domains that have aff_c.* phrase match to the query field of the url.
I would also like to know how I can use = or / in my wildcard or regexp queries. It is not producing any results if I use the above symbols in my queries.
Tha
Nested query returns all documents where a nested document matches the condition, you get matched nested docs only in inner_hits.
Aggregation is applied on top of these documents, so all domains are coming in terms
You need to use nested aggregation to gets only matching terms.
{
"size": 0,
"aggs": {
"Name": {
"nested": {
"path": "chain"
},
"aggs": {
"matched_doc": {
"filter": { --> filter for url
"match_phrase_prefix": {
"chain.url.query": "abc"
}
},
"aggs": {
"domain": {
"terms": {
"field": "chain.url.domain", -- terms for matched url
"size": 10
}
}
}
}
}
}
}
}
You can use match_phrase_prefix instead of regex. It has better performance.
Standard analyzer while generating tokens removes "/","=". So if you want to use regex or wildcard and look for these , you need to use keyword field not text field.

Sort multi-bucket aggregation by source fields inside inner multi-bucket aggregation

TL;DR: Using an inner multi-bucket aggregation (top_hits with size: 1) inside an outer multi-bucket aggregation, is it possible to sort the buckets of the outer aggregation by the data in the inner buckets?
I have the following index mappings
{
"parent": {
"properties": {
"children": {
"type": "nested",
"properties": {
"child_id": { "type": "keyword" }
}
}
}
}
}
and each child (in data) has also the properties last_modified: Date and other_property: String.
I need to fetch a list of children (of all the parents but without the parents), but only the one with the latest last_modified per each child_id. Then I need to sort and paginate those results to return manageable amounts of data.
I'm able to get the data and paginate over it with a combination of nested, terms, top_hits, and bucket_sort aggregations (and also get the total count with cardinality)
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"totalCount": {
"cardinality": {
"field": "children.child_id"
}
},
"oneChildPerId": {
"terms": {
"field": "children.child_id",
"order": { "_term": "asc" },
"size": 1000000
},
"aggs": {
"lastModified": {
"top_hits": {
"_source": [
"children.other_property"
],
"sort": {
"children.last_modified": {
"order": "desc"
}
},
"size": 1
}
},
"paginate": {
"bucket_sort": {
"from": 36,
"size": 3
}
}
}
}
}
}
}
}
but after more than a solid day of going through the docs and experimenting, I seem to be no closer to figuring out, how to sort the buckets of my oneChildPerId aggregation by the other_property of that single child retrieved by lastModified aggregation.
Is there a way to sort a multi-bucket aggregation by results in a nested multi-bucket aggregation?
What I've tried:
I thought I could use bucket_sort for that too, but apparently its sort can only be used with paths containing other single-bucket aggregations and ending in a metic one.
I've tried to find a way to somehow transform the 1-result multi-bucket of lastModified into a single-bucket, but haven't found any.
I'm using ElasticSearch 6.8.6 (the bucket_sort and similar tools weren't available in ES 5.x and older).
I had the same problem. I needed a terms aggregation with a nested top_hits, and want to sort by a specific field inside the nested aggregation.
Not sure how performant my solution is, but the desired behaviour can be achieved with a single-value metric aggregation on the same level as the top_hits. Then you can sort by this new aggregation in the terms aggregation with the order field.
Here an example:
POST books/_doc
{ "genre": "action", "title": "bookA", "pages": 200 }
POST books/_doc
{ "genre": "action", "title": "bookB", "pages": 35 }
POST books/_doc
{ "genre": "action", "title": "bookC", "pages": 170 }
POST books/_doc
{ "genre": "comedy", "title": "bookD", "pages": 80 }
POST books/_doc
{ "genre": "comedy", "title": "bookE", "pages": 90 }
GET books/_search
{
"size": 0,
"aggs": {
"by_genre": {
"terms": {
"field": "genre.keyword",
"order": {"max_pages": "asc"}
},
"aggs": {
"top_book": {
"top_hits": {
"size": 1,
"sort": [{"pages": {"order": "desc"}}]
}
},
"max_pages": {"max": {"field": "pages"}}
}
}
}
}
by_genre has the order field which sorts by a sub aggregation called max_pages. max_pages has only been added for this purpose. It creates a single-value metric by which the order is able to sort by.
Query above returns (I've shortened the output for clarity):
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
If you change "order": {"max_pages": "asc"} to "order": {"max_pages": "desc"}, the output becomes:
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
The type of the max_pages aggregation can be changed as needed , as long as it is a single-value metic aggregation (e.g. sum, avg, etc)

How to sort elasticsearch results based on number of collapsed items?

I'm using a a query with collapse in order to gather some documents under a certain person, yet I wish to sort the results based on the number of documents in which the search found a match.. this is my query:
GET documents/_search
{
"_source": {
"includes": [
"text"
]
},
"query": {
"query_string": {
"fields": [
"text"
],
"query": "some text"
}
},
"collapse": {
"field": "person_id",
"inner_hits": {
"name": "top_mathing_docs",
"_source": {
"includes": [
"doc_year",
"text"
]
}
}
}
}
Any suggestions?
Thanks
If I understand correctly, what you require here is to sort the documents i.e. parent documents, based on the count of inner_hits i.e. count of inner_hits based on person_id.
So that means, the _score of the parent documents in the result doesn't matter.
The only way I've found this doable is making use of the Top Hits Aggregation for Field Collapse Example and below is what your query would look like.
Aggregation Query Field Collapse Example:
POST <your_index_name>/_search
{
"size":0,
"query": {
"query_string": {
"fields": [
"text"
],
"query": "some text"
}
},
"aggs": {
"top_person_ids": {
"terms": {
"field": "person_id"
},
"aggs": {
"top_tags_hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Note that I'm assuming person_id is of type keyword or any numeric.
Also if you look at query closely, I've mentioned "size":"0". Which means I'm only returning the result of aggregation.
Another note is that the above aggregation has nothing to do with Field Collapse in Search Request feature that you have posted in the question. It's just that using this aggregation, your result could be formatted in a similar way.
Let me know if this helps!

Resources