How to nested aggregate matched terms? - elasticsearch

I've this mapping in my index:
{
"mappings": {
"properties": {
"uuid": {
"type": "keyword"
},
"last_visit": {
"type": "date"
},
"urls": {
"type": "nested",
"properties": {
"url": {
"type": "keyword"
},
"is_visited": {
"type": "boolean"
}
}
}
}
}
}
and hundreds of data like this:
This is my desired output when I search for *google.com and *facebook.com:
[
{
"uuid": "afa9ac03-0723-4d66-ae18-08a51e2973bd",
"urls": [
{
"is_visited": true,
"url": "https://www.google.com",
"last_visit": "2022-02-31"
},
{
"is_visited": false,
"url": "https://www.facebook.com",
"last_visit": "2022-02-03"
},
{
"is_visited": true,
"url": "https://www.twitter.com",
"last_visit": "2022-03-30"
}
]
},
{
"uuid": "4a1c695d-756b-4d9d-b3a0-cf524d955884",
"urls": [
{
"is_visited": true,
"url": "https://www.stackoverflow.com",
"last_visit": "2022-03-23"
},
{
"is_visited": false,
"url": "https://www.facebook.com",
"last_visit": "2022-02-02"
},
{
"is_visited": false,
"url": "https://drive.google.com",
"last_visit": "2022-05-01"
},
{
"is_visited": true,
"url": "https://www.google.com",
"last_visit": "2022-07-09"
}
]
}
]
and this is the code I wrote (thanks to another question where I have not explained myself well about desired output) with focus on *google.com when I try to add last_visit field to output :
{
"query": {
"nested": {
"path": "urls",
"query": {
"bool": {
"should": [
{
"wildcard": {
"urls.url": {
"value": "*google.com"
}
}
},
{
"wildcard": {
"urls.url": {
"value": "*facebook.com"
}
}
}
]
}
}
}
},
"aggs": {
"agg_providers": {
"nested": {
"path": "urls"
},
"aggs": {
"google.com": {
"terms": {
"field": "urls.url",
"include": ".*google.com",
"size": 10
},
"aggs": {
"top_hits": {
"top_hits": {
"size": 1,
"_source": {
"includes": ["last_visit"]
}
}
}
}
},
"facebook.com": {
"terms": {
"field": "urls.url",
"include": ".*facebook.com",
"size": 10
}
}
}
}
}
}
The code above returns 2 differents buckets lists in which I have key,doc_count dict values instead of all fields (is_visited, last_visit, uuid, etc.)
Thanks.

Related

How to aggregate matched terms in a query_string search?

I wish to search wildcard terms in a nested list of dict and then obtain a list of terms and its uuid grouped by matched wildcard.
I've the following mapping in my index:
"mappings": {
"properties": {
"uuid": {
"type": "keyword"
},
"urls": {
"type": "nested",
"properties": {
"url": {
"type": "keyword"
},
"is_visited": {
"type": "boolean"
}
}
}
}
}
and a lot of data such this:
{
"uuid":"afa9ac03-0723-4d66-ae18-08a51e2973bd"
"urls": [
{
"is_visited": true,
"url": "https://www.google.com"
},
{
"is_visited": false,
"url": "https://www.facebook.com"
},
{
"is_visited": true,
"url": "https://www.twitter.com"
},
]
},
{
"uuid":"4a1c695d-756b-4d9d-b3a0-cf524d955884"
"urls": [
{
"is_visited": true,
"url": "https://www.stackoverflow.com"
},
{
"is_visited": false,
"url": "https://www.facebook.com"
},
{
"is_visited": false,
"url": "https://drive.google.com"
},
{
"is_visited": false,
"url": "https://maps.google.com"
},
]
}
...
I wish to search via wildcard "*google.com OR *twitter.com" and obtain something like this:
"hits": [
"*google.com": [
{
"uuid": "4a1c695d-756b-4d9d-b3a0-cf524d955884",
"_source": {
"is_visited": false,
"url": "https://drive.google.com"
}
},
{
"id": "4a1c695d-756b-4d9d-b3a0-cf524d955884",
"_source": {
"is_visited": false,
"url": "https://maps.google.com"
}
},
{
"uuid":"afa9ac03-0723-4d66-ae18-08a51e2973bd",
"_source": {
"is_visited": true,
"url": "https://www.google.com"
}
}
]
"*twitter.com": [
{
"uuid":"afa9ac03-0723-4d66-ae18-08a51e2973bd",
"_source": {
"is_visited": true,
"url": "https://www.twitter.com"
},
},
]
]
This is my (python) search query:
body = {
#"_source": False,
"size": 100,
"query": {
"nested": {
"path": "urls",
"query":{
"query_string":{
"query": f"urls.url:{urlToSearch}",
}
}
,"inner_hits": {
"size":100 # returns top 100 results
}
}
}
}
but it returns an hit for each matched term instead of aggregate them in a list similar to what I would like to get.
EDIT
This is my setting and mapping:
{
"settings": {
"analysis": {
"char_filter": {
"my_filter": {
"type": "mapping",
"mappings": [
"- => _",
]
},
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"my_filter"
],
"filter": [
"lowercase",
]
}
}
}
},
"mappings": {
"properties": {
"uuid": {
"type": "keyword"
},
"urls": {
"type": "nested",
"properties": {
"url": {
"type": "keyword"
},
"is_visited": {
"type": "boolean"
}
}
}
}
}
}
Elasticsearch will not provide the output you want the way you set up the query.
This scenario to be an aggregation. My suggestion was to apply the nested query and use aggregation on the results.
Attention point wildcard query:
Avoid beginning patterns with * or ?. This can increase the iterations
needed to find matching terms and slow search performance.
{
"size": 0,
"query": {
"nested": {
"path": "urls",
"query": {
"bool": {
"should": [
{
"wildcard": {
"urls.url": {
"value": "*google.com"
}
}
},
{
"wildcard": {
"urls.url": {
"value": "*twitter.com"
}
}
}
]
}
}
}
},
"aggs": {
"agg_providers": {
"nested": {
"path": "urls"
},
"aggs": {
"google.com": {
"terms": {
"field": "urls.url",
"include": ".*google.com",
"size": 10
}
},
"twitter.com": {
"terms": {
"field": "urls.url",
"include": ".*twitter.com",
"size": 10
}
}
}
}
}
}
Results:
"aggregations": {
"agg_providers": {
"doc_count": 7,
"twitter.com": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "https://www.twitter.com",
"doc_count": 1
}
]
},
"google.com": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "https://drive.google.com",
"doc_count": 1
},
{
"key": "https://maps.google.com",
"doc_count": 1
},
{
"key": "https://www.google.com",
"doc_count": 1
}
]
}
}
}

Nested filter returns 0 doc_count

For this index and sample data:
PUT job_offers
{
"mappings": {
"properties": {
"location": {
"properties": {
"slug": {
"type": "keyword"
},
"name": {
"type": "keyword"
}
},
"type": "nested"
},
"experience": {
"properties": {
"slug": {
"type": "keyword"
},
"name": {
"type": "keyword"
}
},
"type": "nested"
}
}
}
}
POST job_offers/_doc
{
"title": "Junior Ruby on Rails Developer",
"location": [
{
"slug": "new-york",
"name": "New York"
},
{
"slug": "atlanta",
"name": "Atlanta"
},
{
"slug": "remote",
"name": "Remote"
}
],
"experience": [
{
"slug": "junior",
"name": "Junior"
}
]
}
POST job_offers/_doc
{
"title": "Ruby on Rails Developer",
"location": [
{
"slug": "chicago",
"name": "Chicago"
},
{
"slug": "atlanta",
"name": "Atlanta"
}
],
"experience": [
{
"slug": "senior",
"name": "Senior"
}
]
}
I try to run filter on experience.slug:
GET job_offers/_search
{
"query": {
"nested": {
"path": "location",
"query": {
"terms": {
"location.slug": [
"remote",
"new-york"
]
}
}
}
},
"aggs": {
"filtered_job_offers": {
"global": {},
"aggs": {
"filtered_location": {
"filter": {
"bool": {
"must": [
{
"terms": {
"experience.slug": [
"junior"
]
}
}
]
}
}
}
}
}
}
}
Response for this:
"aggregations" : {
"filtered_job_offers" : {
"doc_count" : 2,
"filtered_location" : {
"doc_count" : 0
}
}
}
Why do I get doc_count: 0 for filtered_location instead of 1? How can I make it work?
You were pretty close! Gotta use a nested query in the aggregations:
...
"aggs": {
"filtered_job_offers": {
"global": {},
"aggs": {
"filtered_location": {
"filter": {
"bool": {
"must": [
{
"nested": { <-----
"path": "experience",
"query": {
"terms": {
"experience.slug": [
"junior"
]
}
}
}
}
]
}
}
}
}
}
}

Elasticsearch Bucket sort on nested field

I have a problem with a bucket sort on nested field on Elastic 7.1.0:
My index has the following mapping:
{
"mapping": {
"dynamic": "strict",
"properties": {
"created_at_timestamp": {
"type": "date"
},
"url": {
"type": "keyword",
},
"title": {
"type": "keyword",
},
"entities": {
"type": "nested",
"properties": {
"counter": {
"type": "long"
},
"metric": {
"type": "long"
},
"id": {
"type": "long"
},
"relevance": {
"type": "float"
},
"weighted_metric": {
"type": "float"
}
}
}
}
}
}
and I need to order this documents by "weighted_metric", filtered for a specific entity id. I wrote this query:
GET my_index/_search?size=0
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "entities",
"query": {
"term": {
"entities.id": "27374"
}
}
}
}
],
"must_not": [
{
"term": {
"title": {
"value": ""
}
}
}
]
}
},
"aggs": {
"by_url_and_title": {
"composite": {
"sources": [
{
"final_url": {
"terms": {
"field": "final_url"
}
}
},
{
"title": {
"terms": {
"field": "title"
}
}
}
]
},
"aggs": {
"sum_metric": {
"nested": {
"path": "entities"
},
"aggs": {
"weightedmetric": {
"filters": {
"filters": {
"new": {
"bool": {
"should": [
{
"term": {
"entities.id": "27374"
}
}
]
}
}
}
},
"aggs": {
"wmetric": {
"sum": {
"field": "entities.weighted_metric"
}
}
}
},
"w_sort": {
"bucket_sort": {
"sort": [
{
"weightedmetric.wmetric": {
"order": "desc"
}
}
],
"size": 10
}
}
}
}
}
}
}
}
And I have this error:
{
"error": {
"root_cause": [],
"type": "search_phase_execution_exception",
"reason": "",
"phase": "fetch",
"grouped": true,
"failed_shards": [],
"caused_by": {
"type": "class_cast_exception",
"reason": "org.elasticsearch.search.aggregations.bucket.nested.InternalNested cannot be cast to org.elasticsearch.search.aggregations.InternalMultiBucketAggregation"
}
},
"status": 503
}
If I don't try to order the buckets everything works fine.
Can someone help me with this query? I need to order the buckets by weighted_metric. thanks

Elasticsearch how to sort with condition

On my ElasticSearch (2.x) I have documents like this:
{
"title": "A good title",
"formats": [{
"name": "pdf",
"prices": [{
"price": 11.99,
"currency": "EUR"
}, {
"price": 18.99,
"currency": "AUD"
}]
}]
}
I'd like to sort documents by formats.prices.price but only where the formats.prices.currency === 'EUR'
I tried to do a nested field on formats.prices and then run this query:
{
"query": {
"filtered": {
"query": {
"and": [
{
"match_all": {}
}
]
}
}
},
"sort": {
"formats.prices.price": {
"order": "desc",
"nested_path": "formats.prices",
"nested_filter": {
"term": {
"currency": "EUR"
}
}
}
}
}
But unfortunately I cannot get the right order.
UPDATE:
Relevant part of mapping:
"formats": {
"properties": {
"name": {
"type": "string"
},
"prices": {
"type": "nested",
"include_in_parent": true,
"properties": {
"currency": {
"type": "string"
},
"price": {
"type": "double"
}
}
}
}
},
i hope this will solve your problem
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "formats.prices",
"filter": {
"match": {
"formats.prices.currency": "EUR"
}
}
}
}
]
}
},
"from": 0,
"size": 50,
"sort": [
{
"formats.prices.price": {
"order": "asc",
"nested_path": "formats.prices",
"nested_filter": {
"match": {
"formats.prices.currency": "EUR"
}
}
}
}
]
}

ElasticSearch double nested sorting

I have documents which look like this (here is example):
{
"user": "xyz",
"state": "FINISHED",
"finishedTime": 1465566467161,
"jobCounters": {
"counterGroup": [
{
"counterGroupName": "org.apache.hadoop.mapreduce.FileSystemCounter",
"counter": [
{
"name": "FILE_BYTES_READ",
"mapCounterValue": 206509212380,
"totalCounterValue": 423273933523,
"reduceCounterValue": 216764721143
},
{
"name": "FILE_BYTES_WRITTEN",
"mapCounterValue": 442799895522,
"totalCounterValue": 659742824735,
"reduceCounterValue": 216942929213
},
{
"name": "HDFS_BYTES_READ",
"mapCounterValue": 207913352565,
"totalCounterValue": 207913352565,
"reduceCounterValue": 0
},
{
"name": "HDFS_BYTES_WRITTEN",
"mapCounterValue": 0,
"totalCounterValue": 89846725044,
"reduceCounterValue": 89846725044
}
]
},
{
"counterGroupName": "org.apache.hadoop.mapreduce.JobCounter",
"counter": [
{
"name": "TOTAL_LAUNCHED_MAPS",
"mapCounterValue": 0,
"totalCounterValue": 13394,
"reduceCounterValue": 0
},
{
"name": "TOTAL_LAUNCHED_REDUCES",
"mapCounterValue": 0,
"totalCounterValue": 720,
"reduceCounterValue": 0
}
]
}
]
}
}
Now I want to sort this data to get TOP 15 documents on the basis of totalCounterValue where counter.name is FILE_BYTES_READ. I have tried nested sorting on this but no matter which key name I write in counter.name, it is always sorting on the basis of HDFS_BYTES_READ. Can anyone please help me with my query.
{
"_source": true,
"size": 15,
"query": {
"bool": {
"must": [
{
"term": {
"state": {
"value": "FINISHED"
}
}
},
{
"range": {
"startedTime": {
"gte": "now - 4d",
"lte": "now"
}
}
}
]
}
},
"sort": [
{
"jobCounters.counterGroup.counter.totalCounterValue": {
"order": "desc",
"nested_path": "jobCounters.counterGroup",
"nested_filter": {
"nested": {
"path": "jobCounters.counterGroup.counter",
"filter": {
"term": {
"jobCounters.counterGroup.counter.name": "file_bytes_read"
}
}
}
}
}
}
]}
This is the mapping for jobCounters we have created:
"jobCounters": {
"type": "nested",
"include_in_parent": true,
"properties" : {
"counterGroup": {
"type": "nested",
"include_in_parent": true,
"properties": {
"counterGroupName": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"counter" : {
"type": "nested",
"include_in_parent": true,
"properties": {
"reduceCounterValue": {
"type": "long"
},
"name": {
"type": "string",
"analyzer": "english",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"totalCounterValue": {
"type": "long"
},
"mapCounterValue": {
"type": "long"
}
}
}
}
}
}
}
I followed nested sorting documentation of ElasticSearch and came up with this query, but I don't know why it is always sorting the totalCounterValue of HDFS_BYTES_READ irrespective of jobCounters.counterGroup.counter.name's value.
you can try something like this,
curl -XGET 'http://localhost:9200/index/jobCounters/_search' -d '
{
"size": 15,
"query": {
"nested": {
"path": "jobCounters.counterGroup.counter",
"filter": {
"term": {
"jobCounters.counterGroup.counter.name": "file_bytes_read"
}
}
}
},
"sort": [
{
"jobCounters.counterGroup.counter.totalCounterValue": {
"order": "desc",
"nested_path": "jobCounters.counterGroup",
"nested_filter": {
"nested": {
"path": "jobCounters.counterGroup.counter",
"filter": {
"term": {
"jobCounters.counterGroup.counter.name": "file_bytes_read"
}
}
}
}
}
}
]
}
'
Read the end of this document. It explains that we have to repeat the same query in nested_filter too.

Resources