Elasticsearch - Query to Determine All Unique IDs that are distance X away from a particular ID? - elasticsearch

I have data in this format generated from a random walk (to simulate people walking around). It is set up in this manner { location : { lat: someLat, lon: someLong }, id: uniqueId, date:date }. I am trying to write a query given a users unique ID, find how many other unique IDs came within X distance of the given ID between a certain time range. Any hints on how to accomplish this?
My idea is to have a top level filter aggregration, with a nested geo-query of some sort. I think the geo-distance query is the way to go, but I am not sure how to include it into the below query to get all of unique IDs that come within X distance of the ID I am filtering on. The query below is where I am starting from, I am filtering all documents from now - 1 day to now, where the documents user Id is the provided value. How would I check all other documents for their distances against documents that match this query?
{
"aggs" : {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyyy",
"ranges": [
{ "to": "now" },
{ "from": "now-1d" }
]
}
},
"locations" : {
"filter" : {
"term": { "id.keyword": "7a50ab18-886b-42a2-80ad-3d45112e3cfd" }
}
}
}
}

Your hunch is correct. All of this can be done using range & geo_distance filtering and _geo_distance sorting. You wanna filter on the query-level, not in the aggs though:
GET walking/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"date": {
"gte": "now-1d"
}
}
}
],
"filter": [
{
"geo_distance": {
"distance": "20m",
"location": {
"lat": 48.20150179951008,
"lon": 16.39111876487732
}
}
}
]
}
},
"aggs": {
"rings_around_loc": {
"geo_distance": {
"field": "location",
"origin": {
"lat": 48.20150179951008,
"lon": 16.39111876487732
},
"unit": "m",
"keyed": true,
"ranges": [
{
"to": 10
},
{
"from": 10,
"to": 50
},
{
"from": 50
}
]
}
},
"locations": {
"value_count": {
"field": "id.keyword"
}
}
},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 48.20150179951008,
"lon": 16.39111876487732
},
"order": "asc",
"unit": "m",
"mode": "min",
"distance_type": "arc",
"ignore_unmapped": true
}
}
]
}
Not sure what you need the range buckets for so I left them out.
Full steps to replicate:
PUT walking
{
"mappings": {
"properties": {
"date": {
"type": "date"
},
"id": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"location": {
"type": "geo_point"
}
}
}
}
And then POST _bulk this random walk data

Related

Composite aggregation query with bucket_sort does not work properly

I have an index to store financial transactions:
{
"mappings": {
"_doc": {
"properties": {
"amount": {
"type": "long"
},
"currencyCode": {
"type": "keyword"
},
"merchantId": {
"type": "keyword"
},
"merchantName": {
"type": "text"
},
"partnerId": {
"type": "keyword"
},
"transactionDate": {
"type": "date"
},
"userId": {
"type": "keyword"
}
}
}
}
}
Here's my query:
GET /transactions/_search
{
"aggs": {
"date_merchant": {
"aggs": {
"amount": {
"sum": {
"field": "amount"
}
},
"amount_sort": {
"bucket_sort": {
"sort": [
{
"amount": {
"order": "desc"
}
}
]
}
},
"top_hit": {
"top_hits": {
"_source": {
"includes": [
"merchantName",
"currencyCode"
]
},
"size": 1
}
}
},
"composite": {
"size": 1,
"sources": [
{
"date": {
"date_histogram": {
"calendar_interval": "day",
"field": "transactionDate"
}
}
},
{
"merchant": {
"terms": {
"field": "merchantId"
}
}
}
]
}
}
},
"query": {
"bool": {
"filter": [
{
"term": {
"userId": "AAA"
}
},
{
"term": {
"partnerId": "BBB"
}
},
{
"range": {
"transactionDate": {
"gte": "2022-07-01"
}
}
},
{
"term": {
"currencyCode": "EUR"
}
}
]
}
},
"size": 0
}
Please note the "size": 1 in the composite aggregation.
If I change it to 3 (based on my data)... I get different results!
That means the bucket_sort operation doesn't work on the whole list of buckets, but just on the returned ones (if it's just one, that means it's not going to be sorted at all!)
How can I sort on ALL the buckets instead?
EDIT
Based on Benjamin's answer I changed my query to use normal aggregations instead of composite, and a large bucket size for merchant IDs (default is 10, while for date histogram there's no limit)
Composite agg design
The composite aggregation is designed to iterate all buckets in the most efficient way possible.
How can I sort on ALL the buckets instead?
To fully sort over ALL buckets, all buckets would have to be enumerated ahead of time, defeating the design of the composite aggregation.
So, how to actually sort over all buckets?
Then aggregate over all buckets in a single call. Set your size to the largest number of buckets available within your query.
The number of buckets will be the cardinality of merchantId and the number of days in the date histogram.
Another option is to paginate over all the composite buckets and then sort them client side. If you choose this path, it may be good to have each page of the composite aggregation be sorted so that sorting them client side will be faster.

Elastic Search Geo Spatial search implementation

I am trying to understand how elastic search supports Geo Spatial search internally.
For the basic search, it uses the inverted index; but how does it combine with the additional search criteria like searching for a particular text within a certain radius.
I would like to understand the internals of how the index would be stored and queried to support these queries
Text & geo queries are executed separately of one another. Let's take a concrete example:
PUT restaurants
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
},
"menu": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
POST restaurants/_doc
{
"name": "rest1",
"location": {
"lat": 40.739812,
"lon": -74.006201
},
"menu": [
"european",
"french",
"pizza"
]
}
POST restaurants/_doc
{
"name": "rest2",
"location": {
"lat": 40.7403963,
"lon": -73.9950026
},
"menu": [
"pizza",
"kebab"
]
}
You'd then match a text field and apply a geo_distance filter:
GET restaurants/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"menu": "pizza"
}
},
{
"geo_distance": {
"distance": "0.5mi",
"location": {
"lat": 40.7388,
"lon": -73.9982
}
}
},
{
"function_score": {
"query": {
"match_all": {}
},
"boost_mode": "avg",
"functions": [
{
"gauss": {
"location": {
"origin": {
"lat": 40.7388,
"lon": -73.9982
},
"scale": "0.5mi"
}
}
}
]
}
}
]
}
}
}
Since the geo_distance query only assigns a boolean value (--> score=1; only checking if the location is within a given radius), you may want to apply a gaussian function_score to boost the locations that are closer to a given origin.
Finally, these scores are overridable by using a _geo_distance sort where you'd order by the proximity (while of course keeping the match query intact):
...
"query: {...},
"sort": [
{
"_geo_distance": {
"location": {
"lat": 40.7388,
"lon": -73.9982
},
"order": "asc"
}
}
]
}

ElasticSearch aggregation query with List in documents

I have following records of car sales of different brands in different cities.
Document -1
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":100,
"sold":80
},{
"name":"Honda",
"purchase":200,
"sold":150
}]
}
Document -2
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":50,
"sold":40
},{
"name":"Honda",
"purchase":150,
"sold":120
}]
}
I am trying to come up with query to aggregate car statistics for a given city but not getting the right query.
Required result:
{
"city": "Delhi",
"cars":[{
"name":"Toyota",
"purchase":150,
"sold":120
},{
"name":"Honda",
"purchase":350,
"sold":270
}]
}
First you need to map your array as a nested field (script would be complicated and not performant). Nested field are indexed, aggregation will be pretty fast.
remove your index / or create a new one. Please note i use test as type.
{
"mappings": {
"test": {
"properties": {
"city": {
"type": "keyword"
},
"cars": {
"type": "nested",
"properties": {
"name": {
"type": "keyword"
},
"purchase": {
"type": "integer"
},
"sold": {
"type": "integer"
}
}
}
}
}
}
}
Index your document (same way you did)
For the aggregation:
{
"size": 0,
"aggs": {
"avg_grade": {
"terms": {
"field": "city"
},
"aggs": {
"resellers": {
"nested": {
"path": "cars"
},
"aggs": {
"agg_name": {
"terms": {
"field": "cars.name"
},
"aggs": {
"avg_pur": {
"sum": {
"field": "cars.purchase"
}
},
"avg_sold": {
"sum": {
"field": "cars.sold"
}
}
}
}
}
}
}
}
}
}
result:
buckets": [
{
"key": "Honda",
"doc_count": 2,
"avg_pur": {
"value": 350
},
"avg_sold": {
"value": 270
}
}
,
{
"key": "Toyota",
"doc_count": 2,
"avg_pur": {
"value": 150
},
"avg_sold": {
"value": 120
}
}
]
if you have index the name / city field as a text (you have to ask first if this is necessary), use .keyword in the term aggregation ("cars.name.keyword").

elasticsearch facet nested aggregation

Using elasticsearch 7.0.0.
I am following this link.
I have an index test_products with following mapping:
{
"settings": {
"number_of_shards": 1
},
"mappings": {
"dynamic_templates": [
{
"search_result_data": {
"mapping": {
"type": "keyword"
},
"path_match": "search_result_data.*"
}
}
],
"properties": {
"search_data": {
"type": "nested",
"properties": {
"full_text": {
"type": "text"
},
"string_facet": {
"type": "nested",
"properties": {
"facet-name": {
"type": "keyword"
},
"facet-value": {
"type": "keyword"
}
}
}
}
}
}
}
}
And a document inserted with following format:
{
"search_result_data": {
"sku": "wheel-6075-90092",
"gtin": null,
"name": "Matte Black Wheel Fuel Ripper",
"preview_image": "abc.jg",
"url": "9836817354546538796",
"brand": "Fuel Off-Road"
},
"search_data":
{
"full_text": "Matte Black Wheel Fuel Ripper",
"string_facet": [
{
"facet-name": "category",
"facet-value": "Motor Vehicle Rims & Wheels"
},
{
"facet-name": "brand",
"facet-value": "Fuel Off-Road"
}
]
}
}
and one other document..
I am trying to aggregate on string_facet as mentioned in the link.
"aggregations": {
"agg_string_facet": {
"nested": {
"path": "string_facet"
},
"aggregations": {
"facet_name": {
"terms": {
"field": "string_facet.facet-name"
},
"aggregations": {
"facet_value": {
"terms": {
"field": "string_facet.facet-value"
}
}
}
}
}
}
}
But I get all (two) documents returned with :
"aggregations": {
"agg_string_facet": {
"doc_count": 0
}
}
What am I missing here?
Also why are the docs being returned as a response?
Documents are returned as a response because they match with your query. If you'd like them to disappear, you can set the "size" field to 0. By default, it's set to 10.
query{
...
},
"size" = 0
I read the docs and Facet aggregation has been removed. The recommendation is to use the Terms aggregation.
Now, for your question, you can go with two options:
If you'd like to get the unique values for each: facet-value and facet-name, you can do the following:
"aggs":{
"unique facet-values":{
"terms":{
"field": "facet-value.keyword",
"size": 30 #By default is 10, maximum recommended is 10,000
}
},
"unique facet-names":{
"terms":{
"field": "facet-name.keyword"
"size": 30 #By default is 10, maximum recommended is 10,000
}
}
}
If you'd like to get the unique combinations between facet-name and facet-value, you can use the Composite aggregation. If you choose this way, your aggs should look like this:
{
"aggs":{
"unique-facetvalue-and-facetname-combination":{
"composite":{
"size": 30, #By default is 10, maximum recommended is 10,000. No matter what size you choose, you can paginate.
"sources":[
{
"value":
{
"terms":{
"field": "facet-value.keyword"
}
}
},
{
"name":
{
"terms":{
"field": "facet-name.keyword"
}
}
}
]
}
}
}
}
The advantage of using Composite over Terms is that Composite lets you paginate your results with the After key. So your cluster's performance does not get affected.
Hope this is helpful! :D

ElasticSearch 2 bucket level sorting

The mapping of database is this:
{
"users": {
"mappings": {
"user": {
"properties": {
credentials": {
"type": "nested",
"properties": {
"achievement_id": {
"type": "string"
},
"percentage_completion": {
"type": "integer"
}
}
},
"current_location": {
"type": "geo_point"
},
"locations": {
"type": "geo_point"
}
}
}
}
}
Now In the mapping, You can see there are two geo-distance fields one is current_location and other is locations. Now I want to sort user based on credentials.percentage_completion which is a nested field. This work fine for example this query,
Example Query:
GET /users/user/_search?size=23
{
"sort": [
{
"credentials.percentage_completion": {
"order": "desc",
"missing": "_last"
}
},
"_score"
],
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_distance": {
"distance": "100000000km",
"user.locations": {
"lat": 19.77,
"lon": 73
}
}
}
}
}
}
I want to change sorting order made into buckets, the desired order is first show all the people who are at 100KM radius of user.current_location and sort them according to credentials.percentage_completion and then rest of users sorted again by credentials.percentage_completion.
I tried putting conditional in sorting and made it multilevel but that will not work because only nested can have filters and that on nested fields child only.
I thought I can use _score for sorting and give more relevance to people who are under 1000 km but geo-distance is a filter, I don't seem to find any way to give relevance in filter.
Is there anything I am missing here , any help would be great.
Thanks
Finally solved it, posting it here so other can also take some lead if they get here. The way to solve this is to give constant relevance score to particular query but as here it was Geo distance so was not able to use that in query, then I found Constant Score query: It allows to wrap a filter inside a query.
This is how query looks:
GET /users/user/_search?size=23
{
"sort": [
"_score",
{
"credentials.udacity_percentage_completion": {
"order": "desc",
"missing": "_last"
}
}
],
"explain": true,
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"constant_score": {
"filter": {
"geo_distance": {
"distance": "100km",
"user.current_location": {
"lat": 19.77,
"lon": 73
}
}
},
"boost": 50
}
},
{
"constant_score": {
"filter": {
"geo_distance": {
"distance": "1000000km",
"user.locations": {
"lat": 19.77,
"lon": 73
}
}
},
"boost": 1
}
}
]
}
},
"filter": {
"geo_distance": {
"distance": "10000km",
"user.locations": {
"lat": 19.77,
"lon": 73
}
}
}
}
}
}

Resources