How to fetch unique geo codes from Elasticsearch? - elasticsearch

I'm new to Elasticsearch. I've created the INDEX & inserted some documents by following CURL commands.
curl -XPUT 'localhost:9200/museums?pretty' -H 'Content-Type: application/json' -d'
{
"mappings": {
"doc": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
'
curl -XPOST 'localhost:9200/museums/doc/_bulk?refresh&pretty' -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d\u0027Orsay"}
{"index":{"_id":7}}
{"location": "52.374081,4.912350", "name": "NEMO7 Science Museum"}
{"index":{"_id":8}}
{"location": "52.369219,4.901618", "name": "Museum8 Het Rembrandthuis"}
{"index":{"_id":9}}
{"location": "52.371667,4.914722", "name": "Nederlands9 Scheepvaartmuseum"}
{"index":{"_id":10}}
{"location": "51.222900,4.405200", "name": "Letterenhuis10"}
{"index":{"_id":11}}
{"location": "48.861111,2.336389", "name": "Musée11 du Louvre"}
{"index":{"_id":12}}
{"location": "48.860000,2.327000", "name": "Musée12 d\u0027Orsay"}
'
If you'll see the curl commands I've made some duplicate documents & inserted those also. Now, I want to fetch all documents having UNIQUE GEO CODES & apply SORT(ASC) on that.
I got one sample CURL command like following.
curl -XPOST 'localhost:9200/museums/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
"aggs" : {
"rings_around_amsterdam" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"ranges" : [
{ "to" : 100000 },
{ "from" : 100000, "to" : 300000 },
{ "from" : 300000 }
]
}
}
}
}
'
But, it uses RANGE on that. I just want to fetch only UNIQUE GEO CODES & SORT those in ascending order. I googled also but, whatever I'm getting to fetch UNIQUE documents are works on only TEXT/NUMERIC type documents. Not on GEO CODES type document.
Need some help.

Try :
curl -XPOST 'localhost:9200/museums/_search?size=0&pretty' -H 'Content-Type: application/json' -d'
{
"size" : 0,
"aggs": {
"distinct_geo_distance" : {
"cardinality" : {
"field" : "location"
}
}
}
}

Related

Group and count by array of objects' keys

Given the following index definition and query:
curl -XDELETE "localhost:9200/products"
curl -XPUT "localhost:9200/products"
curl -XPUT "localhost:9200/products/_mapping" -H 'Content-Type: application/json' -d'
{
"properties": {
"opinions": {
"type": "nested",
"properties": {
"topic": {"type": "keyword"},
"count": {"type": "long"}
},
"include_in_parent": true
}
}
}'
curl -X POST "localhost:9200/products/_bulk" -H 'Content-Type: application/json' -d'
{"index":{"_id":1}}
{"opinions":[{"topic": "room", "count": 2}, {"topic": "kitchen", "count": 1}]}
{"index":{"_id":2}}
{"opinions":[{"topic": "room", "count": 1}, {"topic": "restroom", "count": 1}]}
'
sleep 2
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"per_topic": {
"terms": {"field": "opinions.topic"},
"aggs": {
"counts": {
"sum": {"field": "opinions.count"}
}
}
}
}
}
'
Produces the result:
"aggregations" : {
"per_topic" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "room",
"doc_count" : 2,
"counts" : {
"value" : 5.0
}
},
{
"key" : "kitchen",
"doc_count" : 1,
"counts" : {
"value" : 3.0
}
},
{
"key" : "restroom",
"doc_count" : 1,
"counts" : {
"value" : 2.0
}
}
]
}
}
}
I'm expecting the sum of room to be 3, kitchen to be 1 and restroom to be 1, counting only the related nested documents, but instead it is summing all the nested count fields in all the matched the documents.
How can I sum only the matched aggregated nested documents?
UPDATE: solution based on comments
curl -X POST "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"opinions": {
"nested": {"path": "opinions"},
"aggs": {
"per_topic": {
"terms": {"field": "opinions.topic"},
"aggs": {
"counts": {
"sum": {"field": "opinions.count"}
}
}
}
}
}
}
}
'
The main initial problem was the use of object fields instead of nested fields: only using nested fields is it possible to preserve the structure [{"room", 2}, {"kitchen", 1}], as in object fields the data is flattened to {["room", "kitchen"], [1,2]} without relationships between "room" and 2.
Unluckily, at the moment is not possible to use the SQL API to group by (some?) nested fields, but it is possible to write a native Elastic query using nested aggregations.

Elastic Search: return matching parents with matched/unmatched childs

I am using elastic search 7.8.1 and have used parent-child method to index the documents. My requirement is to search both parent and child documents, but return response in a format that parent document is the main document and child document is a field within the parent document. i.e
1) If the child matches, I wish to return parent & child in a document. I am able to achieve this using has_child and inner_hits.
2) If the parent matches the query, I wish to return parent and child in a document even if the child does not matches. (Not sure how to achieve this)
# This is the parent child relationship mapping in index
*curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"my_id": {
"type": "keyword"
},
"my_join_field": {
"type": "join",
"relations": {
"question": "answer"
}
}
}
}
}
'*
Below is the query I am trying to use, but it does not return the child when the parent matches:
*curl -X POST "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"should": [
{
"has_child": {
"type": "answer",
"query": {
"match": {
"my_id": "4"
}
},
"inner_hits": {
"size": 1
}
}
},
{
"match": {
"my_id": "1"
}
}
]
}
}
}'*
#Parent docs curl -X PUT "localhost:9200/my-index-000001/_doc/1?refresh&pretty" -H 'Content-Type: application/json' -d' { "my_id": "1", "text": "This is a question", "my_join_field": "question" } ' curl -X PUT "localhost:9200/my-index-000001/_doc/2?refresh&pretty" -H 'Content-Type: application/json' -d' { "my_id": "2", "text": "This is another question", "my_join_field": "question" } '
#Child docs curl -X PUT "localhost:9200/my-index-000001/_doc/3?routing=1&refresh&pretty" -H 'Content-Type: application/json' -d' { "my_id": "3", "text": "This is an answer", "my_join_field": { "name": "answer", "parent": "1" } } ' curl -X PUT "localhost:9200/my-index-000001/_doc/4?routing=1&refresh&pretty" -H 'Content-Type: application/json' -d' { "my_id": "4", "text": "This is another answer", "my_join_field": { "name": "answer", "parent": "1" } } '
How can I search both parent and child, but return child as a field in parent doc. Thanks in advance.

elasticsearch distinct parent sub aggregation without nested field

In elasticsearch 6.2 I have a parent-child relationship :
Document -> NamedEntity
I want to aggregate NamedEntity by counting mention field and giving the number of documents that contains each named entity.
My use case is :
doc1 contains 'NER'(_id=ner11), 'NER'(_id=ner12)
doc2 contains 'NER'(_id=ner2)
The parent/child relation is implemented with a join field. In the Document I have a field :
join: {
name: "Document"
}
And in the NamedEntity children :
join: {
name: "NamedEntity",
parent: "parent_id"
}
with _routing set to parent_id.
So I tried with terms sub-aggregation :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"terms":{"field":"join"}
}
}
}
}
}'
And I have the following response :
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NamedEntity",
"doc_count" : 3 <-- WRONG ! There are 2 distinct documents
}
]
}
}
]
}
I find the expected 3 occurrences in mentions.buckets.doc_count. But in the mentions.buckets.docs.buckets.doc_count field I would like to have only 2 documents (not 3). Like a select count distinct.
If I aggregate with "terms":{"field":"join.parent"} I have :
...
"docs" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [ ]
}
...
I tied with cardinality aggregation on the join field and I obtain a value of 1, and cardinality aggregation on the join.parent that returns a value of 0.
So how do you make an aggregation distinct count on parents without the use of a reverse nested aggregation ?
As #AndreiStefan asked, here is the mapping. It is a simple 1-N relation between Document(content) and NamedEntity(mention) in an ES 6 mapping (fields are defined on the same level) :
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"mention": {
"type": "keyword"
}
}
}
}}
And the requests for a minimal dataset :
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc1 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "a NER document contains 2 NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/doc2 -H 'Content-Type: application/json' -d '{"type": "Document", "join": {"name": "Document"}, "content": "another NER document"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner11?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner12?routing=doc1 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc1"}, "mention": "NER"}'
curl -XPUT elasticsearch:9200/datashare-testjs/doc/ner2?routing=doc2 -H 'Content-Type: application/json' -d '{"type": "NamedEntity", "join": {"name": "NamedEntity", "parent": "doc2"}, "mention": "NER"}'
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"docs": {
"terms": {
"field": "join"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
}
}
OR if you just want the count:
"aggs": {
"mentions": {
"terms": {
"field": "mention"
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
If you need a custom ordering (by unique counts):
"aggs": {
"mentions": {
"terms": {
"field": "mention",
"order": {
"uniques": "desc"
}
},
"aggs": {
"uniques": {
"cardinality": {
"field": "join#Document"
}
}
}
}
}
I post this workaround in case it can help someone. But if someone has a cleaner way of doing this, I'd be interested.
I added a denormalized field in the children that contains a copy of the parent id (the value already in join/parent):
curl -XPUT elasticsearch:9200/datashare-testjs -H 'Content-Type: application/json' -d '
{
"mappings": {
"doc": {
"properties": {
"content": {
"type": "text",
"index_options": "offsets"
},
"type": {
"type": "keyword"
},
"join": {
"type": "join",
"relations": {
"Document": "NamedEntity"
}
},
"document_id: {
"type": "keyword"
},
"mention": {
"type": "keyword"
}
}
}
}}
Then the cardinality aggregate with this new field works as expected :
curl -XPOST elasticsearch:9200/datashare-testjs/_search?pretty -H 'Content-Type: application/json' -d '
{"query":{"term":{"type":"NamedEntity"}},
"aggs":{
"mentions":{
"terms":{
"field":"mention"
},
"aggs":{
"docs":{
"cardinality": {
"field" : "document_id"
}
}
}
}}}'
It responds :
...
"aggregations" : {
"mentions" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "NER",
"doc_count" : 3,
"docs" : {
"value" : 2
}
}
]
}
}
I recently ran into the same issue on Elasticsearch 7.1, and this additional field "my_join_field#my_parent" created by elasicsearch solved it. I am glad I didn't have to add the parent_id to the child document.
https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html#_searching_with_parent_join

How to get multiple type search results from elasticsearch index?

So, I have myindex elastic search index with two types type1 and type2. Both the type has two common fields as name and descriptionas below:
{
"name": "",
"description": ""
}
I want 5 results from type1 and 5 results from result2 if I specify the size as 10 in a single search query?
The below query gives me 10 results from type1 if the matching results are more from type1:
curl -XPOST 'localhost:9200/myindex/_search?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"size": 10,
"query": {
"match": {
"name": "xyz"
}
}
}'
I can do this in two different queries as below, but I want to do it in one go.
curl -XPOST 'localhost:9200/myindex/type1/_search?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"size": 5,
"query": {
"match": {
"name": "xyz"
}
}
}'
curl -XPOST 'localhost:9200/myindex/type2/_search?pretty&pretty' -H 'Content-Type: application/json' -d'
{
"size": 5,
"query": {
"match": {
"name": "xyz"
}
}
}'
You can use a multisearch and the results will come back in two separate arrays.
GET /_msearch --data-binary
{ "index" : "myindex" , "type" : "type1" }
{ "size" : 5, "query" : { "match" : { "name" : "xyz" } } }
{ "index" : "myindex", "type" : "type2" }
{ "size" : 5, "query" : { "match" : { "name" : "xyz" } } }

Elasticsearch: Parent-child relationship after rollover

Suppose there is a simple blog index which contains two types: blog and comment. One blog can have multiple comments. The index is created like this
curl -X PUT \
'http://localhost:9200/%3Cblog-%7Bnow%2Fd%7D-000001%3E?pretty=' \
-H 'content-type: application/json' \
-d '{
"mappings": {
"comment": {
"_parent": { "type": "blog" },
"properties": {
"name": { "type": "keyword" },
"comment": { "type": "text" }
}
},
"blog": {
"properties": {
"author": { "type": "keyword" },
"subject": { "type": "text" },
"content": { "type": "text" }
}
}
}
}'
The index %3Cblog-%7Bnow%2Fd%7D-000001%3E is equal to <blog-{now/d}-000001> (see here for more about date math).
We're going to add 'blog-active' alias to this index. This alias is going to be used for storing data.
curl -X POST 'http://localhost:9200/_aliases?pretty=' \
-H 'content-type: application/json' \
-d '{ "actions" : [ { "add" : { "index" : "blog-*", "alias" : "blog-active" } } ] }'
Now if we do the following actions:
1.Add a blog using blog-active alias
curl -X POST http://localhost:9200/blog-active/blog/1 \
-H 'content-type: application/json' \
-d '{
"author": "author1",
"subject": "subject1",
"content": "content1"
}'
2.Add a comment to the blog
curl -X POST \
'http://localhost:9200/blog-active/comment/1?parent=1' \
-H 'content-type: application/json' \
-d '{
"name": "commenter1",
"comment": "new comment1"
}'
3.Do a rollover with max_docs = 2
curl -X POST \
http://localhost:9200/blog-active/_rollover \
-H 'content-type: application/json' \
-d '{
"conditions": {
"max_docs": 2
},
"mappings": {
"comment": {
"_parent": { "type": "blog" },
"properties": {
"name": { "type": "keyword" },
"comment": { "type": "text" }
}
},
"blog": {
"properties": {
"author": { "type": "keyword" },
"subject": { "type": "text" },
"content": { "type": "text" }
}
}
}
}'
4.And add another comment to the blog
curl -X POST \
'http://localhost:9200/blog-active/comment/1?parent=1' \
-H 'content-type: application/json' \
-d '{
"name": "commenter2",
"comment": "new comment2"
}'
Now if we search all blog indices for all comments on 'author1' blogs with (blog-%2A is blog-*)
curl -X POST \
http://localhost:9200/blog-%2A/comment/_search \
-H 'content-type: application/json' \
-d '{
"query": {
"has_parent" : {
"query" : {
"match" : { "author" : { "query" : "author1" } }
},
"parent_type" : "blog"
}
}
}'
the result only contains first comment.
This is due to the fact that second comment is in the second index which does not have parent blog document in itself. So it doesn't know about the author of the blog.
So, my question is how do I approach parent-child relations when rollover is used?
Is the relationship even possible in that case?
Similar question: ElasticSearch parent/child on different indexes
All documents that form part of a parent-child relationship need to live in the same index, more preciously same shard. Therefore it's not possible to have parent-child relationship if rollover is used, since it creates new indices.
One solution for the problem above could be to denormalize data by adding filed blog_author and blog_id in comment type. The mapping in that case will look like this (notice that parent-child relationship has been removed):
"mappings": {
"comment": {
"properties": {
"blog_id": { "type": "keyword" },
"blog_author": { "type": "keyword" },
"name": { "type": "keyword" },
"comment": { "type": "text" }
}
},
"blog": {
"properties": {
"author": { "type": "keyword" },
"subject": { "type": "text" },
"content": { "type": "text" }
}
}
}
and the query to fetch comments by blog author is:
curl -X POST \
http://localhost:9200/blog-%2A/comment/_search \
-H 'cache-control: no-cache' \
-H 'content-type: application/json' \
-d '{
"query": {
"match": {
"blog_author": "user1"
}
}
}'

Resources