Elasticsearch search query with nested fields - elasticsearch

I am working on a resume database on elasticsearch. there are nested fields. For example, there is a "skills" section. "skills" is a nested field containing "skill" and "years". I want to be able to do a query that returns a skill with a certain year. For example, I want to get resumes of people with 3 or more years of "python" experience.
I have successfully run a query that does the following:
It returns all the resumes that has "python as a skills.skill and 3 as a skills.year
This returns result where python is associated with 2 years or experience as long as some other field is associated with 3 years of experience.
GET /resumes/_search
{
"query": {
"bool": {
"must": [
{ "match": { "skills.skill": "python" }},
{ "match": { "skills.years": 3 }}
]
}
}
}
Is there a better way to sort the data where that 3 is more associated with python?

You need to make use of Nested DataType and corresponding to it you would need to make use of Nested Query
What you have in current model appears to be basic object model.
I've mentioned sample mapping, sample documents, nested query and response below. This would give you what you are looking for.
Mapping
PUT resumes
{
"mappings": {
"mydocs": {
"properties": {
"skills": {
"type": "nested",
"properties": {
"skill": {
"type": "keyword"
},
"years": {
"type": "integer"
}
}
}
}
}
}
}
Sample Documents:
POST resumes/mydocs/1
{
"skills": [
{
"skill": "python",
"years": 3
},
{
"skill": "java",
"years": 3
}
]
}
POST resumes/mydocs/2
{
"skills": [
{
"skill": "python",
"years": 2
},
{
"skill": "java",
"years": 3
}
]
}
Query
POST resumes/_search
{
"query": {
"nested": {
"path": "skills",
"query": {
"bool": {
"must": [
{
"match": {
"skills.skill": "python"
}
},
{
"match": {
"skills.years": 3
}
}
]
}
}
}
}
}
Query Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.6931472,
"hits": [
{
"_index": "resumes",
"_type": "mydocs",
"_id": "1",
"_score": 1.6931472,
"_source": {
"skills": [
{
"skill": "python",
"years": 3
},
{
"skill": "java",
"years": 3
}
]
}
}
]
}
}
Note that you only retrieve the document having id 1 in the above response. Also note that just for sake of simplicity I've made skills.skill as keyword type. You can change it to text depending on your use case.
Hope it helps!

Related

Aggregate by property on parent document with Elasticsearch join field

I have an Elasticsearch index that uses a join type field to relate two types of indexed documents to each other via a parent-child relation: posts which are parents of comments.
posts have a category keyword field, and comments belong to posts. I would like to find the number of comments in each post category, like so:
// what query do I need to get this result?
{
"aggregations" : {
"comment-counts-by-post-category" : {
"buckets" : [
{
"key" : "Dogs",
"doc_count" : 2,
},
{
"key" : "Cats",
"doc_count" : 1,
}
]
}
}
}
Here is a complete example:
I have an index with the following mapping:
PUT posts-index/
{
"mappings": {
"properties": {
"post": {
"type": "object",
"properties": {
"category": {
"type": "keyword"
}
}
},
"text": {
"type": "keyword"
},
"post_comment_join": {
"type": "join",
"relations": {
"post": "comment"
}
}
}
}
}
I create two posts, one in the Dogs category, and one in the Cats category:
PUT posts-index/_doc/post-1
{
"text": "this is a dog post",
"post": {
"category": "Dogs"
},
"post_comment_join": {
"name": "post"
}
}
PUT posts-index/_doc/post-2
{
"text": "this is a cat post",
"post": {
"category": "Cats"
},
"post_comment_join": {
"name": "post"
}
}
Then, I create a few comments (in this case, 2 on the dog post and 1 on the cat post)
PUT posts-index/_doc/comment-1&routing=1&refresh
{
"text": "this is comment 1 for post 1",
"post_comment_join": {
"name": "comment",
"parent": "post-1"
}
}
PUT posts-index/_doc/comment-2&routing=1&refresh
{
"text": "this is comment 2 for post 1",
"post_comment_join": {
"name": "comment",
"parent": "post-1"
}
}
PUT posts-index/_doc/comment-3&routing=1&refresh
{
"text": "this is a comment 1 for post 2",
"post_comment_join": {
"name": "comment",
"parent": "post-2"
}
}
I can search for all comment documents using a has_parent query:
POST post-index/_search
{
"query": {
"has_parent": {
"parent_type": "post",
"query": {
"match_all": {}
}
}
}
}
{
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": 1.0,
"hits": [ /* returns the 3 comments */ ]
}
}
What I can't figure out how to do is find the number of comments in each category
I've looked into Parent Aggregations, but they seem to only allow you aggregate based on the type of the parent. In this case, all parents are of type post, so that doesn't help.
I've also tried using a basic terms aggregation using the join_field#parent_field syntax:
POST post-index/_search
{
"query": {
"has_parent": {
"parent_type": "post",
"query": {
"match_all": {}
}
}
},
"aggs": {
"agg-by-post-category": {
"terms": {
"field": "post_comment_join#post.category"
}
}
}
}
// returns { "buckets": [] } in the aggs
Unfortunately, this returns no results. It seems as though the post_comment_join#post syntax can be used to aggregate by parent doc, but not by an attribute on the parent doc. (i.e., by the _id field of a post, but not by post.category)
Can anyone help me figure out the right aggs syntax to return all comments grouped by their parent post's category?
Again, here is the result I'm looking for:
{
"aggregations" : {
"comment-counts-by-post-category" : {
"buckets" : [
{
"key" : "Dogs",
"doc_count" : 2,
},
{
"key" : "Cats",
"doc_count" : 1,
}
]
}
}
}
Platform details
Amazon Opensearch service version 7.9
You can use any of below two to find count of comments by category.
GET posts-index/_search
{
"query": {
"has_child": {
"type": "comment",
"inner_hits": {
"_source": false,
"size": 0
},
"query": {
"match_all": {}
}
}
}
}
GET posts-index/_search
{
"aggs": {
"top-tags": {
"terms": {
"field": "post.category",
"size": 10
},
"aggs": {
"to-answers": {
"children": {
"type": "comment"
},
"aggs": {
"comments-count": {
"value_count": {
"field": "text"
}
}
}
}
}
}
}
}

Is it possible to use a query result into another query in ElasticSearch?

I have two queries that I want to combine, the first one returns a document with some fields.
Now I want to use one of these fields into the new query without creating two separates ones.
Is there a way to combine them in order to accomplish my task?
This is the first query
{
"_source": {
"includes": [
"data.session"
]
},
"query": {
"bool": {
"must": [
{
"match": {
"field1": "9419"
}
},
{
"match": {
"field2": "5387"
}
}
],
"filter": [
{
"range": {
"timestamp": {
"time_zone": "+00:00",
"gte": "2020-10-24 10:16",
"lte": "2020-10-24 11:16"
}
}
}
]
}
},
"size" : 1
}
And this is the response returned:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 109,
"relation": "eq"
},
"max_score": 3.4183793,
"hits": [
{
"_index": "file",
"_type": "_doc",
"_id": "UBYCkgsEzLKoXh",
"_score": 3.4183793,
"_source": {
"data": {
"session": "123456789"
}
}
}
]
}
}
I want to use that "data.session" into another query, instead of rewriting the value of the field by passing the result of the first query.
{
"_source": {
"includes": [
"data.session"
]
},
"query": {
"bool": {
"must": [
{
"match": {
"data.session": "123456789"
}
}
]
}
},
"sort": [
{
"timestamp": {
"order": "asc"
}
}
]
}
If you mean to use the result of the first query as an input to the second query, then it's not possible in Elasticsearch. But if you share your query and use-case, we might suggest you better way.
ElasticSearch does not allow sub queries or inner queries.

Retrieve list of objects based on a key value constraint

I have object instance index in ES 6.2 which I can query like this:
POST /_search
{
"query": {
"bool": {
"must": [
{
"match": {
"instanceId" : "I001"
}
}
]
}
}
}
and receive a particular instance query result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 15,
"successful": 15,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 5.7745514,
"hits": [
{
"_index": "instance",
"_type": "searchinstance",
"_id": "I001",
"_score": 5.7745514,
"_source": {
"name": "someInstance",
"uuid": "18fab6a6-0fc9-428e-ad60-a13a6a43e0ea",
"id": "I001",
"createdAt": 1559140971501,
"completedAt": 1559140988024,
"modifiedAt": 1559140988028,
"description": "my description",
"instanceId": "I001",
"status": null,
"attributes": [
{
"name": "response.result",
"value": "0"
},
{
"name": "response.value",
"value": "123"
}
],
"createdBy": null
}
}
]
}
}
How do I query all of such instances (i.e. just list of instanceId values) having "attributes.name": "response.result" and "attributes.value": "0"?
I've been trying to combine query_string, match, wildcard and nested query types but still not being successful. It seems that the issue is specifying path to attributes structure correctly. When POSTing:
{
"query": {
"nested": {
"path": "attributes",
"query": {
"bool": {
"must": [
{
"match": {
"attributes.name": "response.result"
}
},
{
"match": {
"attributes.value": "0"
}
}
]
}
}
}
}
}
I receive a failure reason
{
"type": "query_shard_exception",
"reason": "failed to create query: {...}",
"index_uuid": "8Sr_2jvsRvqGmDjK71SFsw",
"index": ".kibana",
"caused_by": {
"type": "illegal_state_exception",
"reason": "[nested] failed to find nested object under path [attributes]"
}
}
Thank you.
Elasticsearch doesn't have a dedicated array type. In fact, any field of any type is treated as array of values. So assuming your attributes field is of object type you can query it just as you would normally do for single object, for example:
{
"query": {
"bool": {
"must": [
{
"match": {
"attributes.name": "response.result"
}
},
{
"match": {
"attributes.value": "0"
}
}
]
}
}
}

Elasticsearch sort based on element in array that satisfies filter

My types have a field which is an array of times in ISO 8601 format. I want to get all the listing's which have a time on a certain day, and then order them by the earliest time they occur on that specific day. Problem is my query is ordering based on the earliest time of all days.
You can reproduce the problem below.
curl -XPUT 'localhost:9200/listings?pretty'
curl -XPOST 'localhost:9200/listings/listing/_bulk?pretty' -d '
{"index": { } }
{ "name": "second on 6th (3rd on the 5th)", "times": ["2018-12-05T12:00:00","2018-12-06T11:00:00"] }
{"index": { } }
{ "name": "third on 6th (1st on the 5th)", "times": ["2018-12-05T10:00:00","2018-12-06T12:00:00"] }
{"index": { } }
{ "name": "first on the 6th (2nd on the 5th)", "times": ["2018-12-05T11:00:00","2018-12-06T10:00:00"] }
'
# because ES takes time to add them to index
sleep 2
echo "Query listings on the 6th!"
curl -XPOST 'localhost:9200/listings/_search?pretty' -d '
{
"sort": {
"times": {
"order": "asc",
"nested_filter": {
"range": {
"times": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
},
"query": {
"bool": {
"filter": {
"range": {
"times": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
}
}'
curl -XDELETE 'localhost:9200/listings?pretty'
Adding the above script to a .sh file and running it helps reproduce the issue. You'll see the order is happening based on the 5th and not the 6th. Elasticsearch converts the times to a epoch_millis number for sorting, you can see the epoch number in the sort field in the hits object e.g 1544007600000. When doing an asc sort, in takes the smallest number in the array (order not important) and sorts based off that.
Somehow I need it to be ordered on the earliest time that occurs on the queried day i.e the 6th.
Currently using Elasticsearch 2.4 but even if someone can show me how it's done in the current version that would be great.
Here is their doc on nested queries and scripting if that helps.
I think the problem here is that the nested sorting is meant for nested objects, not for arrays.
If you convert the document into one that uses an array of nested objects instead of the simple array of dates, then you can construct a nested filtered sort that works.
The following is Elasticsearch 6.0 - they're changed the syntax a bit for 6.1 onwards, and I'm not sure how much of this works with 2.x:
Mappings:
PUT nested-listings
{
"mappings": {
"listing": {
"properties": {
"name": {
"type": "keyword"
},
"openTimes": {
"type": "nested",
"properties": {
"date": {
"type": "date"
}
}
}
}
}
}
}
Data:
POST nested-listings/listing/_bulk
{"index": { } }
{ "name": "second on 6th (3rd on the 5th)", "openTimes": [ { "date": "2018-12-05T12:00:00" }, { "date": "2018-12-06T11:00:00" }] }
{"index": { } }
{ "name": "third on 6th (1st on the 5th)", "openTimes": [ {"date": "2018-12-05T10:00:00"}, { "date": "2018-12-06T12:00:00" }] }
{"index": { } }
{ "name": "first on the 6th (2nd on the 5th)", "openTimes": [ {"date": "2018-12-05T11:00:00" }, { "date": "2018-12-06T10:00:00" }] }
So instead of the "nextNexpectionOpenTimes", we have an "openTimes" nested object, and each listing contains an array of openTimes.
Now the search:
POST nested-listings/_search
{
"sort": {
"openTimes.date": {
"order": "asc",
"nested_path": "openTimes",
"nested_filter": {
"range": {
"openTimes.date": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
},
"query": {
"nested": {
"path": "openTimes",
"query": {
"bool": {
"filter": {
"range": {
"openTimes.date": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
}
}
}
}
The main difference here is the slightly different query, since you need to use a "nested" query to filter on nested objects.
And this gives the following result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "nested-listings",
"_type": "listing",
"_id": "vHH6e2cB28sphqox2Dcm",
"_score": null,
"_source": {
"name": "first on the 6th (2nd on the 5th)"
},
"sort": [
1544090400000
]
},
{
"_index": "nested-listings",
"_type": "listing",
"_id": "unH6e2cB28sphqox2Dcm",
"_score": null,
"_source": {
"name": "second on 6th (3rd on the 5th)"
},
"sort": [
1544094000000
]
},
{
"_index": "nested-listings",
"_type": "listing",
"_id": "u3H6e2cB28sphqox2Dcm",
"_score": null,
"_source": {
"name": "third on 6th (1st on the 5th)"
},
"sort": [
1544097600000
]
}
]
}
}
I don't think you can actually select a single value from an array in ES, so for sorting, you were always going to be sorting on all the results. The best you can do with a plain array is choose how you treat that array for sorting purposes (use lowest, highest, mean, etc).

Elasticsearch Query - Return all documents that do not have a corresponding document

I have an index that contains documents who have a status. These are initially imported with a job and their status is set to 0.
For simplicity:
{
"_uid" : 1234
"id" : 1
"name" : "someName",
"status" : 0
}
Then another import job runs and extends these objects by iterating over each object with status=0. Each object that is extended gets the status 1.
{
"_uid" : 1234
"id" : 1
"name" : "someName",
"newProperty" : "someValue",
"status" : 1
}
(Note the unchanged _uid. It's the same object)
Now I have a third import job that takes all objects with status one, takes their ID (the ID!!! Not their _uid!) and creates a new object with the same ID, but different UID:
{
"_uid" : 5678
"id" : 1
"completelyDifferentProperty" : "someValue"
"status" : 2
}
So now, for each ID, I have two objects: One with status = 1, One with status = 2.
For the last job I need to make sure that it only picks objects with status =1 that DO NOT YET have a corresponding status=2 object.
So I need a query to the effect of
"Get all objects where status == 1 for which no status == 2 object with the same ID exists".
I have a feeling aggregations might help me but I haven't gotten it figured out yet.
You can do it fairly easily with a parent/child relationship. This is sort of a special-case use of the capability, but I think it could be used to solve your problem.
To test it out, I set up an index like this, with parent_doc type and a child_doc type (I only included the properties necessary to set up the capability; it doesn't hurt to add more in your documents):
PUT /test_index
{
"mappings": {
"parent_doc": {
"_id": {
"path": "id"
},
"properties": {
"id": {
"type": "long"
},
"_uid": {
"type": "long"
},
"status": {
"type": "integer"
}
}
},
"child_doc": {
"_parent": {
"type": "parent_doc"
},
"_id": {
"path": "id"
},
"properties": {
"id": {
"type": "long"
},
"_uid": {
"type": "long"
},
"status": {
"type": "long"
}
}
}
}
}
Then I added four docs; three parents, one child. There is one document that has "status: 1 that doesn't have a corresponding child document.
POST /test_index/_bulk
{"index":{"_type":"parent_doc"}}
{"_uid":1234,"id":1,"name":"someName","newProperty":"someValue","status":0}
{"index":{"_type":"parent_doc"}}
{"_uid":1234,"id":2,"name":"someName","newProperty":"someValue","status":1}
{"index":{"_type":"child_doc","_parent":2}}
{"_uid":5678,"id":2,"completelyDifferentProperty":"someValue","status":2}
{"index":{"_type":"parent_doc"}}
{"_uid":4321,"id":3,"name":"anotherName","newProperty":"anotherValue","status":1}
We can find the document we want like this; notice we are querying only the parent_doc type, and that our conditions are that status is 1 and no child (at all) exists:
POST /test_index/parent_doc/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"term": {
"status": 1
}
},
{
"not": {
"filter": {
"has_child": {
"type": "child_doc",
"query": {
"match_all": {}
}
}
}
}
}
]
}
}
}
}
}
This returns:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "3",
"_score": 1,
"_source": {
"_uid": 4321,
"id": 3,
"name": "anotherName",
"newProperty": "anotherValue",
"status": 1
}
}
]
}
}
Here's all the code I used to test it:
http://sense.qbox.io/gist/d1a0267087d6e744b991de5cdec1c31d947ebc13

Resources