Elasticsearch sort based on element in array that satisfies filter - elasticsearch

My types have a field which is an array of times in ISO 8601 format. I want to get all the listing's which have a time on a certain day, and then order them by the earliest time they occur on that specific day. Problem is my query is ordering based on the earliest time of all days.
You can reproduce the problem below.
curl -XPUT 'localhost:9200/listings?pretty'
curl -XPOST 'localhost:9200/listings/listing/_bulk?pretty' -d '
{"index": { } }
{ "name": "second on 6th (3rd on the 5th)", "times": ["2018-12-05T12:00:00","2018-12-06T11:00:00"] }
{"index": { } }
{ "name": "third on 6th (1st on the 5th)", "times": ["2018-12-05T10:00:00","2018-12-06T12:00:00"] }
{"index": { } }
{ "name": "first on the 6th (2nd on the 5th)", "times": ["2018-12-05T11:00:00","2018-12-06T10:00:00"] }
'
# because ES takes time to add them to index
sleep 2
echo "Query listings on the 6th!"
curl -XPOST 'localhost:9200/listings/_search?pretty' -d '
{
"sort": {
"times": {
"order": "asc",
"nested_filter": {
"range": {
"times": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
},
"query": {
"bool": {
"filter": {
"range": {
"times": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
}
}'
curl -XDELETE 'localhost:9200/listings?pretty'
Adding the above script to a .sh file and running it helps reproduce the issue. You'll see the order is happening based on the 5th and not the 6th. Elasticsearch converts the times to a epoch_millis number for sorting, you can see the epoch number in the sort field in the hits object e.g 1544007600000. When doing an asc sort, in takes the smallest number in the array (order not important) and sorts based off that.
Somehow I need it to be ordered on the earliest time that occurs on the queried day i.e the 6th.
Currently using Elasticsearch 2.4 but even if someone can show me how it's done in the current version that would be great.
Here is their doc on nested queries and scripting if that helps.

I think the problem here is that the nested sorting is meant for nested objects, not for arrays.
If you convert the document into one that uses an array of nested objects instead of the simple array of dates, then you can construct a nested filtered sort that works.
The following is Elasticsearch 6.0 - they're changed the syntax a bit for 6.1 onwards, and I'm not sure how much of this works with 2.x:
Mappings:
PUT nested-listings
{
"mappings": {
"listing": {
"properties": {
"name": {
"type": "keyword"
},
"openTimes": {
"type": "nested",
"properties": {
"date": {
"type": "date"
}
}
}
}
}
}
}
Data:
POST nested-listings/listing/_bulk
{"index": { } }
{ "name": "second on 6th (3rd on the 5th)", "openTimes": [ { "date": "2018-12-05T12:00:00" }, { "date": "2018-12-06T11:00:00" }] }
{"index": { } }
{ "name": "third on 6th (1st on the 5th)", "openTimes": [ {"date": "2018-12-05T10:00:00"}, { "date": "2018-12-06T12:00:00" }] }
{"index": { } }
{ "name": "first on the 6th (2nd on the 5th)", "openTimes": [ {"date": "2018-12-05T11:00:00" }, { "date": "2018-12-06T10:00:00" }] }
So instead of the "nextNexpectionOpenTimes", we have an "openTimes" nested object, and each listing contains an array of openTimes.
Now the search:
POST nested-listings/_search
{
"sort": {
"openTimes.date": {
"order": "asc",
"nested_path": "openTimes",
"nested_filter": {
"range": {
"openTimes.date": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
},
"query": {
"nested": {
"path": "openTimes",
"query": {
"bool": {
"filter": {
"range": {
"openTimes.date": {
"gte": "2018-12-06T00:00:00",
"lte": "2018-12-06T23:59:59"
}
}
}
}
}
}
}
}
The main difference here is the slightly different query, since you need to use a "nested" query to filter on nested objects.
And this gives the following result:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "nested-listings",
"_type": "listing",
"_id": "vHH6e2cB28sphqox2Dcm",
"_score": null,
"_source": {
"name": "first on the 6th (2nd on the 5th)"
},
"sort": [
1544090400000
]
},
{
"_index": "nested-listings",
"_type": "listing",
"_id": "unH6e2cB28sphqox2Dcm",
"_score": null,
"_source": {
"name": "second on 6th (3rd on the 5th)"
},
"sort": [
1544094000000
]
},
{
"_index": "nested-listings",
"_type": "listing",
"_id": "u3H6e2cB28sphqox2Dcm",
"_score": null,
"_source": {
"name": "third on 6th (1st on the 5th)"
},
"sort": [
1544097600000
]
}
]
}
}
I don't think you can actually select a single value from an array in ES, so for sorting, you were always going to be sorting on all the results. The best you can do with a plain array is choose how you treat that array for sorting purposes (use lowest, highest, mean, etc).

Related

Is it possible to use a query result into another query in ElasticSearch?

I have two queries that I want to combine, the first one returns a document with some fields.
Now I want to use one of these fields into the new query without creating two separates ones.
Is there a way to combine them in order to accomplish my task?
This is the first query
{
"_source": {
"includes": [
"data.session"
]
},
"query": {
"bool": {
"must": [
{
"match": {
"field1": "9419"
}
},
{
"match": {
"field2": "5387"
}
}
],
"filter": [
{
"range": {
"timestamp": {
"time_zone": "+00:00",
"gte": "2020-10-24 10:16",
"lte": "2020-10-24 11:16"
}
}
}
]
}
},
"size" : 1
}
And this is the response returned:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 109,
"relation": "eq"
},
"max_score": 3.4183793,
"hits": [
{
"_index": "file",
"_type": "_doc",
"_id": "UBYCkgsEzLKoXh",
"_score": 3.4183793,
"_source": {
"data": {
"session": "123456789"
}
}
}
]
}
}
I want to use that "data.session" into another query, instead of rewriting the value of the field by passing the result of the first query.
{
"_source": {
"includes": [
"data.session"
]
},
"query": {
"bool": {
"must": [
{
"match": {
"data.session": "123456789"
}
}
]
}
},
"sort": [
{
"timestamp": {
"order": "asc"
}
}
]
}
If you mean to use the result of the first query as an input to the second query, then it's not possible in Elasticsearch. But if you share your query and use-case, we might suggest you better way.
ElasticSearch does not allow sub queries or inner queries.

Elasticsearch: find documents containing not more terms than in the query

If I have documents:
1: { "name": "red yellow" }
2: { "name": "green yellow" }
I'd like to query with "red brown yellow" and get document 1.
I mean the query should contain at least terms form my document, but can contain more. If document contains a token whats not in the query, there should be not hit.
How can I do this? The other way around is easy ...
First you have to declare your field as fielddata : true in order to execute script on it :
PUT test
{
"mappings": {
"properties": {
"name": {
"type": "text",
"fielddata": true
}
}
}
}
Then, you can filter your result with a script on your query:
POST test/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": """
boolean res = true;
for (item in doc['name']) {
res = 'red brown yellow'.contains(item) && res;
}
return res;
""",
"lang": "painless"
}
}
},
"must": [
{
"match": {
"name": "red brown yellow"
}
}
]
}
}
}
Note that fielddata on a text field can cost a lot and it's better if fou can index this field as Keyword on an array as follows :
1: { "name": ["red","yellow"] }
2: { "name": ["green", "yellow"] }
The search request can be exactly the same
The match query is of type boolean. It means that the text provided is
analyzed and the analysis process constructs a boolean query from the
provided text. The minimum number of optional should clauses to match
can be set using the minimum_should_match parameter.
To know more about match query, you can refer ES documentation
Below is the mapping of name field
{
"tests": {
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
Now when you search "red brown yellow" from the below query
POST tests/_search
{
"query": {
"match": {
"name": {
"query": "red brown yellow",
"minimum_should_match": "75%"
}
}
}
}
You get your required result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.87546873,
"hits": [
{
"_index": "tests",
"_type": "_doc",
"_id": "1",
"_score": 0.87546873,
"_source": {
"name": "red yellow"
}
}
]
}
}
The output will not include green yellow . This is because the second document, only matches 1/3 of the query terms, which is below 75%

Elasticsearch search query with nested fields

I am working on a resume database on elasticsearch. there are nested fields. For example, there is a "skills" section. "skills" is a nested field containing "skill" and "years". I want to be able to do a query that returns a skill with a certain year. For example, I want to get resumes of people with 3 or more years of "python" experience.
I have successfully run a query that does the following:
It returns all the resumes that has "python as a skills.skill and 3 as a skills.year
This returns result where python is associated with 2 years or experience as long as some other field is associated with 3 years of experience.
GET /resumes/_search
{
"query": {
"bool": {
"must": [
{ "match": { "skills.skill": "python" }},
{ "match": { "skills.years": 3 }}
]
}
}
}
Is there a better way to sort the data where that 3 is more associated with python?
You need to make use of Nested DataType and corresponding to it you would need to make use of Nested Query
What you have in current model appears to be basic object model.
I've mentioned sample mapping, sample documents, nested query and response below. This would give you what you are looking for.
Mapping
PUT resumes
{
"mappings": {
"mydocs": {
"properties": {
"skills": {
"type": "nested",
"properties": {
"skill": {
"type": "keyword"
},
"years": {
"type": "integer"
}
}
}
}
}
}
}
Sample Documents:
POST resumes/mydocs/1
{
"skills": [
{
"skill": "python",
"years": 3
},
{
"skill": "java",
"years": 3
}
]
}
POST resumes/mydocs/2
{
"skills": [
{
"skill": "python",
"years": 2
},
{
"skill": "java",
"years": 3
}
]
}
Query
POST resumes/_search
{
"query": {
"nested": {
"path": "skills",
"query": {
"bool": {
"must": [
{
"match": {
"skills.skill": "python"
}
},
{
"match": {
"skills.years": 3
}
}
]
}
}
}
}
}
Query Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.6931472,
"hits": [
{
"_index": "resumes",
"_type": "mydocs",
"_id": "1",
"_score": 1.6931472,
"_source": {
"skills": [
{
"skill": "python",
"years": 3
},
{
"skill": "java",
"years": 3
}
]
}
}
]
}
}
Note that you only retrieve the document having id 1 in the above response. Also note that just for sake of simplicity I've made skills.skill as keyword type. You can change it to text depending on your use case.
Hope it helps!

Sort keyword field array within ElasticSearch document by relevance

I've got an ElasticSearch index that looks something like this:
{
"mappings": {
"article": {
"properties": {
"title": { "type": "string" },
"tags": {
"type": "keyword"
},
}
}
}
And data that looks something like this:
{ "title": "Something about Dogs", "tags": ["articles", "dogs"] },
{ "title": "Something about Cats", "tags": ["articles", "cats"] },
{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }
If I search for dog, I get the first and third documents, as I'd expect. And I can weight the search documents the way I like (in reality, I'm using a function_score query to weight on a bunch of fields irrelevant to this question).
What I'd like to do is sort the tags field so that the most relevant tags are returned first, without affecting the sort order of the documents themselves. So I'm hoping for a result like this:
{ "title": "Something about Dog Food", "tags": ["dogs", "dogfood", "articles"] }
Instead of what I get now:
{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }
The documentation on sort and function score don't cover my case. Any help appreciated. Thanks!
You cannot sort the _source (your array of tags) of the documents given its "matching" capability. One way of doing this is by using nested fields and inner_hits that allows you to sort the matching nested fields.
My suggestion is to transform your tags in a nested field (I chose keyword there just by simplicity, but you can also have text and the analyzer of your choice):
PUT test
{
"mappings": {
"article": {
"properties": {
"title": {
"type": "string"
},
"tags": {
"type": "nested",
"properties": {
"value": {
"type": "keyword"
}
}
}
}
}
}
}
And use this kind of query:
GET test/_search
{
"_source": {
"exclude": "tags"
},
"query": {
"bool": {
"must": [
{
"match": {
"title": "dogs"
}
},
{
"nested": {
"path": "tags",
"query": {
"bool": {
"should": [
{
"match_all": {}
},
{
"match": {
"tags.value": "dogs"
}
}
]
}
},
"inner_hits": {
"sort": {
"_score": "desc"
}
}
}
}
]
}
}
}
Where you try to match on the tags nested field value for the same text you try to match on title. Then, using inner_hits sorting, you can actually sort the nested values based on their inner scoring.
#Val's suggestion is very good, but is good as long as for your "relevant tags" you are ok with just a simple text matching as a substring (i1.indexOf(params.search)). His solution's biggest advantage is that you don't have to change the mapping.
My solution's big advantage is that you are actually using Elasticsearch true search capabilities to determine the "relevant" tags. But the drawback is that you need nested field instead of the regular simple keyword.
What you get from a search call are the source documents. The documents in the response are returned in exactly the same form as when you indexed them, which means that if you indexed ["articles", "dogs", "dogfood"], you'll always get that array in that unaltered form.
One way to get around this is to declare a script_field that applies a small script to sort your array and return the result of that sort.
What the script does is simply move the terms that contain the search term in the front of the list
{
"_source": ["title"],
"query" : {
"match_all": {}
},
"script_fields" : {
"sorted_tags" : {
"script" : {
"lang": "painless",
"source": "return params._source.tags.stream().sorted((i1, i2) -> i1.indexOf(params.search) > -1 ? -1 : 1).collect(Collectors.toList())",
"params" : {
"search": "dog"
}
}
}
}
}
This will return something like this, as you can see the sorted_tags array contains the terms as you expect.
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "tests",
"_type": "article",
"_id": "1",
"_score": 1,
"_source": {
"title": "Something about Dog Food"
},
"fields": {
"sorted_tags": [
"dogfood",
"dogs",
"articles"
]
}
}
]
}
}

Elastic search date range aggregation

I have a Json Data
"hits": [
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "AHkuN5_iRGO-R5dtaOvz6w",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-08-02T04:55:04.509Z"
}
},
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "Busk7MDFQ4emtL3x5AQyZA",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-08-02T04:58:31.440Z"
}
},
{
"_index": "outboxprov1",
"_type": "deleted-connector",
"_id": "4AN0zKe9SaSF1trz1IixfA",
"_score": 1,
"_source": {
"user_id": "1a9d05586a8dc3f29b4c8147997391f9",
"deleted_date": "2014-07-02T04:53:07.010Z"
}
}
]
Am trying to write aggregation query which will find records in particular "deleted_date" range.
This is my query
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02"
},
{
"to": "2014-08-02"
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
My problem is am not getting correct number of records in particular date range. When i put any date am getting some doc_count number. Am new to elastic search. Am not sure is it the way to write range aggregation query. Please help me to solve this issue.
I think problem is you are confused with "from" and "to" of date range aggregation, with range filter. Range filter includes both date (from and to ) in default. But in date_range aggregation, includes the from value and excludes the to value for each range..
In your query,
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02"
},
{
**"to": "2014-08-02"** -- > if you want to include 2014-08-02 date then do,
"to" : "2014-08-03" (increase date by one, so 08-02 is included)
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
This was also encountered by me, and I think your problem is also same.
FYI, look at the link.
What OP is looking for is InternalDateRange query. Try this instead:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"daily_team": {
"date_range": {
"field": "deleted_date",
"format": "YYYY-MM-DD",
"ranges": [
{
"from": "2014-08-02||/d", // /d rounds off to day
// from value -> 2014-08-02T00:00:00.000Z
"to": "2014-08-03||/d" // to value -> 2014-08-03T00:00:00.000Z
}
]
},
"aggs": {
"daily_team_count": {
"terms": {
"field": "user_id"
}
}
}
}
}
}
This will return count of matching results in single bucket named daily_team.
"buckets": [
{
"key": "2014-08-02T00:00:00.000Z-2014-08-03T00:00:00.000Z",
"from": 1470096000000, //test data value
"from_as_string": "2014-08-02T00:00:00.000Z",
"to": 1470182400000, //test data value
"to_as_string": "2014-08-03T00:00:00.000Z",
"doc_count": 0
}
]
This will return single bucket containing matching doc_count.
"ranges": [
{
"from": "2014-08-02"
},
{
"to": "2014-08-02"
}
Using above ranges will return 2 buckets, one each for from and to date range.
from -> 2014-08-02-*
to -> *-2014-08-02 as shown on official documentation page.

Resources