Sort keyword field array within ElasticSearch document by relevance - elasticsearch

I've got an ElasticSearch index that looks something like this:
{
"mappings": {
"article": {
"properties": {
"title": { "type": "string" },
"tags": {
"type": "keyword"
},
}
}
}
And data that looks something like this:
{ "title": "Something about Dogs", "tags": ["articles", "dogs"] },
{ "title": "Something about Cats", "tags": ["articles", "cats"] },
{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }
If I search for dog, I get the first and third documents, as I'd expect. And I can weight the search documents the way I like (in reality, I'm using a function_score query to weight on a bunch of fields irrelevant to this question).
What I'd like to do is sort the tags field so that the most relevant tags are returned first, without affecting the sort order of the documents themselves. So I'm hoping for a result like this:
{ "title": "Something about Dog Food", "tags": ["dogs", "dogfood", "articles"] }
Instead of what I get now:
{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }
The documentation on sort and function score don't cover my case. Any help appreciated. Thanks!

You cannot sort the _source (your array of tags) of the documents given its "matching" capability. One way of doing this is by using nested fields and inner_hits that allows you to sort the matching nested fields.
My suggestion is to transform your tags in a nested field (I chose keyword there just by simplicity, but you can also have text and the analyzer of your choice):
PUT test
{
"mappings": {
"article": {
"properties": {
"title": {
"type": "string"
},
"tags": {
"type": "nested",
"properties": {
"value": {
"type": "keyword"
}
}
}
}
}
}
}
And use this kind of query:
GET test/_search
{
"_source": {
"exclude": "tags"
},
"query": {
"bool": {
"must": [
{
"match": {
"title": "dogs"
}
},
{
"nested": {
"path": "tags",
"query": {
"bool": {
"should": [
{
"match_all": {}
},
{
"match": {
"tags.value": "dogs"
}
}
]
}
},
"inner_hits": {
"sort": {
"_score": "desc"
}
}
}
}
]
}
}
}
Where you try to match on the tags nested field value for the same text you try to match on title. Then, using inner_hits sorting, you can actually sort the nested values based on their inner scoring.
#Val's suggestion is very good, but is good as long as for your "relevant tags" you are ok with just a simple text matching as a substring (i1.indexOf(params.search)). His solution's biggest advantage is that you don't have to change the mapping.
My solution's big advantage is that you are actually using Elasticsearch true search capabilities to determine the "relevant" tags. But the drawback is that you need nested field instead of the regular simple keyword.

What you get from a search call are the source documents. The documents in the response are returned in exactly the same form as when you indexed them, which means that if you indexed ["articles", "dogs", "dogfood"], you'll always get that array in that unaltered form.
One way to get around this is to declare a script_field that applies a small script to sort your array and return the result of that sort.
What the script does is simply move the terms that contain the search term in the front of the list
{
"_source": ["title"],
"query" : {
"match_all": {}
},
"script_fields" : {
"sorted_tags" : {
"script" : {
"lang": "painless",
"source": "return params._source.tags.stream().sorted((i1, i2) -> i1.indexOf(params.search) > -1 ? -1 : 1).collect(Collectors.toList())",
"params" : {
"search": "dog"
}
}
}
}
}
This will return something like this, as you can see the sorted_tags array contains the terms as you expect.
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "tests",
"_type": "article",
"_id": "1",
"_score": 1,
"_source": {
"title": "Something about Dog Food"
},
"fields": {
"sorted_tags": [
"dogfood",
"dogs",
"articles"
]
}
}
]
}
}

Related

Return only matching array item instead of all document values in ElasticSearch

I'm new to Elasticsearch. I'm faced with unexpected behaviour with my client's search page result and my investigation has ended up in the ES structure.
I have a document field name, which has this mapping:
"name": {
"type": "text",
"fields": {
"sort_name": {
"index": false,
"type": "keyword"
}
}
}
So, usually, it has one value, so in this case, it matches a query correctly. But sometimes I have a document, which has an array of product names, which leads to fetching all of the array values into the search page result.
For example, if I have a product, which looks like this:
{
...,
"name": [
"Tesla",
"Model",
"XXX"
]
}
So, when I search this on name:
{
"from": 0,
"query": {
"bool": {
"must": [
{
"match": {
"name": "Tesla"
}
}
]
}
},
"size": 150,
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"stored_fields": [
"_id",
"_score"
]
}
It returns this:
"hits": {
"total": 1,
"max_score": 1.0,
"hits": [
{
"_index": "magento2_product_4_v2",
"_type": "document",
"_id": "99999",
"_score": 1.0,
"_source": {
"name": [
"Tesla",
"Model",
"XXX"
]
}
}
]
}
When I needed only Tesla.
As a result, I will have 2 different products (imagine that Model and XXX are products), that users didn't search.
I would really want to avoid structure changes if that's possible since the new index is created automatically during reindex process (I'm using Magento 2 right now), so could you help me with the query?
You have 1 document in your index magento2_product_4_v2.
This is
{
...,
"name": [
"Tesla",
"Model",
"XXX"
]
}
after request you get this document because this doc suits your request.
May be you want create 3 docs?
{
...,
"name": [
"Tesla",
]
}
{
...,
"name": [
"Model"
]
}
{
...,
"name": [
"XXX"
]
}
There is no way to do this in elasticsearch without changing the mapping. You have a few options. such as using a nested or object field type.
You can also simply postprocess the results with regex, depending on how many ES features you're using to get the match in the first place.
Here is a potential mapping that you could use, assuming you know which specific field you want to search on.
"name": {
"type": "object",
"properties": {
"make": {
"type": "keyword"
},
"model": {
"type": "keyword"
},
"trim": {
"type": "keyword"
}
}
}
Then you could write a query like this:
{
"match": {
"name.make": "Tesla"
}
}
However, depending on the mapping and the data, this can not be enough in many cases, for example with arrays, due to the way elasticsearch flattens objects at index time. The other option would be to use a nested field type, which can ding performance on the search side. To me, this sounds like something that warrants revisiting the data modeling/es mapping side of things to get the search functionality you're looking for. Read more on the nested mapping type here.

How does "must" clause with an array of "match" clauses really mean?

I have an elasticsearch query which looks like this...
"query": {
"bool": {
"must": [{
"match": {"attrs.name": "username"}
}, {
"match": {"attrs.value": "johndoe"}
}]
}
}
... and documents in the index that look like this:
{
"key": "value",
"attrs": [{
"name": "username",
"value": "jimihendrix"
}, {
"name": "age",
"value": 23
}, {
"name": "alias",
"value": "johndoe"
}]
}
Which of the following does this query really mean?
Document should contain either attrs.name = username OR attrs.value = johndoe
Or, document should contain, both, attrs.name = username AND attrs.value = johndoe, even if they may match different elements in the attrs array (this would mean that the document given above would match the query)
Or, document should contain, both, attrs.name = username AND attrs.value = johndoe, but they must match the same element in the attrs array (which would mean that the document given above would not match the query)
Further, how do I write a query to express #3 from the list above, i.e. the document should match only if a single element inside the attrs array matches both the following conditions:
attrs.name = username
attrs.value = johndoe
Must stands for "And" so a document satisfying all the clauses in match query is returned.
Must will not satisfy point 1. Document should contain either attrs.name = username OR attrs.value = johndoe- you need a should clause which works like "OR"
Whether Must will satisfy Point 2 or point 3 depends on the type of "attrs" field.
If "attr" field type is object then fields are flattened that is no relationship maintained between different fields for array. So must query will return a document if any attrs.name="username" and attrs.value="John doe", even if they are not part of same object in that array.
If you want an object in an array to act like a separate document, you need to use nested field and use nested query to match documents
{
"query": {
"nested": {
"path": "attrs",
"inner_hits": {}, --> returns matched nested documents
"query": {
"bool": {
"must": [
{
"match": {
"attrs.name": "username"
}
},
{
"match": {
"attrs.value": "johndoe"
}
}
]
}
}
}
}
}
hits in the response will contain all nested documents , to get all matched nested documents , inner_hits has to be specified
Based on your requirements you need to define your attrs field as nested, please refer nested type in Elasticsearch for more information. Disclaimer : it maintains the relationship but costly to query.
Answer to your other two questions also depends on what data type you are using please refer nested vs object data type for more details
Edit: solution using sample mapping, example docs and expected result
Index mapping using nested type
{
"mappings": {
"properties": {
"attrs": {
"type": "nested"
}
}
}
}
Index 2 sample doc one which severs the criteria and other which doesn't
{
"attrs": [
{
"name": "username",
"value": "johndoe"
},
{
"name": "alias",
"value": "myname"
}
]
}
Another which serves criteria
{
"attrs": [
{
"name": "username",
"value": "jimihendrix"
},
{
"name": "age",
"value": 23
},
{
"name": "alias",
"value": "johndoe"
}
]
}
And search query
{
"query": {
"nested": {
"path": "attrs",
"inner_hits": {},
"query": {
"bool": {
"must": [
{
"match": {
"attrs.name": "username"
}
},
{
"match": {
"attrs.value": "johndoe"
}
}
]
}
}
}
}
}
And Search result
"hits": [
{
"_index": "nested",
"_type": "_doc",
"_id": "2",
"_score": 1.7509375,
"_source": {
"attrs": [
{
"name": "username",
"value": "johndoe"
},
{
"name": "alias",
"value": "myname"
}
]
},
"inner_hits": {
"attrs": {
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.7509375,
"hits": [
{
"_index": "nested",
"_type": "_doc",
"_id": "2",
"_nested": {
"field": "attrs",
"offset": 0
},
"_score": 1.7509375,
"_source": {
"name": "username",
"value": "johndoe"
}
}
]
}
}
}
}
]

Elasticsearch: find documents containing not more terms than in the query

If I have documents:
1: { "name": "red yellow" }
2: { "name": "green yellow" }
I'd like to query with "red brown yellow" and get document 1.
I mean the query should contain at least terms form my document, but can contain more. If document contains a token whats not in the query, there should be not hit.
How can I do this? The other way around is easy ...
First you have to declare your field as fielddata : true in order to execute script on it :
PUT test
{
"mappings": {
"properties": {
"name": {
"type": "text",
"fielddata": true
}
}
}
}
Then, you can filter your result with a script on your query:
POST test/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": """
boolean res = true;
for (item in doc['name']) {
res = 'red brown yellow'.contains(item) && res;
}
return res;
""",
"lang": "painless"
}
}
},
"must": [
{
"match": {
"name": "red brown yellow"
}
}
]
}
}
}
Note that fielddata on a text field can cost a lot and it's better if fou can index this field as Keyword on an array as follows :
1: { "name": ["red","yellow"] }
2: { "name": ["green", "yellow"] }
The search request can be exactly the same
The match query is of type boolean. It means that the text provided is
analyzed and the analysis process constructs a boolean query from the
provided text. The minimum number of optional should clauses to match
can be set using the minimum_should_match parameter.
To know more about match query, you can refer ES documentation
Below is the mapping of name field
{
"tests": {
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
Now when you search "red brown yellow" from the below query
POST tests/_search
{
"query": {
"match": {
"name": {
"query": "red brown yellow",
"minimum_should_match": "75%"
}
}
}
}
You get your required result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.87546873,
"hits": [
{
"_index": "tests",
"_type": "_doc",
"_id": "1",
"_score": 0.87546873,
"_source": {
"name": "red yellow"
}
}
]
}
}
The output will not include green yellow . This is because the second document, only matches 1/3 of the query terms, which is below 75%

Elastic Search fulltext search query and filters

I wanna perform a full-text search, but I also wanna use one or many possible filters. The simplified structure of my document, when searching with /things/_search?q=*foo*:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "things",
"_type": "thing",
"_id": "63",
"_score": 1,
"fields": {
"name": [
"foo bar"
],
"description": [
"this is my description"
],
"type": [
"inanimate"
]
}
}
]
}
}
This works well enough, but how do I combine filters with a query? Let's say I wanna search for "foo" in an index with multiple documents, but I only want to get those with type == "inanimate"?
This is my attempt so far:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "*foo*"
}
},
"filter": {
"bool": {
"must": {
"term": { "type": "inanimate" }
}
}
}
}
}
}
When I remove the filter part, it returns an accurate set of document hits. But with this filter-definition it does not return anything, even though I can manually verify that there are documents with type == "inanimate".
Since you have not done explicit mapping, term query is looking for an exact match. you need to add "index : not_analyzed" to type field and then your query will work.
This will give you correct documents
{
"query": {
"match": {
"type": "inanimate"
}
}
}
but this is not the solution, You need do explicit mapping as I said.

Elasticsearch OR filtered query does not return results

I have the following data set:
{
"_index": "myIndex",
"_type": "myType",
"_id": "220005",
"_score": 1,
"_source": {
"id": "220005",
"name": "Some Name",
"type": "myDataType",
"doc_as_upsert": true
}
}
Doing a direct match query like so:
GET typo3data/destination/_search
{
"query": {
"match": {
"name": "Some Name"
}
},
"size": 500
}
Will return the data just fine:
"hits": {
"total": 1,
"max_score": 3.442347,
"hits": [...
Doing an OR-query however (I am not sure which syntax is correct, the first syntax is taken from elasticsearch docs, the second is a working query taken from another project with the same versions):
GET typo3data/destination/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"or": {
"filters": [
{
"term": {
"name": "Some Name"
}
}
]
}
}
}
},
"size": 500
}
or
{
"query":
{
"match_all": {}
},
"filter":
{
"or":
[
{ "term": { "name": "Some Name"} },
{ "term": { "name": "Some Other Name"} }
]
},
"size": 1000
}
Does not return anything.
The mapping for the name field is:
"name": {
"type": "string",
"index": "not_analyzed"
}
Elasticsearch version is 1.4.4.
When indexing "some name" , this is broken into tokens as follows -
"some name" => [ "some" , "name" ]
Now in a normal match query , it also does the same above process before matching result. If either "same" or "name" is present , that document is qualified as result
match query ("some name") => search for term "some" or "name"
The term query does not analyze or tokenize your query. This means that it looks for a exact token or term of "some name" which is not present.
term query ("some name") => search for term "some name"
Hence you wont be seeing any result.
Things should work fine if you make the field not_analyzed , but then make sure the case is also matching,
You can read more about the same here.
After extending our mapping to include every field we have:
PUT typo3data/_mapping/destination
{
"someType": {
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string",
"index": "not_analyzed"
},
"parentId": {
"type": "integer"
},
"type": {
"type": "string"
},
"generatedUid": {
"type": "integer"
}
}
}
}
The or-filters were working. So the general answer is: If you have such a problem, check your mappings closely and rather do too much work on them than too little.
If someone has an explanation why this might be happening, I will gladly pass the answer mark on to it.

Resources