Elasticsearch: find documents containing not more terms than in the query - elasticsearch

If I have documents:
1: { "name": "red yellow" }
2: { "name": "green yellow" }
I'd like to query with "red brown yellow" and get document 1.
I mean the query should contain at least terms form my document, but can contain more. If document contains a token whats not in the query, there should be not hit.
How can I do this? The other way around is easy ...

First you have to declare your field as fielddata : true in order to execute script on it :
PUT test
{
"mappings": {
"properties": {
"name": {
"type": "text",
"fielddata": true
}
}
}
}
Then, you can filter your result with a script on your query:
POST test/_search
{
"query": {
"bool": {
"filter": {
"script": {
"script": {
"source": """
boolean res = true;
for (item in doc['name']) {
res = 'red brown yellow'.contains(item) && res;
}
return res;
""",
"lang": "painless"
}
}
},
"must": [
{
"match": {
"name": "red brown yellow"
}
}
]
}
}
}
Note that fielddata on a text field can cost a lot and it's better if fou can index this field as Keyword on an array as follows :
1: { "name": ["red","yellow"] }
2: { "name": ["green", "yellow"] }
The search request can be exactly the same

The match query is of type boolean. It means that the text provided is
analyzed and the analysis process constructs a boolean query from the
provided text. The minimum number of optional should clauses to match
can be set using the minimum_should_match parameter.
To know more about match query, you can refer ES documentation
Below is the mapping of name field
{
"tests": {
"mappings": {
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
Now when you search "red brown yellow" from the below query
POST tests/_search
{
"query": {
"match": {
"name": {
"query": "red brown yellow",
"minimum_should_match": "75%"
}
}
}
}
You get your required result:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.87546873,
"hits": [
{
"_index": "tests",
"_type": "_doc",
"_id": "1",
"_score": 0.87546873,
"_source": {
"name": "red yellow"
}
}
]
}
}
The output will not include green yellow . This is because the second document, only matches 1/3 of the query terms, which is below 75%

Related

Elasticsearch search query with nested fields

I am working on a resume database on elasticsearch. there are nested fields. For example, there is a "skills" section. "skills" is a nested field containing "skill" and "years". I want to be able to do a query that returns a skill with a certain year. For example, I want to get resumes of people with 3 or more years of "python" experience.
I have successfully run a query that does the following:
It returns all the resumes that has "python as a skills.skill and 3 as a skills.year
This returns result where python is associated with 2 years or experience as long as some other field is associated with 3 years of experience.
GET /resumes/_search
{
"query": {
"bool": {
"must": [
{ "match": { "skills.skill": "python" }},
{ "match": { "skills.years": 3 }}
]
}
}
}
Is there a better way to sort the data where that 3 is more associated with python?
You need to make use of Nested DataType and corresponding to it you would need to make use of Nested Query
What you have in current model appears to be basic object model.
I've mentioned sample mapping, sample documents, nested query and response below. This would give you what you are looking for.
Mapping
PUT resumes
{
"mappings": {
"mydocs": {
"properties": {
"skills": {
"type": "nested",
"properties": {
"skill": {
"type": "keyword"
},
"years": {
"type": "integer"
}
}
}
}
}
}
}
Sample Documents:
POST resumes/mydocs/1
{
"skills": [
{
"skill": "python",
"years": 3
},
{
"skill": "java",
"years": 3
}
]
}
POST resumes/mydocs/2
{
"skills": [
{
"skill": "python",
"years": 2
},
{
"skill": "java",
"years": 3
}
]
}
Query
POST resumes/_search
{
"query": {
"nested": {
"path": "skills",
"query": {
"bool": {
"must": [
{
"match": {
"skills.skill": "python"
}
},
{
"match": {
"skills.years": 3
}
}
]
}
}
}
}
}
Query Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.6931472,
"hits": [
{
"_index": "resumes",
"_type": "mydocs",
"_id": "1",
"_score": 1.6931472,
"_source": {
"skills": [
{
"skill": "python",
"years": 3
},
{
"skill": "java",
"years": 3
}
]
}
}
]
}
}
Note that you only retrieve the document having id 1 in the above response. Also note that just for sake of simplicity I've made skills.skill as keyword type. You can change it to text depending on your use case.
Hope it helps!

Sort keyword field array within ElasticSearch document by relevance

I've got an ElasticSearch index that looks something like this:
{
"mappings": {
"article": {
"properties": {
"title": { "type": "string" },
"tags": {
"type": "keyword"
},
}
}
}
And data that looks something like this:
{ "title": "Something about Dogs", "tags": ["articles", "dogs"] },
{ "title": "Something about Cats", "tags": ["articles", "cats"] },
{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }
If I search for dog, I get the first and third documents, as I'd expect. And I can weight the search documents the way I like (in reality, I'm using a function_score query to weight on a bunch of fields irrelevant to this question).
What I'd like to do is sort the tags field so that the most relevant tags are returned first, without affecting the sort order of the documents themselves. So I'm hoping for a result like this:
{ "title": "Something about Dog Food", "tags": ["dogs", "dogfood", "articles"] }
Instead of what I get now:
{ "title": "Something about Dog Food", "tags": ["articles", "dogs", "dogfood"] }
The documentation on sort and function score don't cover my case. Any help appreciated. Thanks!
You cannot sort the _source (your array of tags) of the documents given its "matching" capability. One way of doing this is by using nested fields and inner_hits that allows you to sort the matching nested fields.
My suggestion is to transform your tags in a nested field (I chose keyword there just by simplicity, but you can also have text and the analyzer of your choice):
PUT test
{
"mappings": {
"article": {
"properties": {
"title": {
"type": "string"
},
"tags": {
"type": "nested",
"properties": {
"value": {
"type": "keyword"
}
}
}
}
}
}
}
And use this kind of query:
GET test/_search
{
"_source": {
"exclude": "tags"
},
"query": {
"bool": {
"must": [
{
"match": {
"title": "dogs"
}
},
{
"nested": {
"path": "tags",
"query": {
"bool": {
"should": [
{
"match_all": {}
},
{
"match": {
"tags.value": "dogs"
}
}
]
}
},
"inner_hits": {
"sort": {
"_score": "desc"
}
}
}
}
]
}
}
}
Where you try to match on the tags nested field value for the same text you try to match on title. Then, using inner_hits sorting, you can actually sort the nested values based on their inner scoring.
#Val's suggestion is very good, but is good as long as for your "relevant tags" you are ok with just a simple text matching as a substring (i1.indexOf(params.search)). His solution's biggest advantage is that you don't have to change the mapping.
My solution's big advantage is that you are actually using Elasticsearch true search capabilities to determine the "relevant" tags. But the drawback is that you need nested field instead of the regular simple keyword.
What you get from a search call are the source documents. The documents in the response are returned in exactly the same form as when you indexed them, which means that if you indexed ["articles", "dogs", "dogfood"], you'll always get that array in that unaltered form.
One way to get around this is to declare a script_field that applies a small script to sort your array and return the result of that sort.
What the script does is simply move the terms that contain the search term in the front of the list
{
"_source": ["title"],
"query" : {
"match_all": {}
},
"script_fields" : {
"sorted_tags" : {
"script" : {
"lang": "painless",
"source": "return params._source.tags.stream().sorted((i1, i2) -> i1.indexOf(params.search) > -1 ? -1 : 1).collect(Collectors.toList())",
"params" : {
"search": "dog"
}
}
}
}
}
This will return something like this, as you can see the sorted_tags array contains the terms as you expect.
{
"took": 18,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "tests",
"_type": "article",
"_id": "1",
"_score": 1,
"_source": {
"title": "Something about Dog Food"
},
"fields": {
"sorted_tags": [
"dogfood",
"dogs",
"articles"
]
}
}
]
}
}

How to perform an exact match query on an analyzed field in Elasticsearch?

This is probably a very commonly asked question, however the answers I've got so far isn't satisfactory.
Problem:
I have an es index that is composed of nearly 100 fields. Most of the fields are string type and set as analyzed. However, the query can be both partial (match) or exact (more like term). So, if my index contains a string field with value super duper cool pizza, there can be partial query like duper super and will match with the document, however, there can be exact query like cool pizza which should not match the document. On the other hand, Super Duper COOL PIzza again should match with this document.
So far, the partial match part is easy, I used AND operator in a match query. However can't get the other type done.
I have looked into other posts related to this problem and this post contains the closest solution:
Elasticsearch exact matches on analyzed fields
Out of the three solutions, the first one feels very complex as I have a lot of fields and I do not use the REST api, I am creating queries dynamically using QueryBuilders with NativeSearchQueryBuilder from their Java api. Also it generates a lots of possible patterns which I think will cause performance issues.
The second one is a much easier solution but again, I have to maintain a lot more (almost) redundant data and, I don't think using term queries are ever going to solve my problem.
The last one has a problem I think, it will not prevent super duper to be matched with super duper cool pizza which is not the output I want.
So is there any other way I can achieve the goal? I can post some sample mapping if required for clearing the question farther. I am already keeping the source as well (in case that can be used). Please feel free to suggest any improvements as well.
Thanks in advance.
[UPDATE]
Finally, I used multi_field, keeping a raw field for exact queries. When I insert I use some custom modification on data, and during searching, I used the same modification routines on input text. This part is not handled by Elasticsearch. If you want to do that, you have to design appropriate analyzers as well.
Index settings and mapping queries:
PUT test_index
POST test_index/_close
PUT test_index/_settings
{
"index": {
"analysis": {
"analyzer": {
"standard_uppercase": {
"type": "custom",
"char_filter": ["html_strip"],
"tokenizer": "keyword",
"filter": ["uppercase"]
}
}
}
}
}
PUT test_index/doc/_mapping
{
"doc": {
"properties": {
"text_field": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"analyzer": "standard_uppercase"
}
}
}
}
}
}
POST test_index/_open
Inserting some sample data:
POST test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}
Exact query:
GET test_index/doc/_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": {
"term": {
"text_field.raw": "PIZZA"
}
}
}
}
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1.4054651,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1.4054651,
"_source": {
"text_field": "pizza"
}
}
]
}
}
Partial query:
GET test_index/doc/_search
{
"query": {
"bool": {
"must": {
"bool": {
"should": {
"match": {
"text_field": {
"query": "pizza",
"operator": "AND",
"type": "boolean"
}
}
}
}
}
}
}
}
Response:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"text_field": "pizza"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.5,
"_source": {
"text_field": "super duper cool pizza"
}
}
]
}
}
PS: These are generated queries, that's why there are some redundant blocks, as there would be many other fields concatenated into the queries.
Sad part is, now I need to rewrite the whole mapping again :(
I think this will do what you want (or at least come as close as is possible), using the keyword tokenizer and lowercase token filter:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"lowercase_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": ["lowercase_token_filter"]
}
},
"filter": {
"lowercase_token_filter": {
"type": "lowercase"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"text_field": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lowercase": {
"type": "string",
"analyzer": "lowercase_analyzer"
}
}
}
}
}
}
}
I added a couple of docs for testing:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"text_field":"super duper cool pizza"}
{"index":{"_id":2}}
{"text_field":"some other text"}
{"index":{"_id":3}}
{"text_field":"pizza"}
Notice we have the outer text_field set to be analyzed by the standard analyzer, then a sub-field raw that's not_analyzed (you may not want this one, I just added it for comparison), and another sub-field lowercase that creates tokens exactly the same as the input text, except that they have been lowercased (but not split on whitespace). So this match query returns what you expected:
POST /test_index/_search
{
"query": {
"match": {
"text_field.lowercase": "Super Duper COOL PIzza"
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.30685282,
"_source": {
"text_field": "super duper cool pizza"
}
}
]
}
}
Remember that the match query will use the field's analyzer against the search phrase as well, so in this case searching for "super duper cool pizza" would have exactly the same effect as searching for "Super Duper COOL PIzza" (you could still use a term query if you want an exact match).
It's useful to take a look at the terms generated in each field by the three documents, since this is what your search queries will be working against (in this case raw and lowercase have the same tokens, but that's only because all the inputs were lower-case already):
POST /test_index/_search
{
"size": 0,
"aggs": {
"text_field_standard": {
"terms": {
"field": "text_field"
}
},
"text_field_raw": {
"terms": {
"field": "text_field.raw"
}
},
"text_field_lowercase": {
"terms": {
"field": "text_field.lowercase"
}
}
}
}
...{
"took": 26,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"text_field_raw": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 1
},
{
"key": "some other text",
"doc_count": 1
},
{
"key": "super duper cool pizza",
"doc_count": 1
}
]
},
"text_field_lowercase": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 1
},
{
"key": "some other text",
"doc_count": 1
},
{
"key": "super duper cool pizza",
"doc_count": 1
}
]
},
"text_field_standard": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "pizza",
"doc_count": 2
},
{
"key": "cool",
"doc_count": 1
},
{
"key": "duper",
"doc_count": 1
},
{
"key": "other",
"doc_count": 1
},
{
"key": "some",
"doc_count": 1
},
{
"key": "super",
"doc_count": 1
},
{
"key": "text",
"doc_count": 1
}
]
}
}
}
Here's the code I used to test this out:
http://sense.qbox.io/gist/cc7564464cec88dd7f9e6d9d7cfccca2f564fde1
If you also want to do partial word matching, I would encourage you to take a look at ngrams. I wrote up an introduction for Qbox here:
https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch

Elastic Search- Fetch Distinct Tags

I have document of following format:
{
_id :"1",
tags:["guava","apple","mango", "banana", "gulmohar"]
}
{
_id:"2",
tags: ["orange","guava", "mango shakes", "apple pie", "grammar"]
}
{
_id:"3",
tags: ["apple","grapes", "water", "gulmohar","water-melon", "green"]
}
Now, I want to fetch unique tags value from whole document 'tags field' starting with prefix g*, so that these unique tags will be display by tag suggestors(Stackoverflow site is an example).
For example: Whenever user types, 'g':
"guava", "gulmohar", "grammar", "grapes" and "green" should be returned as a result.
ie. the query should returns distinct tags with prefix g*.
I tried everywhere, browse whole documentations, searched es forum, but I didn't find any clue, much to my dismay.
I tried aggregations, but aggregations returns the distinct count for whole words/token in tags field. It does not return the unique list of tags starting with 'g'.
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"allow_leading_wildcard": false,
"fields": [
"tags"
],
"query": "g*",
"fuzziness":0
}
}
]
}
},
"filter": {
//some condition on other field...
}
}
},
"aggs": {
"distinct_tags": {
"terms": {
"field": "tags",
"size": 10
}
}
},
result of above: guava(w), apple(q), mango(1),...
Can someone please suggest me the correct way to fetch all the distinct tags with prefix input_prefix*?
It's a bit of a hack, but this seems to accomplish what you want.
I created an index and added your docs:
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
}
}
POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"tags":["guava","apple","mango", "banana", "gulmohar"]}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"tags": ["orange","guava", "mango shakes", "apple pie", "grammar"]}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"tags": ["guava","apple","grapes", "water", "grammar","gulmohar","water-melon", "green"]}
Then I used a combination of prefix query and highlighting as follows:
POST /test_index/_search
{
"query": {
"prefix": {
"tags": {
"value": "g"
}
}
},
"fields": [ ],
"highlight": {
"pre_tags": [""],
"post_tags": [""],
"fields": {
"tags": {}
}
}
}
...
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"highlight": {
"tags": [
"guava",
"gulmohar"
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1,
"highlight": {
"tags": [
"guava",
"grammar"
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"highlight": {
"tags": [
"guava",
"grapes",
"grammar",
"gulmohar",
"green"
]
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/c14675ee8bd3934389a6cb0c85ff57621a17bf11
What you're trying to do amounts to autocomplete, of course, and there are perhaps better ways of going about that than what I posted above (though they are a bit more involved). Here are a couple of blog posts we did about ways to set up autocomplete:
http://blog.qbox.io/quick-and-dirty-autocomplete-with-elasticsearch-completion-suggest
http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams
As per #Sloan Ahrens advice, I did following:
Updated the mapping:
"tags": {
"type": "completion",
"context": {
"filter_color": {
"type": "category",
"default": "",
"path": "fruits.color"
},
"filter_type": {
"type": "category",
"default": "",
"path": "fruits.type"
}
}
}
Reference: ES API Guide
Inserted these indexes:
{
_id :"1",
tags:{input" :["guava","apple","mango", "banana", "gulmohar"]},
fruits:{color:'bar',type:'alice'}
}
{
_id:"2",
tags:{["orange","guava", "mango shakes", "apple pie", "grammar"]}
fruits:{color:'foo',type:'bob'}
}
{
_id:"3",
tags:{ ["apple","grapes", "water", "gulmohar","water-melon", "green"]}
fruits:{color:'foo',type:'alice'}
}
I don't need to modify much, my original index. Just added input before tags array.
POST rescu1/_suggest?pretty'
{
"suggest": {
"text": "g",
"completion": {
"field": "tags",
"size": 10,
"context": {
"filter_color": "bar",
"filter_type": "alice"
}
}
}
}
gave me the desired output.
I accepted #Sloan Ahrens answer as his suggestions worked like a charm for me, and he showed me the right direction.

Term filter for boolean types does not return any results

I have some data with the following index (this is just the relevant piece):
{
"content": {
"mappings" : {
"content": {
"properties": {
"published" : {
"type": "boolean"
}
}
}
}
}
}
When I query for everything using
GET content/content/_search
{}
I get back plenty of documents with published: true, but when I query using a term filter:
GET content/content/_search
{
"filter": {
"term": {
"published": true
}
}
}
I don't get any results. What's wrong with my term filter?
Wierd, it works for me on ES1.0:
I indexed a doc like this:
PUT /twitter/tweet/1
{
"bool":true
}
Here is my mapping:
GET /twitter/tweet/_mapping
{
"twitter": {
"mappings": {
"tweet": {
"properties": {
"bool": {
"type": "boolean"
}
}
}
}
}
}
I can search like this:
GET twitter/tweet/_search
{
"filter": {
"term": {
"bool": true
}
}
}
I got these results:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "twitter",
"_type": "tweet",
"_id": "1",
"_score": 1,
"_source": {
"bool": true
}
}
]
}
}
The problem was unrelated to querying... seems like my custom river was importing data incorrectly.

Resources