How to perform search query on two different data types? - elasticsearch

my query is very simple, for the sake of even making it simpler, lets say I only search on two fields, name(text) & age(long):
GET person_db/person/_search
{
"query": {
"bool": {
"should": [
{
"match_phrase_prefix": {
"name": "hank"
}
},
{
"match_phrase_prefix": {
"age": "hank"
}
}
],
"minimum_should_match": 1,
"boost": 1.0
}
}
}
if I search for "23", no problem, elastic knows how to change it to numeric and it won't fail, but if the search input is "john" I get error 400 "reason": "failed to create query: {\n \"bool\....".
what should I do in this case?
I thought of changing the values that are numeric to strings before insert to es, but trying to avoid it, I think es should have a way to support it.
appreciate it
This query works: (thanks to #jmlw)
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "alt",
"type": "phrase_prefix",
"fields": [
"name",
"taxid",
"providers.providerAddress.street"
],
"lenient": true
}
}
],
"minimum_should_match": 1,
"boost": 1.0
}
}
}

Without details of your documents, or your mappings, my first guess is that the age field is interpreted as a numeric field by Elasticsearch. Passing in anything other than a 'number' type, or something that can be converted into a number will cause the query to fail, with some exception reporting a failure to convert your string into a number.
With that said, you may try add ing lenient: true to your match_phrase_prefix search term, which will allow Elasticsearch to ignore failures to convert to a numeric type, and remove that term from the search.
Another approach is to only allow users to query on multiple fields of the same type, or specify what data they'd like to query in which field. I.E. I'm a user, and I want to search for people where age is 23, and have the name John, instead of typing in 23 John, or similar.
Otherwise, you may need to pre-process the query string, and split search terms and pass them into search clauses individually with lenient: true to attempt searching multiple terms in multiple fields with different data types.
You could also try using a different search type, like a multi_match, query_string, or simple_query_string as these will likely have more flexibility for what you are wanting to do.

Related

Elasticsearch: how to write bool query that will contain multiple conditions on the same token?

I have a field with tokenizer that splits by dots.
on search, the following value aaa.bbb will be splitted to two terms aaa and bbb.
My question is how to write bool query that will contain multiple conditions on the same term?
For example, i want to get all docs where its field contains a term that matches a fuzzy search for gmail but also the same term must not contain gamil.
Here are some examples of what i want to achieve:
bmail // MATCH: since its matches fuzzy search and is not gamil
gamil.bmail // MATCH: since the term bmail matches fuzzy search and is not gamil
gamil // NO MATCH: since its matches fuzzy search and but equals gamil
NOTE: the following query does NOT appear to be working since it looks as if one term matches one condition and the second term matches the other, it will be considered a hit.
{
...
"body": {
"query": {
"bool": {
"must": [
{
"fuzzy": {
"my_field": {
"value": "gmail",
"fuzziness": 1,
"max_expansions": 2100000000
}
}
},
{
"bool": {
"must_not": [
{
"query_string": {
"default_field": "my_field",
"query": "*gamil*",
"analyzer": "keyword"
}
}
]
}
}
]
}
}
},
}
I ended up using Highlight by executing fuzzy (or any other) query, and then programatically filter the results by the returned highlight object.
span queries might also be a good option if you don't need regular expression or you can make sure you don't exceed the boolean query limit.
(see more details in the provided link)

ElasticSearch Ignoring words having one single letter

I'm a beginner in ElasticSearch, I have an application that uses elasticSearch to look for ingredients in a given food or fruit...
I'm facing a problem with scoring if the user for example tapes: "Vitamine d"
ElasticSearch will give the "vitamine" phrase that has the best scoring even if the phrase "Vitamine D" exists and normally it should have the highest score.
I see that if the second word "d" in my case is just one letter then elastic search will ignore it.
I did another example: "vitamine b12" and I had the correct score.
Here is the query that the application send to the server:
{
"from": 0,
"size": 5,
"query": {
"bool": {
"must": [
{
"match": {
"constNomFr": {
"query": "vitamine d"
}
}
}
],
"should": [
{
"prefix": {
"constNomFr": {
"value": "vitamine d",
"boost": 2
}
}
}
]
}
},
"_source": {
"excludes": [
"alimentDtos"
]
}
}
What could I modify to make it work?
Thank you so much.
If you can identify your ingredients, I recommend you to index them on a separate field "ingredients" setting it's type to keyword. This way you can use a term filter and you can even run aggregations.
You may already have your documents indexed that way, in that case if your are using the default mapping, just run your query against your_field_name.keyword.
If you don't have your ingredients indexed as an array then you should take a look to the elasticsearch analyzers to choose or build the right one.

Calculate counts of hits of several subqueries inside one query to Elasticsearch

I have 3 fields in a document that I need to match. I'd like to identify which of those 3 fields have any matches.
More specifically, I'd like to find out if the given wildcard query matches only one field through the document set or matches several fields. If the wildcard query matches only, say field1, then I can make a conclusion that the given wildcard query is applicable to only field1. If the wildcard query matches two or three fields, then I cannot make such a conclusion and I'll wait for more characters to be entered by user to narrow search.
I've written the following query that matches all 3 fields:
{
"query": {
"bool": {
"should": [
{"wildcard": { "field1": "*R*" }},
{"wildcard": { "field2": "*R*" }},
{"wildcard": { "field3": "*R*" }}
]
}
},
"size": 0
}
It returns the total count of all documents that have matches on any of those fields. Now I'd like to know if it's possible to receive 3 separate counts for each subquery. This can be achieved by sending 3 separate requests but I'd like to minimize the number of requests to elasticsearch.
I've tried bool and dis_max queries but could not find a solution.
UPDATE
Using named queries I've built the following query:
{
"query": {
"bool": {
"should": [
{"wildcard": { "field1": { "value": "*R*", "_name": "query1" }}},
{"wildcard": { "field2": { "value": "*R*", "_name": "query2" }}},
{"wildcard": { "field3": { "value": "*R*", "_name": "query3" }}}
]
}
},
"size": 1
}
This query returns a single result with the best score. By default, the score is higher when more fields are matched in the same document. So if the found document was matched by two or three fields, it already answers my initial question. However, if the found document was matched by a single field, say, field1, it does not guarantee there are no other documents that are matched by field2 or field3, so it's still not a solution.
Do I have to send 3 requests to run searches over each field separately to solve my problem?

Erratic search results from Elastic when sorting on a field

We just upgraded to Elasticsearch 2.3.1 (from 1.7) and we're getting strange search behavior that I can't explain. What seems to happen is that a search request containing a bool query and a sort clause is returning:
Documents that don't seem to match the given search terms in any way.
Wildly different estimates on the total of matching documents each request
A minimal example of a request with this behavior:
post pim_search_1/_search
{
"explain": false,
"track_scores": false,
"sort": [
{
"product_id": {
"order": "desc"
}
}
],
"query": {
"bool": {
"filter": [
{
"terms": {
"publication": [
"public"
]
}
},
{
"query_string": {
"query": "iphone",
"default_operator": "and"
}
}
]
}
}
}
So in this case, a query string for "iphone" returns no iPhones at all. Setting explain to true yields this for the documents that appear to have no matching terms at all:
"_explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause (#ConstantScore(publication:public) #_all:iphone)",
So the document has no matching clauses, but it's still returned?
We've found two workarounds for this behavior:
Sort on _score or leave out the sort clause entirely. Sorting on anything else, like the field above or on _doc gives the wonky behavior.
Include track_scores : true on the request.
So it appears to have something to do with scoring and relevancy. But since we're sorting on a field of our own, we're not interested in relevancy or score. Without the workarounds, the max_score on the response is null and so is the _score of every document.
Is this behavior something that can be explained in any way, or should we be looking at cluster health/configuration/corruption? According to the cluster, its health is green and all shards for this index appear healthy. It's currently a small index with 3 shards (1 replica per shard) over 3 nodes.
Update
I've further investigated the issue and it seems cache related. Specifically, the fielddata cache for the _all field (I'm not very familiar with the internals of Elasticsearch, so please correct me if that's not a thing).
Steps to reproduce
I have a data set that reproduces the problem, leave a comment and I can send it to you.
Use the following query:
post pim_search_1/_search
{
"fields": [
"_all"
],
"explain": true,
"size": 100,
"sort": [
{
"product_id": {
"order": "desc"
}
}
],
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "surface",
"default_operator": "and"
}
}
],
"filter": [
{
"terms": {
"publication": [
"public"
]
}
}
]
}
}
}
Execute the query. You're searching for "surface" in the query string here and this should result in 22 hits total. This is correct. Execute this query a bunch of times (this seems to matter for step 2).
Change the query string to "iphone". This will result in 22 hits still, even though the dataset contains only one item that should match. The _explanation also mentions that the found documents don't actually match, like my example above.
Execute this: post pim_search_1/_cache/clear
Execute the query again for "iphone". It should now only return 1 hit, which is correct. Also execute this one a bunch of times.
Execute the query again for "surface", this will now return only 1 hit and again the _explanation states that it didn't get a match on the resulting document.
Remove the sort clause from the query and everything appears normal. The same is true for including "track_scores" : true.
Instead of _cache/clear it also works to just restart the cluster.
I say it's related to the _all field because changing the default_field of the query_string to the primitive_name field (an analyzed field) results in the correct behavior. For this example, I've made _all a stored field (it isn't normally with us) and it's returned in the search results so you can inspect it (doesn't appear to contain anything weird).
The above was done on a single node cluster (my local PC) on Elasticsearch 2.3.5.
This Github question seems to be about the same issue as mine, but could not be reproduced at the time and was closed.
This has been fixed in Elasticsearch 2.4:
https://github.com/elastic/elasticsearch/pull/20196

query_string vs group match in elasticsearch

What is the difference between such query:
"query": {
"bool": {
...
"should": [
{
"match": {
"description": {
"query": "test"
}
}
},
{
"match": {
"address": {
"query": "test",
}
}
},
{
"match": {
"country": {
"query": "test"
}
}
},
{
"match": {
"city": {
"query": "test"
}
}
}
]
}}
and that one:
"query": {
"bool": {
...
"should": [
{
"query_string": {
"query": "test",
"fields": [
"description",
"address",
"country",
"city"
]
}
}
]
}}
Performance, relevance?
Thanks in advance!
The query is analyzed depending on the field analyzer (unless you specify the analyzer in the query itself), thus querying multiple fields with a single query doesn't necessarily mean analyzing the query only once.
Keep in mind that the query_string supports the lucene query syntax: AND and OR operators, querying on specific fields, wildcard, phrase queries etc. therefore it needs to be parsed, which I don't think makes a lot of difference here in terms of performance, but it is error prone and might lead to errors. If you don't need all that power, stick to the match query, and if you want to perform the same query on multiple fields, have a look at the multi_match query, which does what you did with your query_string but translates internally to multiple match queries.
Also, the scores returned if you compare the output of multiple match queries and your query_string might be quite different. Using a bool query you effectively build a lucene boolean query, while the query_string uses by default "use_dis_max":"true", which means it uses internally a dis_max query by default. Same happens using the multi_match query. If you set use_dis_max to false a bool query is going to be used internally instead.
I terms of performance, I would say that the second query will have performance benefits because, the first query requires the query string to be analyzed for all the four match sections, while in the second there is only one query string that needs to be analyzed.
Apart from that, there are some comparisons done over here that you can look at.
I am not quite sure about the relevancy differences, but that you can always fire these two queries and see if there is any difference in relevance from the results fetched.

Resources