Elasticsearch case-insensitive query_string query with wildcards - elasticsearch

In my ES mapping I have an 'uri' field which is currently set to not_analysed and I'm not allowed to change the mapping.I wanted to search for uri parts with a query_string query like this (this ES query is autogenerated, that is why it is a bit complicated but let's just focus on the query_string part)
{
"sort": [{"updated": {"order": "desc"}}],
"query": {
"bool": {
"must":[{
"query_string": {
"query":"*w3\\.org\\/2014\\/01\\/a*",
"lowercase_expanded_terms": true,
"default_field": "uri"
}
}],
"minimum_number_should_match": 1
}
}, "size": 50}
Now it is usually working, but I've the following url stored (fictional url): http://w3.org/2014/01/Abc.html and this query does not bring it back because of the A-a difference. Setting the expanded terms to false also not solves this. What should I do for this query to be case insensitive?
Thanks for the help in advance.

From the docs, it seems like you need a new analyzer that first transforms to lowercase and then can run the search. Have you tried that?
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/sorting-collations.html
As I read it, your pattern, lowercase_expanded_terms, only applies to expansions, not to regular words
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
lowercase_expanded_terms
Whether terms of wildcard, prefix, fuzzy, and range queries are to be automatically lower-cased or not (since they are not analyzed). Default it true

Try to use match query instead of query string.
{
"sort": [
{
"updated": {
"order": "desc"
}
}
],
"query": {
"bool": {
"must": [
{
"match": {
"uri": "*w3\\.org\\/2014\\/01\\/a*"
}
}
]
}
},
"size": 50
}
Query string queries are not analyzed and but match queries are analyzed.

Related

Problem searching domain with elastic search

I have registered the following document
"ownDomainValue":"catalogonuevo1.com"
When I perform the following query the document is found, value is "catalogonuevo1"
[
{
"query": {
"bool": {
"filter": [
{
"term": {
"valor_dominio_propio": "catalogonuevo1"
}
}
]
}
},
"from": 0,
"size": 1
}
]
However, when the search value is "catalogonuevo1.com"
[
{
"query": {
"bool": {
"filter": [
{
"term": {
"valor_dominio_propio": "catalogonuevo1.com"
}
}
]
}
},
"from": 0,
"size": 1
}
]
it does not return any value, using MatchQueries the opposite happens, it always finds a wrong document, such as one with the value "catalogonuevo2.com" which is not what I am looking for since I need the search to be exact
It sounds like the problem is that the "term" query in Elasticsearch is not matching the exact value "catalogonuevo1.com" when it is included in the query.
This is likely because the "term" query is tokenizing the input string at the "." character, so it is matching on the token "catalogonuevo1" rather than the entire string "catalogonuevo1.com".
You can resolve this issue by using the "match_phrase" query instead of "term" query, as "match_phrase" query matches on the exact phrase rather than individual tokens.
Additionally, you can use keyword fields to store the domain values; this way, the values are not tokenized and the match phrase will work as expected.

difference between simple query string and multi match query

Hi I am using two search query which is giving similar result. what is difference between these two query simple query string and multi match?
1- simple_query_string
{
"size": 50,
"query": {
"bool": {
"should": [
{
"simple_query_string": {
"query": "text search",
"fields": [
"Field1^2",
"Field2^4",
"Field3^6",
"Field4^8",
"Field5^10",
"Field6^12",
"Field7^14",
"Field8^16",
"*^.1"
]
}
}
]
}
},
"sort": [
"_score",
{
"Field6.keyword": {
"order": "desc"
}
}
]
}
2- Multimatch query
GET index/_search
{
"size": 50,
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "text search",
"fields": [
"Field1^2",
"Field2^4",
"Field3^6",
"Field4^8",
"Field5^10",
"Field6^12",
"Field7^14",
"Field8^16",
"*^.1"
],
"type": "most_fields"
}
}
]
Both query gives same result in same order. Is there any advantage of any query ?
Both queries are the same as they will be converted to same query string. If you use query string your query will be slightly faster as you Elastic doesn't need to rewrite your query.
All queries in Lucene undergo a "rewriting" process. A query (and its sub-queries) may be rewritten one or more times, and the process continues until the query stops changing. This process allows Lucene to perform optimizations, such as removing redundant clauses, replacing one query for a more efficient execution path, etc. For example a Boolean → Boolean → TermQuery can be rewritten to a TermQuery, because all the Booleans are unnecessary in this case. The rewriting process is complex and difficult to display, since queries can change drastically. Rather than showing the intermediate results, the total rewrite time is simply displayed as a value (in nanoseconds). This value is cumulative and contains the total time for all queries being rewritten.
You can check your query performance and rewrite time by setting "profile": "true" in your query, for more information check official documentation of Elastic search here.

Elasticsearch Boolean query with Constant score wrapper

When using elasticsearch-7 I'm confused by es compound queries syntax.
Though reading es documents repeatedly but i just find standard syntax of Boolean or Constant score seperately.
As it illuminate,i understand what is 'query context' and what is 'filter context'.But when combining these two query type in a single query i don't know what it mean.
Let's see a example:
GET /classes_test/_search
{
"size": "21",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": [
{
"match": {
"class_name": "29386556"
}
}
],
"should": [
{
"term": {
"master": "7033560"
}
},
{
"term": {
"assistant": "7033560"
}
},
{
"term": {
"students": "7033560"
}
}
],
"minimum_should_match": 1,
"must_not": [
{
"term": {
"class_id": 0
}
}
],
"filter": [
{
"term": {
"class_status": "1"
}
}
]
}
}
}
}
}
This query can be executed and response well.Each item in response content has a '_score' value with 1.0.
So,is it mean that the sub bool query as a entirety is in a filter context though it has a 'must' and 'should'?
Also i found boolean query can have a constant score sub query.
Why es allow these syntax but has no more words to explain?
If you use a constant_score query, you'll never get scores different than 1.0, unless you specify boost parameters in which case the score will match those.
If you need scoring you obviously need to ditch constant_score.
In your case, your match query on class_name cannot yield any other score than 1 or 0 since this is basically a yes/no filter, not a matching based on full-text search.
To sum up, all your query executes in a filter context (hence score 0 or 1) since you don't rely on full-text search. So you get scoring whenever you use full-text search, not because you use a match query. In your case, you can merge all must constraints into filter, it won't make any difference since you only have filters (yes/no matches) and no full-text search.

ElasticSearch must-terms does not return data

My ElasticSearch must-terms does not work, the data has clientId value "08d71bc7-c4ab-6e1d-f858-cf3448242e8b" but the result is empty. I am using elasticsearch:6.7.1. Do you know the problem here?
{
"from": 0,
"size": 20,
"query": {
"bool": {
"must": [
{ "terms": { "clientId": ["08d71bc7-c4ab-6e1d-f858-cf3448242e8b", "08d71bc7-c4ab-6e1d-f858-cf3448242e8c"] } },
{
"query_string": {
"query": "*d*",
"fields": ["name", "description", "title"]
}
},
{ "query_string": { "query": "1", "fields": ["type"] } }
]
}
}
}
I share sample data
I haven't worked enough with "query_string"... But if you don't put them and run your query, I'm sure it should at least give you some results. If so, your "query_string"s are the ones that are giving you this bad time
I first recommend you to use "filter" instead of "must".
Consider using the Regexp query your first "query_string". I found here how to query multiple fields with Regexp.
For the second, it would be enough to use "term" instead of "query_string".
Hope this is helpful! :D
The search results depends on the analysis type of clientId . If clientId is a 'keyword' your query should work as expected, but if the type of clientId is 'text' then the value might get tokenized to smaller parts (break at the dash).
You can check the clientId fields type in the index mappings, and also run the analyze API to check the tokenization: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-analyze.html

query_string vs group match in elasticsearch

What is the difference between such query:
"query": {
"bool": {
...
"should": [
{
"match": {
"description": {
"query": "test"
}
}
},
{
"match": {
"address": {
"query": "test",
}
}
},
{
"match": {
"country": {
"query": "test"
}
}
},
{
"match": {
"city": {
"query": "test"
}
}
}
]
}}
and that one:
"query": {
"bool": {
...
"should": [
{
"query_string": {
"query": "test",
"fields": [
"description",
"address",
"country",
"city"
]
}
}
]
}}
Performance, relevance?
Thanks in advance!
The query is analyzed depending on the field analyzer (unless you specify the analyzer in the query itself), thus querying multiple fields with a single query doesn't necessarily mean analyzing the query only once.
Keep in mind that the query_string supports the lucene query syntax: AND and OR operators, querying on specific fields, wildcard, phrase queries etc. therefore it needs to be parsed, which I don't think makes a lot of difference here in terms of performance, but it is error prone and might lead to errors. If you don't need all that power, stick to the match query, and if you want to perform the same query on multiple fields, have a look at the multi_match query, which does what you did with your query_string but translates internally to multiple match queries.
Also, the scores returned if you compare the output of multiple match queries and your query_string might be quite different. Using a bool query you effectively build a lucene boolean query, while the query_string uses by default "use_dis_max":"true", which means it uses internally a dis_max query by default. Same happens using the multi_match query. If you set use_dis_max to false a bool query is going to be used internally instead.
I terms of performance, I would say that the second query will have performance benefits because, the first query requires the query string to be analyzed for all the four match sections, while in the second there is only one query string that needs to be analyzed.
Apart from that, there are some comparisons done over here that you can look at.
I am not quite sure about the relevancy differences, but that you can always fire these two queries and see if there is any difference in relevance from the results fetched.

Resources