Elasticsearch collapse not working with search_after with single sort field and PIT - elasticsearch

I have an Elastic query that initially returns results. When I attempt the query again using search_after for paging, I am getting the error: Cannot use [collapse] in conjunction with [search_after] unless the search is sorted on the same field. Multiple sort fields are not allowed. So far as I can tell, I am sorting and collapsing using just a single field per_id. Is my query structured incorrectly or is there something else I need to do to get this query to run?
GET /_search
{
"query": {
"bool": {
"must": [{
"term": {
"pform": "iphone"
}
}]
}
},
"collapse": {
"field": "per_id"
},
"pit": {
"id": "g-ABCDDEFG12345678ABCDDEFG12345678==",
"keep_alive": "5m"
},
"sort": [
{"per_id": "asc"}
],
"search_after" : [
"ABCDDEFG12345678",
123456
]
}

I needed to exclude the tie breaker in my search_after. It shouldn't cause duplicates because I am using a PIT and sorting on the collapse field, meaning duplicates shouldn't exist in the my result set.
"search_after" : [
"ABCDDEFG12345678"
]
So I needed to remove the tiebreaker returned from the previous result before passing it into the next one

Related

difference between simple query string and multi match query

Hi I am using two search query which is giving similar result. what is difference between these two query simple query string and multi match?
1- simple_query_string
{
"size": 50,
"query": {
"bool": {
"should": [
{
"simple_query_string": {
"query": "text search",
"fields": [
"Field1^2",
"Field2^4",
"Field3^6",
"Field4^8",
"Field5^10",
"Field6^12",
"Field7^14",
"Field8^16",
"*^.1"
]
}
}
]
}
},
"sort": [
"_score",
{
"Field6.keyword": {
"order": "desc"
}
}
]
}
2- Multimatch query
GET index/_search
{
"size": 50,
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "text search",
"fields": [
"Field1^2",
"Field2^4",
"Field3^6",
"Field4^8",
"Field5^10",
"Field6^12",
"Field7^14",
"Field8^16",
"*^.1"
],
"type": "most_fields"
}
}
]
Both query gives same result in same order. Is there any advantage of any query ?
Both queries are the same as they will be converted to same query string. If you use query string your query will be slightly faster as you Elastic doesn't need to rewrite your query.
All queries in Lucene undergo a "rewriting" process. A query (and its sub-queries) may be rewritten one or more times, and the process continues until the query stops changing. This process allows Lucene to perform optimizations, such as removing redundant clauses, replacing one query for a more efficient execution path, etc. For example a Boolean → Boolean → TermQuery can be rewritten to a TermQuery, because all the Booleans are unnecessary in this case. The rewriting process is complex and difficult to display, since queries can change drastically. Rather than showing the intermediate results, the total rewrite time is simply displayed as a value (in nanoseconds). This value is cumulative and contains the total time for all queries being rewritten.
You can check your query performance and rewrite time by setting "profile": "true" in your query, for more information check official documentation of Elastic search here.

How to rank ElasticSearch documents based on scores

I have an Elastic search index that contain thousands of documents, each document represent a user.
each document has set of fields (is_verified: boolean, country: string, is_creator: boolean), also i have another service that call ES search to lookup for documents, how i can rank the retrieved documents based on those fields? for example a verified user with match should come first than un verified one.
is there some kind of document scoring while indexing the documents ? if yes can i modify it based on my criteria ?
what shall i read/look to understand how to rank in elastic search.
thanks
I guess the sorting function mentioned by Mikael is pretty straight forward and should cover your use cases. Check Elastic Doc for more information on that.
But in case you want to do really fancy sorting, maybe you could use a bool query and different boost values to set your desired relevancy for each matched field. It tried to come up with a real life example, but honestly didn't find one. For the sake of completeness, he following snippet should give you an idea how to achieve similar results as with the sort API (but still, i would prefer using sort).
GET /yourindexname/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"name": "Monica"
}
}
],
"should": [
{
"term": {
"is_verified": {
"value": true,
"boost": 2
}
}
},
{
"term": {
"is_creator": {
"value": true,
"boost": 2
}
}
}
]
}
}
}
is there some kind of document scoring while indexing the documents ? if yes can i modify it based on my criteria ?
I wouldn't assign a fixed score to a document while indexing, as the score should be dependent on the query. However, if you insist to have a predefined relevancy for each document, theoretically you could add a field relevancy having that value for ordering and use it later in the query:
GET /yourindexname/_search
{
"query" : {
"match" : {
"name": "Monica"
}
},
"sort" : [
{
"relevancy": {
"order": "desc"
},
"_score"
}
]
}
You can consider using the Sort Api inside your search queries ,In example below we used the search on the field country and sorted the result with respect of Boolean field (is_verified) , You can also add the other Boolean field inside Sort brackets .
GET /yourindexname/_search
{
"query" : {
"match" : {
"country": "Iceland"
}
},
"sort" : [
{
"is_verified": {
"order": "desc"
}
}
]
}

How to perform search query on two different data types?

my query is very simple, for the sake of even making it simpler, lets say I only search on two fields, name(text) & age(long):
GET person_db/person/_search
{
"query": {
"bool": {
"should": [
{
"match_phrase_prefix": {
"name": "hank"
}
},
{
"match_phrase_prefix": {
"age": "hank"
}
}
],
"minimum_should_match": 1,
"boost": 1.0
}
}
}
if I search for "23", no problem, elastic knows how to change it to numeric and it won't fail, but if the search input is "john" I get error 400 "reason": "failed to create query: {\n \"bool\....".
what should I do in this case?
I thought of changing the values that are numeric to strings before insert to es, but trying to avoid it, I think es should have a way to support it.
appreciate it
This query works: (thanks to #jmlw)
{
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "alt",
"type": "phrase_prefix",
"fields": [
"name",
"taxid",
"providers.providerAddress.street"
],
"lenient": true
}
}
],
"minimum_should_match": 1,
"boost": 1.0
}
}
}
Without details of your documents, or your mappings, my first guess is that the age field is interpreted as a numeric field by Elasticsearch. Passing in anything other than a 'number' type, or something that can be converted into a number will cause the query to fail, with some exception reporting a failure to convert your string into a number.
With that said, you may try add ing lenient: true to your match_phrase_prefix search term, which will allow Elasticsearch to ignore failures to convert to a numeric type, and remove that term from the search.
Another approach is to only allow users to query on multiple fields of the same type, or specify what data they'd like to query in which field. I.E. I'm a user, and I want to search for people where age is 23, and have the name John, instead of typing in 23 John, or similar.
Otherwise, you may need to pre-process the query string, and split search terms and pass them into search clauses individually with lenient: true to attempt searching multiple terms in multiple fields with different data types.
You could also try using a different search type, like a multi_match, query_string, or simple_query_string as these will likely have more flexibility for what you are wanting to do.

Erratic search results from Elastic when sorting on a field

We just upgraded to Elasticsearch 2.3.1 (from 1.7) and we're getting strange search behavior that I can't explain. What seems to happen is that a search request containing a bool query and a sort clause is returning:
Documents that don't seem to match the given search terms in any way.
Wildly different estimates on the total of matching documents each request
A minimal example of a request with this behavior:
post pim_search_1/_search
{
"explain": false,
"track_scores": false,
"sort": [
{
"product_id": {
"order": "desc"
}
}
],
"query": {
"bool": {
"filter": [
{
"terms": {
"publication": [
"public"
]
}
},
{
"query_string": {
"query": "iphone",
"default_operator": "and"
}
}
]
}
}
}
So in this case, a query string for "iphone" returns no iPhones at all. Setting explain to true yields this for the documents that appear to have no matching terms at all:
"_explanation": {
"value": 0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [
{
"value": 0,
"description": "no match on required clause (#ConstantScore(publication:public) #_all:iphone)",
So the document has no matching clauses, but it's still returned?
We've found two workarounds for this behavior:
Sort on _score or leave out the sort clause entirely. Sorting on anything else, like the field above or on _doc gives the wonky behavior.
Include track_scores : true on the request.
So it appears to have something to do with scoring and relevancy. But since we're sorting on a field of our own, we're not interested in relevancy or score. Without the workarounds, the max_score on the response is null and so is the _score of every document.
Is this behavior something that can be explained in any way, or should we be looking at cluster health/configuration/corruption? According to the cluster, its health is green and all shards for this index appear healthy. It's currently a small index with 3 shards (1 replica per shard) over 3 nodes.
Update
I've further investigated the issue and it seems cache related. Specifically, the fielddata cache for the _all field (I'm not very familiar with the internals of Elasticsearch, so please correct me if that's not a thing).
Steps to reproduce
I have a data set that reproduces the problem, leave a comment and I can send it to you.
Use the following query:
post pim_search_1/_search
{
"fields": [
"_all"
],
"explain": true,
"size": 100,
"sort": [
{
"product_id": {
"order": "desc"
}
}
],
"query": {
"bool": {
"must": [
{
"query_string": {
"default_field": "_all",
"query": "surface",
"default_operator": "and"
}
}
],
"filter": [
{
"terms": {
"publication": [
"public"
]
}
}
]
}
}
}
Execute the query. You're searching for "surface" in the query string here and this should result in 22 hits total. This is correct. Execute this query a bunch of times (this seems to matter for step 2).
Change the query string to "iphone". This will result in 22 hits still, even though the dataset contains only one item that should match. The _explanation also mentions that the found documents don't actually match, like my example above.
Execute this: post pim_search_1/_cache/clear
Execute the query again for "iphone". It should now only return 1 hit, which is correct. Also execute this one a bunch of times.
Execute the query again for "surface", this will now return only 1 hit and again the _explanation states that it didn't get a match on the resulting document.
Remove the sort clause from the query and everything appears normal. The same is true for including "track_scores" : true.
Instead of _cache/clear it also works to just restart the cluster.
I say it's related to the _all field because changing the default_field of the query_string to the primitive_name field (an analyzed field) results in the correct behavior. For this example, I've made _all a stored field (it isn't normally with us) and it's returned in the search results so you can inspect it (doesn't appear to contain anything weird).
The above was done on a single node cluster (my local PC) on Elasticsearch 2.3.5.
This Github question seems to be about the same issue as mine, but could not be reproduced at the time and was closed.
This has been fixed in Elasticsearch 2.4:
https://github.com/elastic/elasticsearch/pull/20196

How to sort fields in _source for a search?

I have a requirement where we are creating CSV from the search result of a particular query. The problem statement is:
There are certain questions with values stored in the elasticsearch. For example A with value of 1 or 0. B with Value of 1 and 0 and so on. Now when I retrieve the response, the questions(fields) in _search are not in order for example it comes as A,C,D,F,B. What we want is something to be added to query so that we have a sorted field list in_source such as A,B,C,D so that it can directly be mapped into CSV. Does elasticsearch provide any API of this sort?
Naively if you try to put
{
...
"query": {...},
"sort": [
{
"field_A": {
"order": "asc"
}
],
...
}
You will get an exception:
No mapping found for [field_A] in order to sort on
Luckily Elasticsearch provides a mechanism to ignore fields those have no mapping and not sort by them:
{
...
"query": {...},
"sort": [
{
"field_A": {
"order": "asc",
"unmapped_type" : "long" // new here
}
],
...
}
The value of this parameter is used to determine what sort values to emit.
Read more here Ignoring Unmapped Fields | Elasticsearch Reference

Resources