Can inclusion of specific fields change the elasticsearch result set? - elasticsearch

I have an ES query that returns 414 documents if I exclude a specific field from results.
If I include this field, the document count drops to 328.
The documents that get dropped are consistent and this happens whether I scroll results or query directly.
The field map for the field that reduces the result set looks like this:
"completion": {
"type": "object",
"enabled": false
}
Nothing special to it and I have other "enabled": false object type fields that return just fine in this query.
I tested against multiple indexes with the same data to rule out corruption (I hope).
This 'completion' object is a nested and ignored object that has 4 or 5 levels of nesting but once again, I have other similarly nested objects that return just fine for this query.
The query is a simple terms match for 414 terms (yes, this is terrible, we are rethinking our strategy on this):
var { _scroll_id, hits } = await elastic.search({
index: index,
type: type,
body: shaQuery,
scroll: '10s',
_source_exclude: 'account,layout,surveydata,verificationdata,accounts,scores'
});
while (hits && hits.hits.length) {
// Append all new hits
allRecords.push(...hits.hits)
var { _scroll_id, hits } = await elastic.scroll({
scrollId: _scroll_id,
scroll: '10s'
})
}
The query is:
"query": {
"terms": {
"_id": [
"....",
"....",
"...."
}
}
}
In this example, I will only get back 328 results. If I add 'completion' to the _source_exclude then I get the full set back.
So, my question is: What are the scenarios where including a field in the result could limit the search when that field is totally unrelated to the search.
The #'s are specific to this example but consistent across queries. I just include them for context on the overall problem.
Also important is that this completion field has the same data and format across both included and excluded records, I can't see anything that would cause a problem.

The problem was found and it was obscure. What we saw was that it was always failing at the same point and when it was examined a little more closely, the same error was coming out:
{ took: 158,
timed_out: false,
_shards:
{ total: 5,
successful: 4,
skipped: 0,
failed: 1,
failures: [ [Object] ] },
[ { shard: 0,
index: ‘theindexname’,
node: ‘4X2vwouIRriYbQTQcHQ_sw’,
reason:
{ type: ‘illegal_argument_exception’,
reason:
‘cannot write xcontent for unknown value of type class java.math.BigInteger’ } } ]
Ok well thats strange, we are not using BigIntegers at all. But, thanks to the power of the Google this issue in the elasticsearch issue tracker was revealed:
https://github.com/elastic/elasticsearch/pull/32888
"XContentBuilder to handle BigInteger and BigDecimal" which is a bug in 6.3 where fields that used BigInteger and BigDecimal would fail to serialize and thus break when source filtering was applied. We were running 6.3.
It is unclear why our systems are triggering this issue but upgrading to 6.5 solved it entirely.
Obscure obscure obscure but solved thanks to Javier's persistence.

Related

Elasticsearch cannot index array of integers

Given mapping:
{
"mappings": {
"properties": {
"user_ids": {"type": "integer"}
}
}
}
Observations:
Index {"user_ids": 1}, and data will show up correctly
Index {"user_ids": "x"}, and error is thrown failed to parse field [user_ids] of type [integer] in document, indicating that mapping is working correctly
However, indexing {"user_ids": [1]} just clears the field, without throwing error.
Question:
Why does this happen and how can I index arrays of integers?
Extra:
I removed all settings config, doesn't change anything
I tried keyword type, doesn't change anything
If relevant, I use latest opensearch-py
It's not clear what do you mean by clear the field, also indexing array of integers works perfectly fine as shown in below example, hope you are following same requests.
put <index-name>/_doc/1
{
"user_ids": [
1,2,3
]
}
And get API returns, all the integers in the array.
GET <index-name>/_doc/1
"_source": {
"user_ids": [
1,
2,
3
]
}
}
Turns out it's my own error: I was using elasticsearch-head for quick checking of values, and they don't support displaying of array values :/ Once I double checked with queries, they came back correct.

Why is my elasticsearch update_by_query timing out?

I am using the Javascript library and and trying to perform a rather large update on documents. I have changed the timeout value on the query itself as well as the client object and am still getting a timeout that seems to be 2 minutes. Below is the query.
const arrayOfStrings = [...]; // This array could be between 10-12k elements in size
await elasticClient.updateByQuery({
index: "main-index",
refresh: true,
conflicts:"proceed",
script: {
lang: "painless",
source: "ctx._source.status = \"1\"; ctx._source.entities = []; for(term in params.array) ctx._source.entities.add(term);",
params: {
"array": arrayOfStrings
}
},
query: {
"term": {
"parent_id": 'dgd39gd3-db3dg23879-d893gdg38-e23ed' // There could be upwards of 5000 documents that match this criteria
}
},
timeout: "5m"
});
Also on the elastic client, I have requestTimeout set to '5m' as well. I can understand why this particular query may be timing out as its applying a 12k element array to a field on 5k documents. However I dont understand why the query is timing out after only 2 minutes when I have timeout values set that are longer.
What you need to do here is to run the update by query asynchronously
await elasticClient.updateByQuery({
index: "main-index",
refresh: true,
conflicts:"proceed",
waitForCompletion: false <---- add this setting
And then you can follow the progress of the task running asynchronously.

Elasticsearch 7 number_format_exception for input value as a String

I have field in index with mapping as :
"sequence_number" : {
"type" : "long",
"copy_to" : [
"_custom_all"
]
}
and using search query as
POST /my_index/_search
{
"query": {
"term": {
"sequence_number": {
"value": "we"
}
}
}
}
I am getting error message :
,"index_uuid":"FTAW8qoYTPeTj-cbC5iTRw","index":"my_index","caused_by":{"type":"number_format_exception","reason":"For input string: \"we\""}}}]},"status":400}
at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:260) ~[elasticsearch-rest-client-7.1.1.jar:7.1.1]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:238) ~[elasticsearch-rest-client-7.1.1.jar:7.1.1]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:212) ~[elasticsearch-rest-client-7.1.1.jar:7.1.1]
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1433) ~[elasticsearch-rest-high-level-client-7.1.1.jar:7.1.1]
at
How can i ignore number_format_exception errors, so the query just doesn't return anything or ignores this filter in particular - either is acceptable.
Thanks in advance.
What you are looking for is not possible, ideally, you should have coherce enabled on your numeric fields so that your index doesn't contain dirty data.
The best solution is that in your application which generated the Elasticsearch query(you should have a check for NumberFormatExcepton if you are searching for numeric fields as your index doesn't contain the dirty data in the first place and reject the query if you get an exception in your application).
Edit: Another interesting approach is to validate the data before inserting into ES, using the Validate API as suggested by #prakash, only thing is that it would add another network call but if your application is not latency-sensitive, it can be used as a workaround.

how to use Elastic Search nested queries by object key instead of object property

Following the Elastic Search example in this article for a nested query, I noticed that it assumes the nested objects are inside an ARRAY and that queries are based on some object PROPERTY:
{
nested_objects: [ <== array
{ name: "x", value: 123 },
{ name: "y", value: 456 } <== "name" property searchable
]
}
But what if I want nested objects to be arranged in key-value structure that gets updated with new objects, and I want to search by the KEY? example:
{
nested_objects: { <== key-value, not array
"x": { value: 123 },
"y": { value: 456 } <== how can I search by "x" and "y" keys?
"..." <=== more arbitrary keys are added now and then
]
}
Thank you!
You can try to do this using the query_string query, like this:
GET my_index/_search
{
"query": {
"query_string": {
"query":"nested_objects.\\*.value:123"
}
}
}
It will try to match the value field of any sub-field of nested_objects.
Ok, so my final solution after some ES insights is as follows:
1. The fact that my object keys "x", "y", ... are arbitrary causes a mess in my index mapping. So generally speaking, it's not a good ES practice to plan this kind of structure... So for the sake of mappings, I resort to the structure described in the "Weighted tags" article:
{ "name":"x", "value":123 },
{ "name":"y", "value":456 },
...
This means that, when it's time to update the value of the sub-object named "x", I'm having a harder (and slower) time finding it: I first need to query the entire top-level object, traverse the sub objects until I find one named "x" and then update its value. Then I update the entire sub-object array back into ES.
The above approach also causes concurrency issues in case I have multiple processes updating the same index. ES has optimistic locking I can use to retry when needed, or, I can queue updates and handle them serially

How can perform an Elasticsearch Multisearch, with only suggesters?

I need to return suggestions from 4 separate suggesters, across two separate indices.
I am currently doing this by sending two separate requests to Elasticsearch (one for each index) and combining the results in my application. Obviously this does not seem ideal when the Multisearch API is available.
From playing with the Multisearch API I am able to combine these suggestion requests into one and it correctly retrieves results from all 4 completion suggesters from both indexes.
However, it also automatically performs a match_all query on the chosen indices. I can of course minimize the impact of this by setting searchType to count but the results are worse than the two separate curl requests.
It seems that no matter what I try I cannot prevent the Multisearch API from performing some sort of query over each index.
e.g.
{
index: 'users',
type: 'user'
},
{
suggest: {
users_suggest: {
text: term,
completion: {
size : 5,
field: 'users_suggest'
}
}
},
{
index: 'photos',
type: 'photo'
},
{
suggest: {
photos_suggest: {
text: term,
completion: {
size : 5,
field: 'photos_suggest'
}
}
}
}
A request like the above which clearly omits the {query:{} part of this multisearch request, still performs a match_all query and returns everything in the index.
Is there any way to prevent the query taking place so that I can simply get the combined completion suggesters results? Or is there another way to search multiple suggesters on multiple indices in one query?
Thanks in advance
Do make size=0, so that no hits will be returned but only suggestions.
{
"size": 0,
"suggest":{}
}
for every request.

Resources