Why is my elasticsearch update_by_query timing out? - elasticsearch

I am using the Javascript library and and trying to perform a rather large update on documents. I have changed the timeout value on the query itself as well as the client object and am still getting a timeout that seems to be 2 minutes. Below is the query.
const arrayOfStrings = [...]; // This array could be between 10-12k elements in size
await elasticClient.updateByQuery({
index: "main-index",
refresh: true,
conflicts:"proceed",
script: {
lang: "painless",
source: "ctx._source.status = \"1\"; ctx._source.entities = []; for(term in params.array) ctx._source.entities.add(term);",
params: {
"array": arrayOfStrings
}
},
query: {
"term": {
"parent_id": 'dgd39gd3-db3dg23879-d893gdg38-e23ed' // There could be upwards of 5000 documents that match this criteria
}
},
timeout: "5m"
});
Also on the elastic client, I have requestTimeout set to '5m' as well. I can understand why this particular query may be timing out as its applying a 12k element array to a field on 5k documents. However I dont understand why the query is timing out after only 2 minutes when I have timeout values set that are longer.

What you need to do here is to run the update by query asynchronously
await elasticClient.updateByQuery({
index: "main-index",
refresh: true,
conflicts:"proceed",
waitForCompletion: false <---- add this setting
And then you can follow the progress of the task running asynchronously.

Related

elasticsearch updateByQuery is not updating but returns 200

I am using elasticsearch js client and I want to find all fields wirh attrs.tags == XXX and delete the value. The update return 200 and 1 event updated that is correct. But when I list all events I still can see attrs.tags with old value. Why is it not working? Even if I wait for 5 min to give elastic time to update it, I get still the same result.
async function search() {
var tag = req.body.tags;
var client = connectToES(res);
const response = await client.updateByQuery({
index: "*",
type: '_doc',
body: {
"query": {
"match": {
"attrs.tags": tag
}
},
"script": { "inline": "ctx._source.attrs.tag = ''" }
}
});
client.close();
}
And here is elastic response:
{"took":34,"timed_out":false,"total":1,"updated":1,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1,"throttled_until_millis":0,"failures":[]}
Your ES response seems OK and its exactly mentioning that total 1 document matched its criteria and that document is updated as well, can you cross-check your script and query and see if its matching all your expected documents if not, than you need to construct it properly to update all the documents matching your criteria.

Sorting ElasticSearch query by multiple fields

I have some data that I'm trying to sort in a very specific order.
I've looked over a few questions here on SO and Elasticsearch sort on multiple queries was pretty helpful. From what I can tell I'm getting the data back in the correct order but it's not always the same data and appears to be very random as to what is returned from the query.
My question is, how do I get my data sorted correctly and get the expected data each time?
Example Data
[
{
id: 00,
...
current_outage: {
device_id: 00,
....
},
forecasted_outages: [
{
device_id: 00
}
]
},
{
id: 01,
...
current_outage: {
device_id: 01,
....
},
forecasted_outages: []
},
{
id: 02,
...
current_outage: null,
forecasted_outages: [
{
device_id: 02
}
]
},
{
id: 03,
...
current_outage: null,
forecasted_outages: []
},
]
Current Query
bool: {
should: [
{
constant_score: {
boost: 6,
filter: {
nested: {
path: 'current_outage',
query: {
exists: {
field: 'current_outage'
}
}
}
}
}
},
{
nested: {
path: 'forecasted_outages',
query: {
exists: {
field: 'forecasted_outages'
}
}
}
}
]
}
Just to reiterate, the above query returns the data in the format/sorted method I expect but it does NOT return the data that I expect each time. The returned data is very random as far as I can tell.
Sort Criteria:
First: Data with both current_outage and one or more forecasted_outages
Second: Data with only current_outage
Third: Data with only forecasted_outages
Edit
The data returning can be anything from zero to thousands of results depending on a user. The user has an option to paginate the data or return all of their relevant data.
Edit 2
The data returned will be anywhere from zero to 1,000 hits.
If the search hits is more than 10 (default result size) and all documents have same score (in your case it could be as you are provided constant score), then the data returned could be different for each run (giving randomness feeling).
The reason for this is, the search results are merged from different shards till the hit count reaches 10 and rest of the results are ignored. So every run can have different result based on the shards merged.
Increasing the result size to include all the search result can provide same data for every run.
UPDATE
Changing the Shard count to 1 might help (you have close and reopen the index if the index is already created).
PUT /twitter/_settings
{
"index" : {
"number_of_shards" : 1
}
}

Can inclusion of specific fields change the elasticsearch result set?

I have an ES query that returns 414 documents if I exclude a specific field from results.
If I include this field, the document count drops to 328.
The documents that get dropped are consistent and this happens whether I scroll results or query directly.
The field map for the field that reduces the result set looks like this:
"completion": {
"type": "object",
"enabled": false
}
Nothing special to it and I have other "enabled": false object type fields that return just fine in this query.
I tested against multiple indexes with the same data to rule out corruption (I hope).
This 'completion' object is a nested and ignored object that has 4 or 5 levels of nesting but once again, I have other similarly nested objects that return just fine for this query.
The query is a simple terms match for 414 terms (yes, this is terrible, we are rethinking our strategy on this):
var { _scroll_id, hits } = await elastic.search({
index: index,
type: type,
body: shaQuery,
scroll: '10s',
_source_exclude: 'account,layout,surveydata,verificationdata,accounts,scores'
});
while (hits && hits.hits.length) {
// Append all new hits
allRecords.push(...hits.hits)
var { _scroll_id, hits } = await elastic.scroll({
scrollId: _scroll_id,
scroll: '10s'
})
}
The query is:
"query": {
"terms": {
"_id": [
"....",
"....",
"...."
}
}
}
In this example, I will only get back 328 results. If I add 'completion' to the _source_exclude then I get the full set back.
So, my question is: What are the scenarios where including a field in the result could limit the search when that field is totally unrelated to the search.
The #'s are specific to this example but consistent across queries. I just include them for context on the overall problem.
Also important is that this completion field has the same data and format across both included and excluded records, I can't see anything that would cause a problem.
The problem was found and it was obscure. What we saw was that it was always failing at the same point and when it was examined a little more closely, the same error was coming out:
{ took: 158,
timed_out: false,
_shards:
{ total: 5,
successful: 4,
skipped: 0,
failed: 1,
failures: [ [Object] ] },
[ { shard: 0,
index: ‘theindexname’,
node: ‘4X2vwouIRriYbQTQcHQ_sw’,
reason:
{ type: ‘illegal_argument_exception’,
reason:
‘cannot write xcontent for unknown value of type class java.math.BigInteger’ } } ]
Ok well thats strange, we are not using BigIntegers at all. But, thanks to the power of the Google this issue in the elasticsearch issue tracker was revealed:
https://github.com/elastic/elasticsearch/pull/32888
"XContentBuilder to handle BigInteger and BigDecimal" which is a bug in 6.3 where fields that used BigInteger and BigDecimal would fail to serialize and thus break when source filtering was applied. We were running 6.3.
It is unclear why our systems are triggering this issue but upgrading to 6.5 solved it entirely.
Obscure obscure obscure but solved thanks to Javier's persistence.

Increasing 'view' counter of a document in an index everytime it gets queried explicitly using _id via _search endpoint

Say, I have an index called blog which has 10 documents called article. The article is a JSON with one of the property being views which is initialized to 0.
I was wondering if there's a good way of updating the views counter everytime the document gets explicitly called via _search endpoint using document id, so that I can sort it by view on my other queries.
Or would that be something that will have to be taken care of at the application layer?
My feeble attempt query dsl so far:
let options = {
index: 'blog',
body: {
query: {
function_score: {
query: {
match: { _id: req.params.articleID }
},
"weight" : 2
,
score_mode: "sum"
,
script_score : {
script : {
inline: "(2 + doc['view'].value)"
}
}
}
},
}
};
I have been trying inline script but that would require me to send two separate request. First search & then update if found. I was wondering if I could do it on a single query i.e trigger the views counter to increase by one automatically everytime I query via _search.

How can perform an Elasticsearch Multisearch, with only suggesters?

I need to return suggestions from 4 separate suggesters, across two separate indices.
I am currently doing this by sending two separate requests to Elasticsearch (one for each index) and combining the results in my application. Obviously this does not seem ideal when the Multisearch API is available.
From playing with the Multisearch API I am able to combine these suggestion requests into one and it correctly retrieves results from all 4 completion suggesters from both indexes.
However, it also automatically performs a match_all query on the chosen indices. I can of course minimize the impact of this by setting searchType to count but the results are worse than the two separate curl requests.
It seems that no matter what I try I cannot prevent the Multisearch API from performing some sort of query over each index.
e.g.
{
index: 'users',
type: 'user'
},
{
suggest: {
users_suggest: {
text: term,
completion: {
size : 5,
field: 'users_suggest'
}
}
},
{
index: 'photos',
type: 'photo'
},
{
suggest: {
photos_suggest: {
text: term,
completion: {
size : 5,
field: 'photos_suggest'
}
}
}
}
A request like the above which clearly omits the {query:{} part of this multisearch request, still performs a match_all query and returns everything in the index.
Is there any way to prevent the query taking place so that I can simply get the combined completion suggesters results? Or is there another way to search multiple suggesters on multiple indices in one query?
Thanks in advance
Do make size=0, so that no hits will be returned but only suggestions.
{
"size": 0,
"suggest":{}
}
for every request.

Resources