Timeout on deleting a snapshot repository - elasticsearch

I'm running elasticsearch 1.7.5 w/ 19 nodes (12 data nodes).
Attempting to setup snapshots for backup and recovery - but am getting a 503 on creation and deletion of a snapshot repository.
curl -XDELETE 'localhost:9200/_snapshot/backups?pretty'
returns:
{
"error" : "RemoteTransportException[[masternodename][inet[/10.0.0.20:9300]][cluster:admin/repository/delete]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (delete_repository [backups]) within 30s]; ",
"status" : 503
}
I was able to adjust the query w/ a master_timeout=10m - still getting a timeout. Is there a way to debug the cause of this request failing?

Performance on this call seems to be related to pending tasks with a higher priority.
https://discuss.elastic.co/t/timeout-on-deleting-a-snapshot-repository/69936/4

Related

Elasticsearch Snapshot Failing in AWS, preventing upgrade

My incremental Snapshots in Elasticsearch are now failing. I didn't touch anything, nothing seems to have changed, can't figure out what is wrong.
I checked my Snapshots by doing: GET _cat/snapshots/cs-automated?v&s=id and finding the details of a failed one:
GET _snapshot/cs-automated/adssd....
Which showed this stacktrace:
java.nio.file.NoSuchFileException: Blob object [YI-....] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 21...; S3 Extended Request ID: zh1C6C0eRy....)
at org.elasticsearch.repositories.s3.S3RetryingInputStream.openStream(S3RetryingInputStream.java:92)
at org.elasticsearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:72)
at org.elasticsearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:100)
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:147)
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:133)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:2381)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1851)
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:505)
at org.elasticsearch.snapshots.SnapshotShardsService.access$600(SnapshotShardsService.java:114)
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:386)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractPrioritizedRunnable.doRun(ThreadContext.java:763)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Don't know how to resolve this I can now longer upgrade my index, I checked this page: Resolve snapshot error in .. but still struggling. I've tried deleting a whole bunch of indicies. I may try restoring an old Snapshot. I also delete some .opendis.. indicies used for tracking ILM and a .lock index as well but nothing is helping. Very annoying.
as requested in comments:
GET /_cat/repositories?v
id type
cs-automated s3
GET /_cat/snapshots/cs-automated produces heaps of Snapshots all of which are PARTIAL in their status:
2020-09-08t01-12-44.ea93d140-7dba-4dcc-98b5-180e7b9efbfa PARTIAL 1599527564 01:12:44 1599527577 01:12:57 13.4s 84 177 52 229
2021-02-04t08-55-22.8691e3aa-4127-483d-8400-ce89bbbc7ea4 PARTIAL 1612428922 08:55:22 1612428957 08:55:57 35s 208 793 31 824
2021-02-04t09-55-16.53444082-a47b-4739-8ff9-f51ec038cda9 PARTIAL 1612432516 09:55:16 1612432552 09:55:52 35.6s 208 793 31 824
2021-02-04t10-55-30.6bf0472f-5a6c-4ecf-94ba-a1cf345ee5b9 PARTIAL 1612436130 10:55:30 1612436167 10:56:07 37.6s 208 793 31 824
2021-02-04t11-......
The reason for snapshot to end in PARTIAL state is that because of some issue in S3 repository YI-.... file is missing. Which is clear case of repository corruption.
java.nio.file.NoSuchFileException: Blob object [YI-....] not found:
The specified key does not exist. (Service: Amazon S3; Status Code:
404; Error Code: NoSuchKey; Request ID: 21...; S3 Extended Request ID:
zh1C6C0eRy....)
This kind of repository corruption is observed when cluster is heavily loaded (JVM > 80% or CPU utilization >80%) and few of nodes drops out of cluster.
One way to fix the issue is to delete all the snapshots that refers to index referred by "YI-....". This will cleanup S3 snapshot files of index YI-.... and now when you take new snapshot everything starts afresh.
To be on safer side, I would recommend to contact AWS support to fix this type of repository corruption.
Elasticsearch reference similar issue fixed in elasticsearch version 7.8 and above : https://github.com/elastic/elasticsearch/issues/57198

Elasticsearch indexing fails after successful Nutch crawl

I'm not sure why but Nutch 1.13 is failing to index the data to ES (v2.3.3). It is crawling, that is fine, but when it comes time to index to ES its giving me this error message:
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)
Right before that is has this:
elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)
I'm not sure exactly if the timeout has anything to do with the job failing?
I've run Nutch v1.10 many times with no problems but decided to upgrade now. Never had this error before until now, with upgrading.
EDIT:
After closer inspection of the error message:
Error running:
/home/david/tutorials/nutch/nutch-1.13/runtime/local/bin/nutch index -Delastic.server.url=http://localhost:9300/search-index/ searchcrawl//crawldb -linkdb searchcrawl//linkdb searchcrawl//segments/20170519125546
It seems to be failing there, on that particular segment, what does that mean? I only know the basics of how to use Nutch, I'm by no means an expert. Is it failing on a link?
Until Nutch 1.14 is out, you need to apply this patch https://github.com/apache/nutch/pull/156 and rebuild:
cd apache-nutch-1.13
wget https://raw.githubusercontent.com/apache/nutch/e040ace189aa0379b998c8852a09c1a1a2308d82/src/java/org/apache/nutch/indexer/CleaningJob.java
mv CleaningJob.java src/java/org/apache/nutch/indexer/.

kibana unable to discover - (Shard Failures) / Error: indexPattern.fields is undefined

Kibana is unable to initialize when starting, it shows the misleading exception "Shard Failures" without any details:
But when digging in the Browser console, the following logs have been written:
"INFO: 2016-11-25T13:41:59Z
Adding connection to https://monitoring.corp.com/elk-kibana/elasticsearch
" kibana.bundle.js:63741:6
config initcommons.bundle.js:62929
complete in 459.08ms commons.bundle.js:62925:12
loading default index patterncommons.bundle.js:62929
Index Patterns: index pattern set to logstash-* commons.bundle.js:8926:17
complete in 125.70ms commons.bundle.js:62925:12
Error: indexPattern.fields is undefined
isSortable#https://monitoring.corp.com/elk-kibana/bundles/kibana.bundle.js?v=9732:85441:8
getSort#https://monitoring.corp.com/elk-kibana/bundles/kibana.bundle.js?v=9732:85448:47
__WEBPACK_AMD_DEFINE_RESULT__</getSort.array#https://monitoring.corp.com/elk-kibana/bundles/kibana.bundle.js?v=9732:85463:15
getStateDefaults#https://monitoring.corp.com/elk-kibana/bundles/kibana.bundle.js?v=9732:85015:16
__WEBPACK_AMD_DEFINE_RESULT__</<#https://monitoring.corp.com/elk-kibana/bundles/kibana.bundle.js?v=9732:85009:47
invoke#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:31569:15
$ControllerProvider/this.$get</</instantiate<#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:36227:25
nodeLinkFn#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:35339:37
compositeLinkFn#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:34771:14
publicLinkFn#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:34646:31
ngViewFillContentFactory/<.link#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:57515:8
invokeLinkFn#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:35880:10
nodeLinkFn#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:35380:12
compositeLinkFn#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:34771:14
publicLinkFn#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:34646:31
createBoundTranscludeFn/boundTranscludeFn#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:34790:17
controllersBoundTransclude#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:35407:19
update#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:57465:26
$RootScopeProvider/this.$get</Scope.prototype.$broadcast#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:43402:16
commitRoute/<#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:57149:16
processQueue#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:41836:29
scheduleProcessQueue/<#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:41852:28
$RootScopeProvider/this.$get</Scope.prototype.$eval#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:43080:17
$RootScopeProvider/this.$get</Scope.prototype.$digest#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:42891:16
$RootScopeProvider/this.$get</Scope.prototype.$apply#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:43188:14
done#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:37637:37
completeRequest#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:37835:8
requestLoaded#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:37776:10
<div class="application ng-scope" ng-class="'tab-' + chrome.getActiveTabId('-none-') + ' ' + chrome.getApplicationClasses()" ng-view="" ng-controller="chrome.$$rootControllerConstruct as kibana"> commons.bundle.js:39568:19
Error: Request to Elasticsearch failed: "Bad Request"
KbnError#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:62016:21
RequestFailure#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:62049:6
__WEBPACK_AMD_DEFINE_RESULT__</</</<#https://monitoring.corp.com/elk-kibana/bundles/kibana.bundle.js?v=9732:88628:16
processQueue#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:41836:29
scheduleProcessQueue/<#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:41852:28
$RootScopeProvider/this.$get</Scope.prototype.$eval#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:43080:17
$RootScopeProvider/this.$get</Scope.prototype.$digest#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:42891:16
$RootScopeProvider/this.$get</Scope.prototype.$apply#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:43188:14
done#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:37637:37
completeRequest#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:37835:8
requestLoaded#https://monitoring.corp.com/elk-kibana/bundles/commons.bundle.js?v=9732:37776:10
commons.bundle.js:39568:19
I'm aware of the https://github.com/elastic/kibana/issues/6460 issue, but we don't have any signs of an entity which is too large.
I also recreated the indexpattern already, without luck (deleting and creating).
However when going into the "Settings" > "Index pattern" where the fields are shown, and going back to discover, kibana seems to work again (until next browser refresh). Any ideas how to fix kibana?
Kibana version: 4.4.2
Elasticsearch version: 2.2.0
Increasing the server.maxPayloadBytes property in the kibana.yml file to an appropiate size solved the issue.

Elasticsearch: get current running snapshot operation

Assume, need to automate the snapshot restoring of 2 or more snapshots to elastic cluster.
It is necessary to detect, that snapshot operation is completed before next api call: _snaphot/<repository>/<snapshot>/_restore.
If I call while snapshot is restoring, cluster responses 503.
I tried to use thread pool api with running snapshot operation:
curl -XGET 'http://127.0.0.1:9200/_cat/thread_pool?h=snapshot.active
But, it returns 0 anyway.
What is proper way to do get info about current running restore operation?
UPDATE:
An example how have it managed to work with ansible:
- name: shell | restore latest snapshot
uri:
url: "http://127.0.0.1:9200/_snapshot/{{ es_snapshot_repository }}/snapshot_name/_restore"
method: "POST"
body: '{"index_settings":{"index.number_of_replicas": 0}}'
body_format: json
- name: shell | get state of active recovering operations | log indices
uri:
url: "http://127.0.0.1:9200/_recovery?active_only"
method: "GET"
register: response
until: "response.json == {}"
retries: 6
delay: 10
You can monitor status of indices being restored using Indices Recovery API.
The easiest way of doing this is looking at the stage property:
init: Recovery has not started
index: Reading index meta-data and copying bytes from source to destination
start: Starting the engine;
opening the index for use translog: Replaying transaction log
finalize: Cleanup done: Complete
done: Complete
Parameter active_only returns info about shards that are not in done state:
http://127.0.0.1:9200/_recovery?active_only

Logstash error message when using ElasticSearch output=>"Failed to flush outgoing items"

Im using ES 1.4.4 and LS 1.5 and Kibana 4 on Debian.
I start logstash, it works fine for a couple of minutes then i have a fatal error.
In order to shutdown logstash i have to delete the recent datas stored in ES, that's the only way i found.
One more relevant fact is that Elastic Search looks OK, i can see old datas in kibana and plugin head works fine.
My output config : output { elasticsearch {port => 9200 protocol => http host => "127.0.0.1"}}
Any help will be appreciated :)
Here is the full error message :
Got error to send bulk of actions to elasticsearch server at 127.0.0.1 : Read timed out {:level=>:error}
Failed to flush outgoing items {:outgoing_count=>1362, :exception=>#, :backtrace=>["/opt/logstash/vendor/bundle/jruby/1.9/gems/manticore-0.3.5-java/lib/manticore/response.rb:35:in initialize'", "org/jruby/RubyProc.java:271:incall'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/manticore-0.3.5-java/lib/manticore/response.rb:61:in call'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/manticore-0.3.5-java/lib/manticore/response.rb:224:incall_once'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/manticore-0.3.5-java/lib/manticore/response.rb:127:in code'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/transport/http/manticore.rb:50:inperform_request'", "org/jruby/RubyProc.java:271:in call'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/transport/base.rb:187:inperform_request'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/transport/http/manticore.rb:33:in perform_request'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.7/lib/elasticsearch/transport/client.rb:115:inperform_request'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-api-1.0.7/lib/elasticsearch/api/actions/bulk.rb:80:in bulk'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.1.18-java/lib/logstash/outputs/elasticsearch/protocol.rb:82:inbulk'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.1.18-java/lib/logstash/outputs/elasticsearch.rb:413:in submit'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.1.18-java/lib/logstash/outputs/elasticsearch.rb:412:insubmit'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.1.18-java/lib/logstash/outputs/elasticsearch.rb:438:in flush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.1.18-java/lib/logstash/outputs/elasticsearch.rb:436:inflush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219:in buffer_flush'", "org/jruby/RubyHash.java:1341:ineach'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216:in buffer_flush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193:inbuffer_flush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:159:in buffer_receive'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.1.18-java/lib/logstash/outputs/elasticsearch.rb:402:inreceive'", "/opt/logstash/lib/logstash/outputs/base.rb:88:in handle'", "(eval):1070:ininitialize'", "org/jruby/RubyArray.java:1613:in each'", "org/jruby/RubyEnumerable.java:805:inflat_map'", "(eval):1067:in initialize'", "org/jruby/RubyProc.java:271:incall'", "/opt/logstash/lib/logstash/pipeline.rb:279:in output'", "/opt/logstash/lib/logstash/pipeline.rb:235:inoutputworker'", "/opt/logstash/lib/logstash/pipeline.rb:163:in `start_outputs'"], :level=>:warn}
Your elasticsearch have surpassed storage and it is unable to write new documents coming from logstash, try deleting old indices and then
PUT your_index/_settings
{
"index": {
"blocks.read_only": false
}
}
I hope this will work for you. Thanks !!

Resources