Elasticsearch: get current running snapshot operation - elasticsearch

Assume, need to automate the snapshot restoring of 2 or more snapshots to elastic cluster.
It is necessary to detect, that snapshot operation is completed before next api call: _snaphot/<repository>/<snapshot>/_restore.
If I call while snapshot is restoring, cluster responses 503.
I tried to use thread pool api with running snapshot operation:
curl -XGET 'http://127.0.0.1:9200/_cat/thread_pool?h=snapshot.active
But, it returns 0 anyway.
What is proper way to do get info about current running restore operation?
UPDATE:
An example how have it managed to work with ansible:
- name: shell | restore latest snapshot
uri:
url: "http://127.0.0.1:9200/_snapshot/{{ es_snapshot_repository }}/snapshot_name/_restore"
method: "POST"
body: '{"index_settings":{"index.number_of_replicas": 0}}'
body_format: json
- name: shell | get state of active recovering operations | log indices
uri:
url: "http://127.0.0.1:9200/_recovery?active_only"
method: "GET"
register: response
until: "response.json == {}"
retries: 6
delay: 10

You can monitor status of indices being restored using Indices Recovery API.
The easiest way of doing this is looking at the stage property:
init: Recovery has not started
index: Reading index meta-data and copying bytes from source to destination
start: Starting the engine;
opening the index for use translog: Replaying transaction log
finalize: Cleanup done: Complete
done: Complete
Parameter active_only returns info about shards that are not in done state:
http://127.0.0.1:9200/_recovery?active_only

Related

Unknown processors type "resourcedetection" for "resourcedetection"

Running OT Collector with image ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector:0.58.0
In config.yaml I have,
processors:
batch:
resourcedetection:
detectors: [ env ]
timeout: 2s
override: false
The collector is deployed as a sidecar but it keeps failing with
collector server run finished with error: failed to get config: cannot unmarshal the configuration: unknown processors type "resourcedetection" for "resourcedetection" (valid values: [resource span probabilistic_sampler filter batch memory_limiter attributes])
Any idea as to what is causing this? I haven't found any relevant documentation/question
The Resource Detection Processor is part of the otelcol-contrib distro upstream and you'd hence would need to use otel/opentelemetry-collector-contrib:0.58.0 (or the equivalent on your container registry of choice) for this processor to be available in your collector.

How to update loggingService of container.v1.cluster with deployment-manager

I want to set the loggingService field of an existing container.v1.cluster through deployment-manager.
I have the following config
resources:
- name: px-cluster-1
type: container.v1.cluster
properties:
zone: europe-west1-b
cluster:
description: "dev cluster"
initialClusterVersion: "1.13"
nodePools:
- name: cluster-pool
config:
machineType: "n1-standard-1"
oauthScopes:
- https://www.googleapis.com/auth/compute
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
management:
autoUpgrade: true
autoRepair: true
initialNodeCount: 1
autoscaling:
enabled: true
minNodeCount: 3
maxNodeCount: 10
ipAllocationPolicy:
useIpAliases: true
loggingService: "logging.googleapis.com/kubernetes"
masterAuthorizedNetworksConfig:
enabled: false
locations:
- "europe-west1-b"
- "europe-west1-c"
When I try to run gcloud deployment-manager deployments update ..., I get the following error
ERROR: (gcloud.deployment-manager.deployments.update) Error in Operation [operation-1582040492957-59edb819a5f3c-7155f798-5ba37285]: errors:
- code: NO_METHOD_TO_UPDATE_FIELD
message: No method found to update field 'cluster' on resource 'px-cluster-1' of
type 'container.v1.cluster'. The resource may need to be recreated with the new
field.
The same succeeds if I remove loggingService.
Is there a way to update loggingService using deployment-manager without deleting the cluster?
The error NO_METHOD_TO_UPDATE_FIELD is due to updating "initialClusterVersion" when you issued the update call to GKE. This field is only used on creation of the cluster, and the type definition doesn't currently allow for it to be updated later. So that should remain static at the original value and will have no effect on the deployment moving forward or try to delete/comment that line.
Even when the previous entry is true, there is also no method to update the logging service, actually Deployment Manager doesn't have many update methods, so, try using the gcloud command to update the cluster directly, keep in mind that you have to use the monitoring service together with the logging service, so, the commando would look like:
gcloud container clusters update px-cluster-1 --logging-service=logging.googleapis.com/kubernetes --monitoring-service=monitoring.googleapis.com/kubernetes --zone=europe-west1-b

Can you change the interim message of Ansible's until retries loop?

In one of my playbooks I start a service and poll for its status before moving onto the next task like the following:
- name: Poll for service status
uri:
url: http://some-service/status
register: response
until: response.status == 200
retries: 12
delay: 10
This logs a message each time it queries the URL which looks like
FAILED - RETRYING: TASK: Poll for service status
Is there a way to customise this message? Specifically remove the word FAILED.
After grepping around, I found the "FAILED - RETRYING" message in the default output callback (read about callbacks here). That means you can change the callback in ansible.cfg to something that suits your needs- or make your own. You can even search the code for v2_runner_retry to see the various outputs.
For instance, here's what stdout_callback=oneline returns. There are no "retrying" messages even at -vv. It still says "FAILED" but that's because it actually failed.
ansible-playbook -vvi localhost, p.yml
ansible-playbook 2.4.1.0
config file = /opt/app/ansible.cfg
configured module search path = ['/opt/app/library']
ansible python module location = /usr/local/lib/python3.5/dist-packages/ansible
executable location = /usr/local/bin/ansible-playbook
python version = 3.5.2 (default, Sep 14 2017, 22:51:06) [GCC 5.4.0 20160609]
Using /opt/app/ansible.cfg as config file
1 plays in p.yml
META: ran handlers
localhost | FAILED! => {"attempts": 3,"changed": false,"content": "","failed": true,"msg": "Status code was not [200]: Request failed: <urlopen error [Errno -2] Name or service not known>","redirected": false,"status": -1,"url": "http://some-service/status"}
Aside from setting it in ansible.cfg the documentation implies it can be done in a role. I have no idea how.
As #techraf said, the Ansible folks are pretty good at reviewing pull requests.

ElasticSearch: Curator concurrent snapshots

We're getting this message:
[2017-08-11T04:00:02,908][WARN ][r.suppressed ] path: /_snapshot/s3_currently/curator-20170811040002, params: {repository=s3_currently, wait_for_completion=true, snapshot=curator-20170811040002}
org.elasticsearch.snapshots.ConcurrentSnapshotExecutionException: [s3_currently:curator-20170811040002]a snapshot is already running
We've configured x-pack curator with two actions:
/home/curator/actions/currently.yml
---
actions:
1:
action: snapshot
description: Create snapshot every 30 minutes.
options:
repository: s3_currently
wait_for_completion: true
filters:
- filtertype: alias
aliases: living
2:
action: delete_snapshots
description: Remove recently snapshots
options:
repository: s3_currently
retry_interval: 120
retry_count: 3
filters:
- filtertype: count
count: 48
And /home/curator/actions/currently-dev.yml:
---
actions:
1:
action: snapshot
description: Create snapshot every hour for development.
options:
repository: s3_currently_dev
wait_for_completion: true
filters:
- filtertype: alias
aliases: living
2:
action: delete_snapshots
description: Remove recently snapshots
options:
repository: s3_currently_dev
retry_interval: 120
retry_count: 3
filters:
- filtertype: count
count: 24
We've added two cron jobs:
0 * * * * -> currently_dev
0,30 * * * * -> currently
Any ideas? It seems that elasticsearch doesn't allow to execute two concurrent snapshots, does it?
Elasticsearch does not allow for more than one snapshot to run at a time. The reason for this is that it is compelled to freeze the Lucene segments for the selected indices for the duration of the snapshot. It would be extremely taxing to the cluster to do this for multiple concurrent snapshots, not in terms of processing, but in terms of how it has to track all segments at all times. It must allow for new data to be indexed into new segments while others are locked/frozen for snapshotting. This could create a situation where there are too many open segments, which could deprive one or more nodes of needed memory resources. As a result, it's safer for Elasticsearch to only permit a single snapshot at a time.

Timeout on deleting a snapshot repository

I'm running elasticsearch 1.7.5 w/ 19 nodes (12 data nodes).
Attempting to setup snapshots for backup and recovery - but am getting a 503 on creation and deletion of a snapshot repository.
curl -XDELETE 'localhost:9200/_snapshot/backups?pretty'
returns:
{
"error" : "RemoteTransportException[[masternodename][inet[/10.0.0.20:9300]][cluster:admin/repository/delete]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (delete_repository [backups]) within 30s]; ",
"status" : 503
}
I was able to adjust the query w/ a master_timeout=10m - still getting a timeout. Is there a way to debug the cause of this request failing?
Performance on this call seems to be related to pending tasks with a higher priority.
https://discuss.elastic.co/t/timeout-on-deleting-a-snapshot-repository/69936/4

Resources