kubernetes rolling update for elasticsearch - elasticsearch

I am performing a simple rolling update for elasticsearch image. The command I use is
kubectl set image deployment master-deployment elasticsearch={private registry}/elasticsearch:{tag}
However, the elasticsearch always gets IOException after the rolling update.
Caused by: java.io.IOException: failed to read [id:60, legacy:false, file:/var/lib/elasticsearch/nodes/0/_state/global-60.st]
I have checked the directory /var/lib/elasticsearch/nodes/0/_state/. It has global-10.st file present but not global-60.st.
How should I make sure the image itself synchronizes well with the files present?

I think you should go with statefulSet and external storage (I.e pvc - don’t store the data inside the pod. )

Related

Elasticsearch snapshot restore from S3 failed with RepositoryMissingException

I was able to create the repository successfully, and list the snapshots. That showed the repository couldn't be missing.
Yet the restore request failed with RepositoryMissingException, with the following details.
shard has failed to be restored from the snapshot [lb-es-snapshots:snapshot-
2022-01-09/gnA_ObsiRmOA-ydXfZWfbA] because of [failed shard on node [RsQQ6-
L6R_6qTIJigizMXQ]: failed recovery, failure RecoveryFailedException[[api][0]: Recovery
failed on {my-release-elasticsearch-data-0}{RsQQ6-L6R_6qTIJigizMXQ}
{w3E20XKZTHyAvpI7XEogjQ}{my-release-elasticsearch-data-0.my-release-elasticsearch-data-
hl.default.svc.cluster.local}{10.244.1.24:9300}{d}{xpack.installed=true,
transform.node=false}]; nested: RepositoryMissingException[[lb-es-snapshots] missing]; ]
- manually close or delete the index [api] in order to retry to restore the snapshot
again or use the reroute API to force the allocation of an empty primary shard
Is there a way to make sense of the error? the logs on the nodes show the same exception.
answering my own question
before registering the repository, you need to manually add the S3 access and secret key to the keystore on BOTH (this is not mentioned in ES documentation) the first master node and the first data node, and reload the settings.

Azure Databricks - java.lang.IllegalStateException: dbfs:/mnt/delta_checkpoints/offsets/5 exists before writing

I have built a streaming pipeline with spark autoloader.
Source Folder is a azure blob container.
We encountered a rare issue (could not replicate it). Below is the exception Message:
java.lang.IllegalStateException: dbfs:/mnt/delta_checkpoints/offsets/5 exists before writing
Please help on this with a resolution, as this looks like some known platform issue.
Please let know if I need to attach the entire stacktrace.

oc debug on a stateful set results in PVC errors

I'm running a stateful set in Openshift 4.3 which does not start properly. I suspect permissions issues, but that's not directly relevant to the question. I'm having problems getting a debug container to start.
I run the command to create the stateful set and other relevant objects. The pod created for the stateful set (I'm only running one replica at the moment) crashes (which I expect). Then I issue the command oc debug statefulset/[ss-name] and I get an error saying that the primary container is invalid because * spec.containers[0].volumeMounts[0].name: Not found: "volume"
The volume does exist, though - it's called 'volume' and it creates successfully when I start up the stateful set.
I'm sure I'm just missing something when it comes to the creation of the debug pod, but I'm not sure what - I can't find anything on Google that suggests that I would need to create a separate PVC for the debug pod or anything. What am I missing?
Okay, I figured out the issue, here. When you start up a debug pod, it's a deployment of its own, and is NOT part of the stateful set. That's why it couldn't find the volume - the volume was created as part of the stateful set, and creating a debug pod creates nothing except the pod, with none of the other SS trappings.
I was able to start the debug pod by removing the section where it attempted to mount the volume, instead having that folder use the ephemeral storage that's local to the pod (since I didn't care what happened to the data on it anyway).

[Elasticsearch]: Unable to Recover Primary Shard

I'm using Elasticsearch 2.3.5 version. I have to recover the complete data from the backup disks. Everything got recovered except 2 shards. While checking logs, I found the following error.
ERROR:
Caused by: java.nio.file.NoSuchFileException: /data/<cluster_name>/nodes/0/indices/index_name/shard_no/index/_c4_49.liv
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
at java.nio.channels.FileChannel.open(FileChannel.java:287)
at java.nio.channels.FileChannel.open(FileChannel.java:335)
at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:81)
at org.apache.lucene.store.FileSwitchDirectory.openInput(FileSwitchDirectory.java:186)
at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:89)
at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:89)
at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:109)
at org.apache.lucene.codecs.lucene50.Lucene50LiveDocsFormat.readLiveDocs(Lucene50LiveDocsFormat.java:83)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:73)
at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:197)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:99)
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:435)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:100)
at org.elasticsearch.index.engine.InternalEngine.createSearcherManager(InternalEngine.java:283)
... 12 more
Can anyone suggest why is this happening or anyhow I can skip this particular file?
Thanks in Advance
Unfortunately restoring Elasticsearch from a filesystem backup is not a reliable way to recover your data, and is expected to fail like this sometimes. You should always use snapshot and restore instead. Your version is rather old, but more recent versions include this warning in the docs (which also applies to your version):
WARNING: You cannot back up an Elasticsearch cluster by simply copying the data directories of all of its nodes. Elasticsearch may be making changes to the contents of its data directories while it is running; copying its data directories cannot be expected to capture a consistent picture of their contents. If you try to restore a cluster from such a backup, it may fail and report corruption and/or missing files. Alternatively, it may appear to have succeeded though it silently lost some of its data. The only reliable way to back up a cluster is by using the snapshot and restore functionality.
It is possible that the restore has silently lost data in other shards too, there's no way to tell. Assuming you don't also have a snapshot of the data held in the lost shards, the only way to recover it is to reindex it from its source.

Forbidden: field can not be less than previous value

I got a problem on Azure that the Nexus didn't had enough disk space. Nexus failed to start due to this problem so I extended the default PVC jenkins-x-nexus from 8GB to 20GB. This extension was successful and everything is just running file.
But if I now want to upgrade my jx platform (jx upgrade platform) I'm getting the following error:
The PersistentVolumeClaim "jenkins-x-nexus" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value'
How can this be resolved?
The PersistentVolumeClaim "jenkins-x-nexus" is invalid: spec.resources.requests.storage: Forbidden: field can not be less than previous value'
When you're doing a jx upgrade platform, Helm values for Nexus chart are populated from default Nexus chart Helm values - https://github.com/jenkins-x-charts/nexus/blob/master/nexus/values.yaml
If you want to override them, and I guess you do, since you need to specify the correct size of the PVC (>8Gb), you need to specify your custom values in myvalues.yaml file and place it in the same directory, where you're executing a jx upgrade platform and then run a jx upgrade platform
Please use https://jenkins-x.io/docs/managing-jx/old/config/#nexus as reference

Resources