Elasticsearch Snapshot Failing in AWS, preventing upgrade - elasticsearch

My incremental Snapshots in Elasticsearch are now failing. I didn't touch anything, nothing seems to have changed, can't figure out what is wrong.
I checked my Snapshots by doing: GET _cat/snapshots/cs-automated?v&s=id and finding the details of a failed one:
GET _snapshot/cs-automated/adssd....
Which showed this stacktrace:
java.nio.file.NoSuchFileException: Blob object [YI-....] not found: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 21...; S3 Extended Request ID: zh1C6C0eRy....)
at org.elasticsearch.repositories.s3.S3RetryingInputStream.openStream(S3RetryingInputStream.java:92)
at org.elasticsearch.repositories.s3.S3RetryingInputStream.<init>(S3RetryingInputStream.java:72)
at org.elasticsearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:100)
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:147)
at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:133)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.buildBlobStoreIndexShardSnapshots(BlobStoreRepository.java:2381)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1851)
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:505)
at org.elasticsearch.snapshots.SnapshotShardsService.access$600(SnapshotShardsService.java:114)
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:386)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractPrioritizedRunnable.doRun(ThreadContext.java:763)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Don't know how to resolve this I can now longer upgrade my index, I checked this page: Resolve snapshot error in .. but still struggling. I've tried deleting a whole bunch of indicies. I may try restoring an old Snapshot. I also delete some .opendis.. indicies used for tracking ILM and a .lock index as well but nothing is helping. Very annoying.
as requested in comments:
GET /_cat/repositories?v
id type
cs-automated s3
GET /_cat/snapshots/cs-automated produces heaps of Snapshots all of which are PARTIAL in their status:
2020-09-08t01-12-44.ea93d140-7dba-4dcc-98b5-180e7b9efbfa PARTIAL 1599527564 01:12:44 1599527577 01:12:57 13.4s 84 177 52 229
2021-02-04t08-55-22.8691e3aa-4127-483d-8400-ce89bbbc7ea4 PARTIAL 1612428922 08:55:22 1612428957 08:55:57 35s 208 793 31 824
2021-02-04t09-55-16.53444082-a47b-4739-8ff9-f51ec038cda9 PARTIAL 1612432516 09:55:16 1612432552 09:55:52 35.6s 208 793 31 824
2021-02-04t10-55-30.6bf0472f-5a6c-4ecf-94ba-a1cf345ee5b9 PARTIAL 1612436130 10:55:30 1612436167 10:56:07 37.6s 208 793 31 824
2021-02-04t11-......

The reason for snapshot to end in PARTIAL state is that because of some issue in S3 repository YI-.... file is missing. Which is clear case of repository corruption.
java.nio.file.NoSuchFileException: Blob object [YI-....] not found:
The specified key does not exist. (Service: Amazon S3; Status Code:
404; Error Code: NoSuchKey; Request ID: 21...; S3 Extended Request ID:
zh1C6C0eRy....)
This kind of repository corruption is observed when cluster is heavily loaded (JVM > 80% or CPU utilization >80%) and few of nodes drops out of cluster.
One way to fix the issue is to delete all the snapshots that refers to index referred by "YI-....". This will cleanup S3 snapshot files of index YI-.... and now when you take new snapshot everything starts afresh.
To be on safer side, I would recommend to contact AWS support to fix this type of repository corruption.
Elasticsearch reference similar issue fixed in elasticsearch version 7.8 and above : https://github.com/elastic/elasticsearch/issues/57198

Related

InvalidMagicIdException: Not able to add cache in infinispan

I am working on infinispan... But while adding cache to it with rest
POST /rest/v2/caches/{cacheName}/{cacheKey} and with nifi also, but it gives me a following error
12:17:55,695 ERROR [org.infinispan.server.hotrod.BaseRequestProcessor] (HotRod-ServerIO-5-1) ISPN005003: Exception reported: org.infinispan.server.hotrod.InvalidMagicIdException: Error reading magic byte or message id: 10
at org.infinispan.server.hotrod:ispn-10.0#10.0.0.Beta3//org.infinispan.server.hotrod.HotRodDecoder.switch0(HotRodDecoder.java:208)
at org.infinispan.server.hotrod:ispn-10.0#10.0.0.Beta3//org.infinispan.server.hotrod.HotRodDecoder.switch1_0(HotRodDecoder.java:153)
at org.infinispan.server.hotrod:ispn-10.0#10.0.0.Beta3//org.infinispan.server.hotrod.HotRodDecoder.decode(HotRodDecoder.java:143)
at io.netty:ispn-10.0#4.1.30.Final//io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502)
at io.netty:ispn-10.0#4.1.30.Final//io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441)
at io.netty:ispn-10.0#4.1.30.Final//io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278)
I am using docker pull jboss/infinispan-server 10.0.0.beta and 10.0.0.CR1-3.
I ma not getting any clue to trace the issue.

Issue installing openwhisk with incubator-openwhisk-devtools

I have a blocking issue installing openwhisk with docker
I typed make quick-start right after a git pull of the project incubator-openwhisk-devtools. My OS is Fedora 29, docker version is 18.09.0, docker-compose version is 1.22.0. JDk 8 Oracle.
I get the following error:
[...]
adding the function to whisk ...
ok: created action hello
invoking the function ...
error: Unable to invoke action 'hello': The server is currently unavailable (because it is overloaded or down for maintenance). (code ciOZDS8VySDyVuETF14n8QqB9wifUboT)
[...]
[ERROR] [#tid_sid_unknown] [Invoker] failed to ping the controller: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for health-0: 30069 ms has passed since batch creation plus linger time
[ERROR] [#tid_sid_unknown] [KafkaProducerConnector] sending message on topic 'health' failed: Expiring 1 record(s) for health-0: 30009 ms has passed since batch creation plus linger time
Please note that controller-local-logs.log is never created.
If I issue a touch controller-local-logs.log in the right directory the log file is always empty after I try to issue make quick-start again.
http://localhost:8888/ping gives me the right answer: pong.
http://localhost:9222 is not reacheable.
Where am I wrong?
Thank you in advance

org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 21

I have yarn cluster with spark(1.6.1), hdfs and hive(2.1). My workflows worked fine for few months till this day (without any changes in code / on environments). I started to get errors like this:
org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 21
Serialization trace:
outputFileFormatClass (org.apache.hadoop.hive.ql.plan.PartitionDesc)
aliasToPartnInfo (org.apache.hadoop.hive.ql.plan.MapWork)
invertedWorkGraph (org.apache.hadoop.hive.ql.plan.SparkWork)
at org.apache.hive.com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:119)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:656)
at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:238)
at org.apache.hive.com.esotericsoftware.kryo.serializers.DefaultSerializers$ClassSerializer.read(DefaultSerializers.java:226)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:745)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:139)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:776)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:131)
at org.apache.hive.com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:694)
at org.apache.hive.com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at org.apache.hive.com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:507)
at org.apache.hive.com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:672)
at org.apache.hadoop.hive.ql.exec.spark.KryoSerializer.deserialize(KryoSerializer.java:49)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:318)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:366)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:335)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Using hive i can do simple selects, but every other operation which needs spark ends with Error: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask (state=08S01,code=3) in console, and error above in yarn logs.
Now my every hive database is paralyzed (i have few). I was trying to solve this problem whole day, but couldnt do antything (hive restart, yarn node's restarts, changing yarn master).
What do you think causes the problem and how can it be solved?
I figured it out.
After restarting hive-server2 for small period of time instead of getting error: org.apache.hive.com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 26 i got error: org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: org.apache.hadoop.hive.ql.io.RCFileOutputFormat. With second form it was obvious, that spark executed on node's didn't have some jars on classpath. I don't know the reason, why spark in one moment was unable to load these jars, but after copying them manually to his lib folder on every node and restarting node everything went back to normal.

Timeout on deleting a snapshot repository

I'm running elasticsearch 1.7.5 w/ 19 nodes (12 data nodes).
Attempting to setup snapshots for backup and recovery - but am getting a 503 on creation and deletion of a snapshot repository.
curl -XDELETE 'localhost:9200/_snapshot/backups?pretty'
returns:
{
"error" : "RemoteTransportException[[masternodename][inet[/10.0.0.20:9300]][cluster:admin/repository/delete]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (delete_repository [backups]) within 30s]; ",
"status" : 503
}
I was able to adjust the query w/ a master_timeout=10m - still getting a timeout. Is there a way to debug the cause of this request failing?
Performance on this call seems to be related to pending tasks with a higher priority.
https://discuss.elastic.co/t/timeout-on-deleting-a-snapshot-repository/69936/4

mvn deploy to artifactory (large size artifact) gives broken pipe error

I am trying to deploy a large zip file (305 MB) to artifactory using the mvn deploy command, but i am getting broken pipe errors. I tried to upload the same file via the browser's "Deploy" button, and it worked fine.
Is there a setting that i can use to increase the socket timeout, or force maven/artifactory to wait until the package is uploaded?
Any help on how to resolve this is appreciated. Here are the log entries and stack trace:
May 09, 2016 11:16:32 AM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request to {}->http://mvnrepo.dev.xyz.myorg.net:8080
May 09, 2016 11:17:02 AM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: I/O exception (java.net.SocketException) caught when processing request to {}->http://mvnrepo.dev.xyz.myorg.net:8080: Broken pipe
May 09, 2016 11:17:02 AM org.apache.maven.wagon.providers.http.httpclient.impl.execchain.RetryExec execute
INFO: Retrying request to {}->http://mvnrepo.dev.xyz.myorg.net:8080
Caused by: org.eclipse.aether.transfer.ArtifactTransferException: Could not transfer artifact com.myorg.xyz.myapp:my-app-package:zip:0.1-svc20160509.151532-7 from/to xyz-service-snapshots (http://mvnrepo.dev.xyz.myorg.net:8080/artifactory/xyz-service-local): Broken pipe
at org.eclipse.aether.connector.basic.ArtifactTransportListener.transferFailed(ArtifactTransportListener.java:43)
at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:355)
at org.eclipse.aether.connector.basic.BasicRepositoryConnector.put(BasicRepositoryConnector.java:274)
at org.eclipse.aether.internal.impl.DefaultDeployer.deploy(DefaultDeployer.java:311)
... 26 more
Caused by: org.apache.maven.wagon.TransferFailedException: Broken pipe
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:662)
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:557)
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:539)
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:533)
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon.put(AbstractHttpClientWagon.java:513)
at org.eclipse.aether.transport.wagon.WagonTransporter$PutTaskRunner.run(WagonTransporter.java:644)
at org.eclipse.aether.transport.wagon.WagonTransporter.execute(WagonTransporter.java:427)
at org.eclipse.aether.transport.wagon.WagonTransporter.put(WagonTransporter.java:410)
at org.eclipse.aether.connector.basic.BasicRepositoryConnector$PutTaskRunner.runTask(BasicRepositoryConnector.java:510)
at org.eclipse.aether.connector.basic.BasicRepositoryConnector$TaskRunner.run(BasicRepositoryConnector.java:350)
... 28 more
Caused by: java.net.SocketException: Broken pipe
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)
at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
at org.apache.maven.wagon.providers.http.httpclient.impl.io.SessionOutputBufferImpl.streamWrite(SessionOutputBufferImpl.java:123)
at org.apache.maven.wagon.providers.http.httpclient.impl.io.SessionOutputBufferImpl.flushBuffer(SessionOutputBufferImpl.java:135)
at org.apache.maven.wagon.providers.http.httpclient.impl.io.SessionOutputBufferImpl.write(SessionOutputBufferImpl.java:164)
at org.apache.maven.wagon.providers.http.httpclient.impl.io.ContentLengthOutputStream.write(ContentLengthOutputStream.java:115)
at org.apache.maven.wagon.providers.http.AbstractHttpClientWagon$RequestEntityImplementation.writeTo(AbstractHttpClientWagon.java:204)
at org.apache.maven.wagon.providers.http.httpclient.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:155)
at org.apache.maven.wagon.providers.http.httpclient.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:149)
at org.apache.maven.wagon.providers.http.httpclient.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:236)
I managed to fix the issue by downgrade maven version from 3.5.2 to 3.0.4
Artifactory - open source 7.12.6
General
A Broken pipe error typically indicates a network issue. Deploying an artifact on Artifactory may fail with this error either if you have network issues (lost packets during deployment) or if your Artifactory repository has ran out of space (physical memory on the repository). Check Case 1 and Case 2 for troubleshooting.
Tip: Run mvn deploy -X. The -X option will log further information related with your error (sets log-level to debug)
Case 1: Network Issue
Solution: Just retry it :)
Issue Details
Maven's source code seems to be using apache's Validate method, (throws an IllegalArgumentException on validation error) to compare and validate the size of the deployed artifact with the artifact that was built locally. Note: On deploy, maven builds the artifact locally, and then uploads it to the configured repository (could be Artifactory, Maven Central etc) which in your case is Artifactory.
The error should look like this:
[DEBUG] Failed to dispatch transfer event 'PUT PROGRESSED [ARTIFACT_SERVER_URL]/artifact.jar <> [LOCAL_PATH]/artifact.jar
java.lang.IllegalArgumentException: progressed file size cannot be greater than size: 117964800 > 116927216
at org.apache.commons.lang3.Validate.isTrue (Validate.java:158)
at org.apache.maven.cli.transfer.AbstractMavenTransferListener$FileSizeFormat.formatProgress (AbstractMavenTransferListener.java:195)
at org.apache.maven.cli.transfer.ConsoleMavenTransferListener.getStatus (ConsoleMavenTransferListener.java:117)
at org.apache.maven.cli.transfer.ConsoleMavenTransferListener.transferProgressed (ConsoleMavenTransferListener.java:90)
...
Case 2: No space left on Artifactory repository
Solution: Free up space on Artifactory repository
Issue Details
For any erros related with deployment to artifactory, you can always check Artifactory's log from Web Application.
Just Navigate (from Sidebar) to Admin/Advanced/System Logs. Check for any error or warning in the log file.
An error related with the available space could be like the following:
2021-08-18 11:20:43,339 [http-nio-8081-exec-19] [WARN ] (o.a.w.s.RepoFilter :252) - Sending HTTP error code 404: No space left on device
2021-08-18 11:20:43,346 [http-nio-8081-exec-16] [WARN ] (o.a.r.s.RepositoryServiceImpl:1739) - Datastore disk is too high: Warning limit: 80%, Used: 94%, Total: 688.01 GB, Used: 652.97 GB, Available: 35.04 GB
2021-08-18 11:20:43,346 [http-nio-8081-exec-16] [INFO ] (o.a.e.UploadServiceImpl:386) - Deploy to 'libs-snapshot-local:artifact.jar' Content-Length: 170019165
2021-08-18 11:20:44,086 [http-nio-8081-exec-16] [ERROR] (o.a.w.s.RepoFilter :251) - Upload request of libs-snapshot-local:artifact.jar failed due to {}
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
...
In case you do identify such errors, try to remove some old artifacts to free up space, and retry your deployment later
Tip: It's a good practise to set some purging policies for old artifacts/snapshots on your Artifactory.

Resources