corrupted/unassigned elasticsearch index - elasticsearch

I have run the Elasticsearch service for quite long time, but suddenly encountered the following
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog from source [d:\elasticsearch-7.1.0\data\nodes\0]indices\A2CcAAE-R3KkQh6jSoaEUA\2\translog\translog-1.tlog] is corrupted, expected shard UUID [.......] but got: [...........] this translog file belongs to a different translog.
I executed the GET /_ca/shards?v and most of the indexes are UNASSIGNED state.
Please help!
I went through the log files and saw the error message "Failed to update shard information for ClusterInfoUpdateJob within 15s timeout", could this error message cause most of the shards turn to UNASSIGNED?

You can try to recover using elasticsearch-translog tool as explained in the documentation
Elasticsearch should be stopped while running this tool
If you don't have replica from which data can be recovered, you may lose some data by using the tool.
Reason is mentioned that drive error or user error.

Related

Elasticsearch connector task stuck in failed and unknown status

In my elasticsearch connector, the tasks.max was set at 10 which I reduced to 7. Now the connector is running with 7 tasks and the other three tasks are stuck at "unknown" and "failed" status. I have restarted the connector but still the tasks did not get removed.
How do I make sure to remove these tasks which are unassigned/failed.
This is a known issue in Kafka Connect. Unsure if there is a JIRA, or if even fixed in versions later than the last one I have used (approx Kafka 2.6).
When you start with a higher max.tasks, then that information is stored in the internal status topic of the Connect API, and this is what the status HTTP endpoint will return. When you reduce this value, those previous task's metadata are still available, but is no longer being updated.
The only real fix I've found, is to delete this connector, and re-create it, potentially with a new name since the status topic is keyed by the name.

Azure Databricks stream fails with StorageException: Could not verify copy source

We have a Databricks job that has suddenly started to consistently fail. Sometimes it runs for an hour, other times it fails after a few minutes.
The inner exception is
ERROR MicroBatchExecution: Query [id = xyz, runId = abc] terminated with error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Could not verify copy source.
The job targets a notebook which consumes from event-hub with PySpark structured streaming, calculates some values based on the data, and streams data back to another event-hub topic.
The cluster is a pool with 2 workers and 1 driver running on standard Databricks 9.1 ML.
We've tried to restart job many times, also with clean input data and checkpoint location.
We struggle to determine what is causing this error.
We cannot see any 403 Forbidden errors in logs, which is sometimes mentioned on forums as a reason
.
Any assistance is greatly appreciated.
Issue resolved by moving checkpointing (used internally by Spark) location from standard storage to premium. I don't know why it suddenly started failing after months of running hardly without hiccup.
Premium storage might be a better place for checkpointing anyway since I/O is cheaper.

Elasticsearch - missing data

I have been planning to use ELK for our production environment and seems to be running into a weird problem -
the problem is that while loading a sample of the production log file I realized that there is a huge mismatch in the number of events being published by Filebeat and what we see in kibana. My first doubt was on filebeat but i could verify that all the events were successfully received in logstash.
I also checked logstash (by enabling debug mode ) and could see all the events were received and processed (i am using the following filters date , json ) and i could see them getting processed successfully
but when i do a search in kibana I only get to see the percent of the number of logs being actually published (e.g. only 16000 out of 350K). No exception or error in either logstash or elasticsearch logs.
I have tried zapping the entire data by doing the following so far :
Stopped all processes for ES, Logstash and kibana.
Deleted all the index files, cleared the cache , deleted mappings
stopped filebeat, deleted registry files (since its running in windows)
Restarted elasticsearch, logstash and filebeat (in that order)
but same results. i get only 2 out of 8 records (in the shortened file) and even less when i use the full file
i tried increasing the time windows in kibana to 10 years (:)) to see if they are being pushed to the wrong year but got nothing
I have read almost all threads related to the missing data but nothing seems to work.
any pointers would help !

CouchBase Replication Error to Elastic Search

I have an existing replication in Couchbase -> ElasticSearch. I found out that there is now errors in replicating:
I tried to CREATE Replication again but it also gave the same error:
I already checked my elasticsearch plugin_head and I can see data in there and I can query with results. I restarted also my elasticsearch batch file but still error is persistent.
Anyone can help me on what else I need to check to further investigate the issue? Thank you in advance.
You may have a connectivity problem, which can happen due to networking issues like an IP address change since you initially setup the replication.
You might try the troubleshooting steps outlined here if you haven't already:
http://developer.couchbase.com/documentation/server/4.1/connectors/elasticsearch-2.1/trouble-intro.html
You should also check the goxdcr logs, which you can find here depending on the OS you're using:
http://developer.couchbase.com/documentation/server/4.0/troubleshooting/troubleshooting-logs.html

After we upgrade elastisearch-1.5.2 we are getting java.io.EOFException

After upgrade elasticsearch to 1.5.2 we are repeatedly getting:
java.io.EOFException: read past EOF:
MMapIndexInput(path="/iqs/ESData/elasticsearch/nodes/0/indices/ids_1/1/index/segments_7")
If we restart the cluster also the same exception is coming continuously. So we have one option left to delete the corrupted segment. But it's not the correct solution to our busy cluster. Can anyone suggest please.

Resources