ElasticSearch No Node Available Exception - elasticsearch

I was indexing around 70,000 records. Normally it takes around 1 to 2 minutes to bulk index. But this time it took more than 10 minutes and an exception was thrown from C# Nest Client.
Does anybody experience the below. Also to note that I set Index Refresh Interval as 30s.
ElasticSearch C# Nest Client. Below are the exception details:
Elasticsearch.Net.Exceptions.MaxRetryException occurred
HResult=-2146233088
Message=Unable to perform request: 'POST #indexName/_bulk' on any
of the nodes after retrying 0 times.
Source=Elasticsearch.Net
StackTrace:
at Elasticsearch.Net.Connection.Transport.RetryRequestAsync[T]
(TransportRequestState`1 requestState, Uri baseUri, Int32 retried,
Exception e) in
c:\Projects\NEST\src\Elasticsearch.Net\Connection\Transport.cs:line 344
InnerException:

I am not sure if this can help you, but I had similar issue in 2 node cluster. I've was adding synonyms and setup the file for only master machine. I completely forgot to copy it over to 2nd node. This was causing the error above for me when creating new index that depended on that synonym file. Which totally makes sense.
After I added synonym file and restarted 2nd node, everything went back to normal.

Related

CoGroupByKey always failed on big data (PythonSDK)

I have about 4000 files (avg ~7MB each) input.
My pipeline always failed on the step CoGroupByKey when the data size reach about 4GB.
I tried to limit only use 300 file then it run just fine.
In case of fail, the logs on GCP dataflow only show:
Workflow failed. Causes: S24:CoGroup Geo data/GroupByKey/Read+CoGroup Geo data/GroupByKey/GroupByWindow+CoGroup Geo data/Map(_merge_tagged_vals_under_key) failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
store-migration-10212040-aoi4-harness-m7j7
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.,
store-migration-xxxxx
Root cause: The worker lost contact with the service.
I digging through all logs in Logs Explorer. Nothing else indicate error other than the above, even my logging.info and try...except code.
Think this relate to the memory of the instances but I didn't digging into that direction. Because it kindna what I don't want to worry about when I am using GCP services.
Thanks.

Error working with "ScrollElasticSearchHttp" processor in NiFi

I am trying to retrieve data from an index in ElasticSearch. I configured the "QueryElasticSearchHttp" processor and it works just fine. However when I try to use the ScrollElasticsearchHttp processor with the same URL, query, index properties and set the 'scroll' to default 1 minute, it doesn't work.
I get an error response of 404 : "Elasticsearch returned code 404 with message Not found".
I am also tailing the log on the ES cluster and I see this error;
[DEBUG][o.e.a.s.TransportSearchScrollAction] [2] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException:[127.0.0.1:9300][indices:data/read/search[phase/query+fetch/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [2]
at org.elasticsearch.search.SearchService.getExecutor(SearchService.java:457) ~[elasticsearch-7.5.2.jar:7.5.2]
I am on Apache NiFi 1.10.0
Here is the config for the processor:
I should see a total of 441 hits, and with page size 20 I should see 23 queries being made to ES.
But I don't get a single result back. I have tried higher values for "scroll" and also played around with "page size" to no avail.
I also noticed that even though the ScrollElasticsearchHttp processor is set to run every 1m, on the ES log I don't see any error log repeated every minute.
Update:
When I cleared the state via UI: "View state" -> "Clear State", I was able to make a single call, that returned a page full of hits in one flowfile.
However, there are more pages to be retrieved. How do I make the processor to go fetch the next page?
My understanding was that the single invocation of the ScrollElasticsearchHttp will page through all the result sets and bring in each page as one flowfile. Is this not correct?
Please decrease the scheduling time to around 10-20 sec. So in every 10-20 sec processor will fetch the next set of records based on your page size.
You can check the state value when the fetching process is in progress i.e. you will find a scroll id in it. Once the fetching process is complete then state value will be changed to "finishedQuery" : true.

ElasticSearch randomly fails when running tests

I have a test ElasticSearch box (2.3.0) and my tests that are using ES are failing in random order which is really frustrating (failed with All shards failed exception).
Looking at the elastic_search.log file it only showed me this
[2017-05-04 04:19:15,990][DEBUG][action.search.type ] [es-testing-1] All shards failed for phase: [query]
RemoteTransportException[[es-testing-1][127.0.0.1:9300][indices:data/read/search[phase/query]]]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
Caused by: [derp_test][[derp_test][3]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:993)
at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:814)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:641)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:618)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:369)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:350)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
Any idea what's going on? So far my research only told me this is most likely due to corrupt translog -- but I don't think deleting translog will help because the test drops the test index for every namespace
ES test box has 3.5GB RAM and it's using 2.5GB heap size, CPU usage is quite normal during the test (peaked at 15%)
To clarify: when I said failing test, I meant error with the weird exception as mentioned above (not failing test due to incorrect value). I did manual refresh after every insert/update operation so value is correct.
After investigating ElasticSearch log file (at DEBUG level) and the source code, turns out what actually happened was that after index is created, the shards are entering RECOVERING state and sometimes my test tried to perform a query on ElasticSearch while the shards are not yet active -- thus the exception.
Fix is simple - after creating an index, just wait until shards are active using setWaitForActiveShards function and to be more paranoid I also added setWaitForYellowStatus
It's a recommendation use ESIntegTestCase to do the integration test.
ESIntegTestCase has some helper method, like: ensureGreen and refresh ... to ensure the Elasticsearch is ready to continue testing. and you can configure node settings for test.
if use Elasticsearch directly as a test box, it maybe cause various problems:
like your Exception, this seems it's recovering shards for index
derp_test.
even you have indexed your data into index, but when you immediately search will fail, since cluster need flush or refresh
...
Those most problems can just use Thread.sleep to wait some time to fix :), but it's a bad way to do this.
Try manually refreshing your indices after inserting the data and before performing a query to ensure the data is searchable.
Either:
As part of the index request - https://www.elastic.co/guide/en/elasticsearch/reference/2.3/docs-index_.html#index-refresh
Or separately - https://www.elastic.co/guide/en/elasticsearch/reference/2.3/indices-refresh.html
There could be another reason. I had the same problem with my elasticsearch unit tests, at first I thought the problem root cause is somewhere in .Net Core or Nest or elsewhere outside of my code because the test would run successfully in Debug mode (When debugging tests) but randomly failed in Release mode (when running tests).
After lots of investigations and many try and errors, I found out the problem root cause (in my case) was Concurrency !! or on the other hand Race Condition used to happen
Since the tests run concurrently and I used to recreate and seed my index (initializing and preparing) on test class constructor which means executing on the beginning of every test and since the tests would run concurrently, race condition were likely to happen and make my tests fail
Here is my initialization code that caused tests fail randomly when running them (on release mode)
public BaseElasticDataTest(RootFixture fixture)
: base(fixture)
{
ElasticHelper = fixture.Builder.Build<ElasticDataProvider<FakePersonIndex>();
deleteFakeIndex();
createFakeIndex();
fillFakeIndexData();
}
the code above used to run on every test concurrently. I fixed my problem by executing initialization code only once per test class (once for all the test cases inside the test class) and the problem went away.
Here is my fixed test class constructor code :
static bool initialized = false;
public BaseElasticDataTest(RootFixture fixture)
: base(fixture)
{
ElasticHelper = fixture.Builder.Build<ElasticDataProvider<FakePersonIndex>>();
if (!initialized)
{
deleteFakeIndex();
createFakeIndex();
fillFakeIndexData();
//for concurrency
System.Threading.Thread.Sleep(100);
initialized = true;
}
}
Hope it helps

How to Fix Read timed out in Elasticsearch

I used Elasticsearch-1.1.0 to index tweets.
The indexing process is okay.
Then I upgraded the version. Now I use Elasticsearch-1.3.2, and I get this message randomly:
Exception happened: Error raised when there was an exception while talking to ES.
ConnectionError(HTTPConnectionPool(host='127.0.0.1', port=8001): Read timed out. (read timeout=10)) caused by: ReadTimeoutError(HTTPConnectionPool(host='127.0.0.1', port=8001): Read timed out. (read timeout=10)).
Snapshot of the randomness:
Happened --33s-- Happened --27s-- Happened --22s-- Happened --10s-- Happened --39s-- Happened --25s-- Happened --36s-- Happened --38s-- Happened --19s-- Happened --09s-- Happened --33s-- Happened --16s-- Happened
--XXs-- = after XX seconds
Can someone point out on how to fix the Read timed out problem?
Thank you very much.
Its hard to give a direct answer since the error your seeing might be associated with the client you are using. However a solution might be one of the following:
1.Increase the default timeout Globally when you create the ES client by passing the timeout parameter. Example in Python
es = Elasticsearch(timeout=30)
2.Set the timeout per request made by the client. Taken from Elasticsearch Python docs below.
# only wait for 1 second, regardless of the client's default
es.cluster.health(wait_for_status='yellow', request_timeout=1)
The above will give the cluster some extra time to respond
Try this:
es = Elasticsearch(timeout=30, max_retries=10, retry_on_timeout=True)
It might won't fully avoid ReadTimeoutError, but it minimalize them.
Read timeouts can also happen when query size is large. For example, in my case of a pretty large ES index size (> 3M documents), doing a search for a query with 30 words took around 2 seconds, while doing a search for a query with 400 words took over 18 seconds. So for a sufficiently large query even timeout=30 won't save you. An easy solution is to crop the query to the size that can be answered below the timeout.
For what it's worth, I found that this seems to be related to a broken index state.
It's very difficult to reliably recreate this issue, but I've seen it several times; operations run as normal except certain ones which periodically seem to hang ES (specifically refreshing an index it seems).
Deleting an index (curl -XDELETE http://localhost:9200/foo) and reindexing from scratch fixed this for me.
I recommend periodically clearing and reindexing if you see this behaviour.
Increasing various timeout options may immediately resolve issues, but does not address the root cause.
Provided the ElasticSearch service is available and the indexes are healthy, try increasing the the Java minimum and maximum heap sizes: see https://www.elastic.co/guide/en/elasticsearch/reference/current/jvm-options.html .
TL;DR Edit /etc/elasticsearch/jvm.options -Xms1g and -Xmx1g
You also should check if all fine with elastic. Some shard can be unavailable, here is nice doc about possible reasons of unavailable shard https://www.datadoghq.com/blog/elasticsearch-unassigned-shards/

indexing many with nest and elasticsearch - unable to perform post on any of the nodes

i'm trying to index many documents using Nest to Elasticsearch. things run fine with there's a limited number of documents, but when i ramp up the number - from say 1000 to 50,000 it throws an error. i'm not convinced it's due to the number of documents - it could be bad data.
i'm trying to safeguard against bad data though - i'm only indexing documents that have an id. the id is being generated from one of my fields (upc). so i'm positive there's an id for every document. i'm also making sure my class object that it's serializing to/from has all nullable properties.
still, there's nothing informational that i can see that helps me in this error.
the error i get is..
Unable to perform request: 'POST' on any of the nodes after retrying 0 times
and here's the stacktrace when it throws the error:
at Elasticsearch.Net.Connection.Transport.RetryRequest[T](TransportRequestState`1 requestState, Uri baseUri, Int32 retried, Exception e) in c:\Projects\NEST\src\Elasticsearch.Net\Connection\Transport.cs:line 241
at Elasticsearch.Net.Connection.Transport.DoRequest[T](TransportRequestState`1 requestState, Int32 retried) in c:\Projects\NEST\src\Elasticsearch.Net\Connection\Transport.cs:line 215
at Elasticsearch.Net.Connection.Transport.DoRequest[T](String method, String path, Object data, IRequestParameters requestParameters) in c:\Projects\NEST\src\Elasticsearch.Net\Connection\Transport.cs:line 163
at Elasticsearch.Net.ElasticsearchClient.DoRequest[T](String method, String path, Object data, BaseRequestParameters requestParameters) in c:\Projects\NEST\src\Elasticsearch.Net\ElasticsearchClient.cs:line 75
at Elasticsearch.Net.ElasticsearchClient.Bulk[T](Object body, Func`2 requestParameters) in c:\Projects\NEST\src\Elasticsearch.Net\ElasticsearchClient.Generated.cs:line 45
at Nest.RawDispatch.BulkDispatch[T](ElasticsearchPathInfo`1 pathInfo, Object body) in c:\Projects\NEST\src\Nest\RawDispatch.generated.cs:line 34
at Nest.ElasticClient.<Bulk>b__d6(ElasticsearchPathInfo`1 p, BulkDescriptor d) in c:\Projects\NEST\src\Nest\ElasticClient-Bulk.cs:line 20
at Nest.ElasticClient.Dispatch[D,Q,R](D descriptor, Func`3 dispatch, Boolean allow404) in c:\Projects\NEST\src\Nest\ElasticClient.cs:line 86
at Nest.ElasticClient.Dispatch[D,Q,R](Func`2 selector, Func`3 dispatch, Boolean allow404) in c:\Projects\NEST\src\Nest\ElasticClient.cs:line 72
at Nest.ElasticClient.Bulk(Func`2 bulkSelector) in c:\Projects\NEST\src\Nest\ElasticClient-Bulk.cs:line 15
at Nest.ElasticClient.IndexMany[T](IEnumerable`1 objects, String index, String type) in c:\Projects\NEST\src\Nest\ElasticClient-Index.cs:line 44
at ElasticsearchLoad.Program.BuildBulkApi() in c:\Projects\ElasticsearchLoad\ElasticsearchLoad\Program.cs:line 258
any help would be appreciated!
You are going to be limited in the effective bulk size you can send to Elasticsearch by a combination of your documents and Elasticsearch configuration. There is not any "single best answer" for this, but with some testing and configuration changes you should be able to achieve a suitable bulk indexing performance threshold. Here are some resources to assist you...
elasticsearch bulk indexing gets slower over time with constant number of indexes and documents
Write heavy elasticsearch
Scaling Elasticsearch Part 1: Overview
And for overall Sizing of Elasticsearch I would highly recommend reading - Sizing Elasticsearch - Scaling up and out
If you are running in multi-node cluster make sure your setup is the same for all nodes.
I am not sure if this can help you, but I had similar issue in 2 node cluster. I've was adding synonyms and setup the file for only master machine. I completely forgot to copy it over to 2nd node. This was causing the error above for me when creating new index that depended on that synonym file.
After I added synonym file and restarted 2nd node, everything went back to normal.

Resources