I have an AWS ElasticSearch t2.medium instance with 2 nodes running, and hardly any load on it. Still it is crashing all the time.
I see the following graph for the metric JVMMemoryPressure:
When I go to Kibana, I see the following error message:
Questions:
Do I interpret correctly that the machines only have 64 MB of memory available, instead of the 4 GB that should be associated with this instance type? Is there another place to verify the absolute amount of heap memory, instead of on Kibana only when it is going wrong?
If so, how can I change this behavior?
If this is normal, where can I look for possible causes of ElasticSearch crashing whenever the memory footprint reaches 100%. I have only very small load on the instance.
In the logging of the instance, I see a lot of warnings, e.g. the ones below. They don't provide any clue for where to start debugging the issue.
[2018-08-15T07:36:37,021][WARN ][r.suppressed ] path: __PATH__ params:
{}
org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [__PATH__ master];
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:165) ~[elasticsearch-6.0.1.jar:6.0.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:387) [elasticsearch-6.0.1.jar:6.0.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:273) [elasticsearch-6.0.1.jar:6.0.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.0.1.jar:6.0.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:421) [elasticsearch-6.0.1.jar:6.0.1]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:317) [elasticsearch-6.0.1.jar:6.0.1]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:244) [elasticsearch-6.0.1.jar:6.0.1]
at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:578) [elasticsearch-6.0.1.jar:6.0.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-6.0.1.jar:6.0.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
or
[2018-08-15T07:36:37,691][WARN ][o.e.d.z.ZenDiscovery ] [U1DMgyE] not enough master nodes discovered during pinging (found [[Candidate{node={U1DMgyE}{U1DMgyE1Rn2gId2aRgRDtw}{F-tqTFGDRZaovQF8ILC44w}{__IP__}{__IP__}{__AMAZON_INTERNAL__, __AMAZON_INTERNAL__}, clusterStateVersion=207939}]], but needed [2]), pinging again
or
[2018-08-15T07:36:42,303][WARN ][o.e.t.n.Netty4Transport ] [U1DMgyE] write and flush on the network layer failed (channel: [id: 0x385d3b63, __PATH__ ! __PATH__])
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.writev0(Native Method) ~[?:1.8.0_172]
at sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51) ~[?:1.8.0_172]
at sun.nio.ch.IOUtil.write(IOUtil.java:148) ~[?:1.8.0_172]
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:504) ~[?:1.8.0_172]
at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:432) ~[netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:856) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:368) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:638) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.13.Final.jar:4.1.13.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.13.Final.jar:4.1.13.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
I have learned that that number is incorrect. I don't know where it is coming from. To get the correct memory usage, one runs the following query:
GET "<es_url>:9200/_nodes/stats"
If you're looking for only memory usage, use GET /"<es_url>:9200/_cat/nodes?h=heap* - it gives a more readable response like below.
{
"payload": [
{
"heap.current": "4.1gb",
"heap.max": "15.9gb",
"heap.percent": "25"
},
{
"heap.current": "3.9gb",
"heap.max": "15.9gb",
"heap.percent": "24"
},
...
}
_nodes/stats is elaborate with all other details also, though.
Related
I deleted all files accidentally in /home/plog/elk/data/elasticsearch-data path,and then es's healthy status is shown in red.
I restart the es daemons and the log shows cannot find file /home/plog/elk/data/elasticsearch-data/nodes/0/node.lock.
And then, I start es on each server separately which can generate node.lock automatically.The problem of cannot find file has been solved.
But, es's healthy status is still shown in red. The log shows failed to write index state ,caused by Underlying file changed by an external force.
How can I solve this problem
The error log is as follows:
[2019-03-25T10:23:32,610][WARN ][o.e.g.MetaStateService ] [es] [[test_2019.03.10/L3uPPm-vSSW_aG6Qvzih5A]]: failed to write index state
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2019-03-25T02:03:45.489478Z, (lock=NativeFSLock(path=/home/plog/elk/data/elasticsearch-data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2019-02-27T08:51:35.409994Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191) ~[lucene-core-7.5.0.jar:7.5.0 b5bf70b7e32d7ddd9742cc821d471c5fabd4e3df - jimczi - 2018-09-18 13:01:13]
at org.elasticsearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:999) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.env.NodeEnvironment.indexPaths(NodeEnvironment.java:798) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.gateway.MetaStateService.writeIndex(MetaStateService.java:124) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.gateway.GatewayMetaState.applyClusterState(GatewayMetaState.java:173) ~[elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:481) ~[elasticsearch-6.5.4.jar:6.5.4]
at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_202]
at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:478) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:465) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:416) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:160) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.5.4.jar:6.5.4]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.5.4.jar:6.5.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
Solved.I deleted all the indexes, then deleted the nodes folder and started again.
Check out this blog post: https://discuss.elastic.co/t/distress-elasticsearch-does-not-start/152288
I am using Sqoop for importing data from oracle to HDFS. When Job starts it stucks in 5% of progress for about 1 hours and this info is outputs:
INFO mapreduce.Job: Task Id : attempt_1535519556038_0015_m_000037_0, Status : FAILED
Container launch failed for container_1535519556038_0015_01_000043 : org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
This token is expired. current time is 1536133107764 found 1536133094775
Note: System times on machines may be out of sync. Check system time and time zones.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168)
at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:375)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
and then it continues until the jobs successfully terminate and all the data imported. So, My question is What is the reason for hanging the job in 5% of progress? Why is it self-correcting? Is it normal? If not, Is it possible to relate to that issued info? How can I fix that?
The error message clearly explains “Unauthorized request to start container.
This token is expired”.
One of the options would be increasing lifespan of container by setting:
yarn.resourcemanager.rm.container-allocation.expiry-interval-ms which is by default is 10 minutes.
Note: The jobs will work if you increase the yarn.resourcemanager.rm.container-allocation.expiry-interval-ms in the yarn-site.xml config file.
<property>
<name>yarn.resourcemanager.rm.container-allocation.expiry-interval-ms</name>
<value>1000000</value>
</property>
I'm running a 5 node elasticsearch cluster (2 data nodes, 2 master nodes, 1 kibana).
I'm getting the following error when use the command
curl -X GET "192.168.107.75:9200/_cat/master?v"
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}
],"type":"master_not_discovered_exception","reason":null},"status":503}
I'm using the following command to run elastic
sudo systemctl start elasticsearch.service
This is the message I see in the logs:
[2018-05-28T21:02:22,074][WARN ][o.e.d.z.ZenDiscovery ] [node-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={node-master-1}{kJKYkpdbTKmdIeq-RVnCAQ}{JGbXMxOXR0SyjCu746Zlwg}{192.168.107.75}{192.168.107.75:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2018-05-28T21:02:25,076][WARN ][o.e.d.z.ZenDiscovery ] [node-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={node-master-1}{kJKYkpdbTKmdIeq-RVnCAQ}{JGbXMxOXR0SyjCu746Zlwg}{192.168.107.75}{192.168.107.75:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2018-05-28T21:02:28,077][WARN ][o.e.d.z.ZenDiscovery ] [node-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={node-master-1}{kJKYkpdbTKmdIeq-RVnCAQ}{JGbXMxOXR0SyjCu746Zlwg}{192.168.107.75}{192.168.107.75:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2018-05-28T21:02:31,079][WARN ][o.e.d.z.ZenDiscovery ] [node-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={node-master-1}{kJKYkpdbTKmdIeq-RVnCAQ}{JGbXMxOXR0SyjCu746Zlwg}{192.168.107.75}{192.168.107.75:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2018-05-28T21:02:34,081][WARN ][o.e.d.z.ZenDiscovery ] [node-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={node-master-1}{kJKYkpdbTKmdIeq-RVnCAQ}{JGbXMxOXR0SyjCu746Zlwg}{192.168.107.75}{192.168.107.75:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2018-05-28T21:02:37,084][WARN ][o.e.d.z.ZenDiscovery ] [node-master-1] not enough master nodes discovered during pinging (found [[Candidate{node={node-master-1}{kJKYkpdbTKmdIeq-RVnCAQ}{JGbXMxOXR0SyjCu746Zlwg}{192.168.107.75}{192.168.107.75:9300}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2018-05-28T21:02:40,090][WARN ][o.e.d.z.ZenDiscovery ] [node-master-1] failed to connect to master [{node-master-2}{_M4BTrFbQguT3PbY5d2_JA}{1rzJcDPSQ5OH2OZ_CnhR-g}{192.168.107.76}{192.168.107.76:9300}], retrying...
org.elasticsearch.transport.ConnectTransportException: [node-master-2][192.168.107.76:9300] connect_exception
at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:165) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:331) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:318) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.discovery.zen.ZenDiscovery.joinElectedMaster(ZenDiscovery.java:515) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.discovery.zen.ZenDiscovery.innerJoinCluster(ZenDiscovery.java:483) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.discovery.zen.ZenDiscovery.access$2500(ZenDiscovery.java:90) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.discovery.zen.ZenDiscovery$JoinThreadControl$1.run(ZenDiscovery.java:1253) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:573) [elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_172]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_172]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: 192.168.107.76/192.168.107.76:9300
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
... 1 more
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) ~[?:?]
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) ~[?:?]
... 1 more
In the ealsticsearch.yml file apart from the config for assigning different roles to nodes I'm using the following configuration:
cluster.name: test_cluster
network.host: 192.168.107.71
discovery.zen.ping.unicast.hosts: ["192.168.107.73", "192.168.107.74", "192.168.107.75", "192.168.107.76"]
#the above two configuration IPs change as per the node
discovery.zen.minimum_master_nodes: 2
The hosts are pingable and have access to each other.
Any help would be much appreciated.
I think the problem is quite clear, [node-master-2][192.168.107.76] either is not accessible from this host, or elastic process on [node-master-2] is down.
You can check if curl -XGET "192.168.107.76:9200" from this host has a valid answer.
Also elastic documents explicitly says:
It is recommended to avoid having only two master eligible nodes,
since a quorum of two is two. Therefore, a loss of either master
eligible node will result in an inoperable cluster.
This ElasticSearch install guide provides a guidance how to to fix master_not_discovered_exception exceptions. Basically you can get this error for several reasons:
Firewall rule is blocking communication
Master / Data host names cannot be resolved (won't be you case as you are using IP addresses)
Incorrect elasticsearch.yml configuration (e.g. master node is not configured as master node, or running on different port / IP address).
First and second item can easily checked with telnet (from master telnet to data node, and the other way around).
I'm actually working on topology taking data from kafka and persist them into elasticsearch. Ok first, I used the basic KafkaSpout from storm dependency to listen for data coming from a precise kafka topic and, I re-implemented the Elasticsearch bolt from the elasticsearch-hadoop project: https://github.com/elastic/elasticsearch-hadoop/blob/master/storm/src/main/java/org/elasticsearch/storm/EsBolt.java. The goal was to write on several indices in elasticsearch.
So, when I process the messages coming from kafka, I have some exceptions when the number of data grow up in the kafka queue. This is one part of the stack trace in the worker logs:
2016-04-13T22:24:44.641+0000 b.s.m.n.Client [ERROR] failed to send 580 messages to Netty-Client-ip-[internal-ip].ec2.internal/[internal-ip]:6700:
java.nio.channels.ClosedChannelException
2016-04-13T22:24:44.641+0000 b.s.m.n.Client [ERROR] failed to send 575 messages to Netty-Client-ip-[internal-ip].ec2.internal/[internal-ip]:6700:
java.nio.channels.ClosedChannelException
2016-04-13T22:25:05.970+0000 b.s.m.n.Client [WARN] Re-connection to ip-[internal-ip].ec2.internal/[internal-ip]:6701 was successful but 52890 messages
has been lost so far
2016-04-13T22:36:33.571+0000 b.s.m.n.StormClientHandler [INFO] Connection failed Netty-Client-ip-ip-[internal-ip].ec2.internal/[internal-ip]:6701
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_77]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_77]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_77]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_77]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.8.0_77]
at org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:64) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [storm-core-0.9.6.jar:0.9.6]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_77]
I'm using a storm cluster of 3 nodes (1 nimbus+UI+Zookeeper and 2 supervisors). Storm version 0.9.6. Each of these machines have 4GB RAM and this is the content my storm.yml config file:
storm.zookeeper.servers:
- "nimbus-ip"
storm.local.dir: "/mnt/storm"
nimbus.seeds: ["nimbus-ip"]
storm.zookeeper.port: 2181
ui.port: 8080
nimbus.host: "nimbus-ip"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
storm.messaging.netty.max_wait_ms: 10000
Can anyone help me to know why workers can't communicate due to Netty-Client hostname resolution? I already saw one report of this issue in the 0.9.4 version of storm https://issues.apache.org/jira/browse/STORM-908. Is it possible that the 0.9.6 version does not fix this issue?
Many thanks!!
I got here from google looking for answers to a similar problem. In my case, the error was:
o.a.s.m.n.Client [ERROR] connection attempt 104 to Netty-Client-ip-XXX-XXX-XXX-XXX.ec2.internal/XXX.XXX.XXX.XXX:6703 failed: java.net.ConnectException: Connection refused: ip-XXX-XXX-XXX-XXX.ec2.internal/XXX.XXX.XXX.XXX:6703
This was appearing on a 2-node storm cluster (v1.0.1).
At first, I thought this was a networking issue with AWS (which is where I was deploying the nodes). I started to look at security group rules, /etc/hosts files etc etc, none of which helped.
After some searching I discovered this: https://issues.apache.org/jira/browse/STORM-1382 and figured that maybe the issue wasn't the network at all, but something on the other end wasn't running.
So, I ssh-d into a worker node and took a look at the supervisor log, which showed me something like this lots and lots:
o.a.s.d.supervisor [INFO] 30236e62-d2e1-4d5c-b75c-f54ef07653a4 still hasn't started
When I looked at the worker.log itself, I discovered there was a problem with the default java version. That was my problem, but other people's problems may be related to other reasons that a worker may fail.
Anyway, once I set the correct default java version it all kicked into life.
My elasticsearch cluster(version 2.0) is started and the node client is built successfully, but for some reason I'm getting the following error while running queries using node client.
20:15:15.479 [Pool:entitytaskscheduler: Thread#1] DEBUG c.b.o.e.t.c.DataCollectorStatusUpdateTask - collectors updated due to agent reconnected:{}
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:154)
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:144)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.<init>(TransportSearchTypeAction.java:116)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.<init>(TransportSearchQueryThenFetchAction.java:73)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.<init>(TransportSearchQueryThenFetchAction.java:67)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:64)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:53)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:70)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:99)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:70)
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:347)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at com.hidden.ppp.management.dc.DataCollectorPollStatusDAOESImpl.findDCIdsUpdatedInTime(DataCollectorPollStatusDAOESImpl.java:151)
at com.hidden.ppp.engine.taskexecutor.cptaskexecs.DataCollectorStatusUpdateTask.execute(DataCollectorStatusUpdateTask.java:199)
at com.hidden.ppp.engine.taskexecutor.cptaskexecs.DataCollectorStatusUpdateTaskRunner.run(DataCollectorStatusUpdateTaskRunner.java:27)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
20:15:15.558 [Pool:entitytaskscheduler: Thread#1] WARN c.b.o.m.d.DataCollectorPollStatusDAOESImpl - blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];
20:15:15.558 [Pool:entitytaskscheduler: Thread#1] DEBUG c.b.o.e.t.c.DataCollectorStatusUpdateTask - collectors for which polls updated after epoc time:1453128243336 - dcids: []
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:154)
at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:144)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.<init>(TransportSearchTypeAction.java:116)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.<init>(TransportSearchQueryThenFetchAction.java:73)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.<init>(TransportSearchQueryThenFetchAction.java:67)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:64)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:53)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:70)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:99)
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:44)
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:70)
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:347)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:85)
at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:59)
at com.hidden.ppp.management.dc.DataCollectorPollStatusDAOESImpl.findDCIdsNotUpdatedInTime(DataCollectorPollStatusDAOESImpl.java:182)
at com.hidden.ppp.engine.taskexecutor.cptaskexecs.DataCollectorStatusUpdateTask.execute(DataCollectorStatusUpdateTask.java:204)
at com.hidden.ppp.engine.taskexecutor.cptaskexecs.DataCollectorStatusUpdateTaskRunner.run(DataCollectorStatusUpdateTaskRunner.java:27)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
I've even disabled the "multicast" as per this post - still no luck. Surprisingly, I could access the elasticsearch from sense. Any clues on what is going wrong ?
I faced the same error message and was not able to understand the problem first. I was developing a node client Java application on my laptop, using an Elasticsearch data node on a remote server. For production use, I needed to deploy the Java application on this remote server.
I configured the Java application to talk to the local host only (being on the same host now):
elasticsearch.discovery.zen.ping.unicast.hosts=127.0.0.1
And got the same exception
ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
Looking at the logs I also found this entry:
[WARN] [TP-Processor2] DiscoveryService.waitForInitialState -> [cerbera] waited for 30s and no initial state was set by the discovery
So basically, the question was: Why doesn't it find the Elasticsearch data node? I changed port ranges and also played with the multicast setting - without success.
Finally, I checked elasticsearch.yml and found the data node not listening to localhost (127.0.0.1), but instead on the ethernet interface 192.168.1.2.
network.host: 192.168.1.2
http.port: 9200
The final change was simple, I just needed to reconfigure the node client configuration to talk to the correct interface
elasticsearch.discovery.zen.ping.unicast.hosts=192.168.1.2
Now my node client is talking to elasticsearch via the correct interface. Job done.
I had the same problem (using k8s ) I finally replaced my elastic image and the issue was solved...
moved from 6.5.4-debian-9-r41 to 6.8.16-debian-10-r5 (using bitnami images)
I know it is not the best answer - but I really tried suggested answers and nothing worked for me. so my recommendation is to update to a newer better version. (docker makes that easy:) )