What is causing elasticsearch to shutdown shortly after starting up? - elasticsearch

I'm having an issue with Elasticsearch on EC2 where I'm starting up several new instances from the same AMI, and very occasionally (like < 1% of the time), the Elasticsearch service will stop shortly after starting. I've looked at the log file, but it's not really clear to me why the service is stopping. Are there any clues in this that I'm missing, or is there anywhere else I should look for logs when this happens?
[2020-07-28T18:17:44,251][INFO ][o.e.c.c.ClusterBootstrapService] [ip-10-0-0-68] no discovery configuration found, will perform best-effort cluster bootstrapping after [3s] unless existing master is discovered
[2020-07-28T18:17:44,375][INFO ][o.e.c.s.MasterService ] [ip-10-0-0-68] elected-as-master ([1] nodes joined)[{ip-10-0-0-68}{C1lEYCg6RUWry4avn4isxw}{IjXE3KNOQO2UeZyrX2o3FA}{127.0.0.1}{127.0.0.1:9300}{dilm}{ml.machine_memory=32601837568, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 4, version: 26, delta: master node changed {previous [], current [{ip-10-0-0-68}{C1lEYCg6RUWry4avn4isxw}{IjXE3KNOQO2UeZyrX2o3FA}{127.0.0.1}{127.0.0.1:9300}{dilm}{ml.machine_memory=32601837568, xpack.installed=true, ml.max_open_jobs=20}]}
[2020-07-28T18:17:44,416][INFO ][o.e.c.s.ClusterApplierService] [ip-10-0-0-68] master node changed {previous [], current [{ip-10-0-0-68}{C1lEYCg6RUWry4avn4isxw}{IjXE3KNOQO2UeZyrX2o3FA}{127.0.0.1}{127.0.0.1:9300}{dilm}{ml.machine_memory=32601837568, xpack.installed=true, ml.max_open_jobs=20}]}, term: 4, version: 26, reason: Publication{term=4, version=26}
[2020-07-28T18:17:44,446][INFO ][o.e.h.AbstractHttpServerTransport] [ip-10-0-0-68] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2020-07-28T18:17:44,447][INFO ][o.e.n.Node ] [ip-10-0-0-68] started
[2020-07-28T18:17:44,595][INFO ][o.e.l.LicenseService ] [ip-10-0-0-68] license [a9a29e21-5167-497e-9e49-ccc785ea2d47] mode [basic] - valid
[2020-07-28T18:17:44,596][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [ip-10-0-0-68] Active license is now [BASIC]; Security is disabled
[2020-07-28T18:17:44,602][INFO ][o.e.g.GatewayService ] [ip-10-0-0-68] recovered [0] indices into cluster_state
[2020-07-28T18:18:29,947][INFO ][o.e.n.Node ] [ip-10-0-0-68] stopping ...
[2020-07-28T18:18:29,962][INFO ][o.e.x.w.WatcherService ] [ip-10-0-0-68] stopping watch service, reason [shutdown initiated]
[2020-07-28T18:18:29,963][INFO ][o.e.x.w.WatcherLifeCycleService] [ip-10-0-0-68] watcher has stopped and shutdown
[2020-07-28T18:18:30,014][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [ip-10-0-0-68] [controller/2184] [Main.cc#150] Ml controller exiting
[2020-07-28T18:18:30,015][INFO ][o.e.x.m.p.NativeController] [ip-10-0-0-68] Native controller process has stopped - no new native processes can be started
[2020-07-28T18:18:30,024][INFO ][o.e.n.Node ] [ip-10-0-0-68] stopped
[2020-07-28T18:18:30,024][INFO ][o.e.n.Node ] [ip-10-0-0-68] closing ...
[2020-07-28T18:18:30,032][INFO ][o.e.n.Node ] [ip-10-0-0-68] closed

[2020-07-28T18:18:29,947][INFO ][o.e.n.Node ] [ip-10-0-0-68] stopping ...
This log line means Elasticsearch shut down gracefully after receiving a shutdown signal (typically SIGTERM) from an external source. It's not possible to say what the external source is, it depends on your system. It could for instance be systemd if that's how you're starting Elasticsearch. If so, hopefully its logs tell you why it's sending that shutdown signal.

Related

elastic search server not running

I have downloaded elastic search on my laptop but whenever I go to bin folder of it and do elasticsearch.bat in Windows, some logs appear but the server don't start or show up on the browser.
Logs are pasted below:
warning: ignoring JAVA_HOME=C:\Program Files\Java\jdk1.8.0_151; using bundled JDK
[2022-09-20T21:53:00,089][INFO ][o.e.n.Node ] [LAPTOP-8VG1D5TB] version[8.4.1], pid[14672], build[zip/2bd229c8e56650b42e40992322a76e7914258f0c/2022-08-26T12:11:43.232597118Z], OS[Windows 10/10.0/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/18.0.2/18.0.2+9-61]
[2022-09-20T21:53:00,099][INFO ][o.e.n.Node ] [LAPTOP-8VG1D5TB] JVM home [G:\elastic stack\elasticsearch-8.4.1\jdk], using bundled JDK [true]
[2022-09-20T21:53:00,100][INFO ][o.e.n.Node ] [LAPTOP-8VG1D5TB] JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -Djava.security.manager=allow, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -Djava.locale.providers=SPI,COMPAT, --add-opens=java.base/java.io=ALL-UNNAMED, -XX:+UseG1GC, -Djava.io.tmpdir=C:\Users\HP\AppData\Local\Temp\elasticsearch, -XX:+HeapDumpOnOutOfMemoryError, -XX:+ExitOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Xms4053m, -Xmx4053m, -XX:MaxDirectMemorySize=2125463552, -XX:G1HeapRegionSize=4m, -XX:InitiatingHeapOccupancyPercent=30, -XX:G1ReservePercent=15, -Des.distribution.type=zip, --module-path=G:\elastic stack\elasticsearch-8.4.1\lib, --add-modules=jdk.net, -Djdk.module.main=org.elasticsearch.server]
[2022-09-20T21:53:13,055][INFO ][c.a.c.i.j.JacksonVersion ] [LAPTOP-8VG1D5TB] Package versions: jackson-annotations=2.13.2, jackson-core=2.13.2, jackson-databind=2.13.2.2, jackson-dataformat-xml=2.13.2, jackson-datatype-jsr310=2.13.2, azure-core=1.27.0, Troubleshooting version conflicts: https://aka.ms/azsdk/java/dependency/troubleshoot
[2022-09-20T21:53:18,911][INFO ][o.e.p.PluginsService ] [LAPTOP-8VG1D5TB] loaded module [x-pack-voting-only-node]
[2022-09-20T21:53:18,912][INFO ][o.e.p.PluginsService ] [LAPTOP-8VG1D5TB] loaded module [x-pack-watcher]
[2022-09-20T21:53:18,913][INFO ][o.e.p.PluginsService ] [LAPTOP-8VG1D5TB] no plugins loaded
[2022-09-20T21:53:29,454][INFO ][o.e.e.NodeEnvironment ] [LAPTOP-8VG1D5TB] using [1] data paths, mounts [[New Volume (G:)]], net usable_space [246.3gb], net total_space [258.4gb], types [NTFS]
[2022-09-20T21:53:29,455][INFO ][o.e.e.NodeEnvironment ] [LAPTOP-8VG1D5TB] heap size [3.9gb], compressed ordinary object pointers [true]
[2022-09-20T21:53:29,737][INFO ][o.e.n.Node ] [LAPTOP-8VG1D5TB] node name [LAPTOP-8VG1D5TB], node ID [cWMr2jqXSdyI_w8NwYQdjw], cluster name [elasticsearch], roles [ingest, data_cold, data, remote_cluster_client, master, data_warm, data_content, transform, data_hot, ml, data_frozen]
[2022-09-20T21:53:41,627][INFO ][o.e.x.s.Security ] [LAPTOP-8VG1D5TB] Security is enabled
[2022-09-20T21:53:42,089][INFO ][o.e.x.s.a.s.FileRolesStore] [LAPTOP-8VG1D5TB] parsed [0] roles from file [G:\elastic stack\elasticsearch-8.4.1\config\roles.yml]
[2022-09-20T21:53:43,195][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [LAPTOP-8VG1D5TB] [controller/744] [Main.cc#123] controller (64 bit): Version 8.4.1 (Build c0373714f3bc4b) Copyright (c) 2022 Elasticsearch BV
[2022-09-20T21:53:44,488][INFO ][o.e.t.n.NettyAllocator ] [LAPTOP-8VG1D5TB] creating NettyAllocator with the following configs: [name=elasticsearch_configured, chunk_size=1mb, suggested_max_allocation_size=1mb, factors={es.unsafe.use_netty_default_chunk_and_page_size=false, g1gc_enabled=true, g1gc_region_size=4mb}]
[2022-09-20T21:53:44,545][INFO ][o.e.i.r.RecoverySettings ] [LAPTOP-8VG1D5TB] using rate limit [40mb] with [default=40mb, read=0b, write=0b, max=0b]
[2022-09-20T21:53:44,668][INFO ][o.e.d.DiscoveryModule ] [LAPTOP-8VG1D5TB] using discovery type [multi-node] and seed hosts providers [settings]
[2022-09-20T21:53:48,249][INFO ][o.e.n.Node ] [LAPTOP-8VG1D5TB] initialized
[2022-09-20T21:53:48,251][INFO ][o.e.n.Node ] [LAPTOP-8VG1D5TB] starting ...
[2022-09-20T21:53:48,313][INFO ][o.e.x.s.c.f.PersistentCache] [LAPTOP-8VG1D5TB] persistent cache index loaded
[2022-09-20T21:53:48,315][INFO ][o.e.x.d.l.DeprecationIndexingComponent] [LAPTOP-8VG1D5TB] deprecation component started
[2022-09-20T21:53:48,698][INFO ][o.e.t.TransportService ] [LAPTOP-8VG1D5TB] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}, {[::1]:9300}
[2022-09-20T21:53:50,024][WARN ][o.e.c.c.ClusterBootstrapService] [LAPTOP-8VG1D5TB] this node is locked into cluster UUID [jxCXal6sRFuAT73DX5e-0w] but [cluster.initial_master_nodes] is set to [LAPTOP-8VG1D5TB]; remove this setting to avoid possible data loss caused by subsequent cluster bootstrap attempts
[2022-09-20T21:53:50,370][INFO ][o.e.c.s.MasterService ] [LAPTOP-8VG1D5TB] elected-as-master ([1] nodes joined)[_FINISH_ELECTION_, {LAPTOP-8VG1D5TB}{cWMr2jqXSdyI_w8NwYQdjw}{S-HMyyEWTgW7OjvE4XtKJg}{LAPTOP-8VG1D5TB}{127.0.0.1}{127.0.0.1:9300}{cdfhilmrstw} completing election], term: 2, version: 30, delta: master node changed {previous [], current [{LAPTOP-8VG1D5TB}{cWMr2jqXSdyI_w8NwYQdjw}{S-HMyyEWTgW7OjvE4XtKJg}{LAPTOP-8VG1D5TB}{127.0.0.1}{127.0.0.1:9300}{cdfhilmrstw}]}
[2022-09-20T21:53:50,574][INFO ][o.e.c.s.ClusterApplierService] [LAPTOP-8VG1D5TB] master node changed {previous [], current [{LAPTOP-8VG1D5TB}{cWMr2jqXSdyI_w8NwYQdjw}{S-HMyyEWTgW7OjvE4XtKJg}{LAPTOP-8VG1D5TB}{127.0.0.1}{127.0.0.1:9300}{cdfhilmrstw}]}, term: 2, version: 30, reason: Publication{term=2, version=30}
[2022-09-20T21:53:50,667][INFO ][o.e.r.s.FileSettingsService] [LAPTOP-8VG1D5TB] starting file settings watcher ...
[2022-09-20T21:53:50,740][INFO ][o.e.r.s.FileSettingsService] [LAPTOP-8VG1D5TB] file settings service up and running [tid=55]
[2022-09-20T21:53:50,854][INFO ][o.e.h.AbstractHttpServerTransport] [LAPTOP-8VG1D5TB] publish_address {192.168.1.6:9200}, bound_addresses {[::]:9200}
[2022-09-20T21:53:50,857][INFO ][o.e.n.Node ] [LAPTOP-8VG1D5TB] started {LAPTOP-8VG1D5TB}{cWMr2jqXSdyI_w8NwYQdjw}{S-HMyyEWTgW7OjvE4XtKJg}{LAPTOP-8VG1D5TB}{127.0.0.1}{127.0.0.1:9300}{cdfhilmrstw}{xpack.installed=true, ml.allocated_processors=4, ml.max_jvm_size=4253024256, ml.machine_memory=8500776960}
[2022-09-20T21:53:51,059][INFO ][o.e.l.LicenseService ] [LAPTOP-8VG1D5TB] license [b3387b5e-8844-40c0-a4fe-8bb3b74b43d6] mode [basic] - valid
[2022-09-20T21:53:51,062][INFO ][o.e.x.s.a.Realms ] [LAPTOP-8VG1D5TB] license mode is [basic], currently licensed security realms are [reserved/reserved,file/default_file,native/default_native]
[2022-09-20T21:53:51,071][INFO ][o.e.g.GatewayService ] [LAPTOP-8VG1D5TB] recovered [2] indices into cluster_state
[2022-09-20T21:53:51,440][ERROR][o.e.i.g.GeoIpDownloader ] [LAPTOP-8VG1D5TB] exception during geoip databases updateorg.elasticsearch.ElasticsearchException: not all primary shards of [.geoip_databases] index are active
at org.elasticsearch.ingest.geoip#8.4.1/org.elasticsearch.ingest.geoip.GeoIpDownloader.updateDatabases(GeoIpDownloader.java:134)
at org.elasticsearch.ingest.geoip#8.4.1/org.elasticsearch.ingest.geoip.GeoIpDownloader.runDownloader(GeoIpDownloader.java:274)
at org.elasticsearch.ingest.geoip#8.4.1/org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:102)
at org.elasticsearch.ingest.geoip#8.4.1/org.elasticsearch.ingest.geoip.GeoIpDownloaderTaskExecutor.nodeOperation(GeoIpDownloaderTaskExecutor.java:48)
at org.elasticsearch.server#8.4.1/org.elasticsearch.persistent.NodePersistentTasksExecutor$1.doRun(NodePersistentTasksExecutor.java:42)
See logs for more details.
[2022-09-20T21:53:52,678][INFO ][o.e.c.r.a.AllocationService] [LAPTOP-8VG1D5TB] current.health="GREEN" message="Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[.security-7][0]]])." previous.health="RED" reason="shards started [[.security-7][0]]"
[2022-09-20T21:53:53,356][INFO ][o.e.i.g.DatabaseNodeService] [LAPTOP-8VG1D5TB] successfully loaded geoip database file [GeoLite2-Country.mmdb]
[2022-09-20T21:53:53,686][INFO ][o.e.i.g.DatabaseNodeService] [LAPTOP-8VG1D5TB] successfully loaded geoip database file [GeoLite2-ASN.mmdb]
[2022-09-20T21:53:59,679][INFO ][o.e.i.g.DatabaseNodeService] [LAPTOP-8VG1D5TB] successfully loaded geoip database file [GeoLite2-City.mmdb]
[2022-09-20T21:56:09,025][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [LAPTOP-8VG1D5TB] received plaintext traffic on an encrypted channel, closing connection Netty4TcpChannel{localAddress=/127.0.0.1:9300, remoteAddress=/127.0.0.1:63342, profile=default}
Can someone tell where the problem lies and how to resolve this?? I tried with all addresses given in log but everytime got no response on the browswer
Two pictures are attached.
Find the elasticserach.yml configuration file. Change the security authentication, switch from true to false to achieve no-secret login access. Change both of these to false.
Add a new configuration to the file("ingest.geoip.downloader.enabled: false").

URL to access cluster environment for ElasticSearch2.4.3

We have an ElasticSearch2.4.3 cluster environment of two nodes. I want to ask what URL should I provide to access the environment so that it works in High Availability?
We have two master Node1 and Node2. The host name for Node1 is node1.elastic.com and Node2 is node2.elastic.com. Both the nodes are master according to formula (n/2 +1).
We have enabled cluster setting by modifying the elastic.yml file by adding
discovery.zen.ping.unicast.hosts for the two nodes.
From our java application, we are connecting to node1.elastic.com. It works fine till both the nodes are up. Data is getting populated in both the ES servers and everything is good. But as soon as Node1 goes down entire elastic search cluster gets disconnected. And it automatically doesn't switch to Node2 for processing requests.
I feel like the URL which I am giving is not right, and it has to be something else to provide an automatic switch.
Logs from Node1
[2020-02-10 12:15:45,639][INFO ][node ] [Wildpride] initialized
[2020-02-10 12:15:45,639][INFO ][node ] [Wildpride] starting ...
[2020-02-10 12:15:45,769][WARN ][common.network ] [Wildpride] _non_loopback_ is deprecated as it picks an arbitrary interface. specify explicit scope(s), interface(s), address(es), or hostname(s) instead
[2020-02-10 12:15:45,783][WARN ][common.network ] [Wildpride] _non_loopback_ is deprecated as it picks an arbitrary interface. specify explicit scope(s), interface(s), address(es), or hostname(s) instead
[2020-02-10 12:15:45,784][INFO ][transport ] [Wildpride] publish_address {000.00.00.204:9300}, bound_addresses {[fe80::9af2:b3ff:fee9:90ca]:9300}, {000.00.00.204:9300}
[2020-02-10 12:15:45,788][INFO ][discovery ] [Wildpride] XXXX/Hg_5eGZIS0e249KUTQqPPg
[2020-02-10 12:16:15,790][WARN ][discovery ] [Wildpride] waited for 30s and no initial state was set by the discovery
[2020-02-10 12:16:15,799][WARN ][common.network ] [Wildpride] _non_loopback_ is deprecated as it picks an arbitrary interface. specify explicit scope(s), interface(s), address(es), or hostname(s) instead
[2020-02-10 12:16:15,802][WARN ][common.network ] [Wildpride] _non_loopback_ is deprecated as it picks an arbitrary interface. specify explicit scope(s), interface(s), address(es), or hostname(s) instead
[2020-02-10 12:16:15,803][INFO ][http ] [Wildpride] publish_address {000.00.00.204:9200}, bound_addresses {[fe80::9af2:b3ff:fee9:90ca]:9200}, {000.00.00.204:9200}
[2020-02-10 12:16:15,803][INFO ][node ] [Wildpride] started
[2020-02-10 12:16:35,552][INFO ][node ] [Wildpride] stopping ...
[2020-02-10 12:16:35,619][WARN ][discovery.zen.ping.unicast] [Wildpride] [17] failed send ping to {#zen_unicast_1#}{000.00.00.206}{000.00.00.206:9300}
java.lang.IllegalStateException: can't add nodes to a stopped transport
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:916)
at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:906)
at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:267)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:395)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2020-02-10 12:16:35,620][WARN ][discovery.zen.ping.unicast] [Wildpride] failed to send ping to [{Wildpride}{Hg_5eGZIS0e249KUTQqPPg}{000.00.00.204}{000.00.00.204:9300}]
SendRequestTransportException[[Wildpride][000.00.00.204:9300][internal:discovery/zen/unicast]]; nested: TransportException[TransportService is closed stopped can't send request];
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:340)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPingRequestToNode(UnicastZenPing.java:440)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.sendPings(UnicastZenPing.java:426)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$2.doRun(UnicastZenPing.java:249)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: TransportException[TransportService is closed stopped can't send request]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:320)
... 7 more
[2020-02-10 12:16:35,623][INFO ][node ] [Wildpride] stopped
[2020-02-10 12:16:35,623][INFO ][node ] [Wildpride] closing ...
[2020-02-10 12:16:35,642][INFO ][node ] [Wildpride] closed
TL;DR: There is no automatic switch in elasticsearch and you'll need some kind of loadbalancer in front of the elasticsearch cluster.
For an HA setup, you need at least 3 master eligible nodes. In front of the cluster there have to be a loadbalancer (also HA) to distribute the requests across the cluster. Or the client needs to be somehow aware of cluster and in a failure scenario failover to any node left.
If you go with 2 master nodes, the cluster can get into the "split brain" state. If your network gets somehow fragmented and the nodes become invisible to each other, both of them will think it is the last one working and keep serving the read/write requests independently. In that way they drift away from each other and it will become nearly impossible to fix it - at least, when the fragmentation is gone, there is a lot of trouble to fix. With 3 nodes, in a fragmentation scenario the clusternwill only continue to serve requests if there are at least 2 nodes visible to each other.

elasticsearch: How to reinitialize a node?

elasticsearch 1.7.2 on CentOS
We have a 3 node cluster that has been running fine. A networking problem caused the "B" node to lose network access. (It then turns out that the C node had the "minimum_master_nodes" as 1, not 2.)
So we are now poking along with just the A node.
We fixed the issues on the B and C nodes, but they refuse to come up and join the cluster. On B and C:
# curl -XGET http://localhost:9200/_cluster/health?pretty=true
{
"error" : "MasterNotDiscoveredException[waited for [30s]]",
"status" : 503
}
The elasticsearch.yml is as follows (the name on "b" and "c" nodes are reflected in the node names on those systems, ALSO, the IP addys on each node reflect the other 2 nodes, HOWEVER, on the "c" node, the index.number_of_replicas was mistakenly set to 1.)
cluster.name: elasticsearch-prod
node.name: "PROD-node-3a"
node.master: true
index.number_of_replicas: 2
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["192.168.3.100", "192.168.3.101"]
We have no idea why they won't join. They have network visibility to A, and A can see them. Each node correctly has the other two defined in "discovery.zen.ping.unicast.hosts:"
On B and C, the log is very sparse, and tells us nothing:
# cat elasticsearch.log
[2015-09-24 20:07:46,686][INFO ][node ] [The Profile] version[1.7.2], pid[866], build[e43676b/2015-09-14T09:49:53Z]
[2015-09-24 20:07:46,688][INFO ][node ] [The Profile] initializing ...
[2015-09-24 20:07:46,931][INFO ][plugins ] [The Profile] loaded [], sites []
[2015-09-24 20:07:47,054][INFO ][env ] [The Profile] using [1] data paths, mounts [[/ (rootfs)]], net usable_space [148.7gb], net total_space [157.3gb], types [rootfs]
[2015-09-24 20:07:50,696][INFO ][node ] [The Profile] initialized
[2015-09-24 20:07:50,697][INFO ][node ] [The Profile] starting ...
[2015-09-24 20:07:50,942][INFO ][transport ] [The Profile] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.181.3.138:9300]}
[2015-09-24 20:07:50,983][INFO ][discovery ] [The Profile] elasticsearch/PojoIp-ZTXufX_Lxlwvdew
[2015-09-24 20:07:54,772][INFO ][cluster.service ] [The Profile] new_master [The Profile][PojoIp-ZTXufX_Lxlwvdew][elastic-search-3c-prod-centos-case-48307][inet[/10.181.3.138:9300]], reason: zen-disco-join (elected_as_master)
[2015-09-24 20:07:54,801][INFO ][http ] [The Profile] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.181.3.138:9200]}
[2015-09-24 20:07:54,802][INFO ][node ] [The Profile] started
[2015-09-24 20:07:54,880][INFO ][gateway ] [The Profile] recovered [0] indices into cluster_state
[2015-09-24 20:42:45,691][INFO ][node ] [The Profile] stopping ...
[2015-09-24 20:42:45,727][INFO ][node ] [The Profile] stopped
[2015-09-24 20:42:45,727][INFO ][node ] [The Profile] closing ...
[2015-09-24 20:42:45,735][INFO ][node ] [The Profile] closed
How do we bring the whole beast to life?
Rebooting B and C makes no difference at all
I am hesitant to cycle A, as that is what our app is hitting...
Well, we do not know what brought it to life, but it kind of magically came back up.
I believe that the shard reroute, (shown here: elasticsearch: Did I lose data when two of my three nodes went down? ) caused the nodes to rejoin the cluster. Our theory is that node A, the only surviving node, was not a "healthy" master, because it knew that one shard (the "p" cut of shard 1, as spelled out here: elasticsearch: Did I lose data when two of my three nodes went down? ) was not allocated.
Since the master knew it was not intact, the other nodes declined to join the cluster, throwing the "MasterNotDiscoveredException"
Once we got all the "p" shards assigned to the surviving A node, the other nodes joined up, and did the whole replicating dance.
HOWEVER Data was lost by allocating the shard like that. We ultimately set up a new cluster, and are rebuilding the index (which takes several days).

Elasticsearch: Node registered to cluster but MasterNotDiscoveredException

I´m having a variation of the usual connection problem between ElasticSearch nodes, however here it does not seem to be related to the network as the client registers with the master without any problem (apparently). My set-up is the following:
One Master node (node.master=true, node.data=true, cluster.name=stokker)
One Client node (Spring Boot 1.3.0.M5) with these settings:
spring.data.elasticsearch.properties.http.enabled=true
spring.data.elasticsearch.cluster-name=stokker
spring.data.elasticsearch.properties.node.local=false
spring.data.elasticsearch.properties.node.data=false
spring.data.elasticsearch.properties.node.client=true
First I start the master node, then the client and I can see that the client registers OK:
[Kilmer] recovered [0] indices into cluster_state
[Kilmer] watch service has started
[Kilmer] bound_address {inet[/0:0:0:0:0:0:0:0:9201]}, publish_address {inet[/159.107.28.230:9201]}
[Kilmer] started
[Kilmer] added {[Thunderclap][VVF_5QnLREac-Du-dZK1IQ][ES00052260][inet[/159.107.28.230:9301]]{client=true, data=false, local=false},}, reason: zen-disco-receive(join from node[[Thunderclap][VVF_5QnLREac-Du-dZK1IQ] [ES00052260][inet[/159.107.28.230:9301]]{client
Client´s console output
org.elasticsearch.node : [Thunderclap] version[1.7.0], pid[12084], build[929b973/2015-07-16T14:31:07Z]
org.elasticsearch.node : [Thunderclap] initializing ...
org.elasticsearch.plugins : [Thunderclap] loaded [], sites []
org.elasticsearch.bootstrap : JNA not found. native methods will be disabled.
org.elasticsearch.node : [Thunderclap] initialized
org.elasticsearch.node : [Thunderclap] starting ...
org.elasticsearch.transport : [Thunderclap] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/159.107.28.230:9301]}
org.elasticsearch.discovery : [Thunderclap] stokker/VVF_5QnLREac-Du-dZK1IQ
org.elasticsearch.discovery : [Thunderclap] waited for 30s and no initial state was set by the discovery
org.elasticsearch.http : [Thunderclap] bound_address {inet[/0:0:0:0:0:0:0:0:9202]}, publish_address {inet[/159.107.28.230:9202]}
org.elasticsearch.node : [Thunderclap] started
However, when I try to perform some indexing, I get the following exception:
org.elasticsearch.discovery.MasterNotDiscoveredException: waited for [30s]
Any ideas on what am I missing here?
Thanks
I solved the issue by adding this property manually indicating where the master node is:
spring.data.elasticsearch.cluster-nodes=192.168.1.18:9300
If somebody finds a better solution, please let me know, I´m not fully confident in this one.

Elasticsearch fails to start

I'm trying to implement a 2 node ES cluster using Amazon EC2 instances. After everything is setup and I try to start the ES, it fails to start. Below are the config files:
/etc/elasticsearch/elasticsearch.yml - http://pastebin.com/3Q1qNqmZ
/etc/init.d/elasticsearch - http://pastebin.com/f3aJyurR
Below are the /var/log/elasticsearch/es-cluster.log content -
[2014-06-08 07:06:01,761][WARN ][common.jna ] Unknown mlockall error 0
[2014-06-08 07:06:02,095][INFO ][node ] [logstash] version[0.90.13], pid[29666], build[249c9c5/2014-03-25T15:27:12Z]
[2014-06-08 07:06:02,095][INFO ][node ] [logstash] initializing ...
[2014-06-08 07:06:02,108][INFO ][plugins ] [logstash] loaded [], sites []
[2014-06-08 07:06:07,504][INFO ][node ] [logstash] initialized
[2014-06-08 07:06:07,510][INFO ][node ] [logstash] starting ...
[2014-06-08 07:06:07,646][INFO ][transport ] [logstash] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.164.27.207:9300]}
[2014-06-08 07:06:12,177][INFO ][cluster.service ] [logstash] new_master [logstash][vCS_3LzESEKSN-thhGWeGA][inet[/<an_ip_is_here>:9300]], reason: zen-disco-join (elected_as_master)
[2014-06-08 07:06:12,208][INFO ][discovery ] [logstash] es-cluster/vCS_3LzESEKSN-thhGWeGA
[2014-06-08 07:06:12,334][INFO ][http ] [logstash] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/<an_ip_is_here>:9200]}
[2014-06-08 07:06:12,335][INFO ][node ] [logstash] started
[2014-06-08 07:06:12,379][INFO ][gateway ] [logstash] recovered [0] indices into cluster_state
I see several things that you should correct in your configuration files.
1) Need different node names. You are using the same config file for both nodes. You do not want to do this if you are setting node name like you are: node.name: "logstash". Either create separate configuration files with different node.name entries or comment it out and let ES auto assign the node.name.
2) Mlockall setting is throwing an error. I would not start out setting bootstrap.mlockall: True until you've first gotten ES to run without it and then have spent a little time configuring linux to support it. It can cause problems with booting up:
Warning
mlockall might cause the JVM or shell session to exit if it tries to
allocate more memory than is available!
I'd check out the documentation on the configuration variables and be careful about making too many adjustments right out of the gate.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-service.html
If you do want to make memory adjustments to ES this previous stackoverflow article should be helpful:
How to change Elasticsearch max memory size

Resources