Elastic 2.3.4. Node Startup Quiet Failure - elasticsearch

We are using a 5 node cluster hosted in Google Cloud (Ubuntu 16.04 LTS) and we noticed that one of the node's disk space was at 90%+ so we shut down the node with:
sudo service elasticsearch stop
Then stopping the instance in the GCP console.
After upgrading the node's disk space, we tried starting elastic again using:
sudo service elasticsearch start
This command seems to fail silently, and the SSH session terminates after freezing momentarily. Nothing shows in the node's elasticsearch logs, and nothing shows up in the current cluster's master elasticsearch logs either. The only hint we can find of something going wrong is in the node's syslog:
Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Cleanup of Temporary Directories.
Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Starting Elasticsearch...
Jan 25 15:48:29 elasticsearch-1-vm systemd[1]: Started Elasticsearch.
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.597729] kernel tried to execute NX-protected page - exploit attempt? (uid: 113)
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.605545] BUG: unable to handle kernel paging request at 00007f896d5467c0
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.612621] IP: 0x7f896d5467c0
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.615779] PGD 80000003050ee067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.615780] P4D 80000003050ee067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.619199] PUD 30508d067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.622626] PMD 305162067
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.625438] PTE 80000003df15b867
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.628245]
Jan 25 15:48:30 elasticsearch-1-vm kernel: [ 919.633174] Oops: 0011 [#1] SMP PTI
The cluster health with 4 nodes is green, and we can't seem to figure out why this may be happening.
Any ideas on why this may be happening would be very helpful.
Here is our config located in /etc/default/elasticsearch:
https://gist.github.com/deppi/58826c38ea8414d301eb034e9a29cd54
Also here is our /etc/elasticsearch/elasticsearch.yml
https://gist.github.com/deppi/17b1f28e649ee528b0fe2ca93a2ff19c
The only thing I can think that might be causing this issue is discovery.zen.minimum_master_nodes: 2
When maybe it should be configured as
discovery.zen.minimum_master_nodes: 3
But we are uncertain this is the issue and don't want to risk further breaking out elasticsearch cluster

By experience, I know that shutting down the cluster using the elasticsearch command was not the best, we had issues with nodes not entirely down, and trying to take the master level. That's maybe why you can see 2 nodes, but your node is not part of it anymore.
What you should do, is shutting down the elasticsearch process on each nodes, unless you still index on the two nodes. In this case shut your cluster properly :
Stop the collect first everytime you need to stop elasticsearch, so logstash if you are using the stack
Then stop elasticsearch itself https://www.elastic.co/guide/en/elasticsearch/reference/master/stopping-elasticsearch.html
Start your first nodes as you let the protocol take place
Start elastic on the other nodes => see if all the nodes enter in
If not your config might be the problem, as I would use 1 master node and 3 slaves, and use another data path. When you need to shut down your cluster, stop the collect, stop the queuing, stop the storage (elastic), node by node

This seems to be an issue with a new kernel that has been deployed on GCP for the Ubuntu 16.04 LTS OS.
Problem Kernel:
uname -a
Linux elasticsearch-1-vm 4.13.0-1007-gcp #10-Ubuntu SMP Fri Jan 12 13:56:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Proper Kernel:
uname -a
Linux elasticsearch-1-vm 4.13.0-1006-gcp #9-Ubuntu SMP Mon Jan 8 21:13:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
To fix the issue with the GCP instances, I ran:
sudo apt remove 4.13.0-1007-gcp
sudo apt install 4.13.0-1006-gcp
exit
Then in google cloud console, restart the instance, then SSH back in then:
sudo service elasticsearch start

Related

Analysis of Redis error logs "LOADING Redis is loading the dataset in memory" and more

I am frequently seeing these messages in the redis logs
1#
602854:M 23 Dec 2022 09:48:54.028 * 10 changes in 300 seconds. Saving...
602854:M 23 Dec 2022 09:48:54.035 * Background saving started by pid 3266364
3266364:C 23 Dec 2022 09:48:55.844 * DB saved on disk
3266364:C 23 Dec 2022 09:48:55.852 * RDB: 12 MB of memory used by copy-on-write
602854:M 23 Dec 2022 09:48:55.938 * Background saving terminated with success
2#
LOADING Redis is loading the dataset in memory
3#
7678:signal-handler (1671738516) Received SIGTERM scheduling shutdown...
7678:M 22 Dec 2022 23:48:36.300 # User requested shutdown...
7678:M 22 Dec 2022 23:48:36.300 # systemd supervision requested, but NOTIFY_SOCKET not found
7678:M 22 Dec 2022 23:48:36.300 * Saving the final RDB snapshot before exiting.
7678:M 22 Dec 2022 23:48:36.300 # systemd supervision requested, but NOTIFY_SOCKET not found
7678:M 22 Dec 2022 23:48:36.720 * DB saved on disk
7678:M 22 Dec 2022 23:48:36.720 * Removing the pid file.
7678:M 22 Dec 2022 23:48:36.720 # Redis is now ready to exit, bye bye...
7901:C 22 Dec 2022 23:48:37.071 # WARNING supervised by systemd - you MUST set appropriate values for TimeoutStartSec and TimeoutStopSec in your service unit.
7901:C 22 Dec 2022 23:48:37.071 # systemd supervision requested, but NOTIFY_SOCKET not found
7914:C 22 Dec 2022 23:48:37.071 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
7914:C 22 Dec 2022 23:48:37.071 # Redis version=6.0.9, bits=64, commit=00000000, modified=0, pid=7914, just started
7914:C 22 Dec 2022 23:48:37.071 # Configuration loaded
Are these messages concerning?
Let me know if there's any optimization to be carried out in terms of settings.
The first set of informational messages is related to Redis persistence: it appears your Redis node is configured to save the database to disk if 300 seconds elapsed and it surpassed 10 write operations against it. You can change that according to your needs through the Redis configuration file.
The message LOADING Redis is loading the dataset in memory, on the other side, is an error message received while attempting to connect to a Redis instance which is loading its dataset in memory: that occurs during the startup for standalone servers and master nodes or when replicas reconnect and fully resynchronize with master. If you are seeing this error too often and not right after a system restart, I would suggest to check your system log files and learn why your Redis instance is restarting or resynchronizing (depending on your topology).

can't start minio server in ubuntu with systemctl start minio

I configured a minio instance server on the ubuntu 18.04 with the guide from https://www.digitalocean.com/community/tutorials/how-to-set-up-an-object-storage-server-using-minio-on-ubuntu-18-04.
after the installation, the server failed to start with the command "sudo systemctl start minio", the error is saying :
root#iZbp1icuzly3aac0dmjz9aZ:~# sudo systemctl status minio
● minio.service - MinIO
Loaded: loaded (/etc/systemd/system/minio.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2021-12-23 17:11:56 CST; 4s ago
Docs: https://docs.min.io
Process: 9085 ExecStart=/usr/local/bin/minio server $MINIO_OPTS $MINIO_VOLUMES (code=exited, status=1/FAILURE)
Process: 9084 ExecStartPre=/bin/bash -c if [ -z "${MINIO_VOLUMES}" ]; then echo "Variable MINIO_VOLUMES not set in /etc/default/minio"; exit 1; fi (code=exited, status=0/SUCCESS)
Main PID: 9085 (code=exited, status=1/FAILURE)
Dec 23 17:11:56 iZbp1icuzly3aac0dmjz9aZ systemd[1]: minio.service: Main process exited, code=exited, status=1/FAILURE
Dec 23 17:11:56 iZbp1icuzly3aac0dmjz9aZ systemd[1]: minio.service: Failed with result 'exit-code'.
Dec 23 17:11:56 iZbp1icuzly3aac0dmjz9aZ systemd[1]: minio.service: Service hold-off time over, scheduling restart.
Dec 23 17:11:56 iZbp1icuzly3aac0dmjz9aZ systemd[1]: minio.service: Scheduled restart job, restart counter is at 5.
Dec 23 17:11:56 iZbp1icuzly3aac0dmjz9aZ systemd[1]: Stopped MinIO.
Dec 23 17:11:56 iZbp1icuzly3aac0dmjz9aZ systemd[1]: minio.service: Start request repeated too quickly.
Dec 23 17:11:56 iZbp1icuzly3aac0dmjz9aZ systemd[1]: minio.service: Failed with result 'exit-code'.
Dec 23 17:11:56 iZbp1icuzly3aac0dmjz9aZ systemd[1]: Failed to start MinIO.
It looks like the reason is the Variable MINIO_VOLUMES not set in /etc/default/minio.
However, I double check the file from /etc/default/minio
MINIO_ACCESS_KEY="minioadmin"
MINIO_VOLUMES="/usr/local/share/minio/"
MINIO_OPTS="-C /etc/minio --address localhost:9001"
MINIO_SECRET_KEY="minioadmin"
I have set the value MINIO_VOLUMES.
I tried to start manually with minio server --address :9001 /usr/local/share/minio/, it works.
now I don't know what goes wrong with starting the minio server by using the systemctl start minio
I'd recommend sticking to the official documentation wherever possible. It's intended for distributed deployments but the only real change is that your MINIO_VOLUMES will be for a single node/drive.
I would recommend trying a combination of things here:
Review minio.service and ensure the user/group exists
Review file path permissions on the MINIO_VOLUMES value
Now for the why:
My guess without seeing further logs (journalctl -u minio would have been helpful here) is that this is a combination of two things:
the minio.service user/group doesn't have rwx permissions on the /usr/local/share/minio path,
you are missing an environment variable we recently introduced to prevent users from pointing at their root drive (this was intended as a safety measure, but somewhat complicates these kinds of smaller setups).
Take a look at these lines in the minio.service file - I'm assuming that is what you are using based on the instructions in the DO guide.
If you ls -al /usr/local/share/minio I would venture it has ROOT permissions for user and group and limited write access if any.
Hope this helps - for further troubleshooting having at least 10-20 lines from journalctl is invaluable, as it would show the actual error and not just the final quit message.

Install of elastic 7.5 on RHEL 7.8 makes memory violation sig=6 due to JNA

I am installing a brand new elasticsearch 7.5 on OS:Red Hat Enterprise Linux Server release 7.8 (Maipo)
At startup of the service, I have hard failure. here is what the service info provides
● elasticsearch.service - Elasticsearch
Loaded: loaded (/usr/lib/systemd/system/elasticsearch.service; disabled; vendor preset: disabled)
Active: failed (Result: signal) since Tue 2020-08-25 11:34:39 CEST; 7min ago
Docs: http://www.elastic.co
Process: 102777 ExecStart=/usr/share/elasticsearch/bin/elasticsearch -p ${PID_DIR}/elasticsearch.pid --quiet (code=killed, signal=ABRT)
Main PID: 102777 (code=killed, signal=ABRT)
CGroup: /system.slice/elasticsearch.service
Aug 25 11:34:34 sv-1348lvd44.esante.local systemd[1]: Starting Elasticsearch...
Aug 25 11:34:35 sv-1348lvd44.esante.local elasticsearch[102777]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated...lease.
Aug 25 11:34:39 sv-1348lvd44.esante.local systemd[1]: elasticsearch.service: main process exited, code=killed, status=6/ABRT
Aug 25 11:34:39 sv-1348lvd44.esante.local systemd[1]: Failed to start Elasticsearch.
Aug 25 11:34:39 sv-1348lvd44.esante.local systemd[1]: Unit elasticsearch.service entered failed state.
Aug 25 11:34:39 sv-1348lvd44.esante.local systemd[1]: elasticsearch.service failed.
when using journalctl -xe
Aug 25 11:34:38 sv-1348lvd44.esante.local audispd[824]: node=sv-1348lvd44.esante.local type=ANOM_ABEND msg=audit(1598348078.836:208066): auid=429496 uid=995 gid=991 ses=4294967295 subj=system_u:system_r:unconfined_service_t:s0 pid=102777 comm="java" reason="memory violation" sig=6
Aug 25 11:34:39 sv-1348lvd44.esante.local systemd[1]: elasticsearch.service: main process exited, code=killed, status=6/ABRT
Aug 25 11:34:39 sv-1348lvd44.esante.local systemd[1]: Failed to start Elasticsearch.
when looking into the dump hs_err_pidXXXX I have.
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f4818939b85, pid=52870, tid=52933
#
# JRE version: OpenJDK Runtime Environment (13.0.1+9) (build 13.0.1+9)
# Java VM: OpenJDK 64-Bit Server VM (13.0.1+9, mixed mode, sharing, tiered, compressed oops, concurrent mark sweep gc, linux-amd64)
# Problematic frame:
# C [jna515356041985641679.tmp+0x12b85] ffi_prep_closure_loc+0x15
[OS:Red Hat Enterprise Linux Server release 7.8 (Maipo)
uname:Linux 3.10.0-1127.10.1.el7.x86_64 #1 SMP Tue May 26 15:05:43 EDT 2020 x86_64
libc:glibc 2.17 NPTL 2.17
rlimit: STACK 8192k, CORE 0k, NPROC 4096, NOFILE 65535, AS infinity, DATA infinity, FSIZE infinity
load average:0.08 0.03 0.05
.../...
It works like a charm on CentOS without doing anything.
For RHEL, I already fixed the stuff about JNA by adding ES_TMPDIR=/var/es-temp into /etc/sysconfig/elasticsearch as
Memory seems fine. this is a brand new VM. (no application logs into /var/logs)
Seems that this version is supposed to be supported
I tested with -Xms2g -Xmx2g, -Xms1g -Xmx1g, -Xms512m -Xmx512m but same error.
I don't get what is going wrong. My Next step is to test with another version 7 of elasticsearch.
After 1 day of struggling, I found the solution at https://discuss.elastic.co/t/elasticsearch-v7-6-2-failed-to-start-killed-by-sigabrt-on-rhel-7-7-urgent/231039/11 from Ivan_A_Carrazana_C
I put here a copy of the steps to perform:
Hi
If you are applying a security compliance in your RHEL installation you must change the >path of the TMP directory that will use elasticsearch as Java.
Uncomment at /etc/elasticsearch/jvm.options
-Djava.io.tmpdir=${ES_TMPDIR}
Add in /etc /sysconfig/elasticsearch
ES_TMPDIR=/usr/share/elasticsearch/tmp
Create the /usr/share/elasticsearch/tmp directory and make sure that the owner and group >are elasticsearch and the permissions are 0755
Lastly make sure that /dev/shm doesn't have the noexec attribute with command:
mount | grep tmpfs | grep '/dev/shm'
Expected result:
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)
If you get output like these:
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,seclabel)
Add or modify in /etc/fstab the following line:
tmpfs /dev/shm tmpfs defaults,nodev,nosuid 0 0
I had the same problem and this worked for me. Hope i can help you
Seems to be known by elastic but not documented correctly. don't undertand why the tmpfs should in noexec. Would be good to have an JNA expert feedback about it.
For some reason, adding a TMPDIR var to /etc/sysconfig/elasticsearch worked (on 7.7.1) and pointing it to the same location as -Djava.io.tmpdir.
i.e.
TMPDIR="/usr/share/elasticsearch/tmp"
(in my case I actually used /var/lib/elasticsearch/tmp with 0755 permissions on it).
I can't say why, and it doesn't change the call string used if I look at 'ps -aef' . But just having -Djava.io.tmpdir wasn't enough.
This allowed me to get it to work without removing noexec on /tmp and /dev/shm.

Marathon exited with status 1

I am installing mesosphere on ubuntu 16.04 xenial .zookeeper and mesos-master and mesos-slave are running fine ,while starting marathon I am getting this issue .
Required option 'master' not found .I have created folder in /etc/marathon/conf .These are the steps I am following for marathon .
sudo mkdir -p /etc/marathon/conf
sudo cp /etc/mesos-master/hostname /etc/marathon/conf
sudo cp /etc/mesos/zk /etc/marathon/conf/master
sudo cp /etc/marathon/conf/master /etc/marathon/conf/zk
sudo nano /etc/marathon/conf/zk ,edit mesos to marathon in the end .
I am attaching the whole logs here,
Jan 25 14:18:01 master01 cron[859]: (*system*) INSECURE MODE (group/other writable) (/etc/crontab)
Jan 25 14:18:01 master01 cron[859]: (*system*popularity-contest) INSECURE MODE (group/other writable) (/etc/cron.d/popularity-contest)
Jan 25 14:18:01 master01 cron[859]: (*system*php) INSECURE MODE (group/other writable) (/etc/cron.d/php)
Jan 25 14:18:01 master01 cron[859]: (*system*anacron) INSECURE MODE (group/other writable) (/etc/cron.d/anacron)
Jan 25 14:18:29 master01 systemd[1]: marathon.service: Service hold-off time over, scheduling restart.
Jan 25 14:18:29 master01 systemd[1]: Stopped Scheduler for Apache Mesos.
Jan 25 14:18:29 master01 systemd[1]: Starting Scheduler for Apache Mesos...
Jan 25 14:18:29 master01 systemd[1]: Started Scheduler for Apache Mesos.
Jan 25 14:18:29 master01 marathon[29366]: No start hook file found ($HOOK_MARATHON_START). Proceeding with the start script.
Jan 25 14:18:30 master01 marathon[29366]: [scallop] Error: **Required option 'master' not found**
Jan 25 14:18:30 master01 systemd[1]: marathon.service: Main process exited, code=exited, status=1/FAILURE
Jan 25 14:18:30 master01 systemd[1]: marathon.service: Unit entered failed state.
Jan 25 14:18:30 master01 systemd[1]: marathon.service: Failed with result 'exit-code'.
Breaking Changes / Packaging standardized
We now publish more normalized packages that attempt to follow Linux Standard Base Guidelines and use sbt-native-packager to achieve this. As a result of this and the many historic ways of passing options into marathon, we will only read /etc/default/marathon when starting up. This file, like /etc/sysconfig/marathon, has all marathon command line options as "MARATHON_XXX=YYY" which will translate to --xx=yyy. We no longer support /etc/marathon/conf which was a set of files that would get translated into command line arguments. In addition, we no longer assume that if there is no zk/master argument passed in, then both are running on localhost.
Try to keep config in the environment.
cat << EOF > /etc/default/marathon
MARATHON_MASTER=zk://127.0.0.1:2181/mesos
MARATHON_ZK=zk://127.0.0.1:2181/marathon
EOF
Remember to replace 127.0.0.1:2181 with proper Zookeeper location.
I am using Ubuntu 14.04 in my case janisz solution did not work as I needed to add export
cat << EOF > /etc/default/marathon
export MARATHON_MASTER=zk://127.0.0.1:2181/mesos
export MARATHON_ZK=zk://127.0.0.1:2181/marathon
EOF

One node in hadoop cluster failure

I have configured 10 nodes HDP hadoop cluster recently, each node is of OS SLES11..
On master node I have configured all master services and clients..also the mabari-server. Remaining nodes other slave services and their clients.
NTP sync is on, other pre-requisites also fine.
I am experiencing weird behavior on hadoop cluster, After starting all the services within few hours one of the node goes down.
When I experienced this first time, I have restarted that particular node and added back to the cluster.
Now My master node is causing the same issue due to which whole cluster is down. I have checked the logs but there are no indications related to failure.
I am clueless what is the root cause for the failure of the node in hadoop cluster?
Below are logs :-
the system which went down:
/var/log/messages
these are /var/log/messages: notice)=0', processed='source(src)=6830'
Apr 23 05:22:43 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:23:49 lnx1863 SuSEfirewall2: SuSEfirewall2 not active Apr 23
05:24:17 lnx1863 sudo: root : TTY=pts/0 ; PWD=/ ; USER=root ;
COMMAND=/usr/bin/du -h / Apr 23 05:24:55 lnx1863 SuSEfirewall2:
SuSEfirewall2 not active Apr 23 05:25:22 lnx1863 kernel:
[248531.127254] megasas: Found FW in FAULT state, will reset adapter.
Apr 23 05:25:22 lnx1863 kernel: [248531.127260] megaraid_sas:
resetting fusion adapter. Apr 23 05:25:22 lnx1863 kernel:
[248531.127427] megaraid_sas: Reset not supported, killing adapter.
namenode logs:-
INFO 2015-04-23 05:27:43,665 Heartbeat.py:78 - Building Heartbeat:
{responseId = 7607, timestamp = 1429781263665, commandsInProgress =
False, componentsMapped = True} INFO 2015-04-23 05:28:44,053
security.py:135 - Encountered communication error. Details:
SSLError('The read operation timed out',) ERROR 2015-04-23
05:28:44,053 Controller.py:278 - Connection to http://localhost was
lost (details=Request to
https://localhost:8441/agent/v1/heartbeat/localhostip failed due to
Error occured during connecting to the server: The read operation
timed out) INFO 2015-04-23 05:29:16,061 NetUtil.py:48 - Connecting to
https://localhost:8440/connection_info INFO 2015-04-23 05:29:16,118
security.py:93 - SSL Connect being called.. connecting to the server

Resources