ELK data pod keep on getting into crashloopbackoff - elasticsearch

I have implemented ELK on k8s, however my data pod keep on getting into crashloopbackoff status.
[]$ kubectl get pod -n es
NAME READY STATUS RESTARTS AGE
logging-es-opendistro-es-client-76ff944d-rwjvz 1/1 Running 0 134m
logging-es-opendistro-es-data-0 0/1 CrashLoopBackOff 3 18m
logging-es-opendistro-es-data-1 0/1 Init:1/2 0 5m24s
logging-es-opendistro-es-kibana-5cfbd8dc49-g5rdl 1/1 Running 0 39m
logging-es-opendistro-es-master-0 1/1 Running 0 134m
logging-es-opendistro-es-master-1 1/1 Running 0 127m
Logs from data pod
[2021-02-15T08:02:46,683][INFO ][o.e.p.PluginsService ] [logging-es-opendistro-es-data-1] loaded plugin [opendistro_security]
[2021-02-15T08:02:46,683][INFO ][o.e.p.PluginsService ] [logging-es-opendistro-es-data-1] loaded plugin [opendistro_sql]
[2021-02-15T08:02:47,027][INFO ][o.e.e.NodeEnvironment ] [logging-es-opendistro-es-data-1] using [1] data paths, mounts [[/usr/share/elasticsearch/data (//fbeff48040de841b395c700.file.core.windows.net/kubernetes-dynamic-pvc-0f5e7e46-5138-0dfbe0aadff3)]], net usable_space [37.1gb], net total_space [50gb], types [cifs]
[2021-02-15T08:02:47,027][INFO ][o.e.e.NodeEnvironment ] [logging-es-opendistro-es-data-1] heap size [4gb], compressed ordinary object pointers [true]

Failed to pull image "/infrastructure/busybox:1.27.2". Your image name looks wrong. The image name is not supposed to start with a /.
You can check this doc if you need to fetch an image from a specific registry.

Related

Openwhisk out of function memory, not enforcing function timeout

I'm running Apache Openwhisk on k3s, installed using helm.
Below is the invoker logs, taken several hours after a fresh install, with several functions set to run periodically. This message appears every few seconds after the problem starts.
[2020-03-17T13:27:12.691Z] [ERROR] [#tid_sid_invokerHealth] [ContainerPool]
Rescheduling Run message, too many message in the pool, freePoolSize: 0 containers and 0 MB,
busyPoolSize: 8 containers and 4096 MB, maxContainersMemory 4096 MB, userNamespace: whisk.system,
action: ExecutableWhiskAction/whisk.system/invokerHealthTestAction0#0.0.1, needed memory: 128 MB,
waiting messages: 24
Here are the running pods. Notice all the function pods have an age of 11+ hours.
NAME READY STATUS RESTARTS AGE
openwhisk-gen-certs-n965b 0/1 Completed 0 14h
openwhisk-init-couchdb-4s9rh 0/1 Completed 0 14h
openwhisk-install-packages-pnvmq 0/1 Completed 0 14h
openwhisk-apigateway-78c64dd7c9-2gsw6 1/1 Running 2 14h
openwhisk-couchdb-844c6df68f-qrxq6 1/1 Running 2 14h
openwhisk-wskadmin 1/1 Running 2 14h
openwhisk-redis-77494b8d44-gkmlt 1/1 Running 2 14h
openwhisk-zookeeper-0 1/1 Running 2 14h
openwhisk-kafka-0 1/1 Running 2 14h
openwhisk-controller-0 1/1 Running 2 14h
openwhisk-nginx-5f795dd747-c228s 1/1 Running 4 14h
openwhisk-cloudantprovider-69fd94b6f6-x88f4 1/1 Running 2 14h
openwhisk-kafkaprovider-544fbfdcc7-kn29p 1/1 Running 2 14h
openwhisk-alarmprovider-58c5454cc8-q4wbw 1/1 Running 2 14h
openwhisk-invoker-0 1/1 Running 2 14h
wskopenwhisk-invoker-00-1-prewarm-nodejs10 1/1 Running 0 14h
wskopenwhisk-invoker-00-6-prewarm-nodejs10 1/1 Running 0 13h
wskopenwhisk-invoker-00-15-whisksystem-checkuserload 1/1 Running 0 13h
wskopenwhisk-invoker-00-31-whisksystem-guacscaleup 1/1 Running 0 12h
wskopenwhisk-invoker-00-30-whisksystem-guacscaledown 1/1 Running 0 12h
wskopenwhisk-invoker-00-37-whisksystem-functionelastalertcheckd 1/1 Running 0 11h
wskopenwhisk-invoker-00-39-whisksystem-checkuserload 1/1 Running 0 11h
wskopenwhisk-invoker-00-40-whisksystem-functionelastalertcheckd 1/1 Running 0 11h
wskopenwhisk-invoker-00-42-whisksystem-guacscaleup 1/1 Running 0 11h
wskopenwhisk-invoker-00-43-whisksystem-functionelastalertcheckd 1/1 Running 0 11h
Shouldn't Openwhisk be killing these pods after they reach the timeout? The functions all have a timeout of either 3 or 5 minutes, but Openwhisk doesn't seem to enforce this.
One other thing I noticed was "timeout" being set to "false" on the activations.
$ wsk activation get ...
{
"annotations": [
...
{
"key": "timeout",
"value": false
},
...
}
The timeout annotation is specific to an particular activation. If the value is true it means that particular activation of the corresponding function exceeded its set maximum duration which is a range of values from 100 ms to 5 minutes by default (per the docs) unless changed for the system deployment as a whole.
The pods are used to execute the functions - they will stick around for some duration while idle to facilitate future warm starts. The openwhisk invoker will terminate these warm pods eventually after an idle timeout, or when resource are required to run other pods.
Ok I fixed this by changing the invoker container factory implementation to docker. I'm not sure why the kubernetes implementation fails to kill pods (and release memory), but we are using docker as the container runtime for k3s.
To set this, change invoker.containerFactory.impl to docker in the helm chart values:
https://github.com/apache/openwhisk-deploy-kube/blob/master/helm/openwhisk/values.yaml#L261
I also increased the invoker memory (invoker.jvmHeapMB) to 1024:
https://github.com/apache/openwhisk-deploy-kube/blob/master/helm/openwhisk/values.yaml#L257
Here is a link that explains the container factory setting:
https://github.com/apache/openwhisk-deploy-kube/blob/master/docs/configurationChoices.md#invoker-container-factory

how to deal "failed to connect to Kubernetes master ip"

I started learn Kubernetes, so I follow this guide https://kubernetes.io/docs/setup/turnkey/gce/
I got an error after I run "cluster/kube-up.sh"
Waiting up to 300 seconds for cluster initialization.
This will continually check to see if the API for kubernetes is reachable.
This may time out if there was some uncaught error during start up.
.........................................................................................................................................Cluster failed to initialize within 300 seconds.
Last output from querying API server follows:
% Total % Received % Xferd Average Speed Time Time Time current
             Dload Upload Total Spent Left speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to "[kubernetes-master (external IP)]"
So I tried to search about this error. And then found one solution. It said go to the "cluster/gce/config-default.sh"
secret: $(dd if=/dev/urandom iflag=fullblock bs=32 count=1 2>/dev/null | base64 | tr -d '\r\n')
And then change 'dd' to 'gdd'. So I modified config-default.sh, But it doesn't work.
I don't know how to fix it. Is there are any solution about this error ?̊̈ Also in the /var/logs there are no logs about Kubernetes. Where logs are saved ?̊̈
My Mac version: mas OS Mojave
version: 10.14.3
Looks like GDD is only available in MacOS, about the Kubernetes logs you can have a look at this other article. If you would have a Google Kubernetes Engine cluster it would be easier to check the log by looking at Stackdriver.

H2O cluster startup frequently timing out

Trying to start an h2o cluster on (MapR) hadoop via python
# startup hadoop h2o cluster
import os
import subprocess
import h2o
import shlex
import re
from Queue import Queue, Empty
from threading import Thread
def enqueue_output(out, queue):
"""
Function for communicating streaming text lines from seperate thread.
see https://stackoverflow.com/questions/375427/non-blocking-read-on-a-subprocess-pipe-in-python
"""
for line in iter(out.readline, b''):
queue.put(line)
out.close()
# clear legacy temp. dir.
hdfs_legacy_dir = '/mapr/clustername/user/mapr/hdfsOutputDir'
if os.path.isdir(hdfs_legacy_dir ):
print subprocess.check_output(shlex.split('rm -r %s'%hdfs_legacy_dir ))
# start h2o service in background thread
local_h2o_start_path = '/home/mapr/h2o-3.18.0.2-mapr5.2/'
startup_p = subprocess.Popen(shlex.split('/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir'.format(local_h2o_start_path)),
shell=False,
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
t.start()
# read line without blocking
h2o_url_out = ''
while True:
try: line = q.get_nowait() # or q.get(timeout=.1)
except Empty:
continue
else: # got line
print line
# check for first instance connection url output
if re.search('Open H2O Flow in your web browser', line) is not None:
h2o_url_out = line
break
if re.search('Error', line) is not None:
print 'Error generated: %s' % line
sys.exit()
print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip = re.search('(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)', h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip
frequently throws a timeout error
Waiting for H2O cluster to come up...
H2O node 172.18.4.66:54321 requested flatfile
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Error generated: ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Shutting down h2o cluster
Looking at the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/faq/general-troubleshooting.html) (and just doing a wordfind for the word "timeout"), was unable to find anything that helped the problem (eg. extending the timeout time via hadoop jar h2odriver.jar -timeout <some time> did nothing but extend the time until the timeout error popped up).
Have noticed that this happens often when there is another instance of an h2o cluster already up and running (which I don't understand since I would think that YARN could support multiple instances), yet also sometimes when there is no other cluster initialized.
Anyone know anything else that can be tried to solve this problem or get more debugging info beyond the error message being thrown by h2o?
UPDATE:
Trying to recreate the problem from the commandline, getting
[me#mnode01 project]$ /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
[Possible callback IP address: 172.18.4.62]
[Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.18.4.62:29388
(You can override these with -driverif and -driverport/-driverportrange.)
Memory Settings:
mapreduce.map.java.opts: -Xms6g -Xmx6g -XX:PermSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10
mapreduce.map.memory.mb: 6758
18/08/15 09:18:46 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: number of splits:4
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523404089784_7404
18/08/15 09:18:48 INFO security.ExternalTokenManagerFactory: Initialized external token manager class - com.mapr.hadoop.yarn.security.MapRTicketManager
18/08/15 09:18:48 INFO impl.YarnClientImpl: Submitted application application_1523404089784_7404
18/08/15 09:18:48 INFO mapreduce.Job: The url to track the job: https://mnode03.cluster.local:8090/proxy/application_1523404089784_7404/
Job name 'H2O_66888' submitted
JobTracker job ID is 'job_1523404089784_7404'
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
Waiting for H2O cluster to come up...
H2O node 172.18.4.65:54321 requested flatfile
H2O node 172.18.4.67:54321 requested flatfile
H2O node 172.18.4.66:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
Killed.
18/08/15 09:23:54 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/172.18.4.64:8032
----- YARN cluster metrics -----
Number of YARN worker nodes: 6
----- Nodes -----
Node: http://mnode03.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 7.0 GB used, 0 / 2 vcores used
Node: http://mnode05.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode06.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode01.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 5.0 GB used, 0 / 2 vcores used
Node: http://mnode04.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 7.0 / 10.4 GB used, 1 / 2 vcores used
Node: http://mnode02.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 2.0 / 8.7 GB used, 1 / 2 vcores used
----- Queues -----
Queue name: root.default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 0.00
Maximum capacity: -1.00
Application count: 0
Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used
----------------------------------------------------------------------
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster resource limitations
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
and noticing the later outputs
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster
I am confused by the reported 0GB mem. and 0 vcores becuase there are no other applications running on the cluster and looking at the cluster details in the YARN RM web UI shows
(using image, since could not find unified place in log files for this info and why the mem. availability is so uneven despite having no other running applications, I do not know). At this point, should mention that don't have much experience tinkering with / examining YARN configs, so it's difficult for me to find relevant information at this point.
Could it be that I am starting h2o cluster with -mapperXmx=6g, but (as shown in the image) one of the nodes only has 5g mem. available, so if this node is randomly selected to contribute to the initialized h2o application, it does not have enough memory to support the requested mapper mem.? Changing the startup command to /bin/hadoop jar /home/me/h2o-3.20.0.5-mapr5.2/h2odriver.jar -nodes 4 -mapperXmx 5g -timeout 300 -output hdfsOutputDir and start/stopping multiple times without error seems to support this theory (though need to check further to determine if I'm interpreting things correctly).
This is most likely because your Hadoop cluster is busy, and there just isn't space to start new yarn containers.
If you ask for N nodes, then you either get all N nodes, or the launch process times out like you are seeing. You can optionally use the -timeout command line flag to increase the timeout.

HA - Pacemaker - Is there a way to clean automatically failed actions after X sec/min/hour?

I'm using Pacemaker + Corosync in Centos7
When one of my resource failed/stopped I/m getting a failed action message:
Master/Slave Set: myoptClone01 [myopt_data01]
Masters: [ pcmk01-cr ]
Slaves: [ pcmk02-cr ]
myopt_fs01 (ocf::heartbeat:Filesystem): Started pcmk01-cr
myopt_VIP01 (ocf::heartbeat:IPaddr2): Started pcmk01-cr
ServicesResource (ocf::heartbeat:RADviewServices): Started pcmk01-cr
Failed Actions:
* ServicesResource_monitor_120000 on pcmk02-cr 'unknown error' (1): call=141, status=complete, exitreason='none',
last-rc-change='Mon Jan 30 10:19:36 2017', queued=0ms, exec=142ms
Is there a way to clean automatically the failed actions after X sec/min/hour?
Look into the 'failure-timeout' resource option. This will automatically cleanup the failed action if no further failures for the particular resource has occurred within the value of failure-timeout.
I believe the failure-timeout is calculated during the cluster-recheck-interval. Which means that even if you have the failure-timeout configured to 1 minute it may still take up to 15 minutes and 59 seconds to clear the failed action with Pacemaker's default 15 minute cluster-recheck-interval.
More information:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-options.html

logstash elasticsearch unassigned shards

I'm having trouble trying to figure out why this is happening and how to fix it:
Note that the first logstash index which has all shards assigned is only in that state because I manually assigned them.
Everything had been working as expected for several months until just today I noticed that shards have been laying around unassigned on all my logstash indexes.
Using Elasticsearch v1.4.0
My Elasticsearch logs appear as follows up until July 29, 2015:
[2015-07-29 00:01:53,352][DEBUG][discovery.zen.publish ] [Thor] received cluster state version 4827
Also, there is a script running on the server which trims the number of logstash indexes down to 30. I think that is where this type of log line comes from:
[2015-07-29 17:43:12,800][DEBUG][gateway.local.state.meta ] [Thor] [logstash-2015.01.25] deleting index that is no longer part of the metadata (indices: [[kibana-int, logstash-2015.07.11, logstash-2015.07.04, logstash-2015.07.03, logstash-2015.07.12, logstash-2015.07.21, logstash-2015.07.17, users_1416418645233, logstash-2015.07.20, logstash-2015.07.25, logstash-2015.07.06, logstash-2015.07.28, logstash-2015.01.24, logstash-2015.07.18, logstash-2015.07.26, logstash-2015.07.08, logstash-2015.07.19, logstash-2015.07.09, logstash-2015.07.22, logstash-2015.07.07, logstash-2015.07.29, logstash-2015.07.10, logstash-2015.07.05, logstash-2015.07.01, logstash-2015.07.16, logstash-2015.07.24, logstash-2015.07.02, logstash-2015.07.27, logstash-2015.07.14, logstash-2015.07.13, logstash-2015.07.23, logstash-2015.06.30, logstash-2015.07.15]])
On July 29, there are a few new entries which I haven't seen before:
[2015-07-29 00:01:38,024][DEBUG][action.bulk ] [Thor] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-07-29 17:12:58,658][INFO ][cluster.routing.allocation.decider] [Thor] updating [cluster.routing.allocation.enable] from [ALL] to [NONE]
[2015-07-29 17:12:58,658][INFO ][cluster.routing.allocation.decider] [Thor] updating [cluster.routing.allocation.disable_allocation] from [false] to [true]

Resources