Gunicorn kills workers while using oracledb + sqlalchemy - oracle

I had my FastAPI application running for longer period of time, but recently I've added SQLAlchemy to connect to Oracle database. While I'm running app locally everything is working fine, but as soon as I try to run it on Gunicorn using Docker + K8s I get issues with Gunicorn workers. It seems that some of the workers are getting killed all the time, while some are running serving requests. Sometimes it comes to the point where whole image is killed and restarted. Only changes made between working and this problematic versions of application are adding SQLAlchemy and oracledb library. Gunicorn config stayed the same, dockerfile is the same as well.
Here is my dockerfile:
FROM python:3.10
EXPOSE 8000
WORKDIR /app
RUN mkdir -p /app
COPY . /app
RUN pip install poetry
RUN poetry config virtualenvs.create false
RUN poetry install
COPY certs/ /usr/local/share/ca-certificates
RUN update-ca-certificates
ENV REQUESTS_CA_BUNDLE "/etc/ssl/certs/ca-certificates.crt"
CMD [ "gunicorn", "wsgi:app", "--bind", "0.0.0.0:8000", "-w 4", "--access-logfile", "-", "--log-level", "debug", "-k uvicorn.workers.UvicornWorker", "-t 90"]
this is how I create engine:
from sqlalchemy import Engine, create_engine
import config
def get_engine() -> Engine:
connection_url = f"oracle+oracledb://{config.DB_USERNAME}:{config.DB_PASSWORD}#{config.DB_HOST}:{config.DB_PORT}/?service_name={config.DB_SERVICE_NAME}"
engine = create_engine(connection_url)
return engine
so it seems that even if there would be any problem with SqlAlchemy or DB connection I shouldn't get any issues before I actually try to use this engine, and engine is used only in endpoints through repository...
My thoughts are that maybe oracledb library is causing issues, but cannot find any related information to that and I'm not getting any significant messages from gunicorn logs:
[2023-02-13 00:02:27 +0000] [1] [WARNING] Worker with pid 203 was terminated due to signal 9
[2023-02-13 00:02:27 +0000] [207] [INFO] Booting worker with pid: 207
[2023-02-13 00:02:35 +0000] [205] [INFO] Started server process [205]
[2023-02-13 00:02:35 +0000] [205] [INFO] Waiting for application startup.
[2023-02-13 00:02:35 +0000] [205] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:02:38 +0000] [1] [WARNING] Worker with pid 197 was terminated due to signal 9
[2023-02-13 00:02:38 +0000] [209] [INFO] Booting worker with pid: 209
[2023-02-13 00:02:45 +0000] [207] [INFO] Started server process [207]
[2023-02-13 00:02:45 +0000] [207] [INFO] Waiting for application startup.
[2023-02-13 00:02:45 +0000] [207] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:02:47 +0000] [1] [WARNING] Worker with pid 205 was terminated due to signal 9
[2023-02-13 00:02:47 +0000] [212] [INFO] Booting worker with pid: 212
[2023-02-13 00:02:55 +0000] [209] [INFO] Started server process [209]
[2023-02-13 00:02:55 +0000] [209] [INFO] Waiting for application startup.
[2023-02-13 00:02:55 +0000] [209] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:02:57 +0000] [1] [WARNING] Worker with pid 207 was terminated due to signal 9
[2023-02-13 00:02:57 +0000] [214] [INFO] Booting worker with pid: 214
[2023-02-13 00:03:04 +0000] [212] [INFO] Started server process [212]
[2023-02-13 00:03:04 +0000] [212] [INFO] Waiting for application startup.
[2023-02-13 00:03:04 +0000] [212] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:03:06 +0000] [1] [WARNING] Worker with pid 209 was terminated due to signal 9
[2023-02-13 00:03:06 +0000] [217] [INFO] Booting worker with pid: 217
[2023-02-13 00:03:14 +0000] [214] [INFO] Started server process [214]
[2023-02-13 00:03:14 +0000] [214] [INFO] Waiting for application startup.
[2023-02-13 00:03:14 +0000] [214] [INFO] Application startup complete.
some-ip:some-port- "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:03:17 +0000] [1] [WARNING] Worker with pid 212 was terminated due to signal 9
[2023-02-13 00:03:17 +0000] [219] [INFO] Booting worker with pid: 219
[2023-02-13 00:03:25 +0000] [217] [INFO] Started server process [217]
[2023-02-13 00:03:25 +0000] [217] [INFO] Waiting for application startup.
[2023-02-13 00:03:25 +0000] [217] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
Before my docker image was getting killed after ~1 min because it was still in unready status, so I've changed gunicorn timeout from default 30 seconds to 90 seconds - it helped a bit because now some workers are running and some are being killed, but application is able to handle any requests at all. I've also increased logging level to debug, but there isn't any helpful information there.
Any ideas?

It seems that for some reason 4 workers were too much after introducing those libraries, after modifying config to 2 workers and 4 threads it started working correctly.

Related

Liveness probe failed request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Openshift 3.11 - while upgrading from spring boot 1.4.5 --> 2.6.1 we are observing intermidiate timeouts for liveness probe with below warning :
Liveness probe failed: Get http://172.40.23.99:8090/monitoring/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
The traffic is very less and the memory/cpu/threads is much beyond limits thresholds.
The issue is reproduced on different cluster compute nodes.
Deployment configuration/hardware/resources wasn't changed as part of the upgrade.
deployment configuration for liveness probe :
Liveness: http-get http://:8090/monitoring/health delay=90s timeout=3s period=50s #success=1 #failure=5
Docker base image:
"name": "redhat-openjdk-18/openjdk18-openshift","version": "1.12"
From access logs the health checks completed in ms's - the defined timeout for liveness is 3 seconds:
10.131.4.1 - - [11/Sep/2022:14:22:07 +0000] "GET /monitoring/health HTTP/1.1" 200 907 13
10.131.4.1 - - [11/Sep/2022:14:22:57 +0000] "GET /monitoring/health HTTP/1.1" 200 907 21
10.131.4.1 - - [11/Sep/2022:14:23:47 +0000] "GET /monitoring/health HTTP/1.1" 200 907 9
10.131.4.1 - - [11/Sep/2022:14:24:37 +0000] "GET /monitoring/health HTTP/1.1" 200 907 19
10.131.4.1 - - [11/Sep/2022:14:25:27 +0000] "GET /monitoring/health HTTP/1.1" 200 907 8
Tried to disable all the components that being checked as part of the actuator health check (db,redis,diskspace,ping,refresh...) - same behavoir.
One important observation is that when scalling up - adding more instances, the warning disapears, also by
blocking any incoming traffic, the warning is also not coming.
it seems that somehow the issue is resources related and something is being choked periodically, but all available metrics are fine.
any suggestion?
The tomcat was reaching max connections and caused this behavior.
Due to some unclear reason the server.tomcat.max-connections was set to 1000 in the environment, when the default is 10000.
The issue was not reproducible with old spring boot(1.4.5) because the server.tomcat.max-connections property was introduced in 1.5.0 version and had no effect on 1.4.5. (was running with defaults - 10k).

Heroku: This site can’t be reached

My heroku app cannot be accessed after a build. The logs show that the web server node and worker node both are listening.
It's a flask app run by gunicorn and it has 2 addons - newrelic and redistogo.
Error:
This site can’t be reached.
appname.herokuapp.com’s server IP address could not be found.
DNS_PROBE_FINISHED_NXDOMAIN
Logs:
```2020-02-05T16:24:31.556201+00:00 heroku[worker.1]: Starting process with command `python worker.py`
2020-02-05T16:24:32.278863+00:00 heroku[worker.1]: State changed from starting to up
2020-02-05T16:24:33.363132+00:00 app[worker.1]: 16:24:33 RQ worker started, version 0.5.1
2020-02-05T16:24:33.364484+00:00 app[worker.1]: 16:24:33
2020-02-05T16:24:33.364574+00:00 app[worker.1]: 16:24:33 *** Listening on high, default, low...
2020-02-05T16:24:35.295791+00:00 heroku[web.1]: Starting process with command `newrelic-admin run-program gunicorn app:server`
2020-02-05T16:24:41.159117+00:00 heroku[web.1]: State changed from starting to up
2020-02-05T16:24:40.959907+00:00 app[web.1]: [2020-02-05 16:24:40 +0000] [4] [INFO] Starting gunicorn 20.0.4
2020-02-05T16:24:40.961836+00:00 app[web.1]: [2020-02-05 16:24:40 +0000] [4] [INFO] Listening at: http://0.0.0.0:21126 (4)
2020-02-05T16:24:40.962097+00:00 app[web.1]: [2020-02-05 16:24:40 +0000] [4] [INFO] Using worker: sync
2020-02-05T16:24:40.971809+00:00 app[web.1]: [2020-02-05 16:24:40 +0000] [12] [INFO] Booting worker with pid: 12
2020-02-05T16:24:41.143051+00:00 app[web.1]: [2020-02-05 16:24:41 +0000] [20] [INFO] Booting worker with pid: 20```
EDIT:
Procfile:
web: newrelic-admin run-program gunicorn app:server
worker: python worker.py
Worker:
import os
import redis
from rq import Worker, Queue, Connection
listen = ['high', 'default', 'low']
redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')
conn = redis.from_url(redis_url)
if __name__ == '__main__':
with Connection(conn):
worker = Worker(list(map(Queue, listen)))
worker.work()
The app is accessible now. It seems to be downtime with Heroku. Although they haven't reported this in their status page.

Installing CDH using Cloudera Manager: No such file or directory

Installing using CM and successfully download and distribute all parcels.
However, all agents do not decompress parcels when distribution is 100% finished. Checking the log, it says:
[21/Nov/2018 09:53:04 +0000] 30292 MainThread parcel INFO Executing command ['chown', 'root:yarn', u'/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-yarn/bin/container-executor']
[21/Nov/2018 09:53:04 +0000] 30292 MainThread parcel INFO chmod: /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-yarn/bin/container-executor 6050
[21/Nov/2018 09:53:04 +0000] 30292 MainThread parcel INFO Executing command ['chmod', '6050', u'/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-yarn/bin/container-executor']
[21/Nov/2018 09:53:04 +0000] 30292 MainThread parcel ERROR Error while attempting to modify permissions of file '/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-0.20-mapreduce/sbin/Linux/task-controller'.
Traceback (most recent call last):
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/parcel.py", line 520, in ensure_permissions
file = cmf.util.validate_and_open_fd(path, self.get_parcel_home(parcel))
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/__init__.py", line 358, in validate_and_open_fd
fd = os.open(path, flags)
OSError: [Errno 2] No such file or directory: '/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-0.20-mapreduce/sbin/Linux/task-controller'
[21/Nov/2018 09:54:04 +0000] 30292 MainThread heartbeat_tracker INFO HB stats (seconds): num:40 LIFE_MIN:0.01 min:0.01 mean:0.01 max:0.01 LIFE_MAX:0.05
[21/Nov/2018 10:04:04 +0000] 30292 MainThread heartbeat_tracker INFO HB stats (seconds): num:40 LIFE_MIN:0.01 min:0.01 mean:0.01 max:0.01 LIFE_MAX:0.05
Why the path '/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-0.20-mapreduce/sbin/Linux/task-controller' is missing and how to address this issue?
Any help is appreciated.
I had the exactly same trouble and could not solve it after a lot of waste of time. I finally ended up with installing via "package" method instead of "parcel" method.

Hue HBase API Error: None

When I use The Web UI for HBase in hue,I just get an error message: API Error: None, and the log says:
[30/Jun/2015 21:16:30 +0000] access INFO 114.112.124.241 admin - "GET /hbase/ HTTP/1.0"
[30/Jun/2015 21:16:31 +0000] access INFO 114.112.124.241 admin - "POST /hbase/api/getClusters HTTP/1.0"
[30/Jun/2015 21:16:31 +0000] access INFO 114.112.124.241 admin - "GET /debug/check_config_ajax HTTP/1.0"
[30/Jun/2015 21:16:31 +0000] access INFO 114.112.124.241 admin - "POST /hbase/api/getTableList/HBase HTTP/1.0"
[30/Jun/2015 21:16:31 +0000] thrift_util INFO Thrift exception; retrying: None
[30/Jun/2015 21:16:31 +0000] thrift_util INFO Thrift exception; retrying: None
[30/Jun/2015 21:16:31 +0000] thrift_util WARNING Out of retries for thrift call: getTableNames
[30/Jun/2015 21:16:31 +0000] thrift_util INFO Thrift saw a transport exception: None
[30/Jun/2015 21:16:31 +0000] middleware INFO Processing exception: Api Error: None: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/core/handlers/base.py", line 100, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/apps/hbase/src/hbase/views.py", line 65, in api_router
return api_dump(HbaseApi().query(*url_params))
File "/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/apps/hbase/src/hbase/api.py", line 48, in query
raise PopupException(_("Api Error: %s") % e.message)
PopupException: Api Error: None
and config in hue.ini is
[hbase]
hbase_clusters=(Cluster|ip-172-31-13-29.cn-north-1.compute.internal:9290)
and thrift port is 9290 (hbase.regionserver.thrift.port), Enable HBase Thrift Server Framed Transport is false (hbase.regionserver.thrift.framed)
Are you using Thrift Server v1 (and not v2)?
Did you make sure that 'framed' was also selected in the hue.ini?
[hbase]
thrift_transport=framed
How to setup the HBase Browser and use it with Security.
HBASE VERSION 1.4.6
HUE VERSION 4.2.0
I am running clustered HBASE on AWS EMR
To fix the issue
1) Start thrift1 on master node of HBASE. Default port is 9090 or make sure its running
./bin/hbase-daemon.sh start thrift -p PORT_NUMBER
2) Change hui.ini or pseudo-distributed.ini configuration settings
[hbase]
hbase_clusters=(Cluster|MASTER_IP_OR_STANDALONE_IP:9090)
# Copy these files from where hbase is installed in case of distributed hbase.
# Like hbase-site.xml , hbase-policy.xml and regionalservers file
hbase_conf_dir=PATH_OF_HBASE_CONFIG_FILES
# 'buffered' used to be the default of the HBase Thrift Server.
thrift_transport=buffered
3) Restart the hue server

HTTPS with gunicorn?

I am running Gunicorn 19.0 on a Debian server to serve a Django 1.8 site. I am also running nginx to serve the site's static assets.
My DNS is managed by Gandi and I have CloudFlare in front of the server. The site is running happily on HTTP. Now I would like to serve it over HTTPS. My question is about how to go about this.
I have generated a certificate by following Gandi's instructions. Now I have a server.csr and a myserver.key file on my server.
I have a script to run Gunicorn and I have amended it to point at these certificate files:
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--certfile=/home/me/server.csr
--keyfile=/home/me/myserver.key
--name $NAME \
--workers $NUM_WORKERS \
--user=$USER --group=$GROUP \
--bind=unix:$SOCKFILE \
--log-level=debug \
--log-file=-
The script seems to run cleanly as usual, but now if I go to https://example.com or http://example.com there is nothing there (521 and 404 respectively).
Is there an additional step I need to carry out?
The Gunicorn logs show the following:
Starting myapp as hello
[2015-06-25 10:28:18 +0000] [11331] [INFO] Starting gunicorn 19.3.0
[2015-06-25 10:28:18 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:18 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:19 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:19 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:20 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:20 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:21 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:21 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:22 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:22 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:23 +0000] [11331] [ERROR] Can't connect to ('127.0.0.1', 8000)
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US)
I'm also confused because most of the gunicorn examples talk about .crt files, but I only have a .csr file.
What #SteveKlein says above in the comments:
SSL should be set up in your NGINX config, not your Gunicorn one.
When you set up NGINX, you'll need to decide if you want to serve
both plain text and SSL or redirect everything to SSL.

Resources