Installing CDH using Cloudera Manager: No such file or directory - hadoop

Installing using CM and successfully download and distribute all parcels.
However, all agents do not decompress parcels when distribution is 100% finished. Checking the log, it says:
[21/Nov/2018 09:53:04 +0000] 30292 MainThread parcel INFO Executing command ['chown', 'root:yarn', u'/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-yarn/bin/container-executor']
[21/Nov/2018 09:53:04 +0000] 30292 MainThread parcel INFO chmod: /opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-yarn/bin/container-executor 6050
[21/Nov/2018 09:53:04 +0000] 30292 MainThread parcel INFO Executing command ['chmod', '6050', u'/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-yarn/bin/container-executor']
[21/Nov/2018 09:53:04 +0000] 30292 MainThread parcel ERROR Error while attempting to modify permissions of file '/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-0.20-mapreduce/sbin/Linux/task-controller'.
Traceback (most recent call last):
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/parcel.py", line 520, in ensure_permissions
file = cmf.util.validate_and_open_fd(path, self.get_parcel_home(parcel))
File "/opt/cloudera/cm-agent/lib/python2.7/site-packages/cmf/util/__init__.py", line 358, in validate_and_open_fd
fd = os.open(path, flags)
OSError: [Errno 2] No such file or directory: '/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-0.20-mapreduce/sbin/Linux/task-controller'
[21/Nov/2018 09:54:04 +0000] 30292 MainThread heartbeat_tracker INFO HB stats (seconds): num:40 LIFE_MIN:0.01 min:0.01 mean:0.01 max:0.01 LIFE_MAX:0.05
[21/Nov/2018 10:04:04 +0000] 30292 MainThread heartbeat_tracker INFO HB stats (seconds): num:40 LIFE_MIN:0.01 min:0.01 mean:0.01 max:0.01 LIFE_MAX:0.05
Why the path '/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/hadoop-0.20-mapreduce/sbin/Linux/task-controller' is missing and how to address this issue?
Any help is appreciated.

I had the exactly same trouble and could not solve it after a lot of waste of time. I finally ended up with installing via "package" method instead of "parcel" method.

Related

Gunicorn kills workers while using oracledb + sqlalchemy

I had my FastAPI application running for longer period of time, but recently I've added SQLAlchemy to connect to Oracle database. While I'm running app locally everything is working fine, but as soon as I try to run it on Gunicorn using Docker + K8s I get issues with Gunicorn workers. It seems that some of the workers are getting killed all the time, while some are running serving requests. Sometimes it comes to the point where whole image is killed and restarted. Only changes made between working and this problematic versions of application are adding SQLAlchemy and oracledb library. Gunicorn config stayed the same, dockerfile is the same as well.
Here is my dockerfile:
FROM python:3.10
EXPOSE 8000
WORKDIR /app
RUN mkdir -p /app
COPY . /app
RUN pip install poetry
RUN poetry config virtualenvs.create false
RUN poetry install
COPY certs/ /usr/local/share/ca-certificates
RUN update-ca-certificates
ENV REQUESTS_CA_BUNDLE "/etc/ssl/certs/ca-certificates.crt"
CMD [ "gunicorn", "wsgi:app", "--bind", "0.0.0.0:8000", "-w 4", "--access-logfile", "-", "--log-level", "debug", "-k uvicorn.workers.UvicornWorker", "-t 90"]
this is how I create engine:
from sqlalchemy import Engine, create_engine
import config
def get_engine() -> Engine:
connection_url = f"oracle+oracledb://{config.DB_USERNAME}:{config.DB_PASSWORD}#{config.DB_HOST}:{config.DB_PORT}/?service_name={config.DB_SERVICE_NAME}"
engine = create_engine(connection_url)
return engine
so it seems that even if there would be any problem with SqlAlchemy or DB connection I shouldn't get any issues before I actually try to use this engine, and engine is used only in endpoints through repository...
My thoughts are that maybe oracledb library is causing issues, but cannot find any related information to that and I'm not getting any significant messages from gunicorn logs:
[2023-02-13 00:02:27 +0000] [1] [WARNING] Worker with pid 203 was terminated due to signal 9
[2023-02-13 00:02:27 +0000] [207] [INFO] Booting worker with pid: 207
[2023-02-13 00:02:35 +0000] [205] [INFO] Started server process [205]
[2023-02-13 00:02:35 +0000] [205] [INFO] Waiting for application startup.
[2023-02-13 00:02:35 +0000] [205] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:02:38 +0000] [1] [WARNING] Worker with pid 197 was terminated due to signal 9
[2023-02-13 00:02:38 +0000] [209] [INFO] Booting worker with pid: 209
[2023-02-13 00:02:45 +0000] [207] [INFO] Started server process [207]
[2023-02-13 00:02:45 +0000] [207] [INFO] Waiting for application startup.
[2023-02-13 00:02:45 +0000] [207] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:02:47 +0000] [1] [WARNING] Worker with pid 205 was terminated due to signal 9
[2023-02-13 00:02:47 +0000] [212] [INFO] Booting worker with pid: 212
[2023-02-13 00:02:55 +0000] [209] [INFO] Started server process [209]
[2023-02-13 00:02:55 +0000] [209] [INFO] Waiting for application startup.
[2023-02-13 00:02:55 +0000] [209] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:02:57 +0000] [1] [WARNING] Worker with pid 207 was terminated due to signal 9
[2023-02-13 00:02:57 +0000] [214] [INFO] Booting worker with pid: 214
[2023-02-13 00:03:04 +0000] [212] [INFO] Started server process [212]
[2023-02-13 00:03:04 +0000] [212] [INFO] Waiting for application startup.
[2023-02-13 00:03:04 +0000] [212] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:03:06 +0000] [1] [WARNING] Worker with pid 209 was terminated due to signal 9
[2023-02-13 00:03:06 +0000] [217] [INFO] Booting worker with pid: 217
[2023-02-13 00:03:14 +0000] [214] [INFO] Started server process [214]
[2023-02-13 00:03:14 +0000] [214] [INFO] Waiting for application startup.
[2023-02-13 00:03:14 +0000] [214] [INFO] Application startup complete.
some-ip:some-port- "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
[2023-02-13 00:03:17 +0000] [1] [WARNING] Worker with pid 212 was terminated due to signal 9
[2023-02-13 00:03:17 +0000] [219] [INFO] Booting worker with pid: 219
[2023-02-13 00:03:25 +0000] [217] [INFO] Started server process [217]
[2023-02-13 00:03:25 +0000] [217] [INFO] Waiting for application startup.
[2023-02-13 00:03:25 +0000] [217] [INFO] Application startup complete.
some-ip:some-port - "GET /status HTTP/1.1" 200
some-ip:some-port - "GET /status HTTP/1.1" 200
Before my docker image was getting killed after ~1 min because it was still in unready status, so I've changed gunicorn timeout from default 30 seconds to 90 seconds - it helped a bit because now some workers are running and some are being killed, but application is able to handle any requests at all. I've also increased logging level to debug, but there isn't any helpful information there.
Any ideas?
It seems that for some reason 4 workers were too much after introducing those libraries, after modifying config to 2 workers and 4 threads it started working correctly.

Error sending data to APM even after successful connectivity

Able to establish APM connection
2021-03-09 17:45:05,741 [Attach Listener] INFO co.elastic.apm.agent.configuration.StartupInfo - VM Arguments: [-XX:TieredStopAtLevel=1, -Xmx6g, -Dfile.encoding=UTF-8, -Duser.country=IN, -Duser.language=en, -Duser.variant]
2021-03-09 17:45:08,192 [Attach Listener] INFO co.elastic.apm.agent.impl.ElasticApmTracer - Tracer switched to RUNNING state
2021-03-09 17:45:08,734 [elastic-apm-server-healthcheck] INFO co.elastic.apm.agent.report.ApmServerHealthChecker - Elastic APM server is available: { "build_date": "2021-02-15T12:37:48Z", "build_sha": "e77061bb3aaedae5ae8dd0ca193eb662513aedde", "version": "7.11.0"}
But post connection, it still throws this error. What could be wrong here, appreciate any inputs on this
2021-03-09 17:45:53,484 [elastic-apm-server-reporter] INFO co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Backing off for 0 seconds (+/-10%)
2021-03-09 17:45:53,489 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Read timed out, response code is -1
2021-03-09 17:45:53,489 [elastic-apm-server-reporter] WARN co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - null
2021-03-09 17:46:08,890 [elastic-apm-server-reporter] INFO co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Backing off for 1 seconds (+/-10%)
2021-03-09 17:46:09,922 [elastic-apm-server-reporter] ERROR co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - Error sending data to APM server: Read timed out, response code is -1
2021-03-09 17:46:09,922 [elastic-apm-server-reporter] WARN co.elastic.apm.agent.report.IntakeV2ReportingEventHandler - null
Check your kibana.yml file URLs, when I setup APM I my machine some of my URLs (elasticsearch.hosts, xpack.fleet.outputs) were defaulted to my current IP address (instead of localhost), which changed after a reboot.

Hue HBase API Error: None

When I use The Web UI for HBase in hue,I just get an error message: API Error: None, and the log says:
[30/Jun/2015 21:16:30 +0000] access INFO 114.112.124.241 admin - "GET /hbase/ HTTP/1.0"
[30/Jun/2015 21:16:31 +0000] access INFO 114.112.124.241 admin - "POST /hbase/api/getClusters HTTP/1.0"
[30/Jun/2015 21:16:31 +0000] access INFO 114.112.124.241 admin - "GET /debug/check_config_ajax HTTP/1.0"
[30/Jun/2015 21:16:31 +0000] access INFO 114.112.124.241 admin - "POST /hbase/api/getTableList/HBase HTTP/1.0"
[30/Jun/2015 21:16:31 +0000] thrift_util INFO Thrift exception; retrying: None
[30/Jun/2015 21:16:31 +0000] thrift_util INFO Thrift exception; retrying: None
[30/Jun/2015 21:16:31 +0000] thrift_util WARNING Out of retries for thrift call: getTableNames
[30/Jun/2015 21:16:31 +0000] thrift_util INFO Thrift saw a transport exception: None
[30/Jun/2015 21:16:31 +0000] middleware INFO Processing exception: Api Error: None: Traceback (most recent call last):
File "/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg/django/core/handlers/base.py", line 100, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/apps/hbase/src/hbase/views.py", line 65, in api_router
return api_dump(HbaseApi().query(*url_params))
File "/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/apps/hbase/src/hbase/api.py", line 48, in query
raise PopupException(_("Api Error: %s") % e.message)
PopupException: Api Error: None
and config in hue.ini is
[hbase]
hbase_clusters=(Cluster|ip-172-31-13-29.cn-north-1.compute.internal:9290)
and thrift port is 9290 (hbase.regionserver.thrift.port), Enable HBase Thrift Server Framed Transport is false (hbase.regionserver.thrift.framed)
Are you using Thrift Server v1 (and not v2)?
Did you make sure that 'framed' was also selected in the hue.ini?
[hbase]
thrift_transport=framed
How to setup the HBase Browser and use it with Security.
HBASE VERSION 1.4.6
HUE VERSION 4.2.0
I am running clustered HBASE on AWS EMR
To fix the issue
1) Start thrift1 on master node of HBASE. Default port is 9090 or make sure its running
./bin/hbase-daemon.sh start thrift -p PORT_NUMBER
2) Change hui.ini or pseudo-distributed.ini configuration settings
[hbase]
hbase_clusters=(Cluster|MASTER_IP_OR_STANDALONE_IP:9090)
# Copy these files from where hbase is installed in case of distributed hbase.
# Like hbase-site.xml , hbase-policy.xml and regionalservers file
hbase_conf_dir=PATH_OF_HBASE_CONFIG_FILES
# 'buffered' used to be the default of the HBase Thrift Server.
thrift_transport=buffered
3) Restart the hue server

HTTPS with gunicorn?

I am running Gunicorn 19.0 on a Debian server to serve a Django 1.8 site. I am also running nginx to serve the site's static assets.
My DNS is managed by Gandi and I have CloudFlare in front of the server. The site is running happily on HTTP. Now I would like to serve it over HTTPS. My question is about how to go about this.
I have generated a certificate by following Gandi's instructions. Now I have a server.csr and a myserver.key file on my server.
I have a script to run Gunicorn and I have amended it to point at these certificate files:
exec gunicorn ${DJANGO_WSGI_MODULE}:application \
--certfile=/home/me/server.csr
--keyfile=/home/me/myserver.key
--name $NAME \
--workers $NUM_WORKERS \
--user=$USER --group=$GROUP \
--bind=unix:$SOCKFILE \
--log-level=debug \
--log-file=-
The script seems to run cleanly as usual, but now if I go to https://example.com or http://example.com there is nothing there (521 and 404 respectively).
Is there an additional step I need to carry out?
The Gunicorn logs show the following:
Starting myapp as hello
[2015-06-25 10:28:18 +0000] [11331] [INFO] Starting gunicorn 19.3.0
[2015-06-25 10:28:18 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:18 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:19 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:19 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:20 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:20 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:21 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:21 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:22 +0000] [11331] [ERROR] Connection in use: ('127.0.0.1', 8000)
[2015-06-25 10:28:22 +0000] [11331] [ERROR] Retrying in 1 second.
[2015-06-25 10:28:23 +0000] [11331] [ERROR] Can't connect to ('127.0.0.1', 8000)
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US)
I'm also confused because most of the gunicorn examples talk about .crt files, but I only have a .csr file.
What #SteveKlein says above in the comments:
SSL should be set up in your NGINX config, not your Gunicorn one.
When you set up NGINX, you'll need to decide if you want to serve
both plain text and SSL or redirect everything to SSL.

Storm worker not starting

I am trying to storm a storm topology but the storm worker refuses to start when I try to run the java command which invokes the worker process I get the following error:
Exception: java.lang.StackOverflowError thrown from the UncaughtExceptionHandler in thread "main"
I am not able to find what problem is causing this. Has anyone faced similar issue
Edit:
when I runt the worker process with flag -V I get the following error:
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:java.library.path=/usr/local/lib:/opt/local/lib:/usr/lib
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:java.io.tmpdir=/tmp
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:java.compiler=<NA>
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:os.name=Linux
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:os.arch=amd64
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:os.version=3.5.0-23-generic
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:user.name=storm
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:user.home=/home/storm
588 [main] INFO org.apache.zookeeper.server.ZooKeeperServer - Server environment:user.dir=/home/storm/storm-0.9.0.1
797 [main] ERROR org.apache.zookeeper.server.NIOServerCnxn - Thread Thread[main,5,main] died
PS: When I run the same topology in local cluster it works fine, only when i deploy in cluster mode it doesnt start.
Just found out the issue. The jar I creted to upload in the storm cluster, was kept in the storm base directory pics. This somehow was creating conflict which was not shown in the log file and actually log file never got created.
Make sure no external jars are present in the base storm folder from where one start storm. Really tricky error no idea why this happens until you just get around it.
Hope the storm guys add this into the logs so that user facing such issue can pinpoint why exactly this is happening.

Resources