Hypercorn runs with duplicated process - multiprocessing

I am not sure whether this is really hypercorn issue, but could not imagine what else can be. I have searched the net but have not found any topic close to this, so please bear with me.
I am running a server with hypercorn on Ubuntu 20.04, with python3.8.10.
The problem is that it is runs with a duplicated process in background.
root 2278497 0.8 0.1 41872 33568 pts/7 S 10:03 0:00 /usr/bin/python3 /usr/local/bin/hypercorn -c config.toml main:app --reload
root 2278499 0.0 0.0 17304 11332 pts/7 S 10:03 0:00 /usr/bin/python3 -c from multiprocessing.resource_tracker import main;main(4)
root 2278500 0.7 0.1 41648 34148 pts/7 S 10:03 0:00 /usr/bin/python3 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=5, pipe_handle=7) --multiprocessing-fork
The main process is 2278497, but there are duplicated processes 2278499 and 2278450. I do not know why these are started.
This causes unwanted effects by executing twice the same tasks.
How can I avoid that?
EDIT:
A minimal example:
# test_main.py
from fastapi import FastAPI
app = FastAPI()
#app.get("/")
async def root():
return {"message": "Hello World"}
print("main module loaded.")
I then type:
sudo hypercorn test_main:app
and the stdout is:
main module loaded.
main module loaded.
[2022-11-02 15:08:45 +0100] [2364437] [INFO] Running on http://127.0.0.1:8000 (CTRL + C to quit)

I get the impression you're using Hypercorn wrong and it's not designed for you to run your own code in the same process.
That said, what you're seeing in your MWE is a master process and a worker process. You can distinguish between these by checking whether the current process is a "daemon" process (i.e. Unix-style background task) or not via:
import multiprocessing
print(multiprocessing.current_process().daemon)
this will output False for the master process, and True for all worker processes. E.g. this would increase to 5 Trues when executed as hypercorn -w5 test_main:app.
I think I'd suggest not using this hack in production and using another system (e.g. systemd or supervisord) to make sure that any background tasks are kept running. This would give you more control over them. You could still have the code in the same file, just behind the normal if __name__ == '__main__': guard.
Update with more complete example:
from fastapi import FastAPI
from multiprocessing import current_process
app = FastAPI()
# see https://asgi.readthedocs.io/en/latest/specs/lifespan.html
#app.on_event('startup')
async def on_startup():
print("asgi lifecycle startup event")
#app.on_event('shutdown')
async def on_shutdown():
print("asgi lifecycle shutdown event")
#app.get("/")
async def root():
return {"message": "Hello World"}
def main():
print("running as main module")
# see https://docs.python.org/3/library/__main__.html
if __name__ == "__main__":
import sys
sys.exit(main())
# warning, these will also execute if this module imported
if not current_process().daemon:
print("main module loaded into master process")
else:
print("main module loaded into worker process")
can be run as:
$ hypercorn -w2 test_main:app
main module loaded into master process
main module loaded into worker process
asgi lifecycle startup event
[2022-11-04 11:39:31 +0000] [24243] [INFO] Running on http://127.0.0.1:8000 (CTRL + C to quit)
main module loaded into worker process
asgi lifecycle startup event
[2022-11-04 11:39:31 +0000] [24244] [INFO] Running on http://127.0.0.1:8000 (CTRL + C to quit)
^C
asgi lifecycle shutdown event
asgi lifecycle shutdown event
$ python -m test_hyper
running as main module
not the the first line mentions "master process". This is Hypercorn's supervisor process which is responsible for looking after worker processes (e.g. clean shutdown / restarting). I also show that this code can recognise that it's being run as a main module, and could do different things there. This is because Hypercorn is importing this module into each process (i.e. whether it's a master or worker).

Related

My Spider crashes whenever I schedule to production on Heroku

My spider is working both on the local machine and on the Heroku platform, however, I wish to schedule the spider so it works according to the time given through cron expression.
This is my periodic_requests.py file:
from pytz import timezone
import requests, pytz
from apscheduler.schedulers.twisted import TwistedScheduler
from twisted.internet import reactor
def send_request():
requests.post(
url="https://datacrawlapp.herokuapp.com/schedule.json",
data={"project":"datahubsitecrawler",
"spider":"datahubcrawlapp",
"data_url": "https://drive.google.com/file/d/1zBfsWZdoFzoAq6ZELfw-tvvBFcVmCMI7/view?usp=sharing"
}
)
if __name__ == '__main__':
scheduler = TwistedScheduler(timezone=pytz.utc),
scheduler.add_job(send_request, 'cron', day_of_week='mon-sun', hour='19', minute='47')
scheduler.start()
try:
reactor.run() # the script will block here until the crawling is finished
except (KeyboardInterrupt, SystemExit):
pass
Procfile:
web: scrapyd
clock: python periodic_requests
I get error:
File "/app/.heroku/python/lib/python3.8/site-packages/scrapyd/website.py", line 212,
in prep_tab_finished
2022-02-07T08:58:06.697647+00:00 app[web.1]: return '\n'.join(
2022-02-07T08:58:06.697647+00:00 app[web.1]: builtins.TypeError: 'NoneType' object is not iterable:
and then it crashes as shown below:
2022-02-07T08:48:32.411005+00:00 heroku[web.1]: State changed from starting to up
2022-02-07T08:48:32.611649+00:00 heroku[clock.1]: Process exited with status 0
2022-02-07T08:48:32.721748+00:00 heroku[clock.1]: State changed from up to crashed
2022-02-07T08:48:32.742567+00:00 heroku[clock.1]: State changed from crashed to starting
2022-02-07T08:48:41.000000+00:00 app[api]: Build succeeded
2022-02-07T08:48:42.473990+00:00 heroku[clock.1]: Starting process with command `python periodic_requests.py`
2022-02-07T08:48:43.195961+00:00 heroku[clock.1]: State changed from starting to up
2022-02-07T08:48:44.920927+00:00 heroku[clock.1]: Process exited with status 0
2022-02-07T08:48:44.984024+00:00 heroku[clock.1]: State changed from up to crashed
Will appreciate your help!

Issue Deploying Heroku App - RQ Worker: wrong number of arguments for 'HSET' command

Modifying an app based on this example, I am having a problem when deploying my updated Heroku app during initialization of the web.1 dyno.
Everything is working fine in my development envr; when I start the 'app' and 'worker' executables directly, I can access the working app from 127.0.0.1:5000
I already have the postgresql and redistogo addons created as part of the app...
heroku config file
=== app Config Vars
APP_SETTINGS: config.StagingConfig
DATABASE_URL: postgres://[points-to-AWS-EC2].compute-1.amazonaws.com:[port]/[identifier]
REDISTOGO_URL: redis://redistogo:[identifier]#pike.redistogo.com:[port]/
Procfile
web: bash heroku.sh
heroku.sh
#!/bin/bash
gunicorn app:app --daemon
python worker.py
app.py
from collections import Counter
from datetime import datetime
import json
import re
import operator
import os
from flask import Flask, render_template, request, jsonify
from flask_sqlalchemy import SQLAlchemy
from requests import get, exceptions as rx
import nltk
from rq import Queue
from rq.job import Job
from validators import url as vurl
from worker import conn
app = Flask(__name__, template_folder='app-templates/')
app.config.from_object(os.environ['APP_SETTINGS'])
db = SQLAlchemy(app)
q = Queue(connection=conn)
def worker_task(url):
'''
takes website url and ...
worker.py
import os
import redis
from rq import Worker, Queue, Connection
listen = ['default']
redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')
conn = redis.from_url(redis_url)
if __name__ == '__main__':
# for tracking worker job in a terminal
# assumptions: redis-server is running
try:
with Connection(conn):
worker = Worker(list(map(Queue, listen)))
--->>> worker.work() <<<---
except Exception as e:
print(f"{str(e)}")
During the dyno startup, I get the following errors:
heroku[web.1]: State changed from crashed to starting
heroku[web.1]: Starting process with command bash heroku.sh
app[web.1]: Command # 2 (HSET rq:worker:2bd1d3d3a7f04d2fb7a09b34786689bf
birth 2020-05-23T11:28:04.485474Z
last_heartbeat 2020-05-23T11:28:04.485474Z
queues default pid 11
hostname b428750b-5d7f-4837-b9c2-b2195e234978
version 1.4.1 python_version 3.7.4 (default, Sep 12 2019, 01:19:52)
app[web.1]: [GCC 7.4.0]) of pipeline caused error: wrong number of arguments for 'hset' command
heroku[web.1]: Process exited with status 0
heroku[web.1]: State changed from starting to crashed
Im fairly confident that the error is in the highlighted worker.work() line from the worker.py file, because I get the same error when I run heroku run python worker.py (or directly copying those lines into the heroku run python REPL interpreter)
I had a similar problem, and I fixed it by downgrading my version of rq from 1.4.1 to 1.3.0.
SOLVED was an issue with my package versions...
the original example I was using had the following package version dependencies:
redis = "==3.4.1"
rq = "==1.2.2"
but I was just using the most recently updated packages = "*" for both, and when I originally deployed the package in Heroku, it installed RQ version "1.4.0", but then (apparently) when my new app was deployed, the Heroku installer used the latest 1.4.1 package, which broke during the worker.work() method...
Worker RQ v 1.4.0 output:
{
'connection': Redis<ConnectionPool<Connection<host=pike.redistogo.com,port=10038,db=0>>>,
'hostname': 'e868b6d3-70b4-4e17-98c2-b96aae04d9a8',
'pid': 39,
'job_class': <class 'rq.job.Job'>,
'queue_class': <class 'rq.queue.Queue'>, 'version': '1.4.0',
**'python_version': '3.7.4 (default, Sep 12 2019, 01:19:52) \n[GCC 7.4.0]',**
**'serializer': <module 'pickle' from '/app/.heroku/python/lib/python3.7/pickle.py'>,**
'name': '05c7af34c21f44f0b4374b40525779af',
'queues': [Queue('default')],
'_exc_handlers': [], ... ...
}
Worker RQ v1.4.1 output:
{
'connection': Redis<ConnectionPool<Connection<host=pike.redistogo.com,port=11022,db=0>>>,
'hostname': '19a476fa-b4e6-4e63-b40e-ac8779ae0f9e',
'pid': 39,
'job_class': <class 'rq.job.Job'>,
'queue_class': <class 'rq.queue.Queue'>,
'version': '1.4.1',
**'python_version': '3.7.4 (default, Sep 12 2019, 01:19:52) \n[GCC 7.4.0]',**
**'serializer': <class 'rq.serializers.DefaultSerializer'>,**
'name': 'b63389b3499f4d73be889a33f3777b46',
'queues': [Queue('default')],
'_exc_handlers': [], ... ...
}
You can see a difference in the 'serializer' key between the two, possibly causing the HSET exception that was being throw...
Reverting to RQ v1.2.2 solved the issue for me; I probably could have used the same v1.4.0...

Configuring Celery on Windows

I have installed Celery 3.1.5, RabbitMQ server 3.2.1 and Python 2.7.5 on Windows 7 64 bit machine. Here is my code which copied from first-steps-with-celery.
from celery import Celery
app = Celery('tasks', backend='amqp', broker='amqp://guest#localhost//')
#app.task
def add(x, y):
return x + y
When I execute task from python shell I got "The operation timed out" exception message. And state and ready() always returns PENDING & False.
>>> from tasks import *
>>> result = add.delay(4, 4)
>>> result.ready()
False
>>> result.state
'PENDING'
>>> result.get(timeout=20)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\celery\result.py", line 136, in get
interval=interval)
File "C:\Python27\lib\site-packages\celery\backends\amqp.py", line 154, in wait_for
raise TimeoutError('The operation timed out.')
celery.exceptions.TimeoutError: The operation timed out.
>>>
I verified RabbitMQ server is running however I have no clue why celery throwing exception.
You can try to start the worker with command
celery -A proj worker -l info --pool==solo
Though there are lots of things that can cause the result.get() call to fail -- because there are a lot of steps in the chain between sending the message via the .delay() command, to Celery, to the broker (RabbitMQ), and back to a Celery worker, which does the work, and posts the results back, etc. -- I had this problem and the solution was the one that #Deja_vu suggested of "--pool=solo" (note one equals sign, not two).
The default "pool" option is "prefork" (see http://docs.celeryproject.org/en/latest/reference/celery.bin.worker.html#module-celery.bin.worker ). So this may be a Celery bug in its "prefork" system under Windows: see https://github.com/celery/celery/issues/2146
Related StackOverflow questions:
Celery 'Getting Started' not able to retrieve results; always pending
Trouble getting result from Celery queue

Celerybeat shuts down immediately after start

I have a django app that is using celeryd and celerybeat. Both are set up to run as daemons.
The celerybeat tasks won't get executed because celerybeat does not start correctly. According to the logs it shuts down immediately:
[2012-05-04 13:02:49,055: WARNING/MainProcess] celerybeat v2.5.1 is starting.
[2012-05-04 13:02:49,122: INFO/MainProcess] process shutting down
[2012-05-04 13:02:49,122: DEBUG/MainProcess] running all "atexit" finalizers with priority >= 0
[2012-05-04 13:02:49,134: DEBUG/MainProcess] running the remaining "atexit" finalizers
I'm starting it with /etc/int.d/celerybeat start
This is the /etc/default/celerybeat config:
# Where the Django project is.
CELERYBEAT_CHDIR="/var/www/path_to_app/cms/"
# Python interpreter from environment.
ENV_PYTHON="$CELERYBEAT_CHDIR/bin/python"
# Name of the projects settings module.
export DJANGO_SETTINGS_MODULE="cms.settings"
# Path to celerybeat
CELERYBEAT="$ENV_PYTHON $CELERYBEAT_CHDIR/cms/manage.py celerybeat"
# Extra arguments to celerybeat
CELERYBEAT_LOG_LEVEL="DEBUG"
CELERYBEAT_USER="www-data"
CELERYBEAT_GROUP="www-data"
The task schedule is set in settings.py:
CELERYBEAT_SCHEDULE = {
# Executes every morning at 7:00 A.M
"every-morning": {
"task": "cms.tasks.get_recent_posts_for_all_pages",
"schedule": crontab(hour=7, minute=00)
},
}
When I run celerybeat from the shell with ./manage.py celerybeat it seems to run fine.
There is also a celerybeat section in the celeryd config but I assume that one is ignored.
Regards
Simon
Maybe you're missing using a broker like rabbitmq
https://web.archive.org/web/20180703074815/http://celery.readthedocs.io/en/latest/getting-started/brokers/rabbitmq.html

EventMachine UNIX socket connections: how do I use them with Thin running as a service?

How can I use EventMachine.connect_unix_domain while running Thin as a service (using the init script (excerpt) and configuration below). The code directly below is the problem (I get an eventmachine not initialized: evma_connect_to_unix_server error). The second code example works, but doesn't allow me to daemonize thin (I don't think). Does Thin not already have a running instance of EventMachine?
UPDATE: Oddly enough: stopping the server (with service thin stop), seems to get into the config.ru file and run the app (so it works, until the stop command times out and kills the process). What happens when thin stops that could be causing this behavior?
Problematic Code
class Server < Sinatra::Base
# Webserver code removed
end
module Handler
def receive_data data
$received_data_changed = 1
$received_data = data
end
end
$sock = EventMachine.connect_unix_domain("/tmp/mysock.sock", Handler)
Working Code
EventMachine.run do
class Server < Sinatra::Base
# Webserver code removed
end
module Handler
def receive_data data
$received_data_changed = 1
$received_data = data
end
end
$sock = EventMachine.connect_unix_domain("/tmp/mysock.sock", Handler)
Server.run!(:port => 4567)
end
Init Script excerpt
DAEMON=/usr/local/bin/thin
SCRIPT_NAME=/etc/init.d/thin
CONFIG_PATH=/etc/thin
# Exit if the package is not installed
[ -x "$DAEMON" ] || exit 0
case "$1" in
start)
$DAEMON start --all $CONFIG_PATH
;;
Thin Config
---
chdir: /var/www
environment: development
timeout: 30
log: log/thin.log
pid: tmp/pids/thin.pid
max_conns: 1024
max_persistent_conns: 512
require: []
wait: 30
servers: 1
socket: /tmp/thin.server.sock
daemonize: true
Thin is built on top of EventMachine. I think that you should use EventMachine for serving your app. Try to debug further way Thin won't daemonize. (What version are you using?). Also you can run Thin on another port such as 4000 and then pass that as the upstream server to your proxy-forwarding server, if that is what you want to achieve.
What I ended up doing was removing the EventMachine.run do ... end and simply enclosing the socket connection in an EM.next_tick{ $sock = EventMachine.connect_unix_domain("/tmp/mysock.sock", Handler) }.
Could swear I tried this once before... but it works now.
EDIT: Idea for next_tick came from here.

Resources