I have a project with lots of Celery tasks, and one of the tasks must be executed only one at a time (it is a request to a 3rd party API which disallows several concurrent connections).
I can achieve this by starting a separate celery process with a separate queue and a concurrency of 1.
Regular celery process:
celery -A sourcery worker -Q default -c 4
A separate single-worker process:
celery -A sourcery worker -Q separate_queue -c 1
But I am on Heroku, and I will be billed doubly for spinning up two processes instead of one. So, is there a way to achieve it with a single Celery process?
There is currently no way to do this.
I had a similar issue where I had a 3rd party API which only allowed one concurrent connection. I ended up running two separate dynos on Heroku.
Somebody did request this feature, but it has not been implemented: https://github.com/celery/celery/issues/1599
Edit:
There is something called celery-multi that could be worth looking into:
Celery - run different workers on one server
I'm not sure it's Heroku compatible though, let me know if you try!
Related
I am currently using a Hobby dyno to run my web process -- frontend website as well as an API. To execute jobs, I need to run a background process that processes these jobs. Do I need to spin up another dyno that specifically runs background workers? I ask this because I see that Standard 1X/2x dynos include "unlimited background workers" which makes me think that multiple process types can run on a single dyno. It looks like I can run 2 hobby dynos -- one for web, one for workers or upgrade to one of the standard dynos...is this correct?
I believe that yes, if you want to have a dedicated worker app, that needs to be a separate dyno. You can run the processes in memory though and that saves you doing it with a separate binary/dyno
I'm not familiar with the internals of Sidekiq and am wondering if it's okay to launch several Sidekiq instances with the same configuration (processing the same queues).
Is it possible that 2 or more Sidekiq instances will process the same message from a queue?
UPDATE:
I need to know if there is a possible conflict, when running Sidekiq on more than 1 machine
Yes, sidekiq can absolutely run many processes against the same queue. Redis will just give the message to a random process.
Nope, I've ran Sidekiqs in different machines with no issues.
Each of the Sidekiqs read from the same redis server, and redis is very robust in multi-threaded, and distributed scenarios.
In addition, if you look at the web interface for Sidekiq, it will show all the workers across all machines because all the workers are logged in the same redis server.
So no, no issues.
I'm hoping the community can clarify something for me, and that others can benefit.
My understanding is that gunicorn worker processes are essentially virtual replicas of Heroku web dynos. In other words, Gunicorn's worker processes should not be confused with Heroku's worker processes (e.g. Django Celery Tasks).
This is because Gunicorn worker processes are focused on handling web requests (basically throttling up the performance of the Heroku Web Dyno) while Heroku Worker Dynos specialize in Remote API calls, etc that are long-running background tasks.
I have a simple Django app that makes decent use of Remote APIs and I want to optimize the resource balance. I am also querying a PostgreSQL database on most requests.
I know that this is very much an oversimplification, but am I thinking about things correctly?
Some relevant info:
https://devcenter.heroku.com/articles/process-model
https://devcenter.heroku.com/articles/background-jobs-queueing
https://devcenter.heroku.com/articles/django#running-a-worker
http://gunicorn.org/configure.html#workers
http://v3.mike.tig.as/blog/2012/02/13/deploying-django-on-heroku/
https://docs.djangoproject.com/en/dev/howto/deployment/wsgi/gunicorn/
Other Quasi-Related Helpful SO Questions for those researching this topic:
Troubleshooting Site Slowness on a Nginx + Gunicorn + Django Stack
Performance degradation for Django with Gunicorn deployed into Heroku
Configuring gunicorn for Django on Heroku
Troubleshooting Site Slowness on a Nginx + Gunicorn + Django Stack
To provide an answer and prevent people from having to search through the comments, a dyno is like an entire computer. Using the Procfile, you give each of your dynos one command to run, and it cranks away on that command, re-running it periodically to refresh it and re-running it when it crashes. As you can imagine, it's rather wasteful to waste an entire computer running a single-threaded webserver, and that's where Gunicorn comes in.
The Gunicorn master thread does nothing but act as a proxy server, spawning a given number of copies of your application (workers), distributing HTTP requests amongst them. It takes advantage of the fact that each dyno actually has multiple cores. As someone mentioned, the number of workers you should choose depends on how much memory your app takes to run.
Contrary to what Bob Spryn said in the last comment, there are other ways of exploiting this opportunity for parallelism to run separate servers on the same dyno. The easiest way is to make a separate sub-procfile and run the all-Python Foreman equivalent, Honcho, from your main Procfile, following these directions. Essentially, in this case your single dyno command is a program that manages multiple single commands. It's kind of like being granted one wish from a genie, and making that wish be for 4 more wishes.
The advantage of this is you get to take full advantage of your dynos' capacity. The disadvantage of this approach is that you lose the ability scale individual parts of your app independently when they're sharing a dyno. When you scale the dyno, it will scale everything you've multiplexed onto it, which may not be desired. You will probably have to use diagnostics to decide when a service should be put on its own dedicated dyno.
I'd like to know how to communicate between processes on a Heroku worker dyno.
We want a Resque worker to read off a queue and send the data to another process running on the same dyno. The "other process" is an off-the-shelf piece of software that usually uses TCP sockets (port xyz) to listen for commands. It is set up to run as a background process before the Resque worker starts.
However, when we try to connect locally to that TCP socket, we get nowhere.
Our Rake task for setting up the queue does this:
task "resque:setup" do
# First launch our listener process in the background
`./some_process_that_listens_on_port_12345 &`
# Now get our queue worker ready, set up Redis backing store
port = 12345
ENV['QUEUE'] = '*'
ENV['PORT'] = port.to_s
Resque.redis = ENV['REDISTOGO_URL']
# Start working from the queue
WorkerClass.enqueue
end
And that works -- our listener process runs, and Resque tries to process queued tasks. However, the Resque jobs fail because they can't connect to localhost:12345 (specifically, Errno::ECONNREFUSED).
Possibly, Heroku is blocking TCP socket communication on the same dyno. Is there a way around this?
I tried to take the "code" out of the situation and just executed on the command line (after the server process claims that it is properly bound to 12345):
nc localhost 12345 -w 1 </dev/null
But this does not connect either.
We are currently investigating changing the client/server code to use UNIXSocket on both sides as opposed to TCPSocket, but as it's an off-the-shelf piece of software, we'd rather avoid our own fork if possible.
Use message queue Heroku add-ons ...,
like IronMQ for exsample
Have you tried Fifo?
http://www.gnu.org/software/libc/manual/html_node/FIFO-Special-Files.html#FIFO-Special-Files
Reading your question, you've answered your own question, you cannot connect to localhost 12345.
This way of setting up your processes is a strange one as your running two processes within one Heroku dyno which removes a lot of the benefits of Heroku, i.e independant process scaling, isolation and clean depenedency declaration and isolation.
I would strongly recommend running this as two seperate processes that interact via a third party backing service.
Heroku only lets you listen in a given port ($PORT) per dyno, I think.
I see two solutions here:
Use Redis as a communication middleware, so the worker would write on Redis again and the listener process, instead of listening in a port would be querying redis for new jobs.
Get another heroku dyno (or better, a complete different application) and launch there the listening process (on $PORT) and communicate both applications
#makdad, is the "3rd party software" written in Ruby? If so, I would run it with a monkey patch which fakes out TCPSocket or whatever class it is using to access the TCP socket. Put the monkey patch in a file of its own, which will only be required by the Ruby process which is running the 3rd party software. The monkey patch could even read data directly from the queue, and make TCPSocket behave as if that data had been received.
Yes, it's not very elegant, and I'm sure there may be a better way to do it, but when are you trying to get a job done (not spend days doing research), sometimes you just have to bite the bullet and do something which is ugly, but works. Whatever solution you choose, make sure to document it for those who work on the project later.
My server process is basically an API that responds to REST requests.
Some of these requests are for starting long running tasks.
Is it a bad idea to do something like this?
get "/crawl_the_web" do
Thread.new do
Crawler.new # this will take many many days to complete
end
end
get "/status" do
"going well" # this can be run while there are active Crawler threads
end
The server won't be handling more than 1000 requests a day.
Not the best idea....
Use a background job runner to run jobs.
POST /crawl_the_web should simply add a job to the job queue. The background job runner will periodically check for new jobs on the queue and execute them in order.
You can use, for example, delayed_job for this, setting up a single separate process to poll for and run the jobs. If you are on Heroku, you can use the delayed_job feature to run the jobs in a separate background worker/dyno.
If you do this, how are you planning to stop/restart your sinatra app? When you finally deploy your app, your application is probably going to be served by unicorn, passenger/mod_rails, etc. Unicorn will manage the lifecycle of its child processes and it would have no knowledge of these long-running threads that you might have launched and that's a problem.
As someone suggested above, use delayed_job, resque or any other queue-based system to run background jobs. You get persistence of the jobs, you get horizontal scalability (just launch more workers on more nodes), etc.
Starting threads during request processing is a bad idea.
Besides that you cannot control your worker threads (start/stop them in a controlled way), you'll quickly get into troubles if you start a thread inside request processing. Think about what happens - the request ends and the process gets prepared to serve the next request, while your worker thread still runs and accesses process-global resources like the database connection, open files, same class variables and global variables and so on. Sooner or later, your worker thread (or any library used from it) will affect the main thread somehow and break other requests and it will be almost impossible to debug.
You're really better off using separate worker processes. delayed_job for example is a really small dependency and easy to use.