sidekiq - Is concurrency > 50 stable?

sidekiq - Is concurrency > 50 stable? - ruby

Sidekiq documentation says:
Don't set the concurrency higher than 50. I've seen stability issues
with concurrency of 100, for example
Well, my low memory consumption enables me to use concurrency of 350 threads on a single 512MB X1 heroku dyno. And I would like to use ~300 because all jobs are IO intensive (http requests).
I wonder what issues can I encounter in?
I tried to monitor the logs at overload with 80 and seen no issues.
What issues should I expect when setting up concurrency of 300 threads? Will I risk jobs getting terminated without being moved to the "dead" queue? OR just a termination of workers that I will be able to watch.
Is it safe to set a concurrency of 300 or 100?
The owner of sidekiq doesn't know the answer and here is the issue I opened.
UPDATE:
In high load, when I increased from 80 to 100 I started getting 'can't create Thread: Resource temporarily unavailable' errors here and there, in extreme cases of 180 threads it will sometime terminate the entire sidekiq process.
The memory consumption was always between 140MB to 240MB according to Heroku metrics.
I used TTIN signal as describe here
And found that most threads are waiting on those lines of code:
app[worker.1]: 3 TID-ow5z46exw WARN: /app/vendor/ruby-2.3.0/lib/ruby/2.3.0/monitor.rb:187:in `lock'
app[worker.1]: 3 TID-os9ulw8ps WARN: /app/vendor/ruby-2.3.0/lib/ruby/2.3.0/net/http.rb:880:in `initialize'
app[worker.1]: 3 TID-os9ulw8ps WARN: /app/vendor/ruby-2.3.0/lib/ruby/2.3.0/timeout.rb:95:in `join'
app[worker.1]: 3 TID-osjnd6zac WARN: /app/vendor/ruby-2.3.0/lib/ruby/2.3.0/net/protocol.rb:158:in `wait_readable'
Everything is documented in the github issue
The owner of sidekiq says that the traces looks fine, so no luck spotting the root cause for the stablity issue, but there is input in how many threads causes it and what is the symptom.

Based on the number of concurrency the DB pool size has to be increased.
concurrency (thread) + 2 = DB connection pool size
(300+2) = 302 DB connection
The real concurrency for a single sidekiq process depends on the number of processor core and other parameter. So using more currency will take most of the time in thread context switching instead of doing real IO/computation.
512MB X1 heroku dyno
Normal rails app will need atleast 200 MB of memory at startup and if each thread takes (approx) 1 MB of memory then the total consumption of memory will be
200 + (300 *1) = 500 MB
If some thread need more memory during computation (i.e, fetching more ActiveRecords, reading a large file, ...), then the whole sidekiq will crash.
When I ran sidekiq with full machine potential, garbage collection is doesn't happening immediately, which caused the memory to increase and crashed the sidekiq freqently.
Also with more threads, the completion of each job takes more time than the usual time. Analyze this case in your environment.

Well, sidekiq stability issues in high concurrency are as follows.
When you are setting a concurrency that is higher than 80 (or 50) you may encounter in this error "can't create Thread: Resource temporarily unavailable:"
Some jobs will return back to queue, sometimes the entire process will be terminated and jobs will be lost, unless you use sidekiq pro reliability feature
It seems that we are hitting heroku's maximum 256 threads limitation although sidekiq is configured to use 80 threads. It doesn't help if I use multiple sidekiq processes inside single heroku dyno when I did it, I still ran into this limit.
It seems like a thread leak, and this is the next thing to investigate.
The above will happen also when the memory consumption will stay low (< 240MB in my example)
Everything is updated in the github issue

Related

Spring Task Executor thread count keeps increasing

Following are the properties I have set -
spring.task.execution.pool.core-size=50
spring.task.execution.pool.max-size=200
spring.task.execution.pool.queue-capacity=100
spring.task.execution.shutdown.await-termination=true
spring.task.execution.shutdown.await-termination-period=10s
spring.task.execution.thread-name-prefix=async-task-exec-
I still see thread names as - "async-task-exec-7200"
Does it mean it is creating 7200 threads?
Also, another issue I observed that #Async would wait for more than 10min to get a thread and relieve the parent thread.

You specified core size of 50 and max size of 200. So your pool will normally run with 50 threads, and when there is extra work, it will spawn additional threads, you'll see "async-task-exec-51", "async-task-exec-52" created and so on. Later, if there is not enough work for all the threads, the pool will kill some threads to get back to just 50. So it may kill thread "async-task-exec-52". The next time it has too much work for 50 threads, it will create a new thread "async-task-exec-53".
So the fact that you see "async-task-exec-7200" means that over the life time of the thread pool it has created 7200 threads, but it will still never have more than the max of 200 running at the same time.
If #Async method is waiting 10 minutes for a thread it means that you have put so much work into the pool that it has already spawned all 200 threads and they are processing, and you have filled up the queue capacity of 100, so now the parent thread has to block(wait) until there is at least a spot in the queue to put the task.
If you need to consistently handle more tasks, you will need a powerful enough machine and enough max threads in the pool. But if your work load is just very spiky, and you don't want to spend on a bigger machine and you are ok with tasks waiting longer sometimes, you might be able to get away with just raising your queue-capacity, so the work will queue up and eventually your threads might catch up (if the task creation gets slower).
Keep trying combinations of these settings to see what will be right for your workload.

What can cause a Cloud Run instance to not be reused despite continuous load?

Context:
My Spring-Boot app runs as expected on Cloud Run when I deploy it with max-instances set to 1: It receives a constant stream of pubsub messages via push, and makes anywhere from 0 to 5 writes to an associated CloudSQL instance, depending on the message payload. Typically it handles between 20 and 40 messages per second. Latency/response-time varies between 50ms and 60sec, probably due to some resource contention.
In order to increase throughput/ decrease resource contention, I'm looking to experiment with the connection pool size per app-instance, as well as the concurrency and max-instances parameters for my cloud run app.
I understand that due to Spring-Boot, my app has a relatively high cold-start time of about 30-40 seconds. This is acceptable for how this service is used.
Problem:
I'm experiencing problems when deploying a spring-boot app to cloud run with max-instances set to a value greater than 1:
Instances start, handle a single request successfully, and then produce no more logs.
This happens a few times per minute, leading me to believe that instances get started (cold-start), handle a single request, die, and then get started again. They are not being reused as described in the docs, and as is happening when I set max-instances to 1. Official docs on concurrency
Instead, I expect 3 container instances to be started, which then each requests according to max-concurrency setting.
Billable container time at max-instances=3:
As shown in the graph, the number of instances is fluctuating wildly, once the new revision with max-instances=3 is deployed.
The graphs for CPU- and memory-usage also look like this.
There are no error logs. As before at max-instaces=1, there are warnings indicating that there are not enough instances available to handle requests (HTTP 429).
Connection Limit of CloudSQL instance has not been exceeded
Requests are handled at less than 10/s
Finally, this is the command used to deploy:
gcloud beta run deploy my-service --project=[...] --image=[...] --add-cloudsql-instances=[...] --region=[...] --platform=managed --memory=1Gi --max-instances=3 --concurrency=3 --no-allow-unauthenticated
What could cause this behavior?

Some month ago, in private Alpha, I performed tests and I observed the same behavior. After discussion with Google team, I understood that instances are over provisioned "in case of": an instances crashes, an instances is preempted, the traffic suddenly increase,...
The trade-off of this is that you will have more cold start that your max instances values. Worse, you will be charged for this over provisioned cold start -> this is not an issue because Cloud Run has a huge free tier that covers this kind of glitches.
Going deeper in the logs (you can do it by creating a sink of Cloud Run logs into BigQuery and then by requesting them), even if there is more instances up than your max instances, only your max instances are active in the same time. I'm not sure to be clear. With your parameters, that means, if you have 5 instances up in the same time, only 3 serve the traffic at the same point of time
This part is not documented because it evolves constantly for find the best balance between over-provisioning and lack of ressources (and 429 errors).
#Steren #AhmetB can you confirm or correct me?

When Cloud Run receives and processes requests rapidly, it predicts how many instances it needs, and will try to scale to the amount. If a sudden burst of requests occur, Cloud Run will instantiate a larger number of instances as a response. This is done in order to adapt to a possible higher number of network requests beyond what it is currently serving, with attempts to take into consideration the length of time it will take for the existing instance to complete loading the request. Per the documentation, it is possible that the amount of container instances can go above the max instance value when it spikes.
You mentioned with max-instances set to 1 it was running fine, but later you mentioned it was in fact producing 429s with it set to 1 as well. Seeing behavior of 429s as well as the instances spiking could indicate that the amount of traffic is not being handled fluidly.
It is also worth noting, because of the cold start time you mention, when instances are serving the first request(s), by design, the number of concurrent requests is actually hard set to 1. Once things are fully ready,only then the concurrency setting you have chosen is applied.
Was there some specific reason you chose 3 and 3 for Max Instance settings and concurrency? Also how was the concurrency set when you had max instance set to 1? Perhaps you could try tinkering up further the concurrency (max 80) and /or Max instances (high limit up to 1000) and see if that removes the 429s.

infinispan hot rod delay

We are using infinispan hot rod in our application.
Some times the retrieval from cache takes more time .This is not happening consistently . Most of the time it takes 6m sec but at times it takes very long ( 200 msec ) .
The size of the object retrieved from cache is around 200 bytes.
We tested both in infinispn 5.2.1 and JDG 6.3.2
Did anybody face this issue ?
Thanks
Lives

Remember that you're running Java, and that means that garbage collector can fire any time and that will give you 200 ms if you're very lucky, several seconds if you're not and up to minutes if you have large heap and not well tuned GC settings.
As the retrieval from distributed cache requires RPC to another node and handled RPC there, thread scheduling also plays vital role. And in busy system there's no surprise to have the thread waiting.
From Infinispan perspective, there's nothing the retrieval should wait for. The request gets translated into RPC to remote mode, and there it's handled by the same thread that received the message. The request does not wait for any locks.
In JGroups, there may be some delay involved. The message can get lost on network or discarded on receiver if it cannot handle the load, and then it's resent. Also, the UFC protocol makes sure that the receiver speed can match to sender's.
Anything built on top of non-realtime Java works with best effort, and sometimes sh!t happens. 200 ms is still a good response time.

Occasional slow requests on Heroku

We are seeing inconsistent performance on Heroku that is unrelated to the recent unicorn/intelligent routing issue.
This is an example of a request which normally takes ~150ms (and 19 out of 20 times that is how long it takes). You can see that on this request it took about 4 seconds, or between 1 and 2 orders of magnitude longer.
Some things to note:
the database was not the bottleneck, and it spent only 25ms doing db queries
we have more than sufficient dynos, so I don't think this was the bottleneck (20 double dynos running unicorn with 5 workers each, we get only 1000 requests per minute, avg response time of 150ms, which means we should be able to serve (60 / 0.150) * 20 * 5 = 40,000 requests per minute. In other words we had 40x the capacity on dynos when this measurement was taken.
So I'm wondering what could cause these occasional slow requests. As I mentioned, anecdotally it seems to happen in about 1 in 20 requests. The only thing I can think of is there is a noisy neighbor problem on the boxes, or the routing layer has inconsistent performance. If anyone has additional info or ideas I would be curious. Thank you.

I have been chasing a similar problem myself, with not much luck so far.
I suppose the first order of business would to be to recommend NewRelic. It may have some more info for you on these cases.
Second, I suggest you look at queue times: how long your request was queued. Look at NewRelic for this, or do it yourself with the "start time" HTTP header that Heroku adds to your incoming request (just print now() minus "start time" as your queue time).
When those failed me in my case, I tried coming up with things that could go wrong, and here's a (unorthodox? weird?) list:
1) DNS -- are you making any DNS calls in your view? These can take a while. Even DNS requests for resolving DB host names, Redis host names, external service providers, etc.
2) Log performance -- Heroku collects all your stdout using their "Logplex", which it then drains to your own defined logdrains, services such as Papertrail, etc. There is no documentation on the performance of this, and writes to stdout from your process could block, theoretically, for periods while Heroku is flushing any buffers it might have there.
3) Getting a DB connection -- not sure which framework you are using, but maybe you have a connection pool that you are getting DB connections from, and that took time? It won't show up as query time, it'll be blocking time for your process.
4) Dyno performance -- Heroku has an add-on feature that will print, every few seconds, some server metrics (load avg, memory) to stdout. I used Graphite to graph those and look for correlation between the metrics and times where I saw increased instances of "sporadic slow requests". It didn't help me, but might help you :)
Do let us know what you come up with.

Erlang "system" memory section keeps growing

I have an application with the following pattern:
2 long running processes that go into hibernate after some idle time
and their memory consumption goes down as expected
N (0 < N < 100) worker processes that do some work and hibernate when idle more than
10 seconds or terminate if idle more than two hours
during the night,
when there is no activity the process memory goes back to almost the
same value that was at the application start, which is expected as
all the workers have died.
The issue is that "system" section keeps growing (around 1GB/week).
My question is how can I debug what is stored there or who's allocating memory in that area and is not freeing it.
I've already tested lists:keysearch/3 and it doesn't seem to leak memory, as that is the only native thing I'm using (no ports, no drivers, no NIFs, no BIFs, nothing). Erlang version is R15B03.
Here is the current erlang:memory() output (slight traffic, app started on Feb 03):
[{total,378865650},
{processes,100727351},
{processes_used,100489511},
{system,278138299},
{atom,1123505},
{atom_used,1106100},
{binary,4493504},
{code,7960564},
{ets,489944},
{maximum,402598426}]
This is a 64-bit system. As you can see, "system" section has ~270MB and "processes" is at around 100MB (that drops down to ~16MB during the night).

It seems that I've found the issue.
I have a "process_killer" gen_server where processes can subscribe for periodic GC or kill. Its subscribe functions are called on each message received by some processes to postpone the GC/kill (something like re-arm).
This process performs an erlang:monitor if not already monitored to catch a dead process and remove it from watch list. If I comment our the re-subscription line on each handled message, "system" area seems to behave normally. That means it is a bug in my process_killer that does leak monitor refs (remember you can call erlang:monitor multiple times and each call creates a reference).
I was lead to this idea because I've tested a simple module which was calling erlang:monitor in a loop and I have seen ~13 bytes "system" area grow on each call.
The workers themselves were OK because they would die anyway taking their monitors along with them. There is one long running (starts with the app, stops with the app) process that dispatches all the messages to the workers that was calling GC re-arm on each received message, so we're talking about tens of thousands of monitors spawned per hour and never released.
I'm writing this answer here for future reference.
TL;DR; make sure you are not leaking monitor refs on a long running process.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio