After a timeout occurs on thin, the process keep using high cpu. The only way is to restart it (I let it run more than a day)
this is the output of strace
ruby#localhost:~$ strace -p 17830
Process 17830 attached
brk(0x7cf38000) = 0x7cf38000
brk(0x7d3ac000) = 0x7d3ac000
brk(0x7d655000) = 0x7d655000
brk(0x7de8c000) = 0x7de8c000
brk(0x7e616000) = 0x7e616000
brk(0x7a0c9000) = 0x7a0c9000
.. and one similar line each 3 - 4 seconds
Why this happen? I've seen this also on mongrel. Why if the http request already ended it keeps up?
Related
Using Ruby (tested with versions 2.6.9, 2.7.5, 3.0.3, 3.1.1) and forking processes to handle socket communication there seems to be a huge difference between MacOS OSX and a Debian Linux.
While running on Debian, the forked processes get called in a balanced manner - that mean: if having 10 tcp server forks and running 100 client calls, each fork will get 10 calls. The order of the pid call stack is also always the same even not ordered by pid (caused by load when instantiating the forks).
Doing the same on a MacOS OSX (Catalina) the forked processes will not get called balanced - that mean: "pid A" might get called 23 or whatever times while e.g. "pid G" was never used.
Sample code (originally from: https://relaxdiego.com/2017/02/load-balancing-sockets.html)
#!/usr/bin/env ruby
# server.rb
require 'socket'
# Open a socket
socket = TCPServer.open('0.0.0.0', 9999)
puts "Server started ..."
# For keeping track of children pids
wpids = []
# Forward any relevant signals to the child processes.
[:INT, :QUIT].each do |signal|
Signal.trap(signal) {
wpids.each { |wpid| Process.kill(:KILL, wpid) }
}
end
5.times {
wpids << fork do
loop {
connection = socket.accept
connection.puts "Hello from #{ Process.pid }"
connection.close
}
end
}
Process.waitall
Run some netcat to the server on a second terminal:
for i in {1..20}; do nc -d localhost 9999; done
As said: if running on Linux each forked process will get 4 calls - doing same on MacOS OSX its a random usage per forked process.
Any solution or correction to make it work on MacOS OSX in a balanced manner also?
The problem is that the default socket backlog size is 5 on MacOS and 128 on Linux. You can change the backlog size by passing it as the second argument to TCPServer#listen:
socket.listen(128)
Or you can use the backlog size from the environment variable SOMAXCONN:
socket.listen(ENV.fetch('SOMAXCONN', 128).to_i)
I'm working on a production app that has multiple rails servers behind nginx loadbalancer. We are monitoring sidekiq processes with monit, and it works just fine - when sidekiq proces dies monit starts it right back.
However recently encountered a situation where one of these processes was running and visible to monit, but for some reason not visible to sidekiq. That resulted in many failed jobs and took us some time to notice that we're missing one process in sidekiq Web UI, since monit was telling us everything was fine and all processes were running. Simple restart fixed the problem.
And that bring me to my question: how do you monitor your sidekiq processes? I know i can use something like rollbar to notify me when jobs fail, but i'd like to know if there is a way to monitor process count and preferably send mail when one dies. Any suggestions?
Something that would ping sidekiq/stats and verify response.
My super simple solution to a similar problem looks like this:
# sidekiq_check.rb
namespace :sidekiq_check do
task rerun: :environment do
if Sidekiq::ProcessSet.new.size == 0
exec 'bundle exec sidekiq -d -L log/sidekiq.log -C config/sidekiq.yml -e production'
end
end
end
and then using cron/whenever
# schedule.rb
every 5.minutes do
rake 'sidekiq_check:rerun'
end
We ran into this problem where our sidekiq processes had stopped working off jobs overnight and we had no idea. It took us about 30 minutes to integrate http://deadmanssnitch.com by following these instructions.
It's not the prettiest or cheapest option but it gets the job done (integrates nicely with Pagerduty) and has saved our butt twice in the last few months.
On of our complaints with the service is the shortest grace interval we can set is 15 minutes which is too long for us. So we're evaluating similar services like Healthchecks, etc.
My approach is the following:
create a background job that does something
call the job regularly
check that the thing is being done!
so; using a cron script (or something like whenever) every 5 mins, I run :
CheckinJob.perform_later
It's now up to sidekiq (or delayed_job, or whatever active job you're using) to actually run the job.
The job just has to do something which you can check.
I used to get the job to update a record in my Status table (essentially a list of key/value records). Then I'd have a /status page which returns a :500 status code if the record hasn't been updated in the last 6 minutes.
(obviously your timing may vary)
Then I use a monitoring service to monitor the status page! (something like StatusCake)
Nowdays I have a simpler approach; I just get the background job to check in with a cron monitoring service like
IsItWorking
Dead Mans Snitch
Health Checks
The monitoring service which expects your task to check in every X mins. If your task doesn't check in - then the monitoring service will let you know.
Integration is dead simple for all the services. For Is It Working it would be:
IsItWorkingInfo::Checkin.ping(key:"CHECKIN_IDENTIFIER")
full disclosure: I wrote IsItWorking !
I use god gem to monitor my sidekiq processes. God gem makes sure that your process is always running and also can notify the process status on various channels.
ROOT = File.dirname(File.dirname(__FILE__))
God.pid_file_directory = File.join(ROOT, "tmp/pids")
God.watch do |w|
w.env = {'RAILS_ENV' => ENV['RAILS_ENV'] || 'development'}
w.name = 'sidekiq'
w.start = "bundle exec sidekiq -d -L log/sidekiq.log -C config/sidekiq.yml -e #{ENV['RAILS_ENV']}"
w.log = "#{ROOT}/log/sidekiq_god.log"
w.behavior(:clean_pid_file)
w.dir = ROOT
w.keepalive
w.restart_if do |restart|
restart.condition(:memory_usage) do |c|
c.interval = 120.seconds
c.above = 100.megabytes
c.times = [3, 5] # 3 out of 5 intervals
end
restart.condition(:cpu_usage) do |c|
c.interval = 120.seconds
c.above = 80.percent
c.times = 5
end
end
w.lifecycle do |on|
on.condition(:flapping) do |c|
c.to_state = [:start, :restart]
c.times = 5
c.within = 5.minute
c.transition = :unmonitored
c.retry_in = 10.minutes
c.retry_times = 5
c.retry_within = 1.hours
end
end
end
Running a small jetty application on a raspberry pi I noticed that after the first access, the application keeps burning around 3% CPU. A quick inspection showed that the same is true, with less %, on my laptop. Checking with strace I find a never ending sequence of
...
12:58:01.999717 clock_gettime(CLOCK_MONOTONIC, {2923, 200177551}) = 0
12:58:01.999864 futex(0x693a0f44, FUTEX_WAIT_BITSET_PRIVATE, 1, {2923, 250177551}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
12:58:02.050090 futex(0x693a0f28, FUTEX_WAKE_PRIVATE, 1) = 0
12:58:02.050236 gettimeofday({1436093882, 50296}, NULL) = 0
12:58:02.050403 gettimeofday({1436093882, 50444}, NULL) = 0
12:58:02.050767 clock_gettime(CLOCK_MONOTONIC, {2923, 251228114}) = 0
...
(This is Java 7 on ubuntu 14.04 with Jetty 9.3.* using an h2 db, just in case this rings any bells for someone.)
I learned that it suffices to capture strace -f -tt -p <pid> -o out.txt, grep for clock_gettime, extract the pid, sort and uniq -c to find the thread calling clock_gettime most often. Plotting the delta times nicely shows a line at 50 milliseconds. Further the PID can be found in a thread dump taken with jvisualvm as the nid in hex and turns out to be 'VM Periodic Task Thread'. But why so often? This does not seem to be a standard behaviour of the JVM.
I am running an executable in Condor that basically processes an input Image and saves a binary image in a given folder. I use this code in 213 images.
My condor configuration file contents are as following:
universe = vanilla
executable = /datasets/me/output_cpen_database/source_codes_techniques/test/vole
arguments = cmfd -I /datasets/me/cpen_database/scale1/$(Process)/$(Process).png -O /datasets/me/output_cpen_database/scale1/dct/$(Process)/ --numThreads 10 --chan GRAY --featvec DCT --blockSize 16 --minDistEuclidian 50 --kdsort --fastsats --minSameShift 1000 --markRegions --useOrig --writePost --writeMatrix
initialdir = /datasets/me/output_cpen_database/source_codes_techniques/test
requirements = (OpSysAndVer == "Ubuntu12")
request_cpus = 5
request_memory = 20000
output = logs/output-$(Process).log
error = logs/error-$(Process).log
log = logs/log-$(Process).log
Notification = Complete
Notify_User = mymail#gmail.com
Queue 214
Some images are processed OK, but in some cases I receive the following error in my mailbox:
Condor job 1273.47
/datasets/me/output_cpen_database/source_codes_techniques/test/vole cmfd -I /datasets/me/cpen_database/scale1/47/47.png -O /datasets/me/output_cpen_database/scale1/dct/47/ --numThreads 10 --chan GRAY --featvec DCT --blockSize 16 --minDistEuclidian 50 --kdsort --fastsats --minSameShift 1000 --markRegions --useOrig --writePost --writeMatrix
died on signal 9 (Killed)
I was thinking if this happens because of lack of memory, but this image's (named 47) size is not longer than 20MB (actually it has 16.7MB).
As I said before, the condor runs this executable ok for some other images .
Should I have to increase the request_memory in my configuration file? what is happening here?
Usually, a job dying on signal 9 means problems with some of the shared libraries required by your executable. What I would check is whether or not all jobs die on a particular host. If that's the case, you could run the code manually and see if you get a missing shared library error.
I'm load testing IIS 7.5 (WinR2/SP1) from my Windows 7/SP1 client. I have a script that makes three ab calls like:
start /B cmd /c ab.exe -k -n 500 -c 50 http://rhvwr2vsu410/HelloWebAPI/Home/SyncProducts > SyncProducts.txt
When the concurrency is > 5, I soon get the error message
apr_poll: The timeout specified has expired (70007)
And ab stops making requests. I don't even get to Completed 100 requests.
This happens within 30 seconds of starting my script. The ab documentation page doesn't provide much. Related Stack Overflow question. Server Fault related question .
You must have the 2.4 version and use -s timeout option.
Edit:
https://www.wampserver.com/ - includes Apache 2.4.x Win32 and Win64.
Deprecated but still available however I not known until when and just not available:
You can use my win32-x86 binary (compiled under Visual Studio 2008 from trunk 8 Feb 2013):
http://mars.iti.pk.edu.pl/~nkg/ab-standalone.exe (no longer available)
http://mars.iti.pk.edu.pl/~nkg/ab-standalone-src.zip (no longer available)
I was made it using: http://code.google.com/p/apachebench-standalone/wiki/HowToBuild and
http://ftp.ps.pl/pub/apache//apr/binaries/win32/apr-1.3.6-iconv-1.2.1-util-1.3.8-win32-x86-msvcrt60.zip (just not available).
ab --help
-s timeout Seconds to max. wait for each response
Default is 30 seconds
Add option: -s 120 to ab command, Where 120 is new timeout. If it is not enough set it even higher...
ab --help
-s timeout Seconds to max. wait for each response
Default is 30 seconds
-k Use HTTP KeepAlive feature
It works for me
Sounds like an ab bug.
I had a similar problem on OS X (now that you mention it happens on Windows, I feel more confident that ab is the culprit). I went around profiling and tracing my web application, but couldn't find anything. I then tested static pages off of nginx, and it still gave me the error. So I then went and found a replacement... jMeter. Works great, but I would still like to know what the ab problem is.