I'm developing a load testing tool with gevent.
I create a testing script like the following
while True:
# send http request
response = client.sendAndRecv()
gevent.sleep(0.001)
The send/receive action completed very quick, like 0.1ms
So the expected rate should be close to 1000 per second.
But actually I got it like about 500 per second on both Ubuntu and Windows platform.
Most likely the gevent sleep is not accuate.
Gevent use libuv or libev for internal loop. And I got the following description about how libuv handle poll timeout from here
If the loop was run with the UV_RUN_NOWAIT flag, the timeout is 0.
If the loop is going to be stopped (uv_stop() was called), the timeout is 0.
If there are no active handles or requests, the timeout is 0.
If there are any idle handles active, the timeout is 0.
If there are any handles pending to be closed, the timeout is 0.
If none of the above cases matches, the timeout of the closest timer is taken, or if there are no active timers, infinity.
It seems when we have gevent sleep , actually it will setup a timer, and libuv loop use the timeout of the closest timer.
I really doubt that is the root cause : the OS system select timeout is not precise !!
I noticed libuv loop could run with UV_RUN_NOWAIT mode, and it will make loop timeout 0. That is no sleeping if no iOS event.
It may cause the load of one CPU core to 100%, but it is acceptable to me.
So I modify the function run of gevent code hub.py, as the following
loop.run(nowait=True)
But when I run the tool, I got the complain 'This operation would block forever', like the following
gevent.sleep(0.001)
File "C:\Python37\lib\site-packages\gevent\hub.py", line 159, in sleep
hub.wait(t)
File "src\gevent\_hub_primitives.py", line 46, in gevent.__hub_primitives.WaitOperationsGreenlet.wait
File "src\gevent\_hub_primitives.py", line 55, in gevent.__hub_primitives.WaitOperationsGreenlet.wait
File "src\gevent\_waiter.py", line 151, in gevent.__waiter.Waiter.get
File "src\gevent\_greenlet_primitives.py", line 60, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src\gevent\_greenlet_primitives.py", line 60, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src\gevent\_greenlet_primitives.py", line 64, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src\gevent\__greenlet_primitives.pxd", line 35, in gevent.__greenlet_primitives._greenlet_switch
gevent.exceptions.LoopExit: This operation would block forever
So what should I do?
Yes, I finally found the trick.
if libuv loop run mode is not UV_RUN_DEFAULT, gevent will do some checking and if libuv loop is 'nowait' mode, It will say "This operation would block forever".
That's wired, actually it will not blcok forever.
Anyway, I just modify the line 473 of the file libuv/loop.py as the following
if mode == libuv.UV_RUN_DEFAULT:
while self._ptr and self._ptr.data:
self._run_callbacks()
self._prepare_ran_callbacks = False
# here, change from UV_RUN_ONCE to UV_RUN_NOWAIT
ran_status = libuv.uv_run(self._ptr, libuv.UV_RUN_NOWAIT)
After that, run the load tool, Wow..... exactly as what I expected, TPS is very close to what I set, but one core load is 100%.
That totally acceptable, because it is load testing tool.
So if we have real time OS kenel, we don't bother to do that.
Related
What would make a write call to a TCPSocket hang indefinitely?
lotsOfBytes = # a really large number of bytes, like 1 or 2 MB of data
socket = TCPSocket.new # some config
socket.write(lotsOfBytes) # this line hangs
I am trying to debug an issue where a get_multi operation sent to memcached with a large number of keys hangs indefinitely, and it does so on a line that resembles that code snippet. I'm trying to better understand how the low-level sockets on which this library is built are expected to work.
What are the values of following attributes on your TCPSocket:
Keep-alive activated and what value is set?
Timeout set and what value is set?
If you will do a Wireshark dump, it's much better so see what happens before hanging connection.
tcpdump? are there any attempts to send anything?
netstat - for see output queue.
does it work on a small number of bytes in your environment?
I am using the wxProgressdialog to show time between switching ports and time between taking measurements. I am running this test for over 24 hours (repeating the same thing over and over while recording the data). Therro that appears during hour 7 is:
Traceback (most recent call last):
File "C:\Users\localuser\Desktop\Thermal\Cheyenne_Antenna_Cal_PDA_Thermal_Test.py", line 2117, in take_measurements_at_interval
self.take_measurement(self)
File "C:\Users\localuser\Desktop\Thermal\Cheyenne_Antenna_Cal_PDA_Thermal_Test.py", line 2185, in take_measurement
self.Measure_Plot(self)
File "C:\Users\localuser\Desktop\Thermal\Cheyenne_Antenna_Cal_PDA_Thermal_Test.py", line 2231, in Measure_Plot
style=wx.PD_AUTO_HIDE | wx.PD_ELAPSED_TIME | wx.PD_REMAINING_TIME)
File "C:\Python27\lib\site-packages\wx-2.8-msw-unicode\wx_windows.py", line 2951, in init
windows.ProgressDialog_swiginit(self,windows.new_ProgressDialog(*args, **kwargs))
wx._core.PyAssertionError: C++ assertion "wxAssertFailure" failed at ....\src\msw\control.cpp(159) in wxControl::MSWCreateControl(): CreateWindowEx("STATIC", flags=52000100, ex=00000000) failed
Here is the code that is being used to 'delay time'
#Giving Time for switch to toggle next port
progressMax = 5
dialog = wx.ProgressDialog("A progress box", "Time to switch", progressMax,
style=wx.PD_AUTO_HIDE | wx.PD_ELAPSED_TIME | wx.PD_REMAINING_TIME)
keepGoing = True
count = 0
while keepGoing and count < progressMax:
count = count + 1
wx.Sleep(1)
keepGoing = dialog.Update(count)
dialog.Destroy()
The code pauses 5 seconds to allow switch hardware and PNA to be steady before data is recorded. All of this is happening in a 'For' loop for a period of time. If anyone needs more information I will be happy to proved.
If the window creation fails after running for a long time, chances are you simply run out of windows, which are still a very limited resource under Microsoft Windows (the exact limit depends on the Windows version, but could be as log as 16,384).
This could happen if you never return to the main event loop during all this time because the top level windows will only be really destroyed (and not just hidden) once you get back to it.
I am developing a Wikipedia bot to analyze editing contributions. Unfortunately, it takes hours to complete a single run and during that time Wikipedia's database replication delay—at some point during the run—is sure to exceed 5 seconds (the default maxlag value). The recommendation in the API's maxlag parameter is to detect the lag error, pause for X seconds and retry.
But all I am doing is reading contributions with:
usrpg = pywikibot.Page(site, 'User:' + username)
usr = pywikibot.User(usrpg)
for contrib in usr.contributions(total=max_per_user_contribs):
# (analyzes contrib here)
How to detect the error and resume it? This is the error:
WARNING: API error maxlag: Waiting for 10.64.32.21: 7.1454429626465 seconds lagged
Traceback (most recent call last):
File ".../bot/core/pwb.py", line 256, in <module>
if not main():
File ".../bot/core/pwb.py", line 250, in main
run_python_file(filename, [filename] + args, argvu, file_package)
File ".../bot/core/pwb.py", line 121, in run_python_file
main_mod.__dict__)
File "analyze_activity.py", line 230, in <module>
attrs = usr.getprops()
File ".../bot/core/pywikibot/page.py", line 2913, in getprops
self._userprops = list(self.site.users([self.username, ]))[0]
File ".../bot/core/pywikibot/data/api.py", line 2739, in __iter__
self.data = self.request.submit()
File ".../bot/core/pywikibot/data/api.py", line 2183, in submit
raise APIError(**result['error'])
pywikibot.data.api.APIError: maxlag: Waiting for 10.64.32.21:
7.1454 seconds lagged [help:See https://en.wikipedia.org/w/api.php for API usage]
<class 'pywikibot.data.api.APIError'>
CRITICAL: Closing network session.
It occurs to me to catch the exception thrown in that line of code:
raise APIError(**result['error'])
But then restarting the contributions for the user seems terribly inefficient. Some users have 400,000 edits, so rerunning that from the beginning is a lot of backsliding.
I have googled for examples of doing this (detecting the error and retrying) but I found nothing useful.
Converting the previous conversation in comments into an answer.
One possible method to resolve this is to try/catch the error and redo the piece of code which caused the error.
But, pywikibot already does this internally for us ! Pywikibot, by default tries to retry every failed API call 2 times if you're using the default user-config.py it generates. I found that increasing the following configs does the trick in my case:
maxlag = 20
retry_wait = 20
max_retries = 8
The maxlag is the parameter recommended to increase according to the documentation of Maxlag parameter, especially if you're doing a large number of writes in a short span of time. But, the retry_wait and max_retries configs are useful in case someone else is writing a lot (As is my case: My scripts just read from wiki).
I have an agent-based model, where several agents are started by a central process and communicate via another central process. Every agent and the communication process communicate via zmq. However, when I start more than 100 agents standard_out sends:
Invalid argument (src/stream_engine.cpp:143) Too many open files
(src/ipc_listener.cpp:292)
and Mac Os prompts a problem report :
Python quit unexpectedly while using the libzmq.5.dylib plug-in.
The problem appears to me that too many contexts are opened. But how can I avoid this with multiprocessing?
I attach part of the code below:
class Agent(Database, Logger, Trade, Messaging, multiprocessing.Process):
def __init__(self, idn, group, _addresses, trade_logging):
multiprocessing.Process.__init__(self)
....
def run(self):
self.context = zmq.Context()
self.commands = self.context.socket(zmq.SUB)
self.commands.connect(self._addresses['command_addresse'])
self.commands.setsockopt(zmq.SUBSCRIBE, "all")
self.commands.setsockopt(zmq.SUBSCRIBE, self.name)
self.commands.setsockopt(zmq.SUBSCRIBE, group_address(self.group))
self.out = self.context.socket(zmq.PUSH)
self.out.connect(self._addresses['frontend'])
time.sleep(0.1)
self.database_connection = self.context.socket(zmq.PUSH)
self.database_connection.connect(self._addresses['database'])
time.sleep(0.1)
self.logger_connection = self.context.socket(zmq.PUSH)
self.logger_connection.connect(self._addresses['logger'])
self.messages_in = self.context.socket(zmq.DEALER)
self.messages_in.setsockopt(zmq.IDENTITY, self.name)
self.messages_in.connect(self._addresses['backend'])
self.shout = self.context.socket(zmq.SUB)
self.shout.connect(self._addresses['group_backend'])
self.shout.setsockopt(zmq.SUBSCRIBE, "all")
self.shout.setsockopt(zmq.SUBSCRIBE, self.name)
self.shout.setsockopt(zmq.SUBSCRIBE, group_address(self.group))
self.out.send_multipart(['!', '!', 'register_agent', self.name])
while True:
try:
self.commands.recv() # catches the group adress.
except KeyboardInterrupt:
print('KeyboardInterrupt: %s,self.commands.recv() to catch own adress ~1888' % (self.name))
break
command = self.commands.recv()
if command == "!":
subcommand = self.commands.recv()
if subcommand == 'die':
self.__signal_finished()
break
try:
self._methods[command]()
except KeyError:
if command not in self._methods:
raise SystemExit('The method - ' + command + ' - called in the agent_list is not declared (' + self.name)
else:
raise
except KeyboardInterrupt:
print('KeyboardInterrupt: %s, Current command: %s ~1984' % (self.name, command))
break
if command[0] != '_':
self.__reject_polled_but_not_accepted_offers()
self.__signal_finished()
#self.context.destroy()
the whole code is under http://www.github.com/DavoudTaghawiNejad/abce
Odds are it's not too many contexts, it's too many sockets. Looking through your repo, I see you're (correctly) using IPC as your transport; IPC uses a file descriptor as the "address" to pass data back and forth between different processes. If I'm reading correctly, you're opening up to 7 sockets per process, so that'll add up quickly. I'm betting that if you do some debugging in the middle of your code, you'll see that it doesn't fail when the last context is created, but when the last socket pushes the open file limit over the edge.
My understanding is that the typical user limit for open FDs is around 1000, so at around 100 agents you're pushing 700 open FDs just for your sockets. The remainder is probably just typical. There should be no problem increasing your limit up to 10,000, higher depending on your situation. Otherwise you'll have to rewrite to use less sockets per process to get a higher process limit.
This has nothing to do with zeromq nor python. It's the underlying operating system, that allow only up to a certain threshold of concurrently opened files. This limit includes normal files, but also socket connections.
You can see your current limit using ulimit -n, it will probably default to 1024. Machines running servers or having other reasons (like your multiprocessing) often require to set this limit higher or just to unlimited. – More info about ulimit.
Additionally, there's another global limit, however it's nothing I had to adjust yet.
In general, you should ask yourself, if you really need that many agents. Usually, X / 2X worker processes should be enough, where X corresponds to your CPU count.
You should increase the number of allowed open files for the process as in this question:
Python: Which command increases the number of open files on Windows?
the default per process is 512
import win32file
print win32file._getmaxstdio() #512
win32file._setmaxstdio(1024)
print win32file._getmaxstdio() #1024
I am using vbscript .vbs in windows scheduler.
Sample code:
objWinHttp.Open "POST", http://bla.com/blabla.asp, false
objWinHttp.Send
CallHTTP= objWinHttp.ResponseText
strRESP= CallHTTP(strURL)
WScript.Echo "after doInstallNewSite: " & strRESP
Problem: blabla.asp is handling a task that need around 1-2 minute to complete.
It should return 'success' when the task completed.
But it return a empty result to the server vbs. (shorter than the normal time to complete the thing. I then go to check whether the task is completed, the answer is yes too.
I found this to happen when the task need longer time to complete.
Is this the weakness of vbs?
Help!!!
You can specify timeouts for the winhttp component:
objWinHttp.SetTimeouts 5000, 10000, 10000, 10000
It takes 4 parameters: ResolveTimeout, ConnectTimeout, SendTimeout, and ReceiveTimeout. All 4 are required and are expressed in milliseconds (1000 = 1 second). The defaults are:
ResolveTimeout: zero (no time out)
ConnectTimeout: 60,000 (one minute)
SendTimeout: 30,000 (30 secs.)
ReceiveTimeout: 30,000 (30 secs.)
So I suggest increasing the ReceiveTimeout
What is objHTTP specifically?
Looking at the target server's log, was the request received?
I can't find this in server log.
objWinHTTP is a standard protocol to send call and wait for response.
I did try using PHP and curl to do the whole process, but failed. Reason: PHP is part of the component in windows server. When come to global privilege and file folder moving, it is controlled by windows server. So I give up, and use vbs.
objWinHTTP is something act like curl in PHP.
sounds to me like the request to is taking too long to complete and the server is timing out. I believe the default timeout for asp scripts is 90 seconds so you may need to adjust this value in IIS or in your script so that the server will wait longer before timing out.
From http://msdn.microsoft.com/en-us/library/ms525225.aspx:
The AspScriptTimeout property
specifies (in seconds) the default
length of time that ASP pages allow a
script to run before terminating the
script and writing an event to the
Windows Event Log. ASP script can
override this value by using the
ScriptTimeout property of the ASP
built-in Session object. The
ScriptTimeout property allows your ASP
application to set a higher script
timeout value. For example, you can
use this setting to adjust the timeout
once a particular user establishes a
valid session by logging in or
ordering a product.