BaseEventLoop.run_in_executor + slow_callback_duration - python-asyncio

In the Python asyncio library, I am using the BaseEventLoop.run_in_executor function to schedule a long running blocking function on a separate thread:
yield from loop.run_in_executor(None, long_running_blocking_function)
When I have the PYTHONASYNCIODEBUG env var set to enable asyncio debug logging, I see the following warning printed regularly to the text ticker (line wraps added):
WARNING:asyncio:Executing <Task pending coro=<long_running_blocking_function()
running at C:/Projects/Blah/blah.py:52>
wait_for=<Future pending cb=[wrap_future.<locals>._check_cancel_other() at
C:\Python34-64\lib\asyncio\futures.py:401, Task._wakeup()] created at
C:\Python34-64\lib\asyncio\futures.py:399> created at
C:/Projects/Blah/blah.py:59> took 0.999 seconds
I'm surprised by this, as I thought the run_in_executor function is specifically to hand off blocking functions to another thread? Can anyone shed some light on this? Thanks
Edit: As mentioned in a comment to Nihal below, the problem seems to lie in the integration of some library code with asyncio using the executor. Here's some example code that helps to describe the problem:
def on_data(*args, **kwargs):
logger.info('Received data %s', args[1])
def blocking_function(t):
logger.info('Going to sleep for %s', t)
time.sleep(t)
executor = ThreadPoolExecutor(2)
#asyncio.coroutine
def update_session():
while True:
# session.Update causes on_data to be called when data is available
yield from loop.run_in_executor(executor, session.Update, -1)
#asyncio.coroutine
def sleep_short():
while True:
yield from loop.run_in_executor(executor, blocking_function, .01)
asyncio.Task(update_session())
asyncio.Task(sleep_short())
loop = asyncio.get_event_loop()
loop.run_forever()
When I comment out the Task that drives the update_session function, I see my sleep_short function called as expected every 0.01 seconds.
2015-07-27 17:57:09,570 [MainThread] Using selector: SelectSelector
2015-07-27 17:57:09,577 [Thread-1] Going to sleep for 0.01
2015-07-27 17:57:09,587 [Thread-1] Going to sleep for 0.01
2015-07-27 17:57:09,597 [Thread-2] Going to sleep for 0.01
However including that task seems to hijack both threads, so the sleep_short task only runs every second or so:
2015-07-27 17:58:21,618 [MainThread] Using selector: SelectSelector
2015-07-27 17:58:21,624 [Thread-1] calling session update
2015-07-27 17:58:21,625 [Thread-2] Going to sleep for 0.01
2015-07-27 17:58:21,625 [Thread-1] calling session update
2015-07-27 17:58:21,633 [Thread-1] Received data
2015-07-27 17:58:21,633 [Thread-1] calling session update
2015-07-27 17:58:22,603 [Thread-2] Going to sleep for 0.01
2015-07-27 17:58:22,603 [Thread-1] calling session update
I'm quite confused... am I running into the GIL perhaps?
Edit 2:
The delays are definitely caused by the GIL. I was expecting the library I am calling to be blocking on IO, but apparently this is not the case.

I will try to give some explanation, I am also quite new to asyncio.
One of the scenario where this can happen is - Let's say you have initiated some connection and the connection has not been closed before the event loop is closed. The loop in such a case should wait for closing the connection and then close itself. This used to frequently occur in asyncio's redis module and a bug reported here can be of reference - https://github.com/jonathanslenders/asyncio-redis/issues/56
I am not sure this answers the situation properly.

Related

SQS task going in DLQ despite being successful in Lambda + also when deleted manually

I have built my own application around AWS Lambda and Salesforce.
I have around 10 users using my internal app, so not talkiing about big usage.
Daily, I have around 500-1000 SQS task which can be processed on a normal day, with one task which can take around 1-60 seconds depending on its complexity.
This is working perfectly.
Timeout for my lambda is 900.
BatchSize = 1
Using Python 3.8
I've created a decorator which allows me to process through SQS some of my functions which required to be processed ASYNC with FIFO logic.
Everything is working well.
My Lambda function doesn't return anything at the end, but it completes with success (standard scenario). However, I have noted that some tasks were going intot my DLQ (I only allow processing once, if it gets represented it goes into DLQ immediately).
The thing I don't get is why is this going on like this ?
Lambda ends with succes --> Normally the task should be deleted from the initial SQS queue.
So I've added a manual deletion of the task processed at the total end of the function. I've logged the result which is sent when I do boto3.client.delete_message and I get a 200 status so everything is OK..... However once in a while (1 out of 100, so 10 times per day in my case) I can see the task going into the DLQ...
Reprocessing the same task into my standard queue without changing anything... it gets processed successfuly (again) and deleted (as expected initially).
What is the most problematic to me is the fact that deleting the message still ends it with it going sometimes into DLQ ? What could be the problem ?
Example of my async processor
def process_data(event, context):
"""
By convention, we need to store in the table AsyncTaskQueueNamea dict with the following parameters:
- python_module: use to determine the location of the method to call asynchronously
- python_function: use to determine the location of the method to call asynchronously
- uuid: uuid to get the params stored in dynamodb
"""
print('Start Processing Async')
client = boto3.client('sqs')
queue_url = client.get_queue_url(QueueName=settings.AsyncTaskQueueName)['QueueUrl']
# batch size = 1 so only record 1 to process
for record in event['Records']:
try:
kwargs = json.loads(record['body'])
print(f'Start Processing Async Data Record:\n{kwargs}')
python_module = kwargs['python_module']
python_function = kwargs['python_function']
# CALLING THE FUNCTION WE WANTED ASYNC, AND DOING ITS STUFF... (WORKING OK)
getattr(sys.modules[python_module], python_function)(uuid=kwargs['uuid'], is_in_async_processing=True)
print('End Processing Async Data Record')
res = client.delete_message(QueueUrl=queue_url, ReceiptHandle=record['receiptHandle'])
print(f'End Deleting Async Data Record with status: {res}') # When the problem I'm monitoring occurs, it goes up to this line, with res status = 200 !! That's where I'm losing my mind. I can confirm the uuid in the DLQ being the same as in the queue so we are definitely talking of the same message which has been moved to the DLQ.
except Exception:
# set expire to 0 so that the task goes into DLQ
client.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=record['receiptHandle'],
VisibilityTimeout=0
)
utils.raise_exception(f'There was a problem during async processing. Event:\n'
f'{json.dumps(event, indent=4, default=utils.jsonize_datetime)}')
Example of today's bug with logs from CloudWatch:
Initial event:
{'Records': [{'messageId': '75587372-256a-47d4-905b-62e1b42e2dad', 'receiptHandle': 'YYYYYY", "python_module": "quote.processing", "python_function": "compute_price_data"}', 'attributes': {'ApproximateReceiveCount': '1', 'SentTimestamp': '1621432888344', 'SequenceNumber': '18861830893125615872', 'MessageGroupId': 'compute_price_data', 'SenderId': 'XXXXX:main-app-production-main', 'MessageDeduplicationId': 'b4de6096-b8aa-11eb-9d50-5330640b1ec1', 'ApproximateFirstReceiveTimestamp': '1621432888344'}, 'messageAttributes': {}, 'md5OfBody': '5a67d0ed88898b7b71643ebba975e708', 'eventSource': 'aws:sqs', 'eventSourceARN': 'arn:aws:sqs:eu-west-3:XXXXX:async_task-production.fifo', 'awsRegion': 'eu-west-3'}]}
Res (after calling delete_message):
End Deleting Async Data Record with status: {'ResponseMetadata': {'RequestId': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'date': 'Wed, 19 May 2021 14:02:47 GMT', 'content-type': 'text/xml', 'content-length': '215'}, 'RetryAttempts': 0}}
BUT... 75587372-256a-47d4-905b-62e1b42e2dad is in the DLQ after this delete_message. I'm becoming crazy
OK, the problem was due to my serverless.yml timeout settings to be 900, but not in AWS. I may have changed it manually to 1min, so my long tasks were released after 1 min and then going immediately to DLQ.
Hence the deletion doing anything since the task was already in the DLQ when the deletion was made

Spring Batch: Terminating the current running job

I am having an issue in terminating the current running spring batch. I wrote
Set<Long> executions = jobOperator.getRunningExecutions("Job-Builder");
jobOperator.stop(longExecutions.iterator().next());`
in my code after going through the spring documentation.
The problem I am facing is at times the termination of the job is happening as expected and the other times the termination of job is not happening. In fact every time I call stop on joboperator it is updating the BATCH_JOB_EXECUTION table. When the termination happens successfully the status of the job is updating to STOPPED by killing the jobExecution in my batch process. The other times when it fails it is completing the rest of the different flows of the batch and updating the status to FAILED on BATCH_JOB_EXECUTION table.
But every time I call stop in the job operator I see a message in my console
2020-09-30 18:14:29.780 [http-nio-8081-exec-5] INFO o.s.b.c.l.s.SimpleJobOperator:428 - Aborting job execution: JobExecution: id=33058, version=2, startTime=2020-09-30 18:14:25.79, endTime=null, lastUpdated=2020-09-30 18:14:28.9, status=STOPPING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=32922, version=0, Job=[Job-Builder]], jobParameters=[{date=1601504064263, time=1601504064262, extractType=false, JobId=1601504064262}]
My project has a series of flows and steps with in it.
Over all my batch process looks like this:
JobBuilderFactory has 3 flows
Each flow has a stepbuilder and two tasklets.
each stepbuilder has a partitioner and a chunk(size is 100) based itemReader, itemProcessor and itemWriter.
I am calling the stop method when I am executing the very first flow in my jobBuilderFactory. The over all process to complete takes about 30 mins. So, it has close to around 20-25 mins from the time I call the stop method and the chunk size is 100 with in each and every flow and I am dealing with more than 500k records.
So, my question is why is jobExecution stopping at times when called stop methos(which is what I wanted) and why it isn't able to stop the jobExecution the remaining times.
Thanks in advance
So, my question is why is jobExecution stopping at times when called stop methos(which is what I wanted) and why it isn't able to stop the jobExecution the remaining times.
It's not easy to figure out the reason for that from what you shared, but I can give you a couple of notes about stopping jobs:
jobOperator.stop does not guarantee that the job stops, it only sends a stop signal to the job execution. From what you shared, you are not checking the returned boolean that indicates if the signal has been correctly sent or not, so you should be doing that first.
You did not share your code, but you need to use StoppableTasklet instead of Tasklet to make sure the stop signal is correctly sent to your steps.

How to make gevent sleep precise?

I'm developing a load testing tool with gevent.
I create a testing script like the following
while True:
# send http request
response = client.sendAndRecv()
gevent.sleep(0.001)
The send/receive action completed very quick, like 0.1ms
So the expected rate should be close to 1000 per second.
But actually I got it like about 500 per second on both Ubuntu and Windows platform.
Most likely the gevent sleep is not accuate.
Gevent use libuv or libev for internal loop. And I got the following description about how libuv handle poll timeout from here
If the loop was run with the UV_RUN_NOWAIT flag, the timeout is 0.
If the loop is going to be stopped (uv_stop() was called), the timeout is 0.
If there are no active handles or requests, the timeout is 0.
If there are any idle handles active, the timeout is 0.
If there are any handles pending to be closed, the timeout is 0.
If none of the above cases matches, the timeout of the closest timer is taken, or if there are no active timers, infinity.
It seems when we have gevent sleep , actually it will setup a timer, and libuv loop use the timeout of the closest timer.
I really doubt that is the root cause : the OS system select timeout is not precise !!
I noticed libuv loop could run with UV_RUN_NOWAIT mode, and it will make loop timeout 0. That is no sleeping if no iOS event.
It may cause the load of one CPU core to 100%, but it is acceptable to me.
So I modify the function run of gevent code hub.py, as the following
loop.run(nowait=True)
But when I run the tool, I got the complain 'This operation would block forever', like the following
gevent.sleep(0.001)
File "C:\Python37\lib\site-packages\gevent\hub.py", line 159, in sleep
hub.wait(t)
File "src\gevent\_hub_primitives.py", line 46, in gevent.__hub_primitives.WaitOperationsGreenlet.wait
File "src\gevent\_hub_primitives.py", line 55, in gevent.__hub_primitives.WaitOperationsGreenlet.wait
File "src\gevent\_waiter.py", line 151, in gevent.__waiter.Waiter.get
File "src\gevent\_greenlet_primitives.py", line 60, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src\gevent\_greenlet_primitives.py", line 60, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src\gevent\_greenlet_primitives.py", line 64, in gevent.__greenlet_primitives.SwitchOutGreenletWithLoop.switch
File "src\gevent\__greenlet_primitives.pxd", line 35, in gevent.__greenlet_primitives._greenlet_switch
gevent.exceptions.LoopExit: This operation would block forever
So what should I do?
Yes, I finally found the trick.
if libuv loop run mode is not UV_RUN_DEFAULT, gevent will do some checking and if libuv loop is 'nowait' mode, It will say "This operation would block forever".
That's wired, actually it will not blcok forever.
Anyway, I just modify the line 473 of the file libuv/loop.py as the following
if mode == libuv.UV_RUN_DEFAULT:
while self._ptr and self._ptr.data:
self._run_callbacks()
self._prepare_ran_callbacks = False
# here, change from UV_RUN_ONCE to UV_RUN_NOWAIT
ran_status = libuv.uv_run(self._ptr, libuv.UV_RUN_NOWAIT)
After that, run the load tool, Wow..... exactly as what I expected, TPS is very close to what I set, but one core load is 100%.
That totally acceptable, because it is load testing tool.
So if we have real time OS kenel, we don't bother to do that.

Does the main thread 'always' run in a ruby web server, like Sinatra?

When you launch a thread from within a web request handler - does the thread continue to run as long as the server is running?
Similar to Thread.join blocks the main thread but could you not call join and have all the threads complete on their own schedule, likely well after the web request handler has returned an http response to the browser?
The following code works fine for me - tested on OS X local machine I was able to get 1500+ real threads running with thin and ruby 1.9.2. On Heroku cedar stack, I can get about 230 threads running before I get an error when creating a thread.
In both cases all the threads seem to finish when they are supposed to - 2 minutes after launching them. '/' is rendered in about 60 ms on Heroku, and then the 20 threads run for 2 minutes each.
If you refresh / a few times, then wait a few minutes, you can see the threads finishing. The reason I tested for 2 minutes was that heroku has a 30 second limit on responses, cutting you off if you take more than that amount of time. But this does not seem to effect background threads.
$threadsLaunched = 0
$$threadsDone = 0
get '/' do
puts "#{Thread.list.size} threads"
for i in 1..20 do
$threadsLaunched = $threadsLaunched + 1
puts "Creating thread #{i}"
Thread.new(i) do |j|
sleep 120
puts "Thread #{j} done"
$threadsDone = $threadsDone + 1
end
end
puts "#{Thread.list.size} threads"
erb :home
end
(home.erb)
<div id="content">
<h1> Threads launched <%= $threadsLaunched.to_s %> </h1>
<h1> Threads running <%= Thread.list.count.to_s %> </h1>
<h1> Threads done <%= $threadsDone.to_s %> </h1>
</div> <!-- id="content" -->
Once your main thread exits, all the other ones are forcefully destroyed too and the process exits.
Thread.new do
# this thread never exists on its own
while true do
puts "."
sleep 1
end
end
sleep 5
Following this example, once the main thread ends, the printing thread will end too without "completing" its work. You have to explicitly join all background threads to wait for their completion before exiting the main thread.
On the other hand, as long as the main thread runs, other threads can run as long as they want. There is no arbitrary restriction to the request/response cycle. Note however, that in the "original" Rubies, threads are not really concurrent but are subject to the GIL. If you want true concurrency (as to use multiple core on your computer with different threads), you should have a look at either JRuby or Rubinius (2.0-preview) which both offer true concurrent threads.
If you just want to take things out of the request cycle to handle later, the green threads in 1.8 and OS-native-but-GILed threads in 1.9 are just fine though. If you want more scalability, you should have a look at technologies like delayed_job or Resque which introduce persistent workers for background jobs.

Check if a Win32 thread is running or in a suspended state

How do I check to see if a Win32 thread is running or in suspended state?
I can't find any Win32 API which gives the state of a thread. So how do I get the thread state?
I think - originally - this information was not provided because any API that provided this info would be misleading and useless.
Consider two possible cases - the current thread has suspended the thread-of-interest. Code in the current thread knows about the suspended state and should be able to share it so theres no need for the kernel team to add an API.
The 2nd case, some other / a 3rd thread in the system has suspended the thread of interest (and theres no way to track which thread that was). Now you have a race condition - that other thread could, at any time - unsuspend the thread of interest and the information gleaned from the API is useless - you have a value indicating the thread is suspended when it is in fact, not.
Moral of the story - if you want to know that a thread is suspended - suspend it: The return value from SuspendThread is the previous suspend count of the thread. And now you DO know something useful - The thread WAS AND STILL IS suspended - which is useful. Or that it WASN't (but now is) suspended. Either way, the thread's state is now deterministically known so you can in theory make some intelligent choices based on that - whether to ResumeThread, or keep it suspended.
You can get this information by calling NtQuerySystemInformation() with the value for SystemProcessesAndThreadsInformation (integer value 5).
If you want an example of what you can do with this information take a look at Thread Status Monitor.
WMI's Win32_Thread class has a ThreadState property, where 5: "Suspended Blocked" and 6:Suspended Ready.
You will need the Thread's Id to get the right instance directly (the WMI object's Handle property is the thread id).
EDIT: Given this PowerShell query:
gwmi win32_thread | group ThreadState
gives
Count Name Group
----- ---- -----
6 2 {, , , ...}
966 5 {, , , ...}
WMI has a different definition of "Suspended" to Win32.
In Windows 7, you can use QueryUmsThreadInformation. (UMS stands for User mode scheduling).
See here for UmsThreadIsSuspended.
you could get thread suspend count with code like this:
DWORD GetThreadSuspendCount(HANDLE hThread) {
DWORD dwSuspendCount = SuspendThread(hThread);
ResumeThread(hThread);
return dwSuspendCount;
}
but, as already said - it is not accurate.
Moreover, suspending a thread is evil.
YES: it IS possible to get the thread state and determine if it is suspended.
And NO: You don't need Windows 7 to do that.
I published my working class here on Stackoverflow: How to get thread state (e.g. suspended), memory + CPU usage, start time, priority, etc
This class requires Windows 2000 or higher.
I think the state here is referred to as
If thread is in thread proc , doing some processing Or
Waiting for event
This can be taken care of by using variable which can tell that if a thread is actually running or waiting for event to happen.
These scenarios appear when considering threadpools, having some n threads and based on each thread running status , tasks can be assigned to idle threads.

Resources