Consuming long running tasks using pika BlockingConnection on EC2 - amazon-ec2

I have this queue system where I use pika's BlockingConnection to consume a rabbitmq queue hosted on https://www.cloudamqp.com/.
Here's how the connection is setup:
params = pika.URLParameters(self.queue_url)
self.connection = pika.BlockingConnection(params)
self.channel = self.connection.channel()
self.channel.queue_declare(queue=self.queue_name, durable=True)
self.channel.basic_qos(prefetch_count=1)
# set up subscription on the queue
self.channel.basic_consume(self._on_new_task,
queue=self.queue_name,
no_ack=False)
self.channel.start_consuming()
def _on_new_task(self, ch, method, properties, body):
logging.info("Processing new scan")
"""
Do long running task here
"""
self.channel.basic_ack(delivery_tag=method.delivery_tag)
# Program crashes here
Processing a task from the queue can take up to 15 minutes. Most of the time, however, processing the task only takes 2-3 minutes, and then everything works fine. When it takes > 5 minutes I get the following errors from pika:
ERROR - pika.adapters.base_connection - Read empty data, calling disconnect
WARNING - pika.adapters.base_connection - Socket closed when connection was open
WARNING - pika.connection - Disconnected from RabbitMQ at squirrel.rmq.cloudamqp.com:5672 (0): Not specified
CRITICAL - pika.adapters.blocking_connection - Connection close detected
appearantly the basic_ack is not handled successfully because the task still remains on the task queue, leading to an infinite loop where the task is never released from the queue.
I've tried to adjust the heartbeat, both to 0, 30 600 etc, but it looks like heartbeats does not work with a blocking connection, since the long running task is blocking the connection to send heartbeats (?)
The program runs inside a docker container. And I get the problem when hosting the container on AWS ECS2 but not when running the container on my home PC. It therefore seems to me that ECS2 is taking down idle connections (?).
This assumption is backed up by the fact that when running
for i in range(1,10):
self.connection.sleep(60)
self.connection.process_data_events()
instead of the long running task, everything works out fine, probably because the heartbeats are sent and keeps the connection alive.
Not sure how to solve this. Is there any point in replacing pika's BlockingConnection with SelectConnection? Or is there any configuration in EC2/ECS2 that does not tear down idle connections?

Related

SQS task going in DLQ despite being successful in Lambda + also when deleted manually

I have built my own application around AWS Lambda and Salesforce.
I have around 10 users using my internal app, so not talkiing about big usage.
Daily, I have around 500-1000 SQS task which can be processed on a normal day, with one task which can take around 1-60 seconds depending on its complexity.
This is working perfectly.
Timeout for my lambda is 900.
BatchSize = 1
Using Python 3.8
I've created a decorator which allows me to process through SQS some of my functions which required to be processed ASYNC with FIFO logic.
Everything is working well.
My Lambda function doesn't return anything at the end, but it completes with success (standard scenario). However, I have noted that some tasks were going intot my DLQ (I only allow processing once, if it gets represented it goes into DLQ immediately).
The thing I don't get is why is this going on like this ?
Lambda ends with succes --> Normally the task should be deleted from the initial SQS queue.
So I've added a manual deletion of the task processed at the total end of the function. I've logged the result which is sent when I do boto3.client.delete_message and I get a 200 status so everything is OK..... However once in a while (1 out of 100, so 10 times per day in my case) I can see the task going into the DLQ...
Reprocessing the same task into my standard queue without changing anything... it gets processed successfuly (again) and deleted (as expected initially).
What is the most problematic to me is the fact that deleting the message still ends it with it going sometimes into DLQ ? What could be the problem ?
Example of my async processor
def process_data(event, context):
"""
By convention, we need to store in the table AsyncTaskQueueNamea dict with the following parameters:
- python_module: use to determine the location of the method to call asynchronously
- python_function: use to determine the location of the method to call asynchronously
- uuid: uuid to get the params stored in dynamodb
"""
print('Start Processing Async')
client = boto3.client('sqs')
queue_url = client.get_queue_url(QueueName=settings.AsyncTaskQueueName)['QueueUrl']
# batch size = 1 so only record 1 to process
for record in event['Records']:
try:
kwargs = json.loads(record['body'])
print(f'Start Processing Async Data Record:\n{kwargs}')
python_module = kwargs['python_module']
python_function = kwargs['python_function']
# CALLING THE FUNCTION WE WANTED ASYNC, AND DOING ITS STUFF... (WORKING OK)
getattr(sys.modules[python_module], python_function)(uuid=kwargs['uuid'], is_in_async_processing=True)
print('End Processing Async Data Record')
res = client.delete_message(QueueUrl=queue_url, ReceiptHandle=record['receiptHandle'])
print(f'End Deleting Async Data Record with status: {res}') # When the problem I'm monitoring occurs, it goes up to this line, with res status = 200 !! That's where I'm losing my mind. I can confirm the uuid in the DLQ being the same as in the queue so we are definitely talking of the same message which has been moved to the DLQ.
except Exception:
# set expire to 0 so that the task goes into DLQ
client.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=record['receiptHandle'],
VisibilityTimeout=0
)
utils.raise_exception(f'There was a problem during async processing. Event:\n'
f'{json.dumps(event, indent=4, default=utils.jsonize_datetime)}')
Example of today's bug with logs from CloudWatch:
Initial event:
{'Records': [{'messageId': '75587372-256a-47d4-905b-62e1b42e2dad', 'receiptHandle': 'YYYYYY", "python_module": "quote.processing", "python_function": "compute_price_data"}', 'attributes': {'ApproximateReceiveCount': '1', 'SentTimestamp': '1621432888344', 'SequenceNumber': '18861830893125615872', 'MessageGroupId': 'compute_price_data', 'SenderId': 'XXXXX:main-app-production-main', 'MessageDeduplicationId': 'b4de6096-b8aa-11eb-9d50-5330640b1ec1', 'ApproximateFirstReceiveTimestamp': '1621432888344'}, 'messageAttributes': {}, 'md5OfBody': '5a67d0ed88898b7b71643ebba975e708', 'eventSource': 'aws:sqs', 'eventSourceARN': 'arn:aws:sqs:eu-west-3:XXXXX:async_task-production.fifo', 'awsRegion': 'eu-west-3'}]}
Res (after calling delete_message):
End Deleting Async Data Record with status: {'ResponseMetadata': {'RequestId': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'date': 'Wed, 19 May 2021 14:02:47 GMT', 'content-type': 'text/xml', 'content-length': '215'}, 'RetryAttempts': 0}}
BUT... 75587372-256a-47d4-905b-62e1b42e2dad is in the DLQ after this delete_message. I'm becoming crazy
OK, the problem was due to my serverless.yml timeout settings to be 900, but not in AWS. I may have changed it manually to 1min, so my long tasks were released after 1 min and then going immediately to DLQ.
Hence the deletion doing anything since the task was already in the DLQ when the deletion was made

How Kafka retries works with request.timeout.?

I have configured my Producer with request.timeout.ms = 70,0000ms and retries=5. I have doubt how this actually works,
After this "request.timeout.ms=70,000" expires it retries for 5 times or within given "request.timeout.ms=70,000" it retries for 5 time with retry.backoff.ms value.?
There are 3 important configs to be aware of:
"request.timeout.ms" - time to retry a single request
"delivery.timeout.ms" - time to complete the entire send operation
"retries" - how many times to retry when the broker responds with retriable errors.
The Apache Kafka recommendation is to set "delivery.timeout.ms" and leave the other two configurations with their default value. The idea is that the main thing you as a user should worry about is how long you want to way for Kafka to figure things out before giving up on it. It doesn't really matter what is taking Kafka so long - the connection, getting metadata, long queues, etc, the only thing that matters is how long you are willing to wait.
Now to your question - request.timeout.ms applies on each retry. So Producer will send the recordbatch to Kafka, and if there's no response after 70,000ms it will consider this a failure and retry. Note that most errors (say, NoLeaderForPartition) will return from the broker much faster (which is why retry backoffs are needed).
Reasoning about delivery times with retries + request.timeout.ms turned out to be near impossible even for those who wrote the producer. Hence, the introduction of delivery.time.ms with a very clear contract.

Socket Exception while running load test with Self Provisioned test rig

I am getting the Socket Exception while running load test on self-provisioned test rig.
I am trigger those load tests in agent machine(self-provisioned test rig) from my local machine.
Note : For first 2 to 3 minutes test iterations are passing , after that we are getting the Socket Exception.
Below is the error message :
A connection attempt failed because the connected party did not
properly respond after a period of time, or established connection
failed because connected host has failed to respond.
Below are the stack trace details :
at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult) at
System.Net.ServicePoint.ConnectSocketInternal(Boolean connectFailure,
Socket s4, Socket s6, Socket& socket, IPAddress& address,
ConnectSocketState state, IAsyncResult asyncResult, Exception&
exception)
Run Time - 20min
Sample rate - 10sec
warm up duration 10sec
number of agents used - 2
Load pattern :
initial load - 10user
max user count - 300
step duration - 10sec
step user count - 10
Although, by Changing above values I am still getting the exception in the same way.
I am using Visual studio 2015 enterprise.
The question states: start with 10 users, every 10 seconds add 10 users to a maximum of 300. Thus after 29 increments there will be 300 users and that will take 29*10 seconds which is 4m50s. The test will thus (attempt to) run with the maximum load of 300 users for the remaining 15m10s.
Given that all tests pass for the first 2 or 3 minutes plus the the error message, that suggests that you are overloading some part of the network. It might be the agents, it might be the servers or it might be on the connections between them. Some network components have a maximum number of active connections and the 300 users might be too many.
Increasing the load so rapidly means you do not clearly know what the limiting value. The sampling rate (at 10 seconds) seems high. At each sampling interval a lot of data is transferred (i.e. the sample data) and that can swamp parts of the network. You should look at the network counters for the agents and controller, also the servers if available.
I recommend changing the load test steps to add 10 users every 30 seconds, so it takes about 15 minutes to reach 300 users. It may also be worth reducing the sample rate to every 20 seconds.

MongoDB-Java performance with rebuilt Sync driver vs Async

I have been testing MongoDB 2.6.7 for the last couple of months using YCSB 0.1.4. I have captured good data comparing SSD to HDD and am producing engineering reports.
After my testing was completed, I wanted to explore the allanbank async driver. When I got it up and running (I am not a developer, so it was a challenge for me), I first wanted to try the rebuilt sync driver. I found performance improvements of 30-100%, depending on the workload, and was very happy with it.
Next, I tried the async driver. I was not able to see much difference between it and my results with the native driver.
The command I'm running is:
./bin/ycsb run mongodb -s -P workloads/workloadb -p mongodb.url=mongodb://192.168.0.13:27017/ycsb -p mongodb.writeConcern=strict -threads 96
Over the course of my testing (mostly with the native driver), I have experimented with more and less threads than 96; turned on "noatime"; tried both xfs and ext4; disabled hyperthreading; disabled half my 12 cores; put the journal on a different drive; changed sync from 60 seconds to 1 second; and checked the network bandwidth between the client and server to ensure its not oversubscribed (10GbE).
Any feedback or suggestions welcome.
The Async move exceeded my expectations. My experience is with the Python Sync (pymongo) and Async driver (motor) and the Async driver achieved greater than 10x the throughput. further, motor is still using pymongo under the hoods but adds the async ability. that could easily be the case with your allanbank driver.
Often the dramatic changes come from threading policies and OS configurations.
Async needn't and shouldn't use any more threads than cores on the VM or machine. For example, if you're server code is spawning a new thread per incoming conn -- then all bets are off. start by looking at the way the driver is being utilized. A 4 core machine uses <= 4 incoming threads.
On the OS level, you may have to fine-tune parameters like net.core.somaxconn, net.core.netdev_max_backlog, sys.fs.file_max, /etc/security/limits.conf nofile and the best place to start is looking at nginx related performance guides including this one. nginx is the server that spearheaded or at least caught the attention of many linux sysadmin enthusiasts. Contrary to popular lore one should reduce your keepalive timeout opposed to lengthen it. The default keep-alive timeout is some absurd (4 hours) number of seconds. you might want to cut the cord in 1 minute. basically, think a short sweet relationship with your clients connections.
Bear in mind that Mongo is not Async so you can use a Mongo driver pool. nevertheless, don't let the driver get stalled on slow queries. cut it off in 5 to 10 seconds using the following equivalents in Java. I'm just cutting and pasting here with no recommendations.
# Specifies a time limit for a query operation. If the specified time is exceeded, the operation will be aborted and ExecutionTimeout is raised. If max_time_ms is None no limit is applied.
# Raises TypeError if max_time_ms is not an integer or None. Raises InvalidOperation if this Cursor has already been used.
CONN_MAX_TIME_MS = None
# socketTimeoutMS: (integer) How long (in milliseconds) a send or receive on a socket can take before timing out. Defaults to None (no timeout).
CLIENT_SOCKET_TIMEOUT_MS=None
# connectTimeoutMS: (integer) How long (in milliseconds) a connection can take to be opened before timing out. Defaults to 20000.
CLIENT_CONNECT_TIMEOUT_MS=20000
# waitQueueTimeoutMS: (integer) How long (in milliseconds) a thread will wait for a socket from the pool if the pool has no free sockets. Defaults to None (no timeout).
CLIENT_WAIT_QUEUE_TIMEOUT_MS=None
# waitQueueMultiple: (integer) Multiplied by max_pool_size to give the number of threads allowed to wait for a socket at one time. Defaults to None (no waiters).
CLIENT_WAIT_QUEUE_MULTIPLY=None
Hopefully you will have the same success. I was ready to bail on Python prior to async

Redis sync fails. Redis copy keys and values works

I have two redis instances both running on the same machine on win64. The version is the one from https://github.com/MSOpenTech/redis with no amendments and the binaries are running as per download from github (ie version 2.6.12).
I would like to create a slave and sync it to the master. I am doing this on the same machine to ensure it works before creating a slave on a WAN located machine which will take around an hour to transfer the data that exists in the primary.
However, I get the following error:
[4100] 15 May 18:54:04.620 * Connecting to MASTER...
[4100] 15 May 18:54:04.620 * MASTER <-> SLAVE sync started
[4100] 15 May 18:54:04.620 * Non blocking connect for SYNC fired the event.
[4100] 15 May 18:54:04.620 * Master replied to PING, replication can continue...
[4100] 15 May 18:54:28.364 * MASTER <-> SLAVE sync: receiving 2147483647 bytes from master
[4100] 15 May 18:55:05.772 * MASTER <-> SLAVE sync: Loading DB in memory
[4100] 15 May 18:55:14.508 # Short read or OOM loading DB. Unrecoverable error, aborting now.
The only way I can sync up is via a mini script something along the lines of :
import orm.model
if __name__ == "__main__":
src = orm.model.caching.Redis(**{"host":"source_host","port":6379})
dest = orm.model.caching.Redis(**{"host":"source_host","port":7777})
ks = src.handle.keys()
for i,k in enumerate(ks):
if i % 1000 == 0:
print i, "%2.1f %%" % ( (i * 100.0) / len(ks))
dest.handle.set(k,src.handle.get(k))
where orm.model.caching.* are my middleware cache implementation bits (which for redis is just creating a self.handle instance variable).
Firstly, I am very suspicious of the number in the receiving bytes as that is 2^32-1 .. a very strange coincidence. Secondly, OOM can mean out of memory, yet I can fire up a 2nd process and sync that via the script but doing this via redis --slaveof fails with what appears to be out of memory. Surely this can't be right?
redis-check-dump does not run as this is the windows implementation.
Unfortunately there is sensitive data in the keys I am syncing so I can't offer it to anybody to investigate. Sorry about that.
I am definitely running the 64 bit version as it states this upon startup in the header.
I don't mind syncing via my mini script and then just enabling slave mode, but I don't think that is possible as the moment slaveof is executed, it drops all known data and resyncs from scratch (and then fails).
Any ideas ??
I have also seen this error earlier, but the latest bits from 2.8.4 seem to have resolved it https://github.com/MSOpenTech/redis/tree/2.8.4_msopen

Resources