Avoid polling and long running lambda task - aws-lambda

I have a cloudwatch group that monitors the number of times my application is started. Then I wrote a lamba function that should one every 24 hours in python retrieving these logs using boto3. To do this, I start a query and then polls the get_query_results method to see if it is finished. However, this leaves me with a pretty bad implementation that uses a lot of resources.
Is there a better way to perhaps use some kind of callback to do this?
These are the two functions I'm using
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.start_query
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#CloudWatchLogs.Client.get_query_results
query_id = cloudwatch_connector.start_query('fields #timestamp, #message|filter #message like /APPSTART/').get('queryId')
# Wait until it is completed
running = True
while running:
response = cloudwatch_connector.get_query_results(query_id)
status = response.get('status')
if status == 'Complete':
print("Done gathering logs")
print(response)
running = False
if status == 'Failed' or status == 'Cancelled':
raise Exception('Request either failed or was cancelled')
time.sleep(1000)
What I really wanna do is running a Insights query and saving it to a text file every night at 00:00

Related

SQS task going in DLQ despite being successful in Lambda + also when deleted manually

I have built my own application around AWS Lambda and Salesforce.
I have around 10 users using my internal app, so not talkiing about big usage.
Daily, I have around 500-1000 SQS task which can be processed on a normal day, with one task which can take around 1-60 seconds depending on its complexity.
This is working perfectly.
Timeout for my lambda is 900.
BatchSize = 1
Using Python 3.8
I've created a decorator which allows me to process through SQS some of my functions which required to be processed ASYNC with FIFO logic.
Everything is working well.
My Lambda function doesn't return anything at the end, but it completes with success (standard scenario). However, I have noted that some tasks were going intot my DLQ (I only allow processing once, if it gets represented it goes into DLQ immediately).
The thing I don't get is why is this going on like this ?
Lambda ends with succes --> Normally the task should be deleted from the initial SQS queue.
So I've added a manual deletion of the task processed at the total end of the function. I've logged the result which is sent when I do boto3.client.delete_message and I get a 200 status so everything is OK..... However once in a while (1 out of 100, so 10 times per day in my case) I can see the task going into the DLQ...
Reprocessing the same task into my standard queue without changing anything... it gets processed successfuly (again) and deleted (as expected initially).
What is the most problematic to me is the fact that deleting the message still ends it with it going sometimes into DLQ ? What could be the problem ?
Example of my async processor
def process_data(event, context):
"""
By convention, we need to store in the table AsyncTaskQueueNamea dict with the following parameters:
- python_module: use to determine the location of the method to call asynchronously
- python_function: use to determine the location of the method to call asynchronously
- uuid: uuid to get the params stored in dynamodb
"""
print('Start Processing Async')
client = boto3.client('sqs')
queue_url = client.get_queue_url(QueueName=settings.AsyncTaskQueueName)['QueueUrl']
# batch size = 1 so only record 1 to process
for record in event['Records']:
try:
kwargs = json.loads(record['body'])
print(f'Start Processing Async Data Record:\n{kwargs}')
python_module = kwargs['python_module']
python_function = kwargs['python_function']
# CALLING THE FUNCTION WE WANTED ASYNC, AND DOING ITS STUFF... (WORKING OK)
getattr(sys.modules[python_module], python_function)(uuid=kwargs['uuid'], is_in_async_processing=True)
print('End Processing Async Data Record')
res = client.delete_message(QueueUrl=queue_url, ReceiptHandle=record['receiptHandle'])
print(f'End Deleting Async Data Record with status: {res}') # When the problem I'm monitoring occurs, it goes up to this line, with res status = 200 !! That's where I'm losing my mind. I can confirm the uuid in the DLQ being the same as in the queue so we are definitely talking of the same message which has been moved to the DLQ.
except Exception:
# set expire to 0 so that the task goes into DLQ
client.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=record['receiptHandle'],
VisibilityTimeout=0
)
utils.raise_exception(f'There was a problem during async processing. Event:\n'
f'{json.dumps(event, indent=4, default=utils.jsonize_datetime)}')
Example of today's bug with logs from CloudWatch:
Initial event:
{'Records': [{'messageId': '75587372-256a-47d4-905b-62e1b42e2dad', 'receiptHandle': 'YYYYYY", "python_module": "quote.processing", "python_function": "compute_price_data"}', 'attributes': {'ApproximateReceiveCount': '1', 'SentTimestamp': '1621432888344', 'SequenceNumber': '18861830893125615872', 'MessageGroupId': 'compute_price_data', 'SenderId': 'XXXXX:main-app-production-main', 'MessageDeduplicationId': 'b4de6096-b8aa-11eb-9d50-5330640b1ec1', 'ApproximateFirstReceiveTimestamp': '1621432888344'}, 'messageAttributes': {}, 'md5OfBody': '5a67d0ed88898b7b71643ebba975e708', 'eventSource': 'aws:sqs', 'eventSourceARN': 'arn:aws:sqs:eu-west-3:XXXXX:async_task-production.fifo', 'awsRegion': 'eu-west-3'}]}
Res (after calling delete_message):
End Deleting Async Data Record with status: {'ResponseMetadata': {'RequestId': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '7738ffe7-0adb-5812-8701-a6f8161cf411', 'date': 'Wed, 19 May 2021 14:02:47 GMT', 'content-type': 'text/xml', 'content-length': '215'}, 'RetryAttempts': 0}}
BUT... 75587372-256a-47d4-905b-62e1b42e2dad is in the DLQ after this delete_message. I'm becoming crazy
OK, the problem was due to my serverless.yml timeout settings to be 900, but not in AWS. I may have changed it manually to 1min, so my long tasks were released after 1 min and then going immediately to DLQ.
Hence the deletion doing anything since the task was already in the DLQ when the deletion was made

How to performance test workflow execution?

I have 2 APIs
Create a workflow (http POST request)
Check workflow status (http GET
request)
I want to performance test on how much time does workflow takes to complete.
Tried two ways:
Option 1 Created a java test that triggers workflow create API and then poll status API to check if status turns to CREATED. I check the time taken in this process which gives me performance results.
Option 2 Was using Gatling to do the same
val createWorkflow = http("create").post("").body(ElFileBody("src/main/resources/weather.json")).asJson.check(status.is(200))
.check(jsonPath("$.id").saveAs("id"))
val statusWorkflow = http("status").get("/${id}")
.check(jsonPath("$.status").saveAs("status")).asJson.check(status.is(200))
val scn = scenario("CREATING")
.exec(createWorkflow)
.repeat(20){exec(statusWorkflow)}
Gatling one didn't really work (or I am doing it in some wrong way). Is there a way in Gatling I can merge multiple requests and do something similar to Option 1
Is there some other tool that can help me out to performance test such scenarios?
I think something like below should work when using Gatling's tryMax
.tryMax(100) {
pause(1)
.exec(http("status").get("/${id}")
.check(jsonPath("$.status").saveAs("status")).asJson.check(status.is(200))
)
}
Note: I didn't try this out locally. More information about tryMax:
https://medium.com/#vcomposieux/load-testing-gatling-tips-tricks-47e829e5d449 (Polling: waiting for an asynchronous task)
https://gatling.io/docs/current/advanced_tutorial/#step-05-check-and-failure-management

Get status of a task Elasticsearch for a long running update query

Assuming I have a long running update query where I am updating ~200k to 500k, perhaps even more.Why I need to update so many documents is beyond the scope of the question.
Since the client times out (I use the official ES python client), I would like to have a way to check what the status of the bulk update request is, without having to use enormous timeout values.
For a short request, the response of the request can be used, is there a way I can get the response of the request as well or if I can specify a name or id to a request so as to reference it later.
For a request which is running : I can use the tasks API to get the information.
But for other statuses - completed / failed, how do I get it.
If I try to access a task which is already completed, I get resource not found .
P.S. I am using update_by_query for the update
With the task id you can look up the task directly:
GET /_tasks/taskId:1
The advantage of this API is that it integrates with
wait_for_completion=false to transparently return the status of
completed tasks. If the task is completed and
wait_for_completion=false was set on it them it’ll come back with a
results or an error field. The cost of this feature is the document
that wait_for_completion=false creates at .tasks/task/${taskId}. It is
up to you to delete that document.
From here https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update-by-query.html#docs-update-by-query-task-api
My use case went like this, I needed to do an update_by_query and I used painless as the script language. At first I did a reindex (when testing). Then I tried using the update_by_query functionality (they resemble each other a lot). I did a request to the task api (the operation hasn't finished of course) and I saw the task being executed. When it finished I did a query and the data of the fields that I was manipulating had disappeared. The script worked since I used the same script for the reindex api and everything went as it should have. I didn't investigate further because of lack of time, but... yeah, test thoroughly...
I feel GET /_tasks/taskId:1 confusing to understand. It should be
GET http://localhost:9200/_tasks/taskId
A taskId looks something like this NCvmGYS-RsW2X8JxEYumgA:1204320.
Here is my trivial explanation related to this topic.
To check a task, you need to know its taskId.
A task id is a string that consists of node_id, a colon, and a task_sequence_number. An example is taskId = NCvmGYS-RsW2X8JxEYumgA:1204320 where node_id = NCvmGYS-RsW2X8JxEYumgA and task_sequence_number = 1204320. Some people including myself thought taskId = 1204320, but that's not the way how the elasticsearch codebase developers understand it at this moment.
A taskId can be found in two ways.
wait_for_deletion = false. When sending a request to ES, with this parameter, the response will be {"task" : "NCvmGYS-RsW2X8JxEYumgA:1204320"}. Then, you can check a status of that task like this GET http://localhost:9200/_tasks/NCvmGYS-RsW2X8JxEYumgA:1204320
GET http://localhost:9200/_tasks?detailed=false&actions=*/delete/byquery. This example will return you the status of all tasks with action = delete_by_query. If you know there is only one task running on ES, you can find your taskId from the response of all running tasks.
After you know the taskId, you can get the status of a task with this.
GET /_tasks/taskId
Notice you can only check the status of a task when the task is running, or a task is generated with wait_for_deletion == false.
More trivial explanation, wait_for_deletion by default is true. Based on my understanding, tasks with wait_for_deletion = true are "in-memory" only. You can still check the status of a task while it's running. But it's completely gone after it is completed/canceled. Meaning checking the status will return you a 'resouce_not_found_exception'. Tasks with wait_for_deletion = false will be stored in an ES system index .task. You can still check it's status after it finishes. However, you might want to delete this task document from .task index after you are done with it to save some space. The deletion request looks like this
http://localhost:9200/.tasks/task/NCvmGYS-RsW2X8JxEYumgA:1204320
You will receive resouce_not_found_exception if a taskId is not present. (for example, you deleted some task twice, or you are deleting an in-memory task, whose wait_for_deletetion == true).
About this confusing taskId thing, I made a pull request https://github.com/elastic/elasticsearch/pull/31122 to help clarify the Elasticsearch document. Unfortunately, they rejected it. Ugh.

Dynamically loading new jobs in a SidekiqStatus container to monitor completion

I built a small web crawler implemented in two Sidekiq workers: Crawler and Parsing. The Crawler worker will seek for links while Parsing worker will read the page body.
I want to trigger an alert when the crawling/parsing of all pages is complete. Monitoring only the Crawler job is not the best solution since it may have finished but there might be several Parser jobs running.
Having a look at sidekiq-status gem it seems that I cannot dynamically add new jobs to the container for monitoring. E.g. it would be nice to have a "add" method in the following context:
#container = SidekiqStatus::Container.new
# ... for each page url found:
jid = ParserWorker.perform_async(page_url)
#container.add(jid)
The closest to this is to use "SidekiqStatus::Container.load" or "SidekiqStatus::Container.load_multi" however, it is not possible to add new jobs in the container a posteriori.
One solution would be to create as many SidekiqStatus::Container instances as the number of ParserJobs and check if all of them have status == "finished", but I wonder if a more elegant solution exists using these tools.
Any help is appreciated.
You are describing Sidekiq Pro's Batches feature exactly. You can spend a lot of time or some money to solve your problem.
https://github.com/mperham/sidekiq/wiki/Batches
OK, here's a simple solution. Using the sidekiq-status gem, the Crawler worker keeps track of the jobs IDs for the Parser jobs and halts if any Parser job is still busy (using the SidekiqStatus::Container instance to check job status).
def perform()
# for each page....
#jids << ParserWorker.perform_async(page_url)
# end
# crawler finished, parsers may still be running
while parsers_busy?
sleep 5 # wait 5 secs between each check
end
# all parsers complete, trigger notification...
end
def parsers_busy?
status_containers = SidekiqStatus::Container.load_multi(#jids)
for container in status_containers
if container.status == 'waiting' || container.status == 'working'
return true
end
end
return false
end

Job with multiple tasks on different servers

I need to have a Job with multiple tasks, being run on different machines, one after another (not simultaneously), and while the current job is running, another same job can arrive to the queue, but should not be started until the previous one has finished. So I came up with this 'solution' which might not be the best but it gets the job done :). I just have one problem.
I figured out I would need a JobQueue (either MongoDb or Redis) with the following structure:
{
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
Hosts:
search for the jobs with same hostname, and running==FALSE
execute the task that is set in that job
upon finish, host sets running=FALSE, checks if there are any other tasks to perform and increases task number + sets the hostname to the next machine from the next task
Because jobs can accumulate, imagine situation when jobs are queued for one host like this: A,B,A
Since I have to run all the jobs for the specified machine how do I not start the 3rd A (first A is still running)?
{
_id : ObjectId("xxxx"), // unique, generated by MongoDB, indexed, sortable
hostname: 'host where to execute the task',
running:FALSE,
task: 'current task number',
tasks:{
[task_id:1, commands:'run these ecommands', hostname:'aaa'],
[task_id:2,commands:'another command', hostname:'bbb']
}
}
The question is how would the next available "worker" know whether it's safe for it to start the next job on a particular host.
You probably need to have some sort of a sortable (indexed) field to indicate the arrival order of the jobs. If you are using MongoDB, then you can let it generate _id which will already be unique, indexed and in time-order since its first four bytes are timestamp.
You can now query to see if there is a job to run for a particular host like so:
// pseudo code - shell syntax, not actual code
var jobToRun = db.queue.findOne({hostname:<myHostName>},{},{sort:{_id:1}});
if (jobToRun.running == FALSE) {
myJob = db.queue.findAndModify({query:{_id:jobToRun._id, running:FALSE},update:{$set:{running:TRUE}}});
if (myJob == null) print("Someone else already grabbed it");
else {
/* now we know that we updated this and we can run it */
}
} else { /* sleep and try again */ }
What this does is checks for the oldest/earliest job for specific host. It then looks to see if that job is running. If yes then do nothing (sleep and try again?) otherwise try to "lock" it up by doing findAndModify on _id and running FALSE and setting running to TRUE. If that document is returned, it means this process succeeded with the update and can now start the work. Since two threads can be both trying to do this at the same time, if you get back null it means that this document already was changed to be running by another thread and we wait and start again.
I would advise using a timestamp somewhere to indicate when a job started "running" so that if a worker dies without completing a task it can be "found" - otherwise it will be "blocking" all the jobs behind it for the same host.
What I described works for a queue where you would remove the job when it was finished rather than setting running back to FALSE - if you set running to FALSE so that other "tasks" can be done, then you will probably also be updating the tasks array to indicate what's been done.

Resources