AWS - Prisma - Python Boto3 Enhanced auto-remediation - aws-lambda

I use that enhanced auto-remediation (https://github.com/PaloAltoNetworks/Prisma-Enhanced-Remediation#getting-started) trying to auto remediate alerts detected in Prisma.
For some reasons some alerts that can not be remediated due to lack of permissions, errors or just deficiency in runbook or any others, constantly trigger associated runbooks in lambda.
I noticed that situation with constantly triggering alert happens when, first time alert is triggered and it can't be fixed due to lack of permissions or just runbook runs correctly but in fact it doesn't fix issue, it triggers lambda(runbook) for some period of time (it looks it is related to parameter Message retention period in SQS) and every 30 minutes (it looks it is related to parameter visibility timeout in SQS ), no matter it is fixed (manually or via improved runbook) or not.
Once alert comes in (first time) and is fixed immediately there are no more triggering as i described as root cause.
I suspect that in second scenario runbook returns something it allows remove that alert from queue. How to handle first scenario ?

Related

Operation times out with observable subscription

We have a subscription to an RxJS Observable that's obtained from the Sanity javascript client's listen method.
This works fine except that every now and then we get an error "The operation timed out". I haven't been able to pinpoint exactly when and where this arises but I suspect it happens after a certain timeout without the subscription receiving any message. This does not, however, indicate any issue in our case.
I'm not well versed in observables; is there something basic I'm missing, or has anyone had a similar issue?
Listeners are currently automatically closed after 5 minutes. This might be what you're encountering.
It's actually a regression that we discovered recently; listeners are supposed to time out only after 30 minutes. We are expecting a fix for it this week. Edit: The fix has now been released.
It's important for a client to be resilient against any kind of error, though. On the Internet, network timeouts and other glithces are of course very common and must be handled appropriately. eventually the listener will close itself, as this is the intended behaviour.
(I'm a developer at Sanity.)

Custom resource not running properly on deployment

For over two days, I've been trying to deploy a CloudFormation stack using serverless framework. The thing is, as part of the stack, I have an RDS cluster as well as a custom resource which relies on a Lambda function (written in Python) for initializing some database tables.
The details of this custom resource in the serverless.yml file are the following:
rdsMigration:
Type: Custom::DatabaseMigration
DependsOn: rdsCluster
Properties:
ServiceToken: !GetAtt MigrateDatabaseLambdaFunction.Arn
Version: 1.0
When deploying using sls deploy, the cluster and the lambda functions are created correctly, but the process is stuck on creating the rdsMigration resource.
In the Lambda code, I've been careful to generate the response in all possible scenarios, including exceptions. However, that does not seem to be the problem.
Apparently, the function is not being invoked... kind of, because even the charts look weird:
You can see how there are no invocations, but there is a red dot in "Error count and success rate" about 5:15 PM, which is the time at which the resource creation started. Also, there are no green dots, and you can see the warning down in the legend, which claims that "One or more data-points have been dropped due to non-numeric values (NaN, -Infinite, +Infinite)". How is this possible? I assume it is no standard behavior, since other Lambda functions (which must be called using an API Gateway endpoint) do not show this strange chart.
Also, there are no log streams in CloudWatch. It is completely empty, as if the function was never invoked (which seems the case, except for the strange "red dot" at the moment of resource creation).
Finally, if I run a test case using the "AWS CloudFormation Create Request" template, the function runs properly, it creates the initial tables I expected for the DB (not always, but that is a different matter) and returns the response.
Do you have any idea of what is going on here? The worst about this is that I need to wait two hours between tests, since the CFN stack gets stuck during the creation and destruction steps until the timeout occurs.
Thanks!
The issue is with your lambda function. You have to send back the SUCCESS or FAILURE signals back to the CFN. Since your lambda function is nots sending any signals, its waiting for Timeout (2 hours) and the Cloudformation gets failed
1.The custom resource provider processes the AWS CloudFormation request and
returns a response of SUCCESS or FAILED to the pre-signed URL. AWS
CloudFormation waits and listens for a response in the pre-signed URL location.
2.After getting a SUCCESS response, AWS CloudFormation proceeds with the stack
operation. If a FAILURE or no response is returned, the operation fails.
Please use cfnresponse module in your lambda function to send the SUCCESS/FAILURE signals back to your Cloudformation
For more details:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-lambda-function-code-cfnresponsemodule.html
I finally managed to find a solution to the issue, albeit it is not explaining the strange behavior with the charts that I explained in the question.
My problem was similar to what Abhinaya suggested in her response. The Lambda function was not sending the signal properly because of a programming error. Essentially, I took the code from the documentation (the one for Python 3, second fragment starting by the end) and apparently I mistakenly removed the line for retrieving the ResponseURL. Of course, that was failing.
A side-comment about this: be careful when using Python's cfnresponse library or even the code snippet I linked in the documentation. It relies on botocore.vendored which was deprecated and no longer exist in latest botocore releases. Therefore, it will fail if your code relies on new versions of this library (as in my case). A simple solution is to replace botocore.vendored.requests with the requests library.
Still, there is some strange behavior that I cannot understand. On creation, the Lambda function is not recording anything to CloudWatch and there is this strange behavior in the charts that I explained in my question. However, this only happens on creation. If the function is manually invoked, or is invoked as part of the delete process (when removing the CFN stack), then it does write to CloudWatch. Therefore, the problem only occurs in the first invokation, apparently.
Best.

Stackdriver monitoring - Metric Absent

I am trying to monitor and get alert when my instance shut down. For this I have configured Alert policy in stackdriver as below:
Metric Absence Condition
Violates when: CPU Usage (GCE Monitoring) is absent for greater than 5 minutes
It worked only for first time and then never created any incident for any of stopped instances.
What am I missing here?
Expected Behavior
#Jyotsna. I investigated this issue a bit more and was able to confirm that this is currently an expected behavior, as alert conditions aren't triggered by inactive instances which explains why, you won't get any alerts when the VM instance CPU is not registering any metrics. However, there's currently a Feature Request in progress to update this behavior.
Known Issue
There's also a known issue which seems to have caused the behavior of no logs being sent to Stackdriver on subsequent violations of the policy even after the offending VM is back online. This explains why,
It worked only for first time and then never created any incident for any of stopped instances.
Hence stopped instances won't work properly until the issue is fixed. Unfortunately, there's no ETA. on this, but eventually, it will be addressed.

Dynamics CRM Field Service: Unable to approve time off requests

I'm looking at the Field Service module of Dynamics 365. I'm trying to block out an employee's time on the schedule board by creating a time off request.
I can create the time off request, but as soon as the object is saved, the system automatically deactivates it.
The system will reported success when Approving a time off request but I can't see any changes in the data nor any records created in the audit summary. If I try to Active a time off request, the process fails due to a Business Process Error:
Microsoft.Xrm.Sdk.InvalidPluginExecutionException: Time off request records can't be reactivated.
To the best of my knowledge, there aren't any process changes to time off requests (but I'm unsure how to confirm this). From everything that I've read, this should be a fairly straight forward process so I'm not sure where to look next.
This page from the documentation is a good example of what I'm trying to do. It's failing on step 3 of "Approve a time-off request".
I've tried creating time off requests:
in the past
for tomorrow and more than 2 weeks future
of duration lengths from 2 hours through 2 weeks
for various user accounts
The time off requests are not conflicting with booked resources.
Any advice on what I could look into to determine if someone modified any processes / workflows associated with time off requests? Or is there something that I'm not doing that I should be?
I've learned that Microsoft's documentation is not complete and there was a bug.
Additional info on how Time Off Requests are used
There are two views of Time Off Requests (TOR) available to managers: Active and Inactive.
Active TORs: Lists TORs that a manager needs to approve
Inactive TORs: Lists TORs that have already been approved (i.e., the request itself has been dealt with)
Bookable Resources have a Time Off Approval Required property. When true, TORs created for the user are Active; when false, TORs created for the users are automatically moved to Inactive.
All Inactive TORs should appear as grayed-out boxes on the Schedule Board. If you attempt to Activate an Inactive TOR, the following error will correctly be reported:
Microsoft.Xrm.Sdk.InvalidPluginExecutionException: Time off request
records can't be reactivated.
Field Service Bug
Additionally, we experienced a bug that prevented Inactive TORs from being grayed out on the Schedule Board. I'm not sure if this was a process error or a client-side style issue.
We observed the bug in Field Service 6.1.0.1462. Upgrading to 6.2.1.38 resolved the issue and allowed Inactive TORs to show up on the Schedule Board.

What exactly happens when I change number of Azure role instances?

I observe the following weird behavior. I have an Azure web role which is deployed on love Azure cloud. Now I click "Configure" in the Azure Management Portal and change the number of instances - the portal shows some "activity". Now I open the browser and navigate to the URL assigned to my deployment and start refreshing the page something like once per two seconds. The page reloads fine many times and then fro some time it will stop reloading - the request will be rejected, then after something like half a minute the requests are handled normally.
What is happening? Is the web server temporarily stopped? How do I change number of instances so that HTTP requests to the role are handled at all times?
When you change the configuration file, your current instance might be restarted. This might be the reason you met with, which your website didn't response in about 30 seconds.
Please have a look http://msdn.microsoft.com/en-us/library/microsoft.windowsazure.serviceruntime.roleenvironment.changing.aspx and check if it 's because of the role restarting.
What you are doing is manual. Have you looked at the SDK for autoscaling Azure?
http://channel9.msdn.com/posts/Autoscaling-Windows-Azure-applications
Check out the demo at the 18 minute mark. It doesn't answer your question directly, but its a much more configurable/dynamic way of scaling Azure.
Azure updates your roles one update domain at a time, so in theory you should see no downtime when updating the config (provided you have at least two instances). However, if you refresh the browser every couple of seconds, it's possible that your requests go always to the same instance due to keep-alive.
It would be interesting to know what the behavior is if you disable keep-alives for your webrole. Note that this will have a performance impact, so you'll probably want to re-enable keep-alives after the exercise.

Resources