run a shell command if emr activity fails in aws data pipelines - amazon-data-pipeline

In AWS Data Pipelines, how can one run a shell command ONLY if a certain activity such as an EMR activity fails? I can see the "onFail" option but that only runs an amazon action which is defined as: http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-actions.html
Help would be greatly appreciated. Thank you!

The ability to run an activity on failure of another activity is not supported. You could however issue a SNS notification on failure. You can invoke a lambda function on an SNS notification. See here: http://docs.aws.amazon.com/sns/latest/dg/sns-lambda.html

Related

Consistently deploying Cloudfunction with PubSub and Google Scheduler

I am trying to automatize deployment of three modules: Cloud Function which is invoked via PubSub subscription from Cloud Scheduler. Currently I have a following script, which uses gcloud command:
gcloud beta pubsub topics create $SCHEDULE_NAME || echo "Topic $SCHEDULE_NAME already created."
gcloud beta functions deploy $SCHEDULE_NAME
--region $CLOUD_REGION
--memory 128MB
--runtime nodejs10
--entry-point $ENTRY_POINT
--trigger-topic $SCHEDULE_NAME
--vpc-connector cloud-function-connector
# gcloud scheduler jobs delete $JOB_NAME # does not work as it needs YES non-interactively
gcloud scheduler jobs create pubsub $SCHEDULE_NAME --message-body='RUN' --topic=$SCHEDULE_NAME --schedule='27 2 * * *' --time-zone='Europe/London' || true
This works, however I am not sure whether this is the most correct way to do this. For instance, there is no way to just update the job if it already exists. I was considering terraform, but I am not sure it is useful just for deploying these three small modules. I discovered also serverless tool, however it seems it can only deploy cloud function, but not schedulers and pubsub topics.
I think your approach is straightforward and fine.
Does Terraform provide the job update capability? If so, you'll likely find that it simply deletes and then (re)creates the job. I think this approach (delete-then-recreate) to updating jobs is fine too and seems to provide more control; you can check whether the schedule is about to fire before|after updating it.
Google provides Deployment Manager as a Google-Cloud-specific deployment tool. In my experience, it's primary benefit is that it's server-side but, ultimately, you're just automating the same APIs that you're using with gcloud.
If you want to learn a tool to manage your infrastructure as code, I'd recommend Terraform over Deployment Manager.
Update
The Scheduler API supports 'patching' jobs:
https://cloud.google.com/scheduler/docs/reference/rest/v1beta1/projects.locations.jobs/patch
And this mechanism is supported by gcloud:
gcloud alpha scheduler jobs update

Amazon EMR spam applications by user dr.who?

I am working spark processes using python (pyspark). I create an amazon EMR cluster to run my spark scripts, but when cluster is just created a lot of processes ar launched by itself (¿?), when I check cluster UI:
So, when I try to lunch my own script, they enter in an endless queue, sometime ACCEPTED but never get into RUNNING state.
I couldn't find any info about this issue even in amazon forums, so I'll glad any advice.
Thanks in advance.
you need to check in the security group of the master node, check the inbound traffic,
maybe you have a rule for anywhere, please remove that or try to remove and check if the things work it is a vulnerability.

Cloud Services to run Batch script when file is uploaded?

I am looking to run a batch script on files that are uploaded from my website (one at a time), and return the resulting file produced by that batch script. The website is hosted on a shared linux environment, so I cannot run the batch file on the server.
It sounds like something I could accomplish with Amazon S3 and Amazon Lambda, but I was wondering if there were any other services out there that would allow me to accomplish the same task.
I would recommend that you look into S3 Events and Lambda.
Using S3 events, you can trigger a lambda function on puts and deletes in a S3 bucket and depending on your "batch file" task you may be able to achieve your goal purely in Lambda.
If you cannot use Lambda to replace the functionality of your batch file you can try the following:
If you need to have the batch process run on a specific instance, take a look at Amazon SQS. You can have the S3 event triggered Lambda create a work item in SQS and your instance can regularly poll SQS for work to process.
If you need something a bit more real time, you could use Amazon SNS for a push rather than pull approach to the above.
If you don't need the file to be processed by a specific instance but you have to have a batch file run against it, perhaps you can have your S3 event triggered Lambda create an instance that has a UserData script that will sys prep the server as needed, download the s3 file, process the batch against it and then finally self terminate by looking up it's own instance ID via the EC2 Metadata service and calling the api method terminate instances.
Here is some related reading to assist with the above approaches:
Amazon SQS
https://aws.amazon.com/documentation/sqs/
Amazon SNS
https://aws.amazon.com/documentation/sns/
Amazon Lambda
https://aws.amazon.com/documentation/lambda/
Amazon S3 Event Notifications
http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
EC2 UserData
http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ec2-instance-metadata.html#instancedata-add-user-data
EC2 Metadata Service
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html#instancedata-data-retrieval
AWS Tools for Powershell Cmdlet Reference
http://docs.aws.amazon.com/powershell/latest/reference/Index.html

How to poll AWS CLI in shell script?

As part of my CD pipeline in snap-ci.com, I want to start the instances in my AWS opsworks stack before deploying the application.
As starting hosts takes a certain amount of time (after the command has already returned), I need to poll for the instances to be running (using the describe-instances command in AWS CLI). This command does return a full JSON response of which one of the fields contains the status of the instance (e.g. "running").
I am new to shell scripting and AWS CLI and would appreciate some pointers. I am aware that I can also use the AWS SDK's to program it in java, but that would require to deploy that program to the snap-ci hosts first which sounds complex as well.
AWS CLI has support for wait commands, those will block and wait for the condition you specify, such as waiting for an instance to be ready.
The Advanced Usage of the AWS CLI talk from Re:Invent 2014 shows how to use waiters (18:55), queries, profiles and other tips for using CLI.

Post hook for Elastic MapReduce

I wonder if there is an example of post process for EMR (Elastic MapReduce)? What I am trying to achieve is send an email to group of people right after Amazon's Hadoop finished the job.
You'll want to configure the job end notification URL.
jobEnd.notificationUrl
AWS will hit this url, presumably with query variables that indicate which job has completed (job id).
You could then have this URL on your server process your email notifications, assuming you had already stored a relationship between emails and job ids.
https://issues.apache.org/jira/browse/HADOOP-1111
An easier way is to use Amazon CloudWatch (monitoring system) and Amazon Simple Notification Service (SNS) to monitor and notify you and others on the status of your EMR jobs.
For example you can set an alarm for your cluster to check when it IsIdle. It will be set to 1 once the job is done (or failed), and you can then get SNS notification as an email (or SMS even). You can set similar alarms on count of JobsFailed and other metrics.
For the complete list of EMR related metrics you can see EMR documentations
You can see more information about it here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_ViewingMetrics.html

Resources