Post hook for Elastic MapReduce - hadoop

I wonder if there is an example of post process for EMR (Elastic MapReduce)? What I am trying to achieve is send an email to group of people right after Amazon's Hadoop finished the job.

You'll want to configure the job end notification URL.
jobEnd.notificationUrl
AWS will hit this url, presumably with query variables that indicate which job has completed (job id).
You could then have this URL on your server process your email notifications, assuming you had already stored a relationship between emails and job ids.
https://issues.apache.org/jira/browse/HADOOP-1111

An easier way is to use Amazon CloudWatch (monitoring system) and Amazon Simple Notification Service (SNS) to monitor and notify you and others on the status of your EMR jobs.
For example you can set an alarm for your cluster to check when it IsIdle. It will be set to 1 once the job is done (or failed), and you can then get SNS notification as an email (or SMS even). You can set similar alarms on count of JobsFailed and other metrics.
For the complete list of EMR related metrics you can see EMR documentations
You can see more information about it here: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_ViewingMetrics.html

Related

How to check from where job arrived in HashiCorp Nomad?

I wonder is there any way to find out how Nomad receives any specific job. As I found in logs there is information about job submit time only. But IP from which this job arrived or submit method (API, GUI) is not specified. Is there any way to find this information?
Although I haven't tried it myself yet, but what you refer to falls under audit logs feature of Nomad where payload is somewhat similar to Vault audit logs. Here is how audit logging can be setup as a part of Nomad server configuration, however, this is available only for enterprise version of Nomad at the moment.
Anyhow, looking at the docs I guess fields that you would be interested are
.payload.auth.stage
.payload.auth.accessor_id
.payload.auth.name
.payload.request.operation
.payload.request.endpoint
.payload.request.request_meta.remote_address
.payload.request.request_meta.user_agent
.payload.response.status_code.

Amazon EMR spam applications by user dr.who?

I am working spark processes using python (pyspark). I create an amazon EMR cluster to run my spark scripts, but when cluster is just created a lot of processes ar launched by itself (¿?), when I check cluster UI:
So, when I try to lunch my own script, they enter in an endless queue, sometime ACCEPTED but never get into RUNNING state.
I couldn't find any info about this issue even in amazon forums, so I'll glad any advice.
Thanks in advance.
you need to check in the security group of the master node, check the inbound traffic,
maybe you have a rule for anywhere, please remove that or try to remove and check if the things work it is a vulnerability.

What's the right way to provide Hadoop/Spark IAM role based access for S3?

We have Hadoop cluster running on EC2 and EC2 instances attached to a role which has access to S3 bucket for example: "stackoverflow-example".
Several users are placing Spark jobs in the cluster, we used keys in the past but do not want to continue and want to migrate to role, so any jobs placed on the Hadoop cluster will use role associated with ec2 instances. Did a lot of search and found 10+ tickets, some of them are still open, some of them are fixed and some of them do not have any comments.
Want to know whether it's still possible to use IAM role for jobs(Spark, Hive, HDFS, Oozie, etc) placing on Hadoop cluster. Most of the tutorials are discussing passing key (fs.s3a.access.key, fs.s3a.secret.key) which is not good enough and not secured as well. We also faced issues with credential provider with Ambari.
Some references:
https://issues.apache.org/jira/browse/HADOOP-13277
https://issues.apache.org/jira/browse/HADOOP-9384
https://issues.apache.org/jira/browse/SPARK-16363
That first one you link to HADOOP-13277 says "can we have IAM?" to which the JIRA was closed "you have this in s3a". The second, HADOOP-9384, was "add IAM to S3n", closed as "switch to s3a". And SPARK-16363? incomplete bugrep.
If you use S3a, and do not set any secrets, then the s3a client will fall back to looking at the special EC2 instance metadata HTTP server, and try to get the secrets from there.
That it: it should just work.

How to expose Hadoop job and workflow metadata using Hive

What I would like to do is make workflow and job metadata such as start date, end date and status available in a hive table to be consumed by a BI tool for visualization purposes. I would like to be able to monitor for example if a certain workflow fails on certain hours, success rate, ...
For this purpose I need access to the same data Hue is able to show in the job browser and Oozie dashboard. What I am looking for specifically for workflows for example is the name, submitter, status, start and end time. The reason that I want this is that in my opinion this tool lacks a general overview and good search.
The idea is that once I locate this data I will directly -or trough some processing steps- load it into Hive.
Questions that I would like to see answered:
Is this data stored in HDFS or is it scattered in local data nodes?
If it is stored in HDFS. Where can I find it? If it is stored in local data nodes, how does Hue find and show this?
Assuming I can access the data. In what format would I expect this data. Is this stored in general log files or can I expect somewhat structured data?
I am using CDH 5.8
If jobs are submitted through other ways than Oozie , my approach won't be helpful.
We have collected all the logs from the oozie server through the Oozie Java API and iterated over the coordinator information to get the required info.
You need to think, what kind of information you need to retrieve.
If you have all jobs submitted through Bundle then come from bundle to coordinator then to workflow to find out the info.
If you want to get all the coordinator info then simply call the api with the number of coordinator to bring and fetch required info.
And then we have loaded the fetched result into a hive table and there one can filter results for failed or time out coordinators & various other parameters.
You can start looking into the example given from Oozie site:-
https://oozie.apache.org/docs/3.2.0-incubating/DG_Examples.html#Java_API_Example]
If you want to track the status of your jobs scheduled in oozie, you should use oozie RESTful API or JavaAPI. I didn't work with Hue version for operation Oozie, but I guess it still uses rest api behind the scene. It provides you with all necessary information and you can create some service which would consume this data and push it into Hive table.
Another option is to access Oozie database. As you probably know Oozie keeps all the data about the scheduled jobs within some RDBMS like MqSql or Postgres. You can consume this information through some JDBC connector. An interesting way would actually be to try to link this information directly into Hive as a set of external tables though JDBCStorageHandler. Not sure if it work, but it worth to try.

Make an execution timeline on Amazon EMR

I am interested in using the job_history_summary.py script to create a Task Timeline of my EMR cluster, similar to this (picture from Smith College Hadoop Tutorial 1.1, but apparently from the Yahoo report on the TeraSort experiment.).
It seems that the Hadoop logs are stored on each node, rather than on the central server. Do I need to manually combine the logs? It also seems that the script doesn't actually produce the graph.
You can enable logging and provide s3 bucket. Logs will be zipped and stored in s3 bucket provided.

Resources