Re-use Amazon Elastic MapReduce instance - amazon-ec2

I have tried a simple Map/Reduce task using Amazon Elastic MapReduce and it took just 3 mins to complete the task. Is it possible to re-use the same instance to run another task.
Even though I have just used the instance for 3 mins Amazon will charge for 1 hr, so I want to use the balance 57 mins to run several other tasks.

The answer is yes.
here's how you do it using the command line client:
When you create an instance pass the --alive flag, this tells emr to keep the cluster around after your job has run.
Then you can submit more tasks to the cluster:
elastic-mapreduce --jobflow <job-id> --stream --input <s3dir> --output <s3dir> --mapper <script1> --reducer <script2>
To terminate the cluster later, simply run:
elastic-mapreduce <jobid> --terminate
try running elastic-mapreduce --help to see all the commands you can run.
If you don't have the command line client, get it here.

Using:
elastic-mapreduce --jobflow job-id \
--jar s3n://some-path/x.jar \
--step-name "New step name" \
--args ...
you can also add non-streaming steps to your cluster. (just so you don't have to try it your yourself ;-) )

http://aws.amazon.com/elasticmapreduce/faqs/#dev-6
Q: Can I run a persistent job flow? Yes. Amazon Elastic MapReduce job
flows that are started with the –alive flag will continue until
explicitly terminated. This allows customers to add steps to a job
flow on demand. You may want to use this to debug your job flow logic
without having to repeatedly wait for job flow startup. You may also
use a persistent job flow to run a long-running data warehouse
cluster. This can be combined with data warehouse and analytics
packages that runs on top of Hadoop such as Hive and Pig.

Related

In Oozie, how I'd be able to use script output

I have to create a cron-like coordinator job and collect some logs.
/mydir/sample.sh >> /mydir/cron.log 2>&1
Can I use simple oozie wf, which I use for any shell command?
I'm asking because I've seen that there are specific workflows to execute .sh scripts
Sure, you can execute Shell action (On any node in the Yarn cluster) or use the Ssh action if you'd like to target specific hosts. You have to keep in mind that the "/mydir/cron.log" file will be created on the host the action is executed on and the generated file might no be available for other Oozie actions.

How get exception,error,log for HIVE-SQOOP based Batch Job?

I have Hadoop cluster with 6 datanode and 1 namenode. I have few(4) jobs in HIVE which run on every day and push some data from logfile to our OLPT data base using sqoop. I do not have oozie installed in the environment. All are written in HIVE script file (.sql file) and I run those from unix script(.sh file). Those shell script file are attach with different OS cron job to run those on different time.
Now Requirement is This:
Generate log/status for each job separately on daily basis. So that at the end of the day looking into those log we can identify which job run successfully and time it took to run , which job failed and dump/stack stace for that failed job.(Feature plan is that we will have mail server and every failed or success job shell script will send mail to respective stack holder with those log/status file as attachment)
Now my problem is how I can find error/exception if anything I have to run those batch job / shell script and how to generate success log also with execution time?
I tried to get the output in text file for each query run into HIVE by redirecting the output but that is not working.
for example :
Select * from staging_table;>>output.txt
Is there any way to do this by configuring HIVE log for each and every HIVE job on day to day basis?
Please let me know if any one face this issue and how can I resolve this?
Select * from staging_table;>>output.txt
this is Redirecting output if you are looking for that option then below is the way from the console.
hive -e 'Select * from staging_table' > /home/user/output.txt
this will simply redirect the output. It wont display job specific log information.
However, I am assuming that you are running on yarn, if you are expecting to see application(job) specific for logs please see this
Resulting log file locations :
During run time you will see all the container logs in the ${yarn.nodemanager.log-dirs}
Using UI you can see the logs i.e job level and task level.
other way is to look from and dump application/job specific logs from command line.
yarn logs -applicationId your_application_id
Please note that using the yarn logs -applicationId <application_id> method is preferred but it does require log aggregation to be enabled first.
Also see much better explanation here

sun grid engine qsub to all nodes

I have a master and two nodes. They are install with SGN. And I have a shell script ready on all the nodes as well. Now I want to use a qsub to submit the job on all my nodes.
I used:
qsub -V -b n -cwd /root/remotescript.sh
but it seems that only one node is doing the job. I am wondering how do I submit jobs for all nodes. What would the command be.
My reference is this enter link description here
SGE is meant to dispatch jobs to worker nodes. In your example, you create one job so one node will run it. If you want to run a job on each of your node, you need to submit more than one job. If you want to target nodes you probably should use something closer to
qsub -V -b n -cwd -l hostname=node001 /root/remotescript.sh
qsub -V -b n -cwd -l hostname=node002 /root/remotescript.sh
The "-l hostname=*" parameter will require a specific host to run the job.
What are you trying to do? The general use case of using a grid engine is to let the scheduler dispatch the jobs so you don't have to use the "-l hostname=*" parameter. So technically you should just submit a bunch of jobs to SGE and let it dispatch it with the nodes availability.
Finch_Powers answer is good for describing how SGE allocates resources. So, I'll elaborate below on specifics of you question, which may be why you are not getting the desired outcome.
You mention launching remote script via:
qsub -V -b n -cwd /root/remotescript.sh
Also, you mention again that these scripts are located on the nodes:
"And I have a shell script ready on all the nodes as well"
This is not how SGE is designed to work, although it can do this. Typical usage is to have same single (or multiple) scripts accessible to all nodes via network mounted storage on the execution nodes and let SGE decide which nodes to run the script on.
To run remote code, you may be better served using plain SSH.

Submitting jobs to different fair scheduler pools while using jar option

I am relatively new to Hadoop and was trying to have different jobs of the same user submitted to different pools of the fair scheduler at run time while using the hadoop jar option.
Based on the solution in http://osdir.com/ml/hive-user-hadoop-apache/2009-03/msg00162.html, I used the -D option while running the job.
Specifically, I ran the command: bin/hadoop jar hadoop-examples-1.0.4.jar grep input output 'dfs[a-z.]+' -D pool.name=sample_pool
I can see the pool in the job tracker scheduler page, but the job is still submitted to user pool. I found that the -D option is not supported by the jar option: http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#job.
How can I specify this at run time?
Couple of suggestions:
Have you restarted the job tracker since you made the changes suggested in the first link?
You've previously needed to set all -D properties before the other arguments (i'm not sure if this has changed in more recent versions). Try:
bin/hadoop jar hadoop-examples-1.0.4.jar -Dpool.name=sample_pool grep input output 'dfs[a-z.]+'
If probably doesn't matter, but i always bunch up the -Dkey=value options (no space between the -D and the key=value), i find it makes it more obvious that this is not part of the variable args list.
One way to verify this has been picked up correctly is to check the job's job.xml in the job tracker - does it have the pool.name property listed, and does it have the value you configured.
EDIT Just reading up on how the examples are bundles, you'll need to add the -D after the program name and before the other arguments:
`bin/hadoop jar hadoop-examples-1.0.4.jar grep -Dpool.name=sample_pool input output 'dfs[a-z.]+' `
I think you can specified the parameter mapred.fairscheduler.pool or mapred.fairscheduler.poolnameproperty.
For instance, you can run command
bin/hadoop jar hadoop-examples-1.0.4.jar -Dmapred.fairscheduler.pool=sample_pool grep input output 'dfs[a-z.]+'
mapred.fairscheduler.pool:
Specify the pool that a job belongs in. If this is specified then mapred.fairscheduler.poolnameproperty is ignored.
mapred.fairscheduler.poolnameproperty:
Specify which jobconf property is used to determine the pool that a job belongs in. String, default: user.name (i.e. one pool for each user). Another useful value is mapred.job.queue.name to use MapReduce's "queue" system for access control lists (see below). mapred.fairscheduler.poolnameproperty is used only for jobs in which mapred.fairscheduler.pool is not explicitly set.
references:
hadoop fair scheduler

How to clear hadoop fifo queue?

I have set up a pseudo distributed mode cluster setup. The FIFO scheduler got stuck somehow in between and therefore a lot of jobs got piled up which I had scheduler through cron. Now, when I restarted YARN resourcemanager it gets stuck after a while and the jobs keep piling up.
Is there a way I can clear the whole queue. Or, is it that my complete understanding of hadoop scheduling is somewhere flawed. Please help.
If you're trying to kill all the jobs in your queue, you can use this shell script:
$HADOOP_HOME/bin/hadoop job -list | awk ' { system("$HADOOP_HOME/bin/hadoop job -kill " $1) } '

Resources