I wrote talend etl job and packaged it into an executable Jar file.
i want schedule this jar file to run every week once.
I am using Amazon AWS. Can it achieve what I wanted? If yes, what are the steps I should proceed with? If not what are other alternatives?
If you are using Linux, I recommend you to use the traditional cron service served by unix system.
step
Edit /etc/crontab
Write schedule and shell script to kick Talend job
Please be careful timezone setting.
alternatives
Wikipedia - List of job scheduler software
I would personally recommend putting your jar in a docker container and putting it into an EC2 instance on AWS. Cron job will work, however, but what happens if you are not signed in when your task is scheduled to execute? If you put the jar in the cloud it will always execute.
Related
Basically, I need to run a set of custom shell scripts on ec2 instances to provision some software. Is there any workflow manager like oozie or airflow with api access to schedule the same. I am asking for alternatives like oozie and airflow, as those are that of hadoop environment schedulers and my environment is not. I can ensure that there can be ssh access from the source machine that will run the workflow manager and the ec2 instance where want to install the software. Is there any such open source workflow schedulers?
I would recommend using Cadence Workflow for your use case. There are multiple provisioning solutions built on top of it. For example Banzai Cloud Pipeline Platform
See the presentation that goes over Cadence programming model.
I have Go application deployed over Docker. Other than running the main program, I want to run periodic job for updating my data.
Which is better?
Run periodic job using concurrency (channel) while being run on main program.
Crontab to register periodic job on system. But I don't know how to do this inside Docker
In Dockerfile or in docker what is the best way to run a separate cronjob?
Please help me. Thanks!
If you are developing the application and all you need is basic periodical execution of one "job" , I would go and implement it in your app. If things get more complicated I would build on an image such as https://github.com/phusion/baseimage-docker which brings support for management of multiple container processes (including cron).
I am new to hadoop, I have written few jobs and exported them as jar file . I am able to run them using hadoop jar command, I want to run these jobs every one hour . How do I do this? Thanks in advance.
Hadoop itself doesn't have ways to schedule jobs like you are suggesting. So you have two main choices, Java's Time and scheduling functions, or run the jobs from the operating system, I would suggest Cron. I would personally use cron to do this, it's simple and very flexible, and is installed by default on most servers. There are also lots of tutorials.
Cron example to run on the first minute of every hour.
0 * * * * /bin/hadoop jar myJar.jar
If you want to keep it inside of java itself, I would suggest checking out this question which has details and code, How to schedule task for start of every hour.
You could probably achieve that by writing a cron or some script. But the better way, in my view, would be to use some scheduler like Oozie.
In addition to an already mentioned Oozie, you might want to take a look at Falcon.
From own experience, however, a much easier approach is to try using your CI system to avoid adding new systems to your stack, for example Jenkins
Adding another option to CRON & Oozie, Quartz Scheduler
I have a script that I need to run once a day that requires a lot of memory. I would like to run it on a dedicated amazon box.
Is there some automated way to build a box, download all required software (like ruby) and then run my script. After the script is ran, I would like to shutdown the box.
The two options I can think of are:
I am thinking about hacking EMR to do this. (My script is a mapper against an empty directory)
Chef - This seemed like too much for one simple script.
You can accomplish setting up a new EC2 instance on startup using the official Ubuntu AMIs, the official Amazon Linux AMIs, and any other AMI that supports the concept of a user-data script.
Create a script (bash, Perl, Python,
whatever) that starts with #!
Pass this script as the user-data when running the EC2 instance.
The script will automatically be run as root on the first boot.
Here's the article where I introduced the concept of a user-data script:
Automate EC2 Instance Setup with user-data Scripts
http://alestic.com/2009/06/ec2-user-data-scripts
Your user-data script can install the required software, configure it, install your work script, and set up a cron job that runs the work script once a day.
ENHANCEMENT:
If the installation script don't take a long time to run (e.g., under an hour or few) then you don't even have to run a single dedicated instance 24 hours a day. You can instead use an approach that lets AWS start an instance for you on a regular schedule.
Here's an article I wrote that provides details on this approach with sample commands:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
The general approach is to use Auto Scaling to start an instance with your user-data script on a regular schedule. Your job will terminate the instance when it has completed. They key is to suspend Auto Scaling's normal desire to re-start instances that terminate so that you don't pay for a running instance until the next time your job starts.
I am a beginner to Hadoop.
As per my understanding, Hadoop framework runs the Jobs in FIFO order (default scheduling).
Is there any way to tell the framework to run the job at a particular time?
i.e Is there any way to configure to run the job daily at 3PM like that?
Any inputs on this greatly appreciated.
Thanks, R
What about calling the job from external java schedule framework, like Quartz? Then you can run the job as you want.
you might consider using Oozie (http://yahoo.github.com/oozie/). It allows (beside other things):
Frequency execution: Oozie workflow specification supports both data
and time triggers. Users can specify execution frequency and can wait
for data arrival to trigger an action in the workflow.
It is independent of any other Hadoop schedulers and should work with any of them, so probably nothing in you Hadoop configuration will change.
How about having a script to execute your Hadoop job and then using at command to execute at some specified time.if you want the job to run regularly, you could setup a cron job to execute your script.
I'd use a commercial scheduling app if Cron does not cut it and/or a custom workflow solution. We use a solution called jams but keep in mind it's .net-oriented.