Why does airflow have an SQS integration if it - etl

I'm trying to do ETL for messages on an SQS queue and airflow has an SQS integration (sensor) which makes me think that would be capable of constantantly polling SQS to run DAGs. However, that does not appear to be the case. The fastest it can run dags and poll for messages is once every few seconds, which does not work for the large quantity of data that I'm trying to consume.
I'm wondering why airflow has the SQS integration, or if it is even the right tool for the job.

Airflow is predominantly used for batch, or recurring processing on a relatively fixed schedule and isn't designed to do "live" processing. If your use case can be done in batches of, say 30 minutes, that may be more suited to Airflow.

Related

ActiveMQ- Can I Replace The Scheduler Plugin With A Delayed Message Queue?

I worked a little with the ActiveMQ scheduler plugin. This simplifies scheduling messages for delivery with a delay at low volume, but as I get into the 100ks of messages the system breaks down in two key ways.
It's very slow (compared to queues) to enqueue messages in the scheduler.
Attempting to view the schedules in the dashboard crashes the ActiveMQ instance.
The existing scheduler feels a little bolted on and does not perform as expected. So, rethinking the problem I would like to have a jobs and jobs-scheduled queue. Messages sent to the jobs-scheduled queue will have a ttl header with the unix timestamp for when it should be delivered. A process will run on a cron job which will take messages from the jobs-scheduled queue and send it to the jobs queue using a selector to just pick out the messages with an elapsed ttl convert_string_expressions:ttl < %(now)s.
My two questions are:
Will this strategy work for delaying messages at scale or will I find scaling pains around the selector? These messages will be persisted if that makes a difference.
Is there an existing feature in ActiveMQ that will allow me to send messages from one queue to another with a selector query?
ActiveMQ is a message broker not a job scheduler so what you are trying to do is really outside the scope of the what the broker is intended to do. Yes ActiveMQ does have a scheduled message feature but this is not intended for large scale job queue type work, it is a simple feature to provide some minimal delayed delivery.
What you are looking for sounds more like Quartz or some other batch job scheduling library. You could develop your own Job scheduler implementation for ActiveMQ or do something in a plugin but you are really trying to run against the grain of what a broker is meant to do which is deliver messages as quickly as possible in a decoupled manner.
Side note-- potentially off-topic.
I've had to solve a similar situation in the past where it made a lot of sense to load up the queues with messages ahead of time to cut down on the total transfer time.
I solved it by using Camel routes and a side-channel activation. Camel allows you to programmatically start and stop routes, so you can load up a queue with no consumers for the data for a given time period. Then using a dedicated queue for control you send the 'start' message. The control route receives the 'start' message, and then activates the main data processing route. You then need to configure some sort of 'stop' message semantic to be ready for the next time periods run.
Effectively, you get the delayed behavior pattern with much more control over scheduling and cut down on the data-to-queue loading time problem. You can also solve the scaling problem by loading the data across more than one queue.

Scheduling task at some specific time in Java

I have some code execution which will scheduled many jobs at different date-time. So overall I will have lot of jobs to run at specific date-time. I know that there is Spring Scheduler which will execute a job at some time period, but it does not schedule a job dynamically. I can use ActiveMQ with timed delivery or Quartz for my purpose but looking for a little suggestion. Shall I use Quartz or ActiveMQ timed/delayed delivery or something else.
There is another alternative as well in Executor service with timed execution, but if application restarts then the job will be gone I believe. Any help will be appreciated.
While you can schedule message delivery in ActiveMQ it wasn't designed to be used as a job scheduler whereas that's exactly what Quartz was designed for.
In one of your comments you talked about wanting a "scalable solution" and ActiveMQ won't scale well with a huge number of scheduled jobs because the more messages which accumulate in the queues the worse it will perform since it will ultimately have to page those messages to disk rather than keeping them in memory. ActiveMQ, like most message brokers, was meant to hold messages for a relatively short amount of time before they are consumed. It's much different than a database which is better suited for this use-case. Quartz should scale better than ActiveMQ for a large number of jobs for this reason.
Also, the complexity of the jobs you can configure in Quartz is greater. If you go with ActiveMQ and you eventually need more functionality than it supports then that complexity will be pushed down into your application code. However, there's a fair chance could simply do what you want with Quartz since it was designed as a job scheduler.
Lastly, a database is more straight-forward to maintain than a message broker in my opinion and a database is also easy to provision in most cloud providers. I'd recommend you go with Quartz.
You can start by using a cron-expression in order to cover the case when your application will restart. The cron-expression can be stored in the properties file. Also, when your application will be scheduled, you can restart or reschedule your job programatically by creating a new job instance with another cron-expression for example.

Concurrent batch jobs writing logs to database

My production system has nearly 170 Ctrl-M jobs(essentially cron jobs) running every day. These jobs are weaved together(by creating dependencies) to perform ETL operations. eg: Ctrl-M(scheduler like CRON) almost always start with a shell script, which then executes a bunch of python, hive scripts or map-reduce jobs in a specific order.
I am trying to implement logging into each of these processes to be able to better monitor the tasks and the pipelines in whole. The logs would be used to build a monitoring dashboard.
Currently I have implemented logging using a central wrapper which would be called by each of the processes to log information. This wrapper in turn opens up a teradata connection EACH time and calls a teradata stored procedure to write into a teradata table.
This works fine for now. But in my case, multiple concurrent processes (spawning even more parallel child processes) run at the same time and I have started experiencing dropped connections while doing some load testing. Below is an approach I have been thinking about:
Make processes write to some kind of message queues(eg: AWS sqs). A listener would pick data from these message queues asynchronously and then batch write to teradata.
Using files or some structure to perform batch writing to teradata db.
I would definitely like to hear your thoughts on that or any other better approaches. Eventually the end point of logging will be shifted to redshift and hence thinking in the lines of AWS SQS queues.
Thanks in advance.
I think Kinesis firehose is the perfect solution for this. Setting up a the firehose stream is incredibly quick and easy to configure, very inexpensive and will stream your data to s3 bucket of your choice and optionally stream your logs directly to redshift.
If redshift is your end goal (or even just s3), kinesis firehose couldn't make it easier.
https://aws.amazon.com/kinesis/firehose/
Amazon Kinesis Firehose is the easiest way to load streaming data into
AWS. It can capture and automatically load streaming data into Amazon
S3 and Amazon Redshift, enabling near real-time analytics with
existing business intelligence tools and dashboards you’re already
using today. It is a fully managed service that automatically scales
to match the throughput of your data and requires no ongoing
administration. It can also batch, compress, and encrypt the data
before loading it, minimizing the amount of storage used at the
destination and increasing security. You can easily create a Firehose
delivery stream from the AWS Management Console, configure it with a
few clicks, and start sending data to the stream from hundreds of
thousands of data sources to be loaded continuously to AWS – all in
just a few minutes.

scheduling jobs using spring batch or just Quartz scheduler

I am looking for best solution to create a java web application to generate reports in excel/PDf format. some thing similar to Google Adwords, where user can create schedule reports and download it when the report is generated at a later time.
I am thinking to develop and java application where User logs, selects a pre defined report and provides the input parameters (like report date etc), This request will be queued up or saved as Quarts Job(prefer persistent Queue). A Job will be monitoring the queue/job and execute the job, generate the report(output excel /pdf) and stored in disk.
When the user refresh the screen or logs back at a later time, the report should be available for down load.
Using Spring batch and Quartz scheduler can I do this ? I also expecting like Spring admin , where I can see number of request in Queue(jobs queued up), and stop the queue processing etc.
You would use spring-batch if you wanted to process all report requests at the same time, perhaps at night when your servers are not otherwise occupied processing real-time user requests (or even during the day during slow periods).
You would use a quartz job if you wanted to check for new jobs every few seconds/minutes/hours/etc, and process one/many of them at that specified time interval.
So, quartz is a scheduler and batch is a process. You could use quartz to schedule batch jobs to run at specific times. They aren't competing technologies, they are complimentary.
About your question:
Given that you talk about queues and their persistence however it sounds a lot like your problem would fit into a simple jms model. You would need some messaging software. If you want to make it easy on yourself I'd recommend using spring-jms as a wrapper around the basic Java EE JMS api -- the spring wrappers are simply simpler than basic jms. For a messaging service I'd look at RabbitMQ, because again it's pretty simple.
With the jms architecture you'd post user requests to the queue, which you'd configured to be persistent. You'd have a custom listener on the queue, passing requests to a report generator whenever it runs. You can assign one or more threads to the listener, meaning that you should find it easy to tune the performance of the report generator.
There is a pretty useful DZone article about using rabbitmq via spring-integration (a set of prebuilt pattern implementations that help with connecting things to each other).

Message queue that delivers delayed messages and checks for uniqness

I have an online service that receives incoming events (few every second). Service needs to process a job when there were no events for 30 seconds or more. Service is distributed across several PCs and uses Amazon webservices (SQS and SimpleDB) as a backbone.
I understand how can I schedule a job when there IS an incoming event (just put a message into message queue and you are done), but how can I schedule a job when the condition is "NO EVENTS FOR X SECONDS" ?
Ideally I would want a message queue that does not allow duplicate messages, allows scheduling for the future and allows adjusting "delivery date" on each message.
Is there such a message queue implementation?
Is this problem can be solved at all without persisting some data in database?
Thank you
Both BizTalk or SQL Server Service Broker fit your requirements. If they are too heavyweight, you could write a simple service that peeks the queue every couple of seconds and times out if it does not see anything in 30 seconds. That would be more difficult to scale horizontally across machines, however.

Resources