Concurrent batch jobs writing logs to database - shell

My production system has nearly 170 Ctrl-M jobs(essentially cron jobs) running every day. These jobs are weaved together(by creating dependencies) to perform ETL operations. eg: Ctrl-M(scheduler like CRON) almost always start with a shell script, which then executes a bunch of python, hive scripts or map-reduce jobs in a specific order.
I am trying to implement logging into each of these processes to be able to better monitor the tasks and the pipelines in whole. The logs would be used to build a monitoring dashboard.
Currently I have implemented logging using a central wrapper which would be called by each of the processes to log information. This wrapper in turn opens up a teradata connection EACH time and calls a teradata stored procedure to write into a teradata table.
This works fine for now. But in my case, multiple concurrent processes (spawning even more parallel child processes) run at the same time and I have started experiencing dropped connections while doing some load testing. Below is an approach I have been thinking about:
Make processes write to some kind of message queues(eg: AWS sqs). A listener would pick data from these message queues asynchronously and then batch write to teradata.
Using files or some structure to perform batch writing to teradata db.
I would definitely like to hear your thoughts on that or any other better approaches. Eventually the end point of logging will be shifted to redshift and hence thinking in the lines of AWS SQS queues.
Thanks in advance.

I think Kinesis firehose is the perfect solution for this. Setting up a the firehose stream is incredibly quick and easy to configure, very inexpensive and will stream your data to s3 bucket of your choice and optionally stream your logs directly to redshift.
If redshift is your end goal (or even just s3), kinesis firehose couldn't make it easier.
https://aws.amazon.com/kinesis/firehose/
Amazon Kinesis Firehose is the easiest way to load streaming data into
AWS. It can capture and automatically load streaming data into Amazon
S3 and Amazon Redshift, enabling near real-time analytics with
existing business intelligence tools and dashboards you’re already
using today. It is a fully managed service that automatically scales
to match the throughput of your data and requires no ongoing
administration. It can also batch, compress, and encrypt the data
before loading it, minimizing the amount of storage used at the
destination and increasing security. You can easily create a Firehose
delivery stream from the AWS Management Console, configure it with a
few clicks, and start sending data to the stream from hundreds of
thousands of data sources to be loaded continuously to AWS – all in
just a few minutes.

Related

Pause/Resume Flink job during migrations

I'm using Apache Flink to propagate updates from a given set of Kafka topics into an Elasticsearch cluster.
The problem I'm facing is that sometimes the Elasticsearch cluster evolves and I have to (1) modify the mappings, (2) copy over the data...and by the time I have to point the Flink jobs to the new alias/index, there are plenty of updates that made it to the old index.
So I wonder what's the best way to approach this. I can have downtime, but I would like to avoid this if possible. I was trying to make the Flink jobs to slowdown or pause the (Kafka) input sources until the migration finishes, but I didn't find any endpoint for this.
The Flink jobs run in application mode.
If anyone can shed some light on how to accomplish this: pause/resume the jobs via an API or something similar, I will really appreciate the input. The only constraint I have is around stopping the applications (as in stopping/killing pods): it's possible, but too troublesome due to access constraints to the Kubernetes clusters.
I'd probably look into stopping the job with a save point using Flink REST API: https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/rest_api/#jobs-jobid-stop
If that Flink app is pretty big and has lots of state, you can also try to just stop sending data to the input Kafka topic if you don't want to stop it (assuming that it can properly write with the new mappings and indeces after you've made the required ES cluster changes without any change in the Flink Job). It is a bit of overhead, but you could have different topics for your producers and Flink sources, and have another simple Flink job mirror data from one topic (where producers produce to) to the other (where Flink consumes from). When you want to stop writing to ES, just stop that job using the REST API. To not write a new Flink job you could use MirrorMaker or similar, but to stop it you may have to kill its pod.
Or another option is architecting the Elasticsearch indexes so they can support your cluster evolution without having to stop the Flink app. It is hard to know what do you'd need to exactly change, but by writing into aliases and playing with the write index flag you may be able to achieve what you want. I've done this in the past, but it is true that if your mappings change a lot it may be hard to do,

Process a stream of sessions on aws

Is there a way to implement somethong like Flink's session-window on aws with lambda and some way of managing messages?
We have a stream of small events with a session id. We cannot guarantee the order of the arriving events and we don't always have a session-finished event. We know that session ids are unique. We also know that when a session is finished it won't be restarted. We also know that when the session is active we will receive a message every minute or so. We need to process the entire session as a whole.
We want to wait for a silent time of X minutes, and if no messages arrive we will process the entire session as a whole.
This is exactly what Flink's silent window does, is there a way to do the same thing purely using aws lambda and it's triggers?
There can be 10s of millions of sessions at the same time
It's not possible with an AWS Lambda.
Lambdas are stateless, they are able to process messages one by one, but cannot offer any processing over a sequence of messages, which would be required for the kind of windowing logic you describe.
Maybe an option for you would be Kinesis Data Analytics? Under the hood, this one is actually Flink, although it's provided as a managed service by AWS, so maybe you'll get there the "lambda-like" experience you're looking for?

Microservice failure Scenario

I am working on Microservice architecture. One of my service is exposed to source system which is used to post the data. This microservice published the data to redis. I am using redis pub/sub. Which is further consumed by couple of microservices.
Now if the other microservice is down and not able to process the data from redis pub/sub than I have to retry with the published data when microservice comes up. Source can not push the data again. As source can not repush the data and manual intervention is not possible so I tohught of 3 approaches.
Additionally Using redis data for storing and retrieving.
Using database for storing before publishing. I have many source and target microservices which use redis pub/sub. Now If I use this approach everytime i have to insert the request in DB first than its response status. Now I have to use shared database, this approach itself adding couple of more exception handling cases and doesnt look very efficient to me.
Use kafka inplace if redis pub/sub. As traffic is low so I used Redis pub/sub and not feasible to change.
In both of the above cases, I have to use scheduler and I have a duration before which I have to retry else subsequent request will fail.
Is there any other way to handle above cases.
For the point 2,
- Store the data in DB.
- Create a daemon process which will process the data from the table.
- This Daemon process can be configured well as per our needs.
- Daemon process will poll the DB and publish the data, if any. Also, it will delete the data once published.
Not in micro service architecture, But I have seen this approach working efficiently while communicating 3rd party services.
At the very outset, as you mentioned, we do indeed seem to have only three possibilities
This is one of those situations where you want to get a handshake from the service after pushing and after processing. In order to accomplish the same, using a middleware queuing system would be a right shot.
Although a bit more complex to accomplish, what you can do is use Kafka for streaming this. Configuring producer and consumer groups properly can help you do the job smoothly.
Using a DB to store would be a overkill, considering the situation where you "this data is to be processed and to be persisted"
BUT, alternatively, storing data to Redis and reading it in a cron-job/scheduled job would make your job much simpler. Once the job is run successfully, you may remove the data from cache and thus save Redis Memory.
If you can comment further more on the architecture and the implementation, I can go ahead and update my answer accordingly. :)

streaming data from oracle with kafka

I'm starting with kafka and I need to control the inserts in a specific Oracle table, send the new records through kafka at the moment. I have no control over the database, so, in principle, Debizium is excluded. How can I do this? Without using triggers.
I've made a producer read data from Oracle with a java program in eclipse but, that would make constant requests to the database. I use java for simulated a ETL with consumer.
PS: I work with Windows but that's secondary.
If I understand your problem correctly, you are trying to route inserts from Kafka to Oracle Database. There could be few possibilities:
You implement Kafka consumer and as soon as your kafka cluster gets a message consumer makes a insert. You could reuse your java code here- just remove the polling part. Please visit here
If you have kafka deployed in a cloud environment and are using it as a service(aws msk) you would have the option to handling the events. Again you can use java program or can write a python script to make inserts. Please visit here
I would like to understand your throughput requirements, whether you really need kafka as a distributed messaging system or a simple aws sqs would work just fine. If you can use sqs things would be straightforward for you. You create a queue and you write a listener in
python or java
boto3 is an excellent python library for working with sqs

scheduling jobs using spring batch or just Quartz scheduler

I am looking for best solution to create a java web application to generate reports in excel/PDf format. some thing similar to Google Adwords, where user can create schedule reports and download it when the report is generated at a later time.
I am thinking to develop and java application where User logs, selects a pre defined report and provides the input parameters (like report date etc), This request will be queued up or saved as Quarts Job(prefer persistent Queue). A Job will be monitoring the queue/job and execute the job, generate the report(output excel /pdf) and stored in disk.
When the user refresh the screen or logs back at a later time, the report should be available for down load.
Using Spring batch and Quartz scheduler can I do this ? I also expecting like Spring admin , where I can see number of request in Queue(jobs queued up), and stop the queue processing etc.
You would use spring-batch if you wanted to process all report requests at the same time, perhaps at night when your servers are not otherwise occupied processing real-time user requests (or even during the day during slow periods).
You would use a quartz job if you wanted to check for new jobs every few seconds/minutes/hours/etc, and process one/many of them at that specified time interval.
So, quartz is a scheduler and batch is a process. You could use quartz to schedule batch jobs to run at specific times. They aren't competing technologies, they are complimentary.
About your question:
Given that you talk about queues and their persistence however it sounds a lot like your problem would fit into a simple jms model. You would need some messaging software. If you want to make it easy on yourself I'd recommend using spring-jms as a wrapper around the basic Java EE JMS api -- the spring wrappers are simply simpler than basic jms. For a messaging service I'd look at RabbitMQ, because again it's pretty simple.
With the jms architecture you'd post user requests to the queue, which you'd configured to be persistent. You'd have a custom listener on the queue, passing requests to a report generator whenever it runs. You can assign one or more threads to the listener, meaning that you should find it easy to tune the performance of the report generator.
There is a pretty useful DZone article about using rabbitmq via spring-integration (a set of prebuilt pattern implementations that help with connecting things to each other).

Resources