Spring batch, where/how to save the metadata about jobs - spring

How do I set any external database (mysql, postgres I'm not concerned with which one at this point) for usage with metadata?
At the moment I have spring batch writing the results of jobs to Mongodb and that works fine but I'm not keeping track of job status so the jobs are being run from the start every time even if interrupted halfway though.
There are plenty examples of how to avoid doing this, but can't seem to find a clear answer on what I need to configure to send the metadata somewhere real rather than in-memory.
I attempted adding a properties file but that had no effect
# for Postgres:
batch.jdbc.driver=org.postgresql.Driver
batch.jdbc.url=jdbc:postgresql://localhost/postgres
batch.jdbc.user=postgres
batch.jdbc.password=mysecretpassword
batch.database.incrementer.class=org.springframework.jdbc.support.incrementer.PostgreSQLSequenceMaxValueIncrementer
batch.schema.script=classpath:/org/springframework/batch/core/schema-postgresql.sql
batch.drop.script=classpath:/org/springframework/batch/core/schema-drop-postgresql.sql
batch.jdbc.testWhileIdle=false
batch.jdbc.validationQuery=

There are plenty examples of how to avoid doing this, but can't seem to find a clear answer on what I need to configure to send the metadata somewhere real rather than in-memory.
You need to configure a bean of type DataSource in your batch application context (or extend the DefaultBatchConfigurer and set the data source you want to use to store meta-data).
There are many samples here: https://github.com/spring-projects/spring-batch/tree/master/spring-batch-samples
You can find the data source configuration here: https://github.com/spring-projects/spring-batch/blob/master/spring-batch-samples/src/main/resources/data-source-context.xml

Related

Correct scope for multi threaded batch jobs in spring

I believe I've got a scoping issue here.
Project explanation:
The goal is to process any incoming file (on disk), including meta data (which is stored in an SQL database). For this I have two tasklets (FileReservation and FileProcessorTask) which are the steps in the overarching "worker" jobs. They wait for an event to start their work. There are several threads dealing with jobs for concurrency. The FileReservation tasklet sends the fileId to FileProcessorTask using the job context.
A separate job (which runs indefinitely) checks for new file meta data records in the database and upon discovering new records "wakes up" the FileReservationTask tasklets using a published event.
With the current configuration the second step in a job can receive a null message when the FileReservation tasklets are awoken.
If you uncomment the code in BatchConfiguration you'll see that it works when we have separate instances of the beans.
Any pointers are greatly appreciated.
Thanks!
Polling a folder for new files is not suitable for a batch job. So using a Spring Batch job (filePollingJob) is not a good idea IMO.
Any pointers are greatly appreciated.
Polling a folder for new files and running a job for each incoming file is a common use case, which can be implemented using a java.nio.file.WatchService or a FileInboundChannelAdapter from Spring integration. See How do I kickoff a batch job when input file arrives? for more details.

How to reload metadata in teiid

I need to reload metadata in spring-boot-teiid. How to i can get it?
Need two methods
Reload in application by cron #Sheduled
Reload from call endpoint in actuator
And another question, can i get only metadata delta (updated metadata)?
With the current model, the metadata load only happens during the deployment of the VDB, which in Teiid using spring boot is always at the start of the application. You can figure out ways to redeploy the VDB to update the metadata or bootstrap other ways to update metadata of a single source and then update the VDB (redeploy) with that to take the changes into effect.
we have not invested in any utilities to give delta of metadata changes. Teiid does have a lot of visitor pattern implementations, I suppose one could traverse the metadata tree that generates a report like that with some work.

Logging for Talend job running within spring-boot

We have talend-jobs triggered within Spring-boot application. Is there any way to configure the output of talend-jobs to the application log files?
One workaround we find is to write logs directly to an external file (filePath passed as context-param). But wanted to find if there is a better way to configure this seamlessly.
Not sure if I understood the question correctly, but I guess your concerns might be on what might have happened to the triggered Jobs.
Logging
With Respect to Logging for Talend, You could configure using Log4j,
https://help.talend.com/reader/5DC~TBhDsBie5JTXyVLW4g/QSGCZJKXo~uhKvZDq1DxUg
Monitoring
Regarding the Status of the Job Executed, you could get the execution details retrieved using REST Call(Talend Metaservlet API).
getTaskExecutionStatus
https://help.talend.com/reader/oYf9gKhmYrkWCiSua4qLeg/SLiAyHyDTjuznLR_F~MiQQ
By Modifying the Existing Talend Job,You could also design a like a feedback loop, ie Trigger a REST Call back to your application. With the details of Execution from Talend Job.

Neo4j in memory db

I saw Neo4j can run as Impermanent DB for unit testing porpouses, I'm not sure if this fits my needs. I have my data stored in neo4j the usual way (persistent) but, starts from my data, I want to let each user start an "experimental session": the users add/delete nodes and relationships, but NOT in permanent way, just experimenting with the data (after that session the edits should be lost). The edits shouldn't be saved and obiouvsly they shouldn't be visibile to the others. What's the best way to accomplish that?
Using impermanent database should work. You would
need to import the data to each new database
spring-data-neo4j is not able to connect to multiple databases (in current release), you would need to start multiple instances of your application, e.g. in a tomcat container
when your application stops (or crashes) you would obviously lose data
Or you could potentially use only 1 database with the base data being public (= visible to everyone) and then for all new nodes/relationships you can add owner property.
When querying the data you would check the property is either public or the current user.
At the end of the session you would just delete all nodes and relationships with given owner.
If you also want to edit existing data then it gets more complicated, you could create a copy of the node/relationship and somehow handle that, or if it's not too large copy whole dataset.
You can build a docker image from the neo4j base image (or build your own) and copy your graph.db into it.
Then you can have every user start a docker container from said image.
If that doesn't answer your question, more info is needed.

Spring Batch: Horizontal scaling of Job Repository

I read a lot about how to enable parallel processing and chunking of an individual job, using Master/Slave paradigm. Consider an already implemented Spring Batch solution that was intended to run on a standalone server. With minimal refactoring I would like to enable this to horizontally scale and be more resilient in production operation. Speed and efficiency is not a goal.
http://www.mkyong.com/spring-batch/spring-batch-hello-world-example/
In the following example a Job Repository is used that connects to an initializes a database schema for the Job Repository. Job initiation requests are fed to a message queue, that a single server, with a single Java process is listening on via Spring JMS. When encountering this it executes a new Java process that is the Spring Batch job. If the job has not been started according to the Job Repository it will begin. If the job had failed it will pick up where the job left off. If the job is in process it will ignore.
The single point of failure is the single server and single listening process for job initiation. I would like to increase resiliency by horizontally scaling identical server instances all competing for who can first grab the job initiation message when it first appears in the queue. That server instance will now attempt to run the job.
I was conceiving that all instances of the JobRepository would share the same schema, so they can all query for when the status is currently in process and decide what they will do. I am unsure though if this schema or JobRepository implementation is meant to be utilized by multiple instances.
Is there a risk in pursuing this that this approach could result in deadlocking the database? There are other constraints to where the Partition features of Spring Batch will not work for my application.
I decided to build a prototype to test if the condition that the Spring Batch Job Repository schema and SimpleJobRepository can be used in a load balanced way with multiple Spring Batch Java processes running concurrently. I was afraid that deadlock scenarios might have occurred at the database to where all running job processes get stuck.
My Test
I started with the mkyong Spring Batch HelloWorld example and made some changes to it where it could be packaged into a Jar that can be executed from the command line. I also removed the initialize database step defined in the database.config file and manually established a local MySQL server with the proper schema elements. I added a Job parameter for time to be the current time in millis so that each job instance would be unique.
Next, I wrote a separate Java main class that used Apache Commons Exec framework to create 50 sub processes with no wait between them. Each of these processes have a Thread.sleep for 1 second within their Processor objects as well so that a number of processes will all kick off at the same time and all attempt to access the database at the same time.
Results
After running this test a number of times in a row I see that all 50 Spring batch processes consistently complete successfully and update the same database schema correctly. I don't see any indication that if there were multiple Spring Batch job processes running on multiple servers connecting to the same database that they would interfere with each other on the schema nor do I see any indication that a deadlock could happen at this time.
So it sounds as if load balancing of Spring Batch jobs without the use of advanced Master/Slave and Step Partitioning approaches is a valid use case.
If anybody would like to comment on my test or suggest ways to improve it I would appreciate it.
Here is excerpt from
Spring Batch docs on how Spring Batch handles database updates for its repository:
Spring Batch employs an optimistic locking strategy when dealing with updates to the database. This means that each time a record is 'touched' (updated) the value in the version column is incremented by one. When the repository goes back to save the value, if the version number has changed it throws an OptimisticLockingFailureException, indicating there has been an error with concurrent access. This check is necessary, since, even though different batch jobs may be running in different machines, they all use the same database tables.

Resources