how to pass parameters from web requests to spring boot yarn application - hadoop

I'm using spring-boot and spring-boot-yarn to submit yarn applications to a cluster.
My use-case is close to the one described in this tutorial https://github.com/spring-guides/gs-yarn-basic.
The only difference is that my 'client' is supposed to be a web application and submit the yarn jobs when web requests are made.
The problem I have is that web requests to the 'client' web-application provide parameters I need to pass down to the yarn job.
In the above tutorial parameters are passed as command line arguments to to the appmaster / container specified in application.yml. In my case this approach does not work since I have a different set of parameters for each yarn job.
Is there a way to pass dynamic parameters to yarn jobs without hard-coding them in application.yml?

Original idea was to prevent "rogue" users or applications to pass properties which would then automatically end up in a command-line options potentially making harm within a hadoop cluster.
It's worth to check my answer in Spring Boot Yarn - Passing Command line arguments if this is what you want.
Having said that, you are not a first person to ask this or "complain" that it is too difficult or unclear how to do it. We're going to make this much easier with future releases mostly because it just seem to be what users want to do.

Related

Activate Batch on only one Server instance

I have a nginx loadbalancer in front of two tomcat instances each contains a spring boot application. Each spring boot application executes a batch that writes data in a database.
The batch executes every day at 1am.
The problem is that both instances execute the batch simultaniously which i don't want.
Is there a way to keep the batchs deployed in two instances and tell tomcat or nginx to start the batch in master server (and the slave server doesn't run the batch).
If one of the servers stops, the second server could start the batch on his behalf.
Is there a tool in nginx or tomcat (or some other technology) to do that ?
thank you in advance.
Here is a simplistic design approach.
Since you have two scheduled methods in the 2 VMs triggered at same time, add a random delay to both. This answer has many options on how to delay the trigger for a random duration. Spring #Scheduled annotation random delay
Inside the method run the job only if it is NOT already started (by the other VM). This could be done with a new table to track this.
Here is the pseudo code for this design:
#Scheduled(cron = "schedule expression")
public void batchUpdateMethod() {
//Check database for signs of job running now.
if (job is not running){
//update database table to indicate job is running
//Run the batch job
//update database table to indicate job is finished
}
}
The database, or some common file location, should be used as a lock to sync between the two runs, since the two VMs are independent of each other.
For a more robust design, consider Spring Batch
Spring Batch uses a database for its jobs (JobsRepository). By default an in memory datasource is used to keep track of running jobs and their status. In your setup, the 2 instances are (most likely) using their own in memory database.
Multiple instances of Spring Batch can coordinate with each other as a cluster and one can run jobs, while the other actasa backup, if the jobsRepository database is shared.
For this you need to configure the 2 instances to use a common datasource.
Here are some docs:
https://docs.spring.io/spring-batch/docs/current/reference/html/index-single.html#jobrepository
https://docs.spring.io/spring-batch/docs/current/reference/html/job.html#configuringJobRepository
If you design two app server instances to run the same job at the same time, then by design, one will succeed to create a job instance and the other will fail (and this failure can be ignored). See Javadoc of JobRepository. This is one of the roles of the job repository: to act as a safeguard against duplicate job executions in a clustered environment.
If one of the servers stops, the second server could start the batch on his behalf. Is there a tool in nginx or tomcat (or some other technology) to do that ?
I believe there is no need for such tool or technology. If one of the servers is down at the time of the schedule, the other will be able to take things over and succeed in launching the job.
I did implement a simple BCM Server functionality, where all servers do register(create a Server-table entry) with their unique IP. The Servers need to register within a defined time(e.g. 10 sec). If a Server does not register within time(last update timestamp > 10 sec), then the Server gets de-registered(delete Server-table entry) by the Server, which do register.
At the end I do have a table with ordered Server entries and can define the task uniquely to the registered Servers.
The Implementation is very simple and Works perfectly.
Before I did also have in mind the Spring Batch Job Sharing functionality, but I wanted zu have a more lightweight and more flexible Solution.
Currently I use it in all my projects where I need to have Batch-Processing implemented.

Logging for Talend job running within spring-boot

We have talend-jobs triggered within Spring-boot application. Is there any way to configure the output of talend-jobs to the application log files?
One workaround we find is to write logs directly to an external file (filePath passed as context-param). But wanted to find if there is a better way to configure this seamlessly.
Not sure if I understood the question correctly, but I guess your concerns might be on what might have happened to the triggered Jobs.
Logging
With Respect to Logging for Talend, You could configure using Log4j,
https://help.talend.com/reader/5DC~TBhDsBie5JTXyVLW4g/QSGCZJKXo~uhKvZDq1DxUg
Monitoring
Regarding the Status of the Job Executed, you could get the execution details retrieved using REST Call(Talend Metaservlet API).
getTaskExecutionStatus
https://help.talend.com/reader/oYf9gKhmYrkWCiSua4qLeg/SLiAyHyDTjuznLR_F~MiQQ
By Modifying the Existing Talend Job,You could also design a like a feedback loop, ie Trigger a REST Call back to your application. With the details of Execution from Talend Job.

Laravel Scheduler (withoutOverlapping)

I have two apps running on the same server.
Now it seems like when adding withoutOverlapping() to the scheduler job and managing the base cronjob via cron itself, these 2 apps are blocking each other in execution.
Could that be?
Yes, withoutOverlapping only works per application.
Laravel creates a file in the storage folder with a hash of the job. This way, if the file exists, Laravel knows the job is still running. The one application cannot possibly know if the other one is currently running a job because it does not have access to the storage folder of the other application.
If your code looks like the following
$schedule->command('process:queue 0')->everyMinute()->withoutOverlapping();
$schedule->command('process:queue 1')->everyMinute()->withoutOverlapping();
It is because same commands with different parameters might bc considered overlapping.
I.e. the hash of the job will consider only the command signature.

How does one run Spring XD in distributed mode?

I'm looking to start Spring XD in distributed mode (more specifically deploying it with BOSH). How does the admin component communicate to the module container?
If it's via TCP/HTTP, surely I'll have to tell the admin component where all the containers are? If it's via Redis, I would've thought that I'll need to tell the containers where the Redis instance is?
Update
I've tried running xd-admin and Redis on one box, and xd-container on another with redis.properties updated to point to the admin box. The container starts without reporting any exceptions.
Running the example stream submission curl -d "time | log" http://{admin IP}:8080/streams/ticktock yields no output to either console, and not output to the logs.
If you are using the xd-container script, then the redis.properties is expected to be under "XD_HOME/config" where XD_HOME points the base directory where you have bin, config, lib & modules of xd.
Communication between the Admin and Container runtime components is via the messaging bus, which by default is Redis.
Make sure the environment variable XD_HOME is set as per the documentation; if it is not you will see a logging message that suggests the properties file has been loaded correctly when it has not:
13/06/24 09:20:35 INFO support.PropertySourcesPlaceholderConfigurer: Loading properties file from URL [file:../config/redis.properties]

Spring Batch Admin: Schedule new jobs through web GUI

A newbie question on Sprint Batch Admin.
My requirement is that the user should be able to schedule new jobs (passing some parameters for the job functionality) through a web UI. These jobs should be persistent, will be repetitive and could be cancelled or deleted. Also, a report could be generated for last run jobs and to list all the existing jobs with their next run dates.
Perhaps my most important requirement is that this should be possible "on the fly", not requiring redeploying the web-application or a server re-start.
Can this be done using Spring Batch Admin (I see that the guide talks about uploading an XML for adding a job but that seems tedious, if there is an API why shouldn't we be able to create a job on the fly through the Batch Admin Web UI)? Or does JDK Timer or Quartz support it?
Once a job has been created, it can't be deleted, but it can be stopped. Allowing deletion from DB is a risky operation, as Spring Batch might have already been started the job execution, but the DB has not been updated yet. If one removes the job at this moment, you have inconsistency.
Scheduling a new job is described in Launch Job. It is not possible to create new types of jobs, as jobs can generally have complicated configuration which is parsed only once when Spring Context is loaded.
Dynamic deployment (on the fly) of jobs and configurations, without requiring server restart, is a feature we implemented in Trooper Batch Profile - it is not exactly Spring Batch admin but builds on it. You continue to write your jobs using Spring batch, just the container changes for in Trooper you would use its Batch profile runtime. Screen shots and features are here : https://github.com/regunathb/Trooper/wiki/Writing-Batch-jobs-in-Trooper
I think we can deploy the each spring batch job by a SBA. I mean each batch job will be compiled as a war file. We deploy them together in server. In this way, we have the following visiting urls to monitor each jobs:
h t t p://bactchjobserver/job1
h t t p://bactchjobserver/job2
h t t p://bactchjobserver/job3
h t t p://bactchjobserver/job4
But the downside is that each war fill surely contains lib files, which make each war file like 10MB size.
At the same time, I tried to manually add new-job.xml to war-file\WEB-INF\classes\META-INF\spring\batch\jobs, and new-job.jar to war-file\WEB-INF\lib without stopping JBoss. It works. The new job can be showed in SBA UI and runnable.
But obviously this would lead much maintenance and trouble shooting. It is not implementable.

Resources