I am using apache ignite to cache my static/reference data from oracle tables.
I have to refresh the data every day at 11:30PM.
Approach 1
Using Apache ignite in-built solution, I did not find a way to refresh the data. Please help me if there is any way?
Approach 2
I used Quartz API to schedule the job, which is working fine. I am using below steps to reload/refresh the data -
Stop the Ignite - Ignition.stopAll(true);
Starting the Ignite - Ignition.start(cfg)
Load the fresh data - In this step I am getting below exception : java.lang.IllegalStateException: Grid is in invalid state to perform this operation. It either not started yet or has already being or have stopped [igniteInstanceName=null, state=STOPPED]
Could you please help me how to fix or choose the best way to adopt ?
It sounds like you can just call loadCache method everyday at 11-30. Quartz can be used to schedule that call.
You may also check out built-in scheduler IgniteScheduler
Related
We have talend-jobs triggered within Spring-boot application. Is there any way to configure the output of talend-jobs to the application log files?
One workaround we find is to write logs directly to an external file (filePath passed as context-param). But wanted to find if there is a better way to configure this seamlessly.
Not sure if I understood the question correctly, but I guess your concerns might be on what might have happened to the triggered Jobs.
Logging
With Respect to Logging for Talend, You could configure using Log4j,
https://help.talend.com/reader/5DC~TBhDsBie5JTXyVLW4g/QSGCZJKXo~uhKvZDq1DxUg
Monitoring
Regarding the Status of the Job Executed, you could get the execution details retrieved using REST Call(Talend Metaservlet API).
getTaskExecutionStatus
https://help.talend.com/reader/oYf9gKhmYrkWCiSua4qLeg/SLiAyHyDTjuznLR_F~MiQQ
By Modifying the Existing Talend Job,You could also design a like a feedback loop, ie Trigger a REST Call back to your application. With the details of Execution from Talend Job.
I am working on big data project.
The basic flow of the project is following:
- data is coming from mainframe and getting stored into cornerstone 3.0
- after that the data is getting ingested in hive using scheduler
- then it is getting stored into the mapr db using map reduce job(running hive queries to get specific aggregated attributes) in terms of the key-value pair to reflect into the application using Rest API.
I want to test this application starting from Hive to Rest API assuming the data in Hive is loaded correctly.
What can be the best approach to test this application
(Objective to be tested : Hive data,hive queries,mapr db performance,mapr dp data,Rest api).What are the best tools and technology to use.
Thank you in advance.
What can be tested? - this is explained by requirements/question it self
data is coming from mainframe mainframe and getting stored into cornerstone 3.0 - validate data is stored as expected (based on requirement) from mainframe to cornerstone
after that the data is getting ingested in hive using scheduler - verify hive tables have data/hdfs file locations etc. as expected(as per requirements - if any transformation is happening during hive table load - you will be validating that)
then it is getting stored into the mapr db using map reduce job(running hive queries to get specific aggregated attributes) in terms of the key-value pair to reflect into the application using Rest API - here basically you are testing map-reduce job that loads/transforms data in maprdb. you should be running job first -> verify job runs end to end with no error/warns (note execution time to very the performance of the job) -> validate maprdb -> thne test REST API app and verify expected results base on requirements.
What are the best tools and technology to use?
for hive/hdfs/data validation - I would create shell-script (consist of hive, hdfs file location, log file validation, runs mapreduce job, validates mapreduce job etc) that test/verifies each step describe above. one should start with manual CLI commands first to get started with testing.
for testing REST API - there are many tools available e.g. ReadyAPI, postman. I would include this step in shell-script too (using curl)
I read a lot about how to enable parallel processing and chunking of an individual job, using Master/Slave paradigm. Consider an already implemented Spring Batch solution that was intended to run on a standalone server. With minimal refactoring I would like to enable this to horizontally scale and be more resilient in production operation. Speed and efficiency is not a goal.
http://www.mkyong.com/spring-batch/spring-batch-hello-world-example/
In the following example a Job Repository is used that connects to an initializes a database schema for the Job Repository. Job initiation requests are fed to a message queue, that a single server, with a single Java process is listening on via Spring JMS. When encountering this it executes a new Java process that is the Spring Batch job. If the job has not been started according to the Job Repository it will begin. If the job had failed it will pick up where the job left off. If the job is in process it will ignore.
The single point of failure is the single server and single listening process for job initiation. I would like to increase resiliency by horizontally scaling identical server instances all competing for who can first grab the job initiation message when it first appears in the queue. That server instance will now attempt to run the job.
I was conceiving that all instances of the JobRepository would share the same schema, so they can all query for when the status is currently in process and decide what they will do. I am unsure though if this schema or JobRepository implementation is meant to be utilized by multiple instances.
Is there a risk in pursuing this that this approach could result in deadlocking the database? There are other constraints to where the Partition features of Spring Batch will not work for my application.
I decided to build a prototype to test if the condition that the Spring Batch Job Repository schema and SimpleJobRepository can be used in a load balanced way with multiple Spring Batch Java processes running concurrently. I was afraid that deadlock scenarios might have occurred at the database to where all running job processes get stuck.
My Test
I started with the mkyong Spring Batch HelloWorld example and made some changes to it where it could be packaged into a Jar that can be executed from the command line. I also removed the initialize database step defined in the database.config file and manually established a local MySQL server with the proper schema elements. I added a Job parameter for time to be the current time in millis so that each job instance would be unique.
Next, I wrote a separate Java main class that used Apache Commons Exec framework to create 50 sub processes with no wait between them. Each of these processes have a Thread.sleep for 1 second within their Processor objects as well so that a number of processes will all kick off at the same time and all attempt to access the database at the same time.
Results
After running this test a number of times in a row I see that all 50 Spring batch processes consistently complete successfully and update the same database schema correctly. I don't see any indication that if there were multiple Spring Batch job processes running on multiple servers connecting to the same database that they would interfere with each other on the schema nor do I see any indication that a deadlock could happen at this time.
So it sounds as if load balancing of Spring Batch jobs without the use of advanced Master/Slave and Step Partitioning approaches is a valid use case.
If anybody would like to comment on my test or suggest ways to improve it I would appreciate it.
Here is excerpt from
Spring Batch docs on how Spring Batch handles database updates for its repository:
Spring Batch employs an optimistic locking strategy when dealing with updates to the database. This means that each time a record is 'touched' (updated) the value in the version column is incremented by one. When the repository goes back to save the value, if the version number has changed it throws an OptimisticLockingFailureException, indicating there has been an error with concurrent access. This check is necessary, since, even though different batch jobs may be running in different machines, they all use the same database tables.
I have a problem that affects the response time of the queries from CassandraInput. I use Datastax Enterprise 3.2.4 - Cassandra 1.2.13.2.
If I try to run the same query (any) directly from the Cassandra client, the answer is considerably faster than the same query executed on the node CassandraInput from Pentaho Data Integration.
What can cause this?
And above all, there is a way to improve the response time from the node CassandraInput in Pentaho?
I hope that some of you might have some suggestions.
Thank you
Federica
Generally it has not to happen.
Try below thing and check whether performance in increasing or not.
open spoon.bat or spoon.sh file according to the OS you are using and change below thing.
Below thing has to be change according to the size of the RAM of your Machine.
PENTAHO_DI_JAVA_OPTIONS="-Xmx2g"
I am involved in a project which requires me to create a Job Scheduler using “Quartz Scheduler” to schedule various jobs which in turn trigger Pentaho Kettle transformation(s). Kettle transformations are essentially ETL scripts performing some mundane activities in our case. Am facing a critical issue while running the scheduler:
We have around 10 jobs scheduled using Job Scheduler. For some 3 to 4 specific jobs it’s throwing following exception:
Unable to load the job from XML file [/home /transformations/jobs/TestJob.kjb] Unable to read file [file:///home /transformations/jobs/ TestJob.kjb] Could not read from "file:///home /transformations/jobs/TestJob.kjb" because it is a not a file.
org.pentaho.di.job.JobMeta.(JobMeta.java:715)
org.pentaho.di.job.JobMeta.(JobMeta.java:679)
com. XYZ.transformation.jobs.impl.JobBootstrapImpl.executeJob(JobBootstrapImpl.java:115)
com. XYZ.transformation.jobs.impl.JobBootstrapImpl.startJobsExecution(JobBootstrapImpl.java:100)
com. XYZ.transformation.jobs.impl.QuartzJobsScheduler.executeInternal(QuartzJobsScheduler.java:25)
org.springframework.scheduling.quartz.QuartzJobBean.execute(QuartzJobBean.java:86)
org.quartz.core.JobRunShell.run(JobRunShell.java:223)
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:549)
Weird thing is that, upon verifying the specified path i.e. “/home /transformations/jobs/TestJob.kjb”, file is present and I am able to read it. Moreover the Job runs successfully and does all the things which it is supposed to, yet throws the exception detailed above.
After observing closely, I strongly feel that Quartz is internally caching jobs and/or its parameters. We do load certain parameters required for the job to execute after it is triggered. Would it be possible to delete/purge the cache used by Quartz? I also tried killing all the java processes running on the box (thinking that it may kill Quartz itself, as Quartz is being run within java process) and restarting quartz and its jobs afresh, but couldn’t make it work as expected. It still stores the old parameters somewhere perhaps in some cache.
Versions used –
Spring Framework (spring-core & spring-beans) - 3.0.6.RELEASE
Quertz Scheduler - 1.8.6
Platform – Redhat Linux - 2.6.18-308.el5
Pentaho kettle – Spoon Stable Release – 4.3.0
I will do in this way:
Ensure that the Pentaho Job can run in standalone first with a shell script, java service wrapper or whatever
In the Quartz Job, then use Quartz's NativeJob to call the same standalone script
Just my two cents
Looks to me like you have an extra space in the path.
/home /transformations/jobs/TestJob.kjb
Between the e of home and the /
Remove that space, I can't possibly believe you actually have a home directory called "home "!!