Quartz creating two triggers for one job - spring-boot

I am using Quartz 2.3.0 with my Spring Boot project and I have one job that runs in the clustered environment every 45 minutes, below is my quartz.properties file content.
org.quartz.scheduler.instanceName = SSDIClusteredScheduler
org.quartz.scheduler.instanceId = AUTO
# thread-pool
org.quartz.threadPool.class=org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount=1
org.quartz.threadPool.threadsInheritContextClassLoaderOfInitializingThread=true
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
# Enable these properties for a JDBCJobStore using JobStoreTX
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.StdJDBCDelegate
org.quartz.jobStore.dataSource=quartzDataSource
# Enable this property for JobStoreCMT
#org.quartz.jobStore.nonManagedTXDataSource=quartzDataSource
#============================================================================
# Configure Datasources
#============================================================================
org.quartz.dataSource.quartzDataSource.driver=oracle.jdbc.driver.OracleDriver
org.quartz.dataSource.quartzDataSource.URL=${quartz_datasource_url}
org.quartz.dataSource.quartzDataSource.user=${quartz_datasource_username}
org.quartz.dataSource.quartzDataSource.maxConnections = 5
org.quartz.dataSource.quartzDataSource.validationQuery=select 0 from dual
And below is my code to create a trigger:
#Bean
public Trigger someTrigger(#Qualifier("someJob") JobDetail jobDetail) {
Trigger trigger = TriggerBuilder.newTrigger().withIdentity(jobDetail.getKey().getName()).forJob(jobDetail)
.withSchedule(CronScheduleBuilder.cronSchedule(someCronExpression)).build();
return trigger;
}
But when I am running the job two triggers are getting created for one single job one with name quartzScheduler and another with my instance name i.e. SSDIClusteredScheduler for the same job and one trigger says that it is clustered and another is non-clustered.
I am not able to understand how this is happening I have explored a lot of documents but I am not able to find the cause of this.
Content of Triggers table
Content of Fired_Triggers table

Related

How to set the starting point when using the Redis scan command in spring boot

i want to migrate 70million data redis(sentinel-mode) to redis(cluster-mode)
ScanOptions options = ScanOptions.scanOptions().build();
Cursor<byte[]> c = sentinelTemplate.getConnectionFactory().getConnection().scan(options);
while(c.hasNext()){
count++;
String key = new String(c.next());
key = key.trim();
String value = (String)sentinelTemplate.opsForHash().get(key,"tc");
//Thread.sleep(1);
clusterTemplate.opsForHash().put(key, "tc", value);
}
I want to scan again from a certain point because redis connection disconnected at some point.
How to set the starting point when using the Redis scan command in spring boot?
Moreover, whenever the program is executed using the above code, the connection is broken when almost 20 million data are moved.

Lagom Jdbc Read-Side support: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'slick.profile'

I'm trying to set-up jdbc read side processor in lagom service:
class ProjectEventsProcessor(readSide: JdbcReadSide)(implicit ec: ExecutionContext) extends ReadSideProcessor[ProjectEvent] {
def buildHandler = {
readSide.builder[ProjectEvent]("projectEventOffset")
.setEventHandler[ProjectCreated]((conn: Connection, e: EventStreamElement[ProjectCreated]) => insertProject(e.event))
.build
}
private def insertProject(e: ProjectCreated) = {
Logger.info(s"Got event $e")
}
override def aggregateTags: Set[AggregateEventTag[ProjectEvent]] = ProjectEvent.Tag.allTags
}
Services connects to database fine on startup
15:40:32.575 [info] play.api.db.DefaultDBApi [] - Database [default] connected at jdbc:postgresql://localhost/postgres?user=postgres
But right after this I'm getting exception.
com.typesafe.config.ConfigException$Missing: No configuration setting
found for key 'slick.profile'
First of all, why slick is involved here at all?
I'm using JdbcReadSide but not SlickReadSide.
Ok, let's say JdbcReadSide internally uses slick somehow.
I've added slick.profile in application.config of my service.
db.default.driver="org.postgresql.Driver"
db.default.url="jdbc:postgresql://localhost/postgres?user=postgres"
// Tried this way
slick.profile="slick.jdbc.PostgresProfile$"
// Also this fay (copied from play documentation).
slick.dbs.default.profile="slick.jdbc.PostgresProfile$"
slick.dbs.default.db.dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
slick.dbs.default.db.properties.driver = "org.postgresql.Driver"
But still getting this exception.
What is going on? How to solve this issue?
According to the docs, Lagom uses akka-persistence-jdbc, which under the hood:
uses Slick to map tables and manage asynchronous execution of JDBC calls.
A full configuration, using also the default connection pool (HikariCP), to set in the application.conf file, may be the following (mostly copied from the docs):
# Defaults to use for each Akka persistence plugin
jdbc-defaults.slick {
# The Slick profile to use
# set to one of: slick.jdbc.PostgresProfile$, slick.jdbc.MySQLProfile$, slick.jdbc.OracleProfile$ or slick.jdbc.H2Profile$
profile = "slick.jdbc.PostgresProfile$"
# The JNDI name for the Slick pre-configured DB
# By default, this value will be used by all akka-persistence-jdbc plugin components (journal, read-journal and snapshot).
# you may configure each plugin component to use different DB settings
jndiDbName=DefaultDB
}
db.default {
driver = "org.postgresql.Driver"
url = "jdbc:postgresql://localhost/postgres?user=postgres"
# The JNDI name for this DataSource
# Play, and therefore Lagom, will automatically register this DataSource as a JNDI resource using this name.
# This DataSource will be used to build a pre-configured Slick DB
jndiName=DefaultDS
# Lagom will configure a Slick Database, using the async-executor settings below
# and register it as a JNDI resource using this name.
# By default, all akka-persistence-jdbc plugin components will use this JDNI name
# to lookup for this pre-configured Slick DB
jndiDbName=DefaultDB
async-executor {
# number of objects that can be queued by the async executor
queueSize = 10000
# 5 * number of cores
numThreads = 20
# same as number of threads
minConnections = 20
# same as number of threads
maxConnections = 20
# if true, a Mbean for AsyncExecutor will be registered
registerMbeans = false
}
# Hikari is the default connection pool and it's fine-tuned to use the same
# values for minimum and maximum connections as defined for the async-executor above
hikaricp {
minimumIdle = ${db.default.async-executor.minConnections}
maximumPoolSize = ${db.default.async-executor.maxConnections}
}
}
lagom.persistence.jdbc {
# Configuration for creating tables
create-tables {
# Whether tables should be created automatically as needed
auto = true
# How long to wait for tables to be created, before failing
timeout = 20s
# The cluster role to create tables from
run-on-role = ""
# Exponential backoff for failures configuration for creating tables
failure-exponential-backoff {
# minimum (initial) duration until processor is started again
# after failure
min = 3s
# the exponential back-off is capped to this duration
max = 30s
# additional random delay is based on this factor
random-factor = 0.2
}
}
}

Dataproc conflict in hadoop temporary tables

I have a flow that executes spark jobs on Dataproc clusters in parallel for different zones. For each zone it creates a cluster, execute the spark job and delete the cluster after it finishes.
The spark job uses the org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset method passing the BigQuery Configuration to save data on BigQuery table. The job saves data in more than one table, calling the saveAsNewAPIHadoopDataset method more than one time per job.
The problem is that sometimes I'm getting a error caused by a conflict in the Hadoop temporary BigQuery Dataset that it internally creates to run the jobs:
Exception in thread "main" com.google.api.client.googleapis.json.GoogleJsonResponseException: 409 Conflict
{
"code" : 409,
"errors" : [ {
"domain" : "global",
"message" : "Already Exists: Dataset <my-gcp-project>:<MY-DATASET>_hadoop_temporary_job_201802250620_0013",
"reason" : "duplicate"
} ],
"message" : "Already Exists: Dataset <my-gcp-project>:<MY-DATASET>_hadoop_temporary_job_201802250620_0013"
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.setupJob(BigQueryOutputCommitter.java:107)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1150)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:819)
...
The timestamp 201802250620_0013 on the exception above has the _0013 sufix that I'm unsure if it represents time.
My thoughts is that sometimes the jobs runs at the same time and try to create a dataset with same timestamp in the name. Either in a parallel job or inside the same job on another saveAsNewAPIHadoopDataset call.
How can we avoid this error without putting a delay on the job execution?
The dependency that I'm using is:
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>bigquery-connector</artifactId>
<version>0.10.2-hadoop2</version>
<scope>provided</scope>
</dependency>
The Dataproc image version is 1.1
Edit 1:
I tried using IndirectBigQueryOutputFormat but now I'm getting an error saying that the gcs output path already exists even passing it different time on each saveAsNewAPIHadoopDataset call.
Here is my code:
SparkConf sc = new SparkConf().setAppName("MyApp");
try (JavaSparkContext jsc = new JavaSparkContext(sc)) {
JavaPairRDD<String, String> filesJson = jsc.wholeTextFiles(jsonFolder, parts);
JavaPairRDD<String, String> jsons = filesJson.flatMapToPair(new FileSplitter()).repartition(parts);
JavaPairRDD<Object, JsonObject> objsJson = jsons.flatMapToPair(new JsonParser()).filter(t -> t._2() != null).cache();
objsJson
.filter(new FilterType(MSG_TYPE1))
.saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE1", "gs://my-bucket/tmp1"));
objsJson
.filter(new FilterType(MSG_TYPE2))
.saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE2", "gs://my-bucket/tmp2"));
objsJson
.filter(new FilterType(MSG_TYPE3))
.saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE3", "gs://my-bucket/tmp3"));
// here goes another ingestion process. same code as above but diferrent params, parsers, etc.
}
Configuration createConf(String table, String outGCS) {
Configuration conf = new Configuration();
BigQueryOutputConfiguration.configure(conf, table, null, outGCS, BigQueryFileFormat.NEWLINE_DELIMITED_JSON, TextOutputFormat.class);
conf.set("mapreduce.job.outputformat.class", IndirectBigQueryOutputFormat.class.getName());
return conf;
}
I believe what may be happening is that each mapper tries to create its own dataset. This is rather inefficient (and burns your daily quota proportional to the number of mappers).
An alternative is to use IndirectBigQueryOutputFormat for output class:
IndirectBigQueryOutputFormat works by first buffering all the data into a Cloud Storage temporary table, and then, on commitJob, copies all data from Cloud Storage into BigQuery in one operation. Its use is recommended for large jobs since it only requires one BigQuery "load" job per Hadoop/Spark job, as compared to BigQueryOutputFormat, which performs one BigQuery job for each Hadoop/Spark task.
See the example here: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example

How to stabilize spark streaming application with a handful of super big sessions?

I am running a Spark Streaming application based on mapWithState DStream function . The application transforms input records into sessions based on a session ID field inside the records.
A session is simply all of the records with the same ID . Then I perform some analytics on a session level to find an anomaly score.
I couldn't stabilize my application because a handful of sessions are getting bigger at each batch time for extended period ( more than 1h) . My understanding is a single session (key - value pair) is always processed by a single core in spark . I want to know if I am mistaken , and if there is a solution to mitigate this issue and make the streaming application stable.
I am using Hadoop 2.7.2 and Spark 1.6.1 on Yarn . Changing batch time, blocking interval , partitions number, executor number and executor resources didn't solve the issue as one single task makes the application always choke. However, filtering those super long sessions solved the issue.
Below is a code updateState function I am using :
val updateState = (batchTime: Time, key: String, value: Option[scala.collection.Map[String,Any]], state: State[Seq[scala.collection.Map[String,Any]]]) => {
val session = Seq(value.getOrElse(scala.collection.Map[String,Any]())) ++ state.getOption.getOrElse(Seq[scala.collection.Map[String,Any]]())
if (state.isTimingOut()) {
Option(null)
} else {
state.update(session)
Some((key,value,session))
}
}
and the mapWithStae call :
def updateStreamingState(inputDstream:DStream[scala.collection.Map[String,Any]]): DStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] ={//MapWithStateDStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] = {
val spec = StateSpec.function(updateState)
spec.timeout(Duration(sessionTimeout))
spec.numPartitions(192)
inputDstream.map(ds => (ds(sessionizationFieldName).toString, ds)).mapWithState(spec)
}
Finally I am applying a feature computing session foreach DStream , as defined below :
def computeSessionFeatures(sessionId:String,sessionRecords: Seq[scala.collection.Map[String,Any]]): Session = {
val features = Functions.getSessionFeatures(sessionizationFeatures,recordFeatures,sessionRecords)
val resultSession = new Session(sessionId,sessionizationFieldName,sessionRecords)
resultSession.features = features
return resultSession
}

Using Spring #Scheduled and #Async together

Here is my use case.
A legacy system updates a database queue table QUEUE.
I want a scheduled recurring job that
- checks the contents of QUEUE
- if there are rows in the table it locks the row and does some work
- deletes the row in QUEUE
If the previous job is still running, then a new thread will be created to do the work. I want to configure the maximum number of concurrent threads.
I am using Spring 3 and my current solution is to do the following (using a fixedRate of 1 millisecond to get the threads to run basically continuously)
#Scheduled(fixedRate = 1)
#Async
public void doSchedule() throws InterruptedException {
log.debug("Start schedule");
publishWorker.start();
log.debug("End schedule");
}
<task:executor id="workerExecutor" pool-size="4" />
This created 4 threads straight off and the threads correctly shared the workload from the queue. However I seem to be getting a memory leak when the threads take a long time to complete.
java.util.concurrent.ThreadPoolExecutor # 0xe097b8f0 | 80 | 373,410,496 | 89.74%
|- java.util.concurrent.LinkedBlockingQueue # 0xe097b940 | 48 | 373,410,136 | 89.74%
| |- java.util.concurrent.LinkedBlockingQueue$Node # 0xe25c9d68
So
1: Should I be using #Async and #Scheduled together?
2: If not then how else can I use spring to achieve my requirements?
3: How can I create the new threads only when the other threads are busy?
Thanks all!
EDIT: I think the queue of jobs was getting infinitely long... Now using
<task:executor id="workerExecutor"
pool-size="1-4"
queue-capacity="10" rejection-policy="DISCARD" />
Will report back with results
You can try
Run a scheduler with one second delay, which will lock & fetch all
QUEUE records that weren't locked so far.
For each record, call an Async method, which will process that record & delete it.
The executor's rejection policy should be ABORT, so that the scheduler can unlock the QUEUEs that aren't given out for processing yet. That way the scheduler can try processing those QUEUEs again in the next run.
Of course, you'll have to handle the scenario, where the scheduler has locked a QUEUE, but the handler didn't finish processing it for whatever reason.
Pseudo code:
public class QueueScheduler {
#AutoWired
private QueueHandler queueHandler;
#Scheduled(fixedDelay = 1000)
public void doSchedule() throws InterruptedException {
log.debug("Start schedule");
List<Long> queueIds = lockAndFetchAllUnlockedQueues();
for (long id : queueIds)
queueHandler.process(id);
log.debug("End schedule");
}
}
public class QueueHandler {
#Async
public void process(long queueId) {
// process the QUEUE & delete it from DB
}
}
<task:executor id="workerExecutor" pool-size="1-4" queue-capcity="10"
rejection-policy="ABORT"/>
//using a fixedRate of 1 millisecond to get the threads to run basically continuously
#Scheduled(fixedRate = 1)
When you use #Scheduled a new thread will be created and will invoke method doSchedule at the specified fixedRate at 1 milliseconds. When you run your app you can already see 4 threads competing for the QUEUE table and possibly a dead lock.
Investigate if there is a deadlock by taking thread dump.
http://helpx.adobe.com/cq/kb/TakeThreadDump.html
#Async annotation will not be of any use here.
Better way to implement this is to create you class as a thread by implementing runnable and passing your class to TaskExecutor with required number of threads.
Using Spring threading and TaskExecutor, how do I know when a thread is finished?
Also check your design it doesn't seem to be handling the synchronization properly. If a previous job is running and holding a lock on the row, the next job you create will still see that row and will wait for acquiring lock on that particular row.

Resources