How to set the starting point when using the Redis scan command in spring boot - spring-boot

i want to migrate 70million data redis(sentinel-mode) to redis(cluster-mode)
ScanOptions options = ScanOptions.scanOptions().build();
Cursor<byte[]> c = sentinelTemplate.getConnectionFactory().getConnection().scan(options);
while(c.hasNext()){
count++;
String key = new String(c.next());
key = key.trim();
String value = (String)sentinelTemplate.opsForHash().get(key,"tc");
//Thread.sleep(1);
clusterTemplate.opsForHash().put(key, "tc", value);
}
I want to scan again from a certain point because redis connection disconnected at some point.
How to set the starting point when using the Redis scan command in spring boot?
Moreover, whenever the program is executed using the above code, the connection is broken when almost 20 million data are moved.

Related

couchbase upsert/insert silently failing with ttl

i am trying to upsert 10 documents using spring boot. It is failing to upsert "few documents" with TTL.There is no error or exception. If i do not provide ttl then it is working as expected.
In addition to that, if i increase the ttl to a different value then also all the documents are getting created.
On the other hand, if i reduce the ttl then failing to insert few more docuemnts.
I tried to insert the failed document(single document out of 10) from another poc with the same ttl the document is getting created.
public Flux<JsonDocument> upsertAll(final List<JsonDocument> jsonDocuments) {
return Flux
.from(keys())
.flatMap(key -> Flux
.fromIterable(jsonDocuments)
.parallel()
.runOn(Schedulers.parallel())
.flatMap(jsonDocument -> {
final String arg = String.format("upsertAll-%s", jsonDocument);
return Mono
.just(asyncBucket
.upsert(jsonDocument, 1000, TimeUnit.MILLISECONDS)
.doOnError(error -> log.error(jsonDocument.content(), error, "failed to upsert")))
.map(obs -> Tuples.of(obs, jsonDocument.content()))
.map(tuple2 -> log.observableHandler(tuple2))
.map(observable1 -> Tuples.of(observable1, jsonDocument.content()))
.flatMap(tuple2 -> log.monoHandler(tuple2))
;
})
.sequential())
;
}
List<JsonDocument> jsonDocuments = new LinkedList<>();
dbService.upsertAll(jsonDocuments)
.subscribe();
some one please suggest how to resolve this issue.
Due to an oddity in the Couchbase server API, TTL values less than 30 days are treated differently than values greater than 30 days.
In order to get consistent behavior with Couchbase Java SDK 2.x, you'll need to adjust the TTL value before passing it to the SDK:
// adjust TTL for Couchbase Java SDK 2.x
public static int adjustTtl(int ttlSeconds) {
return ttlSeconds < TimeUnit.DAYS.toSeconds(30)
? ttlSeconds
: (int) (ttlSeconds + (System.currentTimeMillis() / 1000));
}
In Couchbase Java SDK 3.0.6 this is no longer required; just pass a Duration and the SDK will adjust the value behind the scenes if necessary.

Spring Batch Best Architecture to Read XML

What is the Best performance architecture to read XML in Spring Batch? Each XML is approximately 300 KB size and we are processing 1 Million.
Our Current Approach
30 partitions and 30 Grids and Each slave gets 166 XMLS
Commit Chunk 100
Application Start Memory is 8 GB
Using JAXB in Reader Default Bean Scope
#StepScope
#Qualifier("xmlItemReader")
public IteratorItemReader<BaseDTO> xmlItemReader(
#Value("#{stepExecutionContext['fileName']}") List<String> fileNameList) throws Exception {
String readingFile = "File Not Found";
logger.info("----StaxEventItemReader----fileName--->" + fileNameList.toString());
List<BaseDTO> fileList = new ArrayList<BaseDTO>();
for (String filePath : fileNameList) {
try {
readingFile = filePath.trim();
Invoice bill = (Invoice) getUnMarshaller().unmarshal(new File(filePath));
UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO(bill, environment);
unifiedDTO.setFileName(filePath);
BaseDTO baseDTO = new BaseDTO();
baseDTO.setUnifiedDTO(unifiedDTO);
fileList.add(baseDTO);
} catch (Exception e) {
UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO();
unifiedDTO.setFileName(readingFile);
unifiedDTO.setErrorMessage(e);
BaseDTO baseDTO = new BaseDTO();
baseDTO.setUnifiedDTO(unifiedDTO);
fileList.add(baseDTO);
}
}
return new IteratorItemReader<>(fileList);
}
Our questions:
Is this Archirecture correct
Is any performance or architecture advantage of using StaxEventItemReader and XStreamMarshaller over JAXB.
How to handle memory properly to avoid slow down
I would create a job per xml file by using the file name as a job parameter. This approach has many benefits:
Restartability: If a job fails, you only restart the failed file (from where it left off)
Scalability: This approach allows you to run multiple jobs in parallel. If a single machine is not enough, you can distribute the load on multiple machines
Logging: Logs are separate by design, you don't need to use an MDC or any other technique to separate logs
We are receiving XML filepath in a *.txt file
You can a create a script that iterates over these lines and launch a job per line (aka per file). Gnu Parallel (or a similar tool) is a good option to launch jobs in parallel.

Lagom Jdbc Read-Side support: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'slick.profile'

I'm trying to set-up jdbc read side processor in lagom service:
class ProjectEventsProcessor(readSide: JdbcReadSide)(implicit ec: ExecutionContext) extends ReadSideProcessor[ProjectEvent] {
def buildHandler = {
readSide.builder[ProjectEvent]("projectEventOffset")
.setEventHandler[ProjectCreated]((conn: Connection, e: EventStreamElement[ProjectCreated]) => insertProject(e.event))
.build
}
private def insertProject(e: ProjectCreated) = {
Logger.info(s"Got event $e")
}
override def aggregateTags: Set[AggregateEventTag[ProjectEvent]] = ProjectEvent.Tag.allTags
}
Services connects to database fine on startup
15:40:32.575 [info] play.api.db.DefaultDBApi [] - Database [default] connected at jdbc:postgresql://localhost/postgres?user=postgres
But right after this I'm getting exception.
com.typesafe.config.ConfigException$Missing: No configuration setting
found for key 'slick.profile'
First of all, why slick is involved here at all?
I'm using JdbcReadSide but not SlickReadSide.
Ok, let's say JdbcReadSide internally uses slick somehow.
I've added slick.profile in application.config of my service.
db.default.driver="org.postgresql.Driver"
db.default.url="jdbc:postgresql://localhost/postgres?user=postgres"
// Tried this way
slick.profile="slick.jdbc.PostgresProfile$"
// Also this fay (copied from play documentation).
slick.dbs.default.profile="slick.jdbc.PostgresProfile$"
slick.dbs.default.db.dataSourceClass = "slick.jdbc.DatabaseUrlDataSource"
slick.dbs.default.db.properties.driver = "org.postgresql.Driver"
But still getting this exception.
What is going on? How to solve this issue?
According to the docs, Lagom uses akka-persistence-jdbc, which under the hood:
uses Slick to map tables and manage asynchronous execution of JDBC calls.
A full configuration, using also the default connection pool (HikariCP), to set in the application.conf file, may be the following (mostly copied from the docs):
# Defaults to use for each Akka persistence plugin
jdbc-defaults.slick {
# The Slick profile to use
# set to one of: slick.jdbc.PostgresProfile$, slick.jdbc.MySQLProfile$, slick.jdbc.OracleProfile$ or slick.jdbc.H2Profile$
profile = "slick.jdbc.PostgresProfile$"
# The JNDI name for the Slick pre-configured DB
# By default, this value will be used by all akka-persistence-jdbc plugin components (journal, read-journal and snapshot).
# you may configure each plugin component to use different DB settings
jndiDbName=DefaultDB
}
db.default {
driver = "org.postgresql.Driver"
url = "jdbc:postgresql://localhost/postgres?user=postgres"
# The JNDI name for this DataSource
# Play, and therefore Lagom, will automatically register this DataSource as a JNDI resource using this name.
# This DataSource will be used to build a pre-configured Slick DB
jndiName=DefaultDS
# Lagom will configure a Slick Database, using the async-executor settings below
# and register it as a JNDI resource using this name.
# By default, all akka-persistence-jdbc plugin components will use this JDNI name
# to lookup for this pre-configured Slick DB
jndiDbName=DefaultDB
async-executor {
# number of objects that can be queued by the async executor
queueSize = 10000
# 5 * number of cores
numThreads = 20
# same as number of threads
minConnections = 20
# same as number of threads
maxConnections = 20
# if true, a Mbean for AsyncExecutor will be registered
registerMbeans = false
}
# Hikari is the default connection pool and it's fine-tuned to use the same
# values for minimum and maximum connections as defined for the async-executor above
hikaricp {
minimumIdle = ${db.default.async-executor.minConnections}
maximumPoolSize = ${db.default.async-executor.maxConnections}
}
}
lagom.persistence.jdbc {
# Configuration for creating tables
create-tables {
# Whether tables should be created automatically as needed
auto = true
# How long to wait for tables to be created, before failing
timeout = 20s
# The cluster role to create tables from
run-on-role = ""
# Exponential backoff for failures configuration for creating tables
failure-exponential-backoff {
# minimum (initial) duration until processor is started again
# after failure
min = 3s
# the exponential back-off is capped to this duration
max = 30s
# additional random delay is based on this factor
random-factor = 0.2
}
}
}

How to stabilize spark streaming application with a handful of super big sessions?

I am running a Spark Streaming application based on mapWithState DStream function . The application transforms input records into sessions based on a session ID field inside the records.
A session is simply all of the records with the same ID . Then I perform some analytics on a session level to find an anomaly score.
I couldn't stabilize my application because a handful of sessions are getting bigger at each batch time for extended period ( more than 1h) . My understanding is a single session (key - value pair) is always processed by a single core in spark . I want to know if I am mistaken , and if there is a solution to mitigate this issue and make the streaming application stable.
I am using Hadoop 2.7.2 and Spark 1.6.1 on Yarn . Changing batch time, blocking interval , partitions number, executor number and executor resources didn't solve the issue as one single task makes the application always choke. However, filtering those super long sessions solved the issue.
Below is a code updateState function I am using :
val updateState = (batchTime: Time, key: String, value: Option[scala.collection.Map[String,Any]], state: State[Seq[scala.collection.Map[String,Any]]]) => {
val session = Seq(value.getOrElse(scala.collection.Map[String,Any]())) ++ state.getOption.getOrElse(Seq[scala.collection.Map[String,Any]]())
if (state.isTimingOut()) {
Option(null)
} else {
state.update(session)
Some((key,value,session))
}
}
and the mapWithStae call :
def updateStreamingState(inputDstream:DStream[scala.collection.Map[String,Any]]): DStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] ={//MapWithStateDStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] = {
val spec = StateSpec.function(updateState)
spec.timeout(Duration(sessionTimeout))
spec.numPartitions(192)
inputDstream.map(ds => (ds(sessionizationFieldName).toString, ds)).mapWithState(spec)
}
Finally I am applying a feature computing session foreach DStream , as defined below :
def computeSessionFeatures(sessionId:String,sessionRecords: Seq[scala.collection.Map[String,Any]]): Session = {
val features = Functions.getSessionFeatures(sessionizationFeatures,recordFeatures,sessionRecords)
val resultSession = new Session(sessionId,sessionizationFieldName,sessionRecords)
resultSession.features = features
return resultSession
}

Why does h2 ignore slf4j messages on the first connection when LOG is set?

See sample code & output below (with Slf4j/logback on stdout). I can't find any bug reports on this. I'm using h2 version 1.3.176 (last stable), in-memory mode. It doesn't seem to matter what value is set for the LOG (0, 1 or 2) but just has to be set.
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;
public class H2TraceTest {
public static void main(String[] args) throws SQLException {
System.out.println("Query connection 1");
Connection myConn = DriverManager.getConnection("jdbc:h2:mem:tracetest;TRACE_LEVEL_FILE=4;LOG=2");
myConn.createStatement().execute("SELECT 1");
System.out.println("Query connection 2");
DriverManager.getConnection("jdbc:h2:mem:tracetest").createStatement().execute("SELECT 1");
System.out.println("Query connection 1 again");
myConn.createStatement().execute("SELECT 1");
System.out.println("End");
}
}
Output:
Query connection 1
Query connection 2
16:17:02.955 INFO h2database - jdbc[3]
/**/Connection conn2 = DriverManager.getConnection("jdbc:h2:mem:tracetest", "", "");
16:17:02.958 DEBUG h2database - jdbc[3]
/**/Statement stat2 = conn2.createStatement();
16:17:02.959 DEBUG h2database - jdbc[3]
/**/stat2.execute("SELECT 1");
16:17:02.959 INFO h2database - jdbc[3]
/*SQL #:1*/SELECT 1;
Query connection 1 again
End
I know that the H2 documentation says about TRACE_LEVEL_FILE: it affects all connections. But thats not (fully) correct:
Every connection keeps a lazy reference to the logging system. And if you change that with the special marker TRACE_LEVEL_FILE=4, then that reference isn't changed for all existing connections - but only for those who do their first logging after that change.
So if you use the connection string "jdbc:h2:mem:tracetest;TRACE_LEVEL_FILE=4" everything is as expected, because your session will write no logging message before changing the logging system. Unfortunately the LOG=2 in jdbc:h2:mem:tracetest;TRACE_LEVEL_FILE=4;LOG=2 is evaluated first, because both parameter are written into and read from an unordered Map. And because LOG=2 is generating a log statement, the reference to the log adapter (=4) is never applied to the current session. Only to the next one.
What can you do:
Use only "jdbc:h2:mem:tracetest;TRACE_LEVEL_FILE=4" - LOG=2 is the default anyway. If you need any other log mode you can use connection.createStatement().executeUpdate("SET LOG 1")
Add some default parameters to the connection string until the TRACE_LEVEL_FILE parameter is the first parameter in the map (not really reliable, as the order may depend on the VM)
Discard the first connection at once
Fill in a bug report and wait for the fix (or fix it yourself), as I think this is somehow a bug
I know this is an old question but here is a reliable way to do it (i.e. you can ensure that TRACE_LEVEL_FILE is set to 4 first:
String url = "jdbc:h2:mem:tracetest;INIT=SET TRACE_LEVEL_FILE=4\\;SET DB_CLOSE_DELAY=-1/* for example, i.e. do other stuff */";

Resources