Spark Streaming with multiple Kafka streams - spark-streaming

I create kafka stream with the following codes:
val streams = (1 to 5) map {i =>
KafkaUtils.createStream[....](
streamingContext,
Map( .... ),
Map(topic -> numOfPartitions),
StorageLevel.MEMORY_AND_DISK_SER
).filter(...)
.mapPartitions(...)
.reduceByKey(....)
val unifiedStream = streamingContext.union(streams)
unifiedStream.foreachRDD(...)
streamingContext.start()
I give each stream different group id. When I run the application, only part of kafka messages are received and the executor is pending at foreachRDD call. If I only create one stream, everything works well. There aren't any exceptions from logging info.
I don't know why the application is stuck there. Does it mean no enough resources?

You want to try set the parameter
SparkConf().set("spark.streaming.concurrentJobs", "5")

Related

KTable & LogAndContinueExceptionHandler

I have a very simple consumer from which I create a materialized view. I have enabled validation on my value object (throwing Constraintviolationexception for invalid json data). When I receive a value on which the validation fails, I exepct the value to logged & consumer should read the next offset as I have LogAndContinueExceptionHandler enabled.
However LogAndContinueExceptionHandler is never invoked and consumePojo State transition from PENDING_ERROR to ERROR
Code
#Bean
public Consumer<KTable<String, Pojo>> consume() {
return values->
values
.filter((key, value) -> Objects.nonNull(key))
.mapValues(value-> value, Materialized.<String, Pojo>as(Stores.inMemoryKeyValueStore("POJO_STORE_NAME"))
.withKeySerde(Serdes.String())
.withValueSerde(SerdeUtil.pojoSerde())
.withLoggingDisabled())
.toStream()
.peek((key, value) -> log.debug("Receiving Pojo from topic with key: {}, and UUID: {}", key, value == null ? 0 : value.getUuid()));
}
Why is it that LogAndContinueExceptionHandler is not invoked in case of KTable?
Note: If code is changed to KStreams then I see logging and records being skipped but with KTable not !!
In order to handle exceptions not handled by Kafka Streams use the KafkaStreams.setUncaughtExceptionHandler method and StreamsUncaughtExceptionHandler implementation, this needs to return one of 3 available enumerations:
REPLACE_THREAD
SHUTDOWN_CLIENT
SHUTDOWN_APPLICATION
and in your case REPLACE_THREAD is the best option, as you can see in KIP-671:
REPLACE_THREAD:
The current thread is shutdown and transits to state DEAD.
A new thread is started if the Kafka Streams client is in state RUNNING or REBALANCING.
For the Global thread this option will log an error and revert to shutting down the client until the option had been added
In Spring Kafka you can replace default StreamsUncaughtExceptionHandler by StreamsBuilderFactoryBean:
#Autowired
void setMyStreamsUncaughtExceptionHandler(StreamsBuilderFactoryBean streamsBuilderFactoryBean) {
streamsBuilderFactoryBean.setStreamsUncaughtExceptionHandler(exception -> StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD);
}
I was able to solve the problem after looking at the logs carefully, I found that valueSerde for the Pojo, was showing useNativeDecoding (default being JsonSerde) due to this DeserializationExceptionHandler wasn't invoked and thread terminated.
Problem went away when I fixed the valueSerde in application.properties

How to fix error 'Component: [x] subscribes from non-existent component [y]' in Apache Storm Trident topology

I just implemented a trident DRPC function to process incoming messages and I am trying to persist the count of processed tuples at the final stage of the topology as a Trident state. Here is my topology:
topology.newDRPCStream("portfolio")
.map(parseMapFunction,
new Fields("portfolioUrn", "portfolioSourceSystem", "portfolioRegion",
"portfolioTimestamp", "portfolioPayload"))
.filter(new FilterNull())
.flatMap(splitMapFunction,
new Fields("portfolioUrn", "portfolioSourceSystem", "portfolioRegion",
"portfolioTimestamp", "strategyCode"))
.parallelismHint(1)
.shuffle()
.each(new Fields("strategyCode"), findMongoTradesFunction,
new Fields("uitid", "id", "sourceSystem", "sourceTransactionTime", "publicationTime",
"tradeVersion", "urn", "riskViewFrom", "riskViewTo", "authorized"))
.parallelismHint(10)
.shuffle()
.filter(tradeFilterFunction)
.parallelismHint(150)
.groupBy(new Fields("uitid"))
.aggregate(
new Fields("portfolioUrn", "portfolioTimestamp", "strategyCode", "id", "sourceSystem",
"sourceTransactionTime", "publicationTime", "tradeVersion", "urn", "riskViewFrom",
"riskViewTo", "uitid"), reduceAggregateFunction,
new Fields("portfolioUrn", "portfolioTimestamp", "strategyCode", "id", "sourceSystem",
"sourceTransactionTime", "publicationTime", "tradeVersion", "urn", "riskViewFrom",
"riskViewTo"))
.parallelismHint(200)
.groupBy(new Fields("portfolioUrn"))
.persistentAggregate(stateFactory, new Count(), new Fields("count"));
While I am trying to submit this topology to Storm I am having this error:
Exception in thread "main" java.lang.RuntimeException: InvalidTopologyException(msg:Component: [b-4] subscribes from non-existent component [$mastercoord-bg0])
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:273)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
at com.citi.tm.portfolio.tps.PortfolioLauncher.main(PortfolioLauncher.java:34)
Caused by: InvalidTopologyException(msg:Component: [b-4] subscribes from non-existent component [$mastercoord-bg0])
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8070)
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8047)
at org.apache.storm.generated.Nimbus$submitTopology_result.read(Nimbus.java:7981)
at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.storm.generated.Nimbus$Client.recv_submitTopology(Nimbus.java:306)
at org.apache.storm.generated.Nimbus$Client.submitTopology(Nimbus.java:290)
at org.apache.storm.StormSubmitter.submitTopologyInDistributeMode(StormSubmitter.java:326)
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:260)
... 3 more
I can successfully submit the topology if I remove the last 2 functions from that topology , that are:
.groupBy(new Fields("portfolioUrn"))
.persistentAggregate(stateFactory, new Count(), new Fields("count"));
After I run my aggregation function (aggregate()) I'd like to group tuples with 'portfolioUrn' field and persist the count into mongoDB. I don't understand why the last groupBy().persistentAggregate() section causes this error. Could you please help to find the reason?
After some research, I found this page which seems a similar case to me. Nathan Marz is stating that DRPC topologies do not support partition persist (as of 2013) and I believe it is the same for my case as well. I think, (not fully verified) Storm 1.2.1 DRPC topologies might not support state persistence at all.

Spring integration Scatter-Gather pattern with JMS transport

I need to implement the following architecture:
I have data that must be sent to systems (Some external application ) using JMS.
Depending on the data you need to send only to the necessary systems (For example, if the number of systems is 4, then you can send from 1 to 4 )
It is necessary to wait for a response from the systems to which the messages were sent, after receiving all the answers, it is required to process the received data (or to process at least one timeout)
The correlation id is contained in the header of both outgoing and incoming JMS messages
Each new such process can be started asynchronously and in parallel
Now I have it implemented only with the help of Spring JMS. I synchronize the threads manually, also manually I manage the thread pools.
The correlation ids and information about the systems in which messages were sent are stored as a state and update it after receiving new messages, etc.
But I want to simplify the logic and use Spring-integration Java DSL, Scatter gather pattern (Which is just my case) and other useful Spring features.
Can you help me show an example of how such an architecture can be implemented with the help of Spring-integration/IntregrationFlow?
Here is some sample from our test-cases:
#Bean
public IntegrationFlow scatterGatherFlow() {
return f -> f
.scatterGather(scatterer -> scatterer
.applySequence(true)
.recipientFlow(m -> true, sf -> sf.handle((p, h) -> Math.random() * 10))
.recipientFlow(m -> true, sf -> sf.handle((p, h) -> Math.random() * 10))
.recipientFlow(m -> true, sf -> sf.handle((p, h) -> Math.random() * 10)),
gatherer -> gatherer
.releaseStrategy(group ->
group.size() == 3 ||
group.getMessages()
.stream()
.anyMatch(m -> (Double) m.getPayload() > 5)),
scatterGather -> scatterGather
.gatherTimeout(10_000));
}
So, there is the parts:
scatterer - to send messages to recipients. In your case all those JMS services. That can be a scatterChannel though. Typically PublishSubscribeChannel, so Scatter-Gather might not know subscrbibers in adavance.
gatherer - well, it is just an aggregator with all its possible options.
scatterGather - is just for convenience for the direct properties of the ScatterGatherHandler and common endpoint options.

How to stabilize spark streaming application with a handful of super big sessions?

I am running a Spark Streaming application based on mapWithState DStream function . The application transforms input records into sessions based on a session ID field inside the records.
A session is simply all of the records with the same ID . Then I perform some analytics on a session level to find an anomaly score.
I couldn't stabilize my application because a handful of sessions are getting bigger at each batch time for extended period ( more than 1h) . My understanding is a single session (key - value pair) is always processed by a single core in spark . I want to know if I am mistaken , and if there is a solution to mitigate this issue and make the streaming application stable.
I am using Hadoop 2.7.2 and Spark 1.6.1 on Yarn . Changing batch time, blocking interval , partitions number, executor number and executor resources didn't solve the issue as one single task makes the application always choke. However, filtering those super long sessions solved the issue.
Below is a code updateState function I am using :
val updateState = (batchTime: Time, key: String, value: Option[scala.collection.Map[String,Any]], state: State[Seq[scala.collection.Map[String,Any]]]) => {
val session = Seq(value.getOrElse(scala.collection.Map[String,Any]())) ++ state.getOption.getOrElse(Seq[scala.collection.Map[String,Any]]())
if (state.isTimingOut()) {
Option(null)
} else {
state.update(session)
Some((key,value,session))
}
}
and the mapWithStae call :
def updateStreamingState(inputDstream:DStream[scala.collection.Map[String,Any]]): DStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] ={//MapWithStateDStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] = {
val spec = StateSpec.function(updateState)
spec.timeout(Duration(sessionTimeout))
spec.numPartitions(192)
inputDstream.map(ds => (ds(sessionizationFieldName).toString, ds)).mapWithState(spec)
}
Finally I am applying a feature computing session foreach DStream , as defined below :
def computeSessionFeatures(sessionId:String,sessionRecords: Seq[scala.collection.Map[String,Any]]): Session = {
val features = Functions.getSessionFeatures(sessionizationFeatures,recordFeatures,sessionRecords)
val resultSession = new Session(sessionId,sessionizationFieldName,sessionRecords)
resultSession.features = features
return resultSession
}

How to safely write to one file from many verticle instances in vert.x 3.2?

Instead of using a logger or database server I'd like to append information to one file from possibly many verticle instances.
There are versions of methods for writing asynchronously to a file.
Can I assume that vertx handles the synchronisation between the writes so that these dont interfere when using those versions of methods marked as ¨async¨ ?
There seems to be a rule that one can rely on vertx providing all isolation between concurrent processing out of the box. But is that true in case of writing file access?
Could you please include a code snippet into the answer that shows how to open and write to one file from many verticle instances with finest possible granularity, e.g. for logging requests.
I wouldn't recommend writing to a single file with many different "writers". Regarding concurrent logging I would stick to the Single Writer principle.
Create a Verticle which subscribes to the Event Bus and listens for messages to be logged. Lets call this Verticle Logger which listens to system.logger.
EventBus eb = vertx.eventBus();
eb.consumer("system.logger", message -> {
// write to file
});
Verticles which like to log something need to send a message to the Logger Verticle:
eventBus.send("system.logger", "foobar");
Appending to a existing file work something like this (didn't test):
vertx.fileSystem().open("file.log", new OpenOptions(), result -> {
if (result.succeeded()) {
Buffer buff = Buffer.buffer(message); // message from consume
AsyncFile file = result.result();
file.write(buff, buff.length() * i, ar -> {
if (ar.succeeded()) {
System.out.println("done");
} else {
System.err.println("write failed: " + ar.cause());
}
});
} else {
System.err.println("open file failed " + result.cause());
}
});

Resources