How to fix error 'Component: [x] subscribes from non-existent component [y]' in Apache Storm Trident topology - apache-storm

I just implemented a trident DRPC function to process incoming messages and I am trying to persist the count of processed tuples at the final stage of the topology as a Trident state. Here is my topology:
topology.newDRPCStream("portfolio")
.map(parseMapFunction,
new Fields("portfolioUrn", "portfolioSourceSystem", "portfolioRegion",
"portfolioTimestamp", "portfolioPayload"))
.filter(new FilterNull())
.flatMap(splitMapFunction,
new Fields("portfolioUrn", "portfolioSourceSystem", "portfolioRegion",
"portfolioTimestamp", "strategyCode"))
.parallelismHint(1)
.shuffle()
.each(new Fields("strategyCode"), findMongoTradesFunction,
new Fields("uitid", "id", "sourceSystem", "sourceTransactionTime", "publicationTime",
"tradeVersion", "urn", "riskViewFrom", "riskViewTo", "authorized"))
.parallelismHint(10)
.shuffle()
.filter(tradeFilterFunction)
.parallelismHint(150)
.groupBy(new Fields("uitid"))
.aggregate(
new Fields("portfolioUrn", "portfolioTimestamp", "strategyCode", "id", "sourceSystem",
"sourceTransactionTime", "publicationTime", "tradeVersion", "urn", "riskViewFrom",
"riskViewTo", "uitid"), reduceAggregateFunction,
new Fields("portfolioUrn", "portfolioTimestamp", "strategyCode", "id", "sourceSystem",
"sourceTransactionTime", "publicationTime", "tradeVersion", "urn", "riskViewFrom",
"riskViewTo"))
.parallelismHint(200)
.groupBy(new Fields("portfolioUrn"))
.persistentAggregate(stateFactory, new Count(), new Fields("count"));
While I am trying to submit this topology to Storm I am having this error:
Exception in thread "main" java.lang.RuntimeException: InvalidTopologyException(msg:Component: [b-4] subscribes from non-existent component [$mastercoord-bg0])
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:273)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:387)
at org.apache.storm.StormSubmitter.submitTopology(StormSubmitter.java:159)
at com.citi.tm.portfolio.tps.PortfolioLauncher.main(PortfolioLauncher.java:34)
Caused by: InvalidTopologyException(msg:Component: [b-4] subscribes from non-existent component [$mastercoord-bg0])
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8070)
at org.apache.storm.generated.Nimbus$submitTopology_result$submitTopology_resultStandardScheme.read(Nimbus.java:8047)
at org.apache.storm.generated.Nimbus$submitTopology_result.read(Nimbus.java:7981)
at org.apache.storm.thrift.TServiceClient.receiveBase(TServiceClient.java:86)
at org.apache.storm.generated.Nimbus$Client.recv_submitTopology(Nimbus.java:306)
at org.apache.storm.generated.Nimbus$Client.submitTopology(Nimbus.java:290)
at org.apache.storm.StormSubmitter.submitTopologyInDistributeMode(StormSubmitter.java:326)
at org.apache.storm.StormSubmitter.submitTopologyAs(StormSubmitter.java:260)
... 3 more
I can successfully submit the topology if I remove the last 2 functions from that topology , that are:
.groupBy(new Fields("portfolioUrn"))
.persistentAggregate(stateFactory, new Count(), new Fields("count"));
After I run my aggregation function (aggregate()) I'd like to group tuples with 'portfolioUrn' field and persist the count into mongoDB. I don't understand why the last groupBy().persistentAggregate() section causes this error. Could you please help to find the reason?

After some research, I found this page which seems a similar case to me. Nathan Marz is stating that DRPC topologies do not support partition persist (as of 2013) and I believe it is the same for my case as well. I think, (not fully verified) Storm 1.2.1 DRPC topologies might not support state persistence at all.

Related

KTable & LogAndContinueExceptionHandler

I have a very simple consumer from which I create a materialized view. I have enabled validation on my value object (throwing Constraintviolationexception for invalid json data). When I receive a value on which the validation fails, I exepct the value to logged & consumer should read the next offset as I have LogAndContinueExceptionHandler enabled.
However LogAndContinueExceptionHandler is never invoked and consumePojo State transition from PENDING_ERROR to ERROR
Code
#Bean
public Consumer<KTable<String, Pojo>> consume() {
return values->
values
.filter((key, value) -> Objects.nonNull(key))
.mapValues(value-> value, Materialized.<String, Pojo>as(Stores.inMemoryKeyValueStore("POJO_STORE_NAME"))
.withKeySerde(Serdes.String())
.withValueSerde(SerdeUtil.pojoSerde())
.withLoggingDisabled())
.toStream()
.peek((key, value) -> log.debug("Receiving Pojo from topic with key: {}, and UUID: {}", key, value == null ? 0 : value.getUuid()));
}
Why is it that LogAndContinueExceptionHandler is not invoked in case of KTable?
Note: If code is changed to KStreams then I see logging and records being skipped but with KTable not !!
In order to handle exceptions not handled by Kafka Streams use the KafkaStreams.setUncaughtExceptionHandler method and StreamsUncaughtExceptionHandler implementation, this needs to return one of 3 available enumerations:
REPLACE_THREAD
SHUTDOWN_CLIENT
SHUTDOWN_APPLICATION
and in your case REPLACE_THREAD is the best option, as you can see in KIP-671:
REPLACE_THREAD:
The current thread is shutdown and transits to state DEAD.
A new thread is started if the Kafka Streams client is in state RUNNING or REBALANCING.
For the Global thread this option will log an error and revert to shutting down the client until the option had been added
In Spring Kafka you can replace default StreamsUncaughtExceptionHandler by StreamsBuilderFactoryBean:
#Autowired
void setMyStreamsUncaughtExceptionHandler(StreamsBuilderFactoryBean streamsBuilderFactoryBean) {
streamsBuilderFactoryBean.setStreamsUncaughtExceptionHandler(exception -> StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD);
}
I was able to solve the problem after looking at the logs carefully, I found that valueSerde for the Pojo, was showing useNativeDecoding (default being JsonSerde) due to this DeserializationExceptionHandler wasn't invoked and thread terminated.
Problem went away when I fixed the valueSerde in application.properties

Why are my storm topology not acking when i send tuple to ElasticSearch

I'm new to using Storm, I've just started a Data Architect training course and it's in this context that I'm facing the problem that brings me to you today.
I'm receiving messages from kakfa via a KafkaSpout named CurrentPriceSpout. So far, everything is working. Then, in my CurrentPriceBolt, I re-issued a tuple so that my data is written in ElasticSearch using the EsCurrentPriceBolt . The problem is here. I can't write my data into ElasticSearch directly, it is only written when I delete my topology.
Is there a Storm parameter that can force the writing of tuples by retrieving acknowledgments?
I tried by adding the option ".addConfiguration(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 5)", the tuples are well written in ElasticSearch but not acknowledged. So Storm rewrites them indefinitely.
Thanks for your help
Thierry
I managed to find the answer to my problem.
The main problem was that ES is not designed to ingest as little data as is generated in a study project. ES writes, by default, data in batches of 1000 entries. With this project, I generate one data every 30 seconds, or a batch of 1000 every 500 minutes (or 8h20).
so I reviewed in detail the configuration of my topology and played with the following options:
es.batch.size.entries: 1
es.storm.bolt.flush.entries.size: 1
topology.producer.batch.size: 1
topology.transfer.batch.size: 1
And now it goes like this:
...
...
public class App
{
...
...
public static void main( String[] args ) throws AlreadyAliveException, InvalidTopologyException, AuthorizationException
{
...
...
StormTopology topology = topologyBuilder.createTopology(); // je crée ma topologie Storm
String topologyName = properties.getProperty("storm.topology.name"); // je nomme ma topologie
StormSubmitter.submitTopology(topologyName, getTopologyConfig(properties), topology); // je démarre ma topologie sur mon cluster storm
System.out.println( "Topology on remote cluster : Started!" );
}
private static Config getTopologyConfig(Properties properties)
{
Config stormConfig = new Config();
stormConfig.put("topology.workers", Integer.parseInt(properties.getProperty("topology.workers")));
stormConfig.put("topology.enable.message.timeouts", Boolean.parseBoolean(properties.getProperty("topology.enable.message.timeouts")));
stormConfig.put("topology.message.timeout.secs", Integer.parseInt(properties.getProperty("topology.message.timeout.secs")));
stormConfig.put("topology.transfer.batch.size", Integer.parseInt(properties.getProperty("topology.transfer.batch.size")));
stormConfig.put("topology.producer.batch.size", Integer.parseInt(properties.getProperty("topology.producer.batch.size")));
return stormConfig;
}
...
...
...
}
And now it works!!!

Why would Kafka Stream app crash with suppress() enabled?

I'm writing a kafka stream 2.3.0 application to count the number of events in a session window and hopefully to print out only the final record when a session times out.
Serde<String> stringSerde = Serdes.serdeFrom(new StringSerializer(), new StringDeserializer());
Serde<MuseObject> museObjectSerde = Serdes.serdeFrom(new MuseObjectSerializer(), new MuseObjectDeserializer());
StreamsBuilder builder = new StreamsBuilder();
builder
.stream(INPUT_TOPIC, Consumed.with(stringSerde, museObjectSerde))
.map((key, value) -> {
return KeyValue.pair(value.getSourceValue("vid"), value.toString());
})
.groupByKey(Grouped.with(Serdes.String(), Serdes.String()))
.windowedBy(SessionWindows.with(Duration.ofSeconds(INACTIVITY_GAP)).grace(Duration.ZERO))
.count(Materialized.with(Serdes.String(), Serdes.Long()))
.suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream()
.print(Printed.toSysOut());
However the application crashes when a session times out:
12:35:03.859 [kafka-producer-network-thread | kafka-streams-test-kgu-4c3f2398-8f67-429d-82ce-6062c86af466-StreamThread-1-producer] ERROR o.a.k.s.p.i.RecordCollectorImpl - task [1_0] Error sending record to topic kafka-streams-test-kgu-KTABLE-SUPPRESS-STATE-STORE-0000000008-changelog due to The server experienced an unexpected error when processing the request.; No more records will be sent and no more offsets will be recorded for this task. Enable TRACE logging to view failed record key and value.
org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.
12:35:03.862 [kafka-streams-test-kgu-4c3f2398-8f67-429d-82ce-6062c86af466-StreamThread-1] ERROR o.a.k.s.p.i.AssignedStreamsTasks - stream-thread [kafka-streams-test-kgu-4c3f2398-8f67-429d-82ce-6062c86af466-StreamThread-1] Failed to commit stream task 1_0 due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [1_0] Abort sending since an error caught with a previous record (key user01\x00\x00\x01m!\xCE\x99u\x00\x00\x01m!\xCE\x80\xD1 value null timestamp null) to topic kafka-streams-test-kgu-KTABLE-SUPPRESS-STATE-STORE-0000000008-changelog due to org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:138)
I've tried to comment out ".suppress..." line. It works fine without suppress() and prints out something like this
[KSTREAM-FILTER-0000000011]: [user01#1568230244561/1568230250869], MuseSession{vid='user01', es='txnSuccess', count=6, start=2019-06-26 17:11:02.937, end=2019-06-26 18:07:10.685, sessionType='open'}".
What did I miss in using suppress()? Is there another way to filter out only the session records that have been timed out?
Any help is appreciated. Thanks in advance.
suppress() requires at least broker version 0.11.0 and message format 0.11.

Dataproc conflict in hadoop temporary tables

I have a flow that executes spark jobs on Dataproc clusters in parallel for different zones. For each zone it creates a cluster, execute the spark job and delete the cluster after it finishes.
The spark job uses the org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset method passing the BigQuery Configuration to save data on BigQuery table. The job saves data in more than one table, calling the saveAsNewAPIHadoopDataset method more than one time per job.
The problem is that sometimes I'm getting a error caused by a conflict in the Hadoop temporary BigQuery Dataset that it internally creates to run the jobs:
Exception in thread "main" com.google.api.client.googleapis.json.GoogleJsonResponseException: 409 Conflict
{
"code" : 409,
"errors" : [ {
"domain" : "global",
"message" : "Already Exists: Dataset <my-gcp-project>:<MY-DATASET>_hadoop_temporary_job_201802250620_0013",
"reason" : "duplicate"
} ],
"message" : "Already Exists: Dataset <my-gcp-project>:<MY-DATASET>_hadoop_temporary_job_201802250620_0013"
}
at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:145)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
at com.google.cloud.hadoop.io.bigquery.BigQueryOutputCommitter.setupJob(BigQueryOutputCommitter.java:107)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1150)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1078)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1078)
at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:819)
...
The timestamp 201802250620_0013 on the exception above has the _0013 sufix that I'm unsure if it represents time.
My thoughts is that sometimes the jobs runs at the same time and try to create a dataset with same timestamp in the name. Either in a parallel job or inside the same job on another saveAsNewAPIHadoopDataset call.
How can we avoid this error without putting a delay on the job execution?
The dependency that I'm using is:
<dependency>
<groupId>com.google.cloud.bigdataoss</groupId>
<artifactId>bigquery-connector</artifactId>
<version>0.10.2-hadoop2</version>
<scope>provided</scope>
</dependency>
The Dataproc image version is 1.1
Edit 1:
I tried using IndirectBigQueryOutputFormat but now I'm getting an error saying that the gcs output path already exists even passing it different time on each saveAsNewAPIHadoopDataset call.
Here is my code:
SparkConf sc = new SparkConf().setAppName("MyApp");
try (JavaSparkContext jsc = new JavaSparkContext(sc)) {
JavaPairRDD<String, String> filesJson = jsc.wholeTextFiles(jsonFolder, parts);
JavaPairRDD<String, String> jsons = filesJson.flatMapToPair(new FileSplitter()).repartition(parts);
JavaPairRDD<Object, JsonObject> objsJson = jsons.flatMapToPair(new JsonParser()).filter(t -> t._2() != null).cache();
objsJson
.filter(new FilterType(MSG_TYPE1))
.saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE1", "gs://my-bucket/tmp1"));
objsJson
.filter(new FilterType(MSG_TYPE2))
.saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE2", "gs://my-bucket/tmp2"));
objsJson
.filter(new FilterType(MSG_TYPE3))
.saveAsNewAPIHadoopDataset(createConf("my-project:MY_DATASET.MY_TABLE3", "gs://my-bucket/tmp3"));
// here goes another ingestion process. same code as above but diferrent params, parsers, etc.
}
Configuration createConf(String table, String outGCS) {
Configuration conf = new Configuration();
BigQueryOutputConfiguration.configure(conf, table, null, outGCS, BigQueryFileFormat.NEWLINE_DELIMITED_JSON, TextOutputFormat.class);
conf.set("mapreduce.job.outputformat.class", IndirectBigQueryOutputFormat.class.getName());
return conf;
}
I believe what may be happening is that each mapper tries to create its own dataset. This is rather inefficient (and burns your daily quota proportional to the number of mappers).
An alternative is to use IndirectBigQueryOutputFormat for output class:
IndirectBigQueryOutputFormat works by first buffering all the data into a Cloud Storage temporary table, and then, on commitJob, copies all data from Cloud Storage into BigQuery in one operation. Its use is recommended for large jobs since it only requires one BigQuery "load" job per Hadoop/Spark job, as compared to BigQueryOutputFormat, which performs one BigQuery job for each Hadoop/Spark task.
See the example here: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example

Spark Streaming with multiple Kafka streams

I create kafka stream with the following codes:
val streams = (1 to 5) map {i =>
KafkaUtils.createStream[....](
streamingContext,
Map( .... ),
Map(topic -> numOfPartitions),
StorageLevel.MEMORY_AND_DISK_SER
).filter(...)
.mapPartitions(...)
.reduceByKey(....)
val unifiedStream = streamingContext.union(streams)
unifiedStream.foreachRDD(...)
streamingContext.start()
I give each stream different group id. When I run the application, only part of kafka messages are received and the executor is pending at foreachRDD call. If I only create one stream, everything works well. There aren't any exceptions from logging info.
I don't know why the application is stuck there. Does it mean no enough resources?
You want to try set the parameter
SparkConf().set("spark.streaming.concurrentJobs", "5")

Resources