Exception in Storm Topology having WindowedBolt - apache-storm

I am trying to run topology which has Windowed Bolt, but getting following exception:
Exception in thread "main" java.lang.NullPointerException
at org.apache.storm.topology.WindowedBoltExecutor.declareOutputFields(WindowedBoltExecutor.java:309)
at org.apache.storm.topology.TopologyBuilder.getComponentCommon(TopologyBuilder.java:432)
at org.apache.storm.topology.TopologyBuilder.createTopology(TopologyBuilder.java:120)
at Main.main(Main.java:23)
I have created custom windowed bolt by extending BaseWindowedBolt.
Topology code :
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("integer", new RandomIntegerSpout(), 1);
builder.setBolt("tumblingsum", new CustomTumblingSumWindow().withTumblingWindow(new Duration(10, TimeUnit.SECONDS)),1).shuffleGrouping("integer");
builder.setBolt("final", new ResultBolt(),1).shuffleGrouping("tumblingsum");
Config config = new Config();
config.put(Config.TOPOLOGY_WORKERS, 1);
StormSubmitter.submitTopology("Test-Windowing-Topology", config, builder.createTopology());
Storm Version is 1.2.2
If I run above topology without WindowedBolt then it is working.
Am I missing anything?
Thanks

The line you're getting an exception from is https://github.com/apache/storm/blob/v1.2.2/storm-core/src/jvm/org/apache/storm/topology/WindowedBoltExecutor.java#L309.
My guess would be that your bolt is returning null from getComponentConfiguration. This looks like a bug, but you can work around it by returning an empty map from getComponentConfiguration.
Raised https://issues.apache.org/jira/browse/STORM-3211 to fix it.

Related

How to submit apache beam dataflow job to GCP through java application

I have a dataflow job which is written in apache beam with java. I am able run the dataflow job in GCP through this steps.
Created dataflow template from my code. Then uploading template in cloud storage.
Directly creating job from template option available in GCP->Dataflow->jobs
This flow is working fine.
I want to do same step through java app. means, I have one api when someone sends request to that api, I want to start this dataflow job through the template which I have already stored in storage.
I could see rest api is available to implement this approach. as below,
POST /v1b3/projects/project_id/locations/loc/templates:launch?gcsPath=template-location
But I didn't find any reference or samples for this. I tried the below approach
In my springboot project I added this dependency
<!-- https://mvnrepository.com/artifact/com.google.apis/google-api-services-dataflow -->
<dependency>
<groupId>com.google.apis</groupId>
<artifactId>google-api-services-dataflow</artifactId>
<version>v1b3-rev20210825-1.32.1</version>
</dependency>
and added below code in controller
public static void createJob() throws IOException {
GoogleCredential credential = GoogleCredential.fromStream(new FileInputStream("myCertKey.json")).createScoped(
java.util.Arrays.asList("https://www.googleapis.com/auth/cloud-platform"));
try{
Dataflow dataflow = new Dataflow.Builder(new LowLevelHttpRequest(), new JacksonFactory(),
credential).setApplicationName("my-job").build(); --- this gives error
//RuntimeEnvironment
RuntimeEnvironment env = new RuntimeEnvironment();
env.setBypassTempDirValidation(false);
//all my env configs added
//parameters
HashMap<String,String> params = new HashMap<>();
params.put("bigtableEmulatorPort", "-1");
params.put("gcsPath", "gs://bucket//my.json");
// all other params
LaunchTemplateParameters content = new LaunchTemplateParameters();
content.setJobName("Test-job");
content.setEnvironment(env);
content.setParameters(params);
dataflow.projects().locations().templates().launch("project-id", "location", content);
}catch (Exception e){
log.info("error occured", e);
}
}
This gives {"id":null,"message":"'boolean com.google.api.client.http.HttpTransport.isMtls()'"}
error in this line itself
Dataflow dataflow = new Dataflow.Builder(new LowLevelHttpRequest(), new JacksonFactory(),
credential).setApplicationName("my-job").build();
this is bcs, this dataflow builder expects HttpTransport as 1st argument but I passed LowLevelHttpRequest()
I am not sure is this the correct way to implement this. Can any one suggest any ideas on this? how to implement this? any examples or reference ?
Thanks a lot :)

Using spring-session-hazelcast on Kubernetes with service-dns causing SplitBrainMergeValidationOp ERROR

We are deploying a spring-boot application using spring-session-hazelcast + hazelcast-kubernetes on an OpenShift/Kubernetes cluster.
Due to the nature of our platform, we can only use service-dns configuration. We expose a service on port 5701 for multicasting and set service-dns property to the multicast service name.
Below is a snippet for creation of the Hazelcast instance.
#Bean
public HazelcastInstance hazelcastInstance() {
var config = new Config();
config.setClusterName("spring-session-cluster");
var join = config.getNetworkConfig().getJoin();
join.getTcpIpConfig().setEnabled(false);
join.getMulticastConfig().setEnabled(false);
join.getKubernetesConfig().setEnabled(true)
.setProperty("service-dns", "<multicast-service-name>");
var attribute = new AttributeConfig()
.setName(Hazelcast4IndexedSessionRepository.PRINCIPAL_NAME_ATTRIBUTE)
.setExtractorClassName(Hazelcast4PrincipalNameExtractor.class.getName());
config.getMapConfig(Hazelcast4IndexedSessionRepository.DEFAULT_SESSION_MAP_NAME)
.addAttributeConfig(attribute)
.addIndexConfig(new IndexConfig(IndexType.HASH, Hazelcast4IndexedSessionRepository.PRINCIPAL_NAME_ATTRIBUTE));
var serializer = new SerializerConfig();
serializer.setImplementation(new HazelcastSessionSerializer())
.setTypeClass(MapSession.class);
config.getSerializationConfig().addSerializerConfig(serializer);
return Hazelcast.newHazelcastInstance(config);
}
When we run 2 pods for this application, we see the below ERROR log:
com.hazelcast.internal.cluster.impl.operations.SplitBrainMergeValidationOp
Message: [<private-ip>]:5701 [spring-session-cluster] [4.2] Target is this node! -> [<private-ip>]:5701
Can someone please explain how to fix this error, still using "service-dns" configuration?
You need to enable headless mode for your service in openshift.
https://github.com/hazelcast/hazelcast-kubernetes#dns-lookup
Just add configuration for split brain protection
SplitBrainProtectionConfig splitBrainProtectionConfig = new SplitBrainProtectionConfig();
splitBrainProtectionConfig.setName("splitBrainProtectionRuleWithFourMembers")
.setEnabled(true)
.setMinimumClusterSize(4);
MapConfig mapConfig = new MapConfig();
mapConfig.setSplitBrainProtectionName("splitBrainProtectionRuleWithFourMembers");
Config config = new Config();
config.addSplitBrainProtectionConfig(splitBrainProtectionConfig);
config.addMapConfig(mapConfig);
You can read more about this in hazelcast documentation:
https://docs.hazelcast.com/imdg/4.2/network-partitioning/split-brain-protection.html

FlattenFields is not working as expected

We have been working on a POC with IText7 and getting an error when we try to FlattenFields. All we are trying to do is load a pdf template and inject values. The template which we are using was working fine with IText5.
Here is the exception message:
An exception of type 'iText.Kernel.PdfException' occurred in itext.kernel.dll but was not handled in user code
Additional information: unbalanced.begin.end.marked.content.operators
using (PdfDocument pdfDoc = new PdfDocument(new PdfReader(fileName), new PdfWriter(outputStream)))
{
PdfAcroForm stamper = PdfAcroForm.GetAcroForm(pdfDoc, true);
stamper.FlattenFields();
stamper.SetGenerateAppearance(true);
}
Regards
Shreenidhi B.R
Followed up with IText support folks and they said that this issue has been fixed in IText 7.0.1. I haven't had a chance to test it myself though.

Is it possible to recover an broadcast value from Spark-streaming checkpoint

I used hbase-spark to record pv/uv in my spark-streaming project. Then when I killed the app and restart it, I got following exception while checkpoint-recover:
16/03/02 10:17:21 ERROR HBaseContext: Unable to getConfig from broadcast
java.lang.ClassCastException: [B cannot be cast to org.apache.spark.SerializableWritable
at com.paitao.xmlife.contrib.hbase.HBaseContext.getConf(HBaseContext.scala:645)
at com.paitao.xmlife.contrib.hbase.HBaseContext.com$paitao$xmlife$contrib$hbase$HBaseContext$$hbaseForeachPartition(HBaseContext.scala:627)
at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457)
at com.paitao.xmlife.contrib.hbase.HBaseContext$$anonfun$com$paitao$xmlife$contrib$hbase$HBaseContext$$bulkMutation$1.apply(HBaseContext.scala:457)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:898)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1839)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I checked the code of HBaseContext, It uses a broadcast to store the HBase configuration.
class HBaseContext(#transient sc: SparkContext,
#transient config: Configuration,
val tmpHdfsConfgFile: String = null) extends Serializable with Logging {
#transient var credentials = SparkHadoopUtil.get.getCurrentUserCredentials()
#transient var tmpHdfsConfiguration: Configuration = config
#transient var appliedCredentials = false
#transient val job = Job.getInstance(config)
TableMapReduceUtil.initCredentials(job)
// <-- broadcast for HBaseConfiguration here !!!
var broadcastedConf = sc.broadcast(new SerializableWritable(config))
var credentialsConf = sc.broadcast(new SerializableWritable(job.getCredentials()))
...
When the checkpoint-recover, it tried to access this broadcast value in its getConf func:
if (tmpHdfsConfiguration == null) {
try {
tmpHdfsConfiguration = configBroadcast.value.value
} catch {
case ex: Exception => logError("Unable to getConfig from broadcast", ex)
}
}
Then the exception raised. My question is: is it possible to recover the broadcasted value from checkpoint in a spark application? All we have some other solution to re-broadcast the value after recovering?
Thanks for any feedback!
Currently, it's a known bug of Spark. Contributors have been investigating on this issue but made no progress.
Here's my workaround: Instead of loading data into broadcast variable and broadcasting to all executors, i let each executor loads the data itself into a singleton object.
Btw, follow this issue for changes https://issues.apache.org/jira/browse/SPARK-5206
Follow below approach
Create spark context.
Initialize broadcast variable.
Create streaming context with checkpoint directory using above spark context and passing on the initialized broadcast variable.
When streaming job starts with no data in checkpoint directory, it will initialize the broadcast variable.
When streaming restarts, it will recover the broadcast variable from checkpoint directory.

parallelism configuration in trident topology (storm)

After reading this and this I'm having difficulties understanding how to configure my trident topology.
Basically my storm application is reading from kafka, doing some data manipulations and finally writing to Cassandra.
Here is how I'm currently building my topology:
private static StormTopology buildTopology() {
// connection to kafka
ZkHosts zkHosts = new ZkHosts(broker_zk, broker_path);
TridentKafkaConfig kafkaConfig = new TridentKafkaConfig(zkHosts, topic);
kafkaConfig.scheme = new RawMultiScheme();
StateFactoryFields[] cassandraStateFactories = createStateFactories();
TransactionalTridentKafkaSpout spout = new TransactionalTridentKafkaSpout(kafkaConfig);
TridentTopology topology = new TridentTopology();
Stream kafkaSpout = topology.newStream("kafkaspout", spout).parallelismHint(1).shuffle();
Stream filterValidatStream = kafkaSpout.each(new Fields("bytes"), new SplitKafkaInput(), EventData.getEventDataFields()).parallelismHint(1);
for (StateFactoryFields stateFactoryFields : cassandraStateFactories) {
filterValidatStream.groupBy(stateFactoryFields.groupingFields)
.persistentAggregate(stateFactoryFields.cassandraStateFactor, new Count(), new Fields("count")).parallelismHint(2);
}
logger.info("Building topology");
return topology.build();
}
So I got a spout and a few operations (filter, grouopBy) with parallelismHint.
I don't understant hor to determine the optimal parallelismHint, moreover if I'm setting this value in my code, how does it work in conjunction with storm standard topology configurations such as
topology.max.task.parallelism
topology.workers
topology.acker.executors
Thanks in advance
There is an excellent gist by mrflip here that attempts to outline how to tune a storm/trident topology. This should guide you in selecting your parameters (both the ones you have suggested in your question and others you may not have thought of yet).

Resources