New KafkaSpout Issue in Apache Storm - apache-storm

I have a basic topology includes kafka spout and and kafka bolts
When submit my topology ı gets this error in Storm UI
Unable to get offset lags for kafka. Reason: org.apache.kafka.shaded.common.errors.InvalidTopicException: Topic '[enrich-topic]' is invalid
I check enrich-topic is exist, there is no problem about that
TopologyBuilder streamTopologyBuilder = new TopologyBuilder();
KafkaSpoutRetryService kafkaSpoutRetryService = new KafkaSpoutRetryExponentialBackoff(KafkaSpoutRetryExponentialBackoff.TimeInterval.microSeconds(500), KafkaSpoutRetryExponentialBackoff.TimeInterval.milliSeconds(2), Integer.MAX_VALUE, KafkaSpoutRetryExponentialBackoff.TimeInterval.seconds(10));
KafkaSpoutConfig spoutConf = KafkaSpoutConfig.builder(configProvider.getBootstrapServers(), configProvider.getSpoutTopic())
.setGroupId("consumerGroupId")
.setOffsetCommitPeriodMs(10_000)
.setFirstPollOffsetStrategy(UNCOMMITTED_LATEST)
.setMaxUncommittedOffsets(1000000)
.setRetry(kafkaSpoutRetryService)
.build();
KafkaSpout kafkaSpout = new KafkaSpout(spoutConf);
streamTopologyBuilder.setSpout("kafkaSpout", kafkaSpout, 1);
KafkaWriterBolt2 kafkaWriterBolt2 = null;
try {
kafkaWriterBolt2 = new KafkaWriterBolt2(configProvider.getBootstrapServers(), configProvider.getStreamKafkaWriterTopicName());
} catch (IOException e) {
e.printStackTrace();
}
streamTopologyBuilder.setBolt("kafkaWriterBolt2", kafkaWriterBolt2, 1).setNumTasks(1)
.shuffleGrouping("kafkaSpout");
KafkaWriterBolt2 is my class extends from BaseRichBolt

Related

IBM MQ transactions and .net

I have used .net C# (IBM MQ version 9.1.5) to pull messages from the queue. So I have no issues connecting to the queue and getting messages.
I have read that there is the concept of transactions Distributed Transactions.
I tried the following:
var getMessageOptions = new MQGetMessageOptions();
getMessageOptions = new MQGetMessageOptions();
getMessageOptions.Options += MQC.MQGMO_WAIT + MQC.MQGMO_SYNCPOINT;
getMessageOptions.WaitInterval = 20000; // 20 seconds wait
Transaction oldAmbient = Transaction.Current;
using (var tx = new CommittableTransaction())
{
try
{
int i = queue.CurrentDepth;
Log.Information($"Current queue depth is {i} message(s)");
var message = new MQMessage();
queue.Get(message, getMessageOptions);
string messageStr = message.ReadString(message.DataLength);
Log.Information(messageStr);
tx.Commit();
}
catch (MQException e) when (e.Reason == 2033)
{
// Report exceptions other than "no messages in the queue"
Log.Information("No messages in the queue");
tx.Rollback();
}
catch (Exception ex)
{
Log.Error($"Exception when trying to capture a message from the queue: {ex.Message}");
tx.Rollback();
}
I am getting an error code of 2035.
Looking at the documents on Recovering Transactions, where does the "SYSTEM.DOTNET.XARECOVERY.QUEUE" live, is it on the queuemanger?
Do I need to get permissions enabled on this?
Also I see that Microsoft Distributed Transaction Manager is mentioned, is this something that we need to have running on the local host in order for distributed transactions to work?
If MQ Distributed transactions feature is being used then the user running the application should have the authority to "SYSTEM.DOTNET.XARECOVERY.QUEUE".If a transaction is incomplete "SYSTEM.DOTNET.XARECOVERY.QUEUE" queue holds the information of incomplete transaction as message in that queue,which later can be used to resolve the transaction.
Based on your scenario which you had put in comments i.e "we want to just save the message to a file. My thinking is if there is a problem with that, I could roll back the transaction." .If MQ is the only resource manager then you don't have to use Distributed transactions. Getting a message under syncpoint can also be used instead of Distributed Transactions. Distributed Transactions will be useful if more than one resource manager is being used.
To get a message under syncpoint following sample code can be used by updating hostname,channel,port,queue and queue manager name:
var getMessageOptions = new MQGetMessageOptions();
getMessageOptions = new MQGetMessageOptions();
getMessageOptions.Options += MQC.MQGMO_WAIT + MQC.MQGMO_SYNCPOINT;
getMessageOptions.WaitInterval = 20000; // 20 seconds wait
Hashtable props = new Hashtable();
props.Add(MQC.HOST_NAME_PROPERTY, "localhost");
props.Add(MQC.CHANNEL_PROPERTY, "DOTNET.SVRCONN");
props.Add(MQC.PORT_PROPERTY, 3636);
MQQueueManager qm = new MQQueueManager("QM", props);
MQQueue queue = qm.AccessQueue("Q1", MQC.MQOO_INPUT_AS_Q_DEF);
try
{
var message = new MQMessage();
queue.Get(message, getMessageOptions);
//to commit the message
qm.Commit();
string messageStr = message.ReadString(message.DataLength);
}
catch (MQException e) when (e.Reason == 2033)
{
// Report exceptions other than "no messages in the queue"
Log.Information("No messages in the queue");
}
catch (Exception ex)
{
Log.Error($"Exception when trying to capture a message from the queue:
}

Setting bolts to read from specific streams of other bolts in trident topology

I am trying to write a TridentTopology, which has multiple bolts. Now I want to make one bolt register to other bolts specific stream as shown below.
TridentTopologyBuilder tridentTopologyBuilder = new TridentTopologyBuilder();
FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 2,
new Values("the cow jumped over the moon"),
new Values("the man went to the store and bought some candy"),
new Values("four score and seven years ago"),
new Values("how many apples can you eat"));
tridentTopologyBuilder.setSpout("tridentSpout", "spoutStream", "spoutId", spout, 2, "spoutBatch");
Map<String, String> batchGroups = new HashMap<>();
batchGroups.put("boltStream", "boltBatch");
tridentTopologyBuilder.setBolt("tridentBolt", new TridentTestBolt(), 1, Sets.newHashSet("spoutBatch"), batchGroups).shuffleGrouping("tridentSpout", "spoutStream");
tridentTopologyBuilder.setBolt("tridentBolt2", new TridentTestBolt2(), 1, new HashSet<>(), batchGroups).shuffleGrouping("tridentBolt", "boltStream");
LocalCluster cluster = new LocalCluster();
Config config = new Config();
config.setDebug(true);
cluster.submitTopology("TridentTopology", config, tridentTopologyBuilder.buildTopology(new HashMap<>()));
I am getting following exception:
Error: InvalidTopologyException(msg:Component: [tridentBolt2] subscribes from non-existent stream: [$coord-boltBatch] of component [tridentBolt])
Also declared stream using the declareStream method of OutputFieldsDeclarer
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declareStream("boltStream", new Fields("sentence"));
}
Also registering to other bolts specific stream works in normal topology. The issue is with Trident topology. Also What are we supposed to pass for batchGroups.

How to get rid of NullPointerException in Flume Interceptor?

I have an interceptor written for Flume code is below:
public Event intercept(Event event) {
byte[] xmlstr = event.getBody();
InputStream instr = new ByteArrayInputStream(xmlstr);
//TransformerFactory factory = TransformerFactory.newInstance(TRANSFORMER_FACTORY_CLASS,TRANSFORMER_FACTORY_CLASS.getClass().getClassLoader());
TransformerFactory factory = TransformerFactory.newInstance();
Source xslt = new StreamSource(new File("removeNs.xslt"));
Transformer transformer = null;
try {
transformer = factory.newTransformer(xslt);
} catch (TransformerConfigurationException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
Source text = new StreamSource(instr);
OutputStream ostr = new ByteArrayOutputStream();
try {
transformer.transform(text, new StreamResult(ostr));
} catch (TransformerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
event.setBody(ostr.toString().getBytes());
return event;
}
I'm removing NameSpace from my source xml with removeNs.xslt file. So that I can store that data into HDFS and later put into hive. When my interceptor run it throw below error :
ERROR org.apache.flume.source.jms.JMSSource: Unexpected error processing events
java.lang.NullPointerException
at test.intercepter.App.intercept(App.java:59)
at test.intercepter.App.intercept(App.java:82)
at org.apache.flume.interceptor.InterceptorChain.intercept(InterceptorChain.java:62)
at org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:146)
at org.apache.flume.source.jms.JMSSource.doProcess(JMSSource.java:258)
at org.apache.flume.source.AbstractPollableSource.process(AbstractPollableSource.java:54)
at org.apache.flume.source.PollableSourceRunner$PollingRunner.run(PollableSourceRunner.java:139)
at java.lang.Thread.run(Thread.java:745)*
Can you suggest me what and where is the problem?
I found the solution. The problem was not anything else than new File("removeNs.xslt"). It was not able to find the location as I's not sure where to keep this file but later I get the flume agent path but as soon as I restart the flume agent it deletes all files which I kept in the flume agent dir. So I changed the code and kept the file material into my java code.

parallelism configuration in trident topology (storm)

After reading this and this I'm having difficulties understanding how to configure my trident topology.
Basically my storm application is reading from kafka, doing some data manipulations and finally writing to Cassandra.
Here is how I'm currently building my topology:
private static StormTopology buildTopology() {
// connection to kafka
ZkHosts zkHosts = new ZkHosts(broker_zk, broker_path);
TridentKafkaConfig kafkaConfig = new TridentKafkaConfig(zkHosts, topic);
kafkaConfig.scheme = new RawMultiScheme();
StateFactoryFields[] cassandraStateFactories = createStateFactories();
TransactionalTridentKafkaSpout spout = new TransactionalTridentKafkaSpout(kafkaConfig);
TridentTopology topology = new TridentTopology();
Stream kafkaSpout = topology.newStream("kafkaspout", spout).parallelismHint(1).shuffle();
Stream filterValidatStream = kafkaSpout.each(new Fields("bytes"), new SplitKafkaInput(), EventData.getEventDataFields()).parallelismHint(1);
for (StateFactoryFields stateFactoryFields : cassandraStateFactories) {
filterValidatStream.groupBy(stateFactoryFields.groupingFields)
.persistentAggregate(stateFactoryFields.cassandraStateFactor, new Count(), new Fields("count")).parallelismHint(2);
}
logger.info("Building topology");
return topology.build();
}
So I got a spout and a few operations (filter, grouopBy) with parallelismHint.
I don't understant hor to determine the optimal parallelismHint, moreover if I'm setting this value in my code, how does it work in conjunction with storm standard topology configurations such as
topology.max.task.parallelism
topology.workers
topology.acker.executors
Thanks in advance
There is an excellent gist by mrflip here that attempts to outline how to tune a storm/trident topology. This should guide you in selecting your parameters (both the ones you have suggested in your question and others you may not have thought of yet).

Storm-kafka spout consuming slowly

I was just trying out kafka-storm spout mentioned here https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka and the configuration i used are mentioned as below.
BrokerHosts brokerHosts = KafkaConfig.StaticHosts.fromHostString(
ImmutableList.of("localhost"), 1);
SpoutConfig spoutConfig = new SpoutConfig(brokerHosts, // list of Kafka
"test", // topic to read from
"/kafkastorm", // the root path in Zookeeper for the spout to
"discovery"); // an id for this consumer for storing the
// consumer offsets in Zookeeper
spoutConfig.scheme = new StringScheme();
spoutConfig.stateUpdateIntervalMs = 1000;
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TridentTopology topology = new TridentTopology();
InetSocketAddress inetSocketAddress = new InetSocketAddress(
"localhost", 6379);
TridentState wordsCount = topology
.newStream(SPOUT_FIRST, kafkaSpout)
.parallelismHint(1)
.each(new Fields("str"), new TestSplit(), new Fields("words"))
.groupBy(new Fields("words"))
.persistentAggregate(
RedisState.transactional(inetSocketAddress),
new Count(), new Fields("counts")).parallelismHint(100);
Config conf = new Config();
conf.setMaxTaskParallelism(200);
// conf.setDebug( true );
// conf.setMaxSpoutPending(20);
// This topology can only be run as local because it is a toy example
LocalDRPC drpc = new LocalDRPC();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("symbolCounter", conf, topology.build());
But the speed at which the above spout fetched messages from the Kafka topic is around 7000/seconds but I am expected a load of around 50000 messages per seconds. I have tried various options of increasing the fetch buffer size in spoutConfig with no visible results.
Has any faced with the similar type of issue where he is not able to fetch the kafka topic via storm with the speed at which the producer produces messages?
I updated the "topology.spout.max.batch.size" value in config to about 64*1024 value and then storm processing became fast.

Resources