parallelism configuration in trident topology (storm) - apache-storm

After reading this and this I'm having difficulties understanding how to configure my trident topology.
Basically my storm application is reading from kafka, doing some data manipulations and finally writing to Cassandra.
Here is how I'm currently building my topology:
private static StormTopology buildTopology() {
// connection to kafka
ZkHosts zkHosts = new ZkHosts(broker_zk, broker_path);
TridentKafkaConfig kafkaConfig = new TridentKafkaConfig(zkHosts, topic);
kafkaConfig.scheme = new RawMultiScheme();
StateFactoryFields[] cassandraStateFactories = createStateFactories();
TransactionalTridentKafkaSpout spout = new TransactionalTridentKafkaSpout(kafkaConfig);
TridentTopology topology = new TridentTopology();
Stream kafkaSpout = topology.newStream("kafkaspout", spout).parallelismHint(1).shuffle();
Stream filterValidatStream = kafkaSpout.each(new Fields("bytes"), new SplitKafkaInput(), EventData.getEventDataFields()).parallelismHint(1);
for (StateFactoryFields stateFactoryFields : cassandraStateFactories) {
filterValidatStream.groupBy(stateFactoryFields.groupingFields)
.persistentAggregate(stateFactoryFields.cassandraStateFactor, new Count(), new Fields("count")).parallelismHint(2);
}
logger.info("Building topology");
return topology.build();
}
So I got a spout and a few operations (filter, grouopBy) with parallelismHint.
I don't understant hor to determine the optimal parallelismHint, moreover if I'm setting this value in my code, how does it work in conjunction with storm standard topology configurations such as
topology.max.task.parallelism
topology.workers
topology.acker.executors
Thanks in advance

There is an excellent gist by mrflip here that attempts to outline how to tune a storm/trident topology. This should guide you in selecting your parameters (both the ones you have suggested in your question and others you may not have thought of yet).

Related

Sping Boot Service consume kafka messages on demand

I have requirement where need to have a Spring Boot Rest Service that a client application will call every 30 minutes and service is to return
number of latest messages based on the number specified in query param e.g. http://messages.com/getNewMessages?number=10 in this case should return 10 messages
number of messages based on the number and offset specified in query param e.g. http://messages.com/getSpecificMessages?number=5&start=123 in this case should return 5 messages starting offset 123.
I have simple standalone application and it works fine. Here is what I tested and would lke some direction of incorporating it in the service.
public static void main(String[] args) {
// create kafka consumer
Properties properties = new Properties();
properties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
properties.put(ConsumerConfig.GROUP_ID_CONFIG, "my-first-consumer-group");
properties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
properties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
properties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, args[0]);
Consumer<String, String> consumer = new KafkaConsumer<>(properties);
// subscribe to topic
consumer.subscribe(Collections.singleton("test"));
consumer.poll(0);
//get to specific offset and get specified number of messages
for (TopicPartition partition : consumer.assignment())
consumer.seek(partition, args[1]);
ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(5000));
System.out.println("Total Record Count ******* : " + records.count());
for (ConsumerRecord<String, String> record : records) {
System.out.println("Message: " + record.value());
System.out.println("Message offset: " + record.offset());
System.out.println("Message: " + record.timestamp());
Date date = new Date(record.timestamp());
Format format = new SimpleDateFormat("yyyy MM dd HH:mm:ss.SSS");
System.out.println("Message date: " + format.format(date));
}
consumer.commitSync();
As my consumer will be on-demand wondering in Spring Boot Service how I can achieve this. Where do I specify the properties if I put in application.properties those get's injected at startup time but how do i control MAX_POLL_RECORDS_CONFIG at runtime. Any help appreciated.
MAX_POLL_RECORDS_CONFIG only impact your kafka-client return the records to your spring service, it will never reduce the bytes that the consumer poll from kafka-server
see the above picture, no matter your start offset = 150 or 190, kafka server will return the whole data from (offset=110, offset=190), kafka server even didn't know how many records return to consumer, he only know the byte size = (220 - 110)
so i think you can control the record number by yourself,currently it is controlled by the kafka client jar, they are both occupy your jvm local memory
The answer to your question is here and the answer with code example is this answer.
Both written by the excellent Gary Russell, the main or one of the main person behind Spring Kafka.
TL;DR:
If you want to arbitrarily rewind the partitions at runtime, have your
listener implement ConsumerSeekAware and grab a reference to the
ConsumerSeekCallback.

Setting bolts to read from specific streams of other bolts in trident topology

I am trying to write a TridentTopology, which has multiple bolts. Now I want to make one bolt register to other bolts specific stream as shown below.
TridentTopologyBuilder tridentTopologyBuilder = new TridentTopologyBuilder();
FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 2,
new Values("the cow jumped over the moon"),
new Values("the man went to the store and bought some candy"),
new Values("four score and seven years ago"),
new Values("how many apples can you eat"));
tridentTopologyBuilder.setSpout("tridentSpout", "spoutStream", "spoutId", spout, 2, "spoutBatch");
Map<String, String> batchGroups = new HashMap<>();
batchGroups.put("boltStream", "boltBatch");
tridentTopologyBuilder.setBolt("tridentBolt", new TridentTestBolt(), 1, Sets.newHashSet("spoutBatch"), batchGroups).shuffleGrouping("tridentSpout", "spoutStream");
tridentTopologyBuilder.setBolt("tridentBolt2", new TridentTestBolt2(), 1, new HashSet<>(), batchGroups).shuffleGrouping("tridentBolt", "boltStream");
LocalCluster cluster = new LocalCluster();
Config config = new Config();
config.setDebug(true);
cluster.submitTopology("TridentTopology", config, tridentTopologyBuilder.buildTopology(new HashMap<>()));
I am getting following exception:
Error: InvalidTopologyException(msg:Component: [tridentBolt2] subscribes from non-existent stream: [$coord-boltBatch] of component [tridentBolt])
Also declared stream using the declareStream method of OutputFieldsDeclarer
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declareStream("boltStream", new Fields("sentence"));
}
Also registering to other bolts specific stream works in normal topology. The issue is with Trident topology. Also What are we supposed to pass for batchGroups.

New KafkaSpout Issue in Apache Storm

I have a basic topology includes kafka spout and and kafka bolts
When submit my topology ı gets this error in Storm UI
Unable to get offset lags for kafka. Reason: org.apache.kafka.shaded.common.errors.InvalidTopicException: Topic '[enrich-topic]' is invalid
I check enrich-topic is exist, there is no problem about that
TopologyBuilder streamTopologyBuilder = new TopologyBuilder();
KafkaSpoutRetryService kafkaSpoutRetryService = new KafkaSpoutRetryExponentialBackoff(KafkaSpoutRetryExponentialBackoff.TimeInterval.microSeconds(500), KafkaSpoutRetryExponentialBackoff.TimeInterval.milliSeconds(2), Integer.MAX_VALUE, KafkaSpoutRetryExponentialBackoff.TimeInterval.seconds(10));
KafkaSpoutConfig spoutConf = KafkaSpoutConfig.builder(configProvider.getBootstrapServers(), configProvider.getSpoutTopic())
.setGroupId("consumerGroupId")
.setOffsetCommitPeriodMs(10_000)
.setFirstPollOffsetStrategy(UNCOMMITTED_LATEST)
.setMaxUncommittedOffsets(1000000)
.setRetry(kafkaSpoutRetryService)
.build();
KafkaSpout kafkaSpout = new KafkaSpout(spoutConf);
streamTopologyBuilder.setSpout("kafkaSpout", kafkaSpout, 1);
KafkaWriterBolt2 kafkaWriterBolt2 = null;
try {
kafkaWriterBolt2 = new KafkaWriterBolt2(configProvider.getBootstrapServers(), configProvider.getStreamKafkaWriterTopicName());
} catch (IOException e) {
e.printStackTrace();
}
streamTopologyBuilder.setBolt("kafkaWriterBolt2", kafkaWriterBolt2, 1).setNumTasks(1)
.shuffleGrouping("kafkaSpout");
KafkaWriterBolt2 is my class extends from BaseRichBolt

Storm-kafka spout consuming slowly

I was just trying out kafka-storm spout mentioned here https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka and the configuration i used are mentioned as below.
BrokerHosts brokerHosts = KafkaConfig.StaticHosts.fromHostString(
ImmutableList.of("localhost"), 1);
SpoutConfig spoutConfig = new SpoutConfig(brokerHosts, // list of Kafka
"test", // topic to read from
"/kafkastorm", // the root path in Zookeeper for the spout to
"discovery"); // an id for this consumer for storing the
// consumer offsets in Zookeeper
spoutConfig.scheme = new StringScheme();
spoutConfig.stateUpdateIntervalMs = 1000;
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TridentTopology topology = new TridentTopology();
InetSocketAddress inetSocketAddress = new InetSocketAddress(
"localhost", 6379);
TridentState wordsCount = topology
.newStream(SPOUT_FIRST, kafkaSpout)
.parallelismHint(1)
.each(new Fields("str"), new TestSplit(), new Fields("words"))
.groupBy(new Fields("words"))
.persistentAggregate(
RedisState.transactional(inetSocketAddress),
new Count(), new Fields("counts")).parallelismHint(100);
Config conf = new Config();
conf.setMaxTaskParallelism(200);
// conf.setDebug( true );
// conf.setMaxSpoutPending(20);
// This topology can only be run as local because it is a toy example
LocalDRPC drpc = new LocalDRPC();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("symbolCounter", conf, topology.build());
But the speed at which the above spout fetched messages from the Kafka topic is around 7000/seconds but I am expected a load of around 50000 messages per seconds. I have tried various options of increasing the fetch buffer size in spoutConfig with no visible results.
Has any faced with the similar type of issue where he is not able to fetch the kafka topic via storm with the speed at which the producer produces messages?
I updated the "topology.spout.max.batch.size" value in config to about 64*1024 value and then storm processing became fast.

Programatically bridge a QueueChannel to a MessageChannel in Spring

I'm attempting to wire a queue to the front of a MessageChannel, and I need to do so programatically so it can be done at run time in response to an osgi:listener being triggered. So far I've got:
public void addService(MessageChannel mc, Map<String,Object> properties)
{
//Create the queue and the QueueChannel
BlockingQueue<Message<?>> q = new LinkedBlockingQueue<Message<?>>();
QueueChannel qc = new QueueChannel(q);
//Create the Bridge and set the output to the input parameter channel
BridgeHandler b = new BridgeHandler();
b.setOutputChannel(mc);
//Presumably, I need something here to poll the QueueChannel
//and drop it onto the bridge. This is where I get lost
}
Looking through the various relevant classes, I came up with:
PollerMetadata pm = new PollerMetadata();
pm.setTrigger(new IntervalTrigger(10));
PollingConsumer pc = new PollingConsumer(qc, b);
but I'm not able to put it all together. What am I missing?
So, the solution that ended up working for me was:
public void addEngineService(MessageChannel mc, Map<String,Object> properties)
{
//Create the queue and the QueueChannel
BlockingQueue<Message<?>> q = new LinkedBlockingQueue<Message<?>>();
QueueChannel qc = new QueueChannel(q);
//Create the Bridge and set the output to the input parameter channel
BridgeHandler b = new BridgeHandler();
b.setOutputChannel(mc);
//Setup a Polling Consumer to poll the queue channel and
//retrieve 1 thing at a time
PollingConsumer pc = new PollingConsumer(qc, b);
pc.setMaxMessagesPerPoll(1);
//Now use an interval trigger to poll every 10 ms and attach it
IntervalTrigger trig = new IntervalTrigger(10, TimeUnit.MILLISECONDS);
trig.setInitialDelay(0);
trig.setFixedRate(true);
pc.setTrigger(trig);
//Now set a task scheduler and start it
pc.setTaskScheduler(taskSched);
pc.setAutoStartup(true);
pc.start();
}
I'm not terribly clear if all the above is explicitly needed, but neither the trigger or the task scheduler alone worked, I did appear to need both. I should also note the taskSched used was the default taskScheduler dependency injected from spring via
<property name="taskSched" ref="taskScheduler"/>

Resources