I was just trying out kafka-storm spout mentioned here https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka and the configuration i used are mentioned as below.
BrokerHosts brokerHosts = KafkaConfig.StaticHosts.fromHostString(
ImmutableList.of("localhost"), 1);
SpoutConfig spoutConfig = new SpoutConfig(brokerHosts, // list of Kafka
"test", // topic to read from
"/kafkastorm", // the root path in Zookeeper for the spout to
"discovery"); // an id for this consumer for storing the
// consumer offsets in Zookeeper
spoutConfig.scheme = new StringScheme();
spoutConfig.stateUpdateIntervalMs = 1000;
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TridentTopology topology = new TridentTopology();
InetSocketAddress inetSocketAddress = new InetSocketAddress(
"localhost", 6379);
TridentState wordsCount = topology
.newStream(SPOUT_FIRST, kafkaSpout)
.parallelismHint(1)
.each(new Fields("str"), new TestSplit(), new Fields("words"))
.groupBy(new Fields("words"))
.persistentAggregate(
RedisState.transactional(inetSocketAddress),
new Count(), new Fields("counts")).parallelismHint(100);
Config conf = new Config();
conf.setMaxTaskParallelism(200);
// conf.setDebug( true );
// conf.setMaxSpoutPending(20);
// This topology can only be run as local because it is a toy example
LocalDRPC drpc = new LocalDRPC();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("symbolCounter", conf, topology.build());
But the speed at which the above spout fetched messages from the Kafka topic is around 7000/seconds but I am expected a load of around 50000 messages per seconds. I have tried various options of increasing the fetch buffer size in spoutConfig with no visible results.
Has any faced with the similar type of issue where he is not able to fetch the kafka topic via storm with the speed at which the producer produces messages?
I updated the "topology.spout.max.batch.size" value in config to about 64*1024 value and then storm processing became fast.
Related
My data pipeline is following: Kafka => perform some calculations => load resulting pairs into Ignite cache => print it out
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("MainApplication");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaStreamingContext streamingContext = new JavaStreamingContext(sc, Durations.seconds(10));
JavaIgniteContext<String, Float> igniteContext = new JavaIgniteContext<>(sc, PATH, false);
JavaDStream<Message> dStream = KafkaUtils.createDirectStream(
streamingContext,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, Message>
Subscribe(Collections.singletonList(TOPIC), kafkaParams)
)
.map(ConsumerRecord::value);
JavaPairDStream<String, Message> pairDStream =
dStream.mapToPair(message -> new Tuple2<>(message.getName(), message));
JavaPairDStream<String, Float> pairs = pairDStream
.combineByKey(new CreateCombiner(), new MergeValue(), new MergeCombiners(), new HashPartitioner(10))
.mapToPair(new ToPairTransformer());
JavaIgniteRDD<String, Float> myCache = igniteContext.fromCache(new CacheConfiguration<>());
// I know that we put something here:
pairDStream.foreachRDD((VoidFunction<JavaPairRDD<String, Float>>) myCache::savePairs);
// But I can't see anything here:
myCache.foreach(tuple2 -> System.out.println("In cache: " + tuple2._1() + " = " + tuple2._2()));
streamingContext.start();
streamingContext.awaitTermination();
streamingContext.stop();
sc.stop();
But this code prints nothing.. Why?
Why Ignite cache is empty even after savePairs?
What can be wrong here?
Thanks in advance!
For me, it looks like that pairDStream.foreachRDD(...) is a lazy operation and has no any affect at least before you start streaming context streamingContext.start().
On the other hand, myCache.foreach(...) is eager operation and you perform it on actually empty cache.
So, try to put myCache.foreach(...) after streaming context start. Or even after termination.
I am trying to write a TridentTopology, which has multiple bolts. Now I want to make one bolt register to other bolts specific stream as shown below.
TridentTopologyBuilder tridentTopologyBuilder = new TridentTopologyBuilder();
FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 2,
new Values("the cow jumped over the moon"),
new Values("the man went to the store and bought some candy"),
new Values("four score and seven years ago"),
new Values("how many apples can you eat"));
tridentTopologyBuilder.setSpout("tridentSpout", "spoutStream", "spoutId", spout, 2, "spoutBatch");
Map<String, String> batchGroups = new HashMap<>();
batchGroups.put("boltStream", "boltBatch");
tridentTopologyBuilder.setBolt("tridentBolt", new TridentTestBolt(), 1, Sets.newHashSet("spoutBatch"), batchGroups).shuffleGrouping("tridentSpout", "spoutStream");
tridentTopologyBuilder.setBolt("tridentBolt2", new TridentTestBolt2(), 1, new HashSet<>(), batchGroups).shuffleGrouping("tridentBolt", "boltStream");
LocalCluster cluster = new LocalCluster();
Config config = new Config();
config.setDebug(true);
cluster.submitTopology("TridentTopology", config, tridentTopologyBuilder.buildTopology(new HashMap<>()));
I am getting following exception:
Error: InvalidTopologyException(msg:Component: [tridentBolt2] subscribes from non-existent stream: [$coord-boltBatch] of component [tridentBolt])
Also declared stream using the declareStream method of OutputFieldsDeclarer
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declareStream("boltStream", new Fields("sentence"));
}
Also registering to other bolts specific stream works in normal topology. The issue is with Trident topology. Also What are we supposed to pass for batchGroups.
I have a basic topology includes kafka spout and and kafka bolts
When submit my topology ı gets this error in Storm UI
Unable to get offset lags for kafka. Reason: org.apache.kafka.shaded.common.errors.InvalidTopicException: Topic '[enrich-topic]' is invalid
I check enrich-topic is exist, there is no problem about that
TopologyBuilder streamTopologyBuilder = new TopologyBuilder();
KafkaSpoutRetryService kafkaSpoutRetryService = new KafkaSpoutRetryExponentialBackoff(KafkaSpoutRetryExponentialBackoff.TimeInterval.microSeconds(500), KafkaSpoutRetryExponentialBackoff.TimeInterval.milliSeconds(2), Integer.MAX_VALUE, KafkaSpoutRetryExponentialBackoff.TimeInterval.seconds(10));
KafkaSpoutConfig spoutConf = KafkaSpoutConfig.builder(configProvider.getBootstrapServers(), configProvider.getSpoutTopic())
.setGroupId("consumerGroupId")
.setOffsetCommitPeriodMs(10_000)
.setFirstPollOffsetStrategy(UNCOMMITTED_LATEST)
.setMaxUncommittedOffsets(1000000)
.setRetry(kafkaSpoutRetryService)
.build();
KafkaSpout kafkaSpout = new KafkaSpout(spoutConf);
streamTopologyBuilder.setSpout("kafkaSpout", kafkaSpout, 1);
KafkaWriterBolt2 kafkaWriterBolt2 = null;
try {
kafkaWriterBolt2 = new KafkaWriterBolt2(configProvider.getBootstrapServers(), configProvider.getStreamKafkaWriterTopicName());
} catch (IOException e) {
e.printStackTrace();
}
streamTopologyBuilder.setBolt("kafkaWriterBolt2", kafkaWriterBolt2, 1).setNumTasks(1)
.shuffleGrouping("kafkaSpout");
KafkaWriterBolt2 is my class extends from BaseRichBolt
After reading this and this I'm having difficulties understanding how to configure my trident topology.
Basically my storm application is reading from kafka, doing some data manipulations and finally writing to Cassandra.
Here is how I'm currently building my topology:
private static StormTopology buildTopology() {
// connection to kafka
ZkHosts zkHosts = new ZkHosts(broker_zk, broker_path);
TridentKafkaConfig kafkaConfig = new TridentKafkaConfig(zkHosts, topic);
kafkaConfig.scheme = new RawMultiScheme();
StateFactoryFields[] cassandraStateFactories = createStateFactories();
TransactionalTridentKafkaSpout spout = new TransactionalTridentKafkaSpout(kafkaConfig);
TridentTopology topology = new TridentTopology();
Stream kafkaSpout = topology.newStream("kafkaspout", spout).parallelismHint(1).shuffle();
Stream filterValidatStream = kafkaSpout.each(new Fields("bytes"), new SplitKafkaInput(), EventData.getEventDataFields()).parallelismHint(1);
for (StateFactoryFields stateFactoryFields : cassandraStateFactories) {
filterValidatStream.groupBy(stateFactoryFields.groupingFields)
.persistentAggregate(stateFactoryFields.cassandraStateFactor, new Count(), new Fields("count")).parallelismHint(2);
}
logger.info("Building topology");
return topology.build();
}
So I got a spout and a few operations (filter, grouopBy) with parallelismHint.
I don't understant hor to determine the optimal parallelismHint, moreover if I'm setting this value in my code, how does it work in conjunction with storm standard topology configurations such as
topology.max.task.parallelism
topology.workers
topology.acker.executors
Thanks in advance
There is an excellent gist by mrflip here that attempts to outline how to tune a storm/trident topology. This should guide you in selecting your parameters (both the ones you have suggested in your question and others you may not have thought of yet).
I'm attempting to wire a queue to the front of a MessageChannel, and I need to do so programatically so it can be done at run time in response to an osgi:listener being triggered. So far I've got:
public void addService(MessageChannel mc, Map<String,Object> properties)
{
//Create the queue and the QueueChannel
BlockingQueue<Message<?>> q = new LinkedBlockingQueue<Message<?>>();
QueueChannel qc = new QueueChannel(q);
//Create the Bridge and set the output to the input parameter channel
BridgeHandler b = new BridgeHandler();
b.setOutputChannel(mc);
//Presumably, I need something here to poll the QueueChannel
//and drop it onto the bridge. This is where I get lost
}
Looking through the various relevant classes, I came up with:
PollerMetadata pm = new PollerMetadata();
pm.setTrigger(new IntervalTrigger(10));
PollingConsumer pc = new PollingConsumer(qc, b);
but I'm not able to put it all together. What am I missing?
So, the solution that ended up working for me was:
public void addEngineService(MessageChannel mc, Map<String,Object> properties)
{
//Create the queue and the QueueChannel
BlockingQueue<Message<?>> q = new LinkedBlockingQueue<Message<?>>();
QueueChannel qc = new QueueChannel(q);
//Create the Bridge and set the output to the input parameter channel
BridgeHandler b = new BridgeHandler();
b.setOutputChannel(mc);
//Setup a Polling Consumer to poll the queue channel and
//retrieve 1 thing at a time
PollingConsumer pc = new PollingConsumer(qc, b);
pc.setMaxMessagesPerPoll(1);
//Now use an interval trigger to poll every 10 ms and attach it
IntervalTrigger trig = new IntervalTrigger(10, TimeUnit.MILLISECONDS);
trig.setInitialDelay(0);
trig.setFixedRate(true);
pc.setTrigger(trig);
//Now set a task scheduler and start it
pc.setTaskScheduler(taskSched);
pc.setAutoStartup(true);
pc.start();
}
I'm not terribly clear if all the above is explicitly needed, but neither the trigger or the task scheduler alone worked, I did appear to need both. I should also note the taskSched used was the default taskScheduler dependency injected from spring via
<property name="taskSched" ref="taskScheduler"/>