Storm - topology to topology - apache-storm

Is it possible or fine to emit the tuples from one topology to another topology?
Lets say in one topology, one specific bolt is doing storing of tuples into db. In another topology I don't want to duplicate or create the same bolt for storing the tuples. So from this second topology can I emit to first topology bolt?
-Hariprasad

While you cannot directly pass tuples from one topology to another, you can use queuing system such as Apache Kafka to accomplish what you described. Storm has Kafka spout packaged in their latest releases.

The setup requires two storm topologies (A and B) and one Kafka topic. Let's call it "transfer"
Within the A topology where you want to send data to the B topology, use a Kafka producer:
[The kafka initialization code is taken directly from the docs: https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example and obviously needs to be customized for your kafka installation.]
public void Execute(Tuple input){
...
Properties props = new Properties();
props.put("metadata.broker.list", "broker1:9092,broker2:9092 ");
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("partitioner.class", "example.producer.SimplePartitioner");
props.put("request.required.acks", "1");
ProducerConfig config = new ProducerConfig(props);
Producer<String, String> producer = new Producer<String, String (config);
String msg = ...
KeyedMessage<String, String> data = new KeyedMessage<String, String>
("transfers", ip, msg);
producer.send(data);
producer.close();
From Topology B, you create a Kafka Spout when you initialize your topology:
BrokerHosts hosts = new ZkHosts(zkConnString);
SpoutConfig spoutConfig = new SpoutConfig(hosts, topicName, "/" + topicName,
UUID.randomUUID().toString());
spoutConf.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
// Now it's just like any other spout
topologyBuilder.setSpout(kafkaSpout);
That requires running kafka, of course (check out https://kafka.apache.org/08/quickstart.html).
[Edit: Reading your question again: it sounds like you have a reusable component (save tuple) that you want to call from two different topologies and you are trying to call one from the other. Another approach is to offload this task to a third topology devoted to handling saving tuples and just create kafka messages of the items that need to be persisted within your topologies. In this way, ALL events to save-tuple will be handled the same way.]

This is currently not supported, you can not pass on tuples from one topology to another.
Based on your use case why don't you use another bolt (within the same topology) subscribed to the db bolt instead of running a separate topology

Related

Spring AMQP Get Existing Queue Names and Argument

We have a queue in production where message TTL was set via application. Now we want to change message TTL and attach policy via rabbit CTL than setting within application. Snippet:
Map<String, Object> args = new HashMap<>();
args.put("x-message-ttl", 86400000);
for (String queueName : queueNames) {
Queue queue = new Queue(queueName, true, false, false, args);
admin.declareQueue(queue);
...
}
To achieve this in running application we want way to validate if Queue already exists do nothing otherwise create new Queue without args. It is not possible to leverage local cache as multiple publisher/subscriber nodes can restart under unplanned outage scenario. With above would be able to change TTL during Rabbit upgrade/Migration
Can you help if there is an API to fetch all existing queues and its argument properties?
Note: Overriding x-message-ttl with different value throws error.
RabbitMQ has a REST API and a java client for it.
You can use that to get information about existing elements such as queues.

Spring-Kafka Concurrency Property

I am progressing on writing my first Kafka Consumer by using Spring-Kafka. Had a look at the different options provided by framework, and have few doubts on the same. Can someone please clarify below if you have already worked on it.
Question - 1 : As per Spring-Kafka documentation, there are 2 ways to implement Kafka-Consumer; "You can receive messages by configuring a MessageListenerContainer and providing a message listener or by using the #KafkaListener annotation". Can someone tell when should I choose one option over another ?
Question - 2 : I have chosen KafkaListener approach for writing my application. For this I need to initialize a container factory instance and inside container factory there is option to control concurrency. Just want to double check if my understanding about concurrency is correct or not.
Suppose, I have a topic name MyTopic which has 4 partitions in it. And to consume messages from MyTopic, I've started 2 instances of my application and these instances are started by setting concurrency as 2. So, Ideally as per kafka assignment strategy, 2 partitions should go to consumer1 and 2 other partitions should go to consumer2. Since the concurrency is set as 2, does each of the consumer will start 2 threads, and will consume data from the topics in parallel ? Also should we consider anything if we are consuming in parallel.
Question 3 - I have chosen manual ack mode, and not managing the offsets externally (not persisting it to any database/filesystem). So should I need to write custom code to handle rebalance, or framework will manage it automatically ? I think no as I am acknowledging only after processing all the records.
Question - 4 : Also, with Manual ACK mode, which Listener will give more performance? BATCH Message Listener or normal Message Listener. I guess if I use Normal Message listener, the offsets will be committed after processing each of the messages.
Pasted the code below for your reference.
Batch Acknowledgement Consumer:
public void onMessage(List<ConsumerRecord<String, String>> records, Acknowledgment acknowledgment,
Consumer<?, ?> consumer) {
for (ConsumerRecord<String, String> record : records) {
System.out.println("Record : " + record.value());
// Process the message here..
listener.addOffset(record.topic(), record.partition(), record.offset());
}
acknowledgment.acknowledge();
}
Initialising container factory:
#Bean
public ConsumerFactory<String, String> consumerFactory() {
return new DefaultKafkaConsumerFactory<String, String>(consumerConfigs());
}
#Bean
public Map<String, Object> consumerConfigs() {
Map<String, Object> configs = new HashMap<String, Object>();
configs.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, bootStrapServer);
configs.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
configs.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, enablAutoCommit);
configs.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, maxPolInterval);
configs.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, autoOffsetReset);
configs.put(ConsumerConfig.CLIENT_ID_CONFIG, clientId);
configs.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
configs.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
return configs;
}
#Bean
public ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory = new ConcurrentKafkaListenerContainerFactory<String, String>();
// Not sure about the impact of this property, so going with 1
factory.setConcurrency(2);
factory.setBatchListener(true);
factory.getContainerProperties().setAckMode(AckMode.MANUAL);
factory.getContainerProperties().setConsumerRebalanceListener(RebalanceListener.getInstance());
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setMessageListener(new BatchAckConsumer());
return factory;
}
#KafkaListener is a message-driven "POJO" it adds stuff like payload conversion, argument matching, etc. If you implement MessageListener you can only get the raw ConsumerRecord from Kafka. See #KafkaListener Annotation.
Yes, the concurrency represents the number of threads; each thread creates a Consumer; they run in parallel; in your example, each would get 2 partitions.
Also should we consider anything if we are consuming in parallel.
Your listener must be thread-safe (no shared state or any such state needs to be protected by locks.
It's not clear what you mean by "handle rebalance events". When a rebalance occurs, the framework will commit any pending offsets.
It doesn't make a difference; message listener Vs. batch listener is just a preference. Even with a message listener, with MANUAL ackmode, the offsets are committed when all the results from the poll have been processed. With MANUAL_IMMEDIATE mode, the offsets are committed one-by-one.
Q1:
From the documentation,
The #KafkaListener annotation is used to designate a bean method as a
listener for a listener container. The bean is wrapped in a
MessagingMessageListenerAdapter configured with various features, such
as converters to convert the data, if necessary, to match the method
parameters.
You can configure most attributes on the annotation with SpEL by using
"#{…​} or property placeholders (${…​}). See the Javadoc for more information."
This approach can be useful for simple POJO listeners and you do not need to implement any interfaces. You are also enabled to listen on any topics and partitions in a declarative way using the annotations. You can also potentially return the value you received whereas in case of MessageListener, you are bound by the signature of the interface.
Q2:
Ideally yes. If you have multiple topics to consume from, it gets more complicated though. Kafka by default uses RangeAssignor which has its own behaviour (you can change this -- see more details under).
Q3:
If your consumer dies, there will be rebalancing. If you acknowledge manually and your consumer dies before committing offsets, you do not need to do anything, Kafka handles that. But you could end up with some duplicate messages (at-least once)
Q4:
It depends what you mean by "performance". If you meant latency, then consuming each record as fast as possible will be the way to go. If you want to achieve high throughput, then batch consumption is more efficient.
I had written some samples using Spring kafka and various listeners - check out this repo

Kafka Streams API: KStream to KTable

I have a Kafka topic where I send location events (key=user_id, value=user_location). I am able to read and process it as a KStream:
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Location> locations = builder
.stream("location_topic")
.map((k, v) -> {
// some processing here, omitted form clarity
Location location = new Location(lat, lon);
return new KeyValue<>(k, location);
});
That works well, but I'd like to have a KTable with the last known position of each user. How could I do it?
I am able to do it writing to and reading from an intermediate topic:
// write to intermediate topic
locations.to(Serdes.String(), new LocationSerde(), "location_topic_aux");
// build KTable from intermediate topic
KTable<String, Location> table = builder.table("location_topic_aux", "store");
Is there a simple way to obtain a KTable from a KStream? This is my first app using Kafka Streams, so I'm probably missing something obvious.
Update:
In Kafka 2.5, a new method KStream#toTable() will be added, that will provide a convenient way to transform a KStream into a KTable. For details see: https://cwiki.apache.org/confluence/display/KAFKA/KIP-523%3A+Add+KStream%23toTable+to+the+Streams+DSL
Original Answer:
There is not straight forward way at the moment to do this. Your approach is absolutely valid as discussed in Confluent FAQs: http://docs.confluent.io/current/streams/faq.html#how-can-i-convert-a-kstream-to-a-ktable-without-an-aggregation-step
This is the simplest approach with regard to the code. However, it has the disadvantages that (a) you need to manage an additional topic and that (b) it results in additional network traffic because data is written to and re-read from Kafka.
There is one alternative, using a "dummy-reduce":
KStreamBuilder builder = new KStreamBuilder();
KStream<String, Long> stream = ...; // some computation that creates the derived KStream
KTable<String, Long> table = stream.groupByKey().reduce(
new Reducer<Long>() {
#Override
public Long apply(Long aggValue, Long newValue) {
return newValue;
}
},
"dummy-aggregation-store");
This approach is somewhat more complex with regard to the code compared to option 1 but has the advantage that (a) no manual topic management is required and (b) re-reading the data from Kafka is not necessary.
Overall, you need to decide by yourself, which approach you like better:
In option 2, Kafka Streams will create an internal changelog topic to back up the KTable for fault tolerance. Thus, both approaches require some additional storage in Kafka and result in additional network traffic. Overall, it’s a trade-off between slightly more complex code in option 2 versus manual topic management in option 1.

Storm Trident - Continuous emits from aggregator even when there is no data in kafka

I have a topology which sources tuples from kafka.
Topology looks roughly like this.
topology.newStream("kafkaSpout", spout)
.shuffle()
.each(new Fields("str"), new Filter())
.parallelismHint(5)
.each(new Fields("str"), new Function(),
new Fields("some fields"))
.parallelismHint(5)
.partitionBy(new Fields("some field"))
.partitionAggregate(new Fields("some fields"),
new SomeAggregator(),
new Fields(""some fields""))
.parallelismHint(5)
.partitionPersist(new StateFactory(),
new Fields("some filed"),
new StateUpdater());
From documentation what I understand is that in the aggregator the aggregate method is called for every tuple, the complete method is called when all tuples are processed.
I have put debug sysouts in init and aggregate method.
When I start the topology I can see continuous sysouts from init and aggregate method.
Is this behavior expected ? (Note that there is no data in Kafka)

Access to ZooKeeper from within Storm Bolt

Context: I would like to share configuration information within bolts and rather than passing via configuration files, would like to load that into ZooKeeper. When a bolt comes up it can read it from there.
My preference is to use the same ZooKeeper instance as Storm, so the question is how does one access the Storm ZooKeeper from within a bolt?
I have looked at the Java docs and afraid the path does not seem obvious.
Here is how I am using zookeeper in storm through the curator API:
List<String> servers = (List<String>) conf.get(Config.TRANSACTIONAL_ZOOKEEPER_SERVERS);
Long port = (Long) conf.get(Config.TRANSACTIONAL_ZOOKEEPER_PORT);
if (servers == null || port == null) {
servers = (List<String>) conf.get(Config.STORM_ZOOKEEPER_SERVERS);
port = (Long) conf.get(Config.STORM_ZOOKEEPER_PORT);
}
String connectionString = servers.get(0) + ":" + port.toString();
curatorFramework = CuratorFrameworkFactory.builder()
.connectString(connectionString)
.namespace(config.getNamespace())
.retryPolicy(new ExponentialBackoffRetry(1000, 3))
.build();
conf is the configuration object/map passed to each spout and bolt in the open or prepare method. namespace is a string to identify the path you will read and write to and is an attempt to keep all interactions with zookeeper separate from what storm is doing.

Resources