integrating word count topology of storm with kafka - apache-storm

i am trying to integrate word count program of storm with kafka, for that my producer is working fine i.e it is reading a text file and sending each line as a message ,and i could see those messages in simple consumer console.
Now for integrating it with storm i.e to send those messages/lines to consumer spout i have just replaced the previous storm spout of word count program with the kafka spout from storm-spout integration dependency and rest of the program is same and i am trying to run it in eclipse but it is not getting execute ,i dont know what is the problem and even dont know whether i am doing it in right way,here is my main class -
package com.spnotes.storm;
import storm.kafka.BrokerHosts;
import storm.kafka.KafkaSpout;
import storm.kafka.SpoutConfig;
import storm.kafka.StringScheme;
import storm.kafka.ZkHosts;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.spout.SchemeAsMultiScheme;
import backtype.storm.topology.TopologyBuilder;
import com.spnotes.storm.bolts.WordCounterBolt;
import com.spnotes.storm.bolts.WordSpitterBolt;
public class WordCount {
public static void main(String[] args) throws Exception{
Config config = new Config();
config.setDebug(true);
config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
BrokerHosts hosts = new ZkHosts("localhost:9092");
SpoutConfig spoutConfig = new SpoutConfig(hosts, "test", "localhost:2181", "id1");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("line-reader-spout", kafkaSpout);
builder.setBolt("word-spitter", new WordSpitterBolt()).shuffleGrouping("line-reader-spout");
builder.setBolt("word-counter", new WordCounterBolt()).shuffleGrouping("word-spitter");
LocalCluster cluster = new LocalCluster();
System.out.println("submit topology");
Thread.sleep(10000);
//StormSubmitter.submitTopology("HelloStorm5", config, builder.createTopology());
cluster.submitTopology("HelloStorm5", config, builder.createTopology());
cluster.shutdown();
}
}
There are 2 bolts WordSplitterBolt() and WordCounterBolt() ,Wordsplitterbolt is breaking each line/message into tokens/words and WordCounterBolt is counting the each word. Can anybody tell me m i doing anything in a wrong way? do i need to create own spout instead of using predefined KafkaSpout ? and is my main class correct?

change code:
BrokerHosts hosts = new ZkHosts(zkConnect);
zkConnect is zookeeper hostname and port not for kafka. change it to localhost:2181
As discussed on chat for rest issue related to code.
Issue was with Maven dependency.include all the dependency into POM.xml required.

Related

How to distribute workload to many compute and do scatter-gather scenarios with Kafka Steam?

I am new to Kafka Stream and Alpakka Kafka.
Problem: I have been using Java Executor Service to run parallel jobs and when ALL of them are done, marking the entire process done. The issue is fault tolerance, High Availability and Not utilizing all computes to do the work. It is using just ONE HOST JVM to do work.
We have Apache Kafka as infrastructure, so I was wondering how I can use Kafka Stream to do scatter-gather or just execute child task use case implemented to distribute workload and then gather results or get an indication that all tasks are done.
Any pointer to sample work or scatter-gather or Fork join would be great with Kafka Steam or Alpakka Kafka.
Here is a Sample:
import org.springframework.http.MediaType;
import org.springframework.web.reactive.function.client.WebClient;
import java.util.LinkedList;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Main {
private static final ExecutorService executorService = Executors.newFixedThreadPool(15);
public static void main(String[] args) throws Exception {
final WebClient webClient = WebClient.builder().build();
List<CompletableFuture<String>> allTasks = new LinkedList<>();
String urls[] = {"http://test1", "http://test2", "http://test3"};
// Distribute the work ( webcient can do async but I wanted to just give example).
for (final String url : urls) {
CompletableFuture<String> task = CompletableFuture.supplyAsync(() -> {
// SOME Task JUST FOR Example I have put GET call it could be any thing
String response =
webClient.get().uri(url).accept(MediaType.APPLICATION_JSON).retrieve().bodyToMono(String.class).block();
return response;
}, executorService);
allTasks.add(task);
}
// wait for all do be done (Join)
CompletableFuture.allOf(allTasks.toArray(new CompletableFuture[]{})).join();
for(CompletableFuture<String> task: allTasks){
processResponse(task.get());
}
}
public static void processResponse(String response){
System.out.println(response);
}
}

Spring Boot IBM Queue - Discover all Destinations

I am writing a small spring boot application that is supposed to monitor queues on an external IBM Queue installation.
I am able to connect via MQXAQueueConnectionFactory, but I have not found a way to discover all remote queues/destinations on that Host programmatically. I don't want to add them fix in my code.
How can I get a list of all existing queues in order to add listeners? I have to mention that an access via REST-API is not possible because this feature has been disabled by the administration.
You can use the IBM MQ Programmable Command Formats. If you installed the IBM MQ samples, the tools/pcf/samples/PCF_DisplayActiveLocalQueues.java gives you an idea for your use case.
Here is how I use it in my unit tests to find all the queues with messages:
import java.io.IOException;
import com.ibm.mq.MQException;
import com.ibm.mq.MQGetMessageOptions;
import com.ibm.mq.MQMessage;
import com.ibm.mq.MQQueue;
import com.ibm.mq.MQQueueManager;
import com.ibm.mq.constants.CMQC;
import com.ibm.mq.constants.CMQCFC;
import com.ibm.mq.constants.MQConstants;
import com.ibm.mq.headers.MQDataException;
import com.ibm.mq.headers.pcf.PCFMessage;
import com.ibm.mq.headers.pcf.PCFMessageAgent;
public class MqUtils {
public static void queuesWithMessages(MQQueueManager qmgr) {
try {
PCFMessageAgent agent = new PCFMessageAgent(qmgr);
try {
PCFMessage request = new PCFMessage(CMQCFC.MQCMD_INQUIRE_Q);
// NOTE: You can not use a queue name pattern like "FOO.*" together with
// the "addFilterParameter" method. This is a limitation of PCF messages.
// If you want to filter on queue names, you would have to do it in the
// for loop after sending the PCF message.
request.addParameter(CMQC.MQCA_Q_NAME, "*");
request.addParameter(CMQC.MQIA_Q_TYPE, MQConstants.MQQT_LOCAL);
request.addFilterParameter(CMQC.MQIA_CURRENT_Q_DEPTH, CMQCFC.MQCFOP_GREATER, 0);
for (PCFMessage response : agent.send(request)) {
String queueName = (String) response.getParameterValue(CMQC.MQCA_Q_NAME);
if (queueName == null
|| queueName.startsWith("SYSTEM")
|| queueName.startsWith("AMQ")) {
continue;
}
Integer queueDepth = (Integer) response.getParameterValue(CMQC.MQIA_CURRENT_Q_DEPTH);
// Do something with this queue that has messages
}
} catch (MQException | IOException e) {
throw new RuntimeException(e);
} finally {
agent.disconnect();
}
} catch (MQDataException e) {
throw new RuntimeException(e);
}
}
}
And this should give you ideas how to configure the MQQueueManager (see also IBM docs):
import com.ibm.mq.MQEnvironment;
import com.ibm.mq.MQException;
import com.ibm.mq.MQQueueManager;
#Configuration
static class MQConfig {
#Bean(destroyMethod = "disconnect")
public MQQueueManager mqQueueManager() throws MQException {
MQEnvironment.hostname = "the.host.com";
MQEnvironment.port = 1415;
MQEnvironment.channel = "xxx.CL.FIX";
return new MQQueueManager("xxx");
}
}
The chapter Using with IBM MQ classes for JMS explains how you can use PCF messages in pure JMS.

Create queue runtime in Grails with RabbitMQ plugin

I have a system where external systems can subscribe to events generated by my system. The system is written in Grails 2, using the RabbitMQ plugin for internal messaging. The events to external systems are communicated via HTTP.
I would like to create a queue for each subscriber to prevent that a slow subscriber endpoint slows down messages to an other subscriber. Subscriptions can occur runtime, that's why defining the queues in the application config is not desirable.
How can I create a queue with a topic binding runtime with the Grails RabbitMQ plugin?
As reading messages from RabbitMQ queues is directly coupled to services, a side-problem to creating the queue runtime could be to have multiple instances of that Grails service. Any ideas?
I don't have a ready solution for You but if you follow the code in the RabbitmqGrailsPlugin Descriptor especially the doWithSpring section
You should be able to recreate the steps necessary to initialize a new Queue and associated Listener dynamically at runtime.
It all comes down then to pass the needed parameters, register necessary spring beans and start the listeners.
To answer your second question I think you can come up with some naming convention and create a new queue handler for each queue. An example how to create spring beans dynamically can be found here: dynamically declare beans
Just a short example how I would quickly register a Queue it requires much more wiring etc...
def createQ(queueName) {
def queuesConfig = {
"${queueName}"(durable: true, autoDelete: false,)
}
def queueBuilder = new RabbitQueueBuilder()
queuesConfig.delegate = queueBuilder
queuesConfig.resolveStrategy = Closure.DELEGATE_FIRST
queuesConfig()
queueBuilder.queues?.each { queue ->
if (log.debugEnabled) {
log.debug "Registering queue '${queue.name}'"
}
BeanDefinitionBuilder builder = BeanDefinitionBuilder.rootBeanDefinition(Queue.class);
builder.addConstructorArgValue(queue.name)
builder.addConstructorArgValue(Boolean.valueOf(queue.durable))
builder.addConstructorArgValue(Boolean.valueOf(queue.exclusive))
builder.addConstructorArgValue(Boolean.valueOf(queue.autoDelete))
builder.addConstructorArgValue(queue.arguments)
DefaultListableBeanFactory factory = (DefaultListableBeanFactory) grailsApplication.mainContext.getBeanFactory();
factory.registerBeanDefinition("grails.rabbit.queue.${queue.name}", builder.getBeanDefinition());
}
}
I ended up using Spring AMQP which is used by the Grails RabbitMQ plugin. Removed some methods/arguments as they are not relevant to the sample:
class MyUpdater {
void handleMessage(Object message) {
String content = new String(message)
// do whatever you need with the message
}
}
import org.springframework.amqp.core.BindingBuilder
import org.springframework.amqp.core.Queue
import org.springframework.amqp.core.TopicExchange
import org.springframework.amqp.rabbit.core.RabbitAdmin
import org.springframework.amqp.rabbit.listener.SimpleMessageListenerContainer
import org.springframework.amqp.rabbit.listener.adapter.MessageListenerAdapter
import org.springframework.amqp.support.converter.SimpleMessageConverter
import org.springframework.amqp.rabbit.connection.ConnectionFactory
class ListenerInitiator {
// autowired
ConnectionFactory rabbitMQConnectionFactory
protected void initiateListener() {
RabbitAdmin admin = new RabbitAdmin(rabbitMQConnectionFactory)
// normally passed to this method, moved to local vars for simplicity
String queueName = "myQueueName"
String routingKey = "#"
String exchange = "myExchange"
Queue queue = new Queue(queueName)
admin.declareQueue(queue)
TopicExchange exchange = new TopicExchange(exchange)
admin.declareExchange(exchange)
admin.declareBinding( BindingBuilder.bind(queue).to(exchange).with(routingKey) )
// normally passed to this method, moved to local var for simplicity
MyUpdater listener = new MyUpdater()
SimpleMessageListenerContainer container =
new SimpleMessageListenerContainer(rabbitMQConnectionFactory)
MessageListenerAdapter adapter = new MessageListenerAdapter(listener)
adapter.setMessageConverter(new SimpleMessageConverter())
container.setMessageListener(adapter)
container.setQueueNames(queueName)
container.start()
}

DRPC Server error in storm

I am trying to execute the below code and getting an error .. Not sure if i am missing something here.. Also where would i see the output?
Error
java.lang.RuntimeException: No DRPC servers configured for topology
at backtype.storm.drpc.DRPCSpout.open(DRPCSpout.java:79)
at storm.trident.spout.RichSpoutBatchTriggerer.open(RichSpoutBatchTriggerer.java:58)
at backtype.storm.daemon.executor$fn__5802$fn__5817.invoke(executor.clj:519)
at backtype.storm.util$async_loop$fn__442.invoke(util.clj:434)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:744)
Code:
----
package com.**.trident.storm;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import storm.kafka.*;
import storm.trident.*;
import backtype.storm.*;
public class EventTridentDrpcTopology
{
private static final String KAFKA_SPOUT_ID = "kafkaSpout";
private static final Logger log = LoggerFactory.getLogger(EventTridentDrpcTopology.class);
public static StormTopology buildTopology(OpaqueTridentKafkaSpout spout) throws Exception
{
TridentTopology tridentTopology = new TridentTopology();
TridentState ts = tridentTopology.newStream("event_spout",spout)
.name(KAFKA_SPOUT_ID)
.each(new Fields("mac_address"), new SplitMac(), new Fields("mac"))
.groupBy(new Fields("mac"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("maccount"))
.parallelismHint(4)
;
tridentTopology
.newDRPCStream("mac_count")
.each(new Fields("args"), new SplitMac(), new Fields("mac"))
.stateQuery(ts,new Fields("mac"),new MapGet(), new Fields("maccount"))
.each(new Fields("maccount"), new FilterNull())
.aggregate(new Fields("maccount"), new Sum(), new Fields("sum"))
;
return tridentTopology.build();
}
public static void main(String[] str) throws Exception
{
Config conf = new Config();
BrokerHosts hosts = new ZkHosts("xxxx:2181,xxxx:2181,xxxx:2181");
String topic = "event";
//String zkRoot = topologyConfig.getProperty("kafka.zkRoot");
String consumerGroupId = "StormSpout";
DRPCClient drpc = new DRPCClient("xxxx",3772);
TridentKafkaConfig tridentKafkaConfig = new TridentKafkaConfig(hosts, topic, consumerGroupId);
tridentKafkaConfig.scheme = new SchemeAsMultiScheme(new XScheme());
OpaqueTridentKafkaSpout opaqueTridentKafkaSpout = new OpaqueTridentKafkaSpout(tridentKafkaConfig);
StormSubmitter.submitTopology("event_trident", conf, buildTopology(opaqueTridentKafkaSpout));
}
}
You have to configure the locations of the DRPC servers and launch them.
See Remote mode DRPC on http://storm.apache.org/releases/0.10.0/Distributed-RPC.html
Launch DRPC server(s)
Configure the locations of the DRPC servers
Submit DRPC topologies to Storm cluster
Launching a DRPC server can be done with the storm script and is just like launching Nimbus or the UI:
bin/storm drpc
Next, you need to configure your Storm cluster to know the locations of the DRPC server(s). This is how DRPCSpout knows from where to read function invocations. This can be done through the storm.yaml file or the topology configurations. Configuring this through the storm.yaml looks something like this:
drpc.servers:
- "drpc1.foo.com"
- "drpc2.foo.com"

Route lines from file to persistent JMS queue: How to improve performance?

I need some help with performance tuning of a use case. In this use case the Camel route is tailing status lines in a log file and sends each line as a message to a JMS queue. I have implemented the use case like this:
package tests;
import java.io.File;
import java.net.URI;
import org.apache.activemq.ActiveMQConnectionFactory;
import org.apache.activemq.broker.BrokerFactory;
import org.apache.activemq.broker.BrokerService;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.component.sjms.SjmsComponent;
import org.apache.camel.main.Main;
public class LinesToQueue {
public static void main() throws Exception {
final File file = new File("data/log.txt");
final String uri = "tcp://127.0.0.1:61616";
final BrokerService jmsService = BrokerFactory.createBroker(new URI("broker:" + uri));
jmsService.start();
final SjmsComponent jmsComponent = new SjmsComponent();
jmsComponent.setConnectionFactory(new ActiveMQConnectionFactory(uri));
final Main main = new Main();
main.bind("jms", jmsComponent);
main.addRouteBuilder(new RouteBuilder() {
#Override
public void configure() throws Exception {
fromF("stream:file?fileName=%s&scanStream=true&scanStreamDelay=0", file.getAbsolutePath())
.routeId("LinesToQueue")
.to("jms:LogLines?synchronous=false");
}
});
main.enableHangupSupport();
main.run();
}
}
When I run this use case with a file already filled with 1.000.000 lines the overall performance I get in the route is about 313 lines/second. This means that it takes about 55 minutes to process the file.
As some sort of reference I also have created another use case. In this use case the Camel route is tailing status lines in a log file and sends each line as a document to an Elasticsearch index. I have implemented the use case like this:
package tests;
import java.io.File;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.main.Main;
public class LinesToIndex {
public static void main() throws Exception {
final File file = new File("data/log.txt");
final String uri = "local";
final Main main = new Main();
main.addRouteBuilder(new RouteBuilder() {
#Override
public void configure() throws Exception {
fromF("stream:file?fileName=%s&scanStream=true&scanStreamDelay=0", file.getAbsolutePath())
.routeId("LinesToIndex")
.bean(new LineConverter())
.toF("elasticsearch://%s?operation=INDEX&indexName=log&indexType=line", uri);
}
});
main.enableHangupSupport();
main.run();
}
}
When I run this use case with a file already filled with 1.000.000 lines the overall performance I get in the route is about 8333 lines/second. This means that it takes about 2 minutes to process the file.
I understand that there is a huge difference between a JMS queue and an Elasticsearch index but how can have the JMS use case above to perform better?
Update #1:
It seems to be the persistence in the JMS service that is the bottleneck in my first use case above. If I disable the persistence in the JMS service then the performance in the route is about 11111 lines/second. Which persistence storage for the JMS service will give me a better performance?
a couple of things to consider...
ActiveMQ producer connections are expensive, make sure you use a pooled connection factory...
consider using the VM transport for an in process ActiveMQ instance
consider using an external ActiveMQ broker over TCP (so it doesn't compete for resources with your test)
setup/tune KahaDB or LevelDB to optimize persistent storage for your use case

Resources