I am working on apache-kafka + spring and java api. Facing really annoying issue. I have used kafka topic pattern approach to listen event for multiple client.
Following is the code of kafka consumer in which topic name is coming from config file and suffix has hard code value.
${${service}.topic} value - test-env.demo.*.v1
suffix value is - .cqrs.customer
#KafkaListener(
topicPattern = "${${service}.topic}" + Constants.suffix,
groupId = "test",
id = "test")
So final topic name that will be resolved for customer abc.
test-env.demo.abc.v1.cqrs.customer
and for customer xyz will be
test-env.demo.xyz.v1.cqrs.customer
but when producer emits event on any of following topic. Consumer did not listen anything.
Could someone help me on this.
Thanks
You can grep the phrase "partitions assigned" in the application log, this line will contain information about which partitions are assigned to your application. If successful, should be something like:
partitions assigned: [test-env.demo.abc.v1.cqrs.customer-0],
If the partitions assigned is empty
partitions assigned: []
there may already be some other consumer instances in the "test" group and messages are being processed by them. In this case, try changing the name of the group.
Related
Kafka Connect converters provide the feature of dead letter queue (DLQ) that can be configured (errors.deadletterqueue.topic.name) to store failing records. I tried configuring it on a MirrorMaker2 setup but it doesn't seem to be working as expected. My expectation is that messages that failed to replicate to target cluster are stored in the dead letter queue topic.
To test this, I simulated failures by bringing down the target cluster and expected MirrorMaker2 to create a DLQ on source cluster with failed message but didn't see the dead letter queue topic created. The Kafka documentation is not very clear on whether this configuration option works for MirrorMaker2.
Below is the configuration I used:
clusters = sourceKafkaCluster,targetKafkaCluster
sourceKafkaCluster.bootstrap.servers = xxx
targetKafkaCluster.bootstrap.servers = yyy
sourceKafkaCluster->targetKafkaCluster.enabled = true
targetKafkaCluster->sourceKafkaCluster.enabled = false
#Not sure which one of the below ones are correct.
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.name=dlq_topic_1
sourceKafkaCluster->targetKafkaCluster.errors.deadletterqueue.topic.replication.factor=1
errors.deadletterqueue.topic.name=dlq_topic_1
errors.deadletterqueue.topic.replication.factor=1
Does the deadletterqueue configuration option work with MirrorMaker2?
I'm using a spring-kafka to run Kafka Stream in a Spring Boot application using StreamsBuilderFactoryBean. I changed the number of partitions in some of the topics from 100 to 20 by deleting and recreating them, but now on running the application, I get the following error:
Existing internal topic MyAppId-KSTREAM-AGGREGATE-STATE-STORE-0000000092-changelog has invalid partitions: expected: 20; actual: 100. Use 'kafka.tools.StreamsResetter' tool to clean up invalid topics before processing.
I couldn't access the class kafka.tools.StreamsResetter and tried calling StreamsBuilderFactoryBean.getKafkaStreams.cleanup() but it gave NullPointerException. How do I do the said cleanup?
The relevant documentation is at here.
Step 1: Local Cleanup
For Spring Boot with StreamsBuilderFactoryBean, the first step can be done by simply adding CleanerConfig to the constructor:
// Before
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config));
// After
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config), new CleanupConfig(true, true));
This enables calling the KafkaStreams.cleanUp() method on both before start() & after stop().
Step 2: Global Cleanup
For step two, with all instances of the application stopped, simply use the tool as explained in the documentation:
# In kafka directory
bin/kafka-streams-application-reset.sh --application-id "MyAppId" --bootstrap-servers 1.2.3.4:9092 --input-topics x --intermediate-topics first_x,second_x,third_x --zookeeper 1.2.3.4:2181
What this does:
For any specified input topics: Reset the application’s committed consumer offsets to "beginning of the topic" for all partitions (for consumer group application.id).
For any specified intermediate topics: Skip to the end of the topic, i.e. set the application’s committed consumer offsets for all partitions to each partition’s logSize (for consumer group application.id).
For any internal topics: Delete the internal topic (this will also delete committed the corresponding committed offsets).
I have running ZooKeeper and single Kafka broker and I want to get metrics with MetricBeat, index it with ElasticSearch and display with Kibana.
However, MetricBeat can only get data from partition metricset and nothing comes from consumergroup metricset.
Since kafka module is defined as periodical in metricbeat.yml, it should send some data on it's own, not just waiting for users interaction (f.exam. - write to topic) ?
To ensure myself, I tried to create consumer group, write and consume from topic, but still no data was collected by consumergroup metricset.
consumergroup is defined in both metricbeat.template.json and metricbeat.template-es2x.json.
While metricbeat.full.yml is completely commented off, this is my metricbeat.yml kafka module definition :
- module: kafka
metricsets: ["partition", "consumergroup"]
enabled: true
period: 10s
hosts: ["localhost:9092"]
client_id: metricbeat1
retries: 3
backoff: 250ms
topics: []
In /logs directory of MetricBeat, lines like this show up :
INFO Non-zero metrics in the last 30s:
libbeat.es.published_and_acked_events=109
libbeat.es.publish.write_bytes=88050
libbeat.publisher.messages_in_worker_queues=109
libbeat.es.call_count.PublishEvents=5
fetches.kafka-partition.events=106
fetches.kafka-consumergroup.success=2
libbeat.publisher.published_events=109
libbeat.es.publish.read_bytes=2701
fetches.kafka-partition.success=2
fetches.zookeeper-mntr.events=3
fetches.zookeeper-mntr.success=3
With ZooKeeper's mntr and Kafka's partition, I can see events= and success= values, but for consumergroup there is only success. It looks like no events are fired.
partition and mntr data are properly visible in Kibana, while consumergroup is missing.
Data stored in ElasticSearch are not readable with human eye, there are some internal strings used for directory names and logs do not contain any useful information.
Can anybody help me to understand what is going on and fix it(probably MetricBeat) to send data to ElasticSearch ? Thanks :)
You need to have an active consumer consuming out of the topics, to be able to generate events for consumergroup metricset.
Can the following design be accomplished in Storm?
Lets take the wordcount example that is present in the following
https://github.com/nathanmarz/storm-starter/blob/master/src/jvm/storm/starter/WordCountTopology.java
I am changing the word generator spout to a file reader spout
The design for this Word Count Topology is
1. Spout to read file and create sentences line by line
2. Bolt to split sentences to words
3. Bolt to add unique words and give a word and its corresponding count
So in a way the topology is describing the flow a file needs to take to count the unique words it has.
If I have two files file 1 and file 2 one should be able to call the same topology and create two instance of this topology to run the same word count.
In order to track if the word count has indeed finished the instances of word count topology should have a completed status once the file has been processed.
In the current design of Storm, I find that the Topology is the actual instance so it is like a task.
One needs to make two different calls with different Topology names like
for file 1
StormSubmitter.submitTopology("WordCountTopology1", conf,builder.createTopology());
for file 2
StormSubmitter.submitTopology("WordCountTopology2", conf,builder.createTopology());
not to mention the same upload of the jar using the storm client
storm jar stormwordcount-1.0.0-jar-with-dependencies.jar com.company.WordCount1Main.App "server" "filepath1"
storm jar stormwordcount-1.0.0-jar-with-dependencies.jar com.company.WordCount2Main.App "server" "filepath2"
The other issue is the topologies don't complete once the file is processed. They are alive all the time before we issue a kill on the topology
storm kill "WordCountTopology"
I understand that in a streaming world where the messages are coming from a message queue like Kafka there is no end of message but how is that relevant in the file world where the entities/messages are fixed.
Is there an API that does the following?
//creates the topology, this is done one time using the storm to upload the respective jars
StormSubmitter.submitTopology("WordCountTopology", conf,builder.createTopology());
Once uploaded the application code just instantiates the topology with the agruments
//creates an instance of the topology and give a status tracker
JobTracker tracker = StormSubmitter.runTopology("WordCountTopology", conf, args);
//Can query the Storm for the current job if its complete or not
JobStatus status = StormSubmitter.getTopologyStatus(conf, tracker);
For reusing the same topology twice, you have two possibilities:
1) Use a constructor parameter for your file spout and instantiate the same topology with twice with different parameters:
private StormTopology createMyTopology(String filename) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("File Spout", new FileSpout(filename));
// add further spouts and bolts etc.
return builder.createTopology();
}
public static void main(String[] args) {
String file1 = "/path/to/file1";
String file2 = "/path/to/file2";
Config c = new Config();
if(useFile1) {
StormSubmitter.submitTopology("T1", c, createMyTopology(file1));
} else {
StormSubmitter.submitTopology("T1", c, createMyTopology(file2));
}
}
2) As an alternative, you could configure your file spout in open() method.
public class FileSpout extends IRichSpout {
#Override
public void open(Map conf, ...) {
String filenmae = (String)conf.get("FILENAME");
// ...
}
// other methods omitted
}
public static void main(String[] args) {
String file1 = "/path/to/file1";
String file2 = "/path/to/file2";
Config c = new Config();
if(useFile1) {
c.put("FILENAME", file1);
} else {
c.put("FILENAME", file2);
}
// assembly topology...
StormSubmitter.submitTopology("T", c, builder.createTopology());
}
For you second question: there is no API in Storm that terminates a topology automatically. You could use TopologyInfo and monitor the number of emitted tuples of the spout. If it does not change for some time, you can assume that the whole file got read and then kill the topology.
Config cfg = new Config();
// set NIMBUS_HOST and NIMBUS_THRIFT_PORT in cfg
Client client = NimbusClient.getConfiguredClient(cfg).getClient();
TopologyInfo info = client.getTopologyInfo("topologyName");
// get emitted tuples...
client.killTopology("topologyName");
The word count topology mentioned in the post doesn't do justice for the might and power of Storm. Since Storm is a Stream processor, it requires a stream; period. By definition files are files it is static. I empathize with the Storm developers on how can a simple hello world be given to the adoption on how to show case the topology concepts and a non stream technology like file was taken. So to the newbies who are learning Storm which I was at that time, it was a difficult to understand how to develop using the example. The example is just a way to show how Storm concepts work, not a real word application of how files would come or needs to be processed.
So here is the take on how one of the solution could be.
Since topologies run all the time, they can compute the word count for as long as one wants i,e within a file or across all files for any periods of time.
In order to allow for different files to come in, we would need a streaming spout. So naturally you would need a Kafka Message Broker or similar to receive files in a stream. Depending on the size of the file and the restriction that message brokers put namely Kafka which has a 1 MB file restriction, we could pick to send the file itself as the payload or the reference of the file in which case you would need a distributed file system to store the file namely a Hadoop DFS or a NAS.
We then read these files using a Kafka Spout as opposed to FileSpout.
We now have the following issues
1. Word Count Across Files
2. Word Count per File
3. Running Status on the word count till it is processed
4. When do we know if a file is processed or complete
Word Count Across Files
Using the example provided, this is the use case the example targets so if we continue to stream the files and in each file we read the lines, split the word and send to other bolts, the bolts would count the words independent of which file it came from.
File1 A quick brown fox jumped ...
File2 Once upon a time a fox ...
Field Grouping
quick
brown
fox
...
Once
upon
fox (not needed as it came in file 1)
...
Word Count Per File
In order to do this, we would now need to put the fields grouping of words to be appended with the fileId. So now the example needs to change to include a fileId for each word it splits.
So
File1 A quick brown fox jumped ...
File2 Once upon a time a fox ...
So the fields grouping on word would be (canceling the noise words)
File1_quick
File1_brown
File1_fox
File2_once
File2_upon
File2_fox
Running Status on the word count till it is processed
Since all these counts are in memory of the bolt and we don't know the EoF there is no way to get the status unless someone peaks into the bolt or we send the counts periodically to another data store where we can query it. This is exactly what we need to do, which is at periodic intervals we need to persist the in-memory bolt counts to a data store like hbase, elastic, mongo db etc
When do we know if a file is processed or complete
Perhaps this is the toughest question to answer in the streaming world, basically the stream processor doesn't know the steam is finished as from its perspective the streams are files coming in and it needs to split each file into words and count in corresponding bolts. So they don't know what has happened before or after it reached each actor.
This entire thing needs to be done by the app developer.
One way to do this is when each file is read we count the total words and send a message
File 1 : Total Words : 1000
File 2 : Total Words : 2000
Now when we do the word count and find different words per file File1_* the count of individual words and the total words should match before we say a file is complete. All these are custom logic we would need to write before we can say its complete.
So in essential Storm provides the framework to do stream processing in a variety of ways. Its the application developers job to develop with the design that it has and implement their own logic depending on the use case. It doesn't provide application use cases out of the box or a good reference implementation which I think we need to build as its not a commercial product and depends on community to champion.
I am currently using:
https://github.com/wurstmeister/storm-kafka-0.8-plus/commits/master
which has been moved to:
https://github.com/apache/storm/tree/master/external/storm-kafka
I want to specify the Kafka Consumer Group Name. By looking at the storm-kafka code, I followed the setting, id, to find that is is never used when dealing with a consumer configuration, but is used in creating the zookeeper path at which offset information is stored. Here in this link is an example of why I would want to do this: https://labs.spotify.com/2015/01/05/how-spotify-scales-apache-storm/
Am I correct in saying that the Consumer Group Name cannot be set using the https://github.com/apache/storm/tree/master/external/storm-kafka code?
So far, storm-kafka integration is implemented using SimpleConsumer API of kafka and the format it stores consumer offset in zookeeper is implemented in their own way(JSON format).
If you write spout config like below,
SpoutConfig spoutConfig = new SpoutConfig(zkBrokerHosts,
"topic name",
"/kafka/consumers(just an example, path to store consumer offset)",
"yourTopic");
It will write consumer offset in subdirectories of /kafka/consumers/yourTopic.
Note that by default storm-kafka uses same zookeeper that your Storm uses.