Just to make the scenario simple.
number of consumers == number of partitions == Kafka broker numbers
If deploy the consumers on the same machines where the brokers are, how to make each consumer only consume the messages locally? The purpose is to cut all the network overhead.
I think we can make it if each consumer can know the partition_id on their machines, but I don't know how? or is there other directions to solve this problem?
Thanks.
bin/kafka-topics.sh --zookeeper [zk address] --describe --topic [topic_name] tells you which broker hosts the leader for each partition. Then you can use manual partition assignment for each consumer to make sure it consumes from a local partition.
Probably not worth the effort because partition leadership can change and then you would have to rebalance all your consumers to be local again. You can save the same amount of network bandwidth with less effort by just reducing the replication factor from 3 to 2.
Maybe you could use the Admin Client API.
First you can use the describeTopics() methods for getting information about topics in the cluster. From the DescribeTopicResult you can access to TopicPartitionInfo with information about partitions for each topic. From there you can access to the Node through the leader(). Node contains the host() and you can check if it's the same as the host your consumer is running or id() and the consumer should have the information about the broker-id running on the same machine (in general it's an information you can define upfront). More infor on Admin Client API at the following JavaDoc :
https://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/clients/admin/AdminClient.html
Related
For an ActiveMQ Artemis HA solution, I'm thinking that instead of using a using a primary/secondary configuration with a Zookeeper monitoring node, perhaps as master/slave/slave configuration would be better. The reason is that, as pointed out in this answer, having a single Zookeeper node is an architectural weakness. The master-slave-slave approach would at least probably have at least one node running at any given time.
Also, I'm wondering if it might be better to ask these question over the mailing list or the slack channel? Any preference there?
Using a primary/backup/backup triplet is possible, but it provides no mitigation against split brain since only primary nodes participate in quorum voting. Therefore, the architectural weakness in this configuration is significantly worse than just have a single ZooKeeper node because with a single ZooKeeper node at least you have a chance of mitigating split brain whereas otherwise you don't.
Also, fail-back only works for a primary/backup pair. It doesn't work for a primary/backup/backup triplet. This is because a backup server is owned by only one primary server. This means that when the 3 broker instances are started there will be 1 primary/backup pair and a "left-over" backup which will be in a kind of idle state waiting to attach to a primary broker without a backup. Then if the primary broker fails the primary broker's backup will take over and become primary and the other backup will now become the backup of the server which just became primary. Once the broker instance which failed is restarted it will attempt to register itself as a backup of the now-primary broker and initiate fail-back. However, since the now-primary broker already has a backup it will reject the registration message from the original primary because it already has a backup, and therefore fail-back will not occur.
We have producers that are sending the following to Kafka:
topic=syslog, ~25,000 events per day
topic=nginx, ~5,000 events per day
topic=zeek.xxx.log, ~100,000 events per day (total). In this last case there are 20 distinct zeek topics, such as zeek.conn.log and zeek.http.log
kafka-connect-elasticsearch instances function as consumers to ship data from Kafka to Elasticsearch. The hello-world Sink configuration for kafka-connect-elasticsearch might look like this:
# elasticsearch.properties
name=elasticsearch-sink
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=24
topics=syslog,nginx,zeek.broker.log,zeek.capture_loss.log,zeek.conn.log,zeek.dhcp.log,zeek.dns.log,zeek.files.log,zeek.http.log,zeek.known_services.log,zeek.loaded_scripts.log,zeek.notice.log,zeek.ntp.log,zeek.packet_filtering.log,zeek.software.log,zeek.ssh.log,zeek.ssl.log,zeek.status.log,zeek.stderr.log,zeek.stdout.log,zeek.weird.log,zeek.x509.log
topic.creation.enable=true
key.ignore=true
schema.ignore=true
...
And can be invoked with bin/connect-standalone.sh. I realized that running or attempting to run tasks.max=24 when work is performed in a single process is not ideal. I know that using distributed mode would be a better alternative, but am unclear on the performance-optimal way to submit connectors to distributed mode. Namely,
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call? Or would it be best to break up multiple .properties configs + connectors (e.g. one for syslog, one for nginx, one for zeek.**) and submit them separately?
I understand that tasks be equal to the number of topics x number of partitions, but what dictates the number of workers?
Is there anywhere in the documentation that walks through best practices for a situation such as this where there is a noticeable imbalance of throughput for different topics?
In distributed mode, would I still want to submit just a single elasticsearch.properties through a single API call?
It'd be a JSON file, but yes.
what dictates the number of workers?
Up to you. JVM usage is one factor that you can monitor and scale on
Not really any documentation that I am aware of
We have one topic with one partition due to ordering of message requirements. We have two consumers running on different servers with same set of configurations i.e. groupId, consumerId, consumerGroup. i.e.
1 Topic -> 1 Partition -> 2 Consumers
When we deploy consumers same code is deployed on both the servers. Noticed when a message comes we see both the consumers are consuming message rather than only one processing. Reason having consumers running on two separate servers is if one server crashes at least other can continue processing messages. But looks like if both up both consuming messages. Reading Kafka docs it says if we have more consumers than partitions then some stay idle don't see that happening. Anything we are missing on configuration side apart from consumerId & groupId. Thanks
As #Gary Russel said, as long as the two consumer instances have their own consumer group, they will consume every event that is written to the topic. Just put them into the same consumer-group. You can provide a consumer-group-id in the consumer.properties.
According to the documentation here: https://github.com/OpenHFT/Chronicle-Engine one is able to do pub/sub using maps. This allows one to create a construct similar to topics that are available in middleware such as Tibco, 29W, Kafka and use that as a way of sending events across processes. Is this a recommended usage of chronicle map? What kind of latency can I expect if both publisher and subscriber stay in the same machine?
My second question is, how can this be extended to send messages across machines? How does this work with enterprise TCP replication?
My requirement is to create thousands of topics and use them to communicate across processes running in different machines (in a LAN). Each of these topics would be written by a single source and read by multiple readers running in same or different machines. If the source of a particular topic dies, that source's replica would start writing to the topic and listeners will continue to receive messages. These messages need not be stored for replay.
Is this a recommended usage of chronicle map?
Yes, you can use engine to support event notification across a machine. However, if you want lowest latencies you might need to send a notification via Queue and keep the latest value in a map.
What kind of latency can I expect if both publisher and subscriber stay in the same machine?
It depends on your use case esp the size of the data (in maps case the number of entries as well) The Latency for Map in Engine is around 30 - 100 us, however the latency for Queue is around 2 - 5 us.
My second question is, how can this be extended to send messages across machines?
For this you need our licensed product but the code is the same.
Each of these topics would be written by a single source and read by multiple readers running in same or different machines. If the source of a particular topic dies, that source's replica would start writing to the topic and listeners will continue to receive messages.
Most likely, the simplest solution is to have a Map where each topic is a different key. This will send the latest value for that topic to the consumers.
If you need to recorded every event, a Queue is likely to be a better choice. If you don't need to retain the data for long, you can use a very sort file rotation.
I am using the kafkaSpout that came with storm-0.9.2 distribution for my project. I want to monitor the throughput of this spout. I tried using the KafkaOffsetMonitoring, but it does not show any consumers reading from my topic.
I suspect this is because I have specified the root path in Zookeeper for the spout to store the consumer offsets. How will the kafkaOffsetMonitor know that where to look for data about my kafkaSpout instance?
Can someone explain exactly where does zookeeper store data about kafka topics and consumers? The zookeeper is a filesystem. So, how does it arrange data of different topics and their partitions? What is consumer groupid and how is it interpreted by zookeeper while storing consumer offset?
If anyone has ever used kafkaOffsetMonitor to monitor throughput of a kafkaSpout, please tell me how I can get the tool to find my spout?
Thanks a lot,
Palak Shah
Kafka-Spout maintains its offset in its own znode rather than under the znode where kafka stores the offsets for regular consumers. We had a similar need where we had to monitor the offsets of both the kafka-spout consumers and also regular kafka consumers, so we ended writing our own tool. You can get the tool from here:
https://github.com/Symantec/kafka-monitoring-tool
I have never used KafkaOffsetMonitor, but I can answer the other part.
zookeeper.connect is the property where you can specify the znode for Kafka; By default it keeps all data at '/'.
You can access the zookeeper filesystem using zkCli.sh, the zookeeper command line.
You should look at /consumers and /brokers; following would give you the offset
get /consumers/my_test_group/offsets/my_topic/0
You can poll this offset continuously to know the rate of consumption at spout.