What may cause huge load in Kafka `__consumer_offsets` topic? - performance

I have simple observation in my Kafka cluster (Kafka 0.11.0.0).
According to jmx information the __consumer_offsets topic constantly is loaded with 10 times more messages than sum of all messages in all other topics.
I had also conected console consumer to this topic and I can measure similar values.
What could be the reason?
How can I check what is Kafka broker doing and generate such a load on its own?

to read __consumer_offsets topic:
bin/kafka-console-consumer.sh --topic __consumer_offsets --bootstrap-server brokers --formatter "kafka.coordinator.GroupMetadataManager\$OffsetsMessageFormatter" --new-consumer --consumer.config consumer.conf
for kafka 11 use formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter"
where consumer.conf has one line
exclude.internal.topics=false

For version 2.2 of kafka, this is a demo:
kafka_2.12-2.2.1/bin #
./kafka-console-consumer.sh --bootstrap-server your-broker --topic __consumer_offsets --formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter"
Output will be:
[MyGroup-7cc5f948df-f9tqz,__MY_TOPIC,2]::OffsetAndMetadata(offset=30, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1568121107017, expireTimestamp=None)

Related

Kafka streams keep logging 'Discovered transaction coordinator' after a node crash (with config StreamsConfig.EXACTLY_ONCE_V2)

I have a kafka(kafka_2.13-2.8.0) cluster with 3 partitions and 3 replications distributed in 3 nodes.
A producer cluster is sending messages to the topic.
I also have a consumer cluster using Kafka streams to consume messages from the topic.
To test fault tolerance, I killed a node. Then all consumers get stuck and keep poping below info:
[read-1-producer] o.a.k.c.p.internals.TransactionManager : [Producer clientId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-StreamThread-1-producer, transactionalId=streams-app-3-0451a24c-7e5c-498c-98d4-d30a6f5ecfdb-1] Discovered transaction coordinator myhost:9092 (id: 3 rack: null)
what I found out by now is there are sth relevant to the configuration of StreamsConfig.EXACTLY_ONCE_V2, because if I change it to StreamsConfig.AT_LEAST_ONCE the consumer works as expected.
To keep the EOS consuming, did I miss any configuration for producer/cluster/consumer?

Restart kafka connect sink and source connectors to read from beginning

I have searched quite a lot on this but there doesn't seems to be a good guide around this.
From what I have searched there are a few things to consider:
Resetting Sink Connector internal topics (status, config and offset).
Source Connector offsets implementation is implementation specific.
Question: Is there even a need to reset these topics?
Deleting the consumer group.
Restarting the connector with a different name (this is also an option) but it doesn't seems to be the right thing to do.
Resetting consumer group to --reset-offsets to --to-earliest
Using the REST API (Does the it provides the functionality to reset and read from beginning)
What would be the best way to restart both a sink and a source connector to read from beginning?
Source Connector:
Standalone mode: remove offset file (/tmp/connect.offsets) or change connector name.
Distributed mode: change name of the connector.
Sink Connector (both modes) one of the following methods:
Change name.
Reset offset for the Consumer group. Name of the group is same as Connector name.
To reset offset you have to first delete connector, reset offset (./bin/kafka-consumer-groups.sh --bootstrap-server :9092 --group connectorName --reset-offsets --to-earliest --execute --topic topicName), add same configuration one more time
You can check following question: Reset the JDBC Kafka Connector to start pulling rows from the beginning of time?
Source connector Distributed mode - has another option which is producing a new message to the offset topic.
For example I use jdbc source connector:
When looking on the offset topic I see the following:
./kafka-console-consumer.sh --zookeeper localhost:2181/kafka11-staging --topic kc-staging--offsets --from-beginning --property print.key=true
["referrer-family-jdbc-source",{"query":"query"}] {"incrementing":100}
Now in order to reset this I just produce another message with incrementing:0
For example: how to produce from shell with key from here
./kafka-console-producer.sh \
--broker-list `hostname`:9092 \
--topic kc-staging--offsets \
--property "parse.key=true" \
--property "key.separator=|"
["referrer-family-jdbc-source",{"query":"query"}]|{"incrementing":0}
Please note that you need to do the following:
Delete the connector.
Produce a message with the relevant offset as I described above.
Create the connector again.
a bit late but found another way. Just set the offset.storage.file.name in standalone mode to dev/null:
#worker.properties
offset.storage.file.filename=/dev/null
#cmdline
connect-standalone /data/config/worker.properties /data/config/connector.properties

MetricBeat - Kafka's consumergroup metricset doesn't send any data?

I have running ZooKeeper and single Kafka broker and I want to get metrics with MetricBeat, index it with ElasticSearch and display with Kibana.
However, MetricBeat can only get data from partition metricset and nothing comes from consumergroup metricset.
Since kafka module is defined as periodical in metricbeat.yml, it should send some data on it's own, not just waiting for users interaction (f.exam. - write to topic) ?
To ensure myself, I tried to create consumer group, write and consume from topic, but still no data was collected by consumergroup metricset.
consumergroup is defined in both metricbeat.template.json and metricbeat.template-es2x.json.
While metricbeat.full.yml is completely commented off, this is my metricbeat.yml kafka module definition :
- module: kafka
metricsets: ["partition", "consumergroup"]
enabled: true
period: 10s
hosts: ["localhost:9092"]
client_id: metricbeat1
retries: 3
backoff: 250ms
topics: []
In /logs directory of MetricBeat, lines like this show up :
INFO Non-zero metrics in the last 30s:
libbeat.es.published_and_acked_events=109
libbeat.es.publish.write_bytes=88050
libbeat.publisher.messages_in_worker_queues=109
libbeat.es.call_count.PublishEvents=5
fetches.kafka-partition.events=106
fetches.kafka-consumergroup.success=2
libbeat.publisher.published_events=109
libbeat.es.publish.read_bytes=2701
fetches.kafka-partition.success=2
fetches.zookeeper-mntr.events=3
fetches.zookeeper-mntr.success=3
With ZooKeeper's mntr and Kafka's partition, I can see events= and success= values, but for consumergroup there is only success. It looks like no events are fired.
partition and mntr data are properly visible in Kibana, while consumergroup is missing.
Data stored in ElasticSearch are not readable with human eye, there are some internal strings used for directory names and logs do not contain any useful information.
Can anybody help me to understand what is going on and fix it(probably MetricBeat) to send data to ElasticSearch ? Thanks :)
You need to have an active consumer consuming out of the topics, to be able to generate events for consumergroup metricset.

Kafka console producer losing messages

I am using the below kafka console producer command to pass the contents of the file into the kafka producer.
sh ~/KAFKA_HOME/bin/kafka-console-producer.sh --broker-list xxx:9092,yyy:9092,zzz:9092 --topic HistLoad --new-producer < data.csv
The Data.csv file has about 700,000 records. I receive only about 699,800 messages at the consumer output.
I checked the offset counter for consumer and based on the offset values its having only 699,800 messages in the queue.
Could you help me in finidng out what is causing this issue of losing messages. What all do i need to check to get the root cause.
This is because the console producer has acks=0 by default. Set request-required-acks to 1 and it should be fine.
For your reference https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-3129

Using Kafka to import data to Hadoop

Firstly I was thinking what to use to get events into Hadoop, where they will be stored and periodically analysis would be performed on them (possibly using Ooozie to schedule periodic analysis) Kafka or Flume, and decided that Kafka is probably a better solution, since we also have a component that does event processing, so in this way, both batch and event processing components get data in the same way.
But know I'm looking for suggestions concretely how to get data out of broker to Hadoop.
I found here that Flume can be used in combination with Kafka
Flume - Contains Kafka Source (consumer) and Sink (producer)
And also found on the same page and in Kafka documentation that there is something called Camus
Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great.
I'm interested in what would be a better (and easier, better documented solution) to do that? Also, are there any examples or tutorials how to do it?
When should I use this variants over simpler, High level consumer?
I'm opened for suggestions if there is another/better solution than this two.
Thanks
You can use flume to dump data from Kafka to HDFS. Flume has kafka source and sink. Its a matter of property file change. An example is given below.
Steps:
Create a kafka topic
kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 -- partitions 1 --topic testkafka
Write to the above created topic using kafka console producer
kafka-console-producer --broker-list localhost:9092 --topic testkafka
Configure a flume agent with the following properties
flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.zookeeperConnect = localhost:2181
flume1.sources.kafka-source-1.topic =testkafka
flume1.sources.kafka-source-1.batchSize = 100
flume1.sources.kafka-source-1.channels = hdfs-channel-1
flume1.channels.hdfs-channel-1.type = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = test-events
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /tmp/kafka/%{topic}/%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=100
flume1.sinks.hdfs-sink-1.hdfs.rollSize=0
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 1000
Save the above config file as example.conf
Run the flume agent
flume-ng agent -n flume1 -c conf -f example.conf - Dflume.root.logger=INFO,console
Data will be now dumped to HDFS location under the following path
/tmp/kafka/%{topic}/%y-%m-%d
Most of the time, I see people using Camus with azkaban
You can you at the github repo of Mate1 for their implementation of Camus. It's not a tutorial but I think it could help you
https://github.com/mate1/camus

Resources