Kafka console producer losing messages - hadoop

I am using the below kafka console producer command to pass the contents of the file into the kafka producer.
sh ~/KAFKA_HOME/bin/kafka-console-producer.sh --broker-list xxx:9092,yyy:9092,zzz:9092 --topic HistLoad --new-producer < data.csv
The Data.csv file has about 700,000 records. I receive only about 699,800 messages at the consumer output.
I checked the offset counter for consumer and based on the offset values its having only 699,800 messages in the queue.
Could you help me in finidng out what is causing this issue of losing messages. What all do i need to check to get the root cause.

This is because the console producer has acks=0 by default. Set request-required-acks to 1 and it should be fine.
For your reference https://issues.apache.org/jira/plugins/servlet/mobile#issue/KAFKA-3129

Related

Spring Kafka - increase in concurrency produces more duplicate

Topic A created with 12 partition ,
And spring Kafka consumer started with 10 concurrency as spring boot application. While putting 1000 message into topic , there is no issues with duplicate as all 1000 messages got consumed but on pushing more load 10K messages 100TPS , in consumer end received 10K + 8.5K messages as duplicate with 10 concurrency but for one concurrency it's working fine ( No duplicates found )
Enable auto commit is false and doing manual ack after processing the message .
Processing time for one message is 300 milli second
Consumer rebalance occuring because of that duplicates got produced.
How to overcome this situation when we are handling more message in Kafka ?

Reset Spring Boot Kafka Stream Application on modifying topics

I'm using a spring-kafka to run Kafka Stream in a Spring Boot application using StreamsBuilderFactoryBean. I changed the number of partitions in some of the topics from 100 to 20 by deleting and recreating them, but now on running the application, I get the following error:
Existing internal topic MyAppId-KSTREAM-AGGREGATE-STATE-STORE-0000000092-changelog has invalid partitions: expected: 20; actual: 100. Use 'kafka.tools.StreamsResetter' tool to clean up invalid topics before processing.
I couldn't access the class kafka.tools.StreamsResetter and tried calling StreamsBuilderFactoryBean.getKafkaStreams.cleanup() but it gave NullPointerException. How do I do the said cleanup?
The relevant documentation is at here.
Step 1: Local Cleanup
For Spring Boot with StreamsBuilderFactoryBean, the first step can be done by simply adding CleanerConfig to the constructor:
// Before
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config));
// After
new StreamsBuilderFactoryBean(new KafkaStreamsConfiguration(config), new CleanupConfig(true, true));
This enables calling the KafkaStreams.cleanUp() method on both before start() & after stop().
Step 2: Global Cleanup
For step two, with all instances of the application stopped, simply use the tool as explained in the documentation:
# In kafka directory
bin/kafka-streams-application-reset.sh --application-id "MyAppId" --bootstrap-servers 1.2.3.4:9092 --input-topics x --intermediate-topics first_x,second_x,third_x --zookeeper 1.2.3.4:2181
What this does:
For any specified input topics: Reset the application’s committed consumer offsets to "beginning of the topic" for all partitions (for consumer group application.id).
For any specified intermediate topics: Skip to the end of the topic, i.e. set the application’s committed consumer offsets for all partitions to each partition’s logSize (for consumer group application.id).
For any internal topics: Delete the internal topic (this will also delete committed the corresponding committed offsets).

Restart kafka connect sink and source connectors to read from beginning

I have searched quite a lot on this but there doesn't seems to be a good guide around this.
From what I have searched there are a few things to consider:
Resetting Sink Connector internal topics (status, config and offset).
Source Connector offsets implementation is implementation specific.
Question: Is there even a need to reset these topics?
Deleting the consumer group.
Restarting the connector with a different name (this is also an option) but it doesn't seems to be the right thing to do.
Resetting consumer group to --reset-offsets to --to-earliest
Using the REST API (Does the it provides the functionality to reset and read from beginning)
What would be the best way to restart both a sink and a source connector to read from beginning?
Source Connector:
Standalone mode: remove offset file (/tmp/connect.offsets) or change connector name.
Distributed mode: change name of the connector.
Sink Connector (both modes) one of the following methods:
Change name.
Reset offset for the Consumer group. Name of the group is same as Connector name.
To reset offset you have to first delete connector, reset offset (./bin/kafka-consumer-groups.sh --bootstrap-server :9092 --group connectorName --reset-offsets --to-earliest --execute --topic topicName), add same configuration one more time
You can check following question: Reset the JDBC Kafka Connector to start pulling rows from the beginning of time?
Source connector Distributed mode - has another option which is producing a new message to the offset topic.
For example I use jdbc source connector:
When looking on the offset topic I see the following:
./kafka-console-consumer.sh --zookeeper localhost:2181/kafka11-staging --topic kc-staging--offsets --from-beginning --property print.key=true
["referrer-family-jdbc-source",{"query":"query"}] {"incrementing":100}
Now in order to reset this I just produce another message with incrementing:0
For example: how to produce from shell with key from here
./kafka-console-producer.sh \
--broker-list `hostname`:9092 \
--topic kc-staging--offsets \
--property "parse.key=true" \
--property "key.separator=|"
["referrer-family-jdbc-source",{"query":"query"}]|{"incrementing":0}
Please note that you need to do the following:
Delete the connector.
Produce a message with the relevant offset as I described above.
Create the connector again.
a bit late but found another way. Just set the offset.storage.file.name in standalone mode to dev/null:
#worker.properties
offset.storage.file.filename=/dev/null
#cmdline
connect-standalone /data/config/worker.properties /data/config/connector.properties

UNKNOWN_PRODUCER_ID When using apache kafka streams (scala)

I am running 3 instances of a service that I wrote using:
Scala 2.11.12
kafkaStreams 1.1.0
kafkaStreamsScala 0.2.1 (by lightbend)
The service uses Kafka streams with the following topology (high level):
InputTopic
Parse to known Type
Clear messages that the parsing failed on
split every single message to 6 new messages
on each message run: map.groupByKey.reduce(with local store).toStream.to
Everything works as expected but i can't get rid of a WARN message that keeps showing:
15:46:00.065 [kafka-producer-network-thread | my_service_name-1ca232ff-5a9c-407c-a3a0-9f198c6d1fa4-StreamThread-1-0_0-producer] [WARN ] [o.a.k.c.p.i.Sender] - [Producer clientId=my_service_name-1ca232ff-5a9c-407c-a3a0-9f198c6d1fa4-StreamThread-1-0_0-producer, transactionalId=my_service_name-0_0] Got error produce response with correlation id 28 on topic-partition my_service_name-state_store_1-repartition-1, retrying (2 attempts left). Error: UNKNOWN_PRODUCER_ID
As you can see, I get those errors from the INTERNAL topics that Kafka stream manage. Seems like some kind of retention period on the producer metadata in the internal topics / some kind of a producer id reset.
Couldn't find anything regarding this issue, only a description of the error itself from here:
ERROR CODE RETRIABLE DESCRIPTION
UNKNOWN_PRODUCER_ID 59 False This exception is raised by the broker if it could not locate the producer metadata associated with the producerId in question. This could happen if, for instance, the producer's records were deleted because their retention time had elapsed. Once the last records of the producer id are removed, the producer's metadata is removed from the broker, and future appends by the producer will return this exception.
Hope you can help,
Thanks
Edit:
It seems that the WARN message does not pop up on version 1.0.1 of kafka streams.

How to set consumer to start from a specific offset in Golang Kafka 10

My need is to make the producer to start from the last message it processed before it crashed. Fortunately I am in the case of having only one topic, with one partition and one consumer.
To do so I tried https://github.com/Shopify/sarama but it doesn't seems to be available yet.
I am now using https://godoc.org/github.com/bsm/sarama-cluster, which allow me to commit every message offset.
I cannot retrieve the last committed offset
I cannot figure out how to make a sarama consumer to start from said offset. The only parameter I've found so far is Config.Producer.Offsets.Initial.
How to retrieve the last committed offset?
How to make the consumer start from the last message whose offset has been committed? OffsetNewest will make it start from the last message produced, not the last processed b the consumer.
Is it possible to do so using only Shopify/sarama and not bsm/sarama-cluster ?
Thank in advance
P.S. I am using Kafka 10.0, so the offsets are stores in a kafka and not in zookeeper.
EDIT1:
Partial solution: fetch all the messages since sarama.OffsetOldest and skip all of them until we found a non processed one.
If offset was already saved for a partition, sarama-cluster will resume consumption from that offset. The Config.Producer.Offsets.Initial option is used only if no saved offset is present (first run for a consumer group).
You can verify this by adding the following line at the beginning of your main() function:
sarama.Logger = log.New(os.Stdout, "sarama: ", log.LstdFlags)
Then you'll see something like the following in the output:
cluster/consumer CID-17db1be4-a162-411c-a106-4d198191176a consume sample/0 from 12
The 12 in that is the offset Sarama is going to start from for that partition (sample/0).

Resources