Kafka Connect JDBC OOM - Large Amount of Data - jdbc

I am trying to implement something similar to this tutorial. However, it worked because the data set is very small. How would I do this for a larger table? Because I keep gettting an out of memory error. My logs are
ka.connect.runtime.rest.RestServer:60)
[2018-04-04 17:16:17,937] INFO [Worker clientId=connect-1, groupId=connect-cluster] Marking the coordinator ip-172-31-14-140.ec2.internal:9092 (id: 2147483647 rack: null) dead (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:341)
[2018-04-04 17:16:17,938] ERROR Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:218)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,939] ERROR Uncaught exception in thread 'kafka-coordinator-heartbeat-thread | connect-sink-redshift': (org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread:51)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,940] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:65)
[2018-04-04 17:16:17,940] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:154)
[2018-04-04 17:16:17,940] ERROR WorkerSinkTask{id=sink-redshift-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,940] ERROR WorkerSinkTask{id=sink-redshift-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:173)
[2018-04-04 17:16:17,940] INFO Stopping task (io.confluent.connect.jdbc.sink.JdbcSinkTask:96)
[2018-04-04 17:16:17,941] INFO WorkerSourceTask{id=production-db-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:306)
[2018-04-04 17:16:17,940] ERROR Unexpected exception in Thread[KafkaBasedLog Work Thread - connect-statuses,5,main] (org.apache.kafka.connect.util.KafkaBasedLog:334)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,946] INFO WorkerSourceTask{id=production-db-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:323)
[2018-04-04 17:16:17,954] ERROR WorkerSourceTask{id=production-db-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,960] ERROR WorkerSourceTask{id=production-db-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:173)
[2018-04-04 17:16:17,960] INFO [Producer clientId=producer-4] Closing the Kafka producer with timeoutMillis = 30000 ms. (org.apache.kafka.clients.producer.KafkaProducer:341)
[2018-04-04 17:16:17,960] INFO Stopped ServerConnector#64f4bfe4{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:306)
[2018-04-04 17:16:17,967] INFO Stopped o.e.j.s.ServletContextHandler#2f06a90b{/,null,UNAVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:865)
I have also tried increasing the memory with the suggestion here but I am unable to load the entire table into memory. Is there a way to limit the number of data produced?

For the JDBC Connector, the most important property you can probably apply would be this, which seems to be what you are asking for.
batch.max.rows
Maximum number of rows to include in a single batch when polling for new data. This setting can be used to limit the amount of data
buffered internally in the connector.
There is no need to "buffer the entire table into memory", With smaller batches, and more frequent polls and commits, you can ensure that almost all rows will be scanned, and you won't be at risk for a large batch failing, then the connector stopping for a period of time, then restarting and missing a few rows on the next poll.
Otherwise, make sure you aren't doing bulk table mode, as it'll try to scan the entire table again and again.
Also query option can do a column projection on the table.
You can find more configuration options in the documentation, but any OOM errors will need to be carefully examined on a case-by-case basis by enabling JMX monitoring and exporting these values into some aggregate system you can monitor more closely like Prometheus rather than just seeing the OOM error and not knowing if changing any particular parameter is really helping.
Another option would be to use CDC based connectors like another blog post shows

Related

messages duplicated during rebalancing after service recovery from Kafka SSLHandshakeException

Current setup - Our Springboot application consumes messages from Kafka topic,We are processing one message at a time (we are not using streams).Below are the config properties and version being used.
ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG- 30000
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG-earliest
ContainerProperties.AckMode-RECORD
Spring boot version-2.5.7
Spring-kafka version- 2.7.8
Kafks-clients version-2.8.1
number of partitions- 6
consumer group- 1
consumers- 2
Issue - When springboot application stays idle for longer time(idle time varying from 4 hrs to 3 days).We are seeing org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Exception error message - org.apache.kafka.common.errors.SslAuthenticationException: SSL handshake failed
Caused by: java.security.cert.CertificateException: No subject alternative DNS name matching kafka-2.broker.emh-dev.service.dev found.
2022-04-07 06:58:42.437 ERROR 24180 --- [ntainer#0-0-C-1] o.s.k.l.KafkaMessageListenerContainer : Authentication/Authorization Exception, retrying in 10000 ms
After service recover we are seeing message duplication with same partition and offsets which is inconsistent.
Below are the exception:
Consumer clientId=XXXXXX, groupId=XXXXXX] Offset commit failed on partition XXXXXX at offset 354: The coordinator is not aware of this member
Seek to current after exception; nested exception is org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records

KAFKA SINK CONNECT: WARN Bulk request 167 failed. Retrying request

I have a data process with Input Topic,Kafka Stream and Output Topic connected to a sink connect for Elasticsearch.
At the beginning of this operation, the data ingestion is done satisfactorily, but when the process has been running for a longer time, Elasticsearch ingestion from connector starts to fail.
I have been checking all the Workers logs and I get the following message which I suspect may be the reason:
[2021-10-21 11:22:14,246] WARN Bulk request 168 failed. Retrying request. (io.confluent.connect.elasticsearch.ElasticsearchClient:335)
java.net.SocketTimeoutException: 3,000 milliseconds timeout on connection http-outgoing-643 [ACTIVE]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492)
at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
at java.base/java.lang.Thread.run(Thread.java:829)
[2021-10-21 11:27:23,858] INFO [Consumer clientId=connector-consumer-ElasticsearchSinkConnector-topic01-0, groupId=connect-ElasticsearchSinkConnector-topic01] Member connector-consumer-ElasticsearchSinkConnector-topic01-0-41b68d34-0f00-4887-b54e-79561fffb5e5 sending LeaveGroup request to coordinator kafka1:9092 (id: 2147483646 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records. (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:1042)
I have tried to change the connector configuration, but I don't understand the main reason for this problem to fix it.
Connector Configuration:
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
connection.password=xxxxx
topics=output_topic
value.converter.schemas.enable=false
connection.username=user-x
name=ElasticsearchSinkConnector-output_topic
connection.url=xxxxxxx
value.converter=org.apache.kafka.connect.json.JsonConverter
key.ignore=true
key.converter=org.apache.kafka.connect.storage.StringConverter
schema.ignore=true
Is it possible that the Bulk Warn causes a loss of data?
Any help would be appreciated
you can try adding
"flush.timeout.ms": 30000

UNKNOWN_PRODUCER_ID When using apache kafka streams (scala)

I am running 3 instances of a service that I wrote using:
Scala 2.11.12
kafkaStreams 1.1.0
kafkaStreamsScala 0.2.1 (by lightbend)
The service uses Kafka streams with the following topology (high level):
InputTopic
Parse to known Type
Clear messages that the parsing failed on
split every single message to 6 new messages
on each message run: map.groupByKey.reduce(with local store).toStream.to
Everything works as expected but i can't get rid of a WARN message that keeps showing:
15:46:00.065 [kafka-producer-network-thread | my_service_name-1ca232ff-5a9c-407c-a3a0-9f198c6d1fa4-StreamThread-1-0_0-producer] [WARN ] [o.a.k.c.p.i.Sender] - [Producer clientId=my_service_name-1ca232ff-5a9c-407c-a3a0-9f198c6d1fa4-StreamThread-1-0_0-producer, transactionalId=my_service_name-0_0] Got error produce response with correlation id 28 on topic-partition my_service_name-state_store_1-repartition-1, retrying (2 attempts left). Error: UNKNOWN_PRODUCER_ID
As you can see, I get those errors from the INTERNAL topics that Kafka stream manage. Seems like some kind of retention period on the producer metadata in the internal topics / some kind of a producer id reset.
Couldn't find anything regarding this issue, only a description of the error itself from here:
ERROR CODE RETRIABLE DESCRIPTION
UNKNOWN_PRODUCER_ID 59 False This exception is raised by the broker if it could not locate the producer metadata associated with the producerId in question. This could happen if, for instance, the producer's records were deleted because their retention time had elapsed. Once the last records of the producer id are removed, the producer's metadata is removed from the broker, and future appends by the producer will return this exception.
Hope you can help,
Thanks
Edit:
It seems that the WARN message does not pop up on version 1.0.1 of kafka streams.

Kafka elasticsearch connector - 'Flush timeout expired with unflushed records:'

I have a strange problem with kafka -> elasticsearch connector. First time when I started it all was great, I received a new data in elasticsearch and checked it through kibana dashboard, but when I produced new data in to kafka using the same producer application and tried to start connector one more time, I didn't get any new data in elasticsearch.
Now I'm getting such errors:
[2018-02-04 21:38:04,987] ERROR WorkerSinkTask{id=log-platform-elastic-0} Commit of offsets threw an unexpected exception for sequence number 14: null (org.apache.kafka.connect.runtime.WorkerSinkTask:233)
org.apache.kafka.connect.errors.ConnectException: Flush timeout expired with unflushed records: 15805
I'm using next command to run connector:
/usr/bin/connect-standalone /etc/schema-registry/connect-avro-standalone.properties log-platform-elastic.properties
connect-avro-standalone.properties:
bootstrap.servers=kafka-0.kafka-hs:9093,kafka-1.kafka-hs:9093,kafka-2.kafka-hs:9093
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
# producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor
# consumer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
#rest.host.name=
rest.port=8084
#rest.advertised.host.name=
#rest.advertised.port=
plugin.path=/usr/share/java
and log-platform-elastic.properties:
name=log-platform-elastic
key.converter=org.apache.kafka.connect.storage.StringConverter
connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
tasks.max=1
topics=member_sync_log, order_history_sync_log # ... and many others
key.ignore=true
connection.url=http://elasticsearch:9200
type.name=log
I checked connection to kafka brokers, elasticsearch and schema-registry(schema-registry and connector are on the same host at this moment) and all is fine. Kafka brokers are running on port 9093 and I'm able to read data from topics using kafka-avro-console-consumer.
I'll be gratefull for any help on this!
Just update flush.timeout.ms to bigger than 10000 (10 seconds which is the default)
According to documentation:
flush.timeout.ms
The timeout in milliseconds to use for periodic
flushing, and when waiting for buffer space to be made available by
completed requests as records are added. If this timeout is exceeded
the task will fail.
Type: long Default: 10000 Importance: low
See documentation
We can optimized Elastic search configuration to solve issue. Please refer below link for configuration parameter
https://docs.confluent.io/current/connect/kafka-connect-elasticsearch/configuration_options.html
Below are key parameter which can control message rate flow to eventually help to solve issue:
flush.timeout.ms: Increase might help to give more breath on flush time
The timeout in milliseconds to use for periodic flushing, and when
waiting for buffer space to be made available by completed requests as
records are added. If this timeout is exceeded the task will fail.
max.buffered.records: Try reducing buffer record limit
The maximum number of records each task will buffer before blocking
acceptance of more records. This config can be used to limit the
memory usage for each task
batch.size: Try reducing batch size
The number of records to process as a batch when writing to
Elasticsearch
tasks.max: Number of parallel thread(consumer instance) Reduce or Increase. This will control Elastic Search if bandwidth not able to handle reduce task may help.
It worked my issue by tuning above parameters

[HDFS connector + Kafka]How to write multiple topics in standalone mode?

I am using Confluent's HDFS Connector to write streamed data to HDFS. I followed the user manual and quick start and setup my Connector.
It works properly when i consume only one topic.
My property file looks like this
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=test_topic1
hdfs.url=hdfs://localhost:9000
flush.size=30
When i add more than one topic, i see it continuously committing offsets and i do not see it writing the committed messages.
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=2
topics=test_topic1,test_topic2
hdfs.url=hdfs://localhost:9000
flush.size=30
I tried with tasks.max with 1 and 2.
I continuously get Committing offsets logged as below
[2016-10-26 15:21:30,990] INFO Started recovery for topic partition test_topic1-0 (io.confluent.connect.hdfs.TopicPartitionWriter:193)
[2016-10-26 15:21:31,222] INFO Finished recovery for topic partition test_topic1-0 (io.confluent.connect.hdfs.TopicPartitionWriter:208)
[2016-10-26 15:21:31,230] INFO Started recovery for topic partition test_topic2-0 (io.confluent.connect.hdfs.TopicPartitionWriter:193)
[2016-10-26 15:21:31,236] INFO Finished recovery for topic partition test_topic2-0 (io.confluent.connect.hdfs.TopicPartitionWriter:208)
[2016-10-26 15:21:35,155] INFO Reflections took 6962 ms to scan 249 urls, producing 11712 keys and 77746 values (org.reflections.Reflections:229)
[2016-10-26 15:22:29,226] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
[2016-10-26 15:23:29,227] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
[2016-10-26 15:24:29,225] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
[2016-10-26 15:25:29,224] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
When i gracefully stop the service (Ctrl+C), i see it removing the tmp files.
What am i doing wrong? What is the proper way to do it?
Appreciate any suggestions on this.
I've kept stumbling over the same problem you've mentioned here for the past month or so and I couldn't get to the bottom of it, until today when I've upgraded to confluent 3.1.1 and stuff started working as expected...
This is how I roll
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=5
topics=accounts,contacts,users
hdfs.url=hdfs://localhost:9000
flush.size=1
hive.metastore.uris=thrift://localhost:9083
hive.integration=true
schema.compatibility=BACKWARD
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class=io.confluent.connect.hdfs.partitioner.HourlyPartitioner
locale=en-us
timezone=UTC

Resources