Using elasticsearch sink connector to feed data into ElasticSearch, timeouts all the time and eventually need to restart manually - elasticsearch

We're busy with a PoC where we produce message to a Kafka topic (now about 2 million, should in the end be around 130 million) which we like to do queries on via ElasticSearch. So a small PoC has been made which feeds data into ES via the confluent ElasticSearch Sink Connector (latest) and with connector 6.0.0. However we ran into a lot of timeout issues and eventually the tasks fail with the message that the task needs to be restarted:
ERROR WorkerSinkTask{id=transactions-elasticsearch-connector-3} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: java.net.SocketTimeoutException: Read timed out (org.apache.kafka.connect.runtime.WorkerSinkTask)
My configuration for the sink connector is the following:
{
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"connection.url": "http://elasticsearch:9200",
"key.converter" : "org.apache.kafka.connect.storage.StringConverter",
"value.converter" : "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url" : "http://schema-registry:8081",
"topics": "transactions,trades",
"type.name": "transactions",
"tasks.max" : "4",
"batch.size" : "50",
"max.buffered.events" : "500",
"max.buffered.records" : "500",
"flush.timeout.ms" : "100000",
"linger.ms" : "50",
"max.retries" : "10",
"connection.timeout.ms" : "2000",
"name": "transactions-elasticsearch-connector",
"key.ignore": "true",
"schema.ignore": "false",
"transforms" : "ExtractTimestamp",
"transforms.ExtractTimestamp.type" : "org.apache.kafka.connect.transforms.InsertField\$Value",
"transforms.ExtractTimestamp.timestamp.field" : "MSG_TS"
}
Unfortunately even when not producing messages and starting up the Elasticsearch sink connector manually the tasks close and need to be restarted again. I've fiddled around with various batch size windows, retries etc but to no avail. Note that we only have one kafka broker, one elasticSearch connector and one ElasticSearch instance running in docker containers.
We also see a lot of these timeout messages:
[2020-12-08 13:23:34,107] WARN Failed to execute batch 100534 of 50 records with attempt 1/11, will attempt retry after 43 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:34,116] WARN Failed to execute batch 100536 of 50 records with attempt 1/11, will attempt retry after 18 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:34,132] WARN Failed to execute batch 100537 of 50 records with attempt 1/11, will attempt retry after 24 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:36,746] WARN Failed to execute batch 100539 of 50 records with attempt 1/11, will attempt retry after 0 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:37,139] WARN Failed to execute batch 100536 of 50 records with attempt 2/11, will attempt retry after 184 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:37,155] WARN Failed to execute batch 100534 of 50 records with attempt 2/11, will attempt retry after 70 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:37,160] WARN Failed to execute batch 100537 of 50 records with attempt 2/11, will attempt retry after 157 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:39,681] WARN Failed to execute batch 100540 of 50 records with attempt 1/11, will attempt retry after 12 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:39,750] WARN Failed to execute batch 100539 of 50 records with attempt 2/11, will attempt retry after 90 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:40,231] WARN Failed to execute batch 100534 of 50 records with attempt 3/11, will attempt retry after 204 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:40,322] WARN Failed to execute batch 100537 of 50 records with attempt 3/11, will attempt retry after 58 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
Any idea what we can improve to make the whole chain reliable? For our purposes it does not need to be blazingly fast as long as all messages are getting reliably into ElasticSearch without restarting every time the tasks of the connector.

Related

ClusterManager: Error managing cluster: Failed to obtain DB connection from data source 'springNonTxDataSource.xxxxxx"

I am trying to find a solution .Maybe, some of the queries which are taking too long to execute??
Thanks
logs:
Oct 21 16:05:48 wm-prod-02 web.stdout.log 2022-10-21 14:05:47.424 ERROR 26376 --- [_ClusterManager] o.s.s.quartz.LocalDataSourceJobStore : ClusterManager: Error managing cluster: Failed to obtain DB connection from data source 'springNonTxDataSource.WmSmartwatcherScheduler': java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
Oct 21 16:05:48 wm-prod-02 web.stdout.log org.quartz.JobPersistenceException: Failed to obtain DB connection from data source 'springNonTxDataSource.WmSmartwatcherScheduler': java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.

can not access NIFI UI?

Always can access UI until today!
show logs,why error of other server would make NIFI UI can not be accessed!?
1
2020-04-25 14:59:53,546 ERROR [Timer-Driven Process Thread-41] o.apache.nifi.processors.standard.PutSQL PutSQL[id=c1451287-2ee5-397e-9842-5e477212e954] Failed to update database due to a failed batch update, java.sql.BatchUpdateException: ORA-02291: integrity constraint (ISA.PHY_ASSET_CLASS_X_PHY_ASSET_MAPPING_FK1) violated - parent key not found
. There were a total of 1 FlowFiles that failed, 0 that succeeded, and 0 that were not execute and will be routed to retry; : java.sql.BatchUpdateException: ORA-02291: integrity constraint (ISA.PHY_ASSET_CLASS_X_PHY_ASSET_MAPPING_FK1) violated - parent key not found
java.sql.BatchUpdateException: ORA-02291: integrity constraint (ISA.PHY_ASSET_CLASS_X_PHY_ASSET_MAPPING_FK1) violated - parent key not found
2
IOError: java.nio.channels.ClosedByInterruptException
org.python.core.PyException: null
3
2020-04-25 14:59:58,537 ERROR [Timer-Driven Process Thread-94] o.apache.nifi.processors.standard.PutSQL PutSQL[id=d9733768-de43-17b7-0000-00007453720b] Failed to process session due to org.apache.nifi.processor.exception.ProcessException: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object: org.apache.nifi.processor.exception.ProcessException: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object
org.apache.nifi.processor.exception.ProcessException: java.sql.SQLException: Cannot get a connection, pool error Timeout waiting for idle object
4
ERROR [MasterListener-mymaster-[10.148.xxx.xx:26379]] redis.clients.jedis.JedisSentinelPool Lost connection to Sentinel at 10.xxx.xxx.xx:xxxx. Sleeping 5000ms and retrying.
redis.clients.jedis.exceptions.JedisConnectionException: java.net.ConnectException: Connection refused (Connection refused)
Maybe there are many other error...
ANY HELP IS APPRECIATED!

Wrong atomikos state ABORTING after timeout in transaction

I am testing how atomikos solving timeout in transaction. I have 30s timeout in infinite read from db. After 30s I got this exception:
14:08:40.329 [Atomikos:4] WARN c.a.icatch.imp.ActiveStateHandler - Transaction rob-app-b1c7b95a3b0efb82dfb516b04620a213154159609015500001 has timed out - rolling back...
2018-11-07 14:08:40.354 [pool-3-thread-1] WARN c.a.jdbc.JdbcConnectionProxyHelper - Error enlisting in transaction - connection might be broken? Please check the logs for more information...
java.lang.IllegalStateException: wrong state: ABORTING
at com.atomikos.icatch.imp.CoordinatorImp.registerSynchronization(CoordinatorImp.java:420)
at com.atomikos.icatch.imp.TransactionStateHandler.registerSynchronization(TransactionStateHandler.java:129)
at com.atomikos.icatch.imp.CompositeTransactionImp.registerSynchronization(CompositeTransactionImp.java:177)
at com.atomikos.jdbc.AtomikosConnectionProxy.enlist(AtomikosConnectionProxy.java:211)
at com.atomikos.jdbc.AtomikosConnectionProxy.invoke(AtomikosConnectionProxy.java:122)
at com.sun.proxy.$Proxy133.prepareStatement(Unknown Source)
Why I got this error and no Atomikos Exception in AtomikosConnectionProxy during enlist method ?
AtomikosSQLException.throwAtomikosSQLException("The transaction has timed out - try increasing the timeout if needed");
Seems like there is a timeout because your application has been using the transaction for too long. Try increasing the transaction timeout?

Error while connecting Oozie server: connection timed out

I am trying to run pig program using oozie in command prompt but i am getting error like
Connection exception has occurred [ java.net.ConnectException Connection timed out ]. Trying after 1 sec. Retry count = 1
Connection exception has occurred [ java.net.ConnectException Connection timed out ]. Trying after 2 sec. Retry count = 2
Connection exception has occurred [ java.net.ConnectException Connection timed out ]. Trying after 4 sec. Retry count = 3
Connection exception has occurred [ java.net.ConnectException Connection timed out ]. Trying after 8 sec. Retry count = 4
Error: IO_ERROR : java.io.IOException: Error while connecting Oozie server. No of retries = 4. Exception = Connection timed out
and i am running this command
oozie job -oozie http://localhost:11000/oozie -config job.properties -run

Hbase Heavy write Exception

This Exception was raised in HBase, when there is a heavy writing to
clusters:
WARN org.apache.hadoop.ipc.HBaseServer: IPC Server listener on 60020: readAndProcess threw exception java.io.IOException: Connection reset by peer. Count of bytes read: 0
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:171)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:245)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelRead(HBaseServer.java:1676)
at org.apache.hadoop.hbase.ipc.HBaseServer$Connection.readAndProcess(HBaseServer.java:1120)
at org.apache.hadoop.hbase.ipc.HBaseServer$Listener.doRead(HBaseServer.java:703)
at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.doRunLoop(HBaseServer.java:495)
at org.apache.hadoop.hbase.ipc.HBaseServer$Listener$Reader.run(HBaseServer.java:470)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
And this warning is raised:
WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":761893,"call":"multi(org.apache.hadoop.hbase.client.MultiAction#5bf92021), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"172.16.0.121:55803","starttimems":1378784998180,"queuetimems":0,"class":"HRegionServer","responsesize":0,"method":"multi"}
WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call multi(org.apache.hadoop.hbase.client.MultiAction#5bf92021), rpc version=1, client version=29, methodsFingerPrint=54742778 from 172.16.0.121:55803: output error
WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 39 on 60020 caught: java.nio.channels.ClosedChannelException
at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:135)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:326)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1710)
at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1653)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:924)
at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:1003)
at org.apache.hadoop.hbase.ipc.HBaseServer$Call.sendResponseIfReady(HBaseServer.java:409)
at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1346)
This is caused by not heavy write but big write. "processingtimems":761893 means the write operation is not finished in 761 sec. And before the action is finished, client is timeout. Try to reduce the multi operation item count.

Resources