Horizon: context deadline exceeded - stellar

I ran stellar-horizon with captive-core following the configuration from the official documentation. After catching up with the history and applying all the checkpoints, when ingesting live data horizon tries to get info from http://localhost:11626/info and it times out.
Versions
horizon: 2.18.0
stellar-core: 19.1.0
go: 1.17.9
Horizon logs output
INFO[2022-07-05T15:17:16.644+01:00] Ledger: Got consensus: [seq=41625941, prev=a0d052, txs=430, ops=953, sv: [ SIGNED#lobstr_2_europe txH: f81623, ct: 1657030634, upgrades: [ ] ]] pid=306990 service=ingest subservice=stellar-core
INFO[2022-07-05T15:17:16.644+01:00] Tx: applying ledger 41625941 (txs:430, ops:953, base_fee:100) pid=306990 service=ingest subservice=stellar-core
INFO[2022-07-05T15:17:16.664+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
INFO[2022-07-05T15:17:18.664+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
INFO[2022-07-05T15:17:20.664+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
ERRO[2022-07-05T15:17:21.645+01:00] failed to load the stellar-core info err="http request errored: Get \"http://localhost:11626/info\": context deadline exceeded" pid=306990 stack="[main.go:43 client.go:67 app.go:230 app.go:442 asm_amd64.s:1581]"
WARN[2022-07-05T15:17:21.645+01:00] could not load stellar-core info: http request errored: Get "http://localhost:11626/info": context deadline exceeded pid=306990
WARN[2022-07-05T15:17:21.646+01:00] error ticking app: context deadline exceeded pid=306990
INFO[2022-07-05T15:17:22.663+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
INFO[2022-07-05T15:17:23.340+01:00] Processing ledger entry changes pid=306990 processed_entries=600000 progress="2.20%" sequence=41625919 service=ingest source=historyArchive
INFO[2022-07-05T15:17:24.663+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
INFO[2022-07-05T15:17:25.174+01:00] Processing ledger entry changes pid=306990 processed_entries=650000 progress="2.42%" sequence=41625919 service=ingest source=historyArchive
ERRO[2022-07-05T15:17:26.646+01:00] failed to load the stellar-core info err="http request errored: Get \"http://localhost:11626/info\": context deadline exceeded" pid=306990 stack="[main.go:43 client.go:67 app.go:230 app.go:442 asm_amd64.s:1581]"
WARN[2022-07-05T15:17:26.646+01:00] could not load stellar-core info: http request errored: Get "http://localhost:11626/info": context deadline exceeded pid=306990
WARN[2022-07-05T15:17:26.647+01:00] error ticking app: context deadline exceeded pid=306990
INFO[2022-07-05T15:17:26.663+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
INFO[2022-07-05T15:17:28.664+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
INFO[2022-07-05T15:17:30.663+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
INFO[2022-07-05T15:17:31.543+01:00] Processing ledger entry changes pid=306990 processed_entries=700000 progress="2.65%" sequence=41625919 service=ingest source=historyArchive
WARN[2022-07-05T15:17:31.647+01:00] could not load stellar-core info: http request errored: Get "http://localhost:11626/info": context deadline exceeded pid=306990
ERRO[2022-07-05T15:17:31.647+01:00] failed to load the stellar-core info err="http request errored: Get \"http://localhost:11626/info\": context deadline exceeded" pid=306990 stack="[main.go:43 client.go:67 app.go:230 app.go:442 asm_amd64.s:1581]"
WARN[2022-07-05T15:17:31.648+01:00] error ticking app: context deadline exceeded pid=306990
INFO[2022-07-05T15:17:32.663+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
INFO[2022-07-05T15:17:33.367+01:00] Processing ledger entry changes pid=306990 processed_entries=750000 progress="2.87%" sequence=41625919 service=ingest source=historyArchive
INFO[2022-07-05T15:17:34.663+01:00] waiting for ingestion system catchup pid=306990 service=ingest status="{false false 0 41457386 41457386}"
WARN[2022-07-05T15:17:36.648+01:00] could not load stellar-core info: http request errored: Get "http://localhost:11626/info": context deadline exceeded pid=306990
ERRO[2022-07-05T15:17:36.648+01:00] failed to load the stellar-core info err="http request errored: Get \"http://localhost:11626/info\": context deadline exceeded" pid=306990 stack="[main.go:43 client.go:67 app.go:230 app.go:442 asm_amd64.s:1581]"

Related

Using elasticsearch sink connector to feed data into ElasticSearch, timeouts all the time and eventually need to restart manually

We're busy with a PoC where we produce message to a Kafka topic (now about 2 million, should in the end be around 130 million) which we like to do queries on via ElasticSearch. So a small PoC has been made which feeds data into ES via the confluent ElasticSearch Sink Connector (latest) and with connector 6.0.0. However we ran into a lot of timeout issues and eventually the tasks fail with the message that the task needs to be restarted:
ERROR WorkerSinkTask{id=transactions-elasticsearch-connector-3} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted. Error: java.net.SocketTimeoutException: Read timed out (org.apache.kafka.connect.runtime.WorkerSinkTask)
My configuration for the sink connector is the following:
{
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"connection.url": "http://elasticsearch:9200",
"key.converter" : "org.apache.kafka.connect.storage.StringConverter",
"value.converter" : "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url" : "http://schema-registry:8081",
"topics": "transactions,trades",
"type.name": "transactions",
"tasks.max" : "4",
"batch.size" : "50",
"max.buffered.events" : "500",
"max.buffered.records" : "500",
"flush.timeout.ms" : "100000",
"linger.ms" : "50",
"max.retries" : "10",
"connection.timeout.ms" : "2000",
"name": "transactions-elasticsearch-connector",
"key.ignore": "true",
"schema.ignore": "false",
"transforms" : "ExtractTimestamp",
"transforms.ExtractTimestamp.type" : "org.apache.kafka.connect.transforms.InsertField\$Value",
"transforms.ExtractTimestamp.timestamp.field" : "MSG_TS"
}
Unfortunately even when not producing messages and starting up the Elasticsearch sink connector manually the tasks close and need to be restarted again. I've fiddled around with various batch size windows, retries etc but to no avail. Note that we only have one kafka broker, one elasticSearch connector and one ElasticSearch instance running in docker containers.
We also see a lot of these timeout messages:
[2020-12-08 13:23:34,107] WARN Failed to execute batch 100534 of 50 records with attempt 1/11, will attempt retry after 43 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:34,116] WARN Failed to execute batch 100536 of 50 records with attempt 1/11, will attempt retry after 18 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:34,132] WARN Failed to execute batch 100537 of 50 records with attempt 1/11, will attempt retry after 24 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:36,746] WARN Failed to execute batch 100539 of 50 records with attempt 1/11, will attempt retry after 0 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:37,139] WARN Failed to execute batch 100536 of 50 records with attempt 2/11, will attempt retry after 184 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:37,155] WARN Failed to execute batch 100534 of 50 records with attempt 2/11, will attempt retry after 70 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:37,160] WARN Failed to execute batch 100537 of 50 records with attempt 2/11, will attempt retry after 157 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:39,681] WARN Failed to execute batch 100540 of 50 records with attempt 1/11, will attempt retry after 12 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:39,750] WARN Failed to execute batch 100539 of 50 records with attempt 2/11, will attempt retry after 90 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:40,231] WARN Failed to execute batch 100534 of 50 records with attempt 3/11, will attempt retry after 204 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
^[[36mconnect |^[[0m [2020-12-08 13:23:40,322] WARN Failed to execute batch 100537 of 50 records with attempt 3/11, will attempt retry after 58 ms. Failure reason: Read timed out (io.confluent.connect.elasticsearch.bulk.BulkProcessor)
Any idea what we can improve to make the whole chain reliable? For our purposes it does not need to be blazingly fast as long as all messages are getting reliably into ElasticSearch without restarting every time the tasks of the connector.

The oozie job does not run with the message [AM container is launched, waiting for AM container to Register with RM]

I ran a shell job among the oozie examples.
However, YARN application is not executed.
Detail information YARN UI & LOG:
https://docs.google.com/document/d/1N8LBXZGttY3rhRTwv8cUEfK3WkWtvWJ-YV1q_fh_kks/edit
YARN application status is
Application Priority: 0 (Higher Integer value indicates higher priority)
YarnApplicationState: ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
Queue: default
FinalStatus Reported by AM: Application has not completed yet.
Finished: N/A
Elapsed: 20mins, 30sec
Tracking URL: ApplicationMaster
Log Aggregation Status: DISABLED
Application Timeout (Remaining Time): Unlimited
Diagnostics: AM container is launched, waiting for AM container to Register with RM
Application Attempt status is
Application Attempt State: FAILED
Elapsed: 13mins, 19sec
AM Container: container_1607273090037_0001_02_000001
Node: N/A
Tracking URL: History
Diagnostics Info: ApplicationMaster for attempt appattempt_1607273090037_0001_000002 timed out
Node Local Request Rack Local Request Off Switch Request
Num Node Local Containers (satisfied by) 0
Num Rack Local Containers (satisfied by) 0 0
Num Off Switch Containers (satisfied by) 0 0 1
nodemanager log
2020-12-07 01:45:16,237 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler: Starting container [container_1607273090037_0001_01_000001]
2020-12-07 01:45:16,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1607273090037_0001_01_000001 transitioned from SCHEDULED to RUNNING
2020-12-07 01:45:16,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1607273090037_0001_01_000001
2020-12-07 01:45:16,272 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /tmp/hadoop-oozie/nm-local-dir/usercache/oozie/appcache/application_1607273090037_0001/container_1607273090037_0001_01_000001/default_container_executor.sh]
2020-12-07 01:45:17,301 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: container_1607273090037_0001_01_000001's ip = 127.0.0.1, and hostname = localhost.localdomain
2020-12-07 01:45:17,345 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Skipping monitoring container container_1607273090037_0001_01_000001 since CPU usage is not yet available.
2020-12-07 01:45:48,274 INFO logs: Aliases are enabled
2020-12-07 01:54:50,242 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 496756, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0
2020-12-07 01:58:10,071 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1607273090037_0001_000001 (auth:SIMPLE)
2020-12-07 01:58:10,078 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1607273090037_0001_01_000001
What is the problem ?

AMQP 1.0 Qpid JMS and an Issue with Failover/Reconnect

Im using Qpid JMS 0.8.0 library in order to implement a standalone java AMQP client. Because the underlying transport connection tends to break every couple of hours I have set the reconnection using following configuration:
failover:(amqps://someurl:5671)?failover.reconnectDelay=2000&failover.warnAfterReconnectAttempts=1
In accordance with Qpid client configuration explanation page I expect my client to keep trying to reconnect increasing the attempt delays for factor 2 (starting with 2 seconds). Instead, according to the log file, only two attempts to reconnect have been performed when a connection failure was detected and at the end the whole client application has been terminated, what I definitively would like to avoid! Here is the log file:
2016-03-22 14:29:40 INFO AmqpProvider:1190 - IdleTimeoutCheck closed the transport due to the peer exceeding our requested idle-timeout.
2016-03-22 14:29:40 DEBUG FailoverProvider:761 - Failover: the provider reports failure: Transport closed due to the peer exceeding our requested idle-timeout
2016-03-22 14:29:40 DEBUG FailoverProvider:519 - handling Provider failure: Transport closed due to the peer exceeding our requested idle-timeout
2016-03-22 14:29:40 DEBUG FailoverProvider:653 - Connection attempt:[1] to: amqps://publish.preops.nm.eurocontrol.int:5671 in-progress
2016-03-22 14:29:40 INFO FailoverProvider:659 - Connection attempt:[1] to: amqps://publish.preops.nm.eurocontrol.int:5671 failed
2016-03-22 14:29:40 WARN FailoverProvider:686 - Failed to connect after: 1 attempt(s) continuing to retry.
2016-03-22 14:29:42 DEBUG FailoverProvider:653 - Connection attempt:[2] to: amqps://publish.preops.nm.eurocontrol.int:5671 in-progress
2016-03-22 14:29:42 INFO FailoverProvider:659 - Connection attempt:[2] to: amqps://publish.preops.nm.eurocontrol.int:5671 failed
2016-03-22 14:29:42 WARN FailoverProvider:686 - Failed to connect after: 2 attempt(s) continuing to retry.
2016-03-22 14:29:43 DEBUG ThreadPoolUtils:156 - Shutdown of ExecutorService: java.util.concurrent.ThreadPoolExecutor#778970af[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] is shutdown: true and terminated: true took: 0.000 seconds.
2016-03-22 14:29:45 DEBUG ThreadPoolUtils:192 - Waited 2.004 seconds for ExecutorService: java.util.concurrent.ScheduledThreadPoolExecutor#877a470[Shutting down, pool size = 1, active threads = 0, queued tasks = 1, completed tasks = 3] to terminate...
2016-03-22 14:29:46 DEBUG ThreadPoolUtils:156 - Shutdown of ExecutorService: java.util.concurrent.ScheduledThreadPoolExecutor#877a470[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4] is shutdown: true and terminated: true took: 2.889 seconds.
Any idea, what I’m doing wrong here? Basically, what I'm looking for to achieve is a client which is capable to detect transport connection failure and try to reconnect every 5-10 seconds.
Many thanks!

Hadoop HiveServer2 worker threads - how to get info about them

Was getting
HiveServer2 Process - Thrift server getting "Connection refused"
hiveserver2.log flooded with messages similar to
2016-01-06 12:19:57,617 WARN [Thread-8]: server.TThreadPoolServer (TThreadPoolServer.java:serve(184)) - Task has been rejected by ExecutorService 9 times till timedout, reason: java.util.concurrent.RejectedExecutionException: Task org.apache.thrift.server.TThreadPoolServer$WorkerProcess#753dd4d2 rejected from java.util.concurrent.ThreadPoolExecutor#4408d67c[Running, pool size = 500, active threads = 500, queued tasks = 0, completed tasks = 12772]
being those active 500 the max default hive.server2.thrift.http.max.worker.threads
This was sorted by restarting HS2 but how can I gather more info about
which YARN job a worker thread maps to
or for how long it has been running ?

java.io.IOException: Lease timeout of 0 seconds expired

I'm getting the following warning while running my mapreduce jobs under cd4.
java.io.IOException: Lease timeout of 0 seconds expired.
at org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:1700)
at org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:652)
at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:604)
at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:411)
at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:436)
at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:70)
at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:297)
at java.lang.Thread.run(Thread.java:662)
Any idea what this means?

Resources