How to apply config for Elasticsearch deploy on K8s? - elasticsearch

I have a trouble in change Elasticsearch config that deployed in K8s.
I want to apply this config for my Elasticsearch node
# Force all memory to be locked, forcing the JVM to never swap
bootstrap.mlockall: true
## Threadpool Settings ##
# Search pool
threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100
# Bulk pool
threadpool.bulk.type: fixed
threadpool.bulk.size: 60
threadpool.bulk.queue_size: 300
# Index pool
threadpool.index.type: fixed
threadpool.index.size: 20
threadpool.index.queue_size: 100
# Indices settings
indices.memory.index_buffer_size: 30%
indices.memory.min_shard_index_buffer_size: 12mb
indices.memory.min_index_buffer_size: 96mb
# Cache Sizes
indices.fielddata.cache.size: 15%
indices.fielddata.cache.expire: 6h
indices.cache.filter.size: 15%
indices.cache.filter.expire: 6h
# Indexing Settings for Writes
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000
To deploy Elasticsearch in K8s, I follow two step:
Step 1: Create .yml file, like this
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: esc-lamtest
namespace: lamtest
spec:
version: 7.9.0
nodeSets:
- name: basic-1
count: 3
config:
node.master: true
node.data: true
node.ingest: true
# Search pool
thread_pool.search.queue_size: 50
thread_pool.search.size: 20
thread_pool.search.min_queue_size: 10
thread_pool.search.max_queue_size: 100
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 150Gi
storageClassName: vnptit-nfs
podTemplate:
spec:
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: -Xms4g -Xmx4g
resources:
requests:
memory: 0Gi
cpu: 0Gi
limits:
memory: 0Gi
cpu: 0Gi
http:
tls:
selfSignedCertificate:
disabled: true
Step 2: I run command line:
kubectl apply -f lamtest.yaml
But I can only app config for "Search pool".
When I apply config for # Bulk pool # Index pool # Indices settings # Cache Sizes and # Indexing Settings for Writes, my Elasticsearch fail and here is my logs
"Suppressed: java.lang.IllegalArgumentException: unknown setting [thread_pool.bulk.queue_size] did you mean any of [thread_pool.get.queue_size, thread_pool.write.queue_size, thread_pool.analyze.queue_size, thread_pool.search.queue_size, thread_pool.listener.queue_size]?",
"\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:544) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:489) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:460) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:431) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.common.settings.SettingsModule.<init>(SettingsModule.java:149) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.node.Node.<init>(Node.java:385) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.node.Node.<init>(Node.java:277) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:227) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:227) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:393) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127) ~[elasticsearch-cli-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.9.0.jar:7.9.0]",
"Suppressed: java.lang.IllegalArgumentException: unknown setting [thread_pool.bulk.size] did you mean any of [thread_pool.get.size, thread_pool.write.size, thread_pool.analyze.size, thread_pool.search.size]?",
"\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:544) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:489) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:460) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:431) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.common.settings.SettingsModule.<init>(SettingsModule.java:149) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.node.Node.<init>(Node.java:385) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.node.Node.<init>(Node.java:277) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:227) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:227) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:393) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127) ~[elasticsearch-cli-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126) ~[elasticsearch-7.9.0.jar:7.9.0]",
"\tat org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.9.0.jar:7.9.0]"] }
uncaught exception in thread [main]
java.lang.IllegalArgumentException: unknown setting [thread_pool.bulk.type] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:544)
at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:489)
at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:460)
at org.elasticsearch.common.settings.AbstractScopedSettings.validate(AbstractScopedSettings.java:431)
at org.elasticsearch.common.settings.SettingsModule.<init>(SettingsModule.java:149)
at org.elasticsearch.node.Node.<init>(Node.java:385)
at org.elasticsearch.node.Node.<init>(Node.java:277)
at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:227)
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:227)
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:393)
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:170)
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:161)
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:127)
at org.elasticsearch.cli.Command.main(Command.java:90)
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:126)
at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92)
For complete error details, refer to the log at /usr/share/elasticsearch/logs/esc-lamtest.log

Error message is very clear, that you have not defined the proper config for other threadpools, Please notice the first line of your error msg carefully.
"Suppressed: java.lang.IllegalArgumentException: unknown setting
[thread_pool.bulk.queue_size] did you mean any of
[thread_pool.get.queue_size, thread_pool.write.queue_size,
thread_pool.analyze.queue_size, thread_pool.search.queue_size,
thread_pool.listener.queue_size]?",
You have defined bulk queue size as thread_pool.bulk.queue_size, which is not correct and as you can read threadpools in ES , there is no, threadpool for bulk, instead it uses write threadpool for bulk requests, from the same doc, hence chaning this to thread_pool.write.queue_size would work for this config.
write For single-document index/delete/update and bulk requests.
Thread pool type is fixed with a size of # of allocated processors,
queue_size of 10000. The maximum size for this pool is 1 + # of
allocated processors.
Now in order to fix this for bulk, Index and other settings, please verify that you are using the correct names for their configs, threadpools can be obtained from any running ES instance and you can easily construct the corresponding config.
http://localhost:9200/_nodes/stats?pretty gives the all the threadpools and some are listed below.
"thread_pool": {
"analyze": {
"threads": 1,
"queue": 0,
"active": 0,
"rejected": 0,
"largest": 1,
"completed": 1
},
"ccr": {
"threads": 0,
"queue": 0,
"active": 0,
"rejected": 0,
"largest": 0,
"completed": 0
},
}

Related

ElasticSearch MasterNotDiscoveredException

I've setup an elasticsearch cluster in kuberentes, but I'm getting the error "MasterNotDiscoveredException". I'm not really sure even where to begin debugging this error as there does not appear to be anything really useful in the logs of any of the nodes:
│ elasticsearch {"type": "server", "timestamp": "2022-06-15T00:44:17,226Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "logging-ek", "node.name": "logging-ek-es-master-0", "message": "path: /_bulk, params: {}", │
│ elasticsearch "stacktrace": ["org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized, SERVICE_UNAVAILABLE/2/no master];", │
│ elasticsearch "at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:179) ~[elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.handleBlockExceptions(TransportBulkAction.java:635) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation.doRun(TransportBulkAction.java:481) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$2.onTimeout(TransportBulkAction.java:669) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]", │
│ elasticsearch "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]", │
│ elasticsearch "at java.lang.Thread.run(Thread.java:833) [?:?]", │
│ elasticsearch "Suppressed: org.elasticsearch.discovery.MasterNotDiscoveredException", │
│ elasticsearch "\tat org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) ~[elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "\tat org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "\tat org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.1.jar:7.17.1]", │
│ elasticsearch "\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]", │
│ elasticsearch "\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]", │
│ elasticsearch "\tat java.lang.Thread.run(Thread.java:833) [?:?]"] }
is pretty much the only logs i've ever seen.
It does appear that the cluster sees all of my master nodes:
elasticsearch {"type": "server", "timestamp": "2022-06-15T00:45:41,915Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "logging-ek", "node.name": "logging-ek-es-master-0", "message": "master not discovered yet, this node has not previously joine │
│ d a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: have discovered [{logging-ek-es-master-0}{fHLQrvLsTJ6UvR_clSaxfg}{iLoGrnWSTpiZxq59z7I5zA}{10.42.64.4}{10.42.64.4:9300}{mr}, {logging-ek-es-master-1}{EwF8WLIgSF6Q1Q46_51VlA}{wz5rg74iThicJdtzXZg29g}{ │
│ 10.42.240.8}{10.42.240.8:9300}{mr}, {logging-ek-es-master-2}{jtrThk_USA2jUcJYoIHQdg}{HMvZ_dUfTM-Ar4ROeIOJlw}{10.42.0.5}{10.42.0.5:9300}{mr}]; discovery will continue using [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304, 127.0.0.1:9305, [::1]:9300, [::1]: │
│ 9301, [::1]:9302, [::1]:9303, [::1]:9304, [::1]:9305, 10.42.0.5:9300, 10.42.240.8:9300] from hosts providers and [{logging-ek-es-master-0}{fHLQrvLsTJ6UvR_clSaxfg}{iLoGrnWSTpiZxq59z7I5zA}{10.42.64.4}{10.42.64.4:9300}{mr}] from last-known cluster state; node term 0, last-accepted version │
│ 0 in term 0" }
and I've verified that they can in fact reach each other through the network. Is there anything else or anywhere else I need to look for errors? I installed elasticsearch via elasticoperator.
Start doing curl to the 9300 port. Make sure you get a valid response going both ways.
Also make sure your MTU is set right or this can happen as well. It's a network problem most of the time.

Kafka/questDB JDBC sink connector: crashing if multiple topics

I have a Kafka broker with 2 topics and a JDBC sink connector to a questDB database.
Everything is built with docker containers.
If i configure the JDBC connector with 1 topic (which ever) then it works just fine and all events are being transferred over to questDB.
curl -X POST -H "Accept:application/json" -H "Content-Type:application/json" --data #jdbc_sink.json http://connect:8083/connectors
{
"name": "jdbc_sink",
"config": {
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"topics": "topic_1",
"table.name.format": "${topic}",
"connection.url": "jdbc:postgresql://questdb:8812/qdb?useSSL=false",
"connection.user": "admin",
"connection.password": "quest",
"auto.create": "true",
"insert.mode": "insert",
"dialect.name": "PostgreSqlDatabaseDialect"
}
}
As soon as I include 2 topics into the JDBC config the questDB and connect containers are crashing.
"topics": "topic_1,topic_2",
Connect Dockerfile
FROM confluentinc/cp-kafka-connect-base:7.1.0
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-jdbc:10.3.3
connect:
build:
context: ./connect
dockerfile: Dockerfile
hostname: connect
container_name: connect
depends_on:
- broker1
- zookeeper
ports:
- "8083:8083"
environment:
CONNECT_BOOTSTRAP_SERVERS: "broker1:29092"
CONNECT_REST_ADVERTISED_HOST_NAME: connect
CONNECT_REST_PORT: 8083
CONNECT_GROUP_ID: compose-connect-group
CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
CONNECT_KEY_CONVERTER: org.apache.kafka.connect.storage.StringConverter
CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
CONNECT_VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter
CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: "http://schema-registry:8081"
networks:
- broker-kafka
questdb:
image: questdb/questdb:6.2.2b
hostname: questdb
container_name: questdb
ports:
- "9000:9000" # REST API and Web Console
- "9009:9009" # InfluxDB line protocol
- "8812:8812" # Postgres wire protocol
- "9003:9003" # Min health server
volumes:
- ./questdb:/root/.questdb
networks:
- broker-kafka
What is the problem here? Does anyone have an idea?
Below you can see extracts from the docker logs of the connect and questDB containers. I cannot post the complete logs (too many characters) so I tried to extract relevant messages, but frankly speaking I couldn't figure out the determining error message. Please let me know if you need more logs.
questDB log (extract)
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f7dc99486a8, pid=1, tid=38
#
# JRE version: OpenJDK Runtime Environment (17.0.1+12) (build 17.0.1+12-39)
# Java VM: OpenJDK 64-Bit Server VM (17.0.1+12-39, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# J 2153 c1 io.questdb.cairo.vm.api.MemoryCR.getStr(JLio/questdb/cairo/vm/api/MemoryCR$CharSequenceView;)Ljava/lang/CharSequence; io.questdb#6.2.2-SNAPSHOT (101 bytes) # 0x00007f7dc99486a8 [0x00007f7dc99484a0+0x0000000000000208]
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /root/.questdb/hs_err_pid1.log
Compiled method (c1) 386972 2153 3 io.questdb.cairo.vm.api.MemoryCR::getStr (101 bytes)
total in heap [0x00007f7dc9948210,0x00007f7dc9949438] = 4648
relocation [0x00007f7dc9948370,0x00007f7dc9948498] = 296
main code [0x00007f7dc99484a0,0x00007f7dc9948f80] = 2784
stub code [0x00007f7dc9948f80,0x00007f7dc9949060] = 224
oops [0x00007f7dc9949060,0x00007f7dc9949070] = 16
metadata [0x00007f7dc9949070,0x00007f7dc99490b8] = 72
scopes data [0x00007f7dc99490b8,0x00007f7dc9949220] = 360
scopes pcs [0x00007f7dc9949220,0x00007f7dc99493e0] = 448
dependencies [0x00007f7dc99493e0,0x00007f7dc99493e8] = 8
nul chk table [0x00007f7dc99493e8,0x00007f7dc9949438] = 80
#
# If you would like to submit a bug report, please visit:
# https://bugreport.java.com/bugreport/crash.jsp
#
[error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007f7ddf75b529]
[error occurred during error reporting (), id 0xb, SIGSEGV (0xb) at pc=0x00007f7ddf75b529]
connect log (extract)
[2022-04-06 07:01:50,706] WARN Unable to query database for maximum table name length; the connector may fail to write to tables with long names (io.confluent.connect.jdbc.dialect.PostgreSqlDatabaseDialect)
org.postgresql.util.PSQLException: ERROR: unknown function name: repeat(STRING,INT)
Position: 15
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2675)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2365)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:355)
at org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:490)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:408)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:329)
at org.postgresql.jdbc.PgStatement.executeCachedSql(PgStatement.java:315)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:291)
at org.postgresql.jdbc.PgStatement.executeQuery(PgStatement.java:243)
at io.confluent.connect.jdbc.dialect.PostgreSqlDatabaseDialect.computeMaxIdentifierLength(PostgreSqlDatabaseDialect.java:119)
at io.confluent.connect.jdbc.dialect.PostgreSqlDatabaseDialect.getConnection(PostgreSqlDatabaseDialect.java:106)
at io.confluent.connect.jdbc.util.CachedConnectionProvider.newConnection(CachedConnectionProvider.java:80)
at io.confluent.connect.jdbc.util.CachedConnectionProvider.getConnection(CachedConnectionProvider.java:52)
at io.confluent.connect.jdbc.sink.JdbcDbWriter.write(JdbcDbWriter.java:64)
at io.confluent.connect.jdbc.sink.JdbcSinkTask.put(JdbcSinkTask.java:84)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:581)
at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:333)
at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:234)
at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:203)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:188)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:243)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
[2022-04-06 07:01:50,709] INFO JdbcDbWriter Connected (io.confluent.connect.jdbc.sink.JdbcDbWriter)
[2022-04-06 07:01:50,745] INFO Checking PostgreSql dialect for existence of TABLE "topic_1" (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-04-06 07:01:50,998] INFO Using PostgreSql dialect TABLE "topic_1" absent (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-04-06 07:01:51,005] INFO Creating table with sql: CREATE TABLE "topic_1" (
"id" REAL NOT NULL,
"price" REAL NOT NULL,
"size" REAL NOT NULL,
"side" TEXT NOT NULL,
"liquidation" BOOLEAN NOT NULL,
"time" TEXT NOT NULL) (io.confluent.connect.jdbc.sink.DbStructure)
[2022-04-06 07:01:51,184] INFO Checking PostgreSql dialect for existence of TABLE "topic_1" (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-04-06 07:01:51,193] INFO Using PostgreSql dialect TABLE "topic_1" present (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-04-06 07:01:52,136] INFO Checking PostgreSql dialect for type of TABLE "topic_1" (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-04-06 07:01:52,145] INFO Setting metadata for table "topic_1" to Table{name='"topic_1"', type=TABLE columns=[Column{'liquidation', isPrimaryKey=false, allowsNull=true, sqlType=bool}, Column{'size', isPrimaryKey=false, allowsNull=true, sqlType=float4}, Column{'time', isPrimaryKey=false, allowsNull=true, sqlType=varchar}, Column{'price', isPrimaryKey=false, allowsNull=true, sqlType=float4}, Column{'id', isPrimaryKey=false, allowsNull=true, sqlType=float4}, Column{'side', isPrimaryKey=false, allowsNull=true, sqlType=varchar}]} (io.confluent.connect.jdbc.util.TableDefinitions)
[2022-04-06 07:01:52,154] INFO Checking PostgreSql dialect for existence of TABLE "topic_2" (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-04-06 07:01:52,339] INFO Using PostgreSql dialect TABLE "topic_2" absent (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-04-06 07:01:52,339] INFO Creating table with sql: CREATE TABLE "topic_2" (
"id" REAL NOT NULL,
"price" REAL NOT NULL,
"size" REAL NOT NULL,
"side" TEXT NOT NULL,
"liquidation" BOOLEAN NOT NULL,
"time" TEXT NOT NULL) (io.confluent.connect.jdbc.sink.DbStructure)
[2022-04-06 07:01:52,661] INFO Checking PostgreSql dialect for existence of TABLE "topic_2" (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)
[2022-04-06 07:01:52,669] INFO Using PostgreSql dialect TABLE "topic_2" present (io.confluent.connect.jdbc.dialect.GenericDatabaseDialect)

hive on tez error:java.lang.OutOfMemoryError

I am facing this error when perform partitioning by date on a hive table that have more 70 columns :
ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1612203694878_0265_4_00, diagnostics=[Task failed, taskId=task_1612203694878_0265_4_00_000058, diagnostics=[TaskAttempt 0 failed, info=[Container container_e16_1612203694878_0265_01_000167 finished with diagnostics set to [Container failed, exitCode=-104. [2021-02-02 11:00:58.498]Container [pid=1577,containerID=container_e16_1612203694878_0265_01_000167] is running 3022848B beyond the 'PHYSICAL' memory limit. Current usage: 1.0 GB of 1 GB physical memory used; 2.7 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_e16_1612203694878_0265_01_000167 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 1577 1567 1577 1577 (bash) 0 0 116011008 301 /bin/bash -c /usr/jdk64/jdk1.8.0_112/bin/java -Xmx819m -server -Djava.net.preferIPv4Stack=true -Dhdp.version=3.1.4.0-315 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -server -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/usr/hadoop/yarn/log/application_1612203694878_0265/container_e16_1612203694878_0265_01_000167 -Dtez.root.logger=INFO,CLA -Djava.io.tmpdir=/usr/hadoop/yarn/local/usercache/hive/appcache/application_1612203694878_0265/container_e16_1612203694878_0265_01_000167/tmp org.apache.tez.runtime.task.TezChild slave-06-n.fawryhq.corp 43250 container_e16_1612203694878_0265_01_000167 application_1612203694878_0265 1 1>/usr/hadoop/yarn/log/application_1612203694878_0265/container_e16_1612203694878_0265_01_000167/stdout 2>/usr/hadoop/yarn/log/application_1612203694878_0265/container_e16_1612203694878_0265_01_000167/stderr
|- 1658 1577 1577 1577 (java) 1414 128 2788896768 262581 /usr/jdk64/jdk1.8.0_112/bin/java -Xmx819m -server -Djava.net.preferIPv4Stack=true -Dhdp.version=3.1.4.0-315 -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -server -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -XX:+UseG1GC -XX:+ResizeTLAB -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -Dlog4j.configuratorClass=org.apache.tez.common.TezLog4jConfigurator -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/usr/hadoop/yarn/log/application_1612203694878_0265/container_e16_1612203694878_0265_01_000167 -Dtez.root.logger=INFO,CLA -Djava.io.tmpdir=/usr/hadoop/yarn/local/usercache/hive/appcache/application_1612203694878_0265/container_e16_1612203694878_0265_01_000167/tmp org.apache.tez.runtime.task.TezChild slave-06-n.fawryhq.corp 43250 container_e16_1612203694878_0265_01_000167 application_1612203694878_0265 1
[2021-02-02 11:00:58.512]Container killed on request. Exit code is 143
[2021-02-02 11:00:58.521]Container exited with a non-zero exit code 143.
]], TaskAttempt 1 failed, info=[Error: Error while running task ( failure ) : java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.allocateSpace(PipelinedSorter.java:256)
at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.(PipelinedSorter.java:205)
at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.start(OrderedPartitionedKVOutput.java:146)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:193)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
, errorMessage=Cannot recover from this error:java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.allocateSpace(PipelinedSorter.java:256)
at org.apache.tez.runtime.library.common.sort.impl.PipelinedSorter.(PipelinedSorter.java:205)
at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.start(OrderedPartitionedKVOutput.java:146)
at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.init(MapRecordProcessor.java:193)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:266)
at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:250)
at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:17, Vertex vertex_1612203694878_0265_4_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE]
ERROR : Vertex killed, vertexName=Reducer 2, vertexId=vertex_1612203694878_0265_4_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:2, Vertex vertex_1612203694878_0265_4_01 [Reducer 2] killed/failed due to:OTHER_VERTEX_FAILURE]
ERROR : DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:1
Try (in this order)
Increase mapper parallelism. The goal is to get more smaller mappers. Check how many mappers does it start and adjust figures. This will not work if you have too big files in non-splittable format like gzip, proceed with two next steps then.
--This is example, check your current setings and adjust to get x2 or more mappers
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set tez.grouping.max-size=32000000; --bigger files will be splitted
set tez.grouping.min-size=32000; --smaller files will be combined on single mapper
Disable map-side aggregation (map-side aggregation often leads to OOM)
set hive.map.aggr=false;
Increase mapper memory if two steps above do not help (try to find minimum working container size)
set hive.tez.container.size=9216; --Adjust figures and chose minimum working size
set hive.tez.java.opts=-Xmx6144m;
When You work with Have on Tez, you have to define at least all of these 4 parameters, for example:
set hive.tez.container.size=8192;
set tez.am.resource.memory=8192;
set tez.runtime.io.sort.mb=2048;
set hive.tez.java.opts=-Xmx6144m;
set tez.am.launch.cmd-opts=-Xmx4096m;

Set Kubernetes Readiness for elasticsearch client

I upgraded the elasticsearch chart in kubernetes from 6.6 to 7.10.2 version. Data and master pods are running and ready but, the clients aren't ready (2 clients, 2 data , 3 master).
This is their status:
NAME READY STATUS RESTARTS AGE
elasticsearch-client-685c875bb5-5mcxg 0/1 Running 0 2m23s
elasticsearch-client-685c875bb5-cs9lq 0/1 Running 0 24m
When I run describe I see this warning:
Warning Unhealthy 10s (x10 over 100s) kubelet, Readiness probe failed: Get http://_cluster/health: net/http: request canceled (Client. Timeout exceeded while awaiting headers)
and in kubectl logs, I get this
{"type": "server", "timestamp": "2021-01-24T13:43:41,318Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "elasticsearch", "node.name": "elasticsearch-client-685c875bb5-5mcxg", "message": "path: /_cluster/health, params: {}",
"stacktrace": ["org.elasticsearch.discovery.MasterNotDiscoveredException: null",
"at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:230) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }
{"type": "server", "timestamp": "2021-01-24T13:43:51,319Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "elasticsearch", "node.name": "elasticsearch-client-685c875bb5-5mcxg", "message": "path: /_cluster/health, params: {}",
"stacktrace": ["org.elasticsearch.discovery.MasterNotDiscoveredException: null",
"at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:230) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:335) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:601) [elasticsearch-7.10.2.jar:7.10.2]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684) [elasticsearch-7.10.2.jar:7.10.2]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]",
"at java.lang.Thread.run(Thread.java:832) [?:?]"] }
I set the readiness and liveness initialDelaySeconds to 90
what might be the problem here?
Since you have upgraded from 6.x to 7.x, make sure that you have set cluster.initial_master_nodes in your env or in the elasticsearch.yml config file.
You must have odd number of master nodes, e.g. 1, 3, 5 and so on, usually 3 masters is optimal. Otherwise your cluster won't work due to lack of quorum.

Flink in YARN + Checkpointing in HDFS - recurring error org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException

Flink YARN Cluster High Availability:
high-availability: zookeeper
high-availability.storageDir: hdfs://hann/user/flink/recovery
high-availability.zookeeper.quorum: XXX:2181
high-availability.zookeeper.path.root: /flink
state.backend: rocksdb
state.checkpoints.dir: hdfs://hann/user/flink/checkpoints
state.checkpoints.num-retained: 5
+ Streaming job (Каfka source -> Flink -> Some sinks)
StreamExecutionEnvironment:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(<interval>);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE;
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(<interval>);
env.getCheckpointConfig().setCheckpointTimeout(<interval>);
env.setRestartStrategy(<restartStrategies>);
Work well without checkpointing but with it - periodic crashes:
2018-06-29 07:15:56,429 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 444 # 1530245743320 for job cf58d818c629f8297c6331b4130db1f9.
2018-06-29 07:16:16,638 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 444 of job cf58d818c629f8297c6331b4130db1f9 expired before completing.
2018-06-29 07:16:16,796 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 445 # 1530245776638 for job cf58d818c629f8297c6331b4130db1f9.
2018-06-29 07:16:24,596 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Kafka (5/6) (5d1bb37e21bd68a04a752e62323c6d88) switched from RUNNING to FAILED.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 444 for operator Source: Kafka (5/6).}
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointExceptionHandler.tryHandleCheckpointException(StreamTask.java:1154)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:948)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:885)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Exception: Could not materialize checkpoint 444 for operator Source: Kafka (5/6).
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.handleExecutionException(StreamTask.java:943)
... 6 more
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to hdfs://hann/user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-444/8ec33328-eb51-4c74-8b1b-dfc0ef185bfd in order to obtain the stream state handle
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:53)
at org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.<init>(OperatorSnapshotFinalizer.java:53)
at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:854)
... 5 more
Caused by: java.io.IOException: Could not flush and close the file system output stream to hdfs://hann/user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-444/8ec33328-eb51-4c74-8b1b-dfc0ef185bfd in order to obtain the stream state handle
at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:325)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:447)
at org.apache.flink.runtime.state.DefaultOperatorStateBackend$1.performOperation(DefaultOperatorStateBackend.java:352)
at org.apache.flink.runtime.io.async.AbstractAsyncCallableWithResources.call(AbstractAsyncCallableWithResources.java:75)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:50)
... 7 more
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-444/8ec33328-eb51-4c74-8b1b-dfc0ef185bfd (inode 97646080): File does not exist. Holder DFSClient_NONMAPREDUCE_-2015925738_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3752)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:3839)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:3809)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:748)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.complete(AuthorizationProviderProxyClientProtocol.java:248)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:551)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2226)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2222)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2220)
at org.apache.hadoop.ipc.Client.call(Client.java:1470)
at org.apache.hadoop.ipc.Client.call(Client.java:1401)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy9.complete(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:443)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy10.complete(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:2251)
at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:2233)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
at org.apache.flink.runtime.fs.hdfs.HadoopDataOutputStream.close(HadoopDataOutputStream.java:52)
at org.apache.flink.core.fs.ClosingFSDataOutputStream.close(ClosingFSDataOutputStream.java:64)
at org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:311)
... 12 more
At the same time in checkpoints dir:
~ # hdfs dfs -ls /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/
Found 6 items
drwxr-xr-x - flink flink 0 2018-06-29 07:15 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-441
drwxr-xr-x - flink flink 0 2018-06-29 07:15 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-442
drwxr-xr-x - flink flink 0 2018-06-29 07:15 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-443
drwxr-xr-x - flink flink 0 2018-06-29 07:16 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/chk-445
drwxr-xr-x - flink flink 0 2018-06-29 02:48 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/shared
drwxr-xr-x - flink flink 0 2018-06-29 02:48 /user/flink/checkpoints/cf58d818c629f8297c6331b4130db1f9/taskowned
There is no chk-444 folder in checkpoints directory
I'm stucked =(
I tried FsStatBackend and RocksDBStateBackend and there is no difference - I get this error every 5-6 hours.
P.S.
Flink 1.5.0
Hadoop 2.6.0

Resources