I have a simple groovy script that leverages the GPars library's withPool functionality to launch HTTP GET requests to two internal API endpoints in parallel.
The script runs fine locally, both directly as well as a docker container.
When I deploy it as a Kubernetes Job (in our internal EKS cluster: 1.20), it runs there as well, but the moment it hits the first withPool call, I see a giant thread dump, but the execution continues, and completes successfully.
NOTE: Containers in our cluster run with the following pod security context:
securityContext:
fsGroup: 2000
runAsNonRoot: true
runAsUser: 1000
Environment
# From the k8s job container
groovy#app-271df1d7-15848624-mzhhj:/app$ groovy --version
WARNING: Using incubator modules: jdk.incubator.foreign, jdk.incubator.vector
Groovy Version: 4.0.0 JVM: 17.0.2 Vendor: Eclipse Adoptium OS: Linux
groovy#app-271df1d7-15848624-mzhhj:/app$ ps -ef
UID PID PPID C STIME TTY TIME CMD
groovy 1 0 0 21:04 ? 00:00:00 /bin/bash bin/run-script.sh
groovy 12 1 42 21:04 ? 00:00:17 /opt/java/openjdk/bin/java -Xms3g -Xmx3g --add-modules=ALL-SYSTEM -classpath /opt/groovy/lib/groovy-4.0.0.jar -Dscript.name=/usr/bin/groovy -Dprogram.name=groovy -Dgroovy.starter.conf=/opt/groovy/conf/groovy-starter.conf -Dgroovy.home=/opt/groovy -Dtools.jar=/opt/java/openjdk/lib/tools.jar org.codehaus.groovy.tools.GroovyStarter --main groovy.ui.GroovyMain --conf /opt/groovy/conf/groovy-starter.conf --classpath . /tmp/script.groovy
groovy 116 0 0 21:05 pts/0 00:00:00 bash
groovy 160 116 0 21:05 pts/0 00:00:00 ps -ef
Script (relevant parts)
#Grab('org.codehaus.gpars:gpars:1.2.1')
import static groovyx.gpars.GParsPool.withPool
import groovy.json.JsonSlurper
final def jsl = new JsonSlurper()
//...
while (!(nextBatch = getBatch(batchSize)).isEmpty()) {
def devThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = dev + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
devResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
def stgThread = Thread.start {
withPool(poolSize) {
nextBatch.eachParallel { kw ->
String url = stg + "&" + "query=$kw"
try {
def response = jsl.parseText(url.toURL().getText(connectTimeout: 10.seconds, readTimeout: 10.seconds,
useCaches: true, allowUserInteraction: false))
stgResponses[kw] = response
} catch (e) {
println("\tFailed to fetch: $url | error: $e")
}
}
}
}
devThread.join()
stgThread.join()
}
Dockerfile
FROM groovy:4.0.0-jdk17 as builder
USER root
RUN apt-get update && apt-get install -yq bash curl wget jq
WORKDIR /app
COPY bin /app/bin
RUN chmod +x /app/bin/*
USER groovy
ENTRYPOINT ["/bin/bash"]
CMD ["bin/run-script.sh"]
The bin/run-script.sh simply downloads the above groovy script at runtime and executes it.
wget "$GROOVY_SCRIPT" -O "$LOCAL_FILE"
...
groovy "$LOCAL_FILE"
As soon as the execution hits the first call to withPool(poolSize), there's a giant thread dump, but execution continues.
I'm trying to figure out what could be causing this behavior. Any ideas 🤷🏽♂️?
Thread dump
For posterity, answering my own question here.
The issue turned out to be this log4j2 JVM hot-patch that we're currently leveraging to fix the recent log4j2 vulnerability. This agent (running as a DaemonSet) patches all running JVMs in all our k8s clusters.
This, somehow, causes my OpenJDK 17 based app to thread dump. I found the same issue with an ElasticSearch 8.1.0 deployment as well (also uses a pre-packaged OpenJDK 17). This one is a service, so I could see a thread dump happening pretty much every half hour! Interestingly, there are other JVM services (and some SOLR 8 deployments) that don't have this issue 🤷🏽♂️.
Anyway, I worked with our devops team to temporarily exclude the node that deployment was running on, and lo and behold, the thread dumps disappeared!
Balance in the universe has been restored 🧘🏻♂️.
I have an e2e protractor-jasmine test in my angular application. I've just run the exact same test several times in a row and it stops at exactly one step. giving the same error:
'Failed: script timeout: result was not received in 20 seconds'
What I have tried:
1. I attempted to make this an async function: ... header', async () => { ...
2. I have tried await(ing) the element: await element(by.css("[ng-click='siteDocLibCtrl.managePermissionsDialog($event)']")).click();
3. I have tried to browser.sleep(3000)
Jasmine version:2.8.0
npm version:
npm: '6.4.1',
ares: '1.15.0',
cldr: '33.1',
http_parser: '2.8.0',
icu: '62.1',
modules: '64',
napi: '3',
nghttp2: '1.34.0',
node: '10.15.0',
openssl: '1.1.0j',
tz: '2018e',
unicode: '11.0',
uv: '1.23.2',
v8: '6.8.275.32-node.45',
zlib: '1.2.11'
element.all(by.repeater("file in siteDocLibCtrl.files | filter:global.search | orderBy:orderByField:reverseSort")).get(0).click(); //selects 1st element
element(by.css("[ng-click='siteDocLibCtrl.managePermissionsDialog($event)']")).click();
The output error i am getting is below:
Failed: script timeout: result was not received in 20 seconds
(Session info: chrome=73.0.3683.103)
(Driver info: chromedriver=2.46.628402 (536cd7adbad73a3783fdc2cab92ab2ba7ec361e1),platform=Windows NT 10.0.17134 x86_64)[0m
Stack:
ScriptTimeoutError: script timeout: result was not received in 20 seconds
(Session info: chrome=73.0.3683.103)
(Driver info: chromedriver=2.46.628402 (536cd7adbad73a3783fdc2cab92ab2ba7ec361e1),platform=Windows NT 10.0.17134 x86_64)
at Object.checkLegacyResponse (C:\Users\Jagdeep\eclipse-workspace\ProtractorTutorial\node_modules\selenium-webdriver\lib\error.js:546:15)
at parseHttpResponse (C:\Users\Jagdeep\eclipse-workspace\ProtractorTutorial\node_modules\selenium-webdriver\lib\http.js:509:13)
at doSend.then.response (C:\Users\Jagdeep\eclipse-workspace\ProtractorTutorial\node_modules\selenium-webdriver\lib\http.js:441:30)
at process._tickCallback (internal/process/next_tick.js:68:7)
Try add this line before the test:
browser.ignoreSynchronization = true;
Im receiving multiple erlang errors in my CouchDB 2.1.1 cluster (3 nodes/Windows), see errors and node configuration below:
3 nodes (10.0.7.4 - 10.0.7.6), Azure application gateway is used as load balancer.
Why do these errors appear? system resources of the nodes are far from overload.
I would be thankful for any help - thanks in advance.
Errors:
rexi_server: from: couchdb#10.0.7.4(<0.14976.568>) mfa: fabric_rpc:changes/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream_last,2,[{file,"src/rexi.erl"},{line,224}]},{fabric_rpc,changes,4,[{file,"src/fabric_rpc.erl"},{line,86}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
rexi_server: from: couchdb#10.0.7.6(<13540.24597.655>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,308}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,642}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
rexi_server: from: couchdb#10.0.7.6(<13540.5991.623>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,308}]},{couch_mrview,map_fold,3,[{file,"src/couch_mrview.erl"},{line,511}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,848}]},{couch_btree,fold,4,[{file,"src/couch_btree.erl"},{line,222}]},{couch_db,enum_docs,5,[{file,"src/couch_db.erl"},{line,1450}]},{couch_mrview,all_docs_fold,4,[{file,"src/couch_mrview.erl"},{line,425}]}]
req_err(3206982071) unknown_error : normal [<<"mochiweb_request:recv/3 L180">>,<<"mochiweb_request:stream_unchunked_body/4 L540">>,<<"mochiweb_request:recv_body/2 L214">>,<<"chttpd:body/1 L636">>,<<"chttpd:json_body/1 L649">>,<<"chttpd:json_body_obj/1 L657">>,<<"chttpd_db:db_req/2 L386">>,<<"chttpd:process_request/1 L295">>]
System running to use fully qualified hostnames ** ** Hostname localhost is illegal
COMPACTION-ERRORS
Supervisor couch_secondary_services had child compaction_daemon started with couch_compaction_daemon:start_link() at <0.18509.478> exit with reason {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} in context child_terminated
CRASH REPORT Process couch_compaction_daemon (<0.18509.478>) with 0 neighbors exited with reason: {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} at gen_server:terminate/7(line:826) <= proc_lib:init_p_do_apply/3(line:240); initial_call: {couch_compaction_daemon,init,['Argument__1']}, ancestors: [couch_secondary_services,couch_sup,<0.200.0>], messages: [], links: [<0.12665.492>], dictionary: [], trap_exit: true, status: running, heap_size: 987, stack_size: 27, reductions: 3173
gen_server couch_compaction_daemon terminated with reason: {compaction_loop_died,{timeout,{gen_server,call,[couch_server,get_server]}}} last msg: {'EXIT',<0.23195.476>,{timeout,{gen_server,call,[couch_server,get_server]}}} state: {state,<0.23195.476>,[]}
Error in process <0.16890.22> on node 'couchdb#10.0.7.4' with exit value: {{rexi_DOWN,{'couchdb#10.0.7.5',noproc}},[{mem3_rpc,rexi_call,2,[{file,"src/mem3_rpc.erl"},{line,269}]},{mem3_rep,calculate_start_seq,1,[{file,"src/mem3_rep.erl"},{line,194}]},{mem3_rep,repl,2,[{file,"src/mem3_rep.erl"},{line,175}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,81}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,208}]}]}
#vm.args
-name couchdb#10.0.7.4
-setcookie monster
-kernel error_logger silent
-sasl sasl_error_logger false
+K true
+A 16
+Bd -noinput
+Q 134217727`
local.ini
[fabric]
request_timeout = infinity
[couchdb]
max_dbs_open = 10000
os_process_timeout = 20000
uuid =
[chttpd]
port = 5984
bind_address = 0.0.0.0
[httpd]
socket_options = [{recbuf, 262144}, {sndbuf, 262144}, {nodelay, true}]
enable_cors = true
[couch_httpd_auth]
secret =
[daemons]
compaction_daemon={couch_compaction_daemon, start_link, []}
[compactions]
_default = [{db_fragmentation, "50%"}, {view_fragmentation, "50%"}, {from, "23:00"}, {to, "04:00"}]
[compaction_daemon]
check_interval = 300
min_file_size = 100000
[vendor]
name = COUCHCLUSTERNODE0X
[admins]
adminuser =
[cors]
methods = GET, PUT, POST, HEAD, DELETE
headers = accept, authorization, content-type, origin, referer
origins = *
credentials = true
[query_server_config]
os_process_limit = 2000
os_process_soft_limit = 1000
I am here to ask for some help because I can't reach a solution and I have spent so much time on this.
The problem is a weird behavior in karma + jasmine tests, initially I thought that the problem was in AngularJs code but, stripping down by stripping down I reached the point where there is nothing else to remove and the problem is 100% not in angular.
The actual code that I am using is this:
test.js:
'use strict';
describe('Unit tests suite', function () {
it('test', function () {
expect('base').toEqual('');
});
});
karma.conf.js:
module.exports = function (config) {
config.set({
basePath: '',
frameworks: ['jasmine'],
files: ['*.js'],
exclude: [],
preprocessors: {},
reporters: ['progress'],
port: 9876,
colors: true,
logLevel: config.LOG_INFO,
autoWatch: true,
browsers: ['PhantomJS'],
singleRun: false,
})
}
Absolutely nothing else. The result of that test is:
13 02 2016 04:32:39.559:WARN [karma]: No captured browser, open http://localhost:9876/
13 02 2016 04:32:39.571:INFO [karma]: Karma v0.13.15 server started at http://localhost:9876/
13 02 2016 04:32:39.578:INFO [launcher]: Starting browser PhantomJS
13 02 2016 04:32:41.248:INFO [PhantomJS 2.1.1 (Mac OS X 0.0.0)]: Connected on socket HiC4WW_4235Nlf0rAAAA with id 54292207
PhantomJS 2.1.1 (Mac OS X 0.0.0) Unit tests suite test FAILED
Expected '/Users/Gianmarco/Desktop/test' to equal ''.
/Users/Gianmarco/Desktop/test/test.js:5:31
PhantomJS 2.1.1 (Mac OS X 0.0.0): Executed 1 of 1 (1 FAILED) ERROR (0.003 secs / 0.003 secs)
As you can see it seems that the word "base" is being changed with the path of the folder. This is making me going nuts I can't figure out why it is doing so.
I tried both with MacOSX and Ubuntu 14.04 and the result is the same.
To prepare the folder I did this:
mkdir test
cd test
npm install jasmine-core karma-cli karma-jasmine karma-phantomjs-launcher phantomjs-prebuilt --save
karma init
karma start
and of course my system had a npm install karma-cli -g some time ago.
The versions are:
jasmine-core#2.4.1
karma#0.13.21
karma-cli#0.1.2
karma-jasmine#0.3.7
karma-phantomjs-launcher#1.0.0
phantomjs-prebuilt#2.1.4
The same behaviour is obtained using the word absolute, that is replaced with an empty string
I believe its a problem with the default reporter in karma (progress) it appears the URL_REGEX matches both base and absolute as all the rest of the regex is optional.
var URL_REGEXP = new RegExp('(?:https?:\\/\\/[^\\/]*)?\\/?' +
'(base|absolute)' + // prefix
'((?:[A-z]\\:)?[^\\?\\s\\:]*)' + // path
'(\\?\\w*)?' + // sha
'(\\:(\\d+))?' + // line
'(\\:(\\d+))?' + // column
'', 'g')
https://github.com/karma-runner/karma/blob/684ab1838c6ad7127df2f1785c1f56520298cd6b/lib/reporter.js#L25