Storm Topology does not start with parallelism hint of 1200 - apache-storm

Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
I have a storm Topology with 3 bolts(A,B,C), Where the middle bolt takes around 450ms mean time and other two bolts takes less than 1ms.
I am able to run topology with following parallelism hint values:
A: 4
B: 700
C: 10
But when I increase parallelism hint of B to 1200, the topology does not start.
In the topology logs, I see logs to load the executor: B multiple times, like this:
2018-05-18 18:56:37.462 o.a.s.d.executor main [INFO] Loading executor B:[111 111]
2018-05-18 18:56:37.463 o.a.s.d.executor main [INFO] Loaded executor tasks B:[111 111]
2018-05-18 18:56:37.465 o.a.s.d.executor main [INFO] Finished loading executor B:[111 111]
2018-05-18 18:56:37.528 o.a.s.d.executor main [INFO] Loading executor B:[355 355]
2018-05-18 18:56:37.529 o.a.s.d.executor main [INFO] Loaded executor tasks B:[355 355]
2018-05-18 18:56:37.530 o.a.s.d.executor main [INFO] Finished loading executor B:[355 355]
2018-05-18 18:56:37.666 o.a.s.d.executor main [INFO] Loading executor B:[993 993]
2018-05-18 18:56:37.667 o.a.s.d.executor main [INFO] Loaded executor tasks B:[993 993]
2018-05-18 18:56:37.669 o.a.s.d.executor main [INFO] Finished loading executor B:[993 993]
2018-05-18 18:56:37.713 o.a.s.d.executor main [INFO] Loading executor B:[765 765]
2018-05-18 18:56:37.714 o.a.s.d.executor main [INFO] Loaded executor tasks B:[765 765]
But in between worker process get restarted. I don't see any error in topology logs or storm logs. Following are storm logs, when worker gets restart:
2018-05-18 18:51:46.755 o.a.s.d.s.Container SLOT_6700 [INFO] Killing eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.204 o.a.s.d.s.BasicContainer Thread-7 [INFO] Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143
2018-05-18 18:51:47.766 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE RUNNING msInState: 109081 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674 -> KILL msInState: 0 topo:myTopology-1-1526649581 worker:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.766 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.774 o.a.s.d.s.Slot SLOT_6700 [WARN] SLOT 6700 all processes are dead...
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] Cleaning up eaf4d8ce-e758-4912-a15d-6dab8cda96d0:766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.Container SLOT_6700 [INFO] GET worker-user for 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids/27798
2018-05-18 18:51:47.775 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/heartbeats
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/pids
2018-05-18 18:51:47.780 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674/tmp
2018-05-18 18:51:47.781 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.Container SLOT_6700 [INFO] REMOVE worker-user 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.782 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/workers-users/766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Removed Worker ID 766258fe-a604-4385-8eeb-e85cad38b674
2018-05-18 18:51:47.783 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up BLOB references...
2018-05-18 18:51:47.784 o.a.s.l.AsyncLocalizer SLOT_6700 [INFO] Released blob reference myTopology-1-1526649581 6700 Cleaning up basic files...
2018-05-18 18:51:47.785 o.a.s.d.s.AdvancedFSOps SLOT_6700 [INFO] Deleting path /home/saurabh/storm-run/supervisor/stormdist/myTopology-1-1526649581
2018-05-18 18:51:47.808 o.a.s.d.s.Slot SLOT_6700 [INFO] STATE KILL msInState: 42 topo:myTopology-1-1526649581 worker:null -> EMPTY msInState: 0
This keeps happening and topology never restarts, which used to start perfectly when parallelism hint for bolt: B was 700, there is no other change.
I see one interesting log here is, not yet sure what this means:
Worker Process 766258fe-a604-4385-8eeb-e85cad38b674 exited with code: 143
Any Suggestions?
Edit:
Config:
topology.worker.childopts: -Xms1g -Xmx16g
topology.worker.logwriter.childopts: -Xmx1024m
topology.worker.max.heap.size.mb: 3072.0
worker.childopts: -Xms1g -Xmx16g -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=1%ID% -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:+UseG1GC -XX:+AggressiveOpts -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/home/saurabh.mimani/apache-storm-1.2.1/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Dorg.newsclub.net.unix.library.path=/usr/share/specter/uds-lib/
worker.gc.childopts:
worker.heap.memory.mb: 8192
supervisor.childopts: -Xms1g -Xmx16g
Edit:
Logs for strace -fp PID -e trace=read,write,network,signal,ipc in gist.
not yet able to understand it fully, some relevant looking from it:
[pid 3362] open("/usr/lib/locale/UTF-8/LC_CTYPE", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 3362] kill(1487, SIGTERM) = 0
[pid 3362] close(1)

Quick google suggests 143 is the exit code for when the JVM receives a SIGTERM (e.g. Always app Java end with "Exit 143" Ubuntu). You might be running out of memory, or the OS may be killing the process for some other reason. Remember that setting the parallelism hint to 1200 means that you will get 1200 tasks (copies) for bolt B, where you only had 700 before.

I was able to get this running by tweaking following configurations, seems like it was timing out due to nimbus.task.launch.sec, which was set to 120 and it was restarting the worker if it was not started within 120 secs.
Updated value of some of these configs:
drpc.request.timeout.secs: 1600
supervisor.worker.start.timeout.secs: 1200
nimbus.supervisor.timeout.secs: 1200
nimbus.task.launch.secs: 1200
About nimbus.task.launch.sec:
A special timeout used when a task is initially launched. During launch, this is the timeout used until the first heartbeat, overriding nimbus.task.timeout.secs.
A separate timeout exists for launch because there can be quite a bit of overhead to launching new JVM's and configuring them.

Related

Gitlab job is successful despite assertion failure

I am running SoapUI assertions using maven image in gitlab. Even though the assertion fails the build is successful in gitlab. I have tried using mvn integration-tests -ff and as well as -fae but no luck. Also used allow_failure: false. This did not work either. Please advise as to how to fail the gitlab pipeline job if there is a failure in assertions.
Here is my yml file
T001-0011:
extends: .ETE -stage
image: adoptopenjdk/maven-openjdk11
variables:
MAVEN_CLI_OPTS: "--fail-fast"
script:
- 'mvn -f ./TV001/pom.xml $MAVEN_CLI_OPTS integration-test'
allow_failure: false
when: always
Here is the gitlab log
1 error
09:53:48,937 ERROR [SoapUITestCaseRunner] JDBC_Request failed, exporting to [/builds/gitlab/data/test-team-automation-scripts/./SV321/Warnings/target/surefire-reports/TestSuite_1-AC1-JDBC_Request-0-FAILED.txt]
09:53:48,938 INFO [SoapUITestCaseRunner] Finished running SoapUI testcase [AC1], time taken: 904ms, status: FAILED
09:53:48,953 INFO [SoapUITestCaseRunner] Running SoapUI testcase [AC2]
09:53:48,963 INFO [SoapUITestCaseRunner] running step [IDN220001-Request2]
09:53:48,966 DEBUG [HttpClientSupport$SoapUIHttpClient] Stale connection check
09:53:48,968 DEBUG [HttpClientSupport$SoapUIHttpClient] Attempt 1 to execute request
09:53:48,968 DEBUG [SoapUIMultiThreadedHttpConnectionManager$SoapUIDefaultClientConnection] Sending request: GET /apikey/v1/warnings/waning/IDN22000 HTTP/1.1
09:53:48,974 DEBUG [SoapUIMultiThreadedHttpConnectionManager$SoapUIDefaultClientConnection] Receiving response: HTTP/1.1 404 Not Found
09:53:48,975 DEBUG [HttpClientSupport$SoapUIHttpClient] Connection can be kept alive indefinitely
09:53:49,018 INFO [log] HTTP status code: 404
09:53:49,019 INFO [SoapUITestCaseRunner] Assertion [Valid HTTP Status Codes] has status UNKNOWN
09:53:49,019 INFO [SoapUITestCaseRunner] Assertion [Script Assertion] has status VALID
09:53:49,019 INFO [SoapUITestCaseRunner] Finished running SoapUI testcase [AC2], time taken: 8ms, status: FINISHED
09:53:49,021 INFO [SoapUITestCaseRunner] Project [DPD-3396] finished with status [FAILED] in 2591ms
SoapUI 5.3.0 TestCaseRunner Summary
-----------------------------
Time Taken: 2599ms
Total TestSuites: 1
Total TestCases: 2 (1 failed)
Total TestSteps: 3
Total Request Assertions: 5
Total Failed Assertions: 1
Total Exported Results: 3
[WARNING] JAR will be empty - no content was marked for inclusion!
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:03 min
[INFO] Finished at: 2022-06-15T09:53:54+10:00
[INFO] ------------------------------------------------------------------------
Cleaning up file based variables
00:01
Job succeeded
The mvn tests fail but the exit code that the command itself returns, probably is zero.
It is saying, I managed to run the tests.
But for you this doesn't help since you want to check the result of the tests.
Gitlab in order to fail a job checks the exit code of the commands used. You could force mvn to return an erroneous exit code when the tests fail
You could add the following flag
-Dmaven.test.failure.ignore=false

SonarQube Run scanner

I have success log:
INFO: ------------------------------------------------------------------------
INFO: EXECUTION SUCCESS
INFO: ------------------------------------------------------------------------
INFO: Total time: 29.282s
INFO: Final Memory: 15M/64M
INFO: ------------------------------------------------------------------------
But nothing statistics in dashboard, but have error message
The main branch has no lines of code.
My properties
sonar.projectKey=....
sonar.sources=base/
My project:
root_dir
---------- sonar-project.properties
---------- base (in this dir file for scan)
and have this log
INFO: 0 source files to be analyzed
INFO: 0/0 source files have been analyzed

Accepted socket connection from /hostname:55306 (org.apache.zookeeper.server.NIOServerCnxnFactory)

I have configured the Kafka cluster,Storm cluster and Hadoop cluster. every thing works fine when their are no jobs.
When I submit the storm jar (which gets data from kafka and process ,then store it into Hdfs) in standalone mode ,it works fine
After configuring it to server properties same code and run it on server it gives following error:
[2018-07-03 12:54:00,370] INFO Accepted socket connection from /192.168.3.222:55306 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2018-07-03 12:54:00,381] INFO Client attempting to establish new session at /192.168.3.222:55306 (org.apache.zookeeper.server.ZooKeeperServer)
[2018-07-03 12:54:00,383] INFO Established session 0x3645ed69ca40031 with negotiated timeout 20000 for client /192.168.3.222:55306 (org.apache.zookeeper.server.ZooKeeperServer)
[2018-07-03 12:54:02,429] WARN caught end of stream exception (org.apache.zookeeper.server.NIOServerCnxn)
EndOfStreamException: Unable to read additional data from client sessionid 0x3645ed69ca40031, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:748)
[2018-07-03 12:54:02,433] INFO Closed socket connection for client /192.168.3.222:55306 which had sessionid 0x3645ed69ca40031
(org.apache.zookeeper.server.NIOServerCnxn)
[2018-07-03 12:54:06,000] INFO Expiring session 0x1645ed69c8c0041, timeout of 20000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
[2018-07-03 12:54:06,000] INFO Processed session termination for sessionid: 0x1645ed69c8c0041
(org.apache.zookeeper.server.PrepRequestProcessor)
Respective versions I am using:
apache-storm-1.0.6
kafka_2.11-1.0.1
zookeeper-3.4.12
hadoop-2.9.1
nimbus log
2018-07-04 12:28:54.455 o.a.s.d.nimbus timer [INFO] Setting new assignment for topology id test-topology-1-1530686803: #org.apache.storm.daemon.common.Assignment{:master-code-dir "/usr/local/apache-services/data/storm", :node->host {"7c98bf5a-38d5-4a13-95ad-966be3a51c49" "datanode2.sakha.com"}, :executor->node+port {[2 2] ["7c98bf5a-38d5-4a13-95ad-966be3a51c49" 6700], [1 1] ["7c98bf5a-38d5-4a13-95ad-966be3a51c49" 6700], [3 3] ["7c98bf5a-38d5-4a13-95ad-966be3a51c49" 6700]}, :executor->start-time-secs {[1 1] 1530687534, [2 2] 1530687534, [3 3] 1530687534}, :worker->resources {["7c98bf5a-38d5-4a13-95ad-966be3a51c49" 6700] [0.0 0.0 0.0]}, :owner "hduser"}
2018-07-04 12:28:54.520 o.a.s.d.nimbus pool-14-thread-7 [INFO] Created download session for test-topology-1-1530686803-stormjar.jar with id a9762861-224e-4f40-824b-ae0efa687452
supervisor log
2018-07-04 12:30:46.461 o.a.s.d.s.Container SLOT_6700 [INFO] Creating symlinks for worker-id: b9c3daa0-4f4d-42d7-9963-e93b6e6179a3 storm-id: test-topology-1-1530686803 for files(0): []
2018-07-04 12:30:46.461 o.a.s.d.s.Container SLOT_6700 [INFO] Topology jar for worker-id: b9c3daa0-4f4d-42d7-9963-e93b6e6179a3 storm-id: test-topology-1-1530686803 does not contain re sources directory /usr/local/apache-services/data/storm/supervisor/stormdist/test-topology-1-1530686803/resources.
2018-07-04 12:30:46.461 o.a.s.d.s.BasicContainer SLOT_6700 [INFO] Launching worker with assignment LocalAssignment(topology_id:test-topology-1-1530686803, executors:[ExecutorInfo(task_start:2, task_end:2), ExecutorInfo(task_start:1, task_end:1), ExecutorInfo(task_start:3, task_end:3)], resources:WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, cpu:0.0), owner:hduser) for this supervisor 7c98bf5a-38d5-4a13-95ad-966be3a51c49 on port 6700 with id b9c3daa0-4f4d-42d7-9963-e93b6e6179a3
There is something wrong with your dependency tree. You posted that you got java.lang.NoSuchMethodError: org.apache.hadoop.security.authentication.util.KerberosUtil.hasKerberosTicket in your worker log. This points to you having the wrong Hadoop jar versions on your classpath when you submit the jar, or maybe you're missing the jars entirely.
Here's the pom for storm-hdfs https://github.com/apache/storm/blob/v1.0.6/external/storm-hdfs/pom.xml. By default, it compiles against Hadoop 2.6.1. If you want to use another Hadoop version, you need to ensure that you replace the listed Hadoop dependencies with newer ones in your pom (i.e. you need to manually list e.g hadoop-client in version 2.9.1 in your pom).
A good tool for you to debug this is to run mvn dependency:tree in your project, that'll let you know which versions of which jars you are including in your build.

Mage Resque - Job Class not found error

I'm trying to implement a asynchronous functionality in Magento using Mage-Resque. I have followed the instructions in https://github.com/ajbonner/mage-resque and installed all the components except ext-pcntl.
Now i'm able to queue a job to redis-server. I have tested the it using the default Mns_Resque_Model_Job_Logmessage class but i'm also getting following error.
[info] [11:09:42 2016-03-27] Checking default for jobs
[info] [11:09:42 2016-03-27] Found job on default
[notice] [11:09:42 2016-03-27] Starting work on (Job{default} | ID: 6fe2a430c10ff2920c3f66ec7d52e957 | Mns_Resque_Model_Job_Logmessage | [{"message":"Resque Test 1459057136"}])
[info] [11:09:42 2016-03-27] Forked 3759 at 2016-03-27 11:09:42
[info] [11:09:42 2016-03-27] Processing default since 2016-03-27 11:09:42
[critical] [11:09:42 2016-03-27] (Job{default} | ID: 6fe2a430c10ff2920c3f66ec7d52e957 | Mns_Resque_Model_Job_Logmessage | [{"message":"Resque Test 1459057136"}]) has failed Could not find job class Mns_Resque_Model_Job_Logmessage.
Its reporting that it cannot find a class Mns_Resque_Model_Job_Logmessage. What could be wrong ? Have i missed something? Please help any help would be appreciated...

Error in nimbus log

When I tried to run a topology in my storm client I got an error that point to a connection failed with the nimbus .
I checked my numbus log and here's what shows :
2014-04-25 11:05:03 nimbus [INFO] Uploading file from client to storm-local/nimbus/inbox/stormjar-7106a3e1-fae8-4afe-8028-5c561eeb365e.jar
2014-04-25 11:05:03 nimbus [INFO] Finished uploading file from client: storm-local/nimbus/inbox/stormjar-7106a3e1-fae8-4afe-8028-5c561eeb365e.jar
2014-04-25 11:05:03 nimbus [INFO] Received topology submission for beat with conf {"topology.max.task.parallelism" nil, "topology.acker.executors" 1, "topology.kryo.register" nil, "topology.kryo.decorators" (), "topology.nam$
2014-04-25 11:05:03 nimbus [INFO] Activating beat: beat-2-1398416703
2014-04-25 11:05:03 EvenScheduler [INFO] Available slots: (["c3a1bab3-ed50-4efc-b424-050d34d7d4bd" 6702] ["c3a1bab3-ed50-4efc-b424-050d34d7d4bd" 6703] ["8f506a92-4a1b-4cc6-8f80-ed53ea810256" 6701] ["8f506a92-4a1b-4cc6-8f80-e$
2014-04-25 11:05:03 nimbus [INFO] Setting new assignment for topology id beat-2-1398416703: #backtype.storm.daemon.common.Assignment{:master-code-dir "storm-local/nimbus/stormdist/beat-2-1398416703", :node->host {"c3a1bab3-e$
2014-04-25 12:08:03 nimbus [INFO] Cleaning inbox ... deleted: stormjar-7106a3e1-fae8-4afe-8028-5c561eeb365e.jar
**2014-04-25 13:59:47 TNonblockingServer [ERROR] Read an invalid frame size of -720899. Are you using TFramedTransport on the client side?
2014-04-25 14:00:16 TNonblockingServer [ERROR] Read an invalid frame size of -720899. Are you using TFramedTransport on the client side?**
any clarification ?

Resources