How to cleanup died jobs logs in storm? - apache-storm

I am trying to cleanup died storm jobs logs which stored in storm_log_path/workers-artifacts/
my current approach is using cron job or log rotate to cleanup the directory but that is has a problem it is deleting logs even the job is running.
what I am trying to do is using storm configuration to do this task as written in storm-documentation the Log Cleanup section this options should cleanup the logs and will never delete the logs of running jobs but it didn't work.
I am using storm 1.2.3 and my storm.yaml
logviewer.childopts: "-Xmx128m"
logviewer.cleanup.age.mins: 30
logviewer.max.sum.worker.logs.size.mb: 4096
logviewer.max.per.worker.logs.size.mb: 2048
I set the cleanup period to 30 minutes to test but never worked.
the log directory has folder for jobs per run there names are jobID-countingNumber-timestamp
5faaac990788a706cb972861-1-1607352884
5faaac990788a706cb972861-1-1607358710
5faaac990788a706cb972861-1-1607528615
5faaac990788a706cb972861-1-1607587744
5faaac990788a706cb972861-2-1607353512
5faaac990788a706cb972861-2-1607507502
5faaac990788a706cb972861-3-1607354786
How to allow the logviewer option to work or is there another approach?

TL;DR
In your storm.yaml, you need to add logviewer.cleanup.interval.secs: <value> for the logviewer cleaner service to work. Restart the logviewer service afterwards.
Your question made me curious so I have done some digging, first through the storm docs, then through our cluster's logs, then through the storm source code.
Turns out the logviewer cleanup service does not have a default value configured and is initialized with null. This is not mentioned in the docs, however, examining our own logviewer logs, this line popped to my eye:
2020-12-10 13:34:42.129 o.a.s.d.l.u.LogCleaner main [WARN] The interval for log cleanup is not set. Skip starting log cleanup thread.
Looking through the default config file and the storm sources made it clear there is no default value configured and the process is initialized with null (this file, line 97), which actually does not start the cleanup service at all. Seems to me, that they forgot to mention that in their docs, so admins looking to configure the service would automatically set this.
After setting the value and restarting the logviewer, it immediately started cleaning the files, as I could see in the logs. So thanks for raising this question, it would have slipped my attention otherwise!

Related

How can I start running server in one yml job and tests in another when run server job is still running

So I have 2 yml pipelines currently... one starts running the server and after server is up and running I start the other pipeline that runs tests in one job and once that's completed starts a job that shuts down the server from first pipeline.
I'm kinda new to yml and wondering if there is a way to run all this in a single pipeline...
The problem I came across is that if I put server to run in a first job I do not know how to condition the second job to kick off after server is running. This job doesn't have succeeded of failed condition because it's still in progress as the server has to run in order for tests to be run.
I tried adding a variable that I set to true after server is running but it still never jumps to the next job?
I looked into templates too but those are not very clear to me so any suggestion or documentation or tutorial would be very helpful on how to achive putting this in one pipeline...
I already googled a bunch and will keep googling but figured someone here might have an answer already.
Each agent can run only one job at a time. To run multiple jobs in parallel you must configure multiple agents. You also need sufficient parallel jobs.
You can specify the conditions under which each job runs. By default, a job runs if it does not depend on any other job, or if all of the jobs that it depends on have completed and succeeded. You can customize this behavior by forcing a job to run even if a previous job fails or by specifying a custom condition.
Since you have added a variable that you set to true after server is running. Then try to enable a custom condition, set that job run if a variable is xxx.
More details please kindly check official doc here:
Specify jobs in your pipeline
Specify conditions

Queued jobs are somehow being cached with Laravel Horizon using Supervisor

I have a really strange thing happening with my application that I am really struggling to debug and was wondering if anyone had any ideas or similar experiences.
I have an application running on Laravel v5.8 which is using Horizon to run the queued jobs on a Ubuntu 16.04 server. I have a feature that archives an account which is passed off to the queue.
I noticed that it didn't seem to be working, despite working locally and having had the tests passing for the feature.
My last attempt to debug was me commenting out the entire handle method and added Log::info('wtf?!'); to see if even that would work which it didn't, in fact, it was still trying to run the commented out code. I decided to restart supervisor and tried again. At last, I managed to get 'wtf?!' written to my logs.
I have since been unable to deploy my code without having to restart supervisor in order for it to recognise the 'new' code.
Does Horizon cache the jobs in any way? I can't see anything in the documentation.
Has anyone experienced anything like this?
Any ideas on how I can stop having to restart supervisor every time?
Thanks
As stated in the documentation here
Remember, queue workers are long-lived processes and store the booted application state in memory. As a result, they will not notice changes in your code base after they have been started. So, during your deployment process, be sure to restart your queue workers.
Alternatively, you may run the queue:listen command. When using the queue:listen command, you don't have to manually restart the worker after your code is changed; however, this command is not as efficient as queue:work:
And as stated here in the Horizon documentation.
If you are deploying Horizon to a live server, you should configure a process monitor to monitor the php artisan horizon command and restart it if it quits unexpectedly. When deploying fresh code to your server, you will need to instruct the master Horizon process to terminate so it can be restarted by your process monitor and receive your code changes
When you restart supervisor, you are basically restarting the command and loading the new code, your behaviour is exactly as expected to be.

Spark Launcher Jobs not starting because of token cant be found in cache after 24 hours

I have a Java Application, which runs continuously and checks a table in database for new records. When a New record is added in the table, the Java application do a unzip file and puts into HDFS location and then a Spark Job gets triggered(I am pro-grammatically triggering the Spark Job using 'SparkLauncher" class inside the Java Application), which does the processing for newly added file in HDFS location.
I have scheduled the Java Application in cluster using Oozie Java Action.
The cluster is HDP kerberized cluster.
The Job is working perfectly fine for 24 hours. All the unzip happens and spark job is running.
But after 24 hours the unzip happens in Java Application but the Spark Job is not get triggered in Resource Manager.
Exception : Exception encountered while connecting to the server :INFO: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (owner=****, renewer=oozie mr token, realUser=oozie, issueDate=1498798762481, maxDate=1499403562481, sequenceNumber=36550, masterKeyId=619) can't be found in cache
As per my understanding, after 24 hours oozie is renewing the token, and that token is not getting updated for the Spark launcher Job. The spark Launcher is still looking for the older Token which is not available in cache.
Please help me, how I can make Spark Launcher to look for the new-token.
As per my understanding, after 24 hours oozie is renewing the token
Why? Can you point to any documentation, source code, blog?
Remember that Oozie is a scheduler for batch jobs, and its canonical use case (at Yahoo!) is for triggering hourly jobs.
Only a pathological batch job would run for more than 24h, therefore renewal of the Hadoop delegation token is not really useful in Oozie.
But your Java thing acts as a service, running continuously, and needing automatic restart if it ever crashes. So you should consider...
either Slider, if you really want to run it inside YARN (although there
are many, many drawbacks -- how do you inspect the
logs of a running YARN job? how can you make sure that the app starts on time and is not delayed by a lack of resources? how can you make sure that your app will not be killed because YARN needs resources for a high-priority job?) but it is probably overkill for simply running your toy app
or a plain Linux service running on some Edge Node -- it's a Do-It-Yourself task, but not extremely complicated, and there are tutorials on the web
If you insist on using Oozie, in spite of all the limitations of both YARN and Oozie, then you have to change the way your app runs -- for instance, schedule the Coordinator to launch a job every 12h and pass the "nominal time" as Workflow property, edit the Workflow to pass that time to the Java app, edit the Java code so that the app exits at (arg + 11:58) and clears the way for the next exec.

How to debug a Flink application for memory and garbage collection?

I'm using Flink 1.1.4 and have added to flink-conf.yaml the configuration parameters for memory debugging, as stated in Memory and Performance Debugging:
taskmanager.debug.memory.startLogThread: true
taskmanager.debug.memory.logIntervalMs: 1000
After restarting Flink, I'm seeing the new parameters added to the Job Manager interface, but I'm unable to see any new logs.
Any idea about what I may be missing?
It seems this was resolved in this mailinglist
Key extracts, including one that confirmed the exact settings were tested succesfully:
That is exactly the right way to do it. Logging has to be at least
INFO and the parameter "taskmanager.debug.memory.startLogThread" set
to true. The log output should be under
"org.apache.flink.runtime.taskmanager.TaskManager".
Do you see other outputs for that class in the log?
Make sure you restarted the TaskManager processes after you changed
the config file.
Someone else just used the memory logging with the exact described
settings - it worked.
There is probably some mixup, you may be looking into the wrong log
file, or may setting the a value in a different config...
How do you start the flink cluster? If it's a standalone cluster and
you don't use a shared directory, then you'll find the log of the
taskmanager on the machine on which the taskmanager runs. If you use
YARN then you can activate log aggregation to retrieve the log easily
after the job has finished.

SparkException: Master removed our application

I know there are other very similar questions on Stackoverflow but those either didn't get answered or didn't help me out. In contrast to those questions I put much more stack trace and log file information into this question. I hope that helps, although it made the question to become sorta long and ugly. I'm sorry.
Setup
I'm running a 9 node cluster on Amazon EC2 using m3.xlarge instances with DSE (DataStax Enterprise) version 4.6 installed. For each workload (Cassandra, Search and Analytics) 3 nodes are used. DSE 4.6 bundles Spark 1.1 and Cassandra 2.0.
Issue
The application (Spark/Shark-Shell) gets removed after ~3 minutes even if I do not run any query. Queries on small datasets run successful as long as they finish within ~3 minutes.
I would like to analyze much larger datasets. Therefore I need the application (shell) not to get removed after ~3 minutes.
Error description
On the Spark or Shark shell, after idling ~3 minutes or while executing (long-running) queries, Spark will eventually abort and give the following stack trace:
15/08/25 14:58:09 ERROR cluster.SparkDeploySchedulerBackend: Application has been killed. Reason: Master removed our application: FAILED
org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
FAILED: Execution Error, return code -101 from shark.execution.SparkTask
This is not very helpful (to me), that's why I'm going to show you more log file information.
Error Details / Log Files
Master
From the master.log I think the interesing parts are
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.46.48:46715 got disassociated, removing it.
INFO 2015-08-25 09:19:59 org.apache.spark.deploy.master.DseSparkMaster: akka.tcp://sparkWorker#172.31.33.35:42136 got disassociated, removing it.
and
ERROR 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Application Shark::ip-172-31-46-49 with ID app-20150825091745-0007 failed 10 times, removing it
INFO 2015-08-25 09:21:01 org.apache.spark.deploy.master.DseSparkMaster: Removing app app-20150825091745-0007
Why do the worker nodes get disassociated?
In case you need to see it, I attached the master's executor (ID 1) stdout as well. The executors stderr is empty. However, I think it shows nothing useful to tackle the issue.
On the Spark Master UI I verified to see all worker nodes to be ALIVE. The second screenshot shows the application details.
There is one executor spawned on the master instance while executors on the two worker nodes get respawned until the whole application is removed. Is that okay or does it indicate some issue? I think it might be related to the "(it) failed 10 times" error message from above.
Worker logs
Furthermore I can show you logs of the two Spark worker nodes. I removed most of the class path arguments to shorten the logs. Let me know if you need to see it. As each worker node spawns multiple executors I attached links to some (not all) executor stdout and stderr dumps. Dumps of the remaining executors look basically the same.
Worker I
worker.log
Executor (ID 10) stdout
Executor (ID 10) stderr
Worker II
worker.log
Executor (ID 3) stdout
Executor (ID 3) stderr
The executor dumps seem to indicate some issue with permission and/or timeout. But from the dumps I can't figure out any details.
Attempts
As mentioned above, there are some similar questions but none of those got answered or it didn't help me to solve the issue. Anyway, things I tried and verified are:
Opened port 2552. Nothing changes.
Increased spark.akka.askTimeout which results in the Spark/Shark app to live longer but eventually it still gets removed.
Ran the Spark shell locally with spark.master=local[4]. On the one hand this allowed me to run queries longer than ~3 minutes successfully, on the other hand it obviously doesn't take advantage of the distributed environment.
Summary
To sum up, one could say that the timeouts and the fact long-running queries are successfully executed in local mode all indicate some misconfiguration. Though I cannot be sure and I don't know how to fix it.
Any help would be very much appreciated.
Edit: Two of the Analytics and two of the Solr nodes were added after the initial setup of the cluster. Just in case that matters.
Edit (2): I was able to work around the issue described above by replacing the Analytics nodes with three freshly installed Analytics nodes. I can now run queries on much larger datasets without the shell being removed. I intend not to put this as an answer to the question as it is still unclear what is wrong with the three original Analytics nodes. However, as it is a cluster for testing purposes, it was okay to simply replace the nodes (after replacing the nodes I performed a nodetool rebuild -- Cassandra on each of the new nodes to recover their data from the Cassandra datacenter).
As mentioned in the attempts, the root cause is a timeout between the master node, and one or more workers.
Another thing to try: Verify that all workers are reachable by hostname from the master, either via dns or an entry in the /etc/hosts file.
In my case, the problem was that the cluster was running in an AWS subnet without DNS. The cluster grew over time by spinning up a node, the adding the node to the cluster. When the master was built, only a subset of the addresses in the cluster was known, and only that subset was added to the /etc/hosts file.
When dse spark was run from a "new" node, then communication from the master using the worker's hostname failed and the master killed the job.

Resources