How to automatically restart Jelastic containers killed by OOM killer - jelastic

I have Docker image which have issues with memory leaking. It's known issue for this specific tool and authors recommend to restart nodes from time to time as a workaround.
However, daily restart is not always enough and some processed are killed by Jelastic OOM killer. I wan't not to kill them, but completely restart. If I've had real Docker running on a machine I would be able to instruct it to restart container after OOM or something like this, but in Jelastic I don't have such option.
Simple solution would be to add supervisord or something like this to my setup and take care about it but I'm wondering is there some out-of-the-box solution from Jelastic for this.

Related

can SLURM jobs keep running after computer reboot?

I was running some jobs in the SLURM of my PC, and the computer rebooted.
Once the computer was back on, I saw in the squeue that the jobs that were running before reboot were not running anymore due to a drain state. It seemed they had been automatically requeued after the reboot.
I couldn't submit more jobs, because the node was drained. So I did scancel the jobs that were automatically requeued.
The problem is that I cannot free the node. I tried a few things:
Restarting slurmctld and slurmd
"undraining" the nodes as explained in another question, but no success. The commands ran without any output (I assume this is good), but the state of the node did not change.
I then tried manually rebooting the system to see if anything would change
Running scontrol show node neuropc gives
[...]
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
[...]
Reason=Low RealMemory [slurm#2023-02-05T22:06:33]
Weirdly, the System Monitor shows that all the 8 cores keep having activity between 5% and 15%, whereas in the Process tab it shows only one app (TeamViewer) using less than of 4% processor.
So I suspect the job I was running somehow was kept running after reboot or are still on hold by SLURM.
I use Ubuntu 20.04 and slurm 19.05.5.
To strictly answer the question ; no they cannot. They might or might not be requeued depending on the Slurm configuration, and restarted either from scratch or from the latest checkpoint if the job is able to do checkpoint/restart. But there is not way a running process can survive a server reboot.
This answer solved my problem. Copying it here:
This could be that RealMemory=541008 in slurm.conf is too high for your system. Try lowering the value. Lets suppose you have indeed 541 Gb of RAM installed: change it to RealMemory=500000, do a scontrol reconfigure and then a scontrol update nodename=transgen-4 state=resume.
If that works, you could try to raise the value a bit.

Server running out of memory with Docker and Windows Nano

I've got a VM running Windows Nano and Docker containers. The Docker containers are all running ASP.NET Core 5 apps. I'm coming across this really weird bug where the VM is running out of memory, and the Task Manager, Process Explorer do not show what is taking up all this memory:
What I'm observing is that when I kill one of the containers, a lot of memory gets returned so it's definitely caused by a Docker container. The question is, how do I go about diagnosing this problem?
I've tried to take a memory dump of the app in the container I killed, but the dump is no larger than 1GB.

Master Node becomes very slow

Recently My ICp 2.1.0.1 environment (built on openStack VM) master node become very slow. Actually it is the Linux VM that very slow, not only for ICP product, even some general simple Linux command (ls, pwd, cd, etc.).
The thing here is, I didn't even use this environment, no workload, system completely idled.
I used top to monitor the CPU usage but didn't find long running processes taking long time.
How is that happen?
Note, this same issue occur at least twice. I just setup the env and leave it there.
Although the root cause is not yet identified, but simply restart the cluster solved the issue:
https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0/manage_cluster/restart_cluster.html
The restarting of docker takes several hours to finish. But finally it becomes fast.
I'm not sure if the same issue might happen again in the future. Monitoring..
Note that I didn't reboot the Linux VM as I found after rebooting Linux last time, some docker containers cannot startup successful, which lead me to check below part:
https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0/troubleshoot/restart_master_console.html
From ICP 2.1.0 release, there are some new features added, you need to ensure your VM hardware configuration matches the hardware requirements listed in below link, thanks.
https://www.ibm.com/support/knowledgecenter/en/SSBS6K_2.1.0/supported_system_config/hardware_reqs.html

How to improve Jenkins server performance?

Our Jenkins server(linux machine) slows down over a period of time and it gets unresponsive. All the jobs take unexpectedly long time(even though they run on slaves which are different machines from server). One of things I have observed is increase in the number of open files. The number seems to be increasing as shown in the image below. Does anyone have a solution to keep check on this without restarting the server? Also, are there any configurations/tweaks that could improve the performance of the jenkins server?
We are using Jenkins for four years and we tried to keep it up-to-date (Jenkins + plug-ins).
Like you we experimented some inconvenience, depending on new versions of Jenkins or plug-ins...
So we decided to stop this "continuous" upgrade
Here are humble tips:
Avoid technical debt. Update Jenkins as much as you can, but use only "Long Term Support" versions (latest is 2.138.2)
Backup your entire jenkins_home before any upgrade!
Restart Jenkins every night
Add RAM to your server. Jenkins use file system a lot and this will improve caching
Define JVM min/max memory parameters with the same value to avoid dynamic reallocation, for example: -Xms4G -Xmx4G
Add slaves and execute jobs only on slaves
In addition to above, you can also try:
Discarding old builds
Distribute the builds on multiple slaves, if possible.

Does TeamCity clean-up really require you to shut down the server?

TeamCity seems to completely shut down during the clean-up process, including stopping all active builds. The only option for scheduling seems to be nightly. I have builds that take up to several days to run. Do I have any options other than disabling the scheduled process?
Even for my play server the build process took almost 30min to execute. I'm a bit worried as well about what a production server would look like, especially running daily!
You need a heavy infrastructure in place for CI if you are planning to make it robust. See whether the build artifacts are cleared and configured properly in teamcity. Instead of doing a cleanup in one setting, you can automate it and probably remove relatively older build history periodically
And the ansewr to your question is No, cleanup does not require you to shutdown the server unless it crashes.
Edit : The teamcity shutdown may be due to innumerous reasons i can only speculate on. It might have gotten shut down because of a crash. But according to the policy given in Jebrains wiki, there is no mention about the server shutdown. And it is totally not logical to shutdown the entire server for a cleanup. Source.

Resources