How to find Jenkins build jobs consuming high CPU and Memory - performance

So far, Our Jenkins service hanging everyday, we cannot access the web page, I have to restart Jenkins services after hanging.
In Jenkins, I only set 4 of executors. Sometime, when 4 build job are running, the CPU load almost 90%, I guess that is a reason why Jenkins dies.
So how can I find which build job consume high CPU, and how can I find a root cause why the Jenkins die, I check in system log, but did not find any useful info.
I'm running Jenkins version 2.150.1 in Ubuntu 16.04.
Thanks.

We were also facing same issues.. We did setup prometheus + graphana monitoring for Jenkins..
This video is helpful - https://www.youtube.com/watch?v=3H9eNIf9KZs
With this - we able to point the timing when CPU was hitting at its peak and then we able to findout the job causing CPU consumption. We have switched to ECS(AWS Cloud) based job runnings.. i.e CPU and memory is separately allocated to each job..
We have also added jenkins observer script on machine where Jenkins is hoasted. we have checked if jenkins is down i.e if giving 502, we restarted it using crontab script.
These things helped us to solve the problems related to jenkins downtime.

Related

IPython Notebook with Spark on EC2 : Initial job has not accepted any resources

I am trying to run the simple WordCount job in IPython notebook with Spark connected to an AWS EC2 cluster. The program works perfectly when I use Spark in the local standalone mode but throws the problem when I try to connect it to the EC2 cluster.
I have taken the following steps
I have followed instructions given in this Supergloo blogpost.
No errors are found until the last line where I try to write the output to a file. [The lazyloading feature of Spark means that this when the program really starts to execute]
This is where I get the error
[Stage 0:> (0 + 0) / 2]16/08/05 15:18:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Actually there is no error, we have this warning and the program goes into an indefinite wait state. Nothing happens until I kill the IPython notebook.
I have seen this Stackoverflow post and have reduced the number of cores to 1 and memory to 512 by using this options after the main command
--total-executor-cores 1 --executor-memory 512m
The screen capture from the SparkUI is as follows
sparkUI
This clearly shows that both core and UI is not being fully utilized.
Finally, I see from this StackOverflow post that
The spark-ec2 script configure the Spark Cluster in EC2 as standalone,
which mean it can not work with remote submits. I've been struggled
with this same error you described for days before figure out it's not
supported. The message error is unfortunately incorrect.
So you have to copy your stuff and log into the master to execute your
spark task.
If this is indeed the case, then there is nothing more to be done, but since this statement was made in 2014, I am hoping that in the last 2 years the script has been rectified or there is a workaround. If there is any workaround, I would be grateful if someone can point it out to me please.
Thank you for your reading till this point and for any suggestions offered.
You can not submit jobs except on the Master - as you see - unless you set up a REST based Spark job server.

Marathon kills container when requests increase

I deploy Docker containers on Mesos(0.21) and Marathon(0.7.6) on Google Cloud Engine.
I use JMeter to test a REST service that run on Marathon. When the concurrent requests are less than 10, it works normal, but when the concurrent requests are over 50, the container is killed and Mesos start another container. I increase RAM, CPU but it still happens.
This is log in /var/log/mesos/
E0116 09:33:31.554816 19298 slave.cpp:2344] Failed to update resources for container 10e47946-4c54-4d64-9276-0ce94af31d44 of executor dev_service.2e25332d-964f-11e4-9004-42010af05efe running task dev_service.2e25332d-964f-11e4-9004-42010af05efe on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/612/cgroup: Failed to open file '/proc/612/cgroup': No such file or directory
The error message you're seeing is actually another symptom, not the root cause of the issue. There's a good explanation/discussion in this Apache Jira bug report:
https://issues.apache.org/jira/browse/MESOS-1837
Basically, your container is crashing for one reason or another and the /proc/pid#/ directory is getting cleared without Mesos being aware, so it throws the error message you found when it goes to check that /proc directory.
Try setting your allocated CPU higher in the JSON file describing your task.

Why a Jenkins job takes longer time to run between farms?

I am using a jenkins configuration where the same job is being executed in different locations: one in farm1 and another in an overseas farm2.
The Jenkins master server is located in farm1.
I encounter a situation where the job on farm2 takes much more time to finish, sometimes twice the elapsed time.
Do you have an idea what could be the reason for that?
is there a continuous master-slave discussion during the build that can cause such delay?
The job is a maven junit test + ui seleniun using vnc server on the slave
Thanks in advance,
Roy
I assume your server farms have identical hardware specs?
Network differences while checking out code, downloading dependencies, etc. Workspace of Master and Slave are on different servers
If you are archiving artifacts, they are usually archived back on Master, even when the job is run on Slave.
Install Timestamper plugin, enable it, and then review the logs of both the Master and the Slave runs, and see where there is a big time difference (you can configure Timestamper to show time as increments from the start of job, this would be helpful here)

Build schedule in Jenkins

I am working on a POC currently using Jenkins as CI server. I have setup jobs based on certain lifecycle stages such as test suite and QA. I have configured these jobs to become scheduled builds based on a cron expression.
I have a request to know how to find out what the next scheduled build will be in Jenkins based on the jobs i have created. I know what was the last succesful build, the last failed but i dont know the next proposed build. Any clues!? Or is there a view plugin for this? Sorry if this is a strange request but i need to find out.
Also i need to discover if there is an issue when more than one job is running concurrently what will happen. I would have understood this is not an issue. I do not have any slaves setup, i only have the master.
Jenkins version: 1.441
I found the first issue!
https://wiki.jenkins-ci.org/display/JENKINS/Next+Executions
So can you help me on the second question please? Is there any issue with more than one job building concurrently?
Thanks,
Shane.
For the next execution date take a look at Next Execution Plugin here.
For your second question .
The number of build you can run concurrently is configurable in the jenkins server params(http:///configure : executors param).
If the number of executor is reached each new job triggered will be add in jenkins's execution queue and will be run when one running job will end

Cassandra Snapshot and Restart

Being a level 1 novice in Linux (Ubuntu 9), shell and cron, I've had some difficulty figuring this out. Each night, I'd like to take a snapshot of our Cassandra nodes and restart the process.
Why? Because our team is hunting down a memory leak that requires a process restart every 3 weeks or so. The root cause has been difficult to track down. In the meantime, I'd like to put these cron jobs in place to reduce service interruption.
Thanks in advance for anyone who has some of these already figured out!
The general procedure is:
Run nodetool drain (http://www.riptano.com/docs/0.6/utilities/nodetool#nodetool-drain) on the node
Run nodetool snapshot
Kill the cassandra process
Start the cassandra process
When running nodetool snapshot, it is very important that you have JNA set up and working. This includes:
Having jna.jar in Cassandra's lib directory and either:
Running Cassandra as root, or
Increasing the memory locking limit using 'ulimit -l' or something like /etc/security/limits.conf
If this is all correct, you should see a message about "mlockall" succeeding in the logs on startup.
The other thing to keep an eye on is your disk space usage; this will grow as compactions occur and the old SSTables are replaced (but their snapshots remain).

Resources