Our Jenkins server(linux machine) slows down over a period of time and it gets unresponsive. All the jobs take unexpectedly long time(even though they run on slaves which are different machines from server). One of things I have observed is increase in the number of open files. The number seems to be increasing as shown in the image below. Does anyone have a solution to keep check on this without restarting the server? Also, are there any configurations/tweaks that could improve the performance of the jenkins server?
We are using Jenkins for four years and we tried to keep it up-to-date (Jenkins + plug-ins).
Like you we experimented some inconvenience, depending on new versions of Jenkins or plug-ins...
So we decided to stop this "continuous" upgrade
Here are humble tips:
Avoid technical debt. Update Jenkins as much as you can, but use only "Long Term Support" versions (latest is 2.138.2)
Backup your entire jenkins_home before any upgrade!
Restart Jenkins every night
Add RAM to your server. Jenkins use file system a lot and this will improve caching
Define JVM min/max memory parameters with the same value to avoid dynamic reallocation, for example: -Xms4G -Xmx4G
Add slaves and execute jobs only on slaves
In addition to above, you can also try:
Discarding old builds
Distribute the builds on multiple slaves, if possible.
Related
I am currently using JDK Flight Recorder with JDK 11 and came across some trouble in the CI/CD Plattform. Unfortunately, there is not too much documentation on the new Flight Recorder, but rather on the older one, which was still developed under Java.
When I try to start tests directly from the IDE, everything works fine and I get my recording files.
When I try to do the same thing, automatically, in the CI/CD Plattform, it causes time out and a lot of different indefinite failures, among them: trouble creating the file, the file is not even written, etc.
The JVM commands I used are the following (I put extra spaces for better readability):
-XX:+FlightRecorder
-XX:StartFlightRecording= name="UiTestServer", settings="profile", dumponexit=true, filename=""+System.getenv("CI_PROJECT_DIR") + "flightRecording/javaFlightRecorder.jfr"
The commands are the same that the IDE uses automatically, when starting the flight recording with right click on the specified test.
Does anybody know, whether the Flight Recorder has problems with such systems or specific services which might run parallely to it? I heard of some profiling tools, that are unable to perform on CI Plattforms.
If you need more detail, just ask me. Though, it might happen that I cannot tell anything related to the project.
Bit late as an answer, but JFR can definitely run in CI/CD environments. I have successfully attached JFR to our JMH microbenchmarks and published the results as artifacts in Atlassian Bamboo. Our Bamboo agents are running on AWS, so JFR itself should be good for most cloud environments.
JFR has been built to work in production systems, but if you want guarantees of low overhead (<1%), you should use the default settings, not profile.
'profile' is for a shorter period of time, i.e. 10 minutes, where it may be OK with additional overhead to gain more insight.
This is what I would recommend, for JDK 11 and later:
$ java -XX:StartFlightRecording:filename=/path
There is no need set dumponexit=true if a filename has been specified.
-XX:+FlightRecorder is only needed before JDK 8u40.
You can set a name if you like, but it's typically not needed. If you want to use jcmd and dump a recording, the name can be omitted.
We are running TeamCity with 20-ish .Net build configurations. All of them uses MSBuild 4.0. We recently moved one of our build agents from a machine running Windows Server 2008 to a new physical server running Windows Server 2012, and after this migration all the builds are taking almost twice as long time to finish, compared to on the old server! The new server is more powerful than the old server, both in terms of CPU, RAM and Disk. We have been running benchmarks on both servers, all of them confirming that the new server should be more capable than the old one.
According to the build logs it seems that all the stages of the build is slower, so it's not just one of the build steps that are slower.
First thing we did was to check the CPU utilization during builds, and it is suspiciously low! Only 3-6% CPU usage on some of the cores. RAM usage was also very low. Could it be some config for the build agent that is slowing down the builds, and that we have overlooked?
Note: The new server is running as a virtual machine. At first we thought that was the reason, but then that should have been reflected on the benchmarks? This is the only virtual machine running on this physical server, and it has almost all HW resources dedicated. Would be interesting to hear if any of you have had similar bad experience with running the build server on a virtual machine. We have also tried booting "natively" from the VHD image, without any difference in the build times.
I know this could be a very tricky one to debug for "external" people, but I was hoping that someone could maybe give some good suggestion on where to look for the issue, as we are kind of stuck right now.
Edit: Tried activating the Performance Monitor tool in TeamCity, and it shows that both CPU, RAM and Disk usage is comfortably running at <10% (Disk access peaked at 35% a couple of times during the build)
TeamCity seems to completely shut down during the clean-up process, including stopping all active builds. The only option for scheduling seems to be nightly. I have builds that take up to several days to run. Do I have any options other than disabling the scheduled process?
Even for my play server the build process took almost 30min to execute. I'm a bit worried as well about what a production server would look like, especially running daily!
You need a heavy infrastructure in place for CI if you are planning to make it robust. See whether the build artifacts are cleared and configured properly in teamcity. Instead of doing a cleanup in one setting, you can automate it and probably remove relatively older build history periodically
And the ansewr to your question is No, cleanup does not require you to shutdown the server unless it crashes.
Edit : The teamcity shutdown may be due to innumerous reasons i can only speculate on. It might have gotten shut down because of a crash. But according to the policy given in Jebrains wiki, there is no mention about the server shutdown. And it is totally not logical to shutdown the entire server for a cleanup. Source.
I have a requirement to run a script on all available slave machines. Primarily this is so they get relevant windows hotfixes and new 3rd party tools before building.
The script I have can be run multiple times without undesirable side effects & is quite light weight, so I'm happy for this to be brute force if necessary.
Can anybody give suggestions as to how to ensure that a slave is 'up-to-date' before it works on a job?
I'm happy with solutions that are driven by a job on the master, or ones which can inject the task (automatically) before normal slave job processing.
My shop does this as part of the slave launch process. We have the slaves configured to launch via execution of a command on the master; this command runs a shell script that rsync's the latest tool files to the slave and then launches the slave process. When there is a tool update, all we need to do is to restart the slaves or the master.
However - we use Linux whereas it looks like you are on Windows, so I'm not sure what the equivalent solution would be for you.
To your title: either use Parameter Plugin or use matrix configuration and list your nodes in it.
To your question about ensuring a slave is reliable, we mark it with a 'testbox' label and try out a variety of jobs on it. You could also have a job that is deployed to all of them and have the job take the machine offline it fails, I imagine.
Using Windows for slaves is very obnoxious for us too :(
We have a build machine running in our development department, which we've set up to build continuously throughout the working day.
What this does is:
Deletes the source code previously checked out (5 minutes)
Does a clean checkout from subversion (15 minutes)
Builds a whole bunch of C++ and .NET code (35 minutes)
Builds installers and run unit tests (5 minutes)
Given the above, what sort of impact would adding different hardware have on improving the time it takes to do the above?
For example - I was thinking about using an SSD for the harddisk as compiling involves a lot of random disk access.
The subversion server is currently a virtual machine - would switching it to be a physical machine help the slow checkout?
What impact would upgrading from a Core 2 Duo processor to an i7 make?
Any other suggestions on speeding up the above?
One trick that might speed up the SVN checkout process could be to have a working copy on the build machine, update the working copy and do a svn export from the working copy to the build directory. This should reduce the load on the SVN server and reduce network traffic.
Another trick to reduce the first 5 minutes of cleaning could be to move the old build dir to a temp folder on the same disk and then use another background task to delete the old build dir when the main build completes (could be a nightly cleanup task).
I think you've made good suggestions yourself. Definitely add a faster hard-drive (SSD or otherwise) and upgrade the CPU as well. I think your code repository (Subversion) should definitely be on a physical machine, ideally separate from your build machine. I think you'll notice a big difference after upgrading the hardware. Also, make sure the machine doesn't have any other large tasks running at the same time as the build tasks (e.g. virus scanning) so that the build tasks aren't slowed down.
How is your build machine setup to execute its tasks? Are you using continuous integration software? Is the machine itself a server or just a regular desktop machine?
Another way to speed up SVN is to use binary protocol instead of HTTP.
It looks like the build time is the most time consuming part - that's the best candidate for optimization. What about parallel build spread over other machines in the office - products like Incredibuild might significantly improve compilation time.