Frequently last few days onwards our GitLab(CE) running slowly. We have a hook for the CI with Jenkins. We had installed the GitLab by OmniAuth. I don't have any more ideas regarding this because we didn`t do anything new in our instances.
We are the newbie to GitLab environment. We are working in the GitLab since December 2016 and also we never faced this kind of issue before. I hope that I will fix this problem with you people. Kindly help me to fix the issue.
Follow the below image for our Gitlab details.
How could I overcome from this issue?
These are just some as-is suggestions offered without warranty, but they may help guide you to solving the problem.
Occam's Razor
You mentioned that these issues appear to have just started most recently. This means that the VERY FIRST place to look is what may have changed around the time that these issues occurred. If you have change control for your infrastructure, start there. Make absolutely sure nobody has changed anything around the time these issues started happening. Check your logs for any warnings that may have started showing up. If your OS has a security log or logs configuration changes, check those. If you don't have good visibility/audit-ability into your environment, this may be hard, but if you can identify something that changed around the same time as these issues started occurring, that is most often going to be your problem.
Specificity
It may be helpful for you to describe what you mean by it getting slow. Is it a specific operation that is slow? Or is it all activity? If it's something specific, like triggering a Jenkins job, then you can start to isolate your search there.
It can also help to run top on your server to get a picture of what might be causing the issues. There might be a specific process running on the machine that is dominating everything else and eating all of the resources.
Hardware
First thing I would check is to make sure your hardware configuration matches the 'Hardware requirements' guidelines on gitlab's website:
https://docs.gitlab.com/ce/install/requirements.html#requirements
Based on what you've posted, the CPU and memory on your system seem adequate for several thousand users, so I'm going to assume this isn't a problem, but in case you do have thousands of users, I will add some brief information on this. Your disk configuration (other than size) is not presented in the information above, so we don't know if that is sufficient or not.
I would recommend running vmstat on the server (since it's GitLab, I am assuming this is running on Linux, since they do not recommend Windows installations) to get some basic information about what is going on. The vmstat command will give you several columns of information. To the very left there should be a column 'r'. This is the 'run queue', or the number of processes that are waiting to be run on a CPU. If the value in that column is large compared to the number of cores the system has, you probably have a CPU bottleneck. The next column, 'b', is processes that are blocked. If this is large, you probably don't have a CPU bottleneck. To the right, there are CPU columns: us, sy, id, or something along these lines. These columns are a breakdown of where the CPU is spending its time, either in the application code (us), in the OS code (sy), or waiting (id). High percentage numbers in us generally indicate that you either are running healthily or have a CPU bottleneck. High percentage numbers in sy are usually going to indicate some kind of contention, possibly a configuration issue like having too many worker threads configured for the number of CPUs you have. A high percentage number in id usually indicates that the system either isn't doing much, or can't do much because it's waiting on something like disk or an external database.
So if the 'b' and/or 'id' columns in your vmstat output have high numbers, we may want to consider the possibility of there being an I/O bottleneck. Here are a couple introductory articles on evaluating Linux IO for bottlenecks that might help you determine if this is the case:
https://bartsjerps.wordpress.com/2011/03/04/io-bottleneck-linux/
http://www.linux-mag.com/id/2001/
These articles should get you pointed in the right direction to help you decide if your disks aren't fast enough.
One thing to note, if you're seeing what appears to be a CPU bottleneck (high r values, high us values), make sure that situation makes sense for the number of users you have. The CPU bottleneck may be caused a virtualization issue, or some OS issue causing the CPU to perform poorly, not just by the CPU hardware itself being insufficient.
Topology
One thing mentioned in the gitlab requirements I linked to above is that it is not recommended to run GitLab runner on the same box as GitLab itself. This is something I would say is true for any CI software working with GitLab. If you're running GitLab Runner or Jenkins on the same box as GitLab itself, you should consider moving those to their own hardware.
If you have thousands of users, you may wish to get in contact with GitLab themselves and have consulting on how to get an enterprise-grade cluster stood up and what that looks like. There are people who are experts in the specific hardware configurations that make sense for a very large GitLab installation, and I am not one of them. However, if you don't have a large number of users, the hardware you have is probably not the issue.
Software
If you're running things like vmstat and iostat and you're not finding any specific hardware bottleneck, there may be a configuration issue. Make sure you have a good number of Unicorn Workers configured, so that the box can properly utilize your hardware.
External bottlenecks
Make sure things like network speed on the server are sufficient for its needs. Make sure users trying to reach the server aren't being bottlenecked by a misconfigured network. If you're using OmniAuth, make sure the provider is performing correctly. For example, if you're using some external authentication, and that isn't scaling/performing well, you'll get bad performance in GitLab as well. These are especially important to look at if you're not seeing much hardware utilization using the methods above.
Two aspects which can help accelerate GitLab are, in the latest April 2020 12.10 version:
the application server which switches back Puma
the caching of Git info/refs
The last point is:
When fetching changes from a Git repository, the server advertises a list of all the branches and tags in the repository, known as refs.
In some instances, we have observed up to 75% of all requests to the GitLab web server are requests for the refs.
In the best case, when all the refs are packed, this is an inexpensive operation.
However, when there are unpacked refs, Git must iterate over the unpacked refs. This causes additional disk I/O, which is slow when using high latency storage like NFS.
In GitLab 12.10, info/refs are cached to improve the performance of ref advertisement and decrease the pressure on Gitaly in situations where refs are fetched very frequently.
In testing this feature on GitLab.com, we observed read operations outnumber write operations 10 to 1, and saw median latency decrease by 70%.
For GitLab instances using NFS for Git storage, we expect even greater improvements.
See Documentation and Issue.
I'm using the excellent Fog gem to access just the Rackspace Cloud Files service. My challenge is that I'm trying to keep the service that is accessing Cloud Files lightweight, and it seems that Fog through its flexibility has a lot of dependencies and code I'll never need.
Has anybody tried to build a slimmed down copy of Fog, just to include a subset of providers, and therefore limit the dependencies? For example, for the Rackspace Cloud Files API exclusively, I'd expect to be able to handle everything without net-ssh, net-scp, nokogiri gems, and all the unused code for Amazon, Rackspace and 20 other providers that are not being used. I'm hoping to avoid upgrading the gem every time one of those unused providers notices a bug, while keeping my memory footprint down.
I'd appreciate any experience anybody may have in doing this, or advice from anybody familiar with building Fog in what I can and can't rip out.
If I'm just using the wrong gem, then that's equally fine. I'll move to something more focused.
I'm the maintainer of fog so I'll chime in to help fill in some of the explanation/gaps. The short answer is, yeah it's a lot of stuff, but mostly it shouldn't impact you negatively.
First off, fog grew pretty organically over time, so it did get bigger than intended. One of the ways that we contend with this is that we rather aggressively avoid requiring/loading files until really needed. So although you have to download lots of provider files you won't use to install fog, they shouldn't actually end up in memory. That was the simplest thing we could do in order to keep things "just working" while also reducing the memory usage (and load time).
The release schedule doesn't tend to be too crazy (on average about once a month) and tends to include a mix of stuff across most of the providers. So I'd expect you won't have too much churn here (outside of emergency/security type fixes which might warrant shortening the normal cycle).
So, that hopefully provides some insight in to the state of the art. We have also discussed starting to split things out more in the longer term. I think if/when that happens we would end up with something like fog-rackspace for all the rackspace related things. And then they could share things through fog-core or similar. We have a rough outline, but it is a pretty big undertaking without a huge upside, so it isn't something we have really actively begun on.
Hope that helps, certainly happy to discuss further if you have further questions or concerns.
I work for Rackspace on, among other things, our Ruby SDKs. You're using the right gem. Fog is our official Ruby API.
This is possibly something that could be done by introducing another gemspec into the project that builds from only fog core and the Rackspace-specific files. Though this would be unconventional and make #geemus' (the gem maintainer) gem release process more complicated––especially should other providers start to do the same. Longer term, this would serve to divert the fog community away from acting as a unified API.
I have drupal based website (Drupal version is 6.19), it is very heavy content website (about 400K articles in it).
By following Rule one of using Drupal, I didnt make any change on the core. but i have a lot of enabled modules and some of them were customized.
Now, I am suffering of the performace and I need to enhance it. I never used Pressflow before, but I have read some articles saying that pressflow is better than Drupal. is it safe to upgrade from Drupal to Pressflow? and if so, how to do it?
Thanks for your help
Pressflow adds the following features to Drupal.
Support for database replication
Support for Squid and Varnish reverse proxy caching
Optimization for MySQL
Optimization for PHP 5
Pressflow is a 100% api-compliant replacement for your standard Drupal Core. There are no database schema changes. So long as you are running a normal Drupal core and meet the other system requirements (PHP5.x, MySQL 5.x), Pressflow is a "drop in" replacement.
Short answer: probably not. Especially since you state that you "have a lot of enabled modules and some of them were customized."
Longer answer: Pressflow's changes are relatively small, and hardly break the APIs. However, there are some incompatibilities, most in the area of database-access and caching. Especially modules that knowingly or unknowingly don't play by Drupals coding guidelines, will probably break. My suggestion: just try, if a module breaks: fix it (and file a patch).
But the real question is: are you going to benefit from Pressflow? It is not simply "better". It allows database-replication, such as load-balancing or master-slaves. Do you intend to use that?
It introduces better support for caching proxies. Are you planning to run a squid or some other caching proxy?
It has some small changes in, for example, the area of caching, that may (but may not) help you; depending on your current usage.
My suggestion: first see how to improve performance without Pressflow. Then, once you come across an area where Drupal is of little help, but which is "fixed" in Pressflow, consider changing.
Few modules have issues with Pressflow and if they do, someone else probably found them. Try searching if any of you modules is incompatible.
Its actually slowed out websites down. This is due to too many modules setup and no caching of our blocks. I am working through things now trying to setup caching and memcache. The issue I have though is that our editors want to see changes now. So some of this might be training. The other issue I have is that we have the fimage module setup and it does not work with the minimum cache lifetime setting so we do not get that benifit at all. In theory it should speed up your site but just let it be known it might do the opposite.
The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?
I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?
Edit:
I am more interested in what was done about an unfindable bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?
Language
Different programming languages will have their own flavour of bugs.
C
Adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers far enough to avoid a SEGFAULT---also known as Heisenbugs. Pointer issues are arduous to track and replicate, but debuggers can help (such as GDB and DDD).
Java
An application that has multiple threads might only show its bugs with a very specific timing or sequence of events. Improper concurrency implementations can cause deadlocks in situations that are difficult to replicate.
JavaScript
Some web browsers are notorious for memory leaks. JavaScript code that runs fine in one browser might cause incorrect behaviour in another browser. Using third-party libraries that have been rigorously tested by thousands of users can be advantageous to avoid certain obscure bugs.
Environment
Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment. Does the application run:
on a server?
on a desktop?
in a web browser?
In what environment does the application produce the problem?
development?
test?
production?
Exit extraneous applications, kill background tasks, stop all scheduled events (cron jobs), eliminate plug-ins, and uninstall browser add-ons.
Networking
As networking is essential to so many applications:
Ensure stable network connections, including wireless signals.
Does the software reconnect after network failures robustly?
Do all connections get closed properly so as to release file descriptors?
Are people using the machine who shouldn't be?
Are rogue devices interacting with the machine's network?
Are there factories or radio towers nearby that can cause interference?
Do packet sizes and frequency fall within nominal ranges?
Are packets being monitored for loss?
Are all network devices adequate for heavy bandwidth usage?
Consistency
Eliminate as many unknowns as possible:
Isolate architectural components.
Remove non-essential, or possibly problematic (conflicting), elements.
Deactivate different application modules.
Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.
Logging
Use liberal amounts of logging to correlate the time events happened. Examine logs for any obvious errors, timing issues, etc.
Hardware
If the software seems okay, consider hardware faults:
Are the physical network connections solid?
Are there any loose cables?
Are chips seated properly?
Do all cables have clean connections?
Is the working environment clean and free of dust?
Have any hidden devices or cables been damaged by rodents or insects?
Are there bad blocks on drives?
Are the CPU fans working?
Can the motherboard power all components? (CPU, network card, video card, drives, etc.)
Could electromagnetic interference be the culprit?
And mostly for embedded:
Insufficient supply bypassing?
Board contamination?
Bad solder joints / bad reflow?
CPU not reset when supply voltages are out of tolerance?
Bad resets because supply rails are back-powered from I/O ports and don't fully discharge?
Latch-up?
Floating input pins?
Insufficient (sometimes negative) noise margins on logic levels?
Insufficient (sometimes negative) timing margins?
Tin whiskers?
ESD damage?
ESD upsets?
Chip errata?
Interface misuse (e.g. I2C off-board or in the presence of high-power signals)?
Race conditions?
Counterfeit components?
Network vs. Local
What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?
Firmware
In between hardware and software is firmware.
Is the computer BIOS up-to-date?
Is the BIOS battery working?
Are the BIOS clock and system clock synchronized?
Time and Statistics
Timing issues are difficult to track:
When does the problem happen?
How frequently?
What other systems are running at that time?
Is the application time-sensitive (e.g., will leap days or leap seconds cause issues)?
Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.
Change Management
Sometimes problems appear after a system upgrade.
When did the problem first start?
What changed in the environment (hardware and software)?
What happens after rolling back to a previous version?
What differences exist between the problematic version and good version?
Library Management
Different operating systems have different ways of distributing conflicting libraries:
Windows has DLL Hell.
Unix can have numerous broken symbolic links.
Java library files can be equally nightmarish to resolve.
Perform a fresh install of the operating system, and include only the supporting software required for your application.
Java
Make sure every library is used only once. Sometimes application containers have a different version of a library than the application itself. This might not be possible to replicate in the development environment.
Use a library management tool such as Maven or Ivy.
Debugging
Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data. Use data that covers known and possible edge cases. Eventually the bug should reappear.
Sleep
It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.
Code Review
Walk through the code, line-by-line, and describe what every line does to yourself, a co-worker, or a rubber duck. This may lead to insights on how to reproduce the bug.
Cosmic Radiation
Cosmic Rays can flip bits. This is not as big as a problem in the past due to modern error checking of memory. Software for hardware that leaves Earth's protection is subject to issues that simply cannot be replicated due to the randomness of cosmic radiation.
Tools
Sometimes, albeit infrequently, the compiler will introduce a bug, especially for niche tools (e.g. a C micro-controller compiler suffering from a symbol table overflow). Is it possible to use a different compiler? Could any other tool in the tool-chain be introducing issues?
If it's a GUI app, it's invaluable to watch the customer generate the error (or try to). They'll no doubt being doing something you'd never have guessed they were doing (not wrongly, just differently).
Otherwise, concentrate your logging in that area. Log most everything (you can pull it out later) and get your app to dump its environment as well. e.g. machine type, VM type, encoding used.
Does your app report a version number, a build number, etc.? You need this to determine precisely which version you're debugging (or not!).
If you can instrument your app (e.g. by using JMX if you're in the Java world) then instrument the area in question. Store stats e.g. requests+parameters, time made, etc. Make use of buffers to store the last 'n' requests/responses/object versions/whatever, and dump them out when the user reports an issue.
If you can't replicate it, you may fix it, but can't know that you've fixed it.
I've made my best explanation about how the bug was triggered (even if I didn't know how that situation could come about), fixed that, and made sure that if the bug surfaced again, our notification mechanisms would let a future developer know the things that I wish I had known. In practice, this meant adding log events when the paths which could trigger the bug were crossed, and metrics for related resources were recorded. And, of course, making sure that the tests exercised the code well in general.
Deciding what notifications to add is a feasability and triage question. So is deciding on how much developer time to spend on the bug in the first place. It can't be answered without knowing how important the bug is.
I've had good outcomes (didn't show up again, and the code was better for it), and bad (spent too much time not fixing the problem, whether the bug ended up fixed or not). That's what estimates and issue priorities are for.
Sometimes I just have to sit and study the code until I find the bug. Try to prove that the bug is impossible, and in the process you may figure out where you might be mistaken. If you actually succeed in convincing yourself it's impossible, assume you messed up somewhere.
It may help to add a bunch of error checking and assertions to confirm or deny your beliefs/assumptions. Something may fail that you'd never expect to.
It can be difficult, and sometimes near impossible. But my experience is, that you will sooner or later be able to reproduce and fix the bug, if you spend enough time on it (if that spent time is worth it, is another matter).
General suggestions that might help in this situation.
Add more logging, if possible, so that you have more data the next time the bug appears.
Ask the users, if they can replicate the bug. If yes, you can have them replicate it while watching over their shoulder, and hopefully find out, what triggers the bug.
Make random changes until something works :-)
Assuming you have already added all the logging that you think would help and it didn't... two things spring to mind:
Work backwards from the reported symptom. Think to yourself.. "it I wanted to produce the symptom that was reported, what bit of code would I need to be executing, and how would I get to it, and how would I get to that?" D leads to C leads to B leads to A. Accept that if a bug is not reproducible, then normal methods won't help. I've had to stare at code for many hours with these kind of thought processes going on to find some bugs. Usually it turns out to be something really stupid.
Remember Bob's first law of debugging: if you can't find something, it's because you're looking in the wrong place :-)
Think. Hard. Lock yourself away, admit no interuptions.
I once had a bug where the evidence was a hex dump of a corrupt database. The chains of pointers were systematically screwed up. All the user's programs, and our database software, worked faultlessly in testing. I stared at it for a week (it was an important customer), and after eliminating dozens of possible ideas, I realised that the data was spread across two physical files and the corruption occurred where the chains crossed file boundaries. I realized that if a backup/restore operation failed at a critical point, the two files could end up "out of sync", restored to different time points. If you then ran one of the customer's programs on the already-corrupt data, it would produce exactly the knotted chains of pointers I was seeing. I then demonstrated a sequence of events that reproduced the corruption exactly.
modify the code where you think the problem is happening, so extra debug info is recorded somewhere. when it happens next time, you will have what your need to solve the problem.
There are two types of bugs you can't replicate. The kind you discovered, and the kind someone else discovered.
If you discovered the bug, you should be able to replicate it. If you can't replicate it, then you simply haven't considered all of the contributing factors leading towards the bug. This is why whenever you have a bug, you should document it. Save the log, get a screenshot, etc. If you don't, then how can you even prove the bug really exists? Maybe it's just a false memory?
If someone else discovered a bug, and you can't replicate it, obviously ask them to replicate it. If they can't replicate it, then you try to replicate it. If you can't replicate it quickly, ignore it.
I know that sounds bad, but I think it is justified. The amount of time it will take you to replicate a bug that someone else discovered is very large. If the bug is real, it will happen again naturally. Someone, maybe even you, will stumble across it again. If it is difficult to replicate, then it is also rare, and probably won't cause too much damage if it happens a few more times.
You can be a lot more productive if you spend your time actually working, fixing other bugs and writing new code, than you will be trying to replicate a mystery bug that you can't even guarantee actually exists. Just wait for it to appear again naturally, then you will be able to spend all your time fixing it, rather than wasting your time trying to reveal it.
Discuss the problem, read code, often quite a lot of it. Often we do it in pairs, because you can usually eliminate the possibilities analytically quite quickly.
Start by looking at what tools you have available to you. For example crashes on a Windows platform go to WinQual, so if this is your case you now have crash dump information. Do you can static analysis tools that spot potential bugs, runtime analysis tools, profiling tools?
Then look at the input and output. Anything similar about the inputs in situations when users report the error, or anything out of place in the output? Compile a list of reports and look for patterns.
Finally, as David stated, stare at the code.
Ask user to give you a remote access for his computer and see everything yourself. Ask user to make a small video of how he reproduces this bug and send it to you.
Sure both are not always possible but if they are it may clarify some things. The common way of finding bugs are still the same: separating parts that may cause bug, trying to understand what`s happening, narrowing codespace that could cause the bug.
There are tools like gotomeeting.com, which you can use to share screen with your user and observe the behaviour. There could be many potential problems like number of softwares installed on their machines, some tools utility conflicting with your program. I believe gotomeeting, is not the only solution, but there could be timeout issues, slow internet issue.
Most of times I would say softwares do not report you correct error messages, for example, in case of java and c# track every exceptions.. dont catch all but keep a point where you can catch and log. UI Bugs are difficult to solve unless you use remote desktop tools. And most of time it could be bug in even third party software.
If you work on a real significant sized application, you probably have a queue of 1,000 bugs, most of which are definitely reproducible.
Therefore, I'm afraid I'd probably close the bug as WORKSFORME (Bugzilla) and then get on fixing some more tangible bugs. Or doing whatever the project manager decides to do.
Certainly making random changes is a bad idea, even if they're localised, because you risk introducing new bugs.