Phabricator extremely slow - performance

I am using Phabricator for code reviews, and after tinkering with it, I have gotten it set up more or less exactly as I want.
I just have one problem, that I can't really find a solution to.
Navigating the phabricator app is smooth and has no delays. But when I write a comment (or chose any other action) in the Leap Into Action and press Clowncopterize it takes forever before it is done. The gears (busy indicator) in the lower right corner keep spinning for up to 60 seconds.
I can't figure out what the cause of this is. I have tried to do a top and I don't see anything severe:
top - 11:40:36 up 9 min, 1 user, load average: 0.04, 0.10, 0.07
Tasks: 112 total, 1 running, 111 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
KiB Mem: 2044148 total, 526580 used, 1517568 free, 36384 buffers
KiB Swap: 2093052 total, 0 used, 2093052 free, 257344 cached
There are no spikes when I press Clowncopterize either. I have made sure DNS is set up correctly. It wasn't to begin with, but is now. Even after a reboot, that didn't fix the performance problems.

The trouble was that sendmail was incorrectly set up. So it was waiting to time out on sending mails.

Related

pprof usage and interpretation

We believe our go app has a memory leak.
In order to find out what's going on, we are trying with pprof.
We are having a hard time though understanding the reads.
When connecting to go tool pprof http://localhost:6060/debug/pprof/heap?debug=1, a sample output is
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) text
Showing nodes accounting for 17608.45kB, 100% of 17608.45kB total
Showing top 10 nodes out of 67
flat flat% sum% cum cum%
12292.12kB 69.81% 69.81% 12292.12kB 69.81% github.com/acct/repo/vendor/github.com/.../funcA /../github.com/acct/repo/vendor/github.com/../fileA.go
1543.14kB 8.76% 78.57% 1543.14kB 8.76% github.com/acct/repo/../funcB /../github.com/acct/repo/fileB.go
1064.52kB 6.05% 84.62% 1064.52kB 6.05% github.com/acct/repo/vendor/github.com/../funcC /../github.com/acct/repo/vendor/github.com/fileC.go
858.34kB 4.87% 89.49% 858.34kB 4.87% github.com/acct/repo/vendor/golang.org/x/tools/imports.init /../github.com/acct/repo/vendor/golang.org/x/tools/imports/zstdlib.go
809.97kB 4.60% 94.09% 809.97kB 4.60% bytes.makeSlice /usr/lib/go/src/bytes/buffer.go
528.17kB 3.00% 97.09% 528.17kB 3.00% regexp.(*bitState).reset /usr/lib/go/src/regexp/backtrack.go
(Please forgive the clumsy obfuscation)
We interpret funcA to be consuming nearly 70% of memory - but this being around 12MB.
Now top though shows:
top - 18:09:44 up 2:02, 1 user, load average: 0,75, 0,56, 0,38
Tasks: 166 total, 1 running, 165 sleeping, 0 stopped, 0 zombie
%Cpu(s): 3,7 us, 1,6 sy, 0,0 ni, 94,3 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
KiB Mem : 16318684 total, 14116728 free, 1004804 used, 1197152 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 14451260 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4902 me 20 0 1,371g 0,096g 0,016g S 12,9 0,6 1:58.14 mybin
which suggests 1.371 GB of memory used....where is it gone???
Also, pprof docs are quite frugal. We are having difficulties even to understand how it should be used. Our binary is a daemon. For example:
If we start a reading with go tool pprof http://localhost:6060/debug/pprof/heap, is this a one time reading at this particular time or an aggregate over time?
Sometimes hitting text later again in interactive mode seems to report the same values. Are we actually looking at the same values? Do we need to restart go tool pprof... to get fresh values?
Is it a reading of the complete heap, or of some specific go routine, of a specific point in the stack....???
Finally, is this interpretation correct (from http://localhost:6060/debug/pprof/):
/debug/pprof/
profiles:
0 block
64 goroutine
45 heap
0 mutex
13 threadcreate
The binary has 64 open go routines and a total of 45MB of heap memory?

What causes this strange drop in performance with a *medium* number of items?

I have just read an article by Rico Mariani that concerns with performance of memory access given different locality, architecture, alignment and density.
The author built an array of varying size containing a doubly linked list with an int payload, which was shuffled to a certain percentage. He experimented with this list and found some consistent results on his machine.
Quoting one of the result table:
Pointer implementation with no changes
sizeof(int*)=4 sizeof(T)=12
shuffle 0% 1% 10% 25% 50% 100%
1000 1.99 1.99 1.99 1.99 1.99 1.99
2000 1.99 1.85 1.99 1.99 1.99 1.99
4000 1.99 2.28 2.77 2.92 3.06 3.34
8000 1.96 2.03 2.49 3.27 4.05 4.59
16000 1.97 2.04 2.67 3.57 4.57 5.16
32000 1.97 2.18 3.74 5.93 8.76 10.64
64000 1.99 2.24 3.99 5.99 6.78 7.35
128000 2.01 2.13 3.64 4.44 4.72 4.80
256000 1.98 2.27 3.14 3.35 3.30 3.31
512000 2.06 2.21 2.93 2.74 2.90 2.99
1024000 2.27 3.02 2.92 2.97 2.95 3.02
2048000 2.45 2.91 3.00 3.10 3.09 3.10
4096000 2.56 2.84 2.83 2.83 2.84 2.85
8192000 2.54 2.68 2.69 2.69 2.69 2.68
16384000 2.55 2.62 2.63 2.61 2.62 2.62
32768000 2.54 2.58 2.58 2.58 2.59 2.60
65536000 2.55 2.56 2.58 2.57 2.56 2.56
The author explains:
This is the baseline measurement. You can see the structure is a nice round 12 bytes and it will align well on x86. Looking at the first column, with no shuffling, as expected things get worse and worse as the array gets bigger until finally the cache isn't helping much and you have about the worst you're going to get, which is about 2.55ns on average per item.
But something quite strange can be seen around 32k items:
The results for shuffling are not exactly what I expected. At small sizes, it makes no difference. I expected this because basically the entire table is staying hot in the cache and so locality isn't mattering. Then as the table grows you see that shuffling has a big impact at about 32000 elements. That's 384k of data. Likely because we've blown past a 256k limit.
Now the bizarre thing is this: after this the cost of shuffling actually goes down, to the point that later on it hardly matters at all. Now I can understand that at some point shuffled or not shuffled really should make no difference because the array is so huge that runtime is largely gated by memory bandwidth regardless of order. However... there are points in the middle where the cost of non-locality is actually much worse than it will be at the endgame.
What I expected to see was that shuffling caused us to reach maximum badness sooner and stay there. What actually happens is that at middle sizes non-locality seems to cause things to go very very bad... And I do not know why :)
So the question is: What might have caused this unexpected behavior?
I have thought about this for some time, but found no good explanation. The test code looks fine to me. I don't think CPU branch prediction is the culprit in this instance, as it should be observable far earlier than 32k items, and show a far slighter spike.
I have confirmed this behavior on my box, it looks pretty much exactly the same.
I figured it might be caused by forwarding of CPU state, so I changed the order of rows and/or column generation - almost no difference in output. To make sure, I generated data for a larger continuous sample. For easy of viewing, I put it into excel:
And another independent run for good measure, negligible difference
I put my best theory here: http://blogs.msdn.com/b/ricom/archive/2014/09/28/performance-quiz-14-memory-locality-alignment-and-density-suggestions.aspx#10561107 but it's just a guess, I haven't confirmed it.
Mystery solved! From my blog:
Ryan Mon, Sep 29 2014 9:35 AM #
Wait - are you concluding that completely randomized access is the same speed as sequential for very large cases? That would be very surprising!!
What's the range of rand()? If it's 32k that would mean you're just shuffling the first 32k items and doing basically sequential reads for most items in the large case, and the per-item avg would become very close to the sequential case. This matches your data very well.
Mon, Sep 29 2014 10:57 AM #
That's exactly it!
The rand function returns a pseudorandom integer in the range 0 to RAND_MAX (32767). Use the srand function to seed the pseudorandom-number generator before calling rand.
I need a different random number generator!
I'll redo it!

Elasticsearch High CPU When Idle

I'm fairly new to Elasticsearch and I've bumped into an issue that I'm having difficulties in even troubleshooting. My Elasticsearch (1.1.1) is currently spiking the cpu even though no searching or indexing is going on. CPU usage isn't always at 100%, but it jumps up there quite a bit and load is very high.
Previously, the indices on this node ran perfectly fine for months without any issue. This just started today and I have no idea what's causing it.
The problem persists even after I restart ES and I even restarted the server in pure desperation. No effect on the issue.
Here are some stats to help frame the issue, but I'd imagine there's more information that's needed. I'm just not sure what to provide.
Elasticsearch 1.1.1
Gentoo Linux 3.12.13
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.7) (Gentoo build 1.6.0_27-b27)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)
One node, 5 shards, 0 replicas
32GB RAM on system, 16GB Dedicated to Elasticsearch
RAM does not appear to be the issue here.
Any tips on troubleshooting the issue would be appreciated.
Edit: Info from top if it's helpful at all.
top - 19:56:56 up 3:22, 2 users, load average: 10.62, 11.15, 9.37
Tasks: 123 total, 1 running, 122 sleeping, 0 stopped, 0 zombie
%Cpu(s): 98.5 us, 0.6 sy, 0.0 ni, 0.7 id, 0.2 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 32881532 total, 31714120 used, 1167412 free, 187744 buffers
KiB Swap: 4194300 total, 0 used, 4194300 free, 12615280 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2531 elastic+ 20 0 0.385t 0.020t 3.388g S 791.9 64.9 706:00.21 java
As Andy Pryor mentioned, the background merging might have been what was causing the issue. Our index rollover had been paused and two of our current indices were over 200GB. Rolling them over appeared to have resolved the issue and we've been humming along just fine since.
Edit:
The high load when seemingly idle was determined to have been caused by merges on several very large indices that were not being rolled over on a weekly basis. This was a failure of an internal process to roll over indices on a weekly basis. After addressing this oversight the merge times were short and the high load subsided.

Can't compile using GCC on EC2

I am trying to compile a program using GCC on an AWS EC2 instance (c1.medium). The cc1plus processes are started correctly but after a while they stop using any CPU and the complete compilation process slows down and never finished.
In top I can see that the "wa" stat increases drastically at the same time as the compilation slows down.
Initially:
%Cpu(s): 88.1 us, 5.4 sy, 0.0 ni, 0.0 id, 0.5 wa, 0.0 hi, 0.0 si, 6.0 st
When the compilation processes slow down:
%Cpu(s): 0.2 us, 0.3 sy, 0.0 ni, 50.2 id, 49.3 wa, 0.0 hi, 0.0 si, 0.0 st
I have tried a lot of different instance types, all with the same result.
As I understand it a high wa/iowait means a slow disk. I have therefore also tried to compile the application on different mounts in the ec2 instance, but this does not result in an improvement.
Does anyone have any experience in compiling c/c++ applications on EC2 and know how to solve this problem?
UPDATE 2013-03-06 08:00
As requested in the comments:
$ gcc --version
gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2
The solution was to use a machine with more than 8 GB of RAM. Apparently GCC used a lot of RAM for compiling this specific program.
Glad to see you found the solution yourself.
I've also noticed you can get this sort of hang-up behavior on a micro instance when doing processor heavy operations such as compiling code. Always do this kind of stuff on at least a small and then if necessary, convert back to a micro when you are done.

Different results in ApacheBench with and without concurrent requests

I am trying to get some statistics on response time at my production server.
When calling ab -n100 -c1 "http://example.com/search?q=something" I get following results:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 24 25 0.7 24 29
Processing: 526 874 116.1 868 1263
Waiting: 313 608 105.1 596 1032
Total: 552 898 116.1 892 1288
But when I call ab -n100 -c3 "http://example.com/search?q=something" the results are much worse:
Connection Times (ms)
min mean[+/-sd] median max
Connect: 24 25 0.8 25 30
Processing: 898 1872 1065.6 1689 8114
Waiting: 654 1410 765.5 1299 7821
Total: 923 1897 1065.5 1714 8138
Taking into account that site is in production, so there are requests besides mine, I can't explain why call with no concurrency are so much faster than with even small concurrency.
Any suggestions?
If you have a concurrency of 1 that means you are telling AB to hit this URL, as fast as it can, using one thread. The value -c3 is telling AB to do the same thing but using 3 threads which is probably going to result in a greater volume of calls which, in your case, appears to have caused things to slow down. (Note AB is single-threaded so doesn't actually use multiple os threads but the analogy still holds true.)
It's a bit like having more lanes at the tollbooth, one lane can only process cars so fast but with three lanes you're going to get more throughput. But no matter how many lanes you have the width of the tunnel the cars have to pass through after the tollbooth is also going to affect throughput which is probably what you are seeing.
As a general note, a better approach to load testing is to decide what level of traffic your app needs to be able to support and then design a test that generates this level of throughput and no more. Running threads as fast as they can like AB does tends to make any kind of controlled testing hard. JMeter is better.
Also, you might want to think about setting up a test server for his sort of thing, less risky...

Resources