Insert statement with high buffer gets and high index contention - performance

We have a table with over 300,000,000 rows and two single column indexes. Every now and then the application comes to a hault. At that same time there is high index contention for the insert statement for this table. I also noticed a large amount of buffer gets. Can someone help me remedy this problem?
Here are statistics for the statement when the index contention is high and we are having performance issues.
Total Per Execution Per Row
Executions 51,857 1 1.00
Elapsed Time (sec) 3,270.67 0.06 0.06
CPU Time (sec) 1,554.41 0.03 0.03
Buffer Gets 140,844,228 2,716.01 2,716.01
Disk Reads 1,160 0.02 0.02
Direct Writes 0 0.00 0.00
Rows 51,857 1.00 1
Fetches 0 0.00 0.00
Same statement, same time range, similar workload.
Total Per Execution Per Row
Executions 94,424 1 1.00
Elapsed Time (sec) 30.41 <0.01 <0.01
CPU Time (sec) 12.90 <0.01 <0.01
Buffer Gets 1,130,297 11.97 11.97
Disk Reads 469 <0.01 <0.01
Direct Writes 0 0.00 0.00
Rows 94,424 1.00 1
Fetches 0 0.00 0.00

There are two ways to look at a primary index:
a way to do fast lookups for the most common queries
a way to speed up insertions (and posibly deletions)
most people think in terms of the primary index in the first sense
but there can be only one primary key, since it actual disk order
By having a sequence (or a timestamp) as the primary key, you are basically trying to put records very close (same page) and can have contention, as all inserts try to go to the same place
If you use your primary key instead to distribute the data, you will have fewer insert collisions. It can pay to have a primary key that is the most variable attribute (closest to a good distribution), even if that attribute is rarely queried, in fact adding an extra column with a random value can be used.
There is not enough information provided about how you use the data, but it might pay to trade a bit of query time, to avoid these collisions.

Related

Why doesn't scikit-learn's LogisticRegression classifier use column-major for coefficients even when it is much faster?

I am using LogisticRegression for a classification problem with a large number of sparse features (tfidf vectors for documents to be specific) as well as a large number of classes. I noticed recently that performance seems to have dramatically worsened when upgrading to newer versions of scikit-learn. While it's hard to trace the exact origin of the performance problem, I did notice when profiling that ravel is called twice, which is taking up a large amount of the time at inference. What's interesting though, is that if I change the coef_ matrix to column-major order with np.asfortranarray, I recover the performance I am expecting. I also noticed that the problem only occurs when the input is sparse, as it is in my case.
Is there a way to change inference so that it is fastest with row-major ordering? I suspect you couldn't do this without having to transpose the input matrix to predict_proba, which would be worse since now the time doing taken doing the raveling is unbounded. Or is there some flag to tell scikit to use column-major ordering in order to have to avoid these calls during inference?
Example code below:
import scipy
import numpy as np
from sklearn.linear_model import LogisticRegression
X = np.random.rand(10_000, 10_000)
y = np.random.randint(0, 500, size=10_000)
clf = LogisticRegression(max_iter=10).fit(X, y)
%timeit clf.predict_proba(scipy.sparse.rand(1, 10_000))
# 21.9 ms ± 973 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%prun
# ncalls tottime percall cumtime percall filename:lineno(function)
# 2 0.019 0.010 0.019 0.010 {method 'ravel' of 'numpy.ndarray' objects}
# 1 0.003 0.003 0.022 0.022 _compressed.py:493(_mul_multivector)
clf.coef_ = np.asfortranarray(clf.coef_)
%timeit clf.predict_proba(scipy.sparse.rand(1, 10_000))
# 467 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%prun clf.predict_proba(scipy.sparse.rand(1, 10_000))
# ncalls tottime percall cumtime percall filename:lineno(function)
# 1 0.000 0.000 0.000 0.000 {built-in method scipy.sparse._sparsetools.csr_matvecs}
# 1 0.000 0.000 0.000 0.000 {method 'choice' of 'numpy.random.mtrand.RandomState' objects}
As you can see, converting the matrix to column-major order reduced the runtime of the ravel calls by a large margin.
Sparse matmul is handled by scipy as a.dot(b), and it needs b to be in row-major order. In this case, when you call clf.predict_proba() you're calculating p # clf.coef_.T, and clf.coef_.T is done by switching between row-major and column-major order (cause doing it that way doesn't require a copy).
If clf.coef_ is row-major order (which it will be after the model is fit), clf.coef_.T is column-major order and calling clf.predict_proba() requires it to be fully copied in memory (in this case, by .ravel()) to return it to row-major order.
When you turn clf.coef_ to column-major order with clf.coef_ = np.asfortranarray(clf.coef_), you make it so clf.coef_.T is row-major order, and .ravel() is basically a noop as it makes a view into the existing C array which doesn't have to be copied.
You have already found the most efficient workaround for this, so I don't know that there's anything else to be done. You could also just make p dense with p.A; the scipy.sparse matmul isn't terribly efficient and doesn't handle edge conditions well. This isn't a new thing and I don't know why you'd have not seen it with older sklearn.

What causes this strange drop in performance with a *medium* number of items?

I have just read an article by Rico Mariani that concerns with performance of memory access given different locality, architecture, alignment and density.
The author built an array of varying size containing a doubly linked list with an int payload, which was shuffled to a certain percentage. He experimented with this list and found some consistent results on his machine.
Quoting one of the result table:
Pointer implementation with no changes
sizeof(int*)=4 sizeof(T)=12
shuffle 0% 1% 10% 25% 50% 100%
1000 1.99 1.99 1.99 1.99 1.99 1.99
2000 1.99 1.85 1.99 1.99 1.99 1.99
4000 1.99 2.28 2.77 2.92 3.06 3.34
8000 1.96 2.03 2.49 3.27 4.05 4.59
16000 1.97 2.04 2.67 3.57 4.57 5.16
32000 1.97 2.18 3.74 5.93 8.76 10.64
64000 1.99 2.24 3.99 5.99 6.78 7.35
128000 2.01 2.13 3.64 4.44 4.72 4.80
256000 1.98 2.27 3.14 3.35 3.30 3.31
512000 2.06 2.21 2.93 2.74 2.90 2.99
1024000 2.27 3.02 2.92 2.97 2.95 3.02
2048000 2.45 2.91 3.00 3.10 3.09 3.10
4096000 2.56 2.84 2.83 2.83 2.84 2.85
8192000 2.54 2.68 2.69 2.69 2.69 2.68
16384000 2.55 2.62 2.63 2.61 2.62 2.62
32768000 2.54 2.58 2.58 2.58 2.59 2.60
65536000 2.55 2.56 2.58 2.57 2.56 2.56
The author explains:
This is the baseline measurement. You can see the structure is a nice round 12 bytes and it will align well on x86. Looking at the first column, with no shuffling, as expected things get worse and worse as the array gets bigger until finally the cache isn't helping much and you have about the worst you're going to get, which is about 2.55ns on average per item.
But something quite strange can be seen around 32k items:
The results for shuffling are not exactly what I expected. At small sizes, it makes no difference. I expected this because basically the entire table is staying hot in the cache and so locality isn't mattering. Then as the table grows you see that shuffling has a big impact at about 32000 elements. That's 384k of data. Likely because we've blown past a 256k limit.
Now the bizarre thing is this: after this the cost of shuffling actually goes down, to the point that later on it hardly matters at all. Now I can understand that at some point shuffled or not shuffled really should make no difference because the array is so huge that runtime is largely gated by memory bandwidth regardless of order. However... there are points in the middle where the cost of non-locality is actually much worse than it will be at the endgame.
What I expected to see was that shuffling caused us to reach maximum badness sooner and stay there. What actually happens is that at middle sizes non-locality seems to cause things to go very very bad... And I do not know why :)
So the question is: What might have caused this unexpected behavior?
I have thought about this for some time, but found no good explanation. The test code looks fine to me. I don't think CPU branch prediction is the culprit in this instance, as it should be observable far earlier than 32k items, and show a far slighter spike.
I have confirmed this behavior on my box, it looks pretty much exactly the same.
I figured it might be caused by forwarding of CPU state, so I changed the order of rows and/or column generation - almost no difference in output. To make sure, I generated data for a larger continuous sample. For easy of viewing, I put it into excel:
And another independent run for good measure, negligible difference
I put my best theory here: http://blogs.msdn.com/b/ricom/archive/2014/09/28/performance-quiz-14-memory-locality-alignment-and-density-suggestions.aspx#10561107 but it's just a guess, I haven't confirmed it.
Mystery solved! From my blog:
Ryan Mon, Sep 29 2014 9:35 AM #
Wait - are you concluding that completely randomized access is the same speed as sequential for very large cases? That would be very surprising!!
What's the range of rand()? If it's 32k that would mean you're just shuffling the first 32k items and doing basically sequential reads for most items in the large case, and the per-item avg would become very close to the sequential case. This matches your data very well.
Mon, Sep 29 2014 10:57 AM #
That's exactly it!
The rand function returns a pseudorandom integer in the range 0 to RAND_MAX (32767). Use the srand function to seed the pseudorandom-number generator before calling rand.
I need a different random number generator!
I'll redo it!

Average access time with cache misses

The memory access time is 1 nanosecond for a read operation with a hit in cache, 5 nanoseconds for a read operation with a miss in cache, 2 nanoseconds for a write operation with a hit in cache and 10 nanoseconds for a write operation with a miss in cache. Execution of a sequence of instructions involves 100 instruction fetch operations, 60 memory operand read operations and 40 memory operand write operations. The cache hit-ratio is 0.9.What is the average memory access time
The question is to find the time taken for,
"100 fetch operation and 60 operand red operations and 40 memory operand write operations"/"total number of instructions".
Total number of instructions= 100+60+40 =200
Time taken for 100 fetch operations(fetch =read)
= 100*((0.9*1)+(0.1*5)) //1 corresponds to time taken for read when there is cache hit
= 140 ns //0.9 is cache hit rate
Time taken for 60 read operations
=60*((0.9*1)+(0.1*5))
=84ns
Time taken for 40 write operations
=40*((0.9*2)+(0.1*10)) =112 ns
//Here 2and 10 the time taken for write when there is cache hit and no cahce hit respectively
So,the total time taken for 200 operations is = 140+84+112=336ns
Average time taken = time taken per operation=336/200= 1.68 ns

ruby method caching performance

I'm using the cache_method gem, and while profiling some critical process in my app I found the following result
6.11 0.01 0.00 6.10 413/413 ActiveSupport::Cache::Strategy::LocalCache#write_entry 364
4.70 0.01 0.00 4.69 388/388 ActiveSupport::Cache::Strategy::LocalCache#delete_entry
Is it possible that for 413 cache write and 388 cache delete it takes 10 seconds?
sound way too much. Any way to improve this with some configuration options?
It's perfectly possible that these operations take so long to achieve, the first symptom is indexing, while updating your cache you are certainly updating your indexes and this is the heaviest task in caching mechanism.
You can take a look in your index configuration, and depending in its implementation you can use lazy index refresh to avoid latency in Delete and Update operations.
Cheers

Measuring effective bandwidth on CUDA

So I want to know how to calculate the total memory effective bandwidth for:
cublasSdot(handle, M, devPtrA, 1, devPtrB, 1, &curesult);
where that function belows to cublas_v2.h
That function runs in 0.46 ms, and the vectors are 10000 * sizeof(float)
Am I having ((10000 * 4) / 10^9 )/0.00046 = 0.086 GB/s?
I'm wondering about it because I don't know what is inside the cublasSdot function, and I don't know if it is necesary.
In your case, the size of the input data is 10000 * 4 * 2 since you have 2 input vectors, and the size of the output data is 4. The effective bandwidth should be about 0.172 GB/s.
Basically cublasSdot() does nothing much more than computing.
Profile result shows cublasSdot() invokes 2 kernels to compute the result. An extra 4-bytes device-to-host mem transfer is also invoked if the pointer mode is CUBLAS_POINTER_MODE_HOST, which is the default mode for cublas lib.
If kernel time is in ms then a multiplication factor of 1000 is necessary.
That results in 86 GB/s.
As an example refer to example provide by NVIDIA for Matrix Transpose
at http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf
On Last Page entire code is present. The way the Effective Bandwidth is computed is 2.*1000*mem_size/(1024*1024*1024)/(Time in ms)

Resources