Rocksdb concurrent Put performance - performance

I would like to show some experimental results about Rocksdb Put performance. The fact that single-threaded put throughput is slower than two-threaded put throughput. It is wired because it uses the default skiplist as memtable, and this data structure supports concurrent writes.
Here is my testing code.
uint64_t nthread = 2;
uint64_t nkeys = 16000000;
std::thread threads[nthread];
std::atomic<uint64_t> idx(1000000);
for (int t = 0; t < nthread; t++) {
threads[t] = std::thread([db, &idx, nthread, nkeys, &write_option_disable] {
WriteBatch batch;
for (int i = 0; i < nkeys / nthread; i++) {
std::string key = "WVERIFY" + std::to_string(idx.fetch_add(1));
std::string value = "MOCK";
auto ikey = rocksdb::Slice(key);
auto ivalue = rocksdb::Slice(value);
db->Put(write_option_disable, ikey, ivalue);
}
return 0;
});
}
for (auto& t : threads) {
t.join();
}
Besides, here are the results I got.
// Single thread
Uptime(secs): 8.4 total, 8.3 interval
Flush(GB): cumulative 1.170, interval 1.170
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.17 GB write, 143.35 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Interval compaction: 1.17 GB write, 144.11 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x564742515ea0#7011 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 2e-05 secs_since: 8
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 8.4 total, 8.3 interval
Cumulative writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1.63 GB, 199.80 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1669.88 MB, 200.85 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
// 2 threads
Uptime(secs): 31.4 total, 31.4 interval
Flush(GB): cumulative 0.183, interval 0.183
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.67 GB write, 21.84 MB/s write, 0.97 GB read, 31.68 MB/s read, 10.2 seconds
Interval compaction: 0.67 GB write, 21.87 MB/s write, 0.97 GB read, 31.72 MB/s read, 10.2 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x5619fb7bbea0#6183 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 1.9e-05 secs_since: 31
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 31.4 total, 31.4 interval
Cumulative writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 0.45 GB, 14.67 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 460.94 MB, 14.69 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
===========================update==========================
This is my Rocksdb's setting.
DB* db;
Options options;
BlockBasedTableOptions table_options;
rocksdb::WriteOptions write_option_disable;
write_option_disable.disableWAL = true;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism();
options.OptimizeLevelStyleCompaction();
// create the DB if it's not already present
options.create_if_missing = true;

The atomic idx shared between two threads can introduced non-trivial overhead. Try inserting random values from each thread, and maybe increase the number of threads.

Related

Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?

I am trying to parallelize Monte Carlo simulation by using OpenCL. I use the MWC64X as a uniform random number generator. The code runs well on different Intel CPUs, since the output of parallel computation is very close to the sequential one.
Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 # 1.80GHz
Literal influence running time: 0.029048 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:4781 influence0:0.478100 sum2:4781 influence2:0.478100 sum6:0 influence6:0.000000 sum10:0 sum12:0 influence12:0.000000 sum7:0 influence7:0.000000 influence10:0.000000 sum4:5962 influence4:0.596200 sum8:7971 influence8:0.797100 sum1:4781 influence1:0.478100 sum3:4781 influence3:0.478100 sum13:0 influence13:0.000000 sum11:1261 influence11:0.126100 sum9:0 influence9:0.000000 sum14:0 influence14:0.000000 sum5:0 influence5:0.000000 sum15:0 influence15:0.000000 sum16:0 influence16:0.000000
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971
However, when I run the code on GeForce GTX 1080 Ti, with NVIDIA-SMI 430.40 and CUDA 10.1 and OpenCL 1.2 CUDA installed, the output is as below:
Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:10000 sum1:10000 sum2:10000 sum3:10000 sum4:10000 sum5:0 sum6:0 sum7:0 sum8:10000 sum9:0 sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0
Parallel influence running time: 0.193581 seconds
The "Influence" value equals sum*1.0/10000, thus the parallel influence only composes of 1 and 0, which is incorrect (in GPU runs) and doesn't happen when parallelizing on a Intel CPU.
When I check the output of the random number generator if(flag==0) printf("randint=%u",randint);, it seems the outputs are all zero on GPU. Below is the clinfo and the .cl code:
Device Name GeForce GTX 1080 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 430.40
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 68:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 28
Max clock frequency 1721MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 11720130560 (10.92GiB)
Error Correction support No
Max memory allocation 2930032640 (2.729GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 458752 (448KiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL
//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
enum{ MWC64X_A = 4294883355U };
enum{ MWC64X_M = 18446383549859758079UL };
void MWC64X_Step(mwc64x_state_t *s)
{
uint X=s->x, C=s->c;
uint Xn=MWC64X_A*X+C;
uint carry=(uint)(Xn<C); // The (Xn<C) will be zero or one for scalar
uint Cn=mad_hi(MWC64X_A,X,carry);
s->x=Xn;
s->c=Cn;
}
//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
MWC64X_Step(s);
return res;
}
__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence){
int flag=get_global_id(0);
int sum=0;
int count=10000;
int assignment[N];
//or try to get newlambda like original version does
if(flag < literals){
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
float rand=randint*1.0/BASE;
//if(flag==0)
// printf("randint=%u",randint);
if(lambdap[j]<rand)
assignment[lambdas[j]]=0;
else
assignment[lambdas[j]]=1;
}
//the true case
assignment[flag]=1;
int valuet=0;
int index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuet=1;
break;
}
}
//the false case
assignment[flag]=0;
int valuef=0;
index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuef=1;
break;
}
}
sum += valuet-valuef;
}
influence[flag] = 1.0*sum/count;
printf("sum%d:%d\t", flag, sum);
}
}
What might be the problem when running the code on GPU? Is it MWC64X? According to its author, it can perform well on NVIDIA GPUs. If so, how can I fix it; if not, what might be the problem?
(This started out as a comment, it turns out this was the source of the problem so I'm turning it into an answer.)
You're not initialising your mwc64x_state_t rng; variable before reading from it, so any results will be undefined:
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
Where MWC64X_NextUint() immediately reads from the rng state before updating it:
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
Note that you will probably want to seed your RNG differently for each work-item, otherwise you will get nasty correlation artifacts in your results.
All use-cases of a pseudo-random number are a next-level challenge in true-[PARALLEL] computing platforms (not languages, platforms).
Either, there is some source-of-randomness, which gets us into a trouble once massively-parallel requests are to get fair-handled in a truly [PARALLEL] fashion (here, hardware resources may help, yet at a cost of not being able to reproduce the same behaviour "outside" of this very same platform ( and moment-in-time, if such a source is not software-operated with some seed-injection feature, that may setup the "just"-pseudo-random algorithm that creates a pure-[SERIAL] sequence-of-produced "just"-pseudo-random numbers ) )
Or,there is some "shared"-generator of pseudo-random numbers, which enjoys of a higher level of system-wide level-of-entropy (which is good for the resulting "quality" of pseudo-randomness) but at a cost of pure-serial dependence (no parallel execution possible,serial sequence gets served one after another in a sequential manner) and having close to zero chance for repeatable runs (a must for reproducible science) providing repeatably same sequences, needed for testing and for method-validation cases.
RESUME :
The code may employ a work-item-"private" pseudo-random generating function(s) ( privacy is a must for the sake of both the parallel code-execution and the mutual independence (non-intervening processes) of generating these pseudo-random numbers ) , yet each of instances must be a) independently initialised, so as to provide the expected level of randomness achievable in parallelised code-runs and b) any such initialisation ought be performed in a repeatably reproducible manner, for the sake of running the test on different times, often using different OpenCL target computing-platforms.
For __kernel-s, that do not rely on hardware-specific sources-of-randomness, meeting the conditions a && b will suffice for receiving repeatably reproducible (same) results for testing "in vitro" and thus providing a reasonably random method for generating results during generic production-level use-case code-runs "in vivo".
The comparison of net-run-times (benchmarked above) seems to show that Amdahl's law add-on overhead costs plus a tail-end effect of the atomicity-of-work have finally decided the net-run-time was ~ 3.6x faster on XEON compared to GPU:
index1 = 17
size = 51
dim1_size = 6
sum0: 4781 influence0: 0.478100
sum2: 4781 influence2: 0.478100
sum6: 0 influence6: 0.000000
sum10: 0 influence10: 0.000000
sum12: 0 influence12: 0.000000
sum7: 0 influence7: 0.000000
sum4: 5962 influence4: 0.596200
sum8: 7971 influence8: 0.797100
sum1: 4781 influence1: 0.478100
sum3: 4781 influence3: 0.478100
sum13: 0 influence13: 0.000000
sum11: 1261 influence11: 0.126100
sum9: 0 influence9: 0.000000
sum14: 0 influence14: 0.000000
sum5: 0 influence5: 0.000000
sum15: 0 influence15: 0.000000
sum16: 0 influence16: 0.000000
Parallel influence running time: 0.054391 seconds on XEON E5-2630L v3 # 1.80GHz using OpenCL
|....
index1 = 17 |....
size = 51 |....
dim1_size = 6 |....
sum0: 10000 |....
sum1: 10000 |....
sum2: 10000 |....
sum3: 10000 |....
sum4: 10000 |....
sum5: 0 |....
sum6: 0 |....
sum7: 0 |....
sum8: 10000 |....
sum9: 0 |....
sum10: 0 |....
sum11: 0 |....
sum12: 0 |....
sum13: 0 |....
sum14: 0 |....
sum15: 0 |....
sum16: 0 |....
Parallel influence running time: 0.193581 seconds on GeForce GTX 1080 Ti using OpenCL

TOP command overall CPU

I have below top command results in my RHEL 6. It's running PostgreSQL on my server.
I see 35.8% idle in CPU(s) while all the CPU usages below show 100%.
So how should I read below output?
top - 03:06:30 up 97 days, 20:15, 3 users, load average: 10.85, 10.51, 10.13
Tasks: 738 total, 14 running, 724 sleeping, 0 stopped, 0 zombie
**Cpu(s): 53.3%us, 9.6%sy, 0.0%ni, 35.8%id, 0.6%wa, 0.0%hi, 0.7%si, 0.0%st**
Mem: 32077620k total, 24335372k used, 7742248k free, 19084k buffers
Swap: 81919992k total, 407968k used, 81512024k free, 18686780k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19171 enterpri 20 0 8590m 966m 951m R 100.0 3.1 6:24.51 edb-postgres
19588 enterpri 20 0 8590m 956m 941m R 100.0 3.1 1:20.51 edb-postgres
18494 enterpri 20 0 8590m 959m 944m R 99.8 3.1 18:18.75 edb-postgres
18683 enterpri 20 0 8588m 984m 975m R 99.8 3.1 6:22.80 edb-postgres
19158 enterpri 20 0 8592m 1.0g 1.0g R 99.8 3.3 5:40.16 edb-postgres
19167 enterpri 20 0 8589m 959m 945m R 99.8 3.1 7:48.53 edb-postgres
19590 enterpri 20 0 8586m 945m 933m R 99.8 3.0 2:51.32 edb-postgres
19591 enterpri 20 0 8588m 950m 936m R 99.8 3.0 3:07.77 edb-postgres
19592 enterpri 20 0 8589m 948m 935m R 99.8 3.0 2:52.66 edb-postgres
You have a lot of CPUs (how many?) on your system. Some of them are very busy running postgres, and some of them are not.
In your version of top, %CPU represents the percent of a single CPU, not the percent of the total system CPU. If you had a threaded application, one entry could show more than 100%, but PostgreSQL is not threaded within a single process.

Counting swaps for sorting statistics - what with swaps with only two assignments instead of three

While helping out a student with his classes, I implemented the dual pivot quicksort algorithm to prepare a session and got intriged. After running some statistics, then solving the worst case situation, then running stats again, and again solving the next worst case situation, and repeating this process several times, the resulting code is no more then 80 lines of simple straightforward Python code (a bit less then Vladimir's code). The novel part is how the 3 partitions are constructed in combination with some very simple yet effective post processing of them. Now I need some help on how to test and make statistics properly.
Especially about how to count the swaps: most of the swaps only perform two assignements instead of three. So must I count them as full swaps or, is it fair to count them only as a '2/3' swap?
Counting every swap as 1, the Cn in Cn * N * log2(N) is around 0.48 on short lists (<100 elements) and around 0.55 on longer lists of several million elements. That is just the theoretical minimum as calculated by Vladimir Yaroslavskiy.
Counting the lighter swaps as 2/3 instead, the number of needed swaps is almost equal for any list size and is around 0.36 (stdev around 0.015).
The Cn for the number of comparisons is on average around 1.3 for lists of 2 million records, which is less then the theoretical 1.38 (from 2*N*ln(N)), and lower for shorter lists, i.e. for 1024 elements, it's around 1.21
That is for lists with 100% unique numbers and randomly ordered with Python's random.shuffle().
So my question is:
Is it ok to count the lighter swaps as such, and is the result indeed promising or not?
Also interesting is:
the more equal elements in the list, the faster is sorts. Cn is 0.03 and 0.1 for swaps and comparisons respectively for a 2 million list of all equal elements.
Cn for sorted and reversed sorted lists are almost the same for all sizes: 0.3 and 1 for the swaps (counted with 2/3) and comparisons respectively.
I will post a list with more statistics shortly which includes maximum stack depth, number of recursive calls besides the swaps and comparisons. Are there other things I should count?
Also, are there some 'standard' test suites with files of all kinds of situations (with equals, partially sorted etc.) one can use to test a sorting algorithm, and to make the results comparable with other sorting algorithms.
Added May 5:
I improved the algorithm especially for sorted lists.
Here are the resutls for 20 runs for each.
Are this good results?
New statistics:
Random.shuffle(), unique number
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.367 0.922 0.250
64 0.360 1.072 0.500
256 0.342 1.122 0.625
1024 0.358 1.156 0.800
4096 0.359 1.199 0.917
16384 0.359 1.244 1.071
65536 0.360 1.244 1.125
262144 0.360 1.269 1.167
1048576 0.362 1.275 1.200
Sorted, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.172 0.531 0.250
64 0.117 0.586 0.333
256 0.087 0.609 0.375
1024 0.075 0.740 0.500
4096 0.060 0.732 0.500
16384 0.051 0.726 0.500
65536 0.044 0.722 0.500
262144 0.041 0.781 0.556
1048576 0.036 0.774 0.550
2097152 0.035 0.780 0.571
Reversed order, unique numbers
Length Swaps/Nlog2(N) Comparisons/Nlog2(N) Maximum Stack/log2(N)
16 0.344 0.828 0.250
64 0.279 0.812 0.333
256 0.234 0.788 0.375
1024 0.210 0.858 0.500
4096 0.190 0.865 0.500
16384 0.172 0.855 0.500
65536 0.158 0.846 0.500
262144 0.153 0.900 0.556
1048576 0.143 0.892 0.550
2097152 0.140 0.895 0.571
I have chosen to count the assignments executed on the elements to be sorted, instead of 'swaps'. Assignements and comparisons of indexes are not counted.
I converted the code Vladimir Yaroslavskiy included in his document (Last updated: September 22, 2009) to Python and added the counters the same way as I did in my own implementation. The code is included at the end.
Any comments are welcome.
Here are the results, the averages of 10 runs.
The columns labeled VY are the results for the implementation by Vladimir, the columns labeled by JB are these of my own implementation.
Length F Function call Assignements Comparisons Maximum Stack
of list per N per N.log2(N) per N.log2(N) per log2(N)
Random.shuffle(), unique number
Version VY JB VY JB VY JB VY JB
64 1 0.170 0.266 1.489 1.029 1.041 1.028 0.417 0.633
256 1 0.171 0.270 1.463 1.016 1.066 1.138 0.575 0.812
1024 1 0.167 0.275 1.451 1.046 1.089 1.165 0.690 1.010
4096 1 0.164 0.273 1.436 1.069 1.119 1.189 0.800 1.075
16384 1 0.166 0.273 1.444 1.077 1.117 1.270 0.843 1.221
65536 1 0.166 0.273 1.440 1.108 1.126 1.258 0.919 1.281
262144 1 0.166 0.273 1.423 1.102 1.134 1.278 0.950 1.306
1048576 1 0.166 0.273 1.426 1.085 1.131 1.273 0.990 1.290
Sorted, unique numbers
Version VY JB VY JB VY JB VY JB
64 1 0.203 0.203 1.036 0.349 0.643 0.586 0.333 0.333
256 1 0.156 0.156 0.904 0.262 0.643 0.609 0.375 0.375
1024 1 0.118 0.355 0.823 0.223 0.642 0.740 0.400 0.500
4096 1 0.131 0.267 0.840 0.181 0.679 0.732 0.500 0.500
16384 1 0.200 0.200 0.926 0.152 0.751 0.726 0.500 0.500
65536 1 0.150 0.150 0.866 0.131 0.737 0.722 0.500 0.500
262144 1 0.113 0.338 0.829 0.124 0.728 0.781 0.500 0.556
1048576 1 0.147 0.253 0.853 0.108 0.750 0.774 0.550 0.550
Reversed order, unique numbers
Version VY JB VY JB VY JB VY JB
64 1 0.203 0.203 1.320 0.836 0.841 0.802 0.333 0.333
256 1 0.156 0.156 1.118 0.703 0.795 0.783 0.375 0.375
1024 1 0.118 0.312 1.002 0.631 0.768 0.852 0.400 0.500
4096 1 0.125 0.267 0.977 0.569 0.776 0.861 0.500 0.500
16384 1 0.200 0.200 1.046 0.516 0.834 0.852 0.500 0.500
65536 1 0.150 0.150 0.974 0.475 0.813 0.844 0.500 0.500
262144 1 0.113 0.338 0.925 0.459 0.795 0.896 0.500 0.556
1048576 1 0.145 0.253 0.938 0.430 0.811 0.890 0.550 0.550
Random, with increasing frequency of the numbers.
The last row is a list of the same number
Version VY JB VY JB VY JB VY JB
65536 1 0.166 0.273 1.429 1.051 1.113 1.251 0.881 1.156
65536 2 0.167 0.270 1.404 1.075 1.112 1.238 0.894 1.194
65536 4 0.168 0.273 1.373 1.039 1.096 1.213 0.906 1.238
65536 8 0.151 0.245 1.302 1.029 1.069 1.199 0.900 1.262
65536 16 0.132 0.127 1.264 0.970 1.020 1.150 0.912 1.188
65536 32 0.090 0.064 1.127 0.920 0.950 1.099 0.856 1.119
65536 64 0.051 0.032 1.000 0.845 0.879 0.993 0.819 1.019
65536 128 0.026 0.016 0.884 0.792 0.797 0.923 0.725 0.931
65536 256 0.013 0.008 0.805 0.704 0.728 0.840 0.675 0.856
65536 512 0.006 0.004 0.690 0.615 0.652 0.728 0.588 0.669
65536 1024 0.003 0.002 0.635 0.557 0.579 0.654 0.519 0.625
65536 2048 0.002 0.001 0.541 0.487 0.509 0.582 0.438 0.463
65536 4096 0.001 0.000 0.459 0.417 0.434 0.471 0.369 0.394
65536 8192 0.000 0.000 0.351 0.359 0.357 0.405 0.294 0.300
65536 16384 0.000 0.000 0.247 0.297 0.253 0.314 0.206 0.194
65536 32768 0.000 0.000 0.231 0.188 0.209 0.212 0.125 0.081
65536 65536 0.000 0.000 0.063 0.125 0.063 0.125 0.062 0.000
Here is the code of Vladimirs sort in Python:
DIST_SIZE = 13
TINY_SIZE = 17
def dualPivotQuicksort(a, left, right, nesting=0):
global assignements, comparisons, oproepen, maxnesting
oproepen += 1
maxnesting = max(maxnesting, nesting)
length = right - left
if length < TINY_SIZE: # insertion sort on tiny array
# note by JB: rewritten to minimize the assignements
for i in xrange(left+1, right+1):
key = a[i]
assignements += 1
while i > left:
comparisons += 1
if key < a[i - 1]:
assignements += 1
a[i] = a[i-1]
i -= 1
else:
break
assignements += 1
a[i] = key
return
# median indexes
sixth = length / 6
m1 = left + sixth
m2 = m1 + sixth
m3 = m2 + sixth
m4 = m3 + sixth
m5 = m4 + sixth
assignements += 9*3
comparisons += 9
## 5-element sorting network
if a[m1] > a[m2]: a[m1],a[m2] = a[m2],a[m1]
if a[m4] > a[m5]: a[m4],a[m5] = a[m5],a[m4]
if a[m1] > a[m3]: a[m1],a[m3] = a[m3],a[m1]
if a[m2] > a[m3]: a[m2],a[m3] = a[m3],a[m2]
if a[m1] > a[m4]: a[m1],a[m4] = a[m4],a[m1]
if a[m3] > a[m4]: a[m3],a[m4] = a[m4],a[m3]
if a[m2] > a[m5]: a[m2],a[m5] = a[m5],a[m2]
if a[m2] > a[m3]: a[m2],a[m3] = a[m3],a[m2]
if a[m4] > a[m5]: a[m4],a[m5] = a[m5],a[m4]
# pivots: [ < pivot1 | pivot1 <= && <= pivot2 | > pivot2 ]
assignements += 2
pivot1 = a[m2]
pivot2 = a[m4]
comparisons += 1
diffPivots = pivot1 != pivot2
assignements += 2
a[m2] = a[left]
a[m4] = a[right]
# center part pointers
less = left + 1
great = right - 1
# sorting
if (diffPivots):
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 2
if (x < pivot1):
comparisons -= 1
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
elif (x > pivot2):
while k < great:
comparisons += 1
if a[great] > pivot2:
great -= 1
else:
break
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
else:
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 1
if (x == pivot1):
k += 1
continue
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
else:
while k < great:
comparisons += 1
if a[great] > pivot2:
great -= 1
else:
break
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x < pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
# swap
assignements += 2
a[left] = a[less - 1]
a[less - 1] = pivot1
assignements += 2
a[right] = a[great + 1]
a[great + 1] = pivot2
# left and right parts
dualPivotQuicksort(a, left, less - 2, nesting+1)
dualPivotQuicksort(a, great + 2, right, nesting+1)
# equal elements
if (great - less > length - DIST_SIZE and diffPivots):
k = less
while k <= great:
assignements += 1
x = a[k]
comparisons += 2
if (x == pivot1):
comparisons -= 1
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
elif (x == pivot2):
assignements += 3
a[k] = a[great]
a[great] = x
great -= 1
x = a[k]
comparisons += 1
if (x == pivot1):
assignements += 2
a[k] = a[less]
a[less] = x
less += 1
k += 1
# center part
if (diffPivots):
dualPivotQuicksort(a, less, great, nesting+1)
This code is about 190 lines, my current implementation written with the same formatting is about 110 lines.
So any remarks are welcome.

Go memory consumption management

I am new to Go and trying to figure out how it manages memory consumption.
I have trouble with memory in one of my test projects. I don't understand why Go uses more and more memory (never freeing it) when my program runs for a long time.
I am running the test case provided below. After the first allocation, program uses nearly 350 MB of memory (according to ActivityMonitor). Then I try to free it and ActivityMonitor shows that memory consumption doubles. Why?
I am running this code on OS X using Go 1.0.3.
What is wrong with this code? And what is the right way to manage large variables in Go programs?
I had another memory-management-related problem when implementing an algorithm that uses a lot of time and memory; after running it for some time it throws an "out of memory" exception.
package main
import ("fmt"
"time"
)
func main() {
fmt.Println("getting memory")
tmp := make([]uint32, 100000000)
for kk, _ := range tmp {
tmp[kk] = 0
}
time.Sleep(5 * time.Second)
fmt.Println("returning memory")
tmp = make([]uint32, 1)
tmp = nil
time.Sleep(5 * time.Second)
fmt.Println("getting memory")
tmp = make([]uint32, 100000000)
for kk, _ := range tmp {
tmp[kk] = 0
}
time.Sleep(5 * time.Second)
fmt.Println("returning memory")
tmp = make([]uint32, 1)
tmp = nil
time.Sleep(5 * time.Second)
return
}
Currently, go uses a mark-and-sweep garbage collector, which in general does not define when the object is thrown away.
However, if you look closely, there is a go routine called sysmon which essentially runs as long as your program does and calls the GC periodically:
// forcegcperiod is the maximum time in nanoseconds between garbage
// collections. If we go this long without a garbage collection, one
// is forced to run.
//
// This is a variable for testing purposes. It normally doesn't change.
var forcegcperiod int64 = 2 * 60 * 1e9
(...)
// If a heap span goes unused for 5 minutes after a garbage collection,
// we hand it back to the operating system.
scavengelimit := int64(5 * 60 * 1e9)
forcegcperiod determines the period after which the GC is called by force. scavengelimit determines when spans are returned to the operating system. Spans are a number of memory pages which can hold several objects. They're kept for scavengelimit time and are freed if no object is on them and scavengelimit is exceeded.
Further down in the code you can see that there is a trace option. You can use this to see, whenever the
scavenger thinks he needs to clean up:
$ GOGCTRACE=1 go run gc.go
gc1(1): 0+0+0 ms 0 -> 0 MB 423 -> 350 (424-74) objects 0 handoff
gc2(1): 0+0+0 ms 1 -> 0 MB 2664 -> 1437 (2880-1443) objects 0 handoff
gc3(1): 0+0+0 ms 1 -> 0 MB 4117 -> 2213 (5712-3499) objects 0 handoff
gc4(1): 0+0+0 ms 2 -> 1 MB 3128 -> 2257 (6761-4504) objects 0 handoff
gc5(1): 0+0+0 ms 2 -> 0 MB 8892 -> 2531 (13734-11203) objects 0 handoff
gc6(1): 0+0+0 ms 1 -> 1 MB 8715 -> 2689 (20173-17484) objects 0 handoff
gc7(1): 0+0+0 ms 2 -> 1 MB 5231 -> 2406 (22878-20472) objects 0 handoff
gc1(1): 0+0+0 ms 0 -> 0 MB 172 -> 137 (173-36) objects 0 handoff
getting memory
gc2(1): 0+0+0 ms 381 -> 381 MB 203 -> 202 (248-46) objects 0 handoff
returning memory
getting memory
returning memory
As you can see, no gc invoke is done between getting and returning. However, if you change
the delay from 5 seconds to 3 minutes (more than the 2 minutes from forcegcperiod),
the objects are removed by the gc:
returning memory
scvg0: inuse: 1, idle: 1, sys: 3, released: 0, consumed: 3 (MB)
scvg0: inuse: 381, idle: 0, sys: 382, released: 0, consumed: 382 (MB)
scvg1: inuse: 1, idle: 1, sys: 3, released: 0, consumed: 3 (MB)
scvg1: inuse: 381, idle: 0, sys: 382, released: 0, consumed: 382 (MB)
gc9(1): 1+0+0 ms 1 -> 1 MB 4485 -> 2562 (26531-23969) objects 0 handoff
gc10(1): 1+0+0 ms 1 -> 1 MB 2563 -> 2561 (26532-23971) objects 0 handoff
scvg2: GC forced // forcegc (2 minutes) exceeded
scvg2: inuse: 1, idle: 1, sys: 3, released: 0, consumed: 3 (MB)
gc3(1): 0+0+0 ms 381 -> 381 MB 206 -> 206 (252-46) objects 0 handoff
scvg2: GC forced
scvg2: inuse: 381, idle: 0, sys: 382, released: 0, consumed: 382 (MB)
getting memory
The memory is still not freed, but the GC marked the memory region as unused. Freeing will begin when
the used span is unused and older than limit. From scavenger code:
if(s->unusedsince != 0 && (now - s->unusedsince) > limit) {
// ...
runtime·SysUnused((void*)(s->start << PageShift), s->npages << PageShift);
}
This behavior may of course change over time, but I hope you now get a bit of a feel when objects
are thrown away by force and when not.
As pointed out by zupa, releasing objects may not return the memory to the operating system, so on
certain systems you may not see a change in memory usage. This seems to be the case for Plan 9
and Windows according to this thread on golang-nuts.
To eventually (force) collect unused memory you must call runtime.GC().
variable = nil may make things unreachable and thus eligible for collection, but it per se doesn't free anything.

How can I measure the performance and TCP RTT of my server code?

I created a basic TCP server that reads incoming binary data in protocol buffer format, and writes a binary msg as response. I would like to benchmark the the roundtrip time.
I tried iperf, but could not make it send the same input file multiple times. Is there another benchmark tool than can send a binary input file repeatedly?
If you have access to a linux or unix machine1, you should use tcptrace. All you need to do is loop through your binary traffic test while capturing with wireshark or tcpdump file.
After you have that .pcap file2, analyze with tcptrace -xtraffic <pcap_filename>3. This will generate two text files, and the average RTT stats for all connections in that pcap are shown at the bottom of the one called traffic_stats.dat.
[mpenning#Bucksnort tcpperf]$ tcptrace -xtraffic willers.pcap
mod_traffic: characterizing traffic
1 arg remaining, starting with 'willers.pcap'
Ostermann's tcptrace -- version 6.6.1 -- Wed Nov 19, 2003
16522 packets seen, 16522 TCP packets traced
elapsed wallclock time: 0:00:00.200709, 82318 pkts/sec analyzed
trace file elapsed time: 0:03:21.754962
Dumping port statistics into file traffic_byport.dat
Dumping overall statistics into file traffic_stats.dat
Plotting performed at 15.000 second intervals
[mpenning#Bucksnort tcpperf]$
[mpenning#Bucksnort tcpperf]$ cat traffic_stats.dat
Overall Statistics over 201 seconds (0:03:21.754962):
4135308 ttl bytes sent, 20573.672 bytes/second
4135308 ttl non-rexmit bytes sent, 20573.672 bytes/second
0 ttl rexmit bytes sent, 0.000 bytes/second
16522 packets sent, 82.199 packets/second
200 connections opened, 0.995 conns/second
11 dupacks sent, 0.055 dupacks/second
0 rexmits sent, 0.000 rexmits/second
average RTT: 67.511 msecs <------------------
[mpenning#Bucksnort tcpperf]$
The .pcap file used in this example was a capture I generated when I looped through an expect script that pulled data from one of my servers. This was how I generated the loop...
#!/usr/bin/python
from subprocess import Popen, PIPE
import time
for ii in xrange(0,200):
# willers.exp is an expect script
Popen(['./willers.exp'], stdin=PIPE, stdout=PIPE, stderr=PIPE)
time.sleep(1)
You can adjust the sleep time between loops based on your server's accept() performance and the duration of your tests.
END NOTES:
A Knoppix Live-CD will do
Filtered to only capture test traffic
tcptrace is capable of very detailed per-socket stats if you use other options...
================================
[mpenning#Bucksnort tcpperf]$ tcptrace -lr willers.pcap
1 arg remaining, starting with 'willers.pcap'
Ostermann's tcptrace -- version 6.6.1 -- Wed Nov 19, 2003
16522 packets seen, 16522 TCP packets traced
elapsed wallclock time: 0:00:00.080496, 205252 pkts/sec analyzed
trace file elapsed time: 0:03:21.754962
TCP connection info:
200 TCP connections traced:
TCP connection 1:
host c: myhost.local:44781
host d: willers.local:22
complete conn: RESET (SYNs: 2) (FINs: 1)
first packet: Tue May 31 22:52:24.154801 2011
last packet: Tue May 31 22:52:25.668430 2011
elapsed time: 0:00:01.513628
total packets: 73
filename: willers.pcap
c->d: d->c:
total packets: 34 total packets: 39
resets sent: 4 resets sent: 0
ack pkts sent: 29 ack pkts sent: 39
pure acks sent: 11 pure acks sent: 2
sack pkts sent: 0 sack pkts sent: 0
dsack pkts sent: 0 dsack pkts sent: 0
max sack blks/ack: 0 max sack blks/ack: 0
unique bytes sent: 2512 unique bytes sent: 14336
actual data pkts: 17 actual data pkts: 36
actual data bytes: 2512 actual data bytes: 14336
rexmt data pkts: 0 rexmt data pkts: 0
rexmt data bytes: 0 rexmt data bytes: 0
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 0
pushed data pkts: 17 pushed data pkts: 33
SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/0
req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
adv wind scale: 6 adv wind scale: 1
req sack: Y req sack: Y
sacks sent: 0 sacks sent: 0
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 1460 bytes mss requested: 1460 bytes
max segm size: 792 bytes max segm size: 1448 bytes
min segm size: 16 bytes min segm size: 32 bytes
avg segm size: 147 bytes avg segm size: 398 bytes
max win adv: 40832 bytes max win adv: 66608 bytes
min win adv: 5888 bytes min win adv: 66608 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 14035 bytes avg win adv: 66608 bytes
initial window: 32 bytes initial window: 40 bytes
initial window: 1 pkts initial window: 1 pkts
ttl stream length: 2512 bytes ttl stream length: NA
missed data: 0 bytes missed data: NA
truncated data: 0 bytes truncated data: 0 bytes
truncated packets: 0 pkts truncated packets: 0 pkts
data xmit time: 1.181 secs data xmit time: 1.236 secs
idletime max: 196.9 ms idletime max: 196.9 ms
throughput: 1660 Bps throughput: 9471 Bps
RTT samples: 18 RTT samples: 24
RTT min: 43.8 ms RTT min: 0.0 ms
RTT max: 142.5 ms RTT max: 7.2 ms
RTT avg: 68.5 ms RTT avg: 0.7 ms
RTT stdev: 35.8 ms RTT stdev: 1.6 ms
RTT from 3WHS: 80.8 ms RTT from 3WHS: 0.0 ms
RTT full_sz smpls: 1 RTT full_sz smpls: 3
RTT full_sz min: 142.5 ms RTT full_sz min: 0.0 ms
RTT full_sz max: 142.5 ms RTT full_sz max: 0.0 ms
RTT full_sz avg: 142.5 ms RTT full_sz avg: 0.0 ms
RTT full_sz stdev: 0.0 ms RTT full_sz stdev: 0.0 ms
post-loss acks: 0 post-loss acks: 0
segs cum acked: 0 segs cum acked: 9
duplicate acks: 0 duplicate acks: 1
triple dupacks: 0 triple dupacks: 0
max # retrans: 0 max # retrans: 0
min retr time: 0.0 ms min retr time: 0.0 ms
max retr time: 0.0 ms max retr time: 0.0 ms
avg retr time: 0.0 ms avg retr time: 0.0 ms
sdv retr time: 0.0 ms sdv retr time: 0.0 ms
================================
You can always stick a shell loop around a program like iperf. Also, assuming iperf can read from a file (thus stdin) or programs like ttcp, could allow a shell loop catting a file N times into iperf/ttcp.
If you want a program which sends a file, waits for your binary response, and then sends another copy of the file, you probably are going to need to code that yourself.
You will need to measure the time in the client application for a roundtrip time, or monitor the network traffic going from, and coming to, the client to get the complete time interval. Measuring the time at the server will exclude any kernel level delays in the server and all the network transmission times.
Note that TCP performance will go down as the load goes up. If you're going to test under heavy load, you need professional tools that can scale to thousands (or even millions in some cases) of new connection/second or concurrent established TCP connections.
I wrote an article about this on my blog (feel free to remove if this is considered advertisement, but I think it's relevant to this thread): http://synsynack.wordpress.com/2012/04/09/realistic-latency-measurement-in-the-application-layers
As a very simple highlevel tool netcat comes to mind ... so something like time (nc hostname 1234 < input.binary | head -c 100) assuming the response is 100 bytes long.

Resources