I would like to show some experimental results about Rocksdb Put performance. The fact that single-threaded put throughput is slower than two-threaded put throughput. It is wired because it uses the default skiplist as memtable, and this data structure supports concurrent writes.
Here is my testing code.
uint64_t nthread = 2;
uint64_t nkeys = 16000000;
std::thread threads[nthread];
std::atomic<uint64_t> idx(1000000);
for (int t = 0; t < nthread; t++) {
threads[t] = std::thread([db, &idx, nthread, nkeys, &write_option_disable] {
WriteBatch batch;
for (int i = 0; i < nkeys / nthread; i++) {
std::string key = "WVERIFY" + std::to_string(idx.fetch_add(1));
std::string value = "MOCK";
auto ikey = rocksdb::Slice(key);
auto ivalue = rocksdb::Slice(value);
db->Put(write_option_disable, ikey, ivalue);
}
return 0;
});
}
for (auto& t : threads) {
t.join();
}
Besides, here are the results I got.
// Single thread
Uptime(secs): 8.4 total, 8.3 interval
Flush(GB): cumulative 1.170, interval 1.170
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.17 GB write, 143.35 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Interval compaction: 1.17 GB write, 144.11 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x564742515ea0#7011 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 2e-05 secs_since: 8
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 8.4 total, 8.3 interval
Cumulative writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1.63 GB, 199.80 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1669.88 MB, 200.85 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
// 2 threads
Uptime(secs): 31.4 total, 31.4 interval
Flush(GB): cumulative 0.183, interval 0.183
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.67 GB write, 21.84 MB/s write, 0.97 GB read, 31.68 MB/s read, 10.2 seconds
Interval compaction: 0.67 GB write, 21.87 MB/s write, 0.97 GB read, 31.72 MB/s read, 10.2 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x5619fb7bbea0#6183 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 1.9e-05 secs_since: 31
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 31.4 total, 31.4 interval
Cumulative writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 0.45 GB, 14.67 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 460.94 MB, 14.69 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
===========================update==========================
This is my Rocksdb's setting.
DB* db;
Options options;
BlockBasedTableOptions table_options;
rocksdb::WriteOptions write_option_disable;
write_option_disable.disableWAL = true;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism();
options.OptimizeLevelStyleCompaction();
// create the DB if it's not already present
options.create_if_missing = true;
The atomic idx shared between two threads can introduced non-trivial overhead. Try inserting random values from each thread, and maybe increase the number of threads.
I have a system that makes requests and posts for a page in Italy, and I need to reduce my latency to the maximum, because it interferes directly in the performance of what I need to register. The service in Italy is not mine, I'm just automating a routine, using Python for this. I have tested several cloud processing services (aws ec2, google cloud, italian clouds), and my latency does not fall below 20ms. When I run the traceroute, in all services, I see that my requests are leaving Europe and going to the US (ips 107.162.X.X), and then return to Europe. Is there any way they will not leave Europe?
------Germany start: 1 static.129.213.201.195.clients.your-server.de [195.201.213.129] 0ms 4ms 1ms 2 core11.nbg1.hetzner.com
[213.239.229.153] 0ms 3ms 0.75ms 3 juniper4.dc2.nbg1.hetzner.com
[213.239.203.138] 0ms 0ms 0ms 4 nug-b1-link.telia.net
[213.248.70.0] 0ms 0ms 0ms 5 ffm-bb3-link.telia.net
[62.115.113.146] 3ms 4ms 3.25ms 6 ffm-b1-link.telia.net
[62.115.121.1] 4ms 7ms 5ms 7
f5networks-ic-341210-ffm-b1.c.telia.net [62.115.169.109] 3ms 3ms
3ms 8 107.162.79.3 3ms 5ms 3.5ms 9 107.162.67.110 26ms
107ms 52.75ms 10 85.116.228.2 25ms 25ms 25ms
-- Milan IT start
IT 192.165.67.0 Loss% Snt Last Avg Best Wrst StDev AS Name PTR
IT 192.165.67.1 0.0% 20 1.1 1.5 0.7 10.1 2.0 34971
IT 217.171.38.133 0.0% 20 0.2 1.0 0.2 9.6 2.2 20836
EU 213.248.84.128 5.0% 20 0.6 1.5 0.4 10.9 2.9 1299
EU 62.115.142.140 0.0% 20 9.4 9.7 9.3 14.8 1.2 1299
EU 62.115.116.160 0.0% 20 9.2 12.5 9.2 56.1 10.6 1299
EU 62.115.169.109 0.0% 20 9.4 10.2 9.4 17.0 2.0 1299
US 107.162.79.3 0.0% 20 9.4 10.5 9.4 18.5 2.4 55002
US 107.162.67.114 0.0% 20 41.2 35.8 29.7 69.4 11.6 55002
IT 85.116.228.2 0.0% 20 28.6 28.9 28.5 34.4 1.3 34699
-- Paris FR start
FR 51.15.179.0 Loss% Snt Last Avg Best Wrst StDev AS Name PTR 1 FR 51.15.179.1 0.0% 20
0.5 0.5 0.4 0.9 0.0 12876 2 FR 51.158.8.56 0.0% 20 0.5 0.5 0.3 1.2 0.0 12876 3 FR 195.154.2.168 0.0% 20 1.2 1.3 1.1 1.5 0.0 12876 4 NL 212.3.235.201 0.0% 20 1.9 2.0 1.9 2.3 0.0 3356 5 ??? 100.0 20 0.0 0.0 0.0 0.0 0.0 - 6 EU 80.231.153.65 0.0% 20 2.2 2.3 2.1 3.9 0.3 6453 7 EU 195.219.87.9 0.0% 20 29.1 30.0 29.0 42.6 2.9 6453 8 EU 195.219.87.88 0.0% 20 20.1 20.5 20.0 25.4 1.2 6453 9 CH 195.219.61.48 0.0% 20 12.3 11.7 11.5 12.5 0.0 6453 10 EU 195.219.148.186 0.0% 20 17.0 17.1 16.9 18.5 0.0 6453 11 US 107.162.79.3 0.0% 20 11.6 12.2 11.6 14.4 0.8 55002 12 US 107.162.67.110 0.0% 20 41.3 36.8 32.0 64.7 7.7 55002 13 IT 85.116.228.2 0.0% 20 31.3 31.3 31.2 31.7 0.0 34699
I wish I could reduce latency as much as possible. Could you suggest something?
I can see very high % of stolen time on a EC2 web server (t2.micro) without any load (one current user) with a high page load time. Is there a correlation between hight load time and hight stolen time? I have the same symptoms with another server from class t2.medium
Do you have an explanation?
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 79824 7428 479172 0 0 0 0 52 49 18 0 0 0 82
1 0 0 79792 7436 479172 0 0 0 6 54 49 18 0 0 0 82
1 0 0 79824 7444 479172 0 0 0 5 54 51 18 0 0 0 82
In my code I am using an external C library and the library calls madvise with MADV_SEQUENTIAL option which takes too long to finish. In my opinion only calling madvise with MADV_SEQUENTIAL is enough for our job. My first question is, why multiple madvise system calls are made, is there a logic in calling madvise with different options sequentially? My second question is, do you have any idea why madvise with MADV_SEQUENTIAL takes too long, sometimes about 1-2 minutes?
[root#mymachine ~]# strace -ttT my_compiled_code
...
13:11:35.358982 open("/some/big/file", O_RDONLY) = 8 <0.000010>
13:11:35.359060 fstat64(8, {st_mode=S_IFREG|0644, st_size=953360384, ...}) = 0 <0.000006>
13:11:35.359155 mmap2(NULL, 1073741824, PROT_READ, MAP_SHARED, 8, 0) = 0x7755e000 <0.000007>
13:11:35.359223 madvise(0x7755e000, 1073741824, MADV_NORMAL) = 0 <0.000006>
13:11:35.359266 madvise(0x7755e000, 1073741824, MADV_RANDOM) = 0 <0.000006>
13:11:35.359886 madvise(0x7755e000, 1073741824, MADV_SEQUENTIAL) = 0 <0.000006>
13:11:53.730549 madvise(0x7755e000, 1073741824, MADV_RANDOM) = 0 <0.000013>
...
I am using 32-bit linux kernel: 3.4.52-9
[root#mymachine ~]# free -lk
total used free shared buffers cached
Mem: 4034412 3419344 615068 0 55712 767824
Low: 853572 495436 358136
High: 3180840 2923908 256932
-/+ buffers/cache: 2595808 1438604
Swap: 4192960 218624 3974336
[root#mymachine ~]# cat /proc/buddyinfo
Node 0, zone DMA 89 23 9 4 5 4 4 1 0 2 0
Node 0, zone Normal 9615 7099 3997 1723 931 397 78 0 0 1 1
Node 0, zone HighMem 7313 8089 2187 420 206 92 41 15 8 3 6
I am going to demonstrate the problem using the following example program
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b
in b * (q - fromIntegral (floor q :: Int))
standardMap :: Double -> Point -> Point
standardMap k (Point q p) =
Point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi))
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where k = (cos (pi/3)) - (sin (pi/3))
Here standardMap k is the parametrized function and k=(cos (pi/3))-(sin (pi/3)) is a parameter. If i compile this program with ghc -O3 -fllvm the execution time on my machine is approximately 42s, however, if I write k in the form 0.5 - (sin (pi/3)) the execution time equals 21s and if I write k = 0.5 - 0.5 * (sqrt 3) it will take only 12s.
The conclusion is that k is reevaluated on each call of standardMap k.
Why this is not optimized?
P.S. compiler ghc 7.6.3 on archlinux
EDIT
For those who are concerned with the weird properties of standardMap here is a simpler and more intuitive example, which exhibits the same problem
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
rotate :: Double -> Point -> Point
rotate k (Point q p) =
Point ((cos k) * q - (sin k) * p) ((sin k) * q + (cos k) * p)
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (rotate k) $ (Point 0.15 0.25)
where --k = (cos (pi/3)) - (sin (pi/3))
k = 0.5 - 0.5 * (sqrt 3)
EDIT
Before I asked the question I have tried to make k strict, the same way Don suggested, but with ghc -O3 I didn't see a difference. The solution with strictness works if the program is compiled with ghc -O2. I missed that because I didn't try all possible combinations of flags with the all possible versions of the program.
So what is the difference between -O3 and -O2 that affects such cases?
Should I prefer -O2 in general?
EDIT
As observed by Mike Hartl and others, if rotate k is changed into rotate $ k or standardMap k into standardMap $ k, the performance is improved, though it is not the best possible (Don's solution). Why?
As always, check the core.
With ghc -O2, k is inlined into the loop body, which is floated out as a top level function:
Main.main7 :: Main.Point -> Main.Point
Main.main7 =
\ (ds_dAa :: Main.Point) ->
case ds_dAa of _ { Main.Point q_alG p_alH ->
case q_alG of _ { GHC.Types.D# x_s1bt ->
case p_alH of _ { GHC.Types.D# y_s1bw ->
case Main.$wfmod (GHC.Prim.+## x_s1bt y_s1bw) 6.283185307179586
of ww_s1bi { __DEFAULT ->
case Main.$wfmod
(GHC.Prim.+##
y_s1bw
(GHC.Prim.*##
(GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976))
(GHC.Prim.sinDouble# x_s1bt)))
6.283185307179586
of ww1_X1bZ { __DEFAULT ->
Main.Point (GHC.Types.D# ww_s1bi) (GHC.Types.D# ww1_X1bZ)
Indicating that the sin and cos calls aren't evaluated at compile time.
The result is that a bit more math is going to occur:
$ time ./A
3.1430515093368085
real 0m15.590s
If you make it strict, it is at least not recalculated each time:
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where
k :: Double
!k = (cos (pi/3)) - (sin (pi/3))
Resulting in:
ipv_sEq =
GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976) } in
And a running time of:
$ time ./A
6.283185307179588
real 0m7.859s
Which I think is good enough for now. I'd also add unpack pragmas to the Point type.
If you want to reason about numeric performance under different code arrangements, you must inspect the Core.
Using your revised example. It suffers the same issue. k is inlined rotate. GHC thinks it is really cheap, when in this benchmark it is more expensive.
Naively, ghc-7.2.3 -O2
$ time ./A
0.1470480616244365
real 0m22.897s
And k is evaluated each time rotate is called.
Make k strict: one way to force it to be not shared.
$ time ./A
0.14704806100839019
real 0m2.360s
Using UNPACK pragmas on the Point constructor:
$ time ./A
0.14704806100839019
real 0m1.860s
I don't think it is repeated evaluation.
First, I switched to "do" notation and used a "let" on the definition of "k" which I figured should help. No - still slow.
Then I added a trace call - just being evaluated once. Even checked that the fast variant was in fact producing a Double.
Then I printed out both variations. There is a small difference in the starting values.
Tweaking the value of the "slow" variant makes it the same speed. I've no idea what your algorithm is for - would it be very sensitive to starting values?
import Debug.Trace (trace)
...
main = do
-- is -0.3660254037844386
let k0 = (0.5 - 0.5 * (sqrt 3))::Double
-- was -0.3660254037844385
let k1 = (cos (pi/3)) - (trace "x" (sin (pi/3))) + 0.0000000000000001;
putStrLn (show k0)
putStrLn (show k1)
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k1) $ (Point 0.15 0.25)
EDIT: this is the version with numeric literals. It's displaying runtimes of 23sec vs 7sec for me. I compiled two separate versions of the code to make sure I wasn't doing something stupid like not recompiling.
main = do
-- -0.3660254037844386
-- -0.3660254037844385
let k2 = -0.3660254037844385
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k2) $ (Point 0.15 0.25)
EDIT2: I don't know how to get the opcodes from ghc, but comparing the hexdumps for the two .o files shows they differ by a single byte - presumably the literal. So it can't be the runtime.
EDIT3: Tried turning profiling on, and that's just puzzled me even more. unless I'm missing something the only difference is a small discrepancy in the number of calls to fmod (fmod.q to be precise).
The "5" profile is for the constant ending "5", same with "6".
Fri Sep 6 12:37 2013 Time and Allocation Profiling Report (Final)
constant-timings-5 +RTS -p -RTS
total time = 38.34 secs (38343 ticks # 1000 us, 1 processor)
total alloc = 12,000,105,184 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 71.0 0.0
iterate' Main 21.2 93.3
fmod Main 6.3 6.7
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.0 0.0
main Main 102 0 1.0 0.0 1.0 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 99.0 100.0
main Main 103 0 0.0 0.0 99.0 100.0
iterate' Main 104 100000001 21.2 93.3 99.0 100.0
standardMap Main 105 100000000 71.0 0.0 77.9 6.7
fmod Main 106 200000001 6.3 6.7 6.9 6.7
fmod.q Main 109 49999750 0.6 0.0 0.6 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
Fri Sep 6 12:38 2013 Time and Allocation Profiling Report (Final)
constant-timings-6 +RTS -p -RTS
total time = 22.17 secs (22167 ticks # 1000 us, 1 processor)
total alloc = 11,999,947,752 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 48.4 0.0
iterate' Main 38.2 93.3
fmod Main 10.9 6.7
main Main 1.4 0.0
fmod.q Main 1.0 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.4 0.0
main Main 102 0 1.4 0.0 1.4 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 98.6 100.0
main Main 103 0 0.0 0.0 98.6 100.0
iterate' Main 104 100000001 38.2 93.3 98.6 100.0
standardMap Main 105 100000000 48.4 0.0 60.4 6.7
fmod Main 106 200000001 10.9 6.7 12.0 6.7
fmod.q Main 109 49989901 1.0 0.0 1.0 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
EDIT4: Link below is to the two opcode dumps (thanks to #Tom Ellis). Although I can't read them, they seem to have the same "shape". Presumably the long random-char strings are internal identifiers. I've just recompiled both with -O2 -fforce-recomp and the time differences are real.
https://gist.github.com/anonymous/6462797