Related
I have the following two tables:
Historical_Data_Tbl:
DATE
Cloud%
Wind_KM
Solar_Utiliz
Price
01-Jan
0.85
0
0.1
4.5
02-Jan
0.85
0
0.1
4.5
03-Jan
0.95
15
0
10
04-Jan
0.95
15
0
8
05-Jan
0.6
25
0.35
6
06-Jan
0.6
25
0.35
6
07-Jan
0.2
55
0.8
6
08-Jan
0.2
55
0.8
7
09-Jan
0.55
10
0.5
5.5
10-Jan
0.55
10
0.5
5.5
11-Jan
0.28
12
0.6
2
12-Jan
0.28
12
0.6
2
13-Jan
0.1
40
0.9
3
14-Jan
0.1
40
0.9
3
15-Jan
0.33
17
0.7
8
16-Jan
0.01
17
0.95
1
17-Jan
0.01
17
0.95
1
Forecast_Tbl:
Date
Fcst_Cloud
Fcst_Wind
Fcst_Solar
Max_Cloud
Min_Cloud
Max_Wind
Min_Wind
Max_Solar
Min_Solar
1
0.5
12
0.5
0.7
0.3
27
-3
0.75
0.25
2
0.8
10
0.1
1
0.6
25
-5
0.35
-0.15
3
0.15
15
0.8
0.35
-0.05
30
0
1.05
0.55
4
0.75
10
0.2
0.95
0.55
25
-5
0.45
-0.05
5
0.1
99
0.99
0.3
-0.1
114
84
1.24
0.74
6
0.11
35
0.8
0.31
-0.09
50
20
1.05
0.55
CODE BELOW:
let
//Read in Historical table and set data types
Source = Excel.CurrentWorkbook(){[Name="Historical"]}[Content],
Historical = Table.Buffer(Table.TransformColumnTypes(Source,{
{"DATE", type date}, {"Cloud%", type number}, {"Wind_KM", Int64.Type},
{"Solar_Utiliz", type number}, {"Price", type number}})),
//Read in Forecast table anda set data types
Source1 = Excel.CurrentWorkbook(){[Name="Forecast"]}[Content],
Forecast = Table.Buffer(Table.TransformColumnTypes(Source1,{
{"Date", Int64.Type}, {"Fcst_Cloud", type number}, {"Fcst_Wind", Int64.Type},
{"Fcst_Solar", type number}, {"Max_Cloud", type number},
{"Min_Cloud", type number}, {"Max_Wind", Int64.Type}, {"Min_Wind", Int64.Type},
{"Max_Solar", type number}, {"Min_Solar", type number}})),
//Generate list of filtered Historical Table for each row in Forecast Table with aggregations
//Merge aggregations with the associated Forecast row
#"Filtered Historical" = List.Generate(
()=>[t=Table.SelectRows(Historical, (h)=>
h[#"Cloud%"] <= Forecast[Max_Cloud]{0} and h[#"Cloud%"]>= Forecast[Min_Cloud]{0}
and h[Wind_KM] <= Forecast[Max_Wind]{0} and h[Wind_KM] >= Forecast[Min_Wind]{0}
and h[Solar_Utiliz] <= Forecast[Max_Solar]{0} and h[Solar_Utiliz] >= Forecast[Min_Solar]{0}),
idx=0],
each [idx] < Table.RowCount(Forecast),
each [t=Table.SelectRows(Historical, (h)=>
h[#"Cloud%"] <= Forecast[Max_Cloud]{[idx]+1} and h[#"Cloud%"]>= Forecast[Min_Cloud]{[idx]+1}
and h[Wind_KM] <= Forecast[Max_Wind]{[idx]+1} and h[Wind_KM] >= Forecast[Min_Wind]{[idx]+1}
and h[Solar_Utiliz] <= Forecast[Max_Solar]{[idx]+1} and h[Solar_Utiliz] >= Forecast[Min_Solar]{[idx]+1}),
idx=[idx]+1],
each Forecast{[idx]} & Record.FromList(
{List.Count([t][Price]),List.Min([t][Price]), List.Max([t][Price]),
List.Modes([t][Price]){0}, List.Median([t][Price]), List.Average([t][Price])},
{"Count","Min","Max","Mode","Median","Average"})),
#"Converted to Table" = Table.FromList(#"Filtered Historical", Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1",
{"Date", "Fcst_Cloud", "Fcst_Wind", "Fcst_Solar", "Max_Cloud", "Min_Cloud", "Max_Wind", "Min_Wind", "Max_Solar", "Min_Solar",
"Count", "Min", "Max", "Mode", "Median", "Average"}),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded Column1",{
{"Date", Int64.Type}, {"Fcst_Cloud", Percentage.Type}, {"Fcst_Wind", Int64.Type}, {"Fcst_Solar", type number},
{"Max_Cloud", type number}, {"Min_Cloud", type number}, {"Max_Wind", Int64.Type}, {"Min_Wind", Int64.Type},
{"Max_Solar", type number}, {"Min_Solar", type number}, {"Count", Int64.Type},
{"Min", Currency.Type}, {"Max", Currency.Type}, {"Mode", Currency.Type}, {"Median", Currency.Type}, {"Average", Currency.Type}})
in
#"Changed Type"
And this is the resulting output:
Date
Fcst_Cloud
Fcst_Wind
Fcst_Solar
Max_Cloud
Min_Cloud
Max_Wind
Min_Wind
Max_Solar
Min_Solar
Count
Min
Max
Mode
Median
Average
1
0.5
12
0.5
0.7
0.3
27
0
0.75
0.25
5
5.5
8
6
6
6.2
2
0.8
10
0.1
1
0.6
25
-5
0.35
-0.15
6
4.5
10
4.5
6
6.5
3
0.15
15
0.8
0.35
-0.05
30
0
1.05
0.55
5
1
8
2
2
2.8
4
0.75
10
0.2
0.95
0.55
25
-5
0.45
-0.05
6
4.5
10
4.5
6
6.5
6
0.11
35
0.8
0.31
-0.09
50
20
1.05
0.55
2
3
3
3
3
3
Forecast_Tbl OUTPUT](https://i.stack.imgur.com/8ozB2.png)
The issue is that when one forecast row (for example where Date "5" in output table should be) doesn't have any data points within the filtered range of Historical Data table, it return blank for the entire row.
What I would like it to do is return the original data from the Forecast_Tbl in the first 10 columns, for "Count" column show "0" (When no filtered Criteria are met), and use the previous rows "Average" column value (in this case 6.5) when no filtered Criteria are met. Below is the output I would like for the table to return:
Date
Fcst_Cloud
Fcst_Wind
Fcst_Solar
Max_Cloud
Min_Cloud
Max_Wind
Min_Wind
Max_Solar
Min_Solar
Count
Min
Max
Mode
Median
Average
1
0.5
12
0.5
0.7
0.3
27
0
0.75
0.25
5
5.5
8
6
6
6.2
2
0.8
10
0.1
1
0.6
25
-5
0.35
-0.15
6
4.5
10
4.5
6
6.5
3
0.15
15
0.8
0.35
-0.05
30
0
1.05
0.55
5
1
8
2
2
2.8
4
0.75
10
0.2
0.95
0.55
25
-5
0.45
-0.05
6
4.5
10
4.5
6
6.5
5
0.1
99
0.99
0.3
-0.1
114
84
1.24
0.74
0
6.5
6
0.11
35
0.8
0.31
-0.09
50
20
1.05
0.55
2
3
3
3
3
3
I have tried using conditional if functions but unsuccessful.
How about
....
{"Count","Min","Max","Mode","Median","Average"})),
#"Converted to Table" = Table.FromList(#"Filtered Historical", Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Added Index" = Table.AddIndexColumn(#"Converted to Table", "Index", 0, 1, Int64.Type),
#"Added Custom" = Table.AddColumn(#"Added Index", "Column2", each try if Value.Is([Column1], type record ) then [Column1] else null otherwise Record.Combine({Forecast{[Index]}, [Count = 0, Average = #"Added Index"{[Index]-1}[Column1][Average]]})),
#"Expanded Column1" = Table.ExpandRecordColumn(Table.SelectColumns(#"Added Custom",{"Column2"}), "Column2",
{"Date", "Fcst_Cloud", "Fcst_Wind", "Fcst_Solar", "Max_Cloud", "Min_Cloud", "Max_Wind", "Min_Wind", "Max_Solar", "Min_Solar",
"Count", "Min", "Max", "Mode", "Median", "Average"}),
....
I would like to show some experimental results about Rocksdb Put performance. The fact that single-threaded put throughput is slower than two-threaded put throughput. It is wired because it uses the default skiplist as memtable, and this data structure supports concurrent writes.
Here is my testing code.
uint64_t nthread = 2;
uint64_t nkeys = 16000000;
std::thread threads[nthread];
std::atomic<uint64_t> idx(1000000);
for (int t = 0; t < nthread; t++) {
threads[t] = std::thread([db, &idx, nthread, nkeys, &write_option_disable] {
WriteBatch batch;
for (int i = 0; i < nkeys / nthread; i++) {
std::string key = "WVERIFY" + std::to_string(idx.fetch_add(1));
std::string value = "MOCK";
auto ikey = rocksdb::Slice(key);
auto ivalue = rocksdb::Slice(value);
db->Put(write_option_disable, ikey, ivalue);
}
return 0;
});
}
for (auto& t : threads) {
t.join();
}
Besides, here are the results I got.
// Single thread
Uptime(secs): 8.4 total, 8.3 interval
Flush(GB): cumulative 1.170, interval 1.170
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.17 GB write, 143.35 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Interval compaction: 1.17 GB write, 144.11 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x564742515ea0#7011 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 2e-05 secs_since: 8
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 8.4 total, 8.3 interval
Cumulative writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1.63 GB, 199.80 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1669.88 MB, 200.85 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
// 2 threads
Uptime(secs): 31.4 total, 31.4 interval
Flush(GB): cumulative 0.183, interval 0.183
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.67 GB write, 21.84 MB/s write, 0.97 GB read, 31.68 MB/s read, 10.2 seconds
Interval compaction: 0.67 GB write, 21.87 MB/s write, 0.97 GB read, 31.72 MB/s read, 10.2 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x5619fb7bbea0#6183 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 1.9e-05 secs_since: 31
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 31.4 total, 31.4 interval
Cumulative writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 0.45 GB, 14.67 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 460.94 MB, 14.69 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
===========================update==========================
This is my Rocksdb's setting.
DB* db;
Options options;
BlockBasedTableOptions table_options;
rocksdb::WriteOptions write_option_disable;
write_option_disable.disableWAL = true;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism();
options.OptimizeLevelStyleCompaction();
// create the DB if it's not already present
options.create_if_missing = true;
The atomic idx shared between two threads can introduced non-trivial overhead. Try inserting random values from each thread, and maybe increase the number of threads.
I have below top command results in my RHEL 6. It's running PostgreSQL on my server.
I see 35.8% idle in CPU(s) while all the CPU usages below show 100%.
So how should I read below output?
top - 03:06:30 up 97 days, 20:15, 3 users, load average: 10.85, 10.51, 10.13
Tasks: 738 total, 14 running, 724 sleeping, 0 stopped, 0 zombie
**Cpu(s): 53.3%us, 9.6%sy, 0.0%ni, 35.8%id, 0.6%wa, 0.0%hi, 0.7%si, 0.0%st**
Mem: 32077620k total, 24335372k used, 7742248k free, 19084k buffers
Swap: 81919992k total, 407968k used, 81512024k free, 18686780k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19171 enterpri 20 0 8590m 966m 951m R 100.0 3.1 6:24.51 edb-postgres
19588 enterpri 20 0 8590m 956m 941m R 100.0 3.1 1:20.51 edb-postgres
18494 enterpri 20 0 8590m 959m 944m R 99.8 3.1 18:18.75 edb-postgres
18683 enterpri 20 0 8588m 984m 975m R 99.8 3.1 6:22.80 edb-postgres
19158 enterpri 20 0 8592m 1.0g 1.0g R 99.8 3.3 5:40.16 edb-postgres
19167 enterpri 20 0 8589m 959m 945m R 99.8 3.1 7:48.53 edb-postgres
19590 enterpri 20 0 8586m 945m 933m R 99.8 3.0 2:51.32 edb-postgres
19591 enterpri 20 0 8588m 950m 936m R 99.8 3.0 3:07.77 edb-postgres
19592 enterpri 20 0 8589m 948m 935m R 99.8 3.0 2:52.66 edb-postgres
You have a lot of CPUs (how many?) on your system. Some of them are very busy running postgres, and some of them are not.
In your version of top, %CPU represents the percent of a single CPU, not the percent of the total system CPU. If you had a threaded application, one entry could show more than 100%, but PostgreSQL is not threaded within a single process.
I am going to demonstrate the problem using the following example program
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
fmod :: Double -> Double -> Double
fmod a b | a < 0 = b - fmod (abs a) b
| otherwise = if a < b then a
else let q = a / b
in b * (q - fromIntegral (floor q :: Int))
standardMap :: Double -> Point -> Point
standardMap k (Point q p) =
Point (fmod (q + p) (2 * pi)) (fmod (p + k * sin(q)) (2 * pi))
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where k = (cos (pi/3)) - (sin (pi/3))
Here standardMap k is the parametrized function and k=(cos (pi/3))-(sin (pi/3)) is a parameter. If i compile this program with ghc -O3 -fllvm the execution time on my machine is approximately 42s, however, if I write k in the form 0.5 - (sin (pi/3)) the execution time equals 21s and if I write k = 0.5 - 0.5 * (sqrt 3) it will take only 12s.
The conclusion is that k is reevaluated on each call of standardMap k.
Why this is not optimized?
P.S. compiler ghc 7.6.3 on archlinux
EDIT
For those who are concerned with the weird properties of standardMap here is a simpler and more intuitive example, which exhibits the same problem
{-# LANGUAGE BangPatterns #-}
data Point = Point !Double !Double
rotate :: Double -> Point -> Point
rotate k (Point q p) =
Point ((cos k) * q - (sin k) * p) ((sin k) * q + (cos k) * p)
iterate' gen !p = p : (iterate' gen $ gen p)
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (rotate k) $ (Point 0.15 0.25)
where --k = (cos (pi/3)) - (sin (pi/3))
k = 0.5 - 0.5 * (sqrt 3)
EDIT
Before I asked the question I have tried to make k strict, the same way Don suggested, but with ghc -O3 I didn't see a difference. The solution with strictness works if the program is compiled with ghc -O2. I missed that because I didn't try all possible combinations of flags with the all possible versions of the program.
So what is the difference between -O3 and -O2 that affects such cases?
Should I prefer -O2 in general?
EDIT
As observed by Mike Hartl and others, if rotate k is changed into rotate $ k or standardMap k into standardMap $ k, the performance is improved, though it is not the best possible (Don's solution). Why?
As always, check the core.
With ghc -O2, k is inlined into the loop body, which is floated out as a top level function:
Main.main7 :: Main.Point -> Main.Point
Main.main7 =
\ (ds_dAa :: Main.Point) ->
case ds_dAa of _ { Main.Point q_alG p_alH ->
case q_alG of _ { GHC.Types.D# x_s1bt ->
case p_alH of _ { GHC.Types.D# y_s1bw ->
case Main.$wfmod (GHC.Prim.+## x_s1bt y_s1bw) 6.283185307179586
of ww_s1bi { __DEFAULT ->
case Main.$wfmod
(GHC.Prim.+##
y_s1bw
(GHC.Prim.*##
(GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976))
(GHC.Prim.sinDouble# x_s1bt)))
6.283185307179586
of ww1_X1bZ { __DEFAULT ->
Main.Point (GHC.Types.D# ww_s1bi) (GHC.Types.D# ww1_X1bZ)
Indicating that the sin and cos calls aren't evaluated at compile time.
The result is that a bit more math is going to occur:
$ time ./A
3.1430515093368085
real 0m15.590s
If you make it strict, it is at least not recalculated each time:
main = putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k) $ (Point 0.15 0.25)
where
k :: Double
!k = (cos (pi/3)) - (sin (pi/3))
Resulting in:
ipv_sEq =
GHC.Prim.-##
(GHC.Prim.cosDouble# 1.0471975511965976)
(GHC.Prim.sinDouble# 1.0471975511965976) } in
And a running time of:
$ time ./A
6.283185307179588
real 0m7.859s
Which I think is good enough for now. I'd also add unpack pragmas to the Point type.
If you want to reason about numeric performance under different code arrangements, you must inspect the Core.
Using your revised example. It suffers the same issue. k is inlined rotate. GHC thinks it is really cheap, when in this benchmark it is more expensive.
Naively, ghc-7.2.3 -O2
$ time ./A
0.1470480616244365
real 0m22.897s
And k is evaluated each time rotate is called.
Make k strict: one way to force it to be not shared.
$ time ./A
0.14704806100839019
real 0m2.360s
Using UNPACK pragmas on the Point constructor:
$ time ./A
0.14704806100839019
real 0m1.860s
I don't think it is repeated evaluation.
First, I switched to "do" notation and used a "let" on the definition of "k" which I figured should help. No - still slow.
Then I added a trace call - just being evaluated once. Even checked that the fast variant was in fact producing a Double.
Then I printed out both variations. There is a small difference in the starting values.
Tweaking the value of the "slow" variant makes it the same speed. I've no idea what your algorithm is for - would it be very sensitive to starting values?
import Debug.Trace (trace)
...
main = do
-- is -0.3660254037844386
let k0 = (0.5 - 0.5 * (sqrt 3))::Double
-- was -0.3660254037844385
let k1 = (cos (pi/3)) - (trace "x" (sin (pi/3))) + 0.0000000000000001;
putStrLn (show k0)
putStrLn (show k1)
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k1) $ (Point 0.15 0.25)
EDIT: this is the version with numeric literals. It's displaying runtimes of 23sec vs 7sec for me. I compiled two separate versions of the code to make sure I wasn't doing something stupid like not recompiling.
main = do
-- -0.3660254037844386
-- -0.3660254037844385
let k2 = -0.3660254037844385
putStrLn
. show
. (\(Point a b) -> a + b)
. head . drop 100000000
. iterate' (standardMap k2) $ (Point 0.15 0.25)
EDIT2: I don't know how to get the opcodes from ghc, but comparing the hexdumps for the two .o files shows they differ by a single byte - presumably the literal. So it can't be the runtime.
EDIT3: Tried turning profiling on, and that's just puzzled me even more. unless I'm missing something the only difference is a small discrepancy in the number of calls to fmod (fmod.q to be precise).
The "5" profile is for the constant ending "5", same with "6".
Fri Sep 6 12:37 2013 Time and Allocation Profiling Report (Final)
constant-timings-5 +RTS -p -RTS
total time = 38.34 secs (38343 ticks # 1000 us, 1 processor)
total alloc = 12,000,105,184 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 71.0 0.0
iterate' Main 21.2 93.3
fmod Main 6.3 6.7
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.0 0.0
main Main 102 0 1.0 0.0 1.0 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 99.0 100.0
main Main 103 0 0.0 0.0 99.0 100.0
iterate' Main 104 100000001 21.2 93.3 99.0 100.0
standardMap Main 105 100000000 71.0 0.0 77.9 6.7
fmod Main 106 200000001 6.3 6.7 6.9 6.7
fmod.q Main 109 49999750 0.6 0.0 0.6 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
Fri Sep 6 12:38 2013 Time and Allocation Profiling Report (Final)
constant-timings-6 +RTS -p -RTS
total time = 22.17 secs (22167 ticks # 1000 us, 1 processor)
total alloc = 11,999,947,752 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
standardMap Main 48.4 0.0
iterate' Main 38.2 93.3
fmod Main 10.9 6.7
main Main 1.4 0.0
fmod.q Main 1.0 0.0
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 50 0 0.0 0.0 100.0 100.0
main Main 101 0 0.0 0.0 0.0 0.0
CAF:main1 Main 98 0 0.0 0.0 0.0 0.0
main Main 100 1 0.0 0.0 0.0 0.0
CAF:main2 Main 97 0 0.0 0.0 1.4 0.0
main Main 102 0 1.4 0.0 1.4 0.0
main.\ Main 110 1 0.0 0.0 0.0 0.0
CAF:main3 Main 96 0 0.0 0.0 98.6 100.0
main Main 103 0 0.0 0.0 98.6 100.0
iterate' Main 104 100000001 38.2 93.3 98.6 100.0
standardMap Main 105 100000000 48.4 0.0 60.4 6.7
fmod Main 106 200000001 10.9 6.7 12.0 6.7
fmod.q Main 109 49989901 1.0 0.0 1.0 0.0
CAF:main_k Main 95 0 0.0 0.0 0.0 0.0
main Main 107 0 0.0 0.0 0.0 0.0
main.k2 Main 108 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 93 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 90 0 0.0 0.0 0.0 0.0
CAF GHC.Float 89 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding 82 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 66 0 0.0 0.0 0.0 0.0
EDIT4: Link below is to the two opcode dumps (thanks to #Tom Ellis). Although I can't read them, they seem to have the same "shape". Presumably the long random-char strings are internal identifiers. I've just recompiled both with -O2 -fforce-recomp and the time differences are real.
https://gist.github.com/anonymous/6462797
I created a basic TCP server that reads incoming binary data in protocol buffer format, and writes a binary msg as response. I would like to benchmark the the roundtrip time.
I tried iperf, but could not make it send the same input file multiple times. Is there another benchmark tool than can send a binary input file repeatedly?
If you have access to a linux or unix machine1, you should use tcptrace. All you need to do is loop through your binary traffic test while capturing with wireshark or tcpdump file.
After you have that .pcap file2, analyze with tcptrace -xtraffic <pcap_filename>3. This will generate two text files, and the average RTT stats for all connections in that pcap are shown at the bottom of the one called traffic_stats.dat.
[mpenning#Bucksnort tcpperf]$ tcptrace -xtraffic willers.pcap
mod_traffic: characterizing traffic
1 arg remaining, starting with 'willers.pcap'
Ostermann's tcptrace -- version 6.6.1 -- Wed Nov 19, 2003
16522 packets seen, 16522 TCP packets traced
elapsed wallclock time: 0:00:00.200709, 82318 pkts/sec analyzed
trace file elapsed time: 0:03:21.754962
Dumping port statistics into file traffic_byport.dat
Dumping overall statistics into file traffic_stats.dat
Plotting performed at 15.000 second intervals
[mpenning#Bucksnort tcpperf]$
[mpenning#Bucksnort tcpperf]$ cat traffic_stats.dat
Overall Statistics over 201 seconds (0:03:21.754962):
4135308 ttl bytes sent, 20573.672 bytes/second
4135308 ttl non-rexmit bytes sent, 20573.672 bytes/second
0 ttl rexmit bytes sent, 0.000 bytes/second
16522 packets sent, 82.199 packets/second
200 connections opened, 0.995 conns/second
11 dupacks sent, 0.055 dupacks/second
0 rexmits sent, 0.000 rexmits/second
average RTT: 67.511 msecs <------------------
[mpenning#Bucksnort tcpperf]$
The .pcap file used in this example was a capture I generated when I looped through an expect script that pulled data from one of my servers. This was how I generated the loop...
#!/usr/bin/python
from subprocess import Popen, PIPE
import time
for ii in xrange(0,200):
# willers.exp is an expect script
Popen(['./willers.exp'], stdin=PIPE, stdout=PIPE, stderr=PIPE)
time.sleep(1)
You can adjust the sleep time between loops based on your server's accept() performance and the duration of your tests.
END NOTES:
A Knoppix Live-CD will do
Filtered to only capture test traffic
tcptrace is capable of very detailed per-socket stats if you use other options...
================================
[mpenning#Bucksnort tcpperf]$ tcptrace -lr willers.pcap
1 arg remaining, starting with 'willers.pcap'
Ostermann's tcptrace -- version 6.6.1 -- Wed Nov 19, 2003
16522 packets seen, 16522 TCP packets traced
elapsed wallclock time: 0:00:00.080496, 205252 pkts/sec analyzed
trace file elapsed time: 0:03:21.754962
TCP connection info:
200 TCP connections traced:
TCP connection 1:
host c: myhost.local:44781
host d: willers.local:22
complete conn: RESET (SYNs: 2) (FINs: 1)
first packet: Tue May 31 22:52:24.154801 2011
last packet: Tue May 31 22:52:25.668430 2011
elapsed time: 0:00:01.513628
total packets: 73
filename: willers.pcap
c->d: d->c:
total packets: 34 total packets: 39
resets sent: 4 resets sent: 0
ack pkts sent: 29 ack pkts sent: 39
pure acks sent: 11 pure acks sent: 2
sack pkts sent: 0 sack pkts sent: 0
dsack pkts sent: 0 dsack pkts sent: 0
max sack blks/ack: 0 max sack blks/ack: 0
unique bytes sent: 2512 unique bytes sent: 14336
actual data pkts: 17 actual data pkts: 36
actual data bytes: 2512 actual data bytes: 14336
rexmt data pkts: 0 rexmt data pkts: 0
rexmt data bytes: 0 rexmt data bytes: 0
zwnd probe pkts: 0 zwnd probe pkts: 0
zwnd probe bytes: 0 zwnd probe bytes: 0
outoforder pkts: 0 outoforder pkts: 0
pushed data pkts: 17 pushed data pkts: 33
SYN/FIN pkts sent: 1/1 SYN/FIN pkts sent: 1/0
req 1323 ws/ts: Y/Y req 1323 ws/ts: Y/Y
adv wind scale: 6 adv wind scale: 1
req sack: Y req sack: Y
sacks sent: 0 sacks sent: 0
urgent data pkts: 0 pkts urgent data pkts: 0 pkts
urgent data bytes: 0 bytes urgent data bytes: 0 bytes
mss requested: 1460 bytes mss requested: 1460 bytes
max segm size: 792 bytes max segm size: 1448 bytes
min segm size: 16 bytes min segm size: 32 bytes
avg segm size: 147 bytes avg segm size: 398 bytes
max win adv: 40832 bytes max win adv: 66608 bytes
min win adv: 5888 bytes min win adv: 66608 bytes
zero win adv: 0 times zero win adv: 0 times
avg win adv: 14035 bytes avg win adv: 66608 bytes
initial window: 32 bytes initial window: 40 bytes
initial window: 1 pkts initial window: 1 pkts
ttl stream length: 2512 bytes ttl stream length: NA
missed data: 0 bytes missed data: NA
truncated data: 0 bytes truncated data: 0 bytes
truncated packets: 0 pkts truncated packets: 0 pkts
data xmit time: 1.181 secs data xmit time: 1.236 secs
idletime max: 196.9 ms idletime max: 196.9 ms
throughput: 1660 Bps throughput: 9471 Bps
RTT samples: 18 RTT samples: 24
RTT min: 43.8 ms RTT min: 0.0 ms
RTT max: 142.5 ms RTT max: 7.2 ms
RTT avg: 68.5 ms RTT avg: 0.7 ms
RTT stdev: 35.8 ms RTT stdev: 1.6 ms
RTT from 3WHS: 80.8 ms RTT from 3WHS: 0.0 ms
RTT full_sz smpls: 1 RTT full_sz smpls: 3
RTT full_sz min: 142.5 ms RTT full_sz min: 0.0 ms
RTT full_sz max: 142.5 ms RTT full_sz max: 0.0 ms
RTT full_sz avg: 142.5 ms RTT full_sz avg: 0.0 ms
RTT full_sz stdev: 0.0 ms RTT full_sz stdev: 0.0 ms
post-loss acks: 0 post-loss acks: 0
segs cum acked: 0 segs cum acked: 9
duplicate acks: 0 duplicate acks: 1
triple dupacks: 0 triple dupacks: 0
max # retrans: 0 max # retrans: 0
min retr time: 0.0 ms min retr time: 0.0 ms
max retr time: 0.0 ms max retr time: 0.0 ms
avg retr time: 0.0 ms avg retr time: 0.0 ms
sdv retr time: 0.0 ms sdv retr time: 0.0 ms
================================
You can always stick a shell loop around a program like iperf. Also, assuming iperf can read from a file (thus stdin) or programs like ttcp, could allow a shell loop catting a file N times into iperf/ttcp.
If you want a program which sends a file, waits for your binary response, and then sends another copy of the file, you probably are going to need to code that yourself.
You will need to measure the time in the client application for a roundtrip time, or monitor the network traffic going from, and coming to, the client to get the complete time interval. Measuring the time at the server will exclude any kernel level delays in the server and all the network transmission times.
Note that TCP performance will go down as the load goes up. If you're going to test under heavy load, you need professional tools that can scale to thousands (or even millions in some cases) of new connection/second or concurrent established TCP connections.
I wrote an article about this on my blog (feel free to remove if this is considered advertisement, but I think it's relevant to this thread): http://synsynack.wordpress.com/2012/04/09/realistic-latency-measurement-in-the-application-layers
As a very simple highlevel tool netcat comes to mind ... so something like time (nc hostname 1234 < input.binary | head -c 100) assuming the response is 100 bytes long.