This question is similar to IPC performance: Named Pipe vs Socket but focusses on anonymous instead of named pipes: How is the performance difference between an anonymous pipe and a TCP connection on different operating systems and with different transfer sizes?
I tried to benchmark it using BenchmarkDotNet with the code attached at the end of this post. When the program starts, it initializes BenchmarkDotNet which in turn invokes the GlobalSetup() methods once and the two benchmarked methods (Pipe() and Tcp()) many times.
In GlobalSetup(), two child processes are started. One for pipe communication and one for tcp communication. Once the child processes are ready, they wait for a trigger signal and the number of values N to be transferred (provided via stdin) and then start sending data.
When the benchmarked methods (Pipe() and Tcp()) are invoked, they send the trigger signal and the number of values N and wait for the incoming data.
It has shown that it is important to set TcpClient.NoDelay = true to disable the Nagle-Algorithm that first collects small messages until a certain threshold or a certain timeout is reached. Interestingly this affects only the Linux tests with N = 10000. With NoDelay = false (default), the average time for this test jumps from ~40 µs to ~40 ms.
Here are the results:
Legends
N : N = number of int32 values to be transmitted
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Median : Value separating the higher half of all measurements (50th percentile)
Ratio : Mean of the ratio distribution ([Current]/[Baseline])
RatioSD : Standard deviation of the ratio distribution ([Current]/[Baseline])
1 us : 1 Microsecond (0.000001 sec)
Virtual Machine (Ubuntu 20.04)
BenchmarkDotNet=v0.13.0, OS=ubuntu 20.04
AMD Opteron(tm) Processor 4334, 4 CPU, 4 logical and 4 physical cores
.NET SDK=5.0.102
[Host] : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT
DefaultJob : .NET 5.0.2 (5.0.220.61120), X64 RyuJIT
Method
N
Mean
Error
StdDev
Median
Ratio
RatioSD
Pipe
1
27.33 μs
1.660 μs
4.895 μs
30.75 μs
1.00
0.00
Tcp
1
31.42 μs
0.620 μs
0.713 μs
31.24 μs
1.39
0.21
Pipe
100
26.72 μs
1.990 μs
5.867 μs
26.63 μs
1.00
0.00
Tcp
100
38.95 μs
2.146 μs
6.327 μs
43.34 μs
1.53
0.43
Pipe
10000
42.45 μs
2.804 μs
8.268 μs
47.09 μs
1.00
0.00
Tcp
10000
46.97 μs
3.057 μs
9.013 μs
53.93 μs
1.16
0.34
Pipe
1000000
1,621.87 μs
116.924 μs
344.752 μs
1,893.49 μs
1.00
0.00
Tcp
1000000
1,707.25 μs
8.066 μs
7.545 μs
1,707.24 μs
0.94
0.13
Pipe
10000000
21,013.86 μs
166.250 μs
129.797 μs
21,007.89 μs
1.00
0.00
Tcp
10000000
20,548.03 μs
407.779 μs
814.379 μs
20,713.44 μs
0.96
0.03
Notebook (Ubuntu 20.04 on Windows 10 + WSL2):
BenchmarkDotNet=v0.13.0, OS=ubuntu 20.04
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.301
[Host] : .NET 5.0.7 (5.0.721.25508), X64 RyuJIT
DefaultJob : .NET 5.0.7 (5.0.721.25508), X64 RyuJIT
Method
N
Mean
Error
StdDev
Median
Ratio
RatioSD
Pipe
1
44.66 μs
0.882 μs
1.051 μs
44.45 μs
1.00
0.00
Tcp
1
54.42 μs
0.411 μs
0.364 μs
54.34 μs
1.21
0.03
Pipe
100
45.07 μs
0.895 μs
1.496 μs
44.63 μs
1.00
0.00
Tcp
100
55.27 μs
0.735 μs
0.614 μs
55.17 μs
1.21
0.05
Pipe
10000
52.30 μs
1.018 μs
1.131 μs
52.32 μs
1.00
0.00
Tcp
10000
55.47 μs
0.590 μs
0.523 μs
55.32 μs
1.06
0.03
Pipe
1000000
4,034.01 μs
77.978 μs
65.115 μs
4,035.58 μs
1.00
0.00
Tcp
1000000
1,398.62 μs
24.230 μs
21.479 μs
1,395.20 μs
0.35
0.01
Pipe
10000000
69,767.35 μs
4,993.492 μs
14,723.423 μs
64,169.46 μs
1.00
0.00
Tcp
10000000
24,660.43 μs
1,746.809 μs
4,955.406 μs
23,947.15 μs
0.38
0.14
Notebook (Windows 10):
BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19043.1083 (21H1/May2021Update)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET SDK=5.0.203
[Host] : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT
DefaultJob : .NET 5.0.6 (5.0.621.22011), X64 RyuJIT
Method
N
Mean
Error
StdDev
Median
Ratio
RatioSD
Pipe
1
22.60 μs
0.441 μs
1.013 μs
22.21 μs
1.00
0.00
Tcp
1
27.42 μs
0.535 μs
1.019 μs
27.51 μs
1.21
0.08
Pipe
100
21.93 μs
0.146 μs
0.122 μs
21.94 μs
1.00
0.00
Tcp
100
26.06 μs
0.506 μs
0.474 μs
25.99 μs
1.19
0.02
Pipe
10000
29.59 μs
0.126 μs
0.099 μs
29.58 μs
1.00
0.00
Tcp
10000
33.25 μs
0.655 μs
0.919 μs
33.01 μs
1.14
0.04
Pipe
1000000
1,675.35 μs
32.862 μs
43.870 μs
1,685.37 μs
1.00
0.00
Tcp
1000000
2,553.07 μs
58.100 μs
167.631 μs
2,505.34 μs
1.63
0.10
Pipe
10000000
23,421.61 μs
141.337 μs
132.207 μs
23,380.19 μs
1.00
0.00
Tcp
10000000
28,182.91 μs
375.644 μs
313.679 μs
28,114.22 μs
1.20
0.01
Benchmark code:
Benchmark.csproj
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net5.0</TargetFramework>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="BenchmarkDotNet" Version="0.13.0" />
</ItemGroup>
</Project>
Program.cs
using BenchmarkDotNet.Running;
using System;
using System.IO;
using System.Linq;
using System.Net.Sockets;
using System.Runtime.InteropServices;
namespace Benchmark
{
public class Program
{
public const int MIN_LENGTH = 1;
public const int MAX_LENGTH = 10_000_000;
static void Main(string[] args)
{
if (!args.Any())
{
var summary = BenchmarkRunner.Run<PipeVsTcp>();
}
else
{
var data = MemoryMarshal
.AsBytes<int>(
Enumerable
.Range(0, MAX_LENGTH)
.ToArray())
.ToArray();
using var readStream = Console.OpenStandardInput();
if (args[0] == "pipe")
{
using var pipeStream = Console.OpenStandardOutput();
RunChildProcess(readStream, pipeStream, data);
}
else if (args[0] == "tcp")
{
var tcpClient = new TcpClient()
{
NoDelay = true
};
tcpClient.Connect("localhost", 55555);
var tcpStream = tcpClient.GetStream();
RunChildProcess(readStream, tcpStream, data);
}
else
{
throw new Exception("Invalid argument (args[0]).");
}
}
}
static void RunChildProcess(Stream readStream, Stream writeStream, byte[] data)
{
// wait for start signal
Span<byte> buffer = stackalloc byte[4];
while (true)
{
var length = readStream.Read(buffer);
if (length == 0)
throw new Exception($"The host process terminated early.");
var N = BitConverter.ToInt32(buffer);
// write
writeStream.Write(data, 0, N * sizeof(int));
}
}
}
}
PipeVsTcp.cs
using BenchmarkDotNet.Attributes;
using System;
using System.Buffers;
using System.Diagnostics;
using System.IO;
using System.Net;
using System.Net.Sockets;
using System.Reflection;
using System.Runtime.InteropServices;
namespace Benchmark
{
[MemoryDiagnoser]
public class PipeVsTcp
{
private Process _pipeProcess;
private Process _tcpProcess;
private TcpClient _tcpClient;
[GlobalSetup]
public void GlobalSetup()
{
// assembly path
// under Linux the Location property is an empty
// string (why?), therefore I have it replaced
// with an hard-coded string
var assemblyPath = Assembly.GetExecutingAssembly().Location;
// run pipe process
var pipePsi = new ProcessStartInfo("dotnet")
{
Arguments = $"{assemblyPath} pipe",
UseShellExecute = false,
RedirectStandardInput = true,
RedirectStandardOutput = true,
RedirectStandardError = true
};
_pipeProcess = new Process() { StartInfo = pipePsi };
_pipeProcess.Start();
// run tcp process
var tcpPsi = new ProcessStartInfo("dotnet")
{
Arguments = $"{assemblyPath} tcp",
UseShellExecute = false,
RedirectStandardInput = true,
RedirectStandardOutput = true,
RedirectStandardError = true
};
_tcpProcess = new Process() { StartInfo = tcpPsi };
_tcpProcess.Start();
var tcpListener = new TcpListener(IPAddress.Parse("127.0.0.1"), 55555);
tcpListener.Start();
_tcpClient = tcpListener.AcceptTcpClient();
_tcpClient.NoDelay = true;
}
[GlobalCleanup]
public void GlobalCleanup()
{
_pipeProcess?.Kill();
_tcpProcess?.Kill();
}
[Params(Program.MIN_LENGTH, 100, 10_000, 1_000_000, Program.MAX_LENGTH)]
public int N;
[Benchmark(Baseline = true)]
public Memory<byte> Pipe()
{
var pipeReadStream = _pipeProcess.StandardOutput.BaseStream;
var pipeWriteStream = _pipeProcess.StandardInput.BaseStream;
using var owner = MemoryPool<byte>.Shared.Rent(N * sizeof(int));
return ReadFromStream(pipeReadStream, pipeWriteStream, owner.Memory);
}
[Benchmark()]
public Memory<byte> Tcp()
{
var tcpReadStream = _tcpClient.GetStream();
var pipeWriteStream = _tcpProcess.StandardInput.BaseStream;
using var owner = MemoryPool<byte>.Shared.Rent(N * sizeof(int));
return ReadFromStream(tcpReadStream, pipeWriteStream, owner.Memory);
}
private Memory<byte> ReadFromStream(Stream readStream, Stream writeStream, Memory<byte> buffer)
{
// trigger
var Nbuffer = BitConverter.GetBytes(N);
writeStream.Write(Nbuffer);
writeStream.Flush();
// receive data
var remaining = N * sizeof(int);
var offset = 0;
while (remaining > 0)
{
var span = buffer.Slice(offset, remaining).Span;
var readBytes = readStream.Read(span);
if (readBytes == 0)
throw new Exception("The child process terminated early.");
remaining -= readBytes;
offset += readBytes;
}
var intBuffer = MemoryMarshal.Cast<byte, int>(buffer.Span);
// validate first 3 values
for (int i = 0; i < Math.Min(N, 3); i++)
{
if (intBuffer[i] != i)
throw new Exception($"Invalid data received. Data is {intBuffer[i]}, index = {i}.");
}
return buffer;
}
}
}
Related
I would like to show some experimental results about Rocksdb Put performance. The fact that single-threaded put throughput is slower than two-threaded put throughput. It is wired because it uses the default skiplist as memtable, and this data structure supports concurrent writes.
Here is my testing code.
uint64_t nthread = 2;
uint64_t nkeys = 16000000;
std::thread threads[nthread];
std::atomic<uint64_t> idx(1000000);
for (int t = 0; t < nthread; t++) {
threads[t] = std::thread([db, &idx, nthread, nkeys, &write_option_disable] {
WriteBatch batch;
for (int i = 0; i < nkeys / nthread; i++) {
std::string key = "WVERIFY" + std::to_string(idx.fetch_add(1));
std::string value = "MOCK";
auto ikey = rocksdb::Slice(key);
auto ivalue = rocksdb::Slice(value);
db->Put(write_option_disable, ikey, ivalue);
}
return 0;
});
}
for (auto& t : threads) {
t.join();
}
Besides, here are the results I got.
// Single thread
Uptime(secs): 8.4 total, 8.3 interval
Flush(GB): cumulative 1.170, interval 1.170
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 1.17 GB write, 143.35 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Interval compaction: 1.17 GB write, 144.11 MB/s write, 0.00 GB read, 0.00 MB/s read, 8.1 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x564742515ea0#7011 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 2e-05 secs_since: 8
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 8.4 total, 8.3 interval
Cumulative writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1.63 GB, 199.80 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 16M commit groups, 1.0 writes per commit group, ingest: 1669.88 MB, 200.85 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
// 2 threads
Uptime(secs): 31.4 total, 31.4 interval
Flush(GB): cumulative 0.183, interval 0.183
AddFile(GB): cumulative 0.000, interval 0.000
AddFile(Total Files): cumulative 0, interval 0
AddFile(L0 Files): cumulative 0, interval 0
AddFile(Keys): cumulative 0, interval 0
Cumulative compaction: 0.67 GB write, 21.84 MB/s write, 0.97 GB read, 31.68 MB/s read, 10.2 seconds
Interval compaction: 0.67 GB write, 21.87 MB/s write, 0.97 GB read, 31.72 MB/s read, 10.2 seconds
Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pending_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count
Block cache LRUCache#0x5619fb7bbea0#6183 capacity: 8.00 MB collections: 1 last_copies: 0 last_secs: 1.9e-05 secs_since: 31
Block cache entry stats(count,size,portion): Misc(1,0.00 KB,0%)
** File Read Latency Histogram By Level [default] **
** DB Stats **
Uptime(secs): 31.4 total, 31.4 interval
Cumulative writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 0.45 GB, 14.67 MB/s
Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent
Interval writes: 16M writes, 16M keys, 11M commit groups, 1.4 writes per commit group, ingest: 460.94 MB, 14.69 MB/s
Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s
Interval stall: 00:00:0.000 H:M:S, 0.0 percent
===========================update==========================
This is my Rocksdb's setting.
DB* db;
Options options;
BlockBasedTableOptions table_options;
rocksdb::WriteOptions write_option_disable;
write_option_disable.disableWAL = true;
// Optimize RocksDB. This is the easiest way to get RocksDB to perform well
options.IncreaseParallelism();
options.OptimizeLevelStyleCompaction();
// create the DB if it's not already present
options.create_if_missing = true;
The atomic idx shared between two threads can introduced non-trivial overhead. Try inserting random values from each thread, and maybe increase the number of threads.
I am trying to make a function that computes a diffusion kernel as fast as possible by using view and fused operators. Is it possible to get the second function as fast as the first? Currently, diff takes 59.6 ms, whereas diff_view takes 384.3 ms.
using BenchmarkTools
function diff(
at::Array{Float64, 3}, a::Array{Float64, 3},
visc::Float64, dxidxi::Float64, dyidyi::Float64, dzidzi::Float64,
itot::Int64, jtot::Int64, ktot::Int64)
for k in 2:ktot-1
for j in 2:jtot-1
#simd for i in 2:itot-1
#inbounds at[i, j, k] += visc * (
(a[i-1, j , k ] - 2. * a[i, j, k] + a[i+1, j , k ]) * dxidxi +
(a[i , j-1, k ] - 2. * a[i, j, k] + a[i , j+1, k ]) * dyidyi +
(a[i , j , k-1] - 2. * a[i, j, k] + a[i , j , k+1]) * dzidzi )
end
end
end
end
function diff_view(
at::Array{Float64, 3}, a::Array{Float64, 3},
visc::Float64, dxidxi::Float64, dyidyi::Float64, dzidzi::Float64,
itot::Int64, jtot::Int64, ktot::Int64)
at_c = view(at, 2:itot-1, 2:jtot-1, 2:ktot-1)
a_c = view(a, 2:itot-1, 2:jtot-1, 2:ktot-1)
a_w = view(a, 1:itot-2, 2:jtot-1, 2:ktot-1)
a_e = view(a, 3:itot , 2:jtot-1, 2:ktot-1)
a_s = view(a, 2:itot-1, 1:jtot-2, 2:ktot-1)
a_n = view(a, 2:itot-1, 3:jtot , 2:ktot-1)
a_b = view(a, 2:itot-1, 2:jtot-1, 1:ktot-2)
a_t = view(a, 2:itot-1, 2:jtot-1, 3:ktot )
at_c .+= visc .* ( (a_w .- 2. .* a_c .+ a_e) .* dxidxi .+
(a_s .- 2. .* a_c .+ a_n) .* dyidyi .+
(a_b .- 2. .* a_c .+ a_n) .* dzidzi )
end
itot = 384
jtot = 384
ktot = 384
a = rand(Float64, (itot, jtot, ktot))
at = zeros(Float64, (itot, jtot, ktot))
visc = 0.1
dxidxi = 0.1
dyidyi = 0.1
dzidzi = 0.1
#btime diff(
at, a,
visc, dxidxi, dyidyi, dzidzi,
itot, jtot, ktot)
#btime diff_view(
at, a,
visc, dxidxi, dyidyi, dzidzi,
itot, jtot, ktot)
You can accomplish this using LoopVectorization.jl's #turbo macro, which will make sure that the broadcast compiles to efficient SIMD instructions wherever possible.
using LoopVectorization
function diff_view_lv!(
at::Array{Float64, 3}, a::Array{Float64, 3},
visc::Float64, dxidxi::Float64, dyidyi::Float64, dzidzi::Float64,
itot::Int64, jtot::Int64, ktot::Int64)
at_c = view(at, 2:itot-1, 2:jtot-1, 2:ktot-1)
a_c = view(a, 2:itot-1, 2:jtot-1, 2:ktot-1)
a_w = view(a, 1:itot-2, 2:jtot-1, 2:ktot-1)
a_e = view(a, 3:itot , 2:jtot-1, 2:ktot-1)
a_s = view(a, 2:itot-1, 1:jtot-2, 2:ktot-1)
a_n = view(a, 2:itot-1, 3:jtot , 2:ktot-1)
a_b = view(a, 2:itot-1, 2:jtot-1, 1:ktot-2)
a_t = view(a, 2:itot-1, 2:jtot-1, 3:ktot )
#turbo at_c .+= visc .* ( (a_w .- 2. .* a_c .+ a_e) .* dxidxi .+
(a_s .- 2. .* a_c .+ a_n) .* dyidyi .+
(a_b .- 2. .* a_c .+ a_n) .* dzidzi )
# Could also use #turbo #. to apply the broadcast to every operator, so you don't have to type `.` before each one.
end
As a stylistic aside, since all these functions mutate at, they should have names that end with ! to denote that they mutate their argument.
And, as the comments noted, we want to be sure to interpolate any global variables into the benchmark with $. But other than that, using the same setup as in your question above (on what seems to be a slightly slower CPU):
julia> #benchmark diff!(
$at, $a,
$visc, $dxidxi, $dyidyi, $dzidzi,
$itot, $jtot, $ktot)
BenchmarkTools.Trial: 50 samples with 1 evaluation.
Range (min … max): 100.575 ms … 101.855 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 100.783 ms ┊ GC (median): 0.00%
Time (mean ± σ): 100.798 ms ± 173.505 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▁▁█▄
▄▄▄▄▄▆▇█████▇▆▄▆▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
101 ms Histogram: frequency by time 102 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark diff_view!(
$at, $a,
$visc, $dxidxi, $dyidyi, $dzidzi,
$itot, $jtot, $ktot)
BenchmarkTools.Trial: 13 samples with 1 evaluation.
Range (min … max): 397.203 ms … 397.800 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 397.427 ms ┊ GC (median): 0.00%
Time (mean ± σ): 397.436 ms ± 173.079 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁ ▁ ▁ ▁ ▁ ▁ █ ▁ ▁ ▁
█▁█▁▁▁▁█▁▁▁▁▁█▁▁█▁▁█▁▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
397 ms Histogram: frequency by time 398 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> #benchmark diff_view_lv!(
$at, $a,
$visc, $dxidxi, $dyidyi, $dzidzi,
$itot, $jtot, $ktot)
BenchmarkTools.Trial: 61 samples with 1 evaluation.
Range (min … max): 82.226 ms … 83.015 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 82.364 ms ┊ GC (median): 0.00%
Time (mean ± σ): 82.395 ms ± 115.205 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄ ▁▁▁▁▄▄▄█▁ ▄ ▁█ ▁▁ ▁ ▁ ▁
▆▁▁▆▁▁▁██▆▁█████████▆▆█▆▆██▁██▁█▆█▁▁▁▁▁▁█▆▆▆▁▁▁▆▁▁▁▁▁▁▁▁▁▁▁▆ ▁
82.2 ms Histogram: frequency by time 82.7 ms <
Memory estimate: 1008 bytes, allocs estimate: 41.
With this, the broadcasted version is now faster than the original looped version! However, as the comments have noted, the simple looping approach is arguably cleaner and more readable, and (as you might guess from the name) you can apply LoopVectorization to the looped version just as well:
using LoopVectorization
function diff_lv!(
at::Array{Float64, 3}, a::Array{Float64, 3},
visc::Float64, dxidxi::Float64, dyidyi::Float64, dzidzi::Float64,
itot::Int64, jtot::Int64, ktot::Int64)
#turbo for k in 2:ktot-1
for j in 2:jtot-1
for i in 2:itot-1
at[i, j, k] += visc * (
(a[i-1, j , k ] - 2. * a[i, j, k] + a[i+1, j , k ]) * dxidxi +
(a[i , j-1, k ] - 2. * a[i, j, k] + a[i , j+1, k ]) * dyidyi +
(a[i , j , k-1] - 2. * a[i, j, k] + a[i , j , k+1]) * dzidzi )
end
end
end
end
julia> #benchmark diff_lv!(
$at, $a,
$visc, $dxidxi, $dyidyi, $dzidzi,
$itot, $jtot, $ktot)
BenchmarkTools.Trial: 56 samples with 1 evaluation.
Range (min … max): 89.489 ms … 90.166 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 89.657 ms ┊ GC (median): 0.00%
Time (mean ± σ): 89.660 ms ± 103.127 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁ ▁ █▃ ▆ ▁
▄▁▁▄▁▁▄█▄▁▁▄█▄█▄▁▄▇▇▇▇██▄▄▇█▇█▁▁▁▄▄▁▁▁▁▄▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ ▁
89.5 ms Histogram: frequency by time 89.9 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Finally, if you want to multithread, you can just add another t in the name of the macro (#tturbo instead of #turbo)
julia> #benchmark diff_lvt!(
$at, $a,
$visc, $dxidxi, $dyidyi, $dzidzi,
$itot, $jtot, $ktot)
BenchmarkTools.Trial: 106 samples with 1 evaluation.
Range (min … max): 47.225 ms … 47.560 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 47.434 ms ┊ GC (median): 0.00%
Time (mean ± σ): 47.432 ms ± 67.185 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▁▁ ▄▂ █ ▂
▃▁▁▃▁▁▅▁▁▅▁▁▁▁▁▁▁▁▃▃▃▃▃▃▁▅▅▃▅▅▃█▃██▃██▃▃▆▃▃▅▆█▆▅██▆▆▅▅▃▃▁▃▆ ▃
47.2 ms Histogram: frequency by time 47.5 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
which should provide some additional speedup as long as you have started Julia with multiple threads.
I have something like this (simple example):
using BenchmarkTools
function assign()
e = zeros(100, 90000)
e2 = ones(100) * 0.16
e[:, 100:end] .= e2[:]
end
#benchmark assign()
and need to this for thousands of time steps. This gives
BenchmarkTools.Trial:
memory estimate: 68.67 MiB
allocs estimate: 6
--------------
minimum time: 16.080 ms (0.00% GC)
median time: 27.811 ms (0.00% GC)
mean time: 31.822 ms (12.31% GC)
maximum time: 43.439 ms (27.66% GC)
--------------
samples: 158
evals/sample: 1
Is there a faster way of doing this?
First of all I will assume that you meant
function assign1()
e = zeros(100, 90000)
e2 = ones(100) * 0.16
e[:, 100:end] .= e2[:]
return e # <- important!
end
Since otherwise you will not return the first 99 columns of e(!):
julia> size(assign())
(100, 89901)
Secondly, don't do this:
e[:, 100:end] .= e2[:]
e2[:] makes a copy of e2 and assigns that, but why? Just assign e2 directly:
e[:, 100:end] .= e2
Ok, but let's try a few different versions. Notice that there is no need to make e2 a vector, just assign a scalar:
function assign2()
e = zeros(100, 90000)
e[:, 100:end] .= 0.16 # Just broadcast a scalar!
return e
end
function assign3()
e = fill(0.16, 100, 90000) # use fill instead of writing all those zeros that you will throw away
e[:, 1:99] .= 0
return e
end
function assign4()
# only write exactly the values you need!
e = Matrix{Float64}(undef, 100, 90000)
e[:, 1:99] .= 0
e[:, 100:end] .= 0.16
return e
end
Time to benchmark
julia> #btime assign1();
14.550 ms (5 allocations: 68.67 MiB)
julia> #btime assign2();
14.481 ms (2 allocations: 68.66 MiB)
julia> #btime assign3();
9.636 ms (2 allocations: 68.66 MiB)
julia> #btime assign4();
10.062 ms (2 allocations: 68.66 MiB)
Versions 1 and 2 are equally fast, but you'll notice that there are 2 allocations instead of 5, but, of course, the big allocation dominates.
Versions 3 and 4 are faster, not dramatically so, but you see that it avoids some duplicate work, such as writing values into the matrix twice. Version 3 is the fastest, not by much, but this changes if the assignment is a bit more balanced, in which case version 4 is faster:
function assign3_()
e = fill(0.16, 100, 90000)
e[:, 1:44999] .= 0
return e
end
function assign4_()
e = Matrix{Float64}(undef, 100, 90000)
e[:, 1:44999] .= 0
e[:, 45000:end] .= 0.16
return e
end
julia> #btime assign3_();
11.576 ms (2 allocations: 68.66 MiB)
julia> #btime assign4_();
8.658 ms (2 allocations: 68.66 MiB)
The lesson is to avoid doing unnecessary work.
What my code does
The goal was to build a function, that checks if all brackets open and close correctly in a given string with julia. So,
"{abc()([[def]])()}"
should return true, while something like
"{(bracket order mixed up here!})[and this bracket doesn't close!"
should return false.
Question
I have two versions of the function. Why is version I faster by about 10%?
Version I
function matching_brackets_old(s::AbstractString)
close_open_map = Dict('}' => '{', ')' => '(', ']' => '[')
order_arr = []
for char in s
if char in values(close_open_map)
push!(order_arr, char)
elseif (char in keys(close_open_map)) &&
(isempty(order_arr) || (close_open_map[char] != pop!(order_arr)))
return false
end
end
return isempty(order_arr)
end
Version II
Here I replace the for loop with a do block:
function matching_brackets(s::AbstractString)
close_open_map = Dict('}' => '{', ')' => '(', ']' => '[')
order_arr = []
all_correct = all(s) do char
if char in values(close_open_map)
push!(order_arr, char)
elseif (char in keys(close_open_map)) &&
(isempty(order_arr) || (close_open_map[char] != pop!(order_arr)))
return false
end
return true
end
return all_correct && isempty(order_arr)
end
Timings
Using BenchmarkTools' #benchmark for the strings "{()()[()]()}" and "{()()[())]()}", I get a slow down up of about 10% for both strings, when comparing minimum execution time.
Additional Info
Version Info:
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.6.0)
CPU: Intel(R) Core(TM) i5-4260U CPU # 1.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Timing Code:
using BenchmarkTools
benchmark_strings = ["{()()[()]()}", "{()()[())]()}"]
for s in benchmark_strings
b_old = #benchmark matching_brackets_old("$s") samples=100000 seconds=30
b_new = #benchmark matching_brackets("$s") samples=100000 seconds=30
println("For String=", s)
println(b_old)
println(b_new)
println(judge(minimum(b_new), minimum(b_old)))
println("Result: ", matching_brackets(s))
end
With Result:
For String={()()[()]()}
Trial(8.177 μs)
Trial(9.197 μs)
TrialJudgement(+12.48% => regression)
Result: true
For String={()()[())]()}
Trial(8.197 μs)
Trial(9.202 μs)
TrialJudgement(+12.27% => regression)
Result: false
Edit
I mixed up the order on the Trialjudgement, so Version I is faster, as François Févotte suggests. My question remains: why?
Now that the mistake with judge is resolved, the answer is probably the usual caveat: function calls, as in this case resulting from the closure passed to all, are quite optimized, but not for free.
To get a real improvement, I suggest, other than making the stack type stable (which isn't that big a deal here), to get rid of the iterations you implicitely do by calling in on values and keys. It suffices to do that only once, without a dictionary:
const MATCHING_PAIRS = ('{' => '}', '(' => ')', '[' => ']')
function matching_brackets(s::AbstractString)
stack = Vector{eltype(s)}()
for c in s
for (open, close) in MATCHING_PAIRS
if c == open
push!(stack, c)
elseif c == close
if isempty(stack) || (pop!(stack) != open)
return false
end
end
end
end
return isempty(stack)
end
Even a bit more time can be squeezed out by unrolling the inner loop over the tuple:
function matching_brackets_unrolled(s::AbstractString)
stack = Vector{eltype(s)}()
for c in s
if (c == '(') || (c == '[') || (c == '{')
push!(stack, c)
elseif (c == ')')
if isempty(stack) || (pop!(stack) != '(')
return false
end
elseif (c == ']')
if isempty(stack) || (pop!(stack) != '[')
return false
end
elseif (c == '}')
if isempty(stack) || (pop!(stack) != '{')
return false
end
end
end
return isempty(stack)
end
This is somewhat ugly and certainly not nicely extendable, though. My benchmarks (matching_brackets_new is your second version, matching_brackets my first one):
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7 CPU 960 # 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, nehalem)
# NOT MATCHING
julia> #benchmark matching_brackets_new("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 784 bytes
allocs estimate: 16
--------------
minimum time: 674.844 ns (0.00% GC)
median time: 736.200 ns (0.00% GC)
mean time: 800.935 ns (6.54% GC)
maximum time: 23.831 μs (96.16% GC)
--------------
samples: 10000
evals/sample: 160
julia> #benchmark matching_brackets_old("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 752 bytes
allocs estimate: 15
--------------
minimum time: 630.743 ns (0.00% GC)
median time: 681.725 ns (0.00% GC)
mean time: 753.937 ns (6.41% GC)
maximum time: 23.056 μs (94.19% GC)
--------------
samples: 10000
evals/sample: 171
julia> #benchmark matching_brackets("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 164.883 ns (0.00% GC)
median time: 172.900 ns (0.00% GC)
mean time: 186.523 ns (4.33% GC)
maximum time: 5.428 μs (96.54% GC)
--------------
samples: 10000
evals/sample: 759
julia> #benchmark matching_brackets_unrolled("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 134.459 ns (0.00% GC)
median time: 140.292 ns (0.00% GC)
mean time: 150.067 ns (5.84% GC)
maximum time: 5.095 μs (96.56% GC)
--------------
samples: 10000
evals/sample: 878
# MATCHING
julia> #benchmark matching_brackets_old("{()()[()]()}")
BenchmarkTools.Trial:
memory estimate: 800 bytes
allocs estimate: 18
--------------
minimum time: 786.358 ns (0.00% GC)
median time: 833.873 ns (0.00% GC)
mean time: 904.437 ns (5.43% GC)
maximum time: 29.355 μs (96.88% GC)
--------------
samples: 10000
evals/sample: 106
julia> #benchmark matching_brackets_new("{()()[()]()}")
BenchmarkTools.Trial:
memory estimate: 832 bytes
allocs estimate: 19
--------------
minimum time: 823.597 ns (0.00% GC)
median time: 892.506 ns (0.00% GC)
mean time: 981.381 ns (5.98% GC)
maximum time: 47.308 μs (97.84% GC)
--------------
samples: 10000
evals/sample: 77
julia> #benchmark matching_brackets("{()()[()]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 206.062 ns (0.00% GC)
median time: 214.481 ns (0.00% GC)
mean time: 227.385 ns (3.38% GC)
maximum time: 6.890 μs (96.22% GC)
--------------
samples: 10000
evals/sample: 535
julia> #benchmark matching_brackets_unrolled("{()()[()]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 160.186 ns (0.00% GC)
median time: 164.752 ns (0.00% GC)
mean time: 180.794 ns (4.95% GC)
maximum time: 5.751 μs (97.03% GC)
--------------
samples: 10000
evals/sample: 800
Update: if you insert breaks in the first version, to really avoid unnecessary looping, the timings are almost indistinguishable, with nice code:
function matching_brackets(s::AbstractString)
stack = Vector{eltype(s)}()
for c in s
for (open, close) in MATCHING_PAIRS
if c == open
push!(stack, c)
break
elseif c == close
if isempty(stack) || (pop!(stack) != open)
return false
end
break
end
end
end
return isempty(stack)
end
with
julia> #benchmark matching_brackets_unrolled("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 137.574 ns (0.00% GC)
median time: 144.978 ns (0.00% GC)
mean time: 165.365 ns (10.44% GC)
maximum time: 9.344 μs (98.02% GC)
--------------
samples: 10000
evals/sample: 867
julia> #benchmark matching_brackets("{()()[())]()}") # with breaks
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 148.255 ns (0.00% GC)
median time: 155.231 ns (0.00% GC)
mean time: 175.245 ns (9.62% GC)
maximum time: 9.602 μs (98.31% GC)
--------------
samples: 10000
evals/sample: 839
I don't observe the same on my machine: in my tests, version I is faster for both strings:
julia> versioninfo()
Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-6200U CPU # 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
JULIA_PROJECT = #.
julia> #btime matching_brackets_old("{()()[()]()}")
716.443 ns (18 allocations: 800 bytes)
true
julia> #btime matching_brackets("{()()[()]()}")
761.434 ns (19 allocations: 832 bytes)
true
julia> #btime matching_brackets_old("{()()[())]()}")
574.847 ns (15 allocations: 752 bytes)
false
julia> #btime matching_brackets("{()()[())]()}")
612.793 ns (16 allocations: 784 bytes)
false
I would think (but this is a wild guess) that the difference between for loops and higher-order functions gets less and less significant when the string size increases.
However, I would encourage you to look more closely at the order_arr variable: as it is currently written, it is of type Vector{Any}, which - like any container of abstractly typed values - hurts performance. The following version performs better by concretely typing the elements of order_arr:
function matching_brackets_new(s::AbstractString)
close_open_map = Dict('}' => '{', ')' => '(', ']' => '[')
# Make sure the compiler knows about the type of elements in order_arr
order_arr = eltype(s)[] # or order_arr = Char[]
for char in s
if char in values(close_open_map)
push!(order_arr, char)
elseif (char in keys(close_open_map)) &&
(isempty(order_arr) || (close_open_map[char] != pop!(order_arr)))
return false
end
end
return isempty(order_arr)
end
yielding:
julia> #btime matching_brackets_new("{()()[()]()}")
570.641 ns (18 allocations: 784 bytes)
true
julia> #btime matching_brackets_new("{()()[())]()}")
447.758 ns (15 allocations: 736 bytes)
false
I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.