Compiler predicate optimizations - performance

Consider the following example conditions/predicates:
x > 10 and x > 20
(x > 10 or x == 10) and (x < 10 or x == 10) aka x >= 10 and x <= 10
Predicate 1. can be simplified to x > 20 and 2. can be simplified to x == 10. Would a compiler optimize this kind of (or more complex) predicates and if so what algorithms are used to do so?
What are some common optimization techniques for predicates?

It depends on the compiler, but clang and gcc do perform this optimisation:
#include <stdio.h>
void foo(int x) {
if (x > 10 && x > 20)
puts("foo");
}
void foo2(int x) {
if ((x > 10 || x == 10) && (x < 10 || x == 10))
puts("foo2");
}
You can see the assembly here -- both functions contain a single comparison.
For clang (which uses LLVM), it uses the instruction combine pass ('instcombine'). You can see of the transformations in the InstructionSimplify.cpp source code.

Looking at the IL code that the C# compiler spits out for the following method, at least in this case the compiler does not seem smart enough. Not sure, though, what happens when the IL code gets translated into native code or even later in the processor pipeline - there will be further optimizations kicking in:
private static bool Compare(int x)
{
return (x > 10 || x == 10) && (x < 10 || x == 10);
}
Corresponding IL:
IL_0000: ldarg.0 // x
IL_0001: ldc.i4.s 10 // 0x0a
IL_0003: bgt.s IL_000a
IL_0005: ldarg.0 // x
IL_0006: ldc.i4.s 10 // 0x0a
IL_0008: bne.un.s IL_0017
IL_000a: ldarg.0 // x
IL_000b: ldc.i4.s 10 // 0x0a
IL_000d: blt.s IL_0015
IL_000f: ldarg.0 // x
IL_0010: ldc.i4.s 10 // 0x0a
IL_0012: ceq
IL_0014: ret
IL_0015: ldc.i4.1
IL_0016: ret
IL_0017: ldc.i4.0
IL_0018: ret
Here's the second (optimized) version:
private static bool Compare(int x)
{
return x >= 10 && x <= 10;
}
And, again, the corresponding IL code:
IL_0000: ldarg.0 // x
IL_0001: ldc.i4.s 10 // 0x0a
IL_0003: blt.s IL_000e
IL_0005: ldarg.0 // x
IL_0006: ldc.i4.s 10 // 0x0a
IL_0008: cgt
IL_000a: ldc.i4.0
IL_000b: ceq
IL_000d: ret
IL_000e: ldc.i4.0
IL_000f: ret
Since the second version is clearly shorter it has greater chances of getting inlined at runtime so we should expect it to run a bit faster.
Finally, the third one, let's call it "the best" (x == 10):
private static bool Compare(int x)
{
return x == 10;
}
And its IL:
IL_0000: ldarg.0 // x
IL_0001: ldc.i4.s 10 // 0x0a
IL_0003: ceq
IL_0005: ret
Nice and concise.
Running a benchmark using Benchmark.NET and [MethodImpl(MethodImplOptions.NoInlining)] reveals the runtime behaviour which seems still substantially different for the two implementations:
Case 1: test candidates that are not 10 (negative case).
Method | Jit | Platform | Mean
----------- |---------- |--------- |----------
TestBest | LegacyJit | X64 | 2.329 ms
TestOpt | LegacyJit | X64 | 2.704 ms
TestNonOpt | LegacyJit | X64 | 3.324 ms
TestBest | LegacyJit | X86 | 1.956 ms
TestOpt | LegacyJit | X86 | 2.178 ms
TestNonOpt | LegacyJit | X86 | 2.796 ms
TestBest | RyuJit | X64 | 2.480 ms
TestOpt | RyuJit | X64 | 2.489 ms
TestNonOpt | RyuJit | X64 | 3.101 ms
TestBest | RyuJit | X86 | 1.865 ms
TestOpt | RyuJit | X86 | 2.170 ms
TestNonOpt | RyuJit | X86 | 2.853 ms
Case 2: test using 10 (positive case).
Method | Jit | Platform | Mean
----------- |---------- |--------- |---------
TestBest | LegacyJit | X64 | 2.396 ms
TestOpt | LegacyJit | X64 | 2.780 ms
TestNonOpt | LegacyJit | X64 | 3.370 ms
TestBest | LegacyJit | X86 | 2.044 ms
TestOpt | LegacyJit | X86 | 2.199 ms
TestNonOpt | LegacyJit | X86 | 2.533 ms
TestBest | RyuJit | X64 | 2.470 ms
TestOpt | RyuJit | X64 | 2.532 ms
TestNonOpt | RyuJit | X64 | 2.552 ms
TestBest | RyuJit | X86 | 1.911 ms
TestOpt | RyuJit | X86 | 2.210 ms
TestNonOpt | RyuJit | X86 | 2.753 ms
Interesting to see is that in both cases, the new JIT runs in about the same time for the opt and non-opt X64 version.
The question still is: Why does the compiler not optimize these kinds of patterns? My guess would be that it's because of stuff like operator overloading which makes it impossible for the compiler to infer some correct logical conclusions but II might be extremely off... Also, for the built-in value types it should be possible. Oh well...
Lastly, here's a good articel on optimizations for boolean expressions:
https://hbfs.wordpress.com/2008/08/26/optimizing-boolean-expressions-for-speed/

Related

Tensorflow: Multi-GPU training cannot make all GPU running at the same time

I have a machine that has 3x 1080 GPU. Below are the code of the training:
dynamic_learning_rate = tf.placeholder(tf.float32, shape=[])
model_version = tf.constant(1, tf.int32)
with tf.device('/cpu:0'):
with tf.name_scope('Input'):
# Input images and labels.
batch_images,\
batch_input_vectors,\
batch_one_hot_labels,\
batch_file_paths,\
batch_labels = self.get_batch()
grads = []
pred = []
cost = []
# Define optimizer
optimizer = tf.train.MomentumOptimizer(learning_rate=dynamic_learning_rate / self.batch_size,
momentum=0.9,
use_nesterov=True)
split_input_image = tf.split(batch_images, self.num_gpus)
split_input_vector = tf.split(batch_input_vectors, self.num_gpus)
split_input_one_hot_label = tf.split(batch_one_hot_labels, self.num_gpus)
for i in range(self.num_gpus):
with tf.device(tf.DeviceSpec(device_type="GPU", device_index=i)):
with tf.variable_scope(tf.get_variable_scope(), reuse=i > 0):
with tf.name_scope('Model'):
# Construct model
with tf.variable_scope("inference"):
tower_pred = self.model(split_input_image[i], split_input_vector[i], is_training=True)
pred.append(tower_pred)
with tf.name_scope('Loss'):
# Define loss and optimizer
softmax_cross_entropy_cost = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(logits=tower_pred, labels=split_input_one_hot_label[i]))
cost.append(softmax_cross_entropy_cost)
# Concat variables
pred = tf.concat(pred, 0)
cost = tf.reduce_mean(cost)
# L2 regularization
trainable_vars = tf.trainable_variables()
l2_regularization = tf.add_n(
[tf.nn.l2_loss(v) for v in trainable_vars if any(x in v.name for x in ['weights', 'biases'])])
for v in trainable_vars:
if any(x in v.name for x in ['weights', 'biases']):
print(v.name + ' - included for L2 regularization!')
else:
print(v.name)
cost = cost + self.l2_regularization_strength*l2_regularization
with tf.name_scope('Accuracy'):
# Evaluate model
correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(batch_one_hot_labels, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
prediction = tf.nn.softmax(pred, name='softmax')
# Creates a variable to hold the global_step.
global_step = tf.Variable(0, trainable=False, name='global_step')
# Minimization
update = optimizer.minimize(cost, global_step=global_step, colocate_gradients_with_ops=True)
After I run the training:
Fri Nov 10 12:28:00 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 42% 65C P2 62W / 198W | 7993MiB / 8114MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:04:00.0 Off | N/A |
| 33% 53C P2 150W / 198W | 7886MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 Off | 00000000:05:00.0 On | N/A |
| 26% 54C P2 170W / 198W | 7883MiB / 8108MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23228 C python 7982MiB |
| 1 23228 C python 7875MiB |
| 2 4793 G /usr/lib/xorg/Xorg 40MiB |
| 2 23228 C python 7831MiB |
+-----------------------------------------------------------------------------+
Fri Nov 10 12:28:36 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 42% 59C P2 54W / 198W | 7993MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:04:00.0 Off | N/A |
| 33% 57C P2 154W / 198W | 7886MiB / 8114MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 1080 Off | 00000000:05:00.0 On | N/A |
| 27% 55C P2 155W / 198W | 7883MiB / 8108MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 23228 C python 7982MiB |
| 1 23228 C python 7875MiB |
| 2 4793 G /usr/lib/xorg/Xorg 40MiB |
| 2 23228 C python 7831MiB |
+-----------------------------------------------------------------------------+
You see that the whenever the first GPU is running, the other two GPUs will be idle and vice versa. The alternate frequency is about 0.5 second.
For a single GPU, the training speed is around 650 [images/second], with all the 3 GPUs I got only 1050 [images/second].
Any idea of the problem?
You need to make sure that all the trainable variables are on the controller device (usually the CPU) and all the other worker devices (usually GPUs) are using the variables from the CPU in parallel.

Why is function composition in F# so much slower, by 60%, than piping?

Admittedly, I am unsure whether I am correctly comparing apples with apples or apples with pears here. But I'm particularly surprised about the bigness of the difference, where a slighter difference, if any, would be expected.
Piping can often be expressed as function composition and vice versa, and I would assume the compiler knows that too, so I tried a little experiment:
// simplified example of some SB helpers:
let inline bcreate() = new StringBuilder(64)
let inline bget (sb: StringBuilder) = sb.ToString()
let inline appendf fmt (sb: StringBuilder) = Printf.kbprintf (fun () -> sb) sb fmt
let inline appends (s: string) (sb: StringBuilder) = sb.Append s
let inline appendi (i: int) (sb: StringBuilder) = sb.Append i
let inline appendb (b: bool) (sb: StringBuilder) = sb.Append b
// test function for composition, putting some garbage data in SB
let compose a =
(appends "START"
>> appendb true
>> appendi 10
>> appendi a
>> appends "0x"
>> appendi 65535
>> appendi 10
>> appends "test"
>> appends "END") (bcreate())
// test function for piping, putting the same garbage data in SB
let pipe a =
bcreate()
|> appends "START"
|> appendb true
|> appendi 10
|> appendi a
|> appends "0x"
|> appendi 65535
|> appendi 10
|> appends "test"
|> appends "END"
Testing this in FSI (64 bit enabled, --optimize flag on) gives:
> for i in 1 .. 500000 do compose 123 |> ignore;;
Real: 00:00:00.390, CPU: 00:00:00.390, GC gen0: 62, gen1: 1, gen2: 0
val it : unit = ()
> for i in 1 .. 500000 do pipe 123 |> ignore;;
Real: 00:00:00.249, CPU: 00:00:00.249, GC gen0: 27, gen1: 0, gen2: 0
val it : unit = ()
A small difference would be understandable, but this is a factor 1.6 (60%) performance degradation.
I would actually expect the bulk of the work to happen in the StringBuilder, but apparently the overhead of composition has quite a bit of influence.
I realize that in most practical situations this difference will be negligible, but if you are writing large formatted text files (like log files) as in this case, it has an impact.
I am using the latest version of F#.
I tried out your example with FSI and found no appreciable difference:
> #time
for i in 1 .. 500000 do compose 123 |> ignore
--> Timing now on
Real: 00:00:00.229, CPU: 00:00:00.234, GC gen0: 32, gen1: 32, gen2: 0
val it : unit = ()
> #time;;
--> Timing now off
> #time
for i in 1 .. 500000 do pipe 123 |> ignore;;;;
--> Timing now on
Real: 00:00:00.214, CPU: 00:00:00.218, GC gen0: 30, gen1: 30, gen2: 0
val it : unit = ()
Measuring it in BenchmarkDotNet (The first table is just a single compose/pipe run, the 2nd table is doing it 500000 times), I found something similar:
Method | Platform | Jit | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
-------- |--------- |---------- |------------ |----------- |--------- |------ |------ |------------------- |
compose | X64 | RyuJit | 319.7963 ns | 5.0299 ns | 2,848.50 | - | - | 182.54 |
pipe | X64 | RyuJit | 308.5887 ns | 11.3793 ns | 2,453.82 | - | - | 155.88 |
compose | X86 | LegacyJit | 428.0141 ns | 3.6112 ns | 1,970.00 | - | - | 126.85 |
pipe | X86 | LegacyJit | 416.3469 ns | 8.0869 ns | 1,886.00 | - | - | 121.86 |
Method | Platform | Jit | Median | StdDev | Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
-------- |--------- |---------- |------------ |---------- |--------- |------ |------ |------------------- |
compose | X64 | RyuJit | 160.8059 ms | 4.6699 ms | 3,514.75 | - | - | 56,224,980.75 |
pipe | X64 | RyuJit | 163.1026 ms | 4.9829 ms | 3,120.00 | - | - | 50,025,686.21 |
compose | X86 | LegacyJit | 215.8562 ms | 4.2769 ms | 2,292.00 | - | - | 36,820,936.68 |
pipe | X86 | LegacyJit | 209.9219 ms | 2.5605 ms | 2,220.00 | - | - | 35,554,575.32 |
It may be that differences you are measuring are related to GC. Try to force a GC collect before/after your timings.
That said, looking at the source code for the pipe operator:
let inline (|>) x f = f x
and comparing against the composition operator:
let inline (>>) f g x = g(f x)
seems to make it clear that the composition operator will be creating lambda functions, which should result in more allocations. This can also be seen in the BenchmarkDotNet runs. That might also be the cause for the performance difference you are seeing.
Without any deep knowledge about F# internals, what I can tell from the generated IL is that compose will yield lambdas (and lots of them if optimizations are turned off), whereas in pipe all the calls to append* will be inlined.
Generated IL for pipe function:
Main.pipe:
IL_0000: nop
IL_0001: ldc.i4.s 40
IL_0003: newobj System.Text.StringBuilder..ctor
IL_0008: ldstr "START"
IL_000D: callvirt System.Text.StringBuilder.Append
IL_0012: ldc.i4.1
IL_0013: callvirt System.Text.StringBuilder.Append
IL_0018: ldc.i4.s 0A
IL_001A: callvirt System.Text.StringBuilder.Append
IL_001F: ldarg.0
IL_0020: callvirt System.Text.StringBuilder.Append
IL_0025: ldstr "0x"
IL_002A: callvirt System.Text.StringBuilder.Append
IL_002F: ldc.i4 FF FF 00 00
IL_0034: callvirt System.Text.StringBuilder.Append
IL_0039: ldc.i4.s 0A
IL_003B: callvirt System.Text.StringBuilder.Append
IL_0040: ldstr "test"
IL_0045: callvirt System.Text.StringBuilder.Append
IL_004A: ldstr "END"
IL_004F: callvirt System.Text.StringBuilder.Append
IL_0054: ret
Generated IL for compose function:
Main.compose:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: newobj Main+compose#10..ctor
IL_0007: stloc.1
IL_0008: ldloc.1
IL_0009: newobj Main+compose#10-1..ctor
IL_000E: stloc.0
IL_000F: ldc.i4.s 40
IL_0011: newobj System.Text.StringBuilder..ctor
IL_0016: stloc.2
IL_0017: ldloc.0
IL_0018: ldloc.2
IL_0019: callvirt Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>.Invoke
IL_001E: ldstr "END"
IL_0023: callvirt System.Text.StringBuilder.Append
IL_0028: ret
compose#10.Invoke:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: ldfld Main+compose#10.a
IL_0007: ldarg.1
IL_0008: call Main.f#1
IL_000D: ldc.i4.s 0A
IL_000F: callvirt System.Text.StringBuilder.Append
IL_0014: ret
compose#10..ctor:
IL_0000: ldarg.0
IL_0001: call Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>..ctor
IL_0006: ldarg.0
IL_0007: ldarg.1
IL_0008: stfld Main+compose#10.a
IL_000D: ret
compose#10-1.Invoke:
IL_0000: nop
IL_0001: ldarg.0
IL_0002: ldfld Main+compose#10-1.f
IL_0007: ldarg.1
IL_0008: callvirt Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>.Invoke
IL_000D: ldstr "test"
IL_0012: callvirt System.Text.StringBuilder.Append
IL_0017: ret
compose#10-1..ctor:
IL_0000: ldarg.0
IL_0001: call Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>..ctor
IL_0006: ldarg.0
IL_0007: ldarg.1
IL_0008: stfld Main+compose#10-1.f
IL_000D: ret

Several int3 in a row

I'm using x64dbg to inspect the disassembly of a .DLL.
At several points in the assembly I see several Int3 instructions in a row.
00007FFA24BF1638 | CC | int3 |
00007FFA24BF1639 | CC | int3 |
00007FFA24BF163A | CC | int3 |
00007FFA24BF163B | CC | int3 |
00007FFA24BF163C | CC | int3 |
00007FFA24BF163D | CC | int3 |
00007FFA24BF163E | CC | int3 |
00007FFA24BF163F | CC | int3 |
This instruction is used for debugging / break points right? So then why are there so many in a row, and why is there any at all considering this DLL was compiled with a release configuration VC++.
It's probably just padding, they won't ever be executed. I assume the next function begins at 00007FFA24BF1640 which is 16 byte aligned, and the preceding function presumably ends before these instructions.

Filter Values which are greater than its AVG value in Pig/Hive

This is my sample data:
+---------------------------+------+-----------+--------------+------------+--------+--------------+-------+------------------+
| Car | MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model | Origin |
+---------------------------+------+-----------+--------------+------------+--------+--------------+-------+------------------+
| Chevrolet Chevelle Malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | US Buick |
| Skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | US Plymouth |
| Satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | US AMC Rebel |
| SST | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | US Ford |
| Torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | US Ford Galaxie |
| 500 | 15.0 | 8 | 429.0 | 198.0 | 4341 | 10.0 | 70 | US Chevrolet |
| Impala | 14.0 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 70 | US Plymouth Fury |
| iii | 14.0 | 8 | 440.0 | 215.0 | 4312 | 8.5 | 70 | US |
+---------------------------+------+-----------+--------------+------------+--------+--------------+-------+------------------+
I want to find out those MPG and HorsePower on basis of each car whose values are greater than their AVG value. Like mpg > AVG(mpg) and HorsePower >AVG(HorsePower).
What I did:
r = load '/user/CarData/cars.csv' using PigStorage(',') as (car:chararray,mpg:float,cyl:INT,disp:DOUBLE,hp:DOUBLE,weight:INT,acc:DOUBLE,model:INT,org:chararray);
r1 = group r by car;
r2 = foreach r1 generate group,AVG(r.mpg) as avg_mpg,AVG(r.hp) as avg_hp,r.mpg,r.hp;
It will give me carname,average and bag{mpg}, now I am facing problem to filter from r2.
I am trying something like:
FILTER r2 by r.mpg > AVG(mpg) and r.hp > AVG(hp)
Please help me. Thanks
In Hive, it would be something like
Select Car, MPG, Cylinders, Displacement, Horsepower, Weight, Acceleration, Model, Origin
FROM cars_table
JOIN (Select AVG(mpg) as a_m, AVG(hp) as a_h) averages ON (1 = 1)
WHERE Horsepower > a_h AND MPG > a_m;
Input :
Chevrolet Chevelle Malibu,18.0,8,307.0,130.0,3504,12.0,70,US Buick
Skylark 320,15.0,8,350.0,165.0,3693,11.5,70,US Plymouth
Satellite,18.0,8,318.0,150.0,3436,11.0,70,US AMC Rebel
SST,16.0,8,304.0,150.0,3433,12.0,70,US Ford
Torino,17.0,8,302.0,140.0,3449,10.5,70,US Ford Galaxie
500,15.0,8,429.0,2.0,4341,10.0,70,US Chevrolet
500,45.0,8,429.0,198.0,4341,10.0,70,US Chevrolet
500,10.0,8,429.0,40.0,4341,10.0,70,US Chevrolet
Impala,14.0,8,454.0,220.0,4354,9.0,70,US Plymouth Fury
iii,14.0,8,440.0,215.0,4312,8.5,70,US
Code :
r = load 'test.data' using PigStorage(',') as car:chararray,mpg:float,cyl:INT,disp:DOUBLE,hp:DOUBLE,weight:INT,acc:DOUBLE,mod el:INT,org:chararray);
r1 = group r by car;
r2 = foreach r1 generate FLATTEN(group) as (car_grp:chararray),(float)AVG(r.mpg) as (avg_mpg:float),
(DOUBLE)AVG(r.hp) as (avg_hp:DOUBLE);
j = JOIN r2 BY car_grp, r BY car;
r3 = foreach j generate r2::car_grp as (car:chararray),r::mpg as (mpg:float),r::cyl as (cyl:INT),r::disp as (disp:DOUBLE),r::hp as (hp:DOUBLE),r::weight as (weight:INT),r::acc as (acc:DOUBLE),r::model as (model:INT),r::org as (org:chararray),r2::avg_mpg as (avg_mpg:float),r2::avg_hp as (avg_hp:DOUBLE);
r4 = FILTER r3 BY mpg > avg_mpg AND hp > avg_hp;
Output :
(500,45.0,8,429.0,198.0,4341,10.0,70,US Chevrolet,23.333334,80.0)
As in the above answers, you don't need to join tables. I feel this will be a optimized version.
Input data:
Chevrolet Chevelle Malibu,18.0,8,307.0,130.0,3504,12.0,70,US Buick
Skylark 320,15.0,8,350.0,165.0,3693,11.5,70,US Plymouth
Satellite,18.0,8,318.0,150.0,3436,11.0,70,US AMC Rebel
SST,16.0,8,304.0,150.0,3433,12.0,70,US Ford
Torino,17.0,8,302.0,140.0,3449,10.5,70,US Ford Galaxie
500,15.0,8,429.0,198.0,4341,10.0,70,US Chevrolet
Impala1,14.0,8,454.0,220.0,4354,9.0,70,US Plymouth Fury
Impala2,25.0,8,454.0,270.0,4354,9.0,70,US Plymouth Fury
Impala3,30.0,8,454.0,290.0,4354,9.0,70,US Plymouth Fury
iii,14.0,8,440.0,215.0,4312,8.5,70,US
Pig Script:
input_data = LOAD '/pigsamples/carinfo' USING PigStorage (',')
AS (car:CHARARRAY, mpg:FLOAT, cyl:INT, disp:DOUBLE, hp:DOUBLE, weight:INT, acc:DOUBLE, model:INT, org:CHARARRAY);
group_data = GROUP input_data ALL;
average_values = FOREACH group_data GENERATE AVG(input_data.mpg) AS avg_mpg, AVG(input_data.hp) AS avg_hp;
filter_data = FILTER input_data BY mpg > average_values.avg_mpg AND hp > average_values.hp;
Output:
(Impala2,25.0,8,454.0,270.0,4354,9.0,70,US Plymouth Fury)
(Impala3,30.0,8,454.0,290.0,4354,9.0,70,US Plymouth Fury)
I changed mpg for Impala and iii to 19.0 so that query returns something. You want to avoid self-joins here; his can be efficiently accomplished with hive windowing functions.
Hive:
select car, mpg, avg_mpg, horsepower, avg_hrspwr
from (
select car, mpg, horsepower
, avg( mpg ) over () as avg_mpg
, avg( horsepower ) over () as avg_hrspwr
from db.table ) x
where horsepower > avg_hrspwr and mpg > avg_mpg
Output:
Impala 19.0 17.125 220.0 171.0
iii 19.0 17.125 215.0 171.0
As far as Pig is concerned I think #Sai Kiran Neelakantam's solution is pretty solid.
r = load 'A' using PigStorage(',') as (car:chararray,mpg:float,cyl:INT,disp:DOUBLE,hp:DOUBLE,weight:INT,acc:DOUBLE,model:INT,org:chararray);
r1 = group r by car;
r2 = foreach r1 generate group,FLATTEN(AVG(r.mpg)) as avg_mpg,AVG(r.hp) as avg_hp,FLATTEN(r.mpg) as mpg ,FLATTEN(r.hp) as hp;
r3 = FILTER r2 by mpg > avg_mpg and hp > avg_hp;
r3 = distinct r3;
dump r3;

Approximate cost to access various caches and main memory?

Can anyone give me the approximate time (in nanoseconds) to access L1, L2 and L3 caches, as well as main memory on Intel i7 processors?
While this isn't specifically a programming question, knowing these kinds of speed details is neccessary for some low-latency programming challenges.
Numbers everyone should know
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1K bytes with Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
From:
Originally by Peter Norvig:
- http://norvig.com/21-days.html#answers- http://surana.wordpress.com/2009/01/01/numbers-everyone-should-know/,- http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine
Here is a Performance Analysis Guide for the i7 and Xeon range of processors. I should stress, this has what you need and more (for example, check page 22 for some timings & cycles for example).
Additionally, this page has some details on clock cycles etc. The second link served the following numbers:
Core i7 Xeon 5500 Series Data Source Latency (approximate) [Pg. 22]
local L1 CACHE hit, ~4 cycles ( 2.1 - 1.2 ns )
local L2 CACHE hit, ~10 cycles ( 5.3 - 3.0 ns )
local L3 CACHE hit, line unshared ~40 cycles ( 21.4 - 12.0 ns )
local L3 CACHE hit, shared line in another core ~65 cycles ( 34.8 - 19.5 ns )
local L3 CACHE hit, modified in another core ~75 cycles ( 40.2 - 22.5 ns )
remote L3 CACHE (Ref: Fig.1 [Pg. 5]) ~100-300 cycles ( 160.7 - 30.0 ns )
local DRAM ~60 ns
remote DRAM ~100 ns
EDIT2:
The most important is the notice under the cited table, saying:
"NOTE: THESE VALUES ARE ROUGH APPROXIMATIONS. THEY DEPEND ON
CORE AND UNCORE FREQUENCIES, MEMORY SPEEDS, BIOS SETTINGS,
NUMBERS OF DIMMS, ETC,ETC..YOUR MILEAGE MAY VARY."
EDIT: I should highlight that, as well as timing/cycle information, the above intel document addresses much more (extremely) useful details of the i7 and Xeon range of processors (from a performance point of view).
Cost to access various memories in a pretty page
See this page presenting the memory latency decrease from 1990 to 2020.
Summary
Values having decreased but are stabilized since 2005
1 ns L1 cache
3 ns Branch mispredict
4 ns L2 cache
17 ns Mutex lock/unlock
100 ns Main memory (RAM)
2 000 ns (2µs) 1KB Zippy-compress
Still some improvements, prediction for 2020
16 000 ns (16µs) SSD random read (olibre's note: should be less)
500 000 ns (½ms) Round trip in datacenter
2 000 000 ns (2ms) HDD random read (seek)
See also other sources
What every programmer should know about memory from Ulrich Drepper (2007)
Old but still an excellent deep explanation about memory hardware and software interaction.
Full PDF (114 pages)
Comments on LWN about PDF version
Another ones
Seven posts on LWN + Comments
Part 1 - Introduction
Part 2 - Cache
Part 3 - Virtual Memory
Part 4 - NUMA support
Part 5 - What programmers can do
Part 6 - More things programmers can do
Part 7 - Memory performance tools
Post The Infinite Space Between Words in codinghorror.com based on book Systems Performance: Enterprise and the Cloud
Click to each processor listed on http://www.7-cpu.com/ to see the L1/L2/L3/RAM/... latencies (e.g. Haswell i7-4770 has L1=1ns, L2=3ns, L3=10ns, RAM=67ns, BranchMisprediction=4ns)
http://idarkside.org/posts/numbers-you-should-know/
See also
For further understanding, I recommend the excellent presentation of modern cache architectures (June 2014) from Gerhard Wellein, Hannes Hofmann and Dietmar Fey at University Erlangen-Nürnberg.
French speaking people may appreciate an article by SpaceFox comparing a processor with a developer both waiting for information required to continue to work.
Just for a sake of 2020's review of the predictions for 2025:
The last about 44 years of the integrated circuit technology, the classical (non-quantum) processors evolved, literally and physically "Per Aspera ad Astra". The last decade has evidenced, the classical process has got close to some hurdles, that do not have an achievable physical path forward.
Number of logical cores can and may grow, yet not more than O(n^2~3)
Frequency [MHz] has hard if not impossible to circumvent physics-based ceiling already hit
Transistor Count can and may grow, yet less than O(n^2~3) ( power, noise, "clock")
Power [W] can grow, yet problems with power distribution & heat dissipation will increase
Single Thread Perf may grow, having direct benefits from large cache-footprints and faster and wider memory-I/O & indirect benefits from less often system forced context-switching as we can have more cores to split other threads/processes among
( Credits go to Leonardo Suriano & Karl Rupp )
2022: Still some improvements, prediction for 2025+
--------------------------------------------------------------------------------
0.001 ns light transfer in Gemmatimonas phototrophica bacteriae
| | | | |
| | | | ps|
| | | ns|
| | us| reminding us what Richard FEYNMAN told us:
| ms| "There's a plenty of space
s| down there"
-----s.-ms.-us.-ns|----------------------------------------------------------
0.1 ns - NOP
0.3 ns - XOR, ADD, SUB
0.5 ns - CPU L1 dCACHE reference (1st introduced in late 80-ies )
0.9 ns - JMP SHORT
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
?~~~~~~~~~~~ 1 ns - MUL ( i**2 = MUL i, i )~~~~~~~~~ doing this 1,000 x is 1 [us]; 1,000,000 x is 1 [ms]; 1,000,000,000 x is 1 [s] ~~~~~~~~~~~~~~~~~~~~~~~~~
3~4 ns - CPU L2 CACHE reference (2020/Q1)
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
10 ns - DIV
19 ns - CPU L3 CACHE reference (2020/Q1 considered slow on 28c Skylake)
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
|Q>~~~~~ 5,000 ns - QPU on-chip QUBO ( quantum annealer minimiser 1 Qop )
10,000 ns - Compress 1K bytes with a Zippy PROCESS
20,000 ns - Send 2K bytes over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
?~~~ 2,500,000 ns - Read 10 MB sequentially from MEMORY~~(about an empty python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s), yet an empty python interpreter is indeed not a real-world, production-grade use-case, is it?
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
?~~ 25,000,000 ns - Read 100 MB sequentially from MEMORY~~(somewhat light python process to copy on spawn)~~~~ x ( 1 + nProcesses ) on spawned process instantiation(s)
30,000,000 ns - Read 1 MB sequentially from a DISK
?~~ 36,000,000 ns - Pickle.dump() SER a 10 MB object for IPC-transfer and remote DES in spawned process~~~~~~~~ x ( 2 ) for a single 10MB parameter-payload SER/DES + add an IPC-transport costs thereof or NETWORK-grade transport costs, if going into [distributed-computing] model Cluster ecosystem
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
1s: | | |
. | | ns|
. | us|
. ms|
Just for a sake of 2015's review of the predictions for 2020:
Still some improvements, prediction for 2020 (Ref. olibre's answer below)
16 000 ns ( 16 µs) SSD random read (olibre's note: should be less)
500 000 ns ( ½ ms) Round trip in datacenter
2 000 000 ns ( 2 ms) HDD random read (seek)
1s: | | |
. | | ns|
. | us|
. ms|
In 2015 there are currently available:
======================================
820 ns ( 0.8µs) random read from a SSD-DataPlane
1 200 ns ( 1.2µs) Round trip in datacenter
1 200 ns ( 1.2µs) random read from a HDD-DataPlane
1s: | | |
. | | ns|
. | us|
. ms|
Just for a sake of CPU and GPU latency landscape comparison:
Not an easy task to compare even the simplest CPU / cache / DRAM lineups ( even in a uniform memory access model ), where DRAM-speed is a factor in determining latency, and loaded latency (saturated system), where the latter rules and is something the enterprise applications will experience more than an idle fully unloaded system.
+----------------------------------- 5,6,7,8,9,..12,15,16
| +--- 1066,1333,..2800..3300
v v
First word = ( ( CAS latency * 2 ) + ( 1 - 1 ) ) / Data Rate
Fourth word = ( ( CAS latency * 2 ) + ( 4 - 1 ) ) / Data Rate
Eighth word = ( ( CAS latency * 2 ) + ( 8 - 1 ) ) / Data Rate
^----------------------- 7x .. difference
********************************
So:
===
resulting DDR3-side latencies are between _____________
3.03 ns ^
|
36.58 ns ___v_ based on DDR3 HW facts
GPU-engines have received a lot of technical marketing, while deep internal dependencies are keys to understand both the real strengths and also the real weaknesses these architectures experience in practice ( typically much different than the aggressive marketing whistled-up expectations ).
1 ns _________ LETS SETUP A TIME/DISTANCE SCALE FIRST:
° ^
|\ |a 1 ft-distance a foton travels in vacuum ( less in dark-fibre )
| \ |
| \ |
__|___\__v____________________________________________________
| |
|<-->| a 1 ns TimeDOMAIN "distance", before a foton arrived
| |
^ v
DATA | |DATA
RQST'd| |RECV'd ( DATA XFER/FETCH latency )
25 ns # 1147 MHz FERMI: GPU Streaming Multiprocessor REGISTER access
35 ns # 1147 MHz FERMI: GPU Streaming Multiprocessor L1-onHit-[--8kB]CACHE
70 ns # 1147 MHz FERMI: GPU Streaming Multiprocessor SHARED-MEM access
230 ns # 1147 MHz FERMI: GPU Streaming Multiprocessor texL1-onHit-[--5kB]CACHE
320 ns # 1147 MHz FERMI: GPU Streaming Multiprocessor texL2-onHit-[256kB]CACHE
350 ns
700 ns # 1147 MHz FERMI: GPU Streaming Multiprocessor GLOBAL-MEM access
- - - - -
Understanding internalities is thus much more important, than in other fields, where architectures are published and numerous benchmarks freely available. Many thanks to GPU-micro-testers, who 've spent their time and creativity to unleash the truth of the real schemes of work inside the black-box approach tested GPU devices.
+====================| + 11-12 [usec] XFER-LATENCY-up HostToDevice ~~~ same as Intel X48 / nForce 790i
| |||||||||||||||||| + 10-11 [usec] XFER-LATENCY-down DeviceToHost
| |||||||||||||||||| ~ 5.5 GB/sec XFER-BW-up ~~~ same as DDR2/DDR3 throughput
| |||||||||||||||||| ~ 5.2 GB/sec XFER-BW-down #8192 KB TEST-LOAD ( immune to attempts to OverClock PCIe_BUS_CLK 100-105-110-115 [MHz] ) [D:4.9.3]
|
| Host-side
| cudaHostRegister( void *ptr, size_t size, unsigned int flags )
| | +-------------- cudaHostRegisterPortable -- marks memory as PINNED MEMORY for all CUDA Contexts, not just the one, current, when the allocation was performed
| ___HostAllocWriteCombined_MEM / cudaHostFree() +---------------- cudaHostRegisterMapped -- maps memory allocation into the CUDA address space ( the Device pointer can be obtained by a call to cudaHostGetDevicePointer( void **pDevice, void *pHost, unsigned int flags=0 ); )
| ___HostRegisterPORTABLE___MEM / cudaHostUnregister( void *ptr )
| ||||||||||||||||||
| ||||||||||||||||||
| | PCIe-2.0 ( 4x) | ~ 4 GB/s over 4-Lanes ( PORT #2 )
| | PCIe-2.0 ( 8x) | ~16 GB/s over 8-Lanes
| | PCIe-2.0 (16x) | ~32 GB/s over 16-Lanes ( mode 16x )
|
| + PCIe-3.0 25-port 97-lanes non-blocking SwitchFabric ... +over copper/fiber
| ~~~ The latest PCIe specification, Gen 3, runs at 8Gbps per serial lane, enabling a 48-lane switch to handle a whopping 96 GBytes/sec. of full duplex peer to peer traffic. [I:]
|
| ~810 [ns] + InRam-"Network" / many-to-many parallel CPU/Memory "message" passing with less than 810 ns latency any-to-any
|
| ||||||||||||||||||
| ||||||||||||||||||
+====================|
|.pci............HOST|
My apology for a "bigger-picture", but latency-demasking has also cardinal limits imposed from on-chip smREG/L1/L2-capacities and hit/miss-rates.
|.pci............GPU.|
| | FERMI [GPU-CLK] ~ 0.9 [ns] but THE I/O LATENCIES PAR -- ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| <800> warps ~~ 24000 + 3200 threads ~~ 27200 threads [!!]
| ^^^^^^^^|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [!!]
| smREGs________________________________________ penalty +400 ~ +800 [GPU_CLKs] latency ( maskable by 400~800 WARPs ) on <Compile-time>-designed spillover(s) to locMEM__
| +350 ~ +700 [ns] #1147 MHz FERMI ^^^^^^^^
| | ^^^^^^^^
| +5 [ns] # 200 MHz FPGA. . . . . . Xilinx/Zync Z7020/FPGA massive-parallel streamline-computing mode ev. PicoBlazer softCPU
| | ^^^^^^^^
| ~ +20 [ns] #1147 MHz FERMI ^^^^^^^^
| SM-REGISTERs/thread: max 63 for CC-2.x -with only about +22 [GPU_CLKs] latency ( maskable by 22-WARPs ) to hide on [REGISTER DEPENDENCY] when arithmetic result is to be served from previous [INSTR] [G]:10.4, Page-46
| max 63 for CC-3.0 - about +11 [GPU_CLKs] latency ( maskable by 44-WARPs ) [B]:5.2.3, Page-73
| max 128 for CC-1.x PAR -- ||||||||~~~|
| max 255 for CC-3.5 PAR -- ||||||||||||||||||~~~~~~|
|
| smREGs___BW ANALYZE REAL USE-PATTERNs IN PTX-creation PHASE << -Xptxas -v || nvcc -maxrregcount ( w|w/o spillover(s) )
| with about 8.0 TB/s BW [C:Pg.46]
| 1.3 TB/s BW shaMEM___ 4B * 32banks * 15 SMs * half 1.4GHz = 1.3 TB/s only on FERMI
| 0.1 TB/s BW gloMEM___
| ________________________________________________________________________________________________________________________________________________________________________________________________________________________
+========| DEVICE:3 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+======| DEVICE:2 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+====| DEVICE:1 PERSISTENT gloMEM___
| _|______________________________________________________________________________________________________________________________________________________________________________________________________________________
+==| DEVICE:0 PERSISTENT gloMEM_____________________________________________________________________+440 [GPU_CLKs]_________________________________________________________________________|_GB|
! | |\ + |
o | texMEM___|_\___________________________________texMEM______________________+_______________________________________________________________________________________|_MB|
| |\ \ |\ + |\ |
| texL2cache_| \ \ .| \_ _ _ _ _ _ _ _texL2cache +370 [GPU_CLKs] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ | \ 256_KB|
| | \ \ | \ + |\ ^ \ |
| | \ \ | \ + | \ ^ \ |
| | \ \ | \ + | \ ^ \ |
| texL1cache_| \ \ .| \_ _ _ _ _ _texL1cache +260 [GPU_CLKs] _ _ _ _ _ _ _ _ _ | \_ _ _ _ _^ \ 5_KB|
| | \ \ | \ + ^\ ^ \ ^\ \ |
| shaMEM + conL3cache_| \ \ | \ _ _ _ _ conL3cache +220 [GPU_CLKs] ^ \ ^ \ ^ \ \ 32_KB|
| | \ \ | \ ^\ + ^ \ ^ \ ^ \ \ |
| | \ \ | \ ^ \ + ^ \ ^ \ ^ \ \ |
| ______________________|__________\_\_______________________|__________\_____^__\________+__________________________________________\_________\_____\________________________________|
| +220 [GPU-CLKs]_| |_ _ _ ___|\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ _+220 [GPU_CLKs] on re-use at some +50 GPU_CLKs _IF_ a FETCH from yet-in-shaL2cache
| L2-on-re-use-only +80 [GPU-CLKs]_| 64 KB L2_|_ _ _ __|\\ \ \_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ \ _ _ _ _\_ _ _ + 80 [GPU_CLKs] on re-use from L1-cached (HIT) _IF_ a FETCH from yet-in-shaL1cache
| L1-on-re-use-only +40 [GPU-CLKs]_| 8 KB L1_|_ _ _ _|\\\ \_\__________________________________\________\_____+ 40 [GPU_CLKs]_____________________________________________________________________________|
| L1-on-re-use-only + 8 [GPU-CLKs]_| 2 KB L1_|__________|\\\\__________\_\__________________________________\________\____+ 8 [GPU_CLKs]_________________________________________________________conL1cache 2_KB|
| on-chip|smREG +22 [GPU-CLKs]_| |t[0_______^:~~~~~~~~~~~~~~~~\:________]
|CC- MAX |_|_|_|_|_|_|_|_|_|_|_| |t[1_______^ :________]
|2.x 63 |_|_|_|_|_|_|_|_|_|_|_| |t[2_______^ :________]
|1.x 128 |_|_|_|_|_|_|_|_|_|_|_| |t[3_______^ :________]
|3.5 255 REGISTERs|_|_|_|_|_|_|_|_| |t[4_______^ :________]
| per|_|_|_|_|_|_|_|_|_|_|_| |t[5_______^ :________]
| Thread_|_|_|_|_|_|_|_|_|_| |t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 8_______^:~~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| |t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W0..|t[ F_______^____________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_| ..............
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[0_______^:~~~~~~~~~~~~~~~\:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[1_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[2_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[3_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[4_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[5_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ............|t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| W1..............|t[ F_______^___________WARP__:________]_____________
| |_|_|_|_|_|_|_|_|_|_|_| ....................................................
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[0_______^:~~~~~~~~~~~~~~~\:________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[1_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[2_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[3_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[4_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[5_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[6_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[7_______^ 1stHalf-WARP :________]______________
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ 8_______^:~~~~~~~~~~~~~~~~:________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ 9_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ A_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ B_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ C_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ D_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_| ...................................................|t[ E_______^ :________]
| |_|_|_|_|_|_|_|_|_|_|_|tBlock Wn....................................................|t[ F_______^___________WARP__:________]_____________
|
| ________________ °°°°°°°°°°°°°°°°°°°°°°°°°°~~~~~~~~~~°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°
| / \ CC-2.0|||||||||||||||||||||||||| ~masked ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| / \ 1.hW ^|^|^|^|^|^|^|^|^|^|^|^|^| <wait>-s ^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|
| / \ 2.hW |^|^|^|^|^|^|^|^|^|^|^|^|^ |^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^|^
|_______________/ \______I|I|I|I|I|I|I|I|I|I|I|I|I|~~~~~~~~~~I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|I|
|~~~~~~~~~~~~~~/ SM:0.warpScheduler /~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~~~~~~~~~~~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I~I
| \ | //
| \ RR-mode //
| \ GREEDY-mode //
| \________________//
| \______________/SM:0__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:1__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:2__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:3__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:4__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:5__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:6__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:7__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:8__________________________________________________________________________________
| | |t[ F_______^___________WARP__:________]_______
| ..|SM:9__________________________________________________________________________________
| ..|SM:A |t[ F_______^___________WARP__:________]_______
| ..|SM:B |t[ F_______^___________WARP__:________]_______
| ..|SM:C |t[ F_______^___________WARP__:________]_______
| ..|SM:D |t[ F_______^___________WARP__:________]_______
| |_______________________________________________________________________________________
*/
The bottom line?
Any low-latency motivated design has to rather reverse-engineer the "I/O-hydraulics" ( as 0 1-XFERs are incompressible by the nature ) and the resulting latencies rule the performance envelope for any GPGPU solution be it computationally intensive ( read: where processing costs are forgiving a bit more a poor latency XFERs ... ) or not ( read: where ( might be to someone's surprise ) CPU-s are faster in end-to-end processing, than GPU fabrics [citations available] ).
Look at this "staircase" plot, perfectly illustrating different access times (in terms of clock tics). Notice the red CPU having an additional "step", probably because it has L4 (while others don't).
Taken from this Extremetech article.
In computer science this is called "I/O complexity".

Resources