Missing bounds checking elimination in String constructor? - performance

Looking into UTF8 decoding performance, I noticed the performance of protobuf's UnsafeProcessor::decodeUtf8 is better than String(byte[] bytes, int offset, int length, Charset charset) for the following non ascii string: "Quizdeltagerne spiste jordbær med flØde, mens cirkusklovnen".
I tried to figure out why, so I copied the relevant code in String and replaced the array accesses with unsafe array accesses, same as UnsafeProcessor::decodeUtf8.
Here are the JMH benchmark results:
Benchmark Mode Cnt Score Error Units
StringBenchmark.safeDecoding avgt 10 127.107 ± 3.642 ns/op
StringBenchmark.unsafeDecoding avgt 10 100.915 ± 4.090 ns/op
I assume the difference is due to missing bounds checking elimination which I expected to kick in, especially since there is an explicit bounds check in the form of a call to checkBoundsOffCount(offset, length, bytes.length) in the beginning of String(byte[] bytes, int offset, int length, Charset charset).
Is the issue really a missing bounds check elimination?
Here's the code I benchmarked using OpenJDK 17 & JMH. Note that this is only part of the String(byte[] bytes, int offset, int length, Charset charset) constructor code, and works correctly only for this specific German String.
The static methods were copied from String.
Look for the // the unsafe version: comments that indicate where I replaced the safe access with unsafe.
private static byte[] safeDecode(byte[] bytes, int offset, int length) {
checkBoundsOffCount(offset, length, bytes.length);
int sl = offset + length;
int dp = 0;
byte[] dst = new byte[length];
while (offset < sl) {
int b1 = bytes[offset];
// the unsafe version:
// int b1 = UnsafeUtil.getByte(bytes, offset);
if (b1 >= 0) {
dst[dp++] = (byte)b1;
offset++;
continue;
}
if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
offset + 1 < sl) {
// the unsafe version:
// int b2 = UnsafeUtil.getByte(bytes, offset + 1);
int b2 = bytes[offset + 1];
if (!isNotContinuation(b2)) {
dst[dp++] = (byte)decode2(b1, b2);
offset += 2;
continue;
}
}
// anything not a latin1, including the repl
// we have to go with the utf16
break;
}
if (offset == sl) {
if (dp != dst.length) {
dst = Arrays.copyOf(dst, dp);
}
return dst;
}
return dst;
}
Followup
Apparently if I change the while loop condition from offset < sl to 0 <= offset && offset < sl
I get similar performance in both versions:
Benchmark Mode Cnt Score Error Units
StringBenchmark.safeDecoding avgt 10 100.802 ± 13.147 ns/op
StringBenchmark.unsafeDecoding avgt 10 102.774 ± 3.893 ns/op
Conclusion
This question was picked up by HotSpot developers as https://bugs.openjdk.java.net/browse/JDK-8278518.
Optimizing this code ended up giving a 2.5x boost to decoding the above Latin1 string.
This C2 optimization closes the unbelievable more than 7x gap between commonBranchFirst and commonBranchSecond in the below benchmark and will land in Java 19.
Benchmark Mode Cnt Score Error Units
LoopBenchmark.commonBranchFirst avgt 25 1737.111 ± 56.526 ns/op
LoopBenchmark.commonBranchSecond avgt 25 232.798 ± 12.676 ns/op
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class LoopBenchmark {
private final boolean[] mostlyTrue = new boolean[1000];
#Setup
public void setup() {
for (int i = 0; i < mostlyTrue.length; i++) {
mostlyTrue[i] = i % 100 > 0;
}
}
#Benchmark
public int commonBranchFirst() {
int i = 0;
while (i < mostlyTrue.length) {
if (mostlyTrue[i]) {
i++;
} else {
i += 2;
}
}
return i;
}
#Benchmark
public int commonBranchSecond() {
int i = 0;
while (i < mostlyTrue.length) {
if (!mostlyTrue[i]) {
i += 2;
} else {
i++;
}
}
return i;
}
}

To measure the branch you are interested in and particularly the scenario when while loop becomes hot, I've used the following benchmark:
#State(Scope.Thread)
#BenchmarkMode(Mode.AverageTime)
#OutputTimeUnit(TimeUnit.NANOSECONDS)
public class StringConstructorBenchmark {
private byte[] array;
#Setup
public void setup() {
String str = "Quizdeltagerne spiste jordbær med fløde, mens cirkusklovnen. Я";
array = str.getBytes(StandardCharsets.UTF_8);
}
#Benchmark
public String newString() {
return new String(array, 0, array.length, StandardCharsets.UTF_8);
}
}
And indeed, with modified constructor it does give significant improvement:
//baseline
Benchmark Mode Cnt Score Error Units
StringConstructorBenchmark.newString avgt 50 173,092 ± 3,048 ns/op
//patched
Benchmark Mode Cnt Score Error Units
StringConstructorBenchmark.newString avgt 50 126,908 ± 2,355 ns/op
This is likely to be a HotSpot issue: optimizing compiler for some reason failed to eliminate array bounds check within while-loop. I guess the reason is that offset is modified within the loop:
while (offset < sl) {
int b1 = bytes[offset];
if (b1 >= 0) {
dst[dp++] = (byte)b1;
offset++; // <---
continue;
}
if ((b1 == (byte)0xc2 || b1 == (byte)0xc3) &&
offset + 1 < sl) {
int b2 = bytes[offset + 1];
if (!isNotContinuation(b2)) {
dst[dp++] = (byte)decode2(b1, b2);
offset += 2;
continue;
}
}
// anything not a latin1, including the repl
// we have to go with the utf16
break;
}
Also I've looked into the code via LinuxPerfAsmProfiler, here's the link for baseline https://gist.github.com/stsypanov/d2524f98477d633fb1d4a2510fedeea6 and this one is for patched constructor https://gist.github.com/stsypanov/16c787e4f9fa3dd122522f16331b68b7
What should one pay attention to? Let's find the code corresponding int b1 = bytes[offset]; (line 538). In baseline we have this:
3.62% ││ │ 0x00007fed70eb4c1c: mov %ebx,%ecx
2.29% ││ │ 0x00007fed70eb4c1e: mov %edx,%r9d
2.22% ││ │ 0x00007fed70eb4c21: mov (%rsp),%r8 ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}
││ │ ; - java.lang.String::<init>#107 (line 537)
2.32% ↘│ │ 0x00007fed70eb4c25: cmp %r13d,%ecx
│ │ 0x00007fed70eb4c28: jge 0x00007fed70eb5388 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - java.lang.String::<init>#110 (line 537)
3.05% │ │ 0x00007fed70eb4c2e: cmp 0x8(%rsp),%ecx
│ │ 0x00007fed70eb4c32: jae 0x00007fed70eb5319
2.38% │ │ 0x00007fed70eb4c38: mov %r8,(%rsp)
2.64% │ │ 0x00007fed70eb4c3c: movslq %ecx,%r8
2.46% │ │ 0x00007fed70eb4c3f: mov %rax,%rbx
3.44% │ │ 0x00007fed70eb4c42: sub %r8,%rbx
2.62% │ │ 0x00007fed70eb4c45: add $0x1,%rbx
2.64% │ │ 0x00007fed70eb4c49: and $0xfffffffffffffffe,%rbx
2.30% │ │ 0x00007fed70eb4c4d: mov %ebx,%r8d
3.08% │ │ 0x00007fed70eb4c50: add %ecx,%r8d
2.55% │ │ 0x00007fed70eb4c53: movslq %r8d,%r8
2.45% │ │ 0x00007fed70eb4c56: add $0xfffffffffffffffe,%r8
2.13% │ │ 0x00007fed70eb4c5a: cmp (%rsp),%r8
│ │ 0x00007fed70eb4c5e: jae 0x00007fed70eb5319
3.36% │ │ 0x00007fed70eb4c64: mov %ecx,%edi ;*aload_1 {reexecute=0 rethrow=0 return_oop=0}
│ │ ; - java.lang.String::<init>#113 (line 538)
2.86% │ ↗│ 0x00007fed70eb4c66: movsbl 0x10(%r14,%rdi,1),%r8d ;*baload {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::<init>#115 (line 538)
2.48% │ ││ 0x00007fed70eb4c6c: mov %r9d,%edx
2.26% │ ││ 0x00007fed70eb4c6f: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::<init>#127 (line 540)
3.28% │ ││ 0x00007fed70eb4c71: mov %edi,%ebx
2.44% │ ││ 0x00007fed70eb4c73: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ││ ; - java.lang.String::<init>#134 (line 541)
2.35% │ ││ 0x00007fed70eb4c75: test %r8d,%r8d
╰ ││ 0x00007fed70eb4c78: jge 0x00007fed70eb4c04 ;*iflt {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.String::<init>#120 (line 539)
and in patched code the corresponding part is
17.28% ││ 0x00007f6b88eb6061: mov %edx,%r10d ;*iload_2 {reexecute=0 rethrow=0 return_oop=0}
││ ; - java.lang.String::<init>#107 (line 537)
0.11% ↘│ 0x00007f6b88eb6064: test %r10d,%r10d
│ 0x00007f6b88eb6067: jl 0x00007f6b88eb669c ;*iflt {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>#108 (line 537)
0.39% │ 0x00007f6b88eb606d: cmp %r13d,%r10d
│ 0x00007f6b88eb6070: jge 0x00007f6b88eb66d0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>#114 (line 537)
0.66% │ 0x00007f6b88eb6076: mov %ebx,%r9d
13.70% │ 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d
0.01% │ 0x00007f6b88eb607e: jae 0x00007f6b88eb6671
0.14% │ 0x00007f6b88eb6084: movsbl 0x10(%r14,%r10,1),%edi ;*baload {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>#119 (line 538)
0.37% │ 0x00007f6b88eb608a: mov %r9d,%ebx
0.99% │ 0x00007f6b88eb608d: inc %ebx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>#131 (line 540)
12.88% │ 0x00007f6b88eb608f: movslq %r9d,%rsi ;*bastore {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>#196 (line 548)
0.17% │ 0x00007f6b88eb6092: mov %r10d,%edx
0.39% │ 0x00007f6b88eb6095: inc %edx ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>#138 (line 541)
0.96% │ 0x00007f6b88eb6097: test %edi,%edi
0.02% │ 0x00007f6b88eb6099: jl 0x00007f6b88eb60dc ;*iflt {reexecute=0 rethrow=0 return_oop=0}
│ ; - java.lang.String::<init>#124 (line 539)
In baseline between if_icmpge and aload_1 byte-code instructions we have bounds check, but we don't have one in patched code.
So your original assumptions is correct: it is about missing bounds check elimination.
UPD I must correct my answer: it turned out that bounds check is still there:
13.70% │ 0x00007f6b88eb6079: cmp 0x8(%rsp),%r10d
0.01% │ 0x00007f6b88eb607e: jae 0x00007f6b88eb6671
and the code I've pointed out is something that the compiler introduces, but it does nothing. The issue itself is still about bounds check as its explicit declaration solves the issue ad hoc.

Related

How to check missing values in Clickhouse

I have a table that is filled with data every 15 minutes. I need to check that there is data for all days of the entire period. there is a time column in which the data is in the format yyyy-mm-dd hh:mm:ss
i've found the start date and the last date with
I found out that you can generate an array of dates from this interval (start and end dates) with which each line will be compared, and if there is no match, here it is the missing date.
i've tried this:
WITH dates_range AS (SELECT toDate(min(time)) AS start_date,
toDate(max(time)) AS end_date
FROM table)
SELECT dates
FROM (
SELECT arrayFlatten(arrayMap(x -> start_date + x, range(0, toUInt64(end_date - start_date)))) AS dates
FROM dates_range
)
LEFT JOIN (
SELECT toDate(time) AS date
FROM table
GROUP BY toDate(time)
) USING date
WHERE date IS NULL;
but it returns with Code: 10. DB::Exception: Not found column date in block. There are only columns: dates. (NOT_FOUND_COLUMN_IN_BLOCK) and I can't
You can also use WITH FILL modifier https://clickhouse.com/docs/en/sql-reference/statements/select/order-by/#order-by-expr-with-fill-modifier
create table T ( time DateTime) engine=Memory
as SELECT toDateTime('2020-01-01') + (((number * 60) * 24) * if((number % 33) = 0, 3, 1))
FROM numbers(550);
SELECT *
FROM
(
SELECT
toDate(time) AS t,
count() AS c
FROM T
GROUP BY t
ORDER BY t ASC WITH FILL
)
WHERE c = 0
┌──────────t─┬─c─┐
│ 2020-01-11 │ 0 │
│ 2020-01-13 │ 0 │
│ 2020-01-16 │ 0 │
│ 2020-01-18 │ 0 │
│ 2020-01-21 │ 0 │
│ 2020-01-23 │ 0 │
│ 2020-01-26 │ 0 │
└────────────┴───┘
create table T ( time DateTime) engine=Memory
as SELECT toDateTime('2020-01-01') + (((number * 60) * 24) * if((number % 33) = 0, 3, 1))
FROM numbers(550);
WITH (SELECT (toDate(min(time)), toDate(max(time))) FROM T) as x
select date, sumIf(cnt, type=1) c1, sumIf(cnt, type=2) c2 from
( SELECT arrayJoin(arrayFlatten(arrayMap(x -> x.1 + x, range(0, toUInt64(x.2 - x.1+1))))) AS date, 2 type, 1 cnt
union all SELECT toDate(time) AS date, 1 type, count() cnt FROM T GROUP BY toDate(time) )
group by date
having c1 = 0 or c2 = 0;
┌───────date─┬─c1─┬─c2─┐
│ 2020-01-11 │ 0 │ 1 │
│ 2020-01-13 │ 0 │ 1 │
│ 2020-01-16 │ 0 │ 1 │
│ 2020-01-18 │ 0 │ 1 │
│ 2020-01-21 │ 0 │ 1 │
│ 2020-01-23 │ 0 │ 1 │
│ 2020-01-26 │ 0 │ 1 │
└────────────┴────┴────┘
create table T ( time DateTime) engine=Memory
as SELECT toDateTime('2020-01-01') + (((number * 60) * 24) * if((number % 33) = 0, 3, 1))
FROM numbers(550);
WITH (SELECT (toDate(min(time)), toDate(max(time))) FROM T) as x
SELECT l.*, r.*
FROM ( SELECT arrayJoin(arrayFlatten(arrayMap(x -> x.1 + x, range(0, toUInt64(x.2 - x.1+1))))) AS date) l
LEFT JOIN ( SELECT toDate(time) AS date FROM T GROUP BY toDate(time)
) r USING date
WHERE r.date IS NULL settings join_use_nulls = 1;
┌───────date─┬─r.date─┐
│ 2020-01-11 │ ᴺᵁᴸᴸ │
│ 2020-01-13 │ ᴺᵁᴸᴸ │
│ 2020-01-16 │ ᴺᵁᴸᴸ │
│ 2020-01-18 │ ᴺᵁᴸᴸ │
│ 2020-01-21 │ ᴺᵁᴸᴸ │
│ 2020-01-23 │ ᴺᵁᴸᴸ │
│ 2020-01-26 │ ᴺᵁᴸᴸ │
└────────────┴────────┘
create table T ( time DateTime) engine=Memory
as SELECT toDateTime('2020-01-01') + (((number * 60) * 24) * if((number % 33) = 0, 3, 1))
FROM numbers(550);
select b from (
SELECT
b,
((b - any(b) OVER (ORDER BY b ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING))) AS lag
FROM
(
SELECT toDate(time) AS b
FROM T
GROUP BY b
ORDER BY b ASC
)) where lag > 1 and lag < 10000
┌──────────b─┐
│ 2020-01-12 │
│ 2020-01-14 │
│ 2020-01-17 │
│ 2020-01-19 │
│ 2020-01-22 │
│ 2020-01-24 │
│ 2020-01-27 │
└────────────┘

How to rewrite this deprecated expression using do and "by", with "groupby" (Julia)

The goal is to generate fake data.
We generate a set of parameters,
## Simulated data
df_3 = DataFrame(y = [0,1], size = [250,250], x1 =[2.,0.], x2 =[-1.,-2.])
Now, I want to generate the fake data per se,
df_knn =by(df_3, :y) do df
DataFrame(x_1 = rand(Normal(df[1,:x1],1), df[1,:size]),
x_2 = rand(Normal(df[1,:x2],1), df[1,:size]))
end
How I can replace by with groupby, here?
SOURCE: This excerpt is from the book, Data Science with Julia (2019).
I think this is what you mean here:
julia> combine(groupby(df_3, :y)) do df
DataFrame(x_1 = rand(Normal(df[1,:x1],1), df[1,:size]),
x_2 = rand(Normal(df[1,:x2],1), df[1,:size]))
end
500×3 DataFrame
Row │ y x_1 x_2
│ Int64 Float64 Float64
─────┼─────────────────────────────
1 │ 0 1.88483 0.890807
2 │ 0 2.50124 -0.280708
3 │ 0 1.1857 0.823002
⋮ │ ⋮ ⋮ ⋮
498 │ 1 -0.611168 -0.856527
499 │ 1 0.491412 -3.09562
500 │ 1 0.242016 -1.42652
494 rows omitted

What does head.next actually mean in linked lists?

I have created a linked list with a Node class having Node next and int data.
What does head.next mean? I'm confused.
Consider this list: 1 > 2 > 3 > 4, and 1 is the head.
Now if I execute head=head.next, head is now pointing to 2. But when I execute head.next=null, 1 is pointing to null. Can someone explain why this happens?
It may help to visualise things.
Consider this list: 1 > 2 > 3 > 4, and 1 is the head.
head
↓
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ value: 1 │ │ value: 2 │ │ value: 3 │ │ value: 4 │
│ next: ———————→ │ next: ———————→ │ next: ———————→ │ next: null│
└───────────┘ └───────────┘ └───────────┘ └───────────┘
if I execute head=head.next, head is now pointing to 2
head
↓
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ value: 1 │ │ value: 2 │ │ value: 3 │ │ value: 4 │
│ next: ———————→ │ next: ———————→ │ next: ———————→ │ next: null│
└───────────┘ └───────────┘ └───────────┘ └───────────┘
but when I execute head.next=null, 1 is pointing to null.
head
↓
┌───────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐
│ value: 1 │ │ value: 2 │ │ value: 3 │ │ value: 4 │
│ next: null│ │ next: ———————→ │ next: ———————→ │ next: null│
└───────────┘ └───────────┘ └───────────┘ └───────────┘
In conclusion:
when you assign a value to head, you change what head refers to. This does not affect any next reference: the links between the nodes remain unchanged. You just might have lost access to the first node when head was the only reference to it. And in that case you actually make the list one node shorter*, as it now starts at 2.
*(When a node becomes unreachable, it no longer has any practical use. The garbage collector may free the memory it occupied.)
when you don't assign a value to head, but do head.next =, then you assign a value to a property of the node that head references, and that affects the list. This is called mutation. On the other hand, head will still refer to node 1, since you didn't assign a value to the head variable.
Assuming that you are referring to the linked list example below
#include <bits/stdc++.h>
using namespace std;
class Node {
public:
int data;
Node* next;
};
int main()
{
Node* head = NULL;
Node* second = NULL;
Node* third = NULL;
head = new Node();
second = new Node();
third = new Node();
head->data = 1;
head->next = second;
second->data = 2;
second->next = third;
third->data = 3;
third->next = NULL;
return 0;
}
here Node is a class in which a pointer of that node class next is declared and a integer type data is used to store the data values that the next pointer will point to
class Node {
public:
int data;
Node* next;
};
head here is an object of class Node which is created and by using its pointer declared inside of it previosly, we point to a next Node object second using head.next = second. So now the pointer of our head object is pointing to the address of our second object we created, this way the head object is linked with our second object. data variable is used to store the value given to every object or in this case every node
head = new Node();
second = new Node();
third = new Node();
head->data = 1;
head->next = second;
second->data = 2;
second->next = third;
third->data = 3;
third->next = NULL;
comming to your second point, i don't get it why would you point head.next to head(itself) and head.next = NULL gives nothing as you have to point the pointer to a second object to create a linked list.

How to avoid memory allocations in custom Julia iterators?

Consider the following Julia "compound" iterator: it merges two iterators, a and b,
each of which are assumed to be sorted according to order, to a single ordered
sequence:
struct MergeSorted{T,A,B,O}
a::A
b::B
order::O
MergeSorted(a::A, b::B, order::O=Base.Order.Forward) where {A,B,O} =
new{promote_type(eltype(A),eltype(B)),A,B,O}(a, b, order)
end
Base.eltype(::Type{MergeSorted{T,A,B,O}}) where {T,A,B,O} = T
#inline function Base.iterate(self::MergeSorted{T},
state=(iterate(self.a), iterate(self.b))) where T
a_result, b_result = state
if b_result === nothing
a_result === nothing && return nothing
a_curr, a_state = a_result
return T(a_curr), (iterate(self.a, a_state), b_result)
end
b_curr, b_state = b_result
if a_result !== nothing
a_curr, a_state = a_result
Base.Order.lt(self.order, a_curr, b_curr) &&
return T(a_curr), (iterate(self.a, a_state), b_result)
end
return T(b_curr), (a_result, iterate(self.b, b_state))
end
This code works, but is type-instable since the Julia iteration facilities are inherently so. For most cases, the compiler can work this out automatically, however, here it does not work: the following test code illustrates that temporaries are created:
>>> x = MergeSorted([1,4,5,9,32,44], [0,7,9,24,134]);
>>> sum(x);
>>> #time sum(x);
0.000013 seconds (61 allocations: 2.312 KiB)
Note the allocation count.
Is there any way to efficiently debug such situations other than playing around with the code and hoping that the compiler will be able to optimize out the type ambiguities? Does anyone know there any solution in this specific case that does not create temporaries?
How to diagnose the problem?
Answer: use #code_warntype
Run:
julia> #code_warntype iterate(x, iterate(x)[2])
Variables
#self#::Core.Const(iterate)
self::MergeSorted{Int64, Vector{Int64}, Vector{Int64}, Base.Order.ForwardOrdering}
state::Tuple{Tuple{Int64, Int64}, Tuple{Int64, Int64}}
#_4::Int64
#_5::Int64
#_6::Union{}
#_7::Int64
b_state::Int64
b_curr::Int64
a_state::Int64
a_curr::Int64
b_result::Tuple{Int64, Int64}
a_result::Tuple{Int64, Int64}
Body::Tuple{Int64, Any}
1 ─ nothing
│ Core.NewvarNode(:(#_4))
│ Core.NewvarNode(:(#_5))
│ Core.NewvarNode(:(#_6))
│ Core.NewvarNode(:(b_state))
│ Core.NewvarNode(:(b_curr))
│ Core.NewvarNode(:(a_state))
│ Core.NewvarNode(:(a_curr))
│ %9 = Base.indexed_iterate(state, 1)::Core.PartialStruct(Tuple{Tuple{Int64, Int64}, Int64}, Any[Tuple{Int64, Int64}, Core.Const(2)])
│ (a_result = Core.getfield(%9, 1))
│ (#_7 = Core.getfield(%9, 2))
│ %12 = Base.indexed_iterate(state, 2, #_7::Core.Const(2))::Core.PartialStruct(Tuple{Tuple{Int64, Int64}, Int64}, Any[Tuple{Int64, Int64}, Core.Const(3)])
│ (b_result = Core.getfield(%12, 1))
│ %14 = (b_result === Main.nothing)::Core.Const(false)
└── goto #3 if not %14
2 ─ Core.Const(:(a_result === Main.nothing))
│ Core.Const(:(%16))
│ Core.Const(:(return Main.nothing))
│ Core.Const(:(Base.indexed_iterate(a_result, 1)))
│ Core.Const(:(a_curr = Core.getfield(%19, 1)))
│ Core.Const(:(#_6 = Core.getfield(%19, 2)))
│ Core.Const(:(Base.indexed_iterate(a_result, 2, #_6)))
│ Core.Const(:(a_state = Core.getfield(%22, 1)))
│ Core.Const(:(($(Expr(:static_parameter, 1)))(a_curr)))
│ Core.Const(:(Base.getproperty(self, :a)))
│ Core.Const(:(Main.iterate(%25, a_state)))
│ Core.Const(:(Core.tuple(%26, b_result)))
│ Core.Const(:(Core.tuple(%24, %27)))
└── Core.Const(:(return %28))
3 ┄ %30 = Base.indexed_iterate(b_result, 1)::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(2)])
│ (b_curr = Core.getfield(%30, 1))
│ (#_5 = Core.getfield(%30, 2))
│ %33 = Base.indexed_iterate(b_result, 2, #_5::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])
│ (b_state = Core.getfield(%33, 1))
│ %35 = (a_result !== Main.nothing)::Core.Const(true)
└── goto #6 if not %35
4 ─ %37 = Base.indexed_iterate(a_result, 1)::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(2)])
│ (a_curr = Core.getfield(%37, 1))
│ (#_4 = Core.getfield(%37, 2))
│ %40 = Base.indexed_iterate(a_result, 2, #_4::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])
│ (a_state = Core.getfield(%40, 1))
│ %42 = Base.Order::Core.Const(Base.Order)
│ %43 = Base.getproperty(%42, :lt)::Core.Const(Base.Order.lt)
│ %44 = Base.getproperty(self, :order)::Core.Const(Base.Order.ForwardOrdering())
│ %45 = a_curr::Int64
│ %46 = (%43)(%44, %45, b_curr)::Bool
└── goto #6 if not %46
5 ─ %48 = ($(Expr(:static_parameter, 1)))(a_curr)::Int64
│ %49 = Base.getproperty(self, :a)::Vector{Int64}
│ %50 = Main.iterate(%49, a_state)::Union{Nothing, Tuple{Int64, Int64}}
│ %51 = Core.tuple(%50, b_result)::Tuple{Union{Nothing, Tuple{Int64, Int64}}, Tuple{Int64, Int64}}
│ %52 = Core.tuple(%48, %51)::Tuple{Int64, Tuple{Union{Nothing, Tuple{Int64, Int64}}, Tuple{Int64, Int64}}}
└── return %52
6 ┄ %54 = ($(Expr(:static_parameter, 1)))(b_curr)::Int64
│ %55 = a_result::Tuple{Int64, Int64}
│ %56 = Base.getproperty(self, :b)::Vector{Int64}
│ %57 = Main.iterate(%56, b_state)::Union{Nothing, Tuple{Int64, Int64}}
│ %58 = Core.tuple(%55, %57)::Tuple{Tuple{Int64, Int64}, Union{Nothing, Tuple{Int64, Int64}}}
│ %59 = Core.tuple(%54, %58)::Tuple{Int64, Tuple{Tuple{Int64, Int64}, Union{Nothing, Tuple{Int64, Int64}}}}
└── return %59
and you see that there are too many types of return value, so Julia gives up specializing them (and just assumes the second element of return type is Any).
How to fix the problem?
Answer: reduce the number of return type options of iterate.
Here is a quick write up (I do not claim it is most terse and have not tested it extensively so there might be some bug, but it was simple enough to write quickly using your code to show how one could approach your problem; note that I use special branches when one of the collections is empty as then it should be faster to just iterate one collection):
struct MergeSorted{T,A,B,O,F1,F2}
a::A
b::B
order::O
fa::F1
fb::F2
function MergeSorted(a::A, b::B, order::O=Base.Order.Forward) where {A,B,O}
fa, fb = iterate(a), iterate(b)
F1 = typeof(fa)
F2 = typeof(fb)
new{promote_type(eltype(A),eltype(B)),A,B,O,F1,F2}(a, b, order, fa, fb)
end
end
Base.eltype(::Type{MergeSorted{T,A,B,O}}) where {T,A,B,O} = T
struct State{Ta, Tb}
a::Union{Nothing, Ta}
b::Union{Nothing, Tb}
end
function Base.iterate(self::MergeSorted{T,A,B,O,Nothing,Nothing}) where {T,A,B,O}
return nothing
end
function Base.iterate(self::MergeSorted{T,A,B,O,F1,Nothing}) where {T,A,B,O,F1}
return self.fa
end
function Base.iterate(self::MergeSorted{T,A,B,O,F1,Nothing}, state) where {T,A,B,O,F1}
return iterate(self.a, state)
end
function Base.iterate(self::MergeSorted{T,A,B,O,Nothing,F2}) where {T,A,B,O,F2}
return self.fb
end
function Base.iterate(self::MergeSorted{T,A,B,O,Nothing,F2}, state) where {T,A,B,O,F2}
return iterate(self.b, state)
end
#inline function Base.iterate(self::MergeSorted{T,A,B,O,F1,F2}) where {T,A,B,O,F1,F2}
a_result, b_result = self.fa, self.fb
return iterate(self, State{F1,F2}(a_result, b_result))
end
#inline function Base.iterate(self::MergeSorted{T,A,B,O,F1,F2},
state::State{F1,F2}) where {T,A,B,O,F1,F2}
a_result, b_result = state.a, state.b
if b_result === nothing
a_result === nothing && return nothing
a_curr, a_state = a_result
return T(a_curr), State{F1,F2}(iterate(self.a, a_state), b_result)
end
b_curr, b_state = b_result
if a_result !== nothing
a_curr, a_state = a_result
Base.Order.lt(self.order, a_curr, b_curr) &&
return T(a_curr), State{F1,F2}(iterate(self.a, a_state), b_result)
end
return T(b_curr), State{F1,F2}(a_result, iterate(self.b, b_state))
end
And now you have:
julia> x = MergeSorted([1,4,5,9,32,44], [0,7,9,24,134]);
julia> sum(x)
269
julia> #allocated sum(x)
0
julia> #code_warntype iterate(x, iterate(x)[2])
Variables
#self#::Core.Const(iterate)
self::MergeSorted{Int64, Vector{Int64}, Vector{Int64}, Base.Order.ForwardOrdering, Tuple{Int64, Int64}, Tuple{Int64, Int64}}
state::State{Tuple{Int64, Int64}, Tuple{Int64, Int64}}
#_4::Int64
#_5::Int64
#_6::Int64
b_state::Int64
b_curr::Int64
a_state::Int64
a_curr::Int64
b_result::Union{Nothing, Tuple{Int64, Int64}}
a_result::Union{Nothing, Tuple{Int64, Int64}}
Body::Union{Nothing, Tuple{Int64, State{Tuple{Int64, Int64}, Tuple{Int64, Int64}}}}
1 ─ nothing
│ Core.NewvarNode(:(#_4))
│ Core.NewvarNode(:(#_5))
│ Core.NewvarNode(:(#_6))
│ Core.NewvarNode(:(b_state))
│ Core.NewvarNode(:(b_curr))
│ Core.NewvarNode(:(a_state))
│ Core.NewvarNode(:(a_curr))
│ %9 = Base.getproperty(state, :a)::Union{Nothing, Tuple{Int64, Int64}}
│ %10 = Base.getproperty(state, :b)::Union{Nothing, Tuple{Int64, Int64}}
│ (a_result = %9)
│ (b_result = %10)
│ %13 = (b_result === Main.nothing)::Bool
└── goto #5 if not %13
2 ─ %15 = (a_result === Main.nothing)::Bool
└── goto #4 if not %15
3 ─ return Main.nothing
4 ─ %18 = Base.indexed_iterate(a_result::Tuple{Int64, Int64}, 1)::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(2)])
│ (a_curr = Core.getfield(%18, 1))
│ (#_6 = Core.getfield(%18, 2))
│ %21 = Base.indexed_iterate(a_result::Tuple{Int64, Int64}, 2, #_6::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])
│ (a_state = Core.getfield(%21, 1))
│ %23 = ($(Expr(:static_parameter, 1)))(a_curr)::Int64
│ %24 = Core.apply_type(Main.State, $(Expr(:static_parameter, 5)), $(Expr(:static_parameter, 6)))::Core.Const(State{Tuple{Int64, Int64}, Tuple{Int64, Int64}})
│ %25 = Base.getproperty(self, :a)::Vector{Int64}
│ %26 = Main.iterate(%25, a_state)::Union{Nothing, Tuple{Int64, Int64}}
│ %27 = (%24)(%26, b_result::Core.Const(nothing))::State{Tuple{Int64, Int64}, Tuple{Int64, Int64}}
│ %28 = Core.tuple(%23, %27)::Tuple{Int64, State{Tuple{Int64, Int64}, Tuple{Int64, Int64}}}
└── return %28
5 ─ %30 = Base.indexed_iterate(b_result::Tuple{Int64, Int64}, 1)::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(2)])
│ (b_curr = Core.getfield(%30, 1))
│ (#_5 = Core.getfield(%30, 2))
│ %33 = Base.indexed_iterate(b_result::Tuple{Int64, Int64}, 2, #_5::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])
│ (b_state = Core.getfield(%33, 1))
│ %35 = (a_result !== Main.nothing)::Bool
└── goto #8 if not %35
6 ─ %37 = Base.indexed_iterate(a_result::Tuple{Int64, Int64}, 1)::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(2)])
│ (a_curr = Core.getfield(%37, 1))
│ (#_4 = Core.getfield(%37, 2))
│ %40 = Base.indexed_iterate(a_result::Tuple{Int64, Int64}, 2, #_4::Core.Const(2))::Core.PartialStruct(Tuple{Int64, Int64}, Any[Int64, Core.Const(3)])
│ (a_state = Core.getfield(%40, 1))
│ %42 = Base.Order::Core.Const(Base.Order)
│ %43 = Base.getproperty(%42, :lt)::Core.Const(Base.Order.lt)
│ %44 = Base.getproperty(self, :order)::Core.Const(Base.Order.ForwardOrdering())
│ %45 = a_curr::Int64
│ %46 = (%43)(%44, %45, b_curr)::Bool
└── goto #8 if not %46
7 ─ %48 = ($(Expr(:static_parameter, 1)))(a_curr)::Int64
│ %49 = Core.apply_type(Main.State, $(Expr(:static_parameter, 5)), $(Expr(:static_parameter, 6)))::Core.Const(State{Tuple{Int64, Int64}, Tuple{Int64, Int64}})
│ %50 = Base.getproperty(self, :a)::Vector{Int64}
│ %51 = Main.iterate(%50, a_state)::Union{Nothing, Tuple{Int64, Int64}}
│ %52 = (%49)(%51, b_result::Tuple{Int64, Int64})::State{Tuple{Int64, Int64}, Tuple{Int64, Int64}}
│ %53 = Core.tuple(%48, %52)::Tuple{Int64, State{Tuple{Int64, Int64}, Tuple{Int64, Int64}}}
└── return %53
8 ┄ %55 = ($(Expr(:static_parameter, 1)))(b_curr)::Int64
│ %56 = Core.apply_type(Main.State, $(Expr(:static_parameter, 5)), $(Expr(:static_parameter, 6)))::Core.Const(State{Tuple{Int64, Int64}, Tuple{Int64, Int64}})
│ %57 = a_result::Union{Nothing, Tuple{Int64, Int64}}
│ %58 = Base.getproperty(self, :b)::Vector{Int64}
│ %59 = Main.iterate(%58, b_state)::Union{Nothing, Tuple{Int64, Int64}}
│ %60 = (%56)(%57, %59)::State{Tuple{Int64, Int64}, Tuple{Int64, Int64}}
│ %61 = Core.tuple(%55, %60)::Tuple{Int64, State{Tuple{Int64, Int64}, Tuple{Int64, Int64}}}
└── return %61
EDIT: now I have realized that my implementation is not fully correct, as it assumes that the return value of iterate if it is not nothing is type stable (which it does not have to be). But if it is not type stable then compiler must allocate. So a fully correct solution would first check if iterate is type stable. If it is - use my solution, and if it is not - use e.g. your solution.

Matching sets of integer literals

I am searching for a fast way to check if an int is included in a constant, sparse set.
Consider a unicode whitespace function:
let white_space x = x = 0x0009 or x = 0x000A or x = 0x000B or x = 0x000C or x = 0x000D or x = 0x0020 or x = 0x0085 or x = 0x00A0 or x = 0x1680
or x = 0x2000 or x = 0x2001 or x = 0x2002 or x = 0x2003 or x = 0x2004 or x = 0x2005 or x = 0x2006 or x = 0x2007 or x = 0x2008
or x = 0x2009 or x = 0x200A or x = 0x2028 or x = 0x2029 or x = 0x202F or x = 0x205F or x = 0x3000
What ocamlopt generates looks like this:
.L162:
cmpq $19, %rax
jne .L161
movq $3, %rax
ret
.align 4
.L161:
cmpq $21, %rax
jne .L160
movq $3, %rax
ret
.align 4
.L160:
cmpq $23, %rax
jne .L159
movq $3, %rax
ret
.align 4
...
I microbenchmarked this code using the following benchmark:
let white_space x = x = 0x0009 || x = 0x000A || x = 0x000B || x = 0x000C || x = 0x000D || x = 0x0020 || x = 0x0085 || x = 0x00A0 || x = 0x1680
|| x = 0x2000 || x = 0x2001 || x = 0x2002 || x = 0x2003 || x = 0x2004 || x = 0x2005 || x = 0x2006 || x = 0x2007 || x = 0x2008
|| x = 0x2009 || x = 0x200A || x = 0x2028 || x = 0x2029 || x = 0x202F || x = 0x205F || x = 0x3000
open Core.Std
open Core_bench.Std
let ws = [| 0x0009 ;0x000A ;0x000B ;0x000C ;0x000D ;0x0020 ;0x0085 ;0x00A0 ;0x1680
;0x2000 ;0x2001 ;0x2002 ;0x2003 ;0x2004 ;0x2005 ;0x2006 ;0x2007 ;0x2008
;0x2009 ;0x200A ;0x2028 ;0x2029 ;0x202F ;0x205F ;0x3000 |]
let rec range a b =
if a >= b then []
else a :: range (a + 1) b
let bench_space n =
Bench.Test.create (fun() -> ignore ( white_space ws.(n) ) ) ~name:(Printf.sprintf "checking whitespace (%x)" (n))
let tests : Bench.Test.t list =
List.map (range 0 (Array.length ws)) bench_space
let () =
tests
|> Bench.make_command
|> Command.run
The benchmark yields:
Estimated testing time 2.5s (25 benchmarks x 100ms). Change using -quota SECS.
┌──────────────────────────┬──────────┬────────────┐
│ Name │ Time/Run │ Percentage │
├──────────────────────────┼──────────┼────────────┤
│ checking whitespace (0) │ 4.05ns │ 18.79% │
│ checking whitespace (1) │ 4.32ns │ 20.06% │
│ checking whitespace (2) │ 5.40ns │ 25.07% │
│ checking whitespace (3) │ 6.63ns │ 30.81% │
│ checking whitespace (4) │ 6.83ns │ 31.71% │
│ checking whitespace (5) │ 8.13ns │ 37.77% │
│ checking whitespace (6) │ 8.28ns │ 38.46% │
│ checking whitespace (7) │ 8.98ns │ 41.72% │
│ checking whitespace (8) │ 10.08ns │ 46.81% │
│ checking whitespace (9) │ 10.43ns │ 48.44% │
│ checking whitespace (a) │ 11.49ns │ 53.38% │
│ checking whitespace (b) │ 12.71ns │ 59.04% │
│ checking whitespace (c) │ 12.94ns │ 60.08% │
│ checking whitespace (d) │ 14.03ns │ 65.16% │
│ checking whitespace (e) │ 14.38ns │ 66.77% │
│ checking whitespace (f) │ 15.09ns │ 70.06% │
│ checking whitespace (10) │ 16.15ns │ 75.00% │
│ checking whitespace (11) │ 16.67ns │ 77.43% │
│ checking whitespace (12) │ 17.59ns │ 81.69% │
│ checking whitespace (13) │ 18.66ns │ 86.68% │
│ checking whitespace (14) │ 19.02ns │ 88.35% │
│ checking whitespace (15) │ 20.10ns │ 93.36% │
│ checking whitespace (16) │ 20.49ns │ 95.16% │
│ checking whitespace (17) │ 21.42ns │ 99.50% │
│ checking whitespace (18) │ 21.53ns │ 100.00% │
└──────────────────────────┴──────────┴────────────┘
So I am basically limited at around 100MB/s which is not too bad, but still around one order of magnitude slower than lexers of e.g. gcc. Since OCaml is a "you get what you ask for" language, I guess I cannot expect the compiler to optimize this, but is there a general technique that allows to improve this?
This is shorter and seems more constant time:
let white_space2 = function
| 0x0009 | 0x000A | 0x000B | 0x000C | 0x000D | 0x0020 | 0x0085 | 0x00A0 | 0x1680
| 0x2000 | 0x2001 | 0x2002 | 0x2003 | 0x2004 | 0x2005 | 0x2006 | 0x2007 | 0x2008
| 0x2009 | 0x200A | 0x2028 | 0x2029 | 0x202F | 0x205F | 0x3000 -> true
| _ -> false
Gives:
┌──────────────────────────┬──────────┬────────────┐
│ Name │ Time/Run │ Percentage │
├──────────────────────────┼──────────┼────────────┤
│ checking whitespace (0) │ 5.98ns │ 99.76% │
│ checking whitespace (1) │ 5.98ns │ 99.76% │
│ checking whitespace (2) │ 5.98ns │ 99.77% │
│ checking whitespace (3) │ 5.98ns │ 99.78% │
│ checking whitespace (4) │ 6.00ns │ 100.00% │
│ checking whitespace (5) │ 5.44ns │ 90.69% │
│ checking whitespace (6) │ 4.89ns │ 81.62% │
│ checking whitespace (7) │ 4.89ns │ 81.62% │
│ checking whitespace (8) │ 4.90ns │ 81.63% │
│ checking whitespace (9) │ 5.44ns │ 90.68% │
│ checking whitespace (a) │ 5.44ns │ 90.70% │
│ checking whitespace (b) │ 5.44ns │ 90.67% │
│ checking whitespace (c) │ 5.44ns │ 90.67% │
│ checking whitespace (d) │ 5.44ns │ 90.69% │
│ checking whitespace (e) │ 5.44ns │ 90.69% │
│ checking whitespace (f) │ 5.44ns │ 90.69% │
│ checking whitespace (10) │ 5.44ns │ 90.73% │
│ checking whitespace (11) │ 5.44ns │ 90.69% │
│ checking whitespace (12) │ 5.44ns │ 90.71% │
│ checking whitespace (13) │ 5.44ns │ 90.69% │
│ checking whitespace (14) │ 4.90ns │ 81.67% │
│ checking whitespace (15) │ 4.89ns │ 81.61% │
│ checking whitespace (16) │ 4.62ns │ 77.08% │
│ checking whitespace (17) │ 5.17ns │ 86.14% │
│ checking whitespace (18) │ 4.62ns │ 77.09% │
└──────────────────────────┴──────────┴────────────┘

Resources