Does Adams and Mullapudi autoschedulers support specializations? - halide

The Adams and Mullapudi autoschedulers often generate schedules which include the vectorize and split primitives with constant parameters on them. These schedules do not work for all array sizes fed to the compiled Halide library as shown in the example below.
For the example below try array size of 3 (fails with out of bounds error) and 4 (passes).
Then in the Generator class comment the vectorization schedule and uncomment the split one. Then try array size 7 (fails with out of bounds error) and 8 (passes).
Notice that if the array is not compatible with the split/vectorization parameters it can go out of bounds.
If the Mullapudi and Adams add specializations into the generated schedule to filter out incompatible sizes then this problem would not have happened. Maybe also if the split/vectorization somehow can be parameterized, but maybe that is not a good option.
Do the Mullapudi or Adams autoschedulers support the specialization for cases like this or is there plan to support it?
SchBugGen.cpp file:
#include "Halide.h"
#include <stdio.h>
using namespace Halide;
class SchBugGen : public Halide::Generator <SchBugGen> {
public:
Input<Buffer<double>> aIn1{"aIn1", 1};
Output<Buffer<double>> aOut1{"aOut1", 1};
void generate() {
aOut1(d1) = aIn1(d1) * 2;
}
void schedule() {
Var d2("d2");
// Default schedule
aOut1.vectorize(d1, 4);
// aOut1.split(d1, d1, d2, 8);
}
private:
Var d1{"d1"};
};
HALIDE_REGISTER_GENERATOR(SchBugGen, SchBugGenerator)
bugRepro.cpp file:
#include <stdio.h>
#include <stdlib.h>
#include "schBugFun.h"
#include "HalideBuffer.h"
void printOut(double aOut1[], int aLen) {
printf("Out = {");
for (int i = 0; i < aLen; i++) {
printf("%0.0lf ", aOut1[i]);
}
printf("}\n");
}
void initArrs(double aIn1[], int aIn1Size) {
for (int i = 0; i < aIn1Size; i++) {
aIn1[i] = 10;
}
}
int main() {
// For vectorization of size 4 try fl = 3 and 4. The former asserts, the later does not.
// For split of size 8 try fl = 7 and 8. The former asserts, the later does not.
const int fl = 3;
double in1[fl];
double out1[fl] = {};
initArrs(in1, fl);
Halide::Runtime::Buffer<const double> inHBuff(in1, fl);
Halide::Runtime::Buffer<double> outHBuff(out1, fl);
schBugFun(inHBuff, outHBuff);
printOut(out1, fl);
return 0;
}
// Use these commands to compile the code above:
Do this only once:
set PATH=<HALIDE_BIN_PATH>:$PATH
set LD_LIBRARY_PATH=<HALIDE_BIN_PATH>
Compile Halide generator class:
g++ -std=c++17 -g -I <HALIDE_INCLUDE_PATH> -L <HALIDE_BIN_PATH> -lHalide -lpthread -ldl - rdynamic -fno-rtti -Wl,-rpath,<HALIDE_BIN_PATH> SchBugGen.cpp <HALIDE_INCLUDE_PATH>/GenGen.cpp -o schBugLibGen
Create Halide library by running compiled generator without schedule:
./schBugLibGen -f schBugFun -g SchBugGenerator -e static_library,h,assembly,bitcode,cpp,html,cpp_stub,stmt,o,schedule target=host auto_schedule=false -o .
Compile test harness:
g++ -std=c++17 schBugFun.o -I <HALIDE_INCLUDE_PATH> -L <HALIDE_BIN_PATH> -lHalide -lpthread -ldl -rdynamic -fno-rtti -Wl,-rpath,<HALIDE_BIN_PATH> -O3 -g bugRepro.cpp -o out
Run the program:
./out
Thanks,
Ivan

This issue was also captured here: https://github.com/halide/Halide/issues/3104
And is expected to be addressed here: https://github.com/halide/Halide/issues/6847
Note in issue 6847 these two points:
• There must be a way to ensure that schedules are resilient to varying bounds; it's currently common to get a scheduler that will work for the "estimated" size, but will OOB on smaller/etc sizes. This is unacceptable for production work. (Adams2019 autoscheduler can produce schedules that aren't bounds-resilient #5070, Autoscheduled code doesn't work on buffers smaller than estimates #3953, Adams2019 autoscheduler generates incorrect code #4512)
• Consider whether/how to add support for specialize() to the autoscheduler. (Specializing the auto-schedule #3104)

Related

Weird C library linkage issues on Mac - Segmentation Fault

I have a strange segmentation fault that doesn't exist when everything is in 1 .c file, but does exist when I put part of the code in a dynamically linked library and link it to a test file. The complete code for the working 1 .c file code is at the bottom, the complete code for the error system with 2 .c and 1 .h file come first.
Here is the error system:
example.h:
#include <stdio.h>
#include <stdlib.h>
typedef struct MYARRAY {
int len;
void* items[];
} MYARRAY;
MYARRAY *collection;
void
mypush(void* p);
example.c:
#include "example.h"
void
mypush(void* p) {
printf("Here %lu\n", sizeof collection);
puts("FOO");
int len = collection->len++;
puts("BAR");
collection->items[len] = p;
}
example2.c:
This is essentially a test file:
#include "example.h"
void
test_print() {
puts("Here1");
mypush("foo");
puts("Here2");
}
int
main() {
collection = malloc(sizeof *collection + (sizeof collection->items[0] * 1000));
collection->len = 0;
puts("Start");
test_print();
puts("Done");
return 0;
}
Makefile:
I link example to example2 here, and run:
example:
#clang -I . -dynamiclib \
-undefined dynamic_lookup \
-o example.dylib example.c
#clang example2.c example.dylib -o example2.o
#./example2.o
.PHONY: example
The output is:
$ make example
Start
Here1
Here 8
FOO
make: *** [example] Segmentation fault: 11
But it should show the full output of:
$ make example
Start
Here1
Here 8
FOO
BAR
Here2
Done
The weird thing is everything works if it is this system:
example.c:
#include <stdio.h>
#include <stdlib.h>
typedef struct MYARRAY {
int len;
void* items[];
} MYARRAY;
MYARRAY *collection;
void
mypush(void* p) {
printf("Here %lu\n", sizeof collection);
puts("FOO");
int len = collection->len++;
puts("BAR");
collection->items[len] = p;
}
void
test_print() {
puts("Here1");
mypush("foo");
puts("Here");
}
int
main() {
collection = malloc(sizeof *collection + (sizeof collection->items[0] * 1000));
collection->len = 0;
puts("ASF");
test_print();
return 0;
}
Makefile:
example:
#clang -o example example.c
#./example
.PHONY: example
Wondering why it's creating a segmentation fault when it is linked like this, and what I am doing wrong.
I have checked otool and with DYLD_PRINT_LIBRARIES=YES and it shows it is importing the dynamically linked libraries, but for some reason it's segmentation faulting when linked but works fine when it isn't linked.
Your problem is this, in example.h:
MYARRAY *collection;
Since both main.c and example.c include this file, you end up defining collection twice, which results in undefined behavior. You need to make sure you define each object only once. The details are relatively unimportant since anything can happen with undefined behavior, but what's probably happening is that main.c is allocating memory for one object, but the one example.c is using is still NULL. As mentioned in the comments, since you define collection in main.c your linker is able to build the executable without needing to look for that symbol in the dynamic library, so you don't get a link time warning about it being defined there too, and obviously there'd be no cause for a warning at the time you compile the library.
It works for you when you put everything in one file because obviously then you're not defining anything twice, anymore. The error itself is nothing to do with the fact you're using a dynamic library, although that may have made it harder to detect.
It would be better to define this in example.c and provide a constructor function, there's no need for main() to be able to access it directly. But if you must do this, then define it in example.c and just declare an extern identifier in the header file to tell main.c that the object is defined somewhere else.

Strange behavior with gcc and inline

I want to define an inline function in a header file (.h) which can be included by numerous source files (.c). Here is a minimal example with 1 header and 2 source files:
Header file foo.h
int ifunc(int i);
extern inline
int
ifunc(int i)
{
return i + 1;
}
Source code file: foo.c
#include <stdio.h>
#include "foo.h"
int foo2(int i);
int main()
{
printf("%d\n", foo2(1));
return 0;
}
Source code file foo2.c
#include "foo.h"
int foo2(int i)
{
return ifunc(i);
}
The problem
When I compile with optimization,
gcc -g -Wall -O2 -o foo foo.c foo2.c
$ ./foo
2
everything works fine. However when I turn off optimization, I get this error:
gcc -g -Wall -o foo foo.c foo2.c
/tmp/cc3OrhO9.o: In function `foo2':
foo2.c:5: undefined reference to `ifunc'
Can someone please explain how to fix so that I can run the code with and without -O2? I am using gcc 4.8.5.
if you replace foo.h with
static inline int ifunc(int i)
{
return i + 1;
}
Both will work.
Declaring it extern means it'll be defined somewhere else which in your original example does not happen. And the optimized build doesn't flag as an error because it already optimized it to be inline it but the non-optimized build does not find a definition in any of the .o files (since they were all compiled with ifunc being an extern as defined in foo.h).
Declaring as static inline will ensure that it is local to each file (the downside being that if it does not inline it, you'll end up with each .o that needs it having a local copy, so don't overdo it).

Why does ThreadSanitizer report a race with this lock-free example?

I've boiled this down to a simple self-contained example. The main thread enqueues 1000 items, and a worker thread tries to dequeue concurrently. ThreadSanitizer complains that there's a race between the read and the write of one of the elements, even though there is an acquire-release memory barrier sequence protecting them.
#include <atomic>
#include <thread>
#include <cassert>
struct FakeQueue
{
int items[1000];
std::atomic<int> m_enqueueIndex;
int m_dequeueIndex;
FakeQueue() : m_enqueueIndex(0), m_dequeueIndex(0) { }
void enqueue(int x)
{
auto tail = m_enqueueIndex.load(std::memory_order_relaxed);
items[tail] = x; // <- element written
m_enqueueIndex.store(tail + 1, std::memory_order_release);
}
bool try_dequeue(int& x)
{
auto tail = m_enqueueIndex.load(std::memory_order_acquire);
assert(tail >= m_dequeueIndex);
if (tail == m_dequeueIndex)
return false;
x = items[m_dequeueIndex]; // <- element read -- tsan says race!
++m_dequeueIndex;
return true;
}
};
FakeQueue q;
int main()
{
std::thread th([&]() {
int x;
for (int i = 0; i != 1000; ++i)
q.try_dequeue(x);
});
for (int i = 0; i != 1000; ++i)
q.enqueue(i);
th.join();
}
ThreadSanitizer output:
==================
WARNING: ThreadSanitizer: data race (pid=17220)
Read of size 4 at 0x0000006051c0 by thread T1:
#0 FakeQueue::try_dequeue(int&) /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:26 (issue49+0x000000402bcd)
#1 main::{lambda()#1}::operator()() const <null> (issue49+0x000000401132)
#2 _M_invoke<> /usr/include/c++/5.3.1/functional:1531 (issue49+0x0000004025e3)
#3 operator() /usr/include/c++/5.3.1/functional:1520 (issue49+0x0000004024ed)
#4 _M_run /usr/include/c++/5.3.1/thread:115 (issue49+0x00000040244d)
#5 <null> <null> (libstdc++.so.6+0x0000000b8f2f)
Previous write of size 4 at 0x0000006051c0 by main thread:
#0 FakeQueue::enqueue(int) /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:16 (issue49+0x000000402a90)
#1 main /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:44 (issue49+0x000000401187)
Location is global 'q' of size 4008 at 0x0000006051c0 (issue49+0x0000006051c0)
Thread T1 (tid=17222, running) created by main thread at:
#0 pthread_create <null> (libtsan.so.0+0x000000027a67)
#1 std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) <null> (libstdc++.so.6+0x0000000b9072)
#2 main /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:41 (issue49+0x000000401168)
SUMMARY: ThreadSanitizer: data race /home/cameron/projects/concurrentqueue/tests/tsan/issue49.cpp:26 FakeQueue::try_dequeue(int&)
==================
ThreadSanitizer: reported 1 warnings
Command line:
g++ -std=c++11 -O0 -g -fsanitize=thread issue49.cpp -o issue49 -pthread
g++ version: 5.3.1
Can anybody shed some light onto why tsan thinks this is a data race?
UPDATE
It seems like this is a false positive. To appease ThreadSanitizer, I've added annotations (see here for the supported ones and here for an example). Note that detecting whether tsan is enabled in GCC via a macro has only recently been added, so I had to manually pass -D__SANITIZE_THREAD__ to g++ for now.
#if defined(__SANITIZE_THREAD__)
#define TSAN_ENABLED
#elif defined(__has_feature)
#if __has_feature(thread_sanitizer)
#define TSAN_ENABLED
#endif
#endif
#ifdef TSAN_ENABLED
#define TSAN_ANNOTATE_HAPPENS_BEFORE(addr) \
AnnotateHappensBefore(__FILE__, __LINE__, (void*)(addr))
#define TSAN_ANNOTATE_HAPPENS_AFTER(addr) \
AnnotateHappensAfter(__FILE__, __LINE__, (void*)(addr))
extern "C" void AnnotateHappensBefore(const char* f, int l, void* addr);
extern "C" void AnnotateHappensAfter(const char* f, int l, void* addr);
#else
#define TSAN_ANNOTATE_HAPPENS_BEFORE(addr)
#define TSAN_ANNOTATE_HAPPENS_AFTER(addr)
#endif
struct FakeQueue
{
int items[1000];
std::atomic<int> m_enqueueIndex;
int m_dequeueIndex;
FakeQueue() : m_enqueueIndex(0), m_dequeueIndex(0) { }
void enqueue(int x)
{
auto tail = m_enqueueIndex.load(std::memory_order_relaxed);
items[tail] = x;
TSAN_ANNOTATE_HAPPENS_BEFORE(&items[tail]);
m_enqueueIndex.store(tail + 1, std::memory_order_release);
}
bool try_dequeue(int& x)
{
auto tail = m_enqueueIndex.load(std::memory_order_acquire);
assert(tail >= m_dequeueIndex);
if (tail == m_dequeueIndex)
return false;
TSAN_ANNOTATE_HAPPENS_AFTER(&items[m_dequeueIndex]);
x = items[m_dequeueIndex];
++m_dequeueIndex;
return true;
}
};
// main() is as before
Now ThreadSanitizer is happy at runtime.
This looks like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78158. Disassembling the binary produced by GCC shows that it doesn't instrument the atomic operations on O0.
As a workaround, you can either build your code with GCC with -O1/-O2, or get yourself a fresh Clang build and use it to run ThreadSanitizer (this is the recommended way, as TSan is being developed as part of Clang and only backported to GCC).
The comments above are invalid: TSan can easily comprehend the happens-before relation between the atomics in your code (one can check that by running the above reproducer under TSan in Clang).
I also wouldn't recommend using the AnnotateHappensBefore()/AnnotateHappensAfter() for two reasons:
you shouldn't need them in most cases; they denote that the code is doing something really complex (in which case you may want to double-check you're doing it right);
if you make an error in your lock-free code, spraying it with annotations may mask that error, so that TSan won't notice it.
The ThreadSanitizer is not good at counting, it cannot understand that writes to the items always happen before the reads.
The ThreadSanitizer can find that the stores of m_enqueueIndex happen before the loads, but it does not understand that the store to items[m_dequeueIndex] must happen before the load when tail > m_dequeueIndex.

Not able to use srand48() after changing to c++ 11

Why am I not able to compile my code to c++ 11 and use the srand48 function?
I have a program where I play around with some matrices.
The problem is that when I compile the code with the -std=c++0x flag.
I want to use some c++11 only functions and this is my approach to do so.
It compiles without any problems if I do not specify the c++ version. Like this:
g++ -O2 -Wall test.cpp -o test -g
Please correct me if I have misunderstood what the mentioned flag does.
I run my code on a Windows 7 64-bit machine and compile through cygwin. I use g++ version 4.5.3 (GCC). Please comment if more information is required.
For some unknown reason (even to myself) then all my code is written in one compilation unit.
If the error is caused by a structural error then you should also feel free to point it out. :)
I receive the following errors:
g++ -std=c++0x -O2 -Wall test.cpp -o test -g
test.cpp: In function ‘void gen_mat(T*, size_t)’:
test.cpp:28:16: error: there are no arguments to ‘srand48’ that depend on a template parameter, so a declaration of ‘srand48’ must be available
test.cpp:28:16: note: (if you use ‘-fpermissive’, G++ will accept your code, but allowing the use of an undeclared name is deprecated)
test.cpp:33:28: error: there are no arguments to ‘drand48’ that depend on a template parameter, so a declaration of ‘drand48’ must be available
Here is a sub of my code, it generates the errors shown above.
#include <iostream>
#include <cstdlib>
#include <cassert>
#include <cstring>
#include <limits.h>
#include <math.h>
#define RANGE(S) (S)
// Precision for checking identity.
#define PRECISION 1e-10
using namespace std;
template <typename T>
void gen_mat(T *a, size_t dim)
{
srand48(dim);
for(size_t i = 0; i < dim; ++i)
{
for(size_t j = 0; j < dim; ++j)
{
T z = (drand48() - 0.5)*RANGE(dim);
a[i*dim+j] = (z < 10*PRECISION && z > -10*PRECISION) ? 0.0 : z;
}
}
}
int main(int argc, char *argv[])
{
}
Regards Kim.
This is the solution that solved the problem for me:
First n.m. explained that srand() can not be used when compiling with -std=c++0x.
The correct flag to use is -std=gnu++11 however it require g++ version 4.7+
Therefore, the solution for me was to compile my code with -std=gnu++0x
The compile command = g++ -O2 -Wall test.cpp -o test -g -std=gnu++0x
If you explicitly set -stc=c++03 you will get the same error. This is because drand48 and friends are not actually a part of any C++ standard. gcc includes these functions as an extension, and disables them if standard behaviour is requested.
The default standard mode of g++ is actually -std=gnu++03. You may want to use -std=gnu++11 instead of -std=c++0x, or pass -U__STRICT_ANSI__ to the compiler.

How to find the address & length of a C++ function at runtime (MinGW)

As this is my first post to stackoverflow I want to thank you all for your valuable posts that helped me a lot in the past.
I use MinGW (gcc 4.4.0) on Windows-7(64) - more specifically I use Nokia Qt + MinGW but Qt is not involved in my Question.
I need to find the address and -more important- the length of specific functions of my application at runtime, in order to encode/decode these functions and implement a software protection system.
I already found a solution on how to compute the length of a function, by assuming that static functions placed one after each other in a source-file, it is logical to be also sequentially placed in the compiled object file and subsequently in memory.
Unfortunately this is true only if the whole CPP file is compiled with option: "g++ -O0" (optimization level = 0).
If I compile it with "g++ -O2" (which is the default for my project) the compiler seems to relocate some of the functions and as a result the computed function length seems to be both incorrect and negative(!).
This is happening even if I put a "#pragma GCC optimize 0" line in the source file,
which is supposed to be the equivalent of a "g++ -O0" command line option.
I suppose that "g++ -O2" instructs the compiler to perform some global file-level optimization (some function relocation?) which is not avoided by using the #pragma directive.
Do you have any idea how to prevent this, without having to compile the whole file with -O0 option?
OR: Do you know of any other method to find the length of a function at runtime?
I prepare a small example for you, and the results with different compilation options, to highlight the case.
The Source:
// ===================================================================
// test.cpp
//
// Intention: To find the addr and length of a function at runtime
// Problem: The application output is correct when compiled with: "g++ -O0"
// but it's erroneous when compiled with "g++ -O2"
// (although a directive "#pragma GCC optimize 0" is present)
// ===================================================================
#include <stdio.h>
#include <math.h>
#pragma GCC optimize 0
static int test_01(int p1)
{
putchar('a');
putchar('\n');
return 1;
}
static int test_02(int p1)
{
putchar('b');
putchar('b');
putchar('\n');
return 2;
}
static int test_03(int p1)
{
putchar('c');
putchar('\n');
return 3;
}
static int test_04(int p1)
{
putchar('d');
putchar('\n');
return 4;
}
// Print a HexDump of a specific address and length
void HexDump(void *startAddr, long len)
{
unsigned char *buf = (unsigned char *)startAddr;
printf("addr:%ld, len:%ld\n", (long )startAddr, len);
len = (long )fabs(len);
while (len)
{
printf("%02x.", *buf);
buf++;
len--;
}
printf("\n");
}
int main(int argc, char *argv[])
{
printf("======================\n");
long fun_len = (long )test_02 - (long )test_01;
HexDump((void *)test_01, fun_len);
printf("======================\n");
fun_len = (long )test_03 - (long )test_02;
HexDump((void *)test_02, fun_len);
printf("======================\n");
fun_len = (long )test_04 - (long )test_03;
HexDump((void *)test_03, fun_len);
printf("Test End\n");
getchar();
// Just a trick to block optimizer from eliminating test_xx() functions as unused
if (argc > 1)
{
test_01(1);
test_02(2);
test_03(3);
test_04(4);
}
}
The (correct) Output when compiled with "g++ -O0":
[note the 'c3' byte (= assembly 'ret') at the end of all functions]
======================
addr:4199344, len:37
55.89.e5.83.ec.18.c7.04.24.61.00.00.00.e8.4e.62.00.00.c7.04.24.0a.00.00.00.e8.42
.62.00.00.b8.01.00.00.00.c9.c3.
======================
addr:4199381, len:49
55.89.e5.83.ec.18.c7.04.24.62.00.00.00.e8.29.62.00.00.c7.04.24.62.00.00.00.e8.1d
.62.00.00.c7.04.24.0a.00.00.00.e8.11.62.00.00.b8.02.00.00.00.c9.c3.
======================
addr:4199430, len:37
55.89.e5.83.ec.18.c7.04.24.63.00.00.00.e8.f8.61.00.00.c7.04.24.0a.00.00.00.e8.ec
.61.00.00.b8.03.00.00.00.c9.c3.
Test End
The erroneous Output when compiled with "g++ -O2":
(a) function test_01 addr & len seem correct
(b) functions test_02, test_03 have negative lengths,
and fun. test_02 length is also incorrect.
======================
addr:4199416, len:36
83.ec.1c.c7.04.24.61.00.00.00.e8.c5.61.00.00.c7.04.24.0a.00.00.00.e8.b9.61.00.00
.b8.01.00.00.00.83.c4.1c.c3.
======================
addr:4199452, len:-72
83.ec.1c.c7.04.24.62.00.00.00.e8.a1.61.00.00.c7.04.24.62.00.00.00.e8.95.61.00.00
.c7.04.24.0a.00.00.00.e8.89.61.00.00.b8.02.00.00.00.83.c4.1c.c3.57.56.53.83.ec.2
0.8b.5c.24.34.8b.7c.24.30.89.5c.24.08.89.7c.24.04.c7.04.
======================
addr:4199380, len:-36
83.ec.1c.c7.04.24.63.00.00.00.e8.e9.61.00.00.c7.04.24.0a.00.00.00.e8.dd.61.00.00
.b8.03.00.00.00.83.c4.1c.c3.
Test End
This is happening even if I put a "#pragma GCC optimize 0" line in the source file, which is supposed to be the equivalent of a "g++ -O0" command line option.
I don't believe this is true: it is supposed to be the equivalent of attaching __attribute__((optimize(0))) to subsequently defined functions, which causes those functions to be compiled with a different optimisation level. But this does not affect what goes on at the top level, whereas the command line option does.
If you really must do horrible things that rely on top level ordering, try the -fno-toplevel-reorder option. And I suspect that it would be a good idea to add __attribute__((noinline)) to the functions in question as well.

Resources