AddressSanitizer Crash on GCC 4.8 - gcc

I've just tried out GCC 4.8's new exciting feature AddressSanitizer.
The program
#include <iostream>
int main(int argc, const char * argv[], const char * envp[]) {
int *x = nullptr;
int y = *x;
std::cout << y << std::endl;
return 0;
}
compile find using
g++-4.8 -std=gnu++0x -g -fsanitize=address -fno-omit-frame-pointer -Wall ~/h.cpp -o h
but when I run the program I get
ASAN:SIGSEGV
=================================================================
==7531== ERROR: AddressSanitizer crashed on unknown address 0x000000000000 (pc 0x000000400aac sp 0x7fff11ce0fd0 bp 0x7fff11ce1000 T0)
AddressSanitizer can not provide additional info.
#0 0x400aab (/home/per/h+0x400aab)
#1 0x7fc432e1b76c (/lib/x86_64-linux-gnu/libc-2.15.so+0x2176c)
Stats: 0M malloced (0M for red zones) by 0 calls
Stats: 0M realloced by 0 calls
Stats: 0M freed by 0 calls
Stats: 0M really freed by 0 calls
Stats: 0M (0 full pages) mmaped in 0 calls
mmaps by size class:
mallocs by size class:
frees by size class:
rfrees by size class:
Stats: malloc large: 0 small slow: 0
This seems like an incorrect way to report a memory error. Have I missed some compilation or link flags?

This is the intended way to report a NULL dereference.
You can run the program output through asan_symbolize.py (should be present in your GCC tree) to get symbol names and line numbers in the source file.

I cannot find any asan_symbolize.py on gcc 4.8 nor 4.9.
I added a workaround at https://code.google.com/p/address-sanitizer/issues/detail?id=223

Related

CUDA code runs when compiled with sm_35, but fails with sm_30

The GPU device that I have is GeForce GT 750M, which I found is compute capability 3.0. I downloaded the CUDA code found here: (https://github.com/fengChenHPC/word2vec_cbow. Its makefile had the flag -arch=sm_35.
Since my device is compute capability 3.0, I changed the flag to -arch=sm_30. It compiled fine, but when I run the code, I get the following error:
word2vec.cu 449 : unspecified launch failure
word2vec.cu 449 : unspecified launch failure
It shows it multiple times, because there are multiple CPU threads launching the CUDA kernel. Please note that the threads do not use different streams to launch the kernel, so the kernel launches are all in order.
Now, when I let the flag be, i.e. -arch=sm_35, then the code runs fine. Can someone please explain why the code won't run when I set the flag to match my device?
Unfortunately your conclusion that the code works when compiled for sm_35 and run on an sm_30 GPU is incorrect. The culprit is this:
void cbow_cuda(long window, long negative, float alpha, long sentence_length,
int *sen, long layer1_size, float *syn0, long hs, float *syn1,
float *expTable, int *vocab_codelen, char *vocab_code,
int *vocab_point, int *table, long table_size,
long vocab_size, float *syn1neg){
int blockSize = 256;
int gridSize = (sentence_length)/(blockSize/32);
size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
//printf("sm size is %d\n", smsize);
//fflush(stdout);
cbow_kernel<1><<<gridSize, blockSize, smsize>>>
(window, negative, alpha, sentence_length, sen,
layer1_size, syn0, syn1, expTable, vocab_codelen,
vocab_code, vocab_point, table, table_size,
vocab_size, syn1neg);
}
This code will silently fail if the kernel launch fails because of incomplete API error checking. And the kernel launch does fail if you build for sm_35 and run on sm_30. If you change the code of that function to this (adding kernel launch error checking):
void cbow_cuda(long window, long negative, float alpha, long sentence_length,
int *sen, long layer1_size, float *syn0, long hs, float *syn1,
float *expTable, int *vocab_codelen, char *vocab_code,
int *vocab_point, int *table, long table_size,
long vocab_size, float *syn1neg){
int blockSize = 256;
int gridSize = (sentence_length)/(blockSize/32);
size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
//printf("sm size is %d\n", smsize);
//fflush(stdout);
cbow_kernel<1><<<gridSize, blockSize, smsize>>>
(window, negative, alpha, sentence_length, sen,
layer1_size, syn0, syn1, expTable, vocab_codelen,
vocab_code, vocab_point, table, table_size,
vocab_size, syn1neg);
checkCUDAError( cudaPeekAtLastError() );
}
and compile and run it for sm_35, you should get this on an sm_30 device:
~/cbow/word2vec_cbow$ make
nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w -Xptxas="-v" -arch=sm_35 -lineinfo
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_35'
ptxas info : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 448 bytes cmem[0], 8 bytes cmem[2]
~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
vocab size = 71290
cbow.cu 114 : invalid device function
ie. the kernel launch failed because no appropriate device code was found in the CUDA cubin payload in your application. This also answers your earlier question about why the output of this code is incorrect. The analysis kernel simply never runs on your hardware when built with the default options.
If I build this code for sm_30 and run it on a GTX 670 with 2gb of memory (compute capability 3.0), I get this:
~/cbow/word2vec_cbow$ make
nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w -Xptxas="-v" -arch=sm_30 -lineinfo
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_30'
ptxas info : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 448 bytes cmem[0], 12 bytes cmem[2]
~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
vocab size = 71290
Alpha: 0.000009 Progress: 100.00% Words/thread/sec: 1217.45k
ie. the code runs correctly to completion without any errors. I can't tell you why you are not able to get the code to run on your hardware because I cannot reproduce your error on my hardware. You will need to do some debugging on your own to find the root cause of that.
As this LINK shows there is no GeForce GTX 750M.
yours is either:
GeForce GTX 750 Ti
GeForce GTX 750
or
GeForce GT 750M
If yours is one of the first two then your GPU is Maxwell-based and has Compute Capability = 5.0.
Otherwise, your GPU is Kepler based and has Compute Capability = 3.0.
If you're not sure what your GPU is, first figure it out by running deviceQuery from the NVIDIA SAMPLE.

Memory leak in gcc 4.8.1 when using thread_local?

Valgrind is reporting leaked blocks, apparently one per thread, in the following code:
#include <iostream>
#include <thread>
#include <mutex>
#include <list>
#include <chrono>
std::mutex cout_mutex;
struct Foo
{
Foo()
{
std::lock_guard<std::mutex> lock( cout_mutex );
std::cout << __PRETTY_FUNCTION__ << '\n';
}
~Foo()
{
std::lock_guard<std::mutex> lock( cout_mutex );
std::cout << __PRETTY_FUNCTION__ << '\n';
}
void
hello_world()
{
std::lock_guard<std::mutex> lock( cout_mutex );
std::cout << __PRETTY_FUNCTION__ << '\n';
}
};
void
hello_world_thread()
{
thread_local Foo foo;
// must access, or the thread local variable may not be instantiated
foo.hello_world();
// keep the thread around momentarily
std::this_thread::sleep_for( std::chrono::milliseconds( 100 ) );
}
int main()
{
for ( int i = 0; i < 100; ++i )
{
std::list<std::thread> threads;
for ( int j = 0; j < 10; ++j )
{
std::thread thread( hello_world_thread );
threads.push_back( std::move( thread ) );
}
while ( ! threads.empty() )
{
threads.front().join();
threads.pop_front();
}
}
}
Compiler version:
$ g++ --version
g++ (GCC) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
GCC build options:
--enable-shared
--enable-threads=posix
--enable-__cxa_atexit
--enable-clocale=gnu
--enable-cxx-flags='-fno-omit-frame-pointer -g3'
--enable-languages=c,c++
--enable-libstdcxx-time=rt
--enable-checking=release
--enable-build-with-cxx
--disable-werror
--disable-multilib
--disable-bootstrap
--with-system-zlib
Program compilation options:
g++ -std=gnu++11 -Og -g3 -Wall -Wextra -fno-omit-frame-pointer thread_local.cc
valgrind version:
$ valgrind --version
valgrind-3.8.1
Valgrind options:
valgrind --leak-check=full --verbose ./a.out > /dev/null
Tail-end of valgrind output:
==1786== HEAP SUMMARY:
==1786== in use at exit: 24,000 bytes in 1,000 blocks
==1786== total heap usage: 3,604 allocs, 2,604 frees, 287,616 bytes allocated
==1786==
==1786== Searching for pointers to 1,000 not-freed blocks
==1786== Checked 215,720 bytes
==1786==
==1786== 24,000 bytes in 1,000 blocks are definitely lost in loss record 1 of 1
==1786== at 0x4C29969: operator new(unsigned long, std::nothrow_t const&) (vg_replace_malloc.c:329)
==1786== by 0x4E8E53E: __cxa_thread_atexit (atexit_thread.cc:119)
==1786== by 0x401036: hello_world_thread() (thread_local.cc:34)
==1786== by 0x401416: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (functional:1732)
==1786== by 0x4EE4830: execute_native_thread_routine (thread.cc:84)
==1786== by 0x5A10E99: start_thread (pthread_create.c:308)
==1786== by 0x573DCCC: clone (clone.S:112)
==1786==
==1786== LEAK SUMMARY:
==1786== definitely lost: 24,000 bytes in 1,000 blocks
==1786== indirectly lost: 0 bytes in 0 blocks
==1786== possibly lost: 0 bytes in 0 blocks
==1786== still reachable: 0 bytes in 0 blocks
==1786== suppressed: 0 bytes in 0 blocks
==1786==
==1786== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 2)
--1786--
--1786-- used_suppression: 2 dl-hack3-cond-1
==1786==
==1786== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 2)
Constructors and destructors were run once for each thread:
$ ./a.out | grep 'Foo::Foo' | wc -l
1000
$ ./a.out | grep hello_world | wc -l
1000
$ ./a.out | grep 'Foo::~Foo' | wc -l
1000
Notes:
If you change the number of threads created, the number of leaked blocks matches the number of threads.
The code is structured in such a way that might permit resource reuse (i.e. the leaked block) if GCC were so implemented.
From the valgrind stacktrace, thread_local.cc:34 is the line: thread_local Foo foo;
Due to the sleep_for() call, a program run takes about 10 seconds or so.
Any idea if this memory leak is in GCC, a result of my config options, or is some bug in my program?
It seems that the leak comes from the dynamic initialization.
Here is an example with an int :
thread_local int num=4; //static initialization
The last example does not leak. I tried it with 2 threads and no leak at all.
But now :
int func()
{
return 4;
}
thread_local int num2=func(); //dynamic initialization
This one leak ! With 2 threads it gives total heap uage: 8 allocs, 6 frees, 428 bytes allocated...
I would suggest you to use a workaround like :
thread_local Foo *foo = new Foo; //dynamic initialization
No forget at the end of the thread execution to do :
delete foo;
But the last example as one problem : What if the thread exit with error before your delete ? Leak again...
It seems that there is no great solution. Maybe we should report that to the g++ developers about that ?
try removing thread_local and using the following code
void
hello_world_thread()
{
Foo foo;
// must access, or the thread local variable may not be instantiated
foo.hello_world();
// keep the thread around momentarily
std::this_thread::sleep_for( std::chrono::milliseconds( 100 ) );
}
foo within hello_world_thread should be in the local stack for every thread. so every thread will maintain its own copy of foo. no need to explicitly marking it as thread_local. A thread_local should be used in a context when you have something like static or namespace level variable but you want each variable to maintain its own copy for every thread.
Regards
Kajal

CUDA 5.0 "Generate Relocatable Device Code" leads to invalid device symbol error

I am trying to do separate compilation using CUDA 5. For this reason I set the "Generate Relocatable Device Code" to "Yes (-rdc=true)" in Visual Studio 2010. The program compiles without errors, however,
I get an invalid device symbol error when I try to initialize device constants using cudaMemcpyToSymbol.
i.e. I have the following constant
__constant__ float gdDomainOrigin[2];
and try to initialize it with
cudaMemcpyToSymbol(gdDomainOrigin, mDomainOrigin, 2*sizeof(float));
which leads to the error. The error does not occur, when I compile everything as a whole, without the aforementioned option set. Could anybody please help me with that?
I can't reproduce this. If build an application from two .cu files, one containing a __constant__ symbol and a simple kernel, and the other containing the runtime API incantations to populate that constant memory and call the kernel, it works only when relocatable device code is enabled, viz:
__constant__ float gdDomainOrigin[2];
__global__
void kernel(float *inout)
{
inout[0] = gdDomainOrigin[0];
inout[1] = gdDomainOrigin[1];
}
and
#include <cstdio>
extern __constant__ float gdDomainOrigin;
extern __global__ void kernel(float *);
inline
void gpuAssert(cudaError_t code, char * file, int line, bool Abort=true)
{
if (code != 0) {
fprintf(stderr, "GPUassert: %s %s %d\n",
cudaGetErrorString(code),file,line);
if (Abort) exit(code);
}
}
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
int main(void)
{
const float mDomainOrigin[2] = { 1.234f, 5.6789f };
const size_t sz = sizeof(float) * size_t(2);
float * dbuf, * hbuf;
gpuErrchk( cudaFree(0) );
gpuErrchk( cudaMemcpyToSymbol(gdDomainOrigin, mDomainOrigin, sz) );
gpuErrchk( cudaMalloc((void **)&dbuf, sz) );
kernel<<<1,1>>>(dbuf);
gpuErrchk( cudaPeekAtLastError() );
hbuf = new float[2];
gpuErrchk( cudaMemcpy(hbuf, dbuf, sz, cudaMemcpyDeviceToHost) );
fprintf(stdout, "%f %f\n", hbuf[0], hbuf[1]);
return 0;
}
Compiling and running these in CUDA 5 on a 64 bit linux system with a Kepler GPU produces the following:
$ nvcc -arch=sm_30 -o shared shared.cu shared_dev.cu
$ ./shared
GPUassert: invalid device symbol shared.cu 23
$ nvcc -arch=sm_30 -rdc=true -o shared shared.cu shared_dev.cu
$ ./shared
1.234000 5.678900
You can see that in the first compilation, without relocatable GPU code generation, the symbol isn't found. In the second case, with relocatable GPU code generation, it is found, and the elf header in the object file looks just as you would expect:
$ nvcc -arch=sm_30 -rdc=true -c shared_dev.cu
$ cuobjdump -symbols shared_dev.o
Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = cuda
host = linux
compile_size = 64bit
identifier = shared_dev.cu
symbols:
STT_SECTION STB_LOCAL .text._Z6kernelPf
STT_SECTION STB_LOCAL .nv.constant3
STT_SECTION STB_LOCAL .nv.constant0._Z6kernelPf
STT_CUDA_OBJECT STB_LOCAL _param
STT_SECTION STB_LOCAL .nv.callgraph
STT_FUNC STB_GLOBAL _Z6kernelPf
STT_CUDA_OBJECT STB_GLOBAL gdDomainOrigin
Fatbin ptx code:
================
arch = sm_30
code version = [3,1]
producer = cuda
host = linux
compile_size = 64bit
compressed
identifier = shared_dev.cu
ptxasOptions = --compile-only
Perhaps you could try my code and compilation/diagnostic steps and see what happens with your Windows toolchain.

"Integer operation result out of range" in cuda source code

I'm trying to compile a code written using CUDA 3.2 on RHEL 5.6. The relevant portions are
extern "C"{
#include <stdio.h>
#include <inttypes.h>
static uint64_t size = 0;
...
size = 5000 * 1024 * 1024;
printf("sizeof(size) = %d size = %lu\n", sizeof(size), size);
}
The code is in a .cu file, and compiled using nvcc. I get the compilation warning that for the line "size = 5000 * 1024 * 1024", the "integer operation result is out of range". The output I got is
sizeof(size) = 8 size = 947912704
I don't understand why the variable "size" can't represent the value 5242880000 if it's 8-bytes large.
Thank you.
As #Damien commented, the multiplication is being done on int. The next code gives the expected result:
size = 5000L * 1024 * 1024;
This is not related with CUDA or the nvcc compiler which calls to a general purpose C compiler during 'non-CUDA' phases. See The CUDA Compiler Driver NVCC doc for more details.

Determine program segments (HEADER, TEXT, CONST, etc...) at run time

So i realize I can open a binary up in IDA Pro and determine where the segments start/stop. Is it possible to determine this at run-time in Cocoa?
I'm assuming there are some c-level library functions that enable this, I poked around in the mach headers but couldn't find much :/
Thanks in advance!
Cocoa doesn’t include classes for handling Mach-O files. You need to use the Mach-O functions provided by the system. You were right in read the Mach-O headers.
I’ve coded a small program that accepts as input a Mach-O file name and dumps information about its segments. Note that this program deals with thin files (i.e., not fat/universal) for the x86_64 architecture only.
Note that I’m also not checking every operation and whether the file is a correctly formed Mach-O file. Doing the appropriate checks are left as an exercise to the reader.
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <mach-o/loader.h>
#include <sys/mman.h>
#include <sys/stat.h>
int main(int argc, char *argv[]) {
int fd;
struct stat stat_buf;
size_t size;
char *addr = NULL;
struct mach_header_64 *mh;
struct load_command *lc;
struct segment_command_64 *sc;
// Open the file and get its size
fd = open(argv[1], O_RDONLY);
fstat(fd, &stat_buf);
size = stat_buf.st_size;
// Map the file to memory
addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_FILE | MAP_PRIVATE, fd, 0);
// The first bytes of a Mach-O file comprise its header
mh = (struct mach_header_64 *)addr;
// Load commands follow the header
addr += sizeof(struct mach_header_64);
printf("There are %d load commands\n", mh->ncmds);
for (int i = 0; i < mh->ncmds; i++) {
lc = (struct load_command *)addr;
if (lc->cmdsize == 0) continue;
// If the load command is a (64-bit) segment,
// print information about the segment
if (lc->cmd == LC_SEGMENT_64) {
sc = (struct segment_command_64 *)addr;
printf("Segment %s\n\t"
"vmaddr 0x%llx\n\t"
"vmsize 0x%llx\n\t"
"fileoff %llu\n\t"
"filesize %llu\n",
sc->segname,
sc->vmaddr,
sc->vmsize,
sc->fileoff,
sc->filesize);
}
// Advance to the next load command
addr += lc->cmdsize;
}
printf("\nDone.\n");
munmap(addr, size);
close(fd);
return 0;
}
You need to compile this program for x86_64 bit only and run it against a x86_64 Mach-O binary. For instance, assuming you’ve saved this program as test.c:
$ clang test.c -arch x86_64 -o test
$ ./test ./test
There are 11 load commands
Segment __PAGEZERO
vmaddr 0x0
vmsize 0x100000000
fileoff 0
filesize 0
Segment __TEXT
vmaddr 0x100000000
vmsize 0x1000
fileoff 0
filesize 4096
Segment __DATA
vmaddr 0x100001000
vmsize 0x1000
fileoff 4096
filesize 4096
Segment __LINKEDIT
vmaddr 0x100002000
vmsize 0x1000
fileoff 8192
filesize 624
Done.
If you want more examples on how to read Mach-O files, cctools on Apple’s Open Source Web site is probably your best bet. You’ll also want to read the Mac OS X ABI Mach-O File Format Reference as well.

Resources