I am curious if any major projects have used Boehm GC? I'm particularly interested in if any gaming projects have used this garbage collector. If not, is Boehm GC bad for gaming projects?
I am impressed by the mere fact that simple code such as this Boehm GC can handle:
#include <stdio.h>
#include <stdlib.h>
#include <gc.h>
int main(void)
int i;
for (i = 0; i < 10000000; ++i)
int *p = GC_MALLOC(sizeof(int *));
//int *q = malloc(sizeof(int *));
printf("Heap size = %d\n", GC_get_heap_size());
return 0;
Are there any caveats to making a game using Boehm GC?

The Open Modelica Compiler (OMC) makes use of the Boehm GC.
It is a very large application with over 300 000 lines of code and is used both in industry and in research. The garbage collector collects garbage during simulations.
It defines the internal API for the garbage collector and might be a good reference and an interesting read.
However, if you are going to make a game in C++ I would recomend using smart pointers instead. If you really would like the comfort of having a garbage collector and you need to use C or C++ for some reason Boehm is a good option. Otherwise if performance is not critical for the game that you are programming, it might be wise to look at other languages with efficient garbage collectors such as Java or C#.


How OpenMP macros work behind the scenes in collaboration with the preprocessor/compiler and the library itself?

I'm trying to implement a similar functionality to one of my projects and I was wondering how it works.
For example, I was wondering how #pragma omp parallel default(shared) private(iam, np) works in the following example from the compiler's/proprocessor's perspective? I'm referencing the compiler since I have read that #pragma macros are to give side information to the compiler.
If I take into account that all the macros are handled by the preprocessor it gets really confusing to me.
How is the macro expanded and how the OpenMP library gets access to the information in those macros? Is there a specific compiler extension that OpenMP uses to fetch those information for every compiler that it supports or is it just simple macros invocation?
#include <stdio.h>
#include <mpi.h>
#include <omp.h>
int main(int argc, char *argv[])
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
int iam = 0, np = 1;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
#pragma omp parallel default(shared) private(iam, np)
np = omp_get_num_threads();
iam = omp_get_thread_num();
printf("Hybrid: Hello from thread %d out of %d from process %d out of %d on %s\n",
iam, np, rank, numprocs, processor_name);
return 0;
I got this example from here.
For example, I was wondering how #pragma omp parallel default(shared) private(iam, np) works in the following example from the compiler's/proprocessor's perspective?
This is strongly dependent of the compiler implementation. In practice, for Clang and GCC (and probably ICC), the pragma annotation gives information to compilers steps enabling it to transform the code in a front-end pass. Put it simply, the front-end of a compiler is the one doing preprocessing, tokenization, syntactic analysis and semantic analysis, while the back-end does optimizations and code generation.
For most steps, mainstream compilers enable you to get the temporary output intermediate code. For example Clang and GCC have the -E flag for the preprocessor and -S for code generation. Low-level intermediate representation (IR) are more dependant to a compiler implementation so the flags are not the same (nor the optimizations and the intermediate language). GCC use a GENERIC/GIMPLE language for the high-level IR while Clang use the LLVM IR language. AFAIK, the GIMPLE code can be dump using the -fdump-* flags. For Clang, -emit-llvm can be used to dump the IR code.
In Clang, the transformation is done after the AST generation, but before the first IR generation. Note that some other compilers does an AST transformation, while some other do that in later steps. When OpenMP is enabled (with -fopenmp), Clang replaces the pragma region with an __kmpc_fork_call and generate a function for the region which is passed to KMP function. KMP is the prefix for the IOMP runtime shared by both Clang and ICC. GCC has its own runtime called GOMP. There are many other runtimes but the mainstream ones are GOMP and IOMP. Also note that GCC uses a similar strategy by calling GOMP_parallel with a generated function provided at runtime. The IOMP/GOMP runtimes take care of initializing the region and the ICV before calling the compiler-generated function.
Note that the processor is not aware of the use of OpenMP (at least not for all OpenMP implementations I am aware of).
How is the macro expanded and how the OpenMP library gets access to the information in those macros?
Note that pragma annotations are not macros, there are more powerful than that: they provide information to the compiler that can perform non trivial changes during any compilation steps. For example, a pragma can change the way the code generation is performed which is impossible with preprocessor macros (eg. #pragma GCC unroll n for loop unrolling in GCC and #pragma ivdep for telling ICC that there is no loop-carried dependencies enabling auto-vectorization).
The information are passed to the main runtime fork function as arguments (ie. __kmpc_fork_call and GOMP_parallel) like the compiler-generated user function.
Is there a specific compiler extension that OpenMP uses to fetch those information for every compiler that it supports or is it just simple macros invocation?
It is not just simple macros invocation and AFAIK there is no external module for GCC and Clang. They are directly integrated to the compiler (though it may be modular, especially for Clang). This is important because compilers need to analyse the pragma annotations at compile-time. The pragma are not just a way to automatically generate runtime calls and abstract them with a standard language/interface, they also impact the compiler steps. For example, #pragma omp simd should impact the auto-vectorization optimization steps of compilers (back-end steps).
AFAIK, there are some (research) OpenMP implementations based on a source-to-source compilation so to be compiler independent but I am not sure they supports all OpenMP features (especially SIMD ones).

Is batching same functions with SIMD instruction possible?

I have a scenario that many exact same functions(for simplicity let's just consider C/C++ and python here) will be executed at the same time on my machine. Intuitively I just use multi-threading to treat each instance of a function as a thread to utilize the parallism, they do not contend for same resources but they will do many branch operation(e.g., for loop). However, since they are actually the same functions, I'm thinking about batching them using some SIMD instructions, e.g., AVX-512. Of course, it should be automatic so that users do not have to modify their code.
The reason? Because every thread/process/container/VM occupies resources, but AVX only needs one instructions. So I can hold more users with the same hardware.
Most articles I find online focus on using AVX instructions inside the function, for example, to accelerate the stream data processing, or deal with some large calculation. None of them mentions batching different instances of same function.
I know there are some challenges, such as different execution path caused by different input, and it is not easy to turn a normal function into a batched version automatically, but I think it is indeed possible technically.
Here are my questions
Is it hard(or possible) to automatically change a normal function into a batched version?
If 1 is no, what restrictions should I put on the function to make it possible? For example, if the function only has one path regardless of the data?
Is there other technologies to better solve the problem? I don't think GPU is a good option to me because GPU cannot support IO or branch instruction, although its SIMT fits perfectly into my goal.
SSE/AVX is basically a vector unit, it allows simple operations (like +-*/ and,or,XOR etc) on arrays of multiple elements at once. AVX1 and 2 has 256 byte registers, so you can do e.g. 8 32-bit singles at once, or 4 doubles. AVX-512 is coming but quite rare atm.
So if your functions are all operations on arrays of basic types, it is a natural fit. Rewriting your function using AVX intrinsics is doable if the operations are very simple. Complex things (like not matching vector widths) or even doing it in assembler is a challenge though.
If your function is not operating on vectors then it becomes difficult, and the possibilities are mostly theoretical. Autovectorizing compilers sometimes can do this, but it s rare and limited, and extremely complex.
There's two ways to fix this: vectorization (SIMD) and parallelization (threads).
GCC can already do the SIMD vectorization you want provided that the function is inlined, and the types and operations are compatible (and it will automatically inline smallish functions without you asking it to).
inline void func (int i) {
somearray[i] = someotherarray[i] * athirdarray[i];
for (int i = 0; i < ABIGNUMBER; i++)
func (i);
Vectorization and inlining are enabled at -O3.
If the functions are too complex, and/or GCC doesn't vectorize it yet, then you can use OpenMP or OpenACC to parallelize it.
OpenMP uses special markup to tell the compiler where to spawn threads.
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
And yes, you can do that on a GPU too! You do have to do a bit more typing to get the data copied in and out correctly. Only the marked up areas run on the GPU. Everything else runs on the CPU, so I/O etc. is not a problem.
#pragma omp target map(somearray,someotherarray,athirdarray)
#pragma omp parallel
#pragma omp for
for (int i = 0; i < ABIGNUMBER; i++)
OpenACC is a similar idea, but more specialized towards GPUs.
You can find OpenMP and OpenACC compilers in many places. Both GCC and LLVM support NVidia GPUs. LLVM has some support for AMD GPUs, and there are unofficial GCC builds available too (with official support coming soon).

Rust manual memory management

When I began learning C, I implemented common data structures such as lists, maps and trees. I used malloc, calloc, realloc and free to manage the memory manually when requested. I did the same thing with C++, using new and delete.
Now comes Rust. It seems like Rust doesn't offer any functions or operators which correspond to the ones of C or C++, at least in the stable release.
Are the Heap structure and the ptr module (marked with experimental) the ones to look at for this kind of thing?
I know that these data structures are already in the language. It's for the sake of learning.
Although it's really not recommended to do this ever, you can use malloc and free like you are used to from C. It's not very useful, but here's how it looks:
extern crate libc; // 0.2.65
use std::mem;
fn main() {
unsafe {
let my_num: *mut i32 = libc::malloc(mem::size_of::<i32>() as libc::size_t) as *mut i32;
if my_num.is_null() {
panic!("failed to allocate memory");
libc::free(my_num as *mut libc::c_void);
A better approach is to use Rust's standard library:
use std::alloc::{alloc, dealloc, Layout};
fn main() {
unsafe {
let layout = Layout::new::<u16>();
let ptr = alloc(layout);
*(ptr as *mut u16) = 42;
assert_eq!(*(ptr as *mut u16), 42);
dealloc(ptr, layout);
It's very unusual to directly access the memory allocator in Rust. You generally want to use the smart pointer constructors (Box::new, Rc::new, Arc::new) for single objects and just use Vec or Box<[T]> if you want a heap-based array.
If you really want to allocate memory and get a raw pointer to it, you can look at the implementation of Rc. (Not Box. Box is magical.) To get its backing memory, it actually creates a Box and then uses its into_raw_non_null function to get the raw pointer out. For destroying, it uses the allocator API, but could alternatively use Box::from_raw and then drop that.
Are the Heap structure and the ptr module (marked with experimental) the ones to look at for this kind of thing?
No, as a beginner you absolutely shouldn't start there. When you started learning C, malloc was all there was, and it's still a hugely error-prone part of the language - but you can't write any non-trivial program without it. It's very important for C programmers to learn about malloc and how to avoid all the pitfalls (memory leaks, use-after-free, and so on).
In modern C++, people are taught to use smart pointers to manage memory, instead of using delete by hand, but you still need to call new to allocate the memory for your smart pointer to manage. It's a lot better, but there's still some risk there. And still, as a C++ programmer, you need to learn how new and delete work, in order to use the smart pointers correctly.
Rust aims to be much safer than C or C++. Its smart pointers encapsulate all the details of how memory is handled at low-level. You only need to know how to allocate and deallocate raw memory if you're implementing a smart pointer yourself. Because of the way ownership is managed, you actually need to know a lot more details of the language to be able to write correct code. It can't be lesson one or two like it is in C or C++: it's a very advanced topic, and one many Rust programmers never need to learn about.
If you want to learn about how to allocate memory on the heap, the Box class is the place to start with that. In the Rust book, the chapter about smart pointers is the chapter about memory allocation.

Obtaining source code from the binary

Shouldn't it be possible to obtain the source code from its binary?
Since compilation is the process of converting high level language (source code) into low level language (machine code), can't we just reverse the process in order to obtain the source code? If not, why?
Suppose I give you the number 3, and tell you that I obtained it by summing two numbers. Can you tell me the two numbers that 3 is a sum of? It's impossible, because sum is a one-way function - it's impossible to recover its arguments from its output. I could have obtained it from -55 and 58, even if for you 1 and 2 still works out to the same answer.
Compilation is similar. There's an infinite number of C++ programs that will generate any particular machine code output (more or less).
You can certainly reverse the compilation process and produce C or C++ code that would result, at least, in machine code with same semantics (meaning), but likely not byte-for-byte identical. Such tools exist to a varying degree.
So yes, it is possible, but because a lot of information from the source code necessarily has to be lost, the code you'll get back will not yield much insight into the design of the original source code. For any significantly-sized project, what you'd get back would be code that does the same thing, but is pretty much unreadable for a human. It'd be some very, very obfuscated C/C++.
Why is the information lost? Because the whole big deal with high-level languages is that they should be efficient for humans to deal with. The more high-level and human-comprehensible the source code is, the more it differs from the machine code that results when the compiler is done. As a software designer, your primary objective is to leverage the compiler and other code generating tools to transform high-level ideas and concepts into machine code. The larger the gap between the two, the more information about high-level design is lost.
Remember that the only thing that a compiler has to preserve is the semantics (meaning) of your code. As long as it appears that the code is doing what you meant it to do, everything is fine. Modern compilers can, for example, pre-execute parts of your code and only store the results of the operations, when it makes "sense" to do so according to some metric. Say that your entire program reads as follows:
#include <iostream>
#include <cmath>
int main() {
std::cout << sin(1)+cos(2) << std::endl;
return 0;
Assuming a Unix system, the compiler is perfectly within its right to produce machine code that executes two syscalls: a write to stdout, with a constant buffer, and an exit. Upon decompiling such code, you could get:
#include <unistd.h>
#include <cstdlib>
int main() {
write(0, "0.425324\n", 9);

Portability and Optimization of OpenCL between Radeon Graphic Cards

I'm planning on diving into OpenCL and have been reading (only surface knowledge) on what OpenCL can do, but have a few questions.
Let's say I have an AMD Radeon 7750 and I have another computer that has an AMD Radeon 5870 and no plans on using a computer with an Nvidia card. I heard that optimizing the code for a particular device brings performance benefits. What exactly does optimizing mean? From what I've read and a little bit of guessing, it sounds like it means writing the code in a way that a GPU likes (in general without concern that it's an AMD or Nvidia card) as well as in a way that matches how the graphics card handles memory (I'm guessing this is compute device specific? Or is this only brand specific?).
So if I write code and optimized it for the Radeon 7750, would I be able to bring that code to the other computer with the Radeon 5870 and, without changing any part of the code, still retain a reasonable amount of performance benefits from the optimization? In the event that the code doesn't work, would changing parts of the code be a minor issue or would it involve rewriting enough code that it would have been a better idea to have written an optimized code for the Radeon 5870 in the first place.
Without more information about the algorithms and applications you intend to write, the question is a little vague. But I think I can give you some high-level strategies to keep in mind as you develop your code for these two different platforms.
The Radeon 7750's design is of the new Graphics Core Next architecture, while your HD5780 is based on the older VLIW5 (RV770) Architecture.
For your code to perform well on the HD5780 hardware you must make as heavy use of the packed primitive datatypes as possible, especially the int4, float4 types. This is because the OpenCL compiler has a difficult time automatically discovering parallelism and packing data into the wide vectors for you. If you can structure your code so that you already have taken this into account, then you will be able to fill more of the VLIW-5 slots and thus use more of your stream processors.
GCN is more like NVidia's Fermi architecture, where the code's path to the functional units (ALUs, etc.) of the stream processors does not go through explicitly scheduled VLIW instructions. So more parallelism can be automatically detected at runtime and keep your functional units busy doing useful work without you having to think as hard about how to make that happen.
Here's an over-simplified example to illustrate my point:
// multiply four factors
// A[0] = B[0] * C[0]
// ...
// A[3] = B[3] * C[3];
float *A, *B, *C;
for (i = 0; i < 4; i ++) {
A[i] = B[i] * C[i];
That code will probably run ok on a GCN architecture (except for suboptimal memory access performance--an advanced topic). But on your HD5870 it would be a disaster, because those four multiplies would take up 4 VLIW5 instructions instead of 1! So you would write that above code using the float4 type:
float4 A, B, C;
A = B * C;
And it would run really well on both of your cards. Plus it would kick ass on a CPU OpenCL context and make great use of MMX/SSE wide registers, a bonus. It's also a much better use of the memory system.
An a tiny nutshell, using the packed primitives is the one thing I can recommend to keep in mind as you start deploying code on these two systems at the same time.
Here's one more example that more clearly illustrates what you need to be careful doing on your HD5870. Say we had implemented the previous example using separate work-units:
// multiply four factors
// as separate work units
// A = B * C
float A, B, C;
A = B * C;
And we had four separate work units instead of one. That would be an absolute disaster on the VLIW device and would show tremendously better performance on the GCN device. That is something you will want to also look for when you are writing your code--can you use float4 types to reduce the number of work units doing the same work? If so, then you will see good performance on both platforms.
