Metal Shader pointer or local copy performance - performance

I have a metal kernel function that reads, processes and writes elements in a huge array of data, stored in device memory
device Element *elements [[ buffer(0) ]],
I'm wondering what's better in terms of performance? :
Make a copy of the array element into local thread memory :
Element element = elements[thread_id];
Or, use a pointer to that element :
device Element *element = &elements[thread_id];

Related

Dynamically Change eBPF map size

In the kernel, eBPF maps can be defined as:
struct bpf_map_def SEC("maps") my_map = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(uint32_t),
.value_size = sizeof(struct task_prov_struct),
.max_entries = 4096,
};
If I do not know ahead of time the maximum possible size of my_map (I also don't want to waste memory), is there a way to, say, pre-allocate a small size and dynamically increase the size as needed?
I am aware of bpf_map__resize function but it seems to be a user space function and can only be called before a map is loaded.
I'd appreciate any sample code snippet or reference.
No, at the moment you cannot “resize” an eBPF map after it has been created.
However, the size of the map in the kernel may vary over time.
Some maps are pre-allocated, because their type requires so (e.g. arrays) or because this was required by the user at map creation time, by providing the relevant flag. These maps are allocated as soon as they are created, and occupy a space equal to (key_size + value_size) * max_entries.
Some other maps are not pre-allocated, and will grow over time. This is the case for hash maps for example: They will take more space in kernel space as new entries are added. However, they will only grow up to the maximum number of entries provided during their creation, and it is NOT possible to update this maximum number of entries after that.
Regarding the bpf_map__resize() function from libbpf, it is a user space function that can be used to update the number of entries for a map, before this map is created in the kernel:
int bpf_map__set_max_entries(struct bpf_map *map, __u32 max_entries)
{
if (map->fd >= 0)
return -EBUSY;
map->def.max_entries = max_entries;
return 0;
}
int bpf_map__resize(struct bpf_map *map, __u32 max_entries)
{
if (!map || !max_entries)
return -EINVAL;
return bpf_map__set_max_entries(map, max_entries);
}
If we already created the map (if we have a file descriptor to that map), the operation fails.

What is the difference between storing a Vec vs a Slice?

Rust provides a few ways to store a collection of elements inside a user-defined struct. The struct can be given a custom lifetime specifier, and a reference to a slice:
struct Foo<'a> {
elements: &'a [i32]
}
impl<'a> Foo<'a> {
fn new(elements: &'a [i32]) -> Foo<'a> {
Foo { elements: elements }
}
}
Or it can be given a Vec object:
struct Bar {
elements: Vec<i32>
}
impl Bar {
fn new(elements: Vec<i32>) -> Bar {
Bar { elements: elements }
}
}
What are the major differences between these two approaches?
Will using a Vec force the language to copy memory whenever I call Bar::new(vec![1, 2, 3, 4, 5])?
Will the contents of Vec be implicitly destroyed when the owner Bar goes out of scope?
Are there any dangers associated with passing a slice in by reference if it's used outside of the struct that it's being passed to?
A Vec is composed of three parts:
A pointer to a chunk of memory
A count of how much memory is allocated (the capacity)
A count of how many items are stored (the size)
A slice is composed of two parts:
A pointer to a chunk of memory
A count of how many items are stored (the size)
Whenever you move either of these, those fields are all that will be copied. As you might guess, that's pretty lightweight. The actual chunk of memory on the heap will not be copied or moved.
A Vec indicates ownership of the memory, and a slice indicates a borrow of memory. A Vec needs to deallocate all the items and the chunk of memory when it is itself deallocated (dropped in Rust-speak). This happens when it goes out of scope. The slice does nothing when it is dropped.
There are no dangers of using slices, as that is what Rust lifetimes handle. These make sure that you never use a reference after it would be invalidated.
A Vec is a collection that can grow or shrink in size. It is stored on the heap, and it is allocated and deallocated dynamically at runtime. A Vec can be used to store any number of elements, and it is typically used when the number of elements is not known at compile time or when the number of elements may change during the execution of the program.
A slice is a reference to a contiguous sequence of elements in a Vec or other collection. It is represented using the [T] syntax, where T is the type of the elements in the slice. A slice does not store any elements itself, it only references elements stored in another collection. A slice is typically used when a reference to a subset of the elements in a collection is needed.
One of the main differences between a Vec and a slice is that a Vec can be used to add and remove elements, while a slice only provides read-only access to a subset of the elements in a collection. Another difference is that a Vec is allocated on the heap, while a slice is a reference and therefore has a fixed size. This means that a slice cannot be used to store new elements, but it can be used to reference a subset of the elements in a Vec or other collection.

Keeping UINT64 values in V8

I'm looking to integrate a scripting engine in my C/C++ program. Currently, I am looking at Google V8.
How do I efficiently handle 64 bit values in V8? My C/C++ program uses 64 bit values extensivly for keeping handlers/pointers. I don't want them separatelly allocated on the heap. There appears to be a V8::External value type. Can I assign it to a Javascript variable and use it as a value type?
function foo() {
var a = MyNativeFunctionReturningAnUnsigned64BitValue();
var b = a; // Hopefully, b is a stack allocated value capable of
// keeping a 64 bit pointer or some other uint64 structure.
MyNativeFunctionThatAcceptsAnUnsigned64BitValue(b);
}
If it is not possible in V8, how about SpiderMonkey? I know that Duktape (Javascript engine) has a non Ecmascript standard 64 bit value type (stack allocated) to host pointers, but I would assume that other engines also wants to keep track of external pointers from within their objects.
No it's not possible and I'm afraid duktape could be violating the spec unless it took some great pains to ensure it's not observable.
You can store pointers in objects so to store 64-bit ints directly on an object you need pointers to have the same size:
Local<FunctionTemplate> function_template = FunctionTemplate::New(isolate);
// Instances of this function have room for 1 internal field
function_template->InstanceTemplate()->SetInternalFieldCount(1);
Local<Object> object = function_template->GetFunction()->NewInstance();
static_assert(sizeof(void*) == sizeof(uint64_t));
uint64_t integer = 1;
object->SetAlignedPointerInInternalField(0, reinterpret_cast<void*>(integer));
uint64_t result = reinterpret_cast<uint64_t>(object->GetAlignedPointerInInternalField(0));
This is of course far from being efficient.

What's better for performance, cell arrays of objects or heterogeneous arrays?

Suppose I have some classes foo < handle, and bar < foo, baz < foo, and maybe qux < foo. There are a couple ways I can store an array of these objects:
As a cell array: A = {foo bar baz qux} % A(1) would be a cell, A{1} gives me a foo object
Starting with R2011a, I can make foo <matlab.mixin.Heterogeneous, and then build an array directy: A = [foo bar baz qux] % A(1) directly gives me a foo object
The way I see it, from a maintenance perspective it would be better to use the second method rather than the first, this way it removes ambiguity about how to access A. Namely, when we need to dereference elements of the cell array (cell A(1) vs foo object A{1}, which lives inside A(1)).
But is there any kind of memory or performance penalty (or benefit) to using one syntax vs the other?
I did a small experiment (source) on the memory and running time of the cell array, containers.Map and a Heterogeneous array.
In my method I preallocated each array with N=65535 elements (the max array size for Map and Heterogeneous array), then began assigning each element a uint32, and measured the time and memory.
My Heterogeneous Class was a simple class with a single public property, and a constructor which assigned that property.
The containers.Map had uint32 key/value pairs.
Maps took 9.17917e-01 seconds.
Cells took 5.81220e-02 seconds.
Heterogeneous array took 4.95336e+00 seconds.
**Name** **Size** **Bytes** **Class**
map 65535x1 112 containers.Map
cellArr 65535x1 7602060 cell
hArr 1x65535 262244 SomeHeterogeneousClass
Immediately note that the size of the mapArray is not accurate. It is hidden behind the containers.Map class implementation, most likley the 112 bytes reported is the memory assigned to the map itself, excluding the data. I approximate the true size to be at minimum (112+65535*(sizeof(uint32)*2)) = 524392 bytes. This value is almost exactly double the hArr size, which makes me think it is quite accurate, since the map must store twice as much data (for key AND value) as the hArr.
The results are straightforward:
Time: cell Array < Map < Heterogeneous Array
Memory: Heterogeneous Array < Map < cell Array
I repeated the experiment with N=30 to test for small arrays, the results were similar.
God only knows why cells take up so much memory and Heterogeneous arrays are so slow.

OOP much slower than Structural programming. why and how can be fixed?

as i mentioned on subject of this post i found out OOP is slower than Structural Programming(spaghetti code) in the hard way.
i writed a simulated annealing program with OOP then remove one class and write it structural in main form. suddenly it got much faster . i was calling my removed class in every iteration in OOP program.
also checked it with Tabu Search. Same result .
can anyone tell me why this is happening and how can i fix it on other OOP programs?
are there any tricks ? for example cache my classes or something like that?
(Programs has been written in C#)
If you have a high-frequency loop, and inside that loop you create new objects and don't call other functions very much, then, yes, you will see that if you can avoid those news, say by re-using one copy of the object, you can save a large fraction of total time.
Between new, constructors, destructors, and garbage collection, a very little code can waste a whole lot of time.
Use them sparingly.
Memory access is often overlooked. The way o.o. tends to lay out data in memory is not conducive to efficient memory access in practice in loops. Consider the following pseudocode:
adult_clients = 0
for client in list_of_all_clients:
if client.age >= AGE_OF_MAJORITY:
adult_clients++
It so happens that the way this is accessed from memory is quite inefficient on modern architectures because they like accessing large contiguous rows of memory, but we only care for client.age, and of all clients we have; those will not be laid out in contiguous memory.
Focusing on objects that have fields results into data being laid out in memory in such a way that fields that hold the same type of information will not be laid out in consecutive memory. Performance-heavy code tends to involve loops that often look at data with the same conceptual meaning. It is conducive to performance that such data be laid out in contiguous memory.
Consider these two examples in Rust:
// struct that contains an id, and an optiona value of whether the id is divisible by three
struct Foo {
id : u32,
divbythree : Option<bool>,
}
fn main () {
// create a pretty big vector of these structs with increasing ids, and divbythree initialized as None
let mut vec_of_foos : Vec<Foo> = (0..100000000).map(|i| Foo{ id : i, divbythree : None }).collect();
// loop over all hese vectors, determine if the id is divisible by three
// and set divbythree accordingly
let mut divbythrees = 0;
for foo in vec_of_foos.iter_mut() {
if foo.id % 3 == 0 {
foo.divbythree = Some(true);
divbythrees += 1;
} else {
foo.divbythree = Some(false);
}
}
// print the number of times it was divisible by three
println!("{}", divbythrees);
}
On my system, the real time with rustc -O is 0m0.436s; now let us consider this example:
fn main () {
// this time we create two vectors rather than a vector of structs
let vec_of_ids : Vec<u32> = (0..100000000).collect();
let mut vec_of_divbythrees : Vec<Option<bool>> = vec![None; vec_of_ids.len()];
// but we basically do the same thing
let mut divbythrees = 0;
for i in 0..vec_of_ids.len(){
if vec_of_ids[i] % 3 == 0 {
vec_of_divbythrees[i] = Some(true);
divbythrees += 1;
} else {
vec_of_divbythrees[i] = Some(false);
}
}
println!("{}", divbythrees);
}
This runs in 0m0.254s on the same optimization level, — close to half the time needed.
Despite having to allocate two vectors instead of of one, storing similar values in contiguous memory has almost halved the execution time. Though obviously the o.o. approach provides for much nicer and more maintainable code.
P.s.: it occurs to me that I should probably explain why this matters so much given that the code itself in both cases still indexes memory one field at a time, rather than, say, putting a large swath on the stack. The reason is c.p.u. caches: when the program asks for the memory at a certain address, it actually obtains, and caches, a significant chunk of memory around that address, and if memory next to it be asked quickly again, then it can serve it from the cache, rather than from actual physical working memory. Of course, compilers will also vectorize the bottom code more efficiently as a consequence.

Resources