In the kernel, eBPF maps can be defined as:
struct bpf_map_def SEC("maps") my_map = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(uint32_t),
.value_size = sizeof(struct task_prov_struct),
.max_entries = 4096,
};
If I do not know ahead of time the maximum possible size of my_map (I also don't want to waste memory), is there a way to, say, pre-allocate a small size and dynamically increase the size as needed?
I am aware of bpf_map__resize function but it seems to be a user space function and can only be called before a map is loaded.
I'd appreciate any sample code snippet or reference.
No, at the moment you cannot “resize” an eBPF map after it has been created.
However, the size of the map in the kernel may vary over time.
Some maps are pre-allocated, because their type requires so (e.g. arrays) or because this was required by the user at map creation time, by providing the relevant flag. These maps are allocated as soon as they are created, and occupy a space equal to (key_size + value_size) * max_entries.
Some other maps are not pre-allocated, and will grow over time. This is the case for hash maps for example: They will take more space in kernel space as new entries are added. However, they will only grow up to the maximum number of entries provided during their creation, and it is NOT possible to update this maximum number of entries after that.
Regarding the bpf_map__resize() function from libbpf, it is a user space function that can be used to update the number of entries for a map, before this map is created in the kernel:
int bpf_map__set_max_entries(struct bpf_map *map, __u32 max_entries)
{
if (map->fd >= 0)
return -EBUSY;
map->def.max_entries = max_entries;
return 0;
}
int bpf_map__resize(struct bpf_map *map, __u32 max_entries)
{
if (!map || !max_entries)
return -EINVAL;
return bpf_map__set_max_entries(map, max_entries);
}
If we already created the map (if we have a file descriptor to that map), the operation fails.
Related
I am trying to find a good solution for this question -
Implement two functions that assign/release unique id's from a pool. Memory usage should be minimized and the assign/release should be fast, even under high contention.
alloc() returns available ID
release(id) releases previously assigned ID
The first thought was to maintain a map of IDs and availability(in boolean). Something like this
Map<Integer, Boolean> availabilityMap = new HashMap();
public Integer alloc() {
for (EntrySet es : availabilityMap.entrySet()) {
if (es.value() == false) {
Integer key = es.key();
availabilityMap.put(key, true);
return key;
}
}
}
public void release(Integer id) {
availabilityMap.put(id, false);
}
However this is not ideal for multiple threads and "Memory usage should be minimized and the assign/release should be fast, even under high contention."
What would be a good way to optimize both memory usage and speed?
For memory usage, I think map should be replaced with some other data structure but I am not sure what it is. Something like bit map or bit set? How can I maintain id and availability in this case?
For concurrency I will have to use locks but I am not sure how I can effectively handle contention. Maybe put availabile ids in separate chunks so that each of them can be accessed independently? Any good suggestions?
First of all, you do not want to run over entire map in order to find available ID.
So you can maintain two sets of IDs, the first one for available IDs, and the second one is for allocated IDs.
Then it will make allocation/release pretty easy and fast.
Also you can use ConcurrentMap for both containers (sets), it will reduce the contention.
Edit: Changed bottom sentinel, fixed a bug
First, don't iterate the entire map to find an available ID. You should only need constant time to do it.
What you could do to make it fast is to do this:
Create an int index = 1; for your counter. This is technically the number of IDs generated + 1, and is always > 0.
Create a ArrayDeque<Integer> free = new ArrayDeque<>(); to house the free IDs. Guaranteed constant access.
When you allocate an ID, if the free ID queue is empty, you can just return the counter and increment it (i.e. return index++;). Otherwise, grab its head and return that.
When you release an ID, push the previously used ID to the free deque.
Remember to synchronize your methods.
This guarantees O(1) allocation and release, and it also keeps allocation quite low (literally once per free). Although it's synchronized, it's fast enough that it shouldn't be a problem.
An implementation might look like this:
import java.util.ArrayDeque;
public class IDPool {
int index = 1;
ArrayDeque<Integer> free = new ArrayDeque<>();
public synchronized int acquire() {
if (free.isEmpty()) return index++;
else return free.pop();
}
public synchronized void release(id) {
free.push(id);
}
}
Additionally, if you want to ensure the free ID list is unique (as you should for anything important) as well as persistent, you can do the following:
Use an HashMap<Integer id, Integer prev> to hold all generated IDs. Remember it doesn't need to be ordered or even iterated.
This is technically going to be a stack encoded inside a hash map.
Highly efficient implementations of this exist.
In reality, any unordered int -> int map will do here.
Track the top ID for the free ID set. Remember that 1 can represent nothing and zero used, so you don't have to box it. (IDs are always positive.) Initially, this would just be int top = 1;
When allocating an ID, if there are free IDs (i.e. top >= 2), do the following:
Set the new top to the old head's value in the free map.
Set the old top's value in the map to 0, marking it used.
Return the old top.
When releasing an old ID, do this instead:
If the old ID is already in the pool, return early, so we don't corrupt it.
Set the ID's value in the map to the old top.
Set the new top to the ID, since it's always the last one to use.
The optimized implementation would end up looking like this:
import java.util.HashMap;
public class IDPool {
int index = 2;
int top = 1;
HashMap<Integer, Integer> pool = new HashMap<>();
public synchronized int acquire() {
int id = top;
if (id == 1) return index++;
top = pool.replace(id, 0);
return id;
}
public synchronized void release(id) {
if (pool.getOrDefault(id, 1) == 0) return;
pool.put(id, top);
top = id;
}
}
If need be, you could use a growable integer array instead of the hash map (it's always contiguous), and realize significant performance gains. Matter of fact, that is how I'd likely implement it. It'd just require a minor amount of bit twiddling to do so, because I'd maintain the array's size to be rounded up to the next power of 2.
Yeah...I had to actually write a similar pool in JavaScript because I actually needed moderately fast IDs in Node.js for potentially high-frequency, long-lived IPC communication.
The good thing about this is that it generally avoids allocations (worst case being once per acquired ID when none are released), and it's very amenable to later optimization where necessary.
I'm looking to integrate a scripting engine in my C/C++ program. Currently, I am looking at Google V8.
How do I efficiently handle 64 bit values in V8? My C/C++ program uses 64 bit values extensivly for keeping handlers/pointers. I don't want them separatelly allocated on the heap. There appears to be a V8::External value type. Can I assign it to a Javascript variable and use it as a value type?
function foo() {
var a = MyNativeFunctionReturningAnUnsigned64BitValue();
var b = a; // Hopefully, b is a stack allocated value capable of
// keeping a 64 bit pointer or some other uint64 structure.
MyNativeFunctionThatAcceptsAnUnsigned64BitValue(b);
}
If it is not possible in V8, how about SpiderMonkey? I know that Duktape (Javascript engine) has a non Ecmascript standard 64 bit value type (stack allocated) to host pointers, but I would assume that other engines also wants to keep track of external pointers from within their objects.
No it's not possible and I'm afraid duktape could be violating the spec unless it took some great pains to ensure it's not observable.
You can store pointers in objects so to store 64-bit ints directly on an object you need pointers to have the same size:
Local<FunctionTemplate> function_template = FunctionTemplate::New(isolate);
// Instances of this function have room for 1 internal field
function_template->InstanceTemplate()->SetInternalFieldCount(1);
Local<Object> object = function_template->GetFunction()->NewInstance();
static_assert(sizeof(void*) == sizeof(uint64_t));
uint64_t integer = 1;
object->SetAlignedPointerInInternalField(0, reinterpret_cast<void*>(integer));
uint64_t result = reinterpret_cast<uint64_t>(object->GetAlignedPointerInInternalField(0));
This is of course far from being efficient.
I have found some code in PyCXX that may be buggy.
Is it indeed a bug, and if so, what is the right way to fix it?
Here is the problem:
struct PythonClassInstance
{
PyObject_HEAD
ExtObjBase* m_pycxx_object;
}
:
{
:
table->tp_new = extension_object_new; // PyTypeObject
:
}
:
static PyObject* extension_object_new(
PyTypeObject* subtype, PyObject* args, PyObject* kwds )
{
PythonClassInstance* o = reinterpret_cast<PythonClassInstance *>
( subtype->tp_alloc(subtype,0) );
if( ! o )
return nullptr;
o->m_pycxx_object = nullptr;
PyObject* self = reinterpret_cast<PyObject* >( o );
return self;
}
Now PyObject_HEAD expands to "PyObject ob_base;", so clearly PythonClassInstance trivially extends PyObject to contain an extra pointer (which will point to PyCXX's representation for this PyObject)
tp_alloc allocates memory for storing a PyObject
The code then typecasts this pointer to a PythonClassInstance, laying claim to an extra 4(or 8?) bytes that it does not own!
And then it sets this extra memory to 0.
This looks very dangerous, and I'm surprised the bug has gone unnoticed. The risk is that some future object will get placed in this location (that is meant to be storing the ExtObjBase*).
How to fix it?
PythonClassInstance foo{};
PyObject* tmp = subtype->tp_alloc(subtype,0);
// !!! memcpy sizeof(PyObject) bytes starting from location tmp into location (void*)foo
But I think now maybe I need to release tmp, and I don't think I should be playing with memory directly like this. I feel like it could be jeopardising Python's memory management/garbage collection inbuilt machinery.
The other option is maybe I can persuade tp_alloc to allocate 4 extra bytes (or is it 8 now; enough for a pointer) bypassing in 1 instead of 0.
Documentation says this second parameter is "Py_ssize_t nitems" and:
If the type’s tp_itemsize is non-zero, the object’s ob_size field
should be initialized to nitems and the length of the allocated memory
block should be tp_basicsize + nitemstp_itemsize, rounded up to a
multiple of sizeof(void); otherwise, nitems is not used and the
length of the block should be tp_basicsize.
So it looks like I should be setting:
table->tp_itemsize = sizeof(void*);
:
PyObject* tmp = subtype->tp_alloc(subtype,1);
EDIT: just tried this and it causes a crash
But then the documentation goes on to say:
Do not use this function to do any other instance initialization, not
even to allocate additional memory; that should be done by tp_new.
Now I'm not sure whether this code belongs in tp_new or tp_init.
Related:
Passing arguments to tp_new and tp_init from subtypes in Python C API
Python C-API Object Allocation
The code is correct.
As long as the PyTypeObject for the extension object is properly initialized it should work.
The base class tp_alloc receives subtype so it should know how much memory to allocate by checking the tp_basicsize member.
This is a common Python C/API pattern as demonstrated int the tutorial.
Actually this is a (minor/harmless) bug in PyCXX
SO would like to convert this answer to a comment, which makes no sense I can't awarded the green tick of completion so I comment. So I have to ramble in order to qualify it. blerh.
as i mentioned on subject of this post i found out OOP is slower than Structural Programming(spaghetti code) in the hard way.
i writed a simulated annealing program with OOP then remove one class and write it structural in main form. suddenly it got much faster . i was calling my removed class in every iteration in OOP program.
also checked it with Tabu Search. Same result .
can anyone tell me why this is happening and how can i fix it on other OOP programs?
are there any tricks ? for example cache my classes or something like that?
(Programs has been written in C#)
If you have a high-frequency loop, and inside that loop you create new objects and don't call other functions very much, then, yes, you will see that if you can avoid those news, say by re-using one copy of the object, you can save a large fraction of total time.
Between new, constructors, destructors, and garbage collection, a very little code can waste a whole lot of time.
Use them sparingly.
Memory access is often overlooked. The way o.o. tends to lay out data in memory is not conducive to efficient memory access in practice in loops. Consider the following pseudocode:
adult_clients = 0
for client in list_of_all_clients:
if client.age >= AGE_OF_MAJORITY:
adult_clients++
It so happens that the way this is accessed from memory is quite inefficient on modern architectures because they like accessing large contiguous rows of memory, but we only care for client.age, and of all clients we have; those will not be laid out in contiguous memory.
Focusing on objects that have fields results into data being laid out in memory in such a way that fields that hold the same type of information will not be laid out in consecutive memory. Performance-heavy code tends to involve loops that often look at data with the same conceptual meaning. It is conducive to performance that such data be laid out in contiguous memory.
Consider these two examples in Rust:
// struct that contains an id, and an optiona value of whether the id is divisible by three
struct Foo {
id : u32,
divbythree : Option<bool>,
}
fn main () {
// create a pretty big vector of these structs with increasing ids, and divbythree initialized as None
let mut vec_of_foos : Vec<Foo> = (0..100000000).map(|i| Foo{ id : i, divbythree : None }).collect();
// loop over all hese vectors, determine if the id is divisible by three
// and set divbythree accordingly
let mut divbythrees = 0;
for foo in vec_of_foos.iter_mut() {
if foo.id % 3 == 0 {
foo.divbythree = Some(true);
divbythrees += 1;
} else {
foo.divbythree = Some(false);
}
}
// print the number of times it was divisible by three
println!("{}", divbythrees);
}
On my system, the real time with rustc -O is 0m0.436s; now let us consider this example:
fn main () {
// this time we create two vectors rather than a vector of structs
let vec_of_ids : Vec<u32> = (0..100000000).collect();
let mut vec_of_divbythrees : Vec<Option<bool>> = vec![None; vec_of_ids.len()];
// but we basically do the same thing
let mut divbythrees = 0;
for i in 0..vec_of_ids.len(){
if vec_of_ids[i] % 3 == 0 {
vec_of_divbythrees[i] = Some(true);
divbythrees += 1;
} else {
vec_of_divbythrees[i] = Some(false);
}
}
println!("{}", divbythrees);
}
This runs in 0m0.254s on the same optimization level, — close to half the time needed.
Despite having to allocate two vectors instead of of one, storing similar values in contiguous memory has almost halved the execution time. Though obviously the o.o. approach provides for much nicer and more maintainable code.
P.s.: it occurs to me that I should probably explain why this matters so much given that the code itself in both cases still indexes memory one field at a time, rather than, say, putting a large swath on the stack. The reason is c.p.u. caches: when the program asks for the memory at a certain address, it actually obtains, and caches, a significant chunk of memory around that address, and if memory next to it be asked quickly again, then it can serve it from the cache, rather than from actual physical working memory. Of course, compilers will also vectorize the bottom code more efficiently as a consequence.
The documentation for the new Vista API says that the PowerEnumerate function can be used to Enumerate power schemes, scheme settings, and a wealth of information, The last two parameters are Buffer and BufferSize, what is unclear is what structures or data layout/format is used for the data that is returned in the buffer by the API?
The output buffer is a GUID. You can use this guid to make additional calls to the Power* functions (i.e. recursively walk the tree, query setting names and values, etc).
For example the following code enumerates some setting names from the disk power settings in the current power scheme:
GUID *scheme;
if(ERROR_SUCCESS == PowerGetActiveScheme(NULL, &scheme))
{
GUID buffer;
DWORD bufferSize = sizeof(buffer);
for(int index = 0; ; index++)
{
if(ERROR_SUCCESS == PowerEnumerate(
NULL,
scheme,
&GUID_DISK_SUBGROUP,
ACCESS_INDIVIDUAL_SETTING,
index,
(UCHAR*)&buffer,
&bufferSize))
{
UCHAR displayBuffer[1024];
DWORD displayBufferSize = sizeof(displayBuffer);
if(ERROR_SUCCESS == PowerReadFriendlyName(
NULL,
scheme,
&GUID_DISK_SUBGROUP,
&buffer,
displayBuffer,
&displayBufferSize))
{
wprintf(L"%s\n", (wchar_t*)displayBuffer);
}
}
}
}
As you can see the steps are:
get the current power scheme
enumerate the disk settings in the current scheme
print the friendly name of each enumerated setting
On my machine the output:
Turn off hard disk after
Hard disk burst ignore time
Hopefully this helps get you pointed in the right direction.
This is not production quality code that favors small size and optimistic buffer sizes over robustness.