Is there a linux header for hashtable with spinlock-protected buckets?

Is there a linux header for hashtable with spinlock-protected buckets? - linux-kernel

I write a code which rarely creates/removes objects (up to several thousands) but very frequently modifies them in soft IRQ context. These objects are also rarely read (and probably will also be rarely modified) from task context (via procfs: file per object). Currently my code contains global per-CPU data blocks, each one guarded by a spinlock. Such a block contains a fixed-sized hashtable for object storage.
Obviously the current design is not optimal, especially when having very high object update loads: reading objects from procfs will cause data losses in updating soft IRQs. I need to rewrite the synchronisation scheme to get rid of global locks. The most obvious choice - to have a spinlock for each hashtable bucket - it should scale well. The problem is that I'll probably need to use my own hashtable implementation or at least to reimplement several top-level macros (didn't find those in linux/hashtable.h for spinlock-protected buckets). Should I also look towards RCU-enabled hashtable (yet I have no solid understanding of this synchronisation approach)?

Buckets with lock protection are declared in the header linux/list_bl.h. They use lowest bit of the head pointer as a lock bit.
RCU-protected access to the bucket is defined with other hash table functions in the header linux/hashtable.h (they have _rcu suffix).
Choosing between locks and RCU is up to you. Note, that RCU itself cannot resolve modify-modify conflicts. And it helps mostly for frequently-read data, which seems is not your case.
As only one locking function - hlist_bl_lock - is declared for struct hlist_bl_head, and this function is unaware for irq's, additional actions should be performed when hash table can be used in irq or bottom halves:
spin_lock_irqsave:
local_irq_save(flags);
hlist_bl_lock(...);
spin_unlock_irqrestore:
hlist_bl_unlock(...);
local_irq_restore(flags);
spin_lock_bh:
local_bh_disable();
hlist_bl_lock(...);
spin_unlock_bh:
hlist_bl_unlock(...);
local_bh_enable();

Related

data structure for page-aligned entries

I need to store many page-aligned entries, each entry is page-sized; basically I need to collect/bind together memory pages. The only requirement that I need to be able to check whether entry already exists upon adding by matching the machine-word-sized key. It is not possible to override the entry; if the same key is used, the existing entry must be found.
The function to add/replace the entry receives some machine-word-sized key (32/64 bits), checks if there is a page-aligned entry which contains the same key. If there is no entry, it is created via mmap, and is added with the required key. The C declaration of the entry looks like this:
struct entry {
uintptr_t key; /* machine-word-sized key */
unsigned char meta[]; /* this space may be used for storing in data structure */
};
The caller receives the key and takes the decision whether to use the existing entry or to allocate a new one. That is, the key must be looked up, and pages must be just collected to allow removal in a loop.
All I need are adding the entry and removing all entries (no specific order imposed); since I use each entry as epoll.data.ptr, I don't need even fast lookup after adding the entry. Given that each entry has some space for meta-data, I'm OK to dedicate some of this space to payload required to store the entry in the data structure.
I thought about using a hash table. I have no math or crypto background, so generating a good hash is a problem. I tried looking at several well-known hashes, but it seems that they are quite generic, i.e. intended to work with any data. However, my case seems to be very specific: there is no way user would use the table directly and conditions (page-aligned and page-sized entries plus word-sized key) are unlikely to change.
The questions are:
Am I right that hash table is OK for this case? If yes, what kind of hash would you suggest? If the hash table is linked-list based, it'd better be intrusive (i.e. all required meta is better to be inside the entry, not outside, like Linux kernel's struct list_head).
I was also looking at page tables, like described at https://wiki.osdev.org/Paging. However, it concentrates mostly on how MMU does its job, and I'm not sure whether I can adopt it to purely software implementation and how can I apply these concepts. Since machine-word-sized key must be used for inserting the entry, the concepts from this link only show how to organize pages effectively for page-to-page mappings.
I currently need to care only of 4096-bytes pages, but generic case is better (i.e. some algorithm which operates on PAGE_SIZE, be it 4K, 8K or whatever). It would be also nice the data structure does not assume page-sized (though page-aligned is a strict requirement, since all memory is obtained via mmap).

what does __rcu stands for in linux?

I am new to linux kernel. My question is about the task_struct.
I know that Each task_struct has a reference to its parent process via a pointer to the task_struct of the parent.
After looking at the sched.h in the task_struct definition I noticed the following :
struct task_struct __rcu *real_parent; /* real parent process */
I found that it is referenced to compiler.h. I guess that the "__rcu" stands for "read copy update"
Can someone clarify the syntax ?

Read-copy-update is an algorithm that enables concurrent access to readers of a data structure without having to lock the structure. It can be read about here.
If the kernel is built with the CONFIG_SPARSE_RCU_POINTER config option, __rcu is defined in include/linux/compiler.h as
# define __rcu __attribute__((noderef, address_space(4)))
This is an annotation for a the Sparse code analysis tool that can warn about certain things the programmer may have overlooked. How this is relevant to RCU is explained in Documentation/RCU/checklist.txt:
__rcu sparse checks: tag the pointer to the RCU-protected data
structure with __rcu, and sparse will warn you if you
access that pointer without the services of one of the
variants of rcu_dereference().
rcu_dereference() returns a pointer that can be safely dereferenced by the code and documents the programmer's intention to protect the pointer with the RCU mechanism, enabling tools like Sparse to check for programming errors and omissions.

RCU stands for "read, copy, update". It is an algorithm that allows multiple readers to access data which can be updated or even deleted at the same time by writers.
Under RCU, writers still have to ensure mutual exclusion with regard to one another, but readers do not acquire a lock. Care has to be taken that the shared data structure is updated in ways that do not violate read integrity. If something has to be removed or deleted, the unlinking of that item from the data structure can be done in parallel with the readers but the actual deletion of the memory has to wait until the last reader has finished.
Rather than making the readers acquire a lock, the whereabouts of the readers are inferred in other ways. Threads can announce their intent to browse the data structure by joining a "read side critical section" which is not really a lock but a kind of global phase.
For instance, suppose that some threads entered the RCU read side critical section in phase 0. An updater has performed a deletion and want to free a piece of memory. It has to simply wait for all threads in the system to vacate phase 0. In the meanwhile, other readers are looking at the data structure already, but when they declare their intent to RCU, they do so by entering the RCU read-side critical section under phase 1. Only the phase 0 threads can possibly still have a pointer to the object that was deleted, and so when the last thread leaves phase 0, the object can safely be deleted. Newly arriving threads in phase 1 do not see the object, because the object has been removed from the data structure, so they have no way to find it.
RCU takes advantage of the idea that we do not need lock objects that are "owned" in order to know information like "no thread can be accessing this object any more".

Thread-safe (Goroutine-safe) cache in Go

Question 1
I am building/searching for a RAM memory cache layer for my server. It is a simple LRU cache that needs to handle concurrent requests (both Gets an Sets).
I have found https://github.com/pmylund/go-cache claiming to be thread safe.
This is true as far as getting the stored interface. But if multiple goroutines requests the same data, they are all retrieving a pointer (stored in the interface) to the same block of memory. If any goroutine changes the data, this is no longer very safe.
Are there any cache-packages out there that tackles this problem?
Question 1.1
If the answer to Question 1 is No, then what would be the suggested solution?
I see two options:
Alternative 1
Solution: Storing the values in a wrapping struct with a sync.Mutex so that each goroutine needs to lock the data before reading/writing to it.
type cacheElement struct { value interface{}, lock sync.Mutex }
Drawbacks: The cache becomes unaware of changes made to data or might even have dropped it out of the cache. One goroutine might also lock others.
Alternative 2
Solution: Make a copy of the data (assuming the data in itself doesn't contain pointers)
Drawbacks: Memory allocation every time a cache Get is performed, more garbage collection.
Sorry for the multipart question. But you don't have to answer all of them. If you have a good answer to Question 1, that would be sufficient for me!

Alternative 2 sounds good to me, but please note that you do not have to copy the data for each cache.Get(). As long as your data can be considered immutable, you can access it with many multiple readers at once.
You only have to create a copy if you intend to modify it. This idiom is called COW (copy on write) and is quite common in concurrent software design. It's especially well suited for scenarios with a high read/write ratio (just like a cache).
So, whenever you want to modify a cached entry, you basically have to:
create a copy of the old cached data, if any.
modify the data (after this step, the data should be considered immutable and must not be changed anymore)
add / replace the existing element in the cache. You could either use the go-cache library you have pointed out earlier (which is based on locks) for that, or write your own lock-free library that simply swaps the pointers to the data element atomically.
At this point any goroutine that performs a cache.Get operation will get the new data. Existing goroutines however, might still be reading the old data. So, your program might operate on many different versions of the same data at once. But don't worry, as soon as all goroutines have finished accessing the old data, the GC will collect it automatically.

tux21b gave a good answer. I'll just point out that you don't have to return pointers to data. you can store non pointer values in your cache and go will pass by value which will be a copy. Then your Get and Set methods will be safe since nothing can actually modify the cache contents.

What is the design rationale behind HandleScope?

V8 requires a HandleScope to be declared in order to clean up any Local handles that were created within scope. I understand that HandleScope will dereference these handles for garbage collection, but I'm interested in why each Local class doesn't do the dereferencing themselves like most internal ref_ptr type helpers.
My thought is that HandleScope can do it more efficiently by dumping a large number of handles all at once rather than one by one as they would in a ref_ptr type scoped class.

Here is how I understand the documentation and the handles-inl.h source code. I, too, might be completely wrong since I'm not a V8 developer and documentation is scarce.
The garbage collector will, at times, move stuff from one memory location to another and, during one such sweep, also check which objects are still reachable and which are not. In contrast to reference-counting types like std::shared_ptr, this is able to detect and collect cyclic data structures. For all of this to work, V8 has to have a good idea about what objects are reachable.
On the other hand, objects are created and deleted quite a lot during the internals of some computation. You don't want too much overhead for each such operation. The way to achieve this is by creating a stack of handles. Each object listed in that stack is available from some handle in some C++ computation. In addition to this, there are persistent handles, which presumably take more work to set up and which can survive beyond C++ computations.
Having a stack of references requires that you use this in a stack-like way. There is no “invalid” mark in that stack. All the objects from bottom to top of the stack are valid object references. The way to ensure this is the LocalScope. It keeps things hierarchical. With reference counted pointers you can do something like this:
shared_ptr<Object>* f() {
shared_ptr<Object> a(new Object(1));
shared_ptr<Object>* b = new shared_ptr<Object>(new Object(2));
return b;
}
void g() {
shared_ptr<Object> c = *f();
}
Here the object 1 is created first, then the object 2 is created, then the function returns and object 1 is destroyed, then object 2 is destroyed. The key point here is that there is a point in time when object 1 is invalid but object 2 is still valid. That's what LocalScope aims to avoid.
Some other GC implementations examine the C stack and look for pointers they find there. This has a good chance of false positives, since stuff which is in fact data could be misinterpreted as a pointer. For reachability this might seem rather harmless, but when rewriting pointers since you're moving objects, this can be fatal. It has a number of other drawbacks, and relies a lot on how the low level implementation of the language actually works. V8 avoids that by keeping the handle stack separate from the function call stack, while at the same time ensuring that they are sufficiently aligned to guarantee the mentioned hierarchy requirements.
To offer yet another comparison: an object references by just one shared_ptr becomes collectible (and actually will be collected) once its C++ block scope ends. An object referenced by a v8::Handle will become collectible when leaving the nearest enclosing scope which did contain a HandleScope object. So programmers have more control over the granularity of stack operations. In a tight loop where performance is important, it might be useful to maintain just a single HandleScope for the whole computation, so that you won't have to access the handle stack data structure so often. On the other hand, doing so will keep all the objects around for the whole duration of the computation, which would be very bad indeed if this were a loop iterating over many values, since all of them would be kept around till the end. But the programmer has full control, and can arrange things in the most appropriate way.
Personally, I'd make sure to construct a HandleScope
At the beginning of every function which might be called from outside your code. This ensures that your code will clean up after itself.
In the body of every loop which might see more than three or so iterations, so that you only keep variables from the current iteration.
Around every block of code which is followed by some callback invocation, since this ensures that your stuff can get cleaned if the callback requires more memory.
Whenever I feel that something might produce considerable amounts of intermediate data which should get cleaned (or at least become collectible) as soon as possible.
In general I'd not create a HandleScope for every internal function if I can be sure that every other function calling this will already have set up a HandleScope. But that's probably a matter of taste.

Disclaimer: This may not be an official answer, more of a conjuncture on my part; but the v8 documentation is hardly
useful on this topic. So I may be proven wrong.
From my understanding, in developing various v8 based backed application. Its a means of handling the difference between the C++ and javaScript environment.
Imagine the following sequence, which a self dereferencing pointer can break the system.
JavaScript calls up a C++ wrapped v8 function : lets say helloWorld()
C++ function creates a v8::handle of value "hello world =x"
C++ returns the value to the v8 virtual machine
C++ function does its usual cleaning up of resources, including dereferencing of handles
Another C++ function / process, overwrites the freed memory space
V8 reads the handle : and the data is no longer the same "hell!#(#..."
And that's just the surface of the complicated inconsistency between the two; Hence to tackle the various issues of connecting the JavaScript VM (Virtual Machine) to the C++ interfacing code, i believe the development team, decided to simplify the issue via the following...
All variable handles, are to be stored in "buckets" aka HandleScopes, to be built / compiled / run / destroyed by their
respective C++ code, when needed.
Additionally all function handles, are to only refer to C++ static functions (i know this is irritating), which ensures the "existence"
of the function call regardless of constructors / destructor.
Think of it from a development point of view, in which it marks a very strong distinction between the JavaScript VM development team, and the C++ integration team (Chrome dev team?). Allowing both sides to work without interfering one another.
Lastly it could also be the sake of simplicity, to emulate multiple VM : as v8 was originally meant for google chrome. Hence a simple HandleScope creation and destruction whenever we open / close a tab, makes for much easier GC managment, especially in cases where you have many VM running (each tab in chrome).

Is NSObject's retain method atomic?

Is NSObject's retain method atomic?
For example, when retaining the same object from two different threads, is it promised that the retain count has gone up twice, or is it possible for the retain count to be incremented just once?
Thanks.

NSObject as well as object allocation and retain count functions are thread-safe — see Appendix A: Thread Safety Summary in the Thread Programming Guide.
Edit: I’ve decided to take a look at the open source part of Core Foundation. In CFRuntime.c, __CFDoExternRefOperation() is the function responsible for updating the the retain counters. It tests whether the process has more than one thread and, if there’s more than one thread, it acquires a spin lock before updating the retain count, hence making this operation thread safe.
Interestingly enough, the retain count is not an attribute (or instance variable) of an object in the struct (class) sense. The runtime keeps a separate structure with retain counters. In fact, if I understand it correctly, this structure is an array of hash tables and there’s a spin lock for each hash table. This means that a lock refers to multiple objects that have been placed in the same hash table, i.e., the lock is neither global (for all instances) nor per instance.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio