Is IPset thread-safe? - thread-safety

I'd like to know if adding/removing entries with ipset is thread-safe. For instance, if I have 2 concurrent processes performing these operations
ipset -A myset 1.1.1.1 # process 1's operation
ipset -A myset 2.2.2.2 # process 2's operation
do I need to add a synchronization mechanism that ensures the 2nd process to run waits for the 1st one to complete to avoid somehow corrupting my IPset (e.g., ending up with 1.2.1.2 in my IPset) or is this functionality already provided by ipset?
Thanks!

No - you do not need to add any locking mechanisms in the user-space for this. The kernel module code already has a lock around each set which is write-locked during add and delete operations.
Here is the relevant code from the kernel module of ipset. Notice the write lock & unlock.
static int
call_ad(struct sock *ctnl, struct sk_buff *skb, struct ip_set *set,
struct nlattr *tb[], enum ipset_adt adt,
u32 flags, bool use_lineno)
{
int ret;
u32 lineno = 0;
bool eexist = flags & IPSET_FLAG_EXIST, retried = false;
do {
write_lock_bh(&set->lock);
ret = set->variant->uadt(set, tb, adt, &lineno, flags, retried);
write_unlock_bh(&set->lock);
retried = true;

Related

cgroup skb egress ebpf hook - skb->cb[0-4] updated values are reset in tc egress ebpf hook

I am just updating the skb->cb with constant values (0xfeed) and trying to get the same packet data at egress tc layer ebpf hook. It is all zero. Am i missing something here?
Is there anyway to send meta data between hooks with skb itself (LRU_HASH map may help but having in-band metadata will be faster, I think).
SEC("cgroup/skb")
int cgrp_dump_pkt(struct __sk_buff *skb) {
void *data_end = (void *) (long) skb->data_end;
void *data = (void *) (long) skb->data;
if(data < data_end)
{
skb->cb[0] = 0xfeed;
skb->cb[1] = 0xfeed;
skb->cb[2] = 0xfeed;
skb->cb[3] = 0xfeed;
skb->cb[4] = 0xfeed;
bpf_printk("hash:%x mark:%x pri:%x", skb->hash, skb->mark, skb->priority);
bpf_printk("cb0:%x cb1:%x cb2:%x", skb->cb[0], skb->cb[1], skb->cb[2]);
bpf_printk("cb3:%x cb4:%x", skb->cb[3], skb->cb[4]);
}
return 1;
}
The above hook is attached to a docker instance cgroup and I am trying to see the same packets by attaching the hook in tc egress (will just print the packet info includes cb[0-4]). cb[0-4] values are shown as 0. Is it an expected behavior? How to pass the metadata between ebpf hook from cgroup skb to tc egress or lower layers?
I've only ever used the cb fields to pass data across BPF tail calls. To pass data from one hook point to another in the stack, you could use skb->mark (32-bit value).
Do make sure that no other software is using the same bits you want to use though. See https://github.com/fwmark/registry for example.

for_each_possible_cpu macro in vmalloc_init() function, does the code run in only one cpu? or in every cpu?

this is not about programming, but I ask it here..
in linux start_kernel() function, in the mm_init() function, I see vmalloc_init() function.
inside the function I see codes like this.
void __init vmalloc_init(void)
{
struct vmap_area *va;
struct vm_struct *tmp;
int i;
/*
* Create the cache for vmap_area objects.
*/
vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
for_each_possible_cpu(i) {
struct vmap_block_queue *vbq;
struct vfree_deferred *p;
vbq = &per_cpu(vmap_block_queue, i);
spin_lock_init(&vbq->lock);
INIT_LIST_HEAD(&vbq->free);
p = &per_cpu(vfree_deferred, i);
init_llist_head(&p->list);
INIT_WORK(&p->wq, free_work);
}
/* Import existing vmlist entries. */
for (tmp = vmlist; tmp; tmp = tmp->next) {
va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
if (WARN_ON_ONCE(!va))
continue;
va->va_start = (unsigned long)tmp->addr;
va->va_end = va->va_start + tmp->size;
va->vm = tmp;
insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
}
/*
* Now we can initialize a free vmap space.
*/
vmap_init_free_space();
vmap_initialized = true;
}
I'm not sure if this code is run on every cpu(core) or just on the first cpu?
if this code runs on every smp core, how is this code inside for_each_possible_cpu loop run?
The smp setup seems to be done before this function.
start_kernel() calls mm_init() which calls vmalloc_init(). Only the first (boot) CPU is active at that point. Later, start_kernel() calls arch_call_rest_init() which calls rest_init().
rest_init() creates a kernel thread for the init task with entry point kernel_init(). kernel_init() calls kernel_init_freeable(). kernel_init_freeable() eventually calls smp_init() to activate the remaining CPUs.
Every macro in for_each_cpu family is just wrapper for for() loop, where iterator is a CPU index.
E.g., the core macro of this family is defined as
#define for_each_cpu(cpu, mask) \
for ((cpu) = -1; \
(cpu) = cpumask_next((cpu), (mask)), \
(cpu) < nr_cpu_ids;)
Each macro in for_each_cpu family uses its own CPUs mask, which is just a set of bits corresponded to CPU indices. E.g. mask for for_each_possible_cpu macro have bits set for every index of CPU which could ever be enabled in current machine session.

How to replace read function for procfs entry that returned EOF and byte count read both?

I am working on updating our kernel drivers to work with linux kernel 4.4.0 on Ubuntu 16.0.4. The drivers last worked with linux kernel 3.9.2.
In one of the modules, we have a procfs entries created to read/write the on-board fan monitoring values. Fan monitoring is used to read/write the CPU or GPU temperature/modulation,etc. values.
The module is using the following api to create procfs entries:
struct proc_dir_entry *create_proc_entry(const char *name, umode_t
mode,struct proc_dir_entry *parent);
Something like:
struct proc_dir_entry * proc_entry =
create_proc_entry("fmon_gpu_temp",0644,proc_dir);
proc_entry->read_proc = read_proc;
proc_entry->write_proc = write_proc;
Now, the read_proc is implemented something in this way:
static int read_value(char *buf, char **start, off_t offset, int count, int *eof, void *data) {
int len = 0;
int idx = (int)data;
if(idx == TEMP_FANCTL)
len = sprintf (buf, "%d.%02d\n", fmon_readings[idx] / TEMP_SAMPLES,
fmon_readings[idx] % TEMP_SAMPLES * 100 / TEMP_SAMPLES);
else if(idx == TEMP_CPU) {
int i;
len = sprintf (buf, "%d", fmon_readings[idx]);
for( i=0; i < FCTL_MAX_CPUS && fmon_cpu_temps[i]; i++ ) {
len += sprintf (buf+len, " CPU%d=%d",i,fmon_cpu_temps[i]);
}
len += sprintf (buf+len, "\n");
}
else if(idx >= 0 && idx < READINGS_MAX)
len = sprintf (buf, "%d\n", fmon_readings[idx]);
*eof = 1;
return len;
}
This read function definitely assumes that the user has provided enough buffer space to store the temperature value. This is correctly handled in userspace program. Also, for every call to this function the read value is in totality and therefore there is no support/need for subsequent reads for same temperature value.
Plus, if I use "cat" program on this procfs entry from shell, the 'cat' program correctly displays the value. This is supported, I think, by the setting of EOF to true and returning read bytes count.
New linux kernels do not support this API anymore.
My question is:
How can I change this API to new procfs API structure keeping the functionality same as: every read should return the value, program 'cat' should also work fine and not go into infinite loop ?
The primary user interface for read files on Linux is read(2). Its pair in kernel space is .read function in struct file_operations.
Every other mechanism for read file in kernel space (read_proc, seq_file, etc.) is actually an (parametrized) implementation of .read function.
The only way for kernel to return EOF indicator to user space is returning 0 as number of bytes read.
Even read_proc implementation you have for 3.9 kernel actually implements eof flag as returning 0 on next invocation. And cat actually perfoms the second invocation of read for find that file is end.
(Moreover, cat performs more than 2 invocations of read: first with 1 as count, second with count equal to page size minus 1, and the last with remaining count.)
The simplest way for "one-shot" read implementation is using seq_file in single_open() mode.

kernel module's reference count

I have a module to maintain, and it appears it has issues with reference counter being kept in the kernel, which results in that I'm unable to rmmod my module, after I kill the daemon opening 3 raw sockets to the module. Interestingly, after loading the daemon 'lsmod' displays 6 references to the module exist, I'd expect only three.
This is occuring on ARM-based embedded system with Linux-2.6.31, and rmmod doesn't have 'force' mode to try enforce unload a module (it's anyway not good idea).
I've analyzed the code and here is what I have:
1) The module creates new socket address family AF_HSL and registers with the kernel:
static struct proto_ops SOCKOPS_WRAPPED (hsl_ops) = {
family: AF_HSL,
owner: THIS_MODULE,
release: hsl_sock_release,
bind: _hsl_sock_bind,
connect: sock_no_connect,
socketpair: sock_no_socketpair,
accept: sock_no_accept,
getname: _hsl_sock_getname,
poll: datagram_poll,
ioctl: sock_no_ioctl,
listen: sock_no_listen,
shutdown: sock_no_shutdown,
setsockopt: sock_no_setsockopt,
getsockopt: sock_no_getsockopt,
sendmsg: _hsl_sock_sendmsg,
recvmsg: _hsl_sock_recvmsg,
mmap: sock_no_mmap,
sendpage: sock_no_sendpage,
};
static struct net_proto_family hsl_family_ops = {
family: AF_HSL,
create: _hsl_sock_create,
owner: THIS_MODULE
};
...
static int
_hsl_sock_create (struct net *net, struct socket *sock, int protocol)
{
struct sock *sk = NULL;
sock->state = SS_UNCONNECTED;
sk = sk_alloc (current->nsproxy->net_ns, AF_HSL, GFP_KERNEL, &_prot);
if (sk == NULL)
goto ERR;
sock->ops = &SOCKOPS_WRAPPED(hsl_ops);
sock_init_data (sock,sk);
sock_hold (sk); /* XXX */
...
}
static void
_hsl_sock_destruct (struct sock *sk)
{
struct hsl_sock *hsk, *phsk;
if (!sk)
return;
...
sock_orphan (sk);
skb_queue_purge (&sk->sk_receive_queue);
sock_put (sk);
}
int
hsl_sock_release (struct socket *sock)
{
struct sock *sk = sock->sk;
/* Here goes logic to destroy net_devices */
...
_hsl_sock_destruct (sk);
sock->sk = NULL;
return 0;
}
2) the daemon creates sockets in such way
fd = socket(AF_HSL, SOCK_RAW, 0);;
bind();
getsockname();
However I don't think _hsl_sock_create() should be calling sock_hold(), that would bump socket's reference count, but it's already set to 1 by sock_init_data(), and at socket delete phase sock_put() would decrement by 1, however this would not have the socket free;d and completely removed from the system.
So I experimented and removed sock_hold(); Now killing daemon yields in all references removed and 'rmmod' succedes, however the number of references after starting daemon is still 3.
I also have checked code at socket_create() whick calls internal function __socket_create(), which in turn calls try_module_get() and holds module's reference count. This appears to be the only place I've found that explicitly increments the module's refcnt.
I am still confused. Would anyone try to help understand what is happening?
Looking forward to hearing from you.
Mark

inter-process condition variables in Windows

I know that I can use condition variable to synchronize work between the threads, but is there any class like this (condition variable) to synchronize work between the processes, thanks in advance
Use a pair of named Semaphore objects, one to signal and one as a lock. Named sync objects on Windows are automatically inter-process, which takes care of that part of the job for you.
A class like this would do the trick.
class InterprocessCondVar {
private:
HANDLE mSem; // Used to signal waiters
HANDLE mLock; // Semaphore used as inter-process lock
int mWaiters; // # current waiters
protected:
public:
InterprocessCondVar(std::string name)
: mWaiters(0), mLock(NULL), mSem(NULL)
{
// NOTE: You'll need a real "security attributes" pointer
// for child processes to see the semaphore!
// "CreateSemaphore" will do nothing but give you the handle if
// the semaphore already exists.
mSem = CreateSemaphore( NULL, 0, std::numeric_limits<LONG>::max(), name.c_str());
std::string lockName = name + "_Lock";
mLock = CreateSemaphore( NULL, 0, 1, lockName.c_str());
if(!mSem || !mLock) {
throw std::runtime_exception("Semaphore create failed");
}
}
virtual ~InterprocessCondVar() {
CloseHandle( mSem);
CloseHandle( mLock);
}
bool Signal();
bool Broadcast();
bool Wait(unsigned int waitTimeMs = INFINITE);
}
A genuine condition variable offers 3 calls:
1) "Signal()": Wake up ONE waiting thread
bool InterprocessCondVar::Signal() {
WaitForSingleObject( mLock, INFINITE); // Lock
mWaiters--; // Lower wait count
bool result = ReleaseSemaphore( mSem, 1, NULL); // Signal 1 waiter
ReleaseSemaphore( mLock, 1, NULL); // Unlock
return result;
}
2) "Broadcast()": Wake up ALL threads
bool InterprocessCondVar::Broadcast() {
WaitForSingleObject( mLock, INFINITE); // Lock
bool result = ReleaseSemaphore( mSem, nWaiters, NULL); // Signal all
mWaiters = 0; // All waiters clear;
ReleaseSemaphore( mLock, 1, NULL); // Unlock
return result;
}
3) "Wait()": Wait for the signal
bool InterprocessCondVar::Wait(unsigned int waitTimeMs) {
WaitForSingleObject( mLock, INFINITE); // Lock
mWaiters++; // Add to wait count
ReleaseSemaphore( mLock, 1, NULL); // Unlock
// This must be outside the lock
return (WaitForSingleObject( mSem, waitTimeMs) == WAIT_OBJECT_0);
}
This should ensure that Broadcast() ONLY wakes up threads & processes that are already waiting, not all future ones too. This is also a VERY heavyweight object. For CondVars that don't need to exist across processes I would create a different class w/ the same API, and use unnamed objects.
You could use named semaphore or named mutex. You could also share memory between processes by shared memory.
For a project I'm working on I needed a condition variable and mutex implementation which can handle dead processes and won't cause other processes to end up in a deadlock in such a case. I implemented the mutex with the native named mutexes provided by the WIN32 api because they can indicate whether a dead process owns the lock by returning WAIT_ABANDONED. The next issue was that I also needed a condition variable I could use across processes together with these mutexes. I started of with the suggestion from user3726672 but soon discovered that there are several issues in which the state of the counter variable and the state of the semaphore ends up being invalid.
After doing some research, I found a paper by Microsoft Research which explains exactly this scenario: Implementing Condition Variables with Semaphores . It uses a separate semaphore for every single thread to solve the mentioned issues.
My final implementation uses a portion of shared memory in which I store a ringbuffer of thread-ids (the id's of the waiting threads). The processes then create their own handle for every named semaphore/thread-id which they have not encountered yet and cache it. The signal/broadcast/wait functions are then quite straight forward and follow the idea of the proposed solution in the paper. Just remember to remove your thread-id from the ringbuffer if your wait operation fails or results in a timeout.
For the Win32 implementation I recommend reading the following documents:
Semaphore Objects and Using Mutex Objects as those describe the functions you'll need for the implementation.
Alternatives: boost::interprocess has some robust mutex emulation support but it is based on spin locks and caused a very high cpu load on our embedded system which was the final reason why we were looking into our own implementation.
#user3726672: Could you update your post to point to this post or to the referenced paper?
Best Regards,
Michael
Update:
I also had a look at an implementation for linux/posix. Turns out pthread already provides everything you'll need. Just put pthread_cond_t and pthread_mutex_t in some shared memory to share it with the other process and initialize both with PTHREAD_PROCESS_SHARED. Also set PTHREAD_MUTEX_ROBUST on the mutex.
Yes. You can use a (named) Mutex for that. Use CreateMutex to create one. You then wait for it (with functions like WaitForSingleObject), and release it when you're done with ReleaseMutex.
For reference, Boost.Interprocess (documentation for version 1.59) has condition variables and much more. Please note, however, that as of this writing, that "Win32 synchronization is too basic".

Resources