On the s390 architecture virtual kernel and user address spaces are never present at the same time, so how does copy_to_user work?
copy_to_user for s390 is implemented here: linux/arch/s390/include/asm/uaceess.h.
uaccess is the pointer to copy_[to/from]_user actual implementation. It is setted up here (grep uaccess): arch/s390/kernel/setup.c. There are 4 implementations of uaccess, depening from mode:
uaccess_mvcos_switch, uaccess_pt, uaccess_mvcos and uaccess_std
For example uaccess_std is here: http://lxr.linux.no/#linux+v3.2.1/arch/s390/lib/uaccess_std.c
4 * Standard user space access functions based on mvcp/mvcs and doing
5 * interesting things in the secondary space mode.
...
82 size_t copy_to_user_std(size_t size, void __user *ptr, const void *x)
83 {
84 unsigned long tmp1, tmp2;
85
86 tmp1 = -256UL;
87 asm volatile(
88 "0: mvcs 0(%0,%1),0(%2),%3\n"
The mvcp/mvcs mechanism is used:
http://publib.boulder.ibm.com/infocenter/zos/v1r11/topic/com.ibm.zos.r11.ieaa500/iea2a57031.htm
Related
I've been trying to map a page that both writable AND executable.
mov x0, 0 // start address
mov x1, 4096 // length
mov x2, 7 // rwx
mov x3, 0x1001 // flags
mov x4, -1 // file descriptor
mov x5, 0 // offset
movl x16, 0x200005c // mmap
svc 0
This gives me a 0xD error code (EACCESS, which the documentation unhelpfully blames on an invalid file descriptor, although same documentation says to use '-1'). I think the code is correct, it returns a valid mmap if I just pass 'r--' for permissions.
I know the same code works in Catalina and x64 architecture. I tested the same error happens when SIP mode is disabled.
For more context, I'm trying to port a FORTH implementation to MacOs/ARM64, and this FORTH, like many others, heavily uses self modifying code/assembling code at runtime. And the code that is doing the assembling/compiling resides in the middle of the newly created code (in fact part the compiler will be generated in machine language as part of running FORTH), so it's very hard/infeasible to separate the FORTH JIT compiler (if you call it that) from the generated code.
Now, I'd really don't want to end up with the answer: "Apple thinks they know better than you, no FORTH for you!", but that is what it looks like so far. Thanks for any help!
You need to toggle the thread between being writable or executable, it can not be both at the same time. I think it is actually possible to do both with the same memory using 2 different threads but I haven't tried.
Before you write to the memory you mmap, call this:
pthread_jit_write_protect_np(0);
sys_icache_invalidate(addr, size);
Then when you are done writing to it you can switch back again like this:
pthread_jit_write_protect_np(1);
sys_icache_invalidate(addr, size);
This is the full code I am using right now
#include <stdio.h>
#include <sys/mman.h>
#include <pthread.h>
#include <libkern/OSCacheControl.h>
#include <stdlib.h>
#include <stdint.h>
uint32_t* c_get_memory(uint32_t size) {
int prot = PROT_READ | PROT_WRITE | PROT_EXEC;
int flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_JIT;
int fd = -1;
int offset = 0;
uint32_t* addr = 0;
addr = (uint32_t*)mmap(0, size, prot, flags, fd, offset);
if (addr == MAP_FAILED){
printf("failure detected\n");
exit(-1);
}
pthread_jit_write_protect_np(0);
sys_icache_invalidate(addr, size);
return addr;
}
void c_jit(uint32_t* addr, uint32_t size) {
pthread_jit_write_protect_np(1);
sys_icache_invalidate(addr, size);
void (*foo)(void) = (void (*)())addr;
foo();
}
Why does host_statistics64() in OS X 10.6.8 (I don't know if other versions have this problem) return counts for free, active, inactive, and wired memory that don't add up to the total amount of ram? And why is it missing an inconsistent number of pages?
The following output represents the number of pages not classified as free, active, inactive, or wired over ten seconds (sampled roughly once per second).
458
243
153
199
357
140
304
93
181
224
The code that produces the numbers above is:
#include <stdio.h>
#include <mach/mach.h>
#include <mach/vm_statistics.h>
#include <sys/types.h>
#include <sys/sysctl.h>
#include <unistd.h>
#include <string.h>
int main(int argc, char** argv) {
struct vm_statistics64 stats;
mach_port_t host = mach_host_self();
natural_t count = HOST_VM_INFO64_COUNT;
natural_t missing = 0;
int debug = argc == 2 ? !strcmp(argv[1], "-v") : 0;
kern_return_t ret;
int mib[2];
long ram;
natural_t pages;
size_t length;
int i;
mib[0] = CTL_HW;
mib[1] = HW_MEMSIZE;
length = sizeof(long);
sysctl(mib, 2, &ram, &length, NULL, 0);
pages = ram / getpagesize();
for (i = 0; i < 10; i++) {
if ((ret = host_statistics64(host, HOST_VM_INFO64, (host_info64_t)&stats, &count)) != KERN_SUCCESS) {
printf("oops\n");
return 1;
}
/* updated for 10.9 */
missing = pages - (
stats.free_count +
stats.active_count +
stats.inactive_count +
stats.wire_count +
stats.compressor_page_count
);
if (debug) {
printf(
"%11d pages (# of pages)\n"
"%11d free_count (# of pages free) \n"
"%11d active_count (# of pages active) \n"
"%11d inactive_count (# of pages inactive) \n"
"%11d wire_count (# of pages wired down) \n"
"%11lld zero_fill_count (# of zero fill pages) \n"
"%11lld reactivations (# of pages reactivated) \n"
"%11lld pageins (# of pageins) \n"
"%11lld pageouts (# of pageouts) \n"
"%11lld faults (# of faults) \n"
"%11lld cow_faults (# of copy-on-writes) \n"
"%11lld lookups (object cache lookups) \n"
"%11lld hits (object cache hits) \n"
"%11lld purges (# of pages purged) \n"
"%11d purgeable_count (# of pages purgeable) \n"
"%11d speculative_count (# of pages speculative (also counted in free_count)) \n"
"%11lld decompressions (# of pages decompressed) \n"
"%11lld compressions (# of pages compressed) \n"
"%11lld swapins (# of pages swapped in (via compression segments)) \n"
"%11lld swapouts (# of pages swapped out (via compression segments)) \n"
"%11d compressor_page_count (# of pages used by the compressed pager to hold all the compressed data) \n"
"%11d throttled_count (# of pages throttled) \n"
"%11d external_page_count (# of pages that are file-backed (non-swap)) \n"
"%11d internal_page_count (# of pages that are anonymous) \n"
"%11lld total_uncompressed_pages_in_compressor (# of pages (uncompressed) held within the compressor.) \n",
pages, stats.free_count, stats.active_count, stats.inactive_count,
stats.wire_count, stats.zero_fill_count, stats.reactivations,
stats.pageins, stats.pageouts, stats.faults, stats.cow_faults,
stats.lookups, stats.hits, stats.purges, stats.purgeable_count,
stats.speculative_count, stats.decompressions, stats.compressions,
stats.swapins, stats.swapouts, stats.compressor_page_count,
stats.throttled_count, stats.external_page_count,
stats.internal_page_count, stats.total_uncompressed_pages_in_compressor
);
}
printf("%i\n", missing);
sleep(1);
}
return 0;
}
TL;DR:
host_statistics64() get information from different sources which might cost time and could produce inconsistent results.
host_statistics64() gets some information by variables with names like vm_page_foo_count. But not all of these variables are taken into account, e.g. vm_page_stolen_count is not.
The well known /usr/bin/top adds stolen pages to the number of wired pages. This is an indicator that these pages should be taken into account when counting pages.
Notes
I'm working on a macOS 10.12 with Darwin Kernel Version 16.5.0 xnu-3789.51.2~3/RELEASE_X86_64 x86_64 but all behaviour is completly reproducable.
I'm going to link a lot a source code of the XNU Version I use on my machine. It can be found here: xnu-3789.51.2.
The program you have written is basically the same as /usr/bin/vm_stat which is just a wrapper for host_statistics64() (and host_statistics()). The corressponding source code can be found here: system_cmds-496/vm_stat.tproj/vm_stat.c.
How does host_statistics64() fit into XNU and how does it work?
As widley know the OS X kernel is called XNU (XNU IS NOT UNIX) and "is a hybrid kernel combining the Mach kernel developed at Carnegie Mellon University with components from FreeBSD and C++ API for writing drivers called IOKit." (https://github.com/opensource-apple/xnu/blob/10.12/README.md)
The virtual memory management (VM) is part of Mach therefore host_statistics64() is located here. Let's have a closer look at the its implementation which is contained in xnu-3789.51.2/osfmk/kern/host.c.
The function signature is
kern_return_t
host_statistics64(host_t host, host_flavor_t flavor, host_info64_t info, mach_msg_type_number_t * count);
The first relevant lines are
[...]
processor_t processor;
vm_statistics64_t stat;
vm_statistics64_data_t host_vm_stat;
mach_msg_type_number_t original_count;
unsigned int local_q_internal_count;
unsigned int local_q_external_count;
[...]
processor = processor_list;
stat = &PROCESSOR_DATA(processor, vm_stat);
host_vm_stat = *stat;
if (processor_count > 1) {
simple_lock(&processor_list_lock);
while ((processor = processor->processor_list) != NULL) {
stat = &PROCESSOR_DATA(processor, vm_stat);
host_vm_stat.zero_fill_count += stat->zero_fill_count;
host_vm_stat.reactivations += stat->reactivations;
host_vm_stat.pageins += stat->pageins;
host_vm_stat.pageouts += stat->pageouts;
host_vm_stat.faults += stat->faults;
host_vm_stat.cow_faults += stat->cow_faults;
host_vm_stat.lookups += stat->lookups;
host_vm_stat.hits += stat->hits;
host_vm_stat.compressions += stat->compressions;
host_vm_stat.decompressions += stat->decompressions;
host_vm_stat.swapins += stat->swapins;
host_vm_stat.swapouts += stat->swapouts;
}
simple_unlock(&processor_list_lock);
}
[...]
We get host_vm_stat which is of type vm_statistics64_data_t. This is just a typedef struct vm_statistics64 as you can see in xnu-3789.51.2/osfmk/mach/vm_statistics.h. And we get processor information from the makro PROCESSOR_DATA() defined in xnu-3789.51.2/osfmk/kern/processor_data.h. We fill host_vm_stat while looping through all of our processors by simply adding up the relevant numbers.
As you can see we find some well known stats like zero_fill_count or compressions but not all covered by host_statistics64().
The next relevant lines are:
stat = (vm_statistics64_t)info;
stat->free_count = vm_page_free_count + vm_page_speculative_count;
stat->active_count = vm_page_active_count;
[...]
stat->inactive_count = vm_page_inactive_count;
stat->wire_count = vm_page_wire_count + vm_page_throttled_count + vm_lopage_free_count;
stat->zero_fill_count = host_vm_stat.zero_fill_count;
stat->reactivations = host_vm_stat.reactivations;
stat->pageins = host_vm_stat.pageins;
stat->pageouts = host_vm_stat.pageouts;
stat->faults = host_vm_stat.faults;
stat->cow_faults = host_vm_stat.cow_faults;
stat->lookups = host_vm_stat.lookups;
stat->hits = host_vm_stat.hits;
stat->purgeable_count = vm_page_purgeable_count;
stat->purges = vm_page_purged_count;
stat->speculative_count = vm_page_speculative_count;
We reuse stat and make it our output struct. We then fill free_count with the sum of two unsigned long called vm_page_free_count and vm_page_speculative_count. We collect the other remaining data in the same manner (by using variables named vm_page_foo_count) or by taking the stats from host_vm_stat which we filled up above.
1. Conclusion We collect data from different sources. Either from processor informations or from variables called vm_page_foo_count. This costs time and might end in some inconsitency matter the fact VM is a very fast and continous process.
Let's take a closer look at the already mentioned variables vm_page_foo_count. They are defined in xnu-3789.51.2/osfmk/vm/vm_page.h as follows:
extern
unsigned int vm_page_free_count; /* How many pages are free? (sum of all colors) */
extern
unsigned int vm_page_active_count; /* How many pages are active? */
extern
unsigned int vm_page_inactive_count; /* How many pages are inactive? */
#if CONFIG_SECLUDED_MEMORY
extern
unsigned int vm_page_secluded_count; /* How many pages are secluded? */
extern
unsigned int vm_page_secluded_count_free;
extern
unsigned int vm_page_secluded_count_inuse;
#endif /* CONFIG_SECLUDED_MEMORY */
extern
unsigned int vm_page_cleaned_count; /* How many pages are in the clean queue? */
extern
unsigned int vm_page_throttled_count;/* How many inactives are throttled */
extern
unsigned int vm_page_speculative_count; /* How many speculative pages are unclaimed? */
extern unsigned int vm_page_pageable_internal_count;
extern unsigned int vm_page_pageable_external_count;
extern
unsigned int vm_page_xpmapped_external_count; /* How many pages are mapped executable? */
extern
unsigned int vm_page_external_count; /* How many pages are file-backed? */
extern
unsigned int vm_page_internal_count; /* How many pages are anonymous? */
extern
unsigned int vm_page_wire_count; /* How many pages are wired? */
extern
unsigned int vm_page_wire_count_initial; /* How many pages wired at startup */
extern
unsigned int vm_page_free_target; /* How many do we want free? */
extern
unsigned int vm_page_free_min; /* When to wakeup pageout */
extern
unsigned int vm_page_throttle_limit; /* When to throttle new page creation */
extern
uint32_t vm_page_creation_throttle; /* When to throttle new page creation */
extern
unsigned int vm_page_inactive_target;/* How many do we want inactive? */
#if CONFIG_SECLUDED_MEMORY
extern
unsigned int vm_page_secluded_target;/* How many do we want secluded? */
#endif /* CONFIG_SECLUDED_MEMORY */
extern
unsigned int vm_page_anonymous_min; /* When it's ok to pre-clean */
extern
unsigned int vm_page_inactive_min; /* When to wakeup pageout */
extern
unsigned int vm_page_free_reserved; /* How many pages reserved to do pageout */
extern
unsigned int vm_page_throttle_count; /* Count of page allocations throttled */
extern
unsigned int vm_page_gobble_count;
extern
unsigned int vm_page_stolen_count; /* Count of stolen pages not acccounted in zones */
[...]
extern
unsigned int vm_page_purgeable_count;/* How many pages are purgeable now ? */
extern
unsigned int vm_page_purgeable_wired_count;/* How many purgeable pages are wired now ? */
extern
uint64_t vm_page_purged_count; /* How many pages got purged so far ? */
That's a lot of statistics regarding we only get access to a very limited number using host_statistics64(). The most of these stats are updated in xnu-3789.51.2/osfmk/vm/vm_resident.c. For example this function releases pages to the list of free pages:
/*
* vm_page_release:
*
* Return a page to the free list.
*/
void
vm_page_release(
vm_page_t mem,
boolean_t page_queues_locked)
{
[...]
vm_page_free_count++;
[...]
}
Very interesting is extern unsigned int vm_page_stolen_count; /* Count of stolen pages not acccounted in zones */. What are stolen pages? It seems like there are mechanisms to take a page out of some lists even though it wouldn't usually be paged out. One of these mechanisms is the age of a page in the speculative page list. xnu-3789.51.2/osfmk/vm/vm_page.h tells us
* VM_PAGE_MAX_SPECULATIVE_AGE_Q * VM_PAGE_SPECULATIVE_Q_AGE_MS
* defines the amount of time a speculative page is normally
* allowed to live in the 'protected' state (i.e. not available
* to be stolen if vm_pageout_scan is running and looking for
* pages)... however, if the total number of speculative pages
* in the protected state exceeds our limit (defined in vm_pageout.c)
* and there are none available in VM_PAGE_SPECULATIVE_AGED_Q, then
* vm_pageout_scan is allowed to steal pages from the protected
* bucket even if they are underage.
*
* vm_pageout_scan is also allowed to pull pages from a protected
* bin if the bin has reached the "age of consent" we've set
It is indeed void vm_pageout_scan(void) that increments vm_page_stolen_count. You find the corresponding source code in xnu-3789.51.2/osfmk/vm/vm_pageout.c.
I think stolen pages are not taken into account while calculating VM stats a host_statistics64() does.
Evidence that I'm right
The best way to prove this would be to compile XNU with an customized version of host_statistics64() by hand. I had no opportunity do this but will try soon.
Fortunately we are not the only ones interested in correct VM statistics. Therefore we should have a look at the implementation of well know /usr/bin/top (not contained in XNU) which is completely available here: top-108 (I just picked the macOS 10.12.4 release).
Let's have a look at top-108/libtop.c where we find the following:
static int
libtop_tsamp_update_vm_stats(libtop_tsamp_t* tsamp) {
kern_return_t kr;
tsamp->p_vm_stat = tsamp->vm_stat;
mach_msg_type_number_t count = sizeof(tsamp->vm_stat) / sizeof(natural_t);
kr = host_statistics64(libtop_port, HOST_VM_INFO64, (host_info64_t)&tsamp->vm_stat, &count);
if (kr != KERN_SUCCESS) {
return kr;
}
if (tsamp->pages_stolen > 0) {
tsamp->vm_stat.wire_count += tsamp->pages_stolen;
}
[...]
return kr;
}
tsamp is of type libtop_tsamp_t which is a struct defined in top-108/libtop.h. It contains amongst other things vm_statistics64_data_t vm_stat and uint64_t pages_stolen.
As you can see, static int libtop_tsamp_update_vm_stats(libtop_tsamp_t* tsamp) gets tsamp->vm_stat filled by host_statistics64() as we know it. Afterwards it checks if tsamp->pages_stolen > 0 and adds it up to the wire_count field of tsamp->vm_stat.
2. Conclusion We won't get the number of these stolen pages if we just use host_statistics64() as in /usr/bin/vm_stat or your example code!
Why is host_statistics64() implemented as it is?
Honestly, I don't know. Paging is a complex process and therefore a real time observation a challenging task. We have to notice that there seems to be no bug in its implementation. I think that we wouldn't even get a 100% accurate number of pages if we could get access to vm_page_stolen_count. The implementation of /usr/bin/top doesn't count stolen pages if their number is not very big.
An additional interesting thing is a comment above the function static void update_pages_stolen(libtop_tsamp_t *tsamp) which is /* This is for <rdar://problem/6410098>. */. Open Radar is a bug reporting site for Apple software and usually classifies bugs in the format given in the comment. I was unable to find the related bug; maybe it was about missing pages.
I hope these information could help you a bit. If I manage to compile the latest (and customized) Version of XNU on my machine I will let you know. Maybe this brings interesting insights.
Just noticed that if you add compressor_page_count into the mix you get much closer to the actual amount of RAM in the machine.
This is an observation, not an explanation, and links to where this was properly documented would be nice to have!
I have written the following user-level code snippet to test two sub functions, atomic inc and xchg (refer to Linux code).
What I need is just try to perform operations on 32-bit integer, and that's why I explicitly use int32_t.
I assume global_counter will be raced by different threads, while tmp_counter is fine.
#include <stdio.h>
#include <stdint.h>
int32_t global_counter = 10;
/* Increment the value pointed by ptr */
void atomic_inc(int32_t *ptr)
{
__asm__("incl %0;\n"
: "+m"(*ptr));
}
/*
* Atomically exchange the val with *ptr.
* Return the value previously stored in *ptr before the exchange
*/
int32_t atomic_xchg(uint32_t *ptr, uint32_t val)
{
uint32_t tmp = val;
__asm__(
"xchgl %0, %1;\n"
: "=r"(tmp), "+m"(*ptr)
: "0"(tmp)
:"memory");
return tmp;
}
int main()
{
int32_t tmp_counter = 0;
printf("Init global=%d, tmp=%d\n", global_counter, tmp_counter);
atomic_inc(&tmp_counter);
atomic_inc(&global_counter);
printf("After inc, global=%d, tmp=%d\n", global_counter, tmp_counter);
tmp_counter = atomic_xchg(&global_counter, tmp_counter);
printf("After xchg, global=%d, tmp=%d\n", global_counter, tmp_counter);
return 0;
}
My 2 questions are:
Are these two subfunctions written properly?
Will this behave the same when I compile this on 32-bit or
64-bit platform? For example, could the pointer address have a different
length. or could incl and xchgl will conflict with the operand?
My understanding of this question is below, please correct me if I'm wrong.
All the read-modify-write instructions (ex: incl, add, xchg) need a lock prefix. The lock instruction is to lock the memory accessed by other CPUs by asserting LOCK# signal on the memory bus.
The __xchg function in Linux kernel implies no "lock" prefix because xchg always implies lock anyway. http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/cmpxchg_64.h#L15
However, the incl used in atomic_inc does not have this assumption so a lock_prefix is needed.
http://lxr.linux.no/linux+v2.6.38/arch/x86/include/asm/atomic.h#L105
btw, I think you need to copy the *ptr to a volatile variable to avoid gcc optimization.
William
one misc dev of my system is 600 mod, I need it to be 666 mod ( rw for all ) ,
the chmod is usable, however , I am wondering how to set the mod at register-time using misc_register() , is chmod the only way ?
please help, thanks !
Use miscdevice mode with S_IRUGO | S_IWUGO.
50 struct miscdevice {
51 int minor;
52 const char *name;
53 const struct file_operations *fops;
54 struct list_head list;
55 struct device *parent;
56 struct device *this_device;
57 const char *nodename;
58 umode_t mode;
59 };
Your module is not supposed to set the access level itself. The mantra is 'Policy belongs in user space, not in the kernel.'
You want to let udev (or whatever alternative for it you use) decide this.
For udev look at man 7 udev.
I'm playing with some runtime function patching but I have a problem with the endiannes when writing memory address values. So what I have:
char buf[] = \xE9\xDE\xAD\xBE\xEF
At runtime I have to fixup the 0xDEADBEEF to point to the actual address - here is my function to do this:
void FixJMPAddress(BYTE *jump, BYTE *newRoutine) {
DWORD address;
DWORD *dwPtr;
address = (DWORD)newRoutine;
dwPtr = (DWORD *)&(jump[1]);
*dwPtr = address;
}
It is invoked like that:
FixJMPAddress(buf, &Something);
Unfortunately when disassembling the end result I get:
E9 60 DA 47 93
instead of
E9 93 47 DA 60
So this is due to the fact that x86 is little-endian but is there a way in which I can cope with is automatically without having to write a function which essentially reverses the byteorder of the input?
This has nothing to do with little-endian. Your code assumes that the operand is stored in the same endianess as the architecture it's run on. That should be fine as long as your code runs on x86.
The real problem is that jmp uses a relative offset, not an absolute one.
To calculate the jmp destination:
dest = address_of_jmp + operand + sizeof(jmp_instruction)
Assuming BYTE* jump points to the actual instruction that will be executed, it should be:
LONG delta = address - (DWORD)jump - 5;
*(LONG*)(jump+1) = delta;