I was trying to debug some issues and I want to conjure up a scenario when physical memory page is swapped out. Is there any trick to do this?
Linux kernel: 3.10.x
Platform: arm
Thank a lot.
If you really mean "in the linux kernel" then yes. There are functions that cause a page to be swapped that you could invoke directly. See pageout() as a starting point. I suspect it would be non-trivial to get this all setup just right.
If you mean "is there a way to do it from user-space", the answer is no. Well, not directly (AFAIK anyway). Your best bet would be to not touch the page in question further, meanwhile allocate lots of other memory (this could be done in a separate process) and touch all those other pages so that the one you care about becomes least recently used and hence a candidate for paging.
Not sure how -- from user space -- you would detect that it actually had been paged though. The point of virtual memory is to hide that from you. I suppose you could have a high likelihood of knowing it had been paged after the fact by timing how long it takes to access the memory once you finally do so.
Is there a way to implement dynamically adapting caches in userspace? I would like my programs to allocate caches that employ some fair share of the available physical memory. If the system is running out of physical memory, caches should be dropped as chosen by the program, and in no case should they be swapped out. It is preferrable that no special privilege was needed, so it is not necessary to actually lock the memory. The program should just get to know that pages are swapped out, so it is not going to use them. All in all, it should work something like caches and buffers implemented in the kernel. Can you point out general ideas and APIs how that can be done? Platforms I am interested in are Linux and Windows.
Why do you think there is any reasonable way to define "fair share"? It's not really a great UX when the application tries to know too much: far better would be to find a sensible, minimal default, and offer the user a config option to adjust it. Even better is to provide the user with stats to show how well the current-sized cache is doing - bigger isn't always better.
There is no "cooperative memory management" API in Linux - no way for the kernel to tell user-space to use less memory. The closest I can think of is that the (relatively new) memory cgroup controller can provide a "notifier" when a memory limit is reached (rather than OOM-killing the allocating process.) That's not exactly nice to use, but then again, any such interface is going to flirt with being race/deadlock-prone. Polling with mincore might work in somewhat contrived/constrained situations, but given that the app has no way to understand the changing system-wide demand for memory, it's not going to work well.
There are a set of memory management algorithms used in operating system construction, like pagination, segmentation, paged segmentation (paginación segmentada), segment pagination (segmentación paginada) and others.
Do you know if they are used besides that area, in not so low level software? They are used in bussiness applications?
These algoritms are for translating the program memory addresses onto the physical memory addresses. You will very rarely ever have to think of it in an application. In some extreme cases of applications working on very large datasets you may have to create a driver-like module to tune memory translation, but all the rest is still up to the operating system.
You might never write an OS yourself, but if you ever find yourself having to write a device driver, it will be imperitive that you understand these issues. So it is still quite useful to understand how these algorithms work.
Now you might be in school thinking, "Yuck, I'll just avoid that stuff". But you really have no idea where a 40-year carreer in the industry might take you.
I'm reviewing some code and feel suspicious of the technique being used.
In a linux environment, there are two processes that attach multiple
shared memory segments. The first process periodically loads a new set
of files to be shared, and writes the shared memory id (shmid) into
a location in the "master" shared memory segment. The second process
continually reads this "master" location and uses the shmid to attach
the other shared segments.
On a multi-cpu host, it seems to me it might be implementation dependent
as to what happens if one process tries to read the memory while it's
being written by the other. But perhaps hardware-level bus locking prevents
mangled bits on the wire? It wouldn't matter if the reading process got
a very-soon-to-be-changed value, it would only matter if the read was corrupted
to something that was neither the old value nor the new value. This is an edge case: only 32 bits are being written and read.
Googling for shmat stuff hasn't led me to anything that's definitive in this
area.
I suspect strongly it's not safe or sane, and what I'd really
like is some pointers to articles that describe the problems in detail.
It is legal -- as in the OS won't stop you from doing it.
But is it smart? No, you should have some type of synchronization.
There wouldn't be "mangled bits on the wire". They will come out either as ones or zeros. But there's nothing to say that all your bits will be written out before another process tries to read them. And there are NO guarantees on how fast they'll be written vs how fast they'll be read.
You should always assume there is absolutely NO relationship between the actions of 2 processes (or threads for that matter).
Hardware level bus locking does not happen unless you get it right. It can be harder then expected to make your compiler / library / os / cpu get it right. Synchronization primitives are written to makes sure it happens right.
Locking will make it safe, and it's not that hard to do. So just do it.
#unknown - The question has changed somewhat since my answer was posted. However, the behavior you describe is defiantly platform (hardware, os, library and compiler) dependent.
Without giving the compiler specific instructions, you are actually not guaranteed to have 32 bits written out in one shot. Imagine a situation where the 32 bit word is not aligned on a word boundary. This unaligned access is acceptable on x86, and in the case of the x68, the access is turned into a series of aligned accesses by the cpu.
An interrupt can occurs between those operations. If a context switch happens in the middle, some of the bits are written, some aren't. Bang, You're Dead.
Also, lets think about 16 bit cpus or 64 bit cpus. Both of which are still popular and don't necessarily work the way you think.
So, actually you can have a situation where "some other cpu-core picks up a word sized value 1/2 written to". You write you code as if this type of thing is expected to happen if you are not using synchronization.
Now, there are ways to preform your writes to make sure that you get a whole word written out. Those methods fall under the category of synchronization, and creating synchronization primitives is the type of thing that's best left to the library, compiler, os, and hardware designers. Especially if you are interested in portability (which you should be, even if you never port your code)
The problem's actually worse than some of the people have discussed. Zifre is right that on current x86 CPUs memory writes are atomic, but that is rapidly ceasing to be the case - memory writes are only atomic for a single core - other cores may not see the writes in the same order.
In other words if you do
a = 1;
b = 2;
on CPU 2 you might see location b modified before location 'a' is. Also if you're writing a value that's larger than the native word size (32 bits on an x32 processor) the writes are not atomic - so the high 32 bits of a 64 bit write will hit the bus at a different time from the low 32 bits of the write. This can complicate things immensely.
Use a memory barrier and you'll be ok.
You need locking somewhere. If not at the code level, then at the hardware memory cache and bus.
You are probably OK on a post-PentiumPro Intel CPU. From what I just read, Intel made their later CPUs essentially ignore the LOCK prefix on machine code. Instead the cache coherency protocols make sure that the data is consistent between all CPUs. So if the code writes data that doesn't cross a cache-line boundary, it will work. The order of memory writes that cross cache-lines isn't guaranteed, so multi-word writes are risky.
If you are using anything other than x86 or x86_64 then you are not OK. Many non-Intel CPUs (and perhaps Intel Itanium) gain performance by using explicit cache coherency machine commands, and if you do not use them (via custom ASM code, compiler intrinsics, or libraries) then writes to memory via cache are not guaranteed to ever become visible to another CPU or to occur in any particular order.
So just because something works on your Core2 system doesn't mean that your code is correct. If you want to check portability, try your code also on other SMP architectures like PPC (an older MacPro or a Cell blade) or an Itanium or an IBM Power or ARM. The Alpha was a great CPU for revealing bad SMP code, but I doubt you can find one.
Two processes, two threads, two cpus, two cores all require special attention when sharing data through memory.
This IBM article provides an excellent overview of your options.
Anatomy of Linux synchronization methods
Kernel atomics, spinlocks, and mutexes
by M. Tim Jones (mtj#mtjones.com), Consultant Engineer, Emulex
http://www.ibm.com/developerworks/linux/library/l-linux-synchronization.html
I actually believe this should be completely safe (but is depends on the exact implementation). Assuming the "master" segment is basically an array, as long as the shmid can be written atomically (if it's 32 bits then probably okay), and the second process is just reading, you should be okay. Locking is only needed when both processes are writing, or the values being written cannot be written atomically. You will never get a corrupted (half written values). Of course, there may be some strange architectures that can't handle this, but on x86/x64 it should be okay (and probably also ARM, PowerPC, and other common architectures).
Read Memory Ordering in Modern Microprocessors, Part I and Part II
They give the background to why this is theoretically unsafe.
Here's a potential race:
Process A (on CPU core A) writes to a new shared memory region
Process A puts that shared memory ID into a shared 32-bit variable (that is 32-bit aligned - any compiler will try to align like this if you let it).
Process B (on CPU core B) reads the variable. Assuming 32-bit size and 32-bit alignment, it shouldn't get garbage in practise.
Process B tries to read from the shared memory region. Now, there is no guarantee that it'll see the data A wrote, because you missed out the memory barrier. (In practise, there probably happened to be memory barriers on CPU B in the library code that maps the shared memory segment; the problem is that process A didn't use a memory barrier).
Also, it's not clear how you can safely free the shared memory region with this design.
With the latest kernel and libc, you can put a pthreads mutex into a shared memory region. (This does need a recent version with NPTL - I'm using Debian 5.0 "lenny" and it works fine). A simple lock around the shared variable would mean you don't have to worry about arcane memory barrier issues.
I can't believe you're asking this. NO it's not safe necessarily. At the very least, this will depend on whether the compiler produces code that will atomically set the shared memory location when you set the shmid.
Now, I don't know Linux, but I suspect that a shmid is 16 to 64 bits. That means it's at least possible that all platforms would have some instruction that could write this value atomically. But you can't depend on the compiler doing this without being asked somehow.
Details of memory implementation are among the most platform-specific things there are!
BTW, it may not matter in your case, but in general, you have to worry about locking, even on a single CPU system. In general, some device could write to the shared memory.
I agree that it might work - so it might be safe, but not sane.
The main question is if this low-level sharing is really needed - I am not an expert on Linux, but I would consider to use for instance a FIFO queue for the master shared memory segment, so that the OS does the locking work for you. Consumer/producers usually need queues for synchronization anyway.
Legal? I suppose. Depends on your "jurisdiction". Safe and sane? Almost certainly not.
Edit: I'll update this with more information.
You might want to take a look at this Wikipedia page; particularly the section on "Coordinating access to resources". In particular, the Wikipedia discussion essentially describes a confidence failure; non-locked access to shared resources can, even for atomic resources, cause a misreporting / misrepresentation of the confidence that an action was done. Essentially, in the time period between checking to see whether or not it CAN modify the resource, the resource gets externally modified, and therefore, the confidence inherent in the conditional check is busted.
I don't believe anybody here has discussed how much of an impact lock contention can have over the bus, especially on bus bandwith constrained systems.
Here is an article about this issue in some depth, they discuss some alternative schedualing algorythems which reduse the overall demand on exclusive access through the bus. Which increases total throughput in some cases over 60% than a naieve scheduler (when considering the cost of an explicit lock prefix instruction or implicit xchg cmpx..). The paper is not the most recent work and not much in the way of real code (dang academic's) but it worth the read and consideration for this problem.
More recent CPU ABI's provide alternative operations than simple lock whatever.
Jeffr, from FreeBSD (author of many internal kernel components), discusses monitor and mwait, 2 instructions added for SSE3, where in a simple test case identified an improvement of 20%. He later postulates;
So this is now the first stage in the
adaptive algorithm, we spin a while,
then sleep at a high power state, and
then sleep at a low power state
depending on load.
...
In most cases we're still idling in
hlt as well, so there should be no
negative effect on power. In fact, it
wastes a lot of time and energy to
enter and exit the idle states so it
might improve power under load by
reducing the total cpu time required.
I wonder what would be the effect of using pause instead of hlt.
From Intel's TBB;
ALIGN 8
PUBLIC __TBB_machine_pause
__TBB_machine_pause:
L1:
dw 090f3H; pause
add ecx,-1
jne L1
ret
end
Art of Assembly also uses syncronization w/o the use of lock prefix or xchg. I haven't read that book in a while and won't speak directly to it's applicability in a user-land protected mode SMP context, but it's worth a look.
Good luck!
If the shmid has some type other than volatile sig_atomic_t then you can be pretty sure that separate threads will get in trouble even on the very same CPU. If the type is volatile sig_atomic_t then you can't be quite as sure, but you still might get lucky because multithreading can do more interleaving than signals can do.
If the shmid crosses cache lines (partly in one cache line and partly in another) then while the writing cpu is writing you sure find a reading cpu reading part of the new value and part of the old value.
This is exactly why instructions like "compare and swap" were invented.
Sounds like you need a Reader-Writer Lock : http://en.wikipedia.org/wiki/Readers-writer_lock.
The answer is - it's absolutely safe to do reads and writes simultaneously.
It is clear that the shm mechanism
provides bare-bones tools for the
user. All access control must be taken
care of by the programmer. Locking and
synchronization is being kindly
provided by the kernel, this means the
user have less worries about race
conditions. Note that this model
provides only a symmetric way of
sharing data between processes. If a
process wishes to notify another
process that new data has been
inserted to the shared memory, it will
have to use signals, message queues,
pipes, sockets, or other types of IPC.
From Shared Memory in Linux article.
The latest Linux shm implementation just uses copy_to_user and copy_from_user calls, which are synchronised with memory bus internally.
I've got a proof-of-concept program which is doing some interprocess communication simply by writing and reading from the HD. Yes, I know this is really slow; but it was the easiest way to get things up and running. I had always planned on coming back and swapping out that part of the code with a mechanism that does all the IPC(interprocess communication) in RAM.
With the arrival of solid-state disks, do you think that bottleneck is likely to become negligible?
Notes: It's server software written in C# calling some bare metal number-crunching libraries written in FORTRAN.
The short answer is probably no. A famous researcher named Jim Gray gave a talk about storage and performance which included this great analogy. Assuming your brain as the processor, accessing a register takes 1 clock tick (numbers on left) which roughly equivalent to that information being in your brain. Accessing memory takes 100 clock ticks, so roughly equivalent to getting data somewhere in the city you live in. Accessing a standard disk takes roughly 10^6 ticks, which is the equivalent to the data being on pluto. Where does solid state fit it? Current SSD technology is somewhere between 10^4-10^5 depending on who you ask. While they can be an order of magnitude faster, there is still a tremendous gap between reading from memory and reading from disk. This is why the answer to your question is likely no, since as fast as SSDs become they will still be significantly slower than disk (at least in the foreseeable future).
I think that you will find the bottlenecks are just moved. As we expect higher throughput then we write programs with higher demands.
This pushes bottlenecks to buses, caches and parts other than the read/write mechanism (which is last in the chain anyway).
With a process not bound by disk I/O, then I think you might find it becomes bound by the scheduler which limits the amount of read/write instructions (as with all process instructions).
To take full advantage of limitless I/O speed you would require real-time response and very aggressive management of caches and so on.
When disks get faster then so does RAM and processors and the demand on devices. The bottleneck is the same, the workload just gets bigger.
I don't believe that it will change the way I/O bound applications are written the tiniest bit. Having faster processors did not make people pick bubblesort as a sorting algorithm either.
The external memory hierarchies are an inherent problem of computing.
Joel on Software has an article about his experience upgrading to solid state. Not exactly the same issue you have, but my takeaway was:
Solid state drives can significantly speed up I/O bound operations, but many things (like compiling) are still cpu-bound.
I have a solid-state drive, and no, this won't eliminate I/O as a bottleneck. The SSD is nice, bit it's not that nice.
It's actually not hard to master your system's IPC primitives or to build something on top of TCP. But if you want to stick with your disk stuff and make it faster, ramdisk or tmpfs might do the trick.
No. Current SSDs are designed as disk replacements. Every layer, from SATA controller to filesystem driver treats them as storage.
This is not a problem of the underlying technology, NAND flash. When NAND flash is directly mapped into memory, and uses a rotating log storage system instead of a file system based on named files it can be quite fast. The fundamental problem is that NAND Flash only performans well in block updates. File metadata updates cause expensive read-modify-write operations. Also, NAND blocks are much bigger than typical disk blocks, which doesn't help performance either.
For these reasons, the future of SSDs will be better cached SSDs. DRAM will hide the overhead of poor mapping and a small supercap backup will allow the SSD to commit writes faster.
Solid state drives do make one important improvement to IO throughput, and that is the fact that on solid state disks, block locality is less of an issue from rotating media. This means that high performance IO bound applications can shift their focus from structures that arrange data accessed in order to structures that optimize IO in other ways, such as by keeping data in a single block by means of compression. That said, Even solid state drives benefit from linear access patterns because they can prefetch subsequent blocks into a read cache before the application requests it.
A noticeable regression on solid state disks is that writes take longer than reads, although both are still generally faster than rotating drives, and the difference is narrowing with newer, high end solid state disks.
No, sadly not. They do make it more interesting though: SSD drives have very fast reads and no sync time, but their writes are almost as slow as normal hard drives. This means that you will want to read most of the time. However when you do write to the drive you should write as much as possible in the same spot since SSD drives can only write entire blocks at a time.
How about using a ram drive instead of the disk? You would not have to rewrite anything. Just point it to a different file system. Windows and Linux both have them. Make sure you have lots of memory on the machine and create a virtual disk with enough space for your processing. I did this for a system that listened to multiple protocols on a network tap. I never new what packet I was going to get and there was too much data to keep it in memory. I would write it to the RAM drive and when something was completed, I would move it and let another process get it off the RAM drive and onto a physical disk. I was able to keep up with really busy server class network cards in this way. Good luck!
Something to keep in mind here:
If the communication involves frequent messages and is on the same system you'll get very good performance because Windows won't actually write the data out in the first place.
I've had to resort to it once and discovered this--the drive light did NOT come on at all so long as the data kept getting written.
but it was the easiest way to get things up and running.
I usually find that it's much cheaper to think good once with your own head, than to make the cpu think millions of times in vain.