In which systems is segmentation with paging (segmented paging) memory management technique used? Give some examples - memory-management

I’m a student studying OS and I cannot find a document or something that talks about the history of memory management like where segmented paging was applied and why they don't use this anymore (because I saw somebody say that modern OSes prefer to use paging instead of segmentation). Can anyone help me?

In which systems is segmentation with paging (segmented paging) memory management technique used?
The only systems that were capable of using both segmentation and paging at the same time for memory management that I'm aware of are:
a) OS/2 ( https://en.wikipedia.org/wiki/OS/2 ) where it was optional and each process could choose "segmented with paging" or "flat paging" and most processes chose "flat paging".
b) Old versions of 32-bit Windows when running even older programs designed for 16-bit Windows (which were designed for "segmentation without any paging")
c) A few (maybe just one) hobbyist and/or niche operating systems that lack notoriety
Note 1: This excludes systems that use segmentation for something else and don't use segmentation for memory management (specifically, recycling an "otherwise unused segment register" so it can be used like a pointer to thread local storage to avoid wasting a more useful general purpose register).
Note 2: I can't guarantee that there isn't one or more other systems that I'm not aware of.

Related

Do low-end embedded system have process isolation?

I am studying memory management. In particular, I am studying MMU and the mapping between the process logical space pages and the RAM frames.
My question is: what about low-end embedded systems? If I'm correct, MMU can't be used in this systems due to their smaller memory. So how computers with less memory available can avoid the problem of shared memory between processes?
For embedded systems, the kind of MMU you speak of is only present in high-end microcontrollers like PowerPC or Cortex A.
Low-end to mid-range microcontrollers do often have some simpler form of MMU though. Not as advanced as used to create virtual memory sections, but a simpler kind which allows remapping of RAM, flash, registers and so on. Similarly, they often have various mechanisms for protecting certain parts of the memory from accidental writes. They may or may not be smart enough to do a "MMU-like" realization that code is executing from data memory or when data access happens in code memory. Harvard vs von Neumann architecture also matters here.
As for multiple processes in a RTOS, it can't be compared with multiple processes in a desktop computer. Each process in a RTOS typically got its own stack but that's about it - the MMU isn't involved in that but it's handled by the RTOS. Code in embedded systems is typically executed directly from flash, so it doesn't make sense to assign chunks of RAM memory for executable code like in a PC. Several processes will simply execute code from flash and it might be the same code or different code between processes simply depending on whether they share common code or not.
Similarly, it is senseless to use heap allocation in embedded systems (see Why should I not use dynamic memory allocation in embedded systems?) so we don't need to create a RAM image for that purpose either. The only thing left as unique per process is the stack, as well as separate parts of .data/.bss.

Why do we have a slow `malloc`?

As far as I know, custom memory managers are used in several medium and large-scale projects. This recent answer on security.se discusses the fact that a custom memory allocator in OpenSSL was included for performance reasons and ultimately ended up making the Heartbleed exploit worse. This old thread here discusses memory allocators, and in particular one of the answer links to an academic paper that shows that while people write custom memory allocators for performance reasons because malloc is slow, a general-purpose state-of-the-art allocator easily beats them and causes fewer problems than developers reinventing the wheel in every project.
As someone who does not program professionally, I am curious about how we ended up in this state and why we seem to be stuck there --- assuming my view is correct, which is not necessarily true. I imagine there must be subtler issues such as thread safety. Apologies if I am presenting the situation wrongly.
Why is the system malloc not developed and optimized to match the performance of these "general-purpose state-of-the-art allocators"? It seems to me that it should be quite an important feature for OS and standard library writers to focus on. I have heard a lot of talking about the scheduler implementation in Linux kernel in the past, for instance, and naively I would expect to see more or less the same amount of interest for memory allocators. How come standard malloc is so bad that so many people feel the need roll out a custom allocator? If there are alternative implementations that work so much better, why haven't system programmers included them in Linux and Windows, either as default or as a linking-time option?
There are two problems:
No single allocation scheme fits all application needs.
The C library was poorly designed (or not designed). Some non-eunuchs operating systems have configurable memory managers that can allow the application to choose the allocation scheme. In eunuchs-land, the solution is to link in your own malloc/free implementation into your application.
There is no real standard malloc implementation (GNU LIBC's is the probably the closest to standard). The malloc implementations that come with the OS tend to work fine for more applications.

What interpreters manage memory of its threads inside its own process?

I'm just wondering what interpreters manage memory of its threads inside its own process?
With VMware Virtual Machines, when memory is tight, VMware will inflate a "balloon" inside the VM. This allows the guest OS to "intelligently choose" what memory to swap to disk, allowing memory from that OS to be used by other VMs.
One issue is Java, where the OS can not "see/understand" the memory inside the JRE, and when the balloon inflates, the guest OS will effectively randomly swap out memory to disk and could easily swap out critical JRE functions rather than being able to intelligently choose which bits of memory to swap out.
My question is, what other interpreters behave in the same way as Java around memory management?
Is Microsoft .NET similar in this regard? Any other interpreters?
Regards
ted
I am not sure you can selectively swap out memory used by certain threads, seeing that the main difference between a thread and a process is that they share a common memory space, and the reason to use threads over processes is that they are deemed more lightweight, precisely because you are giving up on process isolation managed by the OS.
So what you're really asking is, whether any interpreter implements its own algorithm for swapping data to disk. As far as I know, no interpreter designed in recent times does this - given the price of RAM nowadays, it's not a good use of engineering resources. (There is a school of thought, to which I subscribe, that says we should now consider throwing out operating system level disk swapping as well.)
Relational database systems do their own disk swapping of course, partly because they were designed when RAM was more expensive, and partly because they still sometimes deal with unusually large volumes of data.
And, memory is rusty on this one, but I could almost swear at least one of the old MUD systems also implemented its own swapping code; a big MUD could run to tens of megabytes, and in those days might have had to run in only a few megabytes of memory.

Where are memory management algorithms used?

There are a set of memory management algorithms used in operating system construction, like pagination, segmentation, paged segmentation (paginación segmentada), segment pagination (segmentación paginada) and others.
Do you know if they are used besides that area, in not so low level software? They are used in bussiness applications?
These algoritms are for translating the program memory addresses onto the physical memory addresses. You will very rarely ever have to think of it in an application. In some extreme cases of applications working on very large datasets you may have to create a driver-like module to tune memory translation, but all the rest is still up to the operating system.
You might never write an OS yourself, but if you ever find yourself having to write a device driver, it will be imperitive that you understand these issues. So it is still quite useful to understand how these algorithms work.
Now you might be in school thinking, "Yuck, I'll just avoid that stuff". But you really have no idea where a 40-year carreer in the industry might take you.

Profiling for analyzing the low level memory accesses of my program

I have to analyze the memory accesses of several programs. What I am looking for is a profiler that allow me to see which one of my programs is more memory intensive insted of computing intensive. I am very interested in the number of accesses to the L1 data cache, L2, and the main memory.
It needs to be for Linux and if it is possible only with command usage. The programming language is c++. If there is any problem with my question, such as I do not understand what you mean or we need more data please comment below.
Thank you.
Update with the solution
I have selected the answer of Crashworks as favourited because is the only one that provided something of what I was looking for. But the question is still open, if you know a better solution please answer.
It is not possible to determine all accesses to memory, since it doesn't make much sense. An access to memory could be executing next instruction (program resides in memory), or when your program reads or write a variable, so your program is almost accessing memory all the time.
What could be more interesting for you could be follow the memory usage of your program (both heap and stack). In this case you can use standard top command.
You could also monitor system calls (i.e. to write to disk or to attach/alloc a shared memory segment). In this case you should use strace command.
A more complete control to do everything would be debugging your program by means of gdb debugger. It lets you control your program such as setting breakpoints to a variable so the program is interrputed whenever it is read or written (maybe this is what you were looking for). On the other hand GDB can be tricky to learn so DDD, which is a gtk graphical frontend will help you starting out with it.
Update: What you are looking for is really low level memory access that it is not available at user level (that is the task of the kernel of the operating system). I am not sure if even L1 cache management is handled transparently by CPU and hidden to kernel.
What is clear is that you need to go as down as kernel level, so KDB, explained here o KDBG, explained here.
Update 2: It seems that Linux kernel does handle CPU cache but only L1 cache. The book Understanding the Linux Virtual Memory Manager explais how memory management of Linux kernel works. This chapter explains some of the guts of L1 cache handling.
If you are running Intel hardware, then VTune for Linux is probably the best and most full-featured tool available to you.
Otherwise, you may be obliged to read the performance-counter MSRs directly, using the perfctr library. I haven't any experience with this on Linux myself, but I found a couple of papers that may help you (assuming you are on x86 -- if you're running PPC, please reply and I can provide more detailed answers):
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/11169/35961/01704008.pdf?temp=x
http://www.cise.ufl.edu/~sb3/files/pmc.pdf
In general these tools can't tell you exactly which lines your cache misses occur on, because they work by polling a counter. What you will need to do is poll the "l1 cache miss" counter at the beginning and end of each function you're interested in to see how many misses occur inside that function, and of course you may do so hierarchically. This can be simplified by eg inventing a class that records the start timer on entering scope and computes the delta on leaving scope.
VTune's instrumented mode does this for you automatically across the whole program. The equivalent AMD tool is CodeAnalyst. Valgrind claims to be an open-source cache profiler, but I've never used it myself.
Perhaps cachegrind (part of the valgrind suite) may be suitable.
Do you need something more than the unix command top will provide? This provides cpu usage and memory usage of linux programs in an easy to read presentation format.
If you need something more specific, a profiler perhaps, the software language (java/c++/etc.) will help determine which profiler is best for your situation.

Resources