Building a software based MMU and TLB

Building a software based MMU and TLB - memory-management

I am trying to hack an old unix kernel. I just want to implement the MMU and TLB using software. Can some one tell me what are the best Data structures and algorithms to use in building one. I saw lots of people using splay trees because its easy to implement LRU. Is there any better Data Structure ? What is the most efficient way of translating virtual to physical address in software.Assume its x86 architecture and translation as any basic page table translation.

You mention efficiency. Is that the goal you're engineering towards? If you're not constrained to any particular goal, just try to get it working. I'd do a single level page table if you can, either direct or fully associative. It sounds like you're past this though.
Most efficient is going to depend on size-speed tradeoffs and what kind of locality you expect. Do you have any critical apps profiled or is this just messing around to try out some implementations? Inverted page tables are used on some newer architectures. I would take that as an indication that someone spending a lot of time working on this thinks it's a good way to go.

Related

Designing and Interfacing a Partition Format

This is a subject that I have never found a suitable answer to, and so I was wondering if the helpful people of Stack Overflow may be able to answer this.
First of all: I'm not asking for a tutorial or anything, merely a discussion because I have not seen much information online about this.
Basically what I'd like to know is how one designs a new type of partition format, and then how it is capable of being interfaced with the operating system for use?
And better yet, what qualifies one partition format to be better than another? Is it performance/security, filename/filesize? Or is there more to it?
It's just something I've always wondered about. I'd love to dabble in creating one just for education purposes someday.

OK, although the question is broad, I'll try to dabble into it:
Assume that we are talking about a 'filesystem' as opposed to
certain 'raw' partition formats such as swap formats etc.
A filesystem should be able to map from low-level OS, BIOS, Network or Custom calls into a coherent file-and-folder file' names
that can be used by user applications. So, in your case, a
'partitition format' should be something that presents low-level
disk sectors and cylinders and their contents into a file-and-folder
abstraction.
Along the way, if you can provide features such as less fragmentation, redundant nodes indexes, journalling to prevent data
loss, survival in case of loss of power, work around bad sectors,
redundant data, mirroring of hardware, etc. then it can be
considered better than another one that does not provide such
features. If you can optimise file sizes to match usage of disk
sectors and clusters while accommodating very small and very large
files, that would be a plus.
Thorough bullet-proof security and testing would be considered essential for any non-experimental use.
To start hacking on your own, work with one of the slightly older filesystems like ext2. You would need considerable
build/compile/kernel skills
to get going, but nothing monumental.

Implementing data structures/algorithms in languages that already support them

Does it makes sense to implement your own version of data structures and algorithms in your language of choice even if they are already supported, knowing that care has been taking into tuning them for best possible performance?

Sometimes - yes. You might need to optimise the data structure for your specific case, or give it some specific extra functionality.
A java example is apache Lucene (A mature, widely used Information Retrieval library). Although the Map<S,T> interface and implementations already exists - for performance issues, its usage is not good enough, since it boxes the int to an Integer, and a more optimized IntToIntMap was developed for this purpose, instead of using a Map<Integer,Integer>.

The question contains a false assumption, that there's such a thing as "best possible performance".
If the already-existing code was tuned for best possible performance with your particular usage patterns, then it would be impossible for you to improve on it in respect of performance, and attempting to do so would be futile.
However, it wasn't tuned for best possible performance with your particular usage. Assuming it was tuned at all, it was designed to have good all-around performance on average, taken across a lot of possible usage patterns, some of which are irrelevant to you.
So, it is possible in principle that by implementing the code yourself, you can apply some tweak that helps you and (if the implementers considered that tweak at all) presumably hinders some other user somewhere else. But that's OK, they don't have to use your code. Maybe you like cuckoo hashing and they like linear probing.
Reasons that the implementers might not have considered the tweak include: they're less smart than you (rare, but it happens); the tweak hadn't been invented when they wrote the code and they aren't following the state of the art for that structure / algorithm; they have better things to do with their time and you don't. In those cases perhaps they'd accept a patch from you once you're finished.
There are also reasons other than performance that you might want a data structure very similar to one that your language supports, but with some particular behavior added or removed. If you can't implement that on top of the existing structure then you might well do it from scratch. Obviously it's a significant cost to do so, up front and in future support, but if it's worth it then you do it.

It may makes sense when you are using a compiled language (like C, Assembly..).
When using an interpreted language you will probably have a performance loss, because the native structure parsers are already compiled, and won't waste time "interpreting" the new structure.
You will probably do it only when the native structure or algorithm lacks something you need.

Where to learn about low-level, hard-core performance stuffs?

This is actually a 2 part question:
For people who want to squeeze every clock cycle, people talk about pipelines, cache locality, etc.
I have seen these low level performance techniques mentioned here and there but I have not seen a good introduction to the subject, from start to finish. Any resource recommendations? (Google gave me definitions and papers, where I'd really appreciate some kind of worked examples/tutorials real-life hands-on kind of materials)
How does one actually measure this kind of things? Like, as in a profiler of some sort? I know we can always change the code, see the improvement and theorize in retrospect, I am just wondering if there are established tools for the job.
(I know algorithm optimization is where the orders of magnitudes are. I am interested in the metal here)

The chorus of replies is, "Don't optimize prematurely." As you mention, you will get a lot more performance out of a better design than a better loop, and your maintainers will appreciate it, as well.
That said, to answer your question:
Learn assembly. Lots and lots of assembly. Don't MUL by a power of two when you can shift. Learn the weird uses of xor to copy and clear registers. For specific references,
http://www.mark.masmcode.com/ and http://www.agner.org/optimize/
Yes, you need to time your code. On *nix, it can be as easy as time { commands ; } but you'll probably want to use a full-features profiler. GNU gprof is open source http://www.cs.utah.edu/dept/old/texinfo/as/gprof.html
If this really is your thing, go for it, have fun, and remember, lots and lots of bit-level math. And your maintainers will hate you ;)

EDIT/REWRITE:
If it is books you need Michael Abrash did a good job in this area, Zen of Assembly language, a number of magazine articles, big black book of graphics programming, etc. Much of what he was tuning for is no longer a problem, the problems have changed. What you will get out of this is the ideas of the kinds of things that can cause bottle necks and the kinds of ways to solve. Most important is to time everything, and understand how your timing measurements work so that you are not fooling yourself by measuring incorrectly. Time the different solutions and try crazy, weird solutions, you may find an optimization that you were not aware of and didnt realize until you exposed it.
I have only just started reading but See MIPS Run (early/first edition) looks good so far (note that ARM took over MIPS as the leader in the processor market, so the MIPS and RISC hype is a bit dated). There are a number of text books old and new to be had about MIPS. Mips being designed for performance (At the cost of the software engineer in some ways).
The bottlenecks today fall into the categories of the processor itself and the I/O around it and what is connected to that I/O. The insides of the processor chips themselves (for higher end systems) run much faster than the I/O can handle, so you can only tune so far before you have to go off chip and wait forever. Getting off the train, from the train to your destination half a minute faster when the train ride was 3 hours is not necessarily a worthwhile optimization.
It is all about learning the hardware, you can probably stay within the ones and zeros world and not have to get into the actual electronics. But without really knowing the interfaces and internals you really cannot do much performance tuning. You might re-arrange or change a few instructions and get a little boost, but to make something several hundred times faster you need more than that. Learning a lot of different instruction sets (assembly languages) helps get into the processors. I would recommend simulating HDL, for example processors at opencores, to get a feel for how some folks do their designs and getting a solid handle on how to really squeeze clocks out of a task. Processor knowledge is big, memory interfaces are a huge deal and need to be learned, media (flash, hard disks, etc) and displays and graphics, networking, and all the types of interfaces between all of those things. And understanding at the clock level or as close to it as you can get, is what it takes.

Intel and AMD provide optimization manuals for x86 and x86-64.
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html/
http://developer.amd.com/documentation/guides/pages/default.aspx
Another excellent resource is agner.
http://www.agner.org/optimize/
Some of the key points (in no particular order):
Alignment; memory, loop/function labels/addresses
Cache; non-temporal hints, page and cache misses
Branches; branch prediction and avoiding branching with compare&move op-codes
Vectorization; using SSE and AVX instructions
Op-codes; avoiding slow running op-codes, taking advantage of op-code fusion
Throughput / pipeline; re-ordering or interleaving op-codes to perform separate tasks avoiding partial stales and saturating the processor's ALUs and FPUs
Loop unrolling; performing multiple iterations for a single "loop comparison, branch"
Synchronization; using atomic op-code (or LOCK prefix) to avoid high level synchronization constructs

Yes, measure, and yes, know all those techniques.
Experienced people will tell you "don't optimize prematurely", which I relate as simply "don't guess".
They will also say "use a profiler to find the bottleneck", but I have a problem with that. I hear lots of stories of people using profilers and either liking them a lot or being confused with their output.
SO is full of them.
What I don't hear a lot of is success stories, with speedup factors achieved.
The method I use is very simple, and I've tried to give lots of examples, including this case.

I'd suggest Optimizing subroutines in assembly
language
An optimization guide for x86 platforms.
It's quite heavy stuff though ;)

How to represent data in an efficient way ? (Graphically Talking)

Before going for further reading, just to let you know this question is vague and do not need one precise answer. To the contrary more answer I get better it will be for me.
The question is : How to represent data in an efficient way ?
I am not talking about representing data into a database or any language.
I am talking about when a program, a report, a page needs to be shown to a user (Static - report- and Dynamic - web pages -) how one should represent the data in order to the user to catch as many information as possible from - almost - the first look. Is there any best-practices, pitfalls to avoid and stuff ?
Edit: Any book/link that can help or that treat about this subject are welcome.

"how one should represent the data in order to the user to catch as many information as
possible from - almost - the first look."
To me, this screams that you need to be speaking to your end-users more. My suggestion would be to mock up the initial layout using something like Balsamiq Mockups (This can be done even if it's a public facing site). Using the mockups will help you visualise the design of the overall page.
"First-look" type views indicates a dashboard which provide overall, high level results.
Now, just to be clear, this is the design and layout of the page and don't confuse this with any web UI tools eg JqueryUI that bring fancy effects to the page.
In terms of links, my suggestion would be thoroughly read through Designing User Interfaces For Business Web Applications from Smashing Magazine (incl. the related links). The one that is probably most relevant is 12 Standard Screen Patterns.
It is a brilliant read and should be, IMO, added to your saved bookmarks.

Effectiveness is always matter then efficiency. Before I express my opinions, I suppose that your question already based on effective solution from user's perspective.
First, data retrieving is about the storage of computer system. If your data can reside totally in the fastest storage(like main memory), keeping data in it is a better strategy than others. But the problem about performance issue is mostly because of non-enough main memories, so the data should be retrived from secondary storages(the slower one) and replace other data in main memory, and produce what you want. So you have to deal with multi-level storage systems.
Second, when you are dealing with multi-level storage systems(as most computer systems), the efficiency ways depend on how much the reductions of access in secondary storages. It's not noly about the gain in loading data from slower storage to faster one, but also, there are sacrifices that the data get kicked out.
In XML, DOM and SAX are two extremities of dealing with multi-level storage systems. In database systems, fully cached indexes are a good solution for performance(when indexes are small enough). In operating systems, file cache is alwasy the one of the most challenging things in computer science.
You can pre-calculating some data before required. You can using more efficient data structures to improve retriving data. You can rudely allocating more main memories to your application. You can... well, buying more memory modules or SSD. Whatever solutions you choose, it's definitely art of fusion in computer science.
Algorithms, data structues, database systems, operating systems, even theories of compilers, these hard metals can help you build a sword which kicks the dragon's ass.

how do addressing modes work on a physical level?

I'm trying to learn this basic thing about processors that should be taught in every CS department of every university. Yet i can't find it on the net (Google doesn't help) and i can't find it in my class materials either.
Do you know any good resource on how addressing modes work on a physical level? I'm particularly interested in Intel processors.

You might want to take a look into the book "Modern Operating Systems" from Tanenbaum.
If you are interested in the x86 architecture the Intel Manuals might help (but they go really deep)
http://www.intel.com/products/processor/manuals/

Start on the Wikipedia Virtual Memory page for a bit of background, then follow up with specific pages such as the MMU etc. as to satisfy your curiosity.
You will normally go in detail over all of the above concepts (and some more, such as pipelined and superscalar architectures, caches, etc.) in any decent Computer Architecture course, typically taught by the Faculty of (Electrical or Computer) Engineering.

This page might help. I did a search for HC12 addressing modes since that's what we learnt with, and it is MUCH better to learn on a simple processor rather than jumping into the deep end with something like an Intel processor. The basic concepts should be similar for any processor though.
http://spx.arizona.edu/ECE372/Supporting%20Documents/lecture/HCS12%20Addressing%20Modes%20and%20Subroutines.pdf
I wouldn't imagine you'd need to know any of the more complicated ones in an introductory course. We only really used the basic ones, then had to explain a few of the others in our exam.
You should be able to see what's going on on a physical level from that provided you understand the assembly code examples. The inherent addressing command inca for example is going to use a set of logic gates within the processor (http://en.wikipedia.org/wiki/Adder_%28electronics%29) in order to increment register A by one. That's all well and good but trying to understand the physical layer of anything more complicated than that is just going to give you headaches. You really don't need to know it, which is the whole point of using a microprocessor in the first place.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio