This is a subject that I have never found a suitable answer to, and so I was wondering if the helpful people of Stack Overflow may be able to answer this.
First of all: I'm not asking for a tutorial or anything, merely a discussion because I have not seen much information online about this.
Basically what I'd like to know is how one designs a new type of partition format, and then how it is capable of being interfaced with the operating system for use?
And better yet, what qualifies one partition format to be better than another? Is it performance/security, filename/filesize? Or is there more to it?
It's just something I've always wondered about. I'd love to dabble in creating one just for education purposes someday.
OK, although the question is broad, I'll try to dabble into it:
Assume that we are talking about a 'filesystem' as opposed to
certain 'raw' partition formats such as swap formats etc.
A filesystem should be able to map from low-level OS, BIOS, Network or Custom calls into a coherent file-and-folder file' names
that can be used by user applications. So, in your case, a
'partitition format' should be something that presents low-level
disk sectors and cylinders and their contents into a file-and-folder
abstraction.
Along the way, if you can provide features such as less fragmentation, redundant nodes indexes, journalling to prevent data
loss, survival in case of loss of power, work around bad sectors,
redundant data, mirroring of hardware, etc. then it can be
considered better than another one that does not provide such
features. If you can optimise file sizes to match usage of disk
sectors and clusters while accommodating very small and very large
files, that would be a plus.
Thorough bullet-proof security and testing would be considered essential for any non-experimental use.
To start hacking on your own, work with one of the slightly older filesystems like ext2. You would need considerable
build/compile/kernel skills
to get going, but nothing monumental.
Related
Does it makes sense to implement your own version of data structures and algorithms in your language of choice even if they are already supported, knowing that care has been taking into tuning them for best possible performance?
Sometimes - yes. You might need to optimise the data structure for your specific case, or give it some specific extra functionality.
A java example is apache Lucene (A mature, widely used Information Retrieval library). Although the Map<S,T> interface and implementations already exists - for performance issues, its usage is not good enough, since it boxes the int to an Integer, and a more optimized IntToIntMap was developed for this purpose, instead of using a Map<Integer,Integer>.
The question contains a false assumption, that there's such a thing as "best possible performance".
If the already-existing code was tuned for best possible performance with your particular usage patterns, then it would be impossible for you to improve on it in respect of performance, and attempting to do so would be futile.
However, it wasn't tuned for best possible performance with your particular usage. Assuming it was tuned at all, it was designed to have good all-around performance on average, taken across a lot of possible usage patterns, some of which are irrelevant to you.
So, it is possible in principle that by implementing the code yourself, you can apply some tweak that helps you and (if the implementers considered that tweak at all) presumably hinders some other user somewhere else. But that's OK, they don't have to use your code. Maybe you like cuckoo hashing and they like linear probing.
Reasons that the implementers might not have considered the tweak include: they're less smart than you (rare, but it happens); the tweak hadn't been invented when they wrote the code and they aren't following the state of the art for that structure / algorithm; they have better things to do with their time and you don't. In those cases perhaps they'd accept a patch from you once you're finished.
There are also reasons other than performance that you might want a data structure very similar to one that your language supports, but with some particular behavior added or removed. If you can't implement that on top of the existing structure then you might well do it from scratch. Obviously it's a significant cost to do so, up front and in future support, but if it's worth it then you do it.
It may makes sense when you are using a compiled language (like C, Assembly..).
When using an interpreted language you will probably have a performance loss, because the native structure parsers are already compiled, and won't waste time "interpreting" the new structure.
You will probably do it only when the native structure or algorithm lacks something you need.
Before going for further reading, just to let you know this question is vague and do not need one precise answer. To the contrary more answer I get better it will be for me.
The question is : How to represent data in an efficient way ?
I am not talking about representing data into a database or any language.
I am talking about when a program, a report, a page needs to be shown to a user (Static - report- and Dynamic - web pages -) how one should represent the data in order to the user to catch as many information as possible from - almost - the first look. Is there any best-practices, pitfalls to avoid and stuff ?
Edit: Any book/link that can help or that treat about this subject are welcome.
"how one should represent the data in order to the user to catch as many information as
possible from - almost - the first look."
To me, this screams that you need to be speaking to your end-users more. My suggestion would be to mock up the initial layout using something like Balsamiq Mockups (This can be done even if it's a public facing site). Using the mockups will help you visualise the design of the overall page.
"First-look" type views indicates a dashboard which provide overall, high level results.
Now, just to be clear, this is the design and layout of the page and don't confuse this with any web UI tools eg JqueryUI that bring fancy effects to the page.
In terms of links, my suggestion would be thoroughly read through Designing User Interfaces For Business Web Applications from Smashing Magazine (incl. the related links). The one that is probably most relevant is 12 Standard Screen Patterns.
It is a brilliant read and should be, IMO, added to your saved bookmarks.
Effectiveness is always matter then efficiency. Before I express my opinions, I suppose that your question already based on effective solution from user's perspective.
First, data retrieving is about the storage of computer system. If your data can reside totally in the fastest storage(like main memory), keeping data in it is a better strategy than others. But the problem about performance issue is mostly because of non-enough main memories, so the data should be retrived from secondary storages(the slower one) and replace other data in main memory, and produce what you want. So you have to deal with multi-level storage systems.
Second, when you are dealing with multi-level storage systems(as most computer systems), the efficiency ways depend on how much the reductions of access in secondary storages. It's not noly about the gain in loading data from slower storage to faster one, but also, there are sacrifices that the data get kicked out.
In XML, DOM and SAX are two extremities of dealing with multi-level storage systems. In database systems, fully cached indexes are a good solution for performance(when indexes are small enough). In operating systems, file cache is alwasy the one of the most challenging things in computer science.
You can pre-calculating some data before required. You can using more efficient data structures to improve retriving data. You can rudely allocating more main memories to your application. You can... well, buying more memory modules or SSD. Whatever solutions you choose, it's definitely art of fusion in computer science.
Algorithms, data structues, database systems, operating systems, even theories of compilers, these hard metals can help you build a sword which kicks the dragon's ass.
I am trying to hack an old unix kernel. I just want to implement the MMU and TLB using software. Can some one tell me what are the best Data structures and algorithms to use in building one. I saw lots of people using splay trees because its easy to implement LRU. Is there any better Data Structure ? What is the most efficient way of translating virtual to physical address in software.Assume its x86 architecture and translation as any basic page table translation.
You mention efficiency. Is that the goal you're engineering towards? If you're not constrained to any particular goal, just try to get it working. I'd do a single level page table if you can, either direct or fully associative. It sounds like you're past this though.
Most efficient is going to depend on size-speed tradeoffs and what kind of locality you expect. Do you have any critical apps profiled or is this just messing around to try out some implementations? Inverted page tables are used on some newer architectures. I would take that as an indication that someone spending a lot of time working on this thinks it's a good way to go.
I'm a mathematician and occasionally do some statistics/machine learning analysis consulting projects on the side. The data I have access to are usually on the smaller side, at most a couple hundred of megabytes (and almost always far less), but I want to learn more about handling and analyzing data on the gigabyte/terabyte scale. What do I need to know and what are some good resources to learn from?
Hadoop/MapReduce is one obvious start.
Is there a particular programming language I should pick up? (I primarily work now in Python, Ruby, R, and occasionally Java, but it seems like C and Clojure are often used for large-scale data analysis?)
I'm not really familiar with the whole NoSQL movement, except that it's associated with big data. What's a good place to learn about it, and is there a particular implementation (Cassandra, CouchDB, etc.) I should get familiar with?
Where can I learn about applying machine learning algorithms to huge amounts of data? My math background is mostly on the theory side, definitely not on the numerical or approximation side, and I'm guessing most of the standard ML algorithms don't really scale.
Any other suggestions on things to learn would be great!
Apache Hadoop is indeed a good start, because it's free, has a large community and is easy to set up.
Hadoop is build in Java, so this can be the language of choice. But it is possible to use ohter languages with Hadoop as well ("pipes" and "streams"). I know, that Python is often used for example.
You can avoid having your data in data bases, if you like to. Originally, Hadoop works with data on the (distributed) file system. But as you already seem to know, there are distributed data bases for Hadoop available.
Did you ever had a look an Mahout? I think that would be a hit for you ;-) Many work you need, may already had been done!?
Read the Quick Start and set up your own (pseudo-distributed?) cluster and run the word-count example.
Let me know, if you have any questions :-) A comment will remind me on this question.
I've done some large scale machine learning (3-5GB datasets), so here are some insights:
First, there are logistics issues at large scales. Can you load all your data into memory? With Java and a 64 bit JVM you can access as much RAM as you have: for example, command line parameter -Xmx8192M will give you access to 8GB (if you have that much). Matlab, being a Java application, can also benefit from this and work with fairly large datasets.
More importantly, the algorithms that you run on your data. Chances are that standard implementations will expect all of the data in memory. You might have to implement a working set approach yourself, where you swap data in and out to the disk, and only work on a portion of data at a time. These are sometimes referred to as chunking, batch or even incremental algorithms, depending on the context.
You are right to suspect that a lot of algorithms do not practically scale, so you might have to go for an approximate solution. The good news is that for almost any algorithm you can find research papers that deal with approximation and/or discuss large scale solutions. The bad news is that you'll most likely have to implement those approaches yourself.
Hadoop is great, but can be a pain in the ass to set up. This is by far the best article I've read on Hadoop setup. I strongly recommend it:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
Clojure is built on top of Java so it's unlikely that it's going to be any faster than Java. However, it is one of the few languages that does shared memory well, which may or may not be helpful. I'm not a math guy but it seems most math calculations are very parallelizable, with little need of threads sharing memory. Either way, you might want to check out Incanter, which is Clojure's statistical computing library, and clojure-hadoop, which makes writing Hadoop jobs a lot less painful.
In terms of languages, I find that the differences in performance end up being constant factors. It's far better to just find a language you enjoy and focus on improving your algorithms. However, according to some shootout cited by Peter Norvig (scroll down to the colorful table, you may want to shy away from Python and Perl due to their crappiness with arrays.
In a nutshell, NoSQL is great for unstructured/arbitrarily structured data while SQL/RDBMS is great (or at least tolerable) for structured data. Changing/adding fields is expensive in RDBMS so if that's going to happen alot, you might want to shy away from them.
However, in your case, it seems like you're going to be batch processing a ton of data and then getting back an answer as opposed to having data around that you will periodically ask questions about? You could probably just process CSVs/text files in Hadoop. Unless you need a performant way of accessing arbitrary information about your data on the fly, I'm not sure either SQL or NoSQL would be useful.
How does one choose if someone justify their design tradeoffs in terms of optimised code, clarity of implementation, efficiency, and portability?
A relevant example for the purpose of this question could be large file handling, where a "large file" is "quite a few GB" for a problem that would be simplified using random-access methods.
Approaches for reading and modifying this file could be:
Use streams anyway, and seek to the desired place - which is portable, but potentially slow, and is not clear - this will work for practically all OS's.
map the relevant portion of the file as a large block. Eg, mmap a 50MB chunk of the file for processing, for each chunk - This would work for many OS's, depending on the subtleties of implementing mmap for that system.
Just mmap the entire file - this requires a 64-bit OS and is the most efficient and clear way to implement this, however does not work on 32-bit OS's.
Not sure what you're asking, but part of the design process is to analyze requirements for portability and performance (amongst other factors).
If you know you'll never need to port the code, and you need absolutely the best performance, then you adjust your implementation accordingly. There's no point being portable just for its own sake.
Note also that if you want both performance and portability, there's nothing stopping you from providing an implementation for each platform. Of course this will increase your cost, so really, its up to you to prioritize your needs.
Without constraints, this question rationally cannot be answered rationally.
You're asking "what is the best color" without telling us whether you're painting a house or a car or a picture.
Constraints would include at least
Language of choice
Target platforms (multi CPU industrial-grade server or iPhone?)
Optimizing for speed vs. memory
Cost (who's funding this and is there a delivery constraint?)
No piece of software could have "ultimate" portability.
An example of this sort of problem being handled using a variety of methods but with a tight constraint both on the specific input/output required and the measurement of "best" would be the WideFinder project.
Basically, you need think first before coding. Every project is unique and an analysis of the needs could help decide what is primordial for it. What will make the best solution for any project depends on a few things...
First of all, will this project need to be or eventually be multiplatform? Depending on your choice, choosing the right programming language should be easier. Then again you could also use more than one language in your project and this is completely normal. Portability does not necessarily mean less performance. All it implies is that it involves harder work to achieve your goals because you will need quality code. Also, every programming language has its own philosophy. Learn what they are. One thing is for sure, certain problems frequently come back over and over. This is why knowing the different design patters can make a difference sometimes, but some languages have their own idioms and can be very relevant when choosing a language. Another thing that needs some thought is the different approaches that you can have for your project. Multithreading, sockets, client/server systems and many other technologies are all there for you to use. Choosing the right technology can help to make a project better.
Knowing the needs and the different solutions available today is what will help decide when comes the time to choose for the different tradeoffs.
It really depends on the drivers for the project. If you are doing in-house enterprise dev, then do the simplest thing that could work on your target hardare. Mod for performance reqs as needed.
If you know you need to support different hardware platforms on day 1, then you'll clearly need to choose a portable implementation, or use multiple approaches.
Portability for portability's sake has been a marketing spiel for Java since inception and is a fact of life for C by convention, and I believe most people who abide by it "grew up" with Java or C will say that.
However true, absolute portability will only be true for the most trivial to at most applications with medium complexity -- anything with high complexity will need specialized tweaks.