A Linux Kernel Module for Self-Optimizing Hard Drives: Advice? - linux-kernel

I am a computer engineering student studying Linux kernel development. My 4-man team was tasked to propose a kernel development project (to be implemented in 6 weeks), and we came up with a tentative "Self-Optimizing Hard Disk Drive Linux Kernel Module". I'm not sure if that title makes sense to the pros.
We based the proposal on this project.
The goal of the project is to minimize hard disk access times. The plan is to create a special partition where the "most commonly used" files are to be placed. An LKM will profile, analyze, plan, and redirect I/O operations to the hard disk. This LKM should primarily be able to predict and redirect all file access (on files with sizes of < 10 MB) with minimal overhead, and lessen average read/write access times to the hard disk. I believe Apple's HFS has this feature.
Can anybody suggest a starting point? I recently found a way to redirect I/O operations by intercepting system calls (by hijacking all the read/write ones). However, I'm not convinced that this is the best way to go. Is there a way to write a driver that redirects these read/write operations? Can we perhaps tap into the read/write cache to achieve the same effect?
Any feedback at all is appreciated.

You may want to take a look at Unionfs. You don't even need a LKM - just a some user-space daemon which would subscribe to inotify events, keep statistics and migrate files between partitions. Unionfs will combine both partitions into a single logical filesystem.

There are many ways in which such optimizations might be useful:
accessing file A implies file B access is imminent. Example: opening an icon file for a media file by a media player
accessing any file in some group G of files means that other files in the group will be accessed shortly. Example: mysql receives a use somedb command which implies all the file tables, indexes, etc. will be accessed.
a program which stops reading a sequential file suggests the program has stalled or exited, so predictions of future accesses associated with that file should be abandoned.
having multiple (yet transparent) copies of some frequently referenced files strategically sprinkled about can use the copy nearest the disk heads. Example: uncached directories or small, frequently accessed settings files.
There are so many possibilities that I think at least 50% of an efficient solution would be a sensible, limited specification for what features you will attempt to implement and what you won't. It might be valuable to study how Microsoft's Vista's aggressive file caching mechanism disappointed.
Another problem you might encounter with a modern Linux distribution is how well the system already does much of what you plan to improve. In fact, measuring the improvement might be a big challenge. I suggest writing a benchmark program which opens and reads a series of files and precisely times the complete sequence. Run it several times with your improvements enabled and disabled. But you'll have to reboot in between for valid timing....

Related

what's the fastest way to copy a folder over the network to multiple servers(Python)

As the title says, what I would like to accomplish is given a package(usually the size may vary between 500Mb and 1Gb), I would like to copy over something around 40 servers at the same time(concurrently), I've been using a script that run a copy at the time, therefore I'm considering these possibilities:
1- Multiprocess library and create a single process for each copy function so that, they can run concurrently;
-although I think I might end up having an I/O bottleneck, and process cannot share the same data.
2-I m not using a single internet connection, but a huge corporate WAN.
Can anyone tell me whether is there any other more effective way(faster) to achieve the same thing? Or some other way to solve it?(I can run this task from a 2 core workstation).
1) I have no experience with this, but it looks like a fit for your use case:
http://code.google.com/p/pysendfile/
sendfile(2) is a system call which provides a "zero-copy" way of copying data from one file descriptor to another (a socket). The phrase "zero-copy" refers to the fact that all of the copying of data between the two descriptors is done entirely by the kernel, with no copying of data into userspace buffers. This is particularly useful when sending a file over a socket (e.g. FTP).
and
When do you want to use it?
Basically any application sending files over the network can take advantage of sendfile(2).
2) Another option would be to use some torrent library. I recently learned (skip to 31:00 for the torrent stuff) that facebook distribute their daily software updates via torrent (and update 1000s of servers with 1.5GB binaries within 15min or so).
Assume your machines have 1Gbit connections. You'll get 800Mbit/s if you're lucky/work at it, and it'll take ~10s to copy each 1GByte and 6-7 minutes to update those machines. If that's good enough, the only thing you need to do is work on using the 1Gbit efficiently to hit that target (what are you seeing from your current scripts ? OK 1Gbit may be ambitous on WAN, but you can do a similar analysis). Multiprocessing might or might not help here... but it's not going to magically get you more bandwidth.
If it's not good enough, I'd either consider:
go P2P (see miku;s answer), so as soon as one machine has a bit of
the data it can share it with other machines using it's own
bandwidth. How much this helps depends to some extent on your
network topology (existence of other bottleneck points).
Look into multicast, if the network is enough under your control that you can get the stuff routed appropriately (this seems pretty
unlikely for WAN, but maybe one day in an IPv6 wonderland...).
Instead of copying the same data 40 times (assuming it is the same
each time), you just broadcast it once and all the receivers pick it
up simultaneously. Multicast UDP isn't reliable (intended more for
IPTV I think) but there have been attempts to build reliable file
transfer tools using multicast tech e.g OpenPGM and MS's
own implementation.

Why do operating systems limit file descriptors?

I ask this question after trying my best to research the best way to implement a message queue server. Why do operating systems put limits on the number of open file descriptors a process and the global system can have?
My current server implementation uses zeromq, and opens a subscriber socket for each connected websocket client. Obviously that single process is only going to be able to handle clients to the limit of the fds.
When I research the topic I find lots of info on how to raise system limits to levels as high as 64k fds but it never mentions how it affects system performance and why it is 1k and lower to start with?
My current approach is to try and dispatch messaging to all clients using a coroutine in its own loop, and a map of all clients and their subscription channels. But I would just love to hear a solid answer about file descriptor limitations and how they affect applications that try to use them on a per client level with persistent connections?
It may be because a file descriptor value is an index into a file descriptor table. Therefore, the number of possible file descriptors would determine the size of the table. Average users would not want half of their ram being used up by a file descriptor table that can handle millions of file descriptors that they will never need.
There are certain operations which slow down when you have lots of potential file descriptors. One example is the operation "close all file descriptors except stdin, stdout, and stderr" -- the only portable* way to do this is to attempt to close every possible file descriptor except those three, which can become a slow operation if you could potentially have millions of file descriptors open.
*: If you're willing to be non-portable, you cna look in /proc/self/fd -- but that's besides the point.
This isn't a particularly good reason, but it is a reason. Another reason is simply to keep a buggy program (i.e, one that "leaks" file descriptors) from consuming too much system resources.
For performance purposes, the open file table needs to be statically allocated, so its size needs to be fixed. File descriptors are just offsets into this table, so all the entries need to be contiguous. You can resize the table, but this requires halting all threads in the process and allocating a new block of memory for the file table, then copying all entries from the old table to the new one. It's not something you want to do dynamically, especially when the reason you're doing it is because the old table is full!
On unix systems, the process creation fork() and fork()/exec() idiom requires iterating over all potential process file descriptors attempting to close each one, typically leaving leaving only a few file descriptors such as stdin, stdout, stderr untouched or redirected to somewhere else.
Since this is the unix api for launching a process, it has to be done anytime a new process is created, including executing each and every non built-in command invoked within shell scripts.
Other factors to consider are that while some software may use sysconf(OPEN_MAX) to dynamically determine the number of files that may be open by a process, a lot of software still uses the C library's default FD_SETSIZE, which is typically 1024 descriptors and as such can never have more than that many files open regardless of any administratively defined higher limit.
Unix has a legacy asynchronous I/O mechanism based on file descriptor sets which use bit offsets to represent files to wait on and files that are ready or in an exception condition. It doesn't scale well for thousands of files as these descriptor sets need to be setup and cleared each time around the runloop. Newer non standard apis have appeared on the major unix variants including kqueue() on *BSD and epoll() on Linux to address performance shortcomings when dealing with a large number of descriptors.
It is important to note that select()/poll() is still used by A LOT of software as for a long time it has been the POSIX api for asynchronous I/O. The modern POSIX asynchronous IO approach is now aio_* API but it is likely not competitve with kqueue() or epoll() API's. I haven't used aio in anger and it certainly wouldn't have the performance and semantics offered by native approaches in the way they can aggregate multiple events for higher performance. kqueue() on *BSD has really good edge triggered semantics for event notification allowing it to replace select()/poll() without forcing large structural changes to your application. Linux epoll() follows the lead of *BSD kqueue() and improves upon it which in turn followed lead of Sun/Solaris evports.
The upshot is that increasing the number of allowed open files across the system adds both time and space overhead for every process in the system even if they can't make use of those descriptors based on the api they are using. There are also aggregate system limits as well for the number of open files allowed. This older but interesting tuning summary for 100k-200k simultaneous connections using nginx on FreeBSD provides some insight into the overheads for maintaining open connections and another one covering a wider range of systems but "only" seeing 10K connections as the Mt Everest.
Probably the best reference for unix system programing is W. Richard Stevens Advanced Programming in the Unix Environment

Why do you use the keyword delete?

I understand that delete returns memory to the heap that was allocated of the heap, but what is the point? Computers have plenty of memory don't they? And all of the memory is returned as soon as you "X" out of the program.
Example:
Consider a server that allocates an object Packet for each packet it receives (this is bad design for the sake of the example).
A server, by nature, is intended to never shut down. If you never delete the thousands of Packet your server handles per second, your system is going to swamp and crash in a few minutes.
Another example:
Consider a video game that allocates particles for the special effect, everytime a new explosion is created (and never deletes them). In a game like Starcraft (or other recent ones), after a few minutes of hilarity and destruction (and hundres of thousands of particles), lag will be so huge that your game will turn into a PowerPoint slideshow, effectively making your player unhappy.
Not all programs exit quickly.
Some applications may run for hours, days or longer. Daemons may be designed to run without cease. Programs can easily consume more memory over their lifetime than available on the machine.
In addition, not all programs run in isolation. Most need to share resources with other applications.
There are a lot of reasons why you should manage your memory usage, as well as any other computer resources you use:
What might start off as a lightweight program could soon become more complex, depending on your design areas of memory consumption may grow exponentially.
Remember you are sharing memory resources with other programs. Being a good neighbour allows other processes to use the memory you free up, and helps to keep the entire system stable.
You don't know how long your program might run for. Some people hibernate their session (or never shut their computer down) and might keep your program running for years.
There are many other reasons, I suggest researching on memory allocation for more details on the do's and don'ts.
I see your point, what computers have lots of memory but you are wrong. As an engineer you have to create programs, what uses computer resources properly.
Imagine, you made program which runs all the time then computer is on. It sometimes creates some objects/variables with "new". After some time you don't need them anymore and you don't delete them. Such a situation occurs time to time and you just make some RAM out of stock. After a while user have to terminate your program and launch it again. It is not so bad but it not so comfortable, what is more, your program may be loading for a while. Because of these user feels bad of your silly decision.
Another thing. Then you use "new" to create object you call constructor and "delete" calls destructor. Lets say you need to open so file and destructor closes it and makes it accessible for other processes in this case you would steel not only memory but also files from other processes.
If you don't want to use "delete" you can use shared pointers (it has garbage collector).
It can be found in STL, std::shared_ptr, it has one disatvantage, WIN XP SP 2 and older do not support this. So if you want to create something for public you should use boost it also has boost::shared_ptr. To use boost you need to download it from here and configure your development environment to use it.

Read files by device/inode order?

I'm interested in an efficient way to read a large number of files on the disk. I want to know if I sort files by device and then by inode I'll got some speed improvement against natural file reading.
There are vast speed improvements to be had from reading files in physical order from rotating storage. Operating system I/O scheduling mechanisms only do any real work if there are several processes or threads contending for I/O, because they have no information about what files you plan to read in the future. Hence, other than simple read-ahead, they usually don't help you at all.
Furthermore, Linux worsens your access patterns during directory scans by returning directory entries to user space in hash table order rather than physical order. Luckily, Linux also provides system calls to determine the physical location of a file, and whether or not a file is stored on a rotational device, so you can recover some of the losses. See for example this patch I submitted to dpkg a few years ago:
http://lists.debian.org/debian-dpkg/2009/11/msg00002.html
This patch does not incorporate a test for rotational devices, because this feature was not added to Linux until 2012:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ef00f59c95fe6e002e7c6e3663cdea65e253f4cc
I also used to run a patched version of mutt that would scan Maildirs in physical order, usually giving a 5x-10x speed improvement.
Note that inodes are small, heavily prefetched and cached, so opening files to get their physical location before reading is well worth the cost. It's true that common tools like tar, rsync, cp and PostgreSQL do not use these techniques, and the simple truth is that this makes them unnecessarily slow.
Back in the 1970s I proposed to our computer center that reading/writing from/to disk would be faster overall if they organized the queue of disk reads and/or writes in such a way as to minimize the seek time and I was told by the computer center that their experiments and information from IBM that many studies had been made of several techniques and that the overall throughput of JOBS (not just a single job) was most optimal if disk reads/writes were done in first come first serve order. This was an IBM batch system.
In general, optimisation techniques for file access are too tied to the architecture of your storage subsystem for them to be something as simple as a sorting algorithm.
1) You can effectively multiply the read data rate if your files are spread into multiple physical drives (not just partitions) and you read two or more files in parallel from different drives. This one is probably the only method that is easy to implement.
2) Sorting the files by name or inode number does not really change anything in the general case. What you'd want is to sort the files by the physical location of their blocks on the disk, so that they can be read with minimal seeking. There are quite a few obstacles however:
Most filesystems do not provide such information to userspace applications, unless it's for debugging reasons.
The blocks themselves of each file can be spread all over the disk, especially on a mostly full filesystem. There is no way to read multiple files sequentially without seeking back and forth.
You are assuming that your process is the only one accessing the storage subsystem. Once there is at least someone else doing the same, every optimisation you come up with goes out of the window.
You are trying to be smarter than the operating system and its own caching and I/O scheduling mechanisms. It's very likely that by trying to second-guess the kernel, i.e. the only one that really knows your system and your usage patterns, you will make things worse.
Don't you think e.g. PostreSQL pr Oracle would have used a similar technique if they could? When the DB is installed on a proper filesystem they let the kernel do its thing and don't try to second-guess its decisions. Only when the DB is on a raw device do the specialised optimisation algorithms that take physical blocks into account come into play.
You should also take the specific properties of your storage devices into account. Modern SSDs, for example, make traditional seek-time optimisations obsolete.

Performance issues using Copyfile() to copy files from different computers

Using VC++ VisualStudio 2003.
I'm trying to copy several image files (30kb or so per file) from another computer`s shared folder to a local file.
The problem is that there can be more than 2000 or so files in one transfer, and it seems
to take its toll, substantially taking more time to complete.
Is there any alternate method of copying files from another computer that could possibly
speed up the copy?
Thanks in advance.
EDIT*
Due to client request, it is not possible to change the code base dramaticaly,
hate to have to deviate from best practice because of non-techinical issues,
but is there a more subtle approuch? such as another function call?
I know I`m asking for some magical voodoo, asking just in case somebody knows of such.
A few things to try:
is copying files using the OS any faster?
if no, then there may be some inherent limitations to your network or the way it's setup (maybe authentication troubles, or the distant server has some hardware issues, or it's too busy, or the network card loses too many packets because of collisions, faulty switch, bad wiring...)
make some tests transferring files of various sizes.
Small files are always slower to transfer because there is a lot of overhead to fetch their details, then transfer the data, then create directory entries etc.
if large files are fast, then your network is OK and you're probably not be able to improve the system much (the bottleneck is elsewhere).
Eventually, from code, you could try to open, read the files into a large buffer in one go then save them on the local drive. This may be faster as you'll be bypassing a lot of checks that the OS does internally.
You could even do this over a few threads to open, load, write files concurrently to speed things up a bit.
A couple of references you can check for mutli-threaded file copy:
MTCopy: A Multi-threaded Single/Multi file copying tool on CodeProject
Good parallel/multi-thread file copy util? discussion thread on Channel 9.
McTool a command line tool for parallel file copy.
If implementing this yourself in code is too much trouble, you could always simply execute a utility like McTool in the background of your application and let it do the work for you.
Well, for a start, 2000 is not several. If it's taking most of the time because you're sending lots of small files, then you come up with a solution that packages them at the source into a single file and unpackages them at the destination. This will require some code running at the source - you'll have to design your solution to allow that since I assume at the moment you're just copying from a network share.
If it's the network speed (unlikely), you compress them as well.
My own beliefs are that it will be the number of files, basically all the repeated startup costs of a copy. That's because 2000 30K files is only 60MB, and on a 10Mb link, theoretical minimum time would be about a minute.
If your times are substantially above that, then I'd say I'm right.
A solution that uses 7zip or similar to compress them all to a single 7z file, transmit them, then unzip them at the other end sounds like what you're looking for.
But measure, don't guess! Test it out to see if it improves performance. Then make a decision.

Resources