Write request flow in Linux from user space to the device? - linux-kernel

I'm confused as to what happens when I issue a write from user space in Linux. What is the full flow, down to the storage device? Supposing I use CFQ and a kernel that still uses pdflush.
CFQ is said to place requests into sync and a-sync queues. Writes are a-sync, for example, so they go into an a-sync queue based on the io priority. The queue will get CPU according to CFQ time slices, expiration settings, etc. Fine.
At the same time, writes dirty pages. Writing dirty pages is triggered by VM settings and done in the context of pdflush thread that runs a copy of background_writeout() which calls writeback_nodes(). When this happens, the writes pdflush does are surely sync.
Does it mean that we have two competing writes going on potentially for the same or similar write requests - one controlled by CFQ queuing mechanisms and another triggered by VM subsystem?
Does it mean the VM subsystem effectively breaks CFQ rules as soon as we hit dirty_background_ratio since pdflush doesn't carry the same io priority as the requesting process?
And as soon as we hit dirty_ratio CFQ setting become more or less obsolete, since all writes become sync?
I've read a lot of specific info on the two subsystems, but I'm yet to understand the whole write request flow. The interactive Linux kernel map does not include IO scheduler.
Any pointers to the whole picture would be appreciated.
Thanks,
Alex

I've found it since then. Basically, yes, pdflush and write cache in general do break IO priorities for writes. Reads can still benefit from CFQ, since they are sync and sync requests get priority, but for writes to be subjected to IO priorities they have to be direct.

Related

Can many (similar) processes use a common RAM cache?

As I understand the creation of processes, every process has it's own space in RAM for it's heap, data, etc, which is allocated upon its creation. Many processes can share their data and storage space in some ways. But since terminating a process would erase its allocated memory(so also its caches), I was wondering if it is possible that many (similar) processes share a cache in memory that is not allocated to any specific process, so that it can be used even when these processes are terminated and other ones are created.
This is a theoretical question from a student perspective, so I am merely interested in the general sence of an operating system, without adding more functionality to them to achieve it.
For example I think of a webserver that uses only single-threaded processes (maybe due to lack of multi-threading support), so that most of the processes created do similar jobs, like retrieving a certain page.
There are a least four ways what you describe can occur.
First, the system address space is shared by all processes. The Operating system can save data there that survives the death of a process.
Second, processes can map logical pages to the same physical page frame. The termination of one process does not cause the page frame to be deallocated to the other processes.
Third, some operating systems have support for writable shared libraries.
Fourth, memory mapped files.
There are probably others as well.
I think so, when a process is terminated the RAM clears it. However your right as things such as webpages will be stored in the Cache for when there re-called. For example -
You open Google and then go to another tab and close the open Google page, when you next go to Google it loads faster.
However, what I think your saying is if the Entire program E.G - Google Chrome or Safari - is closed, does the webpage you just had open stay in the cache? No, when the program is closed all its relative data is also terminated in order to fully close the program.
I guess this page has some info on it -
https://www.wikipedia.org/wiki/Shared_memory

How safe is it to store sessions with Redis?

I'm currently using MySql to store my sessions. It works great, but it is a bit slow.
I've been asked to use Redis, but I'm wondering if it is a good idea because I've heard that Redis delays write operations. I'm a bit afraid because sessions need to be real-time.
Has anyone experienced such problems?
Redis is perfect for storing sessions. All operations are performed in memory, and so reads and writes will be fast.
The second aspect is persistence of session state. Redis gives you a lot of flexibility in how you want to persist session state to your hard-disk. You can go through http://redis.io/topics/persistence to learn more, but at a high level, here are your options -
If you cannot afford losing any sessions, set appendfsync always in your configuration file. With this, Redis guarantees that any write operations are saved to the disk. The disadvantage is that write operations will be slower.
If you are okay with losing about 1s worth of data, use appendfsync everysec. This will give great performance with reasonable data guarantees
This question is really about real-time sessions, and seems to have arisen partly due to a misunderstanding of the phrase 'delayed write operations' While the details were eventually teased out in the comments, I just wanted to make it super-duper clear...
You will have no problems implementing real-time sessions.
Redis is an in-memory key-value store with optional persistence to disk. 'Delayed write operations' refers to writes to disk, not the database in general, which exists in memory. If you SET a key/value pair, you can GET it immediately (i.e in real-time). The policy you select with regards to persistence (how much you delay the writes) will determine the upper-bound for how much data could be lost in a crash.
Basically there are two main types available: async snapsnots and fsync(). They're called RDB and AOF respectively. More on persistence modes on the official page.
The signal handling of the daemonized process syncs to disk when it receives a SIGTERM for instance, so the data will still be there after a reboot. I think the daemon or the OS has to crash before you'll see an integrity corruption, even with the default settings (RDB snapshots).
The AOF setting uses an Append Only File that logs the commands the server receives, and recreates the DB from scratch on cold start, from the saved file. The default disk-sync policy is to flush once every second (IIRC) but can be set to lock and write on every command.
Using both the snapshots and the incremental log seems to offer both a long term don't-mind-if-I-miss-a-few-seconds-of-data approach with a more secure, but costly incremental log. Redis supports clustering out of the box, so replication can be done too it seems.
I'm using the default RDB setting myself and saving the snapshots to remote FTP. I haven't seen a failure that's caused a data loss yet. Acute hardware failure or power outages would most likely, but I'm hosted on a VPS. Slim chance of that happening :)

Performance tuning in Linux for a TCP based server application

I have written an application that communicates over TCP using a proprietary protocol. I have a test client that starts a thousand threads that each make requests from the client, and I noticed I can only get about 100 requests/second, even with fairly simple operations. These requests are all coming from the same client so that might be relevant.
I'm trying to understand how I can make things faster. I've read a bit about performance tuning in this area, but I'm trying to understand what I need to understand to performance tune network applications like this. Where should I start? What linux settings do I need to override, and how do I do it?
Help is greatly appreciated, thanks!
Have you considered using asynchronous methods to do the test instead of trying to spawn lots of threads. Each time one thread stops and other starts on the same cpu core, aka. context switching, there can be a very significant overhead. If you want a quick example of networking using asynchronous methods check out networkComms.net and look at how the NetworkComms.ConnectionListenModeUseSync property is used here. Obviously if you're running in linux you would have to use mono to run networkComms.net.
Play around with the Sysctls and Socket Options of the TCP stack: man tcp(7). E.g. you can change the send and receive buffer of tcp or switch NO_DELAY on. Actually to tune the TCP stack itself you should know how TCP works. Things like slow start, congestion control, congestion window etc. But this is related to the transmitting/receiving performance and the buffers maybe with your process handling.
You need to Understand the following Linux utility Command
uptime - Tell how long the system has been running.
uptime gives a one line display of the following information. The current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.
top provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes
The mpstat command writes to standard output activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported. The mpstat command can be used both on SMP and UP machines, but in the latter, only global average activities will be printed. If no activity has been selected, then the default report is the CPU utilization report.
iostat - The iostat command is used for monitoring system input/output device
loading by observing the time the devices are active in relation to
their average transfer rates.
vmstat reports information about processes, memory, paging, block IO, traps, and cpu activity. The first report produced gives averages since the last reboot
free - display information about free and used memory on the system
ping : ping uses the ICMP protocol's mandatory ECHO_REQUEST datagram to elicit an ICMP ECHO_RESPONSE from a host or gateway
Dstat allows you to view all of your system resources instantly, you can eg. compare disk usage in combination with interrupts from your IDE controller, or compare the network bandwidth numbers directly with the disk throughput (in the same interval)
Then you need to know how TCP protocol work, Learn how to identify the Network Latency, where is the Problem, is the problem with ACK,SYCN,ACK SYC, DATA, RELEASE

aio on osx: Is it implemented in the kernel or with user threads? Other Options?

I am working on my small c++ framework and have a file class which should also support async reading and writing. The only solution other than using synchronous file i/o inside some worker threads I found is aio. Anyways I was looking around and read somewhere, that in Linux, aio is not even implemented in the kernel but rather with user threads. Is the same true for OSX? Another concern is aio's functionality of callbacks which has to spawn an extra thread for each callback since you can't assign a certain thread or threadpool to take care of that (signals are not an option for me). So here are the questions resulting from that:
Is aio implemented in the Kernel of osx and thus is most likely better than my own threaded implementation?
Can the callback system -spawning a thread for each callback- become a bottleneck in practice?
If aio is not worth using on osx, are there any other alternatives on unix? in cocoa? in carbon?
Or should I simply emulate async i/o with my own threadpool?
What is your experience on the subject?
You can see exactly how AIO is implemented on OSX right here.
The implementation uses kernel threads, one queue of jobs which each thread pops and execute in a blocking fashion in a priority queue based on each request's priority (at least that's what it looks like at a first glance).
You can configure the number of threads and the size of the queue with sysctl. To see these options and the default values, run sysctl -a | grep aio
kern.aiomax = 90
kern.aioprocmax = 16
kern.aiothreads = 4
In my experience, in order for it to make any sense to use AIO, these limits need to be a lot higher.
As for the callbacks in threads, I don't believe Mac OS X supports that. It only does completion notifications through signals (see source).
You could probably do as good of a job in your own thread pool. One thing you could do better than the current darwin implementation is to sort your read jobs by physical location on the disk (see fcntl and F_LOG2PHYS) which might even give you an edge.
#Moka: Sorry to say that you're wrong on the linux implementation, as of kernel 2.6 there is a kernel implementation of AIO, which comes in libaio (libaio.h)
The implementation that doesn't use Kernel threads but instead uses user threads is POSIX.1 AIO, and it does it that way to make it more portable, as not all unix based OS support completion events at Kernel level.

In an ISAPI filter, what is a good approach for a common logfile for multiple processes?

I have an ISAPI filter that runs on IIS6 or 7. When there are multiple worker processes ("Web garden"), the filter will be loaded and run in each w3wp.exe.
How can I efficiently allow the filter to log its activities in a single consolidated logfile?
log messages from the different (concurrent) processes must not interfere with each other. In other words, a single log message emitted from any of the w3wp.exe must be realized as a single contiguous line in the log file.
there should be minimal contention for the logfile. The websites may serve 100's of requests per second.
strict time ordering is preferred. In other words if w3wp.exe process #1 emits a message at t1, then process #2 emits a message at t2, then process #1 emits a message at t3, the messages should appear in proper time order in the log file.
The current approach I have is that each process owns a separate logfile. This has obvious drawbacks.
Some ideas:
nominate one of the w3wp.exe to be "logfile owner" and send all log messages through that special process. This has problems in case of worker process recycling.
use an OS mutex to protect access to the logfile. Is this high-perf enough? In this case each w3wp.exe would have a FILE on the same filesystem file. Must I fflush the logfile after each write? Will this work?
any suggestions?
At first I was going to say that I like your current approach best, because each process shares nothing, and then I realized, that, well, they are probably all sharing the same hard drive underneath. So, there's still a bottleneck where contention occurs. Or maybe the OS and hard drive controllers are really smart about handling that?
I think what you want to do is have the writing of the log not slow down the threads that are doing the real work.
So, run another process on the same machine (lower priority?) which actually writes the log messages to disk. Communicate to the other process using not UDP as suggested, but rather memory that the processes share. Also known, confusingly, as a memory mapped file. More about memory mapped files. At my company, we have found memory mapped files to be much faster than loopback TCP/IP for communication on the same box, so I'm assuming it would be faster than UDP too.
What you actually have in your shared memory could be, for starters, an std::queue where the pushs and pops are protected using a mutex. Your ISAPI threads would grab the mutex to put things into the queue. The logging process would grab the mutex to pull things off of the queue, release the mutex, and then write the entries to disk. The mutex is only protected the updating of shared memory, not the updating of the file, so it seems in theory that the mutex would be held for a briefer time, creating less of a bottleneck.
The logging process could even re-arrange the order of what it's writing to get the timestamps in order.
Here's another variation: Contine to have a separate log for each process, but have a logger thread within each process so that the main time-critical thread doesn't have to wait for the logging to occur in order to proceed with its work.
The problem with everything I've written here is that the whole system - hardware, os, the way multicore CPU L1/L2 cache works, your software - is too complex to be easily predictable by a just thinking it thru. Code up some simple proof-of-concept apps, instrument them with some timings, and try them out on the real hardware.
Would logging to a database make sense here?
I've used a UDP-based logging system in the past and I was happy with this kind of solution.
The logs are sent via UDP to a log-collector process which his in charge of saving it to a file on a regular basis.
I don't know if it can work in your high-perf context but I was satisfied with that solution in a less under-stress application.
I hope it helps.
Rather than an OS Mutex to control access to the file, you could just use Win32 file locking mechanisms with LockFile() and UnlockFile().
My suggestion is send messages asynchronously (UDP) to a process that will take charge of recording the log.
The process will :
- one thread receiver puts the messages in a queue;
- one thread is responsible for removing the messages from the queue putting in a time ordered list;
- one thread monitor messages in the list and only messages with time length greater than the minimum should be saved in file (to prevent a delayed message to be written out of order).
You could continue logging to separate files and find/write a tool to merge them later (perhaps automated, or you could just run it at the point you want to use the files.)
Event Tracing for Windows, included in Windows Vista and later, provides a nice capability for this.
Excerpt:
Event Tracing for Windows (ETW) is an efficient kernel-level tracing facility that lets you log kernel or application-defined events to a log file. You can consume the events in real time or from a log file and use them to debug an application or to determine where performance issues are occurring in the application.

Resources