Performance tuning in Linux for a TCP based server application

Performance tuning in Linux for a TCP based server application - performance

I have written an application that communicates over TCP using a proprietary protocol. I have a test client that starts a thousand threads that each make requests from the client, and I noticed I can only get about 100 requests/second, even with fairly simple operations. These requests are all coming from the same client so that might be relevant.
I'm trying to understand how I can make things faster. I've read a bit about performance tuning in this area, but I'm trying to understand what I need to understand to performance tune network applications like this. Where should I start? What linux settings do I need to override, and how do I do it?
Help is greatly appreciated, thanks!

Have you considered using asynchronous methods to do the test instead of trying to spawn lots of threads. Each time one thread stops and other starts on the same cpu core, aka. context switching, there can be a very significant overhead. If you want a quick example of networking using asynchronous methods check out networkComms.net and look at how the NetworkComms.ConnectionListenModeUseSync property is used here. Obviously if you're running in linux you would have to use mono to run networkComms.net.

Play around with the Sysctls and Socket Options of the TCP stack: man tcp(7). E.g. you can change the send and receive buffer of tcp or switch NO_DELAY on. Actually to tune the TCP stack itself you should know how TCP works. Things like slow start, congestion control, congestion window etc. But this is related to the transmitting/receiving performance and the buffers maybe with your process handling.

You need to Understand the following Linux utility Command
uptime - Tell how long the system has been running.
uptime gives a one line display of the following information. The current time, how long the system has been running, how many users are currently logged on, and the system load averages for the past 1, 5, and 15 minutes.
top provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes
The mpstat command writes to standard output activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported. The mpstat command can be used both on SMP and UP machines, but in the latter, only global average activities will be printed. If no activity has been selected, then the default report is the CPU utilization report.
iostat - The iostat command is used for monitoring system input/output device
loading by observing the time the devices are active in relation to
their average transfer rates.
vmstat reports information about processes, memory, paging, block IO, traps, and cpu activity. The first report produced gives averages since the last reboot
free - display information about free and used memory on the system
ping : ping uses the ICMP protocol's mandatory ECHO_REQUEST datagram to elicit an ICMP ECHO_RESPONSE from a host or gateway
Dstat allows you to view all of your system resources instantly, you can eg. compare disk usage in combination with interrupts from your IDE controller, or compare the network bandwidth numbers directly with the disk throughput (in the same interval)
Then you need to know how TCP protocol work, Learn how to identify the Network Latency, where is the Problem, is the problem with ACK,SYCN,ACK SYC, DATA, RELEASE

Related

Limiting memory of V8 Context

I have a script server that runs arbitrary java script code on our servers. At any given time multiple scripts can be running and I would like to prevent one misbehaving script from eating up all the ram on the machine. I could do this by having each script run in its own process and have an off the shelf monitoring tool monitor the ram usage of each process, killing and restarting the ones that get out of hand. I don't want to do this because I would like to avoid the cost of restart the binary every time one of these scripts goes crazy. Is there a way in v8 to set a per context/isolate memory limit that I can use to sandbox the running scripts?

It should be easy to do now
context.EstimatedSize() to get estimated size of the context
isolate.TerminateExecution() when context goes out of acceptable memory/cpu usage/whatever
in order to get access if there is an infinite loop(or something else blocking, like high cpu calculation) I think you could use isolate.RequestInterrupt()

A single process can run multiple isolates, if you have a 1 isolate to 1 context ratio you can easily
restrict memory usage per isolate
get heap stats
See some examples in this commit:
https://github.com/discourse/mini_racer/commit/f7ec907547e9a6ea888b2587e4edee3766752dd3
In particular you have:
v8::HeapStatistics stats;
isolate->GetHeapStatistics(&stats);
There are also fancy features like memory allocation callbacks you can use.

This is not reliably possible.
All JavaScript contexts by this process share the same object heap.
WebKit/Chromium tries some stuff to disable contexts after context OOMs.
http://code.google.com/searchframe#OAMlx_jo-ck/src/third_party/WebKit/Source/WebCore/bindings/v8/V8Proxy.cpp&exact_package=chromium&q=V8Proxy&type=cs&l=361
Sources:
http://code.google.com/p/v8/source/browse/trunk/src/heap.h?r=11125&spec=svn11125#280
http://code.google.com/p/chromium/issues/detail?id=40521
http://code.google.com/p/chromium/issues/detail?id=81227

Snoop interprocess communications

Has anyone tried to create a log file of interprocess communications? Could someone give me a little advice on the best way to achieve this?

The question is not quite clear, and comments make it less clear, but anyway...
The two things to try first are ipcs and strace -e trace=ipc.

If you want to log all IPC(seems very intensive), you should consider instrumentation.
Their are a lot of good tools for this, check out PIN in perticular, this section of the manual;
In this example, we show how to do
more selective instrumentation by
examining the instructions. This tool
generates a trace of all memory
addresses referenced by a program.
This is also useful for debugging and
for simulating a data cache in a
processor.
If your doing some heavy weight tuning and analysis, check out TAU (Tuning and analysis utilitiy).

Communication to a kernel driver can take many forms. There is usually a special device file for communication, or there can be a special socket type, like NETLINK. If you are lucky, there's a character device to which read() and write() are the sole means of interaction - if that's the case then those calls are easy to intercept with a variety of methods. If you are unlucky, many things are done with ioctls or something even more difficult.
However, running 'strace' on the program using the kernel driver to communicate can reveal just about all it does - though 'ltrace' might be more readable if there happens to be libraries the program uses for communication. By tuning the arguments to 'strace', you can probably get a dump which contains just the information you need:
First, just eyeball the calls and try to figure out the means of kernel communication
Then, add filters to strace call to log only the kernel communication calls
Finally, make sure strace logs the full strings of all calls, so you don't have to deal with truncated data
The answers which point to IPC debugging probably are not relevant, as communicating with the kernel almost never has anything to do with IPC (atleast not the different UNIX IPC facilities).

Write request flow in Linux from user space to the device?

I'm confused as to what happens when I issue a write from user space in Linux. What is the full flow, down to the storage device? Supposing I use CFQ and a kernel that still uses pdflush.
CFQ is said to place requests into sync and a-sync queues. Writes are a-sync, for example, so they go into an a-sync queue based on the io priority. The queue will get CPU according to CFQ time slices, expiration settings, etc. Fine.
At the same time, writes dirty pages. Writing dirty pages is triggered by VM settings and done in the context of pdflush thread that runs a copy of background_writeout() which calls writeback_nodes(). When this happens, the writes pdflush does are surely sync.
Does it mean that we have two competing writes going on potentially for the same or similar write requests - one controlled by CFQ queuing mechanisms and another triggered by VM subsystem?
Does it mean the VM subsystem effectively breaks CFQ rules as soon as we hit dirty_background_ratio since pdflush doesn't carry the same io priority as the requesting process?
And as soon as we hit dirty_ratio CFQ setting become more or less obsolete, since all writes become sync?
I've read a lot of specific info on the two subsystems, but I'm yet to understand the whole write request flow. The interactive Linux kernel map does not include IO scheduler.
Any pointers to the whole picture would be appreciated.
Thanks,
Alex

I've found it since then. Basically, yes, pdflush and write cache in general do break IO priorities for writes. Reads can still benefit from CFQ, since they are sync and sync requests get priority, but for writes to be subjected to IO priorities they have to be direct.

In an ISAPI filter, what is a good approach for a common logfile for multiple processes?

I have an ISAPI filter that runs on IIS6 or 7. When there are multiple worker processes ("Web garden"), the filter will be loaded and run in each w3wp.exe.
How can I efficiently allow the filter to log its activities in a single consolidated logfile?
log messages from the different (concurrent) processes must not interfere with each other. In other words, a single log message emitted from any of the w3wp.exe must be realized as a single contiguous line in the log file.
there should be minimal contention for the logfile. The websites may serve 100's of requests per second.
strict time ordering is preferred. In other words if w3wp.exe process #1 emits a message at t1, then process #2 emits a message at t2, then process #1 emits a message at t3, the messages should appear in proper time order in the log file.
The current approach I have is that each process owns a separate logfile. This has obvious drawbacks.
Some ideas:
nominate one of the w3wp.exe to be "logfile owner" and send all log messages through that special process. This has problems in case of worker process recycling.
use an OS mutex to protect access to the logfile. Is this high-perf enough? In this case each w3wp.exe would have a FILE on the same filesystem file. Must I fflush the logfile after each write? Will this work?
any suggestions?

At first I was going to say that I like your current approach best, because each process shares nothing, and then I realized, that, well, they are probably all sharing the same hard drive underneath. So, there's still a bottleneck where contention occurs. Or maybe the OS and hard drive controllers are really smart about handling that?
I think what you want to do is have the writing of the log not slow down the threads that are doing the real work.
So, run another process on the same machine (lower priority?) which actually writes the log messages to disk. Communicate to the other process using not UDP as suggested, but rather memory that the processes share. Also known, confusingly, as a memory mapped file. More about memory mapped files. At my company, we have found memory mapped files to be much faster than loopback TCP/IP for communication on the same box, so I'm assuming it would be faster than UDP too.
What you actually have in your shared memory could be, for starters, an std::queue where the pushs and pops are protected using a mutex. Your ISAPI threads would grab the mutex to put things into the queue. The logging process would grab the mutex to pull things off of the queue, release the mutex, and then write the entries to disk. The mutex is only protected the updating of shared memory, not the updating of the file, so it seems in theory that the mutex would be held for a briefer time, creating less of a bottleneck.
The logging process could even re-arrange the order of what it's writing to get the timestamps in order.
Here's another variation: Contine to have a separate log for each process, but have a logger thread within each process so that the main time-critical thread doesn't have to wait for the logging to occur in order to proceed with its work.
The problem with everything I've written here is that the whole system - hardware, os, the way multicore CPU L1/L2 cache works, your software - is too complex to be easily predictable by a just thinking it thru. Code up some simple proof-of-concept apps, instrument them with some timings, and try them out on the real hardware.

Would logging to a database make sense here?

I've used a UDP-based logging system in the past and I was happy with this kind of solution.
The logs are sent via UDP to a log-collector process which his in charge of saving it to a file on a regular basis.
I don't know if it can work in your high-perf context but I was satisfied with that solution in a less under-stress application.
I hope it helps.

Rather than an OS Mutex to control access to the file, you could just use Win32 file locking mechanisms with LockFile() and UnlockFile().

My suggestion is send messages asynchronously (UDP) to a process that will take charge of recording the log.
The process will :
- one thread receiver puts the messages in a queue;
- one thread is responsible for removing the messages from the queue putting in a time ordered list;
- one thread monitor messages in the list and only messages with time length greater than the minimum should be saved in file (to prevent a delayed message to be written out of order).

You could continue logging to separate files and find/write a tool to merge them later (perhaps automated, or you could just run it at the point you want to use the files.)

Event Tracing for Windows, included in Windows Vista and later, provides a nice capability for this.
Excerpt:
Event Tracing for Windows (ETW) is an efficient kernel-level tracing facility that lets you log kernel or application-defined events to a log file. You can consume the events in real time or from a log file and use them to debug an application or to determine where performance issues are occurring in the application.

Set Windows process (or user) memory limit

Is there any way to set a system wide memory limit a process can use in Windows XP? I have a couple of unstable apps which do work ok for most of the time but can hit a bug which results in eating whole memory in a matter of seconds (or at least I suppose that's it). This results in a hard reset as Windows becomes totally unresponsive and I lose my work.
I would like to be able to do something like the /etc/limits on Linux - setting M90, for instance (to set 90% max memory for a single user to allocate). So the system gets the remaining 10% no matter what.

Use Windows Job Objects. Jobs are like process groups and can limit memory usage and process priority.

Use the Application Verifier (AppVerifier) tool from Microsoft.
In my case I need to simulate memory no longer being available so I did the following in the tool:
Added my application
Unchecked Basic
Checked Low Resource Simulation
Changed TimeOut to 120000 - my application will run normally for 2 minutes before anything goes into effect.
Changed HeapAlloc to 100 - 100% chance of heap allocation error
Set Stacks to true - the stack will not be able to grow any larger
Save
Start my application
After 2 minutes my program could no longer allocate new memory and I was able to see how everything was handled.

Depending on your applications, it might be easier to limit the memory the language interpreter uses. For example with Java you can set the amount of RAM the JVM will be allocated.
Otherwise it is possible to set it once for each process with the windows API
SetProcessWorkingSetSize Function

No way to do this that I know of, although I'm very curious to read if anyone has a good answer. I have been thinking about adding something like this to one of the apps my company builds, but have found no good way to do it.
The one thing I can think of (although not directly on point) is that I believe you can limit the total memory usage for a COM+ application in Windows. It would require the app to be written to run in COM+, of course, but it's the closest way I know of.
The working set stuff is good (Job Objects also control working sets), but that's not total memory usage, only real memory usage (paged in) at any one time. It may work for what you want, but afaik it doesn't limit total allocated memory.

Per process limits
From an end-user perspective, there are some helpful answers (and comments) at the superuser question “Is it possible to limit the memory usage of a particular process on Windows”, including discussions of how to set recursive quota limits on any or all of:
CPU assignment (quantity, affinity, NUMA groups),
CPU usage,
RAM usage (both ‘committed’ and ‘working set’), and
network usage,
… mostly via the built-in Windows ‘Job Objects’ system (as mentioned in #Adam Mitz’s answer and #Stephen Martin’s comment above), using:
the registry (for persistence, when desired) or
free tools, such as the open-source Process Governor.
(Note: nested Job Objects ~may~ not have been available under all earlier versions of Windows, but the un-nested version appears to date back to Windows XP)
Per-user limits
As far as overall per-user quotas:
??
It is possible that each user session is automatically assigned to a job group itself; if true, per-user limits should be able to be applied to that job group. Update: nope; Job Objects can only be nested at the time they are created or associated with a specific process, and in some cases a child Job Object is allowed to ‘break free’ from its parent and become independent, so they can’t facilitate ‘per-user’ resource limits.
(NTFS does support per-user file system ~storage~ quotas, though)
Per-system limits
Besides simple BIOS or ‘energy profile’ restrictions:
VM hypervisor or Kubernetes-style container resource limit controls may be the most straightforward (in terms of end-user understandability, at least) option.
Footnotes, regarding per-process and other resource quotas / QoS for non-Windows systems:
‘Classic’ Mac OS (including ‘classic’ applications running on 2000s-era versions of Mac OS X): per-application memory limits can be easily set within the ‘Memory’ section of the Finder ‘Get Info’ window for the target program; as a system using a cooperative multitasking concurrency model, per-process CPU limits were impossible.
BSD: ? (probably has some overlap with linux and non-proprietary macOS methods?)
macOS (aka ‘Mac OS X’): no user-facing interface; system support includes, depending on version, the ‘Multiprocessing Services API’, Grand Central Dispatch, POSIX threads / pthread, ‘operation objects’, and possibly others.
Linux: ‘Resource Manager’/limits.conf, control groups/‘cgroups’, process priority/‘niceness’/renice, others?
IBM z/OS and other mainframe-style systems: resource controls / allocation was built-in from nearly the beginning

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio