How to avoid "Too many open files" issue in load balancer, what are side effects of setting ulimit -n to a higher value? - amazon-ec2

When load balancing a production system with heavy load of requests, or while load testing it, we face a 'Too many open files' issue as below.
[2011-06-09 20:48:31,852] WARN - HttpCoreNIOListener System may be unstable: IOReactor encountered a checked exception : Too many open files
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:152)
at org.apache.http.impl.nio.reactor.DefaultListeningIOReactor.processEvent(DefaultListeningIOReactor.java:129)
at org.apache.http.impl.nio.reactor.DefaultListeningIOReactor.processEvents(DefaultListeningIOReactor.java:113)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:315)
at org.apache.synapse.transport.nhttp.HttpCoreNIOListener$2.run(HttpCoreNIOListener.java:253)
at java.lang.Thread.run(Thread.java:662)
This exception comes in load balancer, if the maximum allowed files (ulimit) to be open is set to low.
we can fix the above exception by increasing the ulimit by giving a higher value (given 655350 here).
ulimit -n 655350
however setting ulimit -n to a higher number may affect overall performance of load balancer and hence response time of our website in unprecedented ways. Any known side effects od setting unumber -n to higher value?

The side effects will be exactly that - under normal circumstances you want to limit the number of open file handles for a process so it doesn't exhaust available file descriptors or do other bad stuff due to programming errors or DoS attacks. If you're comfortable that your system won't do that (or plainly won't work within normal limits!) then there's no other side effects.

Related

Matlab load function timeout

I would like to increase the matlab load command timeout as I'm accessing a resource over the network (on Windows). The default is 10s as mentioned here.
How can this be achieved?
related: Matlab read from fifo with fopen timeout
I think that increasing the timeout will be hard to accomplish, but you have a way to circumvent this problem: copy the file to your local drive first, then load it. The speed will probably be the same, but you won't have to deal with the timeout.
The problem is probably caused by the fact that the mat files you are trying to load are saved using compression. This would not allow Matlab to explot the full bandwidth of your connection.

JMeter - Thread Users gone after a few mins when set loop forever and schedule time

I have a test plan with five thread groups, each of them has 10 thread users, i wanna run this test plan for 1 hour with 50 concurrent users, for my understanding the number of thread users should keep at 50, but somehow thread users keep reducing after several mins, and ends with 0 thread user, below is my configurations, can someone help take a look, thanks in advance.
Also in jemter command, i got the below error:
Above error shows you got OOM i.e. Out Of Memory error.
You ran out of JMeter allocated heap space.
To resolve this issue you can try below things,
Use tips provided here (use non-gui mode, avoid heavy reporters etc.)
http://blazemeter.com/blog/jmeter-performance-and-tuning-tips
Try to increase heap space of JMeter if your load test really requires large amount of heap space.
I would suggest follow suggestions sequentially.
you can try using distributed setup for load testing , this way you can distribute the load in several machine
The quick fixes
1 Use JMeter in non-GUI mode
Find where you installed JMeter through command line or the terminal
Set the ‘bin’ directory as your current directory
Run ‘jmeter –n –t .jmx
That’s it, when your test is complete you can do your analysis. You may be interested in seeing how your test is performing during execution in command line/terminal (rather than seeing a black window), to do this just uncomment the following within the JMeter properties file (remove the ‘#’):
summariser.name=summary
summariser.interval=180
summariser.out=true
2 Remove or at least disable listeners you don’t need. Typically listeners such as the ‘View Results Tree’ consume large amounts of memory purely down to how much information it gathers.
Simply take a second look at your script and remove the listeners you don’t need.
3Increase the JMeter heap memory
Within your explorer find where you installed JMeter
Open up the bin directory.
Find the ‘JMeter.bat’ file and open it with a text editor
Find the following ‘set HEAP’
You can set this HEAP value to whatever you like, in this example I’ve said allocate 2 GB of memory from the start and throughout the test run: ‘set HEAP=-Xms2048m -Xmx2048m’

Ruby file handle management (too many open files)

I am performing very rapid file access in ruby (2.0.0 p39474), and keep getting the exception Too many open files
Having looked at this thread, here, and various other sources, I'm well aware of the OS limits (set to 1024 on my system).
The part of my code that performs this file access is mutexed, and takes the form:
File.open( filename, 'w'){|f| Marshal.dump(value, f) }
where filename is subject to rapid change, depending on the thread calling the section. It's my understanding that this form relinquishes its file handle after the block.
I can verify the number of File objects that are open using ObjectSpace.each_object(File). This reports that there are up to 100 resident in memory, but only one is ever open, as expected.
Further, the exception itself is thrown at a time when there are only 10-40 File objects reported by ObjectSpace. Further, manually garbage collecting fails to improve any of these counts, as does slowing down my script by inserting sleep calls.
My question is, therefore:
Am I fundamentally misunderstanding the nature of the OS limit---does it cover the whole lifetime of a process?
If so, how do web servers avoid crashing out after accessing over ulimit -n files?
Is ruby retaining its file handles outside of its object system, or is the kernel simply very slow at counting 'concurrent' access?
Edit 20130417:
strace indicates that ruby doesn't write all of its data to the file, returning and releasing the mutex before doing so. As such, the file handles stack up until the OS limit.
In an attempt to fix this, I have used syswrite/sysread, synchronous mode, and called flush before close. None of these methods worked.
My question is thus revised to:
Why is ruby failing to close its file handles, and how can I force it to do so?
Use dtrace or strace or whatever equivalent is on your system, and find out exactly what files are being opened.
Note that these could be sockets.
I agree that the code you have pasted does not seem to be capable of causing this problem, at least, not without a rather strange concurrency bug as well.

hadoop fair scheduler open file error?

I am testing the fair scheduler mode for job assigning, however I get such error
java.io.IOException: Cannot run program "bash": java.io.IOException: error=24, Too many open files
After a google, Most of them will tell to check how many files are currently open in the system (by unix command lsof) and how's that number related to your system limit (check by bash command ulimit -n). Increasing maximum number of open files at a time is short-term solution in my opinion.
Is there anyway to avoid this?
The fact that your system is reaching the limit for #(max open files), you might have to check:
How many other operations are running on the system ?
Are they heavily opening many files ?
Is your hadoop job itself heavily opening many files ?
Is the current limit for #(max open files) too small on your system ? (you can google out the typical values). If its too small, consider increasing it
I think increasing the #(max open files) limit will work out. In long term you might again end up with that problem if #1,#2 and #3 are not addressed.

What can lead to failures in appending data to a file?

I maintain a program that is responsible for collecting data from a data acquisition system and appending that data to a very large (size > 4GB) binary file. Before appending data, the program must validate the header of this file in order to ensure that the meta-data in the file matches that which has been collected. In order to do this, I open the file as follows:
data_file = fopen(file_name, "rb+");
I then seek to the beginning of the file in order to validate the header. When this is done, I seek to the end of the file as follows:
_fseeki64(data_file, _filelengthi64(data_file), SEEK_SET);
At this point, I write the data that has been collected using fwrite(). I am careful to check the return values from all I/O functions.
One of the computers (windows 7 64 bit) on which we have been testing this program intermittently shows a condition where the data appears to have been written to the file yet neither the file's last changed time nor its size changes. If any of the calls to fopen(), fseek(), or fwrite() fail, my program will throw an exception which will result in aborting the data collection process and logging the error. On this machine, none of these failures seem to be occurring. Something that makes the matter even more mysterious is that, if a restore point is set on the host file system, the problem goes away only to re-appear intermittently appear at some future time.
We have tried to reproduce this problem on other machines (a vista 32 bit operating system) but have had no success in replicating the issue (this doesn't necessarily mean anything since the problem is so intermittent in the first place.
Has anyone else encountered anything similar to this? Is there a potential remedy?
Further Information
I have now found that the failure occurs when fflush() is called on the file and that the win32 error that is being returned by GetLastError() is 665 (ERROR_FILE_SYSTEM_LIMITATION). Searching google for this error leads to a bunch of reports related to "extents" for SQL server files. I suspect that there is some sort of journaling resource that the file system is reporting and this because we are growing a large file by opening it, appending a chunk of data, and closing it. I am now looking for understanding regarding this particular error with the hope for coming up with a valid remedy.
The file append is failing because of a file system fragmentation limit. The question was answered in What factors can lead to Win32 error 665 (file system limitation)?

Resources