I maintain a program that is responsible for collecting data from a data acquisition system and appending that data to a very large (size > 4GB) binary file. Before appending data, the program must validate the header of this file in order to ensure that the meta-data in the file matches that which has been collected. In order to do this, I open the file as follows:
data_file = fopen(file_name, "rb+");
I then seek to the beginning of the file in order to validate the header. When this is done, I seek to the end of the file as follows:
_fseeki64(data_file, _filelengthi64(data_file), SEEK_SET);
At this point, I write the data that has been collected using fwrite(). I am careful to check the return values from all I/O functions.
One of the computers (windows 7 64 bit) on which we have been testing this program intermittently shows a condition where the data appears to have been written to the file yet neither the file's last changed time nor its size changes. If any of the calls to fopen(), fseek(), or fwrite() fail, my program will throw an exception which will result in aborting the data collection process and logging the error. On this machine, none of these failures seem to be occurring. Something that makes the matter even more mysterious is that, if a restore point is set on the host file system, the problem goes away only to re-appear intermittently appear at some future time.
We have tried to reproduce this problem on other machines (a vista 32 bit operating system) but have had no success in replicating the issue (this doesn't necessarily mean anything since the problem is so intermittent in the first place.
Has anyone else encountered anything similar to this? Is there a potential remedy?
Further Information
I have now found that the failure occurs when fflush() is called on the file and that the win32 error that is being returned by GetLastError() is 665 (ERROR_FILE_SYSTEM_LIMITATION). Searching google for this error leads to a bunch of reports related to "extents" for SQL server files. I suspect that there is some sort of journaling resource that the file system is reporting and this because we are growing a large file by opening it, appending a chunk of data, and closing it. I am now looking for understanding regarding this particular error with the hope for coming up with a valid remedy.
The file append is failing because of a file system fragmentation limit. The question was answered in What factors can lead to Win32 error 665 (file system limitation)?
Related
I have a Win32 program that keeps a file open and writes data to it over a period of several hours. I'd like for the file size, as shown in an Explorer window, to be updated every so often.
As an example, when a browser is downloading a large file, you can see the file size change over time, even though the file is still downloading.
With my current naive implementation, the file size remains zero until I close the file.
How do I do this in Win32? Currently the file is open using std::ofstream. Is this a proper application of std::ostream::flush() ? Or do I need to close and reopen the file with some regularity?
std::ostream::flush() makes sure you have your data safe on disk. Flushing the buffer is a valid use case in situations where the automatic flushes ain't good enough for you (e.g. there's too little data written over too long periods, the data is written constantly but needs to be accessible constantly too, you need to be sure the data gets logged in case of crash or power down etc.); yet, on some OS/filesystem combinations (see Why is the file size reported incorrectly for files that are still being written to?), that still won't update the file size accordingly. On Win32, you usually won't see size updates before actually closing/reopening the handle; sometimes re-reading the dir etc. will help, and sometimes it simply won't.
As such, you can use e.g. ReOpenFile to force that update, or simply use close/open instead of flushing. The exact solution depends whether you need the updated filesize so direly and the reduced output rate is not a real problem (in which case reopening is the best option), or if you can live with a wrong size reported (in which case flushes are your best option IMO).
If I run the example code below, I get an invalid file identifier error in Matlab:
for i = 1:99999
fid = fopen('test.txt','w');
fprintf(fid,'%s', 'Hello World!\r\n');
fclose(fid);
delete('test.txt');
end;
??? Error using ==> fprintf
Invalid file identifier. Use fopen to generate a valid file identifier.
The interesting thing is, that if I decrease the number of loops, I don't get the error. I researched the problem, and it seems that none of the usual issues that cause the error (Wrong File Path, Corrupt File, File doesn't exist, File already in use) are the culprits, because it works if I change the loops to 10 instead of 99999.
Upon further research, Matlab Forum Post, it seems the problem might be quota related (I think quotas have to do with the OS where the OS, Windows 10 in my case doesn't allow a program to write files after a certain amount of them have been written by the same program?).
How would one increase the quota? Is there a work around? I use Matlab 2010a on Windows 10.
I have also attempted running Matlab in administrator mode with no success.
I'm assuming permissions are correct and disk space is not a problem, but you should check fopen's output nevertheless to get more info or some try-catch which calls ferror(fid) for additional data (note the absence of the semicolon, obviously).
[fid,msg]=fopen('test.txt','w')
If it IS quota related you should be able to disable it in your hard drive's properties, as shown in the image below (it's in spanish, but you should get the idea). Just right click in the unit and access Properties->Disk Quota->Show Configuration and disable it if it isn't already.
GUI location of the disk quota
I have a relatively simple bash script that reads from a set of static input files, stores the input in bash variables and then does a bunch of processing over said input by calling out to external scripts (e.g. written in Python, Go, other bash scripts etc.) and using the intermediate results.
Lately I have been experiencing an intermittent problem where a single character seems to be getting altered somewhere during the processing which then causes subsequent errors. Specifically, a lot of the processing I'm doing involves slicing up a list of comma-separated records, and one of the values on each line is a unix timestamp, e.g. 1354245000.
What seems to be happening is that occasionally one of these values will get altered slightly, so I end up with a timestamp like 13542458=2 or 13542458>2 or 13542458;2 coming out of one of the intermediate scripts. This then subsequently gets fed into another script, which throws an exception when it tries to parse the value to an integer.
In the title of this question, I've suggested that this might be a potential CPU/RAM error. I know the general folly in thinking errors are caused by low level things like hardware/compilers etcetera, but the nature of this particular error makes me think it may be possible, for the following reasons:
The input files are the same on each invocation of the script, and the script only fails on some invocations.
I cannot think of any sources of randomness in the source code prior to where the script is breaking. It's basically just slicing and dicing csv input.
I cannot think of any sources of concurrency in the source code -- even the Go scripts aren't actually written to run anything concurrently.
This problem has only arisen in the last week or so. Prior to this time, this error would never occur.
While I haven't documented every erroneous character, they seem to often be quite close in the ASCII table to numeric values (=, >, ; etc). That said, I guess the Hamming distance between two characters quite far apart can be small also with changes to a high order bit.
The script often breaks at a different stage on different runs. i.e. I have a number of separate Python scripts, and sometimes it'll make it past one script and then the error will be induced in another. Other times it'll be induced on an earlier script.
What I'd like to know is, is there any methodical way to either confirm or rule out a hardware error for this problem? Or if it is a hardware problem, is it possibly undetectable by the operating system?
A bit of further info on the machine:
Linux 64-bit, Ubuntu 12.04
Intel i7 processor
16GB DDR3 RAM
I'm hoping someone can either point me to a reliable way to verify whether the hardware is to blame or otherwise a sound reason as to what else might be the cause.
Try booting into Memtest to check your memory.
While it is highly unlikely that it will be hardware, if you have exhausted you standard software debug as suggested by #OliCharlesworth, here is an outline of hardware error investigation:
(1) check your log area for any `MCE` logs (machine check exceptions).
If you find any in either your log area (syslog) or sometimes in
the present working dir or /dir -- you have a hardware failure.
(2) check your log area for disk errors. e.g:
smartd[3963]: Device: /dev/sda [SAT], 34 Currently unreadable (pending) sectors
(3) check your drive integrity, e.g.: (as root) # `smartctl -a /dev/sda` if any abnormality, run:
smartctl -t short /dev/sda (change drive as required)
(4) download/install/boot to [memtest86](http://www.memtest86.com/download.htm)
(run the complete test)
If your cpu/motherboard has thrown no mce's, you have no disk error, your drive tests OK with smartctl and you have no memory errors with memtest86, then recheck the software debugging. While additional hardware errors can still be present (bad capacitors, etc..) the likelihood at this point is software. Good luck.
I am performing very rapid file access in ruby (2.0.0 p39474), and keep getting the exception Too many open files
Having looked at this thread, here, and various other sources, I'm well aware of the OS limits (set to 1024 on my system).
The part of my code that performs this file access is mutexed, and takes the form:
File.open( filename, 'w'){|f| Marshal.dump(value, f) }
where filename is subject to rapid change, depending on the thread calling the section. It's my understanding that this form relinquishes its file handle after the block.
I can verify the number of File objects that are open using ObjectSpace.each_object(File). This reports that there are up to 100 resident in memory, but only one is ever open, as expected.
Further, the exception itself is thrown at a time when there are only 10-40 File objects reported by ObjectSpace. Further, manually garbage collecting fails to improve any of these counts, as does slowing down my script by inserting sleep calls.
My question is, therefore:
Am I fundamentally misunderstanding the nature of the OS limit---does it cover the whole lifetime of a process?
If so, how do web servers avoid crashing out after accessing over ulimit -n files?
Is ruby retaining its file handles outside of its object system, or is the kernel simply very slow at counting 'concurrent' access?
Edit 20130417:
strace indicates that ruby doesn't write all of its data to the file, returning and releasing the mutex before doing so. As such, the file handles stack up until the OS limit.
In an attempt to fix this, I have used syswrite/sysread, synchronous mode, and called flush before close. None of these methods worked.
My question is thus revised to:
Why is ruby failing to close its file handles, and how can I force it to do so?
Use dtrace or strace or whatever equivalent is on your system, and find out exactly what files are being opened.
Note that these could be sockets.
I agree that the code you have pasted does not seem to be capable of causing this problem, at least, not without a rather strange concurrency bug as well.
I'm developing a program that needs to write a large amout of data to disk then read back much smaller amount of data back later on. It needs to "bin" related data together then once it figures out what to do with it, then it can process the data further. It's basically acting like a database, but with temp files on disk. Portions of the temp files get reused fairly frequently as I don't care about the data on disk after I read it back out, so that portion of the file can be recycled. I'm using I/O completion ports to implement this because sequential I/O is simply too slow.
The problem is that sometimes when I read the data, I don't get all of it back. For example, I will zero out my read buffer, do a read operation of, say, 20 bytes, and when the corresponding completion event triggers, some or even none of my read buffer will match what should be on disk, but all of it won't be zeroed out. Occasionally, I can detect this and try sleeping 5 seconds and reading the same portion again, and it matches what I read in the first try. This is taking place on a top of the line SSD, so 5 seconds should be plenty to flush to disk. However, when I stop my application and look at the contents of the file, it's correct on disk. It's as if the previous write hasn't flushed to disk and it tried reading old data.
To test that theory, I tried writing 0xFF on entire sections as I read them. When this error happened again, my read buffer did not contain 0xFFs as I would have expected. So presumably, I'm not reading old data.
I also checked to make sure that the number of bytes returned from the completion event matched the number of bytes that I passed to ReadFile, and they do match. There is no error returned by the completion event or by ReadFile (other than ERROR_IO_PENDING). I am creating my temp files with FILE_ATTRIBUTE_NORMAL, FILE_FLAG_OVERLAPPED, and FILE_FLAG_RANDOM_ACCESS.
I also tried waiting for all pending writes for a given portion of the file to complete before trying to read, but to no avail. I would hope that Windows would do that for me, but it isn't covered in any documentation that I've read.
I'm really at a loss as to why I'm getting what look to be partial or corrupted reads. I'm really just looking for some ideas that might cause this behavior because I'm all out.
From the sound of things you're firing off writes and reads to the same portions of the same file and sometimes the data that the read returns isn't what you think you have previously written.
I assume you are waiting for the write completion for a piece of data before issuing a read request for the same area of the file? If not the read could be occurring before the write completes? When lots of data is being written to the same disk the write completions may begin to slow down and writes may spend more time pending (watch out for the resources that this consumes!)
Personally I'd include my own memory cache layer which knows about the data block until the write completion occurs - you can then satisfy reads for this part of the file from your cache if the write has not yet completed.