How to handle readLine to terminate the read if line doesn't have terminated characcter? - bufferedreader

I am facing one strange problem where BufferedReader tries to read the line one by one using readLine() method but the data in the file is corrupted somehow and any particular line doesn't have terminated characters('\n' or '\r') so when it starts reading ,doesn't find the terminated character and after some time throws Out Of Memory error and stop whole application.
I just like to handle this scenario and when BufferedReader finds this scenario, it should terminate reading and complain about corrupted data and app should run thereafter. Does BufferedReader having some way to limit the length of the record so if it crosses the limit should terminate the reading.
Please advice me some good way to handling this with proper code example.

Related

readString vs readLine

I am writing an application to read from a list of files, line by line and do some processing. I want to use as little RAM as I can.
I came across this question https://stackoverflow.com/a/41741702/3531263
Where the poster is saying readString uses more RAM than readLine and they have posted some code.
What I don't understand is how one uses more RAM? Because ultimately, the way their code is written, they are still writing an entire line to their buffer. So would that not mean if they had just used readString, it would have been the same thing?
the way their code is written, they are still writing an entire line to their buffer
Their code, yes. Your code might not need the whole line to be in memory at the same time. For example, your program is filtering a log file by request id, which is in the beginning of the line. It doesn't need to read the whole line which may be a few megabytes or more, only to reject it due to wrong request id. But with ReadString you don't have the luxury of choice.
I 'gree with Sergio. Also, have a look at the current implementation in the standard library. ReadLine calls ReadSlice('\n') once, then runs through a few branches to make sure the appropriate sentinel values or errors are returned with the converted data. On the other hand, ReadBytes and ReadString both loop over repeated calls to ReadSlice(delim), so it follows that they would necessarily be copying at least as much data into memory as ReadLine, and potentially much more if the delimiter wasn't found in the first call.

worse case scenario: launched two copies of a program which appends lines to a file

I have a Python program which performs a simple operation on a file:
with open(self.cache_filename_url, "a", encoding="utf8") as f:
w = csv.writer(f, delimiter=',', quotechar='"', lineterminator='\n')
w.writerow([cache_url, rpd_products])
As you can see it just opens the file and appends a CSV line to it. It does this a lot, in a loop.
I accidentally ran two copies of this program simultaneously, so I think they would have been appending to the file simultaneously. I am trying to determine the worst-case-scenario for file corruption.
Do you think the writes would at least be atomic operations in this case? For example this wouldn't be a problem for me:
old line
old line
new line written by instance 1
new line written by instance 2
new line written by one
This would be a problem for me:
old line
old line
[half of new line written by instance 1] [half of new line by instance 2]
etc
To put it another way, is it possible for the two append operations to "interfere" with each other?
EDIT: I am using Windows 7
Opening the same file multiple times in shared write mode can definitely be problematic. And, if they don't open in shared mode, you'll get one of them throwing exceptions that it cannot open the file.
If SHARED mode:
Both instances will have their own internal pointer. In most cases, they will probably write independently. You could get:
Process A opens file, sets pointer to end (byte 1024)
Process B opens file, sets pointer to end (byte 1024)
Process B writes at byte 1024 and closes file
Process A writes at byte 1024 and closes file.
Both processes will have written to the file at the same location. You've basically lost the record from Process B, and depending on how the close works (if it truncates), if the lines it writes are different lengths, you could get part of Process B if the line was longer.
If it is in EXCLUSIVE mode, one process will fail to open the file, and whatever exception handling you have will kick in.
Which mode you are in can be system dependent, as Python doesn't seem to provide any mechanisms for controlling the share mode.
Update: I ran a check on my file, and I did indeed have corrupted partial lines (the case under "This would be a problem for me" in my question)
It's unfortunate, especially since it implies you could have problems even when you intend to share a file between two processes.
I am still interested in any pointers on how to avoid this outcome. I will hold off on marking an answer as accepted for now. (The other answer is good, but doesn't provide enough details on these modes or how to determine which will be used.)

How can I exit reader.ReadString from waiting for user input?

I am making it so that it stops asking for input upon CTRL-C.
What I have currently is that a separate go-routine, upon receiving a CTRL-C, changes the value of a variable so it won't ask for another line. However, I can't seem to find a way around the current line.
i.e. I still have to press enter once, to get out of the current iteration of reading for \n.
Is there perhaps a way to push a "\n" into stdin for the reader.ReadString to read. Or a way to stop its execution altogether.
The only decent mechanism that Go gives you to proceed when either of two things happens is select, and select only selects on channel reads, so your only option is to change your signal-handler goroutine to write to a channel, and add another goroutine that handles stdin and passes lines of input to a channel, then select on the two channels.
However, that still leaves your question half-unanswered: your main program can stop waiting for input on a Ctrl-C, but the goroutine that's reading input will still be waiting for input. In some cases that might be okay... if you will never need stdin again, or if you will go right back to processing lines in the same exact way. But if you want to do something other than ReadString from that reader, you're stuck... literally. The only solution I see would be to write your own state machine around Read or ReadByte that is capable of changing its behavior in response to external conditions, but that can easily get horribly complicated.
Basically, this looks like a case where Go simplifies things compared to the underlying system (not exposing anything like EINTR, not allowing select on filehandles), but ends up providing less power to the programmer.

Maximum size of pipe used by CreateProcess

I'm currently using this example as a guide to redirect standard error of a child process launched by CreateProcess.
However unlike the example currently I'm waiting until the process finishes (checking GetExitCodeProcess), closing the pipe and then reading the error if a non-zero return code comes back.
However I've since read if the pipe fills up the client process will block until the pipe is cleared. The reason I'm not currently reading from the pipe during execution is that the ReadFile call blocks during execution (standard error is only output at the end) so I can't pump the message queue to avoid the GUI from "ghosting" and being marked not responding.
I can't find any reference to how big the pipe is by default (although I can set a size myself), is this something I need to worry about given I'm buffering the output into a string variable for later use anyway? (ie. it would need to fit into the available memory for the process so it has a hard limit there, it's not going to a file like most of the examples have)

I/O completion port silently fails to read completely

I'm developing a program that needs to write a large amout of data to disk then read back much smaller amount of data back later on. It needs to "bin" related data together then once it figures out what to do with it, then it can process the data further. It's basically acting like a database, but with temp files on disk. Portions of the temp files get reused fairly frequently as I don't care about the data on disk after I read it back out, so that portion of the file can be recycled. I'm using I/O completion ports to implement this because sequential I/O is simply too slow.
The problem is that sometimes when I read the data, I don't get all of it back. For example, I will zero out my read buffer, do a read operation of, say, 20 bytes, and when the corresponding completion event triggers, some or even none of my read buffer will match what should be on disk, but all of it won't be zeroed out. Occasionally, I can detect this and try sleeping 5 seconds and reading the same portion again, and it matches what I read in the first try. This is taking place on a top of the line SSD, so 5 seconds should be plenty to flush to disk. However, when I stop my application and look at the contents of the file, it's correct on disk. It's as if the previous write hasn't flushed to disk and it tried reading old data.
To test that theory, I tried writing 0xFF on entire sections as I read them. When this error happened again, my read buffer did not contain 0xFFs as I would have expected. So presumably, I'm not reading old data.
I also checked to make sure that the number of bytes returned from the completion event matched the number of bytes that I passed to ReadFile, and they do match. There is no error returned by the completion event or by ReadFile (other than ERROR_IO_PENDING). I am creating my temp files with FILE_ATTRIBUTE_NORMAL, FILE_FLAG_OVERLAPPED, and FILE_FLAG_RANDOM_ACCESS.
I also tried waiting for all pending writes for a given portion of the file to complete before trying to read, but to no avail. I would hope that Windows would do that for me, but it isn't covered in any documentation that I've read.
I'm really at a loss as to why I'm getting what look to be partial or corrupted reads. I'm really just looking for some ideas that might cause this behavior because I'm all out.
From the sound of things you're firing off writes and reads to the same portions of the same file and sometimes the data that the read returns isn't what you think you have previously written.
I assume you are waiting for the write completion for a piece of data before issuing a read request for the same area of the file? If not the read could be occurring before the write completes? When lots of data is being written to the same disk the write completions may begin to slow down and writes may spend more time pending (watch out for the resources that this consumes!)
Personally I'd include my own memory cache layer which knows about the data block until the write completion occurs - you can then satisfy reads for this part of the file from your cache if the write has not yet completed.

Resources