What is a File IO stream buffer? - ruby

I've checked out a few of the forum posts here and can't find quite what I'm looking for. Suppose you are reading in a text document via Ruby. I understand the stream is essentially the characters coming in byte by byte. What is the purpose/best practice of buffering in this case? My book shows plenty examples of the buffer being utilized, but no real description of what the buffer is or why it even exists. What should I be considering when setting the buffer? For example, the book illustrates the following method as:
read(n, buffer=nil) reads in n bytes, until the bytes are ready
I don't understand what the statement "until the bytes are ready" means. Does the buffer play a role in this? Please feel free to point me to another place where this is explained, I couldn't for the life of me find it on my own.

IO can be not only file, but a network socket. and in networks you regularly have a situation where you are ready to process more data, but the remote side have a pause in data sending.
(You usually see a progress bar or a spinner element in your browser in these cases)
So, if you are using regular files, the bytes are always 'ready'.

The Picaxe book for IO#read says:
Reads at most int bytes from the I/O stream or to the end of file if int is omitted. Returns nil if called at end of file. If buffer (a String) is provided, it is resized accordingly, and input is read directly into it.

Related

readString vs readLine

I am writing an application to read from a list of files, line by line and do some processing. I want to use as little RAM as I can.
I came across this question https://stackoverflow.com/a/41741702/3531263
Where the poster is saying readString uses more RAM than readLine and they have posted some code.
What I don't understand is how one uses more RAM? Because ultimately, the way their code is written, they are still writing an entire line to their buffer. So would that not mean if they had just used readString, it would have been the same thing?
the way their code is written, they are still writing an entire line to their buffer
Their code, yes. Your code might not need the whole line to be in memory at the same time. For example, your program is filtering a log file by request id, which is in the beginning of the line. It doesn't need to read the whole line which may be a few megabytes or more, only to reject it due to wrong request id. But with ReadString you don't have the luxury of choice.
I 'gree with Sergio. Also, have a look at the current implementation in the standard library. ReadLine calls ReadSlice('\n') once, then runs through a few branches to make sure the appropriate sentinel values or errors are returned with the converted data. On the other hand, ReadBytes and ReadString both loop over repeated calls to ReadSlice(delim), so it follows that they would necessarily be copying at least as much data into memory as ReadLine, and potentially much more if the delimiter wasn't found in the first call.

Golang io.Reader usage with net.Pipe

The problem I'm trying to solve is using io.Reader and io.Writer in a net application without using bufio and strings as per the examples I've been able to find online. For efficiency I'm trying to avoid the memcopys those imply.
I've created a test application using net.Pipe on the play area (https://play.golang.org/p/-7YDs1uEc5). There is a data source and sink which talk through a net.Pipe pair of connections (to model a network connection) and a loopback on the far end to reflect the data right back at us.
The program gets as far as the loopback agent reading the sent data, but as far as I can see the write back to the connection locks; it certainly never completes. Additionally the receiver in the Sink never receives any data whatsoever.
I can't figure out why the write cannot proceed as it's wholly symmetrical with the path that does work. I've written other test systems that use bi-directional network connections but as soon as I stop using bufio and ReadString I encounter this problem. I've looked at the code of those and can't see what I've missed.
Thanks in advance for any help.
The issue is on line 68:
data_received := make([]byte, 0, count)
This line creates a slice with length 0 and capacity count. The call to Read does not read data because the length is 0. The call to Write blocks because the data is never read.
Fix the issue by changing the line to:
data_received := make([]byte, count)
playground example
Note that "Finished Writing" may not be printed because the program can exit before dataSrc finishes executing.

IBM Filenet p8 concurrently reading document content

I want to read a document content from FileNetP8 parallel to reduce my reading time. Also the issue is I write into a OutputStream. Is there anyway or any API from where I can parallelize my reads into a OutputStream. I am asking this because I am sure IBM would have provided some way to do it.
Also because let's say if my file is 1GB, then sequential reads are going to be performance hit.
I think from a Document instance there's only one API to retrieve the content - accessContentStream which gives you an object of InputStream. However, for reading huge files there's a new util class called ExtendedInputStream which you might be interested in.
An ExtendedInputStream is an input stream that can retrieve content at arbitrary positions within the stream. The ExtendedInputStream class includes methods that can read a certain number of bytes from the stream or read an unspecified number of bytes. The stream keeps track of the last byte position that was read. You can specify a position in the input stream to get to a later or earlier position within the stream.
More details at :
https://www.ibm.com/support/knowledgecenter/SSGLW6_5.2.1/com.ibm.p8.ce.dev.java.doc/com/filenet/api/util/ExtendedInputStream.html
Edit:
ExtendedInputStream has been introduced in v5.2.1 and is not available if you are using older version P8.

Why doesn't IO#seek work for TCPSocket?

I wrote some simple code to learn the structure of a TCPSocket. I thought it's like an IO stream so I tried to use seek to move the "reading position" back a byte:
socket.gets #=> hello world
socket.seek(-5, IO::SEEK_CUR)
socket.gets #=> hello world # this should return world
but, it gives me an error:
server.rb:11:in `seek': Illegal seek (Errno::ESPIPE)
Does anyone have an idea why this doesn't work?
If this was the case then the socket needs to keep all data around if someone would decides to seek backwards (and how would forward seek work, block for more data?). You could probably quite easy write a wrapper class around a socket that keeps track of a position and also buffers all data or blocks if needed etc.
But maybe you could try to use IO#bytes or IO#chars in combination with Enumerator#peek?
TCP/IP would be more like having a series of files on disk, where you can only read forward a file at a time. The files have to be read sequentially, and you can't jump ahead or back. It's not capable of random I/O, like you can do on a disk, it's more like a serial connection you can only read as things appear.
In order to do what you want you have to build a buffer, where you append each block (i.e., file), reconstructing the entire message. If you want to look backwards at any point, you have to look in your buffer. If you want to look forward you have to wait for that block to be received and read and appended.
That's a simple explanation. It's possible to request blocks be resent in IP but really, at the level we normally work at, we're only reading forward.

I/O completion port silently fails to read completely

I'm developing a program that needs to write a large amout of data to disk then read back much smaller amount of data back later on. It needs to "bin" related data together then once it figures out what to do with it, then it can process the data further. It's basically acting like a database, but with temp files on disk. Portions of the temp files get reused fairly frequently as I don't care about the data on disk after I read it back out, so that portion of the file can be recycled. I'm using I/O completion ports to implement this because sequential I/O is simply too slow.
The problem is that sometimes when I read the data, I don't get all of it back. For example, I will zero out my read buffer, do a read operation of, say, 20 bytes, and when the corresponding completion event triggers, some or even none of my read buffer will match what should be on disk, but all of it won't be zeroed out. Occasionally, I can detect this and try sleeping 5 seconds and reading the same portion again, and it matches what I read in the first try. This is taking place on a top of the line SSD, so 5 seconds should be plenty to flush to disk. However, when I stop my application and look at the contents of the file, it's correct on disk. It's as if the previous write hasn't flushed to disk and it tried reading old data.
To test that theory, I tried writing 0xFF on entire sections as I read them. When this error happened again, my read buffer did not contain 0xFFs as I would have expected. So presumably, I'm not reading old data.
I also checked to make sure that the number of bytes returned from the completion event matched the number of bytes that I passed to ReadFile, and they do match. There is no error returned by the completion event or by ReadFile (other than ERROR_IO_PENDING). I am creating my temp files with FILE_ATTRIBUTE_NORMAL, FILE_FLAG_OVERLAPPED, and FILE_FLAG_RANDOM_ACCESS.
I also tried waiting for all pending writes for a given portion of the file to complete before trying to read, but to no avail. I would hope that Windows would do that for me, but it isn't covered in any documentation that I've read.
I'm really at a loss as to why I'm getting what look to be partial or corrupted reads. I'm really just looking for some ideas that might cause this behavior because I'm all out.
From the sound of things you're firing off writes and reads to the same portions of the same file and sometimes the data that the read returns isn't what you think you have previously written.
I assume you are waiting for the write completion for a piece of data before issuing a read request for the same area of the file? If not the read could be occurring before the write completes? When lots of data is being written to the same disk the write completions may begin to slow down and writes may spend more time pending (watch out for the resources that this consumes!)
Personally I'd include my own memory cache layer which knows about the data block until the write completion occurs - you can then satisfy reads for this part of the file from your cache if the write has not yet completed.

Resources