I wrote some simple code to learn the structure of a TCPSocket. I thought it's like an IO stream so I tried to use seek to move the "reading position" back a byte:
socket.gets #=> hello world
socket.seek(-5, IO::SEEK_CUR)
socket.gets #=> hello world # this should return world
but, it gives me an error:
server.rb:11:in `seek': Illegal seek (Errno::ESPIPE)
Does anyone have an idea why this doesn't work?
If this was the case then the socket needs to keep all data around if someone would decides to seek backwards (and how would forward seek work, block for more data?). You could probably quite easy write a wrapper class around a socket that keeps track of a position and also buffers all data or blocks if needed etc.
But maybe you could try to use IO#bytes or IO#chars in combination with Enumerator#peek?
TCP/IP would be more like having a series of files on disk, where you can only read forward a file at a time. The files have to be read sequentially, and you can't jump ahead or back. It's not capable of random I/O, like you can do on a disk, it's more like a serial connection you can only read as things appear.
In order to do what you want you have to build a buffer, where you append each block (i.e., file), reconstructing the entire message. If you want to look backwards at any point, you have to look in your buffer. If you want to look forward you have to wait for that block to be received and read and appended.
That's a simple explanation. It's possible to request blocks be resent in IP but really, at the level we normally work at, we're only reading forward.
Related
Is there a generally accepted "correct" way for writing out and reading back in marshaled protocol buffer messages from a file?
I've been working on a smaller project that simulates a full network locally with gRPC and am trying to add writing to/ reading from files s.t. I can save state and start from there when its launched again. It seems I was naive in assuming these would remain on a single line:
Sees chain of length 3
from debugging messages I've written; but,
$ wc test.dat
7 8 2483 test.dat
So, I suppose there are an extra 4 newline's... Is there a method of delimiting these that I can use? or do I need to come up with one on my own? I realize this is straightforward, but in my mind, I can only probabilistically guarantee that <<<<DELIMIT>>>> or whatever will never show up and put me back at square 1.
Use proto.Marshal/Unmarshal:
That way you simulate (closest) to receiving the message while avoiding side effects from other Marshal methods.
Alternative: Dump it as []byte and reread it.
I am writing an application to read from a list of files, line by line and do some processing. I want to use as little RAM as I can.
I came across this question https://stackoverflow.com/a/41741702/3531263
Where the poster is saying readString uses more RAM than readLine and they have posted some code.
What I don't understand is how one uses more RAM? Because ultimately, the way their code is written, they are still writing an entire line to their buffer. So would that not mean if they had just used readString, it would have been the same thing?
the way their code is written, they are still writing an entire line to their buffer
Their code, yes. Your code might not need the whole line to be in memory at the same time. For example, your program is filtering a log file by request id, which is in the beginning of the line. It doesn't need to read the whole line which may be a few megabytes or more, only to reject it due to wrong request id. But with ReadString you don't have the luxury of choice.
I 'gree with Sergio. Also, have a look at the current implementation in the standard library. ReadLine calls ReadSlice('\n') once, then runs through a few branches to make sure the appropriate sentinel values or errors are returned with the converted data. On the other hand, ReadBytes and ReadString both loop over repeated calls to ReadSlice(delim), so it follows that they would necessarily be copying at least as much data into memory as ReadLine, and potentially much more if the delimiter wasn't found in the first call.
I would like to do the following :
I want to imple,ment the concept of FIFO in normal files using GUILE.
Two processes should communicate via a normal text file, that a third process , if needed, can access.
The subordinate of the original two processes should write in the file, line after line, that is append. So far so good. (implemented in c++)
The master proces however, should treat this file as a FIFO, it should read the first line, and do somethong corresponding to it, and delete the first line leaving the rest intact.
The problems are :
While the Master is accessing the file, the subordinate may come to a point where it must write there, leading to a conflict.
Popping the first line may need reading the whole ile out, in a string, poping the first thereof, and then saving it, which is memory intensive, and the second saving action may again conflict with the child trying to write there,
I wanted to implement this in GUILE, because since it is the official OS extension language, there might be better ways which addresses the above two issues.
But in the web I do not find much to orient myself. Please help, sorry for the lewss than concrete question, then I dont have a code snippet to show.
I'm working on an application that needs to store a large 2GB+ XML file for processing, and I'm facing two problems:
How do I process the file? Loading the whole file into Nokogiri at once won't work. It quickly eats up memory and, as far as I can tell, the process gets nuked from orbit. Are there Heroku-compatible ways to quickly/easily read a large XML file located on a non-Heroku server in smaller chunks?
How do I store the file? The site is set up to use S3, but the data provider needs FTP access to upload the XML file nightly. S3 via FTP is apparently a no-go, and storing the file on Heroku won't work either, as it'll only be seen by the dyno that owns it and is susceptible to being randomly purged. Has anyone encountered this type of constraint before, and if so, how'd you work around it?
Most of the time we prefer parsing the entire file that has been pulled into memory because it's easier to jump back and forth, extracting this and that as our code needs. Because it's in memory, we can do random access easily, if we want.
For your need, you'll want to start at the top of the file, and read each line, looking for the tags of interest, until you get to the end of the file. For that, you want to use Nokogiri::XML::SAX and Nokogiri::XML::SAX::Parser, along with the events in Nokogiri::XML::SAX::Document. Here's a summary of what it does, from Nokogiri's site:
The basic way a SAX style parser works is by creating a parser, telling the parser about the events we’re interested in, then giving the parser some XML to process. The parser will notify you when it encounters events your said you would like to know about.
SAX is a different beast than dealing with the DOM, but it can be very fast, and is a lot easier on memory.
If you wanted to load the file in smaller chunks, you could process the XML inside an OpenURI.open or Net::HTTP block, so you'd be getting it in TCP packet-size chunks. The problem then is that your lines could be split, because TCP doesn't guarantee reading by lines, but by blocks, which is what you'll see inside the read loop. Your code would have to peel off partial lines at the end of the buffer, and then prepend them to the read buffer so the next block read finishes the line.
You'll need a streaming parser. Have a look at https://github.com/craigambrose/sax_stream
You could run your own FTP server on EC2? Or use a hosted provider such as https://hostedftp.com/
I've checked out a few of the forum posts here and can't find quite what I'm looking for. Suppose you are reading in a text document via Ruby. I understand the stream is essentially the characters coming in byte by byte. What is the purpose/best practice of buffering in this case? My book shows plenty examples of the buffer being utilized, but no real description of what the buffer is or why it even exists. What should I be considering when setting the buffer? For example, the book illustrates the following method as:
read(n, buffer=nil) reads in n bytes, until the bytes are ready
I don't understand what the statement "until the bytes are ready" means. Does the buffer play a role in this? Please feel free to point me to another place where this is explained, I couldn't for the life of me find it on my own.
IO can be not only file, but a network socket. and in networks you regularly have a situation where you are ready to process more data, but the remote side have a pause in data sending.
(You usually see a progress bar or a spinner element in your browser in these cases)
So, if you are using regular files, the bytes are always 'ready'.
The Picaxe book for IO#read says:
Reads at most int bytes from the I/O stream or to the end of file if int is omitted. Returns nil if called at end of file. If buffer (a String) is provided, it is resized accordingly, and input is read directly into it.