Is there a generally accepted "correct" way for writing out and reading back in marshaled protocol buffer messages from a file?
I've been working on a smaller project that simulates a full network locally with gRPC and am trying to add writing to/ reading from files s.t. I can save state and start from there when its launched again. It seems I was naive in assuming these would remain on a single line:
Sees chain of length 3
from debugging messages I've written; but,
$ wc test.dat
7 8 2483 test.dat
So, I suppose there are an extra 4 newline's... Is there a method of delimiting these that I can use? or do I need to come up with one on my own? I realize this is straightforward, but in my mind, I can only probabilistically guarantee that <<<<DELIMIT>>>> or whatever will never show up and put me back at square 1.
Use proto.Marshal/Unmarshal:
That way you simulate (closest) to receiving the message while avoiding side effects from other Marshal methods.
Alternative: Dump it as []byte and reread it.
Related
I am writing an application to read from a list of files, line by line and do some processing. I want to use as little RAM as I can.
I came across this question https://stackoverflow.com/a/41741702/3531263
Where the poster is saying readString uses more RAM than readLine and they have posted some code.
What I don't understand is how one uses more RAM? Because ultimately, the way their code is written, they are still writing an entire line to their buffer. So would that not mean if they had just used readString, it would have been the same thing?
the way their code is written, they are still writing an entire line to their buffer
Their code, yes. Your code might not need the whole line to be in memory at the same time. For example, your program is filtering a log file by request id, which is in the beginning of the line. It doesn't need to read the whole line which may be a few megabytes or more, only to reject it due to wrong request id. But with ReadString you don't have the luxury of choice.
I 'gree with Sergio. Also, have a look at the current implementation in the standard library. ReadLine calls ReadSlice('\n') once, then runs through a few branches to make sure the appropriate sentinel values or errors are returned with the converted data. On the other hand, ReadBytes and ReadString both loop over repeated calls to ReadSlice(delim), so it follows that they would necessarily be copying at least as much data into memory as ReadLine, and potentially much more if the delimiter wasn't found in the first call.
I have a situation where I need to concurrently read/write from/to the file, but the scope of operations is limited:
append only, no random offset writes
read from random position, where I know for sure the content has been written before(via append, internal access serialization via golang channel to ensure random read happens only after content's been appended)
there is only one process running
This is a high loaded application and I would like to avoid locking file for each read/write I do
I was going to open 2 files - one for read, another for append only
would doing so create some potential issues/bugs?
what is the recommended practice if I would like to avoid file locking for each read/write I do?
p.s. golang, linux, ext4
I'll assume by "random read" you actually mean "arbitrary read".
If I understand your use case correctly, you don't need to seek or lock or do anything manual. UNIX has this covered via O_APPEND. Here is what you can do:
Open the file with os.O_APPEND. This way every write, regardless of any preceding operations, will go to the end of the file
When reading use File.ReadAt. This lets you specify arbitrary offsets for your reads
Using this scheme you can avoid any sort of locking: the OS will do it for you. Because of the buffer cache this scheme is not even inefficient: appends and reads are pretty much independent.
I would like to do the following :
I want to imple,ment the concept of FIFO in normal files using GUILE.
Two processes should communicate via a normal text file, that a third process , if needed, can access.
The subordinate of the original two processes should write in the file, line after line, that is append. So far so good. (implemented in c++)
The master proces however, should treat this file as a FIFO, it should read the first line, and do somethong corresponding to it, and delete the first line leaving the rest intact.
The problems are :
While the Master is accessing the file, the subordinate may come to a point where it must write there, leading to a conflict.
Popping the first line may need reading the whole ile out, in a string, poping the first thereof, and then saving it, which is memory intensive, and the second saving action may again conflict with the child trying to write there,
I wanted to implement this in GUILE, because since it is the official OS extension language, there might be better ways which addresses the above two issues.
But in the web I do not find much to orient myself. Please help, sorry for the lewss than concrete question, then I dont have a code snippet to show.
I wrote some simple code to learn the structure of a TCPSocket. I thought it's like an IO stream so I tried to use seek to move the "reading position" back a byte:
socket.gets #=> hello world
socket.seek(-5, IO::SEEK_CUR)
socket.gets #=> hello world # this should return world
but, it gives me an error:
server.rb:11:in `seek': Illegal seek (Errno::ESPIPE)
Does anyone have an idea why this doesn't work?
If this was the case then the socket needs to keep all data around if someone would decides to seek backwards (and how would forward seek work, block for more data?). You could probably quite easy write a wrapper class around a socket that keeps track of a position and also buffers all data or blocks if needed etc.
But maybe you could try to use IO#bytes or IO#chars in combination with Enumerator#peek?
TCP/IP would be more like having a series of files on disk, where you can only read forward a file at a time. The files have to be read sequentially, and you can't jump ahead or back. It's not capable of random I/O, like you can do on a disk, it's more like a serial connection you can only read as things appear.
In order to do what you want you have to build a buffer, where you append each block (i.e., file), reconstructing the entire message. If you want to look backwards at any point, you have to look in your buffer. If you want to look forward you have to wait for that block to be received and read and appended.
That's a simple explanation. It's possible to request blocks be resent in IP but really, at the level we normally work at, we're only reading forward.
Using Google protobuf, I am saving my serialized messaged data to a file - in each file there are several messages. We have both C++ and Python versions of the code, so I need to use protobuf functions that are available in both languages. I have experimented with using SerializeToArray and SerializeAsString and there seems to be the following unfortunate conditions:
SerializeToArray: As suggested in one answer, the best way to use this is to prefix each message with it's data size. This would work great for C++, but in Python it doesn't look like this is possible - am I wrong?
SerializeAsString: This generates a serialized string equivalent to it's binary counterpart - which I can save to a file, but what happens if one of the characters in the serialization result is \n - how do we find line endings, or the ending of messages for that matter?
Update:
Please allow me to rephrase slightly. As I understand it, I cannot write binary data in C++ because then our Python application cannot read the data, since it can only parse string serialized messages. Should I then instead use SerializeAsString in both C++ and Python? If yes, then is it best practice to store such data in a text file rather than a binary file? My gut feeling is binary, but as you can see this doesn't look like an option.
We have had great success base64 encoding the messages, and using a simple \n to separate messages. This will ofcoirse depend a lot on your use - we need to store the messages in "log" files. It naturally has overhead encoding/decoding this - but this has not even remotely been an issue for us.
The advantage of keeping these messages as line separated text has so far been invaluable for maintenance and debugging. Figure out how many messages are in a file ? wc -l . Find the Nth message - head ... | tail. Figure out what's wrong with a record on a remote system you need to access through 2 VPNs and a citrix solution ? copy paste the message and mail it to the programmer.
The best practice for concatenating messages in this way is to prepend each message with its size. That way you read in the size (try a 32bit int or something), then read that number of bytes into a buffer and deserialize it. Then read the next size, etc. etc.
The same goes for writing, you first write out the size of the message, then the message itself.
See Streaming Multiple Messages on the protobuf documentation for more information.
Protobuf is a binary format, so reading and writing should be done as binary, not text.
If you don't want binary format, you should consider using something other than protobuf (there are lots of textual data formats, such as XML, JSON, CSV); just using text abstractions is not enough.