Go read text with goroutine - go

I want to read text file with goroutines. The order of text that gets read from a file does not matter. How do I read a file with concurrency?
scanner := bufio.NewScanner(file)
for scanner.Scan() {
lines = append(lines, scanner.Text())
}
For example, if the text file contains I like Go, I want to read this file without concerning the order. It could be []string{"Go", "like", "I"}

First of all, if you're reading from io.Reader consider it as reading from the stream. It's the single input source, which you can't 'read in parallel' because of it's nature - under the hood, you're getting byte, waiting for another one, getting one more and so on. Tokenizing it in words comes later, in buffer.
Second, I hope you're not trying to use goroutines as a 'silver bullet' in a 'let's add gouroutines and everything will just speed up' manner. If Go gives you such an easy way to use concurrency, it doesn't mean you should use it everywhere.
And finally, if you really need to split huge file into words in parallel and you think that splitting part will be the bottleneck (don't know your case, but I really doubt that) - then you have to invent your own algorithm and use 'os' package to Seek()/Read() parts of the file, each processed by it's own gouroutine and track somehow which parts were already processed.

Related

Protobuffers and Golang --- writing out marshal'ed structs and reading back in

Is there a generally accepted "correct" way for writing out and reading back in marshaled protocol buffer messages from a file?
I've been working on a smaller project that simulates a full network locally with gRPC and am trying to add writing to/ reading from files s.t. I can save state and start from there when its launched again. It seems I was naive in assuming these would remain on a single line:
Sees chain of length 3
from debugging messages I've written; but,
$ wc test.dat
7 8 2483 test.dat
So, I suppose there are an extra 4 newline's... Is there a method of delimiting these that I can use? or do I need to come up with one on my own? I realize this is straightforward, but in my mind, I can only probabilistically guarantee that <<<<DELIMIT>>>> or whatever will never show up and put me back at square 1.
Use proto.Marshal/Unmarshal:
That way you simulate (closest) to receiving the message while avoiding side effects from other Marshal methods.
Alternative: Dump it as []byte and reread it.

readString vs readLine

I am writing an application to read from a list of files, line by line and do some processing. I want to use as little RAM as I can.
I came across this question https://stackoverflow.com/a/41741702/3531263
Where the poster is saying readString uses more RAM than readLine and they have posted some code.
What I don't understand is how one uses more RAM? Because ultimately, the way their code is written, they are still writing an entire line to their buffer. So would that not mean if they had just used readString, it would have been the same thing?
the way their code is written, they are still writing an entire line to their buffer
Their code, yes. Your code might not need the whole line to be in memory at the same time. For example, your program is filtering a log file by request id, which is in the beginning of the line. It doesn't need to read the whole line which may be a few megabytes or more, only to reject it due to wrong request id. But with ReadString you don't have the luxury of choice.
I 'gree with Sergio. Also, have a look at the current implementation in the standard library. ReadLine calls ReadSlice('\n') once, then runs through a few branches to make sure the appropriate sentinel values or errors are returned with the converted data. On the other hand, ReadBytes and ReadString both loop over repeated calls to ReadSlice(delim), so it follows that they would necessarily be copying at least as much data into memory as ReadLine, and potentially much more if the delimiter wasn't found in the first call.

How do I print the contents of a files in a directory but ignore files which are opened in write mode?

I have a goroutine which periodically checks for new files in a directory and then prints the contents of the files. However there is another goroutine which creates a file, writes contents into it and then saves the file.
How do I ignore the files which are open in WRITE mode in a directory?
Sample Code:
for {
fileList, err := ioutil.ReadDir("/uploadFiles")
if err != nil {
log.Fatal(err)
continue
}
for _, f := range fileList {
log.Println("File : ", f.Name())
go printContents(f.Name())
}
time.Sleep(time.Second * 5)
}
In the printContents goroutine I want to ignore the files which are open in WRITE mode.
That is not how it's done.
Off the top of my head I can think of these options:
If both goroutines are working in the same program,
there is little problem: make the "producer" goroutine register
the names of the files it has completed modifying into some
registry, and make the "consumer" goroutine read (and delete)
from that registry.
In the simplest case that could be a buffered channel.
If the producer works much faster than the consumer,
and you don't want to block the former for some reason
then a slice protected by a mutex would fit the bill.
If the goroutines work in different processes on the same
machine but you control both programs, make the producer
process communicate the same data to the consumer process
via any suitable sort of IPC.
What method to do IPC is better depends on how the
processes start up, interact etc.
There is a wide variety of options.
If you control both processes but do not want to mess with
IPC between them (there are reasons, too), then make the producer
follow best practices on how to write a file
(more on this in a moment), and make the consumer use any
filesystem-monitoring facility to report which files get created ("appear") once produced by the producer.
You may start with github.com/fsnotify/fsnotify.
To properly write a file, the producer have to write its
data to a temporary file—that is, a file located in the same
directory but having a filename which is well understood to
indicate that the file is not done with yet—for instance,
".foobar.data.part" or "foobar.data.276gd14054.tmp" is OK for writing "foobar.data".
(Other approaches exist but this one is good enough to
start with.)
Once the file is ready, the producer have to rename the
file from its temporary name to its "proper", final name.
This operation is atomic on all sensible OSes/filesystems,
and makes file atomically "spring into existense" from the PoV
of the consumer. For instance, inotify on Linux generates
an event of type "moved to" for such an appearance.
If you don't feel like doing the proper thing yourself, github.com/dchest/safefile is a good cross-platform start.
As you can see, with this approach you know
the file is done just from the fact it was reported
to having appeared.
If you do not control the producer, you may need to resort to
guessing.
The simpest is to, again, monitor the filesystem for
events—but this time for "file updated" events, not "file created"
events. For each file reported as updated you had to remember
the timestamp of that event, and when certain amount of time passes, you may declare that the file is done by the producer.
IMO this approach is the worst of all, but if you have no
better options it's at least something.

How can I exit reader.ReadString from waiting for user input?

I am making it so that it stops asking for input upon CTRL-C.
What I have currently is that a separate go-routine, upon receiving a CTRL-C, changes the value of a variable so it won't ask for another line. However, I can't seem to find a way around the current line.
i.e. I still have to press enter once, to get out of the current iteration of reading for \n.
Is there perhaps a way to push a "\n" into stdin for the reader.ReadString to read. Or a way to stop its execution altogether.
The only decent mechanism that Go gives you to proceed when either of two things happens is select, and select only selects on channel reads, so your only option is to change your signal-handler goroutine to write to a channel, and add another goroutine that handles stdin and passes lines of input to a channel, then select on the two channels.
However, that still leaves your question half-unanswered: your main program can stop waiting for input on a Ctrl-C, but the goroutine that's reading input will still be waiting for input. In some cases that might be okay... if you will never need stdin again, or if you will go right back to processing lines in the same exact way. But if you want to do something other than ReadString from that reader, you're stuck... literally. The only solution I see would be to write your own state machine around Read or ReadByte that is capable of changing its behavior in response to external conditions, but that can easily get horribly complicated.
Basically, this looks like a case where Go simplifies things compared to the underlying system (not exposing anything like EINTR, not allowing select on filehandles), but ends up providing less power to the programmer.

Incrementally reading logs

Looked around with numerous search strings but can't find anything quite like this:
I'm writing a custom log parser (ala analog or webalizer except not for webserver) and I want to be able to skip the hard work for the lines that have already been parsed. I have thought about using a history file like webalizer but have no idea how it actually works internally and my C is pretty poor.
I've considered hashing each line and writing the hashes out, then parsing the history file for their presence but I think this will perform poorly.
The only other method I can think of is storing the line number of the last parse and skipping until that number is reached the next time round. What happens when the log is rotated I am not sure.
Any other ideas would be appreciated. I will be writing the parser in ruby but tips in a similar language will help as well.
The solutions I can think of right now are bound to be brittle.
Even if you store the line number and later realize it would be past the length of the current file, what happens if old lines have been trimmed? You would start reading (well) after the last position.
If, on the other hand, you are sure your log files won't be tampered with and they will only be rotated, I only see two ways of doing what you want, and I'm not sure the second is applicable to you.
Anyway, here goes.
First solution
You store the last line you parsed along with a timestamp. At the next run, you consider all the rotated log files sorting them by their last modified date, figure out which one you read last time, and start reading from there.
I didn't think this through, there might be funny corner cases you will need to handle.
Second solution
You create a background script that continuously watches the log file. A quick search on Google turned out this gem, but I'm not sure if that's even an option for you. Even then, you might want to integrate this solution with the previous one just in case your daemon will get interrupted (because that's clearly bound to happen at some point).
As you read the file and parse the lines keep track of the byte count. Save that. On next read, try to seek to that byte offset in the file. If the file is smaller than the byte count, it's a new file so start at the beginning.

Resources