Passing a pointer to bufio.Scanner() - go
Lest I provide an XY problem, my goal is to share a memory-mapped file between multiple goroutines as recommended. Each goroutine needs to iterate over the file line by line so I had hoped to store the complete contents in memory first to speed things up.
The method I tried is passing a pointer to a bufio.Scanner, but that is not working. I thought it might be related to needing to set the seek position back to the beginning of the file but it is not even working the very first time and I can find no such parameter in the documentation. My attempt was to create this function then pass the result by reference to the function I intend to run in a goroutine (for right now, I am not using goroutines just to make sure this works outright, which it does not).
Here is a MWE:
// ... package declaration; imports; yada yada
func main() {
// ... validate path to file stored in filePath variable
filePath := "/path/to/file.txt"
// get word list scanner to be shared between goroutines
scanner := getScannerPtr(&filePath)
// pass to function (no goroutine for now, I try to solve one problem at a time)
myfunc(scanner)
}
func getScannerPtr(filePath *string) *bufio.Scanner {
f, err := os.Open(*filePath)
if err != nil {
fmt.Fprint(os.Stderr, "Error opening file\n")
panic(err)
}
defer f.Close()
scanner := bufio.NewScanner(f)
scanner.Split(bufio.ScanLines)
return scanner
}
func myfunc(scanner *bufio.Scanner) {
for scanner.Scan() {
line := strings.TrimSpace(scanner.Text())
// ... do something with line
}
}
I'm not receiving any errors, it just is not iterating over the file when I call Scan() so it never makes it inside that block to do anything with each line of the file. Keep in mind I am not even using concurrency yet, that is just my eventual goal which I want to point out in case that impacts the method I need to take.
Why is Scan() not working?
Is this is a viable approach if I intend to call go myfunc(scanner) in the future?
You're closing the file before you ever use the Scanner:
func getScannerPtr(filePath *string) *bufio.Scanner {
f, err := os.Open(*filePath)
if err != nil {
fmt.Fprint(os.Stderr, "Error opening file\n")
panic(err)
}
defer f.Close() // <--- Here
scanner := bufio.NewScanner(f)
scanner.Split(bufio.ScanLines)
return scanner // <-- File gets closed, then Scanner that tries to read it is returned for further use, which won't work
}
Because Scanner does not expose Close, you'll need to work around this; the quickest is probably to make a simple custom type with a couple of embedded fields:
type FileScanner struct {
io.Closer
*bufio.Scanner
}
func getScannerPtr(filePath *string) *FileScanner {
f, err := os.Open(*filePath)
if err != nil {
fmt.Fprint(os.Stderr, "Error opening file\n")
panic(err)
}
scanner := bufio.NewScanner(f)
return &FileScanner{f, scanner}
}
func myfunc(scanner *FileScanner) {
defer scanner.Close()
for scanner.Scan() {
line := strings.TrimSpace(scanner.Text())
// ... do something with line
}
}
Related
why *(*string)(unsafe.Pointer(&b)) doesn't work with bufio.Reader
i have a file. it has some ip 1.1.1.0/24 1.1.2.0/24 2.2.1.0/24 2.2.2.0/24 i read this file to slice, and used *(*string)(unsafe.Pointer(&b)) to parse []byte to string, but is doesn't work func TestInitIpRangeFromFile(t *testing.T) { filepath := "/tmp/test" file, err := os.Open(filepath) if err != nil { t.Errorf("failed to open ip range file:%s, err:%s", filepath, err) } reader := bufio.NewReader(file) ranges := make([]string, 0) for { ip, _, err := reader.ReadLine() if err != nil { if err == io.EOF { break } logger.Fatalf("failed to read ip range file, err:%s", err) } t.Logf("ip:%s", *(*string)(unsafe.Pointer(&ip))) ranges = append(ranges, *(*string)(unsafe.Pointer(&ip))) } t.Logf("%v", ranges) } result: task_test.go:71: ip:1.1.1.0/24 task_test.go:71: ip:1.1.2.0/24 task_test.go:71: ip:2.2.1.0/24 task_test.go:71: ip:2.2.2.0/24 task_test.go:75: [2.2.2.0/24 1.1.2.0/24 2.2.1.0/24 2.2.2.0/24] why 1.1.1.0/24 changed to 2.2.2.0/24 ? change *(*string)(unsafe.Pointer(&ip)) to string(ip) it works
So, while reinterpreting a slice-header as a string-header the way you did is absolutely bonkers and has no guarantee whatsoever of working correctly, it's only indirectly the cause of your problem. The real problem is that you're retaining a pointer to the return value of bufio/Reader.ReadLine(), but the docs for that method say "The returned buffer is only valid until the next call to ReadLine." Which means that the reader is free to reuse that memory later on, and that's what's happening. When you do the cast in the proper way, string(ip), Go copies the contents of the buffer into the newly-created string, which remains valid in the future. But when you type-pun the slice into a string, you keep the exact same pointer, which stops working as soon as the reader refills its buffer. If you decided to do the pointer trickery as a performance hack to avoid copying and allocation... too bad. The reader interface is going to force you to copy the data out anyway, and since it does, you should just use string().
Golang reading from serial
I'm trying to read from a serial port (a GPS device on a Raspberry Pi). Following the instructions from http://www.modmypi.com/blog/raspberry-pi-gps-hat-and-python I can read from shell using stty -F /dev/ttyAMA0 raw 9600 cs8 clocal -cstopb cat /dev/ttyAMA0 I get well formatted output $GNGLL,5133.35213,N,00108.27278,W,160345.00,A,A*65 $GNRMC,160346.00,A,5153.35209,N,00108.27286,W,0.237,,290418,,,A*75 $GNVTG,,T,,M,0.237,N,0.439,K,A*35 $GNGGA,160346.00,5153.35209,N,00108.27286,W,1,12,0.67,81.5,M,46.9,M,,*6C $GNGSA,A,3,29,25,31,20,26,23,21,16,05,27,,,1.11,0.67,0.89*10 $GNGSA,A,3,68,73,83,74,84,75,85,67,,,,,1.11,0.67,0.89*1D $GPGSV,4,1,15,04,,,34,05,14,040,21,09,07,330,,16,45,298,34*40 $GPGSV,4,2,15,20,14,127,18,21,59,154,30,23,07,295,26,25,13,123,22*74 $GPGSV,4,3,15,26,76,281,40,27,15,255,20,29,40,068,19,31,34,199,33*7C $GPGSV,4,4,15,33,29,198,,36,23,141,,49,30,172,*4C $GLGSV,3,1,11,66,00,325,,67,13,011,20,68,09,062,16,73,12,156,21*60 $GLGSV,3,2,11,74,62,177,20,75,53,312,36,76,08,328,,83,17,046,25*69 $GLGSV,3,3,11,84,75,032,22,85,44,233,32,,,,35*62 $GNGLL,5153.35209,N,00108.27286,W,160346.00,A,A*6C $GNRMC,160347.00,A,5153.35205,N,00108.27292,W,0.216,,290418,,,A*7E $GNVTG,,T,,M,0.216,N,0.401,K,A*3D $GNGGA,160347.00,5153.35205,N,00108.27292,W,1,12,0.67,81.7,M,46.9,M,,*66 $GNGSA,A,3,29,25,31,20,26,23,21,16,05,27,,,1.11,0.67,0.89*10 $GNGSA,A,3,68,73,83,74,84,75,85,67,,,,,1.11,0.67,0.89*1D $GPGSV,4,1,15,04,,,34,05,14,040,21,09,07,330,,16,45,298,34*40 (I've put some random data in) I'm trying to read this in Go. Currently, I have package main import "fmt" import "log" import "github.com/tarm/serial" func main() { config := &serial.Config{ Name: "/dev/ttyAMA0", Baud: 9600, ReadTimeout: 1, Size: 8, } stream, err := serial.OpenPort(config) if err != nil { log.Fatal(err) } buf := make([]byte, 1024) for { n, err := stream.Read(buf) if err != nil { log.Fatal(err) } s := string(buf[:n]) fmt.Println(s) } } But this prints malformed data. I suspect that this is due to the buffer size or the value of Size in the config struct being wrong, but I'm not sure how to get those values from the stty settings. Looking back, I think the issue is that I'm getting a stream and I want to be able to iterate over lines of the stty, rather than chunks. This is how the stream is outputted: $GLGSV,3 ,1,09,69 ,10,017, ,70,43,0 69,,71,3 2,135,27 ,76,23,2 32,22*6F $GLGSV ,3,2,09, 77,35,30 0,21,78, 11,347,, 85,31,08 1,30,86, 72,355,3 6*6C $G LGSV,3,3 ,09,87,2 4,285,30 *59 $GN GLL,5153 .34919,N ,00108.2 7603,W,1 92901.00 ,A,A*6A
The struct you get back from serial.OpenPort() contains a pointer to an open os.File corresponding to the opened serial port connection. When you Read() from this, the library calls Read() on the underlying os.File. The documentation for this function call is: Read reads up to len(b) bytes from the File. It returns the number of bytes read and any error encountered. At end of file, Read returns 0, io.EOF. This means you have to keep track of how much data was read. You also have to keep track of whether there were newlines, if this is important to you. Unfortunately, the underlying *os.File is not exported, so you'll find it difficult to use tricks like bufio.ReadLine(). It may be worth modifying the library and sending a pull request. As Matthew Rankin noted in a comment, Port implements io.ReadWriter so you can simply use bufio to read by lines. stream, err := serial.OpenPort(config) if err != nil { log.Fatal(err) } scanner := bufio.NewScanner(stream) for scanner.Scan() { fmt.Println(scanner.Text()) // Println will add back the final '\n' } if err := scanner.Err(); err != nil { log.Fatal(err) }
Change fmt.Println(s) to fmt.Print(s) and you will probably get what you want. Or did I misunderstand the question?
Two additions to Michael Hamptom's answer which can be useful: line endings You might receive data that is not newline-separated text. bufio.Scanner uses ScanLines by default to split the received data into lines - but you can also write your own line splitter based on the default function's signature and set it for the scanner: scanner := bufio.NewScanner(stream) scanner.Split(ownLineSplitter) // set custom line splitter function reader shutdown You might not receive a constant stream but only some packets of bytes from time to time. If no bytes arrive at the port, the scanner will block and you can't just kill it. You'll have to close the stream to do so, effectively raising an error. To not block any outer loops and handle errors appropriately, you can wrap the scanner in a goroutine that takes a context. If the context was cancelled, ignore the error, otherwise forward the error. In principle, this can look like var errChan = make(chan error) var dataChan = make(chan []byte) ctx, cancelPortScanner := context.WithCancel(context.Background()) go func(ctx context.Context) { scanner := bufio.NewScanner(stream) for scanner.Scan() { // will terminate if connection is closed dataChan <- scanner.Bytes() } // if execution reaches this point, something went wrong or stream was closed select { case <-ctx.Done(): return // ctx was cancelled, just return without error default: errChan <- scanner.Err() // ctx wasn't cancelled, forward error } }(ctx) // handle data from dataChan, error from errChan To stop the scanner, you would cancel the context and close the connection: cancelPortScanner() stream.Close()
Collision between garbage collector and deferred functions?
Consider the following code snippet: func a(fd int) { file := os.NewFile(uintptr(fd), "") defer func() { if err := file.Close(); err != nil { fmt.Printf("%v", err) } } This piece of code is legit, and will work OK. Files will be closed upon returning from a() However, The following will not work correctly: func a(fd int) { file := os.NewFile(uintptr(fd), "") defer func() { if err := syscall.Close(int(file.Fd()); err != nil { fmt.Printf("%v", err) } } The error that will be received, occasionally, will be bad file descriptor, due to the fact of NewFile setting a finalizer which, during garbage collection, will close the file itself. Whats unclear to me, is that the deferred function still has a reference to the file, so theoretically, it shouldn't be garbage collected yet. So why is golang runtime behaves that way?
the problems of the code is after file.Fd() return, file is unreachable, so file may be close by the finalizer(garbage collected). according to runtime.SetFinalizer: For example, if p points to a struct that contains a file descriptor d, and p has a finalizer that closes that file descriptor, and if the last use of p in a function is a call to syscall.Write(p.d, buf, size), then p may be unreachable as soon as the program enters syscall.Write. The finalizer may run at that moment, closing p.d, causing syscall.Write to fail because it is writing to a closed file descriptor (or, worse, to an entirely different file descriptor opened by a different goroutine). To avoid this problem, call runtime.KeepAlive(p) after the call to syscall.Write. runtime.KeepAlive usage: KeepAlive marks its argument as currently reachable. This ensures that the object is not freed, and its finalizer is not run, before the point in the program where KeepAlive is called. func a(fd int) { file := os.NewFile(uintptr(fd), "") defer func() { if err := syscall.Close(int(file.Fd()); err != nil { fmt.Printf("%v", err) } runtime.KeepAlive(file) }() }
golang zlib reader output not being copied over to stdout
I've modified the official documentation example for the zlib package to use an opened file rather than a set of hardcoded bytes (code below). The code reads in the contents of a source text file and compresses it with the zlib package. I then try to read back the compressed file and print its decompressed contents into stdout. The code doesn't error, but it also doesn't do what I expect it to do; which is to display the decompressed file contents into stdout. Also: is there another way of displaying this information, rather than using io.Copy? package main import ( "compress/zlib" "io" "log" "os" ) func main() { var err error // This defends against an error preventing `defer` from being called // As log.Fatal otherwise calls `os.Exit` defer func() { if err != nil { log.Fatalln("\nDeferred log: \n", err) } }() src, err := os.Open("source.txt") if err != nil { return } defer src.Close() dest, err := os.Create("new.txt") if err != nil { return } defer dest.Close() zdest := zlib.NewWriter(dest) defer zdest.Close() if _, err := io.Copy(zdest, src); err != nil { return } n, err := os.Open("new.txt") if err != nil { return } r, err := zlib.NewReader(n) if err != nil { return } defer r.Close() io.Copy(os.Stdout, r) err = os.Remove("new.txt") if err != nil { return } }
Your defer func doesn't do anything, because you're shadowing the err variable on every new assignment. If you want a defer to run, return from a separate function, and call log.Fatal after the return statement. As for why you're not seeing any output, it's because you're deferring all the Close calls. The zlib.Writer isn't flushed until after the function exits, and neither is the destination file. Call Close() explicitly where you need it. zdest := zlib.NewWriter(dest) if _, err := io.Copy(zdest, src); err != nil { log.Fatal(err) } zdest.Close() dest.Close()
I think you messed up the code logic with all this defer stuff and your "trick" err checking. Files are definitively written when flushed or closed. You just copy into new.txt without closing it before opening it to read it. Defering the closing of the file is neat inside a function which has multiple exits: It makes sure the file is closed once the function is left. But your main requires the new.txt to be closed after the copy, before re-opening it. So don't defer the close here. BTW: Your defense against log.Fatal terminating the code without calling your defers is, well, at least strange. The files are all put into some proper state by the OS, there is absolutely no need to complicate the stuff like this.
Check the error from the second Copy: 2015/12/22 19:00:33 Deferred log: unexpected EOF exit status 1 The thing is, you need to close zdest immediately after you've done writing. Close it after the first Copy and it works.
I would have suggested to use io.MultiWriter. In this way you read only once from src. Not much gain for small files but is faster for bigger files. w := io.MultiWriter(dest, os.Stdout)
Parallel zip compression in Go
I am trying build a zip archive from a large number of small-medium sized files. I want to be able to do this concurrently, since compression is CPU intensive, and I'm running on a multi core server. Also I don't want to have the whole archive in memory, since its might turn out to be large. My question is that do I have to compress every file and then combine manually combine everything together with zip header, checksum etc? Any help would be greatly appreciated.
I don't think you can combine the zip headers. What you could do is, run the zip.Writer sequentially, in a separate goroutine, and then spawn a new goroutine for each file that you want to read, and pipe those to the goroutine that is zipping them. This should reduce the IO overhead that you get by reading the files sequentially, although it probably won't leverage multiple cores for the archiving itself. Here's a working example. Note that, to keep things simple, it does not handle errors nicely, just panics if something goes wrong, and it does not use the defer statement too much, to demonstrate the order in which things should happen. Since defer is LIFO, it can sometimes be confusing when you stack a lot of them together. package main import ( "archive/zip" "io" "os" "sync" ) func ZipWriter(files chan *os.File) *sync.WaitGroup { f, err := os.Create("out.zip") if err != nil { panic(err) } var wg sync.WaitGroup wg.Add(1) zw := zip.NewWriter(f) go func() { // Note the order (LIFO): defer wg.Done() // 2. signal that we're done defer f.Close() // 1. close the file var err error var fw io.Writer for f := range files { // Loop until channel is closed. if fw, err = zw.Create(f.Name()); err != nil { panic(err) } io.Copy(fw, f) if err = f.Close(); err != nil { panic(err) } } // The zip writer must be closed *before* f.Close() is called! if err = zw.Close(); err != nil { panic(err) } }() return &wg } func main() { files := make(chan *os.File) wait := ZipWriter(files) // Send all files to the zip writer. var wg sync.WaitGroup wg.Add(len(os.Args)-1) for i, name := range os.Args { if i == 0 { continue } // Read each file in parallel: go func(name string) { defer wg.Done() f, err := os.Open(name) if err != nil { panic(err) } files <- f }(name) } wg.Wait() // Once we're done sending the files, we can close the channel. close(files) // This will cause ZipWriter to break out of the loop, close the file, // and unblock the next mutex: wait.Wait() } Usage: go run example.go /path/to/*.log. This is the order in which things should be happening: Open output file for writing. Create a zip.Writer with that file. Kick off a goroutine listening for files on a channel. Go through each file, this can be done in one goroutine per file. Send each file to the goroutine created in step 3. After processing each file in said goroutine, close the file to free up resources. Once each file has been sent to said goroutine, close the channel. Wait until the zipping has been done (which is done sequentially). Once zipping is done (channel exhausted), the zip writer should be closed. Only when the zip writer is closed, should the output file be closed. Finally everything is closed, so close the sync.WaitGroup to tell the calling function that we're good to go. (A channel could also be used here, but sync.WaitGroup seems more elegant.) When you get the signal from the zip writer that everything is properly closed, you can exit from main and terminate nicely. This might not answer your question, but I've been using similar code to generate zip archives on-the-fly for a web service some time ago. It performed quite well, even though the actual zipping was done in a single goroutine. Overcoming the IO bottleneck can already be an improvement.
From the look of it, you won't be able to parallelise the compression using the standard library archive/zip package because: Compression is performed by the io.Writer returned by zip.Writer.Create or CreateHeader. Calling Create/CreateHeader implicitly closes the writer returned by the previous call. So passing the writers returned by Create to multiple goroutines and writing to them in parallel will not work. If you wanted to write your own parallel zip writer, you'd probably want to structure it something like this: Have multiple goroutines compress files using the compress/flate module, and keep track of the CRC32 value and length of the uncompressed data. The output should be directed to temporary files. Note the compressed size of the data. Once everything has been compressed, start writing the Zip file starting with the header. Write out the file header followed by the contents of the corresponding temporary file for each compressed file. Write out the central directory record and end record at the end of the file. All the required information should be available at this point. For added parallelism, step 1 could be performed in parallel with the remaining steps by using a channel to indicate when compression of each file completes. Due to the file format, you won't be able to perform parallel compression without either storing compressed data in memory or in temporary files.
With Go1.17, parallel compression and merging of zip files are possible using the archive/zip package. An example is below. In the example, I create zip workers to create individual zip files and an entry provider worker which provides entries to be added to a zip file via a channel to zip workers. Actual files can be provided to the zip workers but I skipped that part. package main import ( "archive/zip" "context" "fmt" "io" "log" "os" "strings" "golang.org/x/sync/errgroup" ) const numOfZipWorkers = 10 type entry struct { name string rc io.ReadCloser } func main() { log.SetFlags(log.LstdFlags | log.Lshortfile) entCh := make(chan entry, numOfZipWorkers) zpathCh := make(chan string, numOfZipWorkers) group, ctx := errgroup.WithContext(context.Background()) for i := 0; i < numOfZipWorkers; i++ { group.Go(func() error { return zipWorker(ctx, entCh, zpathCh) }) } group.Go(func() error { defer close(entCh) // Signal workers to stop. return entryProvider(ctx, entCh) }) err := group.Wait() if err != nil { log.Fatal(err) } f, err := os.OpenFile("output.zip", os.O_CREATE|os.O_TRUNC|os.O_WRONLY, 0644) if err != nil { log.Fatal(err) } zw := zip.NewWriter(f) close(zpathCh) for path := range zpathCh { zrd, err := zip.OpenReader(path) if err != nil { log.Fatal(err) } for _, zf := range zrd.File { err := zw.Copy(zf) if err != nil { log.Fatal(err) } } _ = zrd.Close() _ = os.Remove(path) } err = zw.Close() if err != nil { log.Fatal(err) } err = f.Close() if err != nil { log.Fatal(err) } } func entryProvider(ctx context.Context, entCh chan<- entry) error { for i := 0; i < 2*numOfZipWorkers; i++ { select { case <-ctx.Done(): return ctx.Err() case entCh <- entry{ name: fmt.Sprintf("file_%d", i+1), rc: io.NopCloser(strings.NewReader(fmt.Sprintf("content %d", i+1))), }: } } return nil } func zipWorker(ctx context.Context, entCh <-chan entry, zpathch chan<- string) error { f, err := os.CreateTemp(".", "tmp-part-*") if err != nil { return err } zw := zip.NewWriter(f) Loop: for { var ( ent entry ok bool ) select { case <-ctx.Done(): err = ctx.Err() break Loop case ent, ok = <-entCh: if !ok { break Loop } } hdr := &zip.FileHeader{ Name: ent.name, Method: zip.Deflate, // zip.Store can also be used. } hdr.SetMode(0644) w, e := zw.CreateHeader(hdr) if e != nil { _ = ent.rc.Close() err = e break } _, e = io.Copy(w, ent.rc) _ = ent.rc.Close() if e != nil { err = e break } } if e := zw.Close(); e != nil && err == nil { err = e } if e := f.Close(); e != nil && err == nil { err = e } if err == nil { select { case <-ctx.Done(): err = ctx.Err() case zpathch <- f.Name(): } } return err }