How to edit a reader in Go - go

I'm trying to work out what the best practise is to change some data in a stream without ioutil.ReadAll.
I need to remove lines beginning with a certain character and strip all instances of another.
package main
import (
"bufio"
"bytes"
"fmt"
"os"
"gopkg.in/pg.v3"
)
func main() {
fieldSep := "\x01"
badChar := "\x02"
comment := "#"
dbName := "foo"
db := pg.Connect(&pg.Options{})
file, err := os.Open("/path/to/file")
if err != nil {
fmt.Fprintf(os.Stderr, "ERROR: %s\n", err)
}
defer file.Close()
// I need to iterate my file Reader here
// all lines that begin with comment and remove them
scanner := bufio.NewScanner(file)
for scanner.Scan() {
file := bytes.TrimRight(file, comment)
}
// all instances of badChar should be dropped
file := bytes.Trim(file, badChar)
_, err = db.CopyFrom(file, fmt.Sprintf("COPY %s FROM STDIN WITH DELIMITER e'%s'", dbName, fieldSep))
if err != nil {
fmt.Fprintf(os.Stderr, "ERROR: %s\n", err)
}
err = db.Close()
if err != nil {
fmt.Fprintf(os.Stderr, "ERROR: %s\n", err)
}
fmt.Println("Import Done")
}
Context:
I'm to importing a large amount (>10GB) of data into a database, it's spread across several files.
My database interface accepts a reader to load the data.
The data has non-standard line endings and I need to strip comments (because PG's COPY FROM is no fun).
I know the code I've got to edit the stream is woeful, I just can't find a good reference - thanks!

If I was in your position, I'd make my own Reader, and insert it between the source and the destination. That's what consistent interfaces are for. Your reader would work easily on the small chunks of data along as they flow past.
Source (io.Reader) ==> Your filter (io.Reader) ==> Destination (expects an io.Reader)
provides the data does the transformations rock'n'rolls
A library example of such a reader that's made to be inserted between a reader and its client is bufio.Reader, that'll let you speed up many types of readers by buffering larger calls to the source, and letting the client consume the data in small bits if it likes it so. You can check out its source : http://golang.org/src/bufio/bufio.go

Related

Getting `panic: os: invalid use of WriteAt on file opened with O_APPEND`

I am a newbie to Go. Was starting to write my first code in which I have to download a bunch of CSV's from AWS. I don't understand why it is giving me the below error with O_APPEND mode. If I remove os.O_APPEND, I only get the last file data which is not the objective.
The objective is to download all CSV files into one file locally. I'd like to understand what I'm doing incorrectly.
package main
import (
"fmt"
"os"
"path/filepath"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/credentials"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/s3"
"github.com/aws/aws-sdk-go/service/s3/s3manager"
)
const (
AccessKeyId = "xxxxxxxxx"
SecretAccessKey = "xxxxxxxxxxxxxxxxxxxx"
Region = "eu-central-1"
Bucket = "dexter-reports"
bucketKey = "Jenkins/pluginVersions/"
)
func main() {
// Load the Shared AWS Configuration
os.Setenv("AWS_ACCESS_KEY_ID", AccessKeyId)
os.Setenv("AWS_SECRET_ACCESS_KEY", SecretAccessKey)
filename := "JenkinsPluginDetais.txt"
cred := credentials.NewStaticCredentials(AccessKeyId, SecretAccessKey, "")
config := aws.Config{Credentials: cred, Region: aws.String(Region), Endpoint: aws.String("s3.amazonaws.com")}
file, err := os.OpenFile(filename, os.O_APPEND|os.O_WRONLY|os.O_CREATE, 0666)
if err != nil {
panic(err)
}
defer file.Close()
sess, err := session.NewSession(&config)
if err != nil {
fmt.Println(err)
}
//list Buckets
ObjectList := listBucketObjects(sess)
//loop over the obectlist. First initialize the s3 downloader via s3manager
downloader := s3manager.NewDownloader(sess)
for _, item := range ObjectList.Contents {
csvFile := filepath.Base(*item.Key)
if csvFile != "pluginVersions" {
downloadBucketObjects(downloader, file, csvFile)
}
}
}
func listBucketObjects(sess *session.Session) *s3.ListObjectsV2Output {
//create a new s3 client
svc := s3.New(sess)
resp, err := svc.ListObjectsV2(&s3.ListObjectsV2Input{
Bucket: aws.String(Bucket),
Prefix: aws.String(bucketKey),
})
if err != nil {
panic(err)
}
return resp
}
func downloadBucketObjects(downloader *s3manager.Downloader, file *os.File, keyobj string) {
fileToDownload := bucketKey + keyobj
numBytes, err := downloader.Download(file,
&s3.GetObjectInput{
Bucket: aws.String(Bucket),
Key: aws.String(fileToDownload),
})
if err != nil {
panic(err)
}
fmt.Println("Downloaded", file.Name(), numBytes, "bytes")
}
Firstly, I don't get it why do you even need os.O_APPEND flag in the first place. As per my understanding, you can omit os.O_APPEND.
Now, let's come to the actual problem of why it's happening:
Doc for O_APPEND (Ref: https://man7.org/linux/man-pages/man2/open.2.html):
O_APPEND
The file is opened in append mode. Before each write(2),
the file offset is positioned at the end of the file, as
if with lseek(2). The modification of the file offset and
the write operation are performed as a single atomic step.
So for every call to write the file offset is positioned at the end of the file.
But (*s3Manager.Download).Download supposedly be using WriteAt method, i.e.,
Doc for WriteAt:
$ go doc os WriteAt
package os // import "os"
func (f *File) WriteAt(b []byte, off int64) (n int, err error)
WriteAt writes len(b) bytes to the File starting at byte offset off. It
returns the number of bytes written and an error, if any. WriteAt returns a
non-nil error when n != len(b).
If file was opened with the O_APPEND flag, WriteAt returns an error.
Notice the last line, that if the file's opened with O_APPEND flag it will result in an error and it's even right because WriteAt's second argument is an offset but mixing O_APPEND's behaviour and WriteAt offset seeking might create problem resulting in unexpected results and it errors out.
Consider the definition of s3manager.Downloader:
func (d Downloader) Download(w io.WriterAt, input *s3.GetObjectInput, options ...func(*Downloader)) (n int64, err error)
The first argument is an io.WriterAt; this interface is:
type WriterAt interface {
WriteAt(p []byte, off int64) (n int, err error)
}
This means that the Download function is going to call the WriteAt method in the File you are passing it. As per the documentation for File.WriteAt
If file was opened with the O_APPEND flag, WriteAt returns an error.
So this explains why you are getting the error but raises the question "why is Download using WriteAt and not accepting an io.Writer (and calling Write)?"; the answer can be found in the documentation:
The w io.WriterAt can be satisfied by an os.File to do multipart concurrent downloads, or in memory []byte wrapper using aws.WriteAtBuffer
So, to increase performance, Downloader might make multiple simultaneous requests for parts of the file and then write these out as they are received (meaning it may not write the data in order). This also explains why calling the function multiple times with the same File results in overwritten data (when Downloader retrieves the each chunk of the file it writes it out at the appropriate position in the output file; this overwrites any data already there).
The above quote from the documentation also points to a possible solution; use an aws.WriteAtBuffer and, once the download is finished, write the data to your file (which could then be opened with O_APPEND) - something like this:
buf := aws.NewWriteAtBuffer([]byte{})
numBytes, err := downloader.Download(buf,
&s3.GetObjectInput{
Bucket: aws.String(Bucket),
Key: aws.String(fileToDownload),
})
if err != nil {
panic(err)
}
_, err = file.Write(buf.Bytes())
if err != nil {
panic(err)
}
An alternative would be to download into a temporary file and then append that to your output file (you may need to do this if the files are large).

image.Decode() unknown format

I have an image that is stored in the filesystem. This file should be decoded to an image and then resized. I know how to resize it, but I can't decode the image. Whatever image path/image I insert in program, it results: image: unknown format.
I've already read all sites about this problem, but none of them did help me. This code represents my simplified program logic (I'd like to understand why this error occurs). In advance, thanks for your attention!
import (
"bufio"
"fmt"
"image"
"image/png"
_ "image/jpeg"
_ "image/png"
"log"
"os"
)
func main() {
file, err := os.Open(`D:\photos\img.png`)
if err != nil {
log.Fatal(err)
}
defer file.Close()
config, format, err := image.DecodeConfig(bufio.NewReader(file))
if err != nil {
log.Fatal(err)
}
fmt.Println(format, config.Height, config.Width, config.ColorModel)
decodedImg, format, err := image.Decode(bufio.NewReader(file)) // ERROR HERE
if err != nil {
log.Fatal(err)
}
fmt.Println(format,"decode")
outputFile, err := os.Create(`D:\photos\image.png`)
if err != nil {
log.Fatal(err)
}
defer outputFile.Close()
png.Encode(outputFile, decodedImg)
}
Output:
png 512 512 &{0x4ae340}
2020/07/11 09:37:10 image: unknown format
Both image.Decode and image.DecodeConfig consume the bytes from the passed-in io.Reader.
This means that after DecodeConfig is done, the position in the file is after the bytes already read. image.Decode then comes along with the same underlying file, expects to find the image header, but doesn't.
bufio.NewReader does not reset the position to the beginning of the file (because it can't, it only knows the underlying object is an io.Reader).
You have a few solutions (in order or personal preference):
seek back to the beginning of the file before calling image.Decode. eg: newOffset, err := file.Seek(0, 0)
don't use image.DecodeConfig (this might not be an option)
read the file into a []byte and use a bytes.Buffer
open the file again (not particularly efficient)
As a side note, you don't need to wrap the os.File object in a bufio.Reader, it already implements the io.Reader interface.

Golang reading from serial

I'm trying to read from a serial port (a GPS device on a Raspberry Pi).
Following the instructions from http://www.modmypi.com/blog/raspberry-pi-gps-hat-and-python
I can read from shell using
stty -F /dev/ttyAMA0 raw 9600 cs8 clocal -cstopb
cat /dev/ttyAMA0
I get well formatted output
$GNGLL,5133.35213,N,00108.27278,W,160345.00,A,A*65
$GNRMC,160346.00,A,5153.35209,N,00108.27286,W,0.237,,290418,,,A*75
$GNVTG,,T,,M,0.237,N,0.439,K,A*35
$GNGGA,160346.00,5153.35209,N,00108.27286,W,1,12,0.67,81.5,M,46.9,M,,*6C
$GNGSA,A,3,29,25,31,20,26,23,21,16,05,27,,,1.11,0.67,0.89*10
$GNGSA,A,3,68,73,83,74,84,75,85,67,,,,,1.11,0.67,0.89*1D
$GPGSV,4,1,15,04,,,34,05,14,040,21,09,07,330,,16,45,298,34*40
$GPGSV,4,2,15,20,14,127,18,21,59,154,30,23,07,295,26,25,13,123,22*74
$GPGSV,4,3,15,26,76,281,40,27,15,255,20,29,40,068,19,31,34,199,33*7C
$GPGSV,4,4,15,33,29,198,,36,23,141,,49,30,172,*4C
$GLGSV,3,1,11,66,00,325,,67,13,011,20,68,09,062,16,73,12,156,21*60
$GLGSV,3,2,11,74,62,177,20,75,53,312,36,76,08,328,,83,17,046,25*69
$GLGSV,3,3,11,84,75,032,22,85,44,233,32,,,,35*62
$GNGLL,5153.35209,N,00108.27286,W,160346.00,A,A*6C
$GNRMC,160347.00,A,5153.35205,N,00108.27292,W,0.216,,290418,,,A*7E
$GNVTG,,T,,M,0.216,N,0.401,K,A*3D
$GNGGA,160347.00,5153.35205,N,00108.27292,W,1,12,0.67,81.7,M,46.9,M,,*66
$GNGSA,A,3,29,25,31,20,26,23,21,16,05,27,,,1.11,0.67,0.89*10
$GNGSA,A,3,68,73,83,74,84,75,85,67,,,,,1.11,0.67,0.89*1D
$GPGSV,4,1,15,04,,,34,05,14,040,21,09,07,330,,16,45,298,34*40
(I've put some random data in)
I'm trying to read this in Go. Currently, I have
package main
import "fmt"
import "log"
import "github.com/tarm/serial"
func main() {
config := &serial.Config{
Name: "/dev/ttyAMA0",
Baud: 9600,
ReadTimeout: 1,
Size: 8,
}
stream, err := serial.OpenPort(config)
if err != nil {
log.Fatal(err)
}
buf := make([]byte, 1024)
for {
n, err := stream.Read(buf)
if err != nil {
log.Fatal(err)
}
s := string(buf[:n])
fmt.Println(s)
}
}
But this prints malformed data. I suspect that this is due to the buffer size or the value of Size in the config struct being wrong, but I'm not sure how to get those values from the stty settings.
Looking back, I think the issue is that I'm getting a stream and I want to be able to iterate over lines of the stty, rather than chunks. This is how the stream is outputted:
$GLGSV,3
,1,09,69
,10,017,
,70,43,0
69,,71,3
2,135,27
,76,23,2
32,22*6F
$GLGSV
,3,2,09,
77,35,30
0,21,78,
11,347,,
85,31,08
1,30,86,
72,355,3
6*6C
$G
LGSV,3,3
,09,87,2
4,285,30
*59
$GN
GLL,5153
.34919,N
,00108.2
7603,W,1
92901.00
,A,A*6A
The struct you get back from serial.OpenPort() contains a pointer to an open os.File corresponding to the opened serial port connection. When you Read() from this, the library calls Read() on the underlying os.File.
The documentation for this function call is:
Read reads up to len(b) bytes from the File. It returns the number of bytes read and any error encountered. At end of file, Read returns 0, io.EOF.
This means you have to keep track of how much data was read. You also have to keep track of whether there were newlines, if this is important to you. Unfortunately, the underlying *os.File is not exported, so you'll find it difficult to use tricks like bufio.ReadLine(). It may be worth modifying the library and sending a pull request.
As Matthew Rankin noted in a comment, Port implements io.ReadWriter so you can simply use bufio to read by lines.
stream, err := serial.OpenPort(config)
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(stream)
for scanner.Scan() {
fmt.Println(scanner.Text()) // Println will add back the final '\n'
}
if err := scanner.Err(); err != nil {
log.Fatal(err)
}
Change
fmt.Println(s)
to
fmt.Print(s)
and you will probably get what you want.
Or did I misunderstand the question?
Two additions to Michael Hamptom's answer which can be useful:
line endings
You might receive data that is not newline-separated text. bufio.Scanner uses ScanLines by default to split the received data into lines - but you can also write your own line splitter based on the default function's signature and set it for the scanner:
scanner := bufio.NewScanner(stream)
scanner.Split(ownLineSplitter) // set custom line splitter function
reader shutdown
You might not receive a constant stream but only some packets of bytes from time to time. If no bytes arrive at the port, the scanner will block and you can't just kill it. You'll have to close the stream to do so, effectively raising an error. To not block any outer loops and handle errors appropriately, you can wrap the scanner in a goroutine that takes a context. If the context was cancelled, ignore the error, otherwise forward the error. In principle, this can look like
var errChan = make(chan error)
var dataChan = make(chan []byte)
ctx, cancelPortScanner := context.WithCancel(context.Background())
go func(ctx context.Context) {
scanner := bufio.NewScanner(stream)
for scanner.Scan() { // will terminate if connection is closed
dataChan <- scanner.Bytes()
}
// if execution reaches this point, something went wrong or stream was closed
select {
case <-ctx.Done():
return // ctx was cancelled, just return without error
default:
errChan <- scanner.Err() // ctx wasn't cancelled, forward error
}
}(ctx)
// handle data from dataChan, error from errChan
To stop the scanner, you would cancel the context and close the connection:
cancelPortScanner()
stream.Close()

Is there a faster alternative to ioutil.ReadFile?

I am trying to make a program for checking file duplicates based on md5 checksum.
Not really sure whether I am missing something or not, but this function reading the XCode installer app (it has like 8GB) uses 16GB of Ram
func search() {
unique := make(map[string]string)
files, err := ioutil.ReadDir(".")
if err != nil {
log.Println(err)
}
for _, file := range files {
fileName := file.Name()
fmt.Println("CHECKING:", fileName)
fi, err := os.Stat(fileName)
if err != nil {
fmt.Println(err)
continue
}
if fi.Mode().IsRegular() {
data, err := ioutil.ReadFile(fileName)
if err != nil {
fmt.Println(err)
continue
}
sum := md5.Sum(data)
hexDigest := hex.EncodeToString(sum[:])
if _, ok := unique[hexDigest]; ok == false {
unique[hexDigest] = fileName
} else {
fmt.Println("DUPLICATE:", fileName)
}
}
}
}
As per my debugging the issue is with the file reading
Is there a better approach to do that?
thanks
There is an example in the Golang documentation, which covers your case.
package main
import (
"crypto/md5"
"fmt"
"io"
"log"
"os"
)
func main() {
f, err := os.Open("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
h := md5.New()
if _, err := io.Copy(h, f); err != nil {
log.Fatal(err)
}
fmt.Printf("%x", h.Sum(nil))
}
For your case, just make sure to close the files in the loop and not defer them. Or put the logic into a function.
Sounds like the 16GB RAM is your problem, not speed per se.
Don't read the entire file into a variable with ReadFile; io.Copy from the Reader that Open gives you to the Writer that hash/md5 provides (md5.New returns a hash.Hash, which embeds an io.Writer). That only copies a little bit at a time instead of pulling all of the file into RAM.
This is a trick useful in a lot of places in Go; packages like text/template, compress/gzip, net/http, etc. work in terms of Readers and Writers. With them, you don't usually need to create huge []bytes or strings; you can hook I/O interfaces up to each other and let them pass around pieces of content for you. In a garbage collected language, saving memory tends to save you CPU work as well.

Getting IP addresses from big nfcapd binary files

I need to get information about source IPs and destination IPs from nfcapd binary file. The problem is in file's size. I know that it is not desirable to open and read very large (more than 1 GB) files with io or os package.
Here is my hacking and draft start:
package main
import (
"fmt"
"time"
"os"
"github.com/tehmaze/netflow/netflow5"
"log"
"io"
"bytes"
)
type Message interface {}
func main() {
startTime := time.Now()
getFile := os.Args[1]
processFile(getFile)
endTime := time.Since(startTime)
log.Printf("Program executes in %s", endTime)
}
func processFile(fileName string) {
file, err := os.Open(fileName)
// Check if file is not empty. If it is, then exit from program
if err != nil {
fmt.Println(err)
os.Exit(1)
}
// Useful to close file after getting information about it
defer file.Close()
Read(file)
}
func Read(r io.Reader) (Message, error) {
data := [2]byte{}
if _, err := r.Read(data[:]); err != nil {
return nil, err
}
buffer := bytes.NewBuffer(data[:])
mr := io.MultiReader(buffer, r)
return netflow5.Read(mr)
}
I want to split file into chunks with 24 flows and process it concurrently after reading with netflow package. But I do not imagine how to do it without losing any data during division.
Please fix me if I missed something in code or description. I spend a lot of time in searching my solution on the web and thinking about another possible implementations.
Any help and/or advice will be highly appreciated.
File has the following properties (command file -I <file_name> in terminal):
file_name: application/octet-stream; charset=binary
The output of file after command nfdump -r <file_name> has this structure:
Date first seen Duration Proto Src IP Addr:Port Dst IP Addr:Port Packets Bytes Flows
Every property is on own column.
UPDATE 1:
Unfortunately, it is impossible to parse file with netflow package due to difference in binary file structure after saving it on disk via nfcapd. This answer was given by one of the nfdump contributors.
The only way now is to run nfdump from terminal in go program like pynfdump.
Another possible solution in the future is to use gopacket.
IO is is almost always going to be the limiting factor when parsing a file, and unless there is heavy computation involved, reading a single file serially is going to be the fastest way to process it.
Wrap the file in a bufio.Reader and give it to the Read function:
file, err := os.Open(fileName)
if err != nil {
log.Fatal((err)
}
defer file.Close()
packet, err := netflow5.Read(bufio.NewReader(file))
Once it's parsed, you can then split up the records if you need to handle the chunks separately.

Resources