I am trying to make a program for checking file duplicates based on md5 checksum.
Not really sure whether I am missing something or not, but this function reading the XCode installer app (it has like 8GB) uses 16GB of Ram
func search() {
unique := make(map[string]string)
files, err := ioutil.ReadDir(".")
if err != nil {
log.Println(err)
}
for _, file := range files {
fileName := file.Name()
fmt.Println("CHECKING:", fileName)
fi, err := os.Stat(fileName)
if err != nil {
fmt.Println(err)
continue
}
if fi.Mode().IsRegular() {
data, err := ioutil.ReadFile(fileName)
if err != nil {
fmt.Println(err)
continue
}
sum := md5.Sum(data)
hexDigest := hex.EncodeToString(sum[:])
if _, ok := unique[hexDigest]; ok == false {
unique[hexDigest] = fileName
} else {
fmt.Println("DUPLICATE:", fileName)
}
}
}
}
As per my debugging the issue is with the file reading
Is there a better approach to do that?
thanks
There is an example in the Golang documentation, which covers your case.
package main
import (
"crypto/md5"
"fmt"
"io"
"log"
"os"
)
func main() {
f, err := os.Open("file.txt")
if err != nil {
log.Fatal(err)
}
defer f.Close()
h := md5.New()
if _, err := io.Copy(h, f); err != nil {
log.Fatal(err)
}
fmt.Printf("%x", h.Sum(nil))
}
For your case, just make sure to close the files in the loop and not defer them. Or put the logic into a function.
Sounds like the 16GB RAM is your problem, not speed per se.
Don't read the entire file into a variable with ReadFile; io.Copy from the Reader that Open gives you to the Writer that hash/md5 provides (md5.New returns a hash.Hash, which embeds an io.Writer). That only copies a little bit at a time instead of pulling all of the file into RAM.
This is a trick useful in a lot of places in Go; packages like text/template, compress/gzip, net/http, etc. work in terms of Readers and Writers. With them, you don't usually need to create huge []bytes or strings; you can hook I/O interfaces up to each other and let them pass around pieces of content for you. In a garbage collected language, saving memory tends to save you CPU work as well.
I'm confusing about the reassignment of the err variable for errors in Go.
For example, I tend to be doing this:
err1 := Something()
checkErr(err1)
str, err2 := SomethingElse()
checkErr(err2)
err3 := SomethingAgain()
checkErr(err3)
But I'm always losing track of this and have millions of useless err variables floating around that I don't need, and it makes the code messy and confusing.
But then if I do this:
err := Something()
checkErr(err)
str, err := SomethingElse()
checkErr(err)
err := SomethingAgain()
checkErr(err)
...it gets angry and says err is already assigned.
But if I do this:
var err error
err = Something()
checkErr(err)
str, err = SomethingElse()
checkErr(err)
err = SomethingAgain()
checkErr(err)
...it doesn't work because str needs to be assigned with :=
Am I missing something?
you're almost there... at the left side of := there needs to be at least one newly create variable. But if you don't declare err in advance, the compiler tries to create it on each instance of :=, which is why you get the first error. so this would work, for example:
package main
import "fmt"
func foo() (string, error) {
return "Bar", nil
}
func main() {
var err error
s1, err := foo()
s2, err := foo()
fmt.Println(s1,s2,err)
}
or in your case:
//we declare it first
var err error
//this is a normal assignment
err = Something()
checkErr(err)
// here, the compiler knows that only str is a newly declared variable
str, err := SomethingElse()
checkErr(err)
// and again...
err = SomethingAgain()
checkErr(err)
Consider the following Go code fragment:
cmd := exec.Command(program, arg0)
stdin, err := cmd.StdinPipe()
// produces error when b is too large
n, err := stdin.Write(b.Bytes())
Whenever b is too large, Write() returns an error. Having experimented with different size bs, it would seem this occurs whenever the length of b is longer than the Linux pipe buffer size. Is there a way around this? Essentially I need to feed large log files via stdin to an external script.
I wrote this program to test your code:
package main
import "os/exec"
import "fmt"
func main() {
cmd := exec.Command("/bin/cat")
in, _ := cmd.StdinPipe()
cmd.Start()
for i := 1024*1024; ; i += 1024*1024 {
b := make([]byte,i)
n, err := in.Write(b)
fmt.Printf("%d: %v\n", n, err)
if err != nil {
cmd.Process.Kill()
return
}
}
}
The only way this program gives an error is if the called process closes stdin. Does the program you call close stdin? This might be a bug in the Go runtime.
I am a newbee to golang, and I write a program to test io package:
func main() {
readers := []io.Reader{
strings.NewReader("from string reader"),
bytes.NewBufferString("from bytes reader"),
}
reader := io.MultiReader(readers...)
data := make([]byte, 1024)
var err error
//var n int
for err != io.EOF {
n, err := reader.Read(data)
fmt.Printf("%s\n", data[:n])
}
os.Exit(0)
}
The compile error is "err declared and not used". But I think I have used err in for statement. Why does the compiler outputs this error?
The err inside the for is shadowing the err outside the for, and it's not being used (the one inside the for). This happens because you are using the short variable declaration (with the := operator) which declares a new err variable that shadows the one declared outside the for.
Is there a way to clean up this (IMO) horrific-looking code?
aJson, err1 := json.Marshal(a)
bJson, err2 := json.Marshal(b)
cJson, err3 := json.Marshal(c)
dJson, err4 := json.Marshal(d)
eJson, err5 := json.Marshal(e)
fJson, err6 := json.Marshal(f)
gJson, err4 := json.Marshal(g)
if err1 != nil {
return err1
} else if err2 != nil {
return err2
} else if err3 != nil {
return err3
} else if err4 != nil {
return err4
} else if err5 != nil {
return err5
} else if err5 != nil {
return err5
} else if err6 != nil {
return err6
}
Specifically, I'm talking about the error handling. It would be nice to be able to handle all the errors in one go.
var err error
f := func(dest *D, src S) bool {
*dest, err = json.Marshal(src)
return err == nil
} // EDIT: removed ()
f(&aJson, a) &&
f(&bJson, b) &&
f(&cJson, c) &&
f(&dJson, d) &&
f(&eJson, e) &&
f(&fJson, f) &&
f(&gJson, g)
return err
Put the result in a slice instead of variables, put the intial values in another slice to iterate and return during the iteration if there's an error.
var result [][]byte
for _, item := range []interface{}{a, b, c, d, e, f, g} {
res, err := json.Marshal(item)
if err != nil {
return err
}
result = append(result, res)
}
You could even reuse an array instead of having two slices.
var values, err = [...]interface{}{a, b, c, d, e, f, g}, error(nil)
for i, item := range values {
if values[i], err = json.Marshal(item); err != nil {
return err
}
}
Of course, this'll require a type assertion to use the results.
define a function.
func marshalMany(vals ...interface{}) ([][]byte, error) {
out := make([][]byte, 0, len(vals))
for i := range vals {
b, err := json.Marshal(vals[i])
if err != nil {
return nil, err
}
out = append(out, b)
}
return out, nil
}
you didn't say anything about how you'd like your error handling to work. Fail one, fail all? First to fail? Collect successes or toss them?
I believe the other answers here are correct for your specific problem, but more generally, panic can be used to shorten error handling while still being a well-behaving library. (i.e., not panicing across package boundaries.)
Consider:
func mustMarshal(v interface{}) []byte {
bs, err := json.Marshal(v)
if err != nil {
panic(err)
}
return bs
}
func encodeAll() (err error) {
defer func() {
if r := recover(); r != nil {
var ok bool
if err, ok = r.(error); ok {
return
}
panic(r)
}
}()
ea := mustMarshal(a)
eb := mustMarshal(b)
ec := mustMarshal(c)
return nil
}
This code uses mustMarshal to panic whenever there is a problem marshaling a value. But the encodeAll function will recover from the panic and return it as a normal error value. The client in this case is never exposed to the panic.
But this comes with a warning: using this approach everywhere is not idiomatic. It can also be worse since it doesn't lend itself well to handling each individual error specially, but more or less treating each error the same. But it has its uses when there are tons of errors to handle. As an example, I use this kind of approach in a web application, where a top-level handler can catch different kinds of errors and display them appropriately to the user (or a log file) depending on the kind of error.
It makes for terser code when there is a lot of error handling, but at the loss of idiomatic Go and handling each error specially. Another down-side is that it could prevent something that should panic from actually panicing. (But this can be trivially solved by using your own error type.)
You can use go-multierror by Hashicorp.
var merr error
if err := step1(); err != nil {
merr = multierror.Append(merr, err)
}
if err := step2(); err != nil {
merr = multierror.Append(merr, err)
}
return merr
You can create a reusable method to handle multiple errors, this implementation will only show the last error but you could return every error msg combined by modifying the following code:
func hasError(errs ...error) error {
for i, _ := range errs {
if errs[i] != nil {
return errs[i]
}
}
return nil
}
aJson, err := json.Marshal(a)
bJson, err1 := json.Marshal(b)
cJson, err2 := json.Marshal(c)
if error := hasError(err, err1, err2); error != nil {
return error
}
Another perspective on this is, instead of asking "how" to handle the abhorrent verbosity, whether we actually "should". This advice is heavily dependent on context, so be careful.
In order to decide whether handling the json.Marshal error is worth it, we can inspect its implementation to see when errors are returned. In order to return errors to the caller and preserve code terseness, json.Marshal uses panic and recover internally in a manner akin to exceptions. It defines an internal helper method which, when called, panics with the given error value. By looking at each call of this function, we learn that json.Marshal errors in the given scenarios:
calling MarshalJSON or MarshalText on a value/field of a type which implements json.Marshaler or encoding.TextMarshaler returns an error—in other words, a custom marshaling method fails;
the input is/contains a cyclic (self-referencing) structure;
the input is/contains a value of an unsupported type (complex, chan, func);
the input is/contains a floating-point number which is NaN or Infinity (these are not allowed by the spec, see section 2.4);
the input is/contains a json.Number string that is an incorrect number representation (for example, "foo" instead of "123").
Now, a usual scenario for marshaling data is creating an API response, for example. In that case, you will 100% have data types that satisfy all of the marshaler's constraints and valid values, given that the server itself generates them. In the situation user-provided input is used, the data should be validated anyway beforehand, so it should still not cause issues with the marshaler. Furthermore, we can see that, apart from the custom marshaler errors, all the other errors occur at runtime because Go's type system cannot enforce the required conditions by itself. With all these points given, here comes the question: given our control over the data types and values, do we need to handle json.Marshal's error at all?
Probably no. For a type like
type Person struct {
Name string
Age int
}
it is now obvious that json.Marshal cannot fail. It is trickier when the type looks like
type Foo struct {
Data any
}
(any is a new Go 1.18 alias for interface{}) because there is no compile-time guarantee that Foo.Data will hold a value of a valid type—but I'd still argue that if Foo is meant to be serialized as a response, Foo.Data will also be serializable. Infinity or NaN floats remain an issue, but, given the JSON standard limitation, if you want to serialize these two special values you cannot use JSON numbers anyway, so you'll have to look for another solution, which means that you'll end up avoiding the error anyway.
To conclude, my point is that you can probably do:
aJson, _ := json.Marshal(a)
bJson, _ := json.Marshal(b)
cJson, _ := json.Marshal(c)
dJson, _ := json.Marshal(d)
eJson, _ := json.Marshal(e)
fJson, _ := json.Marshal(f)
gJson, _ := json.Marshal(g)
and live fine with it. If you want to be pedantic, you can use a helper such as:
func must[T any](v T, err error) T {
if err != nil {
panic(err)
}
return v
}
(note the Go 1.18 generics usage) and do
aJson := must(json.Marshal(a))
bJson := must(json.Marshal(b))
cJson := must(json.Marshal(c))
dJson := must(json.Marshal(d))
eJson := must(json.Marshal(e))
fJson := must(json.Marshal(f))
gJson := must(json.Marshal(g))
This will work nice when you have something like an HTTP server, where each request is wrapped in a middleware that recovers from panics and responds to the client with status 500. It's also where you would care about these unexpected errors—when you don't want the program/service to crash at all. For one-time scripts you'll probably want to have the operation halted and a stack trace dumped.
If you're unsure of how your types will be changed in the future, you don't trust your tests, data may not be in your full control, the codebase is too big to trace the data or whatever other reason which causes uncertainty over the correctness of your data, it is better to handle the error. Pay attention to the context you're in!
P.S.: Pragmatically ignoring errors should be generally sought after. For example, the Write* methods on bytes.Buffer, strings.Builder never return errors; fmt.Fprintf, with a valid format string and a writer that doesn't return errors, also returns no errors; bufio.Writer aswell doesn't, if the underlying writer doesn't return. You will find some types implement interfaces with methods that return errors but don't actually return any. In these cases, if you know the concrete type, handling errors is unnecessarily verbose and redundant. What do you prefer,
var sb strings.Builder
if _, err := sb.WriteString("hello "); err != nil {
return err
}
if _, err := sb.WriteString("world!"); err != nil {
return err
}
or
var sb strings.Builder
sb.WriteString("hello ")
sb.WriteString("world!")
(of course, ignoring that it could be a single WriteString call)?
The given examples write to an in-memory buffer, which unless the machine is out of memory, an error which you cannot handle in Go, cannot ever fail. Other such situations will surface in your code—blindly handling errors adds little to no value! Caution is key—if an implementation changes and does return errors, you may be in trouble. Standard library or well-established packages are good candidates for eliding error checking, if possible.