My goal is to create a PDF to CSV. In this case, converting journal entries from the PDF file into CSV file.
What I've tried:
use pdftotext from linux.
The Installation:
$ sudo apt-get install poppler-utils
The code:
package main
import (
"fmt"
"os/exec"
)
func main() {
body, err := exec.Command("pdftotext", "-layout", "-q", "-nopgbrk", "-enc", "UTF-8", "-eol", "unix", "/Volumes/T7Touch/Learn/e-statement-to-t-account/5725299769Jul2022.pdf", "-").Output()
if err != nil {
panic(err)
}
fmt.Println(string(body))
}
Does it work?
The -layout option makes the output workable. The fmt.Println(string(output)) will print every journal entry per line.
row1column1 row1column2 row1column3
row2column1 row2column2 row2column2
Without the -layout option, the output will be not readable.
row1column1
row2column1
row1column2
row2column2
The problem is that this solution need to install poppler-utils to use pdftotext. Without it, it will throws error.
$ GOOS=darwin GOARCH=arm64 go build
$ ./main // will throw errors -> panic: exec: "pdftotext": executable file not found in $PATH
use https://github.com/ledongthuc/pdf
The code
package main
import (
"fmt"
"os"
"github.com/ledongthuc/pdf"
)
func main() {
content, err := readPdf(os.Args[1]) // Read local pdf file
if err != nil {
panic(err)
}
fmt.Println(content)
return
}
func readPdf(path string) (string, error) {
f, r, err := pdf.Open(path)
defer func() {
_ = f.Close()
}()
if err != nil {
return "", err
}
totalPage := r.NumPage()
for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
p := r.Page(pageIndex)
if p.V.IsNull() {
continue
}
rows, _ := p.GetTextByRow()
for _, row := range rows {
println(">>>> row: ", row.Position)
for _, word := range row.Content {
fmt.Println(word.S)
}
}
}
return "", nil
}
I intend to create pdf decoder that does exactly like $ pdftotext -layout without the need to do $ sudo apt-get install popper-utils. I believe that I can solve this if someone can help point out where to see pdftotext repository or RFC docs for PDF.
Related
I am trying to write a simple Go program which connects to an FTP server, list the files in a specified directory and pulls them.
The code is this:
package main
import (
"bytes"
"fmt"
"github.com/secsy/goftp"
"io/ioutil"
"log"
"os"
"path"
"time"
)
func main() {
config := goftp.Config{
User: "anonymous",
Password: "root#local.me",
ConnectionsPerHost: 21,
Timeout: 10 * time.Second,
Logger: os.Stderr,
}
// Connecting to the server
client, dailErr := goftp.DialConfig(config, "ftp.example.com")
if dailErr != nil {
log.Fatal(dailErr)
panic(dailErr)
}
// setting the search directory
dir := "/downloads/"
files, err := client.ReadDir(dir)
if err != nil {
for _, file := range files {
if file.IsDir() {
path.Join(dir, file.Name())
} else {
fmt.Println("the file is %s", file.Name())
}
}
}
// this section works , I am setting a file name and I can pull it
// if I mark the search part
ret_file := "example.PDF"
fmt.Println("Retrieving file: ", ret_file)
buf := new(bytes.Buffer)
fullPathFile := dir + ret_file
rferr := client.Retrieve(fullPathFile, buf)
if rferr != nil {
panic(rferr)
}
fmt.Println("writing data to file", ret_file)
fmt.Println("Opening file", ret_file, "for writing")
w, _ := ioutil.ReadAll(buf)
ferr := ioutil.WriteFile(ret_file, w, 0644)
if ferr != nil {
log.Fatal(ferr)
panic(ferr)
} else {
fmt.Println("Writing", ret_file, " completed")
}
}
For some reason I am getting an error on the ReadDir function.
I need to grab the files names so I can download them.
You're attempting to loop through files when ReadDir() returns an error. That will never work, as any time an error is returned files is nil.
This is pretty standard behavior and can be confirmed by reading the implementation of ReadDir().
I'm guessing you may have used the the example from the project used to demonstrate ReadDir() as a starting point. Within the example, the error handling is involved because it's deciding whether or not to continue walking the directory tree. However, note that when ReadDir() returns an error that doesn't result in stopping the program, the subsequent for loop is a no-op, since files is nil.
Here's a small program that demonstrates successfully using the results of Readdir() in a straightforward manner:
package main
import (
"fmt"
"github.com/secsy/goftp"
)
const (
ftpServerURL = "ftp.us.debian.org"
ftpServerPath = "/debian/"
)
func main() {
client, err := goftp.Dial(ftpServerURL)
if err != nil {
panic(err)
}
files, err := client.ReadDir(ftpServerPath)
if err != nil {
panic(err)
}
for _, file := range files {
fmt.Println(file.Name())
}
}
It outputs (which matches the current listing at http://ftp.us.debian.org/debian/):
$ go run goftp-test.go
README
README.CD-manufacture
README.html
README.mirrors.html
README.mirrors.txt
dists
doc
extrafiles
indices
ls-lR.gz
pool
project
tools
zzz-dists
Does anyone have an example on using "github.com/gohugoio/hugo/resources/images/exif" to extract metadata from a local image using Go?
I looked through the docs and since I'm new to Go I'm not 100% sure if I'm doing things write. I do read the image, but I'm not sure what the next step would be.
fname := "image.jpg"
f, err := os.Open(fname)
if err != nil {
log.Fatal("Error: ", err)
}
(Edit 1)
Actually I think I found a solution:
d, err := exif.NewDecoder(exif.IncludeFields("File Type"))
x, err := d.Decode(f)
if err != nil {
log.Fatal("Error: ", err)
}
fmt.Println(x)
however, the question is how do I know what fields are available? File Type for example returns <nil>
Looks like Hugo uses github.com/rwcarlsen/goexif.
The documentation of the package on go.dev shows Exif.Walk can walk the name and tag for every non-nil EXIF field.
Eg, a small program:
package main
import (
"fmt"
"log"
"os"
"github.com/rwcarlsen/goexif/exif"
"github.com/rwcarlsen/goexif/tiff"
)
type Printer struct{}
func (p Printer) Walk(name exif.FieldName, tag *tiff.Tag) error {
fmt.Printf("%40s: %s\n", name, tag)
return nil
}
func main() {
if len(os.Args) < 2 {
log.Fatal("please give filename as argument")
}
fname := os.Args[1]
f, err := os.Open(fname)
if err != nil {
log.Fatal(err)
}
x, err := exif.Decode(f)
if err != nil {
log.Fatal(err)
}
var p Printer
x.Walk(p)
}
Example:
$ go run main.go IMG_123.JPG
ResolutionUnit: 2
YCbCrPositioning: 2
Make: "Canon"
Model: "Canon IXUS 255 HS"
ThumbJPEGInterchangeFormat: 5620
PixelYDimension: 3000
FocalPlaneResolutionUnit: 2
GPSVersionID: [2,3,0,0]
ExifVersion: "0230"
WhiteBalance: 1
DateTime: "2016:10:04 17:27:56"
CompressedBitsPerPixel: "5/1"
... etc ...
Orientation: 1
MeteringMode: 5
FocalLength: "4300/1000"
PixelXDimension: 4000
InteroperabilityIFDPointer: 4982
FocalPlaneXResolution: "4000000/244"
XResolution: "180/1"
ComponentsConfiguration: ""
ShutterSpeedValue: "96/32"
ApertureValue: "101/32"
ExposureBiasValue: "-1/3"
FocalPlaneYResolution: "3000000/183"
SceneCaptureType: 0
Is there an easy way to check the size of a Golang project? It's not an executable, it's a package that I'm importing in my own project.
You can see how big the library binaries are by looking in the $GOPATH/pkg directory (if $GOPATH is not exported go defaults to $HOME/go).
So to check the size of some of the gorilla http pkgs. Install them first:
$ go get -u github.com/gorilla/mux
$ go get -u github.com/gorilla/securecookie
$ go get -u github.com/gorilla/sessions
The KB binary sizes on my 64-bit MacOS (darwin_amd64):
$ cd $GOPATH/pkg/darwin_amd64/github.com/gorilla/
$ du -k *
284 mux.a
128 securecookie.a
128 sessions.a
EDIT:
Library (package) size is one thing, but how much space that takes up in your executable after the link stage can vary wildly. This is because packages have their own dependencies and with that comes extra baggage, but that baggage may be shared by other packages you import.
An example demonstrates this best:
empty.go:
package main
func main() {}
http.go:
package main
import "net/http"
var _ = http.Serve
func main() {}
mux.go:
package main
import "github.com/gorilla/mux"
var _ = mux.NewRouter
func main() {}
All 3 programs are functionally identical - executing zero user code - but their dependencies differ. The resulting binary sizes in KB:
$ du -k *
1028 empty
5812 http
5832 mux
What does this tell us? The core go pkg net/http adds significant size to our executable. The mux pkg is not large by itself, but it has an import dependency on net/http pkg - hence the significant file size for it too. Yet the delta between mux and http is only 20KB, whereas the listed file size of the mux.a library is 284KB. So we can't simply add the library pkg sizes to determine their true footprint.
Conclusion:
The go linker will strip out a lot of baggage from individual libraries during the build process, but in order to get a true sense of how much extra weight importing certain packages, one has to look at all of the pkg's sub-dependencies as well.
Here is another solution that makes use of https://pkg.go.dev/golang.org/x/tools/go/packages
I took the example provided by the author, and slightly updated it with the demonstration binary available here.
package main
import (
"flag"
"fmt"
"log"
"os"
"sort"
"golang.org/x/tools/go/packages"
)
func main() {
flag.Parse()
// Many tools pass their command-line arguments (after any flags)
// uninterpreted to packages.Load so that it can interpret them
// according to the conventions of the underlying build system.
cfg := &packages.Config{Mode: packages.NeedFiles |
packages.NeedSyntax |
packages.NeedImports,
}
pkgs, err := packages.Load(cfg, flag.Args()...)
if err != nil {
fmt.Fprintf(os.Stderr, "load: %v\n", err)
os.Exit(1)
}
if packages.PrintErrors(pkgs) > 0 {
os.Exit(1)
}
// Print the names of the source files
// for each package listed on the command line.
var size int64
for _, pkg := range pkgs {
for _, file := range pkg.GoFiles {
s, err := os.Stat(file)
if err != nil {
log.Println(err)
continue
}
size += s.Size()
}
}
fmt.Printf("size of %v is %v b\n", pkgs[0].ID, size)
size = 0
for _, pkg := range allPkgs(pkgs) {
for _, file := range pkg.GoFiles {
s, err := os.Stat(file)
if err != nil {
log.Println(err)
continue
}
size += s.Size()
}
}
fmt.Printf("size of %v and deps is %v b\n", pkgs[0].ID, size)
}
func allPkgs(lpkgs []*packages.Package) []*packages.Package {
var all []*packages.Package // postorder
seen := make(map[*packages.Package]bool)
var visit func(*packages.Package)
visit = func(lpkg *packages.Package) {
if !seen[lpkg] {
seen[lpkg] = true
// visit imports
var importPaths []string
for path := range lpkg.Imports {
importPaths = append(importPaths, path)
}
sort.Strings(importPaths) // for determinism
for _, path := range importPaths {
visit(lpkg.Imports[path])
}
all = append(all, lpkg)
}
}
for _, lpkg := range lpkgs {
visit(lpkg)
}
return all
}
You can download all the imported modules with go mod vendor, then count the lines of all the .go files that aren't test files:
package main
import (
"bytes"
"fmt"
"io/fs"
"os"
"os/exec"
"path/filepath"
"strings"
)
func count(mod string) int {
imp := fmt.Sprintf("package main\nimport _ %q", mod)
os.WriteFile("size.go", []byte(imp), os.ModePerm)
exec.Command("go", "mod", "init", "size").Run()
exec.Command("go", "mod", "vendor").Run()
var count int
filepath.WalkDir("vendor", func(s string, d fs.DirEntry, err error) error {
if strings.HasSuffix(s, ".go") && !strings.HasSuffix(s, "_test.go") {
data, err := os.ReadFile(s)
if err != nil {
return err
}
count += bytes.Count(data, []byte{'\n'})
}
return nil
})
return count
}
func main() {
println(count("github.com/klauspost/compress/zstd"))
}
I'm new to Golang, starting out with some examples. Currently, what I'm trying to do is reading a file line by line and replace it with another string in case it meets a certain condition.
The file is use for testing purposes contains four lines:
one
two
three
four
The code working on that file looks like this:
func main() {
file, err := os.OpenFile("test.txt", os.O_RDWR, 0666)
if err != nil {
panic(err)
}
reader := bufio.NewReader(file)
for {
fmt.Print("Try to read ...\n")
pos,_ := file.Seek(0, 1)
log.Printf("Position in file is: %d", pos)
bytes, _, _ := reader.ReadLine()
if (len(bytes) == 0) {
break
}
lineString := string(bytes)
if(lineString == "two") {
file.Seek(int64(-(len(lineString))), 1)
file.WriteString("This is a test.")
}
fmt.Printf(lineString + "\n")
}
file.Close()
}
As you can see in the code snippet, I want to replace the string "two" with "This is a test" as soon as this string is read from the file.
In order to get the current position within the file I use Go's Seek method.
However, what happens is that always the last line gets replaced by This is a test, making the file looking like this:
one
two
three
This is a test
Examining the output of the print statement which writes the current file position to the terminal, I get that kind of output after the first line has been read:
2016/12/28 21:10:31 Try to read ...
2016/12/28 21:10:31 Position in file is: 19
So after the first read, the position cursor already points to the end of my file, which explains why the new string gets appended to the end. Does anyone understand what is happening here or rather what is causing that behavior?
The Reader is not controller by the file.Seek. You have declared the reader as: reader := bufio.NewReader(file) and then you read one line at a time bytes, _, _ := reader.ReadLine() however the file.Seek does not change the position that the reader is reading.
Suggest you read about the ReadSeeker in the docs and switch over to using that. Also there is an example using the SectionReader.
Aside from the incorrect seek usage, the difficulty is that the line you're replacing isn't the same length as the replacement. The standard approach is to create a new (temporary) file with the modifications. Assuming that is successful, replace the original file with the new one.
package main
import (
"bufio"
"io"
"io/ioutil"
"log"
"os"
)
func main() {
// file we're modifying
name := "text.txt"
// open original file
f, err := os.Open(name)
if err != nil {
log.Fatal(err)
}
defer f.Close()
// create temp file
tmp, err := ioutil.TempFile("", "replace-*")
if err != nil {
log.Fatal(err)
}
defer tmp.Close()
// replace while copying from f to tmp
if err := replace(f, tmp); err != nil {
log.Fatal(err)
}
// make sure the tmp file was successfully written to
if err := tmp.Close(); err != nil {
log.Fatal(err)
}
// close the file we're reading from
if err := f.Close(); err != nil {
log.Fatal(err)
}
// overwrite the original file with the temp file
if err := os.Rename(tmp.Name(), name); err != nil {
log.Fatal(err)
}
}
func replace(r io.Reader, w io.Writer) error {
// use scanner to read line by line
sc := bufio.NewScanner(r)
for sc.Scan() {
line := sc.Text()
if line == "two" {
line = "This is a test."
}
if _, err := io.WriteString(w, line+"\n"); err != nil {
return err
}
}
return sc.Err()
}
For more complex replacements, I've implemented a package which can replace regular expression matches. https://github.com/icholy/replace
import (
"io"
"regexp"
"github.com/icholy/replace"
"golang.org/x/text/transform"
)
func replace2(r io.Reader, w io.Writer) error {
// compile multi-line regular expression
re := regexp.MustCompile(`(?m)^two$`)
// create replace transformer
tr := replace.RegexpString(re, "This is a test.")
// copy while transforming
_, err := io.Copy(w, transform.NewReader(r, tr))
return err
}
OS package has Expand function which I believe can be used to solve similar problem.
Explanation:
file.txt
one
two
${num}
four
main.go
package main
import (
"fmt"
"os"
)
var FILENAME = "file.txt"
func main() {
file, err := os.ReadFile(FILENAME)
if err != nil {
panic(err)
}
mapper := func(placeholderName string) string {
switch placeholderName {
case "num":
return "three"
}
return ""
}
fmt.Println(os.Expand(string(file), mapper))
}
output
one
two
three
four
Additionally, you may create a config (yml or json) and
populate that data in the map that can be used as a lookup table to store placeholders as well as their replacement strings and modify mapper part to use this table to lookup placeholders from input file.
e.g map will look like this,
table := map[string]string {
"num": "three"
}
mapper := func(placeholderName string) string {
if val, ok := table[placeholderName]; ok {
return val
}
return ""
}
References:
os.Expand documentation: https://pkg.go.dev/os#Expand
Playground
Looking at the latest release (1.2) zip package - how can I unzip a file that was password protected (using 7zip, AES-256 encoding)? I don't see where/how to add in that information. A simple example would be great!
The archive/zip package seems to only provide basic zip functionality.
I would use 7zip to unzip password protected zip files using the os/exec package.
Online 7-zip user guide
The best guide for understanding 7zip is 7-zip.chm, which is in the zip file for the windows command line.
The following code isn't optimal but it shows you how to get the job done.
Code for extracting a password protected zip using 7zip
func extractZipWithPassword() {
fmt.Printf("Unzipping `%s` to directory `%s`\n", zip_path, extract_path)
commandString := fmt.Sprintf(`7za e %s -o%s -p"%s" -aoa`, zip_path, extract_path, zip_password)
commandSlice := strings.Fields(commandString)
fmt.Println(commandString)
c := exec.Command(commandSlice[0], commandSlice[1:]...)
e := c.Run()
checkError(e)
}
Example Program
// Shows how to extract an passsword encrypted zip file using 7zip.
// By Larry Battle <https://github.com/LarryBattle>
// Answer to http://stackoverflow.com/questions/20330210/golang-1-2-unzip-password-protected-zip-file
// 7-zip.chm - http://sevenzip.sourceforge.jp/chm/cmdline/switches/index.htm
// Effective Golang - http://golang.org/doc/effective_go.html
package main
import (
"fmt"
"os"
"os/exec"
"path/filepath"
"strings"
)
var (
txt_content = "Sample file created."
txt_filename = "name.txt"
zip_filename = "sample.zip"
zip_password = "42"
zip_encryptType = "AES256"
base_path = "./"
test_path = filepath.Join(base_path, "test")
src_path = filepath.Join(test_path, "src")
extract_path = filepath.Join(test_path, "extracted")
extracted_txt_path = filepath.Join(extract_path, txt_filename)
txt_path = filepath.Join(src_path, txt_filename)
zip_path = filepath.Join(src_path, zip_filename)
)
var txt_fileSize int64
func checkError(e error) {
if e != nil {
panic(e)
}
}
func setupTestDir() {
fmt.Printf("Removing `%s`\n", test_path)
var e error
os.Remove(test_path)
fmt.Printf("Creating `%s`,`%s`\n", extract_path, src_path)
e = os.MkdirAll(src_path, os.ModeDir|os.ModePerm)
checkError(e)
e = os.MkdirAll(extract_path, os.ModeDir|os.ModePerm)
checkError(e)
}
func createSampleFile() {
fmt.Println("Creating", txt_path)
file, e := os.Create(txt_path)
checkError(e)
defer file.Close()
_, e = file.WriteString(txt_content)
checkError(e)
fi, e := file.Stat()
txt_fileSize = fi.Size()
}
func createZipWithPassword() {
fmt.Println("Creating", zip_path)
commandString := fmt.Sprintf(`7za a %s %s -p"%s" -mem=%s`, zip_path, txt_path, zip_password, zip_encryptType)
commandSlice := strings.Fields(commandString)
fmt.Println(commandString)
c := exec.Command(commandSlice[0], commandSlice[1:]...)
e := c.Run()
checkError(e)
}
func extractZipWithPassword() {
fmt.Printf("Unzipping `%s` to directory `%s`\n", zip_path, extract_path)
commandString := fmt.Sprintf(`7za e %s -o%s -p"%s" -aoa`, zip_path, extract_path, zip_password)
commandSlice := strings.Fields(commandString)
fmt.Println(commandString)
c := exec.Command(commandSlice[0], commandSlice[1:]...)
e := c.Run()
checkError(e)
}
func checkFor7Zip() {
_, e := exec.LookPath("7za")
if e != nil {
fmt.Println("Make sure 7zip is install and include your path.")
}
checkError(e)
}
func checkExtractedFile() {
fmt.Println("Reading", extracted_txt_path)
file, e := os.Open(extracted_txt_path)
checkError(e)
defer file.Close()
buf := make([]byte, txt_fileSize)
n, e := file.Read(buf)
checkError(e)
if !strings.Contains(string(buf[:n]), strings.Fields(txt_content)[0]) {
panic(fmt.Sprintf("File`%s` is corrupted.\n", extracted_txt_path))
}
}
func main() {
fmt.Println("# Setup")
checkFor7Zip()
setupTestDir()
createSampleFile()
createZipWithPassword()
fmt.Println("# Answer to question...")
extractZipWithPassword()
checkExtractedFile()
fmt.Println("Done.")
}
Output
# Setup
Removing `test`
Creating `test/extracted`,`test/src`
Creating test/src/name.txt
Creating test/src/sample.zip
7za a test/src/sample.zip test/src/name.txt -p"42" -mem=AES256
# Answer to question...
Unzipping `test/src/sample.zip` to directory `test/extracted`
7za e test/src/sample.zip -otest/extracted -p"42" -aoa
Reading test/extracted/name.txt
Done.
https://github.com/yeka/zip provides functionality to extract password protected zip file (AES & Zip Standard Encryption aka ZipCrypto).
Below is an example how to use it:
package main
import (
"os"
"io"
"github.com/yeka/zip"
)
func main() {
file := "file.zip"
password := "password"
r, err := zip.OpenReader(file)
if nil != err {
panic(err)
}
defer r.Close()
for _, f := range r.File {
f.SetPassword(password)
w, err := os.Create(f.Name)
if nil != err {
panic(err)
}
io.Copy(w, f)
w.Close()
}
}
The work is a fork from https://github.com/alexmullins/zip which add support for AES only.
If anyone else runs into this the extraction failing with a password error, try removing the quotes. In my case they were being escaped by go and was causing the extraction to fail.