golang reading a large number using bufio.NewScanner

golang reading a large number using bufio.NewScanner - go

I tried to get the input (many number with space) and converted it to slice.
number of numbers is up to 300,000
I got an error and I googled it. and there's some problem with buffer size.
so I wrote the code as below.
func ChangeToInt(input string) []int {
var nums []int
for _, word := range strings.Fields(input) {
num, _ := strconv.Atoi(word)
nums = append(nums, num)
}
return nums
}
scanner := bufio.NewScanner(os.Stdin)
maxCapacity := 4*300000
buf := make([]byte, maxCapacity)
scanner.Buffer(buf, maxCapacity)
scanner.Scan()
input := scanner.Text()
nums := ChangeToInt(input)
but still not working. what's the problem?

You are using bufio.Scanner to read your input. By default bufio.Scanner reads lines, and it uses an internal buffer to store the line. By default the line may have a max length of bufio.MaxScanTokenSize which is 64 KB. If your lines are longer than this, you'll get an error.
The internal buffer size may be changed / increased using the Scanner.Buffer() method, but if your input is a space separated list of numbers, I'd advise to change the split function of the Scanner.
As mentioned earlier, by default the scanner splits input by lines. Instead change it to split the input by words. The bufio package has a "ready" split function for that: bufio.Scanwords. Use it like this:
scanner := bufio.NewScanner(os.Stdin)
scanner.Split(bufio.ScanWords)
Now the scanner.Text() will return the words (numbers in your case) instead of complete lines, so the default 64 KB limit now applies to words, not lines. Your numbers should be less than 64 KB.
Also do check if scanning succeeds by calling scanner.Err().

You can use bufio.NewReader and its ReadString('\n') method. It will be good for large input data.

Related

Is There a Scanner Function in Go That Separates on Length (or is Newline Agnostic)?

I have two types of files in go which can be represented by the following strings:
const nonewline := 'hello' # content but no newline
const newline := `hello\nworld' # content with newline
My goal is just to read all the content from both files (it's coming in via a stream, so I cannot use something built in like ReadAll, i'm using stdioPipe) and include newlines where they appear.
I'm using Scanner but it APPEARS that there's no way to tell if the line terminates with a newline, and if I use Scanner.Text() it auto-splits (making it impossible to tell if a line ends in a newline, or the line just terminated at the end of the file).
I've also looked at writing a custom Split function, but isn't that overkill? I just need to split on some fixed length (I assume the default buffer size - 4096), or whatever is left in the file, whichever is shorter.
I've also looking at Scanner.Split(bufio.ScanBytes) but is there a speed up by chunking the read?
Anyhow, this seems like something that should be really straightforward.

Use this loop to read a stream in fixed size chunks:
chunk := make([]byte, size) // Size is the chunk size.
for {
n, err := io.ReadFull(stream, chunk)
if n > 0 {
// Do something with the chunk of data.
process(chunk[:n])
}
if err != nil {
break
}
}

How is this code generating memory aligned slices?

I'm trying to do direct i/o on linux, so I need to create memory aligned buffers. I copied some code to do it, but I don't understand how it works:
package main
import (
"fmt"
"golang.org/x/sys/unix"
"unsafe"
"yottaStore/yottaStore-go/src/yfs/test/utils"
)
const (
AlignSize = 4096
BlockSize = 4096
)
// Looks like dark magic
func Alignment(block []byte, AlignSize int) int {
return int(uintptr(unsafe.Pointer(&block[0])) & uintptr(AlignSize-1))
}
func main() {
path := "/path/to/file.txt"
fd, err := unix.Open(path, unix.O_RDONLY|unix.O_DIRECT, 0666)
defer unix.Close(fd)
if err != nil {
panic(err)
}
file := make([]byte, 4096*2)
a := Alignment(file, AlignSize)
offset := 0
if a != 0 {
offset = AlignSize - a
}
file = file[offset : offset+BlockSize]
n, readErr := unix.Pread(fd, file, 0)
if readErr != nil {
panic(readErr)
}
fmt.Println(a, offset, offset+utils.BlockSize, len(file))
fmt.Println("Content is: ", string(file))
}
I understand that I'm generating a slice twice as big than what I need, and then extracting a memory aligned block from it, but the Alignment function doesn't make sense to me.
How does the Alignment function works?
If I try to fmt.Println the intermediate steps of that function I get different results, why? I guess because observing it changes its memory alignment (like in quantum physics :D)
Edit:
Example with fmt.println, where I don't need any more alignment:
package main
import (
"fmt"
"golang.org/x/sys/unix"
"unsafe"
)
func main() {
path := "/path/to/file.txt"
fd, err := unix.Open(path, unix.O_RDONLY|unix.O_DIRECT, 0666)
defer unix.Close(fd)
if err != nil {
panic(err)
}
file := make([]byte, 4096)
fmt.Println("Pointer: ", &file[0])
n, readErr := unix.Pread(fd, file, 0)
fmt.Println("Return is: ", n)
if readErr != nil {
panic(readErr)
}
fmt.Println("Content is: ", string(file))
}

Your AlignSize has a value of a power of 2. In binary representation it contains a 1 bit followed by full of zeros:
fmt.Printf("%b", AlignSize) // 1000000000000
A slice allocated by make() may have a memory address that is more or less random, consisting of ones and zeros following randomly in binary; or more precisely the starting address of its backing array.
Since you allocate twice the required size, that's a guarantee that the backing array will cover an address space that has an address in the middle somewhere that ends with as many zeros as the AlignSize's binary representation, and has BlockSize room in the array starting at this. We want to find this address.
This is what the Alignment() function does. It gets the starting address of the backing array with &block[0]. In Go there's no pointer arithmetic, so in order to do something like that, we have to convert the pointer to an integer (there is integer arithmetic of course). In order to do that, we have to convert the pointer to unsafe.Pointer: all pointers are convertible to this type, and unsafe.Pointer can be converted to uintptr (which is an unsigned integer large enough to store the uninterpreted bits of a pointer value), on which–being an integer–we can perform integer arithmetic.
We use bitwise AND with the value uintptr(AlignSize-1). Since AlignSize is a power of 2 (contains a single 1 bit followed by zeros), the number one less is a number whose binary representation is full of ones, as many as trailing zeros AlignSize has. See this example:
x := 0b1010101110101010101
fmt.Printf("AlignSize : %22b\n", AlignSize)
fmt.Printf("AlignSize-1 : %22b\n", AlignSize-1)
fmt.Printf("x : %22b\n", x)
fmt.Printf("result of & : %22b\n", x&(AlignSize-1))
Output:
AlignSize : 1000000000000
AlignSize-1 : 111111111111
x : 1010101110101010101
result of & : 110101010101
So the result of & is the offset which if you subtract from AlignSize, you get an address that has as many trailing zeros as AlignSize itself: the result is "aligned" to the multiple of AlignSize.
So we will use the part of the file slice starting at offset, and we only need BlockSize:
file = file[offset : offset+BlockSize]
Edit:
Looking at your modified code trying to print the steps: I get an output like:
Pointer: 0xc0000b6000
Unsafe pointer: 0xc0000b6000
Unsafe pointer, uintptr: 824634466304
Unpersand: 0
Cast to int: 0
Return is: 0
Content is:
Note nothing is changed here. Simply the fmt package prints pointer values using hexadecimal representation, prefixed by 0x. uintptr values are printed as integers, using decimal representation. Those values are equal:
fmt.Println(0xc0000b6000, 824634466304) // output: 824634466304 824634466304
Also note the rest is 0 because in my case 0xc0000b6000 is already a multiple of 4096, in binary it is 1100000000000000000100001110000000000000.
Edit #2:
When you use fmt.Println() to debug parts of the calculation, that may change escape analysis and may change the allocation of the slice (from stack to heap). This depends on the used Go version too. Do not rely on your slice being allocated at an address that is (already) aligned to AlignSize.
See related questions for more details:
Mix print and fmt.Println and stack growing
why struct arrays comparing has different result
Addresses of slices of empty structs

Is it a good idea to use a list of *bufio.Scanner for files to be read later in golang?

I have a list of delimited files to be read after I obtained their path. Instead of saving path as a string, I'm wondering can I simply store a list of *bufio.Scanner so those will be much easier to be read later (and code will be cleaner too)? Here is a quick example:
func main(){
scannerList := read(filenameList)
dowork(scannerList)
}
func read(filenameList []string) (scannerList []*bufio.Scanner){
for _, filename := range filenameList{
op, _ := os.Open(filename)
defer op.Close()
scanner := bufio.NewScanner(op)
scannerList = append(scannerList, scanner)
}
return
}
func dowork(scannerList []*bufio.Scanner){
for _, scanner := range scannerList{
for scanner.Scan(){
//read stuff
}
//do stuff
}
}
My code similar to above example compiles, but I don't know if this is recommended (or works). Any comments? Thanks!

A Scanner is a complicated structure, and one that embeds a buffer. The buffer can grow dynamically (depending on what the scan function requests) up to 64kB (MaxScanTokenSize).
So in general it is not a good idea to keep redundant Scanners around, as the buffers cannot be released until the Scanners are discarded. But perhaps a few extra kilobytes of memory don't matter much in your case.

Golang high cpu usage on simple webserver unable to understand why?

So I have a simple net/http webserver. All it does is is deliver 100MB of random bytes, which I intend to use for network speed testing. My handler for the 100mb endpoint is really simple (pasted below). The code works fine and I get my random byte file, the problem is when I run this and someone downloads these 100megabytes, the CPU for this program shoots up to 150% and stays there until this handler finishes running. Am I doing something very wrong here? What could I do to improve this handler's performance?
func downloadHandler(w http.ResponseWriter, r *http.Request) {
str := RandStringBytes(8192); //generates 8192 bytes of randomness
sz := 1000*1000*100; //100Megabytes
iter := sz/len(str)+1;
w.Header().Set("Content-Type", "application/octet-stream")
w.Header().Set("Content-Length", strconv.Itoa( sz ))
for i := 0; i < iter ; i++ {
fmt.Fprintf(w, str )
}
}

The problem is that fmt.Fprintf() expects a format string:
func Fprintf(w io.Writer, format string, a ...interface{}) (n int, err error)
And you pass it a big, 8 KB format string. The fmt package has to analyze the format string, it is not something that gets to the output as is. Most definately this is what is eating your CPU.
If the random string contains the special % sign, that even makes your case worse, as then fmt.Fprintf() might expect further arguments which you don't "deliver", so the fmt package also has to (will) include error messages in the output, such as:
fmt.Fprintf(os.Stdout, "aaa%bbb%d")
Output:
aaa%!b(MISSING)bb%!d(MISSING)
Use fmt.Fprint() instead which does not expect a format string:
fmt.Fprint(w, str)
Or even better, convert your random string to a byte slice once, and just keep writing that:
data := []byte(str)
for i := 0; i < iter; i++ {
if _, err := w.Write(data); err != nil {
// Handle error, e.g. return
}
}
Delivering large amount of data – you won't get a faster solution than writing a prepared byte slice in a loop (maybe slightly if you vary the size of the slice). If your solution is still "slow", that might be due to your RandStringBytes() function which we don't know anything about, or your output might be compressed (gzipped) if you use other handlers or some framework (which does use relatively high CPU). Also if the client that receives the response is also on your computer (e.g. a browser), it –or a firewall / antivirus software– may check / analyze the response for malicious code (which may also be resource intensive).

Removing NUL characters from bytes

To teach myself Go I'm building a simple server that takes some input, does some processing, and sends output back to the client (that includes the original input).
The input can vary in length from around 5 - 13 characters + endlines and whatever other guff the client sends.
The input is read into a byte array and then converted to a string for some processing. Another string is appended to this string and the whole thing is converted back into a byte array to get sent back to the client.
The problem is that the input is padded with a bunch of NUL characters, and I'm not sure how to get rid of them.
So I could loop through the array and when I come to a nul character, note the length (n), create a new byte array of that length, and copy the first n characters over to the new byte array and use that. Is that the best way, or is there something to make this easier for me?
Some stripped down code:
data := make([]byte, 16)
c.Read(data)
s := strings.Replace(string(data[:]), "an", "", -1)
s = strings.Replace(s, "\r", "", -1)
s += "some other string"
response := []byte(s)
c.Write(response)
c.close()
Also if I'm doing anything else obviously stupid here it would be nice to know.

In package "bytes", func Trim(s []byte, cutset string) []byte is your friend:
Trim returns a subslice of s by slicing off all leading and trailing UTF-8-encoded Unicode code points contained in cutset.
// Remove any NULL characters from 'b'
b = bytes.Trim(b, "\x00")

Your approach sounds basically right. Some remarks:
When you have found the index of the first nul byte in data, you don't need to copy, just truncate the slice: data[:idx].
bytes.Index should be able to find that index for you.
There is also bytes.Replace so you don't need to convert to string.

The io.Reader documentation says:
Read reads up to len(p) bytes into p. It returns the number of bytes read (0 <= n <= len(p)) and any error encountered.
If the call to Read in the application does not read 16 bytes, then data will have trailing zero bytes. Use the number of bytes read to trim the zero bytes from the buffer.
data := make([]byte, 16)
n, err := c.Read(data)
if err != nil {
// handle error
}
data = data[:n]
There's another issue. There's no guarantee that Read slurps up all of the "message" sent by the peer. The application may need to call Read more than once to get the complete message.
You mention endlines in the question. If the message from the client is terminated but a newline, then use bufio.Scanner to read lines from the connection:
s := bufio.NewScanner(c)
if s.Scan() {
data = s.Bytes() // data is next line, not including end lines, etc.
}
if s.Err() != nil {
// handle error
}

You could utilize the return value of Read:
package main
import "strings"
func main() {
r, b := strings.NewReader("north east south west"), make([]byte, 16)
n, e := r.Read(b)
if e != nil {
panic(e)
}
b = b[:n]
println(string(b) == "north east south")
}
https://golang.org/pkg/io#Reader

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio