How to remove Unicode characters from byte buffer in Go? - go

I have a bytes.Buffer type variable which I filled with Unicode characters:
var mbuff bytes.Buffer
unicodeSource := 'کیا حال ھے؟'
for i,r := range(unicodeSource) {
mbuff.WriteRune(r)
}
Note: I iterated over a Unicode literals here, but really the source is an infinite loop of user input characters.
Now, I want to remove a Unicode character from any position in the buffer mbuff. The problem is that characters may be of variable byte sizes. So I cannot just pick out the ith byte from mbuff.String() as it might be the beginning, middle, or end of a character. This is my trivial (and horrendous) solution:
// removing Unicode character at position n
var tempString string
currChar := 0
for _, ch := range(mbuff.String()) { // iterate over Unicode chars
if currChar != n { // skip concatenating nth char
tempString += ch
}
currChar++
}
mbuff.Reset() // empty buffer
mbuff.WriteString(tempString) // write new string
This is bad in many ways. For one, I convert buffer to string, remove ith element, and write a new string back into the buffer. Too many operations. Second, I use the += operator in the loop to concatenate Unicode characters into a new string. I am using buffers in the first place exactly to avoid concatenation using += which is slow as this answer points out.
What is an efficient method to remove the ith Unicode character in a bytes.Buffer?
Also what is an efficient way to insert a Unicode character after i-1 Unicode characters (i.e. in the ith place)?

To remove the ith rune from a slice of bytes, loop through the slice counting runes. When the ith rune is found, copy the bytes following the rune down to the position of the ith rune:
func removeAtBytes(p []byte, i int) []byte {
j := 0
k := 0
for k < len(p) {
_, n := utf8.DecodeRune(p[k:])
if i == j {
p = p[:k+copy(p[k:], p[k+n:])]
}
j++
k += n
}
return p
}
This function modifies the backing array of the argument slice, but it does not allocate memory.
Use this function to remove a rune from a bytes.Buffer.
p := removeAtBytes(mbuf.Bytes(), i)
mbuf.Truncate(len(p)) // backing bytes were updated, adjust length
playground example
To remove the ith rune from a string, loop through the string counting runes. When the ith rune is found, create a string by concatenating the segment of the string before the rune with the segment of the string after the rune.
func removeAt(s string, i int) string {
j := 0 // count of runes
k := 0 // index in string of current rune
for k < len(s) {
_, n := utf8.DecodeRuneInString(s[k:])
if i == j {
return s[:k] + s[k+n:]
}
j++
k += n
}
return s
}
This function allocates a single string, the result. DecodeRuneInString is a function in the standard library unicode/utf8 package.

Taking a step back, go often works on Readers and Writers, so an alternative solution would be to use the text/transform package. You create a Transformer, attach it to a Reader and use the new Reader to produce a transformed string. For example here's a skipper:
func main() {
src := strings.NewReader("کیا حال ھے؟")
skipped := transform.NewReader(src, NewSkipper(5))
var buf bytes.Buffer
io.Copy(&buf, skipped)
fmt.Println("RESULT:", buf.String())
}
And here's the implementation:
package main
import (
"bytes"
"fmt"
"io"
"strings"
"unicode/utf8"
"golang.org/x/text/transform"
)
type skipper struct {
pos int
cnt int
}
// NewSkipper creates a text transformer which will remove the rune at pos
func NewSkipper(pos int) transform.Transformer {
return &skipper{pos: pos}
}
func (s *skipper) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
for utf8.FullRune(src) {
_, sz := utf8.DecodeRune(src)
// not enough space in the dst
if len(dst) < sz {
return nDst, nSrc, transform.ErrShortDst
}
if s.pos != s.cnt {
copy(dst[:sz], src[:sz])
// track that we stored in dst
dst = dst[sz:]
nDst += sz
}
// track that we read from src
src = src[sz:]
nSrc += sz
// on to the next rune
s.cnt++
}
if len(src) > 0 && !atEOF {
return nDst, nSrc, transform.ErrShortSrc
}
return nDst, nSrc, nil
}
func (s *skipper) Reset() {
s.cnt = 0
}
There may be bugs with this code, but hopefully you can see the idea.
The benefit of this approach is it could work on a potentially infinite amount of data without having to store all of it in memory. For example you could transform a file this way.

Edit:
Remove the ith rune in the buffer:
A: Shift all runes one location to the left (Here A is faster than B), try it on The Go Playground:
func removeRuneAt(s string, runePosition int) string {
if runePosition < 0 {
return s
}
r := []rune(s)
if runePosition >= len(r) {
return s
}
copy(r[runePosition:], r[runePosition+1:])
return string(r[:len(r)-1])
}
B: Copy to new buffer, try it on The Go Playground
func removeRuneAt(s string, runePosition int) string {
if runePosition < 0 {
return s // avoid allocation
}
r := []rune(s)
if runePosition >= len(r) {
return s // avoid allocation
}
t := make([]rune, len(r)-1) // Apply replacements to buffer.
w := copy(t, r[:runePosition])
w += copy(t[w:], r[runePosition+1:])
return string(t[:w])
}
C: Try it on The Go Playground:
package main
import (
"bytes"
"fmt"
)
func main() {
str := "hello"
fmt.Println(str)
fmt.Println(removeRuneAt(str, 1))
buf := bytes.NewBuffer([]byte(str))
fmt.Println(buf.Bytes())
buf = bytes.NewBuffer([]byte(removeRuneAt(buf.String(), 1)))
fmt.Println(buf.Bytes())
}
func removeRuneAt(s string, runePosition int) string {
if runePosition < 0 {
return s // avoid allocation
}
r := []rune(s)
if runePosition >= len(r) {
return s // avoid allocation
}
t := make([]rune, len(r)-1) // Apply replacements to buffer.
w := copy(t, r[0:runePosition])
w += copy(t[w:], r[runePosition+1:])
return string(t[0:w])
}
D: Benchmark:
A: 745.0426ms
B: 1.0160581s
for 2000000 iterations
1- Short Answer: to replace all (n) instances of a character (or even a string):
n := -1
newR := ""
old := "µ"
buf = bytes.NewBuffer([]byte(strings.Replace(buf.String(), old, newR, n)))
2- For replacing the character(string) in the ith instance in the buffer, you may use:
buf = bytes.NewBuffer([]byte(Replace(buf.String(), oldString, newOrEmptyString, ith)))
See:
// Replace returns a copy of the string s with the ith
// non-overlapping instance of old replaced by new.
func Replace(s, old, new string, ith int) string {
if len(old) == 0 || old == new || ith < 0 {
return s // avoid allocation
}
i, j := 0, 0
for ; ith >= 0; ith-- {
j = strings.Index(s[i:], old)
if j < 0 {
return s // avoid allocation
}
j += i
i = j + len(old)
}
t := make([]byte, len(s)+(len(new)-len(old))) // Apply replacements to buffer.
w := copy(t, s[0:j])
w += copy(t[w:], new)
w += copy(t[w:], s[j+len(old):])
return string(t[0:w])
}
Try it on The Go Playground:
package main
import (
"bytes"
"fmt"
"strings"
)
func main() {
str := `How are you?µ`
fmt.Println(str)
fmt.Println(Replace(str, "µ", "", 0))
buf := bytes.NewBuffer([]byte(str))
fmt.Println(buf.Bytes())
buf = bytes.NewBuffer([]byte(Replace(buf.String(), "µ", "", 0)))
fmt.Println(buf.Bytes())
}
func Replace(s, old, new string, ith int) string {
if len(old) == 0 || old == new || ith < 0 {
return s // avoid allocation
}
i, j := 0, 0
for ; ith >= 0; ith-- {
j = strings.Index(s[i:], old)
if j < 0 {
return s // avoid allocation
}
j += i
i = j + len(old)
}
t := make([]byte, len(s)+(len(new)-len(old))) // Apply replacements to buffer.
w := copy(t, s[0:j])
w += copy(t[w:], new)
w += copy(t[w:], s[j+len(old):])
return string(t[0:w])
}
3- If you want to remove all instances of Unicode character (old string) from any position in the string, you may use:
strings.Replace(str, old, "", -1)
4- Also this works fine for removing from bytes.buffer:
strings.Replace(buf.String(), old, newR, -1)
Like so:
buf = bytes.NewBuffer([]byte(strings.Replace(buf.String(), old, newR, -1)))
Here is the complete working code (try it on The Go Playground):
package main
import (
"bytes"
"fmt"
"strings"
)
func main() {
str := `کیا حال ھے؟` //How are you?
old := `ک`
newR := ""
fmt.Println(strings.Replace(str, old, newR, -1))
buf := bytes.NewBuffer([]byte(str))
// for _, r := range str {
// buf.WriteRune(r)
// }
fmt.Println(buf.Bytes())
bs := []byte(strings.Replace(buf.String(), old, newR, -1))
buf = bytes.NewBuffer(bs)
fmt.Println(" ", buf.Bytes())
}
output:
یا حال ھے؟
[218 169 219 140 216 167 32 216 173 216 167 217 132 32 218 190 219 146 216 159]
[219 140 216 167 32 216 173 216 167 217 132 32 218 190 219 146 216 159]
5- strings.Replace is very efficient, see inside:
// Replace returns a copy of the string s with the first n
// non-overlapping instances of old replaced by new.
// If old is empty, it matches at the beginning of the string
// and after each UTF-8 sequence, yielding up to k+1 replacements
// for a k-rune string.
// If n < 0, there is no limit on the number of replacements.
func Replace(s, old, new string, n int) string {
if old == new || n == 0 {
return s // avoid allocation
}
// Compute number of replacements.
if m := Count(s, old); m == 0 {
return s // avoid allocation
} else if n < 0 || m < n {
n = m
}
// Apply replacements to buffer.
t := make([]byte, len(s)+n*(len(new)-len(old)))
w := 0
start := 0
for i := 0; i < n; i++ {
j := start
if len(old) == 0 {
if i > 0 {
_, wid := utf8.DecodeRuneInString(s[start:])
j += wid
}
} else {
j += Index(s[start:], old)
}
w += copy(t[w:], s[start:j])
w += copy(t[w:], new)
start = j + len(old)
}
w += copy(t[w:], s[start:])
return string(t[0:w])
}

Related

Generate combinations from a given range

I'm trying to create a program capable to generate combinations from a given range.
I started editing this code below that generates combinations:
package main
import "fmt"
func nextPassword(n int, c string) func() string {
r := []rune(c)
p := make([]rune, n)
x := make([]int, len(p))
return func() string {
p := p[:len(x)]
for i, xi := range x {
p[i] = r[xi]
}
for i := len(x) - 1; i >= 0; i-- {
x[i]++
if x[i] < len(r) {
break
}
x[i] = 0
if i <= 0 {
x = x[0:0]
break
}
}
return string(p)
}
}
func main() {
np := nextPassword(2, "ABCDE")
for {
pwd := np()
if len(pwd) == 0 {
break
}
fmt.Println(pwd)
}
}
This is the Output of the code:
AA
AB
AC
AD
AE
BA
BB
BC
BD
BE
CA
CB
CC
CD
CE
DA
DB
DC
DD
DE
EA
EB
EC
ED
EE
And this is the code I edited:
package main
import "fmt"
const (
Min = 5
Max = 10
)
func nextPassword(n int, c string) func() string {
r := []rune(c)
p := make([]rune, n)
x := make([]int, len(p))
return func() string {
p := p[:len(x)]
for i, xi := range x {
p[i] = r[xi]
}
for i := len(x) - 1; i >= 0; i-- {
x[i]++
if x[i] < len(r) {
break
}
x[i] = 0
if i <= 0 {
x = x[0:0]
break
}
}
return string(p)
}
}
func main() {
cont := 0
np := nextPassword(2, "ABCDE")
for {
pwd := np()
if len(pwd) == 0 {
break
}
if cont >= Min && cont <= Max{
fmt.Println(pwd)
} else if cont > Max{
break
}
cont += 1
}
}
Output:
BA
BB
BC
BD
BE
CA
My code works, but if I increase the length of the combination and my range starts from the middle, the program will generate even the combinations that I don't want (and of course that will take a lot of time).
How can I solve this problem?
I really didn't like how nextPassword was written, so I made a variation. Rather than starting at 0 and repeatedly returning the next value, this one takes an integer and converts it to the corresponding "password." E.g. toPassword(0, 2, []rune("ABCDE")) is AA, and toPassword(5, ...) is BA.
From there, it's easy to loop over whatever range you want. But I also wrote a nextPassword wrapper around it that behaves similarly to the one in the original code. This one uses toPassword under the cover and takes a starting n.
Runnable version here: https://play.golang.org/p/fBo6mx4Mji
Code below:
package main
import (
"fmt"
)
func toPassword(n, length int, alphabet []rune) string {
base := len(alphabet)
// This will be our output
result := make([]rune, length)
// Start filling from the right
i := length - 1
// This is essentially a conversion to base-b, where b is
// the number of possible letters (5 in the case of "ABCDE")
for n > 0 {
// Filling from the right, put the right digit mod b
result[i] = alphabet[n%base]
// Divide the number by the base so we're ready for
// the next digit
n /= base
// Move to the left
i -= 1
}
// Fill anything that's left with "zeros" (first letter of
// the alphabet)
for i >= 0 {
result[i] = alphabet[0]
i -= 1
}
return string(result)
}
// Convenience function that just returns successive values from
// toPassword starting at start
func nextPassword(start, length int, alphabet []rune) func() string {
n := start
return func() string {
result := toPassword(n, length, alphabet)
n += 1
return result
}
}
func main() {
for i := 5; i < 11; i++ {
fmt.Println(toPassword(i, 2, []rune("ABCDE")))
} // BA, BB, BC, BD, BE, CA
// Now do the same thing using nextPassword
np := nextPassword(5, 2, []rune("ABCDE"))
for i := 0; i < 6; i++ {
fmt.Println(np())
} // BA, BB, BC, BD, BE, CA
}

Why does the following golang program throw a runtime out of memory error?

This program is supposed to read a file consisting of pairs of ints (one pair per line) and remove duplicate pairs. While it works on small files, it throws a runtime error on huge files (say a file of 1.5 GB). Initially, I thought that it is the map data structure which is causing this, but even after commenting it out, it still runs out of memory. Any ideas why this is happening? How to rectify it? Here's a data file on which it runs out of memory: http://snap.stanford.edu/data/com-Orkut.html
package main
import (
"fmt"
"bufio"
"os"
"strings"
"strconv"
)
func main() {
file, err := os.Open(os.Args[1])
if err != nil {
panic(err.Error())
}
defer file.Close()
type Edge struct {
u, v int
}
//seen := make(map[Edge]bool)
edges := []Edge{}
scanner := bufio.NewScanner(file)
for i, _ := strconv.Atoi(os.Args[2]); i > 0; i-- {
scanner.Scan()
}
for scanner.Scan() {
str := scanner.Text()
edge := strings.Split(str, ",")
u, _ := strconv.Atoi(edge[0])
v, _ := strconv.Atoi(edge[1])
var key Edge
if u < v {
key = Edge{u,v}
} else {
key = Edge{v,u}
}
//if seen[key] {
// continue
//}
//seen[key] = true
edges = append(edges, key)
}
for _, e := range edges {
s := strconv.Itoa(e.u) + "," + strconv.Itoa(e.v)
fmt.Println(s)
}
}
A sample input is given below. The program can be run as follows (where the last input says how many lines to skip).
go run undup.go a.txt 1
# 3072441,117185083
1,2
1,3
1,4
1,5
1,6
1,7
1,8
I looked at this file: com-orkut.ungraph.txt and it contains 117,185,082 lines. The way your data is structured, that's at least 16 bytes per line. (Edge is two 64bit ints) That alone is 1.7GB. I have had this problem in the past, and it can be a tricky one. Are you trying to solve this for a specific use case (the file in question) or the general case?
In the specific case there are a few things about the data you could leverage: (1) the keys are sorted and (2) it looks it stores every connection twice, (3) the numbers don't seem huge. Here are a couple ideas:
If you use a smaller type for the key you will use less memory. Try a uint32.
You could stream (without using a map) the keys to another file by simply seeing if the 2nd column is greater than the first:
if u < v {
// write the key to another file
} else {
// skip it because v will eventually show v -> u
}
For the general case there are a couple strategies you could use:
If the order of the resulting list doesn't matter: Use an on-disk hash table to store the map. There are a bunch of these: leveldb, sqlite, tokyo tyrant, ... A really nice one for go is bolt.
In your for loop you would just check to see if a bucket contains the given key. (You can convert the ints into byte slices using encoding/binary) If it does, just skip it and continue. You will need to move the second for loop processing step into the first for loop so that you don't have to store all the keys.
If the order of the resulting list does matter (and you can't guarantee the input is in order): You can also use an on-disk hash table, but it needs to be sorted. Bolt is sorted so that will work. Add all the keys to it, then traverse it in the second loop.
Here is an example: (this program will take a while to run with 100 million records)
package main
import (
"bufio"
"encoding/binary"
"fmt"
"github.com/boltdb/bolt"
"os"
"strconv"
"strings"
)
type Edge struct {
u, v int
}
func FromKey(bs []byte) Edge {
return Edge{int(binary.BigEndian.Uint64(bs[:8])), int(binary.BigEndian.Uint64(bs[8:]))}
}
func (e Edge) Key() [16]byte {
var k [16]byte
binary.BigEndian.PutUint64(k[:8], uint64(e.u))
binary.BigEndian.PutUint64(k[8:], uint64(e.v))
return k
}
func main() {
file, err := os.Open(os.Args[1])
if err != nil {
panic(err.Error())
}
defer file.Close()
scanner := bufio.NewScanner(file)
for i, _ := strconv.Atoi(os.Args[2]); i > 0; i-- {
scanner.Scan()
}
db, _ := bolt.Open("ex.db", 0777, nil)
defer db.Close()
bucketName := []byte("edges")
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucketIfNotExists(bucketName)
return nil
})
batchSize := 10000
total := 0
batch := make([]Edge, 0, batchSize)
writeBatch := func() {
total += len(batch)
fmt.Println("write batch. total:", total)
db.Update(func(tx *bolt.Tx) error {
bucket := tx.Bucket(bucketName)
for _, edge := range batch {
key := edge.Key()
bucket.Put(key[:], nil)
}
return nil
})
}
for scanner.Scan() {
str := scanner.Text()
edge := strings.Split(str, "\t")
u, _ := strconv.Atoi(edge[0])
v, _ := strconv.Atoi(edge[1])
var key Edge
if u < v {
key = Edge{u, v}
} else {
key = Edge{v, u}
}
batch = append(batch, key)
if len(batch) == batchSize {
writeBatch()
// reset the batch length to 0
batch = batch[:0]
}
}
// write anything leftover
writeBatch()
db.View(func(tx *bolt.Tx) error {
tx.Bucket(bucketName).ForEach(func(k, v []byte) error {
edge := FromKey(k)
fmt.Println(edge)
return nil
})
return nil
})
}
You are squandering memory. Here's how to rectify it.
You give the sample input a.txt, 48 bytes.
# 3072441,117185083
1,2
1,3
1,4
1,5
On http://snap.stanford.edu/data/com-Orkut.html, I found http://snap.stanford.edu/data/bigdata/communities/com-orkut.ungraph.txt.gz, 1.8 GB uncompressed, 117,185,083 edges.
# Undirected graph: ../../data/output/orkut.txt
# Orkut
# Nodes: 3072441 Edges: 117185083
# FromNodeId ToNodeId
1 2
1 3
1 4
1 5
On http://socialnetworks.mpi-sws.org/data-imc2007.html, I found http://socialnetworks.mpi-sws.mpg.de/data/orkut-links.txt.gz, 3.4 GB uncompressed, 223,534,301 edges.
1 2
1 3
1 4
1 5
Since they are similar, one program can handle all formats.
Your Edge type is
type Edge struct {
u, v int
}
which is 16 bytes on a 64-bit architecture.
Use
type Edge struct {
U, V uint32
}
which is 8 bytes, it is adequate.
If the capacity of a slice is not large enough to fit the additional values, append allocates a new, sufficiently large underlying array that fits both the existing slice elements and the additional values. Otherwise, append re-uses the underlying array. For a large slice, the new array is 1.25 times the size of the old array. While the old array is being copied to the new array, 1 + 1.25 = 2.25 times the memory for the old array is required. Therefore, allocate the underlying array so that all values fit.
make(T, n) initializes map of type T with initial space for n elements. Provide a value for n to limit the cost of reorganization and fragmentation as elements are added. Hashing functions are often imperfect which leads to wasted space. Eliminate the map as it's unneccesary. To eliminate duplicates, sort the slice in place and move the unique elements down.
A string is immutable, therefore a new string is allocated for scanner.Text() to convert from a byte slice buffer. To parse numbers we use strconv. To minimize temporary allocations, use scanner.Bytes() and adapt strconv.ParseUint to accept a byte array argument (bytconv).
For example,
orkut.go
package main
import (
"bufio"
"bytes"
"errors"
"fmt"
"os"
"runtime"
"sort"
"strconv"
)
type Edge struct {
U, V uint32
}
func (e Edge) String() string {
return fmt.Sprintf("%d,%d", e.U, e.V)
}
type ByKey []Edge
func (a ByKey) Len() int { return len(a) }
func (a ByKey) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a ByKey) Less(i, j int) bool {
if a[i].U < a[j].U {
return true
}
if a[i].U == a[j].U && a[i].V < a[j].V {
return true
}
return false
}
func countEdges(scanner *bufio.Scanner) int {
var nNodes, nEdges int
for scanner.Scan() {
line := scanner.Bytes()
if !(len(line) > 0 && line[0] == '#') {
nEdges++
continue
}
n, err := fmt.Sscanf(string(line), "# Nodes: %d Edges: %d", &nNodes, &nEdges)
if err != nil || n != 2 {
n, err = fmt.Sscanf(string(line), "# %d,%d", &nNodes, &nEdges)
if err != nil || n != 2 {
continue
}
}
fmt.Println(string(line))
break
}
if err := scanner.Err(); err != nil {
panic(err.Error())
}
fmt.Println(nEdges)
return nEdges
}
func loadEdges(filename string) []Edge {
file, err := os.Open(filename)
if err != nil {
panic(err.Error())
}
defer file.Close()
scanner := bufio.NewScanner(file)
nEdges := countEdges(scanner)
edges := make([]Edge, 0, nEdges)
offset, err := file.Seek(0, os.SEEK_SET)
if err != nil || offset != 0 {
panic(err.Error())
}
var sep byte = '\t'
scanner = bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Bytes()
if len(line) > 0 && line[0] == '#' {
continue
}
i := bytes.IndexByte(line, sep)
if i < 0 || i+1 >= len(line) {
sep = ','
i = bytes.IndexByte(line, sep)
if i < 0 || i+1 >= len(line) {
err := errors.New("Invalid line format: " + string(line))
panic(err.Error())
}
}
u, err := ParseUint(line[:i], 10, 32)
if err != nil {
panic(err.Error())
}
v, err := ParseUint(line[i+1:], 10, 32)
if err != nil {
panic(err.Error())
}
if u > v {
u, v = v, u
}
edges = append(edges, Edge{uint32(u), uint32(v)})
}
if err := scanner.Err(); err != nil {
panic(err.Error())
}
if len(edges) <= 1 {
return edges
}
sort.Sort(ByKey(edges))
j := 0
i := j + 1
for ; i < len(edges); i, j = i+1, j+1 {
if edges[i] == edges[j] {
break
}
}
for ; i < len(edges); i++ {
if edges[i] != edges[j] {
j++
edges[j] = edges[i]
}
}
edges = edges[:j+1]
return edges
}
func main() {
if len(os.Args) <= 1 {
err := errors.New("Missing file name")
panic(err.Error())
}
filename := os.Args[1]
fmt.Println(filename)
edges := loadEdges(filename)
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
fmt.Println(ms.Alloc, ms.TotalAlloc, ms.Sys, ms.Mallocs, ms.Frees)
fmt.Println(len(edges), cap(edges))
for i, e := range edges {
fmt.Println(e)
if i >= 10 {
break
}
}
}
// bytconv from strconv
// Return the first number n such that n*base >= 1<<64.
func cutoff64(base int) uint64 {
if base < 2 {
return 0
}
return (1<<64-1)/uint64(base) + 1
}
// ParseUint is like ParseInt but for unsigned numbers.
func ParseUint(s []byte, base int, bitSize int) (n uint64, err error) {
var cutoff, maxVal uint64
if bitSize == 0 {
bitSize = int(strconv.IntSize)
}
s0 := s
switch {
case len(s) < 1:
err = strconv.ErrSyntax
goto Error
case 2 <= base && base <= 36:
// valid base; nothing to do
case base == 0:
// Look for octal, hex prefix.
switch {
case s[0] == '0' && len(s) > 1 && (s[1] == 'x' || s[1] == 'X'):
base = 16
s = s[2:]
if len(s) < 1 {
err = strconv.ErrSyntax
goto Error
}
case s[0] == '0':
base = 8
default:
base = 10
}
default:
err = errors.New("invalid base " + strconv.Itoa(base))
goto Error
}
n = 0
cutoff = cutoff64(base)
maxVal = 1<<uint(bitSize) - 1
for i := 0; i < len(s); i++ {
var v byte
d := s[i]
switch {
case '0' <= d && d <= '9':
v = d - '0'
case 'a' <= d && d <= 'z':
v = d - 'a' + 10
case 'A' <= d && d <= 'Z':
v = d - 'A' + 10
default:
n = 0
err = strconv.ErrSyntax
goto Error
}
if int(v) >= base {
n = 0
err = strconv.ErrSyntax
goto Error
}
if n >= cutoff {
// n*base overflows
n = 1<<64 - 1
err = strconv.ErrRange
goto Error
}
n *= uint64(base)
n1 := n + uint64(v)
if n1 < n || n1 > maxVal {
// n+v overflows
n = 1<<64 - 1
err = strconv.ErrRange
goto Error
}
n = n1
}
return n, nil
Error:
return n, &strconv.NumError{"ParseUint", string(s0), err}
}
Output:
$ go build orkut.go
$ time ./orkut ~/release-orkut-links.txt
/home/peter/release-orkut-links.txt
223534301
1788305680 1788327856 1904683256 135 50
117185083 223534301
1,2
1,3
1,4
1,5
1,6
1,7
1,8
1,9
1,10
1,11
1,12
real 2m53.203s
user 2m51.584s
sys 0m1.628s
$
The orkut.go program with the release-orkut-links.txt file (3,372,855,860 (3.4 GB) bytes with 223,534,301 edges) uses about 1.8 GiB of memory. After eliminating duplicates, 117,185,083 unique edges remain. This matches the 117,185,083 unique edge com-orkut.ungraph.txt file.
With 8 GB of memory on your machine, you can load much larger files.

Golang: find first character in a String that doesn't repeat

I'm trying to write a function that returns the finds first character in a String that doesn't repeat, so far I have this:
package main
import (
"fmt"
"strings"
)
func check(s string) string {
ss := strings.Split(s, "")
smap := map[string]int{}
for i := 0; i < len(ss); i++ {
(smap[ss[i]])++
}
for k, v := range smap {
if v == 1 {
return k
}
}
return ""
}
func main() {
fmt.Println(check("nebuchadnezzer"))
}
Unfortunately in Go when you iterate a map there's no guarantee of the order so every time I run the code I get a different value, any pointers?
Using a map and 2 loops :
play
func check(s string) string {
m := make(map[rune]uint, len(s)) //preallocate the map size
for _, r := range s {
m[r]++
}
for _, r := range s {
if m[r] == 1 {
return string(r)
}
}
return ""
}
The benfit of this is using just 2 loops vs multiple loops if you're using strings.ContainsRune, strings.IndexRune (each function will have inner loops in them).
Efficient (in time and memory) algorithms for grabbing all or the first unique byte http://play.golang.org/p/ZGFepvEXFT:
func FirstUniqueByte(s string) (b byte, ok bool) {
occur := [256]byte{}
order := make([]byte, 0, 256)
for i := 0; i < len(s); i++ {
b = s[i]
switch occur[b] {
case 0:
occur[b] = 1
order = append(order, b)
case 1:
occur[b] = 2
}
}
for _, b = range order {
if occur[b] == 1 {
return b, true
}
}
return 0, false
}
As a bonus, the above function should never generate any garbage. Note that I changed your function signature to be a more idiomatic way to express what you're describing. If you need a func(string) string signature anyway, then the point is moot.
That can certainly be optimized, but one solution (which isn't using map) would be:
(playground example)
func check(s string) string {
unique := ""
for pos, c := range s {
if strings.ContainsRune(unique, c) {
unique = strings.Replace(unique, string(c), "", -1)
} else if strings.IndexRune(s, c) == pos {
unique = unique + string(c)
}
}
fmt.Println("All unique characters found: ", unique)
if len(unique) > 0 {
_, size := utf8.DecodeRuneInString(unique)
return unique[:size]
}
return ""
}
This is after the question "Find the first un-repeated character in a string"
krait suggested below that the function should:
return a string containing the first full rune, not just the first byte of the utf8 encoding of the first rune.

Convert rune to int?

In the following code, I iterate over a string rune by rune, but I'll actually need an int to perform some checksum calculation. Do I really need to encode the rune into a []byte, then convert it to a string and then use Atoi to get an int out of the rune? Is this the idiomatic way to do it?
// The string `s` only contains digits.
var factor int
for i, c := range s[:12] {
if i % 2 == 0 {
factor = 1
} else {
factor = 3
}
buf := make([]byte, 1)
_ = utf8.EncodeRune(buf, c)
value, _ := strconv.Atoi(string(buf))
sum += value * factor
}
On the playground: http://play.golang.org/p/noWDYjn5rJ
The problem is simpler than it looks. You convert a rune value to an int value with int(r). But your code implies you want the integer value out of the ASCII (or UTF-8) representation of the digit, which you can trivially get with r - '0' as a rune, or int(r - '0') as an int. Be aware that out-of-range runes will corrupt that logic.
For example, sum += (int(c) - '0') * factor,
package main
import (
"fmt"
"strconv"
"unicode/utf8"
)
func main() {
s := "9780486653556"
var factor, sum1, sum2 int
for i, c := range s[:12] {
if i%2 == 0 {
factor = 1
} else {
factor = 3
}
buf := make([]byte, 1)
_ = utf8.EncodeRune(buf, c)
value, _ := strconv.Atoi(string(buf))
sum1 += value * factor
sum2 += (int(c) - '0') * factor
}
fmt.Println(sum1, sum2)
}
Output:
124 124
why don't you do only "string(rune)".
s:="12345678910"
var factor,sum int
for i,x:=range s{
if i%2==0{
factor=1
}else{
factor=3
}
xstr:=string(x) //x is rune converted to string
xint,_:=strconv.Atoi(xstr)
sum+=xint*factor
}
fmt.Println(sum)
val, _ := strconv.Atoi(string(v))
Where v is a rune
More concise but same idea as above

bytewise compare varint encoded int64's

I'm using levigo, the leveldb bindings for Go. My keys are int64's and need to be kept sorted. By default, leveldb uses a bytewise comparator so I'm trying to use varint encoding.
func i2b(x int64) []byte {
b := make([]byte, binary.MaxVarintLen64)
n := binary.PutVarint(b, x)
return key[:n]
}
My keys are not being sorted correctly. I wrote the following as a test.
var prev int64 = 0
for i := int64(1); i < 1e5; i++ {
if bytes.Compare(i2b(i), i2b(prev)) <= 0 {
log.Fatalf("bytewise: %d > %d", b2i(prev), i)
}
prev = i
}
output: bytewise: 127 > 128
playground
I'm not sure where the problem is. Am I doing the encoding wrong? Is varint not the right encoding to use?
EDIT:
BigEndian fixed width encoding is bytewise comparable
func i2b(x int64) []byte {
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, uint64(x))
return b
}
The varint encoding is not bytewise comparable* wrt to the order of the values it caries. One option how to write the ordering/collating function (cmp bellow) is for example:
package main
import (
"encoding/binary"
"log"
)
func i2b(x int64) []byte {
var b [binary.MaxVarintLen64]byte
return b[:binary.PutVarint(b[:], x)]
}
func cmp(a, b []byte) int64 {
x, n := binary.Varint(a)
if n < 0 {
log.Fatal(n)
}
y, n := binary.Varint(b)
if n < 0 {
log.Fatal(n)
}
return x - y
}
func main() {
var prev int64 = 0
for i := int64(1); i < 1e5; i++ {
if cmp(i2b(i), i2b(prev)) <= 0 {
log.Fatal("fail")
}
prev = i
}
}
Playground
(*) The reason is (also) the bit fiddling performed.

Resources