How to read large file by blocks with n length - go

I want to read and split large text file (near 3GB) to blocks with n symbols length. I was trying to read file and split using runes, but it takes a lot of memory.
func SplitSubN(s string, n int) []string {
sub := ""
subs := []string{}
runes := bytes.Runes([]byte(s))
l := len(runes)
for i, r := range runes {
sub = sub + string(r)
if (i+1)%n == 0 {
subs = append(subs, sub)
sub = ""
} else if (i + 1) == l {
subs = append(subs, sub)
}
}
return subs
}
I suppose it can be done in smarter way, like a phased reading of blocks of a certain length from file, but I don't know how to do it correctly.

Scan for rune start bytes and split based on that. This eliminates all allocations within the function except for the allocation of the result slice.
func SplitSubN(s string, n int) []string {
if len(s) == 0 {
return nil
}
m := 0
i := 0
j := 1
var result []string
for ; j < len(s); j++ {
if utf8.RuneStart(s[j]) {
if (m+1)%n == 0 {
result = append(result, s[i:j])
i = j
}
m++
}
}
if j > i {
result = append(result, s[i:j])
}
return result
}
The API specified in the question requires that the application allocate memory when converting the []byte read from the file to a string. This allocation can be avoided by changing the function to work on bytes:
func SplitSubN(s []byte, n int) [][]byte {
if len(s) == 0 {
return nil
}
m := 0
i := 0
j := 1
var result [][]byte
for ; j < len(s); j++ {
if utf8.RuneStart(s[j]) {
if (m+1)%n == 0 {
result = append(result, s[i:j])
i = j
}
m++
}
}
if j > i {
result = append(result, s[i:j])
}
return result
}
Both of these functions require that the application slurp the entire file into memory. I assume that's OK because the function in the question does as well. If you only need to process one chunk at a time, then the above code be be adapted to scan as the file is read incrementally.

Actually, the most interesting part is not parsing chunk itself but rather handling characters overlapping.
For example, if you read from a file say in chunks of N bytes but last multi-byte char is read partially (the remainder will be read in the next iteration).
Here is a solution that reads a text file by given chunks and handles characters overlapping in async manner:
package main
import (
"fmt"
"io"
"log"
"os"
"unicode/utf8"
)
func main() {
data, err := ReadInChunks("testfile", 1024*16)
competed := false
for ; !competed; {
select {
case next := <-data:
if next == nil {
competed = true
break
}
fmt.Printf(string(next))
case e := <-err:
if e != nil {
log.Fatalf("error: %s", e)
}
}
}
}
func ReadInChunks(path string, readChunkSize int) (data chan []rune, err chan error) {
var readChanel = make(chan []rune)
var errorChanel = make(chan error)
onDone := func() {
close(readChanel)
close(errorChanel)
}
onError := func(err error) {
errorChanel <- err
onDone()
}
go func() {
if _, err := os.Stat(path); os.IsNotExist(err) {
onError(fmt.Errorf("file [%s] does not exist", path))
return
}
f, err := os.Open(path)
if err != nil {
onError(err)
return
}
defer f.Close()
readBuf := make([]byte, readChunkSize)
reminder := 0
for {
read, err := f.Read(readBuf[reminder:])
if err == io.EOF {
onDone()
return
}
if err != nil {
onError(err)
}
runes, parsed := runes(readBuf[:reminder+read])
if reminder = readChunkSize - parsed; reminder > 0 {
copy(readBuf[:reminder], readBuf[readChunkSize-reminder:])
}
if len(runes) > 0 {
readChanel <- runes
}
}
}()
return readChanel, errorChanel
}
func runes(nextBuffer []byte) ([]rune, int) {
t := make([]rune, utf8.RuneCount(nextBuffer))
i := 0
var size = len(nextBuffer)
var read = 0
for len(nextBuffer) > 0 {
r, l := utf8.DecodeRune(nextBuffer)
runeLen := utf8.RuneLen(r)
if read+runeLen > size {
break
}
read += runeLen
t[i] = r
i++
nextBuffer = nextBuffer[l:]
}
return t[:i], read
}
It can be greatly simplified if the file is ACSII.
Alternativly, if you need to support unicode you can play aroud UTF-32 (which has fixed length) or UTF-16 (if you don't need to handle >2-bytes, you can treat it as fixed-size as well)

Related

How to read specific number of lines of a file?

I need to read specific lines of a file at one time (for example, 10 lines one time), and read from the next line (11) of last read position next time I read the file, and continue to read 10 lines.
There's no available library function to read specific number of lines. Although
you can implement something like this to do the same. Working example here
func readLines(n int, r io.Reader) ([]string, error) {
rd := bufio.NewReader(r)
var (
lines = make([]string, 0, n)
bs []byte
done bool
)
for {
if done || len(lines) == n {
break
}
bss, isPrefix, err := rd.ReadLine()
if err != nil {
if err != io.EOF {
return nil, err
}
done = true
}
bs = append(bs, bss...)
if isPrefix {
continue
}
lines = append(lines, string(bs))
bs = make([]byte, 0)
}
return lines, nil
}
This is the function I wrote, and seems it works
func ReadLine(inputFile io.ReadSeeker, startPos int64, lineNum int) (slice []string, lastPos int64, err error) {
r := bufio.NewReader(inputFile)
var line string
inputFile.Seek(startPos, os.SEEK_SET)
lastPos = startPos
for i := 0; i < lineNum; i++ {
line, err = r.ReadString('\n')
if err != nil {
break
}
lastPos += int64(len(line))
slice = append(slice, line)
}
return
}

fastest way search ip in large ip/subnet list on golang?

please help me solve next task in fastest possible way
i have a large list of ip/subnets like ...
35.132.199.128/27
8.44.144.248/32
87.117.185.193
45.23.45.45
etc
and i'll need to find some ip in that list fastes as possible in go.
when i try use slice of strings and range, it was very slow on large list.
may i use map, like map[string]string, and its look usable but only for ip checking, not for subnet checking.
anyone can help me with solving this task? thanks.
my code
func (app *application) validateIP(ip string) bool {
for _, item := range app.IPList {
itemIsIP := net.ParseIP(item)
if itemIsIP != nil {
if ip == itemIsIP.String() {
return true
}
continue
}
_, itemNet, err := net.ParseCIDR(item)
if err != nil {
log.Printf("[ERROR] %+v", err)
}
checkedIP := net.ParseIP(ip)
if itemNet.Contains(checkedIP) {
return true
}
}
return false
}
A trie is an ideal data structure for fast containment search for addresses and CIDR address blocks.
The following code shows how to use the ipaddress-go library
address trie implementation to do fast containment searches. Disclaimer: I am the project manager of the IPAddress library.
package main
import (
"fmt"
"github.com/seancfoley/ipaddress-go/ipaddr"
)
func main() {
addrStrs := []string{
"35.132.199.128/27", "8.44.144.248/32", "87.117.185.193", "45.23.45.45",
}
trie := ipaddr.AddressTrie{}
for _, addrStr := range addrStrs {
addr := ipaddr.NewIPAddressString(addrStr).GetAddress().ToAddressBase()
trie.Add(addr)
}
fmt.Println("The trie is", trie)
addrSearchStrs := []string{
"35.132.199.143", "8.44.144.248", "45.23.45.45", "127.0.0.1",
}
for _, addrStr := range addrSearchStrs {
addr := ipaddr.NewIPAddressString(addrStr).GetAddress().ToAddressBase()
triePath := trie.ElementsContaining(addr)
if triePath.Count() > 0 {
fmt.Println("The blocks and addresses containing", addr, "are",
triePath)
} else {
fmt.Println("No blocks nor addresses contain", addr)
}
}
}
Output:
The trie is
○ 0.0.0.0/0 (4)
└─○ 0.0.0.0/1 (4)
├─○ 0.0.0.0/2 (3)
│ ├─● 8.44.144.248 (1)
│ └─○ 32.0.0.0/4 (2)
│ ├─● 35.132.199.128/27 (1)
│ └─● 45.23.45.45 (1)
└─● 87.117.185.193 (1)
The blocks and addresses containing 35.132.199.143 are
● 35.132.199.128/27 (1)
The blocks and addresses containing 8.44.144.248 are
● 8.44.144.248 (1)
The blocks and addresses containing 45.23.45.45 are
● 45.23.45.45 (1)
No blocks nor addresses contain 127.0.0.1
We solved this problem in our project:
isIPV4inCIDRList("35.132.199.128", []string{"35.132.199.128/27"})
func isIPV4inCIDRList(ip []byte, list []string) bool {
for i := 0; i < 32; i++ {
sm := strconv.Itoa(i)
m, _ := mask(sm, 4)
inv := andIP(ip, []byte(m))
if len(inv) == 0 {
continue
}
for _, cidr := range list {
if cidr == inv.String() + "/" + sm {
return true
}
}
}
return false
}
func andIP(ip, mask []byte) net.IP {
inv := net.IP{}
for i, v := range ip {
inv = append(inv, mask[i]&v)
}
return inv
}
// Bigger than we need, not too big to worry about overflow
const big = 0xFFFFFF
// Decimal to integer.
// Returns number, characters consumed, success.
func dtoi(s string) (n int, i int, ok bool) {
n = 0
for i = 0; i < len(s) && '0' <= s[i] && s[i] <= '9'; i++ {
n = n*10 + int(s[i]-'0')
if n >= big {
return big, i, false
}
}
if i == 0 {
return 0, 0, false
}
return n, i, true
}
func mask(m string, iplen int) (net.IPMask, error) {
n, i, ok := dtoi(m)
if !ok || i != len(m) || n < 0 || n > 8*iplen {
return nil, &net.ParseError{Type: "CIDR address"}
}
return net.CIDRMask(n, 8*iplen), nil
}

check for equality on slices without order

I am trying to find a solution to check for equality in 2 slices. Unfortanely, the answers I have found require values in the slice to be in the same order. For example, http://play.golang.org/p/yV0q1_u3xR evaluates equality to false.
I want a solution that lets []string{"a","b","c"} == []string{"b","a","c"} evaluate to true.
MORE EXAMPLES
[]string{"a","a","c"} == []string{"c","a","c"} >>> false
[]string{"z","z","x"} == []string{"x","z","z"} >>> true
Here is an alternate solution, though perhaps a bit verbose:
func sameStringSlice(x, y []string) bool {
if len(x) != len(y) {
return false
}
// create a map of string -> int
diff := make(map[string]int, len(x))
for _, _x := range x {
// 0 value for int is 0, so just increment a counter for the string
diff[_x]++
}
for _, _y := range y {
// If the string _y is not in diff bail out early
if _, ok := diff[_y]; !ok {
return false
}
diff[_y] -= 1
if diff[_y] == 0 {
delete(diff, _y)
}
}
return len(diff) == 0
}
Try it on the Go Playground
You can use cmp.Diff together with cmpopts.SortSlices:
less := func(a, b string) bool { return a < b }
equalIgnoreOrder := cmp.Diff(x, y, cmpopts.SortSlices(less)) == ""
Here is a full example that runs on the Go Playground:
package main
import (
"fmt"
"github.com/google/go-cmp/cmp"
"github.com/google/go-cmp/cmp/cmpopts"
)
func main() {
x := []string{"a", "b", "c"}
y := []string{"a", "c", "b"}
less := func(a, b string) bool { return a < b }
equalIgnoreOrder := cmp.Diff(x, y, cmpopts.SortSlices(less)) == ""
fmt.Println(equalIgnoreOrder) // prints "true"
}
The other answers have better time complexity O(N) vs (O(N log(N)), that are in my answer, also my solution will take up more memory if elements in the slices are repeated frequently, but I wanted to add it because I think this is the most straight forward way to do it:
package main
import (
"fmt"
"sort"
"reflect"
)
func array_sorted_equal(a, b []string) bool {
if len(a) != len(b) {return false }
a_copy := make([]string, len(a))
b_copy := make([]string, len(b))
copy(a_copy, a)
copy(b_copy, b)
sort.Strings(a_copy)
sort.Strings(b_copy)
return reflect.DeepEqual(a_copy, b_copy)
}
func main() {
a := []string {"a", "a", "c"}
b := []string {"c", "a", "c"}
c := []string {"z","z","x"}
d := []string {"x","z","z"}
fmt.Println( array_sorted_equal(a, b))
fmt.Println( array_sorted_equal(c, d))
}
Result:
false
true
I would think the easiest way would be to map the elements in each array/slice to their number of occurrences, then compare the maps:
func main() {
x := []string{"a","b","c"}
y := []string{"c","b","a"}
xMap := make(map[string]int)
yMap := make(map[string]int)
for _, xElem := range x {
xMap[xElem]++
}
for _, yElem := range y {
yMap[yElem]++
}
for xMapKey, xMapVal := range xMap {
if yMap[xMapKey] != xMapVal {
return false
}
}
return true
}
You'll need to add some additional due dilligence, like short circuiting if your arrays/slices contain elements of different types or are of different length.
Generalizing the code of testify ElementsMatch, solution to compare any kind of objects (in the example []map[string]string):
https://play.golang.org/p/xUS2ngrUWUl
Like adrianlzt wrote in his answer, an implementation of assert.ElementsMatch from testify can be used to achieve that. But how about reusing actual testify module instead of copying that code when all you need is a bool result of the comparison? The implementation in testify is intended for tests code and usually takes testing.T argument.
It turns out that ElementsMatch can be quite easily used outside of testing code. All it takes is a dummy implementation of an interface with ErrorF method:
type dummyt struct{}
func (t dummyt) Errorf(string, ...interface{}) {}
func elementsMatch(listA, listB interface{}) bool {
return assert.ElementsMatch(dummyt{}, listA, listB)
}
Or test it on The Go Playground, which I've adapted from the adrianlzt's example.
Since I haven't got enough reputation to comment, I have to post yet another answer with a bit better code readability:
func AssertSameStringSlice(x, y []string) bool {
if len(x) != len(y) {
return false
}
itemAppearsTimes := make(map[string]int, len(x))
for _, i := range x {
itemAppearsTimes[i]++
}
for _, i := range y {
if _, ok := itemAppearsTimes[i]; !ok {
return false
}
itemAppearsTimes[i]--
if itemAppearsTimes[i] == 0 {
delete(itemAppearsTimes, i)
}
}
if len(itemAppearsTimes) == 0 {
return true
}
return false
}
The logic is the same as in this answer
I know its been answered but still I would like to add my answer. By following code here stretchr/testify we can have something like
func Elementsmatch(listA, listB []string) (string, bool) {
aLen := len(listA)
bLen := len(listB)
if aLen != bLen {
return fmt.Sprintf("Len of the lists don't match , len listA %v, len listB %v", aLen, bLen), false
}
visited := make([]bool, bLen)
for i := 0; i < aLen; i++ {
found := false
element := listA[i]
for j := 0; j < bLen; j++ {
if visited[j] {
continue
}
if element == listB[j] {
visited[j] = true
found = true
break
}
}
if !found {
return fmt.Sprintf("element %s appears more times in %s than in %s", element, listA, listB), false
}
}
return "", true
}
Now lets talk about performance of this solution compared to map based ones. Well it really depends on the size of the lists which you are comparing, If size of list is large (I would say greater than 20) then map approach is better else this would be sufficent.
Well on Go PlayGround it shows 0s always, but run this on local system and you can see the difference in time taken as size of list increases
So the solution I propose is, adding map based comparision from above solution
func Elementsmatch(listA, listB []string) (string, bool) {
aLen := len(listA)
bLen := len(listB)
if aLen != bLen {
return fmt.Sprintf("Len of the lists don't match , len listA %v, len listB %v", aLen, bLen), false
}
if aLen > 20 {
return elementsMatchByMap(listA, listB)
}else{
return elementsMatchByLoop(listA, listB)
}
}
func elementsMatchByLoop(listA, listB []string) (string, bool) {
aLen := len(listA)
bLen := len(listB)
visited := make([]bool, bLen)
for i := 0; i < aLen; i++ {
found := false
element := listA[i]
for j := 0; j < bLen; j++ {
if visited[j] {
continue
}
if element == listB[j] {
visited[j] = true
found = true
break
}
}
if !found {
return fmt.Sprintf("element %s appears more times in %s than in %s", element, listA, listB), false
}
}
return "", true
}
func elementsMatchByMap(x, y []string) (string, bool) {
// create a map of string -> int
diff := make(map[string]int, len(x))
for _, _x := range x {
// 0 value for int is 0, so just increment a counter for the string
diff[_x]++
}
for _, _y := range y {
// If the string _y is not in diff bail out early
if _, ok := diff[_y]; !ok {
return fmt.Sprintf(" %v is not present in list b", _y), false
}
diff[_y] -= 1
if diff[_y] == 0 {
delete(diff, _y)
}
}
if len(diff) == 0 {
return "", true
}
return "", false
}

Why does the following golang program throw a runtime out of memory error?

This program is supposed to read a file consisting of pairs of ints (one pair per line) and remove duplicate pairs. While it works on small files, it throws a runtime error on huge files (say a file of 1.5 GB). Initially, I thought that it is the map data structure which is causing this, but even after commenting it out, it still runs out of memory. Any ideas why this is happening? How to rectify it? Here's a data file on which it runs out of memory: http://snap.stanford.edu/data/com-Orkut.html
package main
import (
"fmt"
"bufio"
"os"
"strings"
"strconv"
)
func main() {
file, err := os.Open(os.Args[1])
if err != nil {
panic(err.Error())
}
defer file.Close()
type Edge struct {
u, v int
}
//seen := make(map[Edge]bool)
edges := []Edge{}
scanner := bufio.NewScanner(file)
for i, _ := strconv.Atoi(os.Args[2]); i > 0; i-- {
scanner.Scan()
}
for scanner.Scan() {
str := scanner.Text()
edge := strings.Split(str, ",")
u, _ := strconv.Atoi(edge[0])
v, _ := strconv.Atoi(edge[1])
var key Edge
if u < v {
key = Edge{u,v}
} else {
key = Edge{v,u}
}
//if seen[key] {
// continue
//}
//seen[key] = true
edges = append(edges, key)
}
for _, e := range edges {
s := strconv.Itoa(e.u) + "," + strconv.Itoa(e.v)
fmt.Println(s)
}
}
A sample input is given below. The program can be run as follows (where the last input says how many lines to skip).
go run undup.go a.txt 1
# 3072441,117185083
1,2
1,3
1,4
1,5
1,6
1,7
1,8
I looked at this file: com-orkut.ungraph.txt and it contains 117,185,082 lines. The way your data is structured, that's at least 16 bytes per line. (Edge is two 64bit ints) That alone is 1.7GB. I have had this problem in the past, and it can be a tricky one. Are you trying to solve this for a specific use case (the file in question) or the general case?
In the specific case there are a few things about the data you could leverage: (1) the keys are sorted and (2) it looks it stores every connection twice, (3) the numbers don't seem huge. Here are a couple ideas:
If you use a smaller type for the key you will use less memory. Try a uint32.
You could stream (without using a map) the keys to another file by simply seeing if the 2nd column is greater than the first:
if u < v {
// write the key to another file
} else {
// skip it because v will eventually show v -> u
}
For the general case there are a couple strategies you could use:
If the order of the resulting list doesn't matter: Use an on-disk hash table to store the map. There are a bunch of these: leveldb, sqlite, tokyo tyrant, ... A really nice one for go is bolt.
In your for loop you would just check to see if a bucket contains the given key. (You can convert the ints into byte slices using encoding/binary) If it does, just skip it and continue. You will need to move the second for loop processing step into the first for loop so that you don't have to store all the keys.
If the order of the resulting list does matter (and you can't guarantee the input is in order): You can also use an on-disk hash table, but it needs to be sorted. Bolt is sorted so that will work. Add all the keys to it, then traverse it in the second loop.
Here is an example: (this program will take a while to run with 100 million records)
package main
import (
"bufio"
"encoding/binary"
"fmt"
"github.com/boltdb/bolt"
"os"
"strconv"
"strings"
)
type Edge struct {
u, v int
}
func FromKey(bs []byte) Edge {
return Edge{int(binary.BigEndian.Uint64(bs[:8])), int(binary.BigEndian.Uint64(bs[8:]))}
}
func (e Edge) Key() [16]byte {
var k [16]byte
binary.BigEndian.PutUint64(k[:8], uint64(e.u))
binary.BigEndian.PutUint64(k[8:], uint64(e.v))
return k
}
func main() {
file, err := os.Open(os.Args[1])
if err != nil {
panic(err.Error())
}
defer file.Close()
scanner := bufio.NewScanner(file)
for i, _ := strconv.Atoi(os.Args[2]); i > 0; i-- {
scanner.Scan()
}
db, _ := bolt.Open("ex.db", 0777, nil)
defer db.Close()
bucketName := []byte("edges")
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucketIfNotExists(bucketName)
return nil
})
batchSize := 10000
total := 0
batch := make([]Edge, 0, batchSize)
writeBatch := func() {
total += len(batch)
fmt.Println("write batch. total:", total)
db.Update(func(tx *bolt.Tx) error {
bucket := tx.Bucket(bucketName)
for _, edge := range batch {
key := edge.Key()
bucket.Put(key[:], nil)
}
return nil
})
}
for scanner.Scan() {
str := scanner.Text()
edge := strings.Split(str, "\t")
u, _ := strconv.Atoi(edge[0])
v, _ := strconv.Atoi(edge[1])
var key Edge
if u < v {
key = Edge{u, v}
} else {
key = Edge{v, u}
}
batch = append(batch, key)
if len(batch) == batchSize {
writeBatch()
// reset the batch length to 0
batch = batch[:0]
}
}
// write anything leftover
writeBatch()
db.View(func(tx *bolt.Tx) error {
tx.Bucket(bucketName).ForEach(func(k, v []byte) error {
edge := FromKey(k)
fmt.Println(edge)
return nil
})
return nil
})
}
You are squandering memory. Here's how to rectify it.
You give the sample input a.txt, 48 bytes.
# 3072441,117185083
1,2
1,3
1,4
1,5
On http://snap.stanford.edu/data/com-Orkut.html, I found http://snap.stanford.edu/data/bigdata/communities/com-orkut.ungraph.txt.gz, 1.8 GB uncompressed, 117,185,083 edges.
# Undirected graph: ../../data/output/orkut.txt
# Orkut
# Nodes: 3072441 Edges: 117185083
# FromNodeId ToNodeId
1 2
1 3
1 4
1 5
On http://socialnetworks.mpi-sws.org/data-imc2007.html, I found http://socialnetworks.mpi-sws.mpg.de/data/orkut-links.txt.gz, 3.4 GB uncompressed, 223,534,301 edges.
1 2
1 3
1 4
1 5
Since they are similar, one program can handle all formats.
Your Edge type is
type Edge struct {
u, v int
}
which is 16 bytes on a 64-bit architecture.
Use
type Edge struct {
U, V uint32
}
which is 8 bytes, it is adequate.
If the capacity of a slice is not large enough to fit the additional values, append allocates a new, sufficiently large underlying array that fits both the existing slice elements and the additional values. Otherwise, append re-uses the underlying array. For a large slice, the new array is 1.25 times the size of the old array. While the old array is being copied to the new array, 1 + 1.25 = 2.25 times the memory for the old array is required. Therefore, allocate the underlying array so that all values fit.
make(T, n) initializes map of type T with initial space for n elements. Provide a value for n to limit the cost of reorganization and fragmentation as elements are added. Hashing functions are often imperfect which leads to wasted space. Eliminate the map as it's unneccesary. To eliminate duplicates, sort the slice in place and move the unique elements down.
A string is immutable, therefore a new string is allocated for scanner.Text() to convert from a byte slice buffer. To parse numbers we use strconv. To minimize temporary allocations, use scanner.Bytes() and adapt strconv.ParseUint to accept a byte array argument (bytconv).
For example,
orkut.go
package main
import (
"bufio"
"bytes"
"errors"
"fmt"
"os"
"runtime"
"sort"
"strconv"
)
type Edge struct {
U, V uint32
}
func (e Edge) String() string {
return fmt.Sprintf("%d,%d", e.U, e.V)
}
type ByKey []Edge
func (a ByKey) Len() int { return len(a) }
func (a ByKey) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a ByKey) Less(i, j int) bool {
if a[i].U < a[j].U {
return true
}
if a[i].U == a[j].U && a[i].V < a[j].V {
return true
}
return false
}
func countEdges(scanner *bufio.Scanner) int {
var nNodes, nEdges int
for scanner.Scan() {
line := scanner.Bytes()
if !(len(line) > 0 && line[0] == '#') {
nEdges++
continue
}
n, err := fmt.Sscanf(string(line), "# Nodes: %d Edges: %d", &nNodes, &nEdges)
if err != nil || n != 2 {
n, err = fmt.Sscanf(string(line), "# %d,%d", &nNodes, &nEdges)
if err != nil || n != 2 {
continue
}
}
fmt.Println(string(line))
break
}
if err := scanner.Err(); err != nil {
panic(err.Error())
}
fmt.Println(nEdges)
return nEdges
}
func loadEdges(filename string) []Edge {
file, err := os.Open(filename)
if err != nil {
panic(err.Error())
}
defer file.Close()
scanner := bufio.NewScanner(file)
nEdges := countEdges(scanner)
edges := make([]Edge, 0, nEdges)
offset, err := file.Seek(0, os.SEEK_SET)
if err != nil || offset != 0 {
panic(err.Error())
}
var sep byte = '\t'
scanner = bufio.NewScanner(file)
for scanner.Scan() {
line := scanner.Bytes()
if len(line) > 0 && line[0] == '#' {
continue
}
i := bytes.IndexByte(line, sep)
if i < 0 || i+1 >= len(line) {
sep = ','
i = bytes.IndexByte(line, sep)
if i < 0 || i+1 >= len(line) {
err := errors.New("Invalid line format: " + string(line))
panic(err.Error())
}
}
u, err := ParseUint(line[:i], 10, 32)
if err != nil {
panic(err.Error())
}
v, err := ParseUint(line[i+1:], 10, 32)
if err != nil {
panic(err.Error())
}
if u > v {
u, v = v, u
}
edges = append(edges, Edge{uint32(u), uint32(v)})
}
if err := scanner.Err(); err != nil {
panic(err.Error())
}
if len(edges) <= 1 {
return edges
}
sort.Sort(ByKey(edges))
j := 0
i := j + 1
for ; i < len(edges); i, j = i+1, j+1 {
if edges[i] == edges[j] {
break
}
}
for ; i < len(edges); i++ {
if edges[i] != edges[j] {
j++
edges[j] = edges[i]
}
}
edges = edges[:j+1]
return edges
}
func main() {
if len(os.Args) <= 1 {
err := errors.New("Missing file name")
panic(err.Error())
}
filename := os.Args[1]
fmt.Println(filename)
edges := loadEdges(filename)
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
fmt.Println(ms.Alloc, ms.TotalAlloc, ms.Sys, ms.Mallocs, ms.Frees)
fmt.Println(len(edges), cap(edges))
for i, e := range edges {
fmt.Println(e)
if i >= 10 {
break
}
}
}
// bytconv from strconv
// Return the first number n such that n*base >= 1<<64.
func cutoff64(base int) uint64 {
if base < 2 {
return 0
}
return (1<<64-1)/uint64(base) + 1
}
// ParseUint is like ParseInt but for unsigned numbers.
func ParseUint(s []byte, base int, bitSize int) (n uint64, err error) {
var cutoff, maxVal uint64
if bitSize == 0 {
bitSize = int(strconv.IntSize)
}
s0 := s
switch {
case len(s) < 1:
err = strconv.ErrSyntax
goto Error
case 2 <= base && base <= 36:
// valid base; nothing to do
case base == 0:
// Look for octal, hex prefix.
switch {
case s[0] == '0' && len(s) > 1 && (s[1] == 'x' || s[1] == 'X'):
base = 16
s = s[2:]
if len(s) < 1 {
err = strconv.ErrSyntax
goto Error
}
case s[0] == '0':
base = 8
default:
base = 10
}
default:
err = errors.New("invalid base " + strconv.Itoa(base))
goto Error
}
n = 0
cutoff = cutoff64(base)
maxVal = 1<<uint(bitSize) - 1
for i := 0; i < len(s); i++ {
var v byte
d := s[i]
switch {
case '0' <= d && d <= '9':
v = d - '0'
case 'a' <= d && d <= 'z':
v = d - 'a' + 10
case 'A' <= d && d <= 'Z':
v = d - 'A' + 10
default:
n = 0
err = strconv.ErrSyntax
goto Error
}
if int(v) >= base {
n = 0
err = strconv.ErrSyntax
goto Error
}
if n >= cutoff {
// n*base overflows
n = 1<<64 - 1
err = strconv.ErrRange
goto Error
}
n *= uint64(base)
n1 := n + uint64(v)
if n1 < n || n1 > maxVal {
// n+v overflows
n = 1<<64 - 1
err = strconv.ErrRange
goto Error
}
n = n1
}
return n, nil
Error:
return n, &strconv.NumError{"ParseUint", string(s0), err}
}
Output:
$ go build orkut.go
$ time ./orkut ~/release-orkut-links.txt
/home/peter/release-orkut-links.txt
223534301
1788305680 1788327856 1904683256 135 50
117185083 223534301
1,2
1,3
1,4
1,5
1,6
1,7
1,8
1,9
1,10
1,11
1,12
real 2m53.203s
user 2m51.584s
sys 0m1.628s
$
The orkut.go program with the release-orkut-links.txt file (3,372,855,860 (3.4 GB) bytes with 223,534,301 edges) uses about 1.8 GiB of memory. After eliminating duplicates, 117,185,083 unique edges remain. This matches the 117,185,083 unique edge com-orkut.ungraph.txt file.
With 8 GB of memory on your machine, you can load much larger files.

Golang: find first character in a String that doesn't repeat

I'm trying to write a function that returns the finds first character in a String that doesn't repeat, so far I have this:
package main
import (
"fmt"
"strings"
)
func check(s string) string {
ss := strings.Split(s, "")
smap := map[string]int{}
for i := 0; i < len(ss); i++ {
(smap[ss[i]])++
}
for k, v := range smap {
if v == 1 {
return k
}
}
return ""
}
func main() {
fmt.Println(check("nebuchadnezzer"))
}
Unfortunately in Go when you iterate a map there's no guarantee of the order so every time I run the code I get a different value, any pointers?
Using a map and 2 loops :
play
func check(s string) string {
m := make(map[rune]uint, len(s)) //preallocate the map size
for _, r := range s {
m[r]++
}
for _, r := range s {
if m[r] == 1 {
return string(r)
}
}
return ""
}
The benfit of this is using just 2 loops vs multiple loops if you're using strings.ContainsRune, strings.IndexRune (each function will have inner loops in them).
Efficient (in time and memory) algorithms for grabbing all or the first unique byte http://play.golang.org/p/ZGFepvEXFT:
func FirstUniqueByte(s string) (b byte, ok bool) {
occur := [256]byte{}
order := make([]byte, 0, 256)
for i := 0; i < len(s); i++ {
b = s[i]
switch occur[b] {
case 0:
occur[b] = 1
order = append(order, b)
case 1:
occur[b] = 2
}
}
for _, b = range order {
if occur[b] == 1 {
return b, true
}
}
return 0, false
}
As a bonus, the above function should never generate any garbage. Note that I changed your function signature to be a more idiomatic way to express what you're describing. If you need a func(string) string signature anyway, then the point is moot.
That can certainly be optimized, but one solution (which isn't using map) would be:
(playground example)
func check(s string) string {
unique := ""
for pos, c := range s {
if strings.ContainsRune(unique, c) {
unique = strings.Replace(unique, string(c), "", -1)
} else if strings.IndexRune(s, c) == pos {
unique = unique + string(c)
}
}
fmt.Println("All unique characters found: ", unique)
if len(unique) > 0 {
_, size := utf8.DecodeRuneInString(unique)
return unique[:size]
}
return ""
}
This is after the question "Find the first un-repeated character in a string"
krait suggested below that the function should:
return a string containing the first full rune, not just the first byte of the utf8 encoding of the first rune.

Resources