Tour of Go Exercise #23: my word counter doesn't work - go

I'm trying to resolve the puzzle from go tour #23 and I don't understand why my word counter doesn't work. print seems to print the expected value but the tests sees only 1 regardless the count.
package main
import (
"strings"
"unicode/utf8"
"golang.org/x/tour/wc"
)
func WordCount(s string) map[string]int {
// explode the string into a slice without whitespaces
ws := strings.Fields(s)
//make a new map
c := make(map[string]int)
//iterate over each word
for _, v := range ws {
c[v] = utf8.RuneCountInString(v)
}
print(c["am"])
return c
}
func main() {
wc.Test(WordCount)
}
The playground is available here

You're solving the wrong problem. It doesn't ask you for the length of each word, but for the number of times each word occurs. Change
c[v] = utf8.RuneCountInString(v)
for
c[v] += 1 // or c[v]++

The problem is c[v] = utf8.RuneCountInString(v). It has two problems:
You're resetting the counter for each word every time you re-encounter it. You should increment, not set.
You are setting the number of runes in the word to the counter. The puzzle is "how many times a word appears in the text". so just do something like c[v] = c[v] + 1 (if the entry is empty it will default to 0)
Also, I'd normalize the text - strip punctuation marks and lowercase everything.

Related

is there a way to word wrap/pad one sentence to multiple sentences in go

write now i have a huge string which i get from 250-300 characters and i'm writing to file using
file, err := ioutil.TempFile("/Downloads", "*.txt")
if err != nil {
log.Fatal(err)
}
file.Write(mystring)
This writes everything in one line, but is there a way to pad the lines so that automatically after 76 char, we get onto new line.
found a solution which does exactly the above requirment.
made it a generic solution to split based on "n" length and whatever delimeter is required.
you can try it in the playground if you wish (https://play.golang.org/p/5ZHCC_Z5uqc)
func insertNth(s string, n int) string {
var buffer bytes.Buffer
var n_1 = n - 1
var l_1 = len(s) - 1
for i, rune := range s {
buffer.WriteRune(rune)
if i%n == n_1 && i != l_1 {
buffer.WriteRune('\n')
}
}
return buffer.String()
}
https://play.golang.org/p/5ZHCC_Z5uqc
Did some digging in and actually found it not that difficult, posted my solution above.

How do I count emojis in a string in go? [duplicate]

How can I get the number of characters of a string in Go?
For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.
You can try RuneCountInString from the utf8 package.
returns the number of runes in p
that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}
Phrozen adds in the comments:
Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At least in Go 1.3.
And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)
The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.
Adds a new runtime function to count runes in a string.
Modifies the compiler to detect the pattern len([]rune(string))
and replaces it with the new rune counting runtime function.
RuneCount/lenruneslice/ASCII 27.8ns ± 2% 14.5ns ± 3% -47.70%
RuneCount/lenruneslice/Japanese 126ns ± 2% 60 ns ± 2% -52.03%
RuneCount/lenruneslice/MixedLength 104ns ± 2% 50 ns ± 1% -51.71%
Stefan Steiger points to the blog post "Text normalization in Go"
What is a character?
As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.
The definition of a character may vary depending on the application.
For normalization we will define it as:
a sequence of runes that starts with a starter,
a rune that does not modify or combine backwards with any other rune,
followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).
The normalization algorithm processes one character at at time.
Using that package and its Iter type, the actual number of "character" would be:
package main
import "fmt"
import "golang.org/x/text/unicode/norm"
func main() {
var ia norm.Iter
ia.InitString(norm.NFKD, "école")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Printf("Number of chars: %d\n", nc)
}
Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"
Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.
For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.
That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.
package uniseg
import (
"fmt"
"github.com/rivo/uniseg"
)
func main() {
gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
fmt.Printf("%x ", gr.Runes())
}
// Output: [1f44d 1f3fc] [21]
}
Two graphemes, even though there are three runes (Unicode code points).
You can see other examples in "How to manipulate strings in GO to reverse them?"
👩🏾‍🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes:
👩: women (1f469)
dark skin (1f3fe)
ZERO WIDTH JOINER (200d)
🦰red hair (1f9b0)
There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):
package main
import "fmt"
func main() {
russian := "Спутник и погром"
english := "Sputnik & pogrom"
fmt.Println("count of bytes:",
len(russian),
len(english))
fmt.Println("count of runes:",
len([]rune(russian)),
len([]rune(english)))
}
count of bytes 30 16
count of runes 16 16
I should point out that none of the answers provided so far give you the number of characters as you would expect, especially when you're dealing with emojis (but also some languages like Thai, Korean, or Arabic). VonC's suggestions will output the following:
fmt.Println(utf8.RuneCountInString("🏳️‍🌈🇩🇪")) // Outputs "6".
fmt.Println(len([]rune("🏳️‍🌈🇩🇪"))) // Outputs "6".
That's because these methods only count Unicode code points. There are many characters which can be composed of multiple code points.
Same for using the Normalization package:
var ia norm.Iter
ia.InitString(norm.NFKD, "🏳️‍🌈🇩🇪")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Println(nc) // Outputs "6".
Normalization is not really the same as counting characters and many characters cannot be normalized into a one-code-point equivalent.
masakielastic's answer comes close but only handles modifiers (the rainbow flag contains a modifier which is thus not counted as its own code point):
fmt.Println(GraphemeCountInString("🏳️‍🌈🇩🇪")) // Outputs "5".
fmt.Println(GraphemeCountInString2("🏳️‍🌈🇩🇪")) // Outputs "5".
The correct way to split Unicode strings into (user-perceived) characters, i.e. grapheme clusters, is defined in the Unicode Standard Annex #29. The rules can be found in Section 3.1.1. The github.com/rivo/uniseg package implements these rules so you can determine the correct number of characters in a string:
fmt.Println(uniseg.GraphemeClusterCount("🏳️‍🌈🇩🇪")) // Outputs "2".
If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.
package main
import (
"regexp"
"unicode"
"strings"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
str2 := "a" + strings.Repeat("\u0308", 1000)
println(4 == GraphemeCountInString(str))
println(4 == GraphemeCountInString2(str))
println(1 == GraphemeCountInString(str2))
println(1 == GraphemeCountInString2(str2))
println(true == IsStreamSafeString(str))
println(false == IsStreamSafeString(str2))
}
func GraphemeCountInString(str string) int {
re := regexp.MustCompile("\\PM\\pM*|.")
return len(re.FindAllString(str, -1))
}
func GraphemeCountInString2(str string) int {
length := 0
checked := false
index := 0
for _, c := range str {
if !unicode.Is(unicode.M, c) {
length++
if checked == false {
checked = true
}
} else if checked == false {
length++
}
index++
}
return length
}
func IsStreamSafeString(str string) bool {
re := regexp.MustCompile("\\PM\\pM{30,}")
return !re.MatchString(str)
}
There are several ways to get a string length:
package main
import (
"bytes"
"fmt"
"strings"
"unicode/utf8"
)
func main() {
b := "这是个测试"
len1 := len([]rune(b))
len2 := bytes.Count([]byte(b), nil) -1
len3 := strings.Count(b, "") - 1
len4 := utf8.RuneCountInString(b)
fmt.Println(len1)
fmt.Println(len2)
fmt.Println(len3)
fmt.Println(len4)
}
Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.
I tried to make to do the normalization a bit faster:
en, _ = glyphSmart(data)
func glyphSmart(text string) (int, int) {
gc := 0
dummy := 0
for ind, _ := range text {
gc++
dummy = ind
}
dummy = 0
return gc, dummy
}

List of Strings - golang

I'm trying to make a list of Strings in golang. I'm looking up the package container/list but I don't know how to put in a string. I tried several times, but 0 result.
Should I use another thing instead of lists?
Thanks in advance.
edit: Don't know why are you rating this question with negatives votes...
Modifying the exact example you linked, and changing the ints to strings works for me:
package main
import (
"container/list"
"fmt"
)
func main() {
// Create a new list and put some numbers in it.
l := list.New()
e4 := l.PushBack("4")
e1 := l.PushFront("1")
l.InsertBefore("3", e4)
l.InsertAfter("2", e1)
// Iterate through list and print its contents.
for e := l.Front(); e != nil; e = e.Next() {
fmt.Println(e.Value)
}
}
If you take a look at the source code to the package you linked, it seems that the List type holds a list of Elements. Looking at Element you'll see that it has one exported field called Value which is an interface{} type, meaning it could be literally anything: string, int, float64, io.Reader, etc.
To answer your second question, you'll see that List has a method called Remove(e *Element). You can use it like this:
fmt.Println(l.Len()) // prints: 4
// Iterate through list and print its contents.
for e := l.Front(); e != nil; e = e.Next() {
if e.Value == "4" {
l.Remove(e) // remove "4"
} else {
fmt.Println(e.Value)
}
}
fmt.Println(l.Len()) // prints: 3
By and large, Golang documentation is usually pretty solid, so you should always check there first.
https://golang.org/pkg/container/list/#Element

Slice unicode/ascii strings in golang?

I need to slice a string in Go. Possible values can contain Latin chars and/or Arabic/Chinese chars. In the following example, the slice annotation [:1] for the Arabic string alphabet is returning a non-expected value/character.
package main
import "fmt"
func main() {
a := "a"
fmt.Println(a[:1]) // works
b := "ذ"
fmt.Println(b[:1]) // does not work
fmt.Println(b[:2]) // works
fmt.Println(len(a) == len(b)) // false
}
http://play.golang.org/p/R-JxaxbfNL
First of all, you should really read about strings, bytes and runes in Go.
And here is how you can achieve what you want: Go playground (I was not able to properly paste arabic symbols, but if Chinese works, arabic should work too).
s := "abcdefghijklmnop"
fmt.Println(s[2:9])
s = "维基百科:关于中文维基百科"
fmt.Println(string([]rune(s)[2:9]))
The output is:
cdefghi
百科:关于中文
You can use the utf8string package:
package main
import "golang.org/x/exp/utf8string"
func main() {
a := utf8string.NewString("🎈🎄🎀🎢👓")
// example 1
r := a.At(1)
// example 2
s := a.Slice(1, 3)
// example 3
n := a.RuneCount()
// print
println(r == '🎄', s == "🎄🎀", n == 5)
}
https://pkg.go.dev/golang.org/x/exp/utf8string

How to efficiently concatenate string array and string in golang

I have a bunch of strings and []strings in golang which I need to concatenate. But for some reason I am getting a lot of whitespaces along the way which I need to get rid of.
Here's the code
tests := strings.TrimSpace(s[0])
dep_string := make ([]string, len(tests) + len(sfinal))
dep_string = append (dep_string,tests)
for _,v := range sfinal {
dep_string = append(dep_string,v)
}
fmt.Println("dep_String is ",dep_string)
Input:
s[0] = "filename"
sfinal = [test test1]
expected output
[filename test test1]
actual output
[ filename test test1]
It's really weird; even after using TrimSpace I am not able to get rid of excess space. Is there any efficient way to concatenate them?
The whitespace is due to all of the empty elements in dep_string. When you use the make function, it creates a slice with the specified length and capacity, filled with a bunch of nothing. Then, when you use append, it sees that the slice has reached its maximum capacity, extends the slice, then adds your elements, after all of the nothing. The solution is to make a slice with the capacity to hold all of your elements, but with initial length zero:
dep_string := make ([]string, 0, len(tests) + len(sfinal))
strings.TrimSpace is unnecessary. You can read more at http://blog.golang.org/slices
Bill DeRose and Saposhiente are correct about how slices work.
As for a simpler way of solving your problem, you could also do (play):
fmt.Println("join is",strings.Join(append(s[:1],sfinal...)," "))
When you do the assignment dep_string := make ([]string, len(tests) + len(sfinal)), Go zeros out the allocated memory so dep_string then has len(tests) + len(sfinal) empty strings at the front of it. As it's written now you append to the end of the slice, after all those zeroed out strings.
Run this to see where those blanks are showing up in your code. You can fix it by making a slice of length 0 and capacity len(tests) + len(sfinal) instead. You can then concatenate them by using strings.Join.
Here's a simple and efficient solution to your problem.
package main
import "fmt"
func main() {
s := make([]string, 1, 4)
s[0] = "filename"
sfinal := []string{"test", "test1"}
dep_string := append(s[:1], sfinal...)
fmt.Println("dep_String is ", dep_string)
}
Output:
dep_String is [filename test test1]
When you do the assignment dep_string := make ([]string, len(tests) + len(sfinal)), it allocate len(tests) + len(sfinal) null strings ,it is 10 null strings in your case,so when you do the assignment fmt.Println("dep_String is ",dep_string) ,it will print 10 null strings, because fmt.Println(slice of string) will add blank between two elements,so it will print 9 blanks ,so it will prints [ filename test test1] after you append, the whitespaces is the blanks between the 10 null string.

Resources