Go: Removing accents from strings - go

I'm new to Go and I'm trying to implement a function to convert accented characters into their non-accented equivalent. I'm attempting to follow the example given in this blog (see the heading 'Performing magic').
What I've attempted to gather from this is:
package main
import (
"fmt"
"unicode"
"bytes"
"code.google.com/p/go.text/transform"
"code.google.com/p/go.text/unicode/norm"
)
func isMn (r rune) bool {
return unicode.Is(unicode.Mn, r) // Mn: nonspacing marks
}
func main() {
r := bytes.NewBufferString("Your Śtring")
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
r = transform.NewReader(r, t)
fmt.Println(r)
}
It does not work in the slightest and I quite honestly don't know what it means anyway. Any ideas?

Note that Go 1.5 (August 2015) or Go 1.6 (Q1 2016) could introduce a new runes package, with transform operations.
That includes (in runes/example_test.go) a runes.Remove function, which will help transform résumé into resume:
func ExampleRemove() {
t := transform.Chain(norm.NFD, runes.Remove(runes.In(unicode.Mn)), norm.NFC)
s, _, _ := transform.String(t, "résumé")
fmt.Println(s)
// Output:
// resume
}
This is still being reviewed though (April 2015).

r should be or type io.Reader, and you can't print r like that. First, you need to read the content to a byte slice:
var (
s = "Your Śtring"
b = make([]byte, len(s))
r io.Reader = strings.NewReader(s)
)
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
r = transform.NewReader(r, t)
r.Read(b)
fmt.Println(string(b))
This works, but for some reason it returns me "Your Stri", two bytes less than needed.
This here is the version which actually does what you need, but I'm still not sure why the example from the blog works so strangely.
s := "Yoùr Śtring"
b := make([]byte, len(s))
t := transform.Chain(norm.NFD, transform.RemoveFunc(isMn), norm.NFC)
_, _, e := t.Transform(b, []byte(s), true)
if e != nil { panic(e) }
fmt.Println(string(b))

Related

Using regular expressions in Go to Identify a common pattern

I'm trying to parse this string goats=1\r\nalligators=false\r\ntext=works.
contents := "goats=1\r\nalligators=false\r\ntext=works"
compile, err := regexp.Compile("([^#\\s=]+)=([a-zA-Z0-9.]+)")
if err != nil {
return
}
matchString := compile.FindAllStringSubmatch(contents, -1)
my Output looks like [[goats=1 goats 1] [alligators=false alligators false] [text=works text works]]
What I'm I doing wrong in my expression to cause goats=1 to be valid too? I only want [[goats 1]...]
For another approach, you can use the strings package instead:
package main
import (
"fmt"
"strings"
)
func parse(s string) map[string]string {
m := make(map[string]string)
for _, kv := range strings.Split(s, "\r\n") {
a := strings.Split(kv, "=")
m[a[0]] = a[1]
}
return m
}
func main() {
m := parse("goats=1\r\nalligators=false\r\ntext=works")
fmt.Println(m) // map[alligators:false goats:1 text:works]
}
https://golang.org/pkg/strings

Why am I getting an EOF out of my io.PipeReader?

I’m using something similar in a project and I'm a bit perplexed: why isn't anything being printed?
package main
import (
"fmt"
"encoding/json"
"io"
)
func main() {
m := make(map[string]string)
m["foo"] = "bar"
pr, pw := io.Pipe()
go func() { pw.CloseWithError(json.NewEncoder(pw).Encode(&m)) }()
fmt.Fscan(pr)
}
https://play.golang.org/p/OJT1ZRAnut
Is this a race condition of some sort? I tried removing pw.CloseWithError but it changes nothing.
fmt.Fscan takes two arguments. A reader to read from, and one or more pointers to objects to populate. Its result is (n int, err error), where n is the number of items read, and err is the reason why n is less than the (variadic...) slice of data objects you fed into its second argument.
In this case, the slice of data objects is length zero, so Fscan fills zero objects and reads no data. It dutifully reports that it scanned 0 objects, and since that number is not less than the number of objects you passed into it, it reports nil error.
Try the following:
func main() {
m := make(map[string]string)
m["foo"] = "bar"
pr, pw := io.Pipe()
go func() { pw.CloseWithError(json.NewEncoder(pw).Encode(&m)) }()
var s string
n, err := fmt.Fscan(pr, &s)
fmt.Println(n, err) // should be 1 <nil>
fmt.Println(s) // should be {"foo":"bar"}
}

What is wrong with solution to the 23'th task of go tour?

There is a go tour. I've solved https://tour.golang.org/methods/23 like this:
func (old_reader rot13Reader) Read(b []byte) (int, error) {
const LEN int = 1024
tmp_bytes := make([]byte, LEN)
old_len, err := old_reader.r.Read(tmp_bytes)
if err == nil {
tmp_bytes = tmp_bytes[:old_len]
rot13(tmp_bytes)
return len(tmp_bytes), nil
} else {
return 0, err
}
}
func main() {
s := strings.NewReader("Lbh penpxrq gur pbqr!")
r := rot13Reader{s}
io.Copy(os.Stdout, &r)
}
Where rot13 is correct and debug output right before return shows correct string. But why there is no output to console?
The Read method for an io.Reader needs to operate on the byte slice provided to it. You're reading into a new slice, and never modifying the original.
Just use b throughout the Read method:
func (old_reader rot13Reader) Read(b []byte) (int, error) {
n, err := old_reader.r.Read(b)
rot13(b[:n])
return n, err
}
You're never modifying b in your reader. The semantic of io.Reader's Read function is that you put the data into b's underlying array directly.
Assuming the rot13() function also in-place modifies, this will work (edit: I've tried to keep this code close to your version so you can see what's changed easier. JimB's solution is a more idiomatic solution to this problem):
func (old_reader rot13Reader) Read(b []byte) (int, error) {
tmp_bytes := make([]byte, len(b))
old_len, err := old_reader.r.Read(tmp_bytes)
tmp_bytes = tmp_bytes[:old_len]
rot13(tmp_bytes)
for i := range tmp_bytes {
b[i] = tmp_bytes[i]
}
return old_len, err
}
Example (with stubbed rot13()): https://play.golang.org/p/vlbra-46zk
On a side note, from an idiomatic perspect, old_reader isn't a proper receiver name (nor is old_len a proper variable name). Go prefers short receiver names (like r or rdr in this case), and also prefer camelcase to underscores (underscores will actually fire a golint warning).
Edit2: A more idiomatic version of your code. Kept the same mechanism of action, just cleaned it up a bit.
func (rdr rot13Reader) Read(b []byte) (int, error) {
tmp := make([]byte, len(b))
n, err := rdr.r.Read(tmp)
tmp = tmp[:n]
rot13(tmp)
for i := range tmp {
b[i] = tmp[i]
}
return n, err
}
From this, removing the tmp byte slice and using the destination b directly results in JimB's idiomatic solution to the problem.
Edit3: Updated to fix the issue Paul pointed out in comments.

String to UCS-2

I want to translate in Go my python program to convert an unicode string to a UCS-2 HEX string.
In python, it's quite simple:
u"Bien joué".encode('utf-16-be').encode('hex')
-> 004200690065006e0020006a006f007500e9
I am a beginner in Go and the simplest way I found is:
package main
import (
"fmt"
"strings"
)
func main() {
str := "Bien joué"
fmt.Printf("str: %s\n", str)
ucs2HexArray := []rune(str)
s := fmt.Sprintf("%U", ucs2HexArray)
a := strings.Replace(s, "U+", "", -1)
b := strings.Replace(a, "[", "", -1)
c := strings.Replace(b, "]", "", -1)
d := strings.Replace(c, " ", "", -1)
fmt.Printf("->: %s", d)
}
str: Bien joué
->: 004200690065006E0020006A006F007500E9
Program exited.
I really think it's clearly not efficient. How can-I improve it?
Thank you
Make this conversion a function then you can easily improve the conversion algorithm in the future. For example,
package main
import (
"fmt"
"strings"
"unicode/utf16"
)
func hexUTF16FromString(s string) string {
hex := fmt.Sprintf("%04x", utf16.Encode([]rune(s)))
return strings.Replace(hex[1:len(hex)-1], " ", "", -1)
}
func main() {
str := "Bien joué"
fmt.Println(str)
hex := hexUTF16FromString(str)
fmt.Println(hex)
}
Output:
Bien joué
004200690065006e0020006a006f007500e9
NOTE:
You say "convert an unicode string to a UCS-2 string" but your Python example uses UTF-16:
u"Bien joué".encode('utf-16-be').encode('hex')
The Unicode Consortium
UTF-16 FAQ
Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not describe a data format distinct from UTF-16, because
both use exactly the same 16-bit code unit representations. However,
UCS-2 does not interpret surrogate code points, and thus cannot be
used to conformantly represent supplementary characters.
Sometimes in the past an implementation has been labeled "UCS-2" to
indicate that it does not support supplementary characters and doesn't
interpret pairs of surrogate code points as characters. Such an
implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters.
For anything other than trivially short input (and possibly even then), I'd use the golang.org/x/text/encoding/unicode package to convert to UTF-16 (as #peterSo and #JimB point out, slightly different from obsolete UCS-2).
The advantage (over unicode/utf16) of using this (and the golang.org/x/text/transform package) is that you get BOM support, big or little endian, and that you can encode/decode short strings or bytes, but you can also apply this as a filter to an io.Reader or to an io.Writer to transform your data as you process it instead of all up front (i.e. for a large stream of data you don't need to have it all in memory at once).
E.g.:
package main
import (
"bytes"
"fmt"
"io"
"io/ioutil"
"log"
"strings"
"golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
)
const input = "Bien joué"
func main() {
// Get a `transform.Transformer` for encoding.
e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
t := e.NewEncoder()
// For decoding, allows a Byte Order Mark at the start to
// switch to corresponding Unicode decoding (UTF-8, UTF-16BE, or UTF-16LE)
// otherwise we use `e` (UTF-16BE without BOM):
t2 := unicode.BOMOverride(e.NewDecoder())
_ = t2 // we don't show/use this
// If you have a string:
str := input
outstr, n, err := transform.String(t, str)
if err != nil {
log.Fatal(err)
}
fmt.Printf("string: n=%d, bytes=%02x\n", n, []byte(outstr))
// If you have a []byte:
b := []byte(input)
outbytes, n, err := transform.Bytes(t, b)
if err != nil {
log.Fatal(err)
}
fmt.Printf("bytes: n=%d, bytes=%02x\n", n, outbytes)
// If you have an io.Reader for the input:
ir := strings.NewReader(input)
r := transform.NewReader(ir, t)
// Now just read from r as you normal would and the encoding will
// happen as you read, good for large sources to avoid pre-encoding
// everything. Here we'll just read it all in one go though which negates
// that benefit (normally avoid ioutil.ReadAll).
outbytes, err = ioutil.ReadAll(r)
if err != nil {
log.Fatal(err)
}
fmt.Printf("reader: len=%d, bytes=%02x\n", len(outbytes), outbytes)
// If you have an io.Writer for the output:
var buf bytes.Buffer
w := transform.NewWriter(&buf, t)
_, err = fmt.Fprint(w, input) // or io.Copy from an io.Reader, or whatever
if err != nil {
log.Fatal(err)
}
fmt.Printf("writer: len=%d, bytes=%02x\n", buf.Len(), buf.Bytes())
}
// Whichever of these you need you could of
// course put in a single simple function. E.g.:
// NewUTF16BEWriter returns a new writer that wraps w
// by transforming the bytes written into UTF-16-BE.
func NewUTF16BEWriter(w io.Writer) io.Writer {
e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
return transform.NewWriter(w, e.NewEncoder())
}
// ToUTFBE converts UTF8 `b` into UTF-16-BE.
func ToUTF16BE(b []byte) ([]byte, error) {
e := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
out, _, err := transform.Bytes(e.NewEncoder(), b)
return out, err
}
Gives:
string: n=10, bytes=004200690065006e0020006a006f007500e9
bytes: n=10, bytes=004200690065006e0020006a006f007500e9
reader: len=18, bytes=004200690065006e0020006a006f007500e9
writer: len=18, bytes=004200690065006e0020006a006f007500e9
The standard library has the built-in utf16.Encode() (https://golang.org/pkg/unicode/utf16/#Encode) function for this purpose.

How to convert ansi text to utf8

How to convert ansi text to utf8 in Go?
I am trying to convert ansi string to utf8 string.
Go only has UTF-8 strings. You can convert something to a UTF8 string using the conversion described here from a byte[]:
http://golang.org/doc/go_spec.html#Conversions
Here is newer method.
package main
import (
"bytes"
"fmt"
"io/ioutil"
"golang.org/x/text/encoding/traditionalchinese"
"golang.org/x/text/transform"
)
func Decode(s []byte) ([]byte, error) {
I := bytes.NewReader(s)
O := transform.NewReader(I, traditionalchinese.Big5.NewDecoder())
d, e := ioutil.ReadAll(O)
if e != nil {
return nil, e
}
return d, nil
}
func main() {
s := []byte{0xB0, 0xAA}
b, err := Decode(s)
fmt.Println(string(b))
fmt.Println(err)
}
I were use iconv-go to do such convert, you must know what's your ANSI code page, in my case, it is 'big5'.
package main
import (
"fmt"
//iconv "github.com/djimenez/iconv-go"
iconv "github.com/andelf/iconv-go"
"log"
)
func main() {
ibuf := []byte{170,76,80,67}
var obuf [256]byte
// Method 1: use Convert directly
nR, nW, err := iconv.Convert(ibuf, obuf[:], "big5", "utf-8")
if err != nil {
log.Fatalln(err)
}
log.Println(nR, ibuf)
log.Println(obuf[:nW])
fmt.Println(string(obuf[:nW]))
// Method 2: build a converter at first
cv, err := iconv.NewConverter("big5", "utf-8")
if err != nil {
log.Fatalln(err)
}
nR, nW, err = cv.Convert(ibuf, obuf[:])
if err != nil {
log.Fatalln(err)
}
log.Println(string(obuf[:nW]))
}
I've written a function that was useful for me, maybe someone else can use this. It converts from Windows-1252 to UTF-8. I've converted some code points that Windows-1252 treats as chars but Unicode considers to be control characters (http://en.wikipedia.org/wiki/Windows-1252)
func fromWindows1252(str string) string {
var arr = []byte(str)
var buf bytes.Buffer
var r rune
for _, b := range(arr) {
switch b {
case 0x80:
r = 0x20AC
case 0x82:
r = 0x201A
case 0x83:
r = 0x0192
case 0x84:
r = 0x201E
case 0x85:
r = 0x2026
case 0x86:
r = 0x2020
case 0x87:
r = 0x2021
case 0x88:
r = 0x02C6
case 0x89:
r = 0x2030
case 0x8A:
r = 0x0160
case 0x8B:
r = 0x2039
case 0x8C:
r = 0x0152
case 0x8E:
r = 0x017D
case 0x91:
r = 0x2018
case 0x92:
r = 0x2019
case 0x93:
r = 0x201C
case 0x94:
r = 0x201D
case 0x95:
r = 0x2022
case 0x96:
r = 0x2013
case 0x97:
r = 0x2014
case 0x98:
r = 0x02DC
case 0x99:
r = 0x2122
case 0x9A:
r = 0x0161
case 0x9B:
r = 0x203A
case 0x9C:
r = 0x0153
case 0x9E:
r = 0x017E
case 0x9F:
r = 0x0178
default:
r = rune(b)
}
buf.WriteRune(r)
}
return string(buf.Bytes())
}
There is no way to do it without writing the conversion yourself or using a third-party package. You could try using this: http://code.google.com/p/go-charset
golang.org/x/text/encoding/charmap package has functions exactly for this problem
import "golang.org/x/text/encoding/charmap"
func DecodeWindows1250(enc []byte) string {
dec := charmap.Windows1250.NewDecoder()
out, _ := dec.Bytes(enc)
return string(out)
}
func EncodeWindows1250(inp string) []byte {
enc := charmap.Windows1250.NewEncoder()
out, _ := enc.String(inp)
return out
}
Edit: undefined: ba is replace enc

Resources