Normalizing text input to ASCII - go

I am building a small tool which parses a user's input and finds common pitfalls in writing and flags them so the user can improve their text. So far everything works well except for text that has curly quotes compared to normal ASCII straight quotes. I have a hack now which will do a string replacement for opening (and closing) single curly quotes and double opening (and close) curly quotes like so:
cleanedData := bytes.Replace([]byte(data), []byte("’"), []byte("'"), -1)
I feel like there must be a better way to handle this in the stdlib so I can also convert other non-ascii characters to an ascii equivalent. Any help would be greatly appreciated.

The strings.Map function looks to me like what you want.
I don't know of a generic 'ToAscii' type function, but Map has a nice approach for mapping runes to other runes.
Example (updated):
func main() {
data := "Hello “Frank” or ‹François› as you like to be ‘called’"
fmt.Printf("Original: %s\n", data)
cleanedData := strings.Map(normalize, data)
fmt.Printf("Cleaned: %s\n", cleanedData)
}
func normalize(in rune) rune {
switch in {
case '“', '‹', '”', '›':
return '"'
case '‘', '’':
return '\''
}
return in
}
Output:
Original: Hello “Frank” or ‹François› as you like to be ‘called’
Cleaned: Hello "Frank" or "François" as you like to be 'called'

Related

more than one character in rune literal

I have a string as just MyString and I want to append in this data something like this:
MYString ("1", "a"), ("1", "b") //END result
My code is something like this:
query := "MyString";
array := []string{"a", "b"}
for i , v := range array{
id := "1"
fmt.Println(v,i)
query += '("{}", "{}"), '.format(id, v)
}
but I am getting two errors:
./prog.go:15:23: more than one character in rune literal
./prog.go:15:39: '\u0000'.format undefined (type rune has no field or method format)
You can't use single quotes for Strings in Go. You can only use double-quotes or backticks.
Single quotes are used for single characters, called runes
Change your line to:
query += "(\"{}\", \"{}\"), ".format(id, v)
or
query += `("{}", "{}"), `.format(id, v)
However, Go is not python. Go doesn't have a format method like that. But it has fmt.Sprintf.
So to really fix it, use:
query = fmt.Sprintf(`%s("%s", "%s"), `, query, id, v)
Issue here is single quotes . Go Compiler expects a character only when encounters '' . Rather use double quotes with escape symbol as explained in above example.

How to escape a string with single quotes

I am trying to unquote a string that uses single quotes in Go (the syntax is same as Go string literal syntax but using single quotes not double quotes):
'\'"Hello,\nworld!\r\n\u1F60ANice to meet you!\nFirst Name\tJohn\nLast Name\tDoe\n'
should become
'"Hello,
world!
😊Nice to meet you!
First Name John
Last Name Doe
How do I accomplish this?
strconv.Unquote doesn't work on \n newlines (https://github.com/golang/go/issues/15893 and https://golang.org/pkg/strconv/#Unquote), and simply strings.ReplaceAll(ing would be a pain to support all Unicode code points and other backslash codes like \n & \r & \t.
I may be asking for too much, but it would be nice if it automatically validates the Unicode like how strconv.Unquote might be able to do/is doing (it knows that x Unicode code points may become one character), since I can do the same with unicode/utf8.ValidString.
#CeriseLimón came up with this answer, and I just put it into a function with more shenanigans to support \ns. First, this swaps ' and ", and changes \ns to actual newlines. Then it strconv.Unquotes each line, since strconv.Unquote cannot handle newlines, and then reswaps ' and " and pieces them together.
func unquote(s string) string {
replaced := strings.NewReplacer(
`'`,
`"`,
`"`,
`'`,
`\n`,
"\n",
).Replace(s[1:len(s)-1])
unquoted := ""
for _, line := range strings.Split(replaced, "\n") {
tmp, err := strconv.Unquote(`"` + line + `"`)
repr.Println(line, tmp, err)
if err != nil {
return nil, NewInvalidAST(obj.In.Text.LexerInfo, "*Obj.In.Text.Text")
}
unquoted += tmp + "\n"
}
return strings.NewReplacer(
`"`,
`'`,
`'`,
`"`,
).Replace(unquoted[:len(unquoted)-1])
}

Is there a more efficient way to handle string escaping in this function?

I'm migrating some existing code from another language. In the following function it's more or less a 1-1 migration, but given the newness of the language to me I'd like to know if there's better / more efficient ways to handle how the escaped string gets built:
func influxEscape(str string) string {
var chars = map[string]bool{
"\\": true,
"\"": true,
",": true,
"=": true,
" ": true,
}
var escapeStr = ""
for i := 0; i < len(str); i++ {
var char = string(str[i])
if chars[char] == true {
escapeStr += "\\" + char
} else {
escapeStr += char
}
}
return escapeStr
}
This code performs escaping to make string values compatible with the InfluxDB line protocol.
This should be a comment, but it needs too much room for that.
One more thing to consider—which I mentioned in a comment on Burak Serdar's answer—is what happens when your input string is not valid UTF-8.
Remember that a Go string is a byte sequence. It need not be valid Unicode. It may be intended to represent valid Unicode, or it may not. For instance, it could be ISO-Latin-1 or something else that might not play well with UTF-8.
If it is non-UTF-8, using a range loop on it will translate each invalid sequence to the invalid rune. (See the linked Go blog post.) If it is intended to be valid UTF-8, this may be a plus, and of course, you can check for the resulting RuneError.
Your original loop leaves characters above ASCII DEL (127 or 0x7f) alone. If the bytes in the string are something like ISO-Latin-1, this may be the correct behavior. If not, you may be passing invalid, un-sanitized input to this other program. If you are deliberately sanitizing input, you must find out what kind of input it expects, and do a complete job of sanitizing input.
(I still have scars from being forced to cope with a really poor XML encoder coupled to an old database from some number of jobs ago, so I tend to be extra-cautious here.)
This should be somewhat equivalent to your code:
out := bytes.Buffer{}
for _, x := range str {
if strings.IndexRune(`\",= `, x)!=-1 {
out.WriteRune('\\')
}
out.WriteRune(x)
}
return out.String()

How to Print ascii text in go like python does

how to print ascii-text in go language like python does
like picture shown below
Using python
Using Golang
The problem is that your text contains backtick (`), which happen to be delimiter character for golang's raw string literal. This situation is comparable to your python code had your text contains 3 consecutive double-quotes, which is the delimiter being used in your python code.
I don't see any quick escape from this situation without modifying your ascii text, as we don't have other options for raw string delimiter in golang like we have in python. You may want to store your ascii text in a text file and read it from there :
import (
....
....
"io/ioutil"
)
func banner() string {
b, err := ioutil.ReadFile("ascii.txt")
if err != nil {
panic(err)
}
fmt.Println(string(b))
}
If you're ok with slight modification to the ascii text source, then you can temporarily use other character that isn't used anywhere else in the ascii text to represent backtick, and then do string replacement to put the actual backtick in place. Or, you can use fmt.Sprintf to supply the problematic backtick :
ascii := fmt.Sprintf(`....%c88b...`, '`')
fmt.Println(ascii)
// output:
// ....`88b...
Yes but you have to split lines with backtick and put them quoted into standard double quote ”.
... +
“888 6(, ` ‘ “ +
...

How to strings.Split on newline?

I'm trying to do the rather simple task of splitting a string by newlines.
This does not work:
temp := strings.Split(result,`\n`)
I also tried ' instead of ` but no luck.
Any ideas?
You have to use "\n".
Splitting on `\n`, searches for an actual \ followed by n in the text, not the newline byte.
playground
For those of us that at times use Windows platform, it can
help remember to use replace before split:
strings.Split(strings.ReplaceAll(windows, "\r\n", "\n"), "\n")
Go Playground
It does not work because you're using backticks:
Raw string literals are character sequences between back quotes ``. Within the quotes, any character is legal except back quote. The value of a raw string literal is the string composed of the uninterpreted (implicitly UTF-8-encoded) characters between the quotes; in particular, backslashes have no special meaning and the string may contain newlines.
Reference: http://golang.org/ref/spec#String_literals
So, when you're doing
strings.Split(result,`\n`)
you're actually splitting using the two consecutive characters "\" and "n", and not the character of line return "\n". To do what you want, simply use "\n" instead of backticks.
Your code doesn't work because you're using backticks instead of double quotes. However, you should be using a bufio.Scanner if you want to support Windows.
import (
"bufio"
"strings"
)
func SplitLines(s string) []string {
var lines []string
sc := bufio.NewScanner(strings.NewReader(s))
for sc.Scan() {
lines = append(lines, sc.Text())
}
return lines
}
Alternatively, you can use strings.FieldsFunc (this approach skips blank lines)
strings.FieldsFunc(s, func(c rune) bool { return c == '\n' || c == '\r' })
import regexp
var lines []string = regexp.MustCompile("\r?\n").Split(inputString, -1)
MustCompile() creates a regular expression that allows to split by both \r\n and \n
Split() performs the split, seconds argument sets maximum number of parts, -1 for unlimited
' doesn't work because it is not a string type, but instead a rune.
temp := strings.Split(result,'\n')
go compiler: cannot use '\u000a' (type rune) as type string in argument to strings.Split
definition: Split(s, sep string) []string

Resources