GO encoding/decoding - go

I'm workin with python. But now, I need to fix Go bug.
I have string like:
<!-- \\xd0\\xbf\\xd0\\xbb\\xd0\\xb0\\xd1\\x82\\xd0\\xb5\\xd0\\xb6\\xd0\\xb5\\xd0\\xb9-->\\n \\n \\n <guarantees>\\n
How do make it correct and readable?
If it were a Python, I would use decode('unicode-escape'). But what should I use in Go?
Update
I've edited description. There are double backslashes
Update 1
I followed the advice from the answer https://stackoverflow.com/a/67172057/11029221, and repaired the part of the code that is doing the encoding in such a wrong way.
But I found out that in GO you can fix such text like this:
a := `\\xd0\\xb5\\xd0\\xb6\\xd0\\xb5\\xd0\\xb9-->\\n\\n\\n<guarantees>\\n`
a = strconv.Quote(a)
a = strings.ReplaceAll(a, `\\\\`, `\`)
unquoted, err := strconv.Unquote(a)
if err != nil {
println(err)
}
str := []byte(unquoted)
for len(str) > 0 {
r, size := utf8.DecodeLastRune(str)
out = string(r) + out
str = str[:len(str)-size]
}
fmt.Printf("%s", out)

I'm not sure what #melpomene's criteria for "knowing what they are doing" are, but the following solution has worked previously, for example for decoding broken Hebrew text:
("\\u00c3\\u00a4"
.encode('latin-1')
.decode('unicode_escape')
.encode('latin-1')
.decode('utf-8')
)
outputs
'ä'
This works as follows:
The string that contains only ascii-characters '\', 'u', '0', '0', 'c', etc. is converted to bytes using some not-too-crazy 8-bit encoding (doesn't really matter which one, as long as it treats ASCII characters properly)
Use a decoder that interprets the '\u00c3' escapes as unicode code point U+00C3 (LATIN CAPITAL LETTER A WITH TILDE, 'Ã'). From the point of view of your code, it's nonsense, but this unicode code point has the right byte representation when again encoded with ISO-8859-1/'latin-1', so...
encode it again with 'latin-1'
Decode it "properly" this time, as UTF-8
Again, same remark as in the linked post: before investing too much energy trying to repair the broken text, you might want to try to repair the part of the code that is doing the encoding in such a strange way.

Related

Get string with base-16 (hex) rendering of the bytes of an ASCII string

E.g.
input := "Office"
want := "4f6666696365" // Note: this is a string!!
I know that string literals are stored in UTF-8 already.
What is the easiest way to get convert this to string in UTF-8 representation?
Calling EncodeRune on each character seems too cumbersome.
What you're looking for is a string that contains the hex representation of your input string. That is not UTF-8. (Any string that's valid ASCII is also valid UTF-8.)
In any case, this is how to do what you want:
want := fmt.Sprintf("%x", []byte(input))

How to convert byte array to string in Go [duplicate]

This question already has answers here:
How do I convert [Size]byte to string in Go?
(8 answers)
Closed 2 years ago.
[]byte to string raises an error.
string([]byte[:n]) raises an error too.
By the way, for example, sha1 value to string for filename.
Does it need utf-8 or any other encoding set explicitly?
Thanks!
The easiest method I use to convert byte to string is:
myString := string(myBytes[:])
The easiest way to convert []byte to string in Go:
myString := string(myBytes)
Note: to convert a "sha1 value to string" like you're asking, it needs to be encoded first, since a hash is binary. The traditional encoding for SHA hashes is hex (import "encoding/hex"):
myString := hex.EncodeToString(sha1bytes)
In Go you convert a byte array (utf-8) to a string by doing string(bytes) so in your example, it should be string(byte[:n]) assuming byte is a slice of bytes.
I am not sure that i understand question correctly, but may be:
var ab20 [20]byte = sha1.Sum([]byte("filename.txt"))
var sx16 string = fmt.Sprintf("%x", ab20)
fmt.Print(sx16)
https://play.golang.org/p/haChjjsH0-
ToBe := [6]byte{65, 66, 67, 226, 130, 172}
s:=ToBe[:3]
// this will work
fmt.Printf("%s",string(s))
// this will not
fmt.Printf("%s",string(ToBe))
Difference : ToBe is an array whereas s is a slice.
First you're getting all these negatives reviews because you didn't provided any code.
Second, without a good example. This is what i'd do
var Buf bytes.Buffer
Buf.Write([]byte)
myString := Buf.String()
Buf.Reset() // Reset the buffer to reuse later
or better yet
myString := string(someByteArray[:n])
see here also see #JimB's comment
That being said if you help that targets your program, please provide and example of what you've tried, the expect results, and error.
We can just guess what is wrong with your code because no meaningful example is provided. But first what I see that string([]byte[:n]) is not valid at all. []byte[:n] is not a valid expression because no memory allocated for the array. Since byte array could be converted to string directly I assume that you have just a syntax error.
Shortest valid is fmt.Println(string([]byte{'g', 'o'}))

Visual Works smalltalk, how to convert Ascii values to characters

using visualworks, in small talk, I'm receiving a string like '31323334' from a network connection.
I need a string that reads '1234' so I need a way of extracting two characters at a time, converting them to what they represent in ascii, and then building a string of them...
Is there a way to do so?
EDIT(7/24): for some reason many of you are assuming I will only be working with numbers and could just truncate 3s or read every other char. This is not the case, examples of strings read could include any keys on the US standard keyboard (a-z, A-Z,0-9,punctuation/annotation such as {}*&^%$...)
Following along the lines of what Max started to suggest:
x := '31323334'.
in := ReadStream on: x.
out := WriteStream on: String new.
[ in atEnd ] whileFalse: [ out nextPut: (in next digitValue * 16 + (in next digitValue)) asCharacter ].
newX := out contents.
newX will have the result '1234'. Or, if you start with:
x := '454647'
You will get a result of 'EFG'.
Note that digitValue might only recognize upper case hex digits, so an asUppercase may be needed on the string before processing.
There is usually a #fold: or #reduce: method that will let you do that. In Pharo there's also a message #allPairsDo: and #groupsOf:atATimeCollect:. Using one of these methods you could do:
| collectionOfBytes |
collectionOfBytes := '9798'
groupsOf: 2
atATimeCollect: [ :group |
(group first digitValue * 10) + (group second digitValue) ].
collectionOfBytes asByteArray asString "--> 'ab'"
The #digitValue message in Pharo simply returns the value of the digit for numerical characters.
If you're receiving the data on a stream you could replace #groupsOf:atATime: with a loop (result may be any collection that you then convert to a string like above):
...
[ stream atEnd ] whileFalse: [
result add: (stream next digitValue * 10) + (stream next digitValue) ]
...
in Smalltalk/X, there is a method called "fromHexBytes:" which the ByteArray class understands. I am not sure, but think that something similar exists in other ST dialects.
If present, you can solve this with:
(ByteArray fromHexString:'68656C6C6F31323334') asString
and the reverse would be:
'hello1234' asByteArray hexPrintString
Another possible solution is to read the string as a hex number,
fetch the digitBytes (which should give you a byte array) and then convert that to a string.
I.e.
(Integer readFrom:'68656C6C6F31323334' radix:16)
digitBytes asString
One problem with that is that I am not sure about which byte-order you will get the digitBytes (LSB or MSB), and if that is defined to be the same across architectures or converted at image loading time to use the native order. So it may be required to reverse the string at the end (to be portable, it may even be required to reverse it conditionally, depending on the endianess of the system.
I cannot test this on VisualWorks, but I assume it should work fine there, too.

Random byte being added when using string()

I am attempting to XOR two values. If I do I can get the right result, however, using string() on it results in a random byte being added to it!
Can anyone explain this?
Here's a playground: http://play.golang.org/p/tIOOjqo_Fe
So, you have:
z := 175 // 0xaf
That is the unicode code point for the character: ¯
The following line of code will then take the value and treat it as a unicode code point (rune) and turn it into a utf-8 encoded string:
out := string(z)
In utf-8 encoding, that character would be represented by two bytes: []byte(0xc2, 0xaf)
So, the bytes you see are the Go string's utf-8 encoding.

How to remove these kind of symbols (junk) from string?

Imagine I have String in C#: "I Don’t see ya.."
I want to remove (replace to nothing or etc.) these "’" symbols.
How do I do this?
That 'junk' looks a lot like someone interpreted UTF-8 data as ISO 8859-1 or Windows-1252, probably repeatedly.
’ is the sequence C3 A2, E2 82 AC, E2 84 A2.
UTF-8 C3 A2 = U+00E2 = â
UTF-8 E2 82 AC = U+20AC = €
UTF-8 E2 84 A2 = U+2122 = ™
We then do it again: in Windows 1252 this sequence is E2 80 99, so the character should have been U+2019, RIGHT SINGLE QUOTATION MARK (’)
You could make multiple passes with byte arrays, Encoding.UTF8 and Encoding.GetEncoding(1252) to correctly turn the junk back into what was originally entered. You will need to check your processing to find the two places that UTF-8 data was incorrectly interpreted as Windows-1252.
"I Don’t see ya..".Replace( "’", string.Empty);
How did that junk get in there the first place? That's the real question.
By removing any non-latin character you'll be intentionally breaking some internationalization support.
Don't forget the poor guy who's name has a "â" in it.
This looks disturbingly familiar to a character encoding issue dealing with the Windows character set being stored in a database using the standard character encoding. I see someone voted Will down, but he has a point. You may be solving the immediate issue, but the combinations of characters are limitless if this is the issue.
If you really have to do this, regular expressions are probably the best solution.
I would strongly recommend that you think about why you have to do this, though - at least some of the characters your listing as undesirable are perfectly valid and useful in other languages, and just filtering them out will most likely annoy at least some of your international users. As a swede, I can't emphasize enough how much I hate systems that can't handle our å, ä and ö characters correctly.
Consider Regex.Replace(your_string, regex, "") - that's what I use.
Test each character in turn to see if it is a valid alphabetic or numeric character and if not then remove it from the string. The character test is very simple, just use...
char.IsLetterOrDigit;
Please there are various others such as...
char.IsSymbol;
char.IsControl;
Regex.Replace("The string", "[^a-zA-Z ]","");
That's how you'd do it in C#, although that regular expression ([^a-zA-Z ]) should work in most languages.
[Edited: forgot the space in the regex]
The ASCII / Integer code for these characters would be out of the normal alphabetic Ranges. Seek and replace with empty characters. String has a Replace method I believe.
Either use a blacklist of stuff you do not want, or preferably a white list (set). With a white list you iterate over the string and only copy the letters that are in your white list to the result string. You said remove, and the way you do that is having two pointers one you read from (R) and one you write to (W):
I Donââ‚
W R
if comma is in your whitelist then you would in this case read the comma and write it where à is then advance both pointers. UTF-8 is a multi-byte encoding, so you advancing the pointer may not just be adding to the address.
With C an easy to way to get a white list by using one of the predefined functions (or macros): isalnum, isalpha, isascii, isblank, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, isxdigit. In this case you send up with a white list function instead of a set of course.
Usually when I see data like you have I look for memory corruption, or evidence to suggest that the encoding I expect is different than the one the data was entered with.
/Allan
I had the same problem with extraneous junk thrown in by adobe in an EXIF dump. I spent an hour looking for a straight answer and trying numerous half-baked suggestions which did not work here.
This thread more than most I have read was replete with deep, probing questions like 'how did it get there?', 'what if somebody has this character in their name?', 'are you sure you want to break internationalization?'.
There were some impressive displays of erudition positing how this junk could have gotten here and explaining the evolution of the various character encoding schemes. The person wanted to know how to remove it, not how it came to be or what the standards orgs are up to, interesting as this trivia may be.
I wrote a tiny program which gave me the right answer. Instead of paraphrasing the main concept, here is the entire, self-contained, working (at least on my system) program and the output I used to nuke the junk:
#!/usr/local/bin/perl -w
# This runs in a dos window and shows the char, integer and hex values
# for the weird chars. Install the HEX values in the REGEXP below until
# the final test line looks normal.
$str = 's: “Brian'; # Nuke the 3 werid chars in front of Brian.
#str = split(//, $str);
printf("len str '$str' = %d, scalar \#str = %d\n",
length $str, scalar #str);
$ii = -1;
foreach $c (#str) {
$ii++;
printf("$ii) char '$c', ord=%03d, hex='%s'\n",
ord($c), unpack("H*", $c));
}
# Take the hex characters shown above, plug them into the below regexp
# until the junk disappears!
($s2 = $str) =~ s/[\xE2\x80\x9C]//g; # << Insert HEX values HERE
print("S2=>$s2<\n"); # Final test
Result:
M:\new\6s-2014.1031-nef.halloween>nuke_junk.pl
len str 's: GÇ£Brian' = 11, scalar #str = 11
0) char 's', ord=115, hex='73'
1) char ':', ord=058, hex='3a'
2) char ' ', ord=032, hex='20'
3) char 'G', ord=226, hex='e2'
4) char 'Ç', ord=128, hex='80'
5) char '£', ord=156, hex='9c'
6) char 'B', ord=066, hex='42'
7) char 'r', ord=114, hex='72'
8) char 'i', ord=105, hex='69'
9) char 'a', ord=097, hex='61'
10) char 'n', ord=110, hex='6e'
S2=>s: Brian<
It's NORMAL!!!
One other actionable, working suggestion I ran across:
iconv -c -t ASCII < 6s-2014.1031-238246.halloween.exf.dif > exf.ascii.dif
If String having the any Junk date , This is good to way remove those junk date
string InputString = "This is grate kingdom¢Ã‚¬â";
string replace = "’";
string OutputString= Regex.Replace(InputString, replace, "");
//OutputString having the following result
It's working good to me , thanks for looking this review.

Resources