Print an UTF8-encoded smiley - utf-8

I am writing an ReactionRoles-Discord-Bot in Python (discord.py).
This Bot saves the ReactionRoles-Smileys as UFT8-Encoded.
The type of the encoded is bytes but it's converted to str to save it.
The string looks something like "b'\\xf0\\x9f\\x98\\x82'".
I am using EMOJI_ENCODED = str(EMOJI.encode('utf8')) to encode it, but bytes(EMOJI_ENCODED).decode('utf8') isn't working.
Do you know how to decode it or how to save it in a better way?

The output of str() is a Unicode string. EMOJI is a Unicode string. str(EMOJI.encode('utf8')) just makes a mangled Unicode string.
The purpose of encoding is to make a byte string that can be saved to a file/database/socket. Simply do b = EMOJI.encode() (default is UTF-8) to get a byte string and s = b.decode() to get the Unicode string back.

Related

How to decode/encode ASCII_8BIT in the Scala?

I have an external system written in Ruby which sending a data over the wire encoded with ASCII_8BIT. How should I decode and encode them in the Scala?
I couldn't find a library for decoding and encoding ASCII_8BIT string in scala.
As I understand, correctly, the ASCII_8BIT is something similar to Base64. However, there is more than one Base64 encoding. Which type of encoding should I use to be sure that cover all corner cases?
What is ASCII-8BIT?
ASCII-8BIT is Ruby's binary encoding (the name "BINARY" is accepted as an alias for "ASCII-8BIT" when specifying the name of an encoding). It is used both for binary data and for text whose real encoding you don't know.
Any sequence of bytes is a valid string in the ASCII-8BIT encoding, but unlike other 8bit-encodings, only the bytes in the ASCII range are considered printable characters (and of course only those that are printable in ASCII). The bytes in the 128-255 range are considered special characters that don't have a representation in other encodings. So trying to convert an ASCII-8BIT string to any other encoding will fail (or replace the non-ASCII characters with question marks depending on the options you give to encode) unless it only contains ASCII characters.
What's its equivalent in the Scala/JVM world?
There is no strict equivalent. If you're dealing with binary data, you should be using binary streams that don't have an encoding and aren't treated as containing text.
If you're dealing with text, you'll either need to know (or somehow figure out) its encoding or just arbitrarily pick an 8-bit ASCII-superset encoding. That way non-ASCII characters may come out as the wrong character (if the text was actually encoded with a different encoding), but you won't get any errors because any byte is a valid character. You can then replace the non-ASCII characters with question marks if you want.
What does this have to do with Base64?
Nothing. Base64 is a way to represent binary data as ASCII text. It is not itself a character encoding. Knowing that a string has the character encoding ASCII or ASCII-8BIT or any other encoding, doesn't tell you whether it contains Base64 data or not.
But do note that a Base64 string will consist entirely of ASCII characters (and not just any ASCII characters, but only letters, numbers, +, / and =). So if your string contains any non-ASCII character or any character except the aforementioned, it's not Base64.
Therefore any Base64 string can be represented as ASCII. So if you have an ASCII-8BIT string containing Base64 data in Ruby, you should be able to convert it to ASCII without any problems. If you can't, it's not Base64.

How to properly convert byte array of UTF-16LE chars to utf-8 string in Ruby

I have a Base64 encoded binary of a packet capture.
I want to extract a substring at a certain position of the capture.
I'm doing this in Ruby:
payload_decoded = Base64.decode64(payload)
file_size = payload_decoded[114..115].unpack('S*')[0]
file_fullpath = payload_decoded[124, file_size]
p file_fullpath
This works to some extent. file_size gets an integer with the length I want to extract. I then can extract the correct slice of the byte array. And if I just test this in my Mac's terminal, it displays the string perfectly.
But, this code in the application itself, that runs in CentOS7, all characters are displayed suffixed with the 00 byte (e.g. T displays as T\x00). I guess I can just strip that out of the string, but would like to avoid that. What would be the most correct way to handle this?
TIA
This seems to get the desired result:
file_fullpath = file_fullpath.force_encoding('UTF-16LE').encode!('UTF-8')
Seems like I first need to "convince" Ruby that the string is UTF-16LE, and only then convert to UTF-8.

Decoded value of base64 string when re-encoded(to base64) gives different output

I'm really confused about something here. When I decode the following base64 string
RgRaIY6KP0EIASwW7bCCVSJXCGJsYWhibGFoWAQAAAAASB1UcmFuc2FjdGlvbmFsOiBQYXNzd29yZCBSZXNldEIKAASKCT9YVegBjVITZXhhbXBsZTAxQGdtYWlsLmNvbQlRBAAAAABE6mh0dHBzOi8vd3d3LmV4YW1wbGUxLmNvbS9wYXNzd29yZC1yZXNldD9pZD02MkZZUFcmaGFzaD0yYmI0OTIwODBmMjJjNTZnNmNhYzAzNDlmNDNmNmRiYiZ1dG1fY2FtcGFpZ249VHJhbnNhY3Rpb25hbCUzQSUyMFBhc3N3b3JkJTIwUmVzZXQmdXRtX3NvdXJjZT1TcGFya1Bvc3QmdXRtX21lZGl1bT1lbWFpbCZ1dG1fdGVybT1UcmFuc2FjdGlvbmFsJTNBJTIwUGFzc3dvcmQlMjBSZXNldCZyZWZlcmVyPTYyRllQV0dReyJlbWFpbF91dWlkIjoiMTQ4MDUyNDU3ODQxMS4yODgyMDc0MDY3Nzc4OTUyODk5NTUzOTkiLCJiaW5kaW5nIjoibm90aWZpY2F0aW9ucyJ9
The output is:
FZ!?A,�U"WwwwwwwwwXHTransactional: Aaaaaaaa Gggggg
?XU恍Rexample01#gmail.com QD駴tps://www.abcdefgh.com/endpoint--name?id=62FYPW&valu=2ffg92080f22c50c6grsd349frtyuio9&utm_campaign=Transactional%3A%20Password%20Reset&utm_source=SparkPost&utm_medium=email&utm_term=Transactional%3A%20Pppp%20Mainn&referer=62FYPWGQ{"email_user":"14785243699632.288207406700815289974144","binding":"notifications"}
But when I try to encode the same output value back to base64 it gives me:
RgRaIY4/QQgBLBb9VSJXCHd3d3d3d3d3WARIHVRyYW5zYWN0aW9uYWw6IEFhYWFhYWFhIEdnZ2dnZwoEij9YVU1SE2V4YW1wbGUwMUBnbWFpbC5jb20JUQRE9HRwczovL3d3dy5hYmNkZWZnaC5jb20vZW5kcG9pbnQtLW5hbWU/aWQ9NjJGWVBXJnZhbHU9MmZmZzkyMDgwZjIyYzUwYzZncnNkMzQ5ZnJ0eXVpbzkmdXRtX2NhbXBhaWduPVRyYW5zYWN0aW9uYWwlM0ElMjBQYXNzd29yZCUyMFJlc2V0JnV0bV9zb3VyY2U9U3BhcmtQb3N0JnV0bV9tZWRpdW09ZW1haWwmdXRtX3Rlcm09VHJhbnNhY3Rpb25hbCUzQSUyMFBwcHAlMjBNYWlubiZyZWZlcmVyPTYyRllQV0dReyJlbWFpbF91c2VyIjoiMTQ3ODUyNDM2OTk2MzIuMjg4MjA3NDA2NzAwODE1Mjg5OTc0MTQ0IiwiYmluZGluZyI6Im5vdGlmaWNhdGlvbnMifQ==
What's happening here?
If decode(base64_A) is giving me x, and encode(x) gives me base64_B, how can I generate base64_A from x or base64_B?
The decoded text you have shown contains non-ASCII characters in it. Base64 encodes binary data only, so you have to take character encodings into account when encoding/decoding text.
When encoding base64, a character string has to be charset-encoded to bytes first, such as with UTF-8, and then those bytes can be encoded with base64.
When decoding base64, the result is bytes, which can then be charset-decoded to a character string.
If you use the wrong charset, you will get incorrect results.

How do I convert a string of utf-8 codes into the actual string

I am writing a little tool that emulates a user on a webpage and want to process the information I get. The tool is written in Ruby(Version > 2.0).
Unfortunately I don't receive the string but an UTF-8 representation of the actual string. Something like this:
"\u0000\t\u0000\t\u0000N\u0000\t\u0000N\u0000\t\u0000\n"
How do I convert this string back into a "normal" string?
Thank you

Convert a string (representing UTF-8 hex) to string

I have a string in UTF-8 hex like this:
s = "0059006F007500720020006300720065006400690074002000680061007300200067006F006E0065002000620065006C006F00770020003500200064006F006C006C006100720073002E00200049006600200079006F00750020006800610076006500200061006E0020004100640064002D004F006E0020006F007200200042006F006E0075007300200079006F007500720020007200650073006F00750072006300650073002000770069006C006C00200077006F0072006B00200075006E00740069006C0020006500780068006100750073007400650064002E00200054006F00200074006F00700020007500700020006E006F007700200076006900730069007400200076006F006400610066006F006E0065002E0063006F002E006E007A002F0074006F007000750070"
I want to convert this into actual UTF-8 string. It should read:
Your credit has gone below 5 dollars. If you have an Add-On or Bonus your resources will work until exhausted. To top up now visit vodafone.co.nz/topup
This works:
s.scan(/.{4}/).map { |a| [a.hex].pack('U') }.join
but I'm wondering if there's a better way to do this: whether I should be using Encoding#convert.
The extra 00s suggest that the string is actually the hex representation of a UTF-16 string, rather than UTF-8. Assuming that is the case the steps you need to carry out to get a UTF-8 string are first convert the string into the actual bytes the hex digits represents (Array#pack can be used for this), second mark it as being in the appropriate encoding with force_encoding (which looks like UTF-16BE) and finally use encode to convert it to UTF-8:
[s].pack('H*').force_encoding('utf-16be').encode('utf-8')
I think there are extra null characters all along the string (it's valid, but wasteful), but you can try:
[s].pack('H*').force_encoding('utf-8')
although, it seems "Your credit has gone below 5 dollars"...
The string prints with puts, but I can't read all the unicode characters on the terminal when the string is dumped.
If you are intending to use this on other oddly encoded strings, you could unpad the leading bytes:
[s.gsub(/..(..)/,'\1')].pack('H*')
Or use them:
s.gsub(/..../){|p|p.hex.chr}
If you want to use Encoding::Converter
ec = Encoding::Converter.new('UTF-16BE','UTF-8') # save converter for reuse
ec.convert( [s].pack('H*') ) # or: ec.convert [s].pack'H*'

Resources