How do I remove non UTF-8 characters from a String? - ruby

I need to remove non UTF-8 characters from a string. Here is the snap of the text.
This is how it looks like when I open the string in NPP, and then set the encoding to UTF-8:
I think the ACK and FF are non UTF-8 characters.
I tried str.scrub as well as str.encode. Neither of them seems to work. scrub returns the same result, and encode results in an error.

We have a few problems.
The biggest is that a Ruby String stores arbitrary bytes along with a supposed encoding, with no guarantee that the bytes are valid in that encoding and with no obvious reason for that encoding to have been chosen. (I might be biased as a heavy user of Python 3. We would never speak of "changing a string from one encoding to another".)
Fortunately, the editor did not eat your post, but it's hard to see that. I'm guessing that you decoded the string as Windows-1252 in order to display it, which only obscures the issue.
Here's your string of bytes as I see it:
>> s = "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K".b
=> "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K"
>> s.bytes
=> [6, 45, 126, 36, 65, 167, 114, 117, 71, 249, 34, 154, 12, 182, 47, 75]
And it does contain bytes that are not valid UTF-8.
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> String::new(s).force_encoding(Encoding::UTF_8).valid_encoding?
=> false
We can ask to decode this as UTF-8 and insert � where we encounter bytes that are not valid UTF-8:
>> s.encode('utf-8', 'binary', :undef => :replace)
=> "\u0006-~$A�ruG�\"�\f�/K"

Related

Bytes vs codepoints in ruby

What is the difference between ruby string functions:- codepoints and bytes
'abcd'.bytes
=> [97, 98, 99, 100]
'abcd'.codepoints
=> [97, 98, 99, 100]
bytes returns individual bytes, regardless of char size, whereas codepoints returns unicode codepoints.
s = '日本語'
s.bytes # => [230, 151, 165, 230, 156, 172, 232, 170, 158]
s.codepoints # => [26085, 26412, 35486]
s.chars # => ["日", "本", "語"]
I see where your confusion arises from. Ruby uses utf-8 encoding by default now and utf-8 was specifically designed so that its first codepoints (0-127) are exactly the same as in ASCII encoding. ASCII is an encoding with one-byte chars, so in examples in your question methods bytes and codepoints return the same values, coincindentally.
So, if you need to break string into characters, use either chars or codepoints (whichever is appropriate for your use case). Use bytes only when you treat string as an opaque binary blob, not text.
Actually, chars (suggested above) might not be accurate enough, since unicode has notion of combining characters and modifier letters. If you care about this, you need to use so-called "grapheme clusters". Here's an example (taken from this answer:
s = "a\u0308\u0303\u0323\u032d"
s.bytes # => [97, 204, 136, 204, 131, 204, 163, 204, 173]
s.codepoints # => [97, 776, 771, 803, 813]
s.chars # => ["a", "̈", "̃", "̣", "̭"]
s.grapheme_clusters # => ["ạ̭̈̃"] # rendering of this glyph is kinda broken, which illustrates the point that unicode is hard

Convert UTF-8 multibyte characters into multiple ascii characters

Can someone help me with converting some (potential) bogus UTF-8 multibyte characters into ascii as follows?
\u6162 → ["\x61", "\x62"] → ["a", "b"] → "ab"
My use case is fun only. I know I'm not compressing anything by representing two ascii characters in a multibyte character.
I've played around with various versions of unpack but it never seems to work correctly:
"\u6162".unpack('H*')
# => ["e685a2"]
Force encoding seems to return the same:
"\u6162".force_encoding('US-ASCII')
# => "\xE6\x85\xA2"
"\u6162" is not equivalent to "\x61" + "\x62". \u indicates a Unicode code point which does not translate directly to a hex value. Unicode code point 6162 is 慢.
Because it is a string, and because Ruby uses UTF-8 by default, when you unpack it you get the UTF-8 value of U+6162 which is three bytes: E6 85 A2.
2.2.1 :023 > "\u6162".encoding
=> #<Encoding:UTF-8>
2.2.1 :024 > "\u6162".unpack("A*")
=> ["\xE6\x85\xA2"]
To get what you want, you need its UTF-16 representation 61 62. But if you just encode as UTF-16 you'll get a byte order marker FE FF 61 62. So use UTF-16BE (big endian) to avoid this.
2.2.1 :052 > "\u6162".encode("UTF-16BE").unpack("A*")
=> ["ab"]
"\u6162".codepoints.first.divmod(16 ** 2).map(&:chr).join
# => "ab"

Converting String Object to Bytes and vice-versa using packer/unpacker Ruby

I want to convert a string object to bytes and Vice-versa.
Is it possible using Ruby Packer/Unpacker?
I am unable to find the format specifier to use
*pack_object = "Test".pack('**x**')* where x is format specifier
*unpacked_object = pack_object.unpack('**x**')* , this should result in "Test" string
String has a bytes method that returns an array of integers:
'Type'.bytes
#=> [84, 121, 112, 101]
The equivalent unpack directive is C*: (as already noted by cremno)
'Type'.unpack('C*')
#=> [84, 121, 112, 101]
Or the other way round:
[84, 121, 112, 101].pack('C*')
#=> "Type"
Note that pack returns a string in binary encoding.
Regarding your comment:
The output which i need is the same strung which i packed
pack and unpack are counterparts, so you can use all kind of directives:
'Type'.unpack('b*')
#=> ["00101010100111100000111010100110"]
['00101010100111100000111010100110'].pack('b*')
#=> 'Type'

Explain what those escaped numbers mean in unicode encoding in ruby 1.8.7

0186 is the unicode "code". Where do 198 and 134 come from? How can go the other way around, from these byte codes to unicode strings?
>> c = JSON '["\\u0186"]'
[
[0] "Ɔ"
]
>> c[0][0]
198
>> c[0][1]
134
>> c[0][2]
nil
Another confusing thing is unpack. Another seemingly arbitrary number. Where does that come from? Is it even correct? From the 1.8.7 String#unpack documentation:
U | Integer | UTF-8 characters as unsigned integers
>> c[0].unpack('U')
[
[0] 390
]
>
You can find your answers here Unicode Character 'LATIN CAPITAL LETTER OPEN O' (U+0186):
Note that 186 (hexadecimal) === 390 (decimal)
C/C++/Java source code : "\u0186"
UTF-32 (decimal) : 390
UTF-8 (hex) : 0xC6 0x86 (i.e. 198 134)
You can read more about UTF-8 encoding on Wikipedia's article on UTF-8.
UTF-8 (UCS Transformation Format — 8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.

Create binary data using Ruby?

i was palying with the ruby sockets, so i ended up trying to put an IP packet togather, then i took an ip packet and try to make a new one just like it.
now my problem is: if the packet is: 45 00 00 54 00 00 40 00 40 01 06 e0 7f 00 00 01 7f 00 00 01, and this is obviously hexadecimal, so i converted it into a decimal, then into a binary data using the .pack method, and pass it up to the send method, then the Wireshark shows me a very strange different thing from what i created, i doing something wrong ???, i know that, but can't figure it out:
#packet = 0x4500005400004000400106e07f0000017f000001 #i converted each 32 bits together, not like i wrote
#data = ""
#data << #packet.to_s
#socket.send(#data.unpack(c*).to_s,#address)
and is there another way to solve the whole thing up, can i for example write directly to the socket buffer the data i want to send??
thanks in advance.
Starting with a hex Bignum is a novel idea, though I can't immediately think of a good way to exploit it.
Anyway, trouble starts with the .to_s on the Bignum, which will have the effect of creating a string with the decimal representation of your number, taking you rather further from the bits and not closer. Somehow your c* seems to have lost its quotes, also.
But putting them back, you then unpack the string, which gets you an array of integers which are the ascii values of the digits in the decimal representation of the numeric value of the original hex string, and then you .to_s that (which IO would have done anyway, so, no blame there at least) but this then results in a string with the printable representation of the ascii numbers of the unpacked string, so you are now light-years from the original intention.
>> t = 0x4500005400004000400106e07f0000017f000001
=> 393920391770565046624940774228241397739864195073
>> t.to_s
=> "393920391770565046624940774228241397739864195073"
>> t.to_s.unpack('c*')
=> [51, 57, 51, 57, 50, 48, 51, 57, 49, 55, 55, 48, 53, 54, 53, 48, 52, 54, 54, 50, 52, 57, 52, 48, 55, 55, 52, 50, 50, 56, 50, 52, 49, 51, 57, 55, 55, 51, 57, 56, 54, 52, 49, 57, 53, 48, 55, 51]
>> t.to_s.unpack('c*').to_s
=> "515751575048515749555548535453485254545052575248555552505056505249515755555157565452495753485551"
It's kind of interesting in a way. All the information is still there, sort of.
Anyway, you need to make a binary string. Either just << numbers into it:
>> s = ''; s << 1 << 2
=> "\001\002"
Or use Array#pack:
>> [1,2].pack 'c*'
=> "\001\002"
First check your host byte order because what you see in wireshark is in network byte order (BigEndian). Then in wireshark you will be seeing protocol headers (depends upon whether it is TCP socket or a UDP one) followed by data. You can not directly send IP packets. So you can see this particular data in the particular's packet's data section i.e. (data section of TCP/UDP packet).

Resources