How to remove non-printable/invisible characters in ruby? - ruby

Sometimes I have evil non-printable characters in the middle of a string. These strings are user input, so I must make my program receive it well instead of try to change the source of the problem.
For example, they can have zero width no-break space in the middle of the string. For example, while parsing a .po file, one problematic part was the string "he is a man of god" in the middle of the file. While it everything seems correct, inspecting it with irb shows:
"he is a man of god".codepoints
=> [104, 101, 32, 105, 115, 32, 97, 32, 65279, 109, 97, 110, 32, 111, 102, 32, 103, 111, 100]
I believe that I know what a BOM is, and I even handle it nicely. However sometimes I have such characters on the middle of the file, so it is not a BOM.
My current approach is to remove all characters that I found evil in a really smelly fashion:
text = (text.codepoints - CODEPOINTS_BlACKLIST).pack("U*")
The most close I got was following this post which leaded me to :print: option on regexps. However it was no good for me:
"m".scan(/[[:print:]]/).join.codepoints
=> [65279, 109]
so the question is: How can I remove all non-printable characters from a string in ruby?

try this:
>>"aaa\f\d\x00abcd".gsub(/[^[:print:]]/,'.')
=>"aaa.d.abcd"

Ruby can help you convert from one multi-byte character set to another. Check into the search results, plus read up on Ruby String's encode method.
Also, Ruby's Iconv is your friend.
Finally, James Grey wrote a series of articles which cover this in good detail.
One of the things you can do using those tools is to tell them to transcode to a visually similar character, or ignore them completely.
Dealing with alternate character sets is one of the most... irritating things I've ever had to do, because files can contain anything, but be marked as text. You might not expect it and then your code dies or starts throwing errors, because people are so ingenious when coming up with ways to insert alternate characters into content.

Codepoint 65279 is a zero-width no-break space.
It is commonly used as a byte-order mark (BOM).
You can remove it from a string with:
my_new_string = my_old_string.gsub!("\xEF\xBB\xBF".force_encoding("UTF-8"), '')
A fast way to check if you have any invisible characters is to check the length of the string, if it's higher than what you can see in IRB, you do.

Related

string length display one character extra - ruby

I am processing a csv file uploaded by users, the csv only has one column with the header row "API"
when i process the CSV, for one of the file i see that
"API".downcase.length displays 4
could it be a encoding issue. when i do header[0].downcase.bytes for the string i see
[239, 187, 191, 97, 112, 105]
when i do "api".bytes i see
[97, 112, 105]
Any help in understanding why "API".downcase.length in above example display 4 would be really great.
I parse the file like
CSV.foreach(#file_path, headers: true) do |row|
Thanks.
It looks like in this case the extra character is coming from a BOM (Byte Order Mark). These are hidden characters that are sometimes used to indicate the encoding type of the file.
One way to handle BOM characters is to specify the bom|utf-* encoding when reading the file:
CSV.open(#file_path, "r:bom|utf-8", headers: true)
When bom|utf-*is used, Ruby will check for a Unicode BOM in the input document to help determine the encoding, and if a BOM is found it is stripped out - Ruby's IO docs cover this in more detail.

ruby 1.8.6 - build a zero-length space char from unicode values

In a ruby 1.8.6 (please don't tell me to upgrade it) app, i have a problem character in some text which is being imported, and i want to strip it out.
If I look at the byte values of the character i get this:
(0..3).collect{|n| char[n]}
=> [226, 128, 139, nil]
So, in order to filter it out, I could break every character down into its byte values and see if it matches the above, but that seems quite cumbersome. It would be nicer if I could just do a gsub on the text, but i'm struggling to create the first argument to gsub, the string to replace.
I thought I would be able to build it from the above values, like so, but i get the wrong character:
>> zero_space = [226, 128, 139].pack('U*')
=> "​"
How can I actually define this string/char in variable so that I can use it in a gsub?

What is the difference between these two character encodings: "å" (195, 165) and "å" (97, 204, 138)

Both of these byte sequences seem to render correctly in Chrome and my text editor, but the latter is causing some layout problems in a PDF document.
Here are the byte sequences (in decimal):
å: 195, 165
å: 97, 204, 138
I can see that 195, 165 is the expected sequence for UTF-8: https://en.wikipedia.org/wiki/%C3%85#On_computers
Is 97, 204, 138 also a valid way to encode the character for a UTF-8 string? Or is this a different encoding that just happens to work in some contexts?
I am using the Ruby programming language. Is there any way that I could detect when a user submits this kind of character using the 97, 204, 138 encoding, and safely convert these characters into the 195, 165 encoding?
I have discovered that the first å character is a single character called: "latin small letter a with ring above".
The second å character is a plain letter "a" followed by the "combining ring above" character, so it's actually two separate characters that are merged together.
I used this service to inspect the characters: https://apps.timwhitlock.info/unicode/inspect
To answer the second part of the question, Ruby does have a #unicode_normalize method that will automatically convert the two character 97, 204, 138 sequence into a single character: 195, 165.
There are multiple ways to normalize Unicode (NFD, NFC, NFKD and NFKC), so this article goes into much more detail: Unicode Normalization in Ruby

Ruby start_with? inconsistency

Please tell me how the first use of start_with returned false.
Thanks!
Your string may contain a hidden unicode character.
If so, the string starts with that character, not with #, which is why you're getting false.
To see it in Ruby, take the string you're running start_with? on and instead run .unpack('C*'). This will return an array of numbers between 0 and 255, representing the integer values of every byte in the string. Normal printable ASCII characters only go up to 126. Any number higher than that will be a clue that there is a non-printing character hiding in your string.
UPDATE
In this particular case, it turned out that using this diagnostic method showed that there were indeed extra bytes at the beginning of the string. They appeared at the beginning of the array as [239, 187, 191, ...], the string equivalent of which is "\xEF\xBB\xBF" or the UTF-8 codepoint ZERO WIDTH NO-BREAK SPACE, which is inserted as a byte-order mark at the beginning of a file by some text editors.

Invalid Unicode characters in XCode

I am trying to put Unicode characters (using a custom font) into a string which I then display using Quartz, but XCode doesn't like the escape codes for some reason, and I'm really stuck.
CGContextShowTextAtPoint (context, 15, 15, "\u0066", 1);
It doesn't like this (Latin lowercase f) and says it is an "invalid universal character".
CGContextShowTextAtPoint (context, 15, 15, "\ue118", 1);
It doesn't complain about this but displays nothing. When I open the font in FontForge, it shows the glyph as there and valid. Also Font Book validated the font just fine. If I use the font in TextEdit and put in the Unicode character with the character viewer Unicode table, it appears just fine. Just Quartz won't display it.
Any ideas why this isn't working?
The "invalid universal character" error is due to the definition in C99: Essentially \uNNNN escapes are supposed to allow one programmer to call a variable føø and another programmer (who might not be able to type ø) to refer to it as f\u00F8\u00F8. To make parsing easier for everyone, you can't use a \u escape for a control character or a character that is in the "basic character set" (perhaps a lesson learned from Java's unicode escapes which can do crazy things like ending comments).
The second error is probably because "\ue118" is getting compiled to the UTF-8 sequence "\xee\x8e\x98" — three chars. CGContextShowTextAtPoint() assumes that one char (byte) is one glyph, and CGContextSelectFont() only supports the encodings kCGEncodingMacRoman (which decodes the bytes to "Óéò") and kCGEncodingFontSpecific (what happens is anyone's guess. The docs say not to use CGContextSetFont() (which does not specify the char-to-glyph mapping) in conjunction with CGContextShowText() or CGContextShowTextAtPoint().
If you know the glyph number, you can use CGContextShowGlyphs(), CGContextShowGlyphsAtPoint(), or CGContextShowGlyphsAtPositions().
I just changed the font to use standard alphanumeric characters in the end. Much simpler.

Resources