Remove non-ASCII characters in string from file - ascii

What is the idiomatic way to remove non-ASCII characters from file contents in D?
I tried:
auto s = (cast(string) std.file.read(myPath)).filter!( a => a < 128 ).array;
which gave me:
std.utf.UTFException#C:\D\dmd2\windows\bin\..\..\src\phobos\std\utf.d(1109): Invalid UTF-8 sequence (at index 1)
and s is dstring ; and:
auto s = (cast(string) std.file.read(myPath)).tr("\0-~", "", "cd");
which gives me:
core.exception.UnicodeException#src\rt\util\utf.d(290): invalid UTF-8 sequence
at runtime.
I am trying to parse (with the almost deprecated std.xml module) xml files in a unsupported encoding, but I am ok with removing the offending characters.

If you do anything to consider it a string, D tries to treat it as UTF-8. Instead, treat it as a series of bytes, so replace your cast(string) with cast(ubyte[]) and do the filter.
After reading and filtering it, you can /then/ cast it back into a string. So this should do what you need:
auto s = cast(string) (cast(ubyte[])(std.file.read(myPath)).filter!( a => a < 128 ).array);

Related

Ruby Cyphering Leads to non Alphanumeric Characters [duplicate]

This question already has answers here:
Rotating letters in a string so that each letter is shifted to another letter by n places
(4 answers)
Closed 5 years ago.
I'm trying to make a basic cipher.
def caesar_crypto_encode(text, shift)
(text.nil? or text.strip.empty? ) ? "" : text.gsub(/[a-zA-Z]/){ |cstr|
((cstr.ord)+shift).chr }
end
but when the shift is too high I get these kinds of characters:
Test.assert_equals(caesar_crypto_encode("Hello world!", 127), "eBIIL TLOIA!")
Expected: "eBIIL TLOIA!", instead got: "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
What is this format?
The reason you get the verbose output is because Ruby is running with UTF-8 encoding, and your conversion has just produced gibberish characters (an invalid character sequence under UTF-8 encoding).
ASCII characters A-Z are represented by decimal numbers (ordinals) 65-90, and a-z is 97-122. When you add 127 you push all the characters into 8-bit space, which makes them unrecognizable for proper UTF-8 encoding.
That's why Ruby inspect outputs the encoded strings in quoted form, which shows each character as its hexadecimal number "\xC7...".
If you want to get some semblance of characters out of this, you could re-encode the gibberish into ISO8859-1, which supports 8-bit characters.
Here's what you get if you do that:
s = "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
>> s.encoding
=> #<Encoding:UTF-8>
# Re-encode as ISO8859-1.
# Your terminal (and Ruby) is using UTF-8, so Ruby will refuse to print these yet.
>> s.force_encoding('iso8859-1')
=> "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3!"
# In order to be able to print ISO8859-1 on an UTF-8 terminal, you have to
# convert them back to UTF-8 by re-encoding. This way your terminal (and Ruby)
# can display the ISO8859-1 8-bit characters using UTF-8 encoding:
>> s.encode('UTF-8')
=> "Çäëëî öîñëã!"
# Another way is just to repack the bytes into UTF-8:
>> s.bytes.pack('U*')
=> "Çäëëî öîñëã!"
Of course the proper way to do this, is not to let the numbers overflow into 8-bit space under any circumstance. Your encryption algorithm has a bug, and you need to ensure that the output is in the 7-bit ASCII range.
A better solution
Like #tadman suggested, you could use tr instead:
AZ_SEQUENCE = *'A'..'Z' + *'a'..'z'
"Hello world!".tr(AZ_SEQUENCE.join, AZ_SEQUENCE.rotate(127).join)
=> "eBIIL tLOIA!
I'm still curious about that format though...
Those characters represent the corresponding ASCII encoding after getting the ordinal (ord) of each letter and adding 127 to it (i.e. (cstr.ord)+shift).chr)
Why? Check Integer#chr, from the docs:
Returns a string containing the character represented by the int's
value according to encoding.
So, for example, take your first letter "H":
char_ord = "H".ord
#=> 72
new_char_ord = char_ord + 127
#=> 199
new_char_ord.chr
#=> "\xC7"
So, 199 corresponds to "\xC7". Keep changing all characters in "Hello world" and you will get "\xC7\xE4\xEB\xEB\xEE \xF6\xEE\xF1\xEB\xE3".
To avoid this you need to loop only with ord values that represent a letter (answer in the Possible duplicate link).

Handling encoding in ruby

I have a good string and a bad string
to handle a bad string I do
bad.encode("iso-8859-1").force_encoding("utf-8")
which makes it readable
if I do
good.encode("iso-8859-1").force_encoding("utf-8")
I get Encoding::UndefinedConversionError: U+05E2 from UTF-8 to ISO-8859-1
both good and bad string are in UTF-8 in the beginning, but the good strings are readable and the bad are, well, bad.
I don't know how to detect if a string is good or not, and I am trying to find a way to work on a string and to make it readable in the correct encoding
something like that
if needs_fixin?(str)
str.encode("iso-8859-1").force_encoding("utf-8")
else
str
end
The only thing I can think of is to catch exception skip the encoding fixing part, but I don't want the code to have exceptions intentionally.
something like str.try(:encode, "iso-8859-1").force_encoding("utf-8") rescue str
bad string is something like
×¢×××× ×¢×¥ ×'××¤×¡× ×פת×ר ×× ××רק××
I suspect your problem is double-encoded strings. This is very bad for various reasons, but the tl;dr here is it's not fully fixable, and you should instead fix the root problem of strings being double-encoded if at all possible.
This produces a double-encoded string with UTF-8 characters:
> str = "汉语 / 漢語"
=> "汉语 / 漢語"
> str.force_encoding("iso-8859-1")
=> "\xE6\xB1\x89\xE8\xAF\xAD / \xE6\xBC\xA2\xE8\xAA\x9E"
> bad = str.force_encoding("iso-8859-1").encode("utf-8")
=> "æ±\u0089语 / æ¼¢èª\u009E"
You can then fix it by reinterpreting the double-encoded UTF-8 as ISO-8859-1 and then declaring the encoding to actually be UTF-8
> bad.encode("iso-8859-1").force_encoding("utf-8")
=> "汉语 / 漢語"
But you can't convert the actual UTF-8 string into ISO-8859-1, since there are codepoints in UTF-8 which ISO-8859-1 doesn't have any unambiguous means of encoding
> str.encode("iso-8859-1")
Encoding::UndefinedConversionError: ""\xE6\xB1\x89"" from UTF-8 to ISO-8859-1
Now, you can't actually detect and fix this all the time because "there's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters."
So, the best you're left with is a heuristic. Borshuno's suggestion won't work here because it will actually destroy unconvertable bytes:
> str.encode( "iso-8859-1", fallback: lambda{|c| c.force_encoding("utf-8")} )
.0=> " / "
The best course of action, if at all possible, is to fix your double-encoding issue so that it doesn't happen at all. The next best course of action is to add BOM bytes to your UTF-8 strings if you suspect they may get double-encoded, since you could then check for those bytes and determine whether your string has been re-encoded or not.
> str_bom = "\xEF\xBB\xBF" + str
=> "汉语 / 漢語"
> str_bom.start_with?("\xEF\xBB\xBF")
=> true
> str_bom.force_encoding("iso-8859-1").encode("utf-8").start_with?("\xEF\xBB\xBF")
=> false
If you can presume that the BOM is in your "proper" string, then you can check for double-encoding by checking if the BOM is present. If it's not (ie, it's been re-encoded) then you can perform your decoding routine:
> str_bom.force_encoding("iso-8859-1").encode("utf-8").encode("iso-8859-1").force_encoding("utf-8").start_with?("\xEF\xBB\xBF")
=> true
If you can't be assured of the BOM, then you could use a heuristic to guess whether a string is "bad" or not, by counting unprintable characters, or characters which fall outside of your normal expected result set (your string looks like it's dealing with Hebrew; you could say that any string which consists of >50% non-Hebrew letters is double-encoded, for example), so you could then attempt to decode it.
Finally, you would have to fall back to exception handling and hope that you know which encoding the string was purportedly declared as when it was double-encoded:
str = "汉语 / 漢語"
begin
str.encode("iso-8859-1").encode("utf-8")
rescue Encoding::UndefinedConversionError
str
end
However, even if you know that a string is double-encoded, if you don't know the encoding that it was improperly declared as when it was converted to UTF-8, you can't do the reverse operation:
> bad_str = str.force_encoding("windows-1252").encode("utf-8")
=> "汉语 / 漢語"
> bad_str.encode("iso-8859-1").force_encoding("utf-8")
Encoding::UndefinedConversionError: "\xE2\x80\xB0" from UTF-8 to ISO-8859-1
Since the string itself doesn't carry any information about the encoding it was incorrectly encoded from, you don't have enough information to reliably solve it, and are left with iterating through a list of most-likely encodings and heuristically checking the result of each successful re-encode with your Hebrew heuristic.
To echo the post I linked: character encodings are hard.

memcached client throws java.lang.IllegalArgumentException: Key contains invalid characters

Seems memcache client doesn't support UTF-8 string as its key. But I have to use i18n. Anyway to fix it?
java.lang.IllegalArgumentException: Key contains invalid characters: ``HK:00:A Kung Wan''
at net.spy.memcached.MemcachedClient.validateKey(MemcachedClient.java:232)
at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:254)
The issue here isn't UTF encoding. It's the fact that your key contains a space. Keys cannot have spaces, new lines, carriage returns, or null characters.
The line of code that produces the exception is below
if (b == ' ' || b == '\n' || b == '\r' || b == 0) {
throw new IllegalArgumentException
("Key contains invalid characters: ``" + key + "''");
}
Base64 Encode your key just before passing them to memcached client's set() and get() methods.
A general solution to handle all memcached keys with special characters, control characters, new lines, spaces, unicode characters, etc. is to base64 encode the key just before you pass it to the set() and get() methods of memcached.
// pseudo code for set
memcachedClient.set(Base64.encode(key), value);
// pseudo code for get
memcachedClient.get(Base64.encode(key));
This converts them into characters memcached is guaranteed to understand.
In addition, base64 encoding has no performance penalty (unless you are a nano performance optimization guy), base64 is reliable and takes only about 30% extra length.
Works like a charm!

when we import csv data, how eliminate "invalid byte sequence in UTF-8"

we allow users to import data via csv (using ruby 1.9.2, hence it's fastercsv).
being user data, of course, it might not be properly sanitized.
When we try to display the data in an /index method we sometimes get the error "invalid byte sequence in UTF-8" pointing to our erb where we display one of the fields widget.name
When we do the import we'd like to FORCE the incoming data to be valid... is there a ruby operator that will map a string to a valid utf8 string, eg, something like
goodstring = badstring.no_more_invalid_bytes
One example of 'bad' data is char that looks like a hyphen but is not a regular ascii hyphen. We'd prefer to map the non-utf-8 chars to a reasonable ascii equivalent (umlat-u going to u for exmaple) BUT we're okay with simply stripping the character to.
since this is when importing lots of data, it needs to be a fast built-in operator, hopefully...
Note: here is an example of the data. The file comes form windows and is 8bit ascii. when we import it and in our erb we display widget.name.inspect (instead of widget.name) we get:
"Chains \x96 Accessories"
so one example of the data is a "hyphen" that's actually 8 bit code 96.
--- when we changed our csv parse to assign fldval = d.encode('UTF-8')
it throws this error:
Encoding::UndefinedConversionError in StoresController#importfinderitems
"\x96" from ASCII-8BIT to UTF-8
what we're looking for is a simple way to just force it to be valid utf8 regardless of origin type, even if we simply strip non-ascii.
while not as 'nice' as forcing the encoding, this works at a slight expense to our import time:
d.to_s.strip.gsub(/\P{ASCII}/, '')
Thank you, Mladen!
Ruby 1.9 CSV has new parser that works with m17n. The parser works with Encoding of IO object in the string. Following methods: ::foreach, ::open, ::read, and ::readlines could take in optional options :encoding which you could specify the the Encoding.
For example:
CSV.read('/path/to/file', :encoding => 'windows-1251:utf-8')
Would convert all strings to UTF-8.
Also you can use the more standard encoding name 'ISO-8859-1'
CSV.read('/..', {:headers => true, :col_sep => ';', :encoding => 'ISO-8859-1'})
CSV.parse(File.read('/path/to/csv').scrub)
I answered a similar question that deals with reading external files in 1.9.2 with non-UTF-8 encodings. I think that answer will help you a lot: Character Encoding issue in Rails v3/Ruby 1.9.2
Note that you need to know the source encoding for you to convert it anything reliably. There are libraries like the one I linked to in my other answer that can help you determine this.
Also, if you aren't loading the data from a file, you can convert the encoding of a string in 1.9.2 quite easily:
'string'.encode('UTF-8')
However, it's rare that you're building a string in another encoding, and it's best to convert it at the time it's read into your environment if possible.
Ruby 1.9 can change string encoding with invalid detection and replacement:
str = str.encode('UTF-8', :invalid => :replace)
For unusual strings such as strings loaded from a file of unknown encoding, it's wise to use #encode instead of a regex, #gsub, or #delete, because these all need the string to be parsed-- but if the string is broken, it can't be parsed, so those methods fail.
If you get a message like this:
error ** from ASCII-8BIT to UTF-8
Then you're probably trying to convert a binary string that's already in UTF-8, and you can force UTF-8:
str.force_encoding('UTF-8')
If you know the original string is not in binary UTF-8, or if the output string has illiegal characters, then read up on Ruby encoding transliterations.
If you are using Rails, you can try to fix it with the following
'Your string with strange stuff ##~'.mb_chars.tidy_bytes
It removes you the invalid utf-8 chars and replaces it with valid ones.
More info: https://apidock.com/rails/String/mb_chars
Upload the CSV file to Google Docs Spreadsheet and re-download it as a CSV file. Import and voila! (Worked in my case)
Presumably Google converts it to the wanted format..
Source: Excel to CSV with UTF-8 Encoding
As mentioned by someone else, scrub works well to clean this up in Ruby 2.1+. If you have a large file you may not want to read the whole thing into memory, so you can use scrub like this:
data = IO::read(file_path).scrub("")
CSV.parse(data, :col_sep => ',', :headers => true) do |row|
puts row
end
I am using MAC and I was having the same error:
rescue in parse:Invalid byte sequence in UTF-8 in line 1 (CSV::MalformedCSVError)
I added :encoding => 'ISO-8859-1' that resolved my error and csv file could be read.
results = CSV.read("query_result.csv",{:headers => true, :encoding => 'ISO-8859-1'})
:headers => true : If set to :first_row or true, the initial row of the CSV file will be treated as a row of headers. If set to an Array, the contents will be used as the headers. If set to a String, the String is run through a call of ::parse_line with the same :col_sep, :row_sep, and :quote_char as this instance to produce an Array of headers. This setting causes #shift to return rows as CSV::Row objects instead of Arrays and #read to return CSV::Table objects instead of an Array of Arrays.
irb(main):024:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true)
=> <#CSV io_type:StringIO encoding:UTF-8 lineno:0 col_sep:"," row_sep:"\n" quote_char:"\"" headers:true>
irb(main):025:0> rows = CSV.new(StringIO.new("a,b,c\n1,2,3"), headers: true).to_a
=> [#<CSV::Row "a":"1" "b":"2" "c":"3">]
irb(main):026:0> rows.first['a']
=> "1"
In above example you can clearly see that this also enables us to use data as hashes.
The only thing you would need to be careful about while using headers: true that it won't allow any duplicate headers as keys are unique in hashes.
Only do this
anyobject.to_csv(:encoding => 'utf-8')

how to remove whitespace but not utf-8 character in ruby

I want to prevent users to write an empty comment (whitespaces, , etc.). so I apply the following:
var.gsub(/^\s+|\s+\z|\s* \s*/.'')
However, then a smart user find a hole by using \302 or \240 unicode characters so I filtered out these characters too.
Then I ran into problem as I introduced several languages support, then a word like Déjà vu becomes an error. because part of the à character contains \240. is there any way to remove the whitespaces but leave the latin characters untouched?
A way around this is to use iconv to discard the invalid unicode characters (such as \230 on its own) before using your regexp to remove the whitespaces:
require 'iconv'
var1 = "Déjà vu"
var2 = "\240"
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid1 = ic.iconv(var1) # => "D\303\251j\303\240 vu"
valid2 = ic.iconv(var2) # => ""

Resources