I'm importing content from an outside database that is infected with a variety of odd characters, e.g.
> str
=> "Nature’s Variety, Best Friends Animal Society team up"
From context it seems that ’ represents a right single-quote. In cp1252 encoding:
> str.encode('cp1252')
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
So how do I convert it to the correct UTF-8 character? Here's what I've tried:
> str.encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
> str.encode('cp1252').encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
> str.encode('UTF-8', invalid: :replace, replace: '?', undef: :replace)
=> "Nature’s Variety, Best Friends Animal Society team up"
> str.encode('cp1252').encode('UTF-8', invalid: :replace, replace: '?', undef: :replace)
=> "Nature’s Variety, Best Friends Animal Society team up"
I'd rather find a way to do a generic re-encoding so that it will handle all such miss-encoded characters. But if I have to I'll do individual search and replacing. But I'm not able to make that work either:
> str.encode('cp1252').gsub('\xE2/x80/x99', "'")
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
> str.encode('cp1252').gsub(%r{\xE2\x80\x99}, "'")
SyntaxError: unexpected tIDENTIFIER, expecting $end
> str.encode('cp1252').gsub(Regexp.escape('\xE2\x80\x99'), "'")
=> "Nature\xE2\x80\x99s Variety, Best Friends Animal Society team up"
I'd like to do this, but I can't even paste these characters into my REPL:
> str.gsub('’', "'")
When I try I get:
> str.gsub('C"b,b,b
* "', ",")
=> "Nature’s Variety, Best Friends Animal Society team up"
Frustrating. Any suggestions on how to encode this properly into UTF-8?
Edit: At the request for the actual bytes in the string:
> str.bytes.to_a.join(' ')
=> "78 97 116 117 114 101 195 162 226 130 172 226 132 162 115 32 86 97 114 105 101 116 121 44 32 66 101 115 116 32 70 114 105 101 110 100 115 32 65 110 105 109 97 108 32 83 111 99 105 101 116 121 32 116 101 97 109 32 117 112"
I had this problem with Fixing Incorrect String Encoding From MySQL. You need to set the proper encoding and then force it back.
fallback = {
"\u0081" => "\x81".force_encoding("CP1252"),
"\u008D" => "\x8D".force_encoding("CP1252"),
"\u008F" => "\x8F".force_encoding("CP1252"),
"\u0090" => "\x90".force_encoding("CP1252"),
"\u009D" => "\x9D".force_encoding("CP1252")
}
str.encode('CP1252', fallback: fallback).force_encoding('UTF-8')
The fallback may not be necessary depending on your data, but it ensures that it won't raise an error by handling the five bytes which are undefined in CP1252.
Once Ruby has got the encoding wrong, the characters will stay incorrect, according to the original mistake. Conversions simply convert the now wrong characters into the new encoding.
To correct Ruby's mistake on input, you need to use the force_encoding method, which does not do a conversion, it just corrects Ruby's note of what encoding a String has.
In your case the fault has occurred before you read the values from the DB. If you pick out the problem bytes: bytes = %w(195 162 226 130 172 226 132 162).map(&:to_i) they look to be in UTF-8 encoding, and already in the database double-encoded. You can probably assume a problem with whatever has written these into the DB (note if it is a live process, this is a bug that needs sorting, you will continue to get these bad values in).
What has happened is your DB (or code that writes to it) received some UTF-8 bytes representing the correct character, but assumed they were CP1252 to be converted to UTF-8. It made that conversion and wrote valid UTF-8 (but wrong characters) into the DB.
If I do the following in Ruby console using UTF-8 encoding in my terminal and as the default Ruby encoding, I can replicate your problem:
str = "Nature’s Variety, Best Friends Animal Society team up"
=> "Nature’s Variety, Best Friends Animal Society team up"
str = str.force_encoding('CP1252').encode('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
The fault is reversible, as shown here:
str = str.encode('CP1252').force_encoding('UTF-8')
=> "Nature’s Variety, Best Friends Animal Society team up"
The encode('CP1252') undoes the original mistaken conversion.
The force_encoding('UTF-8') sets the encoding back to what the system most likely received in the first place.
You will want to find where in your system an assumption of CP1252 input is being made, and instead assume UTF-8 (it may get more complicated than that if you have multiple sources in different encodings).
Related
I'm currently using software called LineView. It generates downtime reason codes for our factory lines. An operator scans the barcodes with an RS232 scanner and it goes into our XL board system.
The software itself generates the barcodes within an internet browser, but I am trying to make it so our own labeling machine can also print out the barcodes. However, the barcodes that are produced by the labeler (and the many online barcode generators I've tried) look longer and do not work.
The data for the example 128 barcode that I am trying to replicate is [SOH]1[STX]65;1067[ETX].
According to the manual:
- The Start of Header character (ASCII 0x01) starts the XL Command packet.
1 - The Serial Address of the XL device (the default is 1).
- The Start of Transmission character (ASCII 0x02) marks the start of the actual command.
65; - The ID of the Production State > Set Reason Code command.
The Reason Code ID (which can range from 1 to 999 for system reasons or 1000 to 1999 for user defined reasons). In my case it is 1067
- The End of Transmission character (ASCII 0x03) ends the XL Command packet.
I have attatched the pictures of what LineView produces (which is what I want it to look like) and what it is currently printing like on our labeller.
When I scan them they both come up with the [SOH]1[STX]65;1067[ETX] code despite them looking different.
Any help with this would be very much appreciated.
Your intended barcode is constructed internally using the following series of Code 128 codewords which correctly represent the ASCII control characters:
103 Start-in-Mode-A (Upper-case and control characters)
65 [SOH] (ASCII 1)
17 1
66 [STX] (ASCII 2)
22 6
21 5
27 ;
99 Switch-to-Mode-C (Double-density numeric)
10 10
67 67
101 Switch-to-Mode-A
67 [ETX] (ASCII 3)
67 Check-digit
106 Stop
Your label printer is printing a barcode representing the literal string [SOH]1[STX]65;1067[ETX] with no ASCII control characters (i.e. left-bracket, S, O, H, right-bracket, ...) using the following internal codewords:
104 Start-in-Mode-B (Mixed-case)
59 [
51 S
47 O
40 H
61 ]
17 1
59 [
51 S
52 T
56 X
61 ]
22 6
21 5
27 ;
99 Switch-to-Mode-C (Double-density numeric)
10 10
67 67
100 Switch-to-Mode-B
59 [
37 E
52 T
56 X
61 ]
57 Check-digit
106 Stop
So you need to work out how to correctly specify ASCII control characters in the input to your labelling machine.
I'm working on making two secure systems talk via a common encryption scheme. I picked AES as it seems a secure standard, but I'm not married to it, so long as I have two way encryption.
Here is the Go source and Ruby source simplified down to a really clear example to run from command line and see the differences. I'm outputting bytecode for easier literal comparison.
I'm using 128 bit CFB in both, and neither of them appear to have padding, any help is greatly appreciated!
You passed wrong key size in Ruby code. It should be 192. (because key.size is 24 bytes == 192 bits)
cipher = OpenSSL::Cipher::AES.new(192, :CFB)
cipher.encrypt
cipher.key = key
cipher.iv = iv
encrypted = cipher.update(input) + cipher.final()
puts "Output: [" + encrypted.bytes.join(" ") + "]"
output:
Output: [155 79 127 80 31 163 142 111 13 211 221 163 219 248]
I have a variable address which for now is a long string containing some unneccessary info, eg: "Aboriginal Relations 11th Floor Commerce Place 10155 102 Street Edmonton AB T5J 4G8 Phone 780 427-9658 Fax 780 644-4939 Email gerry.kushlyk#gov.ab.ca"
Aboriginal Relations is in a variable called title, and I'm trying to call address.gsub!(title,''), but its returning the original string.
I've also tried address.gsub!(/#{title}/,'') and address.gsub!("#{title}",'') but those won't work either. Any ideas?
Sorry, the typo occurred when I typed it into stack overflow, heres the code and the output, copied and pasted:
(this is within a loop, so there will be multiple outputs)
p title
address.gsub!(title,'')
p address
output
"Aboriginal Relations "
"Aboriginal Relations 11th Floor Commerce Place 10155 102 Street Edmonton AB T5J 4G8 Phone 780 427-9658 Fax 780 644-4939 Email gerry.kushlyk#gov.ab.ca"
"Aboriginal Tourism Advisory Council "
"Aboriginal Tourism Advisory Council 5th Floor Terrace Building 9515 107 Street Edmonton AB T5K 2C3 Phone 780 427-9687 Fax 780 422-7235 Email foip.fintprccs#gov.ab.ca"
"Acadia Foundation "
"Acadia Foundation PO Box 96 Oyen AB T0J 2J0 Phone 403 664-3384 Fax 403 664-3316 Email acadiafoundation#telus.net"
"Access Advisory Council "
"Access Advisory Council 12th Floor Centre West Building 10035 108 Street Edmonton AB T5J 3E1 Phone 780 427-2805 Fax 780 422-3204 Email barb.joyner#gov.ab.ca"
"ACCM Benevolent Association "
"ACCM Benevolent Association Suite 100 9403 95 Avenue Edmonton AB T6C 4M7 Phone 780 468-4648 Fax 780 468-4648 Email accmmanor#shaw.ca"
"Acme Municipal Library "
"Acme Municipal Library PO Box 326 Acme AB T0M 0A0 Phone 403 546-3845 Fax 403 546-2248 Email aamlibrary#marigold.ab.ca"
likewise, if I try address.match(/#{title}/) I get nil.
I'm assuming you're using ruby 1.9 or higher.
It's possible that the trailing whitespace is a non-breaking space:
p "Relations\u00a0" # looks like a trailing space, but strip won't remove it
to get rid of it:
"Relations\u00a0".gsub!(/^\u00a0|\u00a0$/, '') # => "Relations"
A more generic solution for all unicode whitespace:
"Relations\u00a0".gsub!(/^[[:space:]]|[[:space:]]$/, '') # => "Relations"
To see what the character is in your case:
title[-1].ord # => 160 (example only)
'%x' % title[-1].ord # => "a0" (hex equivalent; example only)
title = title[0..-2] seemed to solve it. for some reason strip and chomp wouldn't work.
0186 is the unicode "code". Where do 198 and 134 come from? How can go the other way around, from these byte codes to unicode strings?
>> c = JSON '["\\u0186"]'
[
[0] "Ɔ"
]
>> c[0][0]
198
>> c[0][1]
134
>> c[0][2]
nil
Another confusing thing is unpack. Another seemingly arbitrary number. Where does that come from? Is it even correct? From the 1.8.7 String#unpack documentation:
U | Integer | UTF-8 characters as unsigned integers
>> c[0].unpack('U')
[
[0] 390
]
>
You can find your answers here Unicode Character 'LATIN CAPITAL LETTER OPEN O' (U+0186):
Note that 186 (hexadecimal) === 390 (decimal)
C/C++/Java source code : "\u0186"
UTF-32 (decimal) : 390
UTF-8 (hex) : 0xC6 0x86 (i.e. 198 134)
You can read more about UTF-8 encoding on Wikipedia's article on UTF-8.
UTF-8 (UCS Transformation Format — 8-bit[1]) is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32.
I maintain an Java EE web application against an eight bits charset oracle database.
The application will be used from abroad and I want to be able to check strings -for example with UNICODE regexps, and both from Java and from Javascript- to see if they fit into the database CHARSET.
One function in GDK -globalization developer kit- gives the equivalent Java name of the oracle charset -I think it was ISO-8859-15-. But I'm not certain the correspondence will be exact.
What I wanted is to display the whole charset -NOT ISO..., but the ORACLE one- char by char to use both from Java and Javascript, even to display the UNICODE points and to tell apart the control characters from printable ones.
There is a funcion in Oracle's GDK to that end?
Thank you.
I think I've found it! (Eureka!)
A little JAVA JDBC program resulted in exactly the characters in ISO-8859-15 that are distintc to ISO-8859-1 (by the way, I've learned that ISO-8859-1 occupies from 0x00 to 0xff in UNICODE).
Program output:
CHR: 164 UNICODE: 8364 euro sign
CHR: 166 UNICODE: 352
CHR: 168 UNICODE: 353
CHR: 180 UNICODE: 381
CHR: 184 UNICODE: 382
CHR: 188 UNICODE: 338
CHR: 189 UNICODE: 339
CHR: 190 UNICODE: 376
Program code (not using GDK at all):
NOTE: the statement "SELECT CHR(i using nchar_cs) FROM DUAL" just gave back the same numbers... WHY?
for(int i=0; i<256; i++)
{
Statement select = con.createStatement();
ResultSet result = select.executeQuery("select CHR(" + i +") from DUAL");
while(result.next())
{
int unicodePoint = result.getString(1).codePointBefore(1);
//int unicodePoint = result.getString(1).codePointAt(0);
if (unicodePoint != i)
System.out.println("CHR: " + i + "\tUNICODE: " + unicodePoint);
}
result.close();
result = null;
select.close();
select = null;
}