British Pound Sign £ causing PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding “UTF8”: 0xa3 - ruby

When collecting information containing the British Pound Sign '£' from external sources such as my bank, via csv file, and posting to postgres using ActiveRecord, I get the error:
PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding “UTF8”: 0xa3
The 0xa3 is the hex code for a £ sign. The perceived wisdom is to clearly specify UTF-8 on the string whilst replacing invalid byte sequences..
string.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
This stops the error, but is a lossy fix as the '£' is converted into a '?'
UTF-8 is able to handle the '£' sign, so what can be done to fix the invalid byte sequence and persist the '£' sign?

I'm answering my own question thanks to Michael Fuhr who explained the UTF-8 byte sequence for the pound sign is 0xc2 0xa3. So, all you have to do is find each occurrence of 0xa3 (163) and place 0xc2 (194) in front of it...
array_bytes = string.bytes
new_pound_ptr = 0
# Look for £ sign
pound_ptr = array_bytes.index(163)
while !pound_ptr.nil?
pound_ptr+= new_pound_ptr # new_pound_ptr is set at end of block
# The following statement finds incorrectly sequenced £ sign...
if (pound_ptr == 0) || (array_bytes[pound_ptr-1] != 194)
array_bytes.insert(pound_ptr,194)
pound_ptr+= 1
end
new_pound_ptr = pound_ptr
# Search remainder of array for pound sign
pound_ptr = array_bytes[(new_pound_ptr+1)..-1].index(163)
end
end
# Convert bytes to 8-bit unsigned char, and UTF-8
string = array_bytes.pack('C*').force_encoding('UTF-8') unless new_pound_ptr == 0
# Can now write string to model without out-of-sequence error..
hash["description"] = string
Model.create!(hash)
I've had so much help on this stackoverflow forum, I hope I have helped somebody else.

Related

Convert an emoji to HTML UTF-8 in Ruby

I have a rails server running, where I have a bunch of clocks emojis, and I want to render them to the HTML. The emojis are in ASCII format:
ch = "\xF0\x9F\x95\x8f" ; 12.times.map { ch.next!.dup }.rotate(-1)
# => ["🕛", "🕐", "🕑", "🕒", "🕓", "🕔", "🕕", "🕖", "🕗", "🕘", "🕙", "🕚"]
What I want is this:
> String.define_method(:to_html_utf8) { chars.map! { |x| "&#x#{x.dump[3..-2].delete('{}')};" }.join }
> ch = "\xF0\x9F\x95\x8f" ; 12.times.map { ch.next!.to_html_utf8 }.rotate(-1)
# => ["🕛", "🕐", "🕑", "🕒", "🕓", "🕔", "🕕", "🕖", "🕗", "🕘", "🕙", "🕚"]
> ?🖄.to_html_utf8
# => "🖄"
> "🐭🐹".to_html_utf8
#=> "🐭🐹"
As you can see the to_html_utf8 does use some brute force way to get the job done.
Is there a better way to convert the emojis in aforementioned html compatible UTF-8?
Please note that it would be better to avoid and rails helpers or rails stuff in general, and it can be run with ruby 2.7+ only using standard library stuff.
The emojis are in ASCII format:
ch = "\xF0\x9F\x95\x8f"
0xf0 0x9f 0x95 0x8f is the character's UTF-8 byte sequence. Don't use that unless you absolutely have to. It's much easier to enter the emojis directly, e.g.:
ch = '🕐'
#=> "🕐"
or to use the character's codepoint, e.g.:
ch = "\u{1f550}"
#=> "🕐"
ch = 0x1f550.chr('UTF-8')
#=> "🕐"
You can usually just render that character into your HTML page if the "charset" is UTF-8.
If you want to turn the string's characters into their numeric character reference counterparts yourself, you could use:
ch.codepoints.map { |cp| format('&#x%x;', cp) }.join
#=> "🕐"
Note that the conversion is trivial – 1f550 is simply the character's (hex) codepoint.
The easiest way is to simply use UTF-8 natively and not escaping anything.

Ruby string escape for supplementary plane Unicode characters

I know that I can escape a basic Unicode character in Ruby with the \uNNNN escape sequence. For example, for a smiling face U+263A (☺) I can use the string literal "\u2603".
How do I escape Unicode characters greater than U+FFFF that fall outside the basic multilingual plane, like a winking face: U+1F609 (😉)?
Using the surrogate pair form like in Java doesn't work; it results in an invalid string that contains the individual surrogate code points:
s = "\uD83D\uDE09" # => "\xED\xA0\xBD\xED\xB8\x89"
s.valid_encoding? # => false
You can use the escape sequence \u{XXXXXX}, where XXXXXX is between 1 and 6 hex digits:
s = "\u{1F609}" # => "😉"
The braces can also contain multiple runs separated by single spaces or tabs to encode multiple characters:
s = "\u{41f 440 438 432 435 442 2c 20 43c 438 440}!" # => "Привет, мир!"
You could also use byte escapes to write a literal that contains the UTF-8 encoding of the character, though that's not very convenient, and doesn't necessarily result in a UTF-8-encoded string, if the file encoding differs:
# encoding: utf-8
s = "\xF0\x9F\x98\x89" # => "😉"
s.length # => 1
# encoding: iso-8859-1
s = "\xF0\x9F\x98\x89" # => "\xF0\x9F\x98\x89"
s.length # => 4

£ considered invalid character

I need to search for a "£" sign in my text but it keeps coming up with the error:
invalid character property name {`£`}: /\p{`\u00A3`}/ (SyntaxError)
i have # encoding: utf-8 at the top, the context in which I'm using it is:
original_contents << line.gsub(/[abc]/, '*')
.gsub(/\p{£}/, '')
When I try .gsub(/£/, '') instead, I get
C:/Users...Epub run through.rb:12:in `gsub': incompatible encoding regexp match (UTF-8 regexp with CP850 string) (Encoding::CompatibilityError)
from C:/Users...Epub run through.rb:12:in `block in <top (required)>'
from C:/Users...Epub run through.rb:9:in `each_line'
from C:/Users...Epub run through.rb:9:in `<top (required)>'
from -e:1:in `load'
from -e:1:in `<main>'
In a regexp, \p is for matching a "character property" which is basically a set of characters that are related in some way (e.g. digit charaters, ASCII characters, etc.). The documentation lists the character properties here. According to that list (and your error message), that's an invalid character property in your code.
You should just use /£/. The error you're getting in that case is because your string is not UTF-8 encoded. Regexps use source encoding by default (which is UTF-8) and should match the encoding of the string. The easiest way to fix that is to change the string's encoding.
original_contents.encode! Encoding::UTF_8
original_contents.gsub(/£/, '')
The error message says "invalid character property name", and that is correct. The valid Unicode character property names are:
Alpha
Blank
Cntrl
Digit
Graph
Lower
Print
Punct
Space
Upper
XDigit
Word
Alnum
ASCII
Any
Assigned
C
Cc
Cf
Cn
Co
Cs
L
LC
Ll
Lm
Lo
Lt
Lu
M
Mc
Me
Mn
N
Nd
Nl
No
P
Pc
Pd
Pe
Pf
Pi
Po
Ps
S
Sc
Sk
Sm
So
Z
Zl
Zp
Zs
Arabic
Armenian
Avestan
Balinese
Bamum
Bassa_Vah
Batak
Bengali
Bopomofo
Brahmi
Braille
Buginese
Buhid
Canadian_Aboriginal
Carian
Caucasian_Albanian
Chakma
Cham
Cherokee
Common
Coptic
Cuneiform
Cypriot
Cyrillic
Deseret
Devanagari
Duployan
Egyptian_Hieroglyphs
Elbasan
Ethiopic
Georgian
Glagolitic
Gothic
Grantha
Greek
Gujarati
Gurmukhi
Han
Hangul
Hanunoo
Hebrew
Hiragana
Imperial_Aramaic
Inherited
Inscriptional_Pahlavi
Inscriptional_Parthian
Javanese
Kaithi
Kannada
Katakana
Kayah_Li
Kharoshthi
Khmer
Khojki
Khudawadi
Lao
Latin
Lepcha
Limbu
Linear_A
Linear_B
Lisu
Lycian
Lydian
Mahajani
Malayalam
Mandaic
Manichaean
Meetei_Mayek
Mende_Kikakui
Meroitic_Cursive
Meroitic_Hieroglyphs
Miao
Modi
Mongolian
Mro
Myanmar
Nabataean
New_Tai_Lue
Nko
Ogham
Ol_Chiki
Old_Italic
Old_North_Arabian
Old_Permic
Old_Persian
Old_South_Arabian
Old_Turkic
Oriya
Osmanya
Pahawh_Hmong
Palmyrene
Pau_Cin_Hau
Phags_Pa
Phoenician
Psalter_Pahlavi
Rejang
Runic
Samaritan
Saurashtra
Sharada
Shavian
Siddham
Sinhala
Sora_Sompeng
Sundanese
Syloti_Nagri
Syriac
Tagalog
Tagbanwa
Tai_Le
Tai_Tham
Tai_Viet
Takri
Tamil
Telugu
Thaana
Thai
Tibetan
Tifinagh
Tirhuta
Ugaritic
Unknown
Vai
Warang_Citi
Yi
Alphabetic
Case_Ignorable
Cased
Changes_When_Casefolded
Changes_When_Casemapped
Changes_When_Lowercased
Changes_When_Titlecased
Changes_When_Uppercased
Default_Ignorable_Code_Point
Grapheme_Base
Grapheme_Extend
Grapheme_Link
ID_Continue
ID_Start
Lowercase
Math
Uppercase
XID_Continue
XID_Start
ASCII_Hex_Digit
Bidi_Control
Dash
Deprecated
Diacritic
Extender
Hex_Digit
Hyphen
IDS_Binary_Operator
IDS_Trinary_Operator
Ideographic
Join_Control
Logical_Order_Exception
Noncharacter_Code_Point
Other_Alphabetic
Other_Default_Ignorable_Code_Point
Other_Grapheme_Extend
Other_ID_Continue
Other_ID_Start
Other_Lowercase
Other_Math
Other_Uppercase
Pattern_Syntax
Pattern_White_Space
Quotation_Mark
Radical
STerm
Soft_Dotted
Terminal_Punctuation
Unified_Ideograph
Variation_Selector
White_Space
AHex
Bidi_C
CI
CWCF
CWCM
CWL
CWT
CWU
DI
Dep
Dia
Ext
Gr_Base
Gr_Ext
Gr_Link
Hex
IDC
IDS
IDSB
IDST
Ideo
Join_C
LOE
NChar
OAlpha
ODI
OGr_Ext
OIDC
OIDS
OLower
OMath
OUpper
Pat_Syn
Pat_WS
QMark
SD
Term
UIdeo
VS
WSpace
XIDC
XIDS
Other
Control
Format
Unassigned
Private_Use
Surrogate
Letter
Cased_Letter
Lowercase_Letter
Modifier_Letter
Other_Letter
Titlecase_Letter
Uppercase_Letter
Mark
Combining_Mark
Spacing_Mark
Enclosing_Mark
Nonspacing_Mark
Number
Decimal_Number
Letter_Number
Other_Number
Punctuation
Connector_Punctuation
Dash_Punctuation
Close_Punctuation
Final_Punctuation
Initial_Punctuation
Other_Punctuation
Open_Punctuation
Symbol
Currency_Symbol
Modifier_Symbol
Math_Symbol
Other_Symbol
Separator
Line_Separator
Paragraph_Separator
Space_Separator
Aghb
Arab
Armi
Armn
Avst
Bali
Bamu
Bass
Batk
Beng
Bopo
Brah
Brai
Bugi
Buhd
Cakm
Cans
Cari
Cher
Copt
Qaac
Cprt
Cyrl
Deva
Dsrt
Dupl
Egyp
Elba
Ethi
Geor
Glag
Goth
Gran
Grek
Gujr
Guru
Hang
Hani
Hano
Hebr
Hira
Hmng
Ital
Java
Kali
Kana
Khar
Khmr
Khoj
Knda
Kthi
Lana
Laoo
Latn
Lepc
Limb
Lina
Linb
Lyci
Lydi
Mahj
Mand
Mani
Mend
Merc
Mero
Mlym
Mong
Mroo
Mtei
Mymr
Narb
Nbat
Nkoo
Ogam
Olck
Orkh
Orya
Osma
Palm
Pauc
Perm
Phag
Phli
Phlp
Phnx
Plrd
Prti
Rjng
Runr
Samr
Sarb
Saur
Shaw
Shrd
Sidd
Sind
Sinh
Sora
Sund
Sylo
Syrc
Tagb
Takr
Tale
Talu
Taml
Tavt
Telu
Tfng
Tglg
Thaa
Tibt
Tirh
Ugar
Vaii
Wara
Xpeo
Xsux
Yiii
Zinh
Qaai
Zyyy
Zzzz
Age=1.1
Age=2.0
Age=2.1
Age=3.0
Age=3.1
Age=3.2
Age=4.0
Age=4.1
Age=5.0
Age=5.1
Age=5.2
Age=6.0
Age=6.1
Age=6.2
Age=6.3
Age=7.0
In_Basic_Latin
In_Latin_1_Supplement
In_Latin_Extended_A
In_Latin_Extended_B
In_IPA_Extensions
In_Spacing_Modifier_Letters
In_Combining_Diacritical_Marks
In_Greek_and_Coptic
In_Cyrillic
In_Cyrillic_Supplement
In_Armenian
In_Hebrew
In_Arabic
In_Syriac
In_Arabic_Supplement
In_Thaana
In_NKo
In_Samaritan
In_Mandaic
In_Arabic_Extended_A
In_Devanagari
In_Bengali
In_Gurmukhi
In_Gujarati
In_Oriya
In_Tamil
In_Telugu
In_Kannada
In_Malayalam
In_Sinhala
In_Thai
In_Lao
In_Tibetan
In_Myanmar
In_Georgian
In_Hangul_Jamo
In_Ethiopic
In_Ethiopic_Supplement
In_Cherokee
In_Unified_Canadian_Aboriginal_Syllabics
In_Ogham
In_Runic
In_Tagalog
In_Hanunoo
In_Buhid
In_Tagbanwa
In_Khmer
In_Mongolian
In_Unified_Canadian_Aboriginal_Syllabics_Extended
In_Limbu
In_Tai_Le
In_New_Tai_Lue
In_Khmer_Symbols
In_Buginese
In_Tai_Tham
In_Combining_Diacritical_Marks_Extended
In_Balinese
In_Sundanese
In_Batak
In_Lepcha
In_Ol_Chiki
In_Sundanese_Supplement
In_Vedic_Extensions
In_Phonetic_Extensions
In_Phonetic_Extensions_Supplement
In_Combining_Diacritical_Marks_Supplement
In_Latin_Extended_Additional
In_Greek_Extended
In_General_Punctuation
In_Superscripts_and_Subscripts
In_Currency_Symbols
In_Combining_Diacritical_Marks_for_Symbols
In_Letterlike_Symbols
In_Number_Forms
In_Arrows
In_Mathematical_Operators
In_Miscellaneous_Technical
In_Control_Pictures
In_Optical_Character_Recognition
In_Enclosed_Alphanumerics
In_Box_Drawing
In_Block_Elements
In_Geometric_Shapes
In_Miscellaneous_Symbols
In_Dingbats
In_Miscellaneous_Mathematical_Symbols_A
In_Supplemental_Arrows_A
In_Braille_Patterns
In_Supplemental_Arrows_B
In_Miscellaneous_Mathematical_Symbols_B
In_Supplemental_Mathematical_Operators
In_Miscellaneous_Symbols_and_Arrows
In_Glagolitic
In_Latin_Extended_C
In_Coptic
In_Georgian_Supplement
In_Tifinagh
In_Ethiopic_Extended
In_Cyrillic_Extended_A
In_Supplemental_Punctuation
In_CJK_Radicals_Supplement
In_Kangxi_Radicals
In_Ideographic_Description_Characters
In_CJK_Symbols_and_Punctuation
In_Hiragana
In_Katakana
In_Bopomofo
In_Hangul_Compatibility_Jamo
In_Kanbun
In_Bopomofo_Extended
In_CJK_Strokes
In_Katakana_Phonetic_Extensions
In_Enclosed_CJK_Letters_and_Months
In_CJK_Compatibility
In_CJK_Unified_Ideographs_Extension_A
In_Yijing_Hexagram_Symbols
In_CJK_Unified_Ideographs
In_Yi_Syllables
In_Yi_Radicals
In_Lisu
In_Vai
In_Cyrillic_Extended_B
In_Bamum
In_Modifier_Tone_Letters
In_Latin_Extended_D
In_Syloti_Nagri
In_Common_Indic_Number_Forms
In_Phags_pa
In_Saurashtra
In_Devanagari_Extended
In_Kayah_Li
In_Rejang
In_Hangul_Jamo_Extended_A
In_Javanese
In_Myanmar_Extended_B
In_Cham
In_Myanmar_Extended_A
In_Tai_Viet
In_Meetei_Mayek_Extensions
In_Ethiopic_Extended_A
In_Latin_Extended_E
In_Meetei_Mayek
In_Hangul_Syllables
In_Hangul_Jamo_Extended_B
In_High_Surrogates
In_High_Private_Use_Surrogates
In_Low_Surrogates
In_Private_Use_Area
In_CJK_Compatibility_Ideographs
In_Alphabetic_Presentation_Forms
In_Arabic_Presentation_Forms_A
In_Variation_Selectors
In_Vertical_Forms
In_Combining_Half_Marks
In_CJK_Compatibility_Forms
In_Small_Form_Variants
In_Arabic_Presentation_Forms_B
In_Halfwidth_and_Fullwidth_Forms
In_Specials
In_Linear_B_Syllabary
In_Linear_B_Ideograms
In_Aegean_Numbers
In_Ancient_Greek_Numbers
In_Ancient_Symbols
In_Phaistos_Disc
In_Lycian
In_Carian
In_Coptic_Epact_Numbers
In_Old_Italic
In_Gothic
In_Old_Permic
In_Ugaritic
In_Old_Persian
In_Deseret
In_Shavian
In_Osmanya
In_Elbasan
In_Caucasian_Albanian
In_Linear_A
In_Cypriot_Syllabary
In_Imperial_Aramaic
In_Palmyrene
In_Nabataean
In_Phoenician
In_Lydian
In_Meroitic_Hieroglyphs
In_Meroitic_Cursive
In_Kharoshthi
In_Old_South_Arabian
In_Old_North_Arabian
In_Manichaean
In_Avestan
In_Inscriptional_Parthian
In_Inscriptional_Pahlavi
In_Psalter_Pahlavi
In_Old_Turkic
In_Rumi_Numeral_Symbols
In_Brahmi
In_Kaithi
In_Sora_Sompeng
In_Chakma
In_Mahajani
In_Sharada
In_Sinhala_Archaic_Numbers
In_Khojki
In_Khudawadi
In_Grantha
In_Tirhuta
In_Siddham
In_Modi
In_Takri
In_Warang_Citi
In_Pau_Cin_Hau
In_Cuneiform
In_Cuneiform_Numbers_and_Punctuation
In_Egyptian_Hieroglyphs
In_Bamum_Supplement
In_Mro
In_Bassa_Vah
In_Pahawh_Hmong
In_Miao
In_Kana_Supplement
In_Duployan
In_Shorthand_Format_Controls
In_Byzantine_Musical_Symbols
In_Musical_Symbols
In_Ancient_Greek_Musical_Notation
In_Tai_Xuan_Jing_Symbols
In_Counting_Rod_Numerals
In_Mathematical_Alphanumeric_Symbols
In_Mende_Kikakui
In_Arabic_Mathematical_Alphabetic_Symbols
In_Mahjong_Tiles
In_Domino_Tiles
In_Playing_Cards
In_Enclosed_Alphanumeric_Supplement
In_Enclosed_Ideographic_Supplement
In_Miscellaneous_Symbols_and_Pictographs
In_Emoticons
In_Ornamental_Dingbats
In_Transport_and_Map_Symbols
In_Alchemical_Symbols
In_Geometric_Shapes_Extended
In_Supplemental_Arrows_C
In_CJK_Unified_Ideographs_Extension_B
In_CJK_Unified_Ideographs_Extension_C
In_CJK_Unified_Ideographs_Extension_D
In_CJK_Compatibility_Ideographs_Supplement
In_Tags
In_Variation_Selectors_Supplement
In_Supplementary_Private_Use_Area_A
In_Supplementary_Private_Use_Area_B
In_No_Block
As you can see, "£" is not a valid Unicode property name.

Net::Telnet - puts or print string in UTF-8

I'm using an API in which I have to send client informations as a Json-object over a telnet connection (very strange, I know^^).
I'm german so the client information contains very often umlauts or the ß.
My procedure:
I generate a Hash with all the command information.
I convert the Hash to a Json-object.
I convert the Json-object to a string (with .to_s).
I send the string with the Net::Telnet.puts command.
My puts command looks like: (cmd is the Json-object)
host.puts(cmd.to_s.force_encoding('UTF-8'))
In the log files I see, that the Json-object don't contain the umlauts but for example this: ü instead of ü.
I proved that the string is (with or without the force_encoding() command) in UTF-8. So I think that the puts command doesn't send the strings in UTF-8.
Is it possible to send the command in UTF-8? How can I do this?
The whole methods:
host = Net::Telnet::new(
'Host' => host_string,
'Port' => port_integer,
'Output_log' => 'log/'+Time.now.strftime('%Y-%m-%d')+'.log',
'Timeout' => false,
'Telnetmode' => false,
'Prompt' => /\z/n
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate({'*C'=>'se','Q'=>[get_cmd(cmd, params)]})
host.puts(cmd.to_s.force_encoding('UTF-8'))
add_request_to_logfile(cmd)
end
def get_cmd(cmd, params=nil)
if params == nil
return {'*C'=>'sq','CMD'=>cmd}
else
return {'*C'=>'sq','CMD'=>cmd,'PARAMS'=>params}
end
end
Addition:
I also log my sended requests by this method:
def add_request_to_logfile(request_string)
directory = 'log/'
File.open(File.join(directory, Time.now.strftime('%Y-%m-%d')+'.log'), 'a+') do |f|
f.puts ''
f.puts '> '+request_string
end
end
In the logfile my requests also don't contain UTF-8 umlauts but for example this: ü
TL;DR
Set 'Binmode' => true and use Encoding::BINARY.
The above should work for you. If you're interested in why, read on.
Telnet doesn't really have a concept of "encoding." Telnet just has two modes: Normal mode assumes you're sending 7-bit ASCII characters, and binary mode assumes you're sending 8-bit bytes. You can't tell Telnet "this is UTF-8" because Telnet doesn't know what that means. You can tell it "this is ASCII-7" or "this is a sequence of 8-bit bytes," and that's it.
This might seem like bad news, but it's actually great news, because it just so happens that UTF-8 encodes text as sequences of 8-bit bytes. früh, for example, is five bytes: 66 72 c3 bc 68. This is easy to confirm in Ruby:
puts str = "\x66\x72\xC3\xBC\x68"
# => früh
puts str.bytes.size
# => 5
In Net::Telnet we can turn on binary mode by passing the 'Binmode' => true option to Net::Telnet::new. But there's one more thing we have to do: Tell Ruby to treat the string like binary data, i.e. a sequence of 8-bit bytes.
You already tried to use String#force_encoding, but what you might not have realized is that String#force_encoding doesn't actually convert a string from one encoding to another. Its purpose isn't to change the data's encoding—its purpose is to tell Ruby what encoding the data is already in:
str = "früh" # => "früh"
p str.encoding # => #<Encoding:UTF-8>
p str[2] # => "ü"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # This is the decimal represent-
# ation of the hexadecimal bytes
# we saw before, `66 72 c3 bc 68`
str.force_encoding(Encoding::BINARY) # => "fr\xC3\xBCh"
p str[2] # => "\xC3"
p str.bytes # => [ 102, 114, 195, 188, 104 ] # Same bytes!
Now I'll let you in on a little secret: Encoding::BINARY is just an alias for Encoding::ASCII_8BIT. Since ASCII-8BIT doesn't have multi-byte characters, Ruby shows ü as two separate bytes, \xC3\xBC. Those bytes aren't printable characters in ASCII-8BIT, so Ruby shows the \x## escape codes instead, but the data hasn't changed—only the way Ruby prints it has changed.
So here's the thing: Even though Ruby is now calling the string BINARY or ASCII-8BIT instead of UTF-8, it's still the same bytes, which means it's still UTF-8. Changing the encoding it's "tagged" as, however, means when Net::Telnet does (the equivalent of) data[n] it will always get one byte (instead of potentially getting multi-byte characters as in UTF-8), which is just what we want.
And so...
host = Net::Telnet::new(
# ...all of your other options...
'Binmode' => true
)
def send_cmd_container(host, cmd, params=nil)
cmd = JSON.generate('*C' => 'se','Q' => [ get_cmd(cmd, params) ])
cmd.force_encoding(Encoding::BINARY)
host.puts(cmd)
# ...
end
(Note: JSON.generate always returns a UTF-8 string, so you never have to do e.g. cmd.to_s.)
Useful diagnostics
A quick way to check what data Net::Telnet is actually sending (and receiving) is to set the 'Dump_log' option (in the same way you set the 'Output_log' option). It will write both sent and received data to a log file in hexdump format, which will allow you to see if the bytes being sent are correct. For example, I started a test server (nc -l 5555) and sent the string früh (host.puts "früh".force_encoding(Encoding::BINARY)), and this is what was logged:
> 0x00000: 66 72 c3 bc 68 0a fr..h.
You can see that it sent six bytes: the first two are f and r, the next two make up ü, and the last two are h and a newline. On the right, bytes that aren't printable characters are shown as ., ergo fr..h.. (By the same token, I sent the string I❤NY and saw I...NY. in the right column, because ❤ is three bytes in UTF-8: e2 9d a4).
So, if you set 'Dump_log' and send a ü, you should see c3 bc in the output. If you do, congratulations—you're sending UTF-8!
P.S. Read Yehuda Katz' article Ruby 1.9 Encodings: A Primer and the Solution for Rails. In fact, read it yearly. It's really, really useful.

Encoding issue with Sqlite3 in Ruby

I have a list of sql queries beautifully encoded in utf-8. I read them from files, perform the inserts and than do a select.
# encoding: utf-8
def exec_sql_lines(file_name)
puts "----> #{file_name} <----"
File.open(file_name, 'r') do |f|
# sometimes a query doesn't fit one line
previous_line=""
i = 0
while line = f.gets do
puts i+=1
if(line[-2] != ')')
previous_line += line[0..-2]
next
end
puts (previous_line + line) # <---- (1)
$db.execute((previous_line + line))
previous_line =""
end
a = $db.execute("select * from Table where _id=6")
puts a <---- (2)
end
end
$db=SQLite3::Database.new($DBNAME)
exec_sql_lines("creates.txt")
exec_sql_lines("inserts.txt")
$db.close
The text in (1) is different than the one in (2). Polish letters get broken. If I use IRB and call $db.open ; $db.encoding is says UTF-8.
Why do Polish letters come out broken? How to fix it?
I need this database properly encoded in UTF-8 for my Android app, so I'm not interested in manipulating with database output. I need to fix it's content.
EDIT
Significant lines from the output:
6
INSERT INTO 'Leki' VALUES (NULL, '6', 'Acenocoumarolum', 'Acenocumarol WZF', 'tabl. ', '4 mg', '60 tabl.', '5909990055715', '2012-01-01', '2 lata', '21.0, Leki przeciwzakrzepowe z grupy antagonistów witaminy K', '8.32', '12.07', '12.07', 'We wszystkich zarejestrowanych wskazaniach na dzień wydania decyzji', '', 'ryczałt', '5.12')
out:
6
6
Acenocoumarolum
Acenocumarol WZF
tabl.
4 mg
60 tabl.
5909990055715
2012-01-01
2 lata
21.0, Leki przeciwzakrzepowe z grupy antagonistĂł[<--HERE]w witaminy K
8.32
12.07
12.07
We wszystkich zarejestrowanych wskazaniach na dzieĹ[<--HERE] wydania decyzji
ryczaĹ[<--HERE]t
5.12
There are three default encoding.
In you code you set the source encoding.
Perhaps there is a problem with External and Internal Encoding?
A quick test in windows:
#encoding: utf-8
File.open(__FILE__,'r'){|f|
p f.external_encoding
p f.internal_encoding
p f.read.encoding
}
Result:
#<Encoding:CP850>
nil
#<Encoding:CP850>
Even if UTF-8 is used, the data are read as cp850.
In your case:
Does File.open(filename,'r:utf-8') help?

Resources