£ considered invalid character - ruby

I need to search for a "£" sign in my text but it keeps coming up with the error:
invalid character property name {`£`}: /\p{`\u00A3`}/ (SyntaxError)
i have # encoding: utf-8 at the top, the context in which I'm using it is:
original_contents << line.gsub(/[abc]/, '*')
.gsub(/\p{£}/, '')
When I try .gsub(/£/, '') instead, I get
C:/Users...Epub run through.rb:12:in `gsub': incompatible encoding regexp match (UTF-8 regexp with CP850 string) (Encoding::CompatibilityError)
from C:/Users...Epub run through.rb:12:in `block in <top (required)>'
from C:/Users...Epub run through.rb:9:in `each_line'
from C:/Users...Epub run through.rb:9:in `<top (required)>'
from -e:1:in `load'
from -e:1:in `<main>'

In a regexp, \p is for matching a "character property" which is basically a set of characters that are related in some way (e.g. digit charaters, ASCII characters, etc.). The documentation lists the character properties here. According to that list (and your error message), that's an invalid character property in your code.
You should just use /£/. The error you're getting in that case is because your string is not UTF-8 encoded. Regexps use source encoding by default (which is UTF-8) and should match the encoding of the string. The easiest way to fix that is to change the string's encoding.
original_contents.encode! Encoding::UTF_8
original_contents.gsub(/£/, '')

The error message says "invalid character property name", and that is correct. The valid Unicode character property names are:
Alpha
Blank
Cntrl
Digit
Graph
Lower
Print
Punct
Space
Upper
XDigit
Word
Alnum
ASCII
Any
Assigned
C
Cc
Cf
Cn
Co
Cs
L
LC
Ll
Lm
Lo
Lt
Lu
M
Mc
Me
Mn
N
Nd
Nl
No
P
Pc
Pd
Pe
Pf
Pi
Po
Ps
S
Sc
Sk
Sm
So
Z
Zl
Zp
Zs
Arabic
Armenian
Avestan
Balinese
Bamum
Bassa_Vah
Batak
Bengali
Bopomofo
Brahmi
Braille
Buginese
Buhid
Canadian_Aboriginal
Carian
Caucasian_Albanian
Chakma
Cham
Cherokee
Common
Coptic
Cuneiform
Cypriot
Cyrillic
Deseret
Devanagari
Duployan
Egyptian_Hieroglyphs
Elbasan
Ethiopic
Georgian
Glagolitic
Gothic
Grantha
Greek
Gujarati
Gurmukhi
Han
Hangul
Hanunoo
Hebrew
Hiragana
Imperial_Aramaic
Inherited
Inscriptional_Pahlavi
Inscriptional_Parthian
Javanese
Kaithi
Kannada
Katakana
Kayah_Li
Kharoshthi
Khmer
Khojki
Khudawadi
Lao
Latin
Lepcha
Limbu
Linear_A
Linear_B
Lisu
Lycian
Lydian
Mahajani
Malayalam
Mandaic
Manichaean
Meetei_Mayek
Mende_Kikakui
Meroitic_Cursive
Meroitic_Hieroglyphs
Miao
Modi
Mongolian
Mro
Myanmar
Nabataean
New_Tai_Lue
Nko
Ogham
Ol_Chiki
Old_Italic
Old_North_Arabian
Old_Permic
Old_Persian
Old_South_Arabian
Old_Turkic
Oriya
Osmanya
Pahawh_Hmong
Palmyrene
Pau_Cin_Hau
Phags_Pa
Phoenician
Psalter_Pahlavi
Rejang
Runic
Samaritan
Saurashtra
Sharada
Shavian
Siddham
Sinhala
Sora_Sompeng
Sundanese
Syloti_Nagri
Syriac
Tagalog
Tagbanwa
Tai_Le
Tai_Tham
Tai_Viet
Takri
Tamil
Telugu
Thaana
Thai
Tibetan
Tifinagh
Tirhuta
Ugaritic
Unknown
Vai
Warang_Citi
Yi
Alphabetic
Case_Ignorable
Cased
Changes_When_Casefolded
Changes_When_Casemapped
Changes_When_Lowercased
Changes_When_Titlecased
Changes_When_Uppercased
Default_Ignorable_Code_Point
Grapheme_Base
Grapheme_Extend
Grapheme_Link
ID_Continue
ID_Start
Lowercase
Math
Uppercase
XID_Continue
XID_Start
ASCII_Hex_Digit
Bidi_Control
Dash
Deprecated
Diacritic
Extender
Hex_Digit
Hyphen
IDS_Binary_Operator
IDS_Trinary_Operator
Ideographic
Join_Control
Logical_Order_Exception
Noncharacter_Code_Point
Other_Alphabetic
Other_Default_Ignorable_Code_Point
Other_Grapheme_Extend
Other_ID_Continue
Other_ID_Start
Other_Lowercase
Other_Math
Other_Uppercase
Pattern_Syntax
Pattern_White_Space
Quotation_Mark
Radical
STerm
Soft_Dotted
Terminal_Punctuation
Unified_Ideograph
Variation_Selector
White_Space
AHex
Bidi_C
CI
CWCF
CWCM
CWL
CWT
CWU
DI
Dep
Dia
Ext
Gr_Base
Gr_Ext
Gr_Link
Hex
IDC
IDS
IDSB
IDST
Ideo
Join_C
LOE
NChar
OAlpha
ODI
OGr_Ext
OIDC
OIDS
OLower
OMath
OUpper
Pat_Syn
Pat_WS
QMark
SD
Term
UIdeo
VS
WSpace
XIDC
XIDS
Other
Control
Format
Unassigned
Private_Use
Surrogate
Letter
Cased_Letter
Lowercase_Letter
Modifier_Letter
Other_Letter
Titlecase_Letter
Uppercase_Letter
Mark
Combining_Mark
Spacing_Mark
Enclosing_Mark
Nonspacing_Mark
Number
Decimal_Number
Letter_Number
Other_Number
Punctuation
Connector_Punctuation
Dash_Punctuation
Close_Punctuation
Final_Punctuation
Initial_Punctuation
Other_Punctuation
Open_Punctuation
Symbol
Currency_Symbol
Modifier_Symbol
Math_Symbol
Other_Symbol
Separator
Line_Separator
Paragraph_Separator
Space_Separator
Aghb
Arab
Armi
Armn
Avst
Bali
Bamu
Bass
Batk
Beng
Bopo
Brah
Brai
Bugi
Buhd
Cakm
Cans
Cari
Cher
Copt
Qaac
Cprt
Cyrl
Deva
Dsrt
Dupl
Egyp
Elba
Ethi
Geor
Glag
Goth
Gran
Grek
Gujr
Guru
Hang
Hani
Hano
Hebr
Hira
Hmng
Ital
Java
Kali
Kana
Khar
Khmr
Khoj
Knda
Kthi
Lana
Laoo
Latn
Lepc
Limb
Lina
Linb
Lyci
Lydi
Mahj
Mand
Mani
Mend
Merc
Mero
Mlym
Mong
Mroo
Mtei
Mymr
Narb
Nbat
Nkoo
Ogam
Olck
Orkh
Orya
Osma
Palm
Pauc
Perm
Phag
Phli
Phlp
Phnx
Plrd
Prti
Rjng
Runr
Samr
Sarb
Saur
Shaw
Shrd
Sidd
Sind
Sinh
Sora
Sund
Sylo
Syrc
Tagb
Takr
Tale
Talu
Taml
Tavt
Telu
Tfng
Tglg
Thaa
Tibt
Tirh
Ugar
Vaii
Wara
Xpeo
Xsux
Yiii
Zinh
Qaai
Zyyy
Zzzz
Age=1.1
Age=2.0
Age=2.1
Age=3.0
Age=3.1
Age=3.2
Age=4.0
Age=4.1
Age=5.0
Age=5.1
Age=5.2
Age=6.0
Age=6.1
Age=6.2
Age=6.3
Age=7.0
In_Basic_Latin
In_Latin_1_Supplement
In_Latin_Extended_A
In_Latin_Extended_B
In_IPA_Extensions
In_Spacing_Modifier_Letters
In_Combining_Diacritical_Marks
In_Greek_and_Coptic
In_Cyrillic
In_Cyrillic_Supplement
In_Armenian
In_Hebrew
In_Arabic
In_Syriac
In_Arabic_Supplement
In_Thaana
In_NKo
In_Samaritan
In_Mandaic
In_Arabic_Extended_A
In_Devanagari
In_Bengali
In_Gurmukhi
In_Gujarati
In_Oriya
In_Tamil
In_Telugu
In_Kannada
In_Malayalam
In_Sinhala
In_Thai
In_Lao
In_Tibetan
In_Myanmar
In_Georgian
In_Hangul_Jamo
In_Ethiopic
In_Ethiopic_Supplement
In_Cherokee
In_Unified_Canadian_Aboriginal_Syllabics
In_Ogham
In_Runic
In_Tagalog
In_Hanunoo
In_Buhid
In_Tagbanwa
In_Khmer
In_Mongolian
In_Unified_Canadian_Aboriginal_Syllabics_Extended
In_Limbu
In_Tai_Le
In_New_Tai_Lue
In_Khmer_Symbols
In_Buginese
In_Tai_Tham
In_Combining_Diacritical_Marks_Extended
In_Balinese
In_Sundanese
In_Batak
In_Lepcha
In_Ol_Chiki
In_Sundanese_Supplement
In_Vedic_Extensions
In_Phonetic_Extensions
In_Phonetic_Extensions_Supplement
In_Combining_Diacritical_Marks_Supplement
In_Latin_Extended_Additional
In_Greek_Extended
In_General_Punctuation
In_Superscripts_and_Subscripts
In_Currency_Symbols
In_Combining_Diacritical_Marks_for_Symbols
In_Letterlike_Symbols
In_Number_Forms
In_Arrows
In_Mathematical_Operators
In_Miscellaneous_Technical
In_Control_Pictures
In_Optical_Character_Recognition
In_Enclosed_Alphanumerics
In_Box_Drawing
In_Block_Elements
In_Geometric_Shapes
In_Miscellaneous_Symbols
In_Dingbats
In_Miscellaneous_Mathematical_Symbols_A
In_Supplemental_Arrows_A
In_Braille_Patterns
In_Supplemental_Arrows_B
In_Miscellaneous_Mathematical_Symbols_B
In_Supplemental_Mathematical_Operators
In_Miscellaneous_Symbols_and_Arrows
In_Glagolitic
In_Latin_Extended_C
In_Coptic
In_Georgian_Supplement
In_Tifinagh
In_Ethiopic_Extended
In_Cyrillic_Extended_A
In_Supplemental_Punctuation
In_CJK_Radicals_Supplement
In_Kangxi_Radicals
In_Ideographic_Description_Characters
In_CJK_Symbols_and_Punctuation
In_Hiragana
In_Katakana
In_Bopomofo
In_Hangul_Compatibility_Jamo
In_Kanbun
In_Bopomofo_Extended
In_CJK_Strokes
In_Katakana_Phonetic_Extensions
In_Enclosed_CJK_Letters_and_Months
In_CJK_Compatibility
In_CJK_Unified_Ideographs_Extension_A
In_Yijing_Hexagram_Symbols
In_CJK_Unified_Ideographs
In_Yi_Syllables
In_Yi_Radicals
In_Lisu
In_Vai
In_Cyrillic_Extended_B
In_Bamum
In_Modifier_Tone_Letters
In_Latin_Extended_D
In_Syloti_Nagri
In_Common_Indic_Number_Forms
In_Phags_pa
In_Saurashtra
In_Devanagari_Extended
In_Kayah_Li
In_Rejang
In_Hangul_Jamo_Extended_A
In_Javanese
In_Myanmar_Extended_B
In_Cham
In_Myanmar_Extended_A
In_Tai_Viet
In_Meetei_Mayek_Extensions
In_Ethiopic_Extended_A
In_Latin_Extended_E
In_Meetei_Mayek
In_Hangul_Syllables
In_Hangul_Jamo_Extended_B
In_High_Surrogates
In_High_Private_Use_Surrogates
In_Low_Surrogates
In_Private_Use_Area
In_CJK_Compatibility_Ideographs
In_Alphabetic_Presentation_Forms
In_Arabic_Presentation_Forms_A
In_Variation_Selectors
In_Vertical_Forms
In_Combining_Half_Marks
In_CJK_Compatibility_Forms
In_Small_Form_Variants
In_Arabic_Presentation_Forms_B
In_Halfwidth_and_Fullwidth_Forms
In_Specials
In_Linear_B_Syllabary
In_Linear_B_Ideograms
In_Aegean_Numbers
In_Ancient_Greek_Numbers
In_Ancient_Symbols
In_Phaistos_Disc
In_Lycian
In_Carian
In_Coptic_Epact_Numbers
In_Old_Italic
In_Gothic
In_Old_Permic
In_Ugaritic
In_Old_Persian
In_Deseret
In_Shavian
In_Osmanya
In_Elbasan
In_Caucasian_Albanian
In_Linear_A
In_Cypriot_Syllabary
In_Imperial_Aramaic
In_Palmyrene
In_Nabataean
In_Phoenician
In_Lydian
In_Meroitic_Hieroglyphs
In_Meroitic_Cursive
In_Kharoshthi
In_Old_South_Arabian
In_Old_North_Arabian
In_Manichaean
In_Avestan
In_Inscriptional_Parthian
In_Inscriptional_Pahlavi
In_Psalter_Pahlavi
In_Old_Turkic
In_Rumi_Numeral_Symbols
In_Brahmi
In_Kaithi
In_Sora_Sompeng
In_Chakma
In_Mahajani
In_Sharada
In_Sinhala_Archaic_Numbers
In_Khojki
In_Khudawadi
In_Grantha
In_Tirhuta
In_Siddham
In_Modi
In_Takri
In_Warang_Citi
In_Pau_Cin_Hau
In_Cuneiform
In_Cuneiform_Numbers_and_Punctuation
In_Egyptian_Hieroglyphs
In_Bamum_Supplement
In_Mro
In_Bassa_Vah
In_Pahawh_Hmong
In_Miao
In_Kana_Supplement
In_Duployan
In_Shorthand_Format_Controls
In_Byzantine_Musical_Symbols
In_Musical_Symbols
In_Ancient_Greek_Musical_Notation
In_Tai_Xuan_Jing_Symbols
In_Counting_Rod_Numerals
In_Mathematical_Alphanumeric_Symbols
In_Mende_Kikakui
In_Arabic_Mathematical_Alphabetic_Symbols
In_Mahjong_Tiles
In_Domino_Tiles
In_Playing_Cards
In_Enclosed_Alphanumeric_Supplement
In_Enclosed_Ideographic_Supplement
In_Miscellaneous_Symbols_and_Pictographs
In_Emoticons
In_Ornamental_Dingbats
In_Transport_and_Map_Symbols
In_Alchemical_Symbols
In_Geometric_Shapes_Extended
In_Supplemental_Arrows_C
In_CJK_Unified_Ideographs_Extension_B
In_CJK_Unified_Ideographs_Extension_C
In_CJK_Unified_Ideographs_Extension_D
In_CJK_Compatibility_Ideographs_Supplement
In_Tags
In_Variation_Selectors_Supplement
In_Supplementary_Private_Use_Area_A
In_Supplementary_Private_Use_Area_B
In_No_Block
As you can see, "£" is not a valid Unicode property name.

Related

Printing list with polish letters

I am writing a simple program for windows using Python 2.7. It opens an email, take some words from it and puts them in a form on web. Problem starts when the email contains polish letters like Ó, Ź, Ł etc. Whenever I try to print it I get something like: ['\xc4\x84', '\xc5\xbb', '\xc3\x93', '\xc4\x86', '\xc5\xb9'].
I already know it is because of encoding and that Python 3 has no such problem. Here is what I tried already:
mail = " Ą Ż Ó Ć Ź"
mail = mail.split()
mail = mail.decode("UTF-8")
print mail
or
mail = " Ą Ż Ó Ć Ź"
mail = mail.split()
[x.encode('UTF8') for x in mail]
print mail
Can anyone please show me how to make the list print properly ?
Python 2.x uses ASCII as a default encoding. To force it to use Unicode, add this line to the top of your program.
# -*- coding: utf-8 -*-
Also you should prefix any string literals with 'u'. e.g.
polishLetters = u'Ą Ż Ó Ć Ź'

British Pound Sign £ causing PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding “UTF8”: 0xa3

When collecting information containing the British Pound Sign '£' from external sources such as my bank, via csv file, and posting to postgres using ActiveRecord, I get the error:
PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding “UTF8”: 0xa3
The 0xa3 is the hex code for a £ sign. The perceived wisdom is to clearly specify UTF-8 on the string whilst replacing invalid byte sequences..
string.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => '?'})
This stops the error, but is a lossy fix as the '£' is converted into a '?'
UTF-8 is able to handle the '£' sign, so what can be done to fix the invalid byte sequence and persist the '£' sign?
I'm answering my own question thanks to Michael Fuhr who explained the UTF-8 byte sequence for the pound sign is 0xc2 0xa3. So, all you have to do is find each occurrence of 0xa3 (163) and place 0xc2 (194) in front of it...
array_bytes = string.bytes
new_pound_ptr = 0
# Look for £ sign
pound_ptr = array_bytes.index(163)
while !pound_ptr.nil?
pound_ptr+= new_pound_ptr # new_pound_ptr is set at end of block
# The following statement finds incorrectly sequenced £ sign...
if (pound_ptr == 0) || (array_bytes[pound_ptr-1] != 194)
array_bytes.insert(pound_ptr,194)
pound_ptr+= 1
end
new_pound_ptr = pound_ptr
# Search remainder of array for pound sign
pound_ptr = array_bytes[(new_pound_ptr+1)..-1].index(163)
end
end
# Convert bytes to 8-bit unsigned char, and UTF-8
string = array_bytes.pack('C*').force_encoding('UTF-8') unless new_pound_ptr == 0
# Can now write string to model without out-of-sequence error..
hash["description"] = string
Model.create!(hash)
I've had so much help on this stackoverflow forum, I hope I have helped somebody else.

Ruby string escape for supplementary plane Unicode characters

I know that I can escape a basic Unicode character in Ruby with the \uNNNN escape sequence. For example, for a smiling face U+263A (☺) I can use the string literal "\u2603".
How do I escape Unicode characters greater than U+FFFF that fall outside the basic multilingual plane, like a winking face: U+1F609 (😉)?
Using the surrogate pair form like in Java doesn't work; it results in an invalid string that contains the individual surrogate code points:
s = "\uD83D\uDE09" # => "\xED\xA0\xBD\xED\xB8\x89"
s.valid_encoding? # => false
You can use the escape sequence \u{XXXXXX}, where XXXXXX is between 1 and 6 hex digits:
s = "\u{1F609}" # => "😉"
The braces can also contain multiple runs separated by single spaces or tabs to encode multiple characters:
s = "\u{41f 440 438 432 435 442 2c 20 43c 438 440}!" # => "Привет, мир!"
You could also use byte escapes to write a literal that contains the UTF-8 encoding of the character, though that's not very convenient, and doesn't necessarily result in a UTF-8-encoded string, if the file encoding differs:
# encoding: utf-8
s = "\xF0\x9F\x98\x89" # => "😉"
s.length # => 1
# encoding: iso-8859-1
s = "\xF0\x9F\x98\x89" # => "\xF0\x9F\x98\x89"
s.length # => 4

Ruby: Remove invisible characters after converting string to UTF-8

I am working with text coming from this website with windows-1252 charset. Converting the text to UTF-8 was done using force_encoding, but the text still contains whitespace that I can't get rid of. The whitespace cannot be removed using text.gsub!(/\s/, ' ') or a similar technique.
The iconv gem doesn't do the trick either - as explained here. It is clear that the whitespace is a remnant of the original text and the windows-1252 charset as I get a invalid multibyte char (US-ASCII) warning if I don't specify the encoding as UTF-8.
I'm not an expert of text encoding so I may be overlooking something trivial.
Update: This is the script that I currently use.
#!/bin/env ruby
# encoding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
html = Nokogiri.HTML(open(URL))
# Extract Paragraphs
text = ''
html.css('p').each do |p|
text += p.text
end
# Clean Up Text
text.gsub!(/\s+/, ' ')
puts text
This is a sample of the text that contains invisible characters that I try to remove. The space before the number 16 is what I am referring to.
cobraron aliento para conversar con él.   16 Al punto corrió la voz, y
se divulgó generalmente esta noticia en el palacio del rey: Han
Without seeing your code, it's hard to know exactly what's going on for you. I'll point out, however, that String#force_encoding doesn't transcode the String; it's a way of saying, "No, really, this is UTF-8", for example. To transcode from one encoding to another, use String#encode.
This seems to work for me:
require 'net/http'
s = Net::HTTP.get('www.eximsystems.com', '/LaVerdad/Antiguo/Gn/Genesis.htm')
s.force_encoding('windows-1252')
s.encode!('utf-8')
In general, /[[:space:]]/ should capture more kinds of whitespace that /\s/ (which is equivalent to /[ \t\r\n\f]/), but it doesn't appear to be necessary in this case. I can't find any abnormal whitespace in s at this point. If you're still having problems, you'll need to post your code and a more precise description of the issue.
Update: Thanks for updating your question with your code and an example of the problem. It looks like the issue is non-breaking spaces. I think it's simplest to get rid of them at the source:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
s = open(URL).read # Separate these three lines to convert
s.gsub!(' ', ' ') # to normal ' ' in source rather than after
html = Nokogiri.HTML(s) # conversion to unicode non-breaking space
# Extract Paragraphs
text = ''
html.css('p').each do |p|
text += p.text
end
# Clean Up Text
text.gsub!(/\s+/, ' ')
puts text
There's now just a single, normal space between the period at the end of 15 and the number 16:
15) Besó también José a todos sus hermanos, orando sobre cada uno de ellos; después de cuyas demostraciones cobraron aliento para conversar con él. 16 Al punto corrió la voz, y se divulgó generalmente esta noticia en el palacio del rey: Han venido los hermanos de José; y holgóse de ello Faraón y toda su corte.
You can try to use text.strip for removing the whitespaces.

Python 3 argument (semi)not UTF-8 when passed from Windows batch.cmd

When I invoke a Python 3 script from a Windows batch.cmd,
a UTF-8 arg is not passed as "UTF-8", but as a series of bytes,
each of which are interpreted by Python as individual UTF-8 chars.
How can I convert the Python 3 arg string to its intended UTF-8 state?
The calling .cmd and the called .py are shown below.
PS. As I mention in a comment below, calling u00FF.py "ÿ" directly from the Windows console commandline works fine. It is only a problem when I invoke u00FF.cmd via the .cmd, and I am looking for a Python 3 way to convert the double-encoded UTF-8 arg back to a "normally" encoded UTF-8 form.
I've now include here, the full (and latest) test code.. Its a bit long, but I hope it explains the issue clearly enough.
Update: I've seen why the file read of "ÿ" was "double-encoding"... I was reading the UTF-8 file in binary/byte mode... I should have used codecs.open('u00FF.arg', 'r', 'utf-8') instead of just plain open('u00FF.arg','r')... I've updated the offending code, and the output. The codepage issues seems to be the only problem now...
Because the Python issue has been largely resolved, and the codepage issue is quite independent of Python, I have posted another codepage specific question at
Codepage 850 works, 65001 fails! There is NO response to “call foo.cmd”. internal commands work fine.
::::::::::::::::::: BEGIN .cmd BATCH FILE ::::::::::::::::::::
:: Windows Batch file (UTF-8 encoded, no BOM): "u00FF.cmd"
#echo ÿ>u00FF.arg
#u00FF.py "ÿ"
#goto :eof
::::::::::::::::::: END OF .cmd BATCH FILE ::::::::::::::::::::
################### BEGIN .py SCRIPT #####################################
# -*- coding: utf-8 -*-
import sys
print ("""
Unicode
=======
CodePoint U+00FF
Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'
UTF-8 bytes
===========
Hex: \\xC3 \\xBF
Dec: 195 191
Char: Ã ¿ __Unicode Character 'INVERTED QUESTION MARK'
\_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE'
""")
print("## ====================================================")
print("## ÿ via hard-coding in this .py script itself ========")
print("##")
hard1s = "ÿ"
hard1b = hard1s.encode('utf_8')
print("hard1s: len", len(hard1s), " '" + hard1s + "'")
print("hard1b: len", len(hard1b), hard1b)
for i in range(0,len(hard1s)):
print("CodePoint[", i, "]", hard1s[i], "U+"+"{0:x}".upper().format(ord(hard1s[i])).zfill(4) )
print(''' This is a single CodePoint for "ÿ" (as expected).''')
print()
print("## ====================================================")
print("## ÿ read into this .py script from a UTF-8 file ======")
print("##")
import codecs
file1 = codecs.open( 'u00FF.arg', 'r', 'utf-8' )
file1s = file1.readline()
file1s = file1s[:1] # remove \r
file1b = file1s.encode('utf_8')
print("file1s: len", len(file1s), " '" + file1s + "'")
print("file1b: len", len(file1b), file1b)
for i in range(0,len(file1s)):
print("CodePoint[", i, "]", file1s[i], "U+"+"{0:x}".upper().format(ord(file1s[i])).zfill(4) )
print(''' This is a single CodePoint for "ÿ" (as expected).''')
print()
print("## ====================================================")
print("## ÿ via sys.argv from a call to .py from a .cmd) ===")
print("##")
argv1s = sys.argv[1]
argv1b = argv1s.encode('utf_8')
print("argv1s: len", len(argv1s), " '" + argv1s + "'")
print("argv1b: len", len(argv1b), argv1b)
for i in range(0,len(argv1s)):
print("CodePoint[", i, "]", argv1s[i], "U+"+"{0:x}".upper().format(ord(argv1s[i])).zfill(4) )
print(''' These 2 CodePoints are way off-beam,
even allowing for the "double-encoding" seen above.
The CodePoints are from an entirely different Unicode-Block.
This must be a Codepage issue.''')
print()
################### END OF .py SCRIPT #####################################
Here is the output from the above code.
========================== BEGIN OUTPUT ================================
C:\>u00FF.cmd
Unicode
=======
CodePoint U+00FF
Character ÿ __Unicode Character 'LATIN SMALL LETTER Y WITH DIAERESIS'
UTF-8 bytes
===========
Hex: \xC3 \xBF
Dec: 195 191
Char: Ã ¿ __Unicode Character 'INVERTED QUESTION MARK'
\_______Unicode Character 'LATIN CAPITAL LETTER A WITH TILDE'
## ====================================================
## ÿ via hard-coding in this .py script itself ========
##
hard1s: len 1 'ÿ'
hard1b: len 2 b'\xc3\xbf'
CodePoint[ 0 ] ÿ U+00FF
This is a single CodePoint for "ÿ" (as expected).
## ====================================================
## ÿ read into this .py script from a UTF-8 file ======
##
file1s: len 1 'ÿ'
file1b: len 2 b'\xc3\xbf'
CodePoint[ 0 ] ÿ U+00FF
This is a single CodePoint for "ÿ" (as expected
## ====================================================
## ÿ via sys.argv from a call to .py from a .cmd) ===
##
argv1s: len 2 '├┐'
argv1b: len 6 b'\xe2\x94\x9c\xe2\x94\x90'
CodePoint[ 0 ] ├ U+251C
CodePoint[ 1 ] ┐ U+2510
These 2 CodePoints are way off-beam,
even allowing for the "double-encoding" seen above.
The CodePoints are from an entirely different Unicode-Block.
This must be a Codepage issue.
========================== END OF OUTPUT ================================
Batch files and encodings are a finicky issue. First of all: Batch files have no direct way of specifying the encoding they're in and cmd does not really support Unicode batch files. You can easily see that if you save a batch file with a Unicode BOM or as UTF-16 – they will throw an error.
What you see when you put the ÿ directly into the command line is that when running a command Windows will initially use the command line as Unicode (it may have been converted from some legacy encoding beforehand, but in the end what Windows uses is Unicode). So Python will (hopefully) always grab the Unicode content of the arguments.
However, since cmd has its own opinions about the codepage (and you never told it to use UTF-8) the UTF-8 string you put in the batch file won't be interpreted as UTF-8 but instead in the default cmd codepage (850 or 437, in your case).
You can force UTF-8 with chcp:
chcp 65001 > nul
You can save the following file as UTF-8 and try it out:
#echo off
chcp 850 >nul
echo ÿ
chcp 65001 >nul
echo ÿ
Keep in mind, though, that the chcp setting will persist in the shell if you run the batch from there which may make things weird.
Windows shell uses a specific code page (see CHCP command output). You need to convert from Windows code page to utf-8. See iconv module or decode() / encode()

Resources