TreeView.insert throws UnicodeDecodeError - windows

I'm trying to populate TreeView with data from os.listdir(path).
All is ok until I read a directory name with a non-utf character. In my case 0xf6 which is not utf8.
As I'm running on Windows the charset from os.listdir() is Windows-1252 or ANSI.
How can I solve this problem to achieve correct display in TreeView?
Here some of my code:
def fill_tree(treeview, node):
if treeview.set(node, "type") != 'directory':
return
path = treeview.set(node, "fullpath")
# Delete the possibly 'dummy' node present.
treeview.delete(*treeview.get_children(node))
parent = treeview.parent(node)
for p in os.listdir(path):
ptype = None
p = os.path.join(path, p)
if os.path.isdir(p):
ptype = 'directory'
fname = os.path.split(p)[1].decode('cp1252').encode('utf8')
if ptype == 'directory':
oid = treeview.insert(node, 'end', text=fname, values=[p, ptype])
treeview.insert(oid, 0, text='dummy')
Regards
Göran

The UnicodeDecodeError is due to passing byte strings when the function is expecting Unicode strings. Python 2 attempts to implicitly decode byte strings to Unicode. Use Unicode strings explicitly instead. os.listdir(unicode_path) will return Unicode string, for example os.listdir(u'.').

Related

QDataStream readQString() How to read utf8 String

I am trying to decode UDP packet data from an application which encoded the data using Qt's QDataStream methods, but having trouble when trying to decode string fields. The docs say the data was encoded in utf8. The python QDataStream module only has a readQString() method. Numbers seem to decode fine, but the stream pointer gets messed up when the first strings decode improperly.
How can i decode these UTF8 Strings?
I am using some documentation from the source project interpret the encoding:
wsjtx-2.2.2.tgz
NetworkMessage.hpp Description in the header file
Header:
32-bit unsigned integer magic number 0xadbccbda
32-bit unsigned integer schema number
There is a status message for example with comments like this:
Heartbeat Out/In 0 quint32
Id (unique key) utf8
Maximum schema number quint32
version utf8
revision utf8
example data from the socket when a status message is received:
b'\xad\xbc\xcb\xda\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x06WSJT-X\x00\x00\x00\x03\x00\x00\x00\x052.1.0\x00\x00\x00\x0624fcd1'
def jt_decode_heart_beat(i):
"""
Heartbeat Out/In 0 quint32
Id (unique key) utf8
Maximum schema number quint32
version utf8
revision utf8
:param i: QDataStream
:return: JT_HB_ID,JT_HB_SCHEMA,JT_HB_VERSION,JT_HB_REVISION
"""
JT_HB_ID = i.readQString()
JT_HB_SCHEMA = i.readInt32()
JT_HB_VERSION = i.readQString()
JT_HB_REVISION = i.readQString()
print(f"HB:ID={JT_HB_ID} JT_HB_SCHEMA={JT_HB_SCHEMA} JT_HB_VERSION={JT_HB_VERSION} JT_HB_REVISION={JT_HB_REVISION}")
return (JT_HB_ID, JT_HB_SCHEMA, JT_HB_VERSION, JT_HB_REVISION)
while 1:
data, addr = s.recvfrom(1024)
b = QByteArray(data)
i = QDataStream(b)
JT_QT_MAGIC_NUMBER = i.readInt32()
JT_QT_SCHEMA_NUMBER = i.readInt32()
JT_TYPE = i.readInt32()
if JT_TYPE == 0:
# Heart Beat
jt_decode_heart_beat(i)
elif JT_TYPE == 1:
jt_decode_status(i)
Long story short the wsjtx udp protocol I was reading did not encode the strings using the the QDataString type, so it was wrong to expect that i.readQString() would work.
Instead the data was encoded using a QInt32 to define the string length, followed by the UTF8 characters encoded in QByteArray.
I successfully encapsulated this functionality in a function:
def jt_decode_utf8_str(i):
"""
strings are encoded with an int 32 indicating size
and then an array of bytes in utf-8 of length size
:param i:
:return: decoded string
"""
sz = i.readInt32()
b = i.readRawData(sz)
return b.decode("utf-8")

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

Incorrect representation of the string in csv file

I'm on Win7, Python2.7.
Have the string.
Original view:
A. P. Møller Mærsk
UTF-8:
s = 'A. P. M\xc3\xb8ller M\xc3\xa6rsk'
I need to write it in csv.
Try this:
with open('14.09 Anbefalte aksjer.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([s])
Got this:
A. P. Møller Mærsk
Try to use UnicodeWriter:
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
s = 'A. P. M\xc3\xb8ller M\xc3\xa6rsk'.decode('utf8')
with open('14.09 Anbefalte aksjer.csv', 'w') as csvfile:
writer = UnicodeWriter(csvfile)
writer.writerow([s])
And got again:
A. P. Møller Mærsk
Try unicodecsv:
Again:
A. P. Møller Mærsk
What's wrong? How can I write it right?
What you see is a mojibake: bytes that represent a Unicode text encoded in one character encoding are shown in another (incompatible) character encoding.
If ''.decode('utf8') doesn't raise AttributeError then it means that you are not on Python 3 (despite what you question says). On Python 2, csv doesn't support Unicode directly, you have to encode manually:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
text = "A. P. Møller Mærsk"
with open('14.09 Anbefalte aksjer.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
writer.writerow([text.encode('utf-8')])
Both UnicodeWriter and unicodecsv module should work as well if text contains uncorrupted data.
Windows assumes the default Window locale's encoding with tools like Notepad or Excel, so for UTF-8 a byte order mark (BOM, U+FEFF) must be encoded at the start of the file. Python provides an encoding for this, utf-8-sig. Note also by using #coding:utf8 and saving your source file in UTF-8, you can declare your string directly as a Unicode string. Finally, files for use with the csv module should be opened as wb on Python 2.7 or you will see problems writing newlines on Windows.
#coding:utf8
import csv
from StringIO import StringIO
import codecs
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
# Use utf-8-sig encoding here.
def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
# Redirect output to a queue
self.queue = StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
s = u'A. P. Møller Mærsk' # declare as Unicode string.
with open('14.09 Anbefalte aksjer.csv', 'wb') as csvfile:
writer = UnicodeWriter(csvfile)
writer.writerow([s])
Output:
A. P. Møller Mærsk

UTF-8 and Chinese Characters

I have a function that calls up google API:
def get_lat_long(place):
place = re.sub('\s','+', str(place), flags=re.UNICODE)
url = 'https://maps.googleapis.com/maps/api/geocode/json?address=' + place
content = urllib2.urlopen(url).read()
obj = json.loads(content)
results = obj['results']
lat = long = None
if len(results) > 0:
loc = results[0]['geometry']['location']
lat = float(loc['lat'])
long = float(loc['lng'])
return [lat, long]
However, when I enter 師大附中 as a parameter,I get the error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
I tried doing str(place).encode('utf-8'), but I don't think that's the problem. I think it's because the function cannot read Chinese characters, so it needs to first convert Chinese characters to a unicode string before it reads it? That's just a guess though.
Assuming that place is of unicode type, you need to do something like this:
def get_lat_long(place):
place = urllib.quote_plus(place.encode('utf-8'))
url = 'https://maps.googleapis.com/maps/api/geocode/json?address=' + place

How to convert UTF8 byte arrays to string in lua

I have a table like this
table = {57,55,0,15,-25,139,130,-23,173,148,-24,136,158}
it is utf8 encoded byte array by php unpack function
unpack('C*',$str);
how can I convert it to utf-8 string I can read in lua?
Lua doesn't provide a direct function for turning a table of utf-8 bytes in numeric form into a utf-8 string literal. But it's easy enough to write something for this with the help of string.char:
function utf8_from(t)
local bytearr = {}
for _, v in ipairs(t) do
local utf8byte = v < 0 and (0xff + v + 1) or v
table.insert(bytearr, string.char(utf8byte))
end
return table.concat(bytearr)
end
Note that none of lua's standard functions or provided string facilities are utf-8 aware. If you try to print utf-8 encoded string returned from the above function you'll just see some funky symbols. If you need more extensive utf-8 support you'll want to check out some of the libraries mention from the lua wiki.
Here's a comprehensive solution that works for the UTF-8 character set restricted by RFC 3629:
do
local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
function utf8(decimal)
if decimal<128 then return string.char(decimal) end
local charbytes = {}
for bytes,vals in ipairs(bytemarkers) do
if decimal<=vals[1] then
for b=bytes+1,2,-1 do
local mod = decimal%64
decimal = (decimal-mod)/64
charbytes[b] = string.char(128+mod)
end
charbytes[1] = string.char(vals[2]+decimal)
break
end
end
return table.concat(charbytes)
end
end
function utf8frompoints(...)
local chars,arg={},{...}
for i,n in ipairs(arg) do chars[i]=utf8(arg[i]) end
return table.concat(chars)
end
print(utf8frompoints(72, 233, 108, 108, 246, 32, 8364, 8212))
--> Héllö €—

Resources