Python 3 - GeoPy and encoding - windows

I'm using DictWriter to write a dictionary to a csv after some geolocation work.
location = geolocator.reverse(coords)
row["address"] = location.address
writer.writerow(row)
Which generates this:
File "C:\bin64\python\3.4.3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u200e' in
position 118: character maps to <undefined>

My problem was in how I was opening the file. I suppose I should have posted that in the question. I needed to set the encoding upon opening the file.
with open('results.csv', mode='w', encoding='utf-8', newline='') as file:
...

Related

Cannot debug Rust code using LLDB in VS Code: UnicodeEncodeError: 'ascii' codec can't encode characters

I want to debug Rust with VS Code. I have installed LLDB 6.0.0 and Rust 1.27.1 but I can't debug Rust code with LLDB:
Display settings: variable format=auto, show disassembly=auto, numeric pointer values=off, container summaries=on.
Internal debugger error:
Traceback (most recent call last):
File "/home/kwebi/.vscode/extensions/vadimcn.vscode-lldb-0.8.9/adapter/debugsession.py", line 1365, in handle_message
result = handler(args)
File "/home/kwebi/.vscode/extensions/vadimcn.vscode-lldb-0.8.9/adapter/debugsession.py", line 385, in DEBUG_setBreakpoints
file_id = os.path.normcase(from_lldb_str(source.get('path')))
File "/home/kwebi/.vscode/extensions/vadimcn.vscode-lldb-0.8.9/adapter/__init__.py", line 8, in <lambda>
from_lldb_str = lambda s: s.decode('utf8', 'replace')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 12-13: ordinal not in range(128)

Unicode Decode Error:invalid start byte

The purpose of the code is to create a graph for the decision tree model.
The code is given below.
dot_data=StringIO()
tree.export_graphviz(clf,out_file=dot_data)
graph=py.graph_from_dot_data(dot_data.getvalue())
print(graph)
Image.open(graph.create_png(),mode='r')
On execution, it gives the following error:
Traceback (most recent call last):
File "C:/Ankur/Python36/Python Files/Decision_Tree.py", line 58, in <module>
Image.open(graph.create_png(),mode='r')
File "C:\Ankur\Python36\lib\site-packages\PIL\Image.py", line 2477, in open
fp = builtins.open(filename, "rb")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0:
invalid start byte
I am having a hard time to resolve this error as I don't understand it.
create_png() returns a bytes object while Image.open (from PIL) expects a filename or file object.
Try
import io
Image.open(io.BytesIO(graph.create_png()))
and it should work

How to extract characters from images where their positions are known?

I have a set of png images of 300dpi . Each image is full of text (not handwritten), digits (not handwritten).
l want to extract each character and save it in a different image.
For each character in the image l have its position stored in csv file.
For instance in image1.png for a given character “k” l have its position :
“k”=[left=656, right=736,top=144,down= 286]
Is there any python library which allows to do that ?. As input l have the images (png format) and csv file that contains the position of each character of each images.
after executing the code l stack at this line :
img_charac=img[int(coords[2]):int(coords[3]),int(coords[0]):int(coords[1])]
l got the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'NoneType' object has no attribute '__getitem__'
So if I understood correctly, this has nothing to do with image processing, just file opening, image cropping and saving.
With a csv file looking like ,
an input image looking like
I get results like
import cv2
import numpy as np
import csv
path_csv= #path to your csv
#stock coordinates of characters from your csv in numpy array
npa=np.genfromtxt(path_csv+"cs.csv", delimiter=',',skip_header=1,usecols=(1,2,3,4))
nb_charac=len(npa[:, 0]) #number of characters
#stock the actual letters of your csv in an array
characs=[]
cpt=0
#take characters
f = open(path_csv+"cs.csv", 'rt')
reader = csv.reader(f)
for row in reader:
if cpt>=1: #skip header
characs.append(str(row[0]))
cpt+=1
#open your image
path_image= #path to your image
img=cv2.imread(path_image+"yourimagename.png")
path_save= #path you want to save to
#for every line on your csv,
for i in range(nb_charac):
#get coordinates
coords=npa[i,:]
charac=characs[i]
#actual cropping of the image (easy with numpy)
img_charac=img[int(coords[2]):int(coords[3]),int(coords[0]):int(coords[1])]
#saving the image
cv2.imwrite(path_save+"carac"+str(i)+"_"+str(charac)+".png",img_charac)
This is sort of quick and dirty, the csv opening is a bit messy for example (you could get all the info with one opening and converting), and should be adapted to your csv file anyway.

Using ruby SAX parsers for GB2312 encoded xml

Good day,
I have a lot of big xml files that i need to parse, but problems is they have 'gb2312' encoding. I would normaly use SAX parser for this.
So here is in example of xml:
<?xml version="1.0" encoding="gb2312"?>
<Root>
<ValueList Count="112290" FieldCount="11">
<Item1 Value1="23743" Value2="Дипломатия � Пустой кувшин" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item2 Value1="6611" Value2="ДЛ � 018 омела � золотой кинжал" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item3 Value1="6608" Value2="Наука (ДЛ)�круг фей 021�тяпка" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
<Item4 Value1="6612" Value2="Знаки ДЛ � 003руны � разрушение" Value3="1" Value4="" Value5="6" Value6="0" Value7="0" Value8="0" Value9="0" Value10="0" Value11="0"/>
....
</Root>
I'm trying to use Nokogiri SAX (also tried libxml-ruby with same result) parser:
require 'nokogiri'
class SchemaParser < Nokogiri::XML::SAX::Document
def initialize
#cnt = 0
end
def start_element name, attrs =[]
if name == "Item1"
#cnt+= 1
puts #cnt
end
end
end
parser = Nokogiri::XML::SAX::Parser.new(SchemaParser.new)
parser.parse_io(File.open('2_4_EQUIPMENT_ESSENCE.xml'), 'gb2312')
But this gives error "`check_encoding': 'GB2312' is not a valid encoding (ArgumentError)". If I remove encoding declaration and let Nokogiri detect encoding himself, I will receive this error:
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
encoding error : input conversion failed due to input error, bytes 0xA8 0x43 0x20 0xA7
I/O error : encoder error
I also tried to open File with proper encoding, but that didn't help SAX parser:
[3] pry(main)> f = File.open('2_4_EQUIPMENT_ESSENCE.xml', "r:gb2312")
=> #<File:2_4_EQUIPMENT_ESSENCE.xml>
[4] pry(main)> f.external_encoding.name
=> "GB2312"
Did anyone use 'gb2312' encoding with SAX parsers in ruby? Any recommendations how to proceed?
It seems the issue is that Libxml2 does not support the GB2312 encoding (see here for a list of supported encodings).
I'm not sure if you have tried this, but I think you can work around this by removing the encoding declaration from the XML files (so Libxml2 does not try to transcode the data) and set the external encoding of the File object to GB2312, because then Ruby will transcode the file to UTF-8 as it is read, and from then on everything will remain as UTF-8.
So, here is my workaround.
Problems:
Some of characters presented in xml are not 'gb2312' encoding, I have found that 'GB18030' would be a better choice with full Chinese characters.
I converted all xml's to utf8, so i can use SAX parser.
I ended up with this rake task:
desc "convert chinese xml files to utf-8"
task :convert do
rm_rf 'data/utf8'
mkdir 'data/utf8'
Dir.foreach('data') {|f|
if f.end_with?('.xml')
puts "converted:: data/utf8/#{f}" if system("iconv -f GB18030 -t UTF-8 data/#{f} > data/utf8/#{f}")
end
}
#replace encodings for xml files
system("bundle exec ruby -pi -e \"gsub(/gb2312/, 'UTF-8')\" data/utf8/*.xml")
end

Cannot read unicode .csv into R

I have a .csv file, which contains the following data:
"Ա","Բ"
1,10
2,20
I cannot read it into R so that the column names are displayed like they are in the file.
d <- read.csv("./Data/1.csv", fileEncoding="UTF-8")
head(d)
Produces the following:
> d <- read.csv("./Data/1.csv", fileEncoding="UTF-8")
Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
invalid input found on input connection './Data/1.csv'
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on './Data/1.csv'
> head(d)
[1] X.
<0 rows> (or 0-length row.names)
Meanwhile, doing the same without specifying the fileEncoding produces this:
> d <- read.csv("./Data/1.csv")
> head(d)
Ô. Ô²
1 1 10
2 2 20
When I run the "file" utility to find out the encoding of the file, it says it is UTF-8:
Data\1.csv: UTF-8 Unicode text, with CRLF line terminators
I am using RStudio, Windows 7, R version 2.15.2, 32-bit.
Thanks in advance.
I wrote a longer answer on the same issue here: R on Windows: character encoding hell .
Quick answer, using the parameter encoding instead of fileEncoding should fix your first issue. You will not be able to read it possibly in either console or table view in RStudio, but you will be able to use it in formulaes.
d <- read.csv("./Data/1.csv", encoding="UTF-8")
head(d)
Having saved your table into a UTF-8 file:
> test2 <- read.csv("test2.csv", header = FALSE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", encoding = "UTF-8")
Warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on 'test2.csv'
This gives you how it looks like in the console and RStudio view
> test2
V1 V2
1 <U+0531> <U+0532>
2 1 10
3 2 20
However importantly you are able to manipulate this within R. Thus in my case it is possible to see that the script window input Ա has UTF-8 encoding, and a grep correctly finds this encoding in your table.
> Encoding("Ա")
[1] "UTF-8"
> grep("Ա", as.character(test2[1,1]))
[1] 1
You may need to find suitable encoding variants that work on your settings, or possibly change them. Unfortunately I am not sure where it is done.
You might not be able to make it pretty in all stages, but it is definitely possible to get it to work also in Windows 7 environment.
I tried two ways to replicate your problem.
I copied the characters above into RStudio, saved it to a csv with this code:
write.csv(c("Ա","Բ",
1,10,
2,20), "test.csv")
df <- read.csv("test.csv")
This worked fine.
Then I thought, well maybe R is cheating when I save it to CSV with R? So I just pasted the characters to a text file and save it as a CSV. This approach doesn't have problems either.
Here's my session info:
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8
[4] LC_COLLATE=en_CA.UTF-8 LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=C LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] party_1.0-9 modeltools_0.2-21 strucchange_1.4-7 sandwich_2.2-10 zoo_1.7-10
[6] GGally_0.4.4 reshape_0.8.4 plyr_1.8 ggplot2_0.9.3.1
loaded via a namespace (and not attached):
[1] coin_1.0-23 colorspace_1.2-2 dichromat_2.0-0 digest_0.6.3
[5] gtable_0.1.2 labeling_0.2 lattice_0.20-23 MASS_7.3-29
[9] munsell_0.4.2 mvtnorm_0.9-9995 proto_0.3-10 RColorBrewer_1.0-5
[13] reshape2_1.2.2 scales_0.2.3 splines_3.0.1 stringr_0.6.2
I had the same problem and found out that the file was corrupted.
I opened the file with OpenOffice and saved it back using "UTF8" character set (you need to click the edit filter settings box) and then imported it with the read.csv()(no encoding or filencoding option) and it worked fine.

Resources