Decoding gb-2312 file in colab - utf-8

I am trying to open a file in Colab that uses gb-2312 encoding. Here is the code I successfully ran in my IDE to read and decode:
file = open(r'file.txt')
opened = file.read()
decoded = opened.encode('latin1').decode('gb2312')
print(decoded)
When I run this code in colab, I get the following error:
'utf-8' codec can't decode byte 0xc6 in position 67: invalid continuation byte
But I can't decode without using read() or list() first, or else I get the following error:
'_io.TextIOWrapper' object has no attribute 'encode'
This seems like a catch-22. Is this a bug with Colab or is there some better way to approach the problem?

The default when opening a file is rt (read, text mode) and uses an OS-specific default encoding returned by locale.getpreferredencoding(False). Use the encoding parameter to override the default (which appears to be utf-8):
with open('file.txt', encoding='gb2312') as file:
data = file.read()

Related

Saving decoded Protobuf content

I'm trying to setup a .py plugin that will save decoded Protobuf responses to file, but whatever I do, the result is always file in byte format (not decoded). I have also tried to do the same by using "w" in Mitmproxy - although on screen I saw decoded data, in the file it was encoded again.
Any thoughts how to do it correctly?
Sample code for now:
import mitmproxy
def response(flow):
# if flow.request.pretty_url.endswith("some-url.com/endpoint"):
if flow.request.pretty_url.endswith("some-url.com/endpoint"):
f = open("test.log","ab")
with decoded(flow.response)
f.write(flow.request.content)
f.write(flow.response.content)
Eh, I'm not sure this helps, but what happens if you don't open the file in binary mode
f = open("test.log","a")
?
Hy,
some basic things that I found.
Try replacing
f.write(flow.request.content)
with
f.write(flow.request.text)
I read it on this website
https://discourse.mitmproxy.org/t/modifying-https-response-body-not-working/645/3
Please read and try this to get the requests and responses assembled.
MITM Proxy, getting entire request and response string
Best of luck with your project.
I was able to find the way to do that. Seems mitmdump or mitmproxy wasn't able to save raw decoded Protobuf, so I used:
mitmdump -s decode_script.py
with the following script to save the decoded data to a file:
import mitmproxy
import subprocess
import time
def response(flow):
if flow.request.pretty_url.endswith("HERE/IS/SOME/API/PATH"):
protobuffedResponse=flow.response.content
(out, err) = subprocess.Popen(['protoc', '--decode_raw'], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate(protobuffedResponse)
outStr = str(out, 'utf-8')
outStr = outStr.replace('\\"', '"')
timestr = time.strftime("%Y%m%d-%H%M%S")
with open("decoded_messages/" + timestr + ".decode_raw.log","w") as f:
f.write(outStr)

File encoding issue when downloading file from AWS S3

I have a CSV file in AWS S3 that I'm trying to open in a local temp file. This is the code:
s3 = Aws::S3::Resource.new
bucket = s3.bucket({bucket name})
obj = bucket.object({object key})
temp = Tempfile.new('temp.csv')
obj.get(response_target: temp)
It pulls the file from AWS and loads it in a new temp file called 'temp.csv'. For some files, the obj.get(..) line throws the following error:
WARN: Encoding::UndefinedConversionError: "\xEF" from ASCII-8BIT to UTF-8
WARN: /Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `write'
/Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `block in delegating_block'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/http/response.rb:62:in `signal_data'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/net_http/handler.rb:83:in `block (3 levels) in transmit'
...
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/client.rb:2666:in `get_object'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/object.rb:657:in `get'
Stacktrace shows the error initially gets thrown by the .get from the AWS SDK for Ruby.
Things I've tried:
When uploading the file (object) to AWS S3, you can specify content_encoding, so I tried setting that to UTF-8:
obj.upload_file({file path}, content_encoding: 'utf-8')
Also when you call .get you can set response_content_encoding:
obj.get(response_target: temp, response_content_encoding: 'utf-8')
Neither of those work, they result in the same error as above. I would really expect that to do the trick. In the AWS S3 dashboard I can see that the content encoding is indeed set correctly via the code but it doesn't appear to make a difference.
It does work when I do the following, in the first code snippet above:
temp = Tempfile.new('temp.csv', encoding: 'ascii-8bit')
But I'd prefer to upload and/or download the file from AWS S3 with the proper encoding. Can someone explain why specifying the encoding on the tempfile works? Or how to make it work through the AWS S3 upload/download?
Important to note: The problematic character in the error message appears to just be a random symbol added at the beginning of this auto-generated file I'm working with. I'm not worried about reading the character correctly, it gets ignored when I parse the file anyways.
I don't have a full answer to all your question, but I think I have a generalized solution, and that is to always put the temp file into binary mode. That way the AWS gem will simply dump the data from the bucket into the file, without any further re/encoding:
Step 1 (put the Tempfile into binmode):
temp = Tempfile.new('temp.csv')
temp.binmode
You will however have a problem, and that is the fact that there is a 3-byte BOM header in your UTF-8 file now.
I don't know where this BOM came from. Was it there when the file was uploaded? If so, it might be a good idea to strip the 3 byte BOM before uploading.
However, if you set up your system as below, it will not matter, because Ruby supports transparent reading of UTF-8 with or without BOM, and will return the string correctly regardless of if the BOM header is in the file or not:
Step 2 (process the file using bom|utf-8):
File.read(temp.path, encoding: "bom|utf-8")
# or...
CSV.read(temp.path, encoding: "bom|utf-8")
This should cover all your bases I think. Whether you receive files encoded as BOM + UTF-8 or plain UTF-8, you will process them correctly this way, without any extra header characters appearing in the final string, and without errors when saving them with AWS.
Another option (from OP)
Use obj.get.body instead, which will bypass the whole issue with response_target and Tempfile.
Useful references:
Is there a way to remove the BOM from a UTF-8 encoded file?
How to avoid tripping over UTF-8 BOM when reading files
What's the difference between UTF-8 and UTF-8 without BOM?
How to write BOM marker to a file in Ruby
I fixed this encoding issue by using File.open(tmp, 'wb') additionally. Here is how it looks like:
s3_object = Aws::S3::Resource.new.bucket("bucket-name").object("resource-key")
Tempfile.new.tap do |file|
s3_object.get(response_target: File.open(file, "wb"))
end
The Ruby SDK docs have an example of downloading an S3 item to the filesystem in https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/s3-example-get-bucket-item.html. I just ran it and it works fine.

Pandas to Oracle via SQL Alchemy: UnicodeEncodeError: 'ascii' codec can't encode character

Using pandas 18.1...
I'm trying to iterate through a folder of CSVs to read each CSV and send it to an Oracle database table. There is a non-ascii character lurking in one of my many CSVs (more like reveling in my anguish). I keep getting this error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xab' in position 8:
ordinal not in range(128)
Here's the code:
import pandas as pd
import pandas.io.sql as psql
from sqlalchemy import create_engine
import cx_Oracle as cx
engine = create_engine('oracle+cx_oracle://schema:'+pwd+'#server:port/service_name'
,encoding='latin1')
name='table'
path=r'path_to_folder'
filelist = os.listdir(path)
for file in filelist:
df = pd.read_csv(pathc+'\\'+file,encoding='latin1',index_col=0)
df=df.astype('unicode')
df['date'] = pd.to_datetime(df['date'])
df['date'] = pd.to_datetime(df['Contract_EffDt'],format='%YYYY-%mm-%dd')
df.to_sql(name, engine, if_exists = 'append')
I've tried the following:
encoding=utf-8 (in the engine, if I do that in read_csv, it throws an error)
Adding ?encoding=utf8 after "service_name" in the engine
Using df=df.astype('unicode') (and not)
What I want to do:
Replace the unreadable character with something else and, most importantly, proceed with sending data to Oracle.
Note:
The data file I'm using are from the cms.gov site. Here's a zip file with an example. I'm using the "contracts_info" file.
Thanks in advance!
You need to set the NLS_LANG environment variable like this:
os.environ['NLS_LANG']= 'AMERICAN_AMERICA.AL32UTF8'
Then the error won't occur.
I encoded string fields to utf-8 individually and this may have helped (a new error occurred, but I assume it is not related to this):
dfc['Organization Type'] = dfc['Organization Type'].str.encode('utf-8')
New error:
DatabaseError: (cx_Oracle.DatabaseError) ORA-00904: "Contract_ID": invalid identifier
This was because "Contract_ID" was not set as the index. Once I did that, all went well (except for being slower than molasses, which begins my next adventure).

How to convert PIL image file into string in python3.4?

I have been trying to read a jpeg file using PIL in python 3.4. I need to save this file into string format. Although some options are provided on this site but I have tried a few but it is not working. Following is my code snippet which i have found on this site only:-
from io import StringIO
fp = Image.open("images/login.jpg")
output = StringIO()
fp.save(output, format="JPEG")
contents = output.getvalue()
output.close()
But i am facing the following error :-
TypeError: string argument expected, got 'bytes'
Could you please suggest what I have done wrong and how to get this working?
In python 3 you should use a BytesIO,
whereas as read in python docs:
StringIO is a native in-memory unicode container
.
Thanks a lot for the hint. I Actually have a found a different way of reading the image file and storing in string object in python2.x . Here is the code. Please let me know if there is any disadvantage of using this.
imgText = open("images/login.jpg", 'rb')
imgTextStr = imgText.read()
imgText.close()

Rails parse upload file "\xDE" from ASCII-8BIT to UTF-8

I try parse upload *.txt file and get some import DB information. But before save it I try get tring in utf-8 format. When I do that I get error:
"\xDE" from ASCII-8BIT to UTF-8
First file characters
Import data \xDE\xE4\xE5
Before parse code
# encoding: utf-8
require "iconv"
class HandlerController < ApplicationController
def add_report
utf8_format = "UTF-8"
file_data = params[:import_file].tempfile.read.encode(utf8_format)
end
end
P.S. Also I try do that with iconv but it didn't help
You need to start from a known encoding with valid content (and compatible characters for input and output) before you will be able to successfully convert a string.
ASCII-8BIT doesn't assign Unicode-compatible characters to values 128..255 - it cannot be converted to Unicode.
The chances are that the input - as you say it is text - is in some other encoding to start with. You could start by assuming ISO-8859-1 ("Latin-1") which is quite a common encoding, although you may have some other clue, or know what characters to expect in the file, in which case you should try others.
I suggest you try something like this:
file_data = params[:import_file].tempfile.read.force_encoding('ISO-8859-1')
utf8_file_data = file_data.encode(utf8_format)
This probably will not give you an error, but if my guess at 'ISO-8859-1' is wrong, it will give you gibberish unfortunately.

Resources