what is the encoding of the subprocess module output in Python 2.7? - windows

I'm trying to retrieve the content of a zipped archive with python2.7 on 64bit windows vista. I tried by making a system call to 7zip (my favourite archive manager) using the subprocess module:
# -*- coding: utf-8 -*-
import sys, os, subprocess
Extractor = r'C:\Program Files\7-Zip\7z.exe'
ArchiveName = r'C:\temp\bla.zip'
output = subprocess.Popen([Extractor,'l','-slt',ArchiveName],stdout=subprocess.PIPE).stdout.read()
This works fine as long as the archive content contains only ascii filenames, but when I try it with non-ascii I get an encoded output string variable where ä, ë, ö, ü have been replaced by \x84, \x89, \x94, \x81 (etcetera). I've tried all kinds of decode/encode calls but I'm just too inexperienced with python (and generally too stupid) to reproduce the original characters with umlaut (which is required if I would like to follow-up this step with e.g. an extraction subprocess call to 7z).
Simply put my question is: How do I get this to work also for archives with non-ascii content?
... or to put it in a more convoluted way: Is the output of subprocess always of a fixed encoding or not?
In the former case -> Which encoding is it?
In the latter case -> How can I control or uncover the encoding of the output of subprocess? Inspired by similar questions on this blog I've tried adding
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
and I've also tried
my_env = os.environ
my_env['PYTHONIOENCODING'] = 'utf-8'
output = subprocess.Popen([Extractor,'l','-slt',ArchiveName],stdout=subprocess.PIPE,env=my_env).stdout.read()
but neither seems to alter the encoding of the output variable (or to reproduce the umlaut).

You can try using the -sccUTF-8 switch from 7zip to force the output in utf-8.
Here is ref page: http://en.helpdoc-online.com/7-zip_9.20/source/cmdline/switches/scc.htm

Related

ASCII incompatible encoding with normal run, not in debug mode

I'm really confused on this one, and maybe it's a bug in Ruby 2.6.2. I have files that were written as UTF-8 with BOM, so I'm using the following:
filelist = Dir.entries(#input_dirname).join(' ')
filelist = filelist.split(' ').grep(/xml/)
filelist.each do |indfile|
filecontents_tmp = File.read("#{#input_dirname}/#{indfile}", :encoding =>'bom|utf-8')
puts filecontents_tmp
end
If I put a debug breakpoint at the puts line, my file is read in properly. If I just run the simple script, I get the following error:
in `read': ASCII incompatible encoding needs binmode (ArgumentError)
I'm confused as to why this would work in debug, but not when run normally. Ideas?
Have you tried printing the default encoding when you run the file as opposed to when you debug the file? There are 3 ways to set / change the encoding in Ruby (that I'm aware of), so I wonder if it's different between running the file and debugging. You should be able to tell by printing the default encoding: puts Encoding.default_external.
As for actually fixing the issue, I ran into a similar problem and found this answer which said to add bin mode as an option to the File.open call and it worked for me.

File encoding issue when downloading file from AWS S3

I have a CSV file in AWS S3 that I'm trying to open in a local temp file. This is the code:
s3 = Aws::S3::Resource.new
bucket = s3.bucket({bucket name})
obj = bucket.object({object key})
temp = Tempfile.new('temp.csv')
obj.get(response_target: temp)
It pulls the file from AWS and loads it in a new temp file called 'temp.csv'. For some files, the obj.get(..) line throws the following error:
WARN: Encoding::UndefinedConversionError: "\xEF" from ASCII-8BIT to UTF-8
WARN: /Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `write'
/Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `block in delegating_block'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/http/response.rb:62:in `signal_data'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/net_http/handler.rb:83:in `block (3 levels) in transmit'
...
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/client.rb:2666:in `get_object'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/object.rb:657:in `get'
Stacktrace shows the error initially gets thrown by the .get from the AWS SDK for Ruby.
Things I've tried:
When uploading the file (object) to AWS S3, you can specify content_encoding, so I tried setting that to UTF-8:
obj.upload_file({file path}, content_encoding: 'utf-8')
Also when you call .get you can set response_content_encoding:
obj.get(response_target: temp, response_content_encoding: 'utf-8')
Neither of those work, they result in the same error as above. I would really expect that to do the trick. In the AWS S3 dashboard I can see that the content encoding is indeed set correctly via the code but it doesn't appear to make a difference.
It does work when I do the following, in the first code snippet above:
temp = Tempfile.new('temp.csv', encoding: 'ascii-8bit')
But I'd prefer to upload and/or download the file from AWS S3 with the proper encoding. Can someone explain why specifying the encoding on the tempfile works? Or how to make it work through the AWS S3 upload/download?
Important to note: The problematic character in the error message appears to just be a random symbol added at the beginning of this auto-generated file I'm working with. I'm not worried about reading the character correctly, it gets ignored when I parse the file anyways.
I don't have a full answer to all your question, but I think I have a generalized solution, and that is to always put the temp file into binary mode. That way the AWS gem will simply dump the data from the bucket into the file, without any further re/encoding:
Step 1 (put the Tempfile into binmode):
temp = Tempfile.new('temp.csv')
temp.binmode
You will however have a problem, and that is the fact that there is a 3-byte BOM header in your UTF-8 file now.
I don't know where this BOM came from. Was it there when the file was uploaded? If so, it might be a good idea to strip the 3 byte BOM before uploading.
However, if you set up your system as below, it will not matter, because Ruby supports transparent reading of UTF-8 with or without BOM, and will return the string correctly regardless of if the BOM header is in the file or not:
Step 2 (process the file using bom|utf-8):
File.read(temp.path, encoding: "bom|utf-8")
# or...
CSV.read(temp.path, encoding: "bom|utf-8")
This should cover all your bases I think. Whether you receive files encoded as BOM + UTF-8 or plain UTF-8, you will process them correctly this way, without any extra header characters appearing in the final string, and without errors when saving them with AWS.
Another option (from OP)
Use obj.get.body instead, which will bypass the whole issue with response_target and Tempfile.
Useful references:
Is there a way to remove the BOM from a UTF-8 encoded file?
How to avoid tripping over UTF-8 BOM when reading files
What's the difference between UTF-8 and UTF-8 without BOM?
How to write BOM marker to a file in Ruby
I fixed this encoding issue by using File.open(tmp, 'wb') additionally. Here is how it looks like:
s3_object = Aws::S3::Resource.new.bucket("bucket-name").object("resource-key")
Tempfile.new.tap do |file|
s3_object.get(response_target: File.open(file, "wb"))
end
The Ruby SDK docs have an example of downloading an S3 item to the filesystem in https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/s3-example-get-bucket-item.html. I just ran it and it works fine.

Encoding problems with ruby while reading in command line arguments with optparse

I'm writing a small programm in ruby, which essentially changes some files within a zip-file. The zip-file is specified as a parameter on the command line and interpreted via the OptionParser.
The problem is, that when specifiying a file, which contains non-ascii characters, the file cannot be opened, saying that it could not be found. This problem occurs using cmd.exe under Windows.
Here is a minimal example:
# example.rb
require "zip"
require "optparse"
zip_file_name = String.new
# read and interprete command line arguments:
OptionParser.new do |opts|
opts.on("-f", "--file FILE", String, "The zip-file, which will be modified") do |f|
zip_file_name = f
end
end.parse!
# Open the zip file:
Zip::File.open(zip_file_name) do |zipfile|
end
If you create a zip-file test.zip and run example.rb -f test.zip everything is okay (it does finish without errors). Doing the same with a zip-file täst.zip gives me an error. I tried doing zip_file_name.encode!(Encoding::UTF_8), but this didn't solve the problem.
It seems to be an encoding problem (the encoding of zip_file_name is cp850) but the transcoding does not seem to work correctly.
So my question would be: How can I change my program to also allow non-ascii characters for specifying files on the command line?
Adding zip_file_name.force_encoding(Encoding::Windows_1252) before opening the file solves the issue (on Western Europe Windows).
Apparently, the CP850 file names encoding is a wrong assumption from Ruby. On my Windows system, it seems that filenames are encoded in Windows_1252 (a custom version of Latin1 or ISO 8859-1).

Rails parse upload file "\xDE" from ASCII-8BIT to UTF-8

I try parse upload *.txt file and get some import DB information. But before save it I try get tring in utf-8 format. When I do that I get error:
"\xDE" from ASCII-8BIT to UTF-8
First file characters
Import data \xDE\xE4\xE5
Before parse code
# encoding: utf-8
require "iconv"
class HandlerController < ApplicationController
def add_report
utf8_format = "UTF-8"
file_data = params[:import_file].tempfile.read.encode(utf8_format)
end
end
P.S. Also I try do that with iconv but it didn't help
You need to start from a known encoding with valid content (and compatible characters for input and output) before you will be able to successfully convert a string.
ASCII-8BIT doesn't assign Unicode-compatible characters to values 128..255 - it cannot be converted to Unicode.
The chances are that the input - as you say it is text - is in some other encoding to start with. You could start by assuming ISO-8859-1 ("Latin-1") which is quite a common encoding, although you may have some other clue, or know what characters to expect in the file, in which case you should try others.
I suggest you try something like this:
file_data = params[:import_file].tempfile.read.force_encoding('ISO-8859-1')
utf8_file_data = file_data.encode(utf8_format)
This probably will not give you an error, but if my guess at 'ISO-8859-1' is wrong, it will give you gibberish unfortunately.

Run scala script save in utf-8 gets error

I am new to scala and I tried some small programs in book "Programming in Scala", when the scala script is saved in ANSI, it works well. But when I saved it in UTF-8, a error was thrown up as "error: illegal character ?import". I run this small example program on windows. And the example program is like
import scala.io.Source
if(args.isEmpty){
}else{
Source.fromFile(args(0)).getLines.toList.zipWithIndex.foreach { case (line, i) => println(i + " "+line)}
}
what's going on there?
I guess you saved your file with BOM.
If you save your source code without BOM (How to do it depends on which text editor you are using), it will works fine.

Resources