Encode a movie with Unicode filename in Windows using Popen - ruby

I want to encode a movie through IO.popen by ruby(1.9.3) in windows 7.
If the file name contains only ascii strings, encoding proceed normally.
But with unicode filename the script returns "No such file or directory" error.
Like following code.
#-*- encoding: utf-8 -*-
command = "ffmpeg -i ü.rm"
IO.popen(command){|pipe|
pipe.each{|line|
p line
}
}
I couldn't find whether the problem causes by ffmpeg or ruby.
How can fix this problem?

Windows doesn't use UTF-8 encoding. Ruby send the byte sequence of the Unicode filename to the file system directly, and of course the file system won't recognize UTF-8 sequences. It seems newer version of Ruby has fixed this issue. (I'm not sure. I'm using 1.9.2p290 and it's still there.)
You need to convert the UTF-8 filename to the encoding your Windows uses.
# coding: utf-8
code_page = "cp#{`chcp`.chomp[/\d+$/]}" # detect code page automatically.
command = "ffmpeg -i ü.rm".encode(code_page)
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
Another way is to save your script with the same encoding Windows uses. And don't forget to update the encoding declaration. For example, I'm using Simplified Chinese Windows and it uses GBK(CP936) as default encoding:
# coding: GBK
# save this file in GBK
command = "ffmpeg -i ü.rm"
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
BTW, by convention, it is suggested to use do...end for multi-line code blocks rather than {...}, unless in special cases.
UPDATE:
The underlying filesystem NTFS uses UTF-16 for file name encoding. So 가 is a valid filename character. However, GBK isn't able to encode 가, and so as to CP932 in your Japanese Windows. So you cannot send that specific filename to your cmd.exe and it isn't likely you can process that file with IO.popen. For CP932 compatible filenames, the encoding approach provided above works fine. For those filenames not compatible with CP932, it might be better to modify your filenames to a compatible one.

Related

Concatenating files in Windows Command Prompt and the string ""

I am concatenating files using Windows. I have used the TYPE and the COPY command and I get the same artifact. At the place where my original files are joined in the new file, the character string "" (i.e. Decimal: 139 175 168 Hex: 8BAFA8) is inserted.
How can I troubleshoot this? Is there an easy explanation you can provide for how to avoid this. And why does this happen?
The very good explanation why does this happen is in #Mark_Tolonen answer, so I will not repeat it.
Instead of obsolete TYPE and COPY one have to use powershell now:
powershell -Command "& { Get-Content a*.txt | Out-File output.txt -Encoding utf8 }"
This command get content of all files patterned by a*.txt in a current folder and concatenates them in the output.txt file using UTF-8.
Powershell is a part of Windows 7 and later.
The extra bytes are a UTF-8 encoding signature. The Unicode byte order mark U+FEFF is encoded in UTF-8 and written to the beginning of the file to indicate the file is encoded in UTF-8. It's not required but Windows assumes a text file is encoded in the local ANSI encoding (commonly Windows-1252) unless a BOM appears.
Many file tools don't know about this (DOS copy being one of them), so concatenating files can be troublesome.
Today being ignorant of encodings often causes trouble. You can't simply concatenate two text files of unknown encoding...they may be different.
If you know the encoding, use a tool that understands the encoding. Here's a very basic concatenate script written in Python that will convert encodings as well.
# cat.py
import sys
if len(sys.argv) < 5:
print('usage: cat <in_encoding> <out_encoding> <outfile> <infile> [infile...]')
else:
with open(sys.argv[3],'w',encoding=sys.argv[2]) as fout:
for file in sys.argv[4:]:
with open(file,'r',encoding=sys.argv[1]) as fin:
fout.write(fin.read())
Given two files with UTF-8 w/ BOM encoding, this command will output UTF-8 (no BOM):
cat.py utf-8-sig utf-8 out.txt test1.txt test2.txt
Side note about Python: utf-8-sig encoding reads files and removes the BOM from the data if present, so it can be used to read any UTF-8 file with or without a BOM. utf-8-sig encoding writes a BOM at the start of a file, but utf-8 does not.

Reading a file with ISO-8859 encoding

According to Mac OSX, I have a file with ISO-8859 encoding:
$ file filename.txt
filename.txt: ISO-8859 text, with CRLF line terminators
I try to read it with that encoding:
> filename = "/Users/myuser/Downloads/filename.txt"
> content = File.read(filename, encoding: "ISO-8859")
> content.encoding
=> #<Encoding:UTF-8>
It doesn't work. And consequently:
> content.split("\n")
ArgumentError: invalid byte sequence in UTF-8
Why doesn't it read the file as ISO-8859?
With your code, Ruby emits the following warning when reading the file:
warning: Unsupported encoding ISO-8859 ignored
This is because there is not only one ISO 8859 encoding but actually quite a bunch of variants. You need to specify the correct one explicitly, e.g
content = File.read(filename, encoding: "ISO-8859-1")
# or equivalently
content = File.read(filename, encoding: Encoding::ISO_8859_1)
When dealing with text files produced in Windows machines (which is hinted by the CRLF line endings), you might want to use Encoding:::Windows_1252 (resp. "Windows-1252") instead. This is a superset of ISO 8859-1 and used to be the default encoding used by many Windows programs and the system itself.
Try to use Encoding::ISO_8859_1 instead.

MBCS to UTF-8: How to encode in Python

I am trying to create a duplicate file finder for Windows. My program works well in Linux. But it writes NUL characters to the log file in Windows. This is due to the MBCS default file system encoding of Windows, while the file system encoding in Linux is UTF-8. How can I convert MBCS to UTF-8 to avoid this error?
Tell Python to use UTF-8 on the log file. In Python 3 you do this by:
open(..., encoding='utf-8')
If you want to convert an MBCS string to UTF-8 you can switch string encodings:
filename.encode('mbcs').decode('utf-8')
Use filename.encode(sys.getdefaultencoding())... to make the code work on Linux, as well.
Just change the encode to 'latin-1' (encoding='latin-1')
Using pure Python:
open(..., encoding = 'latin-1')
Using Pandas:
pd.read_csv(..., encoding='latin-1')

File encoding using ruby in windows

I have two files in a windows folder. Using the technique described here I found out that one file encoding is ANSI and another one is UTF-8.
However, If I open cmd or Powershell and try to get the encoding in IRB with the following code I get always "CP850":
File.open(file_name).read.encoding.name # => CP850
or
File.open(file_name).external_encoding.name # => CP850
Notepad++ also gives me that one file is ANSI and another is UTF-8.
How can I get the proper encoding using Ruby in Windows?
It is impossible to tell what encoding a file is, but it's possible to make an educated guess.
When you open a file, ruby simply assumes it's encoded with the default 8-bit encoding (in your case CP850).
See Detect encoding
and What is ANSI format? about ANSI

How can I convert a string from windows-1252 to utf-8 in Ruby?

I'm migrating some data from MS Access 2003 to MySQL 5.0 using Ruby 1.8.6 on Windows XP (writing a Rake task to do this).
Turns out the Windows string data is encoded as windows-1252 and Rails and MySQL are both assuming utf-8 input so some of the characters, such as apostrophes, are getting mangled. They wind up as "a"s with an accent over them and stuff like that.
Does anyone know of a tool, library, system, methodology, ritual, spell, or incantation to convert a windows-1252 string to utf-8?
For Ruby 1.8.6, it appears you can use Ruby Iconv, part of the standard library:
Iconv documentation
According this helpful article, it appears you can at least purge unwanted win-1252 characters from your string like so:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
One might then attempt to do a full conversion like so:
ic = Iconv.new('UTF-8', 'WINDOWS-1252')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]
If you're on Ruby 1.9...
string_in_windows_1252 = database.get(...)
# => "Fåbulous"
string_in_windows_1252.encoding
# => "windows-1252"
string_in_utf_8 = string_in_windows_1252.encode('UTF-8')
# => "Fabulous"
string_in_utf_8.encoding
# => 'UTF-8'
Hy,
I had the exact same problem.
These tips helped me get goin:
Always check for the proper encoding name in order to feed your conversion tools correctly.
In doubt you can get a list of supported encodings for iconv or recode using:
$ recode -l
or
$ iconv -l
Always start from you original file and encode a sample to work with:
$ recode windows-1252..u8 < original.txt > sample_utf8.txt
or
$ iconv -f windows-1252 -t utf8 original.txt -o sample_utf8.txt
Install Ruby1.9, because it helps you A LOT when it comes to encodings. Even if you don't use it in your programm, you can always start an irb1.9 session and pick on the strings to see what the output is.
File.open has a new 'mode' parameter in Ruby 1.9. Use it!
This article helped a lot: http://blog.nuclearsquid.com/writings/ruby-1-9-encodings
File.open('original.txt', 'r:windows-1252:utf-8')
# This opens a file specifying all encoding options. r:windows-1252 means read it as windows-1252. :utf-8 means treat it as utf-8 internally.
Have fun and swear a lot!
If you want to convert a file named win1252file, on a unix OS, run:
$ iconv -f windows-1252 -t utf-8 win1252_file > utf8_file
You should probably be able to do the same on Windows with cygwin.
If you're NOT on Ruby 1.9, and assuming yhager's command works, you could try
File.open('/tmp/w1252', 'w') do |file|
my_windows_1252_string.each_byte do |byte|
file << byte
end
end
`iconv -f windows-1252 -t utf-8 /tmp/w1252 > /tmp/utf8`
my_utf_8_string = File.read('/tmp/utf8')
['/tmp/w1252', '/tmp/utf8'].each do |path|
FileUtils.rm path
end

Resources