How to copy files with Unicode characters in file names in Ruby? - ruby

I can not copy files that have Unicode characters in their names from Ruby 1.9.2p290, on Windows 7.
For example, I have two files in a dir:
file
ハリー・ポッターと秘密の部屋
(The second name contains Japanese characters if you can not see it)
Here is the code:
> entries = Dir.entries(path) - %w{ . .. }
> entries[0]
=> "file"
> entries[1]
=> "???????????????" # <--- what?
> File.file? entries[0]
=> true
> File.file? entries[1]
=> false # <--- !!! Ruby can not see it and will not copy
> entries[1].encoding.name
=> "Windows-1251"
> Encoding.find('filesystem').name
=> "Windows-1251"
As you see my Ruby file system encoding is "windows-1251" which is 8 bit and can not handle Japanese. Setting default_external and default_internal encodings to 'utf-8' does not help.
How can I copy those files from Ruby?
Update
I found a solution. It works if I use Dir.glob or Dir[] instead of Dir.entries. File names are now returned in utf-8 encoding and can be copied.
Update #2
My Dir.glob solution appears to be quite limited. It only works with "*" parameter:
Dir.glob("*") # <--- Shows Unicode names correctly
Dir.glob("c:/test/*") # <--- Does not work for Unicode names

Not so much a real solution, but as a workaround, given:
Dir.glob("*") # <--- Shows Unicode names correctly
Dir.glob("c:/test/*") # <--- Does not work for Unicode names
is there any reason you can't do this:
Dir.chdir("c:/test/")
Dir.glob("*")
?

It's been a while, but I was looking into the same problem and it was all but obvious how to do it.
Turns out that you may specify an encoding when you call Dir#entries in Ruby >= 2.1.
Dir.entries(path, encoding: Encoding::UTF_8)

Related

Why is Ruby failing to convert CP-1252 to UTF-8?

I have a CSV files saved from Excel which is CP-1252/Windows-1252. I tried the following, but it still comes out corrupted. Why?
csv_text = File.read(arg[:file], encoding: 'cp1252').encode('utf-8')
# csv_text = File.read(arg[:file], encoding: 'cp1252')
csv = CSV.parse csv_text, :headers => true
csv.each do |row|
# create model
p model
The result
>rake import:csv["../file.csv"] | grep Brien
... name: "Oâ?TBrien ...
However it works in the console
> "O\x92Brien".force_encoding("cp1252").encode("utf-8")
=> "O'Brien"
I can open the CSV file in Notepad++, Encoding > Character Sets > Western European > Windows-1252, see the correct characters, then Encoding > Convert to UTF-8. However, there are many files an I want Ruby to handle this.
Similar: How to change the encoding during CSV parsing in Rails. But this doesn't explain why this is failing.
Ruby 2.4, Reference: https://ruby-doc.org/core-2.4.3/IO.html#method-c-read
Wow, it was caused by the shitty grep in DevKit.
>rake import:csv["../file.csv"]
... name: "O'Brien ...
>where grep
C:\DevKit2\bin\grep.exe
I also did not need the .encode('utf-8').
Let that be a lesson kids. Never take anything for granted. Trust no one!

File.exist? not working when directory name has special characters

File.exist? in not working with directory name having special characters. for something like given below
path = "/home/cis/Desktop/'El%20POP%20que%20llevas%20dentro%20Vol.%202'/*.mp3"
it works fine but if it has letters like ñ its returns false.
Plz help with this.
Try the following:
Make sure you're running 1.9.2 or greater and put # encoding: UTF-8 at the top of your file (which must be in UTF-8 and your editor must support it).
If you're running MRI(i.e. not JRuby or other implementation) you can add environment variable RUBYOPT=-Ku instead of # encoding: UTF-8 to the top of each file.

ruby 1.9 wrong file encoding on windows

I have a ruby file with these contents:
# encoding: iso-8859-1
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
puts File.read('foo.txt').encoding
When I run it from windows command prompt ruby 1.9.3 I get: IBM437
When I run it from cygwin ruby 1.9.3 I get: UTF-8
What I expect to get is: iso-8859-1
Can someone explain what's happening here?
UPDATE
Here's a better description of what I'm looking for:
I understand now thanks to Darshan that by default ruby will load files in
Encoding.default _external, but shouldn't the # encoding: iso-8859-1
line override that?
Should ruby be able to auto-detect a file's encoding? Is there any
filesystem where the encoding is an attribute?
What is my best option to 'remember' the encoding I saved the file
in?
You're not specifying the encoding when you read the file. You're being very careful to specify it everywhere except there, but then you're reading it with the default encoding.
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'.force_encoding('iso-8859-1')}
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding }
# => ISO-8859-1
Also note that you probably mean 'fòo'.encode('iso-8859-1') rather than 'fòo'.force_encoding('iso-8859-1'). The latter leaves the bytes unchanged, while the former transcodes the string.
Update: I'll elaborate a bit since I wasn't as clear or thorough as I could have been.
If you don't specify an encoding with File.read(), the file will be read with Encoding.default_external. Since you're not setting that yourself, Ruby is using a value depending on the environment it's run in. In your Windows environment, it's IBM437; in your Cygwin environment, it's UTF-8. So my point above was that of course that's what the encoding is; it has to be, and it has nothing to do with what bytes are contained in the file. Ruby doesn't auto-detect encodings for you.
force_encoding() doesn't change the bytes in a string, it only changes the Encoding attached to those bytes. If you tell Ruby "pretend this string is ISO-8859-1", then it won't transcode them when you tell it "please write this string as ISO-8859-1". encode() transcodes for you, as does writing to the file if you don't trick it into not doing so.
Putting those together, if you have a source file in ISO-8859-1:
# encoding: iso-8859-1
# Write in ISO-8859-1 regardless of default_external
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
# Read in ISO-8859-1 regardless of default_external,
# transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1
puts File.read('foo.txt').encoding # -> Whatever is specified by default_external
If you have a source file in UTF-8:
# encoding: utf-8
# Write in ISO-8859-1 regardless of default_external, transcoding from UTF-8
File.open('foo.txt', "w:iso-8859-1") {|f| f << 'fòo'}
# Read in ISO-8859-1 regardless of default_external,
# transcoding if necessary to default_internal, if set
File.open('foo.txt', "r:iso-8859-1") {|f| puts f.read().encoding } # => ISO-8859-1
puts File.read('foo.txt').encoding # -> Whatever is specified by default_external
Update 2, to answer your new questions:
No, the # encoding: iso-8859-1 line does not change Encoding.default_external, it only tells Ruby that the source file itself is encoded in ISO-8859-1. Simply add
Encoding.default_external = "iso-8859-1"
if you expect all files that your read to be stored in that encoding.
No, I don't personally think Ruby should auto-detect encodings, but reasonable people can disagree on that one, and a discussion of "should it be so" seems off-topic here.
Personally, I use UTF-8 for everything, and in the rare circumstances that I can't control encoding, I manually set the encoding when I read the file, as demonstrated above. My source files are always in UTF-8. If you're dealing with files that you can't control and don't know the encoding of, the charguess gem or similar would be useful.

Automatically open a file as binary with Ruby

I'm using Ruby 1.9 to open several files and copy them into an archive. Now there are some binary files, but some are not. Since Ruby 1.9 does not open binary files automatically as binaries, is there a way to open them automatically anyway? (So ".class" would be binary, ".txt" not)
Actually, the previous answer by Alex D is incomplete. While it's true that there is no "text" mode in Unix file systems, Ruby does make a difference between opening files in binary and non-binary mode:
s = File.open('/tmp/test.jpg', 'r') { |io| io.read }
s.encoding
=> #<Encoding:UTF-8>
is different from (note the "rb")
s = File.open('/tmp/test.jpg', 'rb') { |io| io.read }
s.encoding
=> #<Encoding:ASCII-8BIT>
The latter, as the docs say, set the external encoding to ASCII-8BIT which tells Ruby to not attempt to interpret the result at UTF-8. You can achieve the same thing by setting the encoding explicitly with s.force_encoding('ASCII-8BIT'). This is key if you want to read binary into a string and move them around (e.g. saving them to a database, etc.).
Since Ruby 1.9.1 there is a separate method for binary reading (IO.binread) and since 1.9.3 there is one for writing (IO.binwrite) as well:
For reading:
content = IO.binread(file)
For writing:
IO.binwrite(file, content)
Since IO is the parent class of File, you could also do the following which is probably more expressive:
content = File.binread(file)
File.binwrite(file, content)
On Unix-like platforms, there is no difference between opening files in "binary" and "text" modes. On Windows, "text" mode converts line breaks to DOS style, and "binary" mode does not.
Unless you need linebreak conversion on Windows platforms, just open all the files in "binary" mode. There is no harm in reading a text file in "binary" mode.
If you really want to distinguish, you will have to match File.extname(filename) against a list of known extensions like ".txt" and ".class".

Recursive directory listing using Ruby with Chinese characters in file names

I would like to generate a list of files within a directory. Some of the filenames contain Chinese characters.
eg: [试验].Test.txt
I am using the following code:
require 'find'
dirs = ["TestDir"]
for dir in dirs
Find.find(dir) do |path|
if FileTest.directory?(path)
else
p path
end
end
end
Running the script produces a list of files but the Chinese characters are escaped (replaced with backslashes followed by numbers). Using the example filename above would produce:
"TestDir/[\312\324\321\351]Test.txt" instead of "TestDir/[试验].Test.txt".
How can the script be altered to output the Chinese characters?
Ruby needs to know that you are dealing with unicode in your code. Set appropriate character encoding using KCODE, as below:
$KCODE = 'utf-8'
I think utf-8 is good enough for chinese characters.
The following code is more elegant and doesn't require 'find.' It produces a list of files (but not directories) in whatever the working directory is (or whatever directory you put in).
Dir.entries(Dir.pwd).each do |x|
p x.encode('UTF-8') unless FileTest.directory?(x)
end
And to get a recursive digging down one level use:
Dir.glob('*/*').each do |x|
p x.encode('UTF-8') unless FileTest.directory?(x)
end
I'm sure there is a way to get it to go all the way down but Dir.glob('**/*') will go through the whole file system if I remember right.

Resources