Ruby character encoding for shreelipi Indian devenagari font - ruby

I am new to Ruby and I'm trying to write a script (Ruby 1.9.3, on Windows XP) which will automate text extraction from an InDesign document using the WIN32OLE library.
The InDesign document has text in ShreeLipi font (an Indian Devanagari script). My Ruby script is:
require 'win32ole'
app = WIN32OLE.new('InDesign.Application')
doc = app.activeDocument
text_frame = doc.textFrames(1)
text = text_frame.contents #=> "emhy ‘hmamOm§Mo ñ‘maH$ emhy {‘b‘ܶo C^mam"
puts text.encoding.name #=> "IBM437"
file = File.open('D:/try.txt','w')
file.puts text
file.close
When I open this same file to view the text using Notepad it shows:
"emhy `hmamOmMo ¤`maH$ emhy {`b`šo C^mam"
I can't understand why this is happening. Please help me correct it.
I tried to resolve it using Windows-1252 encoding and also with ISO-8859-1 but could not find a solution.

You aren't preserving your bytes when you write the file. Instead of doing:
file = File.open('D:/try.txt','w')
use:
file = File.open('D:/try.txt','wb')
The b means to write as binary, which, in plain English, means to do no line-end conversions.

Related

Ruby writing zip file works on Mac but not on windows / How to recieve zip file in Net::HTTP

actually i'm writing a ruby script which accesses an API based on HTTP-POST calls.
The API returns a zip file containing textdocuments when i call it with specific POST-Parameters. At the moment i'm doing that with the Net::HTTP Package.
Now my problem:
It seems to return the zip-file as a string as far as i know. I can see "PK" (which i suppose is part of the PK-Header of zip-files) and the text from the documents.
And the Content-Type Header is telling me "application/x-zip-compressed; name="somename.zip"".
When i save the zip file like so:
result = comodo.get_cert("<somenumber>")
puts result['Content-Type']
puts result.inspect
puts result.body
File.open("test.zip", "w") do |file|
file.write result.body
end
I can unzip it on my macbook without further problems. But when i run the same code on my Win10 PC it tells me that the file is corrupt or not a ZIP-file.
Has it something to do with the encoding? Can i change it, so it's working on both?
Or is it a complete wrong approach on how to recieve a zip-file from a POST-request?
PS:
My ruby-version on Mac:
ruby 2.2.3p173
My ruby-version on Windows:
ruby 2.2.4p230
Many thanks in advance!
The problem is due to the way Windows handles line endings (\r\n for Windows, whereas OS X and other Unix based operating systems use just \n). When using File.open, using the mode of just w makes the file subject to line ending changes, so any occurrences of byte 0x0A (or \n) are converted into bytes 0x0D 0x0A (or \r\n), which effectively breaks the zip.
When opening the file for write, use the mode wb instead, as this will suppress any line ending changes.
http://ruby-doc.org/core-2.2.0/IO.html#method-c-new-label-IO+Open+Mode
Many thanks! Just as you posted the solution i found it out myself..
So much trouble because of one missing 'b' :/
Thank you very much!
The solution (see Ben Y's answer):
result = comodo.get_cert("<somenumber>")
puts result['Content-Type']
puts result.inspect
puts result.body
File.open("test.zip", "wb") do |file|
file.write result.body
end

Why does a file written out encoded as UTF-8 end up being ISO-8859-1 instead?

I am reading an ISO-8859-1 encoded text file, transcoding it to UTF-8, and writing out a different file as UTF-8. However, when I inspect the output file, it is still encoded as ISO-8859-1! What am I doing wrong?
Here is my ruby class:
module EF
class Transcoder
# app_path ......... Path to the java console application (InferEncoding.jar) that infers the character encoding.
# target_encoding .. Transcodes the text loaded from the file into this encoding.
attr_accessor :app_path, :target_encoding
def initialize(consoleAppPath)
#app_path = consoleAppPath
#target_encoding = "UTF-8"
end
def detect_encoding(filename)
encoding = `java -jar #{#app_path} \"#{filename}\"`
encoding = encoding.strip
end
def transcode(filename)
original_encoding = detect_encoding(filename)
content = File.open(filename, "r:#{original_encoding}", &:read)
content = content.force_encoding(original_encoding)
content.encode!(#target_encoding, :invalid => :replace)
end
def transcode_file(input_filename, output_filename)
content = transcode(input_filename)
File.open(output_filename, "w:#{#target_encoding}") do |f|
f.write content
end
end
end
end
By way of explanation, #app_path is the path to a Java jar file. This console application will read a text file and tell me what its current encoding is (printing it to stdout). It uses the ubiquitous ICU library. (I tried using the ruby gem charlock-holmes, but I cannot get it to compile on Windows for MINGW. The Java bindings to ICU are good, so I wrote a Java application instead.)
To call the above class, I do this in irb:
require_relative 'transcoder'
tc = EF::Transcoder.new("C:/Users/paul.chernoch/Documents/java/InferEncoding.jar")
tc.detect_encoding "C:/temp/infer-encoding-test/ISO-8859-1.txt"
tc.transcode_file("C:/temp/infer-encoding-test/ISO-8859-1.txt", "C:/temp/infer-encoding-test/output-utf8.txt")
tc.detect_encoding "C:/temp/infer-encoding-test/output-utf8.txt"
The file ISO-8859-1.txt is encoded like it sounds. I used Notepad++ to write the file using that encoding.
I used my Java application to test the file. It concurs that it is in ISO-8859-1 format.
I also created a file in Notepad++ and saved it as UTF-8. I then verified using my java app that it was in UTF-8.
After I perform the above in irb, I used my java app to test the output file and it says the format is still ISO-8859-1.
What am I doing wrong? If you hard-code the method detect_encoding to return "ISO-8859-1", you do not need my java application to replicate the part that reads the file.
Any solution must NOT use charlock-holmes.

Problems with Ruby encoding in Windows

I wrote a simple code that reads an email from MS-Outlook, using 'win32ole', and then save its subjects to an CSV file. Everything goes well except the encoding system. When I open my CSV file the words such as "André" are printed as "Andr\x82". I want my output format to be equal to my input.
# encoding: 'CP850'
require 'win32ole'
require 'CSV'
Encoding.default_external = 'CP850'
ol = WIN32OLE.new('Outlook.Application')
inbox = ol.GetNamespace("MAPI").GetDefaultFolder(6)
email_subjecs = []
inbox.Items.each do |m|
email_subjects << m.Subject
end
CSV.open('MyFile.csv',"w") do |csv|
csv << email_subjects
end
O.S: Windows 7 64bit
Encoding.default_external -> CP850
Languadge -> PT
ruby -v -> 1.9.2p290 (2011-07-09) [i386-mingw32]
It seems a simple problem related to external windows encoding and I tryied many solution posted here but I realy can't solve this.
1) Your file name is missing a closing quote.
2) The default open mode for CSV.open() is 'rb', so you can't possibly write to a file with the code you posted.
3) You didn't post the encoding of the text you are trying to write to the file.
4) You didn't post the encoding that you want the the data to be written in.
5)
When I open my CSV file the words such as "é" are printed as "\x82"
Tell your viewing device not to do that.
The magic comment only sets the encoding the current (.rb) file should be read as. It does not set default_external. Try set RUBYOPT=-E utf-8, open your file with CSV.open('MyFile.csv', encoding: 'UTF-8'), or set Encoding.default_external at the top of your file (discouraged).

Automatically open a file as binary with Ruby

I'm using Ruby 1.9 to open several files and copy them into an archive. Now there are some binary files, but some are not. Since Ruby 1.9 does not open binary files automatically as binaries, is there a way to open them automatically anyway? (So ".class" would be binary, ".txt" not)
Actually, the previous answer by Alex D is incomplete. While it's true that there is no "text" mode in Unix file systems, Ruby does make a difference between opening files in binary and non-binary mode:
s = File.open('/tmp/test.jpg', 'r') { |io| io.read }
s.encoding
=> #<Encoding:UTF-8>
is different from (note the "rb")
s = File.open('/tmp/test.jpg', 'rb') { |io| io.read }
s.encoding
=> #<Encoding:ASCII-8BIT>
The latter, as the docs say, set the external encoding to ASCII-8BIT which tells Ruby to not attempt to interpret the result at UTF-8. You can achieve the same thing by setting the encoding explicitly with s.force_encoding('ASCII-8BIT'). This is key if you want to read binary into a string and move them around (e.g. saving them to a database, etc.).
Since Ruby 1.9.1 there is a separate method for binary reading (IO.binread) and since 1.9.3 there is one for writing (IO.binwrite) as well:
For reading:
content = IO.binread(file)
For writing:
IO.binwrite(file, content)
Since IO is the parent class of File, you could also do the following which is probably more expressive:
content = File.binread(file)
File.binwrite(file, content)
On Unix-like platforms, there is no difference between opening files in "binary" and "text" modes. On Windows, "text" mode converts line breaks to DOS style, and "binary" mode does not.
Unless you need linebreak conversion on Windows platforms, just open all the files in "binary" mode. There is no harm in reading a text file in "binary" mode.
If you really want to distinguish, you will have to match File.extname(filename) against a list of known extensions like ".txt" and ".class".

Does Ruby auto-detect a file's codepage?

If a save a text file with the following character б U+0431, but save it as an ANSI code page file.
Ruby returns ord = 63. Saving the file with UTF-8 as the codepage returns ord = 208, 177
Should I be specifically telling Ruby to handle the input encoded with a certain code page? If so, how do you do this?
Is that in ruby source code or in a file which is read with File.open? If it's in the ruby source code, you can (in ruby 1.9) add this to the top of the file:
# encoding: utf-8
Or you could specify most other encodings (like iso-8859-1).
If you are reading a file with File.open, you could do something like this:
File.open("file.txt", "r:utf-8") {|f| ... }
As with the encoding comment, you can pass in different types of encodings here too.

Resources