Reading contents from UTF-16 encoded file in Ruby - ruby

I want to read the contents of a file and save it into a variable. Normally I would do something like:
text = File.read(filepath)
Unfortunately there's a file I'm working with that is encoded with UTF-16LE. I've been doing some research and it looks like I need to use File.Open instead and define the encoding. I read a suggestion somewhere that said to open the file and read in the data line by line:
text = File.open(filepath,"rb:UTF-16LE") { |file| file.lines }
However if I run:
puts text
I get:
#<Enumerator:0x23f76a8>
How can I read in the content of the UTF-16LE file into a variable?
Note: I am using Ruby 1.9.3 and a Windows OS

The lines method is deprecated. If you expect text to be an array with lines, then use readlines.
text = File.open(filepath,"rb:UTF-16LE"){ |file| file.readlines }
As the Tin Man says, it's better practise to process each line seperately, if possible:
File.open("test.csv", "rb:UTF-16LE") do |file|
file.each do |line|
p line
end
end

First, don't make it a practice to read a file directly into a variable unless you absolutely have to. That's called "slurping", and is not scalable. Instead, read it line by line.
Ruby's IO class, which File inherits from, supports a parameter they call open_args, which is a hash, on the majority of "read" type calls. For example, here are some method signatures:
read(name, [length [, offset]], open_args)
readlines(name, sep=$/ [, open_args])
The documentation says this about open_args:
If the last argument is a hash, it specifies option for internal open(). The
key would be the following. open_args: is exclusive to others.
encoding:
string or encoding
specifies encoding of the read string. encoding will be ignored if length
is specified.
mode:
string
specifies mode argument for open(). It should start with "r" otherwise it
will cause an error.
open_args:
array of strings
specifies arguments for open() as an array.

Related

How to make my script use a CSV file that was given in the terminal as a parameter

I tried to google this, but cant really find "good words" to get to my solution. So maybe someone here can help me out.
I have a script (lets call it script.rb) that uses File.read to read a csv file called somefile.csv and i have another csv file called somefileV2.csv.
Script.rb
csv_text = File.read('/home/XXX/XXX/XXX/somefile.csv')
Right now it uses somefile.csv as default, but I would like to know, if it is posseble to make my script use a CSV file that was given in the terminal as a parameter like:
Terminal
home$ script.rb somefileV2
so instead of it reading the file that is in the script, it reads the other csv file (somefileV2.csv) that is in the directory. It is kinda annoying to change the file manually everytime in the script itself.
You can access the parameters (arguments) using the ARGV array.
So your program could be like:
default = "/home/XXX/XXX/XXX/somefile.csv"
csv_text = File.read(ARGV[0] || default)
which gives you the possibility to supply a filename or, if not supplied, use the default value.
ARGV[0] refers to the first, ARGV[1] to the second argument and so on.
ruby myscript.rb foo bar baz would result in ARGV being
´["foo", "bar", "baz"]´. Note that the elements will always be strings. So if you want anything else (Numbers, Date, ...) you need to process it accordingly in your program.

Ruby. NUL chars after reading simple file

I'm reading simple text files using Ruby for further regex processing and suddenly I see that str NUL after each printable character. Totally lost, where it comes from, I tested typing simple text in Notepad, saving as txt file and still getting those. I'm on W machine, didn't have this before.
How I can process it, probably replace them, not sure how to refer to them.
My regex doesn't work with them, tried several ways, using SciTE for run.
e.g. use presented as uNULsNULeNUL and not equal to use
puts File.read(file_name)
puts '____________________'
File.open(file_name, "r") do |f|
f.each_line do |line|
puts 'Line.....' + line
end
end
---------------------- below pic on content of file and output:
This file is probably in UTF-16 format. You'll need to read it in that way:
File.open(file_name, "r:UTF-16LE") do |f|
# ...
end
That format is the default in Windows.
You can always fix this by re-saving the file as UTF-8.

Decode base64 string and write to file

I'm trying to read file which contains encoded base64 string and write decoded output into another file. My Input.txt contains a base64 string, something like:
PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aWV3LWNhc2UgY3JlYXRl\r\nZGF0ZT0iMTMvTWFyLzIwMTQgMDk6MDQ6NTEiIHN5c3RlbT0iVHJhZmlndXJhX1RlbXBsYXRlX01h\r\nbmFnZW1lbnRfdjUuMSIgYmF0Y2hpZD0iMCIgdHJhbnNhY3Rpb25ubz0iMSIgYmF0Y2huYW1lPSJH\r\nVUlEKGY1NWRmYjgwODQ4ZDQ3YzliZmVhYTg3YzMyZDQyNDQyKS1HTE9CQUxfSU5WT0lDRS1FTkdM\r\nSVNIIiB2ZXJzaW9uPSI1LjEuMi44ICBidWlsZCA1MjUzOSI+PHRyYW5zYWN0aW9uPjxvYmplY3Rz\r\nPjxvYmplY3QgY2xhc3M9IlRoXzE5NTQwMDk3OTRfNl9tb2RlbCIgbmFtZT0ibW9kZWwiPjxwcm9w\r\nZXJ0eSBuYW1lPSJUaXRsZSIgdmFsdWU9IlByb3Zpc2lvbmFsIEludm9pY2UiLz48cHJvcGVydHkg\r\nbmFtZT0iR3JvdXBDb21wYW55Ij48b2JqZWN0IGNsYXNzPSJUaF8xOTU0MDA5Nzk0XzZfR3JvdXBD\r\nb21wYW55IiBuYW1lPSJHcm91cENvbXBhbnkiPjxwcm9wZXJ0eSBuYW1lPSJOYW1lIiB2YWx1ZT0i\r\nVHJhZmlndXJhIEJlaGVlciBCLlYuIEFNU1RFUkRBTSwgQlJBTkNIIE9GRklDRSBMVUNFUk5FIi8+\r\nPHByb3BlcnR5IG5hbWU9IkFkZHJlc3MiIHZhbHVlPSJaPz9yaWNoc3RyYXNzZSAzMSIgaW5kZXg9\r\nIjAiLz48cHJvcGVydHkgbmFtZT0iQWRkcmVzcyIgdmFsdWU9Ikx1Y2VybmUiIGluZGV4PSIxIi8+\r\nPHByb3BlcnR5IG5hbWU9IkFkZHJlc3MiIHZhbHVlPSI2MDAyIiBpbmRleD0iMiIvPjxwcm9wZXJ0\r\neSBuYW1lPSJBZGRyZXNzIiB2YWx1ZT0iU3dpdHplcmxhbmQiIGluZGV4PSIzIi8+PHByb3BlcnR5\r\nIG5hbWU9IlBob25lTnVtYmVyIiB2YWx1
This string is created on server side with Java apache codec.binary.Base64 library. This string is captured with Fiddler when two different web services communicates with each other. Sometimes I have no access to the another web-service, that is why I sniff messages between services. In addition I use Ruby to automate some routine tasks and decided this time to use Ruby again. For encoding captured base64 string I use next snippet of code:
require "base64"
content = File.read('Input.txt')
decode_base64_content = Base64.decode64(content)
File.open("Output.txt", "wb") do |f|
f.write(decode_base64_content)
end
But output looks malformed, like <?xml version="1.0" encoding="UTF-8"?><review-case create®vFFSТ#2фЦ"у#B“ЈCЈS"7—7FVУТ%G&f–wW&хFVЧЖFUфЦзnagement_v5.1" ba and so on. Can you please advise on what I'm doing wrong? I use Ruby 1.9.3 on Windows 7 and Ubuntu 12.04.
I do not know how you manage to do this, but the line endings \r\n in your string seem to be there as 4-byte character sequences, not as 2-byte escaped CRLF. If I copy your file into a ruby string with single ticks:
unescaped='PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aWV3LWNhc2UgY3JlYXRl\r\nZGF0ZT0iMTMvTWFyLzIwMTQgMDk6MDQ6NTEiIHN5c3RlbT0iVHJhZmlndXJhX1RlbXBsYXRlX01h\r\nbmFnZW1lbnRfdjUuMSIgYmF0Y2hpZD0iMCIgdHJhbnNhY3Rpb25ubz0iMSIgYmF0Y2huYW1lPSJH'
Base64.decode64(unescaped)
#=> garbled text for every second line
if I do the same with double quotes (which respect the escape sequences):
escaped="PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aWV3LWNhc2UgY3JlYXRl\r\nZGF0ZT0iMTMvTWFyLzIwMTQgMDk6MDQ6NTEiIHN5c3RlbT0iVHJhZmlndXJhX1RlbXBsYXRlX01h\r\nbmFnZW1lbnRfdjUuMSIgYmF0Y2hpZD0iMCIgdHJhbnNhY3Rpb25ubz0iMSIgYmF0Y2huYW1lPSJH"
Base64.decode64(escaped)
#=> all is well that ends well
Therefore the problem seems to occur when you write the file. It can be amended in Ruby though:
unescaped='PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aWV3LWNhc2UgY3JlYXRl\r\nZGF0ZT0iMTMvTWFyLzIwMTQgMDk6MDQ6NTEiIHN5c3RlbT0iVHJhZmlndXJhX1RlbXBsYXRlX01h\r\nbmFnZW1lbnRfdjUuMSIgYmF0Y2hpZD0iMCIgdHJhbnNhY3Rpb25ubz0iMSIgYmF0Y2huYW1lPSJH'
Base64.decode64(unescaped)
escaped=unescaped.gsub('\\r', "\r").gsub('\\n', "\n")
Base64.decode64(escaped)
#=> now you should be fine again
but of course the correct solution would be to store the file correctly.
Given your current file the following should work:
require "base64"
content = File.read('Input.txt')
content.gsub!('\\r', "\r")
content.gsub!('\\n', "\n")
decode_base64_content = Base64.decode64(content)
File.open("Output.txt", "wb") do |f|
f.write(decode_base64_content)
end
Please do post some output if it does not.

Undefined method "each" in Ruby 2.0

The new Mac OS update moved the system Ruby up to 2.0, which is great, but now I'm seeing errors in a lot of my scripts that I don't know how to fix. Specifically, I had code that called for files using mdfind and then read them, like this:
files = %x{mdfind -onlyin /Users/Username/Dropbox/Tasks 'kMDItemContentModificationDate >= "$time.today(-1)"'}
files.each do |file|
Now I'm getting an error that says
undefined method `each' for #<String:0x007f83521865c8> (NoMethodError)"
It seems as if each now needs a qualifier. I tried each_line but that yielded additional errors down the line. Is there a simple replacement for this that I'm overlooking?
Ruby 1.8 used to have String#each which was doing implicit splitting.
each(separator=$/) {|substr| block } => str
Splits str using the supplied parameter as the record separator ($/ by default), passing each substring in turn to the supplied block. If a zero-length record separator is supplied, the string is split into paragraphs delimited by multiple successive newlines.
Explicit splitting should work in modern rubies, I believe.
files.split($/).each do |file|
Where $/ is newline char. You can use explicit char, since your script is not portable anyway.
files.split("\n").each do |file|
Update
or you can just use an alias of now-extinct each
files.each_line do |file|

Is it possible to specify newline type while reading a file in ruby

I frequently deal with UTF-16LE files encoded on windows which have a \r\n carriage return. There is no problem converting the file to UTF-8 by using:
File.new(filepath, 'r:utf-16le:utf-8')
But this of course does not get rid of the \r. The way I currently get rid of them is with
str.gsub("\r", "")
But it would be nice to take care of it while reading the file in. String#encode has :cr_newline, :crlf_newline, and :universal_newline options which convert all newlines to a desired kind of newline. Is there a way to apply these or similar options while reading in a file?
The method IO#gets takes an optional argument that allows you to pass a string to define how to separate the lines:
file = File.new(filepath, 'r:utf-16le:utf-8')
while (line = file.gets("\r\n"))
...
end

Resources