Decode base64 string and write to file - ruby

I'm trying to read file which contains encoded base64 string and write decoded output into another file. My Input.txt contains a base64 string, something like:
PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aWV3LWNhc2UgY3JlYXRl\r\nZGF0ZT0iMTMvTWFyLzIwMTQgMDk6MDQ6NTEiIHN5c3RlbT0iVHJhZmlndXJhX1RlbXBsYXRlX01h\r\nbmFnZW1lbnRfdjUuMSIgYmF0Y2hpZD0iMCIgdHJhbnNhY3Rpb25ubz0iMSIgYmF0Y2huYW1lPSJH\r\nVUlEKGY1NWRmYjgwODQ4ZDQ3YzliZmVhYTg3YzMyZDQyNDQyKS1HTE9CQUxfSU5WT0lDRS1FTkdM\r\nSVNIIiB2ZXJzaW9uPSI1LjEuMi44ICBidWlsZCA1MjUzOSI+PHRyYW5zYWN0aW9uPjxvYmplY3Rz\r\nPjxvYmplY3QgY2xhc3M9IlRoXzE5NTQwMDk3OTRfNl9tb2RlbCIgbmFtZT0ibW9kZWwiPjxwcm9w\r\nZXJ0eSBuYW1lPSJUaXRsZSIgdmFsdWU9IlByb3Zpc2lvbmFsIEludm9pY2UiLz48cHJvcGVydHkg\r\nbmFtZT0iR3JvdXBDb21wYW55Ij48b2JqZWN0IGNsYXNzPSJUaF8xOTU0MDA5Nzk0XzZfR3JvdXBD\r\nb21wYW55IiBuYW1lPSJHcm91cENvbXBhbnkiPjxwcm9wZXJ0eSBuYW1lPSJOYW1lIiB2YWx1ZT0i\r\nVHJhZmlndXJhIEJlaGVlciBCLlYuIEFNU1RFUkRBTSwgQlJBTkNIIE9GRklDRSBMVUNFUk5FIi8+\r\nPHByb3BlcnR5IG5hbWU9IkFkZHJlc3MiIHZhbHVlPSJaPz9yaWNoc3RyYXNzZSAzMSIgaW5kZXg9\r\nIjAiLz48cHJvcGVydHkgbmFtZT0iQWRkcmVzcyIgdmFsdWU9Ikx1Y2VybmUiIGluZGV4PSIxIi8+\r\nPHByb3BlcnR5IG5hbWU9IkFkZHJlc3MiIHZhbHVlPSI2MDAyIiBpbmRleD0iMiIvPjxwcm9wZXJ0\r\neSBuYW1lPSJBZGRyZXNzIiB2YWx1ZT0iU3dpdHplcmxhbmQiIGluZGV4PSIzIi8+PHByb3BlcnR5\r\nIG5hbWU9IlBob25lTnVtYmVyIiB2YWx1
This string is created on server side with Java apache codec.binary.Base64 library. This string is captured with Fiddler when two different web services communicates with each other. Sometimes I have no access to the another web-service, that is why I sniff messages between services. In addition I use Ruby to automate some routine tasks and decided this time to use Ruby again. For encoding captured base64 string I use next snippet of code:
require "base64"
content = File.read('Input.txt')
decode_base64_content = Base64.decode64(content)
File.open("Output.txt", "wb") do |f|
f.write(decode_base64_content)
end
But output looks malformed, like <?xml version="1.0" encoding="UTF-8"?><review-case create®vFFSТ#2фЦ"у#B“ЈCЈS"7—7FVУТ%G&f–wW&хFVЧЖFUфЦзnagement_v5.1" ba and so on. Can you please advise on what I'm doing wrong? I use Ruby 1.9.3 on Windows 7 and Ubuntu 12.04.

I do not know how you manage to do this, but the line endings \r\n in your string seem to be there as 4-byte character sequences, not as 2-byte escaped CRLF. If I copy your file into a ruby string with single ticks:
unescaped='PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aWV3LWNhc2UgY3JlYXRl\r\nZGF0ZT0iMTMvTWFyLzIwMTQgMDk6MDQ6NTEiIHN5c3RlbT0iVHJhZmlndXJhX1RlbXBsYXRlX01h\r\nbmFnZW1lbnRfdjUuMSIgYmF0Y2hpZD0iMCIgdHJhbnNhY3Rpb25ubz0iMSIgYmF0Y2huYW1lPSJH'
Base64.decode64(unescaped)
#=> garbled text for every second line
if I do the same with double quotes (which respect the escape sequences):
escaped="PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aWV3LWNhc2UgY3JlYXRl\r\nZGF0ZT0iMTMvTWFyLzIwMTQgMDk6MDQ6NTEiIHN5c3RlbT0iVHJhZmlndXJhX1RlbXBsYXRlX01h\r\nbmFnZW1lbnRfdjUuMSIgYmF0Y2hpZD0iMCIgdHJhbnNhY3Rpb25ubz0iMSIgYmF0Y2huYW1lPSJH"
Base64.decode64(escaped)
#=> all is well that ends well
Therefore the problem seems to occur when you write the file. It can be amended in Ruby though:
unescaped='PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz48cmV2aWV3LWNhc2UgY3JlYXRl\r\nZGF0ZT0iMTMvTWFyLzIwMTQgMDk6MDQ6NTEiIHN5c3RlbT0iVHJhZmlndXJhX1RlbXBsYXRlX01h\r\nbmFnZW1lbnRfdjUuMSIgYmF0Y2hpZD0iMCIgdHJhbnNhY3Rpb25ubz0iMSIgYmF0Y2huYW1lPSJH'
Base64.decode64(unescaped)
escaped=unescaped.gsub('\\r', "\r").gsub('\\n', "\n")
Base64.decode64(escaped)
#=> now you should be fine again
but of course the correct solution would be to store the file correctly.
Given your current file the following should work:
require "base64"
content = File.read('Input.txt')
content.gsub!('\\r', "\r")
content.gsub!('\\n', "\n")
decode_base64_content = Base64.decode64(content)
File.open("Output.txt", "wb") do |f|
f.write(decode_base64_content)
end
Please do post some output if it does not.

Related

Ruby. NUL chars after reading simple file

I'm reading simple text files using Ruby for further regex processing and suddenly I see that str NUL after each printable character. Totally lost, where it comes from, I tested typing simple text in Notepad, saving as txt file and still getting those. I'm on W machine, didn't have this before.
How I can process it, probably replace them, not sure how to refer to them.
My regex doesn't work with them, tried several ways, using SciTE for run.
e.g. use presented as uNULsNULeNUL and not equal to use
puts File.read(file_name)
puts '____________________'
File.open(file_name, "r") do |f|
f.each_line do |line|
puts 'Line.....' + line
end
end
---------------------- below pic on content of file and output:
This file is probably in UTF-16 format. You'll need to read it in that way:
File.open(file_name, "r:UTF-16LE") do |f|
# ...
end
That format is the default in Windows.
You can always fix this by re-saving the file as UTF-8.

Encoding::UndefinedConversionError when using open-uri

When I do this:
require 'open-uri'
response = open('some-html-page-url-here')
response.read
On a certain url I get the following error (due to wrong encoding in the returned url?!):
Encoding::UndefinedConversionError: U+00A0 from UTF-8 to US-ASCII
Any way around this to still get the html content?
In the introduction to the open-uri module, the docs say this,
It is possible to open an http, https or ftp URL as though it were a file
And if you know anything about reading files, then you have to know the encoding of the file you are trying to read. You need to know the encoding so that you can tell ruby how to read the file(i.e. how many bytes(or how much space) each character will occupy).
In the first code example in the docs, there is this:
open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
So if you don't know the encoding of the "file" you are trying to read, you can get the encoding with f.charset. If that encoding is different than your default external encoding, you will most likely get an error. Your default external encoding is the encoding ruby uses to read from external sources. You can check what your default external encoding is set to like this:
The default external Encoding is pulled from your environment...Have a
look:
$ echo $LC_CTYPE
en_US.UTF-8
or
$ ruby -e 'puts Encoding.default_external.name'
UTF-8
http://graysoftinc.com/character-encodings/ruby-19s-three-default-encodings
On Mac OSX, I actually have to do the following to see the default external encoding:
$ echo $LANG
You can set your default external encoding with the method Encoding.default_external=(), so you might want to try something like this:
open('some_url_here') do |f|
Encoding.default_external = f.charset
html = f.read
end
Setting an IO object to binmode, like you have done, tells ruby that the encoding of the file is BINARY (or ruby's confusing synonym ASCII-8BIT), which means you are telling ruby that each character in the file takes up one byte. In your case, you are telling ruby to read the character U+00A0, whose UTF-8 representation takes up two bytes 0xC2 0xA0, as two characters instead of just one character, so you have eliminated your error, but you have produced two junk characters instead of the original character.
Doing a response.binmode before the response.read stops the error from happening.
Had the same issue, will add my solution here:
After reading the open-uri documentation further, it turns out you could set the encoding of the io before reading using the set_encoding method, like this:
result = open('some-page-uri') do |io|
io.set_encoding(Encoding.default_external)
io.read
end
Hope it helps!

Reading contents from UTF-16 encoded file in Ruby

I want to read the contents of a file and save it into a variable. Normally I would do something like:
text = File.read(filepath)
Unfortunately there's a file I'm working with that is encoded with UTF-16LE. I've been doing some research and it looks like I need to use File.Open instead and define the encoding. I read a suggestion somewhere that said to open the file and read in the data line by line:
text = File.open(filepath,"rb:UTF-16LE") { |file| file.lines }
However if I run:
puts text
I get:
#<Enumerator:0x23f76a8>
How can I read in the content of the UTF-16LE file into a variable?
Note: I am using Ruby 1.9.3 and a Windows OS
The lines method is deprecated. If you expect text to be an array with lines, then use readlines.
text = File.open(filepath,"rb:UTF-16LE"){ |file| file.readlines }
As the Tin Man says, it's better practise to process each line seperately, if possible:
File.open("test.csv", "rb:UTF-16LE") do |file|
file.each do |line|
p line
end
end
First, don't make it a practice to read a file directly into a variable unless you absolutely have to. That's called "slurping", and is not scalable. Instead, read it line by line.
Ruby's IO class, which File inherits from, supports a parameter they call open_args, which is a hash, on the majority of "read" type calls. For example, here are some method signatures:
read(name, [length [, offset]], open_args)
readlines(name, sep=$/ [, open_args])
The documentation says this about open_args:
If the last argument is a hash, it specifies option for internal open(). The
key would be the following. open_args: is exclusive to others.
encoding:
string or encoding
specifies encoding of the read string. encoding will be ignored if length
is specified.
mode:
string
specifies mode argument for open(). It should start with "r" otherwise it
will cause an error.
open_args:
array of strings
specifies arguments for open() as an array.

CSV writes Ñ into its code form instead of its actual form

I have a CSV file. I checked its encoding using this:
File.open('C:\path\to\file\myfile.txt').read.encoding
and it returned the encoding as:
=> #<Encoding:IBM437>
I'm reading this CSV per row -- stripping spaces and doing other stuff. After "cleansing" it, I push it to a new file. I'm doing it like this:
CSV.foreach(file_read, encoding: "IBM437:UTF-8") do |r|
# some code
CSV.open(file_appended, "a", col_sep: "|") do |csv|
csv << r
end
end
Now my problem is, inside the CSV I'm reading, there's a word with an accented character -- Ñ to be exact. This character is being appended to the new file as
\u2564
Its a problem considering that the accented character is a vital part of that word, and I wanted that character to appear to the new file as-is.
Am I missing something? I tried the ff. source:destination encoding but to no avail:
ISO-8859-1:UTF8 (and vice versa)
ISO-8859-1:Windows-1252 (and vice versa)
Am I missing something?
Here is my ruby version, just if you'd need to know:
ruby 1.9.3p392 (2013-02-22) [i386-mingw32]
Thanks in advance!
The line below solved my problem:
Encoding.default_external = "iso-8859-1"
It tells Ruby that the file being read is encoded in ISO-8859-1, and therefore correctly interprets the Ñ character.
Credit goes to Darshan Computing's answer here. Just look for Update #2.

Is it possible to specify newline type while reading a file in ruby

I frequently deal with UTF-16LE files encoded on windows which have a \r\n carriage return. There is no problem converting the file to UTF-8 by using:
File.new(filepath, 'r:utf-16le:utf-8')
But this of course does not get rid of the \r. The way I currently get rid of them is with
str.gsub("\r", "")
But it would be nice to take care of it while reading the file in. String#encode has :cr_newline, :crlf_newline, and :universal_newline options which convert all newlines to a desired kind of newline. Is there a way to apply these or similar options while reading in a file?
The method IO#gets takes an optional argument that allows you to pass a string to define how to separate the lines:
file = File.new(filepath, 'r:utf-16le:utf-8')
while (line = file.gets("\r\n"))
...
end

Resources