ruby unicode escapes as command line arguments - ruby

It looks like this question has been asked by a python dev (Allowing input of Unicode escapes as command line arguments), which I think partially relates, but it doesn't fully give me a solution for my immediate problem in Ruby. I'm curious if there is a way to take escaped unicode sequences as command line arguments, assign to a variable, then have the escaped unicode be processed and displayed as normal unicode after the script runs. Basically, I want to be able to choose a unicode number, then have Ruby stick that in a filename and have the actual unicode character displayed.
Here are a few things I've noticed that cause problems:
unicode = ARGV[0] #command line argument is \u263a
puts unicode
puts unicode.inspect
=> u263a
=> "u263a"
The forward slash needed to have the string be treated as a unicode sequence gets stripped.
Then, if we try adding another "\" to escape it,
unicode = ARGV[0] #command line argument is \\u263a
puts unicode
puts unicode.inspect
=> \u263a
=> "\\u263a"
but it still won't be processed properly.
Here's some more relevant code where I'm actually trying to make this happen:
unicode = ARGV[0]
filetype = ARGV[1]
path = unicode + "." + filetype
File.new(path, "w")
It seems like this should be pretty simple, but I've searched and searched and cannot find a solution. I should add, I do know that supplying the hard-coded escaped unicode in a string works just fine, like File.new("\u263a.#{filetype}", "w"), but getting it from an argument/variable is what I'm having an issue with. I'm using Ruby 1.9.2.

To unescape the unicode escaped command line argument and create a new file with the user supplied unicode string in the filename, I used #mu is too short's method of using pack and unpack, like so:
filetype = ARGV[1]
unicode = ARGV[0].gsub(/\\u([\da-fA-F]{4})/) {|m| [$1].pack("H*").unpack("n*").pack("U*")}
path = unicode + "." + filetype
File.new(path, "w")

Related

How to obtain basename in ruby from the given file path in unix or windows format?

I need to parse a basename in ruby a from file path which I get as input. Unix format works fine on Linux.
File.basename("/tmp/text.txt")
return "text.txt".
However, when I get input in windows format:
File.basename("C:\Users\john\note.txt")
or
File.basename("C:\\Users\\john\\note.txt")
"C:Usersjohn\note.txt" is the output (note that \n is a new line there), but I didn't get "note.txt".
Is there some nice solution in ruby/rails?
Solution:
"C:\\test\\note.txt".split(/\\|\//).last
=> "note.txt"
"/tmp/test/note.txt".split(/\\|\//).last
=> "note.txt"
If the Linux file name doesn't contain \, it will work.
Try pathname:
require 'pathname'
Pathname.new('C:\Users\john\note.txt').basename
# => #<Pathname:note.txt>
Pathname docs
Ref How to get filename without extension from file path in Ruby
I'm not convinced that you have a problem with your code. I think you have a problem with your test.
Ruby also uses the backslash character for escape sequences in strings, so when you type the String literal "C:\Users\john\note.txt", Ruby sees the first two backslashes as invalid escape sequences, and so ignores the escape character. \n refers to a newline. So, to Ruby, this literal is the same as "C:Usersjohn\note.txt". There aren't any file separators in that sequence, since \n is a newline, not a backslash followed by the letter n, so File.basename just returns it as it receives it.
If you ask for user input in either a graphical user interface (GUI) or command line interface (CLI), the user entering input needn't worry about Ruby String escape sequences; those only matter for String literals directly in the code. Try it! Type gets into IRB or Pry, and type or copy a file path, and press Enter, and see how Ruby displays it as a String literal.
On Windows, Ruby accepts paths given using both "/" (File::SEPARATOR) and "\\" (File::ALT_SEPARATOR), so you don't need to worry about conversion unless you are displaying it to the user.
Backslashes, while how Windows expresses things, are just a giant nuisance. Within a double-quoted string they have special meaning so you either need to do:
File.basename("C:\\Users\\john\\note.txt")
Or use single quotes that avoid the issue:
File.basename('C:\Users\john\note.txt')
Or use regular slashes which aren't impacted:
File.basename("C:/Users/john/note.txt")
Where Ruby does the mapping for you to the platform-specific path separator.

Why does File.dirname returns a period when I expect a path?

I am trying to get the directory of a file on a Windows box using File.dirname. I get the file ("file1" below) from the Windows box and return it to my the Mac OS X box that the script is run on.
file1 = "C:\Administrator\proj1\testFile.txt" below is to simplify my example, however, to make it more clear, I am getting this value from a remote box and returning it to my development box:
file1 = "C:\Administrator\proj1\testFile.txt"
path = "#{File.dirname(file1)}"
puts "#{path}"
>> .
I am confused on why it would return '.'. I saw on ruby-doc.org that File.dirname says the following:
"Returns all components of the filename given in file_name except the last one. The filename can be formed using both File::SEPARATOR and File::ALT_SEPARETOR as the separator when File::ALT_SEPARATOR is not nil."
I did a puts on File::SEPARATOR and File::ALT_SEPARATOR and got the following:
File::SEPARATOR >> /
File::ALT_SEPARATOR >>
I assumed it was because "\" wasn't a valid file separator. So I set File::ALT_SEPARATOR to "\". However, even after that, I still got the same value when I puts path.
I tried using File.realdirpath and this was the result:
file1 = "C:\Administrator\proj1\testFile.txt"
path = "#{File.realdirpath(file1)}"
puts "{path}"
>> /Users/me/myProject/C:\Administrator\proj1\testFile.txt
It seemed to add the path from where I called the Ruby script and appended the full path (including the file name). Seems to be odd behavior.
Any ideas, comments or suggestions would be great.
The problem is that when you declare file1, those backslashes define escape characters. Notice the return:
file1 = "C:\Administrator\proj1\testFile.txt"
=> "C:Administratorproj1\testFile.txt"
If you want to store a filepath in a string, you either need to use forward slashes or double backslashes (to escape the escape character):
file1 = "C:\\Administrator\\proj1\\testFile.txt"
file1 = "C:/Administrator/proj1/testFile.txt"
Okay, I was able to duplicate this problem as well.
As #fbonetti pointed out, you have to enclose your directory with single quotes to keep ruby from interpreting the backslashes as escapes, so start with that...
>> file1='C:\Administrator\proj1\testFile.txt'
=> "C:\\Administrator\\proj1\\testFile.txt"
Then, passing file1 through gsub to 'normalize' the slashes, gives you the results you're expecting.
>> File.dirname(file1.gsub('\\', '/'))
=> "C:/Administrator/proj1"
Of course, you could always reverse the gsub if you needed them to be backslashes again.
>> File.dirname(file1.gsub('\\', '/')).gsub('/', '\\')
=> "C:\\Administrator\\proj1"
I figured it out. It was an issue with the version of Ruby I was using. I was using ruby 1.9.3 and then I switched to jruby 1.7.3 and it works correctly now.
Ruby's IO documentation is of great help when dealing with different OS path separators. From the documentation:
Ruby will convert pathnames between different operating system conventions if possible. For instance, on a Windows system the filename "/gumby/ruby/test.rb" will be opened as "\gumby\ruby\test.rb". When specifying a Windows-style filename in a Ruby string, remember to escape the backslashes:
"c:\\gumby\\ruby\\test.rb"
Our examples here will use the Unix-style forward slashes; File::ALT_SEPARATOR can be used to get the platform-specific separator character.
So, in other words, you don't need to hassle with backslashes, and whether you need to use single or double-quotes. Keep it simple, and use forward-slashes and let Ruby worry about it. That way your code is portable across *nix/Mac OS and Windows.
Beyond that, it looks like there's a real need to learn how character escaping works in double-quoted strings vs. single-quoted strings. This is from "Programming Ruby":
Ruby provides a number of mechanisms for creating literal strings. Each generates objects of type String. The different mechanisms vary in terms of how a string is delimited and how much substitution is done on the literal's content.
Single-quoted string literals (' stuff ' and %q/stuff/) undergo the least substitution. Both convert the sequence into a single backslash, and the form with single quotes converts \' into a single quote.
'hello' » hello
'a backslash \'\\\'' » a backslash '\'
%q/simple string/ » simple string
%q(nesting (really) works) » nesting (really) works
%q no_blanks_here ; » no_blanks_here
Double-quoted strings ("stuff", %Q/stuff/, and %/stuff/) undergo additional substitutions, shown in Table 18.2 on page 203.
Substitutions in double-quoted strings
\\a Bell/alert (0x07) \\nnn Octal nnn
\\b Backspace (0x08) \\xnn Hex nn
\\e Escape (0x1b) \\cx Control-x
\\f Formfeed (0x0c) \\C-x Control-x
\\n Newline (0x0a) \\M-x Meta-x
\\r Return (0x0d) \\M-\\C-x Meta-control-x
\\s Space (0x20) \\x x
\\t Tab (0x09) #{expr} Value of expr
\\v Vertical tab (0x0b)
a = 123
"\123mile" » Smile
"Say \"Hello\"" » Say "Hello"
%Q!"I said 'nuts'," I said! » "I said 'nuts'," I said
%Q{Try #{a + 1}, not #{a - 1}} » Try 124, not 122
%<Try #{a + 1}, not #{a - 1}> » Try 124, not 122
"Try #{a + 1}, not #{a - 1}" » Try 124, not 122

Converting gsub() pattern from ruby 1.8 to 2.0

I have a ruby program that I'm trying to upgrade form ruby 1.8 to ruby 2.0.0-p247.
This works just fine in 1.8.7:
begin
ARGF.each do |line|
# a collection of pecluliarlities, appended as they appear in data
line.gsub!("\x92", "'")
line.gsub!("\x96", "-")
puts line
end
rescue => e
$stderr << "exception on line #{$.}:\n"
$stderr << "#{e.message}:\n"
$stderr << #line
end
But under ruby 2.0, this results in this an exxeption when encountering the 96 or 92 encoded into a data file that otherwise contains what appears to be ASCII:
invalid byte sequence in UTF-8
I have tried all manner of things: double backslashes, using a regex object instead of the string, force_encoding(), etc. and am stumped.
Can anybody fill in the missing puzzle piece for me?
Thanks.
=============== additions: 2013-09-25 ============
Changing \x92 to \u2019 did not fix the problem.
The program does not error until it actually hits a 92 or 96 in the input file, so I'm confused as to how the character pattern in the string is the problem when there are hundreds of thousands of lines of input data that are matched against the patterns without incident.
It's not the regex that's throwing the exception, it's the Ruby compiler. \x92 and \x96 are how you would represent ’ and – in the windows-1252 encoding, but Ruby expects the string to be UTF-8 encoded. You need to get out of the habit of putting raw byte values like \x92 in your string literals. Non-ASCII characters should be specified by Unicode escape sequences (in this case, \u2019 and \u2013).
It's a Unicode world now, stop thinking of text in terms of bytes and think in terms of characters instead.

Desperately trying to remove this diabolical excel generated special character from csv in ruby

My computer has no idea what this character is. It came from Excel.
In excel it was a weird space, now it is literally represented by several symbols viz. my computer has no idea what it is.
This character is represented by a Ê in Excel (in csv, as xls it is a space of some kind), OS X's TextEdit treats it as a big space this long "            ", which is, I think, what it is. Ruby's CSV parser blows up when it tries to parse it using normal utf-8, and I have to add :encoding => "windows-1251:utf-8" to parse it, in which case Ruby turns it into an "K". This K appears in groups of 9, 12, 15 and 18 (KKKKKKKKK, etc) in my CSV, and cannot be removed via gsub(/K/) (groups of K, /KKKKKKKKK/, etc, cannot be removed either)! I've also used the opensource tool CSVfix, but its "removing leading and trailing spaces" command did not have an effect on the Ks.
I've tried using sed as suggested in Remove non-ascii characters from csv, but got errors like
sed: 1: "output.csv": invalid command code o
when running something like sed -i 's/[\d128-\d255]//' input.csv on Mac.
Parse your csv with the following to remove your "evil" character
.encode!("ISO-8859-1", :invalid => :replace)
**self-answers (different account, same person)
1st solution attempt:
evil_string_from_csv_cell = "KKKKKKKKK"
encoding_opts = {
:invalid => :replace, :undef => :replace,
:replace => '', :universal_newline => true }
evil_string_from_csv_cell.encode Encoding.find('ASCII'), encoding_opts
#=> ""
2nd solution attempt:
Don't use 'windows-1251:utf-8' for encoding, use 'iso-8859-1' instead, which will turn those (cyrillic) K's into '\xCA', which can then be removed with
string.gsub!(/\xCA/, '')
** I have not solved this problem yet.
3rd solution attempt:
trying to match array of K's as if they were actual K's is foolish. Copy and paste in the actual cyrillic K and see how that works-- here is the character, notice the little curl on the end
К
ruby treats it by making it a little bit bolder than normal K's
4th solution/strategy attempt (success):
use regular expressions to capture the characters, so long as you can encode the weird spaces (or whatever they are) into something, you can then ignore them using regular expressions
also try to take advantage of any spatial (matrix-like) patterns amongst the document types.
The answer to this problem is
A.) this is a very difficult problem. no one so far knows how to "physically" remove the cyrillic Ks.
but
B.) csv files are just strings separated by unescaped commas, so matching strings using regular expressions works just find so long as the encoding doesn't break the program.
So to read the file
f = File.open(File.join(Rails.root, 'lib', 'assets', 'repo', name), :encoding => "windows-1251:utf-8")
parsed = CSV.parse(f)
then find specific rows via regular expression literal string matching (it will overlook the cyrillic K's)
parsed.each do |p| #here, p[0] is the metatag column
#specific_metatag_row = parsed.index if p[0] =~ /MetatagA/
end
I couldn't get sed working but finally had luck doing this in Vim:
vim myhorriblefile.csv
# Once vim is open:
:s/Ê/ /g
:wq
# Done!
As a generalized function for reuse, this can be:
clean_weird_character () {
vim "$1" -c ":%s/Ê/ /g" -c "wq"
}

Convert Hex STDIN / ARGV / gets to ASCII in ruby

my Question is how I can convert the STDIN of cmd ARGV or gets from hex to ascii
I know that if I assigned hex string to variable it'll be converted once I print it
ex
hex_var = "\x41\41\x41\41"
puts hex_var
The result will be
AAAA
but I need to get the value from command line by (ARGV or gets)
say I've this lines
s = ARGV
puts s
# another idea
puts s[0].gsub('x' , '\x')
then I ran
ruby gett.rb \x41\x41\x41\x41
I got
\x41\x41\x41\x41
is there a way to get it work ?
There are a couple problems you're dealing with here. The first you've already tried to address, but I don't think your solution is really ideal. The backslashes you're passing in with the command line argument are being evaluated by the shell, and are never making it to the ruby script. If you're going to simply do a gsub in the script, there's no reason to even pass them in. And doing it your way means any 'x' in the arguments will get swapped out, even those that aren't being used to indicate a hex. It would be better to double escape the \ in the argument if possible. Without context of where the values are coming from, it's hard to say with way would actually be better.
ruby gett.rb \\x41\\x41
That way ARGV will actually get '\x41\x41', which is closer to what you want.
It's still not exactly what you want, though, because ARGV arguments are created without expression substitution (as though they are in single quotes). So Ruby is escaping that \ even though you don't want it to. Essentially you need to take that and re-evaluate it as though it were in double quotes.
eval('"%s"' % s)
where s is the string.
So to put it all together, you could end up with either of these:
# ruby gett.rb \x41\x41
ARGV.each do |s|
s = s.gsub('x' , '\x')
p eval('"%s"' % s)
end
# => "AA"
# ruby gett.rb \\x41\\x41
ARGV.each do |s|
p eval('"%s"' % s)
end
# => "AA"
Backlashes entered in the console will be interpreted by the shell and will
not make it into your Ruby script, unless you enter two backlashes in a row,
in which case you script will get a literal backlash and no automatic
conversion of hexadecimal character codes following those backlashes.
You can convert these escaped codes to characters manually if you replace the last line of your script with this:
puts s.gsub(/\\x([[:xdigit:]]{1,2})/) { $1.hex.chr }
Then run it with double backlashed input:
$ ruby gett.rb \\x41\\x42\\x43
ABC
When fetching user input through gets or similar, only a single backslash will be need to be entered by the user for each character escape, since that will indeed be passed to your script as literal backslashes and thus handled correctly by the above gsub call.
An alternative way when parsing command line arguments would be to let the shell interpret the character escapes for you. How to do this will depend on what shell you are using. If using bash, it can be done
like this:
$ echo $'\x41\x42\x43'
ABC
$ ruby -e 'puts ARGV' $'\x41\x42\x43'
ABC

Resources