I'm trying to clean up some auto-generated code where input URL fragments:
may include spaces, which need to be %-escaped (as %20, not +)
may include other URL-invalid characters, which also need to be %-escaped
may include path separators, which need to be left alone (/)
may include already-escaped components, which need not to be doubly-escaped
The existing code uses libcurl (via Typhoeus and Ethon), which like command-line curl seems to happily accept spaces in URLs.
The existing code is all string-based and has a number of shenanigans involving removing extra slashes, adding missing slashes, etc. I'm trying to replace this with URI.join(), but this fails with bad URI(is not URI?) on the fragments with spaces.
The obvious solution is to use the (deprecated) URI.escape, which escapes spaces, but leaves slashes alone:
URI.escape('http://example.org/ spaces /<"punc^tu`ation">/non-ascïï 𝖈𝖍𝖆𝖗𝖘/&c.')
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
This mostly works, except for case (3) above — previously escaped components get double-escaped.
s1 = URI.escape(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
URI.escape(s)
# => "http://example.org/%2520spaces%2520/%253C%2522punc%255Etu%2560ation%2522%253E/non-asc%25C3%25AF%25C3%25AF%2520%25F0%259D%2596%2588%25F0%259D%2596%258D%25F0%259D%2596%2586%25F0%259D%2596%2597%25F0%259D%2596%2598/%25EF%25BC%2586%25EF%25BD%2583%25EF%25BC%258E"
The recommended alternatives to URI.escape, e.g. CGI.escape and ERB::Util.url_encode, are not suitable as they mangle the slashes (among other problems):
CGI.escape(s)
# => "http%3A%2F%2Fexample.org%2F+spaces+%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF+%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"
ERB::Util.url_encode(s)
# => "http%3A%2F%2Fexample.org%2F%20spaces%20%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"
Is there a clean, out-of-the-box way to preserve existing slashes, escapes, etc. and escape only invalid characters in a URI string?
So far the best I've been able to come up with is something like:
include URI::RFC2396_Parser::PATTERN
INVALID = Regexp.new("[^%#{RESERVED}#{UNRESERVED}]")
def escape_invalid(str)
parser = URI::RFC2396_Parser.new
parser.escape(str, INVALID)
end
This seems to work:
s2 = escape_invalid(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
s2 == escape_invalid(s2)
# => true
but I'm not confident in the regex concatenation (even if it is the way URI::RFC2396_Parser works internally) and I know it doesn't handle all cases (e.g., a % that isn't part of a valid hex escape should probably be escaped). I'd much rather find a library standard solution.
Related
On the input I have string that can be plain path string (e.g. /home/user/1.txt) or glob pattern (e.g. /home/user/*.txt).
Next I want to get array of matches if string is glob pattern and in case when string is just plain path I want to get array with single element - this path.
So somehow I should check if string contains unescaped glob symbols and if it does then call Pathname.glob() to get matches otherwise just return array with this string.
How can I check if string is a glob pattern?
UPDATE
I had this question while implementing homebrew cask glob pattern support for zap stanza.
And the solution that I used is to made a little refactoring to avoid need to check if string is a glob pattern.
Next I want to get array of matches if string is glob pattern and in case when string is just plain path I want to get array with single element - this path.
They're both valid glob patterns. One contains a wildcard, one does not. Run them both through Pathname.glob() and you'll always get an array back. Bonus, it'll check if it matches anything.
$ irb
2.3.3 :001 > require "pathname"
=> true
2.3.3 :002 > Pathname.glob("test.data")
=> [#<Pathname:test.data>]
2.3.3 :003 > Pathname.glob("test.*")
=> [#<Pathname:test.asm>, #<Pathname:test.c>, #<Pathname:test.cpp>, #<Pathname:test.csv>, #<Pathname:test.data>, #<Pathname:test.dSYM>, #<Pathname:test.html>, #<Pathname:test.out>, #<Pathname:test.php>, #<Pathname:test.pl>, #<Pathname:test.py>, #<Pathname:test.rb>, #<Pathname:test.s>, #<Pathname:test.sh>]
2.3.3 :004 > Pathname.glob("doesnotexist")
=> []
This is a great way to normalize and validate your data early, so the rest of the program doesn't have to.
If you really want to figure out if something is a literal path or a glob, you could try scanning for any special glob characters, but that rapidly gets complicated and error prone. It requires knowing how glob works in detail and remembering to check for quoting and escaping. foo* has a glob pattern. foo\* does not. foo[123] does. foo\[123] does not. And I'm not sure what foo[123\] is doing, I think it counts as a non-terminated set.
In general, you want to avoid writing code that has to reproduce the inner workings of another piece of code. If there was a Pathname.has_glob_chars you could use that, but there isn't such a thing.
Pathname.glob uses File.fnmatch to do the globbing and you can use that without touching the filesystem. You might be able to come up with something using that, but I can't make it work. I thought maybe only a literal path will match itself, but foo* defeats that.
Instead, check if it exists.
Pathname.new(path).exist?
If it exists, it was a real path to a real file. If it didn't exist, it might have been a real path, or it might be a glob. That's probably good enough.
You can also check by looking to see if Pathname.glob(path) returned a single element that matches the original path. Note that when matching paths it's important to normalize both sides with cleanpath.
paths = Pathname.glob(path)
if paths.size == 1 && paths[0].cleanpath == Pathname.new(path).cleanpath
puts "#{path} is a literal path"
elsif paths.size == 0
puts "#{path} matched nothing"
else
puts "#{path} was a glob"
end
I'm attempting to do this:
Dir["c:\temp\*.*"]
but that is failing. I understand why, but I seem to lack the Ruby prowess to work around it.
I am given the path in a variable and otherwise have no control over it. Nor do I know the contents ahead of time.
Is there a way to make Dir function with double quoted strings that are poorly escaped? Alternatively, how does one take a variable with the apparent contents
"c:\temp\*.*"
and convert it into
'c:/temp/*.*'
This problem at the core seems to be how to potentially escape a string that should have been escaped but now is not.
The end result is I am not able to use the given string to do this as conceptually simple as puts() or Dir[].
If given 'c:\temp\*.*' then I have no problem. I can fix that:
foo = 'c:\temp\*.*'.gsub('\\', '/')
If given "c:\\\\temp\\\\*.*" then I have no problem. I can fix that:
foo = "c:\\temp\\*.*".gsub("\\", "/")
However, I am passed neither of those, but rather "c:\\temp\\*.*". This string contains a TAB and a second undefined escape. It is this that I can't fix in a general way.
Even if I knew the contents ahead of time I am stumped on how to properly escape and transform this. I should add that I am not a ruby programmer at the moment so maybe there is some simple method to deal with this that I am not aware of.
I tried a bunch of stuff like:
"c:\temp\*.*".gsub("\t", "/t")
which gets me part of the way, but since the actual contents of the string are not known to me ahead of time this is a little wonky. Further, if the escape character is not valid as in \\* then I am also in a jam. So this also fails:
"c:\temp\*.*".gsub("\t", "/t").gsub("\*", "/*")
Is there a way to make Dir function with double quoted strings that are poorly escaped?
No.
Garbage in, garbage out. There is no Rumpelstiltskin routine that returns gold when given trash.
Ruby auto-converts forward-slashes in filenames/paths to reverse-slashes when running on Windows. Simply make it a habit of using forward, *nix-style, slashes and you'll be fine.
From the IO documentation:
Ruby will convert pathnames between different operating system conventions if possible. For instance, on a Windows system the filename "/gumby/ruby/test.rb" will be opened as "\gumby\ruby\test.rb". When specifying a Windows-style filename in a Ruby string, remember to escape the backslashes:
"c:\\gumby\\ruby\\test.rb"
I don't have "c:\temp" I have "c:\temp" as input
In a properly defined Windows path you should see:
'c:' + '\temp' + '\*.*' # => "c:\\temp\\*.*"
Note that the single-quotes are treating "\t" as an escaped-escape + "t". Your source for the variable is creating the string improperly by using double-quotes:
'c:' + "\temp" + "\*.*" # => "c:\temp*.*"
If you have "\t", you have a TAB character. It's possible to change it to an escaped-T using:
"c:\temp" # => "c:\temp"
"c:\temp"[2] # => "\t"
"c:\temp"[2].ord # => 9
'\t' # => "\\t"
"c:\temp".sub("\t", '\t') # => "c:\\temp"
The next problem is what to do when you have a String containing "*" to convert it to "\*". There's no way to search for "\*" because that's the same as "*" as seen above:
"\*.*" # => "*.*"
But, since "*.*" is a fairly specific "anything" wildcard, maybe simply searching for and replacing that pattern would work:
"c:\temp\*.*".gsub('*.*', '\\*.*') # => "c:\temp\\*.*"
or:
"c:\temp\*.*".gsub('*.*', '/*.*') # => "c:\temp/*.*"
Back to dealing with "\t" and putting it all together... I'd start with:
"c:\temp\*.*".gsub("\t", '\t').gsub('*.*', '/*.*') # => "c:\\temp/*.*"
"c:\temp\*.*".gsub("\t", '/t').gsub('*.*', '/*.*') # => "c:/temp/*.*"
You'll have to figure out what to do if you have something like:
c:/dir/file*.*
where they mean they want all files starting with file. Since you're seeing ambiguous inputs it seems the input routine needs to be more rigorous to not allow reversed-slashes.
How do I make the parameter file of the method sound become the file name of the .fifo >extension using single quotes? I've searched up and down, and tried many different >approaches, but I think I need a new set of eyes on this one.
def sound(file)
#cli.stream_audio('audio\file.fifo')
end
Alright so I finally got it working, might not be the correct way but this seemed to do the trick. First thing, there may have been some white space interfering with my file parameter. Then I used the File.join option that I saw posted here by a few different people.
I used a bit of each of the answers really, and this is how it came out:
def sound(file)
file = file.strip
file = File.join('audio/',"#{file}.fifo")
#cli.stream_audio(file) if File.exist? file
end
Works like a charm! :D
Ruby interpolation requires that you use double quotes.
Is there a reason you need to use single quotes?
def sound(FILE)
#cli.stream_audio("audio/#{FILE}.fifo")
end
As Charles Caldwell stated in his comment, the best way to get cross-platform file paths to work correctly would be to use File.join. Using that, your method would look like this:
def sound(FILE)
#cli.stream_audio(File.join("audio", "#{FILE}.fifo"))
end
Your problem is with your usage of file path separators. You are using a \. Whereas this may not seem like a big deal, it actually is when used in Ruby strings.
When you use \ in a single quoted string, nothing happens. It is evaluated as-is:
puts 'Hello\tWorld' #=> Hello\tWorld
Notice what happens when we use double quotes:
puts "Hello\tWorld" #=> "Hello World"
The \t got interpreted as a tab. That's because, much like how Ruby will interpolate #{} code in a double quote, it will also interpret \n or \t into a new line or tab. So when it sees "audio\file.fifo" it is actually seeing "audio" with a \f and "ile.fifo". It then determines that \f means 'form feed' and adds it to your string. Here is a list of escape sequences. It is for C++ but it works across most languages.
As #sawa pointed out, if your escape sequence does not exist (for instance \y) then it will just remove the \ and leave the 'y'.
"audio\yourfile.fifo" #=> audioyourfile.fifo
There are three possible solutions:
Use a forward slash:
"audio/#{file}.fifo"
The forward slash will be interpreted as a file path separator when passed to the system. I do most my work on Windows which uses \ but using / in my code is perfectly fine.
Use \\:
"audio\\#{file}.fifo"
Using a double \\ escapes the \ and causes it to be read as you intended it.
Use File.join:
File.join("audio", "#{file}.fifo")
This will output the parameters with whatever file separator is setup as in the File::SEPARATOR constant.
My computer has no idea what this character is. It came from Excel.
In excel it was a weird space, now it is literally represented by several symbols viz. my computer has no idea what it is.
This character is represented by a Ê in Excel (in csv, as xls it is a space of some kind), OS X's TextEdit treats it as a big space this long " ", which is, I think, what it is. Ruby's CSV parser blows up when it tries to parse it using normal utf-8, and I have to add :encoding => "windows-1251:utf-8" to parse it, in which case Ruby turns it into an "K". This K appears in groups of 9, 12, 15 and 18 (KKKKKKKKK, etc) in my CSV, and cannot be removed via gsub(/K/) (groups of K, /KKKKKKKKK/, etc, cannot be removed either)! I've also used the opensource tool CSVfix, but its "removing leading and trailing spaces" command did not have an effect on the Ks.
I've tried using sed as suggested in Remove non-ascii characters from csv, but got errors like
sed: 1: "output.csv": invalid command code o
when running something like sed -i 's/[\d128-\d255]//' input.csv on Mac.
Parse your csv with the following to remove your "evil" character
.encode!("ISO-8859-1", :invalid => :replace)
**self-answers (different account, same person)
1st solution attempt:
evil_string_from_csv_cell = "KKKKKKKKK"
encoding_opts = {
:invalid => :replace, :undef => :replace,
:replace => '', :universal_newline => true }
evil_string_from_csv_cell.encode Encoding.find('ASCII'), encoding_opts
#=> ""
2nd solution attempt:
Don't use 'windows-1251:utf-8' for encoding, use 'iso-8859-1' instead, which will turn those (cyrillic) K's into '\xCA', which can then be removed with
string.gsub!(/\xCA/, '')
** I have not solved this problem yet.
3rd solution attempt:
trying to match array of K's as if they were actual K's is foolish. Copy and paste in the actual cyrillic K and see how that works-- here is the character, notice the little curl on the end
К
ruby treats it by making it a little bit bolder than normal K's
4th solution/strategy attempt (success):
use regular expressions to capture the characters, so long as you can encode the weird spaces (or whatever they are) into something, you can then ignore them using regular expressions
also try to take advantage of any spatial (matrix-like) patterns amongst the document types.
The answer to this problem is
A.) this is a very difficult problem. no one so far knows how to "physically" remove the cyrillic Ks.
but
B.) csv files are just strings separated by unescaped commas, so matching strings using regular expressions works just find so long as the encoding doesn't break the program.
So to read the file
f = File.open(File.join(Rails.root, 'lib', 'assets', 'repo', name), :encoding => "windows-1251:utf-8")
parsed = CSV.parse(f)
then find specific rows via regular expression literal string matching (it will overlook the cyrillic K's)
parsed.each do |p| #here, p[0] is the metatag column
#specific_metatag_row = parsed.index if p[0] =~ /MetatagA/
end
I couldn't get sed working but finally had luck doing this in Vim:
vim myhorriblefile.csv
# Once vim is open:
:s/Ê/ /g
:wq
# Done!
As a generalized function for reuse, this can be:
clean_weird_character () {
vim "$1" -c ":%s/Ê/ /g" -c "wq"
}
Im new to Ruby and Rails so forgive me if this an easy question. Im trying to check when a user passes in an IMG url in my form, that it is a valid url. Here is my code:
if params[:url].include? 'http://' && (params[:url].include? '.jpg' || params[:url].include? '.png')
This returns and error. Is this is even the best way to go about it? What should I do differently? Thanks.
if my_str =~ %r{\Ahttps?://.+\.(?:jpe?g|png)\z}i
Regex explained:
%r{...} — regex literal similar to /.../, but allows / to be used inside without escaping
\A — the start of the string (^ is just the start of the line)
http — the literal text
s? — optionally followed by an "s" (to allow https://)
:// — the literal text (to prevent something like http-whee.jpg)
.+ — one or more characters (that aren't a newline)
\. — a literal period (make sure this is an extension we're looking at)
(?:aaa|bbb) — allow either aaa or bbb here, but don't capture the result
jpe?g — either "jpg" or "jpeg"
png — the literal text
\z — the end of the string ($ is just the end of the line)
i — make the match case-insensitive (allow for .JPG as well as .jpg)
However, you might be able to get away with just this (more readable) version:
allowed_extensions = %w[.jpg .jpeg .png]
if my_str.start_with?('http://') &&
allowed_extensions.any?{ |ext| my_str.end_with?(ext) }
#Phrogz answer is better,I just tried this with some ruby libs.
require 'uri'
extensions = %w( .jpg .jpeg .png )
schemes = %w( http https )
string = params[:url]
if (schemes.include?URI.parse(string).scheme) && (extensions.include?File.extname(string))
end
While regex will shorten the code, I prefer to not do such a check all in one pattern. It's a self-documentation/maintenance thing. A single regex is faster, but if the needed protocols or image types grow, the pattern will become more and more unwieldy.
Here's what I'd do:
str[%r{^http://}i] && str[/\.(?:jpe?g|png)$/]