Check if a string contains 'http://' and '.jpg' - ruby

Im new to Ruby and Rails so forgive me if this an easy question. Im trying to check when a user passes in an IMG url in my form, that it is a valid url. Here is my code:
if params[:url].include? 'http://' && (params[:url].include? '.jpg' || params[:url].include? '.png')
This returns and error. Is this is even the best way to go about it? What should I do differently? Thanks.

if my_str =~ %r{\Ahttps?://.+\.(?:jpe?g|png)\z}i
Regex explained:
%r{...} — regex literal similar to /.../, but allows / to be used inside without escaping
\A — the start of the string (^ is just the start of the line)
http — the literal text
s? — optionally followed by an "s" (to allow https://)
:// — the literal text (to prevent something like http-whee.jpg)
.+ — one or more characters (that aren't a newline)
\. — a literal period (make sure this is an extension we're looking at)
(?:aaa|bbb) — allow either aaa or bbb here, but don't capture the result
jpe?g — either "jpg" or "jpeg"
png — the literal text
\z — the end of the string ($ is just the end of the line)
i — make the match case-insensitive (allow for .JPG as well as .jpg)
However, you might be able to get away with just this (more readable) version:
allowed_extensions = %w[.jpg .jpeg .png]
if my_str.start_with?('http://') &&
allowed_extensions.any?{ |ext| my_str.end_with?(ext) }

#Phrogz answer is better,I just tried this with some ruby libs.
require 'uri'
extensions = %w( .jpg .jpeg .png )
schemes = %w( http https )
string = params[:url]
if (schemes.include?URI.parse(string).scheme) && (extensions.include?File.extname(string))
end

While regex will shorten the code, I prefer to not do such a check all in one pattern. It's a self-documentation/maintenance thing. A single regex is faster, but if the needed protocols or image types grow, the pattern will become more and more unwieldy.
Here's what I'd do:
str[%r{^http://}i] && str[/\.(?:jpe?g|png)$/]

Related

Avoid double encoding when escaping URL fragments in Ruby

I'm trying to clean up some auto-generated code where input URL fragments:
may include spaces, which need to be %-escaped (as %20, not +)
may include other URL-invalid characters, which also need to be %-escaped
may include path separators, which need to be left alone (/)
may include already-escaped components, which need not to be doubly-escaped
The existing code uses libcurl (via Typhoeus and Ethon), which like command-line curl seems to happily accept spaces in URLs.
The existing code is all string-based and has a number of shenanigans involving removing extra slashes, adding missing slashes, etc. I'm trying to replace this with URI.join(), but this fails with bad URI(is not URI?) on the fragments with spaces.
The obvious solution is to use the (deprecated) URI.escape, which escapes spaces, but leaves slashes alone:
URI.escape('http://example.org/ spaces /<"punc^tu`ation">/non-ascïï 𝖈𝖍𝖆𝖗𝖘/&c.')
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
This mostly works, except for case (3) above — previously escaped components get double-escaped.
s1 = URI.escape(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
URI.escape(s)
# => "http://example.org/%2520spaces%2520/%253C%2522punc%255Etu%2560ation%2522%253E/non-asc%25C3%25AF%25C3%25AF%2520%25F0%259D%2596%2588%25F0%259D%2596%258D%25F0%259D%2596%2586%25F0%259D%2596%2597%25F0%259D%2596%2598/%25EF%25BC%2586%25EF%25BD%2583%25EF%25BC%258E"
The recommended alternatives to URI.escape, e.g. CGI.escape and ERB::Util.url_encode, are not suitable as they mangle the slashes (among other problems):
CGI.escape(s)
# => "http%3A%2F%2Fexample.org%2F+spaces+%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF+%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"
ERB::Util.url_encode(s)
# => "http%3A%2F%2Fexample.org%2F%20spaces%20%2F%3C%22punc%5Etu%60ation%22%3E%2Fnon-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98%2F%EF%BC%86%EF%BD%83%EF%BC%8E"
Is there a clean, out-of-the-box way to preserve existing slashes, escapes, etc. and escape only invalid characters in a URI string?
So far the best I've been able to come up with is something like:
include URI::RFC2396_Parser::PATTERN
INVALID = Regexp.new("[^%#{RESERVED}#{UNRESERVED}]")
def escape_invalid(str)
parser = URI::RFC2396_Parser.new
parser.escape(str, INVALID)
end
This seems to work:
s2 = escape_invalid(s)
# => "http://example.org/%20spaces%20/%3C%22punc%5Etu%60ation%22%3E/non-asc%C3%AF%C3%AF%20%F0%9D%96%88%F0%9D%96%8D%F0%9D%96%86%F0%9D%96%97%F0%9D%96%98/%EF%BC%86%EF%BD%83%EF%BC%8E"
s2 == escape_invalid(s2)
# => true
but I'm not confident in the regex concatenation (even if it is the way URI::RFC2396_Parser works internally) and I know it doesn't handle all cases (e.g., a % that isn't part of a valid hex escape should probably be escaped). I'd much rather find a library standard solution.

Check if string is a glob pattern

On the input I have string that can be plain path string (e.g. /home/user/1.txt) or glob pattern (e.g. /home/user/*.txt).
Next I want to get array of matches if string is glob pattern and in case when string is just plain path I want to get array with single element - this path.
So somehow I should check if string contains unescaped glob symbols and if it does then call Pathname.glob() to get matches otherwise just return array with this string.
How can I check if string is a glob pattern?
UPDATE
I had this question while implementing homebrew cask glob pattern support for zap stanza.
And the solution that I used is to made a little refactoring to avoid need to check if string is a glob pattern.
Next I want to get array of matches if string is glob pattern and in case when string is just plain path I want to get array with single element - this path.
They're both valid glob patterns. One contains a wildcard, one does not. Run them both through Pathname.glob() and you'll always get an array back. Bonus, it'll check if it matches anything.
$ irb
2.3.3 :001 > require "pathname"
=> true
2.3.3 :002 > Pathname.glob("test.data")
=> [#<Pathname:test.data>]
2.3.3 :003 > Pathname.glob("test.*")
=> [#<Pathname:test.asm>, #<Pathname:test.c>, #<Pathname:test.cpp>, #<Pathname:test.csv>, #<Pathname:test.data>, #<Pathname:test.dSYM>, #<Pathname:test.html>, #<Pathname:test.out>, #<Pathname:test.php>, #<Pathname:test.pl>, #<Pathname:test.py>, #<Pathname:test.rb>, #<Pathname:test.s>, #<Pathname:test.sh>]
2.3.3 :004 > Pathname.glob("doesnotexist")
=> []
This is a great way to normalize and validate your data early, so the rest of the program doesn't have to.
If you really want to figure out if something is a literal path or a glob, you could try scanning for any special glob characters, but that rapidly gets complicated and error prone. It requires knowing how glob works in detail and remembering to check for quoting and escaping. foo* has a glob pattern. foo\* does not. foo[123] does. foo\[123] does not. And I'm not sure what foo[123\] is doing, I think it counts as a non-terminated set.
In general, you want to avoid writing code that has to reproduce the inner workings of another piece of code. If there was a Pathname.has_glob_chars you could use that, but there isn't such a thing.
Pathname.glob uses File.fnmatch to do the globbing and you can use that without touching the filesystem. You might be able to come up with something using that, but I can't make it work. I thought maybe only a literal path will match itself, but foo* defeats that.
Instead, check if it exists.
Pathname.new(path).exist?
If it exists, it was a real path to a real file. If it didn't exist, it might have been a real path, or it might be a glob. That's probably good enough.
You can also check by looking to see if Pathname.glob(path) returned a single element that matches the original path. Note that when matching paths it's important to normalize both sides with cleanpath.
paths = Pathname.glob(path)
if paths.size == 1 && paths[0].cleanpath == Pathname.new(path).cleanpath
puts "#{path} is a literal path"
elsif paths.size == 0
puts "#{path} matched nothing"
else
puts "#{path} was a glob"
end

Single quote string interpolation to access a file in linux

How do I make the parameter file of the method sound become the file name of the .fifo >extension using single quotes? I've searched up and down, and tried many different >approaches, but I think I need a new set of eyes on this one.
def sound(file)
#cli.stream_audio('audio\file.fifo')
end
Alright so I finally got it working, might not be the correct way but this seemed to do the trick. First thing, there may have been some white space interfering with my file parameter. Then I used the File.join option that I saw posted here by a few different people.
I used a bit of each of the answers really, and this is how it came out:
def sound(file)
file = file.strip
file = File.join('audio/',"#{file}.fifo")
#cli.stream_audio(file) if File.exist? file
end
Works like a charm! :D
Ruby interpolation requires that you use double quotes.
Is there a reason you need to use single quotes?
def sound(FILE)
#cli.stream_audio("audio/#{FILE}.fifo")
end
As Charles Caldwell stated in his comment, the best way to get cross-platform file paths to work correctly would be to use File.join. Using that, your method would look like this:
def sound(FILE)
#cli.stream_audio(File.join("audio", "#{FILE}.fifo"))
end
Your problem is with your usage of file path separators. You are using a \. Whereas this may not seem like a big deal, it actually is when used in Ruby strings.
When you use \ in a single quoted string, nothing happens. It is evaluated as-is:
puts 'Hello\tWorld' #=> Hello\tWorld
Notice what happens when we use double quotes:
puts "Hello\tWorld" #=> "Hello World"
The \t got interpreted as a tab. That's because, much like how Ruby will interpolate #{} code in a double quote, it will also interpret \n or \t into a new line or tab. So when it sees "audio\file.fifo" it is actually seeing "audio" with a \f and "ile.fifo". It then determines that \f means 'form feed' and adds it to your string. Here is a list of escape sequences. It is for C++ but it works across most languages.
As #sawa pointed out, if your escape sequence does not exist (for instance \y) then it will just remove the \ and leave the 'y'.
"audio\yourfile.fifo" #=> audioyourfile.fifo
There are three possible solutions:
Use a forward slash:
"audio/#{file}.fifo"
The forward slash will be interpreted as a file path separator when passed to the system. I do most my work on Windows which uses \ but using / in my code is perfectly fine.
Use \\:
"audio\\#{file}.fifo"
Using a double \\ escapes the \ and causes it to be read as you intended it.
Use File.join:
File.join("audio", "#{file}.fifo")
This will output the parameters with whatever file separator is setup as in the File::SEPARATOR constant.

Desperately trying to remove this diabolical excel generated special character from csv in ruby

My computer has no idea what this character is. It came from Excel.
In excel it was a weird space, now it is literally represented by several symbols viz. my computer has no idea what it is.
This character is represented by a Ê in Excel (in csv, as xls it is a space of some kind), OS X's TextEdit treats it as a big space this long "            ", which is, I think, what it is. Ruby's CSV parser blows up when it tries to parse it using normal utf-8, and I have to add :encoding => "windows-1251:utf-8" to parse it, in which case Ruby turns it into an "K". This K appears in groups of 9, 12, 15 and 18 (KKKKKKKKK, etc) in my CSV, and cannot be removed via gsub(/K/) (groups of K, /KKKKKKKKK/, etc, cannot be removed either)! I've also used the opensource tool CSVfix, but its "removing leading and trailing spaces" command did not have an effect on the Ks.
I've tried using sed as suggested in Remove non-ascii characters from csv, but got errors like
sed: 1: "output.csv": invalid command code o
when running something like sed -i 's/[\d128-\d255]//' input.csv on Mac.
Parse your csv with the following to remove your "evil" character
.encode!("ISO-8859-1", :invalid => :replace)
**self-answers (different account, same person)
1st solution attempt:
evil_string_from_csv_cell = "KKKKKKKKK"
encoding_opts = {
:invalid => :replace, :undef => :replace,
:replace => '', :universal_newline => true }
evil_string_from_csv_cell.encode Encoding.find('ASCII'), encoding_opts
#=> ""
2nd solution attempt:
Don't use 'windows-1251:utf-8' for encoding, use 'iso-8859-1' instead, which will turn those (cyrillic) K's into '\xCA', which can then be removed with
string.gsub!(/\xCA/, '')
** I have not solved this problem yet.
3rd solution attempt:
trying to match array of K's as if they were actual K's is foolish. Copy and paste in the actual cyrillic K and see how that works-- here is the character, notice the little curl on the end
К
ruby treats it by making it a little bit bolder than normal K's
4th solution/strategy attempt (success):
use regular expressions to capture the characters, so long as you can encode the weird spaces (or whatever they are) into something, you can then ignore them using regular expressions
also try to take advantage of any spatial (matrix-like) patterns amongst the document types.
The answer to this problem is
A.) this is a very difficult problem. no one so far knows how to "physically" remove the cyrillic Ks.
but
B.) csv files are just strings separated by unescaped commas, so matching strings using regular expressions works just find so long as the encoding doesn't break the program.
So to read the file
f = File.open(File.join(Rails.root, 'lib', 'assets', 'repo', name), :encoding => "windows-1251:utf-8")
parsed = CSV.parse(f)
then find specific rows via regular expression literal string matching (it will overlook the cyrillic K's)
parsed.each do |p| #here, p[0] is the metatag column
#specific_metatag_row = parsed.index if p[0] =~ /MetatagA/
end
I couldn't get sed working but finally had luck doing this in Vim:
vim myhorriblefile.csv
# Once vim is open:
:s/Ê/ /g
:wq
# Done!
As a generalized function for reuse, this can be:
clean_weird_character () {
vim "$1" -c ":%s/Ê/ /g" -c "wq"
}

How can I match a URL but exclude terminators from the match?

I want to match urls in text and replace them with anchor tags, but I want to exclude some terminators just like how Twitter matches urls in tweets.
So far I've got this, but it's obviously not working too well.
(http[s]?\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?)
EDIT: Some example urls. In all cases below I only want to match "http://www.example.com"
http://www.example.com.
http://www.example.com:
"http://www.example.com"
http://www.example.com;
http://www.example.com!
[http://www.example.com]
{http://www.example.com}
http://www.example.com*
I looked into this very issue last year and developed a solution that you may want to look at - See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript - (but could easily be translated to Ruby) is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The Problem With URLs by Jeff Atwood, and
An Improved Liberal, Accurate Regex Pattern for Matching URLs by John Gruber
The comments following Jeff's blog post are a must read if you want to do this right...
Ruby's URI module has a extract method that is used to parse out URLs from text. Parsing the returned values lets you piggyback on the heuristics in the module to extract the scheme and host information from a URL, avoiding reinventing the wheel.
text = '
http://www.example.com.
http://www.example.com:
"http://www.example.com"
http://www.example.com;
http://www.example.com!
[http://www.example.com]
{http://www.example.com}
http://www.example.com*
http://www.example.com/foo/bar?q=foobar
http://www.example.com:81
'
require 'uri'
puts URI::extract(text).map{ |u| uri = URI.parse(u); "#{ uri.scheme }://#{ uri.host[/(^.+?)\.?$/, 1] }" }
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
# >> http://www.example.com
The only gotcha, is that a period '.' is a legitimate character in a host name, so URI#host won't strip it. Those get caught in the map statement where the URL is rebuilt. Note that URI is stripping off the path and query information.
A pragmatic and easy understandable solution is:
regex = %r!"(https?://[-.\w]+\.\w{2,6})"!
Some notes:
With %r we can choose the start and end delimiter. In this case I used exclamation mark, since I want to use slash unescaped in the regex.
The optional quantifier (i.e. '?') binds only to the preceding expression, in this case 's'. There's no need to put the 's' in a character class [s]?. It's the same as s?.
Inside the character class [-.\w] we don't need to escape dash and dot in order to make them match dot and dash literally. Dash should be first, however, to not mean range.
\w matches [A-Za-z0-9_] in Ruby. It's not exactly the full definition of URL characters, but combined with dash and dot it may be enough for our needs.
Top domains are between 2 and 6 characters long, e.g. '.se' and '.travel'
I'm not sure what you mean by I want to exclude some terminators but this regex matches only the wanted one in your example.
We want to use the first capture group, e.g. like this:
if input =~ %r!"(https?://[-.\w]+.\w{2,6})"!
match = $~[1]
else
match = ""
end
What about this?
%r|https?://[-\w.]*\w|

Resources