I'm writing a class that parses a filename. I've got 3 questions:
The regex
Given hello/this/is/my/page.html I want to capture three parts:
The parent folders: hello/this/is/my
The filename itself: page
The extension: .html
This is the regex: /^((?:[^\/]+\/)*)(\w+)(\.\w+)$/
The problem is that when I tried this (using Rubular), when I use a relative pathfile such as page.html, it all gets captured into the first capturing group.
Can someone suggest a regex that works correctly for both relative and absolute filepaths?
The class
Would this class be ok?
class RegexFilenameHelper
filenameRegex = /^((?:[^\/]+\/)*)(\w+)(\.\w+)$/
def self.getParentFolders(filePath)
matchData = filenameRegex.match(filePath)
return matchData[1]
end
def self.getFileName(filePath)
# ...
end
def self.getFileExtension(filePath)
# ...
end
end
I understand that it's inefficient to call .match for every function, but I don't intend to use all three functions sequentially.
I also intend to call the class itself, and not instantiate an object.
An aside
Assuming this is important: would you rather capture .html or html, and why?
Using the standard library:
As Tim Pietzcker suggested, the functionality is already implemented in the Pathname and File classes.
filepath = "hello/this/is/my/page.html"
Getting the parents: File.dirname(filepath) => "hello/this/is/my"
Getting the name: File.basename(filepath) => "page.html"
without extension: File.basename(filepath, File.extname(filepath)) => "page"
Getting the extension: File.extname(filepath) => ".html"
We call class methods without having to instantiate any class, which is exactly what I wanted.
It's not necessary for the file or folders to actually exist in the file system!
Thanks to Tim Pietzcker for letting me know!
Using regex:
If I had wanted to do it with regex, the correct regex would be ((?:^.*\/)?)([^\/]+)(\..*$).
((?:^.*\/)?): Captures everything before the last /, or nothing (that's what the last ? is for). This is the parent path, which is optional.
([^\/]+): Gets everything that's not /, which is the filename.
(\..*$): Captures everything coming after the last ., including it.
I tried this in Rubular and it worked like a charm, but I'm still not sure if the second capturing group is too broad, so be careful if you use this!
Thanks to user230910 for helping me get there! :)
Related
So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:
def parse_html(html)
html_doc = Nokogiri::HTML(html)
nodes = html_doc.xpath("//a[#href]")
nodes.inject([]) do |uris, node|
uris << node.attr('href').strip
end.uniq
end
I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:
^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)
Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.
I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:
"//a[img][#href]"
Or even go further and extract just the URIs directly from the href values:
uris = html_doc.xpath("//a[img]/#href").map(&:value)
As some have said, you may not want to use Regex for this, but if you're determined to:
^http(s?):\/\/.*\.(jpeg|jpg|gif|png)
Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.
Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.
For your simple example, I would suggest using a simple condition like:
IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
# ...
In the context of your question, you might want to change your method to:
IMAGE_EXTS = %w[gif jpg png]
def parse_html(html)
uris = []
Nokogiri::HTML(html).xpath("//a[#href]").each do |node|
uri = node.attr('href').strip
uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
end
uris.uniq
end
What is the most standard Ruby symbology for naming variables containing file names, file names with path and file instances? Completely clear way of doing this would be:
file_name = "bar.txt"
file_name_with_path = "foo", file_name
file = File.open( file_name_with_path )
But it's too long. It is out of question to use :file_name_with_path in method definition:
def quux( file_name_with_path: "foo/bar.txt" )
# ...
end
Having encountered this for umpteenth time, I realized that shortening conventions are needed. I started making personal shortening conventions: :file_name => :fn, :file_name_with_path => :fnwp, :file always refers to a File instance, :fn never includes path, :fnwap means :file_name_with_absolute_path etc. But everyone must be facing this, so I am asking: Is there a public convention for this? More particularly, does Rails code have a convention for this?
But everyone must be facing this...
No, not really, because you're really over-thinking this.
Just use file:, or filename:. It doesn't matter whether your filename contains a relative or absolute path, or whether the path contains directories, and your code should reflect this. A path to a file is just a path to a file, and all paths should be treated identically by your code: It just opens the file, and raises an error if it can't.
You can use filesystem utilities to extract directories and base names from a path, and they'll work just fine on any path, regardless of the presence of directories, regardless of wether the path is absolute or relative. It just doesn't matter.
I have the following path:
http://192.168.56.10:4567/browse/foo/bar?x=100&y=200
I want absolutely everything that comes after "http://192.168.56.10:4567/browse/" in a string.
Using a splat doesn't work (only catches "foo/bar"):
get '/browse/*' do
Neither does the regular expression (also only catches "foo/bar"):
get %r{/browse/(.*)} do
The x and y params are all accessible in the params hash, but doing a .map on the ones I want seems unreasonable and un-ruby-like (also, this is just an example.. my params are actually very dynamic and numerous). Is there a better way to do this?
More info: my path looks this way because it is communicating with an API and I use the route to determine the API call I will make. I need the string to look this way.
If you are willing to ignore hash tag in path param this should work(BTW browser would ignore anything after hash in URL)
updated answer
get "/browse/*" do
p "#{request.path}?#{request.query_string}".split("browse/")[1]
end
Or even simpler
request.fullpath.split("browse/")[1]
get "/browse/*" do
a = "#{params[:splat]}?#{request.env['rack.request.query_string']}"
"Got #{a}"
end
I often use long paths in my scripts and since i'm on windows i have to convert these long paths to nix style with slashes in stead of backslashes. Nothing difficult but annoying if thereafter you copy that path to go to that folder since in explorer you have to do the opposite again.
So i made a function that does the conversion, now i can use windowspaths that i can copy around and keep Ruby sattisfied.
Question: is there a more elegant solution here ? I don't like the second gsub to handle the double \ at he beginning and also would like to handle a \ at the end (currently not possible). The function should be able to handle network unc's (\..) and local drivepaths (c:..)
class String
def path
self.gsub('\\','/').gsub(/^\//,'//')
end
end
path = '\\server\share\folder'.path
Dir.glob(path+'**/*') do |file|
puts file
end
#=>
#//server/share/folder/file1.txt
#//server/share/folder/file2.txt
The suggestion to use File.join made me try a regular split & join and now i have this version, got rid of the ugly double gsub, now it's longer but can handle an ending slash. Has someone a better version ?
class String
def to_path(end_slash=false)
"#{'/' if self[0]=='\\'}#{self.split('\\').join('/')}#{'/' if end_slash}"
end
end
puts '\\server\share\folder'.to_path(true) #//server/share/folder/
puts 'c:\folder'.to_path #c:/folder
The portable way to write paths is with Ruby's File#join method. This will create OS-independent paths, using the right path separators.
For UNC paths, this previous answer addresses the creation of a custom File#to_unc method:
def File.to_unc( path, server="localhost", share=nil )
parts = path.split(File::SEPARATOR)
parts.shift while parts.first.empty?
if share
parts.unshift share
else
# Assumes the drive will always be a single letter up front
parts[0] = "#{parts[0][0,1]}$"
end
parts.unshift server
"\\\\#{parts.join('\\')}"
end
I haven't tried it myself, but it would appear to be the result you're looking for.
From any URL I want to extract its path.
For example:
URL: https://stackoverflow.com/questions/ask
Path: questions/ask
It shouldn't be difficult:
url[/(?:\w{2,}\/).+/]
But I think I use a wrong pattern for 'ignore this' ('?:' - doesn't work). What is the right way?
I would suggest you don't do this with a regular expression, and instead use the built in URI lib:
require 'uri'
uri = URI::parse('http://stackoverflow.com/questions/ask')
puts uri.path # results in: /questions/ask
It has a leading slash, but thats easy to deal with =)
You can use regex in this case, which is faster than URI.parse:
s = 'http://stackoverflow.com/questions/ask'
s[s[/.*?\/\/[^\/]*\//].size..-1]
# => "questions/ask" (6,8 times faster)
s[/\/(?!.*\.).*/]
# => "/questions/ask" (9,9 times faster, but with an extra slash)
But if you don't care with the speed, use uri, as ctcherry showed, is more readable.
The approach presented by ctcherry is perfectly correct, but I prefer to use request.fullpath instead of including the URI library in the code. Just call request.fullpath in your views or controllers. But be careful, if you have any GET parameters in your URL it will be catched, in this case a use a split('?').first