Regular Expression - matching file extension from a URL - ruby

So I have a very specific URL, that tends to always follow the following format:
http://mtc.cdn.vine.co/r/videos/0DCB6FF2EF1179983941847883776_38a153447e7.1.5.3901866229871838946.mp4?versionId=.k9_w6W7t1Yr1KUCWRIm6AnYhSdOUz32
Basically I want to grab everything from after the . and before the ?versionId as I imagine that's the consistent location of the file extension.
I currently have something like this where \.\.{0}(.+)\?versionId it is matching everything starting from the first . to versionId.
One solution I thought about doing was using the . as a delimiter. I've never tried to restrict a character, but basically I would want it to try to match everything starting with a ., reject anything that has a . leading up to the ?.
Anyone got any idea how to get this to work?

Is your goal to get 'mp4'? Might consider not using a regex at all...
> require 'uri'
> uri = URI.parse('http://mtc.cdn.vine.co/r/videos/0DCB6FF2EF1179983941847883776_38a153447e7.1.5.3901866229871838946.mp4?versionId=.k9_w6W7t1Yr1KUCWRIm6AnYhSdOUz32')
=> #<URI::HTTP http://mtc.cdn.vine.co/r/videos/0DCB6FF2EF1179983941847883776_38a153447e7.1.5.3901866229871838946.mp4?versionId=.k9_w6W7t1Yr1KUCWRIm6AnYhSdOUz32>
> uri.path
=> "/r/videos/0DCB6FF2EF1179983941847883776_38a153447e7.1.5.3901866229871838946.mp4"
> File.extname(uri.path)
=> ".mp4"

Completely in agreement with Philip Hallstrom, this is a typical XY problem. However, if you really wish to just hone your Regexp skills, the literal solution to your question is (Rubular):
(?<=\.)[^.]+(?=\?)
"From where a period was just before, match any number of non-periods, matching up to where question mark is just after."
To understand this, read up on positive lookbehind ((?<=...)), positive lookahead ((?=...)), and negated character sets ([^...]).

Related

simple symbol regex solution

The problem I'm looking at says only inputs with '+' symbols covering any letters in the string is true so like "+d++" or "+d+==+a+" but not
"f++d+"
"3+a=+b+"
"++d+=c+"
I tried to solve this using regex since it's kind of a string pattern matching problem. /(+[a-z][^+])|([^+.][a-z]+)/ but this does not cover patterns where the letters are at the beginning or end of the string. I need help something more comprehensive.
You should try following
/^\+{0,2}[a-z0-9]+\+{0,2}(=*\+{0-2}[a-z0-9]+\+{0,2})*$/
You could use the below regex.
^(?:[^\w\n]*\+[a-z]+\+)+[^\w\n]*$
DEMO
If you want to match +f+g+ also, then put the following + inside a positive lookahead assertion.
^(?:[^\w\n]*\+[a-z]+(?=\+))+[^\w\n]*$
DEMO

Regex for matching everything before trailing slash, or first question mark?

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.
This is what I came up with but it seems to be failing in some cases:
regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/
In summary:
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return
2013/07/31/a-new-health-care-approach-dont-hide-the-price
Please don't use Regex for this. Use the URI library:
require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path
Why?
See everything about this famous question for a good discussion of why these kinds of things are a bad idea.
Also, this XKCD really says why:
In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?
If lookaheads are allowed
((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)
Copy + Paste this in http://regexpal.com/
See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz
Image using javascript regex, but it works out the same
(?=) is just a a lookahead
I basically set up three matches from 2XXX up to (in this order):
(?=\?\w+) # lookahead for a question mark followed by one or more word characters
(?=/\s+) # lookahead for a slash followed by one or more whitespace characters
.*\w # match up to the last word character
I'm pretty sure that some parentheses were not needed but I just copy pasted.
There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.
You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.
Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.
Also, there is xkcd (https://xkcd.com/208/) for everything:

ruby regex: match URL recurring pattern

I want to be able to match all the following cases below using Ruby 1.8.7.
/pages/multiedit/16801,16809,16817,16825,16833
/pages/multiedit/16801,16809,16817
/pages/multiedit/16801
/pages/multiedit/1,3,5,7,8,9,10,46
I currently have:
\/pages\/multiedit\/\d*
This matches upto the first set of numbers. So for example:
"/pages/multiedit/16801,16809,16817,16825,16833"[/\/pages\/multiedit\/\d*/]
# => "/pages/multiedit/16801"
See http://rubular.com/r/ruFPx5yIAF for example.
Thanks for the help, regex gods.
\/pages\/multiedit\/\d+(?:,\d+)*
Example: http://rubular.com/r/0nhpgki6Gy
Edit: Updated to not capture anything... Although the performance hit would be negligible. (Thanks Tin Man)
The currently accepted answer of
\/pages\/multiedit\/[\d,]+
may not be a good idea because that will also match the following strings
.../pages/multiedit/,,,
.../pages/multiedit/,1,
My answer requires there be at least one digit before the first comma, and at least one digit between commas, and it must end with a digit.
I'd use:
/\/pages\/multiedit\/[\d,]+/
Here's a demonstration of the pattern at http://rubular.com/r/h7VLZS1W1q
[\d,]+ means "find one or more numbers or commas"
The reason \d* doesn't work is it means "find zero or more numbers". As soon as the pattern search runs into a comma it stops. You have to tell the engine that it's OK to find numbers and commas.

How can I simplify this regular expression?

The format I'm trying to match is:
# (Apple push notification codes)
"11a735e9 9f696c2f 700b2700 728042c6 137eeb7a 8442c27d 40e59d9e 3c7e0de7"
The simplest expression I can think of is: /((\w{8}\s){7}\w{8})/i
Can anyone think of a simpler one?
(I'm using Ruby regular expressions)
UPDATE - thanks to user1096188, I've removed \d - this is included in \w
You can detect a word boundary using \b, and use (?: to prevent capturing groups
/(?:\w{8}\b\s?){8}/
You could do this if the end of the match is the end of the whole string.
(\w{8}(:?\s|$)){7}
Taking #zapthedingbat's solution one stage further, it looks like the code only contains hexadecimal characters (0-9 and a-f) and spaces. So you could possibly sacrifice a little simplicity for accuracy.
I'm making an assumption, but I suspect letters g to z are invalid.
If the format is hexadecimal only (you should check Apple's documentation to be sure), a tighter match would be:
/(?:[0-9a-f]{8}\b\s?){8}/
EDIT
In fact, in Ruby, it looks like you should be able to do:
/(?:\h{8}\b\s?){8}/
> "11a735e9 9f696c2f 700b2700 728042c6 137eeb7a 8442c27d 40e59d9e 3c7e0de7".match(/((\w{8}\s)+)/)
> $&
=> "11a735e9 9f696c2f 700b2700 728042c6 137eeb7a 8442c27d 40e59d9e 3c7e0de7"

Ruby RegEx issue

I'm having a problem getting my RegEx to work with my Ruby script.
Here is what I'm trying to match:
http://my.test.website.com/{GUID}/{GUID}/
Here is the RegEx that I've tested and should be matching the string as shown above:
/([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)/
3 capturing groups:
group 1: ([-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])*?\/)
group 2: (\/[-a-zA-Z0-9#:%_\+.~#?&\/\/=]*)
group 3: ([\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\/\/])
Ruby is giving me an error when trying to validate a match against this regex:
empty range in char class: (My RegEx goes here) (SyntaxError)
I appreciate any thoughts or suggestions on this.
You could simplify things a bit by using URI to deal parsing the URL, \h in the regex, and scan to pull out the GUIDs:
uri = URI.parse(your_url)
path = uri.path
guids = path.scan(/\h{8}-\h{4}-\h{4}-\h{4}-\h{12}/)
If you need any of the non-path components of the URL the you can easily pull them out of uri.
You might need to tighten things up a bit depending on your data or it might be sufficient to check that guids has two elements.
You have several errors in your RegEx. I am very sleepy now, so I'll just give you a hint instead of a solution:
...[\/\/[0-9a-fA-F]....
the first [ does not belong there. Also, having \/\/ inside [] is unnecessary - you only need each character once inside []. Also,
...[-a-zA-Z0-9#:%_\+.~#?&\/\/=]{2,256}...
is greedy, and includes a period - indeed, includes all chars (AFAICS) that can come after it, effectively swallowing the whole string (when you get rid of other bugs). Consider {2,256}? instead.

Resources