Why does this Regex time out?

Why does this Regex time out? - ruby

This might not be the ideal question for Stackoverflow, sorry if I really violated a guideline (like "Too localized". But this is quite an interesting problem:
I have the following Regex (a simpler version of URL matching):
\A(http(s)?:\/\/)?(([\da-z\.-]+)\.([a-z]{2,6})(\.([a-z]{2,6}))?([\/\w \.-]*)*\/?)\z
Now if I test this string (which doesn't match, because of the special characters):
http://t3n.de/news/nokia-lumia-930-test-560264/?utm_source=feedburner+t3n+News+12.000er&utm_medium=feed&utm_campaign=Feed%3A+aktuell%2Ffeeds%2Frss+%28t3n+News%29
Like this (just to make sure I didn't make an obvious error):
str = 'http://t3n.de/news/nokia-lumia-930-test-560264/?utm_source=feedburner+t3n+News+12.000er&utm_medium=feed&utm_campaign=Feed%3A+aktuell%2Ffeeds%2Frss+%28t3n+News%29'
str.match /\A(http(s)?:\/\/)?(([\da-z\.-]+)\.([a-z]{2,6})(\.([a-z]{2,6}))?([\/\w \.-]*)*\/?)\z/i
The command just runs forever. Shouldn't it return nil since the string doesn't match? I use the latest version of ruby, but this also occurs on Rubular: http://rubular.com/r/2ajABaqmTE
jarvis:~ rudolf$ ruby -v
ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin13.0]
Any ideas what might cause this? Did I discover a Ruby bug or what am I missing?

Inside your regex, there's this:
([\/\w \.-]*)*
which causes the regex engine to create a lot of states that it can possibly backtrack to. You can safely remove the last *:
([\/\w \.-]*)

The way I see it, the 2nd * in this part ([\/\w \.-]*)* is redundant and causes great amounts of backtracking. Remove it and it works fine: ([\/\w \.-]*)
You have a lot of capture groups and you might want to remove them as well if you don't intend to use them, but that won't have as big of an impact.

Related

How to match regexp starting from specific character index in Ruby 1.8?

In Ruby 1.9 I would use String#match(regexp,start_index). I'm sure there must be a (computationally efficient) equivalent in Ruby 1.8, but I can't find it. Do you know what it is?

You could start the regexp with ^.{start_index}
or take the substring first before performing the match.
Alternatively, if you're constrained to using Ruby 1.8, but can install your own libraries then you could use Oniguruma.

As far as I can tell, there is no efficient way to match a Regexp against a large string, starting from an arbitrary index, in pure Ruby 1.8.
This seems like a major flaw. I guess the moral of the story is: use Ruby 1.9!

What a Ruby parser would you suggest to parse Ruby sources?

A parser I'm looking for should:
be Ruby parsing friendly,
be elegant by rule design,
produce user friendly parsing errors,
user documentation should be available in volume more than a calculator example,
UPD: allowing to omit optional whitespaces writing a grammar.
Fast parsing is not an important feature.
I tried Citrus but the lack of documentation and need to specify every space in rules just turned me away from it.

Treetop
Ragel
Or in case you want to parse Ruby itself:
parse_tree and ruby_parser
Edit:
I just saw your last comment about needing a subset of Ruby for your project, in that case I'd also recommend having a look at tinyrb.

Why should I use File.join()?

I wonder why I should use:
puts "In folder #{File.join ENV[HOME], projects}"
Instead of:
puts "In folder #{ENV[HOME]/projects}"
I am aware of that File.join will put the appropriate separator (/ vs \) depending on the OS.
The script is already so tightly tied to what version of ruby you are using, what gems you have installed and so on. My scripts tend not to be like an ORM, (in this case) independent of OS.
I will never run this on Windows (the other dependencies will make the script not to work anyway).
So seems not to be a strong reason for using it, right?

Any of the following :
File.join("first","second")
File.join("first/","second")
File.join("first","/second")
File.join("first/","/second")
Will return
=> "first/second"
Could it be a good reason for you ?
That's only one example I can think of.
Actually, your goal is not to concatenate 2 strings, your goal is creating a path. This looks like a strong reason to use File.join to me.

Haven't used Ruby, but I expect a Path.join to handle corner cases, like paths ending with or without directory separators. Besides, it expresses intent a bit more clearly than string concatenation, and clarity is IMHO almost always a good idea.

I expect join to handle corner cases gracefully, like when ENV[HOME] is empty for some weird reason.

In addition to the other answers your code will be more portable, the correct separator will be used regardless of unix/windows/etc.

be aware of difference between RUBY and PYTHON
RUBY: File.join("","somthing") → "/something"
PYTHON: os.path.join("","somthing") → "something"
RUBY treat empty string as path → I call this a BUG

Where can I find an actively developed lint tool for Ruby?

Most of the code I write is in Ruby, and every once in a while, I make some typo which only gets caught after a while. This is irritating when I have my scripts running long tasks, and return to find I had a typo.
Is there an actively developed lint tool for Ruby that could help me overcome this? Would it be possible to use it across a system that works with a lot of source files, some of them loaded dynamically?
Take this snippet as an example:
a = 20
b = 30
puts c
To win bounty, show me a tool that will detect the c variable as not created/undefined.

ruby -c myfile.rb will check for correct Ruby syntax.
Reek checks Ruby code for common code smells.
Roodi checks Ruby code for common object-oriented design issues.
Flog can warn you about unusually complex code.
[Plug] If your project is in a public Github repository, Caliper can run the latter three tools and others on your code every time you commit. (Disclaimer: I work on Caliper)

You could give Diamondback Ruby a try. It does a static typecheck of Ruby code, and will thus blame you for using an undefined variable.
While DRuby is an ongoing research project, it already works quite well for small, self-contained Ruby scripts. Currently, it is unable to analyze much of the Ruby standard library “out-of-the-box”. Currently they are working toward typing Ruby on Rails (see their most recent papers).

RubyMine (http://www.jetbrains.com/ruby) does the trick:
alt text http://img707.imageshack.us/img707/5688/31911448.png
None of the below will do all the analysis that RubyMine does.
NetBeans Ruby pack
Aptana RadRails
gVIM (with syntastic plugin by scrooloose)
Each of these has the capacity to identify syntax errors such as wrong number of parentheses, too many defs, ends, braces, etc. But none will identify invalid method calls the way RubyMine does.
Here's why: it's difficult.
Since Ruby is extremely dynamic (and methods like 'c' could easily be generated on the fly), any editor that tries to identify non-existent variables/methods would need to have a large part of the entire evironment loaded and multiple program flow paths constantly tested in order to get accurate 'validity' results. This is much more difficult than in Java where almost all programming is static (at least it was when I dropped that hat).
This ability to easily generate methods on the fly is one of the reasons the community holds testing to such high esteem. I really do reccomend you try testing as well.

Have a look at RuboCop. It is a Ruby code style checker based on the Ruby Style Guide. It's maintained pretty actively and supports all major Ruby implementations. It works well with Ruby 1.9 and 2.0 and has great Emacs integration.

Yes. Test::Unit
Ok, I know you already know this and that in some sense this is a non-helpful answer, but you do bring up the negative consequence of duck typing, that there kind of is (at this time) no way around just writing more tests than something like Java might need.
So, for the record, see Test::Unit in the Ruby Standard Library or one of the other test frameworks.
Having unit tests that you can run and rerun is the best way to catch errors, and you do need more of them (tests, not errors :-) in dynamic languages like Ruby...

nitpick might be what you're lookng for.
With this code:
class MyString < String
def awesome
self.gsub("e", "3").gsub("l", "1").uppercase
end
end
puts MyString.new("leet").awesome
... it outputs:
$ nitpick misspelling.rb
*** Nitpick had trouble loading "misspelling.rb":
NoMethodError undefined method `uppercase' for "133t":MyString
Nothing to report boss! He's clean!

Have not used it yet, but sounds promising (will update when I've tested this).
https://github.com/michaeledgar/laser
Static analysis and style linter for Ruby code.

Pelusa is nice, but is working in rubinius only. This shouln't be a proplem for people familar with RVM though.

avdi#lazarus:~$ irb
>> a = 20
=> 20
>> b = 30
=> 30
>> puts c
NameError: undefined local variable or method `c' for main:Object
from (irb):3
>>
There ya go, the tool is called "IRB". Do I get the bounty?
I'm only half joking. I wrote this second answer to hopefully drive home the point that in Ruby, if you want to know that something is defined or not, you have to run the code.

Ruby: How to break a potentially unicode string into bytes

I'm writing a game which is taking user input and rendering it on-screen. The engine I'm using for this is entirely unicode-friendly, so I'd like to keep that if at all possible. The problem is that the rendering loop looks like this:
"string".each_byte do |c|
render_this_letter(c)
end
I don't know a whole lot about i18n, but I know enough to know the above code is only ever going to work for me and people who speak my language. I'd prefer something like:
"unicode string".each_unicode_letter do |u|
render_unicode_letter(u)
end
Does this exist in the core distribution? I'm somewhat averse to adding additional requirements to the install, but if it's the only way to do it, I'll live.
For extra fun, I have no way of knowing if the string is, in fact, a unicode string.
EDIT: The library I'm using can indeed render entire strings, however I'm letting the user edit what comes up on the fly - if they hit 'backspace', essentially, I need to know how many bytes to chop off the end.

Unfortunately ruby 1.8.x has poor unicode support. It's being addressed in 1.9. But in the mean time, libraries like this one (http://snippets.dzone.com/posts/show/4527) are a good solution. Using the linked library, your code would look something like this:
"unicode_string".each_utf8_char do |u|
render_unicode_letter(u)
end

You could try including the ActiveSupport::CoreExtensions::String::Unicode module from the rails codebase.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why does this Regex time out? - ruby

Inside your regex, there's this: ([\/\w \.-]) which causes the regex engine to create a lot of states that it can possibly backtrack to. You can safely remove the last : ([\/\w \.-])

The way I see it, the 2nd * in this part ([\/\w \.-]) is redundant and causes great amounts of backtracking. Remove it and it works fine: ([\/\w \.-]*) You have a lot of capture groups and you might want to remove them as well if you don't intend to use them, but that won't have as big of an impact.

Related

How to match regexp starting from specific character index in Ruby 1.8?

What a Ruby parser would you suggest to parse Ruby sources?

Why should I use File.join()?

Where can I find an actively developed lint tool for Ruby?

Ruby: How to break a potentially unicode string into bytes

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Why does this Regex time out? - ruby

Inside your regex, there's this: ([\/\w \.-]*)* which causes the regex engine to create a lot of states that it can possibly backtrack to. You can safely remove the last *: ([\/\w \.-]*)

Related

How to match regexp starting from specific character index in Ruby 1.8?

What a Ruby parser would you suggest to parse Ruby sources?

Why should I use File.join()?

Where can I find an actively developed lint tool for Ruby?

Ruby: How to break a potentially unicode string into bytes

Categories

Resources

Inside your regex, there's this: ([\/\w \.-]) which causes the regex engine to create a lot of states that it can possibly backtrack to. You can safely remove the last : ([\/\w \.-])