Regexp hangs when input string contains brackets - ruby

I have:
vv = /added:\s{0,}\d{1,2}\/\d{1,2}\/\d{4}|terminated:\s{0,}\d{1,2}\/\d{1,2}\/\d{4}|(?-mix:\((\w+([\p{P}\s]{,3}\w*)*)\))/i
Below is my experiment:
detail = "(value containts lorem ipsum lorum ipsum"
detail =~ vv
When I try without bracket at the start of input string, it works.
detail = "value containts lorem ipsum lorum ipsum"
detail =~ vv
# => nil

The problem you experience is catastrophical backtracking. Your \w+([\p{P}\s]{,3}\w*)* causes an issue as the ([\p{P}\s]{,3}\w*)* contains a nested zero or more quantifier *. The problem arises because the parts inside are both optional (=can match empty strings) and quantified. See your regex demo, try adding one more symbol and see the step amount increase: adding a space after (value containt will double the number of steps from 65,742 to 102,610! Adding 1 more symbol crashes the demo.
Replacing it with \w+(?:[\p{P}\s]{1,3}\w+)*, or even \w+(?:\W{1,3}\w+)* should fix the issue as the subpatterns inside the grouping (...) construct will no longer be matching empty strings (but the whole group will be optional, zero or more repetitions). [\p{P}\s]{1,3} requires at least 1 punctuation or whitespace and \w+ requires one or more word characters.
Also note that you do not need the (?-mix:...) group, I removed it from my suggested pattern: you have no . inside (no need for m), no letters that can be in lower- or upper case (no need for i) and there are no spaces to ignore in the pattern (no need for x). Also, {0,} quantifier is equal to *, I replaced one or two in the beginning.
Use
vv = /added:\s*\d{1,2}\/\d{1,2}\/\d{4}|terminated:\s*\d{1,2}\/\d{1,2}\/\d{4}|\((\w+(?:[\p{P}\s]{1,3}\w+)*)\)/i
detail = "(value containts lorem ipsum lorum ipsum"
detail =~ vv
See Ruby demo

Related

Spring JPA find entity where property containing List

For example:
This is the name property of an object I want to search for:
Lorem ipsum dolor sit amet, consectetur adipiscing elit
When I fill in "lo adip tetur" (lo (= Lorem), dipi (=adipiscing), tetur (=consectetur) ) I want to be able to find this object.
I tried to split my name property on space and pass it to the jpa method but I did not get any results.
List<Obj> findAllByNameIgnoreCaseContaining(String[] splittedName);
What would be the correct way to solve this problem? Thanks!
A regex query will allow you to specify this type of complex text search criteria.
Regular expressions are supported by many databases, and can be supplied when using a native query.
An example for postgres could look like this:
#Query(nativeQuery = true, value =
"SELECT * FROM my_entity
WHERE text_column ~ :contentRegex")
List<MyEntity> findByContentRegex(#Param("contentRegex") String contentRegex);
To match the first two characters of the first three words, you could for example pass a regex like this one:
var result = repository.findByContentRegex("^Lo\S*\sip\S*\sdo.*");
(a string starting with Lo, followed by an arbitrary number of non-whitespace characters, a whitespace character, ip, an arbitrary number of non-whitespace characters, a whitespace character, do, and an arbitrary number of arbitrary characters)
Of course you can dynamically assemble the regex, e.g. by concatenating user-supplied search term fragments:
List<String> searchTerms = List.of("Lo", "ip", "do"); // e.g. from http request url params
String regex = "^"+String.join("\S*\s", searchTerms) + ".*";
var result = repository.findByContentRegex(regex);
See e.g. https://regex101.com/ for an interactive playground.
Note that complex expressions may cause the query to become expensive, so that one may consider more advanced approaches at some point, like e.g. full text search which can make use of special indexing. https://www.postgresql.org/docs/current/textsearch-intro.html
Also note that setting a query timeout is recommended for potentially hostile parameter sources.
Apart from that, there are also more specialized search servers e.g. like apache-solr.

ruby 'gsub' to snake case

The following code from a book is supposed to transfer "FOO92OBAR" to "FOO92_O_BAR":
gsub(/([a-z\d])([A-Z])/, '\1_\2')
Can anyone explain how this works?
([a-z\d]) looks for a lowercase letter (a-z) or a number (\d means a digit). The () around the whole thing assign the result to regex subgroup 1.
([A-Z]) then looks for an uppercase letter, assigning the result to group 2. So the whole thing looks for a lowercase-or-digit followed by an uppercase letter. The second part, '\1_\2', means "regex group 1 followed by regex group 2"
gsub replaces every time it sees a lowercase-or-digit followed by an uppercase letter with (the first thing)_(the second thing).
So actually FOO92OBAR will be FOO92_OBAR.
For FOO92OBAR to become FOO92_O_BAR, the replace part should be '\1_\2_' (since only the O is the second part.. BAR is not matched, so not replaced at all).
It works using regular expressions.
The two parameters of gsub are the match expression and the replacement. Because the match /([a-z\d])([A-Z])/ contains groups (identified by (...)), then you can reference a match in the replacement using \ID where the ID is the number of the group, starting from 1.
That said, the code gsub(/([a-z\d])([A-Z])/, '\1_\2')
# take any combination of
([a-z\d])([A-Z])
# which means any combinations of a (1) lower-case char or (2) digit
([a-z\d])
# followed by an (1) upper case letter
([A-Z])
# if any, replace it with
\1_\2
# that represents the first group
\1
# followed by _
# followed by the second group
\2
Please note that your example will generate FOO92_OBAR, not FOO92_O_BAR
2.1.5 :001 > string = "FOO92OBAR"
=> "FOO92OBAR"
2.1.5 :002 > string.gsub(/([a-z\d])([A-Z])/, '\1_\2')
=> "FOO92_OBAR"
The explanation is because there is only one case of a "lower-case char or digit" (and that is a digit) followed by an upper case char.
2.1.5 :003 > string.scan(/([a-z\d])([A-Z])/)
=> [["2", "O"]]
Regular expressions are case sensitive by default.

Find last word in string without any dots, white spaces

I dont understant regular expression, because it seems very dificult... So what i found it just find last word with dots and etc.
$content='Nullam quis risus eget urna mollis ornare vel eu leo.';
$pattern = '/[^ ]*$/';
preg_match($pattern, $content, $result);
echo $result[0];
I get "leo.".
How can I get just "leo", without dot, or question mark ?
Thank you.
You can obtain the last word using the $ anchor and a lookahead:
$pattern = '~[a-z]+(?=[^a-z]*$)~i';
explanation:
~ # pattern delimiter
[a-z]+ # all characters between a and z (included) one or more times
(?= # open a lookahead. Means followed by:
[^a-z]* # all characters that are not letters zero or more times
$ # until the end of the string
) # close the lookahead
~ # pattern delimiter
i # case insensitive ( [a-z] => [a-zA-Z] and [^a-z] => [^a-zA-Z])
For your regex tries, i suggest you to use this online tool which is specific to php.
You can use rubular.com to learn and test regular expressions. They also provide a nice cheat-sheet =).
You will catch your word with : \b([a-zA-Z]*)[\W]$
With this regex, you will match the point too. To extract only "leo" you will need to understand about capturing groups: http://www.regular-expressions.info/named.html
Somehow I solved this problem with pattern: '/[\p{L}\d]+(?=[^a-z]*$)/u'. Maybe someone know how to find first and last word in string and put them in , arrays with preg_match ?

What is the most efficient way to search a blob of text for an array of regular expressions?

I'm looking for the most efficient way to search a blob of text (± 1/2KB) for many regular expressions stored in an array.
Example code:
patterns = [/patternA/i,/patternB/i,/patternC/m,...,/patternN/i]
content = "Lorem ipsum dolor sit amet, consectetur... officiam id est laborum."
r = patterns.collect{ |pattern|
pattern unless ( content =~ pattern ).blank?
}.compact
Where r now contains patterns that matched the content string.
If you are only interested in whether any of the patterns match the text, then consider combining all patterns into a single big regex, using the regex 'or' operator, and compiling that giant regex once.
For instance, if your patterns are: A, B, C, create a single regex of the form A|B|C
Sorry, I don't know Ruby, but hopefully you can turn that into code (:
Side Note: This is how Mercurial's .hgignore files are handled last I looked. In that case there are 1000s of filenames that get thrown at the one big regex, which is more efficient than those filenames getting thrown at each of hundreds of smaller regexes.
Solution 1
Do this:
r = patterns.select{|pattern| content =~ pattern}
Since the string is huge, it is better to implement this method on String rather then on something else because passing a large argument seems to be slow.
class String
def filter_patterns patterns
patterns.select{|r| self =~ pattern}
end
end
and use it like:
content.filter_patterns(patterns)
Solution 2
it has restrictions that each regex does not include a named/numbered capture.
combined_regex = Regexp.new(patterns.map{|r| "(?=[.\n]*(#{r.source}))?"}.join)
content =~ combined_regex
The following part will have problem if the regex inside patterns include a named/numbered capture. If there is a way to know for each regex how many potential captures there are, then it will solve the problem.
r = patterns.select.with_index{|pattern, i| Regexp.last_match[i]}
Addition
Given:
dogs = {
'saluki' => 'Hounds',
'russian wolfhound' => 'Hounds',
'italian greyhound' => 'Hounds',
..
}
content = "Running in the fields at great speeds, the sleek saluki dog comes from..."
you can do this:
combined_regex =
Regexp.new(dogs.keys.map{|w| "(?=[.\n]*(#{w}))?"}.join, Regexp::IGNORECASE)
content =~ combined_regex
r = patterns.select.with_index{|pattern, i| Regexp.last_match[i]}
"This article talks about #{r.collect{|x| dogs[x]}.to_sentence}."
=> "This article talks about Hounds."
To avoid outputs like This article talks about Hounds, Hounds and Hounds., you might want to put uniq in it.
"This article talks about #{r.uniq.collect{|x| dogs[x]}.to_sentence}."
How about:
text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor magna'
targets = [ /(am?et)/, /(ips.m)/, /(elit)/, /(magna)/, /([Ll]or[eu]m)/ ]
regex = Regexp.union(targets)
hits = []
text.scan(regex) { |a| hits += a.each_with_index.to_a }
r = hits.select{ |w,i| w }.map{ |w,i| targets[i]} # => [/([lL]or[eu]m)/, /(ips.m)/, /(am?et)/, /(elit)/, /(magna)/]
This works to return the matched patterns in the order that the words were found in the text.
There's probably a way to do it using named-captures too.
What you want is exactly what a lexer has been designed to do. Pick out a set of regular expressions from an input stream with only a single pass over the input required.
Unfortunately I haven't been able to find a good lexer gem for Ruby which lets you define your own lexer. I'll update the answer if I find anything.

How to remove the first 4 characters from a string if it matches a pattern in Ruby

I have the following string:
"h3. My Title Goes Here"
I basically want to remove the first four characters from the string so that I just get back:
"My Title Goes Here".
The thing is I am iterating over an array of strings and not all have the h3. part in front so I can't just ditch the first four characters blindly.
I checked the docs and the closest thing I could find was chomp, but that only works for the end of a string.
Right now I am doing this:
"h3. My Title Goes Here".reverse.chomp(" .3h").reverse
This gives me my desired output, but there has to be a better way. I don't want to reverse a string twice for no reason. Is there another method that will work?
To alter the original string, use sub!, e.g.:
my_strings = [ "h3. My Title Goes Here", "No h3. at the start of this line" ]
my_strings.each { |s| s.sub!(/^h3\. /, '') }
To not alter the original and only return the result, remove the exclamation point, i.e. use sub. In the general case you may have regular expressions that you can and want to match more than one instance of, in that case use gsub! and gsub—without the g only the first match is replaced (as you want here, and in any case the ^ can only match once to the start of the string).
You can use sub with a regular expression:
s = 'h3. foo'
s.sub!(/^h[0-9]+\. /, '')
puts s
Output:
foo
The regular expression should be understood as follows:
^ Match from the start of the string.
h A literal "h".
[0-9] A digit from 0-9.
+ One or more of the previous (i.e. one or more digits)
\. A literal period.
A space (yes, spaces are significant by default in regular expressions!)
You can modify the regular expression to suit your needs. See a regular expression tutorial or syntax guide, for example here.
A standard approach would be to use regular expressions:
"h3. My Title Goes Here".gsub /^h3\. /, '' #=> "My Title Goes Here"
gsub means globally substitute and it replaces a pattern by a string, in this case an empty string.
The regular expression is enclosed in / and constitutes of:
^ means beginning of the string
h3 is matched literally, so it means h3
\. - a dot normally means any character so we escape it with a backslash
is matched literally

Resources