Lookbehind and lookahead regex

Lookbehind and lookahead regex - ruby

I have a strings like this:
journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma
Russo:::Programmer-Defined Control Abstractions in Modula-2
I need to capture Michele Di Santo, Libero Nigro, Wilma Russo but not the last one.
This regex matches almost what I need:
/(?<=::).*?(?=::)/
But it has problem, it captures the third colon
str.scan(/(?<=::).*?(?=::)/) #=> [":Michele Di Santo", ...]
As you can see, the first match has a colon at the beginning.
How to fix this regex to avoid this third colon?

Don't use regex for this. All you need to do is split the input string on :::, take the second string from the resulting array, and split that on ::. Faster to code, faster to run, and easier to read than a regex version.
Edit: The code:
str.split(':::')[1].split('::')
Running on CodePad: http://codepad.org/1BNNwoh6

An expression to do that could be:
(?<=::)[^:].*?(?=::)
Although if the string to be searched is always in the form of "xxx:::A::B::C:::xxx" and you only care about A, B and C, consider using something more specific, and using the capture groups to get A, B and C:
:::(.+?)::(.+?)::(.+?):::
$1, $2 and $3 will contain the group matches.

I'd use a simple split because the string is basically a CSV with colons instead of commas:
str = 'journals/cl/SantoNR90:::Michele Di Santo::Libero Nigro::Wilma Russo:::Programmer-Defined Control Abstractions in Modula-2'
items = split(':')
str1, str2, str3 = items[3], items[5], items[7]
=> [
[0] "Michele Di Santo",
[1] "Libero Nigro",
[2] "Wilma Russo"
]
You could also use:
str1, str2, str3 = str.split(':').select{ |s| s > '' }[1, 3]
If it's possible to have quoted colons, use the CSV module and set your field delimiter to ':'.

Related

Regex to obfuscate substring of a repeating substring

Given a string like:
abc_1234 xyz def_123aa4a56
I want to replace parts of it so the output is:
abc_*******z def_*******56
The rules are:
abc_ and def_ are kind of delimiters, so anything between the two are part of the previous delimiter string.
The string between the abc_ and def_, and the next delimited string should be replaced by *, except for the last 2 characters of that substring. In the above example, abc_1234 xyz (note trailing space), got turned into abc_*******z

prefixes = %w|abc_ def_|
input = "Hello abc_111def_frg def_333World abc_444"
input.gsub(/(#{Regexp.union(prefixes)})../, "\\1**")
#⇒ "Hello abc_**1def_**g def_**3World abc_**4"

Is this what you are looking for?
str = "Hello abc_111def_frg def_333World abc_444"
str.scan(/(?<=abc_|def_)(?:[[:alpha:]]+|[[:digit:]]+)/)
# => ["111", "frg", "333", "444"]
I've assumed the string following "abc_" or "def_" is either all digits or all letters. It won't work if, for example, you wished to extract "a1b" from "abc_a1b cat". You need to better define the rules for what terminates the strings you want.
The regular expression reads, "Following the string "abc_" or "def_" (a positive lookbehind that is not part of the match), match a string of digits or a string of letters".

Given:
> s
=> "abc_1234 xyz def_123aa4a56"
You can do:
> s.gsub(/(?<=abc_|def_)(.*?)(..)(?=(?:abc_|def_|$))/) { |m| "*" * $1.length<<$2 }
=> "abc_*******z def_*******56"

Regular Expression replacement to convert Less mixins to Scss

I'm looking to convert Less mixin calls to their equivalents in Scss:
.mixin(); should become #mixin();
.mixin(0); should become #mixin(0);
.mixin(0; 1; 2); should become #mixin(0, 1, 2);
I'm having the most difficulty with the third example, as I essentially need to match n groups separated by semicolons, and replace those with the same groups separated by commas. I suppose this relies on some sort of repeating groups functionality in regexes that I'm not familiar with.
It's not simply enough to simply replace semicolons within paren - I need a regex that will only match the \.[\w\-]+\(.*\) format of mixins, but obviously with some magic in the second match group to handle the 3rd example above.
I'm doing this in Ruby, so if you're able to provide replacement syntax that's compatible with gsub, that would be awesome. I would like a single regex replacement, something that doesn't require multiple passes to clean up the semicolons.

I suggest adding two capturing groups round the subvalues you need and using an additional gsub in the first gsub block to replace the ; with , only in the 2nd group.
See
s = ".mixin(0; 1; 2);"
puts s.gsub(/\.([\w\-]+)(\(.*\))/) { "##{$1}#{$2.gsub(/;/, ',')}" }
# => #mixin(0, 1, 2);
The pattern details:
\. - a literal dot
([\w\-]+) - Group 1 capturing 1 or more word chars ([a-zA-Z0-9_]) or -
(\(.*\)) - Group 2 capturing a (, then any 0+ chars other than linebreak symbols as many as possible up to the last ) and the last ). NOTE: if there are multiple values, use lazy matching - (\(.*?\)) - here.

Here you go:
less_style = ".mixin(0; 1; 2);"
# convert the first period to #
less_style.gsub! /^\./, '#'
# convert the inner semicolons to commas
scss_style = less_style.gsub /(?<=[\(\d]);/, ','
scss_style
# => "#mixin(0, 1, 2);"
The second regex is using positive lookbehinds. You can read about those here: http://www.regular-expressions.info/lookaround.html
I also use this neat web app to play around with regexes: http://rubular.com/

This will get you a single pass through gsub:
".mixin(0; 1; 2);".gsub(/(?<!\));|\./, ";" => ",", "." => "#")
=> "#mixin(0, 1, 2);"
It's an OR regex with a hash for the replacement parameters.
Assuming from your example that you just want to replace semicolons not following close parens(negative lookbehind): (?<!\));
You can modify/build on this with other expressions. Even add more OR conditions to the regex.
Also, you can use the block version of gsub if you need more options.

What is the best way to delimit a csv files thats contain commas and double quotes?

Lets say I have the following string and I want the below output without requiring csv.
this, "what I need", to, do, "i, want, this", to, work
this
what i need
to
do
i, want, this
to
work

This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
We can solve it with a beautifully-simple regex:
"([^"]+)"|[^, ]+
The left side of the alternation | matches complete "quotes" and captures the contents to Group1. The right side matches characters that are neither commas nor spaces, and we know they are the right ones because they were not matched by the expression on the left.
Option 2: Allowing Multiple Words
In your input, all tokens are single words, but if you also want the regex to work for my cat scratches, "what I need", your dog barks, use this:
"([^"]+)"|[^, ]+(?:[ ]*[^, ]+)*
The only difference is the addition of (?:[ ]*[^, ]+)* which optionally adds spaces + characters, zero or more times.
This program shows how to use the regex (see the results at the bottom of the online demo):
subject = 'this, "what I need", to, do, "i, want, this", to, work'
regex = /"([^"]+)"|[^, ]+/
# put Group 1 captures in an array
mymatches = []
subject.scan(regex) {|m|
$1.nil? ? mymatches << $& : mymatches << $1
}
mymatches.each { |x| puts x }
Output
this
what I need
to
do
i, want, this
to
work
Reference
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...

String gsub - Replace characters between two elements, but leave surrounding elements

Suppose I have the following string:
mystring = "start/abc123/end"
How can you splice out the abc123 with something else, while leaving the "/start/" and "/end" elements intact?
I had the following to match for the pattern, but it replaces the entire string. I was hoping to just have it replace the abc123 with 123abc.
mystring.gsub(/start\/(.*)\/end/,"123abc") #=> "123abc"
Edit: The characters between the start & end elements can be any combination of alphanumeric characters, I changed my example to reflect this.

You can do it using this character class : [^\/] (all that is not a slash) and lookarounds
mystring.gsub(/(?<=start\/)[^\/]+(?=\/end)/,"7")

For your example, you could perhaps use:
mystring.gsub(/\/(.*?)\//,"/7/")
This will match the two slashes between the string you're replacing and putting them back in the substitution.

Alternatively, you could capture the pieces of the string you want to keep and interpolate them around your replacement, this turns out to be much more readable than lookaheads/lookbehinds:
irb(main):010:0> mystring.gsub(/(start)\/.*\/(end)/, "\\1/7/\\2")
=> "start/7/end"
\\1 and \\2 here refer to the numbered captures inside of your regular expression.

The problem is that you're replacing the entire matched string, "start/8/end", with "7". You need to include the matched characters you want to persist:
mystring.gsub(/start\/(.*)\/end/, "start/7/end")
Alternatively, just match the digits:
mystring.gsub(/\d+/, "7")

You can do this by grouping the start and end elements in the regular expression and then referring to these groups in in the substitution string:
mystring.gsub(/(?<start>start\/).*(?<end>\/end)/, "\\<start>7\\<end>")

Ruby regular expression

Apparently I still don't understand exactly how it works ...
Here is my problem: I'm trying to match numbers in strings such as:
910 -6.258000 6.290
That string should gives me an array like this:
[910, -6.2580000, 6.290]
while the string
blabla9999 some more text 1.1
should not be matched.
The regex I'm trying to use is
/([-]?\d+[.]?\d+)/
but it doesn't do exactly that. Could someone help me ?
It would be great if the answer could clarify the use of the parenthesis in the matching.

Here's a pattern that works:
/^[^\d]+?\d+[^\d]+?\d+[\.]?\d+$/
Note that [^\d]+ means at least one non digit character.
On second thought, here's a more generic solution that doesn't need to deal with regular expressions:
str.gsub(/[^\d.-]+/, " ").split.collect{|d| d.to_f}
Example:
str = "blabla9999 some more text -1.1"
Parsed:
[9999.0, -1.1]

The parenthesis have different meanings.
[] defines a character class, that means one character is matched that is part of this class
() is defining a capturing group, the string that is matched by this part in brackets is put into a variable.
You did not define any anchors so your pattern will match your second string
blabla9999 some more text 1.1
^^^^ here ^^^ and here
Maybe this is more what you wanted
^(\s*-?\d+(?:\.\d+)?\s*)+$
See it here on Regexr
^ anchors the pattern to the start of the string and $ to the end.
it allows Whitespace \s before and after the number and an optional fraction part (?:\.\d+)? This kind of pattern will be matched at least once.

maybe /(-?\d+(.\d+)?)+/
irb(main):010:0> "910 -6.258000 6.290".scan(/(\-?\d+(\.\d+)?)+/).map{|x| x[0]}
=> ["910", "-6.258000", "6.290"]

str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map(&:to_f)
# => [910.0, -6.258, 6.29]
If you don't want integers to be converted to floats, try this:
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map do |ns|
ns[/\./] ? ns.to_f : ns.to_i
end
# => [910, -6.258, 6.29]

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Lookbehind and lookahead regex - ruby

Related

Regex to obfuscate substring of a repeating substring

Regular Expression replacement to convert Less mixins to Scss

What is the best way to delimit a csv files thats contain commas and double quotes?

String gsub - Replace characters between two elements, but leave surrounding elements

Ruby regular expression

Categories

Resources