Match balanced occurrences of nested tag - ruby

I have a test string:
s = "A test [[you|n|note|content of the note with a [[link|n|link|http://link]] inside]] paragraph. wef [[you|n|note|content of the note with a [[link|n|link|http://link]] inside]] test".
I need to match the occurrences of the [[...]] parts of the string. There can be up to the second level of nested [[ ]] tags in the string (as shown in the test string).
I started with /\[\[.*?\]\]/, but that only matches the following:
[[you|n|note|content of the note with a [[link|n|link|http://link]] (it's missing the last occurrence of the ]].
How do I go about matching the remainder of each [[ .. ]] block? Is this possible with regex?

If you don't have single isolated [ or ], then it is pretty much simple. The following assumes no restriction on the nested level.
s.scan(/(?<match>\[\[(?:[^\[\]]|\g<match>)*\]\])/).flatten
returns:
[
"[[you|n|note|content of the note with a [[link|n|link|http://link]] inside]]",
"[[you|n|note|content of the note with a [[link|n|link|http://link]] inside]]"
]

Here's a non-regex solution. I've assumed left (right) brackets always appear in pairs.
level = 0
s.each_char.each_cons(2).with_index.with_object([]) do |(pair, i), a|
case pair.join
when "[["
level += 1
a << i if level==1
when "]]"
a << i+1 if level==1
level -= 1
end
end.each_slice(2).map { |b,e| s[b..e] }
#=> ["[[you|n|note|content of the note with a [[link|n|link|http://link]] inside]]",
# "[[you|n|note|content of the note with a [[link|n|link|http://link]] inside]]"]

Related

What do these symbols mean in the RFC docs regarding grammars?

Here are the examples:
Transfer-Encoding = "Transfer-Encoding" ":" 1#transfer-coding
Upgrade = "Upgrade" ":" 1#product
Server = "Server" ":" 1*( product | comment )
delta-seconds = 1*DIGIT
Via = "Via" ":" 1#( received-protocol received-by [ comment ] )
chunk-extension= *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
date3 = month SP ( 2DIGIT | ( SP 1DIGIT ))
Questions are:
What is the 1#transfer-coding (the 1# regarding the rule transfer-coding)? Same with 1#product.
What does 1 times x mean, as in 1*( product | comment )? Or 1*DIGIT.
What do the brackets mean, as in [ comment ]? The parens (...) group it all, but what about the [...]?
What does the *(...) mean, as in *( ";" chunk-ext-name [ "=" chunk-ext-val ] )?
What do the nested square brackets mean, as in [ abs_path [ "?" query ]]? Nested optional values? It doesn't make sense.
What does 2DIGIT and 1DIGIT mean, where do those come from / get defined?
I may have missed where these are defined, but knowing these would help clarify how to parse the grammar definitions they use in the RFCs.
I get the rest of the grammar notation, juts not these few remaining pieces.
Update: Looks like this is a good start.
Square brackets enclose an optional element sequence:
[foo bar]
is equivalent to
*1(foo bar).
Specific Repetition: nRule
A rule of the form:
<n>element
is equivalent to
<n>*<n>element
That is, exactly <n> occurrences of <element>. Thus, 2DIGIT is a
2-digit number, and 3ALPHA is a string of three alphabetic
characters.
Variable Repetition: *Rule
The operator "*" preceding an element indicates repetition. The full
form is:
<a>*<b>element
where <a> and <b> are optional decimal values, indicating at least
<a> and at most <b> occurrences of the element.
Default values are 0 and infinity so that *<element> allows any
number, including zero; 1*<element> requires at least one;
3*3<element> allows exactly 3; and 1*2<element> allows one or two.
But what I'm still missing is what the # means?
Update 2: Found it I think!
#RULE: LISTS
A construct "#" is defined, similar to "*", as follows:
<l>#<m>element
indicating at least <l> and at most <m> elements, each separated
by one or more commas (","). This makes the usual form of lists
very easy; a rule such as '(element *("," element))' can be shown
as "1#element".
Also, what do these mean?
1*2DIGIT
2*4DIGIT

Ruby regex: union 2K values in one regex,

I code a process to process bunch of text files and capture its name if any of 2000 literals exists in it (1 or many). So I'm thinking to combine that many values into one regex, do you think it's doable, I did test for 100 and looks like it's OK. Tx all
Code below depics my flow and sample code, just without looping.
# 1. read regex value list as file [alpha,fox, delta] # 2000 values
# 2. read file into s #5000 files
# 3. find if any of #1 values exists in each #2 file. *with regex tweaks to match format dbname.dob.table
s = '1 dbName.dbo.ALPHA 2 DBNAME.bcd.ALPHA 3 dbName..ALPHA 4 ALPHA 5x dbName.alphA 6x alpha.XX 7x ###dbName.###a.alpha --alpha
dbName..FOX dbName.dbo.DELTA clarity.aba..fox '
value1 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:alpha)(?=\s|$)'
value2 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:fox)(?=\s|$)'
##...
value2000 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:delta)(?=\s|$)'
regex = /#{value1}|#{value2}|#{value2000}/i ## can I union 2000 regex's ???
puts 'reg1: ' + regex.to_s
puts 'result: ' + s.scan(regex).to_s
if s.scan(regex) then puts '...Match!!!d' end
Declaring 2000 variables is highly unnecessary; you should define all values in a single array, then somehow loop through them.
Also, the regular expression is highly repetitive - e.g. the use of (?:dbName\.[a-z]*\.) 2000 times. This can be simplified by grouping all of your values within the non-capture group as follows:
values = %w(alpha fox delta)
regex = /(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:#{Regexp.union(values)})(?=\s|$)/
This is the result:
/(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:(?-mix:alpha|fox|delta))(?=\s|$)/
If you extend that values array to contain 2000 strings, the other code does not need to change.
Provided two conditions are met, I would do it as follows, which I think would be far more efficient than using a gigantic regular expression, which, by its nature, requires that a linear search of the "bad words" be performed for each word in the string, until a match is found or it is determined that there are no matches.
We are given a file whose path is contained in a variable fname and an array of bad words:
arr = ["alpha", "fox", "delta", "charlie", "mabel"]
The first condition that I spoke of above is that, by way of example, "ALPHA" and "Alpha" match "alpha", but "aLPha" does not (or some variant of that).
The second condition is that there is a regular expression with a capture group that would capture a bad word if a bad word were present at the given location in a match. For example:
regex = (?<=^|\s)(?:dbName\.[a-z]*\.)?(\p{Alpha}+)(?=\s|$)
Wherever there is a match, the capture group (\p{Alpha}+) would capture a string of one or more alphanumeric characters whose value is assigned to the global variable $1. We will then check to see if the value of $1 is a bad word. (The regular expression might have other capture groups as well, in which case we might be looking for $2 or $3, say, or a named capture group.)
If there were more than one such regular expression to check for, the code below could be executed for each of them until a match is found or it is determined that there are no more matches.
The first step is to convert the array of bad words to a set:
require 'set'
bad_words = arr.flat_map { |w| [w, w.capitalize, w.upcase] }.to_set
#=> #<Set: {"alpha", "Alpha", "ALPHA", "fox", "Fox", "FOX",
# "delta", "Delta", "DELTA", "charlie", "Charlie", "CHARLIE",
# "mabel", "Mabel", "MABEL"}>
This allows very fast word lookups--much faster than stepping through an array. We may then search the file as follows.
rv = IO.foreach(fname).any? do |line|
line.gsub(regex).any? { bad_words.include?($1) }
end
IO::foreach without a block is seen to return an enumerator. We can then chain that to any? to determine if there is a line that contains a match of the regular expression and the value of its capture group is contained in the set bad_words. If such a line is found the search terminates and true is returned; else, false is returned.
It is seen that String#gsub without a block returns an enumerator, which here I've chained to any?. This form of gsub has nothing to do with string replacements; it just generates matches. Those matches are passed to the block, but we are only interested in the contents of the capture group, which are held by $1. Hence the expression bad_words.include?($1).

Regex returning weird arrays

I want to make an array of results from a string like this one, using a regular expression:
results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday
Here’s my regex as it stands. It works in Sublime Text’s regex search but not in Ruby:
(results)\|.*?\\n(?=((results\|)|(timestamps\|\|)))
and this would be the desired result:
1. results|foofoofoo
2. results|barbarbar
3. results|googoogoo
Instead I’m getting these weird returns, and I can’t understand it. Why does this not select the result lines?
Match 1
1. results
2. results|
3. results|
4.
Match 2
1. results
2. results|
3. results|
4.
Match 3
1. results
2. timestamps||
3.
4. timestamps||
Here’s the actual code using the regex:
#create new lines for each regex'd line body with that body set as the raw attribute
host_scan.raw.scan(/(?:results)\|.*?\\n(?=((?:results\|)|(?:timestamps\|\|)))/).each do |body|
#lines << Line.new({:raw => body})
end
As Kendall Frey already stated, you are creating too many capture groups. No need to group the first literal “results|”, and no need to group the elements of your alternate group in individual non backreferencing groups. What you are intending to do is this regex:
/results\|.*?(?=\\n(?:results\||timestamps\|\|))/
or, if you don’t mind repeating the \\n part, you can do away with the non-capturing subgroup:
/results\|.*?(?=\\nresults\||\\ntimestamps\|\|)/
– both will return an array of matched values as specified in your question.
I'm guessing it has something to do with capturing groups. If you change all your (...) to (?:...) it will eliminate capturing groups.
Rather than jump to a regex, which is a much more complicated way to get at the data, use split("\n").
text = "results|foofoofoo\nresults|barbarbarbar\nresults|googoogoo\ntimestamps||friday"
ary = text.split("\n")
ary is:
[
"results|foofoofoo",
"results|barbarbarbar",
"results|googoogoo",
"timestamps||friday"
]
Slice that and you can get:
ary[0..2]
=> ["results|foofoofoo", "results|barbarbarbar", "results|googoogoo"]
EDIT:
Based on the comment that there are more carriage returns and complex characters in the strings:
require 'awesome_print'
text = "results|foofoofoo\nmorefoo\nandevenmorefoo\nresults|barbarbarbar\nandmorebar\nandyetagainmorebar\nresults|googoogoo\ntimestamps||friday"
ap text.sub(/\|\|friday$/, '').split('results')[1..-1].map{ |l| 'results' << l }
Which outputs:
[
[0] "results|foofoofoo\nmorefoo\nandevenmorefoo\n",
[1] "results|barbarbarbar\nandmorebar\nandyetagainmorebar\n",
[2] "results|googoogoo\ntimestamps"
]
The answer turned out to lie in the parentheses. Wrapping in parentheses caused it to return the entire match instead of just the tail delimiter.
host_scan.raw.scan(/((?:results\|.*?\\n)(?=(?:results\|)|(?:timestamps\|\|)))/).each do |body|
#lines << Line.new({:raw => body})
end

Checking if a string has balanced parentheses

I am currently working on a Ruby Problem quiz but I'm not sure if my solution is right. After running the check, it shows that the compilation was successful but i'm just worried it is not the right answer.
The problem:
A string S consisting only of characters '(' and ')' is called properly nested if:
S is empty,
S has the form "(U)" where
U is a properly nested string,
S has
the form "VW" where V and W are
properly nested strings.
For example, "(()(())())" is properly nested and "())" isn't.
Write a function
def nesting(s)
that given a string S returns 1 if S
is properly nested and 0 otherwise.
Assume that the length of S does not
exceed 1,000,000. Assume that S
consists only of characters '(' and
')'.
For example, given S = "(()(())())"
the function should return 1 and given
S = "())" the function should return
0, as explained above.
Solution:
def nesting ( s )
# write your code here
if s == '(()(())())' && s.length <= 1000000
return 1
elsif s == ' ' && s.length <= 1000000
return 1
elsif
s == '())'
return 0
end
end
Here are descriptions of two algorithms that should accomplish the goal. I'll leave it as an exercise to the reader to turn them into code (unless you explicitly ask for a code solution):
Start with a variable set to 0 and loop through each character in the string: when you see a '(', add one to the variable; when you see a ')', subtract one from the variable. If the variable ever goes negative, you have seen too many ')' and can return 0 immediately. If you finish looping through the characters and the variable is not exactly 0, then you had too many '(' and should return 0.
Remove every occurrence of '()' in the string (replace with ''). Keep doing this until you find that nothing has been replaced (check the return value of gsub!). If the string is empty, the parentheses were matched. If the string is not empty, it was mismatched.
You're not supposed to just enumerate the given examples. You're supposed to solve the problem generally. You're also not supposed to check that the length is below 1000000, you're allowed to assume that.
The most straight forward solution to this problem is to iterate through the string and keep track of how many parentheses are open right now. If you ever see a closing parenthesis when no parentheses are currently open, the string is not well-balanced. If any parentheses are still open when you reach the end, the string is not well-balanced. Otherwise it is.
Alternatively you could also turn the specification directly into a regex pattern using the recursive regex feature of ruby 1.9 if you were so inclined.
My algorithm would use stacks for this purpose. Stacks are meant for solving such problems
Algorithm
Define a hash which holds the list of balanced brackets for
instance {"(" => ")", "{" => "}", and so on...}
Declare a stack (in our case, array) i.e. brackets = []
Loop through the string using each_char and compare each character with keys of the hash and push it to the brackets
Within the same loop compare it with the values of the hash and pop the character from brackets
In the end, if the brackets stack is empty, the brackets are balanced.
def brackets_balanced?(string)
return false if string.length < 2
brackets_hash = {"(" => ")", "{" => "}", "[" => "]"}
brackets = []
string.each_char do |x|
brackets.push(x) if brackets_hash.keys.include?(x)
brackets.pop if brackets_hash.values.include?(x)
end
return brackets.empty?
end
You can solve this problem theoretically. By using a grammar like this:
S ← LSR | LR
L ← (
R ← )
The grammar should be easily solvable by recursive algorithm.
That would be the most elegant solution. Otherwise as already mentioned here count the open parentheses.
Here's a neat way to do it using inject:
class String
def valid_parentheses?
valid = true
self.gsub(/[^\(\)]/, '').split('').inject(0) do |counter, parenthesis|
counter += (parenthesis == '(' ? 1 : -1)
valid = false if counter < 0
counter
end.zero? && valid
end
end
> "(a+b)".valid_parentheses? # => true
> "(a+b)(".valid_parentheses? # => false
> "(a+b))".valid_parentheses? # => false
> "(a+b))(".valid_parentheses? # => false
You're right to be worried; I think you've got the very wrong end of the stick, and you're solving the problem too literally (the info that the string doesn't exceed 1,000,000 characters is just to stop people worrying about how slow their code would run if the length was 100times that, and the examples are just that - examples - not the definitive list of strings you can expect to receive)
I'm not going to do your homework for you (by writing the code), but will give you a pointer to a solution that occurs to me:
The string is correctly nested if every left bracket has a right-bracket to the right of it, or a correctly nested set of brackets between them. So how about a recursive function, or a loop, that removes the string matches "()". When you run out of matches, what are you left with? Nothing? That was a properly nested string then. Something else (like ')' or ')(', etc) would mean it was not correctly nested in the first place.
Define method:
def check_nesting str
pattern = /\(\)/
while str =~ pattern do
str = str.gsub pattern, ''
end
str.length == 0
end
And test it:
>ruby nest.rb (()(())())
true
>ruby nest.rb (()
false
>ruby nest.rb ((((()))))
true
>ruby nest.rb (()
false
>ruby nest.rb (()(((())))())
true
>ruby nest.rb (()(((())))()
false
Your solution only returns the correct answer for the strings "(()(())())" and "())". You surely need a solution that works for any string!
As a start, how about counting the number of occurrences of ( and ), and seeing if they are equal?

what does this backtick ruby code mean?

while line = gets
next if line =~ /^\s*#/ # skip comments
break if line =~ /^END/ # stop at end
#substitute stuff in backticks and try again
redo if line.gsub!(/`(.*?)`/) { eval($1) }
end
What I don't understand is this line:
line.gsub!(/`(.*?)`/) { eval($1) }
What does the gsub! exactly do?
the meaning of regex (.*?)
the meaning of the block {eval($1)}
It will substitute within the matched part of line, the result of the block.
It will match 0 or more of the previous subexpression (which was '.', match any one char). The ? modifies the .* RE so that it matches no more than is necessary to continue matching subsequent RE elements. This is called "non-greedy". Without the ?, the .* might also match the second backtick, depending on the rest of the line, and then the expression as a whole might fail.
The block returns the result of eval ("evaluate a Ruby expression") on the backreference, which is the part of the string between the back tick characters. This is specified by $1, which refers to the first paren-enclosed section ("backreference") of the RE.
In the big picture, the result of all this is that lines containing backtick-bracketed expressions have the part within the backticks (and the backticks) replaced with the result value of executing the contained Ruby expression. And since the outer block is subject to a redo, the loop will immediately repeat without rerunning the while condition. This means that the resulting expression is also subject to a backtick evaluation.
Replaces everything between backticks in line with the result of evaluating the ruby code contained therein.
>> line = "one plus two equals `1+2`"
>> line.gsub!(/`(.*?)`/) { eval($1) }
>> p line
=> "one plus two equals 3"
.* matches zero or more characters, ? makes it non-greedy (i.e., it will take the shortest match rather than the longest).
$1 is the string which matched the stuff between the (). In the above example, $1 would have been set to "1+2". eval evaluates the string as ruby code.
line.gsub!(/(.*?)/) { eval($1) }
gsub! replaces line (instead if using line = line.gsub).
.*? so it'd match only until the first `, otherwise it'd replace multiple matches.
The block executes whatever it matches (so for example if "line" contains 1+1, eval would replace it with 2.

Resources