Regex to match pipes not within brackets or braces - ruby

I am trying to parse some wiki markup. For example, the following:
{{Infobox
| person
| name = Joe
| title = Ruler
| location = [[United States|USA]] | height = {{convert|12|m|abbr=on}}
| note = <ref>{{cite book|title= Some Book}}</ref>
}}
can be the text to start with. I first remove the starting {{ and ending }}, so I can assume those are gone.
I want to do .split(<regex>) on the string to split the string by all | characters that are not within braces or brackets. The regex needs to ignore the | characters in [[United States|USA]], {{convert|12|m|abbr=on}}, and {{cite book|title= Some Book}}. The expected result is:
[
'person'
'name = Joe',
'title = Ruler',
'location = [[United States|USA]]',
'height = {{convert|12|m|abbr=on}}',
'note = <ref>{{cite book|title= Some Book}}</ref>'
]
There can be line breaks at any point, so I can't just look for \n|. If there is extra white space in it, that is fine. I can easily strip out extra \s* or \n*.

You could split on:
\s*\|\s*(?![^{\[]*[]}])
Breakdown:
\s*\|\s* Match a pipe with any leading or trailing whitespaces
(?! Start of negative lookahead
[^{\[]* Match anything except { and [ as much as possible
[]}] Up to a closing ] or }
) End of negative lookahead
The negative lookahead asserts that we shouldn't reach } or ] without matching an opening pair.
See live demo here

I literally stole the regex from #WiktorStribiżew but this should work for your input string
regex = (/\w+(?:\s*=\s*(?:\[\[[^\]\[]*]]|{{[^{}]*}}|[^|{\[])*)?/)
arr = str.scan(regex).map{|l| l.strip.delete("\n")}[1..-1]
arr is now the array you've requested.

Related

How to grab all text inside of matching brackets with ruby and/or Regular Expressions

I am working on doing some code cleanup and need to make sure that my gsub! only runs on a small section of code. The portion of the code I need to examine starts with {{Infobox television (\{\{[Ii]nfobox\s[Tt]elevision to be technical) and ends with the matching double brackets "}}".
An example of the gsub! that will be run is text.gsub!(/\|(\s*)channel\s*=\s*(.*)\n/, "|\\1network = \\2\n")
...
{{Infobox television
| show_name = 60 Minutos
| image =
| director =
| developer =
| channel = [[NBC]]
| presenter = [[Raúl Matas]] (1977–86)<br />[[Raquel Argandoña]] (1979–81)
| language = [[Spanish language|Spanish]]
| first_aired = {{Date|7 April 1975}}
| website = {{url|https://foo.bar.com}}
}}
...
Note:
Using sub instead of gsub is not an option due to the fact that multiple instances of the parameter needed to be substituted may exist.
I cannot just look for the first set of }} as there may be multiple sets as show in the example above.
You may use a regex with a bit of recursion:
/(?=\{\{[Ii]nfobox\s[Tt]elevision)(\{\{(?>[^{}]++|\g<1>)*}})‌​/
Or, if there are single { or } inside, you will need to also match those with (?<!{){(?!{)|(?<!})}(?!}):
/(?=\{\{[Ii]nfobox\s[Tt]elevision)(\{\{(?>[^{}]++|(?<!{){(?!{)|(?<!})}(?!})|\g<1>)*}})/
See the Rubular demo
Details:
(?=\{\{[Ii]nfobox\s[Tt]elevision) - a positive lookahead making sure the current location is followed with {{Infobox television like string (with different casing)
(\{\{(?>[^{}]++|\g<1>)*}})‌​ - Group 1 that matches the following:
\{\{ - a {{ substring
(?>[^{}]++|\g<1>)* - zero or more occurrences of:
[^{}]++ - 1 or more chars other than { and }
(?<!{){(?!{) - a { not enclosed with other {
(?<!})}(?!}) - a } not enclosed with other }
| - or
\g<1> - the whole Group 1 subpattern
}} - a }} substring
Can't give you a direct answer without spending a lot of time on it.
But it is noteable that the first bracket set is at the beginning of a line, as is the last one.
So you have
^{{(.*)^}}$/m
The m means multiline match. That will match everything between the braces - the () brackets mean that you can pull out what was matched inside the braces, for example:
string = <<_EOT
{{Infobox television
| show_name = 60 Minutos
| image =
| director =
| developer =
| channel = [[NBC]]
| presenter = [[Raúl Matas]] (1977–86)<br />[[Raquel Argandoña]] (1979–81)
| language = [[Spanish language|Spanish]]
| first_aired = {{Date|7 April 1975}}
| website = {{url|https://foo.bar.com}}
}}
_EOT
matcher = string.match(^{{(.*)^}}$/m)
matcher[0] will give you the whole expression
matcher[1] will give you what was matched inside the () brackets
The danger with this is that it will do "greedy" matching and match the largest piece of text it can, so you will have to turn this off. Without more info on what you're trying to do I can't help any more.
NB - to match () brackets you have to escape them. See https://ruby-doc.org/core-2.1.1/Regexp.html for more info.

How to pull out all characters from a to b in a string using regular expressions

So I have a sections of text that is part of a larger body. I am trying to pull out one specific section... (The text is MediaWiki code by the way). Basically what I am trying to do is replace everything starting from {{ and ending at }} INCLUSIVE (brackets should be grabbed also).
| locator map = {{Location map|island of Ireland|relief=yes|caption=|float=center|marksize=5|lat= 53.50073|long=-10.14984}}
Now the current ruby REGEX I have is shown below and this works great if all the parameters are on one line as in the example above.
\|\s*locator\smap\s*=\s*\{\{[Ll]ocation map\s*\|(?<map>[A-Za-z0-9\s]*).*caption\s*=\s*(?<caption>[^\|]*).*\}\}
However, if the parameters are on multiple lines such as below, then the regular expression breaks.
| locator map = {{Location map
|Island of Ireland
|relief=yes|caption=|float=center
|marksize=5|lat= 53.50073|long=-10.14984
}}
| coords = {{coord|12|12|}}
Note tht the last line should NOT be selected by the REGEX. I do not have my heart set on using regular expressions... If there is an easier way to get what I need, perhaps using Ruby's String class, that would be fine by me!
Try something simple:
\|\s*locator\smap[\s\S]+\}\}
Demo: https://regex101.com/r/BEUGNn/1
The code above gets you the same results that your code does. However, if you want to match only what is between the curly brackets { } as indicated in your question, You can try the Regex LookAround function which is allowed in Rupy. Try this code:
(?<=\|\slocator\smap\s{6}\=\s\{\{)[\s\S]+\d+(?=\}\})
Demo: https://regex101.com/r/2JfrJU/1
You can use Oniguruma's subroutines to make it work even when you have nested curlies:
text = <<TEXT
| locator map = {{Location map
|Island of {{country}} <-- NESTED CURLY HERE
|relief=yes|caption=|float=center
|marksize=5|lat= 53.50073|long=-10.14984
}}
| coords = {{coord|12|12|}}
TEXT
field_name = "locator map"
re = %r[
#{Regexp.escape(field_name)} # find the key we want
\s* = \s* # then the equals sign
(?<curlies> # start subroutine (also the final capture region)
{{ # opening curlies, then
(?: # any number of
\g<curlies> # full curly tag
| # or
(?!{{). # any character that would not start a curly tag
)*
}} # then closing curlies
)
]xm # extended syntax, multiline matching
puts text[re, :curlies] # extract the curlies region
# => {{Location map
|Island of {{country}} <-- NESTED CURLY HERE
|relief=yes|caption=|float=center
|marksize=5|lat= 53.50073|long=-10.14984
}}
text[re, :curlies] = "SOMETHING" # replace it with something
puts text
# => | locator map = SOMETHING
| coords = {{coord|12|12|}}
Of course you can also use the other Regexp methods like gsub.
Code
R = /
(?<={{) # match two left brackets in a positive lookbehind
.* # match any number of any character, greedily
(?=}}) # match two right brackets in a positive lookahead
/xm # free-spacing regex definition and multi-line modes
def replace_it(str, replacement)
str.sub(R, replacement)
end
Examples
str =<<-END
| locator map = {{Location map
|Island of Ireland
|relief=yes|caption=|float=center
|marksize=5|lat= 53.50073|long=-10.14984
}}
END
str[R]
#=> "Location map\n |Island...|long=-10.14984\n "
replace_it(str, "How now, brown cow?")
#=> " | locator map = {{How now, brown cow?}}\n"
Another example:
str = "| locator map = {{pig{{dog}}cat}}"
str[R]
#=> "pig{{dog}}cat"
replace_it(str, "How now, brown cow?")
#=> "| locator map = {{How now, brown cow?}}"
In my opinion REGEX is easier way to solve your task. It's shortest way. If you want to work with parameters on multiple lines, you should use "m" modifier. It will looks like this: /your REGEX here/m. If your REGEX select too long string, it means you use greedy version of quantifiers. Greedy quantifiers looking for longest substrings that match the pattern. Not greedy quantifiers looking for shortest matches. For using not greedy version put "?" after quantifier. For your example
| locator map = {{Location map
|Island of Ireland
|relief=yes|caption=|float=center
|marksize=5|lat= 53.50073|long=-10.14984
}}
| coords = {{coord|12|12|}}
right REGEX will be:
/\|\s*locator\smap\s*=\s*\{\{[Ll]ocation map\s*\|(?<map>[A-Za-z0-9\s]*).*caption\s*=\s*(?<caption>[^\|]*).*?\}\}/m
There is great project rubular.com in Internet. You can check your regular expressions here. It shows result immediately without any code writing. It can make your work with regular expressions faster even if you don't love them.

Return specific segment from Ruby regex

I have a big chunk of text I am scanning through and I am searching with a regex that is prefixed by some text.
var1 = textchunk.match(/thedata=(\d{6})/)
My result from var1 would return something like:
thedata=123456
How do I only return the number part of the search so in the example above just 123456 without taking var1 and then stripping thedata= off in a line below
If you expect just one match in the string, you may use your own code and access the captures property and get the first item (since the data you need is captured with the first set of unescaped parentheses that form a capturing group):
textchunk.match(/thedata=(\d{6})/).captures.first
See this IDEONE demo
If you have multiple matches, just use scan:
textchunk.scan(/thedata=(\d{6})/)
NOTE: to only match thedata= followed with exactly 6 digits, add a word boundary:
/thedata=(\d{6})\b/
^^
or a lookahead (if there can be word chars after 6 digits other than digits):
/thedata=(\d{6})(?!\d)/
^^^^^^
▶ textchunk = 'garbage=42 thedata=123456'
#⇒ "garbage=42 thedata=123456"
▶ textchunk[/thedata=(\d{6})/, 1]
#⇒ "123456"
▶ textchunk[/(?<=thedata=)\d{6}/]
#⇒ "123456"
The latter uses positive lookbehind.

Remove Certain Alphanumeric Characters from a String in Ruby

I have to validate a string based on first alpha-numeric character of the string. Certain characters can be part of the string but if they are at beginning then they have to ignored.
For example:
--- BATest- 1 --
should be:
BATest-1
How do I remove dashes from beginning and end but not from middle?
To add to my question: can the first alphanumeric character decide if following alphanumeric characters are to be removed or not?
I.e. If A then nothing would need to be removed and throw a validation error; and yet if B then strip the string as mentioned above.
r = /
--+ # Match at least two hyphens
| # or
\s # Match a space
/x # Free-spacing regex definition mode
'--- BATest- 1 --'.gsub r, ""
#=> "BATest-1"
You asked to remove the dashes from the beginning and the end:
"--- BATest- 1 --".gsub(/^-+|-+$|\s/, "")
# => "BATest-1"

Ruby regular expressions for movie titles and ratings

The quiz problem:
You are given the following short list of movies exported from an Excel comma-separated values (CSV) file. Each entry is a single string that contains the movie name in double quotes, zero or more spaces, and the movie rating in double quotes. For example, here is a list with three entries:
movies = [
%q{"Aladdin", "G"},
%q{"I, Robot", "PG-13"},
%q{"Star Wars","PG"}
]
Your job is to create a regular expression to help parse this list:
movies.each do |movie|
movie.match(regexp)
title,rating = $1,$2
end
# => for first entry, title should be Aladdin, rating should be G,
# => WITHOUT the double quotes
You may assume movie titles and ratings never contain double-quote marks. Within a single entry, a variable number of spaces (including 0) may appear between the comma after the title and the opening quote of the rating.
Which of the following regular expressions will accomplish this? Check all that apply.
regexp = /"([^"]+)",\s*"([^"]+)"/
regexp = /"(.*)",\s*"(.*)"/
regexp = /"(.*)", "(.*)"/
regexp = /(.*),\s*(.*)/
Would someone explain why the answer was (1) and (2)?
Would someone explain why the answer was (1) and (2)?
The resulting strings will be similar to "Aladdin", "G" let's take a look at the correct answer #1:
/"([^"]+)",\s*"([^"]+)"/
"([^"]+)" = at least one character that is not a " surrounded by "
, = a comma
\s* = a number of spaces (including 0)
"([^"]+)" = like first
Which is exactly the type of strings you will get. Let's take a look at the above string:
"Aladdin", "G"
#^1 ^2^3^4
Now let's take at the second correct answer:
/"(.*)",\s*"(.*)"/
"(.*)" = any number (including 0) of almost any character surrounded by ".
, = a comma
\s* = any number of spaces (including 0)
"(.*)" = see first point
Which is correct as well as the following irb session (using Ruby 1.9.3) shows:
'"Aladdin", "G"'.match(/"([^"]+)",\s*"([^"]+)"/) # number 1
# => #<MatchData "\"Aladdin\", \"G\"" 1:"Aladdin" 2:"G">
'"Aladdin", "G"'.match(/"(.*)",\s*"(.*)"/) # number 2
# => #<MatchData "\"Aladdin\", \"G\"" 1:"Aladdin" 2:"G">
Just for completeness I'll tell why the third and fourth are wrong as well:
/"(.*)", "(.*)"/
The above regex is:
"(.*)" = any number (including 0) of almost any character surrounded by "
, = a comma
= a single space
"(.*)" = see first point
Which is wrong because, for example, Aladdin takes more than one character (the first point) as the following irb session shows:
'"Aladdin", "G"'.match(/"(.*)", "(.*)"/) # number 3
# => nil
The fourth regex is:
/(.*),\s*(.*)/
which is:
(.*) = any number (including 0) of almost any character
, = a comma
\s* = any number (including 0) of spaces
(.*) = see first point
Which is wrong because the text explicitly says that the movie titles do not contain any number of " character and that are surrounded by double quotes. The above regex does not checks for the presence of " in movie titles as well as the needed surrounding double quotes, accepting strings like "," (which are not valid) as the following irb session shows:
'","'.match(/(.*),\s*(.*)/) # number 4
# => #<MatchData "\",\"" 1:"\"" 2:"\"">

Resources