Regular expression to match some conditions given a formatted file name? - ruby

(Sorry for the bad title, any suggestion appreciated) ;-)
Well, consider those strings:
first = "SC/SCO_160ZA206_T_mlaz_kdiz_nziizjeij.ext"
second = "MLA/SA2_jkj15PO_B_lkazkl lakzlk-akzl.oxt"
third = "A12A/AZD_KZALKZL_F_LKAZ_AZ__azaz___.ixt"
I'm looking for a regular expression allowing me to get arrays like this (in ruby):
first_array = ['SCO', '160ZA206', 'T', 'mlaz_kdiz_nziizjeij']
second_array = ['SA2', 'jkj15PO', 'B', 'lkazkl lakzlk-akzl']
third_array = ['AZD', 'KZALKZL', 'F', 'LKAZ_AZ__azaz___']
The first match must be anything right after the / and before the first _
The second match must be anything between the first and the second _
The third match must be anything between the second and the third _
The last match must be anything between the third _ and the last .
I can't get it: [^\/].?([A-Z]*)_(.*)_(.*)[\.$] :-(

You're super close. Just add a question mark to the second matcher to make it lazy (otherwise, it won't stop at the first underscore), and then duplicate that matcher.
[^\/].?([A-Z]*)_(.*?)_(.*?)_(.*)[\.$]

Following up on #fge's split suggestion:
str = "SC/SCO_160ZA206_T_mlaz_kdiz_nziizjeij.ext"
p str[(str.index('/')+1)...str.rindex('.')].split( '_', 4)
#=> ["SCO", "160ZA206", "T", "mlaz_kdiz_nziizjeij"]
It splits on _ for max 4 elements (the fourth element is the remainder).

Related

Ruby regex: union 2K values in one regex,

I code a process to process bunch of text files and capture its name if any of 2000 literals exists in it (1 or many). So I'm thinking to combine that many values into one regex, do you think it's doable, I did test for 100 and looks like it's OK. Tx all
Code below depics my flow and sample code, just without looping.
# 1. read regex value list as file [alpha,fox, delta] # 2000 values
# 2. read file into s #5000 files
# 3. find if any of #1 values exists in each #2 file. *with regex tweaks to match format dbname.dob.table
s = '1 dbName.dbo.ALPHA 2 DBNAME.bcd.ALPHA 3 dbName..ALPHA 4 ALPHA 5x dbName.alphA 6x alpha.XX 7x ###dbName.###a.alpha --alpha
dbName..FOX dbName.dbo.DELTA clarity.aba..fox '
value1 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:alpha)(?=\s|$)'
value2 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:fox)(?=\s|$)'
##...
value2000 = '(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:delta)(?=\s|$)'
regex = /#{value1}|#{value2}|#{value2000}/i ## can I union 2000 regex's ???
puts 'reg1: ' + regex.to_s
puts 'result: ' + s.scan(regex).to_s
if s.scan(regex) then puts '...Match!!!d' end
Declaring 2000 variables is highly unnecessary; you should define all values in a single array, then somehow loop through them.
Also, the regular expression is highly repetitive - e.g. the use of (?:dbName\.[a-z]*\.) 2000 times. This can be simplified by grouping all of your values within the non-capture group as follows:
values = %w(alpha fox delta)
regex = /(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:#{Regexp.union(values)})(?=\s|$)/
This is the result:
/(?<=^|\s)(?:dbName\.[a-z]*\.)?(?:(?-mix:alpha|fox|delta))(?=\s|$)/
If you extend that values array to contain 2000 strings, the other code does not need to change.
Provided two conditions are met, I would do it as follows, which I think would be far more efficient than using a gigantic regular expression, which, by its nature, requires that a linear search of the "bad words" be performed for each word in the string, until a match is found or it is determined that there are no matches.
We are given a file whose path is contained in a variable fname and an array of bad words:
arr = ["alpha", "fox", "delta", "charlie", "mabel"]
The first condition that I spoke of above is that, by way of example, "ALPHA" and "Alpha" match "alpha", but "aLPha" does not (or some variant of that).
The second condition is that there is a regular expression with a capture group that would capture a bad word if a bad word were present at the given location in a match. For example:
regex = (?<=^|\s)(?:dbName\.[a-z]*\.)?(\p{Alpha}+)(?=\s|$)
Wherever there is a match, the capture group (\p{Alpha}+) would capture a string of one or more alphanumeric characters whose value is assigned to the global variable $1. We will then check to see if the value of $1 is a bad word. (The regular expression might have other capture groups as well, in which case we might be looking for $2 or $3, say, or a named capture group.)
If there were more than one such regular expression to check for, the code below could be executed for each of them until a match is found or it is determined that there are no more matches.
The first step is to convert the array of bad words to a set:
require 'set'
bad_words = arr.flat_map { |w| [w, w.capitalize, w.upcase] }.to_set
#=> #<Set: {"alpha", "Alpha", "ALPHA", "fox", "Fox", "FOX",
# "delta", "Delta", "DELTA", "charlie", "Charlie", "CHARLIE",
# "mabel", "Mabel", "MABEL"}>
This allows very fast word lookups--much faster than stepping through an array. We may then search the file as follows.
rv = IO.foreach(fname).any? do |line|
line.gsub(regex).any? { bad_words.include?($1) }
end
IO::foreach without a block is seen to return an enumerator. We can then chain that to any? to determine if there is a line that contains a match of the regular expression and the value of its capture group is contained in the set bad_words. If such a line is found the search terminates and true is returned; else, false is returned.
It is seen that String#gsub without a block returns an enumerator, which here I've chained to any?. This form of gsub has nothing to do with string replacements; it just generates matches. Those matches are passed to the block, but we are only interested in the contents of the capture group, which are held by $1. Hence the expression bad_words.include?($1).

Regex cuts word if end of string

I want to check and capture 2 or x words after and before a target string in a multiline text. The problem is that if the words matched are less than x number of words, then regex cuts off the last word and splits it till x.
For example
text = "This is an example /year"
if example is the target:
Matching Data: "is" , "an", "/yea", "r"
If i add random words after /year it matches it correctly.
How could I fix this so that if less than x words exist just stop there or return empty for the rest of the matches?
So it should be
Matching Data: "is" , "an", "/year", ""
def checkWords(target, text, numLeft = 2, numRight = 2)
target = target.compact.map{|x| x.inspect}.join('').gsub(/"/, '')
regex = ""
regex += "\\s+{,2}(\\S+)\\s+{,2}" * numLeft
regex += target
regex += "\\s+{,2}(\\S+)" * numRight
pattern = Regexp.new(regex)
matches = pattern.match(text)
puts matches.inspect
end
Since you want to capture the words before and after target, you need to set a capturing group around the whole regex parts that match the 0 to 2 occurrences of spaces-non-spaces. Also, you need to allow a minimum bound of 0 - use {0,2} (or a more succint {,2}) limiting quantifier to make sure you get the context on the left even if it is missing on the right:
/((?:\S+\s+){,2})target((?:\s+\S+){,2})/
^ ^ ^ ^
See this Rubular demo
If you use /(?:(\S+)\s+){0,2}target(?:\s+(\S+)){0,2}/, all captured values but the last one will be lost, i.e. once quantified, repeated capturing groups only store the value captured during the last iteration in the group buffer.
Also note that setting a {,2} quantifier on the + quantifier makes no sense, \\s+{,2} = \\s+.

Regex: text before multiple matches

Idea. Given the string, return all the matches (with overlaps) and the text before these matches.
Example. For the text atatgcgcatatat and the query atat there are three matches, and the desired output is atat, atatgcgcatat and atatgcgcatatat.
Problem. I use Ruby 2.2 and String#scan method to get multiple matches. I've tried to use lookahead, but the regex /(?=(.*?atat))/ returns every substring that ends with atat. There must be some regex magic to solve this problem, but I can't figure out the right spell.
I believe this is at least better than the OP's answer:
text = "atatgcgcatatat"
query = "atat"
res = []
text.scan(/(?=#{query})/){res.push($` + query)} #`
res # => ["atat", "atatgcgcatat", "atatgcgcatatat"]
Given the nature and purpose of regex, there is no way to do that. When a regex matches text, there is no way to include the same text in another match. Therefore, the best option that I can think of is to use a look-behind to find the ending position of each match:
(?<=atat)
With your example input of atatgcgcatatat, that would return the following three matches:
Position 4, Length 0
Position 12, Length 0
Position 14, Length 0
You could then loop through those results, get the position for each one, and then get the sub-string that starts at the beginning of the input string and ends at that position. If you don't know how to get the positions of each match, you may find the answers to this question helpful.
You could do this:
str = 'atatgcgcatatat'
target = 'atat'
[].tap do |a|
str.gsub(/(?=#{target})/) { a << str[0, $~.end(0)+target.size] }
end
#=> ["atat", "atatgcgcatat", "atatgcgcatatat"]
Notice that the string returned by gsub is discarded.
It seems, there's no way to solve the problem in just one go.
One possible solution is to use this knowledge to get indices of matches when using String#scan, and then return the array of sliced strings:
def find_by_end text, query
res = []
n = query.length
text.scan( /(?=(#{query}))/ ) do |m|
res << text.slice(0, $~.offset(0).first + n)
end
res
end
find_by_end "atatgcgcatatat", "atat" #=> ["atat", "atatgcgcatat", "atatgcgcatatat"]
A slightly different solution was proposed by #StevenDoggart. Here's a nice and short code which uses this hack to solve the problem:
"atatgcatatat".to_enum(:scan, /(?<=atat)/).map { $` } #`
#=> ["atat", "atatgcatat", "atatgcatatat"]
As #CasimiretHippolyte notes, reversing the string might help to solve the problem. It actually does, but it's hardly the prettiest solution:
"atatgcatatat".reverse.scan(/(?=(tata.*))/).flatten.map(&:reverse).reverse
#=> ["atat", "atatgcatat", "atatgcatatat"]

At which position does the regex fail?

I need a very simple string validator that would show where is first symbol not corresponding to the desired format. I want to use regex but in this case I have to find the place where the string stops corresponding to the expression and I can't find a method that would do that.
(It's got to be a fairly simple method... maybe there isn't one?)
For example if I have regex:
/^Q+E+R+$/
with string:
"QQQQEEE2ER"
The desired result should be 7
An idea: what you can do is to tokenize your pattern and write it with optional nested capturing groups:
^(Q+(E+(R+($)?)?)?)?
Then you only need to count the number of capture groups you obtain to know where the regex engine stops in the pattern and you can determine the offset of the match end in the string with the whole match length.
As #zx81 notices it in his comment, if one of the elements can match the next element (example Q can match the element E), things become different.
Let's say that Q is \w (and can match E and R). For the string QQQEEERRR the precedent pattern will give only one capturing group (the greedy \w+ matches all) when ^(\w+)(E+)(R+)$ will give three groups: QQQEE, E, RRR
To obtain the same result you need to add an alternation:
^((?:\w+(?=E)|\w+)(E+(R+($)?)?)?)?
In the alternation, the case where E exists must be tested first, and only if this branch fails (with the lookahead), then the other branch where E doesn't exist is used.
Thus the full pattern can be rewritten like this to deal with this specific case:
^((?:Q+(?=E)|Q+)((?:E+(?=R)|E+)((?:R+(?=$)|R+)($)?)?)?)?
Perhaps could you take a look to the gem amatch too.
This is an interesting task that can be accomplished with a neat regex trick:
^(?:(?=(Q+)))?(?:(?=(Q+E+)))?(?:(?=(Q+E+R+)))?(?:(?=(Q+E+R+$)))?
We have four optional lookaheads checking various parts of the pattern and capturing the partial matches to Groups 1, 2, 3 and 4 incrementally.
Group 1 contains Q+ if it can be matched, in your example QQQQ.
Group 2 contains Q+E+ if it can be matched, in your example EEE.
Group 3 contains Q+E+R+ if it can be matched, in your example nil.
Group 3 contains Q+E+R+$ if it can be matched, in your example nil.
In your code, check which is the last Group that is set by testing !$1.nil?, !$2.nil? and so on.
The last one set gives you the length that is matchable, so in your example $2.length gives you the 7 you wanted.
Incidentally, the fact that Group 2 is the last one set also tells you that we fail on R+.
For your example, you could do the following.
Code
Change your regex from:
/^Q+E+R+$/
to
R = /^(Q*)(E*)(R*)/
and then apply the following method to the string:
def nbr_matched_chars(str)
str.scan(R).flatten.reduce(0) {|t,e| return t if e.nil?; t+e.size }
end
str matches the original regex if and only if nbr_matched_chars(str) == str.size.
Examples
nbr_matched_chars("QQQQEEE2ER") #=> 7
nbr_matched_chars("QQQQEEEERR") #=> 10 (= "QQQQEEEERR".size)
nbr_matched_chars("QQAQQEEEER") #=> 2
Explanation
To see why this [evidently :-)] works, we can look at the results of invoking String#scan, followed by Array#flatten:
"QQQQEEE2ER".scan(r).flatten #=> ["QQQQ", "EEE" , nil ]
"QQQQEEEERR".scan(r).flatten #=> ["QQQQ", "EEEE", "RR"]
"QQAQQEEEER".scan(r).flatten #=> ["QQ" , nil , nil ]

Incrementing numeric parameter in a URL parameter string?

I've had a look round and can't find what I need on Stack Overflow, and was wondering if someone had a simple solution.
I want to find a parameter within a URL and increment its value, so, as an example:
?kws=&pstc=&cty=&prvnm=1
I want to be able to locate the prvnm parameter no matter where it is in the string and increment its value by 1.
I know I could split the parameters into an array, find the key, increment it and write it back but that seems rather long winded and wondered if someone else had any ideas!
require "uri"
url = "http://example.com/?kws=&pstc=&cty=&prvnm=1"
def new_url(url)
uri = URI.parse(url)
hsh = Hash[URI.decode_www_form(uri.query)]
hsh['prvnm'] = hsh['prvnm'].next
uri.query = URI.encode_www_form(hsh).to_s
uri.to_s
end
new_url(url) # => "http://example.com/?kws=&pstc=&cty=&prvnm=2"
There are already four answers, so I had to come up with something a little different:
s = "?kws=&pstc=&cty=&prvnm=1"
head, sep, tail = s.partition(/(?<=[?&]prvnm=)\d+/)
head + (sep.to_i + 1).to_s + tail # => "?kws=&pstc=&cty=&prvnm=2"
'String#partition' returns an array of three strings [head, sep, tail], such that head + sep + tail => s, where separator is partition's argument, which can be a string or a regex.
We want the separator to be the digits following &prvnm=. We therefore use a regex with \d+ preceeded by the aforementioned string which we want to treat as having zero length, so it will not be included in the separator. That calls for a "positive look-behind": (?<=&prvnm=). \d+ is "greedy", so it take all consequetive digits.
For the given value of s, head, sep, tail = s.partition(/(?<=&prvnm=)(\d+)/)
=> ["?kws=&pstc=&cty=&prvnm=", "1", ""].
Edit: my thanks to #quetzalcoatl for pointing out that I needed to change (?<=&prvnm=) in my regex to what I have now, as what I had would fail when ?prvnm= was at the beginning of the string.
split the string by `&`
then iterate over the parts
then split each part by `=` and inspect the results
when found `prvnm`, parse the integer and increment it
then join the bits by '='
then join the parts by '&'
Or, use regex like:
/[?&]prvnm=\d+/
and parse the result and then do a replacement.
Or, get some URL-parsing library..
Try something like this:
params = "?kws=&pstc=&cty=&prvnm=1"
num = params.scan(/prvnm=(\d)/)[0].join.to_i
puts num + 1
Use:
require 'uri'
Then:
parsed-url= URI.parse( ur full url)
r = CGI.parse(parsed_url.query)
r is now a hash of all your query parameters.
You can easily access it by using:
r["prsvn"].to_i + 1

Resources