Need to extract substrings based on key words - ruby

I have a string (a block of cdata from a soap) that looks roughly like:
"<![CDATA[XXX|^~\&
KEY|^~\&|xxxxx|xxxxx^xxxx xxxxx
INFO||xxx|xxxxxx||xxxxx|xxxxxxx|xxxxxxx
INFO|||xxxxx||||xxxxxxxxx||||||||||xxxxxxxx
KEY|^~\&|xxxxxx|xxxxxxxxxx|xxxxxxxx
INFO||xx|xxxxxxxx||xxxxxxx|xxxxxx
INFO|||xxxx|x|||xxxxxxxxx|||||||x|||xxxxx|||xxxx||||||||||||||||||||||||xxxx
KEY|^~\&|xxxxx|xxxxx^xxxx xxxxx
INFO||xxx|xxxxxx||xxxxx|xxxxxxx|xxxxxxx
INFO|||xxxxx||||xxxxxxxxx||||||||||xxxxxxxx ]]>"
I am trying to figure how to safely parse out a string for each 'KEY' section using ruby. Basically I need a sting that looks like:
"KEY|^~\&|xxxxx|xxxxx^xxxx xxxxx
INFO||xxx|xxxxxx||xxxxx|xxxxxxx|xxxxxxx
INFO|||xxxxx||||xxxxxxxxx||||||||||xxxxxxxx"
For each time there is a 'KEY'. Thoughts on the best way to go about this? Thanks.

Here's one way to do it (with a simplified example):
str =
"<![CDATA[XXX|^~\&
KEY|^~\&|x
INFO||x
INFO|||x
KEY|^~\&|x
INFO||xx|x
INFO|||x
KEY|^~\&|x
INFO||x
INFO|||x"
r = /
^KEY\b # match KEY at beginning of line followed by word boundary
.+? # match any number of any character, lazily
(?=\bKEY\b|\z) # match KEY bracketed by word boundaries or end of
# string, in positive lookahead
/mx # multiline and extended modes
str.scan r
#=> ["KEY|^~&|x\nINFO||x\nINFO|||x\n",
# "KEY|^~&|x\nINFO||xx|x\nINFO|||x\n",
# "KEY|^~&|x\nINFO||x\nINFO|||x"]

Not as relaxed of a regex as like, but this might work for you:
KEY(.+\n)+(?=\s+KEY)

Related

How do I remove a common substring using Ruby?

I have read How do I remove substring after a certain character in a string using Ruby?. This is close, but different.
I have these emails with a mask:
email1 = 'giovanna.macedo#lojas100.com.br-215000695716b.ct.domain.com.br'
email2 = 'alvaro-neves#stockshop.com-215000695716b.ct.domain.com.br'
email3 = 'filiallojas123#filiallojas.net-215000695716b.ct.domain.com.br'
I want to remove the substrings that are after .br, .com and .net. The return must be:
email1 = 'giovanna.macedo#lojas100.com.br'
email2 = 'alvaro-neves#stockshop.com'
email3 = 'filiallojas123#filiallojas.net'
You can do that with the method String#[] with an argument that is a regular expression.
r = /.*?\.(?:rb|com|net|br)(?!\.br)/
'giovanna.macedo#lojas100.com.br-215000695716b.ct.domain.com.br'[r]
#=> "giovanna.macedo#lojas100.com.br"
'alvaro-neves#stockshop.com-215000695716b.ct.domain.com.br'[r]
#=> "alvaro-neves#stockshop.com"
'filiallojas123#filiallojas.net-215000695716b.ct.domain.com.br'[r]
#=> "filiallojas123#filiallojas.net"
The regular expression reads as follows: "Match zero or more characters non-greedily (?), follow by a period, followed by 'rb' or 'com' or 'net' or 'br', which is not followed by .br. (?!\.br) is a negative lookahead.
Alternatively the regular expression can be written in free-spacing mode to make it self-documenting:
r = /
.*? # match zero or more characters non-greedily
\. # match '.'
(?: # begin a non-capture group
rb # match 'rb'
| # or
com # match 'com'
| # or
net # match 'net'
| # or
br # match 'br'
) # end non-capture group
(?! # begin a negative lookahead
\.br # match '.br'
) # end negative lookahead
/x # invoke free-spacing regex definition mode
This should work for your scenario:
expr = /^(.+\.(?:br|com|net))-[^']+(')$/
str = "email = 'giovanna.macedo#lojas100.com.br-215000695716b.ct.domain.com.br'"
str.gsub(expr, '\1\2')
Use the String#delete_suffix Method
This was tested with Ruby 3.0.2. Your mileage may vary with other versions that don't support String#delete_suffix or its related bang method. Since you're trying to remove the exact same suffix from all your emails, you can simply invoke #delete_suffix! on each of your strings. For example:
common_suffix = "-215000695716b.ct.domain.com.br".freeze
emails = [email1, email2, email3]
emails.each { _1.delete_suffix! common_suffix }
You can then validate your results with:
emails
#=> ["giovanna.macedo#lojas100.com.br", "alvaro-neves#stockshop.com", "filiallojas123#filiallojas.net"]
email1
#=> "giovanna.macedo#lojas100.com.br"
email2
#=> "alvaro-neves#stockshop.com"
email3
#=> "filiallojas123#filiallojas.net"
You can see that the array has replaced each value, or you can call each of the array's variables individually if you want to check that the strings have actually been modified in place.
String Methods are Usually Faster, But Your Mileage May Vary
Since you're dealing with String objects instead of regular expressions, this solution is likely to be faster at scale, although I didn't bother to benchmark all solutions to compare. If you care about performance, you can measure larger samples using IRB's new measure command, it took only 0.000062s to process the strings this way on my system, and String methods generally work faster than regular expressions at large scales. You'll need to do more extensive benchmarking if performance is a core concern, though.
Making the Call Shorter
You can even make the call shorter if you want. I left it a bit verbose above so you could see what the intent was at each step, but you can trim this to a single one-liner with the following block:
# one method chain, just wrapped to prevent scrolling
[email1, email2, email3].
map { _1.delete_suffix! "-215000695716b.ct.domain.com.br" }
Caveats
You Need Fixed-String Suffixes
The main caveat here is that this solution will only work when you know the suffix (or set of suffixes) you want to remove. If you can't rely on the suffixes to be fixed, then you'll likely need to pursue a regex solution in one way or another, even if it's just to collect a set of suffixes.
Dealing with Frozen Strings
Another caveat is that if you've created your code with frozen string literals, you'll need to adjust your code to avoid attempting in-place changes to frozen strings. There's more than one way to do this, but a simple destructuring assignment is probably the easiest to follow given your small code sample. Consider the following:
# assume that the strings in email1 etc. are frozen, but the array
# itself is not; you can't change the strings in-place, but you can
# re-assign new strings to the same variables or the same array
emails = [email1, email2, email3]
email1, email2, email3 =
emails.map { _1.delete_suffix "-215000695716b.ct.domain.com.br" }
There are certainly other ways to work around frozen strings, but the point is that while the now-common use of the # frozen_string_literal: true magic comment can improve VM performance or memory usage in large programs, it isn't always the best option for string-mangling code. Just keep that in mind, as tools like RuboCop love to enforce frozen strings, and not everyone stops to consider the consequences of such generic advice to the given problem domain.
I would just use the chomp(string) method like so:
mask = "-215000695716b.ct.domain.com.br"
email1.chomp(mask)
#=> "giovanna.macedo#lojas100.com.br"
email2.chomp(mask)
#=> "alvaro-neves#stockshop.com"
email3.chomp(mask)
#=> "filiallojas123#filiallojas.net"

How do I regex-match an unknown number of repeating elements?

I'm trying to write a Ruby script that replaces all rem values in a CSS file with their px equivalents. This would be an example CSS file:
body{font-size:1.6rem;margin:4rem 7rem;}
The MatchData I'd like to get would be:
# Match 1 Match 2
# 1. font-size 1. margin
# 2. 1.6 2. 4
# 3. 7
However I'm entirely clueless as to how to get multiple and different MatchData results. The RegEx that got me closest is this (you can also take a look at it at Rubular):
/([^}{;]+):\s*([0-9.]+?)rem(?=\s*;|\s*})/i
This will match single instances of value declarations (so it will properly return the desired Match 1 result), but entirely disregards multiples.
I also tried something along the lines of ([0-9.]+?rem\s*)+, but that didn't return the desired result either, and doesn't feel like I'm on the right track, as it won't return multiple result data sets.
EDIT After the suggestions in the answers, I ended up solving the problem like this:
# search for any declarations that contain rem unit values and modify blockwise
#output.gsub!(/([^ }{;]+):\s*([^}{;]*[0-9.]rem+[^;]*)(?=\s*;|\s*})/i) do |match|
# search for any single rem value
string = match.gsub(/([0-9.]+)rem/i) do |value|
# convert the rem value to px by multiplying by 10 (this is not universal!)
value = sprintf('%g', Regexp.last_match[1].to_f * 10).to_s + 'px'
end
string += ';' + match # append the original match result to the replacement
match = string # overwrite the matched result
end
You can't capture a dynamic number of match groups (at least not in ruby).
Instead you could do either one of the following:
Capture the whole value and split on space
Use multilevel matching to capture first the whole key/value pair and secondly match the value. You can use blocks on the match method in ruby.
This regex will do the job for your example :
([^}{;]+):(?:([0-9\.]+?)rem\s?)?(?:([0-9\.]+?)rem\s?)
But whith this you can't match something like : margin:4rem 7rem 9rem
This is what I've been able to do: DEMO
Regex: (?<={|;)([^:}]+)(?::)([^A-Za-z]+)
And this is what my result looks like:
# Match 1 Match 2
# 1. font-size 1. margin
# 2. 1.6 2. 4
As #koffeinfrei says, dynamic capture isn't possible in Ruby. Would be smarter to capture the whole string and remove spaces.
str = 'body{font-size:1.6rem;margin:4rem 7rem;}'
str.scan(/(?<=[{; ]).+?(?=[;}])/)
.map { |e| e.match /(?<prop>.+):(?<value>.+)/ }
#⇒ [
# [0] #<MatchData "font-size:1.6rem" prop:"font-size" value:"1.6rem">,
# [1] #<MatchData "margin:4rem 7rem" prop:"margin" value:"4rem 7rem">
# ]
The latter match might be easily adapted to return whatever you want, value.split(/\s+/) will return all the values, \d+ instead of .+ will match digits only etc.

Return specific segment from Ruby regex

I have a big chunk of text I am scanning through and I am searching with a regex that is prefixed by some text.
var1 = textchunk.match(/thedata=(\d{6})/)
My result from var1 would return something like:
thedata=123456
How do I only return the number part of the search so in the example above just 123456 without taking var1 and then stripping thedata= off in a line below
If you expect just one match in the string, you may use your own code and access the captures property and get the first item (since the data you need is captured with the first set of unescaped parentheses that form a capturing group):
textchunk.match(/thedata=(\d{6})/).captures.first
See this IDEONE demo
If you have multiple matches, just use scan:
textchunk.scan(/thedata=(\d{6})/)
NOTE: to only match thedata= followed with exactly 6 digits, add a word boundary:
/thedata=(\d{6})\b/
^^
or a lookahead (if there can be word chars after 6 digits other than digits):
/thedata=(\d{6})(?!\d)/
^^^^^^
▶ textchunk = 'garbage=42 thedata=123456'
#⇒ "garbage=42 thedata=123456"
▶ textchunk[/thedata=(\d{6})/, 1]
#⇒ "123456"
▶ textchunk[/(?<=thedata=)\d{6}/]
#⇒ "123456"
The latter uses positive lookbehind.

Substring starting with a combination of digits till the next white space

I have a very long string of around 2000 chars. The string is a join of segments with first two chars of each segment as the segment indicator.
Eg- '11xxxxx 12yyyy 14ddddd gghgfbddc 0876686589 SANDRA COLINS 201 STMONK CA'
Now I want to extract the segment with indicator 14.
I achieved this using:
str.split(' ').each do |substr|
if substr.starts_with?('14')
key = substr.slice(2,5).to_i
break
end
end
I feel there should be a better way to do this. I am not able to find a more direct and one line solution for string matching in ruby. Please someone suggest a better approach.
It's not entirely clear what you're looking for, because your example string shows letters, but your title says digits. Either way, this is a good task for a regular expression.
foo = '12yyyy 014dddd 14ddddd gghgfbddc'
bar = '12yyyy 014dddd 1499999 gghgfbddc'
baz = '12yyyy 014dddd 14a9B9z gghgfbddc'
foo[/\b14[a-zA-Z]+/] # => "14ddddd"
bar[/\b14\d+/] # => "1499999"
baz[/\b14\w+/] # => "14a9B9z"
foo[/\b14\S+/] # => "14ddddd"
bar[/\b14\S+/] # => "1499999"
baz[/\b14\S+/] # => "14a9B9z"
In the patterns:
\b means word-break, so the pattern has to start at a transition between spaces or punctuation.
[a-zA-Z]+ means one or more letters.
\d+ means one or more digits.
\w+ means one or more of letters, digits and '_'. That is equivalent to the character set [a-zA-Z0-9_]+.
\S+ means non-whitespace, which is useful if you want everything up to a space.
Which of those is appropriate for your use-case is really up to you to decide.

Ruby regular expression

Apparently I still don't understand exactly how it works ...
Here is my problem: I'm trying to match numbers in strings such as:
910 -6.258000 6.290
That string should gives me an array like this:
[910, -6.2580000, 6.290]
while the string
blabla9999 some more text 1.1
should not be matched.
The regex I'm trying to use is
/([-]?\d+[.]?\d+)/
but it doesn't do exactly that. Could someone help me ?
It would be great if the answer could clarify the use of the parenthesis in the matching.
Here's a pattern that works:
/^[^\d]+?\d+[^\d]+?\d+[\.]?\d+$/
Note that [^\d]+ means at least one non digit character.
On second thought, here's a more generic solution that doesn't need to deal with regular expressions:
str.gsub(/[^\d.-]+/, " ").split.collect{|d| d.to_f}
Example:
str = "blabla9999 some more text -1.1"
Parsed:
[9999.0, -1.1]
The parenthesis have different meanings.
[] defines a character class, that means one character is matched that is part of this class
() is defining a capturing group, the string that is matched by this part in brackets is put into a variable.
You did not define any anchors so your pattern will match your second string
blabla9999 some more text 1.1
^^^^ here ^^^ and here
Maybe this is more what you wanted
^(\s*-?\d+(?:\.\d+)?\s*)+$
See it here on Regexr
^ anchors the pattern to the start of the string and $ to the end.
it allows Whitespace \s before and after the number and an optional fraction part (?:\.\d+)? This kind of pattern will be matched at least once.
maybe /(-?\d+(.\d+)?)+/
irb(main):010:0> "910 -6.258000 6.290".scan(/(\-?\d+(\.\d+)?)+/).map{|x| x[0]}
=> ["910", "-6.258000", "6.290"]
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map(&:to_f)
# => [910.0, -6.258, 6.29]
If you don't want integers to be converted to floats, try this:
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map do |ns|
ns[/\./] ? ns.to_f : ns.to_i
end
# => [910, -6.258, 6.29]

Resources