Return specific segment from Ruby regex - ruby

I have a big chunk of text I am scanning through and I am searching with a regex that is prefixed by some text.
var1 = textchunk.match(/thedata=(\d{6})/)
My result from var1 would return something like:
thedata=123456
How do I only return the number part of the search so in the example above just 123456 without taking var1 and then stripping thedata= off in a line below

If you expect just one match in the string, you may use your own code and access the captures property and get the first item (since the data you need is captured with the first set of unescaped parentheses that form a capturing group):
textchunk.match(/thedata=(\d{6})/).captures.first
See this IDEONE demo
If you have multiple matches, just use scan:
textchunk.scan(/thedata=(\d{6})/)
NOTE: to only match thedata= followed with exactly 6 digits, add a word boundary:
/thedata=(\d{6})\b/
^^
or a lookahead (if there can be word chars after 6 digits other than digits):
/thedata=(\d{6})(?!\d)/
^^^^^^

▶ textchunk = 'garbage=42 thedata=123456'
#⇒ "garbage=42 thedata=123456"
▶ textchunk[/thedata=(\d{6})/, 1]
#⇒ "123456"
▶ textchunk[/(?<=thedata=)\d{6}/]
#⇒ "123456"
The latter uses positive lookbehind.

Related

Regex to obfuscate substring of a repeating substring

Given a string like:
abc_1234 xyz def_123aa4a56
I want to replace parts of it so the output is:
abc_*******z def_*******56
The rules are:
abc_ and def_ are kind of delimiters, so anything between the two are part of the previous delimiter string.
The string between the abc_ and def_, and the next delimited string should be replaced by *, except for the last 2 characters of that substring. In the above example, abc_1234 xyz (note trailing space), got turned into abc_*******z
prefixes = %w|abc_ def_|
input = "Hello abc_111def_frg def_333World abc_444"
input.gsub(/(#{Regexp.union(prefixes)})../, "\\1**")
#⇒ "Hello abc_**1def_**g def_**3World abc_**4"
Is this what you are looking for?
str = "Hello abc_111def_frg def_333World abc_444"
str.scan(/(?<=abc_|def_)(?:[[:alpha:]]+|[[:digit:]]+)/)
# => ["111", "frg", "333", "444"]
I've assumed the string following "abc_" or "def_" is either all digits or all letters. It won't work if, for example, you wished to extract "a1b" from "abc_a1b cat". You need to better define the rules for what terminates the strings you want.
The regular expression reads, "Following the string "abc_" or "def_" (a positive lookbehind that is not part of the match), match a string of digits or a string of letters".
Given:
> s
=> "abc_1234 xyz def_123aa4a56"
You can do:
> s.gsub(/(?<=abc_|def_)(.*?)(..)(?=(?:abc_|def_|$))/) { |m| "*" * $1.length<<$2 }
=> "abc_*******z def_*******56"

best way to find substring in ruby using regular expression

I have a string https://stackverflow.com. I want a new string that contains the domain from the given string using regular expressions.
Example:
x = "https://stackverflow.com"
newstring = "stackoverflow.com"
Example 2:
x = "https://www.stackverflow.com"
newstring = "www.stackoverflow.com"
"https://stackverflow.com"[/(?<=:\/\/).*/]
#⇒ "stackverflow.com"
(?<=..) is a positive lookbehind.
If string = "http://stackoverflow.com",
a really easy way is string.split("http://")[1]. But this isn't regex.
A regex solution would be as follows:
string.scan(/^http:\/\/(.+)$/).flatten.first
To explain:
String#scan returns the first match of the regex.
The regex:
^ matches beginning of line
http: matches those characters
\/\/ matches //
(.+) sets a "match group" containing any number of any characters. This is the value returned by the scan.
$ matches end of line
.flatten.first extracts the results from String#scan, which in this case returns a nested array.
You might want to try this:
#!/usr/bin/env ruby
str = "https://stackoverflow.com"
if mtch = str.match(/(?::\/\/)(/S)/)
f1 = mtch.captures
end
There are two capturing groups in the match method: the first one is a non-capturing group referring to your search pattern and the second one referring to everything else afterwards. After that, the captures method will assign the desired result to f1.
I hope this solves your problem.

Need to extract substrings based on key words

I have a string (a block of cdata from a soap) that looks roughly like:
"<![CDATA[XXX|^~\&
KEY|^~\&|xxxxx|xxxxx^xxxx xxxxx
INFO||xxx|xxxxxx||xxxxx|xxxxxxx|xxxxxxx
INFO|||xxxxx||||xxxxxxxxx||||||||||xxxxxxxx
KEY|^~\&|xxxxxx|xxxxxxxxxx|xxxxxxxx
INFO||xx|xxxxxxxx||xxxxxxx|xxxxxx
INFO|||xxxx|x|||xxxxxxxxx|||||||x|||xxxxx|||xxxx||||||||||||||||||||||||xxxx
KEY|^~\&|xxxxx|xxxxx^xxxx xxxxx
INFO||xxx|xxxxxx||xxxxx|xxxxxxx|xxxxxxx
INFO|||xxxxx||||xxxxxxxxx||||||||||xxxxxxxx ]]>"
I am trying to figure how to safely parse out a string for each 'KEY' section using ruby. Basically I need a sting that looks like:
"KEY|^~\&|xxxxx|xxxxx^xxxx xxxxx
INFO||xxx|xxxxxx||xxxxx|xxxxxxx|xxxxxxx
INFO|||xxxxx||||xxxxxxxxx||||||||||xxxxxxxx"
For each time there is a 'KEY'. Thoughts on the best way to go about this? Thanks.
Here's one way to do it (with a simplified example):
str =
"<![CDATA[XXX|^~\&
KEY|^~\&|x
INFO||x
INFO|||x
KEY|^~\&|x
INFO||xx|x
INFO|||x
KEY|^~\&|x
INFO||x
INFO|||x"
r = /
^KEY\b # match KEY at beginning of line followed by word boundary
.+? # match any number of any character, lazily
(?=\bKEY\b|\z) # match KEY bracketed by word boundaries or end of
# string, in positive lookahead
/mx # multiline and extended modes
str.scan r
#=> ["KEY|^~&|x\nINFO||x\nINFO|||x\n",
# "KEY|^~&|x\nINFO||xx|x\nINFO|||x\n",
# "KEY|^~&|x\nINFO||x\nINFO|||x"]
Not as relaxed of a regex as like, but this might work for you:
KEY(.+\n)+(?=\s+KEY)

Regex matching chars around text

I have a string with chars inside and I would like to match only the chars around a string.
"This is a [1]test[/1] string. And [2]test[/2]"
Rubular http://rubular.com/r/f2Xwe3zPzo
Currently, the code in the link matches the text inside the special chars, how can I change it?
Update
To clarify my question. It should only match if the opening and closing has the same number.
"[2]first[/2] [1]second[/2]"
In the code above, only first should match and not second. The text inside the special chars (first), should be ignored.
Try this:
(\[[0-9]\]).+?(\[\/[0-9]\])
Permalink to the example on Rubular.
Update
Since you want to remove the 'special' characters, try this instead:
foo = "This is a [1]test[/1] string. And [2]test[/2]"
foo.gsub /\[\/?\d\]/, ""
# => "This is a test string. And test"
Update, Part II
You only want to remove the 'special' characters when the surrounding tags match, so what about this:
foo = "This is a [1]test[/1] string. And [2]test[/2], but not [3]test[/2]"
foo.gsub /(?:\[(?<number>\d)\])(?<content>.+?)(?:\[\/\k<number>\])/, '\k<content>'
# => "This is a test string. And test, but not [3]test[/2]"
\[([0-9])\].+?\[\/\1\]
([0-9]) is a capture since it is surrounded with parentheses. The \1 tells it to use the result of that capture. If you had more than one capture, you could reference them as well, \2, \3, etc.
Rubular
You can also use a named capture, rather than \1 to make it a little less cryptic. As in: \[(?<number>[0-9])\].+?\[\/\k<number>\]
Here's a way to do it that uses the form of String#gsub that takes a block. The idea is to pull strings such as "[1]test[/1]" into the block, and there remove the unwanted bits.
str = "This is a [1]test[/1] string. And [2]test[/2], plus [3]test[/99]"
r = /
\[ # match a left bracket
(\d+) # capture one or more digits in capture group 1
\] # match a right bracket
.+? # match one or more characters lazily
\[\/ # match a left bracket and forward slash
\1 # match the contents of capture group 1
\] # match a right bracket
/x
str.gsub(r) { |s| s[/(?<=\]).*?(?=\[)/] }
#=> "This is a test string. And test, plus [3]test[/99]"
Aside: When I first heard of named capture groups, they seemed like a great idea, but now I wonder if they really make regexes easier to read than \1, \2....

Ruby regular expression

Apparently I still don't understand exactly how it works ...
Here is my problem: I'm trying to match numbers in strings such as:
910 -6.258000 6.290
That string should gives me an array like this:
[910, -6.2580000, 6.290]
while the string
blabla9999 some more text 1.1
should not be matched.
The regex I'm trying to use is
/([-]?\d+[.]?\d+)/
but it doesn't do exactly that. Could someone help me ?
It would be great if the answer could clarify the use of the parenthesis in the matching.
Here's a pattern that works:
/^[^\d]+?\d+[^\d]+?\d+[\.]?\d+$/
Note that [^\d]+ means at least one non digit character.
On second thought, here's a more generic solution that doesn't need to deal with regular expressions:
str.gsub(/[^\d.-]+/, " ").split.collect{|d| d.to_f}
Example:
str = "blabla9999 some more text -1.1"
Parsed:
[9999.0, -1.1]
The parenthesis have different meanings.
[] defines a character class, that means one character is matched that is part of this class
() is defining a capturing group, the string that is matched by this part in brackets is put into a variable.
You did not define any anchors so your pattern will match your second string
blabla9999 some more text 1.1
^^^^ here ^^^ and here
Maybe this is more what you wanted
^(\s*-?\d+(?:\.\d+)?\s*)+$
See it here on Regexr
^ anchors the pattern to the start of the string and $ to the end.
it allows Whitespace \s before and after the number and an optional fraction part (?:\.\d+)? This kind of pattern will be matched at least once.
maybe /(-?\d+(.\d+)?)+/
irb(main):010:0> "910 -6.258000 6.290".scan(/(\-?\d+(\.\d+)?)+/).map{|x| x[0]}
=> ["910", "-6.258000", "6.290"]
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map(&:to_f)
# => [910.0, -6.258, 6.29]
If you don't want integers to be converted to floats, try this:
str = " 910 -6.258000 6.290"
str.scan(/-?\d+\.?\d+/).map do |ns|
ns[/\./] ? ns.to_f : ns.to_i
end
# => [910, -6.258, 6.29]

Resources