Extracting parts of substring after a given target

Extracting parts of substring after a given target - ruby

Given I have a string like (but not identical) to this:
"\ndigfodigjn \nfdoigoidfgj \nResidence\n123 N 74TH STREET \nPhiladelphia\nPA 19020\ngfhfgh gfhgfh \ndfijoij"
It will contain the substring "Residence". And I want to extract the 3 substrings after that. Each will be separated by a newline, but there is no guarantee of the total number of newlines in the entire string. The only guarantee is that after Residence substring, there will be three substrings that represent the address that are delimited by newlines.
I want this:
123 N 74TH STREET Philadelphia PA 19020
I am able to get the Residence substring this way:
str.split("\n").detect {|s| s =~ /^Residence/ }
But how can I get the substrings I want after it?

Given:
> s="\ndigfodigjn \nfdoigoidfgj \nResidence\n123 N 74TH STREET \nPhiladelphia\nPA 19020\ngfhfgh gfhgfh \ndfijoij"
You can slice the multiline string with a regex and capture the 3 lines after:
> s[/Residence\s*([^\n]*\n[^\n]*\n[^\n]*\n)/]
=> "Residence\n123 N 74TH STREET \nPhiladelphia\nPA 19020\n"
Or if you just want the capture group portion:
> s[/Residence\s*([^\n]*\n[^\n]*\n[^\n]*\n)/,1]
=> "123 N 74TH STREET \nPhiladelphia\nPA 19020\n"
Then you can split that on "\n" if you need three strings.

Based on #dawg answer it will do the trick:
s[/Residence(\n[^\n]+){3}/].split("\n")[1..3]
Regex will look for Residencethen it will look for 3 newlines followed for anything that is not a newline.
The resulting string can be split by new line and the 3 last elements will have the address

Try with a lookback expression:
> str[/(?<=Residence)(\n[^\n]+){3}/].split("\n").join
=> "123 N 74TH STREET PhiladelphiaPA 19020"

Related

Regex to obfuscate substring of a repeating substring

Given a string like:
abc_1234 xyz def_123aa4a56
I want to replace parts of it so the output is:
abc_*******z def_*******56
The rules are:
abc_ and def_ are kind of delimiters, so anything between the two are part of the previous delimiter string.
The string between the abc_ and def_, and the next delimited string should be replaced by *, except for the last 2 characters of that substring. In the above example, abc_1234 xyz (note trailing space), got turned into abc_*******z

prefixes = %w|abc_ def_|
input = "Hello abc_111def_frg def_333World abc_444"
input.gsub(/(#{Regexp.union(prefixes)})../, "\\1**")
#⇒ "Hello abc_**1def_**g def_**3World abc_**4"

Is this what you are looking for?
str = "Hello abc_111def_frg def_333World abc_444"
str.scan(/(?<=abc_|def_)(?:[[:alpha:]]+|[[:digit:]]+)/)
# => ["111", "frg", "333", "444"]
I've assumed the string following "abc_" or "def_" is either all digits or all letters. It won't work if, for example, you wished to extract "a1b" from "abc_a1b cat". You need to better define the rules for what terminates the strings you want.
The regular expression reads, "Following the string "abc_" or "def_" (a positive lookbehind that is not part of the match), match a string of digits or a string of letters".

Given:
> s
=> "abc_1234 xyz def_123aa4a56"
You can do:
> s.gsub(/(?<=abc_|def_)(.*?)(..)(?=(?:abc_|def_|$))/) { |m| "*" * $1.length<<$2 }
=> "abc_*******z def_*******56"

Regex to find a newline character ("\n") and replace with empty string from address

We have a string which contains address in it like below:
"first-name, last-name, email, address\n Ashok, G, \"Hyderabad\nTelangana\n India\"\n John, M, \"Mayur Vihar\nNew Delhi\n110096, India\"\n"
and the requirement is to replace all the newline characters ("\n") characters with "" from the address string only (inside \" \")
The Expected output should be like:
"first-name, last-name, email, address\n Ashok, G, \"Hyderabad Telangana India\"\n John, M, \"Mayur Vihar, New Delhi 110096, India\"\n "

\\n(?=(?:(?!\\").)*\\"(?:(?:(?!\\").)*\\"(?:(?!\\").)*\\")*(?:(?!\\").)*$)
Try this.Replace by empty string.See demo.
https://www.regex101.com/r/rG7gX4/7

I suggest you do it as follows:
str.gsub(/(?<=\").*?(?=\")/) { |s| s.gsub(/\n/,' ') }
#=> "first-name, last-name, email, address\n Ashok, G, \"heyderabad |
Telangana India\" ABCD, L, \"Guntur AP 500505, India\"\n"
This matches each string bracketed by \", which in turn is passed to the block for removal of all \n's. (?<=\") is a positive lookbehind; (?=\") is a postive lookahead. ? is needed to make .* non-greedy, so the match stops before the first matching postive lookahead.
This doesn't give quite the spacing contained in your desired output. That spacing seems somewhat inconsistent, however. For example, where did the single space at the end of the string come from? You said you wanted to replace \n between pairs of \", but you didn't say what you want to replace it with. (I assumed one space.) If you want different spacing, you could adjust the regex used by gsub inside the block. For example, you might have /\s*\n\s*/.

Matching repeated pattern in string

I have street names and numbers in a file, like so:
Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29
I parse the lines one by one with regex. I want a regex that will find and match:
The name of the street,
The street numbers with its possible a,b,c,d attached.
I've come up with this mean while:
/(\D{2,})\s+(\d{1,3}[a-d|א-ד]?)(?:[,\s]{1,3})?/
It finds the street name and first number. I need to find all the numbers.
I don't want to use two separate regex's if possible, and I prefer not to use Ruby's scan but just have it in one regex.

You can use regex to find all the numbers, with their separators:
re = /\A(.+?)\s+((?:\d+[a-z]*[,\s]+)*\d+[a-z]*)/
txt = "Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29"
matches = txt.lines.map{ |line| line.match(re).to_a[1..-1] }
p matches
#=> [["Sokolov", "19, 20, 23 ,25"],
#=> ["Hertzl", "80,82,84,86"],
#=> ["Hertzl", "80a,82b,84e,90"],
#=> ["Aba Hillel Silver", "2,3,5,6"],
#=> ["Weizman", "8"],
#=> ["Ahad Ha'am", "9 13 29"]]
The above regex says:
\A Starting at the front of the string
(…) Capture the result
.+? Find one or more characters, as few as possible that make the rest of this pattern match.
\s+ Followed by one or more whitespace characters (which we don't capture)
(…) Capture the result
(?:…)* Find zero or more of what's in here, but don't capture them
\d+ One or more digits (0–9)
[a-z]* Zero or more lowercase letters
[,\s]+ One or more commas and/or whitespace characters
\d+ Followed by one or more digits
[a-z]* And zero or more lowercase letters
However, if you want to break the number up into pieces you will need to use scan or split or the equivalent.
result = matches.map{ |name,numbers| [name,numbers.scan(/[^,\s]+/)] }
p result
#=> [["Sokolov", ["19", "20", "23", "25"]],
#=> ["Hertzl", ["80", "82", "84", "86"]],
#=> ["Hertzl", ["80a", "82b", "84e", "90"]],
#=> ["Aba Hillel Silver", ["2", "3", "5", "6"]],
#=> ["Weizman", ["8"]],
#=> ["Ahad Ha'am", ["9", "13", "29"]]]
This is because regex captures inside a repeating group do not capture each repetition. For example:
re = /((\d+) )+/
txt = "hello 11 2 3 44 5 6 77 world"
p txt.match(re)
#=> #<MatchData "11 2 3 44 5 6 77 " 1:"77 " 2:"77">
The whole regex matches the whole string, but each capture only saves the last-seen instance. In this case, the outer capture only gets "77 " and the inner capture only gets "77".
Why do you prefer not to use scan? This is what it is made for.

If you want your 3rd example to work, you need to have the [a-d] change to include the e in the range. After changing that you can use (\D{2,})\s+(\d{1,3}[a-e]?(?:[,\s]{1,3})*)*. Using the examples you gave I did some testing using Rubular.
Using some more groupings you can have the repetition on those last few conditions (which seem to be pretty tricky. This way the spacing and comma at the end will get caught in the repetition after consuming the space initially.

The only way around the limitation that you can only capture the last instance of a repeated expression is to write your regex for a single instance and let the regex machine do the repeating for you, as occurs with the global substitute options, admittedly similar to scan. Unfortunately, in that case, you have to match for either the street name or the street number and then have no way to easily associate the captured numbers with the captured names.
Regex is great at what it does, but when you try to extend its application beyond it's natural limitations, it's not pretty. ;-)

I want a regex that will find and match....
Do the street names also contain digits (0-9), other characters beside an apostrophe?
Are the street numbers based off arbitrary data? Is it always just an optional a, b, c, or d?
Are you needing a minimum and maximum limitation of string length?
Here are some possible options:
If you are unsure about what the street name contains, but know your street number pattern will be numbers with an optional letter, commas or spaces.
/^(.*?)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If the street names contain only letters with optional apostrophe's and the street numbers contain numbers with an optional letter, comma.
/^([a-zA-Z' ]+)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If your street name and street number pattern are always consistant, you could easily do.
/^([a-zA-Z' ]+)\s+([0-9a-z, ]+)$/
See working demo

How do I match repeated characters?

How do I find repeated characters using a regular expression?
If I have aaabbab, I would like to match only characters which have three repetitions:
aaa

Try string.scan(/((.)\2{2,})/).map(&:first), where string is your string of characters.
The way this works is that it looks for any character and captures it (the dot), then matches repeats of that character (the \2 backreference) 2 or more times (the {2,} range means "anywhere between 2 and infinity times"). Scan will return an array of arrays, so we map the first matches out of it to get the desired results.

How to insert tag every 5 characters in a Ruby String?

I would like to insert a <wbr> tag every 5 characters.
Input: s = 'HelloWorld-Hello guys'
Expected outcome: Hello<wbr>World<wbr>-Hell<wbr>o guys

s = 'HelloWorld-Hello guys'
s.scan(/.{5}|.+/).join("<wbr>")
Explanation:
Scan groups all matches of the regexp into an array. The .{5} matches any 5 characters. If there are characters left at the end of the string, they will be matched by the .+. Join the array with your string

There are several options to do this. If you just want to insert a delimiter string you can use scan followed by join as follows:
s = '12345678901234567'
puts s.scan(/.{1,5}/).join(":")
# 12345:67890:12345:67
.{1,5} matches between 1 and 5 of "any" character, but since it's greedy, it will take 5 if it can. The allowance for taking less is to accomodate the last match, where there may not be enough leftovers.
Another option is to use gsub, which allows for more flexible substitutions:
puts s.gsub(/.{1,5}/, '<\0>')
# <12345><67890><12345><67>
\0 is a backreference to what group 0 matched, i.e. the whole match. So substituting with <\0> effectively puts whatever the regex matched in literal brackets.
If whitespaces are not to be counted, then instead of ., you want to match \s*\S (i.e. a non whitespace, possibly preceded by whitespaces).
s = '123 4 567 890 1 2 3 456 7 '
puts s.gsub(/(\s*\S){1,5}/, '[\0]')
# [123 4 5][67 890][ 1 2 3 45][6 7]
Attachments
Source code and output on ideone.com
References
regular-expressions.info
Finite Repetition, Greediness
Character classes
Grouping and Backreferences
Dot Matches (Almost) Any Character

Here is a solution that is adapted from the answer to a recent question:
class String
def in_groups_of(n, sep = ' ')
chars.each_slice(n).map(&:join).join(sep)
end
end
p 'HelloWorld-Hello guys'.in_groups_of(5,'<wbr>')
# "Hello<wbr>World<wbr>-Hell<wbr>o guy<wbr>s"
The result differs from your example in that the space counts as a character, leaving the final s in a group of its own. Was your example flawed, or do you mean to exclude spaces (whitespace in general?) from the character count?
To only count non-whitespace (“sticking” trailing whitespace to the last non-whitespace, leaving whitespace-only strings alone):
# count "hard coded" into regexp
s.scan(/(?:\s*\S(?:\s+\z)?){1,5}|\s+\z/).join('<wbr>')
# parametric count
s.scan(/\s*\S(?:\s+\z)?|\s+\z/).each_slice(5).map(&:join).join('<wbr>')

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Extracting parts of substring after a given target - ruby

Based on #dawg answer it will do the trick: s[/Residence(\n[^\n]+){3}/].split("\n")[1..3] Regex will look for Residencethen it will look for 3 newlines followed for anything that is not a newline. The resulting string can be split by new line and the 3 last elements will have the address

Try with a lookback expression: > str[/(?<=Residence)(\n[^\n]+){3}/].split("\n").join => "123 N 74TH STREET PhiladelphiaPA 19020"

Related

Regex to obfuscate substring of a repeating substring

Regex to find a newline character ("\n") and replace with empty string from address

Matching repeated pattern in string

How do I match repeated characters?

How to insert tag every 5 characters in a Ruby String?

Categories

Resources