I have this code running inside a buffer (used to unescape a JS string in Ruby):
elsif hex_substring =~ /^\\u[0-9a-fA-F]{1,4}/
hex_substring.scan(/^((\\u[\da-fA-F]{4}){1,})/) do |match|
hex_byte = match[0]
buffer << JSON.load(%Q("#{hex_byte}"))
hex_index += hex_byte.length
end
...
I have a concern that the scan() is matching a bit too much:
hex_substring.scan(/^((\\u[\da-fA-F]{4}){1,})/)
# => [["\\ud83c\\udfec", "\\udfec"]]
I am using only "\\ud83c\\udfec", not "\\udfec".
Is there a way in Ruby or in regex to grab only the first part?
You should use a single grouping construct here, the one to match 1 or more occurrences of four hex chars, and omit the inner capturing group that resulted in an extra item in the resulting array:
.scan(/^(?:\\u[\da-fA-F]{4})+/)
Note that + is a simpler and shorter way to write {1,} (one or more occurrences).
Details
^ - start of string
(?: - start of a non-capturing group (what it matches won't be added to the final scan result):
\\u - a \u substring
[\da-fA-F]{4} - four hex chars
)+ - 1 or more occurrences (of the group pattern sequence).
Given any email address I would like to leave only the first and last two characters and input 4 asterisks to the left and right of # character.
The best way to explain are examples:
lorem.ipsum#gmail.com changed to lo****#****om
foo#foo.de changed fo****#****de
How to do it with gsub?
**If you want to mask with a fixed number of * symbols, you may yse
'lorem.ipsum#gmail.com'.sub(/\A(..).*#.*(..)\z/, '\1****#****\2')
# => lo****#****om
See the Ruby demo.
Here,
\A - start of string anchor
(..) - Group 1: first 2 chars
.*#.* - any 0+ chars other than line break chars as many as possible up to the last # followed with another set of 0+ chars other than line break ones
(..) - Group 2: last 2 chars
\z - end of string.
The \1 in the replacment string refers to the value kept in Group 1, and \2 references the value in Group 2.
If you want to mask existing chars while keeping their number, you might consider an approach to capture the parts of the string you need to keep or process, and manipulate the captures inside a sub block:
'lorem.ipsum#gmail.com'.sub(/\A(..)(.*)#(.*)(..)\z/) {
$1 + "*"*$2.length + "#" + "*"*$3.length + $4
}
# => lo*********#*******om
See the Ruby demo
Details
\A - start of string
(..) - Group 1 capturing any 2 chars
(.*) - Group 2 capturing any 0+ chars as many as possible up to the last....
# - # char
(.*) - Group 3 capturing any 0+ chars as many as possible up to the
(..) - Group 4: last two chars
\z - end of string.
Note that inside the block, $1 contains Group 1 value, $2 holds Group 2 value, and so on.
Using gsub with look-ahead and look-behind regex patterns:
'lorem.ipsum#gmail.com'.gsub(/(?<=.{2}).*#.*(?=\S{2})/, '****#****')
=> "lo****#****om"
Using plain ruby:
str.first(2) + '****#****' + str.last(2)
=> "lo****#****om"
I have a solution which doesn't fully solve your problem but it's pretty flexible and I think it's worth it to share it for anyone else looking for similar solutions.
module CoreExtensions
module String
module MaskChars
def mask_chars(except_first_n: 1, except_last_n: 2, mask_with: '*')
if except_first_n.zero? && except_last_n.zero?
raise ArgumentError, "except_first_n and except_last_n can't both be zero"
end
if length < (except_first_n + except_last_n)
raise ArgumentError, "String '#{self}' must be at least #{except_first_n}"\
" (except_first_n) #{except_last_n} (except_last_n) ="\
" #{except_first_n + except_last_n} characters long"
end
sub(
/\A(.{#{except_first_n}})(.*)(.{#{except_last_n}})\z/,
'\1' + (mask_with * (length - (except_first_n + except_last_n))) + '\3'
)
end
end
end
end
Let me explain the regex in /\A(.{#{except_first_n}})(.*)(.{#{except_last_n}})\z/
\A - start of string
(.#{except_first_n}) or (.{1}) Group 1: first n chars. Default value of except_first_n is 1
(.*) Group 2 capturing any 0+ chars as many as possible before the last n characters
(.#{except_last_n}) or (.{2}) Group 3: last n chars. Default value of except_last_n is 2
\z - end of string
Let me explain what's happening in '\1' + (mask_with * (length - (except_first_n + except_last_n))) + '\3'
We are substituting the string with group 1 (\1) at the start, it'll contain characters equalling except_first_n argument's value. We are not gonna use group 2, we need to replace group 2 with the character from mask_with argument, to calculate the amount of times we need to add mask_with character, we use this formula length - (except_first_n + except_last_n) (total length of the string minus the sum value of except_first_n and except_last_n. This will ensure that we have the exact number of mask_with characters between the except_first_n and the except_last_n characters).
Then I created an initializer file config/initializers/core_extensions.rb with this line:
String.include CoreExtensions::String::MaskChars
It will add mask_chars as an instance method to the String class available to all strings.
It should work like this:
account = "123456789101112"
=> "123456789101112"
account.mask_chars
=> "1************12"
account.mask_chars(except_first_n: 3, except_last_n: 4, mask_with: '#')
=> "123########1112"
I think this is a pretty useful method which can be useful in many scenarios and very flexible too.
I'm trying to learn ruby and having a hard time figuring out what each individual part of this code is doing. Specifically, how does the global subbing determine whether two sequential numbers are both one of these values [13579] and how does it add a dash (-) in between them?
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
num_str.gsub(/([13579])(?=[13579])/, '\1-')
() called capturing group, which captures the characters matched by the pattern present inside the capturing group. So the pattern present inside the capturing group is [13579] which matches a single digit from the given set of digits. That corresponding digit was captured and stored inside index 1.
(?=[13579]) Positive lookahead which asserts that the match must be followed by the character or string matched by the pattern inside the lookahead. Replacement will occur only if this condition is satisfied.
\1 refers the characters which are present inside the group index 1.
Example:
> "13".gsub(/([13579])(?=[13579])/, '\1-')
=> "1-3"
You may start with some random tests:
def DashInsert(num)
num_str = num.to_s
num_str.gsub(/([13579])(?=[13579])/, '\1-')
end
10.times{
x = rand(10000)
puts "%6i: %6s" % [x,DashInsert(x)]
}
Example:
9633: 963-3
7774: 7-7-74
6826: 6826
7386: 7-386
2145: 2145
7806: 7806
9499: 949-9
4117: 41-1-7
4920: 4920
14: 14
And now to check the regex.
([13579]) take any odd number and remember it (it can be used later with \1
(?=[13579]) Check if the next number is also odd, but don't take it (it still remains in the string)
'\1-' Output the first odd num and ab a - to it.
In other word:
Puts a - between each two odds numbers.
I have street names and numbers in a file, like so:
Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29
I parse the lines one by one with regex. I want a regex that will find and match:
The name of the street,
The street numbers with its possible a,b,c,d attached.
I've come up with this mean while:
/(\D{2,})\s+(\d{1,3}[a-d|א-ד]?)(?:[,\s]{1,3})?/
It finds the street name and first number. I need to find all the numbers.
I don't want to use two separate regex's if possible, and I prefer not to use Ruby's scan but just have it in one regex.
You can use regex to find all the numbers, with their separators:
re = /\A(.+?)\s+((?:\d+[a-z]*[,\s]+)*\d+[a-z]*)/
txt = "Sokolov 19, 20, 23 ,25
Hertzl 80,82,84,86
Hertzl 80a,82b,84e,90
Aba Hillel Silver 2,3,5,6,
Weizman 8
Ahad Ha'am 9 13 29"
matches = txt.lines.map{ |line| line.match(re).to_a[1..-1] }
p matches
#=> [["Sokolov", "19, 20, 23 ,25"],
#=> ["Hertzl", "80,82,84,86"],
#=> ["Hertzl", "80a,82b,84e,90"],
#=> ["Aba Hillel Silver", "2,3,5,6"],
#=> ["Weizman", "8"],
#=> ["Ahad Ha'am", "9 13 29"]]
The above regex says:
\A Starting at the front of the string
(…) Capture the result
.+? Find one or more characters, as few as possible that make the rest of this pattern match.
\s+ Followed by one or more whitespace characters (which we don't capture)
(…) Capture the result
(?:…)* Find zero or more of what's in here, but don't capture them
\d+ One or more digits (0–9)
[a-z]* Zero or more lowercase letters
[,\s]+ One or more commas and/or whitespace characters
\d+ Followed by one or more digits
[a-z]* And zero or more lowercase letters
However, if you want to break the number up into pieces you will need to use scan or split or the equivalent.
result = matches.map{ |name,numbers| [name,numbers.scan(/[^,\s]+/)] }
p result
#=> [["Sokolov", ["19", "20", "23", "25"]],
#=> ["Hertzl", ["80", "82", "84", "86"]],
#=> ["Hertzl", ["80a", "82b", "84e", "90"]],
#=> ["Aba Hillel Silver", ["2", "3", "5", "6"]],
#=> ["Weizman", ["8"]],
#=> ["Ahad Ha'am", ["9", "13", "29"]]]
This is because regex captures inside a repeating group do not capture each repetition. For example:
re = /((\d+) )+/
txt = "hello 11 2 3 44 5 6 77 world"
p txt.match(re)
#=> #<MatchData "11 2 3 44 5 6 77 " 1:"77 " 2:"77">
The whole regex matches the whole string, but each capture only saves the last-seen instance. In this case, the outer capture only gets "77 " and the inner capture only gets "77".
Why do you prefer not to use scan? This is what it is made for.
If you want your 3rd example to work, you need to have the [a-d] change to include the e in the range. After changing that you can use (\D{2,})\s+(\d{1,3}[a-e]?(?:[,\s]{1,3})*)*. Using the examples you gave I did some testing using Rubular.
Using some more groupings you can have the repetition on those last few conditions (which seem to be pretty tricky. This way the spacing and comma at the end will get caught in the repetition after consuming the space initially.
The only way around the limitation that you can only capture the last instance of a repeated expression is to write your regex for a single instance and let the regex machine do the repeating for you, as occurs with the global substitute options, admittedly similar to scan. Unfortunately, in that case, you have to match for either the street name or the street number and then have no way to easily associate the captured numbers with the captured names.
Regex is great at what it does, but when you try to extend its application beyond it's natural limitations, it's not pretty. ;-)
I want a regex that will find and match....
Do the street names also contain digits (0-9), other characters beside an apostrophe?
Are the street numbers based off arbitrary data? Is it always just an optional a, b, c, or d?
Are you needing a minimum and maximum limitation of string length?
Here are some possible options:
If you are unsure about what the street name contains, but know your street number pattern will be numbers with an optional letter, commas or spaces.
/^(.*?)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If the street names contain only letters with optional apostrophe's and the street numbers contain numbers with an optional letter, comma.
/^([a-zA-Z' ]+)\s+(\d+(?:[a-z]?[, ]+\d+)*)(?=,|$)/
See working demo
If your street name and street number pattern are always consistant, you could easily do.
/^([a-zA-Z' ]+)\s+([0-9a-z, ]+)$/
See working demo
I would like to insert a <wbr> tag every 5 characters.
Input: s = 'HelloWorld-Hello guys'
Expected outcome: Hello<wbr>World<wbr>-Hell<wbr>o guys
s = 'HelloWorld-Hello guys'
s.scan(/.{5}|.+/).join("<wbr>")
Explanation:
Scan groups all matches of the regexp into an array. The .{5} matches any 5 characters. If there are characters left at the end of the string, they will be matched by the .+. Join the array with your string
There are several options to do this. If you just want to insert a delimiter string you can use scan followed by join as follows:
s = '12345678901234567'
puts s.scan(/.{1,5}/).join(":")
# 12345:67890:12345:67
.{1,5} matches between 1 and 5 of "any" character, but since it's greedy, it will take 5 if it can. The allowance for taking less is to accomodate the last match, where there may not be enough leftovers.
Another option is to use gsub, which allows for more flexible substitutions:
puts s.gsub(/.{1,5}/, '<\0>')
# <12345><67890><12345><67>
\0 is a backreference to what group 0 matched, i.e. the whole match. So substituting with <\0> effectively puts whatever the regex matched in literal brackets.
If whitespaces are not to be counted, then instead of ., you want to match \s*\S (i.e. a non whitespace, possibly preceded by whitespaces).
s = '123 4 567 890 1 2 3 456 7 '
puts s.gsub(/(\s*\S){1,5}/, '[\0]')
# [123 4 5][67 890][ 1 2 3 45][6 7]
Attachments
Source code and output on ideone.com
References
regular-expressions.info
Finite Repetition, Greediness
Character classes
Grouping and Backreferences
Dot Matches (Almost) Any Character
Here is a solution that is adapted from the answer to a recent question:
class String
def in_groups_of(n, sep = ' ')
chars.each_slice(n).map(&:join).join(sep)
end
end
p 'HelloWorld-Hello guys'.in_groups_of(5,'<wbr>')
# "Hello<wbr>World<wbr>-Hell<wbr>o guy<wbr>s"
The result differs from your example in that the space counts as a character, leaving the final s in a group of its own. Was your example flawed, or do you mean to exclude spaces (whitespace in general?) from the character count?
To only count non-whitespace (“sticking” trailing whitespace to the last non-whitespace, leaving whitespace-only strings alone):
# count "hard coded" into regexp
s.scan(/(?:\s*\S(?:\s+\z)?){1,5}|\s+\z/).join('<wbr>')
# parametric count
s.scan(/\s*\S(?:\s+\z)?|\s+\z/).each_slice(5).map(&:join).join('<wbr>')