What is the fastest way to modify a large string in Ruby? - ruby

I need to modify a string in ruby. Specifically I'm trying to remove 'holes' from a WKT string. Holes are defined as any single set of parenthesis after the first one with numbers within. For example in this string...
POLYGON ((1 2, 3 4), (5 6, 7 8))
I would need to remove , (5 6, 7 8) because this parenthesis data is a hole, and the comma and the space don't belong except to separate sets of parentheses.
I am avoiding ruby methods like match or scan to try to optimize for speed and achieve O(n) speed.
Here's what I have so far.
def remove_holes_from(wkt)
output_string = ""
last_3_chars = [ nil, nil, nil ]
number_chars = [ '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' ]
should_delete_chars = false
wkt.each_char do |char|
last_3_chars.shift
last_3_chars.push(char)
if should_delete_chars == false
if number_chars.include?(last_3_chars[0]) && last_3_chars[1] == ")" && last_3_chars[2] == ","
should_delete_chars = true
next
else
output_string += char
end
end
if should_delete_chars == true
if number_chars.include?(last_3_chars[0]) && last_3_chars[1] == ")" && last_3_chars[2] == ")"
should_delete_chars = false
output_string += char
else
next
end
end
end
output_string
end
The problem I am facing is that for a large polygon, like the United States (over 500,000 characters and over 40,000 points) it takes me 66 seconds to complete this. You can find the string here: https://gist.github.com/cheeseandpepper/9da5ca6ade921da2b4ab
Can anyone think of optimizations to this example I can use? Or maybe a separate approach? Thanks.

Whelp... regex wins!
wkt.gsub(/, \(-?\d.*?\)/, "")
took me 0.003742 seconds
As for the regex
, Literal comma
Literal space
\( Literal open parenthesis
-? Optional negative sign
\d Any digit (because the previous is optional, we need to make sure we have a digit vs another open parenthesis)
.* Any number of any characters (will be digits, a comma and maybe a negative sign)
?\) Up to and including a literal close parenthesis

Related

Move decimal fixed number of spaces with Ruby

I have large integers (typically 15-30 digits) stored as a string that represent a certain amount of a given currenty (such as ETH). Also stored with that is the number of digits to move the decimal.
{
"base_price"=>"5000000000000000000",
"decimals"=>18
}
The output that I'm ultimately looking for is 5.00 (which is what you'd get if took the decimal from 5000000000000000000 and moved it to the left 18 positions).
How would I do that in Ruby?
Given:
my_map = {
"base_price"=>"5000000000000000000",
"decimals"=>18
}
You could use:
my_number = my_map["base_price"].to_i / (10**my_map["decimals"]).to_f
puts(my_number)
h = { "base_price"=>"5000000000000000000", "decimals"=>18 }
​
bef, aft = h["base_price"].split(/(?=\d{#{h["decimals"]}}\z)/)
#=> ["5", "000000000000000000"]
bef + '.' + aft[0,2]
#=> "5.00"
The regular expression uses the positive lookahead (?=\d{18}\z) to split the string at a ("zero-width") location between digits such that 18 digits follow to the end of the string.
Alternatively, one could write:
str = h["base_price"][0, h["base_price"].size-h["decimals"]+2]
#=> h["base_price"][0, 3]
#=> "500"
str.insert(str.size-2, '.')
#=> "5.00"
Neither of these address potential boundary cases such as
{ "base_price"=>"500", "decimals"=>1 }
or
{ "base_price"=>"500", "decimals"=>4 }
Nor do they consider rounding issues.
Regular expressions and interpolation?
my_map = {
"base_price"=>"5000000000000000000",
"decimals"=>18
}
my_map["base_price"].sub(
/(0{#{my_map["decimals"]}})\s*$/,
".#{$1}"
)
The number of decimal places is interpolated into the regular expression as the count of zeroes to look for from the end of the string (plus zero or more whitespace characters). This is matched, and the match is subbed with a . in front of it.
Producing:
=> "5.000000000000000000"

How to replace all characters but for the first and last two with gsub Ruby

Given any email address I would like to leave only the first and last two characters and input 4 asterisks to the left and right of # character.
The best way to explain are examples:
lorem.ipsum#gmail.com changed to lo****#****om
foo#foo.de changed fo****#****de
How to do it with gsub?
**If you want to mask with a fixed number of * symbols, you may yse
'lorem.ipsum#gmail.com'.sub(/\A(..).*#.*(..)\z/, '\1****#****\2')
# => lo****#****om
See the Ruby demo.
Here,
\A - start of string anchor
(..) - Group 1: first 2 chars
.*#.* - any 0+ chars other than line break chars as many as possible up to the last # followed with another set of 0+ chars other than line break ones
(..) - Group 2: last 2 chars
\z - end of string.
The \1 in the replacment string refers to the value kept in Group 1, and \2 references the value in Group 2.
If you want to mask existing chars while keeping their number, you might consider an approach to capture the parts of the string you need to keep or process, and manipulate the captures inside a sub block:
'lorem.ipsum#gmail.com'.sub(/\A(..)(.*)#(.*)(..)\z/) {
$1 + "*"*$2.length + "#" + "*"*$3.length + $4
}
# => lo*********#*******om
See the Ruby demo
Details
\A - start of string
(..) - Group 1 capturing any 2 chars
(.*) - Group 2 capturing any 0+ chars as many as possible up to the last....
# - # char
(.*) - Group 3 capturing any 0+ chars as many as possible up to the
(..) - Group 4: last two chars
\z - end of string.
Note that inside the block, $1 contains Group 1 value, $2 holds Group 2 value, and so on.
Using gsub with look-ahead and look-behind regex patterns:
'lorem.ipsum#gmail.com'.gsub(/(?<=.{2}).*#.*(?=\S{2})/, '****#****')
=> "lo****#****om"
Using plain ruby:
str.first(2) + '****#****' + str.last(2)
=> "lo****#****om"
I have a solution which doesn't fully solve your problem but it's pretty flexible and I think it's worth it to share it for anyone else looking for similar solutions.
module CoreExtensions
module String
module MaskChars
def mask_chars(except_first_n: 1, except_last_n: 2, mask_with: '*')
if except_first_n.zero? && except_last_n.zero?
raise ArgumentError, "except_first_n and except_last_n can't both be zero"
end
if length < (except_first_n + except_last_n)
raise ArgumentError, "String '#{self}' must be at least #{except_first_n}"\
" (except_first_n) #{except_last_n} (except_last_n) ="\
" #{except_first_n + except_last_n} characters long"
end
sub(
/\A(.{#{except_first_n}})(.*)(.{#{except_last_n}})\z/,
'\1' + (mask_with * (length - (except_first_n + except_last_n))) + '\3'
)
end
end
end
end
Let me explain the regex in /\A(.{#{except_first_n}})(.*)(.{#{except_last_n}})\z/
\A - start of string
(.#{except_first_n}) or (.{1}) Group 1: first n chars. Default value of except_first_n is 1
(.*) Group 2 capturing any 0+ chars as many as possible before the last n characters
(.#{except_last_n}) or (.{2}) Group 3: last n chars. Default value of except_last_n is 2
\z - end of string
Let me explain what's happening in '\1' + (mask_with * (length - (except_first_n + except_last_n))) + '\3'
We are substituting the string with group 1 (\1) at the start, it'll contain characters equalling except_first_n argument's value. We are not gonna use group 2, we need to replace group 2 with the character from mask_with argument, to calculate the amount of times we need to add mask_with character, we use this formula length - (except_first_n + except_last_n) (total length of the string minus the sum value of except_first_n and except_last_n. This will ensure that we have the exact number of mask_with characters between the except_first_n and the except_last_n characters).
Then I created an initializer file config/initializers/core_extensions.rb with this line:
String.include CoreExtensions::String::MaskChars
It will add mask_chars as an instance method to the String class available to all strings.
It should work like this:
account = "123456789101112"
=> "123456789101112"
account.mask_chars
=> "1************12"
account.mask_chars(except_first_n: 3, except_last_n: 4, mask_with: '#')
=> "123########1112"
I think this is a pretty useful method which can be useful in many scenarios and very flexible too.

Regex cuts word if end of string

I want to check and capture 2 or x words after and before a target string in a multiline text. The problem is that if the words matched are less than x number of words, then regex cuts off the last word and splits it till x.
For example
text = "This is an example /year"
if example is the target:
Matching Data: "is" , "an", "/yea", "r"
If i add random words after /year it matches it correctly.
How could I fix this so that if less than x words exist just stop there or return empty for the rest of the matches?
So it should be
Matching Data: "is" , "an", "/year", ""
def checkWords(target, text, numLeft = 2, numRight = 2)
target = target.compact.map{|x| x.inspect}.join('').gsub(/"/, '')
regex = ""
regex += "\\s+{,2}(\\S+)\\s+{,2}" * numLeft
regex += target
regex += "\\s+{,2}(\\S+)" * numRight
pattern = Regexp.new(regex)
matches = pattern.match(text)
puts matches.inspect
end
Since you want to capture the words before and after target, you need to set a capturing group around the whole regex parts that match the 0 to 2 occurrences of spaces-non-spaces. Also, you need to allow a minimum bound of 0 - use {0,2} (or a more succint {,2}) limiting quantifier to make sure you get the context on the left even if it is missing on the right:
/((?:\S+\s+){,2})target((?:\s+\S+){,2})/
^ ^ ^ ^
See this Rubular demo
If you use /(?:(\S+)\s+){0,2}target(?:\s+(\S+)){0,2}/, all captured values but the last one will be lost, i.e. once quantified, repeated capturing groups only store the value captured during the last iteration in the group buffer.
Also note that setting a {,2} quantifier on the + quantifier makes no sense, \\s+{,2} = \\s+.

Counting and removing leading characters

I have strings that contain a variable number of leading hyphens and which may or may not contain a hyphen in the body of the string. For example:
--xxx-xxx
-xxxx
---xxxxxx-xx
How do I in Ruby a) count the number of leading hyphens and b) return the string with the leading hyphens removed?
Many thanks for your help!
>> "--xxx-xxx"[/\A-+/].size
=> 2
>> "--xxx-xxx".sub(/\A-+/, '')
=> "xxx-xxx"
EDIT: The comment from #shime made me want to show the other relevant capability of String#[] or String#slice:
>> "--xxx-xxx"[/\A-+(.+)/, 1]
=> "xxx-xxx"
Remove leading hyphens:
.sub(/^-*/, "")
Count leading hyphens by subtracting the length of the string before and after the removal.
For removing leading hyphens:
str.sub(/^-+(.+)/, '\1')
s = '---xx-x'
new_s = s.gsub(/\A-*/, '')
hyph_num = s.length - new_s.length
gsub removes leading hyphens. And the difference between s and new_s length equals the number of leading hyphens.
count = 0
"-----xxx---xxx---".each_char do |ch|
break if ch != '-'
count = count + 1
end
# => 5
"-----xxx---xxx---".sub(/^-+/, '')
# => "xxx---xxx---"
Read Ruby string strip defined characters and https://stackoverflow.com/a/3166005/284795
Then compare the length of the naked string and the original string

Checking if a string has balanced parentheses

I am currently working on a Ruby Problem quiz but I'm not sure if my solution is right. After running the check, it shows that the compilation was successful but i'm just worried it is not the right answer.
The problem:
A string S consisting only of characters '(' and ')' is called properly nested if:
S is empty,
S has the form "(U)" where
U is a properly nested string,
S has
the form "VW" where V and W are
properly nested strings.
For example, "(()(())())" is properly nested and "())" isn't.
Write a function
def nesting(s)
that given a string S returns 1 if S
is properly nested and 0 otherwise.
Assume that the length of S does not
exceed 1,000,000. Assume that S
consists only of characters '(' and
')'.
For example, given S = "(()(())())"
the function should return 1 and given
S = "())" the function should return
0, as explained above.
Solution:
def nesting ( s )
# write your code here
if s == '(()(())())' && s.length <= 1000000
return 1
elsif s == ' ' && s.length <= 1000000
return 1
elsif
s == '())'
return 0
end
end
Here are descriptions of two algorithms that should accomplish the goal. I'll leave it as an exercise to the reader to turn them into code (unless you explicitly ask for a code solution):
Start with a variable set to 0 and loop through each character in the string: when you see a '(', add one to the variable; when you see a ')', subtract one from the variable. If the variable ever goes negative, you have seen too many ')' and can return 0 immediately. If you finish looping through the characters and the variable is not exactly 0, then you had too many '(' and should return 0.
Remove every occurrence of '()' in the string (replace with ''). Keep doing this until you find that nothing has been replaced (check the return value of gsub!). If the string is empty, the parentheses were matched. If the string is not empty, it was mismatched.
You're not supposed to just enumerate the given examples. You're supposed to solve the problem generally. You're also not supposed to check that the length is below 1000000, you're allowed to assume that.
The most straight forward solution to this problem is to iterate through the string and keep track of how many parentheses are open right now. If you ever see a closing parenthesis when no parentheses are currently open, the string is not well-balanced. If any parentheses are still open when you reach the end, the string is not well-balanced. Otherwise it is.
Alternatively you could also turn the specification directly into a regex pattern using the recursive regex feature of ruby 1.9 if you were so inclined.
My algorithm would use stacks for this purpose. Stacks are meant for solving such problems
Algorithm
Define a hash which holds the list of balanced brackets for
instance {"(" => ")", "{" => "}", and so on...}
Declare a stack (in our case, array) i.e. brackets = []
Loop through the string using each_char and compare each character with keys of the hash and push it to the brackets
Within the same loop compare it with the values of the hash and pop the character from brackets
In the end, if the brackets stack is empty, the brackets are balanced.
def brackets_balanced?(string)
return false if string.length < 2
brackets_hash = {"(" => ")", "{" => "}", "[" => "]"}
brackets = []
string.each_char do |x|
brackets.push(x) if brackets_hash.keys.include?(x)
brackets.pop if brackets_hash.values.include?(x)
end
return brackets.empty?
end
You can solve this problem theoretically. By using a grammar like this:
S ← LSR | LR
L ← (
R ← )
The grammar should be easily solvable by recursive algorithm.
That would be the most elegant solution. Otherwise as already mentioned here count the open parentheses.
Here's a neat way to do it using inject:
class String
def valid_parentheses?
valid = true
self.gsub(/[^\(\)]/, '').split('').inject(0) do |counter, parenthesis|
counter += (parenthesis == '(' ? 1 : -1)
valid = false if counter < 0
counter
end.zero? && valid
end
end
> "(a+b)".valid_parentheses? # => true
> "(a+b)(".valid_parentheses? # => false
> "(a+b))".valid_parentheses? # => false
> "(a+b))(".valid_parentheses? # => false
You're right to be worried; I think you've got the very wrong end of the stick, and you're solving the problem too literally (the info that the string doesn't exceed 1,000,000 characters is just to stop people worrying about how slow their code would run if the length was 100times that, and the examples are just that - examples - not the definitive list of strings you can expect to receive)
I'm not going to do your homework for you (by writing the code), but will give you a pointer to a solution that occurs to me:
The string is correctly nested if every left bracket has a right-bracket to the right of it, or a correctly nested set of brackets between them. So how about a recursive function, or a loop, that removes the string matches "()". When you run out of matches, what are you left with? Nothing? That was a properly nested string then. Something else (like ')' or ')(', etc) would mean it was not correctly nested in the first place.
Define method:
def check_nesting str
pattern = /\(\)/
while str =~ pattern do
str = str.gsub pattern, ''
end
str.length == 0
end
And test it:
>ruby nest.rb (()(())())
true
>ruby nest.rb (()
false
>ruby nest.rb ((((()))))
true
>ruby nest.rb (()
false
>ruby nest.rb (()(((())))())
true
>ruby nest.rb (()(((())))()
false
Your solution only returns the correct answer for the strings "(()(())())" and "())". You surely need a solution that works for any string!
As a start, how about counting the number of occurrences of ( and ), and seeing if they are equal?

Resources