Ruby (on Rails) Regex: removing thousands comma from numbers - ruby

This seems like a simple one, but I am missing something.
I have a number of inputs coming in from a variety of sources and in different formats.
Number inputs
123
123.45
123,45 (note the comma used here to denote decimals)
1,234
1,234.56
12,345.67
12,345,67 (note the comma used here to denote decimals)
Additional info on the inputs
Numbers will always be less than 1 million
EDIT: These are prices, so will either be whole integers or go to the hundredths place
I am trying to write a regex and use gsub to strip out the thousands comma. How do I do this?
I wrote a regex: myregex = /\d+(,)\d{3}/
When I test it in Rubular, it shows that it captures the comma only in the test cases that I want.
But when I run gsub, I get an empty string: inputstr.gsub(myregex,"")
It looks like gsub is capturing everything, not just the comma in (). Where am I going wrong?

result = inputstr.gsub(/,(?=\d{3}\b)/, '')
removes commas only if exactly three digits follow.
(?=...) is a lookahead assertion: It needs to be possible to be matched at the current position, but it's not becoming part of the text that is actually matched (and subsequently replaced).

You are confusing "match" with "capture": to "capture" means to save something so you can refer to it later. You want to capture not the comma, but everything else, and then use the captured portions to build your substitution string.
Try
myregex = /(\d+),(\d{3})/
inputstr.gsub(myregex,'\1\2')

In your example, it is possible to tell from the number of digits after the last separator (either , or .) that it is a decimal point, since there are 2 lone digits. For most cases, if the last group of digits does not have 3 digits then you can assume that the separator in front is decimal point. Another sign is the multiple appearance of a separator in big numbers allows us to differentiate between decimal point and separators.
However, I can give a string 123,456 or 123.456 without any sort of context. It is impossible to tell whether they are "123 thousand 456" or "123 point 456".
You need to scan the document to look for clue whether , is used for thousand separator or decimal point, and vice versa for .. With the context provided, then you can safely apply the same method to remove the thousand separators.
You may also want to check out this article on Wikipedia on the less common ways to specify separators or decimal points. Knowing and deciding not to support is better than assuming things will work.

Related

How does this regular expression limit email addresses to ".com" instead of "...com"

The regex below:
EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
is what I initially used to validate email format. After finding that the format "name#email...com" was passing my tests, I copy/pasted a different piece of regex that limits the amount of periods. This looks like:
EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(?:\.[a-z\d\-]+)*\.[a-z]+\z/i
The main difference is the piece of regex below:
(?:\.[a-z\d\-]+)
I can't quite figure out how this bit works. Can someone break it down for me?
Notice that in this subexpression:
(?:\.[a-z\d\-]+)
The character class [a-z\d-] does not contain a period. The expression requires there to be at least one (+) of those characters after the period (\.) in order to match. Therefore, a series of periods with no letters or digits or hyphens between them won't match the repetition of the subexpression.
The problem with your regular expression here is that you're allowing for multiple dots:
/[a-z\.]+\.[a-z]+\z/
To fix this you need to make your repeating pattern more specific in terms of structure:
/(?:[a-z]+\.)+[a-z]+\z/
That means you can have one or more repeating groups of letters plus dot. That will exclude multiple dots in a row.
Do keep in mind that email addresses are getting increasingly insane with the introduction of new GTLDs that are often used without any sort of prefix. That is, example#google may be a valid address in the future. You can't expect there to be a dot in the domain.
You have [a-z\d\-]+(?:\.[a-z\d\-]+)*. The [a-z\d\-]+ part ensures that this part of the string starts with a sequence of at least one non-period character. A period is only allowed one per (?:\.[a-z\d\-]+) structure. In each (?:\.[a-z\d\-]+), the period \. is necessarily followed by [a-z\d\-]+, which includes at least one non-period character. This ensures that whenever a period appears, it has at least one non-period character on the left and on the right. In other words, consecutive periods are not allowed.

phone regex does not completely working

In my country the phone numbers follow a format like this (XX)XXXX-XXXX. But enter phone numbers according to the pattern in input texts it's too mainstream. Some people follow, but some people don't. I'd like to make a regex to catch all possible cases. By now it look like this:
/^[\(]?\d{2}?[\)]?\d{4}[. -]?\d{4}$/
And I prepared some test cases to prove the regex's functionality
# GOOD PHONES #
8432115262
843211 5262
843211.5262
843211-5262
32115262
3211.5262
3211 5262
3211-5262
(84)32115262
(84)3211.5262
(84)3211 5262
(84)3211-5262
# BAD PHONES #
!##$%*()
()32115262
()1231 3213
()1231.3213
()1231-3213
().3213
()-3213
()3213.
()3213-
3211-5a62
sakdiihbnmwlzi
Unfortunately, the wrong case ()32115262 is bypassing the regex. Altought it is clear why. this part [\(]?\d{2}?[\)]? is responsable for the mistake. From left to right, you can enter zero or one of (; You can enter zero or two digits; You can enter zero or one of ).
I'd like that part should be like this: If you put (, you will have to enter two digits and ), else you can enter zero or two digits. Something like this or with simmilar semantics is possible in regex world?
Thanks in advance
Something like this perhaps:
/^(?:\(\d{2}\)|\d{2}?)\d{4}[. -]?\d{4}$/
I used a non-matching group (?: ... ) and alternation to provide two possible options for the first part of the phone number.
Either it is \(\d{2}\) which means brackets with exactly two digits, or it is \d{2}? which means two digits or empty string.
Combine these two options together with | (which means OR) and you get the first part of the regex above: (?:\(\d{2}\)|\d{2}?)
It seemed to work for all your test cases!
try with this: ^(?:\(\d\d\)|\d\d)?\d{4}[. -]?\d{4}$
If pattern matches (..) then have to match 2 digits inside.

What's the difference between /\t+|,/ and /[\t+,]/ when split a string using Ruby?

I have a string seperated by \t and ,, but the number of \t is not fixed, for example :
a=["seg1\tseg2\t\tseg3,seg4"]
seg2 and seg3 is seperated by two \t.
So I try to split them by
a.split(/\t+|,/)
it print the right anwser :
["seg1", "seg2", "seg3", "seg4"]
And I also try this
a.split(/[\t+,]/)
but the answer is
["seg1", "seg2", "", "seg3", "seg4"]
Why ruby print different results?
Because \t+ inside [] does not mean "one or more tabs", it means "a tab or a plus". Since it finds two consecutive tabs, it splits twice, and the string in the middle becomes empty.
Most special characters, like . + * ? etc, when placed in an interval become "regular" characters. There are some exceptions, like ^ (which negates the interval when placed at the beginning), the \ (that escapes the next character(s), just like it does outside intervals) and the ] (that closes the interval; another [ is also disallowed there). So, [\t+,] actually means '\t' or '+' or ','.
Unfortunatly, I don't know any reference for the full set of characters that need or don't need escaping inside an interval. In doubt, I tend to escape just to be sure. In any case, an interval will always match a single character only, if you want something different you must put your quantifier outside the interval. (For example: [\t,]+, if you also admit two commas in a row; otherwise, your first regex is really the correct one)

How to conflate consecutive gsubs in ruby

I have the following
address.gsub(/^\d*/, "").gsub(/\d*-?\d*$/, "").gsub(/\# ?\d*/,"")
Can this be done in one gsub? I would like to pass a list of patterns rather then just one pattern - they are all being replaced by the same thing.
You could combine them with an alternation operator (|):
address = '6 66-666 #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/, "")
# " 66-666 "
address = 'pancakes 6 66-666 # pancakes #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/,"")
# "pancakes 6 66-666 pancakes "
You might want to add little more whitespace cleanup. And you might want to switch to one of:
/\A\d*|\d*-?\d*\z|\# ?\d*/
/\A\d*|\d*-?\d*\Z|\# ?\d*/
depending on what your data really looks like and how you need to handle newlines.
Combining the regexes is a good idea--and relatively simple--but I'd like to recommend some additional changes. To wit:
address.gsub(/^\d+|\d+(?:-\d+)?$|\# *\d+/, "")
Of your original regexes, ^\d* and \d*-?\d*$ will always match, because they don't have to consume any characters. So you're guaranteed to perform two replacements on every line, even if that's just replacing empty strings with empty strings. Of my regexes, ^\d+ doesn't bother to match unless there's at least one digit at the beginning of the line, and \d+(?:-\d+)?$ matches what looks like an integer-or-range expression at the end of the line.
Your third regex, \# ?\d*, will match any # character, and if the # is followed by a space and some digits, it'll take those as well. Judging by your other regexes and my experience with other questions, I suspect you meant to match a # only if it's followed by one or more digits, with optional spaces intervening. That's what my third regex does.
If any of my guesses are wrong, please describe what you were trying to do, and I'll do my best to come up with the right regex. But I really don't think those first two regexes, at least, are what you want.
EDIT (in answer to the comment): When working with regexes, you should always be aware of the distinction between a regex the matches nothing and a regex that doesn't match. You say you're applying the regexes to street addresses. If an address doesn't happen to start with a house number, ^\d* will match nothing--that is, it will report a successful match, said match consisting of the empty string preceding the first character in the address.
That doesn't matter to you, you're just replacing it with another empty string anyway. But why bother doing the replacement at all? If you change the regex to ^\d+, it will report a failed match and no replacement will be performed. The result is the same either way, but the "matches noting" scenario (^\d*) results in a lot of extra work that the "doesn't match" scenario avoids. In a high-throughput situation, that could be a life-saver.
The other two regexes bring additional complications: \d*-?\d*$ could match a hyphen at the end of the string (e.g. "123-", or even "-"); and \# ?\d* could match a hash symbol anywhere in string, not just as part of an apartment/office number. You know your data, so you probably know neither of those problems will ever arise; I'm just making sure you're aware of them. My regex \d+(?:-\d+)?$ deals with the trailing-hyphen issue, and \# *\d+ at least makes sure there are digits after the hash symbol.
I think that if you combine them together in a single gsub() regex, as an alternation,
it changes the context of the starting search position.
Example, each of these lines start at the beginning of the result of the previous
regex substitution.
s/^\d*//g
s/\d*-?\d*$//g
s/\# ?\d*//g
and this
s/^\d*|\d*-?\d*$|\# ?\d*//g
resumes search/replace where the last match left off and could potentially produce a different overall output, especially since a lot of the subexpressions search for similar
if not the same characters, distinguished only by line anchors.
I think your regex's are unique enough in this case, and of course changing the order
changes the result.

regex for matching german postal codes but not a

following string:
23434 5465434
58495 / 46949345
58495 - 46949345
58495 / 55643
d 44444 ssdfsdf
64784
45643 dfgh
58495/55643
48593/48309596
675643235
34565435 34545
it only want to extract the bold ones. its a five digit number(german).
it should not match telephone numbers 43564 366334 or 45433 / 45663,etc as in my example above.
i tried something like ^\b\d{5} but thats not a good beginning.
some hints for me to get this working?
thanks for all hints
You could add a negative look-ahead assertion to avoid the matches with phone numbers.
\b[0124678][0-9]{4}\b(?!\s?[ \/-]\s?[0-9]+)
If you're using Ruby 1.9, you can add a negative look-behind assertion as well.
You haven't specified what distinguishes the number you're trying to search for.
Based on the example string you gave, it looks like you just want:
^(\d{5})\n
Which matches lines that start with 5 digits and contain nothing else.
You might want to permit some spaces after the first 5 digits (but nothing else):
^(\d{5})\s*\n
I'm not completely sure about the specified rules. But if you want lines that start with 5 digits and do not contain additional digits, this may work:
^(\d{5})[^\d]*$
If leading white space is okay, then:
^\s*(\d{5})[^\d]*$
Here is the Rubular link that shows the result.
^\D*(\d{5})(\s(\D)*$|()$)
This should (it's untested) match:
line starting with five digits (or some non-digits and then five digits), then
a space, and ending with some non-numbers
line starting and ending with five
digits (or some non-digits and then five digits)
\1 would be the five digits
\2 would be the whole second half, if any
\3 would be the word after the digits, if any
edited to fit the asker's edited question
edit again: I came up with a much more elegant solution:
^\D*(\d{5})\D*$

Resources