format string (postcode) in ruby - ruby

I need to re-format a list of UK postcodes and have started with the following to strip whitespace and capitalize:
postcode.upcase.gsub(/\s/,'')
I now need to change the postcode so the new postcode will be in a format that will match the following regexp:
^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$
I would be grateful of any assistance.

If this standards doc is to be believed (and Wikipedia concurs), formatting a valid post code for output is straightforward: the last three characters are the second part, everything before is the first part!
So assuming you have a valid postcode, without any pre-embedded space, you just need
def format_post_code(pc)
pc.strip.sub(/([A-Z0-9]+)([A-Z0-9]{3})/, '\1 \2')
end
If you want to validate an input post code first, then the regex you gave looks like a good starting point. Perhaps something like this?
NORMAL_POSTCODE_RE = /^([A-PR-UWYZ][A-HK-Y0-9][A-HJKS-UW0-9]?[A-HJKS-UW0-9]?)\s*([0-9][ABD-HJLN-UW-Z]{2})$/i
GIROBANK_POSTCODE_RE = /^GIR\s*0AA$/i
def format_post_code(pc)
return pc.strip.upcase.sub(NORMAL_POSTCODE_RE, '\1 \2') if pc =~ NORMAL_POSTCODE_RE
return 'GIR 0AA' if pc =~ GIROBANK_POSTCODE_RE
end
Note that I removed the '0-9' part of the first character, which appears unnecessary according to the sources I quoted. I also changed the alpha sets to match the first-cited document. It's still not perfect: a code of the format 'AAA ANN' validates, for example, and I think a more complex RE is probably required.
I think this might cover it (constructed in stages for easier fixing!)
A1 = "[A-PR-UWYZ]"
A2 = "[A-HK-Y]"
A34 = "[A-HJKS-UW]" # assume rule for alpha in fourth char is same as for third
A5 = "[ABD-HJLN-UW-Z]"
N = "[0-9]"
AANN = A1 + A2 + N + N # the six possible first-part combos
AANA = A1 + A2 + N + A34
ANA = A1 + N + A34
ANN = A1 + N + N
AAN = A1 + A2 + N
AN = A1 + N
PART_ONE = [AANN, AANA, ANA, ANN, AAN, AN].join('|')
PART_TWO = N + A5 + A5
NORMAL_POSTCODE_RE = Regexp.new("^(#{PART_ONE})[ ]*(#{PART_TWO})$", Regexp::IGNORECASE)

UK Postcodes aren't consistent, but they are finite - you might be better with a look-up table.

Reformat or pattern match? I suspect the latter, although upcasing it first is a good idea.
Before we proceed though I would point out that you are stripping spaces but your regex contains " {1,2}" which is "one or two space characters". As you have already stripped whitespace you've already caused all to fail the match.
Given a post code as input we can check whether it matches the regex using =~
Here we create some example post codes (taken from the wikipedia page), and test each one against the regex:
post_codes = ["M1 1AA", "M60 1NW", "CR2 6XH", "DN55 1PT", "W1A 1HQ", "EC1A 1BB", "bad one", "cc93h29r2"]
r = /^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$/
post_codes.each do |pc|
# pc =~ r will return something true if we have a match (specifically the integer of first match position)
# We use !! to display it as true|false
puts "#{pc}: #{!!(pc =~ r)}"
end
M1 1AA: true
M60 1NW: true
CR2 6XH: true
DN55 1PT: true
W1A 1HQ: true
EC1A 1BB: true
bad one: false
cc93h29r2: false

Related

ruby regex string $

I have to cut the price from strings like that:
s1 = "somefing $ 100"
s2 = "$ 19081 words $"
s3 = "30$"
s4 = "hi $90"
s5 = "wow 150"
Output should be:
s1 = "100"
s2 = "19081"
s3 = "30"
s4 = "90"
s5 = nil
I use the following regex:
price = str[/\$\s*(\d+)|(\d+)\s*\$/, 1]
But it doesn't work for all types of strings.
Your code always returns the result of the first capture group group whereas in the failing case it is the second capture group that you are interested in. I don't think the [] method has a good way of dealing with this (when using numbered capture groups). You could write this like so
price = str =~ /\$\s*(\d+)|(\d+)\s*\$/ && ($1 || $2)
Although this isn't very legible. If instead you use a named capture group, then you can do
price = str[/\$\s*(?<amount>\d+)|(?<amount>\d+)\s*\$/, 'amount']
Duplicate named capture groups won't always do what you want but when they are in separate alternation branches (as they are here) then it should work.
The problem is that you're always getting value from the first regex group and you don't check the second. So, you're not looking the case after | - the one when digit is before $ sign.
If you look at the graphical representation of your regex, by typing 1 as a second parameter in square brackets, you are covering only the upper path (first case), and you never check lower one (second case).
Basically, try:
price = str[/\$\s*(\d+)|(\d+)\s*\$/, 1] or str[/\$\s*(\d+)|(\d+)\s*\$/, 2]
P.S. I'm not that experienced in Ruby, there might be some more optimal way to type this, but this should do the trick
try this, its much simpler but it may not be the most efficient.
p1 = s1.gsub(' ','')[/\$(\d+)|(\d+)\$/,1]

Ruby Regular expressions (regex): character appear only once at most

Suppose I want to make sure a string x equals any combination of abcd (each character appearing one or zero times-->each character should not repeat, but the combination may appear in any order)
valid ex: bc .. abcd ... bcad ... b... d .. dc
invalid ex. abcdd, cc, bbbb, abcde (ofcourse)
my effort:
I tried various techniques:
the closest I came was
x =~ ^(((a)?(b)?(c)?(d)?))$
but this wont work if I do not type them in the same order as i have written:
works for: ab, acd, abcd, a, d, c
wont work for: bcda, cb, da (anything that is not in the above order)
you can test your solutions here : http://rubular.com/r/wCpD355bub
PS: the characters may not be in alphabetical order, it could be u c e t
If you can use things besides regexes, you can try:
str.chars.uniq.length == str.length && str.match(/^[a-d]+$/)
The general idea here is that you just strip any duplicated characters from the string, and if the length of the uniq array is not equal to the length of the source string, you have a duplicated character in the string. The regex then enforces the character set.
This can probably be improved, but it's pretty straightforward. It does create a couple of extra arrays, so you might want a different approach if this needs to be used in a performance-critical location.
If you want to stick to regexes, you could use:
str.match(/^[a-d]+$/) && !str.match(/([a-d]).*\1/)
That'll basically check that the string only contains the allowed characters, and that those characters are never repeated.
This is really not what regular expressions are meant to do, but if you really really want to.
Here is a regex that satisfies the conditions.
^([a-d])(?!(\1))([a-d])?(?!(\1|\3))([a-d])?(?!(\1|\3|\5))([a-d])?(?!(\1|\3|\5|\7))$
basically it goes through each character, making the group, then makes sure that that group isn't matched. Then checks the next character, and makes sure that group and the previous groups don't match.
You can reverse it (match the condition that would make it fail)
re = /^ # start of line
(?=.*([a-d]).*\1) # match if a letter appears more than once
| # or
(?=.*[^a-d]) # match if a non abcd char appears
/x
puts 'fail' if %w{bc abcd bcad b d dc}.any?{|s| s =~ re}
puts 'fail' unless %w{abcdd cc bbbb abcde}.all?{|s| s =~ re}
I don't think regexes are well suited to this problem, so here is another non-regex solution. It's recursive:
def match_chars_no_more_than_once(characters, string)
return true if string.empty?
if characters.index(string[0])
match_chars_no_more_than_once(characters.sub(string[0],''), string[1..-1])
else
false
end
end
%w{bc bdac hello acbbd cdda}.each do |string|
p [string, match_chars_no_more_than_once('abcd', string)]
end
Output:
["bc", true]
["bdac", true]
["hello", false]
["acbbd", false]
["cdda", false]

Ruby Truncate Words + Long Text

I have the following function which accepts text and a word count and if the number of words in the text exceeded the word-count it gets truncated with an ellipsis.
#Truncate the passed text. Used for headlines and such
def snippet(thought, wordcount)
thought.split[0..(wordcount-1)].join(" ") + (thought.split.size > wordcount ? "..." : "")
end
However what this function doesn't take into account is extremely long words, for instance...
"Helloooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
world!"
I was wondering if there's a better way to approach what I'm trying to do so it takes both word count and text size into consideration in an efficient way.
Is this a Rails project?
Why not use the following helper:
truncate("Once upon a time in a world far far away", :length => 17)
If not, just reuse the code.
This is probably a two step process:
Truncate the string to a max length (no need for regex for this)
Using regex, find a max words quantity from the truncated string.
Edit:
Another approach is to split the string into words, loop through the array adding up
the lengths. When you find the overrun, join 0 .. index just before the overrun.
Hint: regex ^(\s*.+?\b){5} will match first 5 "words"
The logic for checking both word and char limits becomes too convoluted to clearly express as one expression. I would suggest something like this:
def snippet str, max_words, max_chars, omission='...'
max_chars = 1+omision.size if max_chars <= omission.size # need at least one char plus ellipses
words = str.split
omit = words.size > max_words || str.length > max_chars ? omission : ''
snip = words[0...max_words].join ' '
snip = snip[0...(max_chars-3)] if snip.length > max_chars
snip + omit
end
As other have pointed out Rails String#truncate offers almost the functionality you want (truncate to fit in length at a natural boundary), but it doesn't let you independently state max char length and word count.
First 20 characters:
>> "hello world this is the world".gsub(/.+/) { |m| m[0..20] + (m.size > 20 ? '...' : '') }
=> "hello world this is t..."
First 5 words:
>> "hello world this is the world".gsub(/.+/) { |m| m.split[0..5].join(' ') + (m.split.size > 5 ? '...' : '') }
=> "hello world this is the world..."

How does this Ruby app know to select the middle third of sentences?

I am currently following Beginning Ruby by Peter Cooper and have put together my first app, a text analyzer. However, whilst I understand all of the concepts and the way in which they work, I can't for the life of me understand how the app knows to select the middle third of sentences sorted by length from this line:
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
I have included the whole app for context any help is much appreciated as so far everything is making sense.
#analyzer.rb --Text Analyzer
stopwords = %w{the a by on for of are with just but and to the my I has some in do}
lines = File.readlines(ARGV[0])
line_count = lines.size
text = lines.join
#Count the characters
character_count = text.length
character_count_nospaces = text.gsub(/\s+/, '').length
#Count the words, sentances, and paragraphs
word_count = text.split.length
paragraph_count = text.split(/\n\n/).length
sentence_count = text.split(/\.|\?|!/).length
#Make a list of words in the text that aren't stop words,
#count them, and work out the percentage of non-stop words
#against all words
all_words = text.scan(/\w+/)
good_words = all_words.select {|word| !stopwords.include?(word)}
good_percentage = ((good_words.length.to_f / all_words.length.to_f)*100).to_i
#Summarize the text by cherry picking some choice sentances
sentances = text.gsub(/\s+/, ' ').strip.split(/\.|\?|!/)
sentances_sorted = sentences.sort_by { |sentence| sentance.length }
one_third = sentences_sorted.length / 3
ideal_sentances = sentences_sorted.slice(one_third, one_third + 1)
ideal_sentances = ideal_sentences.select{ |sentence| sentence =~ /is|are/ }
#Give analysis back to user
puts "#{line_count} lines"
puts "#{character_count} characters"
puts "#{character_count_nospaces} characters excluding spaces"
puts "#{word_count} words"
puts "#{paragraph_count} paragraphs"
puts "#{sentence_count} sentences"
puts "#{sentence_count / paragraph_count} sentences per paragraph (average)"
puts "#{word_count / sentence_count} words per sentence (average)"
puts "#{good_percentage}% of words are non-fluff words"
puts "Summary:\n\n" + ideal_sentences.join(". ")
puts "-- End of analysis."
Obviously I am a beginner so plain English would help enormously, cheers.
It gets a third of the length of the sentence with one_third = sentences_sorted.length / 3 then the line you posted ideal_sentances = sentences_sorted.slice(one_third, one_third + 1) says "grab a slice of all the sentences starting at the index equal to 1/3rd and continue 1/3rd of the length +1".
Make sense?
The slice method in you look it up in the ruby API say this:
If passed two Fixnum objects, returns a substring starting at the
offset given by the first, and a length given by the second.
This means that if you have a sentence broken into three pieces
ONE | TWO | THREE
slice(1/3, 1/3+1)
will return the string starting at 1/3 from the beginning
| TWO | THREE (this is what you are looking at now)
then you return the string that is 1/3+1 distance from where you are, which gives you
| TWO |
sentences is a list of all sentences. sentances_sorted is that list sorted by sentence length, so the middle third will be the sentences with the most average length. slice() grabs that middle third of the list, starting from the position represented by one_third and counting one_third + 1 from that point.
Note that the correct spelling is 'sentence' and not 'sentance'. I mention this only because you have some code errors that result from spelling it inconsistently.
I was stuck on this when I first started too. In plain English, you have to realize that the slice method can take 2 parameters here.
The first is the index. The second is how long slice goes for.
So lets say you start off with 6 sentences.
one_third = 2
slice(one_third, one_third+1)
1/3 of 6 is 2.
1) here the 1/3 means you start at element 2 which is index[1]
2) then it goes on for 2 (6/3) more + 1 length, so a total of 3 spaces
so it is affecting indexes 1 to index 3

Ruby String pad zero OPE ID

I'm working with OPE IDs. One file has them with two trailing zeros, eg, [998700, 1001900]. The other file has them with one or two leading zeros for a total length of six, eg, [009987, 010019]. I want to convert every OPE ID (in both files) to an eight-digit string with exactly two leading zeros and however many zeros at the end to get it to be eight digits long.
Try this:
a = [ "00123123", "077934", "93422", "1231234", "12333" ]
a.map { |n| n.gsub(/^0*/, '00').ljust(8, '0') }
=> ["00123123", "00779340", "00934220", "001231234", "00123330"]
If you have your data parsed and stored as strings, it could be done like this, for example.
n = ["998700", "1001900", "009987", "0010019"]
puts n.map { |i|
i =~ /^0*([0-9]+?)0*$/
"00" + $1 + "0" * [0, 6 - $1.length].max
}
Output:
00998700
00100190
00998700
00100190
This example on codepad.
I'm note very sure though, that I got the description exactly right. Please check the comments and I correct in case it's not exactly what you were looking for.
With the help of the answers given by #detunized & #nimblegorilla, I came up with:
"998700"[0..-3].rjust(6, '0').to_sym
to make the first format I described (always with two trailing zeros) equal to the second.

Resources