ruby remove variable length string from regular expression leaving hyphen - ruby

I have a string such as this: "im# -33.870816,151.203654"
I want to extract the two numbers including the hyphen.
I tried this:
mystring = "im# -33.870816,151.203654"
/\D*(\-*\d+\.\d+),(\-*\d+\.\d+)/.match(mystring)
This gives me:
33.870816,151.203654
How do I get the hyphen?
I need to do this in ruby
Edit: I should clarify, the "im# " was just an example, there can be any set of characters before the numbers. the numbers are mostly well formed with the comma. I was having trouble with the hyphen (-)
Edit2: Note that the two nos are lattidue, longitude. That pattern is mostly fixed. However, in theory, the preceding string can be arbitrary. I don't expect it to have nos. or hyphen, but you never know.

How about this?
arr = "im# -33.2222,151.200".split(/[, ]/)[1..-1]
and arr is ["-33.2222", "151.200"], (using the split method).
now
arr[0].to_f is -33.2222 and arr[1].to_f is 151.2
EDIT: stripped "im#" part with [1..-1] as suggested in comments.
EDIT2: also, this work regardless of what the first characters are.

If you want to capture the two numbers with the hyphen you can use this regex:
> str = "im# -33.870816,151.203654"
> str.match(/([\d.,-]+)/).captures
=> ["33.870816,151.203654"]
Edit: now it captures hyphen.
This one captures each number separetely: http://rubular.com/r/NNP2OTEdiL
Note: Using String#scan will match all ocurrences of given pattern, in this case
> str.scan /\b\s?([-\d.]+)/
=> [["-33.870816"], ["151.203654"]] # Good, but flattened version is better
> str.scan(/\b\s?([-\d.]+)/).flatten
=> ["-33.870816", "151.203654"]
I recommend you playing around a little with Rubular. There's also some docs about regegular expressions with Ruby:
http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UJ
http://www.regular-expressions.info/ruby.html
http://www.ruby-doc.org/core-1.9.3/Regexp.html

Your regex doesn't work because the hyphen is caught by \D, so you have to modify it to catch only the right set of characters.
[^0-9-]* would be a good option.

Related

Extracting numbers with regex in ruby from a numbers divided by a dot (thousand delimiter)

Trying to extract '4995' from the string '4.995,-' with regex in Ruby.
I tried with
/\d+/
Which seems to work from this Rubular screenshot: http://cl.ly/image/111c2x0N3s0C
but running it only outputs
4
You cannot match it in a single regex because it is not a single substring.
"4.995,-".gsub(/\D/, "") # => "4995"
I'm up-voting sawa's answer because it's a good answer.
But since you are new to regular expressions, you may want further explanation as to why his answer works for you.
When you are trying to match with the regexp /\d+/, what you are saying is "Match for me 1 or more consecutive digits." But your target string, 4.995,-, is not made up of only consecutive digits. It has a 4 and it has a 995. The first match of "1 or more consecutive digits" is 4. That's why what you're getting as a result is 4.
Try to look at your problem differently. Instead of saying, "Find me all the digits and extract those out," you could say, "Find me anything that's not a digit, and get rid of it." To do this, you can use ruby's search-and-replace function, gsub. gsub searches a target string for anything that matches a given regular expression, and then it replaces those matches with some replacement string that you also provide. Documentation on gsub can be found here
The regular expression for "non-digit" is /\D/. So, you can do a gsub that looks for any /\D/ and replaces it with a blank string.
'4.995,-'.gsub(/\D/,'')
Do as below using String#[] and String#tr:
"4.995,-"[/\d+.\d+/].tr('.','') # => "4995"
# more Rubyish way using #tr method only
"4.995,-".tr("^0-9",'') # => "4995"
p '4.995,-1'.delete('.')[/\d+/] #=> "4995"
Here's another way that, like #Arup's solution, works when a digit follows the first non-digit:
'4.995,-1'.sub('.','').to_i.to_s #=> "4995"
This works because
'4.995,-1'.sub('.','') #=> "4995,-1"
and to_i takes the first part part of a string that can be converted to a Fixnum.
Alternatively:
'4.995,-1'.to_f.to_s.sub('.','') #=> "4995"

Regular expression to find first letter in a string

Consider this example string:
mystr ="1. moody"
I want to capitalize the first letter that occurs in mystr. I am trying this regular expression in Ruby but still returns all the letters in mystr (moody) instead of the letter m only.
puts mystr.scan(/[a-zA-Z]{1}/)
Any help appreciated!
Do as below using String#sub
(arup~>~)$ pry --simple-prompt
>> s = "1. moody"
=> "1. moody"
>> s.sub(/[a-z]/i,&:upcase)
=> "1. Moody"
>>
If you want to modify the source string use s.sub!(/[a-z]/,&:upcase).
Just for completeness, although it doesn’t directly answer your question as posed but could be relevant, consider this variation:
mystr ="1. école"
The line mystr.sub(/[a-z]/i,&:upcase) (as in Arup Rakshit’s answer) will match the second letter of the word, producing
1. éCole
The line mystr.sub /\b\s?[a-zA-Z]{1}/, &:upcase (diego.greyrobot’s answer) won’t match at all and so the line will be unchanged.
There are two problems here. The first is that [a-zA-Z] doesn’t match accented characters, so é isn’t matched. The fix for this is to use the \p{Letter} character property:
mystr.sub /\p{Letter}/, &:upcase
This will match the character in question, but won’t change it. This is due to the second problem, which is that upcase (and downcase) only works on characters in the ASCII range. This is almost as easy to fix, but relies on using an external library such as unicode_utils:
require 'unicode_utils'
mystr.sub(/\p{Letter}/) { |c| UnicodeUtils.upcase(c)}
This results in:
1. École
which is probably what is wanted in this case.
This may not affect you if you are sure all your data is just ASCII, but is worth knowing for other situations.
The reason your attempt returns all the letters is because you are using the scan method which does just that, it returns all the characters which match the regex, in your case letters. For your use case you should use sub since you only want to substitute 1 letter.
I use http://rubular.com to practice my Ruby Regexes. Here's what I came up with http://rubular.com/r/fAQEDFVEVn
The regex is: /\b[a-z]/
It uses \b to find a word boundary, and finally we ask for one letter only with [a-zA-Z]
Finally we'll use sub to replace it with its upcased version:
"1. moody".sub /\b[a-z]/, &:upcase
=> "1. Moody"
Hope that helps.

Match comma separated list with Ruby Regex

Given the following string, I'd like to match the elements of the list and parts of the rest after the colon:
foo,bar,baz:something
I.e. I am expecting the first three match groups to be "foo", "bar", "baz". No commas and no colon. The minimum number of elements is 1, and there can be arbitrarily many. Assume no whitespace and lower case.
I've tried this, which should work, but doesn't populate all the match groups for some reason:
^([a-z]+)(?:,([a-z]+))*:(something)
That matches foo in \1 and baz (or whatever the last element is) in \2. I don't understand why I don't get a match group for bar.
Any ideas?
EDIT: Ruby 1.9.3, if that matters.
EDIT2: Rubular link: http://rubular.com/r/pDhByoarbA
EDIT3: Add colon to the end, because I am not just trying to match the list. Sorry, oversimplified the problem.
This expression works for me: /(\w+)/i
If you want to do it with regex, how about this?
(?<=^|,)("[^"]*"|[^,]*)(?=,|$)
This matches comma-separated fields, including the possibility of commas appearing inside quoted strings like 123,"Yes, No". Regexr for this.
More verbosely:
(?<=^|,) # Must be preceded by start-of-line or comma
(
"[^"]*"| # A quote, followed by a bunch of non-quotes, followed by quote, OR
[^,]* # OR anything until the next comma
)
(?=,|$) # Must end with comma or end-of-line
Usage would be with something like Python's re.findall(), which returns all non-overlapping matches in the string (working from left to right, if that matters.) Don't use it with your equivalent of re.search() or re.match() which only return the first match found.
(NOTE: This actually doesn't work in Python because the lookbehind (?<=^|,) isn't fixed width. Grr. Open to suggestions on this one.)
Edit: Use a non-capturing group to consume start-of-line or comma, instead of a lookbehind, and it works in Python.
>>> test_str = '123,456,"String","String, with, commas","Zero-width fields next",,"",nyet,123'
>>> m = re.findall('(?:^|,)("[^"]*"|[^,]*)(?=,|$)',test_str)
>>> m
['123', '456', '"String"', '"String, with, commas"',
'"Zero-width fields next"', '', '""', 'nyet', '123']
Edit 2: The Ruby equivalent of Python's re.findall(needle, haystack) is haystack.scan(needle).
Maybe split will be better solution for this case?
'foo,bar,baz'.split(',')
=> ["foo", "bar", "baz"]
If I am interpreting your post correctly, you want everything separated by commas before the colon (:).
The appropriate regex for this would be:
[^\s:]*(,[^\s:]*)*(:.*)?
This should find everything you are looking for.

Regular expression to strip everything but words

I'm helpless on regular expressions so please help me on this problem.
Basically I am downloading web pages and rss feeds and want to strip everything except plain words. No periods, commas, if, ands, and buts. Literally I have a list of the most common words used in English and I also want to strip those too but I think I know how to do that and don't need a regular expression because it would be really way to long.
How do I strip everything from a chunk of text except words that are delimited by spaces? Everything else goes in the trash.
This works quite well thanks to Pavel .split(/[^[:alpha:]]/).uniq!
I think that what fits you best would be splitting of the string into words. In this case, String::split function would be the better option. It accepts a regexp that matches substrings, which should split the source string into array elements.
In your case, it should be "some non-alphabetic characters". Alphabetic character class is denoted by [:alpha:]. So, here's the example of what you need:
irb(main):001:0> "asd, < er >w , we., wZr,fq.".split(/[^[:alpha:]]+/)
=> ["asd", "er", "w", "we", "wZr", "fq"]
You may further filter the result by intersecting the resultant array with array that contains only English words:
irb(main):001:0> ["asd", "er", "w", "we", "wZr", "fq"] & ["we","you","me"]
=> ["we"]
try \b\w*\b to match whole words

Strip words beginning with a specific letter from a sentence using regex

I'm not sure how to use regular expressions in a function so that I could grab all the words in a sentence starting with a particular letter. I know that I can do:
word =~ /^#{letter}/
to check if the word starts with the letter, but how do I go from word to word. Do I need to convert the string to an array and then iterate through each word or is there a faster way using regex? I'm using ruby so that would look like:
matching_words = Array.new
sentance.split(" ").each do |word|
matching_words.push(word) if word =~ /^#{letter}/
end
Scan may be a good tool for this:
#!/usr/bin/ruby1.8
s = "I think Paris in the spring is a beautiful place"
p s.scan(/\b[it][[:alpha:]]*/i)
# => ["I", "think", "in", "the", "is"]
\b means 'word boundary."
[:alpha:] means upper or lowercase alpha (a-z).
You can use \b. It matches word boundaries--the invisible spot just before and after a word. (You can't see them, but oh they're there!) Here's the regex:
/\b(a\w*)\b/
The \w matches a word character, like letters and digits and stuff like that.
You can see me testing it here: http://rubular.com/regexes/13347
Similar to Anon.'s answer:
/\b(a\w*)/g
and then see all the results with (usually) $n, where n is the n-th hit. Many libraries will return /g results as arrays on the $n-th set of parenthesis, so in this case $1 would return an array of all the matching words. You'll want to double-check with whatever library you're using to figure out how it returns matches like this, there's a lot of variation on global search returns, sadly.
As to the \w vs [a-zA-Z], you can sometimes get faster execution by using the built-in definitions of things like that, as it can easily have an optimized path for the preset character classes.
The /g at the end makes it a "global" search, so it'll find more than one. It's still restricted by line in some languages / libraries, though, so if you wish to check an entire file you'll sometimes need /gm, to make it multi-line
If you want to remove results, like your title (but not question) suggests, try:
/\ba\w*//g
which does a search-and-replace in most languages (/<search>/<replacement>/). Sometimes you need a "s" at the front. Depends on the language / library. In Ruby's case, use:
string.gsub(/(\b)a\w*(\b)/, "\\1\\2")
to retain the non-word characters, and optionally put any replacement text between \1 and \2. gsub for global, sub for the first result.
/\ba[a-z]*\b/i
will match any word starting with 'a'.
The \b indicates a word boundary - we want to only match starting from the beginning of a word, after all.
Then there's the character we want our word to start with.
Then we have as many as possible letter characters, followed by another word boundary.
To match all words starting with t, use:
\bt\w+
That will match test but not footest; \b means "word boundary".
Personally i think that regex is overkill for this application, simply running a select is more than capable of solving this particular problem.
"this is a test".split(' ').select{ |word| word[0,1] == 't' }
result => ["this", "test"]
or if you are determined to use regex then go with grep
"this is a test".split(' ').grep(/^t/)
result => ["this", "test"]
Hope this helps.

Resources