How do I extract this substring within this string? - ruby

I have the following text:
"Showing1-30\nof 1404results"
What I want to pull out is the 1404.
How do I do that?
I was thinking I would use a regexp to match just the string between the words of and results, but can't quite figure out how to do that.
Or is there another way, say a built-in Ruby method I could use that is efficient?
I was also considering using split, but the spacing is off so it looks like this:
=> ["Showing1-30", "of", "1404results"]
How do I do what I want?

You could just do
["Showing1-30", "of", "1404results"].last.to_i
Or use a regex like
/of (\d+)results/

Match "of" followed by one or more spaces, followed by one or more digits in capture group 1, followed by "results", then retrieve the contents of capture group 1.
"Showing1-30\nof 1404results"[/of\s+(\d+)results/,1]
#=> "1404"
or
Match the string that is preceded by "of" followed by one1 space (positive lookbehind) and is followed by "results" (positive lookahead)
"Showing1-30\nof 1404results"[/(?<=of\s)\d+(?=results)/]
#=> "1404"
or
Match "of" followed by one or more spaces, forget everything matched so far (\K), match one or more digits followed by "results" (positive lookahead)
"Showing1-30\nof 1404results"[/of\s+\K\d+(?=results)/]
#=> "1404"
It may be desirable to change the first regex to
/(?<=of\s)\d+\s*(?=results)/
in case someone decides to "correct" the string to read "Showing 1-30\nof 1404 results"[/(?<=of\s)\d+(?=results)/]. (Same for the other two.)
1 Ruby's positive lookbehinds cannot be variable length; hence, \s+ is not permitted here.

I'd use:
"Showing1-30\nof 1404results"[/(\d+)results/, 1] # => "1404"
"Showing1-30\nof 1404results" is not overly readable. If you are in charge of generating it, or if it is likely to change to something more readable, such as "Showing 1-30\nof 1404 results", then a simple tweak will help:
"Showing1-30\nof 1404results"[/(\d+)\s*results/, 1] # => "1404"
where \s* will allow 0, 1 or multiple whitespace characters.
Keep regular expressions as simple as possible until it's proven they need to be more complex. As complexity increases the odds of slowing the match increases which, in a loop, can be drastic with long strings. Also, the odds of adding a hole that leads to false positives goes up too, which can be hard to debug.

If the position of this number is fixed, the the following is the fastest
"Showing1-30\nof 1404results"[-12..-8]
The [-12..-8] is a range, you can see the string as an array of characters and specify the characters between the 8th and the 12th position counting from the right, -1 is the end of the line, -2 the last character etc..
In not, then a regular expression like
"Showing1-30\nof 14results"[/ \d+/].strip
You look for a space followed by a number, then you remove the leading space.
This is simpler than having to use a capture group.

Related

Discard contractions from string

I have a special use case where I want to discard all the contractions from the string and select only words followed by alphabets which do not contain any special character.
For eg:
string = "~ ASAP ASCII Achilles Ada Stackoverflow James I'd I'll I'm I've"
string.scan(/\b[A-z][a-z]+\b/)
#=> ["Achilles", "Ada", "Stackoverflow", "James", "ll", "ve"]
Note: It's not discarding the whole word I'll and I've
Can someone please help how to discard the whole word which contains contractions?
Try this Regex:
(?:(?<=\s)|(?<=^))[a-zA-Z]+(?=\s|$)
Explanation:
(?:(?<=\s)|(?<=^)) - finds the position immediately preceded by either start of the line or by a white-space
[a-zA-Z]+ - matches 1+ occurrences of a letter
(?=\s|$) - The substring matched above must be followed by either a whitespace or end of the line
Click for Demo
Update:
To make sure that not all the letters are in upper case, use the following regex:
(?:(?<=\s)|(?<=^))(?=\S*[a-z])[a-zA-Z]+(?=\s|$)
Click for Demo
The only thing added here is (?=\S*[a-z]) which means that there must be atleast one lowercase letter
I know that there's an accepted answer already, but I'd like to give my own shot:
(?<=\s|^)\w+[a-z]\w*
You can test it here. This regex is shorter and more efficient (157 steps against 315 from the accepted answer).
The explanation is rather simple:
(?<=\s|^)- This is a positive look behind. It means that we want strings preceded by a whitespace character or the start of the string.
\w+[a-z]\w* - This one means that we want strings composed by letters only (word characters) containing least one lowercase letter, thus discarding words which are whole uppercase. Along with the positive look behind, the whole regex ends up discarding words containing special characters.
NOTE: this regex won't take into account one-letter words. If you want to accomplish that, then you should use \w*[a-z]\w* instead, with a little efficiency cost.

Ruby regex specify length of captured group

I need to match a string of variable length(between 5 and 12), composed of uppercase letters and one or more digits between 1 and 8.
How can I specify that I need the whole captured group's length to be between 5 and 12?
I have tried with parenthesis but with no luck.
I have tried this
\s([A-Z]+[1-8]+[A-Z]+){5,12}\s
My idea was to use the quantifier {5,12} to limit the length of the captured group between parenthesis, but clearly it doesn't work like that.
The string needs to be identified inside a normal text just like
"THE STRING I NEED TO DECODE IS SOMETHING LIKE FD1531FHHKWF BUT NOT LIKE g4G58234JJ"
You actually have two conditions to met:
The length of the match is to be specified with curly brackets {5,12}, and before and after there should be not letters/digits. So:
/(?!\b[A-Z]+\b)\b[A-Z1-8]{5,12}\b/
First, we assure that the lookahead for letters only is negative, then we look for the pattern.
Use positive look-ahead on total size of regex
\s(?=^.{5,12}$)([A-Z]+[1-8]+[A-Z]+)\s
Explanation
(?= # look-ahead match start
^.{5,12}$ # 3 to 15 characters from start to end
) # look-ahead match end

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

How to find whole complete number with ruby regex

I'm looking to find the first whole occurance of a number within a string. I'm not looking for the first digit, rather the whole first number. So, for example, the first number in: w134fklj342 is 134, while the first number in 1235alkj9342klja9034 is 1235.
I have attempted to use \d but I'm unsure how to expand that to include multiple digits (without specifying how long the number is).
I think, you're looking for this regex
\d+
"Plus" means "one or more". This regex will match all numbers within a string, so pick first one.
strings = ['w134fklj342', '1235alkj9342klja9034']
strings.each do |s|
puts s[/\d+/]
end
# >> 134
# >> 1235
Demo: http://rubular.com/r/YE8kPE2SyW
The easiest way to understand regexes is to think of eachbit is one character; e.g: \d or [1234567890] or [0-9] will match one digit.
To expand this one character you have 2 basic options: * and +
* will match the character 0 or more times
+ will match it one or more times
Like Sergio said you should use \d+ to match many digits.
Excellent tutorial for regexes in general: http://www.regular-expressions.info/tutorial.html

Strip words beginning with a specific letter from a sentence using regex

I'm not sure how to use regular expressions in a function so that I could grab all the words in a sentence starting with a particular letter. I know that I can do:
word =~ /^#{letter}/
to check if the word starts with the letter, but how do I go from word to word. Do I need to convert the string to an array and then iterate through each word or is there a faster way using regex? I'm using ruby so that would look like:
matching_words = Array.new
sentance.split(" ").each do |word|
matching_words.push(word) if word =~ /^#{letter}/
end
Scan may be a good tool for this:
#!/usr/bin/ruby1.8
s = "I think Paris in the spring is a beautiful place"
p s.scan(/\b[it][[:alpha:]]*/i)
# => ["I", "think", "in", "the", "is"]
\b means 'word boundary."
[:alpha:] means upper or lowercase alpha (a-z).
You can use \b. It matches word boundaries--the invisible spot just before and after a word. (You can't see them, but oh they're there!) Here's the regex:
/\b(a\w*)\b/
The \w matches a word character, like letters and digits and stuff like that.
You can see me testing it here: http://rubular.com/regexes/13347
Similar to Anon.'s answer:
/\b(a\w*)/g
and then see all the results with (usually) $n, where n is the n-th hit. Many libraries will return /g results as arrays on the $n-th set of parenthesis, so in this case $1 would return an array of all the matching words. You'll want to double-check with whatever library you're using to figure out how it returns matches like this, there's a lot of variation on global search returns, sadly.
As to the \w vs [a-zA-Z], you can sometimes get faster execution by using the built-in definitions of things like that, as it can easily have an optimized path for the preset character classes.
The /g at the end makes it a "global" search, so it'll find more than one. It's still restricted by line in some languages / libraries, though, so if you wish to check an entire file you'll sometimes need /gm, to make it multi-line
If you want to remove results, like your title (but not question) suggests, try:
/\ba\w*//g
which does a search-and-replace in most languages (/<search>/<replacement>/). Sometimes you need a "s" at the front. Depends on the language / library. In Ruby's case, use:
string.gsub(/(\b)a\w*(\b)/, "\\1\\2")
to retain the non-word characters, and optionally put any replacement text between \1 and \2. gsub for global, sub for the first result.
/\ba[a-z]*\b/i
will match any word starting with 'a'.
The \b indicates a word boundary - we want to only match starting from the beginning of a word, after all.
Then there's the character we want our word to start with.
Then we have as many as possible letter characters, followed by another word boundary.
To match all words starting with t, use:
\bt\w+
That will match test but not footest; \b means "word boundary".
Personally i think that regex is overkill for this application, simply running a select is more than capable of solving this particular problem.
"this is a test".split(' ').select{ |word| word[0,1] == 't' }
result => ["this", "test"]
or if you are determined to use regex then go with grep
"this is a test".split(' ').grep(/^t/)
result => ["this", "test"]
Hope this helps.

Resources