Regex for first x words in string - ruby

I need a regex that returns the first N words from a string, including line breaks and white spaces. I tried with the following code, but the server crashes:
str[/\S+(\s)?{N}/].strip

Like this (for the first 15 words):
if subject =~ /^(?:\w+\s){15}/
thefirstwords = $&
Just change the 15 to whatever number you like.

I guess you can achieve this without even regex:
str.split[0...n].join(' ')

Try this expression
'/^.\S+(\s){N}/'
Start with any character and match up to N words.

Related

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

RegExp match word with parenthesis?

I have problem with regular expressions. I have strings like this:
test(r), testtest(r,r), example, example2, exmp (r,5)
I would like to find all words without blanks between text and the parenthesis. From the string above, I want to get:
test(r), testtest(r,r)
I created this regexp but it catches only brackets with content inside:
/\(([^\)]+)\)/g
Thanks for all answers.
Edit:
I am working in Ruby, and this regex works perfectly /\w+\(([^)]+)\)/g.
What should I do if I would like to separate words with commas between brackets and without? For example how do I get this?
testtest(r,r) <-with comma
test(r) <-without comma
You can add a \w+ in front of the pattern:
/\w+\(([^)]+)\)/g
This will match one or more 'word' characters (which includes letters, digits, and underscores) followed by an (, followed by one or more of any character other than ), followed by a ).
If you want to know want to be able to capture the word that appears before the parentheses separately, you can put it in a group like this:
/(\w+)\(([^)]+)\)/g
In your example, this would give group 1: test and group 2: r for the first match and group 1: testtest and group 2: r,r for the second match.
Recently had the same issue myself. Doing this solved my problem:
(\w+\([\w, ]+\))
Hope it helps!
Something like this:
/\w+\(\w([,]\w)?\)/g
text = '''test(r), testtest(r,r), example, example2, exmp (r,5)'''
pattern = r'\w+\([^()]+\)'
m = re.compile(pattern)
results = m.finditer(text)
for r in results:
print(r)

ruby remove variable length string from regular expression leaving hyphen

I have a string such as this: "im# -33.870816,151.203654"
I want to extract the two numbers including the hyphen.
I tried this:
mystring = "im# -33.870816,151.203654"
/\D*(\-*\d+\.\d+),(\-*\d+\.\d+)/.match(mystring)
This gives me:
33.870816,151.203654
How do I get the hyphen?
I need to do this in ruby
Edit: I should clarify, the "im# " was just an example, there can be any set of characters before the numbers. the numbers are mostly well formed with the comma. I was having trouble with the hyphen (-)
Edit2: Note that the two nos are lattidue, longitude. That pattern is mostly fixed. However, in theory, the preceding string can be arbitrary. I don't expect it to have nos. or hyphen, but you never know.
How about this?
arr = "im# -33.2222,151.200".split(/[, ]/)[1..-1]
and arr is ["-33.2222", "151.200"], (using the split method).
now
arr[0].to_f is -33.2222 and arr[1].to_f is 151.2
EDIT: stripped "im#" part with [1..-1] as suggested in comments.
EDIT2: also, this work regardless of what the first characters are.
If you want to capture the two numbers with the hyphen you can use this regex:
> str = "im# -33.870816,151.203654"
> str.match(/([\d.,-]+)/).captures
=> ["33.870816,151.203654"]
Edit: now it captures hyphen.
This one captures each number separetely: http://rubular.com/r/NNP2OTEdiL
Note: Using String#scan will match all ocurrences of given pattern, in this case
> str.scan /\b\s?([-\d.]+)/
=> [["-33.870816"], ["151.203654"]] # Good, but flattened version is better
> str.scan(/\b\s?([-\d.]+)/).flatten
=> ["-33.870816", "151.203654"]
I recommend you playing around a little with Rubular. There's also some docs about regegular expressions with Ruby:
http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UJ
http://www.regular-expressions.info/ruby.html
http://www.ruby-doc.org/core-1.9.3/Regexp.html
Your regex doesn't work because the hyphen is caught by \D, so you have to modify it to catch only the right set of characters.
[^0-9-]* would be a good option.

How to find whole complete number with ruby regex

I'm looking to find the first whole occurance of a number within a string. I'm not looking for the first digit, rather the whole first number. So, for example, the first number in: w134fklj342 is 134, while the first number in 1235alkj9342klja9034 is 1235.
I have attempted to use \d but I'm unsure how to expand that to include multiple digits (without specifying how long the number is).
I think, you're looking for this regex
\d+
"Plus" means "one or more". This regex will match all numbers within a string, so pick first one.
strings = ['w134fklj342', '1235alkj9342klja9034']
strings.each do |s|
puts s[/\d+/]
end
# >> 134
# >> 1235
Demo: http://rubular.com/r/YE8kPE2SyW
The easiest way to understand regexes is to think of eachbit is one character; e.g: \d or [1234567890] or [0-9] will match one digit.
To expand this one character you have 2 basic options: * and +
* will match the character 0 or more times
+ will match it one or more times
Like Sergio said you should use \d+ to match many digits.
Excellent tutorial for regexes in general: http://www.regular-expressions.info/tutorial.html

Regular expression Unix shell script

I need to filter all lines with words starting with a letter followed by zero or more letters or numbers, but no special characters (basically names which could be used for c++ variable).
egrep '^[a-zA-Z][a-zA-Z0-9]*'
This works fine for words such as "a", "ab10", but it also includes words like "b.b". I understand that * at the end of expression is problem. If I replace * with + (one or more) it skips the words which contain one letter only, so it doesn't help.
EDIT:
I should be more precise. I want to find lines with any number of possible words as described above. Here is an example:
int = 5;
cout << "hello";
//some comments
In that case it should print all of the lines above as they all include at least one word which fits the described conditions, and line does not have to began with letter.
Your solution will look roughly like this example. In this case, the regex requires that the "word" be preceded by space or start-of-line and then followed by space or end-of-line. You will need to modify the boundary requirements (the parenthesized stuff) as needed.
'(^| )[a-zA-Z][a-zA-Z0-9]*( |$)'
Assuming the line ends after the word:
'^[a-zA-Z][a-zA-Z0-9]+|^[a-zA-Z]$'
You have to add something to it. It might be that the rest of it can be white spaces or you can just append the end of line.(AFAIR it was $ )
Your problem lies in the ^ and $ anchors that match the start and end of the line respectively. You want the line to match if it does contain a word, getting rid of the anchors does what you want:
egrep '[a-zA-Z][a-zA-Z0-9]+'
Note the + matches words of length 2 and higher, a * in that place would signel chars too.

Resources