Regular expression to strip everything but words - ruby

I'm helpless on regular expressions so please help me on this problem.
Basically I am downloading web pages and rss feeds and want to strip everything except plain words. No periods, commas, if, ands, and buts. Literally I have a list of the most common words used in English and I also want to strip those too but I think I know how to do that and don't need a regular expression because it would be really way to long.
How do I strip everything from a chunk of text except words that are delimited by spaces? Everything else goes in the trash.
This works quite well thanks to Pavel .split(/[^[:alpha:]]/).uniq!

I think that what fits you best would be splitting of the string into words. In this case, String::split function would be the better option. It accepts a regexp that matches substrings, which should split the source string into array elements.
In your case, it should be "some non-alphabetic characters". Alphabetic character class is denoted by [:alpha:]. So, here's the example of what you need:
irb(main):001:0> "asd, < er >w , we., wZr,fq.".split(/[^[:alpha:]]+/)
=> ["asd", "er", "w", "we", "wZr", "fq"]
You may further filter the result by intersecting the resultant array with array that contains only English words:
irb(main):001:0> ["asd", "er", "w", "we", "wZr", "fq"] & ["we","you","me"]
=> ["we"]

try \b\w*\b to match whole words

Related

Removing trailing newlines with regex in Ruby's 'String#scan'

I have a string, which contains a bunch of HTML documents, tagged with #name:
string = "#one\n\n<html>\n</html>\n\n#two\n<html>\n</html>\n\n\n"
I want to get an array of two-element arrays, each of which with a tag as the first element and the HTML document as the second:
[ ["#one", "<html>\n</html>"], ["#two", "<html>\n</html>"] ]
In order to solve the problem, I crafted the following regular expression:
regex = /(#.+)\n+([^#]+)\n+/
and applied it in string.scan regex.
However, instead of the desired output, I get the following:
[ ["#one", "<html>\n</html>\n"], ["#two", "<html>\n</html>\n\n"] ]
There are trailing newline characters at the end of each document. It appears that only one newline character was removed from the documents, but others stayed at the place.
How can the aforementioned regular expression be changed in order to remove all the trailing characters from the resulting documents?
The reason only the last \n was thrown away is because the two relevant capturing parts in your regex: .+ and [^#]+ capture everything up to the last \n (in order to make matching possible at all). It does not matter that they are followed by \n+. Remember that regex works from the left to the right. If some substring (sequences of \n in this case) can fit in either the preceding part of the following part of a regex, it actually fits in the preceding part.
With generality, I would suggest doing this:
string.split(/\s+(?=#)/).map{|s| s.strip.split(/\s+/, 2)}
# => [["#one", "<html>\n</html>"], ["#two", "<html>\n</html>"]]
You can remove duplicated newlines first:
string.gsub(/\n+/, "\n").scan(regex)
=> [["#one", "<html>\n</html>"], ["#two", "<html>\n</html>"]]

ruby remove variable length string from regular expression leaving hyphen

I have a string such as this: "im# -33.870816,151.203654"
I want to extract the two numbers including the hyphen.
I tried this:
mystring = "im# -33.870816,151.203654"
/\D*(\-*\d+\.\d+),(\-*\d+\.\d+)/.match(mystring)
This gives me:
33.870816,151.203654
How do I get the hyphen?
I need to do this in ruby
Edit: I should clarify, the "im# " was just an example, there can be any set of characters before the numbers. the numbers are mostly well formed with the comma. I was having trouble with the hyphen (-)
Edit2: Note that the two nos are lattidue, longitude. That pattern is mostly fixed. However, in theory, the preceding string can be arbitrary. I don't expect it to have nos. or hyphen, but you never know.
How about this?
arr = "im# -33.2222,151.200".split(/[, ]/)[1..-1]
and arr is ["-33.2222", "151.200"], (using the split method).
now
arr[0].to_f is -33.2222 and arr[1].to_f is 151.2
EDIT: stripped "im#" part with [1..-1] as suggested in comments.
EDIT2: also, this work regardless of what the first characters are.
If you want to capture the two numbers with the hyphen you can use this regex:
> str = "im# -33.870816,151.203654"
> str.match(/([\d.,-]+)/).captures
=> ["33.870816,151.203654"]
Edit: now it captures hyphen.
This one captures each number separetely: http://rubular.com/r/NNP2OTEdiL
Note: Using String#scan will match all ocurrences of given pattern, in this case
> str.scan /\b\s?([-\d.]+)/
=> [["-33.870816"], ["151.203654"]] # Good, but flattened version is better
> str.scan(/\b\s?([-\d.]+)/).flatten
=> ["-33.870816", "151.203654"]
I recommend you playing around a little with Rubular. There's also some docs about regegular expressions with Ruby:
http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html#UJ
http://www.regular-expressions.info/ruby.html
http://www.ruby-doc.org/core-1.9.3/Regexp.html
Your regex doesn't work because the hyphen is caught by \D, so you have to modify it to catch only the right set of characters.
[^0-9-]* would be a good option.

Match comma separated list with Ruby Regex

Given the following string, I'd like to match the elements of the list and parts of the rest after the colon:
foo,bar,baz:something
I.e. I am expecting the first three match groups to be "foo", "bar", "baz". No commas and no colon. The minimum number of elements is 1, and there can be arbitrarily many. Assume no whitespace and lower case.
I've tried this, which should work, but doesn't populate all the match groups for some reason:
^([a-z]+)(?:,([a-z]+))*:(something)
That matches foo in \1 and baz (or whatever the last element is) in \2. I don't understand why I don't get a match group for bar.
Any ideas?
EDIT: Ruby 1.9.3, if that matters.
EDIT2: Rubular link: http://rubular.com/r/pDhByoarbA
EDIT3: Add colon to the end, because I am not just trying to match the list. Sorry, oversimplified the problem.
This expression works for me: /(\w+)/i
If you want to do it with regex, how about this?
(?<=^|,)("[^"]*"|[^,]*)(?=,|$)
This matches comma-separated fields, including the possibility of commas appearing inside quoted strings like 123,"Yes, No". Regexr for this.
More verbosely:
(?<=^|,) # Must be preceded by start-of-line or comma
(
"[^"]*"| # A quote, followed by a bunch of non-quotes, followed by quote, OR
[^,]* # OR anything until the next comma
)
(?=,|$) # Must end with comma or end-of-line
Usage would be with something like Python's re.findall(), which returns all non-overlapping matches in the string (working from left to right, if that matters.) Don't use it with your equivalent of re.search() or re.match() which only return the first match found.
(NOTE: This actually doesn't work in Python because the lookbehind (?<=^|,) isn't fixed width. Grr. Open to suggestions on this one.)
Edit: Use a non-capturing group to consume start-of-line or comma, instead of a lookbehind, and it works in Python.
>>> test_str = '123,456,"String","String, with, commas","Zero-width fields next",,"",nyet,123'
>>> m = re.findall('(?:^|,)("[^"]*"|[^,]*)(?=,|$)',test_str)
>>> m
['123', '456', '"String"', '"String, with, commas"',
'"Zero-width fields next"', '', '""', 'nyet', '123']
Edit 2: The Ruby equivalent of Python's re.findall(needle, haystack) is haystack.scan(needle).
Maybe split will be better solution for this case?
'foo,bar,baz'.split(',')
=> ["foo", "bar", "baz"]
If I am interpreting your post correctly, you want everything separated by commas before the colon (:).
The appropriate regex for this would be:
[^\s:]*(,[^\s:]*)*(:.*)?
This should find everything you are looking for.

Split string suppressing all null fields

I want to split a string suppressing all null fields
Command:
",1,2,,3,4,,".split(',')
Result:
["", "1", "2", "", "3", "4", ""]
Expected:
["1", "2", "3", "4"]
How to do this?
Edit
Ok. Just to sum up all that good questions posted.
What I wanted is that split method (or other method) didn't generate empty strings. Looks like it isn't possible.
So, the solution is two step process: split string as usual, and then somehow delete empty strings from resulting array.
The second part is exactly this question
(and its duplicate)
So I would use
",1,2,,3,4,,".split(',').delete_if(&:empty?)
The solution proposed by Nikita Rybak and by user229426 is to use reject method. According to docs reject returns a new array. While delete_if method is more efficient since I don't want a copy. Using select proposed by Mark Byers even more inefficient.
steenslag proposed to replace commas with space and then use split by space:
",1,2,,3,4,,".gsub(',', ' ').split(' ')
Actually, the documentation says that space is actually a white space. But results of "split(/\s/)" and "split(' ')" are not the same. Why's that?
Mark Byers proposed another solution - just using regular expressions. Seems like this is what I need. But this solution implies that you have to be a master of regexp. But this is great solution! For example, if I need spaces to be separators as well as any non-alphanumeric symbol I can rewrite this to
",1,2, ,3 3,4 4 4,,".scan(/\w+[\s*\w*]*/)
the result is:
["1", "2", "3 3", "4 4 4"]
But again regexps are very unintuitive and they need an experience.
Summary
I expect that split to work with whitespaces as if whitespaces were a comma or even regexp. I expect it to do not produce empty strings. I think this is a bug in ruby or my misunderstanding.
Made it a community question.
There's a reject method in Array:
",1,2,,3,4,,".split(',').reject { |s| s.empty? }
Or if you prefer Symbol#to_proc:
",1,2,,3,4,,".split(',').reject(&:empty?)
Hoping to illuminate a bit here:
But results of "split(/\s/)" and "split(' ')" are not the same. Why's that?
If you look at the docs for String#split you'll see that split with ' ' is a special case:
If pattern is a single space, str is split on whitespace,
with leading whitespace and runs of contiguous whitespace characters ignored.
You also mention:
I expect it to do not produce empty strings. I think this is a bug in ruby or my misunderstanding.
The problem probably lies between the keyboard and the chair. ;-)
split will happily produce empty strings as it should, because there are times when you would definitely want this ability, and there are plenty of easy ways to work around it. Consider if you were splitting a csv from an Excel file. Anywhere you see ',,' would be an empty column, not a column you should just get rid of.
Regardless, you've seen a bunch of solutions - and here's another one that might show you the things you can do with ruby and split!
It seems you want to split up data between multiple commas, so why not try that and see what happens?
a = ",1,2,,3,4,,5,,,,6,,,".split(/,+/)
It's a simple enough regular expression: /,+/ means one or more commas, so we'll split on that.
This almost gives you want you want, except that you also want to ignore the leading empty field. You'll note that split ignores the empty field on the end because (from the String#split docs):
If the limit parameter is omitted, trailing null fields are suppressed.
So that means we can either use something that will remove that nil at the front of the array or just remove the initial commas. We can use gsub for that:
a = ",1,2,,3,4,,5,,,,6,,,".gsub(/^,+/,'')
If you print that out you'll see that our trailing empty "field" is now gone. So we can combine them all in one line:
a = ",1,2,,3,4,,5,,,,6,,,".gsub(/^,+/,'').split(/,+/)
And you have another solution!
And incidentally, this points out another possibility, that we can just cleanup our string entirely before sending it to split if we want a simple split. I'll leave it to you to figure out what this one is doing:
a = ",1,2,,3,4,,5,,,,6,,,".gsub(/,+/,',').gsub(/^,/,'').split(',')
There's lots of ways to do things in ruby. If it seems that ruby isn't doing what you want, then take a look at the docs and realize that it probably works the way that it does for a reason (there are plenty of people who would be upset if split wasn't able to spit out empty fields :)
Hope that helps!
You could use split followed by select:
",1,2,,3,4,,".split(',').select{|x|!x.empty?}
Or you could use a regular expression to match what you want to keep instead of splitting on the delimiter:
",1,2,,3,4,,".scan(/[^,]+/)
",1,2,,3,4,,".split(/,/).reject(&:empty?)
",1,2,,3,,,4,,".squeeze(",").sub(/^,*|,*$/,"").split(",")
String#split(pattern) behaves as desired when pattern is a single space (ruby-doc).
",1,2,,3,4,,".gsub(',', ' ').split(' ')

Strip words beginning with a specific letter from a sentence using regex

I'm not sure how to use regular expressions in a function so that I could grab all the words in a sentence starting with a particular letter. I know that I can do:
word =~ /^#{letter}/
to check if the word starts with the letter, but how do I go from word to word. Do I need to convert the string to an array and then iterate through each word or is there a faster way using regex? I'm using ruby so that would look like:
matching_words = Array.new
sentance.split(" ").each do |word|
matching_words.push(word) if word =~ /^#{letter}/
end
Scan may be a good tool for this:
#!/usr/bin/ruby1.8
s = "I think Paris in the spring is a beautiful place"
p s.scan(/\b[it][[:alpha:]]*/i)
# => ["I", "think", "in", "the", "is"]
\b means 'word boundary."
[:alpha:] means upper or lowercase alpha (a-z).
You can use \b. It matches word boundaries--the invisible spot just before and after a word. (You can't see them, but oh they're there!) Here's the regex:
/\b(a\w*)\b/
The \w matches a word character, like letters and digits and stuff like that.
You can see me testing it here: http://rubular.com/regexes/13347
Similar to Anon.'s answer:
/\b(a\w*)/g
and then see all the results with (usually) $n, where n is the n-th hit. Many libraries will return /g results as arrays on the $n-th set of parenthesis, so in this case $1 would return an array of all the matching words. You'll want to double-check with whatever library you're using to figure out how it returns matches like this, there's a lot of variation on global search returns, sadly.
As to the \w vs [a-zA-Z], you can sometimes get faster execution by using the built-in definitions of things like that, as it can easily have an optimized path for the preset character classes.
The /g at the end makes it a "global" search, so it'll find more than one. It's still restricted by line in some languages / libraries, though, so if you wish to check an entire file you'll sometimes need /gm, to make it multi-line
If you want to remove results, like your title (but not question) suggests, try:
/\ba\w*//g
which does a search-and-replace in most languages (/<search>/<replacement>/). Sometimes you need a "s" at the front. Depends on the language / library. In Ruby's case, use:
string.gsub(/(\b)a\w*(\b)/, "\\1\\2")
to retain the non-word characters, and optionally put any replacement text between \1 and \2. gsub for global, sub for the first result.
/\ba[a-z]*\b/i
will match any word starting with 'a'.
The \b indicates a word boundary - we want to only match starting from the beginning of a word, after all.
Then there's the character we want our word to start with.
Then we have as many as possible letter characters, followed by another word boundary.
To match all words starting with t, use:
\bt\w+
That will match test but not footest; \b means "word boundary".
Personally i think that regex is overkill for this application, simply running a select is more than capable of solving this particular problem.
"this is a test".split(' ').select{ |word| word[0,1] == 't' }
result => ["this", "test"]
or if you are determined to use regex then go with grep
"this is a test".split(' ').grep(/^t/)
result => ["this", "test"]
Hope this helps.

Resources