I'm trying to use Ruby regex to get word combo like below.
In a example below I only need cases 1-4, * marked them in caps for easy testing. Word in the middle (dbo, bcd) could be anything or nothing like in case#3. I have trouble how to get that double period case#3 working. It's also good to get standalone SALES as word too but probably it's too much for one regex ?Tx all guru .
This is my script which partially working, need add alpha..SALES
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX'
regex = /alpha+\.+[a-z]+\.?sales/ix
puts 'R: ' + s.scan(regex).to_s
##R: ["alpha.dbo.SALES", "alpha.bcd.SALES"]
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX 7x alpha.abc.SALES.etc'
regex = /(?<=^|\s)(?:alpha\.[a-z]*\.)?(?:sales)(?=\s|$)/i
puts 'R: ' + s.scan(regex).to_s
Output:
R: ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
r = /
(?<=\d[ ]) # match a digit followed by a space in a positive lookbehind
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
| # or
\. # match a period
) # end non-capture group
)? # end non-capture group and optionally match it
SALES # match string
(?!=[.\p{Alpha}]) # do not match a period or letter (negative lookahead)
/x # free-spacing regex definition mode.
s.scan(r)
#=> ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
This regular expression is customarily written as follows.
r = /
(?<=\d )(?:\p{Alpha}+\.(?:\p{Alpha}+\.|\.))?SALES(?!=[.\p{Alpha}])/
In free-spacing mode the space must be put in a character class ([ ]); else it would be stripped out.
Related
I have a large file, and I want to be able to check if a word is present twice.
puts "Enter a word: "
$word = gets.chomp
if File.read('worldcountry.txt') # do something if the word entered is present twice...
How can i check if the file worldcountry.txt include twice the $word i entered ?
I found what i needed from this: count-the-frequency-of-a-given-word-in-text-file-in-ruby
On the Gerry post with this code
word_count = 0
my_word = "input"
File.open("texte.txt", "r") do |f|
f.each_line do |line|
line.split(' ').each do |word|
word_count += 1 if word == my_word
end
end
end
puts "\n" + word_count.to_s
Thanks, i will pay more attention next time.
If the file is not overly large, it can be gulped into a string. Suppose:
str = File.read('cat')
#=> "There was a dog 'Henry' who\nwas pals with a dog 'Buck' and\na dog 'Sal'."
puts str
There was a dog 'Henry' who
was pals with a dog 'Buck' and
a dog 'Sal'.
Suppose the given word is 'dog'.
Confirm the file contains at least two instances of the given word
One can attempt to match the regular expression
r1 = /\bdog\b.*\bdog\b/m
str.match?(r1)
#=> true
Demo
Confirm the file contains exactly two instances of the given word
Using a regular expression to determine is the file contains exactly two instances of the the given word is somewhat more complex. Let
r2 = /\A(?:(?:.(?!\bdog\b))*\bdog\b){2}(?!.*\bdog\b)/m
str.match?(r1)
#=> false
Demo
The two regular expressions can be written in free-spacing mode to make them self-documenting.
r1 = /
\bdog\b # match 'dog' surrounded by word breaks
.* # match zero or more characters
\bdog\b # match 'dog' surrounded by word breaks
/m # cause . to match newlines
r2 = /
\A # match beginning of string
(?: # begin non-capture group
(?: # begin non-capture group
. # match one character
(?! # begin negative lookahead
\bdog\b # match 'dog' surrounded by word breaks
) # end negative lookahead
) # end non-capture group
* # execute preceding non-capture group zero or more times
\bdog\b # match 'dog' surrounded by word breaks
) # end non-capture group
{2} # execute preceding non-capture group twice
(?! # begin negative lookahead
.* # match zero or more characters
\bdog\b # match 'dog' surrounded by word breaks
) # end negative lookahead
/xm # # cause . to match newlines and invoke free-spacing mode
I have the following regex which will capture the first N words and finish at the next period, exclamation point or question mark. I need to get chunks of texts that vary in the number of words but I want complete sentences.
regex = (?:\w+[.?!]?\s+){10}(?:\w+,?\s+)*?\w+[.?!]
It works with the following text:
Therapy extract straw and chitosan from shrimp shells alone
accounted for 2, 4, 6, 8 and 10% found that the extract straw 8% is
highly effective in inhibiting the growth of algae Microcystis spp.
The number of cells and the amount of chlorophyll a was reduced during
treatment. Both value decreased continuous until the end of the trial.
https://regex101.com/r/ardIQ7/5
However it won't work with the following text:
Therapy extract straw and chitosan from shrimp shells alone accounted
for 2, 4, 6, 8 and 10% found that the extract straw 8.2% is highly
effective in inhibiting the growth of algae Microcystis spp. The
number of cells and the amount of chlorophyll a was reduced during
treatment. Both value decreased continuous until the end of the trial.
That is because of the digits (8.2%) with decimals and %.
I have been trying to figure out how to also capture these items but need some assistance to point me in the right direction. I don't just want to capture the first sentence. I want to capture N words which may include several sentences and returns complete sentences.
r = /
(?: # begin a non-capture group
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
| # or
\-? # optionally match a minus sign
(?: # begin non-capture group
\d+ # match one or more digits
| # or
\d+ # match one or more digits
\. # match a decimal point
\d+ # match one or more digits
) # end non-capture group
%? # optionally match a percentage character
) # end non-capture group
[,;:.!?]? # optionally ('?' following ']') match a punctuation char
[ ]+ # match one or more spaces
) # end non-capture group
{9,}? # execute the preceding non-capture group at least 14 times, lazily ('?')
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
| # or
\-? # optionally match a minus sign
(?: # begin non-capture group
\d+ # match one or more digits
| # or
\d+ # match one or more digits
\. # match a decimal point
\d+ # match one or more digits
) # end non-capture group
%? # optionally match a percentage character
) # end non-capture group
[.!?] # match one of the three punctuation characters
(?!\S) # negative look-ahead: do not match a non-whitespace char
/x # free-spacing regex definition mode
Let text equal the paragraph you wish to examine ("Therapy extract straw...end of the trial.")
Then
text[r]
#=> "Therapy extract straw and chitosan from...the growth of algae Microcystis spp."
We can simplify the construction of the regex (and avoid duplicate bits) as follows.
def construct_regex(min_nbr_words)
common_bits = /(?:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)/
/(?:#{common_bits}[,;:.!?]? +){#{min_nbr_words},}?#{common_bits}[.!?](?!\S)/
end
r = construct_regex(10)
#=> /(?:(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[,;:.!?]? +){10,}?(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[.!?](?!\S)/
This regex could be simplified if it were permitted to match nonsense words such as "ab2.3e%" or "2.3.2%". As presently defined, the regex will not match such words.
Try this, (?:\S+[,.?!]?\s+){1,200}[\s\S]*?(\. |!|\?)
This will match the N number of characters.
If the Nth character didn't end a sentence, then it will match until the previous sentence. The N should be mentioned as {1, N}
Regex
I've been having difficulty trying to figureout how to go about solving this issue. I have 2 kinds of URLs in which I need to be able to update/increment the number value for the page.
Url 1:
forum-351-page-2.html
In the above, I would like to modify this url for n pages. So I'd like to generate new urls with a given range of say page-1 to page-30. But that's all I'd like to change. page-n.html
Url 2:
href="forumdisplay.php?fid=115&page=3
The second url is different but I feal it's easier visit.
R = /
(?: # begin non-capture group
(?<=-page-) # match string in a positive lookbehind
\d+ # match 1 or more digits
(?=\.html) # match period followed by 'html' in a positive lookahead
) # close non-capture group
| # or
(?: # begin non-capture group
(?<=&page=) # match string in a positive lookbehind
\d+ # match 1 or more digits
\z # match end of string
) # close non-capture group
/x # free-spacing regex definition mode
def update(str, val)
str.sub(R, val.to_s)
end
update("forum-351-page-2.html", 4)
#=> "forum-351-page-4.html"
update("forumdisplay.php?fid=115&page=3", "4")
#=> "forumdisplay.php?fid=115&page=4"
For the first url
url1 = "forum-351-page-2.html"
(1..30).each do |x|
puts url1.sub(/page-\d*/, "page-#{x}")
end
This will output
"forum-351-page-1.html"
"forum-351-page-2.html"
"forum-351-page-3.html"
...
"forum-351-page-28.html"
"forum-351-page-29.html"
"forum-351-page-30.html"
You can do the same thing for the second url.
url1.sub(/page=\d*$/, "page=#{x}")
Picked up Ruby recently and have been fiddling around with it. I wanted to learn how to use regex or other Ruby tricks to check for certain words, whitespace characters, valid format etc in a given text line.
Let's say I have an order list that looks strictly like this in this format:
cost: 50 items: book,lamp
One space after semicolon, no space after each comma, no trailing whitespaces at the end and stuff like that.
How can I check for errors in this format using Ruby? This for example should fail my checks:
cost: 60 items:shoes,football
My goal was to split the string by a " " and check to see if the first word was "cost:", if the second word was a number and so on but I realized that splitting on a " " doesn't help me check for extra whitespaces as it just eats it up. Also doesn't help me check for trailing whitespaces. How do I go about doing this?
You could use the following regular expression.
r = /
\A # match beginning of string
cost:\s # match "cost:" followed by a space
\d+\s # match > 0 digits followed by a space
items:\s # match "items:" followed by a space
[[:alpha:]]+ # match > 0 lowercase or uppercase letters
(?:,[[:alpha:]]+) # match a comma followed by > 0 lowercase or uppercase
# letters in a non-capture group (?: ... )
* # perform the match on non-capture group >= 0 times
\z # match the end of the string
/x # free-spacing regex definition mode
"cost: 50 items: book,lamp" =~ r #=> 0 (a match, beginning at index 0)
"cost: 50 items: book,lamp,table" =~ r #=> 0 (a match, beginning at index 0)
"cost: 60 items:shoes,football" =~ r #=> nil (no match)
The regex can can of course be written in the normal manner:
r = /\Acost:\s\d+\sitems:\s[[:alpha:]]+(?:,[[:alpha:]]+)*\z/
or
r = /\Acost: \d+ items: [[:alpha:]]+(?:,[[:alpha:]]+)*\z/
though a whitespace character (\s) cannot be replaced by a space in the free-spacing mode definition (\x).
I need a regex pattern which matches a date with optional time.
The date should be a valid U.S. date in m/d/yyyy format. The time should be h:mm:ss am/pm or 24-hour time hh:mm:ss.
Matches: 9/1/2011 | 9/1/2011 10:00 am | 9/1/2011 10:00 AM | 9/1/2011 10:00:00
This pattern will be used in a Ruby on Rails project, so it should be in a format usable via Ruby. See http://rubular.com/ for testing.
Here's my existing date pattern (which may be an over-kill):
DATE_PATTERN = /^((((0[13578])|([13578])|(1[02]))[\/](([1-9])|([0-2][0-9])|(3[01])))|(((0[469])|([469])|(11))[\/](([1-9])|([0-2][0-9])|(30)))|((2|02)[\/](([1-9])|([0-2][0-9]))))[\/]\d{4}$|^\d{4}/
Regular expressions are horrible for this kind of job. If you're using Ruby I'd recommend using DateTime.strptime to parse the data and check its validity:
def validate_date(date_str)
valid_formats = ["%m/%d/%Y", "%m/%d/%Y %I:%M %P"]
#see http://www.ruby-doc.org/core-1.9.3/Time.html#method-i-strftime for more
valid_formats.each do |format|
valid = Time.strptime(date_str, format) rescue false
return true if valid
end
return false
end
Well, here's what I ended up with; using stricter military time:
DATE_TIME_FORMAT = /^([0,1]?\d{1})\/([0-2]?\d{1}|[3][0,1]{1})\/([1]{1}[9]{1}[9]{1}\d{1}|[2-9]{1}\d{3})\s([0]?\d|1\d|2[0-3]):([0-5]\d):([0-5]\d)$/
Matches: 1/19/2011 23:59:59
Captures:
1
19
2011
23
59
59
if subject =~ /\A(?:0?[1-9]|1[012])\/(?:0?[1-9]|[12]\d|3[01])\/(?:\d{4})(?:\s+(?:(?:[01]?\d|2[0-3]):(?:[0-5]\d)|(?:0?\d|1[0-2]):(?:[0-5]\d)\s+[ap]m))?\s*\Z/i
# Successful match
Good luck..
How it works :
"
^ # Assert position at the beginning of the string
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
0 # Match the character “0” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
[1-9] # Match a single character in the range between “1” and “9”
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
1 # Match the character “1” literally
[012] # Match a single character present in the list “012”
)
/ # Match the character “/” literally
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
0 # Match the character “0” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
[1-9] # Match a single character in the range between “1” and “9”
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
[12] # Match a single character present in the list “12”
\d # Match a single digit 0..9
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
3 # Match the character “3” literally
[01] # Match a single character present in the list “01”
)
/ # Match the character “/” literally
(?: # Match the regular expression below
\d # Match a single digit 0..9
{4} # Exactly 4 times
)
(?: # Match the regular expression below
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
[01] # Match a single character present in the list “01”
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
2 # Match the character “2” literally
[0-3] # Match a single character in the range between “0” and “3”
)
: # Match the character “:” literally
(?: # Match the regular expression below
[0-5] # Match a single character in the range between “0” and “5”
\d # Match a single digit 0..9
)
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
0 # Match the character “0” literally
? # Between zero and one times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
1 # Match the character “1” literally
[0-2] # Match a single character in the range between “0” and “2”
)
: # Match the character “:” literally
(?: # Match the regular expression below
[0-5] # Match a single character in the range between “0” and “5”
\d # Match a single digit 0..9
)
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
[ap] # Match a single character present in the list “ap”
m # Match the character “m” literally
)
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
"
Remember is not the spoon that bents but you!
Here's what I came up with that seems to work:
regex = /^1?\d{1}\/[123]?\d{1}\/\d{4}(\s[12]?\d:[0-5]\d(:[0-5]\d)?(\s[ap]m)?)?$/