Identify extract & replace part of a string in ruby - ruby

I've been having difficulty trying to figureout how to go about solving this issue. I have 2 kinds of URLs in which I need to be able to update/increment the number value for the page.
Url 1:
forum-351-page-2.html
In the above, I would like to modify this url for n pages. So I'd like to generate new urls with a given range of say page-1 to page-30. But that's all I'd like to change. page-n.html
Url 2:
href="forumdisplay.php?fid=115&page=3
The second url is different but I feal it's easier visit.

R = /
(?: # begin non-capture group
(?<=-page-) # match string in a positive lookbehind
\d+ # match 1 or more digits
(?=\.html) # match period followed by 'html' in a positive lookahead
) # close non-capture group
| # or
(?: # begin non-capture group
(?<=&page=) # match string in a positive lookbehind
\d+ # match 1 or more digits
\z # match end of string
) # close non-capture group
/x # free-spacing regex definition mode
def update(str, val)
str.sub(R, val.to_s)
end
update("forum-351-page-2.html", 4)
#=> "forum-351-page-4.html"
update("forumdisplay.php?fid=115&page=3", "4")
#=> "forumdisplay.php?fid=115&page=4"

For the first url
url1 = "forum-351-page-2.html"
(1..30).each do |x|
puts url1.sub(/page-\d*/, "page-#{x}")
end
This will output
"forum-351-page-1.html"
"forum-351-page-2.html"
"forum-351-page-3.html"
...
"forum-351-page-28.html"
"forum-351-page-29.html"
"forum-351-page-30.html"
You can do the same thing for the second url.
url1.sub(/page=\d*$/, "page=#{x}")

Related

Ruby - quick way to extract number from string [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 months ago.
Improve this question
As a beginner in Ruby, is there a quick to extract the first and second number from this string 5.16.0.0-15? In this case, I am looking 5 and 16. Thanks
One way is to use the method String#match with the regular expression
rgx = /(\d+)\.(\d+)/
to construct a MatchData object. The regular expression captures the first two strings of digits, separated by a period. The method MatchData#captures is then use to extract the contents of capture groups 1 and 2 (strings) and save them to an array. Lastly, String#to_i is used to convert the strings in the array to integers:
"5.16.0.0-15".match(rgx).captures.map(&:to_i)
#=> [5, 16]
We see that
m = "5.16.0.0-15".match(rgx)
#=> #<MatchData "5.16" 1:"5" 2:"16">
a = m.captures
#=> ["5", "16"]
a.map(&:to_i)
#=> [5, 16]
a.map(&:to_i) can be thought of as shorthand for a.map { |s| s.to_i }.
We can express the regular expression in free-spacing mode to make it self-documenting:
/
( # begin capture group 1
\d+ # match one or more digits
) # end capture group 1
\. # match a period
( # begin capture group 2
\d+ # match one or more digits
) # end capture group 2
/x # invoke free-spacing regex definition mode
One reason for using a regular expression here is to confirm the structure of the string, should that be desired. That could be done by using the following regex:
rgx1 =
/
\A # match the beginning of the string
( # begin capture group 1
\d+ # match one or more digits
) # end capture group 1
\. # match a period
( # begin capture group 2
\d+ # match one or more digits
) # end capture group 2
(?: # begin a non-capture group
\. # match a period
\d+ # match one or more digits
(?: # begin a non-capture group
\- # match a hyphen
\d+ # match one or more digits
)? # end non-capture group and make it optional
)* # end non-capture group and execute it zero or more times
\z # match the end of the string
/x # invoke free-spacing regex definition mode
"5.16.0.0-15".match(rgx1).captures.map(&:to_i)
#=> [5, 16]
"5.16.0.A".match(rgx1)
#=> nil
"5.16.0.0-1-5".match(rgx1)
#=> nil
The last two examples would generate exceptions because nil has no method captures. One could of course handle those exceptions.
rgx1 is conventionally written /\A(\d+)\.(\d+)(x?:\.\d+(?:\-\d+)?)*\z/.
Use #split, telling it to split on "." and only split into three parts, then access the first two.
irb(main):003:0> s = "5.16.0.0-15"
=> "5.16.0.0-15"
irb(main):004:0> s.split(".", 3)[0..1]
=> ["5", "16"]
Optionally map to integers.
irb(main):005:0> s.split(".", 3)[0..1].map(&:to_i)
=> [5, 16]

How to check with ruby if a word is repeated twice in a file

I have a large file, and I want to be able to check if a word is present twice.
puts "Enter a word: "
$word = gets.chomp
if File.read('worldcountry.txt') # do something if the word entered is present twice...
How can i check if the file worldcountry.txt include twice the $word i entered ?
I found what i needed from this: count-the-frequency-of-a-given-word-in-text-file-in-ruby
On the Gerry post with this code
word_count = 0
my_word = "input"
File.open("texte.txt", "r") do |f|
f.each_line do |line|
line.split(' ').each do |word|
word_count += 1 if word == my_word
end
end
end
puts "\n" + word_count.to_s
Thanks, i will pay more attention next time.
If the file is not overly large, it can be gulped into a string. Suppose:
str = File.read('cat')
#=> "There was a dog 'Henry' who\nwas pals with a dog 'Buck' and\na dog 'Sal'."
puts str
There was a dog 'Henry' who
was pals with a dog 'Buck' and
a dog 'Sal'.
Suppose the given word is 'dog'.
Confirm the file contains at least two instances of the given word
One can attempt to match the regular expression
r1 = /\bdog\b.*\bdog\b/m
str.match?(r1)
#=> true
Demo
Confirm the file contains exactly two instances of the given word
Using a regular expression to determine is the file contains exactly two instances of the the given word is somewhat more complex. Let
r2 = /\A(?:(?:.(?!\bdog\b))*\bdog\b){2}(?!.*\bdog\b)/m
str.match?(r1)
#=> false
Demo
The two regular expressions can be written in free-spacing mode to make them self-documenting.
r1 = /
\bdog\b # match 'dog' surrounded by word breaks
.* # match zero or more characters
\bdog\b # match 'dog' surrounded by word breaks
/m # cause . to match newlines
r2 = /
\A # match beginning of string
(?: # begin non-capture group
(?: # begin non-capture group
. # match one character
(?! # begin negative lookahead
\bdog\b # match 'dog' surrounded by word breaks
) # end negative lookahead
) # end non-capture group
* # execute preceding non-capture group zero or more times
\bdog\b # match 'dog' surrounded by word breaks
) # end non-capture group
{2} # execute preceding non-capture group twice
(?! # begin negative lookahead
.* # match zero or more characters
\bdog\b # match 'dog' surrounded by word breaks
) # end negative lookahead
/xm # # cause . to match newlines and invoke free-spacing mode

Ruby regex get the word combo separated by period

I'm trying to use Ruby regex to get word combo like below.
In a example below I only need cases 1-4, * marked them in caps for easy testing. Word in the middle (dbo, bcd) could be anything or nothing like in case#3. I have trouble how to get that double period case#3 working. It's also good to get standalone SALES as word too but probably it's too much for one regex ?Tx all guru .
This is my script which partially working, need add alpha..SALES
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX'
regex = /alpha+\.+[a-z]+\.?sales/ix
puts 'R: ' + s.scan(regex).to_s
##R: ["alpha.dbo.SALES", "alpha.bcd.SALES"]
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX 7x alpha.abc.SALES.etc'
regex = /(?<=^|\s)(?:alpha\.[a-z]*\.)?(?:sales)(?=\s|$)/i
puts 'R: ' + s.scan(regex).to_s
Output:
R: ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
r = /
(?<=\d[ ]) # match a digit followed by a space in a positive lookbehind
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
| # or
\. # match a period
) # end non-capture group
)? # end non-capture group and optionally match it
SALES # match string
(?!=[.\p{Alpha}]) # do not match a period or letter (negative lookahead)
/x # free-spacing regex definition mode.
s.scan(r)
#=> ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
This regular expression is customarily written as follows.
r = /
(?<=\d )(?:\p{Alpha}+\.(?:\p{Alpha}+\.|\.))?SALES(?!=[.\p{Alpha}])/
In free-spacing mode the space must be put in a character class ([ ]); else it would be stripped out.

Insert hyphen into number

I want to convert:
"890414.14.1422, 900515141092, 950616-12-5414"
to:
"890414-14-1422, 900515-14-1092, 950616-12-5414"
How can I achieve it?
I tried:
def format_ids(string)
string.gsub(/(\d{6})[.-](\d{2})[.-](\d{4})/, '\1-\2-\3')
end
format_ids("890414.14.1422, 900515141092, 950616-12-5414")
# => "890414-14-1422, 900515141092, 950616-12-5414"
You should make the delimiters in the input string non mandatory:
- string.gsub(/(\d{6})[.-](\d{2})[.-](\d{4})/, '\1-\2-\3')
+ string.gsub(/(\d{6})[.-]?(\d{2})[.-]?(\d{4})/, '\1-\2-\3')
Note question marks after the delimiters, they do the trick.
str = "890414.14.1422, 900515141092, 950616-12-5414"
r = /
( # begin capture group 1
\. # match a period
| # or
(?<=\d{6}) # match after 6 digits (positive lookbehind)
(?=\d{6}) # match before 6 digits (positive lookahead)
| # or
(?<=\d{8}) # match after 8 digits (positive lookbehind)
(?=\d{4}) # match before 4 digits (positive lookahead)
) # end capture group 1
/x # free-spacing regex definition mode
str.gsub(r,'-')
#=> "890414-14-1422, 900515-14-1092, 950616-12-5414"
This regular expression is conventionally (not free-spacing mode) written as follows:
/(\.|(?<=\d{6})(?=\d{6})|(?<=\d{8})(?=\d{4}))/
Note that (?<=\d{6}) and (?=\d{6}) match a position between two consecutive spaces that has a width of zero, as do (?<=\d{8}) and (?=\d{4}).

Improve my Regex to include numbers that contain decimals and percentage signs

I have the following regex which will capture the first N words and finish at the next period, exclamation point or question mark. I need to get chunks of texts that vary in the number of words but I want complete sentences.
regex = (?:\w+[.?!]?\s+){10}(?:\w+,?\s+)*?\w+[.?!]
It works with the following text:
Therapy extract straw and chitosan from shrimp shells alone
accounted for 2, 4, 6, 8 and 10% found that the extract straw 8% is
highly effective in inhibiting the growth of algae Microcystis spp.
The number of cells and the amount of chlorophyll a was reduced during
treatment. Both value decreased continuous until the end of the trial.
https://regex101.com/r/ardIQ7/5
However it won't work with the following text:
Therapy extract straw and chitosan from shrimp shells alone accounted
for 2, 4, 6, 8 and 10% found that the extract straw 8.2% is highly
effective in inhibiting the growth of algae Microcystis spp. The
number of cells and the amount of chlorophyll a was reduced during
treatment. Both value decreased continuous until the end of the trial.
That is because of the digits (8.2%) with decimals and %.
I have been trying to figure out how to also capture these items but need some assistance to point me in the right direction. I don't just want to capture the first sentence. I want to capture N words which may include several sentences and returns complete sentences.
r = /
(?: # begin a non-capture group
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
| # or
\-? # optionally match a minus sign
(?: # begin non-capture group
\d+ # match one or more digits
| # or
\d+ # match one or more digits
\. # match a decimal point
\d+ # match one or more digits
) # end non-capture group
%? # optionally match a percentage character
) # end non-capture group
[,;:.!?]? # optionally ('?' following ']') match a punctuation char
[ ]+ # match one or more spaces
) # end non-capture group
{9,}? # execute the preceding non-capture group at least 14 times, lazily ('?')
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
| # or
\-? # optionally match a minus sign
(?: # begin non-capture group
\d+ # match one or more digits
| # or
\d+ # match one or more digits
\. # match a decimal point
\d+ # match one or more digits
) # end non-capture group
%? # optionally match a percentage character
) # end non-capture group
[.!?] # match one of the three punctuation characters
(?!\S) # negative look-ahead: do not match a non-whitespace char
/x # free-spacing regex definition mode
Let text equal the paragraph you wish to examine ("Therapy extract straw...end of the trial.")
Then
text[r]
#=> "Therapy extract straw and chitosan from...the growth of algae Microcystis spp."
We can simplify the construction of the regex (and avoid duplicate bits) as follows.
def construct_regex(min_nbr_words)
common_bits = /(?:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)/
/(?:#{common_bits}[,;:.!?]? +){#{min_nbr_words},}?#{common_bits}[.!?](?!\S)/
end
r = construct_regex(10)
#=> /(?:(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[,;:.!?]? +){10,}?(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[.!?](?!\S)/
This regex could be simplified if it were permitted to match nonsense words such as "ab2.3e%" or "2.3.2%". As presently defined, the regex will not match such words.
Try this, (?:\S+[,.?!]?\s+){1,200}[\s\S]*?(\. |!|\?)
This will match the N number of characters.
If the Nth character didn't end a sentence, then it will match until the previous sentence. The N should be mentioned as {1, N}
Regex

Resources