I want to convert:
"890414.14.1422, 900515141092, 950616-12-5414"
to:
"890414-14-1422, 900515-14-1092, 950616-12-5414"
How can I achieve it?
I tried:
def format_ids(string)
string.gsub(/(\d{6})[.-](\d{2})[.-](\d{4})/, '\1-\2-\3')
end
format_ids("890414.14.1422, 900515141092, 950616-12-5414")
# => "890414-14-1422, 900515141092, 950616-12-5414"
You should make the delimiters in the input string non mandatory:
- string.gsub(/(\d{6})[.-](\d{2})[.-](\d{4})/, '\1-\2-\3')
+ string.gsub(/(\d{6})[.-]?(\d{2})[.-]?(\d{4})/, '\1-\2-\3')
Note question marks after the delimiters, they do the trick.
str = "890414.14.1422, 900515141092, 950616-12-5414"
r = /
( # begin capture group 1
\. # match a period
| # or
(?<=\d{6}) # match after 6 digits (positive lookbehind)
(?=\d{6}) # match before 6 digits (positive lookahead)
| # or
(?<=\d{8}) # match after 8 digits (positive lookbehind)
(?=\d{4}) # match before 4 digits (positive lookahead)
) # end capture group 1
/x # free-spacing regex definition mode
str.gsub(r,'-')
#=> "890414-14-1422, 900515-14-1092, 950616-12-5414"
This regular expression is conventionally (not free-spacing mode) written as follows:
/(\.|(?<=\d{6})(?=\d{6})|(?<=\d{8})(?=\d{4}))/
Note that (?<=\d{6}) and (?=\d{6}) match a position between two consecutive spaces that has a width of zero, as do (?<=\d{8}) and (?=\d{4}).
Related
I have a large file, and I want to be able to check if a word is present twice.
puts "Enter a word: "
$word = gets.chomp
if File.read('worldcountry.txt') # do something if the word entered is present twice...
How can i check if the file worldcountry.txt include twice the $word i entered ?
I found what i needed from this: count-the-frequency-of-a-given-word-in-text-file-in-ruby
On the Gerry post with this code
word_count = 0
my_word = "input"
File.open("texte.txt", "r") do |f|
f.each_line do |line|
line.split(' ').each do |word|
word_count += 1 if word == my_word
end
end
end
puts "\n" + word_count.to_s
Thanks, i will pay more attention next time.
If the file is not overly large, it can be gulped into a string. Suppose:
str = File.read('cat')
#=> "There was a dog 'Henry' who\nwas pals with a dog 'Buck' and\na dog 'Sal'."
puts str
There was a dog 'Henry' who
was pals with a dog 'Buck' and
a dog 'Sal'.
Suppose the given word is 'dog'.
Confirm the file contains at least two instances of the given word
One can attempt to match the regular expression
r1 = /\bdog\b.*\bdog\b/m
str.match?(r1)
#=> true
Demo
Confirm the file contains exactly two instances of the given word
Using a regular expression to determine is the file contains exactly two instances of the the given word is somewhat more complex. Let
r2 = /\A(?:(?:.(?!\bdog\b))*\bdog\b){2}(?!.*\bdog\b)/m
str.match?(r1)
#=> false
Demo
The two regular expressions can be written in free-spacing mode to make them self-documenting.
r1 = /
\bdog\b # match 'dog' surrounded by word breaks
.* # match zero or more characters
\bdog\b # match 'dog' surrounded by word breaks
/m # cause . to match newlines
r2 = /
\A # match beginning of string
(?: # begin non-capture group
(?: # begin non-capture group
. # match one character
(?! # begin negative lookahead
\bdog\b # match 'dog' surrounded by word breaks
) # end negative lookahead
) # end non-capture group
* # execute preceding non-capture group zero or more times
\bdog\b # match 'dog' surrounded by word breaks
) # end non-capture group
{2} # execute preceding non-capture group twice
(?! # begin negative lookahead
.* # match zero or more characters
\bdog\b # match 'dog' surrounded by word breaks
) # end negative lookahead
/xm # # cause . to match newlines and invoke free-spacing mode
I'm trying to use Ruby regex to get word combo like below.
In a example below I only need cases 1-4, * marked them in caps for easy testing. Word in the middle (dbo, bcd) could be anything or nothing like in case#3. I have trouble how to get that double period case#3 working. It's also good to get standalone SALES as word too but probably it's too much for one regex ?Tx all guru .
This is my script which partially working, need add alpha..SALES
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX'
regex = /alpha+\.+[a-z]+\.?sales/ix
puts 'R: ' + s.scan(regex).to_s
##R: ["alpha.dbo.SALES", "alpha.bcd.SALES"]
s = '1 alpha.dbo.SALES 2 alpha.bcd.SALES 3 alpha..SALES 4 SALES
bad cases 5x alpha.saleS 6x saleSXX 7x alpha.abc.SALES.etc'
regex = /(?<=^|\s)(?:alpha\.[a-z]*\.)?(?:sales)(?=\s|$)/i
puts 'R: ' + s.scan(regex).to_s
Output:
R: ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
r = /
(?<=\d[ ]) # match a digit followed by a space in a positive lookbehind
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
\. # match a period
| # or
\. # match a period
) # end non-capture group
)? # end non-capture group and optionally match it
SALES # match string
(?!=[.\p{Alpha}]) # do not match a period or letter (negative lookahead)
/x # free-spacing regex definition mode.
s.scan(r)
#=> ["alpha.dbo.SALES", "alpha.bcd.SALES", "alpha..SALES", "SALES"]
This regular expression is customarily written as follows.
r = /
(?<=\d )(?:\p{Alpha}+\.(?:\p{Alpha}+\.|\.))?SALES(?!=[.\p{Alpha}])/
In free-spacing mode the space must be put in a character class ([ ]); else it would be stripped out.
I have the following regex which will capture the first N words and finish at the next period, exclamation point or question mark. I need to get chunks of texts that vary in the number of words but I want complete sentences.
regex = (?:\w+[.?!]?\s+){10}(?:\w+,?\s+)*?\w+[.?!]
It works with the following text:
Therapy extract straw and chitosan from shrimp shells alone
accounted for 2, 4, 6, 8 and 10% found that the extract straw 8% is
highly effective in inhibiting the growth of algae Microcystis spp.
The number of cells and the amount of chlorophyll a was reduced during
treatment. Both value decreased continuous until the end of the trial.
https://regex101.com/r/ardIQ7/5
However it won't work with the following text:
Therapy extract straw and chitosan from shrimp shells alone accounted
for 2, 4, 6, 8 and 10% found that the extract straw 8.2% is highly
effective in inhibiting the growth of algae Microcystis spp. The
number of cells and the amount of chlorophyll a was reduced during
treatment. Both value decreased continuous until the end of the trial.
That is because of the digits (8.2%) with decimals and %.
I have been trying to figure out how to also capture these items but need some assistance to point me in the right direction. I don't just want to capture the first sentence. I want to capture N words which may include several sentences and returns complete sentences.
r = /
(?: # begin a non-capture group
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
| # or
\-? # optionally match a minus sign
(?: # begin non-capture group
\d+ # match one or more digits
| # or
\d+ # match one or more digits
\. # match a decimal point
\d+ # match one or more digits
) # end non-capture group
%? # optionally match a percentage character
) # end non-capture group
[,;:.!?]? # optionally ('?' following ']') match a punctuation char
[ ]+ # match one or more spaces
) # end non-capture group
{9,}? # execute the preceding non-capture group at least 14 times, lazily ('?')
(?: # begin a non-capture group
\p{Alpha}+ # match one or more letters
| # or
\-? # optionally match a minus sign
(?: # begin non-capture group
\d+ # match one or more digits
| # or
\d+ # match one or more digits
\. # match a decimal point
\d+ # match one or more digits
) # end non-capture group
%? # optionally match a percentage character
) # end non-capture group
[.!?] # match one of the three punctuation characters
(?!\S) # negative look-ahead: do not match a non-whitespace char
/x # free-spacing regex definition mode
Let text equal the paragraph you wish to examine ("Therapy extract straw...end of the trial.")
Then
text[r]
#=> "Therapy extract straw and chitosan from...the growth of algae Microcystis spp."
We can simplify the construction of the regex (and avoid duplicate bits) as follows.
def construct_regex(min_nbr_words)
common_bits = /(?:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)/
/(?:#{common_bits}[,;:.!?]? +){#{min_nbr_words},}?#{common_bits}[.!?](?!\S)/
end
r = construct_regex(10)
#=> /(?:(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[,;:.!?]? +){10,}?(?-mix:\p{Alpha}+|\-?(?:\d+|\d+\.\d+)%?)[.!?](?!\S)/
This regex could be simplified if it were permitted to match nonsense words such as "ab2.3e%" or "2.3.2%". As presently defined, the regex will not match such words.
Try this, (?:\S+[,.?!]?\s+){1,200}[\s\S]*?(\. |!|\?)
This will match the N number of characters.
If the Nth character didn't end a sentence, then it will match until the previous sentence. The N should be mentioned as {1, N}
Regex
I've been having difficulty trying to figureout how to go about solving this issue. I have 2 kinds of URLs in which I need to be able to update/increment the number value for the page.
Url 1:
forum-351-page-2.html
In the above, I would like to modify this url for n pages. So I'd like to generate new urls with a given range of say page-1 to page-30. But that's all I'd like to change. page-n.html
Url 2:
href="forumdisplay.php?fid=115&page=3
The second url is different but I feal it's easier visit.
R = /
(?: # begin non-capture group
(?<=-page-) # match string in a positive lookbehind
\d+ # match 1 or more digits
(?=\.html) # match period followed by 'html' in a positive lookahead
) # close non-capture group
| # or
(?: # begin non-capture group
(?<=&page=) # match string in a positive lookbehind
\d+ # match 1 or more digits
\z # match end of string
) # close non-capture group
/x # free-spacing regex definition mode
def update(str, val)
str.sub(R, val.to_s)
end
update("forum-351-page-2.html", 4)
#=> "forum-351-page-4.html"
update("forumdisplay.php?fid=115&page=3", "4")
#=> "forumdisplay.php?fid=115&page=4"
For the first url
url1 = "forum-351-page-2.html"
(1..30).each do |x|
puts url1.sub(/page-\d*/, "page-#{x}")
end
This will output
"forum-351-page-1.html"
"forum-351-page-2.html"
"forum-351-page-3.html"
...
"forum-351-page-28.html"
"forum-351-page-29.html"
"forum-351-page-30.html"
You can do the same thing for the second url.
url1.sub(/page=\d*$/, "page=#{x}")
I'm trying to extract a pair of string from a parsed PDF and I have this extract:
Number:731 / 13/06/2016 1823750212 10/06/2016\n\n\n\n Articolo
http://rubular.com/r/GRI6j4Byz3
My goal is to get out the 731 and 1823750212 values.
I tried something like text[/Number:(.*)Articolo/] for the first steps but it's returning nil while on rubular it somewhat matches.
Any tips?
Whether the format of the string is fixed (dates and the long number,) this will do the trick:
text.scan /\ANumber:(\d+).*?(\d{5,})/
#⇒ [[ "731", "1823750212" ]]
I have assumed that we do not know the length of either string (representations of non-negative integers) to be extracted, only that the first follows "Number:", which is at the beginning of the string, and the second is preceded and followed by at least one space.
r = /
(?<=\A\Number:) # match beginning of string followed by 'Number:' in a
# positive lookbehind
\d+ # match one or more digits
| # or
(?<=\s) # match a whitespace char in a positive lookbehind
\d+ # match one or more digits
(?=\s) # match a whitespace char in a positive lookbehind
/x # free-spacing regex definition mode
str = "Number:731 / 13/06/2016 1823750212 10/06/2016\n\n\n\n Articolo"
str.scan(r)
#=> ["731", "1823750212"]
If there could be intervening spaces between the colon and "731", you could do modify the regex as follows.
r = /
\A # match beginning of string followed by 'Number:' in a
# positive lookbehind
Number: # match string 'Number:'
\s* # match zero or more spaces
\K # forget everything matched so far
\d+ # match one or more digits
| # or
(?<=\s) # match a whitespace char in a positive lookbehind
\d+ # match one or more digits
(?=\s) # match a whitespace char in a positive lookbehind
/x # free-spacing regex definition mode
str = "Number: 731 / 13/06/2016 1823750212 10/06/2016\n\n\n\n Articolo"
str.scan(r)
#=> ["731", "1823750212"]
Here \K must be used because Ruby does not support variable-length positive lookbehinds.