extracting data through regexps is returning nil - ruby

I'm trying to extract a pair of string from a parsed PDF and I have this extract:
Number:731 / 13/06/2016 1823750212 10/06/2016\n\n\n\n Articolo
http://rubular.com/r/GRI6j4Byz3
My goal is to get out the 731 and 1823750212 values.
I tried something like text[/Number:(.*)Articolo/] for the first steps but it's returning nil while on rubular it somewhat matches.
Any tips?

Whether the format of the string is fixed (dates and the long number,) this will do the trick:
text.scan /\ANumber:(\d+).*?(\d{5,})/
#⇒ [[ "731", "1823750212" ]]

I have assumed that we do not know the length of either string (representations of non-negative integers) to be extracted, only that the first follows "Number:", which is at the beginning of the string, and the second is preceded and followed by at least one space.
r = /
(?<=\A\Number:) # match beginning of string followed by 'Number:' in a
# positive lookbehind
\d+ # match one or more digits
| # or
(?<=\s) # match a whitespace char in a positive lookbehind
\d+ # match one or more digits
(?=\s) # match a whitespace char in a positive lookbehind
/x # free-spacing regex definition mode
str = "Number:731 / 13/06/2016 1823750212 10/06/2016\n\n\n\n Articolo"
str.scan(r)
#=> ["731", "1823750212"]
If there could be intervening spaces between the colon and "731", you could do modify the regex as follows.
r = /
\A # match beginning of string followed by 'Number:' in a
# positive lookbehind
Number: # match string 'Number:'
\s* # match zero or more spaces
\K # forget everything matched so far
\d+ # match one or more digits
| # or
(?<=\s) # match a whitespace char in a positive lookbehind
\d+ # match one or more digits
(?=\s) # match a whitespace char in a positive lookbehind
/x # free-spacing regex definition mode
str = "Number: 731 / 13/06/2016 1823750212 10/06/2016\n\n\n\n Articolo"
str.scan(r)
#=> ["731", "1823750212"]
Here \K must be used because Ruby does not support variable-length positive lookbehinds.

Related

Insert hyphen into number

I want to convert:
"890414.14.1422, 900515141092, 950616-12-5414"
to:
"890414-14-1422, 900515-14-1092, 950616-12-5414"
How can I achieve it?
I tried:
def format_ids(string)
string.gsub(/(\d{6})[.-](\d{2})[.-](\d{4})/, '\1-\2-\3')
end
format_ids("890414.14.1422, 900515141092, 950616-12-5414")
# => "890414-14-1422, 900515141092, 950616-12-5414"
You should make the delimiters in the input string non mandatory:
- string.gsub(/(\d{6})[.-](\d{2})[.-](\d{4})/, '\1-\2-\3')
+ string.gsub(/(\d{6})[.-]?(\d{2})[.-]?(\d{4})/, '\1-\2-\3')
Note question marks after the delimiters, they do the trick.
str = "890414.14.1422, 900515141092, 950616-12-5414"
r = /
( # begin capture group 1
\. # match a period
| # or
(?<=\d{6}) # match after 6 digits (positive lookbehind)
(?=\d{6}) # match before 6 digits (positive lookahead)
| # or
(?<=\d{8}) # match after 8 digits (positive lookbehind)
(?=\d{4}) # match before 4 digits (positive lookahead)
) # end capture group 1
/x # free-spacing regex definition mode
str.gsub(r,'-')
#=> "890414-14-1422, 900515-14-1092, 950616-12-5414"
This regular expression is conventionally (not free-spacing mode) written as follows:
/(\.|(?<=\d{6})(?=\d{6})|(?<=\d{8})(?=\d{4}))/
Note that (?<=\d{6}) and (?=\d{6}) match a position between two consecutive spaces that has a width of zero, as do (?<=\d{8}) and (?=\d{4}).

How do I match something only if a character doesn't follow a pattern?

I"m using Ruby 2.4 How do I write a regular expression that matches a series of numbers, the plus sign and then any sequence that follows provided that sequence doesn't contain another number? For example, this would match per my rules
23+abcdef
as would this
1111111+ __++
But this would not
2+3
Neither would this
2+ L43
I tried this but was unsuccessful ...
/\d+[[:space:]]*(\+|plus).*([^\d]|$)/i.match(mystr)
r = /\A # match beginning of string
\d+ # match one or more digits
\+ # match plus sign
\D* # match zero or more characters other than a digit
\z # match end of string
/x # free-spacing regex definition mode
"23+abcdef".match?(r)
#=> true
"1111111+ __++".match?(r)
#=> true
"23 abcdef".match?(r)
#=> false
"2+3".match?(r)
#=> false
"2+ L43".match?(r)
#=> false
If at least one character that is not a digit is to follow '+', change \D* in the regex to \D+.

Regex: Match all hyphens or underscores not at the beginning or the end of the string

I am writing some code that needs to convert a string to camel case. However, I want to allow any _ or - at the beginning of the code.
I have had success matching up an _ character using the regex here:
^(?!_)(\w+)_(\w+)(?<!_)$
when the inputs are:
pro_gamer #matched
#ignored
_proto
proto_
__proto
proto__
__proto__
#matched as nerd_godess_of, skyrim
nerd_godess_of_skyrim
I recursively apply my method on the first match if it looks like nerd_godess_of.
I am having troubled adding - matches to the same, I assumed that just adding a - to the mix like this would work:
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
and it matches like this:
super-mario #matched
eslint-path #matched
eslint-global-path #NOT MATCHED.
I would like to understand why the regex fails to match the last case given that it worked correctly for the _.
The (almost) full set of test inputs can be found here
The fact that
^(?![_-])(\w+)[_-](\w+)(?<![_-])$
does not match the second hyphen in "eslint-global-path" is because of the anchor ^ which limits the match to be on the first hyphen only. This regex reads, "Match the beginning of the line, not followed by a hyphen or underscore, then match one or more words characters (including underscores), a hyphen or underscore, and then one or more word characters in a capture group. Lastly, do not match a hyphen or underscore at the end of the line."
The fact that an underscore (but not a hyphen) is a word (\w) character completely messes up the regex. In general, rather than using \w, you might want to use \p{Alpha} or \p{Alnum} (or POSIX [[:alpha:]] or [[:alnum:]]).
Try this.
r = /
(?<= # begin a positive lookbehind
[^_-] # match a character other than an underscore or hyphen
) # end positive lookbehind
( # begin capture group 1
(?: # begin a non-capture group
-+ # match one or more hyphens
| # or
_+ # match one or more underscores
) # end non-capture group
[^_-] # match any character other than an underscore or hyphen
) # end capture group 1
/x # free-spacing regex definition mode
'_cats_have--nine_lives--'.gsub(r) { |s| s[-1].upcase }
#=> "_catsHaveNineLives--"
This regex is conventionally written as follows.
r = /(?<=[^_-])((?:-+|_+)[^_-])/
If all the letters are lower case one could alternatively write
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/).
map(&:capitalize).join
#=> "_catsHaveNineLives--"
where
'_cats_have--nine_lives--'.split(/(?<=[^_-])(?:_+|-+)(?=[^_-])/)
#=> ["_cats", "have", "nine", "lives--"]
(?=[^_-]) is a positive lookahead that requires the characters on which the split is made to be followed by a character other than an underscore or hyphen
you can try the regex
^(?=[^-_])(\w+[-_]\w*)+(?=[^-_])\w$
see the demo here.
Switch _- to -_ so that - is not treated as a range op, as in a-z.

Ruby parsing and regex

Picked up Ruby recently and have been fiddling around with it. I wanted to learn how to use regex or other Ruby tricks to check for certain words, whitespace characters, valid format etc in a given text line.
Let's say I have an order list that looks strictly like this in this format:
cost: 50 items: book,lamp
One space after semicolon, no space after each comma, no trailing whitespaces at the end and stuff like that.
How can I check for errors in this format using Ruby? This for example should fail my checks:
cost: 60 items:shoes,football
My goal was to split the string by a " " and check to see if the first word was "cost:", if the second word was a number and so on but I realized that splitting on a " " doesn't help me check for extra whitespaces as it just eats it up. Also doesn't help me check for trailing whitespaces. How do I go about doing this?
You could use the following regular expression.
r = /
\A # match beginning of string
cost:\s # match "cost:" followed by a space
\d+\s # match > 0 digits followed by a space
items:\s # match "items:" followed by a space
[[:alpha:]]+ # match > 0 lowercase or uppercase letters
(?:,[[:alpha:]]+) # match a comma followed by > 0 lowercase or uppercase
# letters in a non-capture group (?: ... )
* # perform the match on non-capture group >= 0 times
\z # match the end of the string
/x # free-spacing regex definition mode
"cost: 50 items: book,lamp" =~ r #=> 0 (a match, beginning at index 0)
"cost: 50 items: book,lamp,table" =~ r #=> 0 (a match, beginning at index 0)
"cost: 60 items:shoes,football" =~ r #=> nil (no match)
The regex can can of course be written in the normal manner:
r = /\Acost:\s\d+\sitems:\s[[:alpha:]]+(?:,[[:alpha:]]+)*\z/
or
r = /\Acost: \d+ items: [[:alpha:]]+(?:,[[:alpha:]]+)*\z/
though a whitespace character (\s) cannot be replaced by a space in the free-spacing mode definition (\x).

Capitalize the first character after a dash

So I've got a string that's an improperly formatted name. Let's say, "Jean-paul Bertaud-alain".
I want to use a regex in Ruby to find the first character after every dash and make it uppercase. So, in this case, I want to apply a method that would yield: "Jean-Paul Bertaud-Alain".
Any help?
String#gsub can take a block argument, so this is as simple as:
str = "Jean-paul Bertaud-alain"
str.gsub(/-[a-z]/) {|s| s.upcase }
# => "Jean-Paul Bertaud-Alain"
Or, more succinctly:
str.gsub(/-[a-z]/, &:upcase)
Note that the regular expression /-[a-z]/ will only match letters in the a-z range, meaning it won't match e.g. à. This is because String#upcase does not attempt to capitalize characters with diacritics anyway, because capitalization is language-dependent (e.g. i is capitalized differently in Turkish than in English). Read this answer for more information: https://stackoverflow.com/a/4418681
"Jean-paul Bertaud-alain".gsub(/(?<=-)\w/, &:upcase)
# => "Jean-Paul Bertaud-Alain"
I suggest you make the test more demanding by requiring the letter to be upcased: 1) be preceded by a capitalized word followed by a hypen and 2) be followed by lowercase letters followed by a word break.
r = /
\b # Match a word break
[A-Z] # Match an upper-case letter
[a-z]+ # Match >= 1 lower-case letters
\- # Match hypen
\K # Forget everything matched so far
[a-z] # Match a lower-case letter
(?= # Begin a positive lookahead
[a-z]+ # Match >= 1 lower-case letters
\b # Match a word break
) # End positive lookahead
/x # Free-spacing regex definition mode
"Jean-paul Bertaud-alain".gsub(r) { |s| s.upcase }
#=> "Jean-Paul Bertaud-Alain"
"Jean de-paul Bertaud-alainM".gsub(r) { |s| s.upcase }
#=> "Jean de-paul Bertaud-alainM"

Resources