Remove Certain Alphanumeric Characters from a String in Ruby - ruby

I have to validate a string based on first alpha-numeric character of the string. Certain characters can be part of the string but if they are at beginning then they have to ignored.
For example:
--- BATest- 1 --
should be:
BATest-1
How do I remove dashes from beginning and end but not from middle?
To add to my question: can the first alphanumeric character decide if following alphanumeric characters are to be removed or not?
I.e. If A then nothing would need to be removed and throw a validation error; and yet if B then strip the string as mentioned above.

r = /
--+ # Match at least two hyphens
| # or
\s # Match a space
/x # Free-spacing regex definition mode
'--- BATest- 1 --'.gsub r, ""
#=> "BATest-1"

You asked to remove the dashes from the beginning and the end:
"--- BATest- 1 --".gsub(/^-+|-+$|\s/, "")
# => "BATest-1"

Related

How to extract substring between two characters/substrings

I have a string:
string1 = "my name is fname.lname and i live in xyz. my lname is not common"
I want to extract a substring from string1 that is anything between the first empty space " " and ".lname". In the case above, the answer should be "fname.lname"`.
string1[/(?<= ).*?(?=\.lname\b)/]
#=> "name is fname"
(?<= ) is a positive lookbehind that requires the first character matched be immediately preceded by a space, but that space is not part of the match.
(?=\.lname\b) is a positive lookahead that requires the last character matched is immediately followed by the string ".lname"1
, which is itself followed by a word break (\b), but that string is not part of the match. That ensures, for example, that "\.lnamespace" is not matched. If that should be matched, remove \b.
.*? matches zero more characters (.*), non-greedily (?). (Matches are by default greedy.) The non-greedy qualifier has the following effect:
"my name is fname.lname and fname.lname"[/(?<= ).*(?=\.lname\b)/]
#=> "name is fname.lname and fname"
"my name is fname.lname and fname.lname"[/(?<= ).*?(?=\.lname\b)/]
#=> "name is fname"
In other words, the non-greedy (greedy) match matches the first (last) occurrence of ".lname" in the string.
This could alternatively be written with a capture group and no lookarounds:
string1[/ (.*?)\.lname\b/, 1]
#=> "name is fname"
This regular expression reads, "mactch a space followed by zero or more characters, saved in capture group 1, followed by the string ".name" followed by a word break. This uses the form of String#[] that has two arguments, a reference to a capture group.
Yet another way follows.
string1[(string1 =~ / /)+1..(string1 =~ /\.lname\b/)-1]
#=> "name is fname"
1 The period in ".lname" must be escaped because an unescaped period in a regular expression (except in a character class) matches any character.

ignore a specific \n character while still enabling the m flag

I want to match characters across multiple lines so I enabled the m flag. However, I do not want to match a specific \n. Instead I want to match a space \s only. But it seems like the newline is matching spaces too:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+\s.+,.+,.+\d+)/m
=> 0
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[ ].+,.+,.+\d+)/m
=> 3
Even I try to explicitly ignore the newline:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[^\n].+,.+,.+\d+)/m
=> 0
Why is the newline matching a space character? And what can I do to ensure that it does not and still matches characters across multiple lines everywhere else?
The /\s(\d+[^\n].+,.+,.+\d+)/m pattern matches " 41\n6332 Hardin Rd, Bensalem, PA\n 19020" because when the regex engine gets to [^\n] after matching 41 with \d+ backtracking occurs: the regex engine tries to match the string differently since it encountered \n and the next char should be a different char. So, it steps back to \d+ and matches 4, and 1 is not a newline, so matching continues.
You may anchor the search at the start of the string and prevent backtracking with a possessive quantifier, also implementing the negative check with a lookahead:
/\A\s*(\d++(?!\n).+,.+,.+\d)/m
See the regex demo
Details
\A - start of string
\s* - 0+ whitespaces
(\d++(?!\n).+,.+,.+\d) - Capturing group 1:
\d++(?!\n) - 1+ digits (matched possessively with ++ quantifier) not followed with a newline (as (?!\n) is a negative lookahead that fails the match if there is a newline immediately to the right of the current location)
.+,.+, - 2 occurrences of any 1+ chars as many as possible, followed with ,
.+\d - any 1+ chars as many as possible followed with a digit.

Regex matching chars around text

I have a string with chars inside and I would like to match only the chars around a string.
"This is a [1]test[/1] string. And [2]test[/2]"
Rubular http://rubular.com/r/f2Xwe3zPzo
Currently, the code in the link matches the text inside the special chars, how can I change it?
Update
To clarify my question. It should only match if the opening and closing has the same number.
"[2]first[/2] [1]second[/2]"
In the code above, only first should match and not second. The text inside the special chars (first), should be ignored.
Try this:
(\[[0-9]\]).+?(\[\/[0-9]\])
Permalink to the example on Rubular.
Update
Since you want to remove the 'special' characters, try this instead:
foo = "This is a [1]test[/1] string. And [2]test[/2]"
foo.gsub /\[\/?\d\]/, ""
# => "This is a test string. And test"
Update, Part II
You only want to remove the 'special' characters when the surrounding tags match, so what about this:
foo = "This is a [1]test[/1] string. And [2]test[/2], but not [3]test[/2]"
foo.gsub /(?:\[(?<number>\d)\])(?<content>.+?)(?:\[\/\k<number>\])/, '\k<content>'
# => "This is a test string. And test, but not [3]test[/2]"
\[([0-9])\].+?\[\/\1\]
([0-9]) is a capture since it is surrounded with parentheses. The \1 tells it to use the result of that capture. If you had more than one capture, you could reference them as well, \2, \3, etc.
Rubular
You can also use a named capture, rather than \1 to make it a little less cryptic. As in: \[(?<number>[0-9])\].+?\[\/\k<number>\]
Here's a way to do it that uses the form of String#gsub that takes a block. The idea is to pull strings such as "[1]test[/1]" into the block, and there remove the unwanted bits.
str = "This is a [1]test[/1] string. And [2]test[/2], plus [3]test[/99]"
r = /
\[ # match a left bracket
(\d+) # capture one or more digits in capture group 1
\] # match a right bracket
.+? # match one or more characters lazily
\[\/ # match a left bracket and forward slash
\1 # match the contents of capture group 1
\] # match a right bracket
/x
str.gsub(r) { |s| s[/(?<=\]).*?(?=\[)/] }
#=> "This is a test string. And test, plus [3]test[/99]"
Aside: When I first heard of named capture groups, they seemed like a great idea, but now I wonder if they really make regexes easier to read than \1, \2....

Ruby REGEX for letters and numbers or letters followed by period, letters and numbers

I am trying to construct a Ruby REGEX that will only allow the following:
some string (read letter only characters)
some string followed by numbers
some string followed by a period and another string
some string followed by a period and another string followed by numbers
period is only allowed if another string follows it
no other periods are allowed afterwards
numbers may only be at the very end
I have got \A[[^0-9.]a-z]*([0-9]*|((.)([[^0-9]a-z]*)[0-9]*))\z but I can't get what I need. This allows:
test.
test..
test.123
What is the correct REGEX? If someone could explain what I am doing wrong to help me understand for future that would be great too.
Edit: update requirements to be more descriptive
So I'm guessing you want identifiers separated by ..
By identifier I mean:
a string consisting of alphanumeric characters
that does not start with a number
and is atleast one characer long.
Written out as a grammar, it would look something like this:
EXPR := IDENT "." EXPR | IDENT
IDENT := [A-Z]\w*
And the regex for this would be the following:
/\A[A-Z]\w*(\.[A-Z]\w*)*\Z/i
Try it out here
Note Due to the behaviour of \w this pattern will also accept _ (underscores) after the first character (i.e. test_123 will also pass).
EDIT to reflect update of question
So the grammar you want is actually like this:
EXPR := IDENT [0-9]*
IDENT := STR | STR "." STR
STR := [A-Z]+
And the regexp then is this:
/\A[A-Z]+(\.[A-Z]+)?[0-9]*\z/i
Try this one out here
The explanation is as follows:
/ # start Regexp
\A # start of string
[A-Z]+ # "some string"
(
\. # followed by a period
[A-Z]+ # and another string
)? # period + another string is optional
[0-9]* # optional digits at the end
\z # end of string
/i # this regexp is case insensitive.
You can try
^[a-z]+\.?[a-z]+[0-9]*$
Here is demo
Note: use \A and \z to match starting and ending of string instead of line.
You need to escape . that matches any single character.
Pattern explanation:
^ the beginning of the line
[a-z]+ any character of: 'a' to 'z' (1 or more times)
\.? '.' (optional)
[a-z]+ any character of: 'a' to 'z' (1 or more times)
[0-9]* any character of: '0' to '9' (0 or more times)
$ the end of the line

how to remove leading and trailing non-alphabetic characters in ruby

I want to remove any leading and trailing non-alphabetic character in my string.
for eg. ":----- pt-br:-" , i want "pt-br"
Thanks
result = subject.gsub(/\A[\d_\W]+|[\d_\W]+\Z/, '')
will remove non-letters from the start and end of the string.
\A and \Z anchor the regex at the start/end of the string (^/$ would also match after/before a newline which is probably not what you want - but that might not matter in this case);
[\d_\W]+ matches one or more digits, the underscore or anything else that is not an alphanumeric character, leaving only letters.
| is the alternation operator.
In ruby 1.9.1 :
":----- pt-br:-".partition( /[a-zA-Z](...)[a-zA-Z]/ )[1]
partition searches the pattern in the string and returns the part before it, the match, and the part after it.
result = subject.gsub(/^[^a-zA-Z]+/, '').gsub(/[^a-zA-Z]+$/, '')

Resources