Match any character, including special characters using the Match function - go

I have a relatively weird file names,
but the glob function doesn't seem to pick them up using usual wildcards:
fmt.Println(filepath.Match("/home/catch/*.xml", "/home/catch/{foo/x/y}.xml"))
so, I want to match xml files in the catch folders, and they might have special characters in their name, like {path1/path2}.xml
Sadly, the * wildcard won't match that since I assume slashes and maybe curly braces are considered as non-separator characters?

I'm using Linux and with this pattern Match is returning true . So files like that {path1/path2}.xml (although it is not common), it is possible to match them.
package main
import (
"fmt"
"path/filepath"
)
func main() {
fmt.Println(filepath.Match("/home/catch/*\\/*", "/home/catch/{path1/path2}.xml"))
}
Output
true <nil>
But on Windows, escaping is disabled. Instead, \\ is treated as path separator, so it will not work.
"/home/catch/*\\/*"
You can read it in the documentation
pattern:
{ term }
term:
'*' matches any sequence of non-Separator characters
'?' matches any single non-Separator character
'[' [ '^' ] { character-range } ']'
character class (must be non-empty)
c matches character c (c != '*', '?', '\\', '[')
'\\' c matches character c
character-range:
c matches character c (c != '\\', '-', ']')
'\\' c matches character c
lo '-' hi matches character c for lo <= c <= hi
Match requires pattern to match all of name, not just a substring. The only possible returned error is ErrBadPattern, when pattern is malformed.
On Windows, escaping is disabled. Instead, '\\' is treated as path separator.
https://pkg.go.dev/path/filepath#Match

Related

How below REGEXP_REPLACE works?

I have query in my project and that is having REGEXP_REPLACE
i tried to find how it works by searching but i found it like
w+ Matches a word character (that is, an alphanumeric or underscore
(_) character).
but not able to find '"\w+\":' why these "" are used and what is mean by '{|}|"',''
UPDATE (SELECT data,data_value FROM TEMP) t
SET t.DATA_VALUE=REGEXP_REPLACE(REGEXP_REPLACE(t.data, '"\w+\":',''),'{|}|"','');
can you please tell me how it works?
This appear to be a regular expression for stripping keys and enclosing brackets from a JSON string - unfortunately, if this is the case then it does not work in all situations.
The regular expression
'"\w+\":'
will match:
A " double quotation mark;
\w+ one-or-more word (a-z or A-Z or 0-9 or _) characters;
\" another double quotation mark - note: the \ character is not necessary; then
A : colon.
So:
REGEXP_REPLACE(
'{"key":"value","key2":"value with \"quote"}',
'"\w+":', -- Pattern matched
'' -- Replacement string
)
Will output:
{"value","value with \"quote"}
The second pattern {|}|" will match either a {, or a } or a " character (and could have been equivalently written as [{}"]) so:
REGEXP_REPLACE(
'{"value","value with \"quote"}',
'{|}|"', -- Pattern matched
'' -- Replacement string
)
Will output:
value,value with \quote
Which is fine, until (like my example) you have an escaped double quote (or curly braces) in the value string; in which case those will also get stripped leaving the escape character.
(Note: you would not typically find this but it is possible to include escaped quotes in the key. So {"keywith\":quote":"value"} would get replaced to {quote":"value"} and then quote:value which is not the intended output.)
If parsing JSON is what you are trying to do (pre-Oracle 12) then you can use:
REGEXP_REPLACE(
'{"key":"value","key2":"value with \"quote","keywith\":quote":"value with \"{}"}',
'^{|"(\\"|[^"])+":(")?((\\"|[^"])+?)\2((,)|})',
'\3\6'
)
Which outputs:
value,value with \"quote,value with \"{}
Or in Oracle 12 you can do:
SELECT *
FROM JSON_TABLE(
'{"key":"value","key2":"value with \"quote","keywith\":quote":"value with \"{}"}',
'$.*' NULL ON ERROR
COLUMNS (
value VARCHAR2(4000) PATH '$'
)
)
Which outputs:
VALUE
-----------------
value
value with "quote
value with "{}
example:::REGEXP_REPLACE( string, pattern [, replacement_string [, start_position [, nth_appearance [, match_parameter ] ] ] ] )
| is or(CAN MEAN MORE THAN ONE ALTERNATIVE ) , is for at least as in {n,} at least n times
https://www.techonthenet.com/oracle/functions/regexp_replace.php
"where I got my info"
'"\w+\":' why these "" are used and what is mean by '{|}|"',''
Matches a word character(\w)One or more times(+) this has to be messed up it's missing the right quantity of close parentheses by putting \" w+ \"
they allow the " to be shown. This expression takes one expression changes it then uses that as the basis for the next change. Good luck figuring the rest out. Regular expressions aren't too bad, pretty intuitive once you get the basics down.

ignore a specific \n character while still enabling the m flag

I want to match characters across multiple lines so I enabled the m flag. However, I do not want to match a specific \n. Instead I want to match a space \s only. But it seems like the newline is matching spaces too:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+\s.+,.+,.+\d+)/m
=> 0
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[ ].+,.+,.+\d+)/m
=> 3
Even I try to explicitly ignore the newline:
" 41\n6332 Hardin Rd, Bensalem, PA\n 19020" =~ /\s(\d+[^\n].+,.+,.+\d+)/m
=> 0
Why is the newline matching a space character? And what can I do to ensure that it does not and still matches characters across multiple lines everywhere else?
The /\s(\d+[^\n].+,.+,.+\d+)/m pattern matches " 41\n6332 Hardin Rd, Bensalem, PA\n 19020" because when the regex engine gets to [^\n] after matching 41 with \d+ backtracking occurs: the regex engine tries to match the string differently since it encountered \n and the next char should be a different char. So, it steps back to \d+ and matches 4, and 1 is not a newline, so matching continues.
You may anchor the search at the start of the string and prevent backtracking with a possessive quantifier, also implementing the negative check with a lookahead:
/\A\s*(\d++(?!\n).+,.+,.+\d)/m
See the regex demo
Details
\A - start of string
\s* - 0+ whitespaces
(\d++(?!\n).+,.+,.+\d) - Capturing group 1:
\d++(?!\n) - 1+ digits (matched possessively with ++ quantifier) not followed with a newline (as (?!\n) is a negative lookahead that fails the match if there is a newline immediately to the right of the current location)
.+,.+, - 2 occurrences of any 1+ chars as many as possible, followed with ,
.+\d - any 1+ chars as many as possible followed with a digit.

Ruby REGEX for letters and numbers or letters followed by period, letters and numbers

I am trying to construct a Ruby REGEX that will only allow the following:
some string (read letter only characters)
some string followed by numbers
some string followed by a period and another string
some string followed by a period and another string followed by numbers
period is only allowed if another string follows it
no other periods are allowed afterwards
numbers may only be at the very end
I have got \A[[^0-9.]a-z]*([0-9]*|((.)([[^0-9]a-z]*)[0-9]*))\z but I can't get what I need. This allows:
test.
test..
test.123
What is the correct REGEX? If someone could explain what I am doing wrong to help me understand for future that would be great too.
Edit: update requirements to be more descriptive
So I'm guessing you want identifiers separated by ..
By identifier I mean:
a string consisting of alphanumeric characters
that does not start with a number
and is atleast one characer long.
Written out as a grammar, it would look something like this:
EXPR := IDENT "." EXPR | IDENT
IDENT := [A-Z]\w*
And the regex for this would be the following:
/\A[A-Z]\w*(\.[A-Z]\w*)*\Z/i
Try it out here
Note Due to the behaviour of \w this pattern will also accept _ (underscores) after the first character (i.e. test_123 will also pass).
EDIT to reflect update of question
So the grammar you want is actually like this:
EXPR := IDENT [0-9]*
IDENT := STR | STR "." STR
STR := [A-Z]+
And the regexp then is this:
/\A[A-Z]+(\.[A-Z]+)?[0-9]*\z/i
Try this one out here
The explanation is as follows:
/ # start Regexp
\A # start of string
[A-Z]+ # "some string"
(
\. # followed by a period
[A-Z]+ # and another string
)? # period + another string is optional
[0-9]* # optional digits at the end
\z # end of string
/i # this regexp is case insensitive.
You can try
^[a-z]+\.?[a-z]+[0-9]*$
Here is demo
Note: use \A and \z to match starting and ending of string instead of line.
You need to escape . that matches any single character.
Pattern explanation:
^ the beginning of the line
[a-z]+ any character of: 'a' to 'z' (1 or more times)
\.? '.' (optional)
[a-z]+ any character of: 'a' to 'z' (1 or more times)
[0-9]* any character of: '0' to '9' (0 or more times)
$ the end of the line

Ruby, weird substitution

For example:
str1 = "pppp(m)pppp"
str2 = "(m)"
str1 = str1.sub(/#{str2}/, "<>#{str2}<>")
I will got this:
"pppp(<>(m)<>)pppp"
I expected to get this:
"pppp<>(m)<>pppp"
Why it's happening and how to avoid this?
In ( and ) have a special meaning in regexen and do not actually match the characters ( and ). The regex /(m)/ will match any m whether or not it is enclosed in parentheses (and if it is, it won't match the parentheses).
To match literal parentheses use \( and \) - or in a case like this where you're interpolating a string, you can just use Regexp.escape on the string, i.e. /#{ Regexp.escape(str2) }/.
The regular expression is viewing the "(m)" as a capture group because the parenthesis are operators in regular expressions to get a literal "(m)" you need to use the escape char \ ["\(m\)"].

Regex to remove non letters

I'm trying to remove non-letters from a string. Would this do it:
c = o.replace(o.gsub!(/\W+/, ''))
Just gsub! is sufficient:
o.gsub!(/\W+/, '')
Note that gsub! modifies the original o object. Also, if the o does not contain any non-word characters, the result will be nil, so using the return value as the modified string is unreliable.
You probably want this instead:
c = o.gsub(/\W+/, '')
Remove anything that is not a letter:
> " sd 190i.2912390123.aaabbcd".gsub(/[^a-zA-Z]/, '')
"sdiaaabbcd"
EDIT: as ikegami points out, this doesn't take into account accented characters, umlauts, and other similar characters. The solution to this problem will depend on what exactly you are referring to as "not a letter". Also, what your input will be.
Keep in mind that ruby considers the underscore _ to be a word character. So if you want to keep underscores as well, this should do it
string.gsub!(/\W+/, '')
Otherwise, you need to do this:
string.gsub!(/[^a-zA-Z]/, '')
That will work most of the cases, except when o initially does not contain any non-letter, in which case gsub! will return nil.
If you just want a replaced string, it can be simpler:
c = o.gsub(/\W+/, '')
Using \W or \w to select or delete only characters won't work. \w means A-Z, a-z, 0-9, and "_":
irb(main):002:0> characters = (' ' .. "\x7e").to_a.join('')
=> " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
irb(main):003:0> characters.gsub(/\W+/, '')
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"
So, stripping using \W preserves digits and underscores.
If you want to match characters use /[A-Za-z]+/, or the POSIX character class [:alpha:], i.e. /[[:alpha:]]+/, or /\p{ALPHA}/.
The final format is the Unicode property for 'A'..'Z' + 'a'..'z' in ASCII, and gets extended when dealing with Unicode, so if you have multibyte characters you should probably use that.
use Regexp#union to create a big matching object
allowed = Regexp.union(/[a-zA-Z0-9]/, " ", "-", ":", ")", "(", ".")
cleanstring = dirty_string.chars.select {|c| c =~ allowed}.join("")
I don't see what that o.replace is in there for if you have a string:
string = 't = 4 6 ^'
And you do:
string.gsub!(/\W+/, '')
You get:
t46
If you want to get rid of the number characters too, you can do:
string.gsub!(/\W+|\d+/, '')
And you get:
t

Resources