preg_match to match url with multiple languages - preg-match

i was using standard preg_match for making url excluding
http://domainlllll.com/
and it was working without any issue
preg_match("/^[0-9a-z_\/.\|\-]+$/",$url)
but now i want to support multiple languages so i used this and it is also working without any problem
preg_match("/\S+/",$url)
my url is
link/kn/some-word-গরম-এবং-সেক্সি-ইমেজ/611766
but i want to exclude some special characters which is hackers favorite like single quotes and other. i dont want to exclude all special character as few are part of few languages and it will break those languages in link
Any guide will be great

Look, the /^[0-9a-z_\/.\|\-]+$/ regex requires the whole string to match the patterns, 1+ chars from the defined ranges (digits, lowercase ASCII letters) and sets. The /\S+/ regex does not require a full string match since there are no anchors, ^ (start of string) and $ (end of string), and matches 1 or more non-whitespace chars anywhere inside a string.
If you plan to match strings that only contain non-whitespace symbols and NOT quotes, use
preg_match('~^[^\s\'"]+$~', $url)
The ^[^\s\'"]+$ matches
^ - start of string
[^\s\'"]+ - 1 or more chars other than whitespace (\s), ' and "
$ - end of string (better use \z instead if you need to validate a string).

Related

Regular expression for letters, spaces and hyphens

Looking for the regex to allow letters (either case), spaces and dashes for validation in ruby. Can't quite crack it.
As a starting point I'm using:
validates :name, format: { with: /\A[a-zA-Z]+(?: [a-zA-Z]+)?\z/, allow_blank: true}
Many thanks!
If you need to support all Unicode letters and to make sure - and spaces only appear between letters and no consecutive spaces/hyphens may occur (and there may be any amount of spaces/hyphens), use
/\A\p{L}+(?:[- ]\p{L}+)*\z/
/\A\p{L}+(?:[-\s]\p{L}+)*\z/
/\A\p{L}+(?:[-\p{Zs}\t]\p{L}+)*\z/
In short,
\A - matches start of string
\p{L}+ - one or more letters
(?:[-\s]\p{L}+)* - a non-capturing group that matches zero or more occurrences of
[-\s] - a - or whitespace
\p{L}+ - one or more Unicode letters
\z - the end of string.
See the regex demo.
In the comments, you mention that /\A[-A-Z\s]+\z/i works for you, but it also matches blank strings, or strings that are a mix of hyphens and whitespace because it means "start of string, one or more ASCII letters, whitespace or hyphens and then the end of string". This can be used to allow only specific chars to be input, but this does not validate much.
This regex will allow letters, spaces and hyphens: /\A[A-Za-z\s\-]+\z/

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

any explanation on the following regular expression?

I met the following the regex in ruby code, anyone could detail this to me?
[\w-]+\.(?:doc|txt)$
especially I think I am not clear about [\w-]+\ and ?:
It is a sequence of one or more letter/number/underscore/hyphen, followed by the period, followed by either doc or txt at the end of a line.
[\w-] is letter/number/underscore/hyphen.
\. is an escaped period.
(?:...) is a grouping (required to express options between doc and txt) that would not appear in the result as a captured substring.
It is likely written for searching a file name with the extension doc or txt, embedded within a multi-line string. Or, if the author of that regex is stupid (mistaking $ for \z), then it might have been intended to simply match a file name with that extension.
There is an online regex tester available at https://regex101.com/
You can use it to analyse, verify or debug your regex strings. It already saved me tons of time.
Your regex detailed automatically with the help of that tool:
/[\w-]+\.(?:doc|txt)$/
[\w-]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\w match any word character [a-zA-Z0-9_]
- the literal character -
\. matches the character . literally
(?:doc|txt) Non-capturing group
1st Alternative: doc
doc matches the characters doc literally (case sensitive)
2nd Alternative: txt
txt matches the characters txt literally (case sensitive)
$ assert position at end of the string
\w means any word character
minus in this context just means minus char
(?:doc|txt) means match doc or txt
so any word char or a minus repeated one or more times followed by a dot followed by either doc or txt and the pattern must be at the end of the line
the author should have escaped the minus for clarity imho
It means a file name which contains only word characters (a-z, A-Z, 0-9 and underscore) and hyphens, and with an extension of either .doc or .txt.
In detail,
\w matches a word character
[\w-] matches either a word character or a hyphen
[\w-]+ matches one or more such characters
\. matches a period
(?:) forms a non-capture group
(?:doc|txt) matches either a doc sequence, or a txt sequence
In ruby, $ matches the end of a line

Regular expression help to skip first occurrence of a special character while allowing for later special chars but no whitespace

I'm looking for words starting with a hashtag: "#yolo"
My regex for this was very simple: /#\w+/
This worked fine until I hit words that ended with a question mark: "#yolo?".
I updated my regex to allow for words and any non whitespace character as well: /#[\w\S]*/.
The problem is I sometimes need to pull a match from a word starting with two '#' characters, up until whitespace, that may contain a special character in it or at the end of the word (which I need to capture).
Example:
"##yolo?"
And I would like to end up with:
"#yolo?"
Note: the regular expressions are for Ruby.
P.S. I'm testing these out here: http://rubular.com/
Maybe this would work
#(#?[\S]+)
What about
#[^#\s]+
\w is a subset of ^\s (i.e. \S) so you don't need both. Also, I assume you don't want any more #s in the match, so we use [^#\s] which negates both whitespace and # characters.

How can I write a regex in Ruby that will determine if a string meets this criteria?

How can I write a regex in Ruby 1.9.2 that will determine if a string meets this criteria:
Can only include letters, numbers and the - character
Cannot be an empty string, i.e. cannot have a length of 0
Must contain at least one letter
/\A[a-z0-9-]*[a-z][a-z0-9-]*\z/i
It goes like
beginning of string
some (or zero) letters, digits and/or dashes
a letter
some (or zero) letters, digits and/or dashes
end of string
I suppose these two will help you: /\A[a-z0-9\-]{1,}\z/i and /[a-z]{1,}/i. The first one checks on first two rules and the second one checks for the last condition.
No regex:
str.count("a-zA-Z") > 0 && str.count("^a-zA-Z0-9-") == 0
You can take a look at this tutorial for how to use regular expressions in ruby. With regards to what you need, you can use the following:
^[A-Za-z0-9\-]+$
The ^ will instruct the regex engine to start matching from the very beginning of the string.
The [..] will instruct the regex engine to match any one of the characters they contain.
A-Z mean any upper case letter, a-z means any lower case letter and 0-9 means any number.
The \- will instruct the regex engine to match the -. The \ is used infront of it because the - in regex is a special symbol, so it needs to be escaped
The $ will instruct the regex engine to stop matching at the end of the line.
The + instructs the regex engine to match what is contained between the square brackets one or more time.
You can also use the \i flag to make your search case insensitive, so the regex might become something like this:
^[a-z0-9\-]+/i$

Resources