any explanation on the following regular expression? - ruby

I met the following the regex in ruby code, anyone could detail this to me?
[\w-]+\.(?:doc|txt)$
especially I think I am not clear about [\w-]+\ and ?:

It is a sequence of one or more letter/number/underscore/hyphen, followed by the period, followed by either doc or txt at the end of a line.
[\w-] is letter/number/underscore/hyphen.
\. is an escaped period.
(?:...) is a grouping (required to express options between doc and txt) that would not appear in the result as a captured substring.
It is likely written for searching a file name with the extension doc or txt, embedded within a multi-line string. Or, if the author of that regex is stupid (mistaking $ for \z), then it might have been intended to simply match a file name with that extension.

There is an online regex tester available at https://regex101.com/
You can use it to analyse, verify or debug your regex strings. It already saved me tons of time.
Your regex detailed automatically with the help of that tool:
/[\w-]+\.(?:doc|txt)$/
[\w-]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\w match any word character [a-zA-Z0-9_]
- the literal character -
\. matches the character . literally
(?:doc|txt) Non-capturing group
1st Alternative: doc
doc matches the characters doc literally (case sensitive)
2nd Alternative: txt
txt matches the characters txt literally (case sensitive)
$ assert position at end of the string

\w means any word character
minus in this context just means minus char
(?:doc|txt) means match doc or txt
so any word char or a minus repeated one or more times followed by a dot followed by either doc or txt and the pattern must be at the end of the line
the author should have escaped the minus for clarity imho

It means a file name which contains only word characters (a-z, A-Z, 0-9 and underscore) and hyphens, and with an extension of either .doc or .txt.
In detail,
\w matches a word character
[\w-] matches either a word character or a hyphen
[\w-]+ matches one or more such characters
\. matches a period
(?:) forms a non-capture group
(?:doc|txt) matches either a doc sequence, or a txt sequence
In ruby, $ matches the end of a line

Related

preg_match to match url with multiple languages

i was using standard preg_match for making url excluding
http://domainlllll.com/
and it was working without any issue
preg_match("/^[0-9a-z_\/.\|\-]+$/",$url)
but now i want to support multiple languages so i used this and it is also working without any problem
preg_match("/\S+/",$url)
my url is
link/kn/some-word-গরম-এবং-সেক্সি-ইমেজ/611766
but i want to exclude some special characters which is hackers favorite like single quotes and other. i dont want to exclude all special character as few are part of few languages and it will break those languages in link
Any guide will be great
Look, the /^[0-9a-z_\/.\|\-]+$/ regex requires the whole string to match the patterns, 1+ chars from the defined ranges (digits, lowercase ASCII letters) and sets. The /\S+/ regex does not require a full string match since there are no anchors, ^ (start of string) and $ (end of string), and matches 1 or more non-whitespace chars anywhere inside a string.
If you plan to match strings that only contain non-whitespace symbols and NOT quotes, use
preg_match('~^[^\s\'"]+$~', $url)
The ^[^\s\'"]+$ matches
^ - start of string
[^\s\'"]+ - 1 or more chars other than whitespace (\s), ' and "
$ - end of string (better use \z instead if you need to validate a string).

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.
You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.
Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/
This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.
You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.
Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

Syntax Highlighting in Notepad++: how to highlight timestamps in log files

I am using Notepad++ to check logs. I want to define custom syntax highlighting for timestamps and log levels. Highlighting logs levels works fine (defined as keywords). However, I am still struggling with highlighting timestamps of the form
06 Mar 2014 08:40:30,193
Any idea how to do that?
If you just want simple highlighting, you can use Notepad++'s regex search mode. Open the Find dialog, switch to the Mark tab, and make sure Regular Expression is set as the search mode. Assuming the timestamp is at the start of the line, this Regex should work for you:
^\d{2}\s[A-Za-z]+\s\d{4}\s\d{2}:\d{2}:\d{2},[\d]+
Breaking it down bit by bit:
^ means the following Regex should be anchored to the start of the line. If your timestamp appears anywhere but the start of a line, delete this.
\d means match any digit (0-9). {n} is a qualifier that means to match the preceding bit of Regex exactly n times, so \d{2} means match exactly two digits.
\s means match any whitespace character.
[A-Za-z] means match any character in the set A-Z or the set a-z, and the + is a qualifier that means match the preceding bit of Regex 1 or more times. So we're looking for an alphabetic character sequence containing one or more alphabetic characters.
\s means match any whitespace character.
\d{4} is just like \d{2} earlier, only now we're matching exactly 4 digits.
\s means match any whitespace character.
\d{2} means match exactly two digits.
: matches a colon.
\d{2} matches exactly two digits.
: matches another colon.
\d{2} matches another two digits.
, matches a comma.
[\d]+ works similarly to the alphabetic search sequence we set up earlier, only this one's for digits. This finds one or more digits.
When you run this Regex on your document, the Mark feature will highlight anything that matches it. Unlike the temporary highlighting the "Find All in Document" search type can give you, Mark highlighting lasts even after you click somewhere else in the document.

Regular expression help to skip first occurrence of a special character while allowing for later special chars but no whitespace

I'm looking for words starting with a hashtag: "#yolo"
My regex for this was very simple: /#\w+/
This worked fine until I hit words that ended with a question mark: "#yolo?".
I updated my regex to allow for words and any non whitespace character as well: /#[\w\S]*/.
The problem is I sometimes need to pull a match from a word starting with two '#' characters, up until whitespace, that may contain a special character in it or at the end of the word (which I need to capture).
Example:
"##yolo?"
And I would like to end up with:
"#yolo?"
Note: the regular expressions are for Ruby.
P.S. I'm testing these out here: http://rubular.com/
Maybe this would work
#(#?[\S]+)
What about
#[^#\s]+
\w is a subset of ^\s (i.e. \S) so you don't need both. Also, I assume you don't want any more #s in the match, so we use [^#\s] which negates both whitespace and # characters.

Resources