Regix in ruby on rails for adding "\n\n" " 2 new line" when i found "\n" [duplicate] - ruby

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question's answers are a community effort. Edit existing answers to improve this post. It is not currently accepting new answers or interactions.
I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?

The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.
If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.
Start simple
Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.
Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.
If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)
Order from the menu
Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.
The pattern . is special: rather than matching a literal dot only, it matches any character†. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].
Think of character classes as menus: pick just one.
Helpful shortcuts
Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).
The uppercased variants are their complements, so \S matches any non-whitespace character, for example.
Once is not enough
From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are
* (zero or more times)
+ (one or more times)
{n} (exactly n times)
{n,} (at least n times)
{n,m} (at least n times but no more than m times)
Putting some of these blocks together, the pattern [Nn]*ick matches all of
ick
Nick
nick
Nnick
nNick
nnick
(and so on)
The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.
A few other useful examples:
[0-9]+ (and its equivalent \d+) matches any non-negative integer
\d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01
Grouping
A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.
To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.
Alternation
Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).
For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.
Escaping
Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.
Greediness
Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.
For example, say the input is
"Hello," she said, "How are you?"
You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.
To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.
If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.
(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)
Anchors
Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.
Say you want to match comments of the form
-- This is a comment --
you'd write ^--\s+(.+)\s+--$.
Build your own
Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.
Tools for writing and debugging regexes:
RegExr (for JavaScript)
Perl: YAPE: Regex Explain
Regex Coach (engine backed by CL-PPCRE)
RegexPal (for JavaScript)
Regular Expressions Online Tester
Regex Buddy
Regex 101 (for PCRE, JavaScript, Python, Golang, Java 8)
I Hate Regex
Visual RegExp
Expresso (for .NET)
Rubular (for Ruby)
Regular Expression Library (Predefined Regexes for common scenarios)
Txt2RE
Regex Tester (for JavaScript)
Regex Storm (for .NET)
Debuggex (visual regex tester and helper)
Books
Mastering Regular Expressions, the 2nd Edition, and the 3rd edition.
Regular Expressions Cheat Sheet
Regex Cookbook
Teach Yourself Regular Expressions
Free resources
RegexOne - Learn with simple, interactive exercises.
Regular Expressions - Everything you should know (PDF Series)
Regex Syntax Summary
How Regexes Work
JavaScript Regular Expressions
Footnote
†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.

Related

Count Number of Sentence Ruby

I happened to search around everywhere and did not managed to find a solution to count number of sentence in a String using Ruby. Does anyone how to do it?
Example
string = "The best things in an artist’s work are so much a matter of intuition, that there is much to be said for the point of view that would altogether discourage intellectual inquiry into artistic phenomena on the part of the artist. Intuitions are shy things and apt to disappear if looked into too closely. And there is undoubtedly a danger that too much knowledge and training may supplant the natural intuitive feeling of a student, leaving only a cold knowledge of the means of expression in its place. For the artist, if he has the right stuff in him ... "
This string should return number 4.
You can split the text into sentences and count them. Here:
string.scan(/[^\.!?]+[\.!?]/).map(&:strip).count # scan has regex to split string and strip will remove trailing spaces.
# => 4
Explaining regex:
[^\.!?]
Caret inside of a character class [^ ] is the negation operator. Which means we are looking for characters which are not present in list: ., ! and ?.
+
is a greedy operator that returns matches between 1 and unlimited times. (capturing our sentences here and ignoring repetitions like ...)
[\.!?]
matching characters ., ! or ?.
In a nutshell, we are capturing all characters that are not ., ! or ? till we get characters that are ., ! or ?. Which basically can be treated as a sentence (in broad senses).
I think it makes sense to consider a word char followed by a ?! or . the delimiter of a sentence:
string.strip.split(/\w[?!.]/).length
#=> 4
So I'm not considering the ... a delimiter when it hangs on it's own like that:
"I waited a while ... and then I went home"
But then again, maybe I should...
It also occurs to me that maybe a better delimiter is a punctuation followed by some space and a capital letter:
string.split(/[?!.]\s+[A-Z]/).length
#=> 4
Sentences end with full stops, question marks, and exclamation marks. They can also be
separated with dashes and other punctuation, but we won’t worry about these rare cases here.
The split is simple. Instead of asking Ruby to split the text on one type of character, you simply
ask it to split on any of three types of characters, like so:
txt = "The best things in an artist’s work are so much a matter of intuition, that there is much to be said for the point of view that would altogether discourage intellectual inquiry into artistic phenomena on the part of the artist. Intuitions are shy things and apt to disappear if looked into too closely. And there is undoubtedly a danger that too much knowledge and training may supplant the natural intuitive feeling of a student, leaving only a cold knowledge of the means of expression in its place. For the artist, if he has the right stuff in him ... "
sentence_count = txt.split(/\.|\?|!/).length
puts sentence_count
#=> 7
string.squeeze('.!?').count('.!?')
#=> 4

Is there a method to find the most specific pattern for a string?

I'm wondering whether there is a way to generate the most specific regular expression (if such a thing exists) that matches a given string. Here's an illustration of what I want the method to do:
str = "(17 + 31)"
find_pattern(str)
# => /^\(\d+ \+ \d+\)$/ (or something more specific)
My intuition was to use Regex.new to accumulate the desired pattern by looping through str and checking for known patterns like \d, \s, and so on. I suspect there is an easy way for doing this.
This is in essence an algorithm compression problem. The simplest way to match a list of known strings is to use Regexp.union factory method, but that just tries each string in turn, it does not do anything "clever":
combined_rx = Regexp.union( "(17 + 31)", "(17 + 45)" )
=> /\(17\ \+\ 31\)|\(17\ \+\ 45\)/
This can still be useful to construct multi-stage validators, without you needing to write loops to check them all.
However, a generic pattern matcher that could figure out what you mean to match from examples is not really possible. There are too many ways in which you could consider strings to be similar or not. The closest I could think of would be genetic programming where you supply a large list of should match/should not match strings and the code guesses at the best regex by constructing random Regexp objects (a challenge in itself) and seeing how accurately they match and don't match your examples. The best matchers could be combined and mutated and tried again until you got 100% accuracy. This might be a fun project, but ultimately much more effort for most purposes than writing the regular expressions yourself from a description of the problem.
If your problem is heavily constrained - e.g. any example integer could always be replaced by \d+, any example space by \s+ etc, then you could work through the string replacing "matchable units", in fact using the same regular expressions checked in turn. E.g. if you match \A\d+ then consume the match from the string, and add \d+ to your regex. Then take the remainder of the string and look for next matching pattern. Working this way will have its limitations (you must know the full set of patterns you want to match in advance, and all examples would have to be unambiguous). However, it is more tractable than a genetic program.

Inconsistency between engines when using reluctant quantifier in negative look ahead

I found something odd when using a reluctant quantifier in a negative look ahead.
When creating a regex to assert a maximum of 3 uppercase characters, I devised this:
^(?!(.*?[A-Z]){4}).*$
which works on rubular, but not on regex101.
Why is that?
^, $ matches beginning/end of line in Ruby.
While in another languages, ^, $ matches the beginning/end of the string unless multiline mode (m) is specified. (Some regular expression engine requires g flag to match multiple times.)

Ruby Regular expression too big / Multiple string match

I have 1,000,000 strings that I want to categorize. The way I do this is to bucket it if it contains a set of words or phrases. The set of words is about 10,000. Ideally I would be able to support regular expressions, but I am focused on making it run fast right now. Example phrases:
ford, porsche, mazda...
I really dont want to match each word against the strings one by one, so I decided to use regular expressions. Unfortunately, I am running into a regular expression issue:
Regexp.new("(a)"*253)
=> /(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)...
Regexp.new("(a)"*254)
RegexpError: regular expression too big: /(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)(a)...
where a would be one of my words or phrases. Right now, I am planning on running 10,000 / 253 matches. I read that the length of the regex heavily impacts performance, but my regex match is really simple and the regexp is created very quickly. I would like to get around the limitation somehow, or use a better solution if anyone has any ideas. Thanks.
You might consider other mechanisms for recognizing 10k words.
Trie: Sometimes called a prefix tree, it is often used by spell checkers for doing word lookups. See Trie on wikipedia
DFA (deterministic finite automata): A DFA is often created by the lexer in a compiler for recognizing the tokens of the language. A DFA runs very quickly. Simple regexes are often compiled into DFAs. See DFA on wikipedia

Treetop backtracking similar to regex?

Everything I've read suggests Treetop backtracks like regular expressions, but I'm having a hard time making that work.
Suppose I have the following grammar:
grammar TestGrammar
rule open_close
'{' .+ '}'
end
end
This does not match the string {abc}. I suspect that's because the .+ is consuming everything from the letter a onwards. I.e. it's consuming abc} when I only want it to consume abc.
This appears different from what a similar regex does. The regex /{.+}/ will match {abc}. It's my understanding that this is possible because the regex engine backtracks after consuming the closing } as part of the .+ and then failing to match.
So can Treetop do backtracking like that? If so, how?
I know you can use negation to match "anything other than a }." But that's not my intention. Suppose I want to be able to match the string {ab}c}. The tokens I want in that case are the opening {, a middle string of ab}c, and the closing }. This is a contrived example, but it becomes very relevant when working with nested expressions like {a b {c d}}.
Treetop is an implementation of a Parsing Expression Grammar parser. One of the benefits of PEGs is their combination of flexibility, speed, and memory requirements. However, this balancing act has some tradeoffs.
Quoting from the Wikipedia article:
The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression e, respectively. Unlike in context-free grammars and regular expressions, however, these operators always behave greedily, consuming as much input as possible and never backtracking. […] the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.
(Emphasis mine.)
In short: while certain PEG operators can backtrack in an attempt to take another route, the + operator cannot.
Instead, in order to match nested sub-expressions, you want to create an alternation between the delimited sub-expression (checked first) followed by the non-expression characters. Something like (untested):
grammar TestGrammar
rule open_close
'{' contents '}'
end
rule contents
open_close / non_brackets
end
rule non_brackets
# …
end
end

Resources