Parenthesis and regular expressions [duplicate] - ruby

This question already has answers here:
Finding parenthesis via regular expression
(3 answers)
Closed 9 years ago.
I am currently working on an exercise in which I need to locate all the occurring (),[],{}, both grouped or single, and I can't figure out the regular expression. I do not need the text in between. I already filtered them out of my string with this:
string_updated = string.gsub(/([a-zA-Z]|\d+)|\s+/, "")
For example. in this string:
"I can't ( find the } regular ] expression to ) grab these[."
All I want is: (, }, ], ), [.

You should be able to do that with ([\(\[\{]).*([\)\]\}])
Here is a working example: http://regex101.com/r/qH1uF5

You should use the below regex:
([][)(}{])
Explanations:
( # Start capturing group
[ # Match any of these characters
][)(}{ # Desired chars
] # End of char group
) # End of capturing group
Live demo

What about:
a = "I can't ( find the } regular ] expression to ) grab these[."
brackets = %w|( ) [ ] { }|
puts a.scan(Regexp.union(brackets)).join(', ') #=> (, }, ], ), [

Related

What do these symbols mean in the RFC docs regarding grammars?

Here are the examples:
Transfer-Encoding = "Transfer-Encoding" ":" 1#transfer-coding
Upgrade = "Upgrade" ":" 1#product
Server = "Server" ":" 1*( product | comment )
delta-seconds = 1*DIGIT
Via = "Via" ":" 1#( received-protocol received-by [ comment ] )
chunk-extension= *( ";" chunk-ext-name [ "=" chunk-ext-val ] )
http_URL = "http:" "//" host [ ":" port ] [ abs_path [ "?" query ]]
date3 = month SP ( 2DIGIT | ( SP 1DIGIT ))
Questions are:
What is the 1#transfer-coding (the 1# regarding the rule transfer-coding)? Same with 1#product.
What does 1 times x mean, as in 1*( product | comment )? Or 1*DIGIT.
What do the brackets mean, as in [ comment ]? The parens (...) group it all, but what about the [...]?
What does the *(...) mean, as in *( ";" chunk-ext-name [ "=" chunk-ext-val ] )?
What do the nested square brackets mean, as in [ abs_path [ "?" query ]]? Nested optional values? It doesn't make sense.
What does 2DIGIT and 1DIGIT mean, where do those come from / get defined?
I may have missed where these are defined, but knowing these would help clarify how to parse the grammar definitions they use in the RFCs.
I get the rest of the grammar notation, juts not these few remaining pieces.
Update: Looks like this is a good start.
Square brackets enclose an optional element sequence:
[foo bar]
is equivalent to
*1(foo bar).
Specific Repetition: nRule
A rule of the form:
<n>element
is equivalent to
<n>*<n>element
That is, exactly <n> occurrences of <element>. Thus, 2DIGIT is a
2-digit number, and 3ALPHA is a string of three alphabetic
characters.
Variable Repetition: *Rule
The operator "*" preceding an element indicates repetition. The full
form is:
<a>*<b>element
where <a> and <b> are optional decimal values, indicating at least
<a> and at most <b> occurrences of the element.
Default values are 0 and infinity so that *<element> allows any
number, including zero; 1*<element> requires at least one;
3*3<element> allows exactly 3; and 1*2<element> allows one or two.
But what I'm still missing is what the # means?
Update 2: Found it I think!
#RULE: LISTS
A construct "#" is defined, similar to "*", as follows:
<l>#<m>element
indicating at least <l> and at most <m> elements, each separated
by one or more commas (","). This makes the usual form of lists
very easy; a rule such as '(element *("," element))' can be shown
as "1#element".
Also, what do these mean?
1*2DIGIT
2*4DIGIT

Bash - More efficient way to process csv file than grep [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
Updated
I have a file (file.txt) with a list of words:
apple
banana
cherry
orange
pineapples
I have a csv file (data.csv) that contains lots of data:
1,"tasty apples",3,5
23,"iphone app",5,12
1,"sour grapes",3,5
23,"banana apple smoothie",5,12
1,"cherries and orange shortage",3,5
23,"apple iphone orange cover",5,12
3,"pineapple cherry bubble gum",13,5
5,"pineapples are best frozen",22,33
I want to append the match from file like this (output.csv):
1,"tasty apples",3,5,""
23,"iphone app",5,12,""
1,"sour grapes",3,5,""
23,"banana apple smoothie",5,12,"apple+banana"
1,"cherries and orange shortage",3,5,"orange"
23,"apple iphone orange cover",5,12,"apple+orange"
3,"pineapple cherry bubble gum",13,5,"cherry"
5,"pineapples are best frozen",22,33,"pineapples"
I can this with grep, but in order to do this, I have to use a while loop with if statements and process text files.
The problem with doing this is that file.txt has about 500 lines, and data.csv has 330,000 lines. My script would work, however it may take days to complete.
I'm wondering is there a more efficient way to do this than my method?
Perl to the rescue!
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS qw{ csv };
open my $f1, '<', 'file.txt' or die $!;
my #fruits;
chomp, push #fruits, $_ while <$f1>;
my %order;
#order{#fruits} = 0 .. $#fruits;
my $regex = join '|', sort { length $b <=> length $a } #fruits;
csv(
in => 'data.csv1',
eol => "\n",
on_in => sub {
my #matches;
push #matches, $1 while $_[1][1] =~ /\b($regex)\b/g;
push #{ $_[1] }, join '+',
sort { $order{$a} <=> $order{$b} }
#matches;
},
);
Unfortunately, Text::CSV_XS can't quote the last field if it doesn't contain a special character (or without quoting all the fields). If file.txt doesn't contain double quotes and commas, though, you add them easily:
perl ... | sed 's/,\([^,"]*\)$/,"\1"/'
Is there a reason you want that last field quoted? The "+" has no special meaning in CSV, so does not need quotation and neither does an empty field.
Text::CSV_XS does support quotation of empty fields or quotation of all fields, but not yet quotation of all non-numeric fields.
Based on choroba's answer, which allows the last field to be "apple+apple+orange", which is not clearly defined in the OP if that is wanted, I'd write it like this:
use 5.14.1;
use warnings;
use Text::CSV_XS qw( csv );
use Data::Peek;
chomp (my #fruits = do { local #ARGV = "file.txt"; <> });
my %order;
#order{#fruits} = 0 .. $#fruits;
my $regex = join "|", sort { length $b <=> length $a } #fruits;
csv (
in => "data1.csv",
eol => "\n",
quote_empty => 1,
on_in => sub {
push #{$_[1]}, join "+" =>
sort { $order{$a} <=> $order{$b} }
keys %{{map { $_ => 1 }
($_[1][1] =~ m/\b($regex)\b/g)}};
},
);

Regex to match pipes not within brackets or braces

I am trying to parse some wiki markup. For example, the following:
{{Infobox
| person
| name = Joe
| title = Ruler
| location = [[United States|USA]] | height = {{convert|12|m|abbr=on}}
| note = <ref>{{cite book|title= Some Book}}</ref>
}}
can be the text to start with. I first remove the starting {{ and ending }}, so I can assume those are gone.
I want to do .split(<regex>) on the string to split the string by all | characters that are not within braces or brackets. The regex needs to ignore the | characters in [[United States|USA]], {{convert|12|m|abbr=on}}, and {{cite book|title= Some Book}}. The expected result is:
[
'person'
'name = Joe',
'title = Ruler',
'location = [[United States|USA]]',
'height = {{convert|12|m|abbr=on}}',
'note = <ref>{{cite book|title= Some Book}}</ref>'
]
There can be line breaks at any point, so I can't just look for \n|. If there is extra white space in it, that is fine. I can easily strip out extra \s* or \n*.
You could split on:
\s*\|\s*(?![^{\[]*[]}])
Breakdown:
\s*\|\s* Match a pipe with any leading or trailing whitespaces
(?! Start of negative lookahead
[^{\[]* Match anything except { and [ as much as possible
[]}] Up to a closing ] or }
) End of negative lookahead
The negative lookahead asserts that we shouldn't reach } or ] without matching an opening pair.
See live demo here
I literally stole the regex from #WiktorStribiżew but this should work for your input string
regex = (/\w+(?:\s*=\s*(?:\[\[[^\]\[]*]]|{{[^{}]*}}|[^|{\[])*)?/)
arr = str.scan(regex).map{|l| l.strip.delete("\n")}[1..-1]
arr is now the array you've requested.

Create regular expression from Array of search terms ruby

Is there a way / gem to create regular expressions with some basic search parameters.
e.g.
Search = ["\"German Shepherd\"","Collie","poodle", "Miniature Schnauzer"]
Such that the regexp will search (case insensitively) for:
"German Shepherd" - exactly
OR
"Collie"
OR
"poodle"
OR
"Miniature" AND "Schnauzer"
So in this case something like:
/German\ Shepherd|Collie|poodle|(?=.*Miniature)(?=.*Schnauzer).+/i
(Open to suggestions of better ways of doing the last bit...)
If I understood the question properly, here you go:
regexps = ["\"German Shepherd\"","Collie","poodle", "Miniature Schnauzer"]
# those in quotes
greedy = regexps.select { |re| re =~ /\A['"].*['"]\z/ } # c'"mon, parser
# the rest unquoted
non_greedy = (regexps - greedy).map(&:split).flatten
# concatenating... ⇓⇓⇓ get rid of quotes
all = Regexp.union(non_greedy + greedy.map { |re| re[1...-1] })
#⇒ /Collie|poodle|Miniature|Schnauzer|German\ Shepherd/
UPD
I finally got what is to be done with Miniature Schnauzer (please see a comment below for further explanation.) That said, these words are to be permuted and joined with non-greedy .*?:
non_greedy = (regexps - greedy).map(&:split).map do |re|
# single word? YES : NO, permute and join
re.length < 2 ? re : re.permutation.map { |p| Regexp.new p.join('.*?') }
end.flatten
all = Regexp.union(non_greedy + greedy.map { |re| re[1...-1] })
#=> /Collie|poodle|(?-mix:Miniature.*?Schnauzer)|(?-mix:Schnauzer.*?Miniature)|German\ Shepherd/

Ruby gsub / regex with several arguments [duplicate]

This question already has answers here:
Match a string against multiple patterns
(2 answers)
Closed 8 years ago.
I'm new to ruby and I'm trying to solve a problem.
I'm parsing through several text field where I want to remove the header which has different values. It works fine when the header always is the same:
variable = variable.gsub(/(^Header_1:$)/, '')
But when I put in several arguments it doesn't work:
variable = variable.gsub(/(^Header_1$)/ || /(^Header_2$)/ || /(^Header_3$)/ || /(^Header_4$)/ || /^:$/, '')
You can use Regexp.union:
regex = Regexp.union(
/^Header_1/,
/^Header_2/,
/^Header_3/,
/^Header_4/,
/^:$/
)
variable.gsub(regex, '')
Please note that ^something$ will not work on strings containing something more than something :)
Cause ^ is for matching beginning of string and $ is for end of string.
So i intentionally removed $.
Also you do not need brackets when you only need to remove the matched string.
You can also use it like this:
headers = %w[Header_1 Header_2 Header_3]
regex = Regexp.union(*headers.map{|s| /^#{s}/}, /^\:$/, /etc/)
variable.gsub(regex, '')
And of course you can remove headers without explicitly define them.
Most likely there are a white space after headers?
If so, you can do it as simple as:
variable = "Header_1 something else"
puts variable.gsub(/(^Header[^\s]*)?(.*)/, '\2')
#=> something else
variable = "Header_BLAH something else"
puts variable.gsub(/(^Header[^\s]*)?(.*)/, '\2')
#=> something else
Just use a proper regexp:
variable.gsub(/^(Header_1|Header_2|Header_3|Header_4|:)$/, '')
If the header is always the same format of Header_n, where n is some integer value, then you can simplify your regex greatly:
/Header_\d+/
will find every one of these:
%w[Header_1 Header_2 Header_3].grep(/Header_\d+/)
[
[0] "Header_1",
[1] "Header_2",
[2] "Header_3"
]
Tweaking it to handle finding words, not substrings:
/^Header_\d+$/
or:
/\bHeader_\d+\b/
As mentioned, using Regexp.union is a good start, but, used blindly, can result in very slow or inefficient patterns, so think ahead and help out the engine by giving it useful sub-patterns to work with:
values = %w[foo bar]
/Header_(?:\d+|#{ values.join('|') })/
=> /Header_(?:\d+|foo|bar)/
Unfortunately, Ruby doesn't have the equivalent to Perl's Regexp::Assemble module, which can build highly optimized patterns from big lists of words. Search here on Stack Overflow for examples of what it can do. For instance:
use Regexp::Assemble;
my #values = ('Header_1', 'Header_2', 'foo', 'bar', 'Header_3');
my $ra = Regexp::Assemble->new;
foreach (#values) {
$ra->add($_);
}
print $ra->re, "\n";
=> (?-xism:(?:Header_[123]|bar|foo))

Resources