grep wildcards issue ubuntu - bash

I have an input file named test which looks like this
leonid sergeevich vinogradov
ilya alexandrovich svintsov
and when I use grep like this grep 'leonid*vinogradov' test it says nothing, but when I type grep 'leonid.*vinogradov' test it gives me the first string. What's the difference between * and .*? Because I see no difference between any number of any characters and any character followed by any number of any characters.
I use ubuntu 14.04.3.

* doesn't match any number of characters, like in a file glob. It is an operator, which indicates 0 or more matches of the previous character. The regular expression leonid*vinogradov would require a v to appear immediately after 0 or more ds. The . is the regular expression metacharcter representing any single character, so .* matches 0 or more arbitrary characters.

grep uses regex and .* matches 0 or more of any characters.
Where as 'leonid*vinogradov' is also evaluated as regex and it means leoni followed by 0 or more of letter d hence your match fails.

It's Regular Expression grep uses, short as regexp, not wildcards you thought. In this case, "." means any character, "" means any number of (include zero) the previous character, so "." means anything here.
Check the link, or google it, it's a powerful tool you'll find worth to knew.

Related

What is the meaning of this BASH SED command?

Example of tnum ... HYH19986_T_DRIVER_BAG_PRESSURE__78ms_546ms
tnum=`echo $1 | sed -e 's/_.*$//'`
The end result is that tnum will eventually become HYH19986. I have absolutely no experience of BASH but a quick search found that SED is the stream editor and essentially a find an replace too.
Please could someone explain to me what everything means from the -e onwards? Thank you.
Sed is the "stream editor". It is a non-interactive text editor, that takes commands to edit text. It's most commonly used command is "s", short for "substitute". This takes two expressions and optionally some options, and replaces the first expression with the second one.
The character after the "s" is the delimiter - it separates the expressions. Typically this is "/", but if you are working e.g. with paths it might be nicer to use something different like : or _ so you don't need to escape every /.
The _.*$ is a regular expression. Sed matches this, and replaces it with the second expression, the bit between the second and third slash, i.e. nothing in this case.
_ is a literal underline, .* is "any number of characters" and $ is the end of the line.
After that third slash you could also give options, like "g" (I remember it as "global"), which would cause this to be run multiple times per line. That's missing, but in this case the expression matches to the end of the line anyway, so nothing would change.
So this substitutes anything after an underline with nothing, which results in trimming it.
s/pattern/repl/ replaces the first occurrence of the pattern with the string repl. _.*$ matches a literal _ followed by the longest string of zero or more of any character (.*) up to the end of the line ($). So this just deletes everything from and including the first underscore to the end of the line.

repeating regex to match mathematical symbol then number fails

I am trying to match mathematica expressions like 1+2 and 1*2/3.... to infinity. Can someone explain why my regex matches the final case below, and how to fix it so that it matches only valid expressions (that might stretch forever)?
perms=["12+2*4","2+2","-2+","12+34-"]
perms.each do |line|
puts "#{line}=#{eval(line)}" if line =~ /^\d+([+-\/*]\d+){1,}/
end
I expected the output to be:
12+2*4=20
2+2=4
Inside a [character set], the - character defines a range of characters -- think of [a-z] or [0-9]. If you want to match a literal -, it must be the first or last character.
/^\d+(?:[+\/*-]\d+)+$/
Other things: {1,} is exactly +; and you need to anchor at the end too, so you don't match 1+2+
You should finalize your expression with $ to match the entire input string:
/^\d+([-+\/*]\d+){1,}$/
The wrong position of the hyphen - is one source of error in your expression. The missing $ the other.

Why does this regex run differently in sed than in Perl/Ruby?

I have a regex that gives me one result in sed but another in Perl (and Ruby).
I have the string one;two;;three and I want to highlight the substrings delimited by the ;. So I do the following in Perl:
$a = "one;two;;three";
$a =~ s/([^;]*)/[\1]/g;
print $a;
(Or, in Ruby: print "one;two;;three".gsub(/([^;]*)/, "[\\1]").)
The result is:
[one][];[two][];[];[three][]
(I know the reason for the spurious empty substrings.)
Curiously, when I run the same regexp in sed I get a different result. I run:
echo "one;two;;three" | sed -e 's/[^;]*/[\0]/g'
and I get:
[one];[two];[];[three]
What is the reason for this different result?
EDIT:
Somebody replied "because sed is not perl". I know that. The reason I'm asking my question is because I don't understand how sed copes so well with zero-length matches.
This is an interesting and surprising edge case.
Your [^;]* pattern may match the empty string, so it becomes a philosophy question, viz., how many empty strings are between two characters: zero, one, or many?
sed
The sed match clearly follows the philosophy described in the “Advancing After a Zero–Length Regex Match” section of “Zero–Length Regex Matches.”
Now the regex engine is in a tricky situation. We’re asking it to go through the entire string to find all non–overlapping regex matches. The first match ended at the start of the string, where the first match attempt began. The regex engine needs a way to avoid getting stuck in an infinite loop that forever finds the same zero-length match at the start of the string.
The simplest solution, which is used by most regex engines, is to start the next match attempt one character after the end of the previous match, if the previous match was zero–length.
That is, zero empty strings are between characters.
The above passage is not an authoritative standard, and quoting such a document instead would make this a better answer.
Inspecting the source of GNU sed, we see
/* Start after the match. last_end is the real end of the matched
substring, excluding characters that were skipped in case the RE
matched the empty string. */
start = offset + matched;
last_end = regs.end[0];
Perl and Ruby
Perl’s philosophy with s///, which Ruby seems to share—so the documentation and examples below use Perl to represent both—is there is exactly one empty string after each character.
The “Regexp Quote–Like Operators” section of the perlop documentation reads
The /g modifier specifies global pattern matching—that is, matching as many times as possible within the string.
Tracing execution of s/([^;]*)/[\1]/g gives
Start. The “match position,” denoted by ^, is at the beginning of the target string.
o n e ; t w o ; ; t h r e e
^
Attempt to match [^;]*.
o n e ; t w o ; ; t h r e e
^
Note that the result captured in $1 is one.
Attempt to match [^;]*.
o n e ; t w o ; ; t h r e e
^
Important Lesson: The * regex quantifier always succeeds because it means “zero or more.” In this case, the substring in $1 is the empty string.
The rest of the match proceeds as in the above.
Being a perceptive reader, you now ask yourself, “Self, if * always succeeds, how does the match terminate at the end of the target string, or for that matter, how does it get past even the first zero–length match?”
We find the answer to this incisive question in the “Repeated Patterns Matching a Zero–length Substring” section of the perlre documentation.
However, long experience has shown that many programming tasks may be significantly simplified by using repeated subexpressions that may match zero–length substrings. Here’s a simple example being:
#chars = split //, $string; # // is not magic in split
($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /
Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower–level loops given by the greedy quantifiers *+{}, and for higher-level ones like the /g modifier or split operator.
…
The higher–level loops preserve an additional state between iterations: whether the last match was zero–length. To break the loop, the following match after a zero–length match is prohibited to have a length of zero. This prohibition interacts with backtracking … and so the second best match is chosen if the best match is of zero length.
Other Perl approaches
With the addition of a negative lookbehind assertion, you can filter the spurious empty matches.
$ perl -le '$a = "one;two;;three";
$a =~ s/(?<![^;])([^;]*)/[\1]/g;
print $a;'
[one];[two];[];[three]
Apply what Mark Dominus dubbed Randal’s Rule, “Use capturing when you know what you want to keep. Use split when you know what you want to throw away.” You want to throw away the semicolons, so your code becomes more direct with
$ perl -le '$a = "one;two;;three";
$a = join ";", map "[$_]", split /;/, $a;
print $a;'
[one];[two];[];[three]
From the source code for sed-4.2 for the substitute function:
/sed/execute.c
/* If we're counting up to the Nth match, are we there yet?
And even if we are there, there is another case we have to
skip: are we matching an empty string immediately following
another match?
This latter case avoids that baaaac, when passed through
s,a*,x,g, gives `xbxxcx' instead of xbxcx. This behavior is
unacceptable because it is not consistently applied (for
example, `baaaa' gives `xbx', not `xbxx'). */
This indicates that the behavior we see in Ruby and Perl was consciously avoided in sed. This is not due to any fundamental difference between the languages but a result of special handling in sed
There's something else going on in the perl (and presumably ruby) scripts as that output makes no sense for simply handling the regexp as a BRE or ERE.
awk (EREs) and sed (BREs) behave as they should for just doing an RE replacement:
$ echo "one;two;;three" | sed -e 's/[^;]*/[&]/g'
[one];[two];[];[three]
$ echo "one;two;;three" | awk 'gsub(/[^;]*/,"[&]")'
[one];[two];[];[three]
You said I know the reason for the spurious empty substrings.. Care to clue us in?

bash copy file where some of the filename is not known

In a bash script i want to copy a file but the file name will change over time.
The start and end of the file name will however stay the same.
is there a way so i get the file like so:
cp start~end.jar
where ~ can be anything?
the cp command would be run a a bash script on a ubuntu machine if this makes and difference.
A glob (start*end) will give you all matching files.
Check out the Expansion > Pathname Expansion > Pattern Matching section of the bash manual for more specific control
* Matches any string, including the null string.
? Matches any single character.
[...] Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale's collat-
ing sequence and character set, is matched. If the first character following the [ is a ! or a ^ then any character not enclosed is matched. The sorting order of characters in range expressions is determined by
the current locale and the value of the LC_COLLATE shell variable, if set. A - may be matched by including it as the first or last character in the set. A ] may be matched by including it as the first character in
the set.
and if you enable extglob:
?(pattern-list)
Matches zero or one occurrence of the given patterns
*(pattern-list)
Matches zero or more occurrences of the given patterns
+(pattern-list)
Matches one or more occurrences of the given patterns
#(pattern-list)
Matches one of the given patterns
!(pattern-list)
Matches anything except one of the given patterns
Use a glob to capture the variable text:
cp start*end.jar

count quotes in a string that do not have a backslash before them

Hey I'm trying to use a regex to count the number of quotes in a string that are not preceded by a backslash..
for example the following string:
"\"Some text
"\"Some \"text
The code I have was previously using String#count('"')
obviously this is not good enough
When I count the quotes on both these examples I need the result only to be 1
I have been searching here for similar questions and ive tried using lookbehinds but cannot get them to work in ruby.
I have tried the following regexs on Rubular from this previous question
/[^\\]"/
^"((?<!\\)[^"]+)"
^"([^"]|(?<!\)\\")"
None of them give me the results im after
Maybe a regex is not the way to do that. Maybe a programatic approach is the solution
How about string.count('"') - string.count("\\"")?
result = subject.scan(
/(?: # match either
^ # start-of-string\/line
| # or
\G # the position where the previous match ended
| # or
[^\\] # one non-backslash character
) # then
(\\\\)* # match an even number of backslashes (0 is even, too)
" # match a quote/x)
gives you an array of all quote characters (possibly with a preceding non-quote character) except unescaped ones.
The \G anchor is needed to match successive quotes, and the (\\\\)* makes sure that backslashes are only counted as escaping characters if they occur in odd numbers before the quote (to take Amarghosh's correct caveat into account).

Resources