bash - proper use of grep's infix operator "|" - bash

I'm having a hard time discovering which method is best...
(debian latest, gnome)
This works;
$ ls -1 | grep "JPG|RAF|TIF"
I am trying to replace the "JPG|RAF|TIF" with a variable
e.g.
$ xFILTER="JPG RAF TIF"
I've tried to assign to the xFILTER variable...
xFILTER="JPG\|RAF\|TIF"
xFILTER="\"JPG|RAF|TIF\""
xFILTER="JPG\nRAF\nTIF"
$ ls -1 | grep -E "$xFILTER"
$ ls -1 | grep -e $xFILTER
$ ls -1 | grep -E "$(echo "xFILTER" | tr ' ' '|')"
Could someone please direct me towards a more sensible approach ?
Thank you.

This should work:
xFILTER="JPG|RAF|TIF"
ls -1 | grep -E "$xFILTER"
However as a word of caution it is not always good to parse ls output when your files can have spaces or new lines. Look into shopt -s extglob to enable extended globbing and search your files by extended patterns.

You're running into the different regular expressions syntaxes that grep supports.
Quoting the man page,
In basic regular expressions the metacharacters ?, +, {, |, (, and ) lose their special meaning; instead use the backslashed versions \?, +, {, \|, (, and ).
I'm surprised that you say your first example works.
Use one of the following:
xFILTER="JPG\|RAF\|TIF" ; ls -1 | grep $xFILTER
xFILTER="JPG|RAF|TIF" ; ls -1 | grep -E $xFILTER

Related

How to subset filenames in bash `ls` output

So I have a list of files that start with a date (yyyymmdd) and then they have different endings. I would like to filter all dates and the find the uniq ones and count them. So something like this pseudocode:
ls | grep 'file[0:8]' | unique | wc -l
But this obviuosly doesn't work. So is there any way on how to do this more or less easily?
The data I have looks something like this:
20160124_vv_iw2.slc 20170118_vv_iw2.slc.tops_par 20190120_vv_iw2.slc.par
20160124_vv_iw2.slc.par 20170915_vv_iw2.slc 20190120_vv_iw2.slc.tops_par
20160124_vv_iw2.slc.tops_par 20170915_vv_iw2.slc.par 20200911_vv_iw2.slc
20160827_vv_iw2.slc 20170915_vv_iw2.slc.tops_par 20200911_vv_iw2.slc.par
20160827_vv_iw2.slc.par 20180113_vv_iw2.slc 20200911_vv_iw2.slc.tops_par
20160827_vv_iw2.slc.tops_par 20180113_vv_iw2.slc.par 20200923_vv_iw2.slc
20170118_vv_iw2.slc 20180113_vv_iw2.slc.tops_par 20200923_vv_iw2.slc.par
20170118_vv_iw2.slc.par 20190120_vv_iw2.slc 20200923_vv_iw2.slc.tops_par
Don't use ls in scripts.
printf "%-8.8s\n" * | uniq | wc -l
More generally, you could do something like
for file in *; do
echo "${file:0:8}"
done | uniq | wc -l
Like any line-oriented approach, this will break if you have file names with newlines in them.
If you just want to split at the first underscore, "${file%%_*}" does that.
ls -1 | sed -E 's/^[^_]+_//' | sort -u | wc -l
or
ls -1 | sed -E 's/^[[:digit:]]+_//' | sort -u | wc -l
Use this Perl one-liner, combined with uniq | wc -l
perl -le 'print sort /^(\d+)/ for glob "*";' | uniq | wc -l
8
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
glob "*" returns the list of all files and directories.
/^(\d+)/ returns the regex matches, here, the stretches of digits at the beginning of the file names. Use something like /^(\d{8})/ if you need the exact number of digits.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

sed: remove all characters except for last n characters

I am trying to remove every character in a text string except for the remaining 11 characters. The string is Sample Text_that-would$normally~be,here--pe_-l4_mBY and what I want to end up with is just -pe_-l4_mBY.
Here's what I've tried:
$ cat food
Sample Text_that-would$normally~be,here--pe_-l4_mBY
$ cat food | sed 's/^.*(.{3})$/\1/'
sed: 1: "s/^.*(.{3})$/\1/": \1 not defined in the RE
Please note that the text string isn't really stored in a file, I just used cat food as an example.
OS is macOS High Sierra 10.13.6 and bash version is 3.2.57(1)-release
You can use this sed with a capture group:
sed -E 's/.*(.{11})$/\1/' file
-pe_-l4_mBY
Basic regular expressions (used by default by sed) require both the parentheses in the capture group and the braces in the brace expression to be escaped. ( and { are otherwise treated as literal characters to be matched.
$ cat food | sed 's/^.*\(.\{3\}\)$/\1/'
mBY
By contrast, explicitly requesting sed to use extended regular expressions with the -E option reverses the meaning, with \( and \{ being the literal characters.
$ cat food | sed -E 's/^.*(.{3})$/\1/'
mBY
Try this also:
grep -o -E '.{11}$' food
grep, like sed, accepts an arbitrary number of file name arguments, so there is no need for a separate cat. (See also useless use of cat.)
You can use tail or Parameter Expansion :
string='Sample Text_that-would$normally~be,here--pe_-l4_mBY'
echo "$string" | tail -c 11
echo "${string#${string%??????????}}"
pe_-l4_mBY
pe_-l4_mBY
also with rev/cut/rev
$ echo abcdefghijklmnopqrstuvwxyz | rev | cut -c1-11 | rev
pqrstuvwxyz
man rev => rev - reverse lines characterwise

grep ".*" does not match valid matches?

Information and Problems
I am learning linux command now, and was simply practicing grep command in a bash.
I want to match every file whose name begins with character "a"...quite a simple requirement...From what I understand the regex should be something like a.*, but it doesn't work as what I thought.
Some of the filenames should be matched doesn't match.
My Command
I typed commands in a Ubuntu Mate 16.04 VirtualBox terminal.
I have created a document called test. In the test document, I have got three files,
a.txt
a1.txt
a2.txt
Here the following is my command using grep.
ls -a | grep -E -e a.*
But the output is simply
a.txt
I think .* should mean any numbers of whatever character. So the a1.txt and a2.txt should match the regex, but it doesn't work.
However if I tried
ls -a | grep -E -e ^a.*
ls -a | grep -E -e a.+
Both of the command work as what I expected, all the filenames matches.
a.txt
a1.txt
a2.txt
I could not figure out what goes wrong?
What I have tried
I have searched through the questions, there exist a question very similar to mine, but the problems is about the extended grep and the basic one, which definitely isn't my situation.
Use more quotes!
With the literal command you ran in your question:
ls -a | grep -E -e a.*
...your shell will replace a.* with a list of filenames in the current directory matching a.* as a glob pattern before grep is started at all. (See also the full bash-hackers page on globbing).
If a.* is placed inside quotes, as in:
ls -a | grep -E 'a.*'
...then this string will no longer be evaluated as a glob. You might also want to anchor the regex with ^, to search only at the beginning:
ls -a | grep -E '^a.*'
That said, ls is not a tool build for programmatic use -- it isn't guaranteed to emit filenames in unmodified literal form, so it's not certain that all possible names will be emitted in such a way that grep or other tools will parse them correctly (indeed, ls can't emit all possible names is literal form, since it uses newline delimiters between names, whereas newline literals are actually possible within names themselves). Consider using find for this kind of processing:
while IFS= read -r -d '' filename; do
printf 'Found file: %q\n' "$filename"
done < <(find . -regex '/^a[^/]*' -print0)
...will work even with files having intentionally difficult-to-process names; consider, for example, mkdir -p $'\n/etc/passwd\n' && touch $'\n/etc/passwd\n/a.txt'.
You are misunderstanding how the shell is parsing your command. When you do this:
ls -a | grep -E -e a.*
The shell globs the command before it is passed to ls or grep. The result of the glob is this:
ls -a | grep -E -e a.txt
Because in globbing, a.* only matches a.txt.
You need to put the regexes in quotes, e.g.
ls -a | grep -E -e 'a.*'

Why do you have to escape | and + in grep between apostrophes?

I was under the impression that within single quotes, e.g. 'pattern', bash special characters are not interpolated, so one need only escape single quotes themselves.
Why then does echo "123" | grep '[0-9]+' output nothing, whereas echo "123" | grep '[0-9]\+' (plus sign escaped) output 123? (Likewise, echo "123" | grep '3|4' outputs nothing unless you escape the |.)
This is under bash 4.1.2 and grep 2.6.3 on CentOS 6.5.
grep uses Basic Regular Expressions, like sed and vi. In that you have to escape metacharacters, and it is tedious.
You probably want Extended Regular Expressions, so use egrep or grep -E (depending on the version in use). Check your man grep.
See also the GNU documentation for a full list of the characters involved.
Most languages use Extended Regular Expressions (EREs) these days, and they are much easier to use. Basic Regular Expressions (BREs) are really a throw-back.
That seems to be the regular expression engine that grep uses. If you use a different one, it works:
$ echo "123" | grep '[0-9]+'
$ echo "123" | grep -P '[0-9]+'
123
$ echo "123" | grep '3|4'
$ echo "123" | grep -P '3|4'
123

Bash: escape characters in backticks

I'm trying to escape characters within backticks in my bash command, mainly to handle spaces in filenames which cause my command to fail.
The command I have so far is:
grep -Li badword `grep -lr goodword *`
This command should result in a list of files that do not contain the word "badword" but do contain "goodword".
Your approach, even if you get the escaping right, will run into problems when the number of files output by the goodword grep reaches the limits on command-line length. It is better to pipe the output of the first grep onto a second grep, like this
grep -lr -- goodword * | xargs grep -Li -- badword
This will correctly handle files with spaces in them, but it will fail if a file name has a newline in it. At least GNU grep and xargs support separating the file names with NUL bytes, like this
grep -lrZ -- goodword * | xargs -0 grep -Li -- badword
EDIT: Added double dashes -- to grep invocations to avoid the case when some file names start with - and would be interpreted by grep as additional options.
How about rewrite it to:
grep -lr goodword * | grep -Li badword

Resources