Using regex in grep filename - bash

I want to search a certain string in a number of archival log folders which reflect different servers. I use 2 different commands as of now
-bash-4.1$ zcat /mnt/bkp/logs/cmmt-54-22[8-9]/my_app.2021-12-28-* | grep 'abc'
and
-bash-4.1$ zcat /mnt/bkp/logs/cmmt-54-23[0-3]/my_app.2021-12-28-* | grep 'abc'
I basically want to search on server folders cmmt-54-228, cmmt-54-229 .... cmmt-54-233.
I tried combining the two commands into one but it doesn't seem to be working some mistake in using regex from my side
-bash-4.1$ zcat /mnt/bkp/logs/cmmt-54-22[8-9]|3[0-3]/my_app.2021-12-28-* | grep 'abc'
Please help.

Regex is not glob. See man 7 glob vs man 7 regex.
grep with with regex. grep filters lines that match some regular expresion.
Shell expands words that you write. Shell replaces what you write that contains "filename expansion triggers" * ? [ and replaces that word with a list of words of matching filenames.
You can use extended pattern matching (see man bash), which sounds like the most natural here:
shopt -s extglob
echo /mnt/bkp/logs/cmmt-54-2#(2[8-9]|3[0-3])/my_app.2021-12-28-*
In interactive shell I would just write it twice:
zcat /mnt/bkp/logs/cmmt-54-22[8-9]/my_app.2021-12-28-* /mnt/bkp/logs/cmmt-54-23[0-3]/my_app.2021-12-28-*
Or with brace expansion (see man bash):
zcat /mnt/bkp/logs/cmmt-54-2{2[8-9],3[0-3]}/my_app.2021-12-28-*
Braces expansion first replaces the word by two words, then filename expansion replaces them for actual filenames.
You can also find files with a -regex. For that, see man find. (Or output a list of filenames and pipe it to grep and then use xargs or similar to pass it to a command)

Related

Why does the second grep command not work?

I have a folder named "components" and within that folder a file name "apple"
If I cd to "components" folder and execute the following command:
ls | grep -G a*e
It works and returns apple correctly.
However, if I do not cd to components folder and execute the following command:
ls components | grep -G a*e
It does not work and returns blank. What could be the reason?
A third grep command below works fine.
ls components | grep ap
The actual filename I am grepping is complex. So I need the grep -G tag to work.
Unquoted, a*e is a shell glob pattern that is expanded by the shell before grep runs.
When you are in the directory, this:
ls | grep -G a*e
becomes
ls | grep -G apple
As you have a file named 'apple' this matches.
When you are not in the folder, and you run:
ls components | grep -G a*e
the shell again attempts to expand the glob pattern.
If there is any file in your current directory that matches (for example, "abalone"), then the glob will expand to that. It may expand to multiple strings if there is more than one such filename (for example, "abalone", "algae"). The command becomes something like:
ls components | grep -G abalone
ls components | grep -G abalone algae
In the first case, you will get blank output unless components directory also contains that filename.
In the second case, grep will ignore the directory entirely and attempt to find the string "abalone" inside the file "algae".
There is a third possibility: the glob fails to find anything. In this case, grep will receive the regexp a*e. The -G option to grep tell it to use BRE-style regexp. With these, a*e means "zero or more a followed by e". This is equivalent to saying "contains e".
In that case, you should see apple in your results regardless of whether you are in components or not. In a comment, you say that ls components | grep "a*e" returned nothing. As quoting should force precisely the same result as this third case, this is surprising.
Note that if you are intending to use globs you don't need grep at all:
cd components
ls a*e
ls components/a*e
a*e is a glob, not a regex. It's important to understand the difference.
The shell expands globs in unquoted arguments by matching the argument with available files. The * in a*e means "any sequence of characters not containing a directory separator", so it will match the filename apple (or accolade.node) as long as that file is present in the current directory. Glob matches are complete, not substring matches.
So when you execute grep a*e in a directory which contains the file apple, the shell will replace a*e with the word apple before invoking grep, making the command grep apple. If the directory also contained the file accolade.node, the shell would have put that into the command line as well; grep accolade.node apple. That's very rarely what you want to happen to grep arguments (other than filename arguments), so it's highly recommended to get into the habit of quoting arguments.
Unlike the shell, grep is based on regular expression matching. In a regular expression, * means "any number of repetitions of the previous element", so the regular expression a*e will match e, ae, aae, aaae, and so on. Since grep does substring matching (by default), those strings could be anywhere in the line being matched. That will match the e in apple, for example, but it will also match any other line which contains an e, such as electronics. (That makes it a bit surprising that ls components | grep "a*e" did not match components/apple. Perhaps there was some typing problem.)
In order to match a followed by a sequence of arbitrary characters followed by an e, you could use the regular expression a.*e (i.e. grep "a.*e" -- note the use of quotes to avoid having the shell try to expand that argument as a glob). But that will probably match too much, if you're expecting it to do the same thing as the glob a*e. You might want to add some restrictions. For example, grep -w forces the match to be complete words. And (with gnu grep, at least) you can use grep -w "a\S*e" to match a complete word which starts with a and ends with e, using the \S shortcut (any character other than whitespace).
You very rarely want to use -G, by the way, particularly since it's the default (unfortunately). Most of the time, you'll want to use grep -E in order to not have to insert backslashes throughout your pattern. Please read man 7 regex for a quick overview of regex syntax and the difference between basic and extended Posix regexes. man grep is also useful, of course.

bash list files of a particular naming convention

Operating System - Linux (Ubuntu 20.04)
I have a directory with thousands of files in it. The file names range anything from a.daily.csv to a.b.daily.csv to a.b.c.daily.csv to a.b.c.d.daily.csv to a.b.c.d.e.daily.csv
The challenge I'm having is in listing just a.daily.csv or a.b.daily.csv and so on. That is to say with "daily.csv" as the fixed part, I would like to be able to wildcard what is in front of it with "." being the delimiter between the fields
I tried a few wildcards such as ? [a-zA-Z0-9] & so on but unable to achieve this. Please could I get some guidance
Please note a,b,c etc are placeholders I'm using to post the question. In real world, a,b,c are alphanumeric words
Example -
PAHKY.daily.csv
TYUI.GHJ.WE.daily.csv
WGGH.FGH.daily.csv
98KJL-GHR.YUI.daily.csv
67HJE.HJQ.ATD.HJ.daily.csv
If I want to list all those files that are like PAHKY.daily.csv where thre is only one filed (dot being the delimiter) in front of daily.csv, how could I do this?
If you enable the extglob option:
$ shopt -s extglob
you can use extended pattern matching operators like *(pattern) for zero or more of pattern. Knowing that [^.] matches any character but a dot, this leads to:
$ ls *([^.]).daily.csv
PAHKY.daily.csv
to obtain all a.daily.csv files. For the next group:
$ ls *([^.]).*([^.]).daily.csv
WGGH.FGH.daily.csv 98KJL-GHR.YUI.daily.csv
and so on. Replace *(pattern) by +(pattern) if you want to match one or more of pattern instead of zero or more.
You use grep with ls, as grep works well with regex
Try something like this,
^a\.b\.c\.data\.csv$
ls | grep 'Your Expression'
Fact, you can even use find without piping to grep
This should work:
ls |grep -Po '([A-Za-z0-9\-\.]?)+.daily.csv'
Explanation:
-P, --perl-regexp
-o, --only-matching
[A-Za-z0-9\-\.] --match the group of characters : (A-Z,a-z,0-9,-,.)
() -- to capture a group
? -- matches zero or one of the previous RE.
+ -- matches one or more of the previous RE
Output:
67HJE.HJQ.ATD.HJ.daily.csv
98KJL-GHR.YUI.daily.csv
PAHKY.daily.csv
TYUI.GHJ.WE.daily.csv
WGGH.FGH.daily.csv

How to grep '*' in unix korn shell

I'm trying to find something in a file with a pattern using the '*' but is not working, any idea how to do this?
this is what I'm trying to do:
grep "files*.txt" $myTestFile
is not returning anything, it's suppose that "*" should be all.
By default, grep doesn't support extended regular express, but grep -E or egrep do.
egrep "files.*\.txt" $myTestFile
or
grep -E "files.*\.txt" $myTestFile
In addition, three variant programs egrep, fgrep and rgrep are available. egrep is the same as grep -E. fgrep is the same as grep -F. rgrep is the same as grep -r. Direct
invocation as either egrep or fgrep is deprecated, but is provided to allow historical applications that rely on them to run unmodified.
Matcher Selection
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below). (-E is specified by POSIX.)
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX.)
-G, --basic-regexp
Interpret PATTERN as a basic regular expression (BRE, see below). This is the default.
-P, --perl-regexp
Interpret PATTERN as a Perl regular expression (PCRE, see below). This is highly experimental and `grep -P` may warn of unimplemented features.
If you only want to match the exact string files*.txt, that would be:
# match exactly "files*.txt"
grep -e "files[*][.]txt" "$myTestFile"
...or, more simply put using fgrep to match only the exact string given:
# match exactly "files*.txt"
fgrep -e 'files*.txt' "$myTestFile"
[*] defines a character class with only a single character -- the * -- contained, and thus matches only that one character. Backslash-based escaping is also possible, but can have different meanings in different contexts and thus is less reliable.
If you want to match any line that contains files, and later .txt, then:
# match any line containing "files" and later ".txt"
grep -e "files.*[.]txt" "$myTestFile"
.* matches zero-or-more characters, and is thus the regex equivalent to the glob-pattern *. Likewise, whereas in a glob pattern . matches only itself, in a regex . can match any character, so the . in .txt needs to be escaped, as in [.]txt, to prevent it from matching anything else.

Remove everything but one file tcsh

I basically want to do the following bash command but in tcsh:
rm !(file1)
Thanks
You can use ls -1 (that's the number one, not the lowercase letter L) to list one file per line, and then use grep -vx <pattern> to exclude (-v) lines that exactly (-x) match <pattern>, and then xargs it to your command, rm. For example,
ls -1 | grep -vx file1 | xargs rm
In case your version of grep doesn't support the -x option, you can use anchors:
ls -1 | grep -vx '^file1$' | xargs rm
To use this with commands other than rm that may not take an arbitrary number of arguments, remember to add the -n 1 option to xargs so that arguments are handled one by one:
ls -1 | grep -vx '^file1$' | xargs -n 1 rm
I believe you can also achieve this using find's -name option to specify a parameter by negation, i.e. the find utility itself may support expressions like !(file1), though you'll still have to pipe the results to xargs.
tcsh has a special ^ syntax for glob patterns (not supported in csh, sh, or bash). Prefixing a glob pattern with ^ negates it, causing to match all file names that don't match the pattern.
Quoting the tcsh manual:
An entire glob-pattern can also be negated with `^':
> echo *
bang crash crunch ouch
> echo ^cr*
bang ouch
A single file name is not a glob pattern, and so the ^ prefix doesn't apply to it, but it can be turned into one by, for example, surrounding the first character with square brackets.
So this:
rm ^[f]ile1
should remove all files in the current directory other than file1.
I strongly recommend testing this before using it, either by using an echo command first:
echo ^[f]ile1
or by using Ctrl-X * to expand the pattern to a list of files before hitting Enter.
UPDATE: I've since learned that bash supports similar functionality but with a different syntax. In bash, !(PATTERN) matches anything not matched by the pattern. This is not recognized unless the extglob shell option is enabled. Unlike tcsh's ^ syntax, the pattern can be a single file name. This isn't relevant to what you're asking, but it could be useful if you ever decide to switch to bash.
zsh probably has something similar.

bash filename globbing - operate on files starting with capital

Lets say I have a folder with the following jpeg-files:
adfjhu.jpg Afgjo.jpg
Bdfji.jpg bkdfjhru.jpg
Cdfgj.jpg cfgir.jpg
Ddfgjr.jpg dfgjrr.jpg
How do I remove or list the files that starts with a capital?
This can be solved with a combination of find, grep and xargs.
But it is possible with normal file-globbing/pattern matching in bash?
cmd below doesn't work due to the fact that (as far as I can tell) LANG is set to en_US
and the collation order.
$ ls [A-Z]*.jpg
Afgjo.jpg Bdfji.jpg bkdfjhru.jpg Cdfgj.jpg cfgir.jpg Ddfgjr.jpg dfgjrr.jpg
This sort of works
$ ls +(A|B|C|D)*.jpg
Afgjo.jpg Bdfji.jpg Cdfgj.jpg Ddfgjr.jpg
But I don't wanna do this for all characters A-Z for a general solution!
So is this possible?
cheers
//Fredrik
you should set your locale to the C (or POSIX) locale.
$ LC_ALL=C ls [A-Z]*.jpg
or
$ LC_ALL=C ls [[:upper:]]*.jpg
read here for more information: http://www.opengroup.org/onlinepubs/007908799/xbd/locale.html
Use a bracket expression with a character class:
ls -l [[:upper:]]*
See man 7 regex for a list of character classes and other information.
From that page:
Within a bracket expression, the name of a character class enclosed in '[:' and ':]' stands for the list of all characters belonging to that class. Standard character class names are:
alnum digit punct
alpha graph space
blank lower upper
cntrl print xdigit
Use grep:
ls | grep -e ^[A-Z]
If you want make more use a for loop:
for i in $(ls | grep -e ^[A-Z]); do echo $i ;done

Resources