grep ".*" does not match valid matches? - bash

Information and Problems
I am learning linux command now, and was simply practicing grep command in a bash.
I want to match every file whose name begins with character "a"...quite a simple requirement...From what I understand the regex should be something like a.*, but it doesn't work as what I thought.
Some of the filenames should be matched doesn't match.
My Command
I typed commands in a Ubuntu Mate 16.04 VirtualBox terminal.
I have created a document called test. In the test document, I have got three files,
a.txt
a1.txt
a2.txt
Here the following is my command using grep.
ls -a | grep -E -e a.*
But the output is simply
a.txt
I think .* should mean any numbers of whatever character. So the a1.txt and a2.txt should match the regex, but it doesn't work.
However if I tried
ls -a | grep -E -e ^a.*
ls -a | grep -E -e a.+
Both of the command work as what I expected, all the filenames matches.
a.txt
a1.txt
a2.txt
I could not figure out what goes wrong?
What I have tried
I have searched through the questions, there exist a question very similar to mine, but the problems is about the extended grep and the basic one, which definitely isn't my situation.

Use more quotes!
With the literal command you ran in your question:
ls -a | grep -E -e a.*
...your shell will replace a.* with a list of filenames in the current directory matching a.* as a glob pattern before grep is started at all. (See also the full bash-hackers page on globbing).
If a.* is placed inside quotes, as in:
ls -a | grep -E 'a.*'
...then this string will no longer be evaluated as a glob. You might also want to anchor the regex with ^, to search only at the beginning:
ls -a | grep -E '^a.*'
That said, ls is not a tool build for programmatic use -- it isn't guaranteed to emit filenames in unmodified literal form, so it's not certain that all possible names will be emitted in such a way that grep or other tools will parse them correctly (indeed, ls can't emit all possible names is literal form, since it uses newline delimiters between names, whereas newline literals are actually possible within names themselves). Consider using find for this kind of processing:
while IFS= read -r -d '' filename; do
printf 'Found file: %q\n' "$filename"
done < <(find . -regex '/^a[^/]*' -print0)
...will work even with files having intentionally difficult-to-process names; consider, for example, mkdir -p $'\n/etc/passwd\n' && touch $'\n/etc/passwd\n/a.txt'.

You are misunderstanding how the shell is parsing your command. When you do this:
ls -a | grep -E -e a.*
The shell globs the command before it is passed to ls or grep. The result of the glob is this:
ls -a | grep -E -e a.txt
Because in globbing, a.* only matches a.txt.
You need to put the regexes in quotes, e.g.
ls -a | grep -E -e 'a.*'

Related

Bash / Sed / Grep : Parsing / Capturing a Substring

I have generated a set of filepaths as strings in a bash script, all of this form:
./foo/bar/filename.proto
There can be any number of subfolders/slashes, but they all have the .proto extension.
I want to trim the leading ./ and trailing filename.proto to transform them to look like this:
foo/bar
I have had a surprising amount of difficulty adapting this from other solutions and debugging it. I have tried:
grep -Po "\.\/(.*)\/[^\/]+\.proto"
and
sed -n 's/\.\/\(.*\)\/[^\/]+\.proto/\1/p'
I have tried sed with both escaped and unescaped parentheses. For reference, I am currently working on a mac, and would like the most cross-platform-compatible solution.
I could do this fairly easily in Python, but I want to avoid the complexity of calling another script to do this.
To give you an idea of how this is working, my full script looks like this (so far):
#!/bin/bash
consume_single_folder () {
do_stuff $1
}
find . -name \*.proto|while read fname; do
echo "$fname" |sed -n 's/\.\/\(.*\)\/[^\/]+\.proto/\1/p' | consume_single_folder
done
Any help is appreciated. Thanks!
EDIT:
To be clear, I have tested my regex on regex101.com and it seems to look alright:
\.\/(.*)\/[^\/]+\.proto
It should be greedy, capturing everything between the first and last slash.
Looks like dirname could help you:
$ dirname "./foo/bar/filename.proto"
./foo/bar
With leading ./ removal:
$ dirname "./foo/bar/filename.proto" | sed "s/\.\///g"
foo/bar
Also you could add sort | uniq avoid duplicates:
find . -name \*.proto|while read fname; do
echo "$fname" | xargs dirname | sed "s/\.\///g" | consume_single_folder
done
Works on MacOS and Linux
Please do not use sites like regex101 for testing sed regular expression - syntax and features vary a lot between tools, as well as between various implementations.. See Why does my regular expression work in X but not in Y? and differences between various sed implementations
For your given example, changing + to * will work (lookup differences between BRE and ERE)
$ fname='./foo/bar/filename.proto'
$ echo "$fname" | sed -n 's/\.\/\(.*\)\/[^\/]*\.proto/\1/p'
foo/bar
$ # or use a different delimiter
$ echo "$fname" | sed 's|\./\(.*\)/[^/]*\.proto|\1|'
foo/bar
$ # further simplification as find already filters by extension
$ echo "$fname" | sed 's|\./\(.*\)/.*|\1|'
foo/bar
Also, I would suggest to read Why is looping over find's output bad practice? and change your find syntax accordingly

Grep multiple strings from text file

Okay so I have a textfile containing multiple strings, example of this -
Hello123
Halo123
Gracias
Thank you
...
I want grep to use these strings to find lines with matching strings/keywords from other files within a directory
example of text files being grepped -
123-example-Halo123
321-example-Gracias-com-no
321-example-match
so in this instance the output should be
123-example-Halo123
321-example-Gracias-com-no
With GNU grep:
grep -f file1 file2
-f FILE: Obtain patterns from FILE, one per line.
Output:
123-example-Halo123
321-example-Gracias-com-no
You should probably look at the manpage for grep to get a better understanding of what options are supported by the grep utility. However, there a number of ways to achieve what you're trying to accomplish. Here's one approach:
grep -e "Hello123" -e "Halo123" -e "Gracias" -e "Thank you" list_of_files_to_search
However, since your search strings are already in a separate file, you would probably want to use this approach:
grep -f patternFile list_of_files_to_search
I can think of two possible solutions for your question:
Use multiple regular expressions - a regular expression for each word you want to find, for example:
grep -e Hello123 -e Halo123 file_to_search.txt
Use a single regular expression with an "or" operator. Using Perl regular expressions, it will look like the following:
grep -P "Hello123|Halo123" file_to_search.txt
EDIT:
As you mentioned in your comment, you want to use a list of words to find from a file and search in a full directory.
You can manipulate the words-to-find file to look like -e flags concatenation:
cat words_to_find.txt | sed 's/^/-e "/;s/$/"/' | tr '\n' ' '
This will return something like -e "Hello123" -e "Halo123" -e "Gracias" -e" Thank you", which you can then pass to grep using xargs:
cat words_to_find.txt | sed 's/^/-e "/;s/$/"/' | tr '\n' ' ' | dir_to_search/*
As you can see, the last command also searches in all of the files in the directory.
SECOND EDIT: as PesaThe mentioned, the following command would do this in a much more simple and elegant way:
grep -f words_to_find.txt dir_to_search/*

How to extract file name after grepping a string in unix

I am using the code below to grep some string:
grep 'string' *.log | grep -v 'string1'
I am getting output in particular file. My requirement is to extract that file name to a variable. How I can do that effectively?
In general, you can capture the output of any command into a shell variable via command substitution like this:
variable=$(command arg1 arg2)
This is appropriate for your particular case if you are sure that there will be only one file name produced by the grep pipeline. In that case, you capture its name into shell variable fname via:
fname=$(grep -lZ string *.log | xargs -0 grep -lv string1)
This is safe for difficult file names because, via the -Z and -0 options, we use NUL-separated lists. The -l option to grep is useful here because it suppresses the normal grep output and just prints the file names.
If there might be multiple file matches, then, if you can use an advanced shell like bash, try:
grep -lZ string *.log | xargs -0 grep -lvZ string1 | while IFS= read -r -d $'\0' fname
do
# Process file "$fname"
done
This is also safe for difficult file names because, throughout the pipeline, it uses NUL-separated lists.
For a POSIX shell, read works with newline-separated input. To make the above safe for difficult file names, the -d option is used which is supported by bash, zsh, and other advanced shells.
use command
basename "filePath" "fileExtension"
ex: basename /home/john/xyz.txt .txt
output: xyz

perform an operation for *each* item listed by grep

How can I perform an operation for each item listed by grep individually?
Background:
I use grep to list all files containing a certain pattern:
grep -l '<pattern>' directory/*.extension1
I want to delete all listed files but also all files having the same file name but a different extension: .extension2.
I tried using the pipe, but it seems to take the output of grep as a whole.
In find there is the -exec option, but grep has nothing like that.
If I understand your specification, you want:
grep --null -l '<pattern>' directory/*.extension1 | \
xargs -n 1 -0 -I{} bash -c 'rm "$1" "${1%.*}.extension2"' -- {}
This is essentially the same as what #triplee's comment describes, except that it's newline-safe.
What's going on here?
grep with --null will return output delimited with nulls instead of newline. Since file names can have newlines in them delimiting with newline makes it impossible to parse the output of grep safely, but null is not a valid character in a file name and thus makes a nice delimiter.
xargs will take a stream of newline-delimited items and execute a given command, passing as many of those items (one as each parameter) to a given command (or to echo if no command is given). Thus if you said:
printf 'one\ntwo three \nfour\n' | xargs echo
xargs would execute echo one 'two three' four. This is not safe for file names because, again, file names might contain embedded newlines.
The -0 switch to xargs changes it from looking for a newline delimiter to a null delimiter. This makes it match the output we got from grep --null and makes it safe for processing a list of file names.
Normally xargs simply appends the input to the end of a command. The -I switch to xargs changes this to substitution the specified replacement string with the input. To get the idea try this experiment:
printf 'one\ntwo three \nfour\n' | xargs -I{} echo foo {} bar
And note the difference from the earlier printf | xargs command.
In the case of my solution the command I execute is bash, to which I pass -c. The -c switch causes bash to execute the commands in the following argument (and then terminate) instead of starting an interactive shell. The next block 'rm "$1" "${1%.*}.extension2"' is the first argument to -c and is the script which will be executed by bash. Any arguments following the script argument to -c are assigned as the arguments to the script. This, if I were to say:
bash -c 'echo $0' "Hello, world"
Then Hello, world would be assigned to $0 (the first argument to the script) and inside the script I could echo it back.
Since $0 is normally reserved for the script name I pass a dummy value (in this case --) as the first argument and, then, in place of the second argument I write {}, which is the replacement string I specified for xargs. This will be replaced by xargs with each file name parsed from grep's output before bash is executed.
The mini shell script might look complicated but it's rather trivial. First, the entire script is single-quoted to prevent the calling shell from interpreting it. Inside the script I invoke rm and pass it two file names to remove: the $1 argument, which was the file name passed when the replacement string was substituted above, and ${1%.*}.extension2. This latter is a parameter substitution on the $1 variable. The important part is %.* which says
% "Match from the end of the variable and remove the shortest string matching the pattern.
.* The pattern is a single period followed by anything.
This effectively strips the extension, if any, from the file name. You can observe the effect yourself:
foo='my file.txt'
bar='this.is.a.file.txt'
baz='no extension'
printf '%s\n'"${foo%.*}" "${bar%.*}" "${baz%.*}"
Since the extension has been stripped I concatenate the desired alternate extension .extension2 to the stripped file name to obtain the alternate file name.
If this does what you want, pipe the output through /bin/sh.
grep -l 'RE' folder/*.ext1 | sed 's/\(.*\).ext1/rm "&" "\1.ext2"/'
Or if sed makes you itchy:
grep -l 'RE' folder/*.ext1 | while read file; do
echo rm "$file" "${file%.ext1}.ext2"
done
Remove echo if the output looks like the commands you want to run.
But you can do this with find as well:
find /path/to/start -name \*.ext1 -exec grep -q 'RE' {} \; -print | ...
where ... is either the sed script or the three lines from while to done.
The idea here is that find will ... well, "find" things based on the qualifiers you give it -- namely, that things match the file glob "*.ext", AND that the result of the "exec" is successful. The -q tells grep to look for RE in {} (the file supplied by find), and exit with a TRUE or FALSE without generating any of its own output.
The only real difference between doing this in find vs doing it with grep is that you get to use find's awesome collection of conditions to narrow down your search further if required. man find for details. By default, find will recurse into subdirectories.
You can pipe the list to xargs:
grep -l '<pattern>' directory/*.extension1 | xargs rm
As for the second set of files with a different extension, I'd do this (as usual use xargs echo rm when testing to make a dry run; I haven't tested it, it may not work correctly with filenames with spaces in them):
filelist=$(grep -l '<pattern>' directory/*.extension1)
echo $filelist | xargs rm
echo ${filelist//.extension1/.extension2} | xargs rm
Pipe the result to xargs, it will allow you to run a command for each match.

Bash and filenames with spaces

The following is a simple Bash command line:
grep -li 'regex' "filename with spaces" "filename"
No problems. Also the following works just fine:
grep -li 'regex' $(<listOfFiles.txt)
where listOfFiles.txt contains a list of filenames to be grepped, one
filename per line.
The problem occurs when listOfFiles.txt contains filenames with
embedded spaces. In all cases I've tried (see below), Bash splits the
filenames at the spaces so, for example, a line in listOfFiles.txt
containing a name like ./this is a file.xml ends up trying to run
grep on each piece (./this, is, a and file.xml).
I thought I was a relatively advanced Bash user, but I cannot find a
simple magic incantation to get this to work. Here are the things I've
tried.
grep -li 'regex' `cat listOfFiles.txt`
Fails as described above (I didn't really expect this to work), so I
thought I'd put quotes around each filename:
grep -li 'regex' `sed -e 's/.*/"&"/' listOfFiles.txt`
Bash interprets the quotes as part of the filename and gives "No such
file or directory" for each file (and still splits the filenames with
blanks)
for i in $(<listOfFiles.txt); do grep -li 'regex' "$i"; done
This fails as for the original attempt (that is, it behaves as if the
quotes are ignored) and is very slow since it has to launch one 'grep'
process per file instead of processing all files in one invocation.
The following works, but requires some careful double-escaping if
the regular expression contains shell metacharacters:
eval grep -li 'regex' `sed -e 's/.*/"&"/' listOfFiles.txt`
Is this the only way to construct the command line so it will
correctly handle filenames with spaces?
Try this:
(IFS=$'\n'; grep -li 'regex' $(<listOfFiles.txt))
IFS is the Internal Field Separator. Setting it to $'\n' tells Bash to use the newline character to delimit filenames. Its default value is $' \t\n' and can be printed using cat -etv <<<"$IFS".
Enclosing the script in parenthesis starts a subshell so that only commands within the parenthesis are affected by the custom IFS value.
cat listOfFiles.txt |tr '\n' '\0' |xargs -0 grep -li 'regex'
The -0 option on xargs tells xargs to use a null character rather than white space as a filename terminator. The tr command converts the incoming newlines to a null character.
This meets the OP's requirement that grep not be invoked multiple times. It has been my experience that for a large number of files avoiding the multiple invocations of grep improves performance considerably.
This scheme also avoids a bug in the OP's original method because his scheme will break where listOfFiles.txt contains a number of files that would exceed the buffer size for the commands. xargs knows about the maximum command size and will invoke grep multiple times to avoid that problem.
A related problem with using xargs and grep is that grep will prefix the output with the filename when invoked with multiple files. Because xargs invokes grep with multiple files one will receive output with the filename prefixed, but not for the case of one file in listOfFiles.txt or the case of multiple invocations where the last invocation contains one filename. To achieve consistent output add /dev/null to the grep command:
cat listOfFiles.txt |tr '\n' '\0' |xargs -0 grep -i 'regex' /dev/null
Note that was not an issue for the OP because he was using the -l option on grep; however it is likely to be an issue for others.
This works:
while read file; do grep -li dtw "$file"; done < listOfFiles.txt
With Bash 4, you can also use the builtin mapfile function to set an array containing each line and iterate on this array:
$ tree
.
├── a
│ ├── a 1
│ └── a 2
├── b
│ ├── b 1
│ └── b 2
└── c
├── c 1
└── c 2
3 directories, 6 files
$ mapfile -t files < <(find -type f)
$ for file in "${files[#]}"; do
> echo "file: $file"
> done
file: ./a/a 2
file: ./a/a 1
file: ./b/b 2
file: ./b/b 1
file: ./c/c 2
file: ./c/c 1
Though it may overmatch, this is my favorite solution:
grep -i 'regex' $(cat listOfFiles.txt | sed -e "s/ /?/g")
Do note that if you somehow ended up with a list in a file which has Windows line endings, \r\n, NONE of the notes above about the input file separator $IFS (and quoting the argument) will work; so make sure that the line endings are correctly \n (I use scite to show the line endings, and easily change them from one to the other).
Also cat piped into while file read ... seems to work (apparently without need to set separators):
cat <(echo -e "AA AA\nBB BB") | while read file; do echo $file; done
... although for me it was more relevant for a "grep" through a directory with spaces in filenames:
grep -rlI 'search' "My Dir"/ | while read file; do echo $file; grep 'search\|else' "$ix"; done

Resources