Grep multiple occurrences given two strings and two integers - bash

im looking for a bash script to count the occurences of a word in a given directory and it's subdirectory's files with this pattern:
^str1{n}str2{m}$
for example:
str1= yo
str2= uf
n= 3
m= 4
the match would be "yoyoyoufufufuf"
but i'm having trouble with grep
that's what i have tried
for file in $(find $dir)
do
if [ -f $file ]; then
echo "<$file>:<`grep '\<\$str1\{$n\}\$str2\{$m\}\>'' $file | wc -l >" >> a.txt
fi
done
should i use find?

#Barmar's comment is useful.
If I understand your question, I think this single grep command should do what you're looking for:
grep -r -c "^\($str1\)\{$n\}\($str2\)\{$m\}$" "$dir"
Note the combination of -r and -c causes grep to output zero-counts for non-matching files. You can pipe to grep -v ":0$" to suppress this output if you require:
$ dir=.
$ str1=yo
$ str2=uf
$ n=3
$ m=4
$ cat youf
yoyoyoufufufuf
$ grep -r -c "^\($str1\)\{$n\}\($str2\)\{$m\}$" "$dir"
./noyouf:0
./youf:1
./dir/youf:1
$ grep -r -c "^\($str1\)\{$n\}\($str2\)\{$m\}$" "$dir" | grep -v ":0$"
./youf:1
./dir/youf:1
$
Note also $str1 and $str2 need to be put in parentheses so that the {m} and {n} apply to everything within the parentheses and not just the last character.
Note the escaping of the () and {} as we require double-quotes ", so that the variables are expanded into the grep regular expression.

Related

Using bash to iterate through similarly named files and grep

I have a list of base files:
file1.txt
file2.txt
file3.txt
and a list of target files:
target1.txt
target2.txt
target3.txt
and I want to use bash to perform the following command using a loop:
grep -wf "file1.txt" "target1.txt" > "result1.txt"
grep -wf "file2.txt" "target2.txt" > "result2.txt"
The files will all have the same name besides the final integer, which will be in a series (1:22).
With a for loop:
for((i=1; i<=22; i++)); do
grep -wf "file$i.txt" "target$i.txt" > "result$i.txt"
done
With arbitrary number of file#.txt and target#.txt:
#!/usr/bin/env bash
shopt -s extglob # Enable extended globbing patterns
# Iterate all file#.txt
for f in file+([[:digit:]]).txt; do
# Extract the index from the file name by stripping-out all non digit characters
i="${f//[^[:digit:]]//}"
file="$f"
target="target$i.txt"
result="result$i.txt"
# If both file#.txt and target#.txt exists
if [ -e "$file" ] && [ -e "$target" ]; then
grep -wf "$file" "$target" >"$result"
fi
done
This is a one-line version suitable for the command line with brace expanion:
for i in {1..22};do grep -wf "file$i.txt" "target$i.txt" > "result$i.txt"; done
Do them all in parallel with GNU Parallel:
parallel 'grep -wf file{}.txt target{}.txt > result{}.txt' ::: {1..22}

Print program path and its symlink using which

Whenever I use which I do this: $ which -a npm
Which results in: /usr/local/bin/npm
Then to find the real path, I run:
ls -l /usr/local/bin/npm
I would like a fast way of doing this. The best I have come up with is defining a function:
which(){
/usr/bin/which -a "$#" | xargs ls -l | tr -s ' ' | cut -d ' ' -f 9-
}
Now it has a nice output of: /usr/local/bin/npm -> ../lib/node_modules/npm/bin/npm-cli.js
Is there a better way to do this? I don't like using cut like this.
This won't print the -> output ls -l does, but it will resolve symlinks:
which() {
command which -a "$#" | xargs -d '\n' readlink -m
}
If you want the -> output but want to do it more robustly, you could mimic ls -l with:
which() {
command which -a "$#" | while IFS= read -r file; do
if [[ -L $file ]]; then
echo "$file -> $(readlink -m "$file")"
else
echo "$file"
fi
done
}
What does command do?
command suppresses function lookup and runs the built-in which command as if which() weren't defined. This way you don't have to hardcode /usr/bin/which.
awk: if first character is "l" (link), print fields 9,10,11; else only 9.
[ranga#garuda ~]$ ls -l $(which -a firefox)|awk '{print $1 ~ /^l/ ? $9 $10 $11 : $9 }'
/usr/bin/firefox->/usr/lib64/firefox/firefox
$1 is the first field
$1 ~ /^l/ ? tests whether the first field matches the pattern ^l (first character is "l")
if the test passes, print receives $9 $10 $11; else, only $9.
sed : remove first 8 non-space and space character bunches.
[ranga#garuda ~]$ ls -l $(which firefox) | sed 's/^\([^ ]*[ ]*\)\{8\}//'
/usr/bin/firefox -> /usr/lib/firefox/firefox
[ ]* matches a bunch of contiguous spaces
[^ ]* matches a contiguous bunch of non-space characters
the grouping \([^ ]*[ ]*\) matches a text with one non-space bunch and one space bunch (in that order).
\{8\} matches 8 contiguous instances of this combination. ^ at the beginning pins the match to the beginning of the line.
's/^\([^ ]*[ ]*\)\{8\}//' replaces a match with empty text - effectively removing it.
seems to work so long as you aren't running "which" on an alias.
these commands are not presented inside a function but can be used in one (which you already know how to).

How to store NUL output of a program in bash script?

Suppose there is a directory 'foo' which contains several files:
ls foo:
1.aa 2.bb 3.aa 4.cc
Now in a bash script, I want to count the number of files with specific suffix in 'foo', and display them, e.g.:
SUFF='aa'
FILES=`ls -1 *."$SUFF" foo`
COUNT=`echo $FILES | wc -l`
echo "$COUNT files have suffix $SUFF, they are: $FILES"
The problem is: if SUFF='dd', $COUNT also equal to 1. After google, the reason I found is when SUFF='dd', $FILES is an empty string, not really the null output of a program, which will be considered to have one line by wc. NUL output can only be passed through pipes. So one solution is:
COUNT=`ls -1 *."$SUFF" foo | wc -l`
but this will lead to the ls command being executed twice. So my question is: is there any more elegant way to achieve this?
$ shopt -s nullglob
$ FILES=(*)
$ echo "${#FILES[#]}"
4
$ FILES=(*aa)
$ echo "${#FILES[#]}"
2
$ FILES=(*dd)
$ echo "${#FILES[#]}"
0
$ SUFFIX=aa
$ FILES=(*"$SUFFIX")
$ echo "${#FILES[#]}"
2
$ SUFFIX=dd
$ FILES=(*"$SUFFIX")
$ echo "${#FILES[#]}"
0
you can also try this;
#!/bin/bash
SUFF='aa'
FILES=`ls -1 *."$SUFF" foo`
FILENAMES=`echo $FILES | awk -F ':' '{print $2}'`
COUNT=`echo $FILENAMES | wc -w`
echo "$COUNT files have suffix $SUFF, they are: $FILENAMES"
if inserted echo $FILES in your script, output is foo: 1.aa 2.aa 3.aa so
awk -F ':' '{print $2}' gets 1.aa 2.aa 3.aa from $FILES variable
wc -w prints the word counts
If you only need the file count, I would actually use find for that:
find '/path/to/directory' -mindepth 1 -maxdepth 1 -name '*.aa' -printf '\n' | wc -l
This is more reliable as it handles correctly filenames with line breaks. The way this works is that find outputs one empty line for each matching file.
Edit: If you want to keep the file list in an array, you can use a glob:
GLOBIGNORE=".:.."
shopt -s nullglob
FILES=(*aa)
COUNT=${#arr[#]}
echo "$COUNT"
The reason is that the option nullglob is unset by default in bash:
If no matching file names are found, and the shell option nullglob is not enabled, the word is left unchanged. If the nullglob option is set, and no matches are found, the word is removed.
So, just set the nullglob option, and run you code again:
shopt -s nullglob
SUFF='aa'
FILES="$(printf '%s\n' foo/*."$SUFF")"
COUNT="$(printf '%.0s\n' foo/*."$SUFF" | wc -l)"
echo "$COUNT files have suffix $SUFF, they are: $FILES"
Or better yet:
shopt -s nullglob
suff='aa'
files=( foo/*."$suff" )
count=${#file[#]}
echo "$count files have suffix $suff, they are: ${files[#]}"

Bash: Filter directory when piping from `ls` to `tee`

(background info)
Writing my first bash psuedo-program. The program downloads a bunch of files from the network, stores them in a sub-directory called ./network-files/, then removes all the files it downloaded. It also logs the result to several log files in ./logs/.
I want to log the filenames of each file deleted.
Currently, I'm doing this:
echo -e "$(date -u) >>> Removing files: $(ls -1 "$base_directory"/network-files/* | tr '\n' ' ')" | tee -a $network_files_log $verbose_log $network_log
($base_directory is a variable defining the base directory for the app, $network_files_log etc are variables defining the location of various log files)
This produces some pretty grody and unreadable output:
Tue Jun 21 04:55:46 UTC 2016 >>> Removing files: /home/vagrant/load-simulator/network-files/207822218.png /home/vagrant/load-simulator/network-files/217311040.png /home/vagrant/load-simulator/network-files/442119100.png /home/vagrant/load-simulator/network-files/464324101.png /home/vagrant/load-simulator/network-files/525787337.png /home/vagrant/load-simulator/network-files/581100197.png /home/vagrant/load-simulator/network-files/640387393.png /home/vagrant/load-simulator/network-files/650797708.png /home/vagrant/load-simulator/network-files/827538696.png /home/vagrant/load-simulator/network-files/833069509.png /home/vagrant/load-simulator/network-files/8580204.png /home/vagrant/load-simulator/network-files/858174053.png /home/vagrant/load-simulator/network-files/998266826.png
Any good way to strip out the /home/vagrant/load-simulator/network-files/ part from each of those file paths? I suspect there's something I should be doing with sed or grep, but haven't had any luck so far.
You might also consider using find. Its perfect for walking directories, removing files and using customized printf for output:
find $PWD/x -type f -printf "%f\n" -delete >>$YourLogFile.log
Don't use ls at all; use a glob to populate an array with the desired files. You can then use parameter expansion to shorten each array element.
d=$base_directory/network-files
files=( "$d"/* )
printf '%s Removing files: %s' "$(date -u)" "${files[*]#$d/}" | tee ...
You could do it a couple of ways. To directly answer the question, you could use sed to do it with the substitution command like:
echo -e "$(date -u) >>> Removing files: $(ls -1 "$base_directory"/network-files/* | tr '\n' ' ')" | sed -e "s,$base_directory/network-files/,," | tee -a $network_files_log $verbose_log $network_log
which adds sed -e "s,$base_directory/network-files/,," to the pipeline. It will substitute the string found in base_directory with the empty string, so long as there are not any commas in base_directory. If there are you could try a different separator for the parts of the sed command, like underscore: sed -e "s_$base_directory/network-files__"
Instead though, you could just have the subshell cd to that directory and then the string wouldn't be there in the first place:
echo -e "$(date -u) >>> Removing files: $(cd "$base_directory/network-files/"; ls -1 | tr '\n' ' ')" | tee -a "$network_files_log" "$verbose_log" "$network_log"
Or you could avoid some potential pitfalls with echo and use printf like
{ printf '%s >>>Removing files: '; printf '%s ' "$(cd "$base_directory/network-files"; ls -1)"; printf '\n'; } | tee -a ...
testdata="/home/vagrant/load-simulator/network-files/207822218.png /home/vagrant/load-simulator/network-files/217311040.png"
echo -e $testdata | sed -e 's/\/[^ ]*\///g'
Pipe your output to sed the replace that captured group with nothing.
The regex: \/[^ ]*\/
Start with a /, captured everything that is not a space until it gets to the last /.

grep for multiple strings in file on different lines (ie. whole file, not line based search)?

I want to grep for files containing the words Dansk, Svenska or Norsk on any line, with a usable returncode (as I really only like to have the info that the strings are contained, my one-liner goes a little further then this).
I have many files with lines in them like this:
Disc Title: unknown
Title: 01, Length: 01:33:37.000 Chapters: 33, Cells: 31, Audio streams: 04, Subpictures: 20
Subtitle: 01, Language: ar - Arabic, Content: Undefined, Stream id: 0x20,
Subtitle: 02, Language: bg - Bulgarian, Content: Undefined, Stream id: 0x21,
Subtitle: 03, Language: cs - Czech, Content: Undefined, Stream id: 0x22,
Subtitle: 04, Language: da - Dansk, Content: Undefined, Stream id: 0x23,
Subtitle: 05, Language: de - Deutsch, Content: Undefined, Stream id: 0x24,
(...)
Here is the pseudocode of what I want:
for all files in directory;
if file contains "Dansk" AND "Norsk" AND "Svenska" then
then echo the filename
end
What is the best way to do this? Can it be done on one line?
You can use:
grep -l Dansk * | xargs grep -l Norsk | xargs grep -l Svenska
If you want also to find in hidden files:
grep -l Dansk .* | xargs grep -l Norsk | xargs grep -l Svenska
Yet another way using just bash and grep:
For a single file 'test.txt':
grep -q Dansk test.txt && grep -q Norsk test.txt && grep -l Svenska test.txt
Will print test.txt iff the file contains all three (in any combination). The first two greps don't print anything (-q) and the last only prints the file if the other two have passed.
If you want to do it for every file in the directory:
for f in *; do grep -q Dansk $f && grep -q Norsk $f && grep -l Svenska $f; done
grep –irl word1 * | grep –il word2 `cat -` | grep –il word3 `cat -`
-i makes search case insensitive
-r makes file search recursive through folders
-l pipes the list of files with the word found
cat - causes the next grep to look through the files passed to it list.
You can do this really easily with ack:
ack -l 'cats' | ack -xl 'dogs'
-l: return a list of files
-x: take the files from STDIN (the previous search) and only search those files
And you can just keep piping until you get just the files you want.
How to grep for multiple strings in file on different lines (Use the pipe symbol):
for file in *;do
test $(grep -E 'Dansk|Norsk|Svenska' $file | wc -l) -ge 3 && echo $file
done
Notes:
If you use double quotes "" with your grep, you will have to escape the pipe like this: \| to search for Dansk, Norsk and Svenska.
Assumes that one line has only one language.
Walkthrough: http://www.cyberciti.biz/faq/howto-use-grep-command-in-linux-unix/
awk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print "0" }'
you can then catch the return value with the shell
if you have Ruby(1.9+)
ruby -0777 -ne 'print if /Dansk/ and /Norsk/ and /Svenka/' file
This searches multiple words in multiple files:
egrep 'abc|xyz' file1 file2 ..filen
Simply:
grep 'word1\|word2\|word3' *
see this post for more info
This is a blending of glenn jackman's and kurumi's answers which allows an arbitrary number of regexes instead of an arbitrary number of fixed words or a fixed set of regexes.
#!/usr/bin/awk -f
# by Dennis Williamson - 2011-01-25
BEGIN {
for (i=ARGC-2; i>=1; i--) {
patterns[ARGV[i]] = 0;
delete ARGV[i];
}
}
{
for (p in patterns)
if ($0 ~ p)
matches[p] = 1
# print # the matching line could be printed
}
END {
for (p in patterns) {
if (matches[p] != 1)
exit 1
}
}
Run it like this:
./multigrep.awk Dansk Norsk Svenska 'Language: .. - A.*c' dvdfile.dat
Here's what worked well for me:
find . -path '*/.svn' -prune -o -type f -exec gawk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
./path/to/file1.sh
./another/path/to/file2.txt
./blah/foo.php
If I just wanted to find .sh files with these three, then I could have used:
find . -path '*/.svn' -prune -o -type f -name "*.sh" -exec gawk '/Dansk/{a=1}/Norsk/{b=1}/Svenska/{c=1}END{ if (a && b && c) print FILENAME }' {} \;
./path/to/file1.sh
Expanding on #kurumi's awk answer, here's a bash function:
all_word_search() {
gawk '
BEGIN {
for (i=ARGC-2; i>=1; i--) {
search_terms[ARGV[i]] = 0;
ARGV[i] = ARGV[i+1];
delete ARGV[i+1];
}
}
{
for (i=1;i<=NF; i++)
if ($i in search_terms)
search_terms[$1] = 1
}
END {
for (word in search_terms)
if (search_terms[word] == 0)
exit 1
}
' "$#"
return $?
}
Usage:
if all_word_search Dansk Norsk Svenska filename; then
echo "all words found"
else
echo "not all words found"
fi
I did that with two steps. Make a list of csv files in one file
With a help of this page comments I made two scriptless steps to get what I needed. Just type into terminal:
$ find /csv/file/dir -name '*.csv' > csv_list.txt
$ grep -q Svenska `cat csv_list.txt` && grep -q Norsk `cat csv_list.txt` && grep -l Dansk `cat csv_list.txt`
it did exactly what I needed - print file names containing all three words.
Also mind the symbols like `' "
If you only need two search terms, arguably the most readable approach is to run each search and intersect the results:
comm -12 <(grep -rl word1 . | sort) <(grep -rl word2 . | sort)
If you have git installed
git grep -l --all-match --no-index -e Dansk -e Norsk -e Svenska
The --no-index searches files in the current directory that is not managed by Git. So this command will work in any directory irrespective of whether it is a git repository or not.
I had this problem today, and all one-liners here failed to me because the files contained spaces in the names.
This is what I came up with that worked:
grep -ril <WORD1> | sed 's/.*/"&"/' | xargs grep -il <WORD2>
A simple one-liner in bash for an arbitrary list LIST for file my_file.txt can be:
LIST="Dansk Norsk Svenska"
EVAL=$(echo "$LIST" | sed 's/[^ ]* */grep -q & my_file.txt \&\& /g'); eval "$EVAL echo yes || echo no"
Replacing eval with echo reveals, that the following command is evaluated:
grep -q Dansk my_file.txt && grep -q Norsk my_file.txt && grep -q Svenska my_file.txt && echo yes || echo no

Resources