Using bash to iterate through similarly named files and grep - bash

I have a list of base files:
file1.txt
file2.txt
file3.txt
and a list of target files:
target1.txt
target2.txt
target3.txt
and I want to use bash to perform the following command using a loop:
grep -wf "file1.txt" "target1.txt" > "result1.txt"
grep -wf "file2.txt" "target2.txt" > "result2.txt"
The files will all have the same name besides the final integer, which will be in a series (1:22).

With a for loop:
for((i=1; i<=22; i++)); do
grep -wf "file$i.txt" "target$i.txt" > "result$i.txt"
done

With arbitrary number of file#.txt and target#.txt:
#!/usr/bin/env bash
shopt -s extglob # Enable extended globbing patterns
# Iterate all file#.txt
for f in file+([[:digit:]]).txt; do
# Extract the index from the file name by stripping-out all non digit characters
i="${f//[^[:digit:]]//}"
file="$f"
target="target$i.txt"
result="result$i.txt"
# If both file#.txt and target#.txt exists
if [ -e "$file" ] && [ -e "$target" ]; then
grep -wf "$file" "$target" >"$result"
fi
done

This is a one-line version suitable for the command line with brace expanion:
for i in {1..22};do grep -wf "file$i.txt" "target$i.txt" > "result$i.txt"; done

Do them all in parallel with GNU Parallel:
parallel 'grep -wf file{}.txt target{}.txt > result{}.txt' ::: {1..22}

Related

Find the occurrences of an element in array

arr=(7793 7793123 7793 37793 3214)
I'd like to find the occurrence of 7793. I tried: grep -o '7793' <<< $arr | wc -l
However, this also counts other elements that contain 7793 (e.g. 7793123, 37793)
printf '%s\n' "${arr[#]}" | grep -c '^7793$'
Explanation:
printf prints each item of the array on a new line
grep -c '^7793$' uses the start and end anchors to match 7793 exactly and outputs the count
With GNU grep (note the correct counting of elements containing newlines, refer to documentation for a description of options used):
arr=(7793 7793123 7793 37793 3214 7793$'\n'7793)
printf '%s\0' "${arr[#]}" | grep --null-data -cFxe 7793
Output:
2
This works because variables in bash cannot contain the NUL character.
You can use regex in this case
grep -e ^7793$
To make a bash script efficient (from CPU/memory consumption point of view), whenever possible, avoid running sub-shells and programs. Hence, instead of using grep or any other program, here we have the choice of using a simple loop with variable comparison and arithmetic:
#!/bin/bash
key=7793
arr=(7793 7793123 7793 37793 3214)
count=0
for i in "${arr[#]}"
do if [ "$i" = "$key" ]
then count=$((count+1))
fi
done
echo $count

How to iterate two variables in bash script?

I have these kind of files:
file6543_015.bam
subreadset_15.xml
file6543_024.bam
subreadset_24.xml
file6543_027.bam
subreadset_27.xml
I would like to run something like this:
for i in *bam && l in *xml
do
my_script $i $l > output_file
done
Because in my command the first bam file goes with the first xml file. For each combination bam/xml, that command will give a specific output file.
Like this, using bash arrays:
bam=( *.bam )
xml=( *.xml )
for ((i=0; i<${#bam[#]}; i++)); do
my_script "${bam[i]}" "${xml[i]}"
done
Assuming you have way to uniquely name your output_file for each specific output,
here is one way:
#!/bin/bash
ls file*.bam | while read i
do
CMD=`echo -n "my_script $i "`
CMD="$CMD `echo $i | sed -e 's/file.*_0/subreadset_/' -e 's/.bam/.xml/'`"
$CMD >> output_file
done

Sed replace substring only if expression exist

In a bash script, I am trying to remove the directory name in filenames :
documents/file.txt
direc/file5.txt
file2.txt
file3.txt
So I try to first see if there is a "/" and if yes delete everything before :
for i in **/*.scss *.scss; do
echo "$i" | sed -n '^/.*\// s/^.*\///p'
done
But it doesn't work for files in the current directory, it gives me a blank string.
I get :
file.txt
file5.txt
When you only want the filename, use basename instead of sed.
# basename /path/to/file
returns file
here is the man page
Your sed attempt is basically fine, but you should print regardless of whether you performed a substitution; take out the -n and the p at the end. (Also there was an unrelated syntax error.)
Also, don't needlessly loop over all files.
printf '%s\n' **/*.scss *.scss |
sed -n 's%^.*/%%p'
This also can be done with awk bash util.
Example:
echo "1/2/i.py" | awk 'BEGIN {FS="/"} {print $NF}'
output: i.py
Eventually, I did :
for i in **/*.scss *.scss; do
# for i in *.scss; do
# for i in _hm-globals.scss; do
name=${i##*/} # remove dir name
name=${name%.scss} # remove extension
name=`echo "$name" | sed -n "s/^_hm-//p"` # remove _hm-
if [[ $name = *"."* ]]; then
name=`echo "$name" | sed -n 's/\./-/p'` #replace . to --
fi
echo "$name" >&2
done

How to store NUL output of a program in bash script?

Suppose there is a directory 'foo' which contains several files:
ls foo:
1.aa 2.bb 3.aa 4.cc
Now in a bash script, I want to count the number of files with specific suffix in 'foo', and display them, e.g.:
SUFF='aa'
FILES=`ls -1 *."$SUFF" foo`
COUNT=`echo $FILES | wc -l`
echo "$COUNT files have suffix $SUFF, they are: $FILES"
The problem is: if SUFF='dd', $COUNT also equal to 1. After google, the reason I found is when SUFF='dd', $FILES is an empty string, not really the null output of a program, which will be considered to have one line by wc. NUL output can only be passed through pipes. So one solution is:
COUNT=`ls -1 *."$SUFF" foo | wc -l`
but this will lead to the ls command being executed twice. So my question is: is there any more elegant way to achieve this?
$ shopt -s nullglob
$ FILES=(*)
$ echo "${#FILES[#]}"
4
$ FILES=(*aa)
$ echo "${#FILES[#]}"
2
$ FILES=(*dd)
$ echo "${#FILES[#]}"
0
$ SUFFIX=aa
$ FILES=(*"$SUFFIX")
$ echo "${#FILES[#]}"
2
$ SUFFIX=dd
$ FILES=(*"$SUFFIX")
$ echo "${#FILES[#]}"
0
you can also try this;
#!/bin/bash
SUFF='aa'
FILES=`ls -1 *."$SUFF" foo`
FILENAMES=`echo $FILES | awk -F ':' '{print $2}'`
COUNT=`echo $FILENAMES | wc -w`
echo "$COUNT files have suffix $SUFF, they are: $FILENAMES"
if inserted echo $FILES in your script, output is foo: 1.aa 2.aa 3.aa so
awk -F ':' '{print $2}' gets 1.aa 2.aa 3.aa from $FILES variable
wc -w prints the word counts
If you only need the file count, I would actually use find for that:
find '/path/to/directory' -mindepth 1 -maxdepth 1 -name '*.aa' -printf '\n' | wc -l
This is more reliable as it handles correctly filenames with line breaks. The way this works is that find outputs one empty line for each matching file.
Edit: If you want to keep the file list in an array, you can use a glob:
GLOBIGNORE=".:.."
shopt -s nullglob
FILES=(*aa)
COUNT=${#arr[#]}
echo "$COUNT"
The reason is that the option nullglob is unset by default in bash:
If no matching file names are found, and the shell option nullglob is not enabled, the word is left unchanged. If the nullglob option is set, and no matches are found, the word is removed.
So, just set the nullglob option, and run you code again:
shopt -s nullglob
SUFF='aa'
FILES="$(printf '%s\n' foo/*."$SUFF")"
COUNT="$(printf '%.0s\n' foo/*."$SUFF" | wc -l)"
echo "$COUNT files have suffix $SUFF, they are: $FILES"
Or better yet:
shopt -s nullglob
suff='aa'
files=( foo/*."$suff" )
count=${#file[#]}
echo "$count files have suffix $suff, they are: ${files[#]}"

Grep multiple occurrences given two strings and two integers

im looking for a bash script to count the occurences of a word in a given directory and it's subdirectory's files with this pattern:
^str1{n}str2{m}$
for example:
str1= yo
str2= uf
n= 3
m= 4
the match would be "yoyoyoufufufuf"
but i'm having trouble with grep
that's what i have tried
for file in $(find $dir)
do
if [ -f $file ]; then
echo "<$file>:<`grep '\<\$str1\{$n\}\$str2\{$m\}\>'' $file | wc -l >" >> a.txt
fi
done
should i use find?
#Barmar's comment is useful.
If I understand your question, I think this single grep command should do what you're looking for:
grep -r -c "^\($str1\)\{$n\}\($str2\)\{$m\}$" "$dir"
Note the combination of -r and -c causes grep to output zero-counts for non-matching files. You can pipe to grep -v ":0$" to suppress this output if you require:
$ dir=.
$ str1=yo
$ str2=uf
$ n=3
$ m=4
$ cat youf
yoyoyoufufufuf
$ grep -r -c "^\($str1\)\{$n\}\($str2\)\{$m\}$" "$dir"
./noyouf:0
./youf:1
./dir/youf:1
$ grep -r -c "^\($str1\)\{$n\}\($str2\)\{$m\}$" "$dir" | grep -v ":0$"
./youf:1
./dir/youf:1
$
Note also $str1 and $str2 need to be put in parentheses so that the {m} and {n} apply to everything within the parentheses and not just the last character.
Note the escaping of the () and {} as we require double-quotes ", so that the variables are expanded into the grep regular expression.

Resources