pattern matching in the filename and change extension - bash script - bash

I want to use name-last.txt files to call another several files in previous directories which names belong to the filename string:
For example, for Perez-Castillo.txt, I want to used: (1) grep in Perez-Castillo.txt, (2) grep in Perez.list and (3) grep in Castillo.list.
I have this part:
for i in *.txt;
do
wc -l $i > out1.txt
grep -c "something" ../${i%-*}.list > out2.txt
grep -c "something" ../${i#*-}.list > out3.txt
done;
However, I fail to call i.e Castillo.list, as my script is calling Castillo.txt.list
Any suggestion?

Bash doesn't let you nest two transformations into a single parameter expansion, so there is no way to delete both a prefix and a suffix with a parameter expansion.
So the simplest approach is to just remove the .txt extension at the beginning:
for i in *.txt; do
pfx=${i%.txt}
wc -l "${pfx}.txt" > out1.txt
grep -c "something" "../${pfx%-*}.list" > out2.txt
grep -c "something" "../${pfx#*-}.list" > out3.txt
done;

Related

cat multiple files in separate directories file1 file2 file3....file100 using loop in bash script

I have several files in multiple directories like in directory 1/file1 2/file2 3/file3......100/file100. I want to cat all those files to a single file using loop over index in bash script. Is there easy loop for doing so?
Thanks,
seq 100 | sed 's:.*:dir&/file&:' | xargs cat
seq 100 generates list of numbers from 1 to 100
sed
s substitutes
: separates parts of the command
.* the whole line
: separator. Usually / is used, but it's used in replacement string.
dir&/file& by dir<whole line>/file<whole line>
: separator
so it generates list of dir1/file1 ... dir100/file100
xargs - pass input as arguments to ...
cat - so it will execute cat dir1/file1 dir2/file2 ... dir100/file100.
This code should do the trick;
for((i=1;i<=`ls -l | wc -l`;i++)); do cat dir${i}/file${i} >> output; done
I made an example of what you're describing about your directory structure and files. Create directories and files with It's own content.
for ((i=1;i<=100;i++)); do
mkdir "$i" && touch "$i/file$i" && echo content of "$(pwd) $i" > "$i/file$i"
done
Check the created directories.
ls */*
ls */* | sort -n
If you see that the directories and files are created then proceed to the next step.
This solution does not involve any external command from the shell except of course cat :-)
Now we can check the contents of each files using bash syntax.
i=1
while [[ -e "$i" ]]; do
cat "$i"/*
((i++))
done
This code was tested in dash.
i=1
while [ -e "$i" ]; do
cat "$i"/*
i=$((i+1))
done
Just add the redirection of the output to the file after the done.
You can add some more test if you like see help test
One more thing :-), you can just check the contents using tail and brace expansion
tail -n +1 {1..100}/*
Using cat also you can redirect the output already, just remember brace expansion is bash3+ feature/syntax.
cat {1..100}/*

Bash subshell input with variable number of subshells

I want to grep lines from a variable number of log files and connect their outputs with paste. If I had a fixed number of outputs, I could do it thus:
paste <(grep $PATTERN $FILE1) <(grep $PATTERN $FILE2)
But is there a way to do this with a variable number of input files? I want to write a shell script whose arguments are the input files. The shell script should paste the grepped lines from ALL of them.
Use explicit named pipes, instead of process substitution.
pipes=()
for f in "$FILE1" "$FILE2" "$FILE3"; do
n="$(mktemp)" # Or some other command to create a temporary name
mkfifo "$n"
pipes+=( "$n" )
grep "$PATTERN" "$f" > "$n" &
done
paste "${pipes[#]}"
rm "${pipes[#]}" # When done with them
You can do this by combining find command to list the files and piping its output to grep usings xargs to ensure grep is applied on each file listed in find command
$ find /dir/containing/files -name "file.*" | xargs grep $PATTERN

Bash - change the filename by changing the filename variable

I want to save the results of a multiple grep in a .txt format. I do
for i in GO_*.txt; do
grep -o "GO:\w*" ${i} | grep -f - ../PFAM2GO.txt > ${i}_PFAM+GO.txt
done
The thing is that, obviously, the final filename comprehends the original file extension too, being GO_*.txt_PFAM+GO.txt.
Now, I'd like to only have GO_*_PFAM+GO.txt. Is there a way to modify the ${i} as to cancel the .txt without having to perform a rename or a mv afterwards?
Note: the * part has variable length.
You can use parameter expansion to remove the extension from the filename:
for i in GO_*.txt; do
name="${i%.txt}"
grep -o "GO:\w*" "${i}" | grep -f - ../PFAM2GO.txt > "${name}_PFAM+GO.txt"
done

bash: How to extract episode number from a string

Suppose you have this string variable in bash:
filename="House Of Lies 5x02 HDTV XviD [DivxTotaL]"
What can I do to get the 5x02 part?
I've tried with grep with no luck:
echo "$filename" > grep -c '[0-9]x[0-9]{2}'
The option -c which you are passing with grep is wrong
-c Only a count of selected lines is written to standard output.
$ echo $filename | grep -oE '[0-9]{1,2}x[0-9]{1,3}'
5x02
-o Prints only the matching part of the lines.
-E Extended Regex
echo "$filename" | egrep -o '[0-9]x[0-9]{2}'
>file redirects output to a file; |cmd pipes it to another command. -c counts the number of matches, which isn't useful here; -o outputs the matching string(s). To be able to use {2} you need to enable extended regexes, which egrep does.

How could I redirect file name into counts by tab using one line commands in bash?

I have some files in fasta format and want to counts their reads and would like to have output in file names and their corresponding counts.
input file names:
1.fa
2.fa
3.fa
...
I tried:
for i in $(ls -t -v *.fa); do grep -c '>' $i > echo $i >> out.txt ; done
Problem:
It gives me out.txt but double file names and their counts by ':' separated. However, I need a tab and unique file names.
1.fa:7323580
1.fa:7323580
2.fa:5591179
2.fa:5591179
...
Suggested solution
grep -c '>' *.fa | sed 's/:/'$'\t'/ > out.txt
The $'\t\' is a Bash-ism called ANSI C Quoting.
Analysis of what went wrong
Your code is:
for i in $(ls -t -v *.fa); do grep -c '>' $i > echo $i >> out.txt ; done
It isn't a good idea to parse the output of the ls command. However, if your file names are well behaved (roughly, in the portable filename character set, which is [-A-Za-z._]), you'll be reasonably OK.
Your grep command, though, is confused. It is:
grep -c '>' $i > echo $i >> out.txt
That could be written more clearly as:
grep -c '>' $i $i > echo >> out.txt
This means 'count the number of lines containing > in $i, and then in $i again, and send the output first to a file echo, and then append to out.txt. Since the append overrides the redirection, the file echo is empty. You get the file name included in the output because there are two files to search; with only one file, you wouldn't get the file name too. (One way to ensure you get file names with regular (not -c or -l) grep is to scan /dev/null too. Many versions of grep also provide options to get the name explicitly, but POSIX doesn't mandate one. BSD grep uses -H; so does GNU grep.)
So, that's why you got the double file names and entries in your output.
Try this:
for i in $(ls -t -v *.fa)
do
c=$(grep -c '>' $i | awk -F: '{print $2}')
echo "$i: $c" >> out.txt
done

Resources