How could I redirect file name into counts by tab using one line commands in bash? - bash

I have some files in fasta format and want to counts their reads and would like to have output in file names and their corresponding counts.
input file names:
1.fa
2.fa
3.fa
...
I tried:
for i in $(ls -t -v *.fa); do grep -c '>' $i > echo $i >> out.txt ; done
Problem:
It gives me out.txt but double file names and their counts by ':' separated. However, I need a tab and unique file names.
1.fa:7323580
1.fa:7323580
2.fa:5591179
2.fa:5591179
...

Suggested solution
grep -c '>' *.fa | sed 's/:/'$'\t'/ > out.txt
The $'\t\' is a Bash-ism called ANSI C Quoting.
Analysis of what went wrong
Your code is:
for i in $(ls -t -v *.fa); do grep -c '>' $i > echo $i >> out.txt ; done
It isn't a good idea to parse the output of the ls command. However, if your file names are well behaved (roughly, in the portable filename character set, which is [-A-Za-z._]), you'll be reasonably OK.
Your grep command, though, is confused. It is:
grep -c '>' $i > echo $i >> out.txt
That could be written more clearly as:
grep -c '>' $i $i > echo >> out.txt
This means 'count the number of lines containing > in $i, and then in $i again, and send the output first to a file echo, and then append to out.txt. Since the append overrides the redirection, the file echo is empty. You get the file name included in the output because there are two files to search; with only one file, you wouldn't get the file name too. (One way to ensure you get file names with regular (not -c or -l) grep is to scan /dev/null too. Many versions of grep also provide options to get the name explicitly, but POSIX doesn't mandate one. BSD grep uses -H; so does GNU grep.)
So, that's why you got the double file names and entries in your output.

Try this:
for i in $(ls -t -v *.fa)
do
c=$(grep -c '>' $i | awk -F: '{print $2}')
echo "$i: $c" >> out.txt
done

Related

How to read strings from a text file and use them in grep?

I have a file of strings that I need to search for in another file. So I've tried the following code:
#!/bin/bash
while read name; do
#echo $name
grep "$name" file.txt > results.txt
done < missing.txt
The echo line confirms the file is being read into the variable, but my results file is always empty. Doing the grep command on its own works, I'm obviously missing something very basic here but I have been stuck for a while and can't figure it out.
I've also tried without quotes around the variable. Can someone tell me what I'm missing? Thanks a bunch
Edit - input file was DOS format, set file format to unix and works fine now
Use grep's -f option: Then you only need a single grep call and no loop.
grep -f missing.txt file.txt > results.txt
If the contents of "missing.txt" are fixed strings, not regular expressions, this will speed up the process:
grep -F -f missing.txt file.txt > results.txt
And if you want to find the words of missing.txt in the other file, not partial words
grep -F -w -f missing.txt file.txt > results.txt
My first guess is that you are overwriting your results.txt file in every iteration of the while loop (with the single >). If it is the case you should at least have the result for the very last line in your missing.txt file. Then I think it would suffice to do something like
#!/bin/bash
while read name; do
#echo "$name"
grep "$name" file.txt
done < missing.txt > results.txt

Bash subshell input with variable number of subshells

I want to grep lines from a variable number of log files and connect their outputs with paste. If I had a fixed number of outputs, I could do it thus:
paste <(grep $PATTERN $FILE1) <(grep $PATTERN $FILE2)
But is there a way to do this with a variable number of input files? I want to write a shell script whose arguments are the input files. The shell script should paste the grepped lines from ALL of them.
Use explicit named pipes, instead of process substitution.
pipes=()
for f in "$FILE1" "$FILE2" "$FILE3"; do
n="$(mktemp)" # Or some other command to create a temporary name
mkfifo "$n"
pipes+=( "$n" )
grep "$PATTERN" "$f" > "$n" &
done
paste "${pipes[#]}"
rm "${pipes[#]}" # When done with them
You can do this by combining find command to list the files and piping its output to grep usings xargs to ensure grep is applied on each file listed in find command
$ find /dir/containing/files -name "file.*" | xargs grep $PATTERN

BASH output from grep

I am relatively new to bash and I am testing my code for the first case.
counter=1
for file in not_processed/*.txt; do
if [ $counter -le 1 ]; then
grep -v '2018-07' $file > bis.txt;
counter=$(($counter+1));
fi;
done
I want to subtract all the lines containing '2018-07' from my file. The new file needs to be named $file_bis.txt.
Thanks
With sed or awk it's much easier and faster to process complex files.
sed -n '/2018-07/p' not_processed/*.txt
then you get the output in your console. If you want you can pipe the output to a new file.
sed -n '/2018-07/p' not_processed/*.txt >> out.txt
This is to do it on all files in not_processed/*.txt
for file in not_processed/*.txt
do
grep -v '2018-07' $file > "$file"_bis.txt
done
And this is to do it only on the first 2 files in not_processed/*.txt
for file in $(ls not_processed/*.txt|head -2)
do
grep -v '2018-07' $file > "$file"_bis.txt
done
Don't forget to add "" on $file, because otherwise bash considers $file_bis as a new variable, which has no assigned value.
I don't understood why you are using a counter and if condition for this simple requirement. Use below script which will fulfill you requirement:-
#first store all the files in a variable
files=$(ls /your/path/*.txt)
# now use a for loop
for file in $files;
do
grep '2018-07' $file >> bis.txt
done
Better avoid for loop here as below single line is suffice
grep -h '2018-07' /your/path/*.txt > bis.txt

Concatenate files based on numeric sort of name substring in awk w/o header

I am interested in concatenate many files together based on the numeric number and also remove the first line.
e.g. chr1_smallfiles then chr2_smallfiles then chr3_smallfiles.... etc (each without the header)
Note that chr10_smallfiles needs to come after chr9_smallfiles -- that is, this needs to be numeric sort order.
When separate the two command awk and ls -v1, each does the job properly, but when put them together, it doesn't work. Please help thanks!
awk 'FNR>1' | ls -v1 chr*_smallfiles > bigfile
The issue is with the way that you're trying to pass the list of files to awk. At the moment, you're piping the output of awk to ls, which makes no sense.
Bear in mind that, as mentioned in the comments, ls is a tool for interactive use, and in general its output shouldn't be parsed.
If sorting weren't an issue, you could just use:
awk 'FNR > 1' chr*_smallfiles > bigfile
The shell will expand the glob chr*_smallfiles into a list of files, which are passed as arguments to awk. For each filename argument, all but the first line will be printed.
Since you want to sort the files, things aren't quite so simple. If you're sure the full range of files exist, just replace chr*_smallfiles with chr{1..99}_smallfiles in the original command.
Using some Bash-specific and GNU sort features, you can also achieve the sorting like this:
printf '%s\0' chr*_smallfiles | sort -z -n -k1.4 | xargs -0 awk 'FNR > 1' > bigfile
printf '%s\0' prints each filename followed by a null-byte
sort -z sorts records separated by null-bytes
-n -k1.4 does a numeric sort, starting from the 4th character (the numeric part of the filename)
xargs -0 passes the sorted, null-separated output as arguments to awk
Otherwise, if you want to go through the files in numerical order, and you're not sure whether all the files exist, then you can use a shell loop (although it'll be significantly slower than a single awk invocation):
for file in chr{1..99}_smallfiles; do # 99 is the maximum file number
[ -f "$file" ] || continue # skip missing files
awk 'FNR > 1' "$file"
done > bigfile
You can also use tail to concatenate all the files without header
tail -q -n+2 chr*_smallfiles > bigfile
In case you want to concatenate the files in a natural sort order as described in your quesition, you can pipe the result of ls -v1 to xargs using
ls -v1 chr*_smallfiles | xargs -d $'\n' tail -q -n+2 > bigfile
(Thanks to Charles Duffy) xargs -d $'\n' sets the delimiter to a newline \n in case the filename contains white spaces or quote characters
Using a bash 4 associative array to extract only the numeric substring of each filename; sort those individually; and then retrieve and concatenate the full names in the resulting order:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "Requires bash 4.0 or newer" >&2; exit 1;; esac
# when this is done, you'll have something like:
# files=( [1]=chr_smallfiles1.txt
# [10]=chr_smallfiles10.txt
# [9]=chr_smallfiles9.txt )
declare -A files=( )
for f in chr*_smallfiles.txt; do
files[${f//[![:digit:]]/}]=$f
done
# now, emit those indexes (1, 10, 9) to "sort -n -z" to sort them as numbers
# then read those numbers, look up the filenames associated, and pass to awk.
while read -r -d '' key; do
awk 'FNR > 1' <"${files[$key]}"
done < <(printf '%s\0' "${!files[#]}" | sort -n -z) >bigfile
You can do with a for loop like below, which is working for me:-
for file in chr*_smallfiles
do
tail +2 "$file" >> bigfile
done
How will it work? For loop read all the files from current directory with wild chard character * chr*_smallfiles and assign the file name to variable file and tail +2 $file will output all the lines of that file except the first line and append in file bigfile. So finally all files will be merged (accept the first line of each file) into one i.e. file bigfile.
Just for completeness, how about a sed solution?
for file in chr*_smallfiles
do
sed -n '2,$p' $file >> bigfile
done
Hope it helps!

pattern matching in the filename and change extension - bash script

I want to use name-last.txt files to call another several files in previous directories which names belong to the filename string:
For example, for Perez-Castillo.txt, I want to used: (1) grep in Perez-Castillo.txt, (2) grep in Perez.list and (3) grep in Castillo.list.
I have this part:
for i in *.txt;
do
wc -l $i > out1.txt
grep -c "something" ../${i%-*}.list > out2.txt
grep -c "something" ../${i#*-}.list > out3.txt
done;
However, I fail to call i.e Castillo.list, as my script is calling Castillo.txt.list
Any suggestion?
Bash doesn't let you nest two transformations into a single parameter expansion, so there is no way to delete both a prefix and a suffix with a parameter expansion.
So the simplest approach is to just remove the .txt extension at the beginning:
for i in *.txt; do
pfx=${i%.txt}
wc -l "${pfx}.txt" > out1.txt
grep -c "something" "../${pfx%-*}.list" > out2.txt
grep -c "something" "../${pfx#*-}.list" > out3.txt
done;

Resources