Running zcat on multiple files using a for loop - bash

I'm very new to terminal/bash, and perhaps this has been asked before but I wasn't able to find what I'm looking for perhaps because I'm not sure exactly what to search for to answer my question.
I'm trying to format some files for genetic analysis and while I could write out the following command for every sample file, I know there is a better way:
zcat myfile.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip > myfile.2.fastq.gz
zcat myfile.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip > myfile.1.fastq.gz
I have the following files:
-bash-3.2$ ls
BB001.fastq BB013.fastq.gz IN014.fastq.gz RV006.fastq.gz SL083.fastq.gz
BB001.fastq.gz BB014.fastq.gz INA01.fastq.gz RV007.fastq.gz SL192.fastq.gz
BB003.fastq.gz BB015.fastq.gz INA02.fastq.gz RV008.fastq.gz SL218.fastq.gz
BB004.fastq.gz IN001.fastq.gz INA03.fastq.gz RV009.fastq.gz SL276.fastq.gz
BB006.fastq.gz IN002.fastq.gz INA04.fastq.gz RV010.fastq.gz SL277.fastq.gz
BB008.fastq.gz IN007.fastq.gz INA05.fastq.gz RV011.fastq.gz SL326.fastq.gz
BB009.fastq.gz IN010.fastq.gz INA1M.fastq.gz RV012.fastq.gz SL392.fastq.gz
BB010.fastq.gz IN011.fastq.gz RV003.fastq.gz SL075.fastq.gz SL393.fastq.gz
BB011.fastq.gz IN012.fastq.gz RV004.fastq.gz SL080.fastq.gz SL395.fastq.gz
BB012.fastq.gz IN013.fastq.gz RV005.fastq.gz SL081.fastq.gz
and I would like to apply the two zcat functions to each file, creating two new files from each one without writing it out 50 times. I've used for loops in R quite a bit but don't know where to start in bash. I can say in words what I want and hopefully someone can give me a hand coding it!:
for FILENAME.fastq.gz in all files in cd
zcat FILENAME.fastq.gz | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip > FILENAME.2.fastq.gz
zcat FILENAME.fastq.gz | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip > FILENAME.1.fastq.gz
Thanks a ton in advance for your help!
*****EDIT*****
My notation was a bit off, here's the final, correct for loop:
for fname in *.fastq.gz
do
gzcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.2.fastq.gz"
gzcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip >../../SeparateReads/"${fname%.fastq.gz}.1.fastq.gz"
done
*****FOLLOWUP QUESTION*****
When I run the following:
for fname in *.1.fastq.gz
do
cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done
I get this error:
cat: ./CleanedSeparate/XhoI/*.1.fastq.gz: No such file or directory
cat: ./CleanedSeparate/MseI/*.2.fastq.gz: No such file or directory
Obviously I'm not using * correctly. Any tips on where I'm going wrong?

for fname in *.fastq.gz
do
zcat "$fname" | awk 'NR % 8 == 5 || NR % 8 == 6 || NR % 8 == 7 || NR % 8 == 0 {print $0}' | gzip >"${fname%.fastq.gz}.2.fastq.gz"
zcat "$fname" | awk 'NR % 8 == 1 || NR % 8 == 2 || NR % 8 == 3 || NR % 8 == 4 {print $0}' | gzip >"${fname%.fastq.gz}.1.fastq.gz"
done
Key points:
for fname in *.fastq.gz
This loops over every file in the current directory ending in .fastq.gz. If the files are in a different directory, then use:
for fname in /path/to/*.fastq.gz
where /path/to/ is whatever the path should be to get to those files.
zcat "$fname"
This part is straightforward. It substitutes in the file name as the argument for zcat.
"${fname%.fastq.gz}.1.fastq.gz"
This is a little bit trickier. To get the desired output file name, we need to insert the .1 into the original filename. The easiest way to do this in bash is to remove the .fastq.gz suffix from the file name with ${fname%.fastq.gz} where the % is bash-speak meaning remove what follows from the end. Then, we add on the new suffix .1.fastq.gz and we have the correct file name.
Creating the new files in a different directory
As per the follow-up question, this does not work:
for fname in *.1.fastq.gz
do
cat ./CleanedSeparate/XhoI/"$fname" ./CleanedSeparate/MseI/"${fname%.1.fastq.gz}.2.fastq.gz" > ./FinalCleaned/"${fname%.1.fastq.gz}.fastq.gz"
done
The problem is that, in the for statement, the shell is looking for the *.1.fastq.gz in the current directory. But, they aren't there. They are in the ./CleanedSeparate/XhoI/. Instead, run:
dir1=./CleanedSeparate/XhoI
for fname in "$dir1"/*.1.fastq.gz
do
base=${fname#$dir1/}
base=${base%.1.fastq.gz}
echo "base=$base"
cat "$fname" "./CleanedSeparate/MseI/${base}.2.fastq.gz" >"./FinalCleaned/${base}.fastq.gz"
done
Notice here that the for statement is given the correct directory in which to find the files.

You can use something like:
for fspec in *.fastq.gz ; do
echo "${fspec}"
done
That will simply echo the file being processed but you can do anything you want to ${fspec}, including using it for a couple of zcat commands.
In order to get the root of the file name (for creating the other files), you can use the pattern deletion feature of bash to remove the trailing bit:
for fspec in *.fastq.gz ; do
froot=${fspec%%.fastq.gz}
echo "Transform ${froot}.fastq.gz into ${froot}.1.fastq.gz"
done
In addition, for your specific need, it appears you want to send the first four lines of an eight-line group to one file and the other four lines to a second file.
I tend to just use sed for simple tasks like that since it's likely to be faster. You can get the first line group (first four lines of the eight) with:
sed -n 'p;n;p;n;p;n;p;n;n;n;n'
and the second (second four lines of the eight) with:
sed -n 'n;n;n;n;p;n;p;n;p;n;p'
using the p print-current and n get-next commands.
Hence the code then becomes something like:
for fsrc in *.fastq.gz ; do
fdst1="${fspec%%.fastq.gz}.1.fastq.gz"
fdst2="${fspec%%.fastq.gz}.2.fastq.gz"
echo "Processing ${fsrc}"
# For each group of 8 lines, fdst1 gets 1-4, fdst2 gets 5-8.
zcat ${fsrc} | sed -n 'p;n;p;n;p;n;p;n;n;n;n' | gzip >${fdst1}
zcat ${fsrc} | sed -n 'n;n;n;n;p;n;p;n;p;n;p' | gzip >${fdst2}
done

Related

Print first few and last few lines of file through a pipe with "..." in the middle

Problem Description
This is my file
1
2
3
4
5
6
7
8
9
10
I would like to send the cat output of this file through a pipe and receive this
% cat file | some_command
1
2
...
9
10
Attempted solutions
Here are some solutions I've tried, with their output
% cat temp | (head -n2 && echo '...' && tail -n2)
1
2
...
% cat temp | tee >(head -n3) >(tail -n3) >/dev/null
1
2
3
8
9
10
# I don't know how to get the ...
% cat temp | sed -e 1b -e '$!d'
1
10
% cat temp | awk 'NR==1;END{print}'
1
10
# Can only get 2 lines
An awk:
awk -v head=2 -v tail=2 'FNR==NR && FNR<=head
FNR==NR && cnt++==head {print "..."}
NR>FNR && FNR>(cnt-tail)' file file
Or if a single pass is important (and memory allows), you can use perl:
perl -0777 -lanE 'BEGIN{$head=2; $tail=2;}
END{say join("\n", #F[0..$head-1],("..."),#F[-$tail..-1]);}' file
Or, an awk that is one pass:
awk -v head=2 -v tail=2 'FNR<=head
{lines[FNR]=$0}
END{
print "..."
for (i=FNR-tail+1; i<=FNR; i++) print lines[i]
}' file
Or, nothing wrong with being a caveman direct like:
head -2 file; echo "..."; tail -2 file
Any of these prints:
1
2
...
9
10
It terms of efficiency, here are some stats.
For small files (ie, less than 10 MB or so) all these are less than 1 second and the 'caveman' approach is 2 ms.
I then created a 1.1 GB file with seq 99999999 >file
The two pass awk: 50 secs
One pass perl: 10 seconds
One pass awk: 29 seconds
'Caveman': 2 MS
You may consider this awk solution:
awk -v top=2 -v bot=2 'FNR == NR {++n; next} FNR <= top || FNR > n-top; FNR == top+1 {print "..."}' file{,}
1
2
...
9
10
Two single pass sed solutions:
sed '1,2b
3c\
...
N
$!D'
and
sed '1,2b
3c\
...
$!{h;d;}
H;g'
Assumptions:
as OP has stated, a solution must be able to work with a stream from a pipe
the total number of lines coming from the stream is unknown
if the total number of lines is less than the sum of the head/tail offsets then we'll print duplicate lines (we can add more logic if OP updates the question with more details on how to address this situation)
A single-pass awk solution that implements a queue in awk to keep track of the most recent N lines; the queue allows us to limit awk's memory usage to just N lines (as opposed to loading the entire input stream into memory, which could be problematic when processing a large volume of lines/data on a machine with limited available memory):
h=2 t=3
cat temp | awk -v head=${h} -v tail=${t} '
{ if (NR <= head) print $0
lines[NR % tail] = $0
}
END { print "..."
if (NR < tail) i=0
else i=NR
do { i=(i+1)%tail
print lines[i]
} while (i != (NR % tail) )
}'
This generates:
1
2
...
8
9
10
Demonstrating the overlap issue:
$ cat temp4
1
2
3
4
With h=3;t=3 the proposed awk code generates:
$ cat temp4 | awk -v head=${h} -v tail=${t} '...'
1
2
3
...
2
3
4
Whether or not this is the 'correct' output will depend on OP's requirements.
I suggest with bash:
(head -n 2; echo "..."; tail -n 2) < file
Output:
1
2
...
9
10

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

Skip lines starting with a character and delete lines matching second column lesser than a value

I have file with following format :
Qil
Lop
A D E
a 1 10
b 2 21
c 3 22
d 4 5
3 5 9
I need to skip lines that start with pattern 'Qil' or 'Lop' or 'A D E' and ones where the third column has a value greater than 10 and save the entire thing in 2 different files with formats as shown below.
Example output files :
Output file 1
Qil
Lop
A D E
a 1 10
d 4 5
3 5 9
Output file 2
a
d
3
My code :
while read -r line; if [[ $line == "A" ]] ||[[ $line == "Q" ]]||[[ $line == "L" ]] ; then
awk '$2 < "11" { print $0 }' test.txt
awk '$2 < "11" { print $1 }' test1.txt
done < input.file
Could you please try following.
awk '
/^Qil$|^Lop$|^A D E$/{
val=(val?val ORS:"")$0
next
}
$3<=10{
if(!flag){
print val > "file1"
flag=1
}
print > "file1"
if(!a[$1]++){
print $1> "file2"
}
}' Input_file
This will create 2 output files named file1 and file2 as per OP's requirements.
This can be done in a single awk:
awk '$1 !~ /^[QLA]/ && $2 <= 10' file
1 10
4 5
5 9
If you want to print only first column then use:
awk '$1 !~ /^[QLA]/ && $2 <= 10 { print $1 }' file
1
4
5

Get lengths of zeroes (interrupted by ones)

I have a long column of ones and zeroes:
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
....
I can easily get the average number of zeroes between ones (just total/ones):
ones=$(grep -c 1 file.txt)
lines=$(wc -l < file.txt)
echo "$lines / $ones" | bc -l
But how can I get the length of strings of zeroes between the ones? In the short example above it would be:
3
5
5
2
I'd include uniq for a more easily read approach:
uniq -c file.txt | awk '/ 0$/ {print $1}'
Edit: fixed for the case where the last line is a 0
Easy in awk:
awk '/1/{print NR-prev-1; prev=NR;}END{if (NR>prev)print NR-prev;}'
Not so difficult in bash, either:
i=0
for x in $(<file.txt); do
if ((x)); then echo $i; i=0; else ((++i)); fi
done
((i)) && echo $i
Using awk, I would use the fact that a field with the value 0 evaluates as False:
awk '!$1{s++; next} {if (s) print s; s=0} END {if (s) print s}' file
This returns:
3
5
5
2
Also, note the END block to print any "remaining" zeroes appearing after the last 1.
Explanation
!$1{s++; next} if the field is not True, that is, if the field is 0, increment the counter. Then, skip to the next line.
{if (s) print s; s=0} otherwise, print the value of the counter and reset it, but just if it contains some value (to avoid printing 0 if the file starts with a 1).
END {if (s) print s} print the remaining value of the counter after processing the file, but just if it wasn't printed before.
If your file.txt is just a column of ones and zeros, you can use awk and change the record separator to "1\n". This makes each "record" a sequence of "0\n", and the count of 0's in the record is the length of the record divided by 2. Counts will be correct for leading and trailing ones and zeros.
awk 'BEGIN {RS="1\n"} { print length/2 }' file.txt
This seems to be pretty popular question today. Joining the party late, here is another short gnu-awk command to do the job:
awk -F '\n' -v RS='(1\n)+' 'NF{print NF-1}' file
3
5
5
2
How it works:
-F '\n' # set input field separator as \n (newline)
-v RS='(1\n)+' # set input record separator as multipled of 1 followed by newline
NF # execute the block if minimum one field is found
print NF-1 # print num of field -1 to get count of 0
Pure bash:
sum=0
while read n ; do
if ((n)) ; then
echo $sum
sum=0
else
((++sum))
fi
done < file.txt
((sum)) && echo $sum # Don't forget to output the last number if the file ended in 0.
Another way:
perl -lnE 'if(m/1/){say $.-1;$.=0}' < file
"reset" the line counter when 1.
prints
3
5
5
2
You can use awk:
awk '$1=="0"{s++} $1=="1"{if(s)print s;s=0} END{if(s)print(s)}'
Explanation:
The special variable $1 contains the value of the first field (column) of a line of text. Unless you specify the field delimiter using the -F command line option it defaults to a widespace - meaning $1 will contain 0 or 1 in your example.
If the value of $1 equals 0 a variable called s will get incremented but if $1 is equal to 1 the current value of s gets printed (if greater than zero) and re-initialized to 0. (Note that awk initializes s with 0 before the first increment operation)
The END block gets executed after the last line of input has been processed. If the file ends with 0(s) the number of 0s between the file's end and the last 1 will get printed. (Without the END block they wouldn't printed)
Output:
3
5
5
2
if you can use perl:
perl -lne 'BEGIN{$counter=0;} if ($_ == 1){ print $counter; $counter=0; next} $counter++' file
3
5
5
2
It actually looks better with awk same logic:
awk '$1{print c; c=0} !$1{c++}' file
3
5
5
2
My attempt. Not so pretty but.. :3
grep -n 1 test.txt | gawk '{y=$1-x; print y-1; x=$1}' FS=":"
Out:
3
5
5
2
A funny one, in pure Bash:
while read -d 1 -a u || ((${#u[#]})); do
echo "${#u[#]}"
done < file
This tells read to use 1 as a delimiter, i.e., to stop reading as soon as a 1 is encountered; read stores the 0's in the fields of the array u. Then we only need to count the number of fields in u with ${#u[#]}. The || ((${#u[#]})) is here just in case your file doesn't end with a 1.
More strange (and not fully correct) way:
perl -0x31 -laE 'say #F+0' <file
prints
3
5
5
2
0
It
reads the file with the record separator is set to character 1 the -0x31
with autosplit -a (splits the record into array #F)
and prints the number of elements in #F e.g. say #F+0 or could use say scalar #F
Unfortunately, after the final 1 (as record separator) it prints an empty record - therefore prints the last 0.
It is incorrect solution, showing it only as alternative curiosity.
Expanding erickson's excellent answer, you can say:
$ uniq -c file | awk '!$2 {print $1}'
3
5
5
2
From man uniq we see that the purpose of uniq is to:
Filter adjacent matching lines from INPUT (or standard input), writing
to OUTPUT (or standard output).
So uniq groups the numbers. Using the -c option we get a prefix with the number of occurrences:
$ uniq -c file
3 0
1 1
5 0
1 1
5 0
1 1
2 0
1 1
Then it is a matter of printing those the counters before the 0. For this we can use awk like: awk '!$2 {print $1}'. That is: print the second field if the field is 0.
The simplest solution would be to use sed together with awk, like this:
sed -n '$bp;/0/{:r;N;/0$/{h;br}};/1/{x;bp};:p;/.\+/{s/\n//g;p}' input.txt \
| awk '{print length}'
Explanation:
The sed command separates the 0s and creates output like this:
000
00000
00000
00
Piped into awk '{print length}' you can get the count of 0 for each interval:
Output:
3
5
5
2

printing variable number lines to output

I would like to have a script to modify some large text files (100k records) such that, for every record, a number of lines in the output is created equivalent to the difference in columns 3 and 2 of every input line. In the output I want to print the record name (column 1), and a step-wise walk between the numbers contained in columns 2 and 3.
Sample trivial input could be (tab separated data, if it makes a difference)
a 3 5
b 10 14
with the desired output (again, ideally tab separated)
a 3 4
a 4 5
b 10 11
b 11 12
b 12 13
b 13 14
It's a challenge sadly beyond my (very) limited abilities.
Can anyone provide a solution to the problem, or point me in the right direction? In an ideal world I would be able to be integrate this into a bash script, but I'll take anything that works!
Bash solution:
while read h f t ; do
for ((i=f; i<t; i++)) ; do
printf "%s\t%d\t%d\n" $h $i $((i+1))
done
done < input.txt
Perl solution:
perl -lape '$_ = join "\n", map join("\t", $F[0], $_, $_ + 1), $F[1] .. $F[2] - 1' input.txt
awk -F '\t' -v OFS='\t' '
$2 >= $3 {print; next}
{for (i=$2; i<$3; i++) print $1, i, i+1}
' filename
With awk:
awk '$3!=$2 { while (($3 - $2) > 1) { print $1,$2,$2+1 ; $2++} }1' inputfile
Fully POSIX, and no unneeded loop variables:
$ while read h f t; do
while test $f -lt $t; do
printf "%s\t%d\t%d\n" "$h" $f $((++f))
done
done < input.txt
a 3 4
a 4 5
b 10 11
b 11 12
b 12 13
b 13 14

Resources