Getting different output files - bash

I'm doing a test with these files:
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R2_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R2_001.fastq
comp995_c0_seq1_Glicose_1_ACTTGA_merge_R2_001.fastq
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R2_001.fastq
I want to get the files that have the same code until the first _ (underscore) and have the code R1 in different output files. The output files should be called according with the code until the first _ (underscore).
-This is my code, but I'm having trouble on making the output files.
#!/bin/bash
for i in {900..995}; do
if [[ ${i} -eq ${i} ]]; then
cat comp${i}_*_R1_001.fastq
fi
done
-I want to have two outputs:
One output will have all lines from:
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
and its name should be comp900_R1.out
The other output will have lines from:
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq
and its name should be comp995_R1.out
Finally, as I said, this is a small test. I want my script to work with a lot of files that have the same characteristics.

Using awk:
ls -1 *.fastq | awk -F_ '$8 == "R1" {system("cat " $0 ">>" $1 "_R1.out")}'
List all files *.fastq into awk, splitting on _. Check if 8:th part $8 is R1, then append cat >> the file into first part $1 + _R1.out, which will be comp900_R1.out or comp995_R1.out. It is assumed that no filenames contain spaces or other special characters.
Result:
File comp900_R1.out containing all lines from
comp900_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
comp900_c0_seq2_Glicose_1_ACTTGA_merge_R1_001.fastq
and file comp995_R1.out containing all lines from
comp995_c0_seq1_Xilano_1_AGTCAA_merge_R1_001.fastq

My stab at a general solution:
#!/bin/bash
for f in *_R1_*; do
code=$(echo $f | cut -d _ -f 1)
cat $f >> ${code}_c0_seq1_Glicose_1_ACTTGA_merge_R1_001.fastq
done
Iterates over files with _R1_ in it, then appends its output to a file based on code.
cut pulls out the code by splitting the filename (-d _) and returning the first field (-f 1).

Related

looping with grep over several files

I have multiple files /text-1.txt, /text-2.txt ... /text-20.txt
and what I want to do is to grep for two patterns and stitch them into one file.
For example:
I have
grep "Int_dogs" /text-1.txt > /text-1-dogs.txt
grep "Int_cats" /text-1.txt> /text-1-cats.txt
cat /text-1-dogs.txt /text-1-cats.txt > /text-1-output.txt
I want to repeat this for all 20 files above. Is there an efficient way in bash/awk, etc. to do this ?
#!/bin/sh
count=1
next () {
[[ "${count}" -lt 21 ]] && main
[[ "${count}" -eq 21 ]] && exit 0
}
main () {
file="text-${count}"
grep "Int_dogs" "${file}.txt" > "${file}-dogs.txt"
grep "Int_cats" "${file}.txt" > "${file}-cats.txt"
cat "${file}-dogs.txt" "${file}-cats.txt" > "${file}-output.txt"
count=$((count+1))
next
}
next
grep has some features you seem not to be aware of:
grep can be launched on lists of files, but the output will be different:
For a single file, the output will only contain the filtered line, like in this example:
cat text-1.txt
I have a cat.
I have a dog.
I have a canary.
grep "cat" text-1.txt
I have a cat.
For multiple files, also the filename will be shown in the output: let's add another textfile:
cat text-2.txt
I don't have a dog.
I don't have a cat.
I don't have a canary.
grep "cat" text-*.txt
text-1.txt: I have a cat.
text-2.txt: I don't have a cat.
grep can be extended to search for multiple patterns in files, using the -E switch. The patterns need to be separated using a pipe symbol:
grep -E "cat|dog" text-1.txt
I have a dog.
I have a cat.
(summary of the previous two points + the remark that grep -E equals egrep):
egrep "cat|dog" text-*.txt
text-1.txt:I have a dog.
text-1.txt:I have a cat.
text-2.txt:I don't have a dog.
text-2.txt:I don't have a cat.
So, in order to redirect this to an output file, you can simply say:
egrep "cat|dog" text-*.txt >text-1-output.txt
Assuming you're using bash.
Try this:
for i in $(seq 1 20) ;do rm -f text-${i}-output.txt ; grep -E "Int_dogs|Int_cats" text-${i}.txt >> text-${i}-output.txt ;done
Details
This one-line script does the following:
Original files are intended to have the following name order/syntax:
text-<INTEGER_NUMBER>.txt - Example: text-1.txt, text-2.txt, ... text-100.txt.
Creates a loop starting from 1 to <N> and <N> is the number of files you want to process.
Warn: rm -f text-${i}-output.txt command first will be run and remove the possible outputfile (if there is any), to ensure that a fresh new output file will be only available at the end of the process.
grep -E "Int_dogs|Int_cats" text-${i}.txt will try to match both strings in the original file and by >> text-${i}-output.txt all the matched lines will be redirected to a newly created output file with the relevant number of the original file. Example: if integer number in original file is 5 text-5.txt, then text-5-output.txt file will be created & contain the matched string lines (if any).

Trying to create a script that counts the length of a all the reads in a fastq file but getting no return

I am trying go count the length of each read in a fastq file from illumina sequencing and outputting this to a tsv or any sort of file so I can then later also look at this and count the number of reads per file. So I need to cycle down the file and eactract each line that has a read on it (every 4th line) then get its length and store this as an output
num=2
for file in *.fastq
do
echo "counting $file"
function file_length(){
wc -l $file | awk '{print$FNR}'
}
for line in $file_length
do
awk 'NR==$num' $file | chrlen > ${file}read_length.tsv
num=$((num + 4))
done
done
Currently all I get the counting $file and no other output but also no errors
Your script contains a lot of errors in both syntax and algorithm. Please try shellcheck to see what is the problem. The most issue will be the $file_length part.
You may want to call a function file_length() here but it is just
an undefined variable which is evaluated as null in the for loop.
If you just want to count the length of the 4th line of *.fastq files,
please try something like:
for file in *.fastq; do
awk 'NR==4 {print length}' "$file" > "${file}_length.tsv"
done
Or if you want to put the results together in a single tsv file, try:
tsvfile="read_lenth.tsv"
for file in *.fastq; do
echo -n -e "$file\t" >> "$tsvfile"
awk 'NR==4 {print length}' "$file" >> "$tsvfile"
done
Hope this helps.

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.

I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates.
In particular, I have files in which some lines begin with the '#' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented.
So far I've written a bash script that finds duplicated lines beginning with '#', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below:
#!/bin/bash
fdir=$1"*"
#for each fastq file
for f in $fdir
do
(
#find duplicated read names and write to file $f.txt
sort $f | uniq -d | grep ^# > "$f".txt
#loop over each duplicated readname
while read in; do
rname=$in
i=1
#while this readname still exists in the file increment and replace
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
rm "$f".txt
rm "$f".bu
done
echo "done" >> progress.txt
)&
background=( $(jobs -p) )
if (( ${#background[#]} ==40)); then
wait -n
fi
done
The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why.
My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach.
EDIT
Strangely the loop works fine on one file. Basically I ran
sort $f | uniq -d | grep ^# > "$f".txt
while read in; do
rname=$in
i=1
while grep -q "$rname" $f; do
replace=${rname/0:0/$i:0}
sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
let "i+=1"
done
done < "$f".txt
To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach.
With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel?
Sample input:
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Sample output:
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
Here's some code that produces the required output from your sample input.
Again, it is assumed that your input file is sorted by the first value (up to the first space character).
time awk '{
#dbg if (dbg) print "#dbg:prev=" prev
if (/^#/ && prev!=$1) {fixNum=0 ;if (dbg) print "prev!=$1=" prev "!=" $1}
if (/^#/ && (prev==$1 || NR==1) ) {
prev=$1
n=split($1,tmpArr,":") ; n++
#dbg if (dbg) print "tmpArr[6]="tmpArr[6] "\tfixNum="fixNum
fixNum++;tmpArr[6]=fixNum;
# magic to rebuild $1 here
for (i=1;i<n;i++) {
tmpFix ? tmpFix=tmpFix":"tmpArr[i]"" : tmpFix=tmpArr[i]
}
$1=tmpFix ; $0=$0
print $0
}
else { tmpFix=""; print $0 }
}' file > fixedFile
output
#D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
#D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG
I've left a few of the #dbg:... statements in place (but they are now commented out) to show how you can run a small set of data as you have provided, and watch the values of variables change.
Assuming a non-csh, you should be able to copy/paste the code block into a terminal window cmd-line and replace file > fixFile at the end with your real file name and a new name for the fixed file. Recall that awk 'program' file > file (actually, any ...file>file) will truncate the existing file and then try to write, SO you can lose all the data of a file trying to use the same name.
There are probably some syntax improvements that will reduce the size of this code, and there might be 1 or 2 things that could be done that will make the code faster, but this should run very quickly. If not, please post the result of time command that should appear at the end of the run, i.e.
real 0m0.18s
user 0m0.03s
sys 0m0.06s
IHTH
#!/bin/bash
i=4
sort $1 | uniq -d | grep ^# > dups.txt
while read in; do
if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then
x="$in"
x=${x/"0:0 "/$i":0 "}
echo "$x" >> $1"fixed.txt"
else
echo "$in" >> $1"fixed.txt"
fi
let "i+=1"
done < $1

appending text to specific line in file bash

So I have a file that contains some lines of text separated by ','. I want to create a script that counts how much parts a line has and if the line contains 16 parts i want to add a new one. So far its working great. The only thing that is not working is appending the ',' at the end. See my example below:
Original file:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
Expected result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
This is my code:
while read p; do
if [[ $p == "HEA"* ]]
then
IFS=',' read -ra ADDR <<< "$p"
echo ${#ADDR[#]}
arrayCount=${#ADDR[#]}
if [ "${arrayCount}" -eq 16 ];
then
sed -i "/$p/ s/\$/,xx/g" $f
fi
fi
done <$f
Result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
What im doing wrong? I'm sure its something small but i cant find it..
It can be done using awk:
awk -F, 'NF==16{$0 = $0 FS "xx"} 1' file
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
-F, sets input field separator as comma
NF==16 is the condition that says execute block inside { and } if # of fields is 16
$0 = $0 FS "xx" appends xx at end of line
1 is the default awk action that means print the output
For using sed answer should be in the following:
Use ${line_number} s/..../..../ format - to target a specific line, you need to find out the line number first.
Use the special char & to denote the matched string
The sed statement should look like the following:
sed -i "${line_number}s/.*/&xx/"
I would prefer to leave it to you to play around with it but if you would prefer i can give you a full working sample.

Resources