Slow bash script to execute sed expression on each line of an input file - bash

I have a simple bash script as follows
#!/bin/bash
#This script reads a file of row identifiers separated by new lines
# and outputs all query FASTA sequences whose headers contain that identifier.
# usage filter_fasta_on_ids.sh fasta_to_filter.fa < seq_ids.txt; > filtered.fa
while read SEQID; do
sed -n -e "/$SEQID/,/>/ p" $1 | head -n -1
done
A fasta file has the following format:
> HeadER23217;count=1342
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
>ANotherName;count=3221
GGGTACAGACCTACAC
CAACTAGGGGACCAAT
edit changed header names to better show their actual structure in the files
The script I made above does filter the file correctly, but it is very slow. My input file has ~ 20,000,000 lines containing ~ 4,000,000 sequences, and I have a list of 80,000 headers that I want to filter on. Is there a faster way to do this using bash/sed or other tools (like python or perl?) Any ideas why the script above is taking hours to complete?

You're scanning the large file 80k times. I'll suggest a different approach with a different tool: awk. Load the selection list into an hashmap (awk array) and while scanning the large file if any sequence matches print.
For example
$ awk -F"\n" -v RS=">" 'NR==FNR{for(i=1;i<=NF;i++) a["Sequence ID " $i]; next}
$1 in a' headers fasta
The -F"\n" flag sets the field separator in the input file to be a new line. -v RS=">" sets the record separator to be a ">"
Sequence ID 1
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
Sequence ID 4
GGGTACAGACCTACAT
CAACTAGGGGACCAAT
the headers file contains
$ cat headers
1
4
and the fasta file includes some more records in the same format.
If your headers already includes the "Sequence ID" prefix, adjust the code as such. I didn't test this for large files but should be dramatically faster than your code as long as you don't have memory restrictions to hold 80K size array. In that case, splitting the headers to multiple sections and combining the results should be trivial.
To allow any format of header and to have the resulting file be a valid FASTA file, you can use the following command:
awk -F"\n" -v RS=">" -v ORS=">" -v OFS="\n" 'NR==FNR{for(i=1;i<=NF;i++) a[$i]; next} $1 in a' headers fasta > out
The ORS and OFS flags set the output field and record separators, in this case to be the same as the input fasta file.

You should take advantage of the fact (which you haven't explicitly stated, but I assume) that the huge fasta file contains the sequences in order (sorted by ID).
I'm also assuming the headers file is sorted by ID. If it isn't, make it so - sorting 80k integers is not costly.
When both are sorted it boils down to a single simultaneous linear scan through both files. And since it runs in constant memory it can work with any size unlike the other awk example. I give an example in python since I'm not comfortable with manual iteration in awk.
import sys
fneedles = open(sys.argv[1])
fhaystack = open(sys.argv[2])
def get_next_id():
while True:
line = next(fhaystack)
if line.startswith(">Sequence ID "):
return int(line[len(">Sequence ID "):])
def get_next_needle():
return int(next(fneedles))
try:
i = get_next_id()
j = get_next_needle()
while True:
if i == j:
print(i)
while i <= j:
i = get_next_id()
while i > j:
j = get_next_needle()
except StopIteration:
pass
Sure it's a bit verbose, but it finds 80k of 4M sequences (339M of input) in about 10 seconds on my old machine. (It could also be rewritten in awk which would probably be much faster). I created the fasta file this way:
for i in range(4000000):
print(">Sequence ID {}".format(i))
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
And the headers ("needles") this way:
import random
ids = list(range(4000000))
random.shuffle(ids)
ids = ids[:80000]
ids.sort()
for i in ids:
print(i)

It's slow because you are reading several times the same file when you could have sed read it once and process all patterns. So you need to generate a sed script with a statement for each ID and the />/b to replace your head -n -1.
while read ID; do
printf '/%s/,/>/ { />/b; p }\n' $ID;
done | sed -n -f - data.fa

Related

Script to find and print common IDs between two files working but it's not optimal

I have a code that is working and does what I want, but it is extremely slow. It takes 1 or 2 days depending on the size of the input files. I know that there are alternatives that can be almost instant and that my code is slow because it's a recursive grep. I wrote another code in python that works as intended and is almost instant, but it does not print everything I need.
What I need is the common IDs between two files, and I want it to print the whole line. My python script does not do that, while the bash does it but it's too much slow.
This is my code in bash:
awk '{print $2}' file1.bim > sites.txt
for snp in `cat sites.txt`
do
grep -w $snp file2.bim >> file1_2_shared.txt
done
This is my code in python:
#!/usr/bin/env python3
import sys
argv1=sys.argv[1] #argv1 is the first .bim file
argv2=sys.argv[2] #argv2 is the second .bim file
argv3=sys.argv[3] #argv3 is the output .txt file name
def printcommonSNPs(inputbim1,inputbim2,outputtxt):
bim1 = open(inputbim1, "r")
bim2 = open(inputbim2, "r")
output = open(outputtxt,"w")
snps1 = []
line1 = bim1.readline()
line1 = line1.split()
snps1.append(line1[1])
for line1 in bim1:
line1 = line1.split()
snps1.append(line1[1])
bim1.close()
snps2 = []
line2 = bim2.readline()
line2 = line2.split()
snps2.append(line2[1])
for line2 in bim2:
line2 = line2.split()
snps2.append(line2[1])
bim2.close()
common=[]
common = list(set(snps1).intersection(snps2))
for SNP in common:
print(SNP, file=output)
printcommonSNPs(argv1,argv2,argv3)
My .bim input files are made this way:
1 1:891021 0 891021 G A
1 1:903426 0 903426 T C
1 1:949654 0 949654 A G
I would appreciate suggestions on what I could do to make it quick in bash (I suspect I can use an awk script, but I tried awk 'FNR==NR {map[$2]=$2; next} {print $2, map[$2]}' file1.bim file2.bim > Roma_sets_shared_sites.txt and it simply prints every line, so it's not working as I need), or how could I tell to print the whole line in python3.
It looks as if the problem can be solved like this:
grep -w -f <(awk '{ print $2 }' file1.bim) file2.bim
The identifiers (field $2) from file1.bim are to be treated as patterns to grep for in file2.bim. GNU grep takes a -f file argument which gives a list of patterns, one per line. We use <() process substitution in place of a file. It looks as if the -w option individually applies to the -f patterns.
This won't have the same output as your shell script if there are duplicate IDs in file1.bim. If the same pattern occurs more than once, that's the same as one instance. And of course the order is different. Grepping the entire second file for one identifier and hen the next and next, produces the matches in a different order. If that order has to be reproduced, it will take extra work.

How to parse specific values from csv file to a for loop command? [duplicate]

This question already has answers here:
Looping over pairs of values in bash [duplicate]
(6 answers)
Closed 1 year ago.
I am trying to write a for loop where I conditionally parse specific values from a csv file into the do command.
My situation is as follows:
I have several directories containing genome sequences. The samples are numbered and the directories are named accordingly.
Dir 1 contains sample1_genome.fasta
Dir 2 contains sample2_genome.fasta
Dir 3 contains sample3_genome.fasta
The genome sequences have differing average read lengths. It is important to adress this. Therefore, I created a csv file containing the sample number and the according average read length of the genome sequence.
csv file example (first column = sample_no, 2nd column = avg_read_length):
1,130
2,134
3,129
Now, I want to loop through the directories, take the genome sequences as input and parse the respective average read length to the process.
my code is as follows:
for f in *
do
shortbred_quantify.py --genome $f/sample${f%}.fasta --aerage_read_length *THE SAMPLE MATCHING VALUE FROM 2nd COLUMN* --results results/quantify_results_sample${f%}
done
Can you help me out with this?
Use awk. $2 is the second field, $1 is the first. eg:
$ cat input
1,130
2,134
3,129
$ awk '$2 == avgReadBP{ print $1 }' FS=, avgReadBP=134 input
2
So your command ends up looking like:
input="$f"/genome_sample.fasta
shortbred_quantify.py --genome "$input" \
--avgreadBP "$(awk '$2 == a{ print $1 }' FS=, a="$value_to_match" "$input")" \
--results results/quantify_results_sample"${f}"
Don't forget to quote the filename.
I would structure it along these lines:
while IFS=, read sample read_length
do
shortbred_quantify.py --genome "$sample/genome_sample.fasta" --avgreadBP "$read_length" --results "results/quantify_results_sample$sample"
done < ./input

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Find Replace using Values in another File

I have a directory of files, myFiles/, and a text file values.txt in which one column is a set of values to find, and the second column is the corresponding replace value.
The goal is to replace all instances of find values (first column of values.txt) with the corresponding replace values (second column of values.txt) in all of the files located in myFiles/.
For example...
values.txt:
Hello Goodbye
Happy Sad
Running the command would replace all instances of "Hello" with "Goodbye" in every file in myFiles/, as well as replace every instance of "Happy" with "Sad" in every file in myFiles/.
I've taken as many attempts at using awk/sed and so on as I can think logical, but have failed to produce a command that performs the action desired.
Any guidance is appreciated. Thank you!
Read each line from values.txt
Split that line in 2 words
Use sed for each line to replace 1st word with 2st word in all files in myFiles/ directory
Note: I've used bash parameter expansion to split the line (${line% *} etc) , assuming values.txt is space separated 2 columnar file. If it's not the case, you may use awk or cut to split the line.
while read -r line;do
sed -i "s/${line#* }/${line% *}/g" myFiles/* # '-i' edits files in place and 'g' replaces all occurrences of patterns
done < values.txt
You can do what you want with awk.
#! /usr/bin/awk -f
# snarf in first file, values.txt
FNR == NR {
subs[$1] = $2
next
}
# apply replacements to subsequent files
{
for( old in subs ) {
while( index(old, $0) ) {
start = index(old, $0)
len = length(old)
$0 = substr($0, start, len) subs[old] substr($0, start + len)
}
}
print
}
When you invoke it, put values.txt as the first file to be processed.
Option One:
create a python script
with open('filename', 'r') as infile, etc., read in the values.txt file into a python dict with 'from' as key, and 'to' as value. close the infile.
use shutil to read in directory wanted, iterate over files, for each, do popen 'sed 's/from/to/g'" or read in each file interating over all the lines, each line you find/replace.
Option Two:
bash script
read in a from/to pair
invoke
perl -p -i -e 's/from/to/g' dirname/*.txt
done
second is probably easier to write but less exception handling.
It's called 'Perl PIE' and it's a relatively famous hack for doing find/replace in lots of files at once.

Bash script replace two fields in a text file using variables

This should be a simple fix but I cannot wrap my head around it at the moment.
I have a comma-delimited file called my_course that contains a list of courses with some information about them.
I need to get user input about the last two fields and change them accordingly.
Each field is constructed like:
CourseNumber,CourseTitle,CreditHours,Status,Grade
Example file:
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
I get the user input for 3 things: Course Number, Status (0 or 1), and Grade (A,B,C,N/A)
So far I have tried matching the line containing the course number and changing the last two fields. I haven't been about to figure out how to modify the last two fields using sed so I'm using this horrible jumble of awk and sed:
temporary=$(awk -v status=$status -v grade=$grade '
BEGIN { FS="," }; $(NF)=""; $(NF-1)="";
/'$cNum'/ {printf $0","status","grade;}' my_course)
sed -i "s/CSC$cNum.*/$temporary/g" my_course
The issue that I'm running into here is the number of fields in the course title can range from 1 to 4 so I can't just easily print the first n fields. I've tried removing the last two fields and appending the new values for status and grade but that isn't working for me.
Note: I have already done checks to ensure that the user inputs valid data.
Use a simple awk-script:
BEGIN {
FS=","
OFS=FS
}
$0 ~ course {
$(NF-1)=status
$NF=grade
} {print}
and on the cmd-line, set three parameters for the various parameters like course, status and grade.
in action:
$ cat input
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
$ awk -vcourse="CSC3210" -vstatus="1" -vgrade="A" -f grades.awk input
CSC3210,COMPUTER ORG & PROGRAMMING,3,1,A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,0,N/A
$ awk -vcourse="CSC1010" -vstatus="1" -vgrade="B" -f grades.awk input
CSC3210,COMPUTER ORG & PROGRAMMING,3,0,N/A
CSC2010,INTRO TO COMPUTER SCIENCE,3,0,N/A
CSC1010,COMPUTERS & APPLICATIONS,3,1,B
It doesn't matter how much commas you have in course name as long as you look only at last two commas:
sed -i "/CSC$cNum/ s/.,[^,]*$$/$status,$grade/"
The trick is to use $ in pattern to match the end of line. $$ because of double quotes.
And don't bother building the "temporary" line - apply substitution only to line that matches course number.

Resources