Pull out a range of data from unique character to unique character using grep or awk - bash

I have a moderately large fasta format file that has a complex header. I need to pull a sequence out based on a value (an 8 digit number) from another file. I can get the sequence out using 'grep -20 "value" fasta.file'. Some of the sequences are very large and I often have to adjust the number of lines in order to get the whole sequence. I then have to copy and paste into another file. Right now, I have too many values (1000) to do this manually. The tools that I have found to do this haven't worked so far...
The fasta format file looks like this:
>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818559; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134301675; Stop=134301762; Strand=+; Length=87;
GGATCATTGATGACCAAAAAAAAAAAAACATCTGGGAGTCCTCTGAGACATCCATGATGA
CCACAACATTGGGAGTCTGAGGTCCAC
If I use the command grep -4 "17818557" fasta.fa I get:
ATTGCGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818555; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134299894; Stop=134299978; Strand=+; Length=84;
GGATCATTGATGACCAGAAAAAAATCATCTCGGAGTCCTCTGAGACATCCATGATGACCA
CAACATTGGGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
>transcript_cluster:RaGene-2_0-st-v1:17818559; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134301675; Stop=134301762; Strand=+; Length=87;
GGATCATTGATGACCAAAAAAAAAAAAACATCTGGGAGTCCTCTGAGACATCCATGATGA
grep -4 grabs four lines above and below. What I need to do is use the numerical query and pull out just the sequence data below the fasta header (>). It would be nice to collect the sequence below the fasta header to next fasta header, ie from > to >.
I have tried some of the UCSC tools 'faSomeRecord' and some perl scripts. They haven't worked with the numerical query in list file or on the command line with and without the 'transcript_cluster:RaGene-2_0-st-v1:' addition. I am thinking that it is the colons or because the header includes positions and lengths which are variable.
Any comments or help is greatly appreciated!
EDIT 30July14
Thanks to the help I received here. I was able to get the data from one file to another using this bash script:
#!/usr/bin/bash
filename='21Feb14_list.txt'
filelines=`cat $filename`
for i in $filelines ; do
awk '/transcript/ && f==1 {f=0;next} /'"$i"'/ {f=1} f==1{print $1}' RaGene-2_0-st-v1.rn4.transcript_cluster.fa
done
This pulls out the sequence, but it truncates the data to the wildcard value. Is there a way to modify this so that I can get the entire header?
example output:
>transcript_cluster:RaGene-2_0-st-v1:17719499;
ATGCCTGAGCCTTCGAAATCTGCACCAGCTCCTAAGAAGGGCTCTAAGAAAGCTATCTCT
AAAGCTCAGAAAAAGGATGGCAAGAAGCGCAAGCGTAGCCGCAAGGAGAGCTATTCCGTG
TACGTGTACAAGGTGCTGAAGCAAGTGCACCCGGACACCGGCATCTCTTCCAAGGCCATG
GGCATCATGAACTCGTTCGTGAACGACATCTTCGAGCGCATCGCGGGCGAGGCGTCGCGC
CTGGCGCATTACAACAAGCGCTCGACCATCACGTCCCGGGAGATCCAGACCGCCGTGCGC
CTGCTGCTGCCGGGGGAGCTGGCCAAGCACGCGGTGTCGGAAGGCACCAAGGCGGTCACC
AAGTACACCAGCTCCAAGTG
>transcript_cluster:RaGene-2_0-st-v1:17623679;
Thanks again!!

$ awk '/transcript/ {f=0} /17818557/ {f=1} f==1{print}' fasta
>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
How it works:
The code uses a flag, called f, to decide if a line should be printed. Taking each command, one by one:
/transcript/ {f=0}
If "transcript" appears in the line, indicating a header, we set the flag to 0.
/17818557/ {f=1}
If the line contains our key, 17818557, we set the flag to 1
f==1{print}
If the flag is 1, print the line.

sed '1,/17818557/d;/>/,$d' file
Output:
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC
With Header:
id=17818557
sed "/$id/p;1,/$id/d;/>/,\$d" file
Output:
>transcript_cluster:RaGene-2_0-st-v1:17818557; Assembly=build-Rnor3.4/rn4; Seqname=chr6; Start=134300789; Stop=134300869; Strand=+; Length=80;
GGATCATTGATGACCATAAAAGATGTGGGAGTCGTCTGAAACATGCATGATGACCACAAC
ATTGAGAGTCTGAGGTCCAC

Related

Pass number of for loop elements to external command

I'm using for loop to iterate through .txt files in a directory and grab specified rows from the files. Afterwards the output is passed to pr command in order to print it as a table. Everything works fine, however I'm manually specifying the number of columns that the table should contain. This is cumbersome when the number of files is not constant.
The command I'm using:
for f in *txt; do awk -F"\t" 'FNR ~ /^(2|6|9)$/{print $2}' $f; done | pr -ts --column 4
How should I modify the command to replace '4' with elements number?
Edit:
The fundamental question was if one can provide matching files number to function outside the loop. Seeing the solutions I guess it is not possible to work around the problem. Until this conclusion the structure of the files was not really relevant.
However taking the above into account, I'm providing the files structure below.
Sample file.txt:
Irrelevant1 text
Placebo 1222327
Irrelevant1 text
Irrelevant2 text
Irrelevant3 text
Treatment1 105956
Irrelevant1 text
Irrelevant2 text
Treatment2 49271
Irrelevant1 text
Irrelevant2 text
The for loop generates the following from 4 *txt files:
1222327
105956
49271
969136
169119
9672
1297357
237210
11581
1189529
232095
13891
Expected pr output using a dynamically generated --column 4:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
Assumptions:
all input files generate the same number of output lines (otherwise we can add some code to keep track of the max number of lines and generate blank columns as needed)
Setup (columns are tab-delimited):
$ grep -n xxx f[1-4].txt
f1.txt:6:xxx 1222327
f1.txt:9:xxx 105956
f1.txt:24:xxx 49271
f2.txt:6:xxx 969136
f2.txt:9:xxx 169119
f2.txt:24:xxx 9672
f3.txt:6:xxx 1297357
f3.txt:9:xxx 237210
f3.txt:24:xxx 11581
f4.txt:6:xxx 1189529
f4.txt:9:xxx 232095
f4.txt:24:xxx 13891
One idea using awk to dynamically build the 'table' (replaces OP's current for loop):
awk -F'\t' '
FNR==1 { c=0 }
FNR ~ /^(6|9|24)$/ { ++c ; arr[c]=arr[c] (FNR==NR ? "" : " ") $2 }
END { for (i=1;i<=c;i++) print arr[i] }
' f[1-4].txt | column -t -o ' '
NOTE: we'll go ahead and let column take care of pretty-printing the table with a single space separating the columns, otherwise we could add some more code to awk to right-pad columns with spaces
This generates:
1222327 969136 1297357 1189529
105956 169119 237210 232095
49271 9672 11581 13891
You could just run ls and pipe the output to wc -l. Then once you've got that number you can assign it to a variable and place that variable in your command.
num=$(ls *.txt | wc -l)
I forget how to place bash variables in AWK, but I think you can do that. If not, respond back and I'll try to find a different answer.

Replace column after performing actions using awk

Below here I'm trying to remove comma only from msg column.
Input file ("abc.txt" has many entries as below):
alert tcp any any -> any [10,112,34] (msg:"Its an Test, Rule"; reference:url,view/Main; sid:1234; rev:1;)
Expected Output:
alert tcp any any -> any [10,112,34] (msg:"Its an Test Rule"; reference:url,view/Main; sid:1234; rev:1;)
This is what i have tried using awk:
awk -F ';' '{for(i=1;i<=NF;i++){if(match($i,"msg:")>0){split($i, array, "\"");tmessage=array[2];gsub("[',']","",tmessage);message=tmessage; }}print message'} abc.txt
The problem with having awk rewrite your fields is that output for modified lines will be field-separated by OFS, which is static.
The way around this is to avoid dealing with fields, and just handle the string replacement on $0. You could piece together the parts of the line manually, like this:
awk '{x=index($0,"msg:"); y=index(substr($0,x),";"); s=substr($0,x,y); gsub(/,/,"",s); print substr($0,1,x-1) s substr($0,x+y)}' input.txt
Or spelled out for easier reading:
{
x=index($0,"msg:") # find the offset of the interesting bit
y=index(substr($0,x),";") # find the length of that bit
s=substr($0,x,y) # clip the bit
gsub(/,/,"",s) # replace commas,
print substr($0,1,x-1) s substr($0,x+y) # print the result.
}

Read and sum occurrence lines in bash

I have a file that includes lines below separated by comma ;
filename.txt
usernameA,10,10
usernameB,20,20
usernameA,10,10
usernameB,20,20
usernameC,10,10
I just want to parse the file and add numbers by username if occurs multiple times , so the result should be ;
usernameA=40
usernameB=80
usernameC=20
How can i achive this result using Bash script ?
Thank you,
$ awk -F, '{a[$1]+=$2+$3}END{for(x in a)print x "=" a[x]}' file
usernameA=40
usernameB=80
usernameC=20
This works for the given example.

Slow bash script to execute sed expression on each line of an input file

I have a simple bash script as follows
#!/bin/bash
#This script reads a file of row identifiers separated by new lines
# and outputs all query FASTA sequences whose headers contain that identifier.
# usage filter_fasta_on_ids.sh fasta_to_filter.fa < seq_ids.txt; > filtered.fa
while read SEQID; do
sed -n -e "/$SEQID/,/>/ p" $1 | head -n -1
done
A fasta file has the following format:
> HeadER23217;count=1342
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
>ANotherName;count=3221
GGGTACAGACCTACAC
CAACTAGGGGACCAAT
edit changed header names to better show their actual structure in the files
The script I made above does filter the file correctly, but it is very slow. My input file has ~ 20,000,000 lines containing ~ 4,000,000 sequences, and I have a list of 80,000 headers that I want to filter on. Is there a faster way to do this using bash/sed or other tools (like python or perl?) Any ideas why the script above is taking hours to complete?
You're scanning the large file 80k times. I'll suggest a different approach with a different tool: awk. Load the selection list into an hashmap (awk array) and while scanning the large file if any sequence matches print.
For example
$ awk -F"\n" -v RS=">" 'NR==FNR{for(i=1;i<=NF;i++) a["Sequence ID " $i]; next}
$1 in a' headers fasta
The -F"\n" flag sets the field separator in the input file to be a new line. -v RS=">" sets the record separator to be a ">"
Sequence ID 1
ACTGTGCCCCGTGTAA
CGTTTGTCCACATACC
Sequence ID 4
GGGTACAGACCTACAT
CAACTAGGGGACCAAT
the headers file contains
$ cat headers
1
4
and the fasta file includes some more records in the same format.
If your headers already includes the "Sequence ID" prefix, adjust the code as such. I didn't test this for large files but should be dramatically faster than your code as long as you don't have memory restrictions to hold 80K size array. In that case, splitting the headers to multiple sections and combining the results should be trivial.
To allow any format of header and to have the resulting file be a valid FASTA file, you can use the following command:
awk -F"\n" -v RS=">" -v ORS=">" -v OFS="\n" 'NR==FNR{for(i=1;i<=NF;i++) a[$i]; next} $1 in a' headers fasta > out
The ORS and OFS flags set the output field and record separators, in this case to be the same as the input fasta file.
You should take advantage of the fact (which you haven't explicitly stated, but I assume) that the huge fasta file contains the sequences in order (sorted by ID).
I'm also assuming the headers file is sorted by ID. If it isn't, make it so - sorting 80k integers is not costly.
When both are sorted it boils down to a single simultaneous linear scan through both files. And since it runs in constant memory it can work with any size unlike the other awk example. I give an example in python since I'm not comfortable with manual iteration in awk.
import sys
fneedles = open(sys.argv[1])
fhaystack = open(sys.argv[2])
def get_next_id():
while True:
line = next(fhaystack)
if line.startswith(">Sequence ID "):
return int(line[len(">Sequence ID "):])
def get_next_needle():
return int(next(fneedles))
try:
i = get_next_id()
j = get_next_needle()
while True:
if i == j:
print(i)
while i <= j:
i = get_next_id()
while i > j:
j = get_next_needle()
except StopIteration:
pass
Sure it's a bit verbose, but it finds 80k of 4M sequences (339M of input) in about 10 seconds on my old machine. (It could also be rewritten in awk which would probably be much faster). I created the fasta file this way:
for i in range(4000000):
print(">Sequence ID {}".format(i))
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
print("ACTGTGCCCCGTGTAA")
And the headers ("needles") this way:
import random
ids = list(range(4000000))
random.shuffle(ids)
ids = ids[:80000]
ids.sort()
for i in ids:
print(i)
It's slow because you are reading several times the same file when you could have sed read it once and process all patterns. So you need to generate a sed script with a statement for each ID and the />/b to replace your head -n -1.
while read ID; do
printf '/%s/,/>/ { />/b; p }\n' $ID;
done | sed -n -f - data.fa

awk for sort lines in file

I have a file which needs to sort on basis of column and the column is fixed length based column i.e. from character 5 to 10.
example file:
0120456789bcdc hsdsjjlofk
01204567-9 __abc __hsdsjjjiejks
01224-6777 abcddd hsdsjjjpsdpf
012645670- abccccd hsdsjjjopp
I tried awk -v FIELDWIDTHS="4 10" '{print|"$2 sort -n"}' file but it does not give proper output.
You can use sort for this
$ sort -k1.5,1.10 file
01224-6777 abcddd hsdsjjjpsdpf
01204567-9 __abc __hsdsjjjiejks
012645670- abccccd hsdsjjjopp
0120456789bcdc hsdsjjlofk

Resources